Running translation on arbitrary image files. What preprocessing is needed?

I’m looking into the accuracy of OpenNTM in the im2text problem (http://zh.opennmt.net/OpenNMT-py/im2text.html).

Since a pre-trained model exists, I downloaded it from http://lstm.seas.harvard.edu/latex/py-model.pt. Then I copied my test image into a directory (in/), and created an input file listing the path within the directory (manifest.txt).

Finally, I ran translate.py with the following flags (I realise this is running on my CPU, but again, just testing it here):

python translate.py -data_type img -model py-model.pt  -src_dir input -src input/manifest.txt -output input/pred.txt -max_length 500 -beam_size 5

However, this fails with the error:

RuntimeError: Given groups=1, weight of size 64 3 3 3, expected input[1, 4, 77, 424] to have 3 channels, but got 4 channels instead

Since I assume this is related to the image channels, I’ve tried -image_channel_size 3, but that didn’t change anything. Next I tried converting the image to greyscale (with convert input/1.png -set colorspace Gray -separate -average -alpha off input/1.grey.png) and re-running with -image_channel_size 1, however this results in:

RuntimeError: Given groups=1, weight of size 64 3 3 3, expected input[1, 1, 77, 424] to have 3 channels, but got 1 channels instead

So basically I just can’t get the input channels right, so I have to assume I’m missing a key preprocessing step. The document does show the use of preprocess.py, but this seems to cater only for training data, and it expects inputs like tgt-train.txt that I obviously don’t have for my test file.

What preprocessing needs to be run on my file 1.png in order for OpenNMT to work?

There are some discussion here:

I think you should use a conversion tool to ensure the input images have 3 channels.