I’ve built quite a few models now, including two in production use, and for the first time I am getting the warning “source/target not aligned” during preprocessing after adding some new data to previously used training data. I have double-checked the tokenized source & target files and everything seems to be perfectly aligned. What is causing this warning, I wonder.
Do the files have the exact same number of lines? You could use
wc -l to quickly check.
Indeed - that’s on my basic checklist before lift-off
What is the exact warning message?
The message in the code reads:
_G.logger:warning (SENT %s : source/target not aligned (%d%d)’, tostring(idx), length1, length2)
and I can see that this refers to a length disparity.
Most of the data is “old data” (preprocessed in already built models) and I’m effectively doing some incremental training. As far as I can see there are no length disparities in this new data.
What version are you using? Also are you using the
Using v7 and in fact have only encountered this problem since installing v7. Not using -check_length option. Will try that.
This warning is not in v0.7 so you should re-check which version you are using. Can you do:
git checkout v0.7.1
Ah, the error was in my own training script where I had not put an absolute path to my latest version but was pointing to an older version. Preprocessing has now completed correctly. Sorry for wasting your time