I’ve built quite a few models now, including two in production use, and for the first time I am getting the warning “source/target not aligned” during preprocessing after adding some new data to previously used training data. I have double-checked the tokenized source & target files and everything seems to be perfectly aligned. What is causing this warning, I wonder.
Do the files have the exact same number of lines? You could use wc -l
to quickly check.
Indeed - that’s on my basic checklist before lift-off
What is the exact warning message?
The message in the code reads:
_G.logger:warning (SENT %s : source/target not aligned (%d%d)’, tostring(idx), length1, length2)
and I can see that this refers to a length disparity.
Most of the data is “old data” (preprocessed in already built models) and I’m effectively doing some incremental training. As far as I can see there are no length disparities in this new data.
What version are you using? Also are you using the -check_plength
option?
Using v7 and in fact have only encountered this problem since installing v7. Not using -check_length option. Will try that.
This warning is not in v0.7 so you should re-check which version you are using. Can you do:
git checkout v0.7.1
and retry?
Ah, the error was in my own training script where I had not put an absolute path to my latest version but was pointing to an older version. Preprocessing has now completed correctly. Sorry for wasting your time