I’m trying to use what works for de-en with hi-en, another language pair. While de-en at leasts manage to overfit using the training data and gets me decent translation results, hi-en isn’t going beyond 35-45% accuracy and 90ish perplexity. Feeling overly optimistic, I ran translation on the test set as well - nearly everything is unknown.
I do not know what tokenization is enabled at this point. I’m kinda running OpenNMT-py as a black box, hoping it’ll get some baseline results. I didn’t take it through the perl tokenization script - perhaps that’s one huge mistake.
Model options - I just used the usual source target, 30 epochs, 5 layers (enc + dec).
How do I enable this. In the source I can find an MLP and some PositionalEncoding - reading code is turning out to be cumbersome - is there some documentation?
Inconsistent tokenization could lead to such issue (i.e. many target unknowns). You should consider applying a tokenization, even a very basic one. We have some tools in OpenNMT-lua but you can use any scripts that you find useful.
It’s controlled by the feat_merge option but note that it won’t solve your issue.
After some tinkering, I figured out the prediction of lot of unknowns due to the limited vocabulary default of OpenNMT-py.
The training figures are still the same - only close to 30% accuracy, with most predictions being very similar to each other and lesser with the gold translation. My hunch is this is a problem with the corpora I’m using - is there any way to do some analysis and conclusively assert that it’s an issue with the corpora?
In what cases would OpenNMT and underlying methods work and not, w.r.t size and characteristics of the dataset?