OpenNMT-py doubts

Hello,

I’m trying to use what works for de-en with hi-en, another language pair. While de-en at leasts manage to overfit using the training data and gets me decent translation results, hi-en isn’t going beyond 35-45% accuracy and 90ish perplexity. Feeling overly optimistic, I ran translation on the test set as well - nearly everything is unknown.

I’m using this dataset.

What could be the possible reasons? I have the following doubts at this point as well.

  • Could an embedding learnt using monolingual corpora help the issue?
  • Could it be an issue with the nature of the dataset?
  • Is the MLP learning the embedding, something I saw inside the source, enabled by default?

Hello,

What tokenization do you use? And what model options are you using (if any)?

  • Monolingual embedding can make the convergence faster but will not improve the final performance by much.
  • I don’t know about this dataset. Maybe other users here have experience with it.
  • The linear transformation applied on the embeddings is not enabled by default.

Hello,

I do not know what tokenization is enabled at this point. I’m kinda running OpenNMT-py as a black box, hoping it’ll get some baseline results. I didn’t take it through the perl tokenization script - perhaps that’s one huge mistake.

Model options - I just used the usual source target, 30 epochs, 5 layers (enc + dec).

How do I enable this. In the source I can find an MLP and some PositionalEncoding - reading code is turning out to be cumbersome - is there some documentation?

Inconsistent tokenization could lead to such issue (i.e. many target unknowns). You should consider applying a tokenization, even a very basic one. We have some tools in OpenNMT-lua but you can use any scripts that you find useful.

It’s controlled by the feat_merge option but note that it won’t solve your issue.

How much would non-breaking prefixes matter? Can you point me to some literature/resource on this?

After some tinkering, I figured out the prediction of lot of unknowns due to the limited vocabulary default of OpenNMT-py.

The training figures are still the same - only close to 30% accuracy, with most predictions being very similar to each other and lesser with the gold translation. My hunch is this is a problem with the corpora I’m using - is there any way to do some analysis and conclusively assert that it’s an issue with the corpora?

In what cases would OpenNMT and underlying methods work and not, w.r.t size and characteristics of the dataset?