I’m trying to use what works for de-en with hi-en, another language pair. While de-en at leasts manage to overfit using the training data and gets me decent translation results, hi-en isn’t going beyond 35-45% accuracy and 90ish perplexity. Feeling overly optimistic, I ran translation on the test set as well - nearly everything is unknown.
I’m using this dataset.
What could be the possible reasons? I have the following doubts at this point as well.
- Could an embedding learnt using monolingual corpora help the issue?
- Could it be an issue with the nature of the dataset?
- Is the MLP learning the embedding, something I saw inside the source, enabled by default?