I have been working on the Finnish to English Translation project and have used OpenNMT-tf in Linux.
Initally with small dataset the accuracy of the model was pretty good, like around 60-70 %.
But with increase in data we found the accuracy to fall drastically like to 16-17% and also many words and numbers were missing in the target file.
The dataset used had a sentence per line and was tokenized. And we used the same code from the Github available for the German to English translation. No features were modified.
Can you suggest us why the model failed, and also whether the code from Github could be directly used to translate finning to english text or any language pair it be?
Begineer in Machine Learning Field
Can you give more details? The dataset size, the preprocessing, the vocabulary size, etc.
Are you referring to these scripts?
Thanks for writing back.
Sorry for the mistake but we have been using the OpenNMT-py model.
So the scripts we have been using were from : https://github.com/OpenNMT/OpenNMT-
And initially we used 2k lines of data for training, which gave us an accuracy of 70%. Later
when trained with 2lacs data the accuracy turned out to be 16%.
The default vocabulary size was used for the source and target files i.e 50k.
And part of preprocessing we have just tokenized the data based on delimiter and then
preprocessed it further using the commands given in OpenNMT.
Let me know if you need any further information on this.
Just hoping to find some solution and develop a better model.
The preprocessing part of the data included tokenzing it based on spaces and having a sentence per line.
And the Vocabulary size was
Our customers have been very pleased with the Dutch-English translations made by a Transformer model (TensorFlow) trained with data first processed with SentencePiece. Have you looked at that?
I haven’t seen any such material yet. Could you please share the link with me?
Everything you need to know to get started is here: https://github.com/google/sentencepiece
I shall go through it.