Hello everyone. I have a problem with russian-english translation model, which gives a senless output.
Several examples :
src: Меня зовут Антон
translation: Lei Scorpii
correct: My name is Anton
src: Джадд Трамп выиграл свой первый титул чемпиона мира по снукеру в 2019
translation: The other day gave birth to a rich girl, bearing 201.
correct: Judd Trump has won his maiden world snooker champion title in 2019
src: Я буду ждать тебя у оперного театра
translation: I know a lot of things.
correct: I will wait for you near opera theatre
Preprocessing steps for training model (several corpuses were used):
- Cynical selection for paracrawl corpus (only for this one)
- Tokenization of all corpuses using moses tokenizer
- Formal cleaning of datasets (removing the same sentences in both languages, too long sentences, sentences with huge amount of non-cyrillic or non-latin characters, etc.)
- Application of BPE
- Final preprocessing using preprocess.py from OpenNMT
For inference we use the following pipeline (WMT was used for inference):
- WMT tokenization
- Application of BPE for WMT
- Removing of BPE characters and detokenization
For translation on server we use only tokenization (difference implementation of tokenization were tested, like sacremoses, pyonmttok, razdel (python library for russian language tokenization)), but result is the same.