Recommended training time, and how to tokenize


I’m a complete newbie at OpenNMT (and also at Python), trying to train a TR->SAH (Turkish-Sakha) model using OpenNMT-py on Ubuntu 20.04. I’ve trained my model in 1000 steps (100000 is too much) and still get poor translation results. Is it the lack of the tokenization process that makes my translations very poor?

I generally use training and vocabulary files. Should I use both of them or are only the training files enough?

I don’t really know what the minimum recommended training and validation steps are and how to use the validation files, and how to tokenize using the terminal.

My laptop’s graphics card is an AMD one, therefore I can only train using my laptop’s CPU.

Thank you a lot, I appreciate it.

How big is your dataset?

Though I’m afraid you will struggle to get a proper model training on a laptop CPU.
You may try to build pytorch for ROCm (the AMD equivalent of CUDA) to train on your GPU. We never tested it but that might be worth a try.

You may also try Google Colab, which provides free access to GPUs.

1 Like

My dataset includes no more than 200 parallel sentences at the moment, but later I will add more sentences.

By the way, I will try to train using ROCm.

You won’t get anywhere with so few examples. A proper NMT setup requires hundreds of thousands or even millions of examples.
You may want to look for research about ‘low-resource machine translation’, and read about topics such as pre-training, back-translation, etc.