I’m a complete newbie at OpenNMT (and also at Python), trying to train a TR->SAH (Turkish-Sakha) model using OpenNMT-py on Ubuntu 20.04. I’ve trained my model in 1000 steps (100000 is too much) and still get poor translation results. Is it the lack of the tokenization process that makes my translations very poor?
I generally use training and vocabulary files. Should I use both of them or are only the training files enough?
I don’t really know what the minimum recommended training and validation steps are and how to use the validation files, and how to tokenize using the terminal.
My laptop’s graphics card is an AMD one, therefore I can only train using my laptop’s CPU.
Though I’m afraid you will struggle to get a proper model training on a laptop CPU.
You may try to build pytorch for ROCm (the AMD equivalent of CUDA) to train on your GPU. We never tested it but that might be worth a try.
You may also try Google Colab, which provides free access to GPUs.
You won’t get anywhere with so few examples. A proper NMT setup requires hundreds of thousands or even millions of examples.
You may want to look for research about ‘low-resource machine translation’, and read about topics such as pre-training, back-translation, etc.