Build custom Dataset for a custom language

Jalkhov · November 2, 2022, 12:22am

Hi community, I am planning to train a network for a non-standard language (native to the town where I live). Seeing the tutorial of OpenMNT-py my first problem was that they send me to download some English-German datasets for example, and although I know it is for tutorial purposes, I have not found a way to create my own dataset with the language in question and then train the model. I would like to know if there was this possibility to use OpenMNT-py for an unknown language. Thanks.

JOHW85 · November 2, 2022, 6:34am

OpenNMT is a framework to train a machine translation model, not a way to collect the datasets. You will need to rely on your own skills, resources to find a way to collect bitexts of your non-standard language to whatever other language you would like to translate to (English?). Once you have the datasets, you can start the process of training a model.

Jalkhov · November 2, 2022, 1:11pm

That’s my question, how do I create these datasets, is there a specific structure for OpenMNT? I already have the phrases and words of the language in question in Spanish (For now it will only be unidirectional Lang->Spanish).

JOHW85 · November 3, 2022, 2:34am

https://opennmt.net/OpenNMT-py/quickstart.html#step-1-prepare-the-data

Basically you just need two files. They have to be the same number of lines, with each line corresponding to the src and tgt respectively.
So for a 2-line Bitext, you have something like:

en.txt
Happy birthday! 
Hello

es.txt
¡Feliz cumpleaños!
Hola

There is typically some preprocessing done like tokenization, etc. That’s up to you to figure out what works best for your language pair. You will generally need at least a few hundred thousand lines to get any good results.