Pre-trained word embeddings

sadanyh · June 3, 2020, 11:38am

I am new to OpenNMT. I am trying to train an English-Arabic data, I’ve managed to create a model using glove pre-trained using ‘-emb_file_both’ opition but the accuracy was very bad. I guess because it contained only English vocab. My questions are:

how can I add pretrained Arabic embeddings, do I have to combine it with the glove file and feed in one file that contains the two langauges’ word embeddigns?
do you keep track of model accuracies for different languages trained on OpenNMT by different users?, it would be interesting for OpenNMT users to compare their results against baselines?
Is there a way that OpenNMT model train word embeddings during the training process instead of using pretrained word-embeddings which may have lower number of words in the corpus?

I would appreciate it if you can provide tutorial for using different word-embeddings in OpenNMT.

Thank you

francoishernandez · June 3, 2020, 12:03pm

Hi,

You can have separate embeddings for encoder and decoder side.
Unfortunately not. Also, any metric would depend a lot on the setup, training, validation and test data so it wouldn’t necessarily be comparable in a straightforward way. Your best bet is to try and replicate some paper or task, if you find some with detailed setup information.
Not sure if I understand this correctly. By default, if you do not provide any pretrained embeddings, these will be learned jointly with the other parameters.

As for tutorial, you can have a look at the FAQ entry, the dedicated opts or search for similar topics in the forum.