Pre-train encoder side UTF-16 word embeddings in OpenNMT-py?

Hello,
I’m training an ancient Greek<->English model. I would like to use pre-trained embeddings on the encoder side only. I have a couple questions.

  1. Is it possible to train a Greek<->Greek model in OpenNMT-py, extract the embeddings, and then use those embeddings in another OpenNMT model? I tried using extract_embeddings.py and then embeddings_to_torch.py to create the encoder side embeddings I need to use the -pre_words_vec_enc option?

Both files seem specific to UTF-8, while Greek text is UTF-16. I tried changing the encode() and decode() options to UTF-16, but kept getting errors that some bytes still were not readable.

  1. What is another method for using a pre-trained word embedding only on the encoder side in OpenNMT? I have read all the examples of using word2vec embeddings, but since those tutorials still use embeddings_to_torch.py, I’m concerned that they still won’t work for a UTF-16 language.

Thank you!

1 Like