OpenNMT Forum

OpenNMT Pytorch - Using FastText Pretrained Embedding Tutorial for beginner

OpenNMT Pytorch - Using FastText Pretrained Embedding Tutorial for beginner

Step 1: Install the Fasttext

git clone

cd fastText


Step 2: Training the Fasttext

Create a directory called result in the fasttext directory.

Please prepare a large text file for training.


…/fasttext skipgram -input ./emb_data.txt -output result/model

emb_data : large text file for training

output: You can write down the directory path and the name of the model to be created.

As a result, * .vec and * .bin are created.

Learn about key options

./fasttext skipgram -input data/fil9 -output result/fil9 -minn 2 -maxn 5 -dim 300

./fasttext skipgram -input data/fil9 -output result/fil9 -epoch 1 -lr 0.5

./fasttext skipgram -input data/fil9 -output result/fil9 -thread 4

This option is most important

Dim : dimension

Skipgram or cbow Your Choice !!


…/fasttext skipgram -input ko_emb_data.txt -output restult_dimension_512/ko-vec-pcj -minn 2 -maxn 5 -dim 512

Step 3 (optional) Experience secondary functionality

Visually check words

echo “asparagus pidgey yellow” | ./fasttext print-word-vectors result/fil9.bin


echo “YOUR WORD” | …/…/fasttext print-word-vectors YOUR_MODEL_BIN_FILE

Nearest neighbor queries

./fasttext nn result/fil9.bin


…/…/fasttext nn YOUR_MODEL_BIN_FILE

Word analogies

./fasttext analogies result/fil9.bin


…/…/fasttext analogies YOUR_MODEL_BIN_FILE

Now let’s look at how to use pretrained embedding in OpenNMT py

Step 1: Preprocess the data


python3 …/…/…/ -train_src ./src-train.txt -train_tgt ./tgt-train.txt -valid_src ./src-val.txt -valid_tgt ./tgt-val.txt -save_data ./data -src_vocab_size 32000 -tgt_vocab_size 32000

Step 2: Prepare embedding

Command Like below

python3 …/…/…/tools/ -emb_file_both “…/…/ko_embedding/ko.vec” -dict_file ./ -output_file "./embeddings"

python3 …/…/…/tools/ -emb_file_both “.YOUR_VEC_FILE” -dict_file YOUR_VOCAB_PT_FILE -output_file "./embeddings"

As a result, and are created.

Step 3: Transformer Training with Pretrained embedding


python3 …/…/ -data ./data -save_model ./model/model -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 500000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -pre_word_vecs_enc “./” -pre_word_vecs_dec “./” -world_size 8 -gpu_ranks 0 1 2 3 4 5 6 7 -log_file ./log &

Key parameters
pre_word_vecs_enc “./” -pre_word_vecs_dec “./”

So far, this was a simple pretrained embedding tutorial using Fasttext.:grinning:

Can the same method be used on pertained BERT embeddings?

and then train the transformers on BERT embeddings

You could try the Cross-lingual Language Model Pretraining, which is bases on BERT. It is currently not supported by openNMT, but the repo has everything to train a translation model.