OpenNMT Pytorch - Using FastText Pretrained Embedding Tutorial for beginner

OpenNMT Pytorch - Using FastText Pretrained Embedding Tutorial for beginner

Step 1: Install the Fasttext

git clone https://github.com/facebookresearch/fastText.git

cd fastText

make

Step 2: Training the Fasttext

Create a directory called result in the fasttext directory.

Please prepare a large text file for training.

Training

…/fasttext skipgram -input ./emb_data.txt -output result/model

emb_data : large text file for training

output: You can write down the directory path and the name of the model to be created.

As a result, * .vec and * .bin are created.

Learn about key options

./fasttext skipgram -input data/fil9 -output result/fil9 -minn 2 -maxn 5 -dim 300

./fasttext skipgram -input data/fil9 -output result/fil9 -epoch 1 -lr 0.5

./fasttext skipgram -input data/fil9 -output result/fil9 -thread 4


This option is most important

Dim : dimension

Skipgram or cbow Your Choice !!


Usage

…/fasttext skipgram -input ko_emb_data.txt -output restult_dimension_512/ko-vec-pcj -minn 2 -maxn 5 -dim 512

Step 3 (optional) Experience secondary functionality

Visually check words

echo “asparagus pidgey yellow” | ./fasttext print-word-vectors result/fil9.bin

Try

echo “YOUR WORD” | …/…/fasttext print-word-vectors YOUR_MODEL_BIN_FILE

Nearest neighbor queries

./fasttext nn result/fil9.bin

Try

…/…/fasttext nn YOUR_MODEL_BIN_FILE

Word analogies

./fasttext analogies result/fil9.bin

Try

…/…/fasttext analogies YOUR_MODEL_BIN_FILE



Now let’s look at how to use pretrained embedding in OpenNMT py


Step 1: Preprocess the data

Preprocess

python3 …/…/…/preprocess.py -train_src ./src-train.txt -train_tgt ./tgt-train.txt -valid_src ./src-val.txt -valid_tgt ./tgt-val.txt -save_data ./data -src_vocab_size 32000 -tgt_vocab_size 32000

Step 2: Prepare embedding

Command Like below

python3 …/…/…/tools/embeddings_to_torch.py -emb_file_both “…/…/ko_embedding/ko.vec” -dict_file ./data.vocab.pt -output_file "./embeddings"

python3 …/…/…/tools/embeddings_to_torch.py -emb_file_both “.YOUR_VEC_FILE” -dict_file YOUR_VOCAB_PT_FILE -output_file "./embeddings"

As a result, embedding.dec.pt and embeddings.enc.pt are created.


Step 3: Transformer Training with Pretrained embedding

Command

python3 …/…/train.py -data ./data -save_model ./model/model -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 500000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -pre_word_vecs_enc “./embeddings.enc.pt” -pre_word_vecs_dec “./embeddings.dec.pt” -world_size 8 -gpu_ranks 0 1 2 3 4 5 6 7 -log_file ./log &

Key parameters
pre_word_vecs_enc “./embeddings.enc.pt” -pre_word_vecs_dec “./embeddings.dec.pt”


So far, this was a simple pretrained embedding tutorial using Fasttext.:grinning:

2 Likes

Can the same method be used on pertained BERT embeddings?

and then train the transformers on BERT embeddings

You could try the Cross-lingual Language Model Pretraining, which is bases on BERT. It is currently not supported by openNMT, but the repo has everything to train a translation model.

Thanks for the tutorial!

I have got some silly doubts as I am new to NMT.

  1. Training of the fasttext (Step 2) model will be done for both source and target language seperately or the large text file you are talking about will have both source and target language in the same file, such that eventually ‘embeddings.enc.pt’ will be generated for the source language and ‘embeddings.dec.pt’ for the target language?

  2. If training of the fasttext model should be done seperately for source and target language, then how should embedding be prepared separately for encoder and decoder because as stated in step 2 (Prepare embedding), it requires only vector file generated when fasttext model is trained.

  3. When we do the preprocessing of the of the training and validation set, i guess, please correct me if am wrong, by default openNMT uses SentencePiece model for tokenization, then how fasttext embedding for subword would look like at the time of training?

  4. How to handle the named entities in the OpenNMT? Is there any tag that we have to provide to named entities in the training dataset?
    Currently, I am thinking of identifying named entities using pretrained transformer NER model and replacing them by their tags in the training and validation dataset for example ORG or PER. Please give your suggestion!

Thanks!