OpenNMT Forum

OpenNMT Pytorch - Using FastText Pretrained Embedding Tutorial for beginner

OpenNMT Pytorch - Using FastText Pretrained Embedding Tutorial for beginner

Step 1: Install the Fasttext

git clone https://github.com/facebookresearch/fastText.git

cd fastText

make

Step 2: Training the Fasttext

Create a directory called result in the fasttext directory.

Please prepare a large text file for training.

Training

…/fasttext skipgram -input ./emb_data.txt -output result/model

emb_data : large text file for training

output: You can write down the directory path and the name of the model to be created.

As a result, * .vec and * .bin are created.

Learn about key options

./fasttext skipgram -input data/fil9 -output result/fil9 -minn 2 -maxn 5 -dim 300

./fasttext skipgram -input data/fil9 -output result/fil9 -epoch 1 -lr 0.5

./fasttext skipgram -input data/fil9 -output result/fil9 -thread 4


This option is most important

Dim : dimension

Skipgram or cbow Your Choice !!


Usage

…/fasttext skipgram -input ko_emb_data.txt -output restult_dimension_512/ko-vec-pcj -minn 2 -maxn 5 -dim 512

Step 3 (optional) Experience secondary functionality

Visually check words

echo “asparagus pidgey yellow” | ./fasttext print-word-vectors result/fil9.bin

Try

echo “YOUR WORD” | …/…/fasttext print-word-vectors YOUR_MODEL_BIN_FILE

Nearest neighbor queries

./fasttext nn result/fil9.bin

Try

…/…/fasttext nn YOUR_MODEL_BIN_FILE

Word analogies

./fasttext analogies result/fil9.bin

Try

…/…/fasttext analogies YOUR_MODEL_BIN_FILE



Now let’s look at how to use pretrained embedding in OpenNMT py


Step 1: Preprocess the data

Preprocess

python3 …/…/…/preprocess.py -train_src ./src-train.txt -train_tgt ./tgt-train.txt -valid_src ./src-val.txt -valid_tgt ./tgt-val.txt -save_data ./data -src_vocab_size 32000 -tgt_vocab_size 32000

Step 2: Prepare embedding

Command Like below

python3 …/…/…/tools/embeddings_to_torch.py -emb_file_both “…/…/ko_embedding/ko.vec” -dict_file ./data.vocab.pt -output_file "./embeddings"

python3 …/…/…/tools/embeddings_to_torch.py -emb_file_both “.YOUR_VEC_FILE” -dict_file YOUR_VOCAB_PT_FILE -output_file "./embeddings"

As a result, embedding.dec.pt and embeddings.enc.pt are created.


Step 3: Transformer Training with Pretrained embedding

Command

python3 …/…/train.py -data ./data -save_model ./model/model -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 500000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -pre_word_vecs_enc “./embeddings.enc.pt” -pre_word_vecs_dec “./embeddings.dec.pt” -world_size 8 -gpu_ranks 0 1 2 3 4 5 6 7 -log_file ./log &

Key parameters
pre_word_vecs_enc “./embeddings.enc.pt” -pre_word_vecs_dec “./embeddings.dec.pt”


So far, this was a simple pretrained embedding tutorial using Fasttext.:grinning: