OpenNMT Pytorch - Using FastText Pretrained Embedding Tutorial for beginner
Step 1: Install the Fasttext
git clone https://github.com/facebookresearch/fastText.git
cd fastText
make
Step 2: Training the Fasttext
Create a directory called result in the fasttext directory.
Please prepare a large text file for training.
Training
…/fasttext skipgram -input ./emb_data.txt -output result/model
emb_data : large text file for training
output: You can write down the directory path and the name of the model to be created.
As a result, * .vec and * .bin are created.
Learn about key options
./fasttext skipgram -input data/fil9 -output result/fil9 -minn 2 -maxn 5 -dim 300
./fasttext skipgram -input data/fil9 -output result/fil9 -epoch 1 -lr 0.5
./fasttext skipgram -input data/fil9 -output result/fil9 -thread 4
This option is most important
Dim : dimension
Skipgram or cbow Your Choice !!
Usage…/fasttext skipgram -input ko_emb_data.txt -output restult_dimension_512/ko-vec-pcj -minn 2 -maxn 5 -dim 512
Step 3 (optional) Experience secondary functionality
Visually check words
echo “asparagus pidgey yellow” | ./fasttext print-word-vectors result/fil9.bin
Try
echo “YOUR WORD” | …/…/fasttext print-word-vectors YOUR_MODEL_BIN_FILE
Nearest neighbor queries
./fasttext nn result/fil9.bin
Try
…/…/fasttext nn YOUR_MODEL_BIN_FILE
Word analogies
./fasttext analogies result/fil9.bin
Try
…/…/fasttext analogies YOUR_MODEL_BIN_FILE
Now let’s look at how to use pretrained embedding in OpenNMT py
Step 1: Preprocess the data
Preprocess
python3 …/…/…/preprocess.py -train_src ./src-train.txt -train_tgt ./tgt-train.txt -valid_src ./src-val.txt -valid_tgt ./tgt-val.txt -save_data ./data -src_vocab_size 32000 -tgt_vocab_size 32000
Step 2: Prepare embedding
Command Like below
python3 …/…/…/tools/embeddings_to_torch.py -emb_file_both “…/…/ko_embedding/ko.vec” -dict_file ./data.vocab.pt -output_file "./embeddings"
python3 …/…/…/tools/embeddings_to_torch.py -emb_file_both “.YOUR_VEC_FILE” -dict_file YOUR_VOCAB_PT_FILE -output_file "./embeddings"
As a result, embedding.dec.pt and embeddings.enc.pt are created.
Step 3: Transformer Training with Pretrained embedding
Command
python3 …/…/train.py -data ./data -save_model ./model/model -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 500000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -pre_word_vecs_enc “./embeddings.enc.pt” -pre_word_vecs_dec “./embeddings.dec.pt” -world_size 8 -gpu_ranks 0 1 2 3 4 5 6 7 -log_file ./log &
Key parameters
pre_word_vecs_enc “./embeddings.enc.pt” -pre_word_vecs_dec “./embeddings.dec.pt”
So far, this was a simple pretrained embedding tutorial using Fasttext.