Translation Example in OpenNMT 2.0 Docs

cryptik · September 21, 2021, 7:58pm

In the translation example shown in the OpenNMT documentation, it includes a step to tokenize the test dataset before passing it to the translate script (during inference). From what I understand, during training, we don’t have to tokenize the input files since OpenNMT 2.0 can do this on the fly.

My question then, when running inference on the trained model, do we still need to tokenize the input test data? Or is this step in the documentation left over from prior to 2.0?

ymoslem · September 25, 2021, 8:46pm

Yes. On the fly tokenization works during training only.

cryptik · September 27, 2021, 8:15pm

Ah great, thanks Yasmin. I do have one other question. In the translate example in the OpenNMT documentation, a script is used to call ‘spm_train’ to train the sentencepeice model. In the script, they concatenate both the source and target sentences into one file (i.e. source file 1, target file 1, source file 2, target file 2, etc.). Is this correct? The sentencepeice model training does not require that the language pairs be matched?

UPDATE: I might have answered by own question… I found a thread on the forum that mentions using two models, one for source and one for target. I will give that a try.

ymoslem · September 28, 2021, 6:27pm

Using one SentencePiece model for both the source and the target usually means you will use shared vocabulary during training. So to avoid confusion while you are still trying things out, just start by using separate SentencePiece models and separate vocabularies, i.e. one for the source and one for the target. Later, you can research using shared/joint vocabulary for future experiments.

Good!

All the best,
Yasmin

cryptik · September 29, 2021, 2:21pm

Thank you so much Yasmin. So I have my script generating two SentencePiece models, one for the source and one for the target. SentencePiece also generates vocab files. I also generate opennmt vocab files with the onmt_build_vocab command. For training, should I use the SentencePiece vocab files or the onmt_build_vocab vocab files for training the model?

cryptik · September 29, 2021, 2:58pm

So I think I can answer this myself…lol. The SentencePiece vocab is used in the subword second as follows:

src_subword_type: sentencepiece
src_subword_model: sp-model.ru.model
src_subword_vocab: sp-model.ru.vocab
tgt_subword_type: sentence-piece
tgt_subword_model: sp-model.en.model
tgt_subword_vocab: sp-model.en.vocab
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
src_seq_length: 150
tgt_seq_length: 150

Where as the onmt_build_vocab vocab files are used by the transformer. Is this correct?

ymoslem · September 29, 2021, 3:32pm

The latter.

JOHW85 · September 30, 2021, 1:50pm

Shouldn’t the script OpenNMT-py/spm_to_vocab.py at master · OpenNMT/OpenNMT-py · GitHub be used to convert the SentencePiece vocab files and then use the output to train the model?
Otherwise, there will be all sorts of incompatibilities between the two SentencePiece models, increasing the chances of OOV.

cryptik · September 30, 2021, 7:10pm

Interesting, I did not know that script exists. I will have to give that a try after I finish the current training run to see if it improves the BLEU scores. Right now, I used the SPM generated vocab files (separate source and target) to train the SPM model… but my current training run is using the OpenNMT generated vocab.

ymoslem · September 30, 2021, 7:39pm

Hi James!

For OpenNMT-tf, there is a similar script and I used it. I did not know about one for OpenNMT-py and I did not try it. If it works well, then - yes, it should be used.

There are not two SentencePiece models here. Actually, you subword your training and development data with the SentencePiece model you created, and then build the vocab on this sub-worded data. So it is not incompatible.

Still, as I said, using the script you mentioned would be better if it works. Thanks for referring to it.

Kind regards,
Yasmin

JOHW85 · October 1, 2021, 8:29am

Yes, I was having the OOV issue when moving from OpenNMT-py from 1.x to 2.x and was advised to use that script to convert my manually trained SentencePiece model.

github.com/OpenNMT/OpenNMT-py

<unk> appearing in 2.0.x despite using Sentencepiece/BPE

opened 07:48AM - 03 Mar 21 UTC

closed 01:32PM - 05 Apr 21 UTC

JOHW85

I've trained two non-vanilla (big and deep) Transformers with v2 using both Sent…encepiece and YouTokenToMe's BPE. I've noticed that there are plenty of <unk>s generated prior to detokenizing via Sentencepiece. Those terms frequently appear in my corpus as well (like a name), so it's quite unlikely that the corpus hasn't seen it. Besides, I do have a mapping for one of the particular terms in my Sentencepiece .vocab file. In 1.2.0, this didn't happen (it was a big Transformer) back then, so other than some architectural differences in the model, the corpus used is slightly bigger than the older version. So I can only guess that something happened in the move to v2. Training is done past 500k steps for over 22million lines. Example of yaml file. I'm not sure if it has to do with the vocabulary files that are created prior to the training. ``` ## Where the vocab(s) will be written save_data: v3_big # Vocabulary files that were just created src_vocab: v3_big.vocab.src tgt_vocab: v3_big.vocab.tgt # Prevent overwriting existing files in the folder overwrite: True # Corpus opts: data: corpus_1: path_src: v3_big.cleaned.zh path_tgt: v3_big.cleaned.en path_align:v3_big.cleaned.tok.align transforms: [filtertoolong] weight: 1 valid: path_src: v3_big_val.tok.zh path_tgt: v3_big_val.tok.en path_align:v3_big_val.tok.align transforms: [filtertoolong] # Preprocessing Settings src_seq_length: 250 tgt_seq_length: 250 # General opts save_model: v3_deep save_checkpoint_steps: 10000 valid_steps: 10000 train_steps: 4000000 # Batching queue_size: 10000 bucket_size: 32768 world_size: 1 gpu_ranks: [0] batch_type: "tokens" batch_size: 4096 valid_batch_size: 8 max_generator_batches: 2 accum_count: [4] accum_steps: [0] # Optimization model_dtype: "fp32" optim: "adam" learning_rate: 2 warmup_steps: 16000 decay_method: "noam" adam_beta2: 0.998 max_grad_norm: 0 label_smoothing: 0.1 param_init: 0 param_init_glorot: true normalization: "tokens" # Model encoder_type: transformer decoder_type: transformer position_encoding: true enc_layers: 12 dec_layers: 6 heads: 8 rnn_size: 512 word_vec_size: 512 transformer_ff: 2048 dropout_steps: [0] dropout: [0.3] attention_dropout: [0.1] # Tensorboard tensorboard: true tensorboard_log_dir: logs/ ```

cryptik · October 1, 2021, 1:36pm

Hi James,

Thanks for this info… can you briefly explain OOV? I am not sure what that is.

V/r
Ken

JOHW85 · October 2, 2021, 4:28am

Out of vocab. So the idea of subwords is to prevent getting UNKs, but if the subword models are mixed, the resulting training can lead to out of vocabs (unk) or at least that was what I was experiencing.