With on the fly tokenization for my opennmt-py models, I’m not 100% sure how it works.
At the simplest level (with no submodel trained), I have the following;
- onmt_build_vocab -config vanilla.yaml -n_sample -1
- onmt_train -config vanilla.yaml
- onmt_translate --model model_step_80000.pt --src src-test.txt --output pred.txt -replace_unk
- sacrebleu tgt-test.txt < pred.txt -m bleu
With on the fly tokenization, I’m doing no preprocessing and I just specify onmt_tokenize as a transform on the corpus.
Is it that simple or do I need to do something else?
Do I need to detokenize at any stage e.g. between 2) and 3) above or between 3) and 4) ? Or is detokenization handled automatically by onmt_translate ?
I have gone through all the documentation in detail but I still can’t see if the detokenization is handled automatically. Thanks.
Séamus.
My yaml file is outlined below;
# use on the fly tokenization with onmt_tokenize
## Where the vocab(s) will be written
src_vocab: data/vocab.src
tgt_vocab: data/vocab.tgt
### Transform related
# Tokenisation options
src_subword_type: none
src_subword_model: none
tgt_subword_type: none
tgt_subword_model: none
# Number of candidates for SentencePiece sampling
subword_nbest: 20
# Smoothing parameter for SentencePiece sampling
subword_alpha: 0.1
# Specific arguments for pyonmttok
src_onmttok_kwargs: "{'mode': 'none', 'spacer_annotate': True}"
tgt_onmttok_kwargs: "{'mode': 'none', 'spacer_annotate': True}"
#### Filter
src_seq_length: 150
tgt_seq_length: 150
# Corpus opts:
data:
# (required for train run type).
corpus_1:
path_src: data/src-train.txt
path_tgt: data/tgt-train.txt
transforms: [onmt_tokenize, filtertoolong]
weight: 1
valid:
path_src: data/src-val.txt
path_tgt: data/tgt-val.txt
transforms: [onmt_tokenize]
# Transformer Model Hyperparameters
save_data: data/
save_model: models/model
save_checkpoint_steps: 40000
valid_steps: 2000
train_steps: 80000
seed: 5151
report_every: 1000
keep_checkpoint: 2
# Model and optimization parameters.
emb_size: 100
rnn_size: 500
# (optional) Width of the beam search (default: 1).
beam_width: 5
# Training options.
# (optional when batch_type=tokens) If not set, the training will search the largest
# possible batch size.
batch_size: 64
# silently ignore empty lines in the data
skip_empty_level: silent
# To resume training from previous checkpoint, specify name e.g. model_step_10000.pt
# train_from: model_step_10000.pt
# Logging
tensorboard: true
log_file: data/training_log.txt
# Train on a single GPU
world_size: 1
gpu_ranks: [0]
model_dtype: fp32
optim: adam
learning_rate: '2'
warmup_steps: '8000'
decay_method: noam
adam_beta2: '0.998'
max_grad_norm: '0'
label_smoothing: '0.1'
param_init: '0'
param_init_glorot: 'true'
normalization: tokens