Using Sentencepiece/Byte Pair Encoding on Model

Oh boy, that was kind of it, there was a typo in the name of the file… After I fixed it, it worked fine. It would be nice though to have some kind of file open error be thrown instead of it just hanging with no output. Thanks!

Ah good catch, that’s indeed an issue.

EDIT: This has been improved in 1.15.2.

So I’ve run the tokenizer/bpe on a compilation of several of the opus datasets. A couple lines of the output tokenized/bpe’d file is:

New York ■, 2 ■-■ 2■ 7 May 2■ 0■ 0■ 5
2 Statement by the delegation of Malaysia ■, on behalf of the Group of Non ■-■ Aligned States Parties to the Treaty on the Non ■-■ Proliferation of Nuclear Weapons ■, at the plenary of the 2■ 0■ 0■ 5 Review Conference of the Parties to the Treaty on the Non ■-■ Proliferation of Nuclear Weapons
■, concerning the adoption of the agenda ■, New York ■, 1■ 1 May 2■ 0■ 0■ 5
3 The Non ■-■ Aligned States Parties to the NPT welcome the adoption of the agenda of the 2■ 0■ 0■ 5 Review Conference of the Parties to the NPT ■.

The code that I used is:

import pyonmttok
import os

tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)

print("Ingesting")
if os.path.isfile("data/full/tgt-train.txt"):
    with open("data/full/tgt-train.txt", "r") as wi:
        print(wi.readline())
    learner.ingest_file("data/full/tgt-train.txt")
else:
    print("File doesn't exist")
print("Learning")
tokenizer = learner.learn("data/full/bpe-en.model", verbose=True)
print("Tokenizing")
tokens = tokenizer.tokenize_file("data/full/tgt-train.txt", "data/full/tgt-train.txt.token")
print("Finished")

Does this all look normal to everyone? Just wanted to make sure that I wasn’t missing anything.

EDIT: Actually, from looking at it, I don’t think that bpe went through. Any ideas?

Usage looks OK. The output could be a valid BPE output depending on the training file.

Do you find some examples with words that are segmented?

I didn’t see any bpe applied so I ran it again with the line:
tokenizer = pyonmttok.Tokenizer("aggressive", bpe_model_path="data/full/bpe-en.model", joiner_annotate=True, segment_numbers=True)
before the tokenization step and it seemed to help out.

The tokenizer returned by learn should actually be just that. I’m pretty sure we have a unit test for this. I will have a look.

EDIT: tested and worked as expected. Maybe you did not run the commands sequentially?

So @guillaumekln et al, to clarify with SentencePiece:
I should feed the UNDERSCORED output as the input to ONMT pre-processing (that generates the .pt)?

Q. Including the underscores?

  1. corpora -> SentencPiece -> underscored corpora
  2. underscored corpora -> opennmt pre-processing -> *train.0.pt etc
  3. train.0.pt etc -> opennmt train -> model.pt (model)

I’m about to reproduce a 4.5M sentence Transformer training on 4 GPUs so I’d like to get it right!

1 Like

Yes, with the underscores.

1 Like

Using OpenNMT 2.0, I’ve created a translation model, initially without using subwords, that achieves a bleu score of 37.7.

I’ve modified the commands and updated the yaml so that I now have a translation model that uses BPE from SentencePiece. The problem is the Bleu score is nearly identical which makes me suspicious that I have something wrong. All other parameters, such as the number of training steps etc. have been kept the same. I’ve also tried vocab_size of 8k, 16k and 32k. It generates myspm.model, myspm.vocab, vocab.tgt, vocab.src, src-test.txt.sp, tgt-test.txt.sp. The .sp files have the correct format with underscores throughout. The validation accuracy of the model is over 70% in both cases so it looks like it is building the models correctly.

I haven’t used preprocess.py since that’s not available anymore with OpenNMT 2.0. The training.txt file is large file of tokens separated with white space and it contains both the source and target languages.

Are the following work flow commands and config.yaml correct ? Thanks.

spm_train --input=train.txt --model_prefix=myspm \
–vocab_size=16000 --character_coverage=1.0 --model_type=bpe

onmt_build_vocab -config config.yaml -n_sample=-1
onmt_train -config config.yaml

spm_encode --model=myspm.model < src-test.txt > src-test.txt.sp
spm_encode --model=myspm.model < tgt-test.txt > tgt-test.txt.sp

onmt_translate --model model_step_100000.pt \
–src src-test.txt.sp --output pred.sp -replace_unk

spm_decode -model=myspm.model -input_format=piece < pred.sp > pred.txt

sacrebleu tgt-test.txt < pred.txt

The following is my config:

src_vocab: data/vocab.src
tgt_vocab: data/vocab.tgt

src_subword_type: sentencepiece
src_subword_model: data/myspm.model
tgt_subword_type: sentencepiece
tgt_subword_model: data/myspm.model

subword_nbest: 20
subword_alpha: 0.1

src_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
tgt_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”

src_seq_length: 150
tgt_seq_length: 150

data:
corpus_1:
path_src: data/src-train.txt
path_tgt: data/tgt-train.txt
transforms: [onmt_tokenize, filtertoolong]
weight: 1
valid:
path_src: data/src-val.txt
path_tgt: data/tgt-val.txt
transforms: [onmt_tokenize]

save_data: data/
save_model: models/model
save_checkpoint_steps: 50000
valid_steps: 2000
train_steps: 100000
seed: 5151
report_every: 1000
keep_checkpoint: 2

rnn_size: 512
batch_size: 64

skip_empty_level: silent

world_size: 1
gpu_ranks: [0]

Dear Séamus,

I had a similar experience with French-English. If the model is already good, BLEU cannot assess such subtleties. However, when I analysed some English sentences generated by the MT model, I noticed they were smoother with fewer/no leftovers from the French source compared to the old model.

There is one thing you can try though. Are you generating one SentencePiece model for both the target and the source? Is this intentional? I usually generate two different SentencePiece models. It is something you can try especially if the source and target languages are not similar.

Kind regards,
Yasmin

Hi Yasmin, Thanks for that. Yes, I’m just using one SP model. That had crossed my mind as well about 2 separate SP models so I’m just after building a translation model using 2 SP models, one for the src and one for the target. Unfortunately, there was no change so it may be a limit for the particular config and datasets that I have unless I am missing something else. Currently I am using transforms: [onmt_tokenize, filtertoolong] as my transforms . I tried replaced this with transforms: [sentencepiece, filtertoolong] but it crashes out when I start training.

Dear Séamus,

My understanding is that you are using an RNN model rather than a Transformer model, right? If so, you have a great opportunity to improve your model by using the Transformer architecture. I will leave the configuration I use below this reply.

Note: If you only have one GPU, change this as follows:

world_size: 1
gpu_ranks: [0]

Another point is that you are using Google Clab. I do not oppose this. I just cannot tell exactly to what extent Google Clab can be used for larger models.

Kind regards,
Yasmin

# Training files
data:
    corpus_1:
        path_src: data/train.en
        path_tgt: data/train.hi
        transforms: [sentencepiece, filtertoolong]
    valid:
        path_src: data/dev.en
        path_tgt: data/dev.hi
        transforms: [sentencepiece, filtertoolong]

# Tokenization options
src_subword_model: subword/bpe/en.model
tgt_subword_model: subword/bpe/hi.model

# Vocabulary files
src_vocab: run/enhi.vocab.src
tgt_vocab: run/enhi.vocab.tgt

early_stopping: 4
log_file: logs/train.bpe.log
save_model: models/model.enhi

save_checkpoint_steps: 10000
keep_checkpoint: 10
seed: 3435
train_steps: 200000
valid_steps: 10000
warmup_steps: 8000
report_every: 100

decoder_type: transformer
encoder_type: transformer
word_vec_size: 512
rnn_size: 512
layers: 6
transformer_ff: 2048
heads: 8

accum_count: 4
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0

batch_size: 2048    #original: 4096
batch_type: tokens
normalization: tokens
dropout: 0.1
label_smoothing: 0.1

max_generator_batches: 2

param_init: 0.0
param_init_glorot: 'true'
position_encoding: 'true'

world_size: 2
gpu_ranks: [0,1]
1 Like

Hi Yasmin, Thanks for posting the configuration. It’s good to note that you are using the sentencepiece transform whereas I am using onmt_tokenize. You are correct in pointing out that I am using a simple vanilla NMT implementation. At present I am trying to replicate results in a paper that uses a vanilla approach so I am restricted to using simpler models for that particular experiment. However, I might move on from that and now concentrate on a transformer architecture. Using transformers, I have also seen significant improvement in Bleu scores. I have a pro version of Colab which is working great. The standard version is too slow for building these types of models.

Regards,

Séamus.

Hello, I’m going to jump in this conversation.
I am training on a low-resource setting, things I noticed, that could be beneficial.

  1. Using RelativeTransformer model instead of Transformer. (+1 BLEU)
  2. Word dictionary. (+1 BLEU)
  3. Keep long sentences and paragraphs. (+6 BLEU)
  4. Back translation. (+1 BLEU)
  5. Copy monolingual target text to source.
  6. Tagged features (i.e <ru> <bt> <v2> ▁I ▁love ▁music .)
  7. Average the last two models, then continue training with the averaged model. (+0.5 BLEU)
  8. Train variance of sentencepiece versions for a single NMT model, with shared vocab by adding all the sentence piece tokens of all sp versions on both the source and target side in one vocab file. (+1 BLEU)
    (i.e)
    v1 vocab size:28500, maximum piece length: 14
    v2 vocab size:25000, maximum piece length: 12
    v3 vocab size:22000, maximum piece length: 10
    v4 vocab size:19000, maximum piece length: 8
    v5 vocab size:16000, maximum piece length: 6

If something needs to be more clear, I could share further details.

4 Likes

Dear Nart,

Many thanks for sharing your tips when it comes to low-resource languages! Sure, changing BPE settings can help in some cases.

I would be grateful if you can share more details about the first point, using the “Relative Transformer”. What are the different parameters between it and the standard “Transformer”?

Many thanks!
Yasmin

@ymoslem I’m using the relative transformer model that comes out of the box with openNMT-tf:

onmt-main --model_type TransformerRelative --config data.yml --auto_config train

The implementation seems to correspond to this paper: [1803.02155] Self-Attention with Relative Position Representations
It’s the last item in this list: Model — OpenNMT-tf 2.16.0 documentation

1 Like

Many thanks, Nart, for the detailed answer!

My understanding is that in OpenNMT-py position_encoding: 'true' achieves the same purpose. Could you please confirm or correct me, François. Thanks! @francoishernandez

Kind regards,
Yasmin

1 Like

Positional encoding and relative positions representations are not the same thing.
Positional encoding, or position_encoding is the original Transformer architecture (section 3.5 of Attention is all you need.
Relative positions representations were introduced a bit later, in the paper that @Nart mentions. This was implemented in OpenNMT-py and can be enabled by setting max_relative_positions.

2 Likes

Some interesting tips there :-). Am also doing some “low resource work”.

1 Like

I am curious, what language are you working on?