Using Sentencepiece/Byte Pair Encoding on Model

Dear Séamus,

I had a similar experience with French-English. If the model is already good, BLEU cannot assess such subtleties. However, when I analysed some English sentences generated by the MT model, I noticed they were smoother with fewer/no leftovers from the French source compared to the old model.

There is one thing you can try though. Are you generating one SentencePiece model for both the target and the source? Is this intentional? I usually generate two different SentencePiece models. It is something you can try especially if the source and target languages are not similar.

Kind regards,
Yasmin

Hi Yasmin, Thanks for that. Yes, I’m just using one SP model. That had crossed my mind as well about 2 separate SP models so I’m just after building a translation model using 2 SP models, one for the src and one for the target. Unfortunately, there was no change so it may be a limit for the particular config and datasets that I have unless I am missing something else. Currently I am using transforms: [onmt_tokenize, filtertoolong] as my transforms . I tried replaced this with transforms: [sentencepiece, filtertoolong] but it crashes out when I start training.

Dear Séamus,

My understanding is that you are using an RNN model rather than a Transformer model, right? If so, you have a great opportunity to improve your model by using the Transformer architecture. I will leave the configuration I use below this reply.

Note: If you only have one GPU, change this as follows:

world_size: 1
gpu_ranks: [0]

Another point is that you are using Google Clab. I do not oppose this. I just cannot tell exactly to what extent Google Clab can be used for larger models.

Kind regards,
Yasmin

# Training files
data:
    corpus_1:
        path_src: data/train.en
        path_tgt: data/train.hi
        transforms: [sentencepiece, filtertoolong]
    valid:
        path_src: data/dev.en
        path_tgt: data/dev.hi
        transforms: [sentencepiece, filtertoolong]

# Tokenization options
src_subword_model: subword/bpe/en.model
tgt_subword_model: subword/bpe/hi.model

# Vocabulary files
src_vocab: run/enhi.vocab.src
tgt_vocab: run/enhi.vocab.tgt

early_stopping: 4
log_file: logs/train.bpe.log
save_model: models/model.enhi

save_checkpoint_steps: 10000
keep_checkpoint: 10
seed: 3435
train_steps: 200000
valid_steps: 10000
warmup_steps: 8000
report_every: 100

decoder_type: transformer
encoder_type: transformer
word_vec_size: 512
rnn_size: 512
layers: 6
transformer_ff: 2048
heads: 8

accum_count: 4
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0

batch_size: 2048    #original: 4096
batch_type: tokens
normalization: tokens
dropout: 0.1
label_smoothing: 0.1

max_generator_batches: 2

param_init: 0.0
param_init_glorot: 'true'
position_encoding: 'true'

world_size: 2
gpu_ranks: [0,1]
1 Like

Hi Yasmin, Thanks for posting the configuration. It’s good to note that you are using the sentencepiece transform whereas I am using onmt_tokenize. You are correct in pointing out that I am using a simple vanilla NMT implementation. At present I am trying to replicate results in a paper that uses a vanilla approach so I am restricted to using simpler models for that particular experiment. However, I might move on from that and now concentrate on a transformer architecture. Using transformers, I have also seen significant improvement in Bleu scores. I have a pro version of Colab which is working great. The standard version is too slow for building these types of models.

Regards,

Séamus.

Hello, I’m going to jump in this conversation.
I am training on a low-resource setting, things I noticed, that could be beneficial.

  1. Using RelativeTransformer model instead of Transformer. (+1 BLEU)
  2. Word dictionary. (+1 BLEU)
  3. Keep long sentences and paragraphs. (+6 BLEU)
  4. Back translation. (+1 BLEU)
  5. Copy monolingual target text to source.
  6. Tagged features (i.e <ru> <bt> <v2> ▁I ▁love ▁music .)
  7. Average the last two models, then continue training with the averaged model. (+0.5 BLEU)
  8. Train variance of sentencepiece versions for a single NMT model, with shared vocab by adding all the sentence piece tokens of all sp versions on both the source and target side in one vocab file. (+1 BLEU)
    (i.e)
    v1 vocab size:28500, maximum piece length: 14
    v2 vocab size:25000, maximum piece length: 12
    v3 vocab size:22000, maximum piece length: 10
    v4 vocab size:19000, maximum piece length: 8
    v5 vocab size:16000, maximum piece length: 6

If something needs to be more clear, I could share further details.

4 Likes

Dear Nart,

Many thanks for sharing your tips when it comes to low-resource languages! Sure, changing BPE settings can help in some cases.

I would be grateful if you can share more details about the first point, using the “Relative Transformer”. What are the different parameters between it and the standard “Transformer”?

Many thanks!
Yasmin

@ymoslem I’m using the relative transformer model that comes out of the box with openNMT-tf:

onmt-main --model_type TransformerRelative --config data.yml --auto_config train

The implementation seems to correspond to this paper: [1803.02155] Self-Attention with Relative Position Representations
It’s the last item in this list: Model — OpenNMT-tf 2.16.0 documentation

1 Like

Many thanks, Nart, for the detailed answer!

My understanding is that in OpenNMT-py position_encoding: 'true' achieves the same purpose. Could you please confirm or correct me, François. Thanks! @francoishernandez

Kind regards,
Yasmin

1 Like

Positional encoding and relative positions representations are not the same thing.
Positional encoding, or position_encoding is the original Transformer architecture (section 3.5 of Attention is all you need.
Relative positions representations were introduced a bit later, in the paper that @Nart mentions. This was implemented in OpenNMT-py and can be enabled by setting max_relative_positions.

2 Likes

Some interesting tips there :-). Am also doing some “low resource work”.

1 Like

I am curious, what language are you working on?

I am not sure, why I cant update my previous post anymore!
Back translation first iteration. (+4.5 BLEU)

Hi @Nart,

Many thanks for sharing your insight and hints.

I’m particularly curious about this one:

Is there a paper describing this technique? Wouldn’t the final vocabulary be too big (after adding all tokens for both languages) for a low-resource setting? I guess the effect should be more or less similar to applying BPE dropout, have you tried it too?

My answers are all speculative, no concrete analysis or experiments yet.

I have not tried it, I read the description of the paper, the drop out seems to be done randomly so the model is affected by the bad and the good signal, in my case the bpe segmentation is still unique to each model, but aware of other possible segmentations.

The vocab is slightly bigger than what would’ve been if I only used the largest model (all models vocab: 62681) (largest model vocab: 57000), afterwards you can pick one of the models with the best BLEU score and reduce the model’s vocab size just for that model. (vocab_update)

This is based on my own experiments, no papers published.

Hi @Nart, Just spotted your question as I didn’t get a notification. I’m focussing on Tagalog (Filipino)-English.

How is your progress? What strategy are you using?

I’ve described this work at some length in various posts in the Facebook Machine Translation Group. It started with SMT, then various small NMT models, then back-translation, then a Transformer model.

1 Like

I have some questions regarding the way you determine the BLEU score.

  1. Did you check all these individually? Could there be a chance that doing (Keep long sentences and paragraphs) reduce considerably the gains from the other ones?

  2. Your BLUE score was determine based on a completly independent test set and the same kind of data? I’m asking because I’m doing some similar test and I use both the paragraphs and the sentences, plus the chunked sentences. I wasn’t sure of the best way to build my script in order to guarantee my test set is completly independant and yet not trash any data. As i’m doing lots of filtering in the data and if i apply these filters on paragraph level… I will lose lots of data!
    An another challenge is when i include Full sentences and chunks of sentences… i need to compare my BLEU score on the same kind of data as before splitting them. You usually get a much better score when the sentences are smaller which wouldn’t be a good comparison.

I just did a test where I keep Long sentences plus, when Source and Target have exacly the same ponctuation (.!?,:;), I break thos Long sentences into small chunks base on the punctuation. The results seem to be significantly better, but because of the reasons mentionned above I did not put so much time to figure out how to compare the BLEU scores…

You did a good job too, I enjoyed the results :slight_smile:

Hello @SamuelLacombe

I have not analyzed those independently, this definitely needs fine tuning and individual testing to figure out what works best.

Yes, the testing data is not part of the training data, but it’s in-domain.

True.

I get the BLEU scores of validation/test data of each model (5 models in a single NMT model), BLEU score seems to be a relative metric, but can be a good indicator wither the models are heading to the right direction.

BLEU-0: Tokenized validation data
BLEU-1: Detokenized validation data
BLEU-2: Tokenized test data
BLEU-3: Detokenized test data

28500 v6/src-vocab.txt
BLEU-0 = 34.76, 61.1/41.1/31.6/25.8 (BP=0.919, ratio=0.922, hyp_len=57572, ref_len=62445)
BLEU-1 = 27.46, 53.0/33.0/23.3/17.7 (BP=0.942, ratio=0.944, hyp_len=37347, ref_len=39568)
BLEU-2 = 36.32, 62.8/43.3/33.4/27.6 (BP=0.913, ratio=0.916, hyp_len=30255, ref_len=33015)
BLEU-3 = 29.16, 55.1/35.0/25.2/19.6 (BP=0.934, ratio=0.936, hyp_len=19841, ref_len=21200)

25000 v7/src-vocab.txt
BLEU-0 = 35.27, 61.2/41.6/32.2/26.3 (BP=0.920, ratio=0.923, hyp_len=59325, ref_len=64243)
BLEU-1 = 27.45, 52.9/32.8/23.3/17.8 (BP=0.943, ratio=0.945, hyp_len=37372, ref_len=39568)
BLEU-2 = 36.37, 62.4/43.2/33.5/27.6 (BP=0.916, ratio=0.919, hyp_len=31232, ref_len=33973)
BLEU-3 = 28.82, 54.6/34.7/24.9/19.3 (BP=0.933, ratio=0.935, hyp_len=19822, ref_len=21200)

22000 v8/src-vocab.txt
BLEU-0 = 36.14, 61.6/42.7/33.0/26.9 (BP=0.925, ratio=0.927, hyp_len=62436, ref_len=67324)
BLEU-1 = 27.20, 52.9/32.8/23.1/17.5 (BP=0.940, ratio=0.942, hyp_len=37255, ref_len=39568)
BLEU-2 = 36.61, 62.4/43.5/33.6/27.4 (BP=0.921, ratio=0.924, hyp_len=32956, ref_len=35675)
BLEU-3 = 27.98, 54.0/33.8/23.9/18.2 (BP=0.937, ratio=0.939, hyp_len=19909, ref_len=21200)

19000 v9/src-vocab.txt
BLEU-0 = 37.18, 61.7/43.6/33.8/27.5 (BP=0.935, ratio=0.937, hyp_len=68244, ref_len=72811)
BLEU-1 = 26.67, 52.3/32.0/22.4/16.8 (BP=0.946, ratio=0.948, hyp_len=37501, ref_len=39568)
BLEU-2 = 37.92, 62.6/44.7/34.7/28.3 (BP=0.931, ratio=0.934, hyp_len=36004, ref_len=38564)
BLEU-3 = 27.57, 53.6/33.3/23.5/17.7 (BP=0.938, ratio=0.940, hyp_len=19934, ref_len=21200)

16000 v10/src-vocab.txt
BLEU-0 = 38.48, 61.8/44.8/35.0/28.3 (BP=0.946, ratio=0.947, hyp_len=79756, ref_len=84217)
BLEU-1 = 24.92, 50.8/30.3/20.6/15.1 (BP=0.947, ratio=0.949, hyp_len=37537, ref_len=39568)
BLEU-2 = 39.34, 63.1/46.2/36.2/29.3 (BP=0.939, ratio=0.941, hyp_len=42260, ref_len=44924)
BLEU-3 = 25.53, 52.1/31.4/21.3/15.6 (BP=0.939, ratio=0.941, hyp_len=19948, ref_len=21200)

Couple of questions

  1. What does transforms does in data? I see you have mentioned sentencepiece, so during omnt build vocab does it automatically use sentencepiece to generate the source and target vocab?
  2. Regarding training time
    My config/paramters are-
    Training data - 2.5 Million Sentence Pairs
    Transformer architecture
    Training on 2 GPU’s(3080’s)
    Layers:4
    Heads:6
    Batch_type: tokens
    Batch_size: 4096
    Train_steps: 400000
    Valid_steps: 1000
    Src vocab size: 54k
    Tgt vocab size: 61k

The number of parameters comes out to be ~67M
During training I’m getting 16-18k source tokens/sec and 18-20k target tokens/sec
Also along with this I’m getting a value in seconds which I believe is the total training time passed in seconds. Using this I’m evaluating the training time.
For 13k steps it took 4.34 hours which seems alot to me as for 400k steps it will take ~6 days approximately
Does this seem appropriate to you or I’m doing something wrong here?

Also at 13k steps, it has loaded up-till weighted corpora 8 only (not sure about the total number but i would expect it to be in more than 500 if not in thousands) and achieved training accuracy of 54.73, ppl: 6.03, xent: 1.80 and for validation accuracy: 56.8, ppl: 9.25.
The accuracy seems weirdly low to me even though it has seen very small part of dataset. Does this seem good to you?

Any help/feedback would be highly appreciated !! As it would be bummer if I found out i was doing something wrong after the training has finished which takes like a whole week.