How to train NMT model with subword regularization?(Sampling segmentation every epoch)

Hi, y’all

I’m trying to apply subword regularization method in sentencepiece(where they sample from input sequence’s segmented candidates every epoch, which means input changes every epoch) to train my NMT model. But as for opennmt/torch/ the inputs are fixed during training(as far as I know). Any method or suggestion that I can reference?

Thank you.

Which version of OpenNMT are you using / would like to use?

I’m using opennmt-py(torch), the latest version available in github.

Is there any source that I can reference?

Thanks!

Everything regarding tokenization should happen before the preprocessing step.
Basically, you have your data, you tokenize it with your subwords model, and then call preprocess.py to build the datasets ‘shards’ as well as the vocabulary(ies) that will be used to train the model.
So, your best bet I think would be to apply sentencepiece regularization before preprocessing, i.e. dump your N samples for each example of your dataset, and then preprocess on this.

Oh I read too quickly. Regarding the ‘changing input every epoch’, there is no proper way of doing this, but with the ‘shards’ mechanism you might be able to pull it off quite easily.
I think you can:

  • sample your dataset in N different parts;
  • build the full vocabulary on all the parts;
  • preprocess each part with the -src_vocab (and possibly -tgt_vocab) option;
  • renumber your shards to have your différent samples in following order.

When training, the inputter iterator will loop over the shards, thus simulating a change every epoch.
I encourage you to test on little data to get the feel of all of it if you’re not used to how OpenNMT-py works.

Dear Hernandez,

Thank you for your quick and kind reply.
I really appreciate it.

So you mean if I were to run say, 200 epochs, then I should generate 200 different “shards” and then iterate over those ones.

I guess I should check how to generate those “shards”.

Thank you, so much.

Just had a look at the paper. They seem to have quite good results with a sampling size of 64. The method I suggested might be a bit data heavy for such sampling sizes.
Not sure of exactly how sampling works with sentencepiece, but if the vocab is fixed you might want to adapt the inputter code a bit to do tokenization on the fly instead of dumping everything before hand.
I think most of it could happen in onmt.inputters.inputter in batch_iter.

@francoishernandez
Thank you for your attention and such a helpful reply. In Sentencepiece Library, they provide the function, “SampleEncodeasPieces” which samples one segment candidate from given text. I think I can use this function while the vocab is fixed by changing the inputter code you mentioned(onmt.inputters.inputter).

Thanks again.

Keep us posted if you get interesting results and/or want to submit a PR with your adaptations!

Will sure do. Thank you!

@francoishernandez

Hi, while looking around opennmt-tf(tensorflow version)

I’ve noticed that there is an advanced feature “on-the-fly-tokenization”, and I think this is the feature I’m looking for. Is this not implemented in the Pythorch version?

Thank you.

This is not implemented yet in OpenNMT-py, hence the required adaptations I mentioned before.

Thank you. Then I guess I’ll have to change the inputter code.

Thank you!