How to train NMT model with subword regularization?(Sampling segmentation every epoch)

JJumSSu · September 16, 2019, 2:26pm

Hi, y’all

I’m trying to apply subword regularization method in sentencepiece(where they sample from input sequence’s segmented candidates every epoch, which means input changes every epoch) to train my NMT model. But as for opennmt/torch/ the inputs are fixed during training(as far as I know). Any method or suggestion that I can reference?

Thank you.

francoishernandez · September 16, 2019, 2:32pm

Which version of OpenNMT are you using / would like to use?

JJumSSu · September 17, 2019, 5:10am

I’m using opennmt-py(torch), the latest version available in github.

Is there any source that I can reference?

Thanks!

francoishernandez · September 17, 2019, 6:59am

Everything regarding tokenization should happen before the preprocessing step.
Basically, you have your data, you tokenize it with your subwords model, and then call preprocess.py to build the datasets ‘shards’ as well as the vocabulary(ies) that will be used to train the model.
So, your best bet I think would be to apply sentencepiece regularization before preprocessing, i.e. dump your N samples for each example of your dataset, and then preprocess on this.

francoishernandez · September 17, 2019, 7:06am

Oh I read too quickly. Regarding the ‘changing input every epoch’, there is no proper way of doing this, but with the ‘shards’ mechanism you might be able to pull it off quite easily.
I think you can:

sample your dataset in N different parts;
build the full vocabulary on all the parts;
preprocess each part with the -src_vocab (and possibly -tgt_vocab) option;
renumber your shards to have your différent samples in following order.

When training, the inputter iterator will loop over the shards, thus simulating a change every epoch.
I encourage you to test on little data to get the feel of all of it if you’re not used to how OpenNMT-py works.

JJumSSu · September 17, 2019, 8:10am

Dear Hernandez,

Thank you for your quick and kind reply.
I really appreciate it.

So you mean if I were to run say, 200 epochs, then I should generate 200 different “shards” and then iterate over those ones.

I guess I should check how to generate those “shards”.

Thank you, so much.

francoishernandez · September 17, 2019, 8:25am

Just had a look at the paper. They seem to have quite good results with a sampling size of 64. The method I suggested might be a bit data heavy for such sampling sizes.
Not sure of exactly how sampling works with sentencepiece, but if the vocab is fixed you might want to adapt the inputter code a bit to do tokenization on the fly instead of dumping everything before hand.
I think most of it could happen in onmt.inputters.inputter in batch_iter.

JJumSSu · September 17, 2019, 9:16am

@francoishernandez
Thank you for your attention and such a helpful reply. In Sentencepiece Library, they provide the function, “SampleEncodeasPieces” which samples one segment candidate from given text. I think I can use this function while the vocab is fixed by changing the inputter code you mentioned(onmt.inputters.inputter).

Thanks again.

francoishernandez · September 17, 2019, 9:33am

Keep us posted if you get interesting results and/or want to submit a PR with your adaptations!

JJumSSu · September 18, 2019, 3:59am

Will sure do. Thank you!

JJumSSu · September 18, 2019, 7:50am

@francoishernandez

Hi, while looking around opennmt-tf(tensorflow version)

I’ve noticed that there is an advanced feature “on-the-fly-tokenization”, and I think this is the feature I’m looking for. Is this not implemented in the Pythorch version?

Thank you.

francoishernandez · September 18, 2019, 8:11am

This is not implemented yet in OpenNMT-py, hence the required adaptations I mentioned before.

JJumSSu · September 18, 2019, 9:58am

Thank you. Then I guess I’ll have to change the inputter code.

Thank you!