Increase number of threads using on the fly tokenization

anderleich · October 21, 2020, 12:33pm

Hi,

Before OpenNMT-py v2, I used Tokenizer to tokenize files with BPE. It had an option to set the number of threads to preprocess the whole file. Is there such an option in the new on the fly tokenization using onmt_tokenizer? Is this done automatically?

Thanks

francoishernandez · October 21, 2020, 1:58pm

Hey there,
For now, the number of threads used for on the fly data processing is the same as the number of GPUs, for simplicity purposes.
One idea to allow more parallelization would be to make N ‘producer’ threads per GPU, instead of 1. It may require a few adaptations though.

Do you have some performance issue with on the fly tokenization?
If so, what is your setup (CPU/GPU/task)?

anderleich · October 22, 2020, 8:26am

Ok, thanks for your answer. I was just wondering about it, I don’t have any performance issues.

However, I have another question about the on the fly data processing step. I’ve seen when using BPE, it creates a directory called samples for the tokenized examples during vocabulary creation. These tokenized examples are only generated for the vocabulary generation, aren’t they? During the training process the examples are tokenized again on the fly, yes? Or are they reused?

francoishernandez · October 22, 2020, 8:29am

Yes, these samples are mainly to simplify any ‘visual check’ you might want to do to verify your transforms are working correctly.
By the way, this “samples dump” will be made optional very soon through this PR (which allows quicker vocabulary building using multiple threads): https://github.com/OpenNMT/OpenNMT-py/pull/1897

anderleich · October 22, 2020, 8:35am

Perfect, that’s what I thought. I’ll be checking the PR. Thanks!