Before OpenNMT-py v2, I used Tokenizer to tokenize files with BPE. It had an option to set the number of threads to preprocess the whole file. Is there such an option in the new on the fly tokenization using onmt_tokenizer? Is this done automatically?
Hey there,
For now, the number of threads used for on the fly data processing is the same as the number of GPUs, for simplicity purposes.
One idea to allow more parallelization would be to make N ‘producer’ threads per GPU, instead of 1. It may require a few adaptations though.
Do you have some performance issue with on the fly tokenization?
Ok, thanks for your answer. I was just wondering about it, I don’t have any performance issues.
However, I have another question about the on the fly data processing step. I’ve seen when using BPE, it creates a directory called samples for the tokenized examples during vocabulary creation. These tokenized examples are only generated for the vocabulary generation, aren’t they? During the training process the examples are tokenized again on the fly, yes? Or are they reused?
Yes, these samples are mainly to simplify any ‘visual check’ you might want to do to verify your transforms are working correctly.
By the way, this “samples dump” will be made optional very soon through this PR (which allows quicker vocabulary building using multiple threads): https://github.com/OpenNMT/OpenNMT-py/pull/1897