Running OpenNMT-tf Quickstart Very Slow

I’m running the quickstart for OpenNMT-tf on Floydhub using the gpu (Tesla K80).

0.16 steps/sec seems far too slow. Training would take forever at that rate. Are these the kind of speeds we should expect?

The auto_config flag enables gradient accumulation so that the effective batch size is the same as the original Transformer paper. With gradient accumulation, one training step processes multiple batches, hence the apparently low steps/sec value.

In any case, don’t spend too much time on the quickstart. You can’t get usable results with the small training data that is used.

Thanks, that’s good to know.

Is there an approximate minimum dataset size needed to get usable results? I’m gathering my own data, and right now I’ve only got around 100K sentences/utterances.

There is no definite answer to that. It depends on the task and the final expectation in terms of quality. For machine translation, people usually work with millions of examples.

Is there an approximate minimum dataset size needed to get usable results?

There are many approaches for Low-Ressource NMT depending on the available amount of monolingual, bilingual or multilingual data.
Which language pair do you want to train?
You could get good results by fine-tuning BART with your dataset (for long form generative question answering and dialog response generation).

That’s an interesting paper on Low Resource NMT!

I trained a Tagalog-English syetem with about 120K sentence pairs, using bilingual materials I found on the Australian immigration website and several tourist phrase books. The model gave intelligible results within the domains covered but outside those areas it was useless. I’ve generally found you need between two and three million sentence pairs to get translations that are of any practical use.

I’m trying to create a chatbot that responds like I do for a hobby project. Its my first time using OpenNMT, or doing any sort of larger scale machine learning. I’ve got about 100K sentences gathered (so far) from my Facebook messages, sms, and Slack conversations.

There’s no way I could get 2 or 3 million sentences of my own speech. I think I’m going to try and train on the 100K dataset I have now. If that fails, I’ll start adding more conversational data thats not from myself.