Running OpenNMT-tf Quickstart Very Slow

owen · March 22, 2020, 2:52am

I’m running the quickstart for OpenNMT-tf on Floydhub using the gpu (Tesla K80).

0.16 steps/sec seems far too slow. Training would take forever at that rate. Are these the kind of speeds we should expect?

guillaumekln · March 23, 2020, 8:22am

The auto_config flag enables gradient accumulation so that the effective batch size is the same as the original Transformer paper. With gradient accumulation, one training step processes multiple batches, hence the apparently low steps/sec value.

In any case, don’t spend too much time on the quickstart. You can’t get usable results with the small training data that is used.

owen · March 23, 2020, 6:28pm

Thanks, that’s good to know.

Is there an approximate minimum dataset size needed to get usable results? I’m gathering my own data, and right now I’ve only got around 100K sentences/utterances.

guillaumekln · March 24, 2020, 8:16am

There is no definite answer to that. It depends on the task and the final expectation in terms of quality. For machine translation, people usually work with millions of examples.

Bachstelze · March 25, 2020, 10:04am

Is there an approximate minimum dataset size needed to get usable results?

There are many approaches for Low-Ressource NMT depending on the available amount of monolingual, bilingual or multilingual data.
Which language pair do you want to train?
You could get good results by fine-tuning BART with your dataset (for long form generative question answering and dialog response generation).

tel34 · March 25, 2020, 11:28am

That’s an interesting paper on Low Resource NMT!

tel34 · March 25, 2020, 11:32am

I trained a Tagalog-English syetem with about 120K sentence pairs, using bilingual materials I found on the Australian immigration website and several tourist phrase books. The model gave intelligible results within the domains covered but outside those areas it was useless. I’ve generally found you need between two and three million sentence pairs to get translations that are of any practical use.

owen · March 25, 2020, 6:10pm

I’m trying to create a chatbot that responds like I do for a hobby project. Its my first time using OpenNMT, or doing any sort of larger scale machine learning. I’ve got about 100K sentences gathered (so far) from my Facebook messages, sms, and Slack conversations.

owen · March 25, 2020, 6:13pm

There’s no way I could get 2 or 3 million sentences of my own speech. I think I’m going to try and train on the 100K dataset I have now. If that fails, I’ll start adding more conversational data thats not from myself.