Experiment with limited data set - how small is too small?

Hi all,

We are starting an experiment where we have limited number of sentences in English (probably a few hundred words generating less than a 2000 sentences with max being 5000 sentences) which need to be translated to an experimental language which has a few hundred words (and roughly 2000 sentences with max being 5000). We can guarantee that all the sentences in English will have equivalent translation in the target language and is deterministic i.e. the target language is domain specific and limited.

We are not doing this manually as we have numerous target languages and automating this is better.
(If this works, input languages too can expand beyond English.)

We can across seq2seq which seemed promising but then found OpenNMT which seemed even more promising and are evaluating if openNMT is better. Before we got deeper in openNMT, we wanted to check if openNMT is even fit for purpose given the limited training examples (see below).

Can OpenNMT be used in this scenario ? How many sentences does OpenNMT need to get trained well.

Thanks,
kc

Hi,

This is too small for training a model from scratch, which typically requires millions of examples.

In this context, there are techniques to start training on a large and generic corpora and then run a few more iterations on small and domain specifc data. You can search for “specialization” or “domain adaptation”.

However, I’m not sure that 5k is enough with this approach. Maybe other users have more experienced on this.

Thank you for the information. To factor this in, we will try to rework the architecture to combine multiple targets to ensure that we have enough sample size.
Eg:
Sample 1: Source “English: How are you?” , Target 1: “Spanish: “equivalent of how are you””
Sample 2: Source “English: Where are you?” , Target 2: “Deutsch: “equivalent of how are you””
Sample 3: Source “English: When are you coming?” , Target 3: “Dutch: “equivalent of when are you coming””

We will test if openNMT picks up and predicts the language “Spanish/Deutsch/Dutch” along with the message.
Let us see how it goes. Will update in a few weeks if this works.

I don’t get your experimental setting, but have a look at the grammatical framework. It produces high quality translations and is easily extendable to multiple languages. The only reason it is not used in a wide range of settings is its small vocabulary which could be benificial in this case.

You could try an SMT If you want to stick to a training approach for your low resource translations or get monolingual texts to treat them as unsupervised NMT.