Sequence to sequence model not making any attempt to translate with OpenNMT-py

I’m trying to train an English-Polish model with OpenNMT-py and can’t get anything resembling a translation out of it.

Input:

Alan Mathison Turing (ur. 23 czerwca 1912 w Londynie, zm. 7 czerwca 1954 w Wilmslow k. Manchesteru) – brytyjski matematyk, kryptolog, twórca koncepcji maszyny Turinga i jeden z twórców informatyki. Uważany za ojca sztucznej inteligencji.

Output:

23 23low 23 23 23 23 23 23 23log1212.lowlowlow czerwcalowlowlowlowlowlowlowlowlowlowloglogloglogloglogloglogloglowlowlowlowlowlowlowalow. Uważa Uważa Uważa sztucznej…

Real translation:

Alan Mathison Turing (born June 23, 1912 in London, died June 7, 1954 in Wilmslow near Manchester) - British mathematician, cryptologist, creator of the Turing machine concept and one of the founders of computer science. Considered the father of artificial intelligence.

The sentence boundary detection and tokenization look fine:


sentences ['Alan Mathison Turing (ur. 23 czerwca 1912 w Londynie, zm. 7 czerwca 1954 w Wilmslow k. Manchesteru) – brytyjski matematyk, kryptolog, twórca koncepcji maszyny Turinga i jeden z twórców informatyki.', 'Uważany za ojca sztucznej inteligencji.']
tokenized [['▁Alan', '▁Math', 'ison', '▁Tur', 'ing', '▁(', 'ur', '.', '▁23', '▁czerwca', '▁19', '12', '▁w', '▁Londynie', ',', '▁z', 'm', '.', '▁7', '▁czerwca', '▁19', '54', '▁w', '▁Wil', 'm', 's', 'low', '▁k', '.', '▁Manchester', 'u', ')', '▁–', '▁brytyjski', '▁matematyk', ',', '▁krypto', 'log', ',', '▁', 'twór', 'ca', '▁koncepcji', '▁maszyny', '▁Tur', 'ing', 'a', '▁i', '▁jeden', '▁z', '▁twórców', '▁inform', 'a', 'tyki', '.'], ['▁Uważa', 'ny', '▁za', '▁ojca', '▁sztuczne', 'j', '▁inteligencji', '.']]
translated_batches [[{'tokens': ['▁23', '▁23', 'low', '▁23', '▁23', '▁23', '▁23', '▁23', '▁23', '▁23', 'log', '12', '12', '.', 'low', 'low', 'low', '▁czerwca', 'low', 'low', 'low', 'low', 'low', 'low', 'low', 'low', 'low', 'low', 'log', 'log', 'log', 'log', 'log', 'log', 'log', 'log', 'log', 'low', 'low', 'low', 'low', 'low', 'low', 'low', 'a', 'low', '.'], 'score': -10.890531539916992}], [{'tokens': ['▁Uważa', '▁Uważa', '▁Uważa', '▁sztuczne', 'j', '.', '.'], 'score': -2.8631486892700195}]]

The model was trained on 82,402,198 parallel sentences for 60,000 steps. I’ve also tried averaging the 50,000 step checkpoint with the 60,000 step checkpoint. I’m exporting to CTranslate with 8 bit quantization.

One issue may be that the model needs to train longer. I had pretty good results with opennmt-tf with 30,000 steps though which makes this seem unlikely even though the PyTorch models seem to be a little larger.

My plan is to try another language and see if it works but any advice is appreciated.

Full code

I’m not sure your training data is tokenized. You should either tokenize the data offline (before the training), or enable on-the-fly tokenization by selecting the appropriate data transforms.

@francoishernandez Do you confirm that?

1 Like

Yes, @argosopentech you need to enable the tokenization transform in your configuration, such as in this example or this one.

1 Like

Woops thanks, that would do it. I think I copied the config from @francoishernandez’s second example:

#### Subword
src_subword_model: sentencepiece.model
tgt_subword_model: sentencepiece.model
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
#### Filter
src_seq_length: 150
tgt_seq_length: 150

# silently ignore empty lines in the data
skip_empty_level: silent
onmt_build_vocab -config config.yml -n_sample -1

Is the issue that I’m missing this?

src_subword_type: sentencepiece
tgt_subword_type: sentencepiece

No, the issue is that you need to explicitly define the transforms.

1 Like

I see thanks:

# Corpus opts:
data:
    commoncrawl:
        path_src: data/wmt/commoncrawl.de-en.en
        path_tgt: data/wmt/commoncrawl.de-en.de
        transforms: [sentencepiece, filtertoolong]
        weight: 23