Same sentence as an inference to any EnFr input

tlysenko · July 8, 2019, 8:55am

Hi,

I’m facing a really weird bug. I’m training EnFr model and for any input sentence the model keeps giving as an inference one and the same sentence.

For the training I’m using 8M shuffled and cleaned subset from Gigafr, eubook, multiun and newscom downloaded from opus.nlp.eu in moses format.

The model is trained for 25 000 (~10 epochs) steps.

French corpus is not transliterated. Is it required for this language pair?

The dataset is shuffled, cleaned and preprocessed with joined-bpe SentencePiece.

SentencePiece model training parameters are:

spm_train
–input=train #( 5M (2.5M En + 2.5M Fr) shuffled subset from the main corpus)
–model_prefix=spm
–vocab_size=32000
–input_format=text
–num_threads=5
–input_sentence_size=4999995
–max_sentence_length=500
–shuffle_input_sentence=true
–character_coverage=0.9995
–model_type=bpe

Any ideas what can cause such behavior ?

Thank you!

tlysenko · July 23, 2019, 1:01pm

Hi, I’d like to post an update on this. It was all about the quality of the dataset. After cleaning it with bicleaner tool the issue was gone.