Cannot translate some sentences

Hello,

I am developoing an English - Spanish translator but I have found some strange behaviours while testing it. Setting a baseline, I got a BLEU score of 0.37 with the test dataset and in general translations are decent despite of the lack of more vocabulary, and an accuracy of 71 in the validation dataset while training.

However, I have noticed that sometimes there are complete sentences which appear in the source language after trying to translate them.

I am gonna give an example.
Imagine I want to translate this sentence to spanish:

“Musicological influences and references can be found throughout his work; he has even included musical notation in the text to make a point.”

Even though previous and posterior phrases are translated OK (understanding there are some errors, but the result in general is decent), this sentence is translated as:

" and can be found throughout his he even even music in the text to make a "

However, I make a little change just adding “his” at the beggining of the sentence, like:

“His musicological influences and references can be found throughout his work; he has even included musical notation in the text to make a point.”

The translation is:

“Sus influencias y referencias pueden ser encontradas en su incluso ha incluido la notación musical en el texto para hacer un punto.”

This has sense in spanish, despite of the "unk"s. What’s more, the words “influences” and “references” which were translated as “unk” in the first example are well translated in this second example: “influencias” means “influences” and “referencias” means “references”.

This has happened to me in more examples, I am not sure why the texts are in general well translated but some sentences appears non-translated. Maybe my model is quite poor. So, please, if you can give me some advice I’d be grateful.

Thanks! Gracias!
Ana

Hi,
Please give more info in the dataset you used, the command line of you preprocessing and your training workflow.
thanks.

Hello,

First of all, I am using the Pytorch implementation. I created a dataset using various open source corpus I found on the internet. These are:

  • Europarl
  • UN (United Nations)
  • One from different articles on Wikipedia
  • Common Crawl
  • News Commentary

Right now I am just setting a baseline so I use texts with different topics. I got like 16,8 millions of sentences.

Once I got an unique text in spanish and the same in english, I split it into train, test and validation text files, using 99%, 0,5% and 0,5% respectively because I didn’t want the test and validation files to be too long.

And after that, I used these commands from the library:

python preprocess.py -train_src …\Datasets\train-all_corpus.en -train_tgt …\Datasets\train-all_corpus.es -valid_src …\Datasets\val-all_corpus.en -valid_tgt …\Datasets\test\val-all_corpus.es -save_data data\all_corpus

It succesfully created the train pt files, valid pt files and vocab file. After that, I trained the model just using:

python train.py -data data/all_corpus -save_model all_corpus-model -gpu_ranks 0 --batch_size 32 --valid_batch_size 4 --valid_steps 100000 -train_steps 300000 --valid_steps 20000 -start_decay_steps 200000

I must say, I used batch=32 in training and batch=4 in validation because my GPU is 6GB so if I used bigger batches, a CUDA Memory Allocation problem would appear.
I have used the model in the step 240000 because it was the one which gave me best validation accuracy, 71 as I have said. I made a translation from english to spanish of the test file and after that I compute the BLEU score, which gave me 0,37. I am interested in getting spanish translations which can be easily understood so I read the translation I got and it was decent. Of course it can be improved, specially the “unk” predictions, but I noticed the problem I have described with some particular sentences so I wonder if I am doing something wrong.

Right now I am training a model with just the Europarl dataset (2millions of lines), just to see if having a closer topic improves the results.

Thanks again, and btw congratulations for the library and the documentation, really intuitive!
Ana

and I am sorry, just noticed the examples I gave were incomplete.
The first phrase is translated as:

“unk unk and unk can be found throughout his unk he even even unk music in the text to make a unk”

and the second one:

“Sus influencias unk y referencias pueden ser encontradas en su unk incluso ha incluido la notación musical en el texto para hacer un punto.”

The “unks” which appeared were deleted while copying and pasting, sorry. Don’t know if they are important understaning what is happening, but just in case.

How did you tokenize your dataset ? what vocab size did you pass ?

I didn’t tokenized my dataset, is it necessary before preprocessing?
I mantain the default vocabulary size for both src and target, 50000

yes it is.
try to follow some tutorials that you will find here or in some other places.
Cheers.

1 Like

Hi Ana, bienvenida

I will try to answer from the experience of my inexperience.

I am not sure when you say “there are complete sentences which appear in the source language after trying to translate them”. As far as I have seen, if you are not using the -replace_unk in the translate.py you should not have any english in the spanish translation. The unk words should come up as < unk >. Obviously, if you use -replace_unk, the matching should come in english.

If english not related to unk are coming up, probably is related to your train/validation files.

Also very minor changes will change the whole translation. I think this is expected.

For the neuronal network “eat” or “Eat” or “eating” or “eat!” or “eat,” are not related. For instance your “… work; …” sentence probably is not translated as “work” along is known, but not “work;”.

You can overcome this problem with more examples or increasing vocabulary at a cost of corpus availability and computer resources. It will be hard to find for instance a corpus will all tenses of an spanish verb o deal with unlimited memory sizes.

First approach to reduce vocabulary is word segmentation tokenization (this is done for instance by the moses tokenize.perl included in openmt-py). Then tokenization can be improved removing casing or even splitting words as in BPE or now sentencepiece. Easily digging the forum or the web you will find many ways.

As an example, an untokenized sentence as

His musicological influences and references can be found throughout 
his work; he has even included musical notation in the text to make a point.

is translated as:

sus <unk> y referencias pueden encontrarse en toda su <unk> , 
incluso ha incluido <unk> musicales en el texto para hacer
de un <unk> <unk>

Properly tokenized:

his musicological influences and references can be found throughout
his work ; he has even included musical notation in the text to make a point .

is translated as

su influencia y sus referencias pueden encontrarse en toda su labor ; 
incluso ha incluido <unk> musicales en el texto para hacer 
un comentario .

The bleu score i got from the eroparl en-es download from the opus repositroy is 40.5 using moses tokenization system (word tokenization and trucasing) using the default model. Notice that even with your raw input you are not far. But again, there are more ways (OpenMT-lua had his own way, there is a C++ tokenizer, BPE, SetencePiece,… )

Be also aware that if you want to reach high bleu scores your starting point corpus has to be the best as you can. Eurparl is not a very trustworthy corpus in my opinion for a real work. Even with 200 million lines of europarl corpus probably your bleu score wont raise much more (humble opinion as I do not have 200 million europarl lines). Probably a bleu translation score less than 60/70 is useless. Even bleu scores have to be reviewed with great care, as for instance, the same corpus processed in Moses or OpenNMT will have very similar bleu values. For instance, I bet you a neuronal translation with bleu score 70 is “better” than an statistical translation with bleu score 80.

And finally there is the design you model/neuronal network. This is not a trivial task, but hopefully you will find some help here :slight_smile:

Have a nice day!
Miguel

1 Like

Hello Miguel,

thank you so much for your help, I really appreaciate it.
I am going to start with tokenization since I was following a tutorial from the web which didn’t mencioned it so I didn’t make tokenization.

Thanks again,
Ana