OpenNMT and WMT21 Similar Language Task for the Spanish-Catalan and Spanish-Portuguese Language Pair

miguelknals · April 26, 2022, 8:35pm

Hi!

Our team presented our solution T4T for the Shared Task: Similar Language Translation for the WMT21 (EMNLP 2021 6th Conference on MT that was hold Nov 2021) using OpenMT as “out-of-box” NMT toolkit.

This task is related to the translation between similar language pairs (in our case for ES<>CA and ES<>PT).

We focused on the corpus cleaning (both from “physical” and “statistical” point of view). Also we tried a word segmentation alternative (syllabic) to byte-pair-encoding (BPE). Finally we used OpenNMT to create our MT model.

We have found that after a good “physical” cleaning, other recipes as “statistical” cleaning (trying to remove translations with low probability related to a corpus dictionary) or alternatives to BPE (as the syllabic segmentation) provided little or unclear improvements.

We have used less demanding OpenNMT RNN models for training and evaluation, and only for the final election, used the Transformer model. This means we have used a reasonable local environment (2 x 8 GB GPUs in a 48Gb i7) thru all this process.

Final result has been based in common sense, that is: clean the corpus as much as we can, use standard techniques as BPE in order to reduce vocabulary and a proven toolkit as OpenNMT with the Transformer model.

The result of the competition has been that our system has been always close to the top if not the best one. These are good news in the sense that you still can get state of the art results with tools using reasonable power. In the following table you can compare our results (column T4T) against the other participants (Best score). Notice how close are the results (if not the best).

	BLEU		RIBES		TER
	Best score	T4T	Best score	T4T	Best score	T4T
PT-ES	47.71	46.29	87.11	87.04	39.21	40.18
ES-PT	40.74	40.74	85.69	85.69	43.34	43.34
CA-ES	82.79	77.93	96.98	96.04	10.92	16.50
ES-CA	79.69	78.60	96.24	96.24	14.63	16.13

The results reinforce the idea that if you have a clean and coherent corpus your results will be pretty good with OpenNMT.

The tech paper we submitted for the WMT21 is T4T Solution: WMT21 Similar Language Task for the Spanish-Catalan and Spanish-Portuguese Language Pair

Besides the tech paper, if you are interested in more detailed info, check here in www.mknals.com

I also would like to thank all the community supporting OpenNMT!!!
Have a nice day!
Miguel