Larger Vocab = Worse BLEU?

bcomeyes · June 13, 2024, 6:30pm

Hello, All.

I’m brand new to OpenNMT and this community. First off, thank you to all those who have built this amazing tool.

After following the tutorial to train the wmt_ende model with with Opennmt-Tf, I ventured out on my first solo flight and built two models using the translation memories from “84,000” (https://84000.co/). This group is translating ancient Buddhist texts from Tibetan into English by hand but still save their translation memories for those who want to venture into NMT.

I first created a model that can translate from Tibetan into English (and also separate model to go from English into Tibetan).The number of translation memories is about 275,000–what I figured was a reasonable sample size for training/validation/testing. I was able to train all of my models to completion (i.e., until achieving early stops based on a plateaued BLEU score).

I have two questions that I’m turning to this community for guidance:
1)
For the model that translates from Tibetan to English, I started with SentencePiece model of 8k, then build one with a vocab of 16K and then one with a vocab of 32K. Each time my BLEU score increased a bit and maxed out around 29. I wasn’t thrilled but it at least allows me to start comparing manual translations with NMT translations. Of course I was certainly hoping for a higher score but wasn’t sure what to experiment with next. Can anyone provide any guidance or advice on what best practices are for trying other available models in Opennmt, hyperparameter tuning or other idea so that I can keep raising my score?

Then, I tried to build a model that went from English to Tibetan. My 8K vocab model achieved a BLEU of around 25 and then I tried a 32K vocab and the model stopped early once my BLEU score quit increasing after 4 evaluations, with a final BLEU score of around 16. I was shocked at its poor performance. I checked and double checked my code. I couldn’t find any irregularities. I guess my first question/assumptions was that an increase in vocab “should” create a higher BLEU score. Apparently this isn’t true. Can anyone shed any light on this topic? Or guide me on what I might have done wrong and what I might do to remedy this?

Thanks in advance for any guidance.

Warm regards,
Matt

thejonnyt · June 21, 2024, 4:28pm

Hey @bcomeyes / Matt,

what exactly is your data? I looked at the website and from it it does not seem to provide the standard format for machine based language translation, which is a parallel corpus, where an English sentence corresponds to an Tibetan sentence in 1:1 (per line) fashion. At least I couldn’t find it. However, achieving a BLEU of 25 is a strong baseline result in such a (low resource?) scenario. Do you have 275.000 sentence pairs or are the memories documents? What are you training on exactly? I am wondering, because theses results sound a little bit to good to be true.

Cheers,
Jonny

ymoslem · June 22, 2024, 1:00am

Hi Matt,

Please refer to section 4. Results and Analysis of this paper. If the dataset is small, increasing the vocabulary size so much can negatively affect the quality.

Kind regards,
Yasmin

bcomeyes · June 22, 2024, 5:41pm

Hi, Jonny.
Yes, the data came from 275,205 sentence pairs. That’s promising that the BLEU is high to you at 28.78. I based my disappointment on the BLEU from this Google table:
https://cloud.google.com/translate/docs/advanced/automl-evaluate#:~:text=BLEU%20(BiLingual%20Evaluation%20Understudy)%20is,of%20high%20quality%20reference%20translations.

bcomeyes · June 22, 2024, 5:45pm

Thank you, Yasmin. This is exactly the technical response I’m looking for as I’m a little flummoxed by vocabulary size and its impact on the model. When I did a unique word count of my English data, I have about 60,000 unique words. Therefore, when I was faced with the limitations of 8K, 16K, and 32K vocab in SentencePiece, I thought more has to be better. I’ll do a deep dive into the article to challenge my original assumption.

Another question based on your response. Is a data size of 275,000 sentence pairs a “small data” set? Is it too small?

bcomeyes · June 22, 2024, 5:55pm

Hello, again, Jonny.
I realized I missed a part of your question… Yes, the dataset file “tmx_to_xlxs.xlsx”(Tibetan-English-OpenNMT-tf-Google-Colab-Notebook-/tmx_to_xlxs.xlsx at main · bcomeyes/Tibetan-English-OpenNMT-tf-Google-Colab-Notebook- · GitHub) is a sentence to sentence file. The only reason that I used an Excel format was that it did a good job of showing the Tibetan script when I was converting the .tmx files to a usable format and still is easily read by Pandas. The file “PrepareData_NMT.ipynb” uses very basic Pandas and Sklearn code (just noticed that I left Numpy import as well—embarrassing example of muscle memory when I do imports) to read in the data frame, separate the two languages into variables and then split my data into my feature and label files for training (70%)/validation (15%) (and also a test dataset (15%) for inference and later comparison to hand translated files).

thejonnyt · June 25, 2024, 8:58am

275k sentence pairs is somewhat low resource. Its 1-10% of what other, more rich languages have to work with. E.g., the german english pairing has close to 100 million sentence pairs @ Opus.nlpl (granted most likely not unique ones and the quality might not be top notch for some). You should compare your result to language pairings with similar resources. In that case, I believe, 30ish BLEU is already a really good result. There might be some techniques, however, to improve these results even further by leveraging/exploiting tactics for low resource machine translation scenarios. There are multiple survey papers on the topic… e.g. Low Res MT Survey . Reading one or two should provide you with an overview of the different possibilities.

Cheers

bcomeyes · June 25, 2024, 9:44pm

Thanks, Jonny.
That article is super helpful. Since each model costs about $10 on GCP and ~6 hours, it’s really helpful to have some direction where to go next to refine this model.