How to choose a best voc size?

After build voc my voc size is 5 million tokens for src and 5.5 million for tgt. So what voc size should i choose when i go for train model ? I have 400gb ram. And if i my voc size is bigger, my accuracy will increase too ?

A good starting point is to apply a subword tokenization like SentencePiece with a vocabulary size of 32k.

A larger vocabulary size could increase the accuracy, but not always. You should run experiments on your data.

Hello. I am trying to train an English-Hindi model and have been making some experiments since a few days now. The training here was done on about ~1M+ sentences. Here are some experiments I made and the results I found for these experiments -

  1. I first trained a simple transformer model, with the default vocab size of 50K even though my vocab did had around ~250K items for the source and target language independently. The model was trained till about 90K steps and did gave good translations, very good actually (even on very long sentences), with the only problem being the occurrence of the ‘unk’ token every time a proper noun is introduced or a word with a different word form is introduced.
  2. Next up I tried to create a subwording model with a subwording vocab size of 32K and the model type being bpe. This model reduced the overall possible vocab (the one created after running onmt vocab builder) to about ~27K items per language file. As per norm the vocab was built on the subworded training data. The results were again trained till about 80K after which the results did not improve at all. The results here didn’t throw any ‘unk’ tokens, but gave bad translations every time I tried to increase the size of the input sentence to have more than 5 or 6 tokens.
  3. Next up I tried to create a full vocab subwording model. Here the translation quality was better than the one I received in the 32K model, but was again nowhere comparable to what was achieved in the first model.

Any inputs on what can be done so as to make a model which translates with good accuracy and tries to best predict the unknown words?

Dear Anurag,

These are brilliant experiments. The next step is experiment with two approaches:
1- tagged back-translation
2- filtering crawled data

For Tagged Back-translation, you will add the whole data to your vocab build step, this will help with unknowns.

This approach aims at augmenting the available parallel training data with synthetic data that represent the purpose of the model. The technique depends on the following steps:

  1. For an English-to-Hindi model, we train another Hindi-to-English model (i.e. in the other direction), using publicly available data from OPUS;
  2. Selecting monolingual data in Hindi publicly available (e.g. at OSCAR), which must have domains and linguistic features similar to the potential texts to be translated;
  3. Using the Hindi-to-English model to create a synthetic dataset, by translating the Hindi monolingual data into English. Note here that only the English side (source for EN-HI) is MTed while the Hindi side (target for EN-HI) is human-generated text;
  4. Augmenting the original English-to-Hindi training dataset with the synthetic dataset;
  5. Training a new English-to-Hindi model using the dataset generated from the previous step. The vocab must be built on all the data, both the original and the synthetic datasets;
  6. For low-resource languages like Hindi, the technique works well with 1:1 synthetic to original data, but you can experiment with different portions.

The following papers are good references for the Tagged Back-translation approach:

As for crawled bilingual data, currently OPUS has approx. 27 million sentences for English-Hindi. However, most of this data is crawled, which might mean many sentences are misaligned. So you can first filter this data and then train a model on as much as possible of these sentences, and see your results.

All the best,


Thanks for the reply @ymoslem

Using your suggestions I made two different models, the first one being a back translation model with original:synthetic ratio being 1:2, and the second one having a 1:1 ratio. Overall now the training was done on 3M and 2M parallel sentences respectively.

Using these my initial problem of long sentence translation quality is somewhat solved. Now the translation quality is somewhat close to what my base model (with unknowns) achieved. Using subwording the model is now able to identify a lot of non-proper nouns as well. But for proper-nouns the subwording does happen, but the translation quality does not improve.

My expectation here as a user would be to transliterate these proper nouns. Is this a problem widely encountered (proper nouns transliteration)?

Also currently while training the hindi-english model, the training file is being provided in cased form (‘HellO how are YOU?’ is not being trained as ‘hello how are you?’), can this type of change possibly help in improving the quality?

The current best bleu score I am able to achieve on my custom test dataset is 26, is their an online testing dataset which has been benchmarked by different individuals and entities specifically for these indic languages?

For unknown tokens look at CTranslate2 replace_unknowns.

I am aware about replace_unknowns. Here the unknowns I am referring to are the words which gave unk in a normal non-subwording transformer model, and were tried to be enhanced using the subwording models.

The idea here was that it should be able to translate various forms of a given word (which it is now doing after subwording), and transliterates the proper-nouns since the translation itself is coming out to be a very random word. To be clear, it is not that the model is now giving me , instead the results replacing are poor.

Dear Anurag,

I do not think an MT system would transliterate proper nouns by itself, unless it is covered by the training data or sub-wording. The XLEnt corpus includes named entities; however, it must be filtered as it has many misalignments.

For low-resource languages, I prefer lower-casing the data. However, if you are submitting a paper, you are usually required to give the true-case version; so you can train a true-caser or use Moses’ true-caser for English.

Try averaging and ensemble decoding (you can find examples in the forum). Make sure you filter your data, especially the dev and test sets.

EDIT: You can also try iterative back-translation. Now, as you have a better HI-EN model, back-translate EN monolingual data for the reverse model EN-HI, and then use the new EN-HI model to back-translate the same HI monolingual data you used for the first run, to create a new version of the HI-EN model. You already ran one iteration of back-translation, you can try to run one or two extra iterations.

Kind regards,

1 Like

This is a truly excellent and thorough explanation of the way to tackle the “low resource” challenge.

1 Like

@ab585 By the way, have you added tags to the MTed segments? In this case, you will also have to add a tag like <BT> to your Sentence model as well, using the option --user_defined_symbols

Also, use the following options:
--split_by_number to split tokens by numbers (0-9)
--byte_fallback to decompose unknown pieces into UTF-8 byte pieces

As for proper nouns, in addition to what is mentioned above, consider also experimenting with having a shared vocab (and one SentencePiece model) instead of two separate vocabs. This can help ONLY if the source has the (sub-)words that you want to move to the target.

What does this mean? Having everything lower case? The tradeoff would be better translations but losing upper and lower case information.

Does this work well? It’s a really cool idea but of you did it too many times I’m guessing the quality would deteriorate.

I’ve done such iterative back-translation with my Tagalog-English model (using SentencePiece subwords). Not speaking Tagalog I test against a set of human-created 20,000 bilingual examples on a Tagalog learning site and there’s usually a fair degree of correspondence.


Hi PJ!

Yes, it is a trade-off. However, if the target is English, there are already pre-trained ture-casers, or you can train one.

You train back-translation models for 2-3 times only. The idea here is not increasing data (this is a different approach), but rather back-translating the same data with a better system. Check this paper for example:

Kind regards,


Interesting thanks for the link!

That makes sense so the idea is to label monolingual data in the target language using a pre trained model not necessarily generate new data.

1 Like