One vocab file or two?

bcomeyes · June 20, 2024, 5:53pm

Hello.
I’m confused by the documentation (Vocabulary — OpenNMT-tf 2.32.0 documentation) that indicates the following:
In most cases, you should configure vocabularies with source_vocabulary and target_vocabulary in data block of the YAML configuration, for example:
data:
source_vocabulary: src_vocab.txt
target_vocabulary: tgt_vocab.txt

I have two questions:

If SentencePiece only provides one vocabulary file when the tokenization model is made, how would two be generated for the source and target languages?
The next section of the documentation indicates the following:
However, some models may require a different configuration:
Language models require a single vocabulary:
data:
vocabulary: vocab.txt

What is meant by a “Language model”? Aren’t all of these language models? Why does this one need only one vocabulary file?

thejonnyt · June 21, 2024, 12:10pm

Hi @bcomeyes,

to my understanding, if you generate a joint vocabulary on source and target language you can pass the path to the same vocabulary for both, target and source vocabulary in the yaml file. If you don’t create said joint vocabulary, you should make sure that for both, target and source language, a vocabulary file is present. Its somewhat specific to what kind of tokenization approach you are going for.

As for your second question: Large Language Models and Machine Translation Models are somewhat different. LLMs are usually referred to as text generating instances, where the encoder learns a specific language and its representation and the decoder learns to generated the next word based on the surrounding context of that specific language. Hence, we do not need a second language / second vocabulary. All information comes from a monolingual corpus. Usually, your training consists of exploiting some kind of self-supervised learning data, where specific words or one specific word is masked (hidden), and the decoder has to learn to predict that hidden word. Once the decoder accurately predicts the masked words, we shift to the text generation task by ‘simply’ masking the final word each sequence, so the decoder always predicts the next word and so on.
In language translation, however, the decoder’s prediction task is linked to the dependency of the target language on the source language (some kind of parallel corpus), and the learnt representations for encoder and decoder vary depending on the vocabulary. The vocabulary is the look up table from which the contextual meaning is derived and finally the look up table to knit the translated context back in the target language representation. In the first case, we start with smarties and end up with smarties. In the second case, we start with smarties and end up with M&Ms. Its similar but not the same.

(Disclaimer: I’m just a student trying to wrap his head around these concepts as well. Currently investigating the domain by writing my masters thesis about Neural Machine Translation. Feel free to correct me!)

Cheers

bcomeyes · June 22, 2024, 5:59pm

Hey, Jonny.
Ahhh. That makes sense… When the documentation simply listed Language Models (and not Large Language Models), I was stumped. I just didn’t make the connection that it was necessarily single language models such an LLM that it was referring to. Thanks for the help as a I try to make sense of this new world.