In Translation.py (onmt.translate.Translation.py) the make_features def generates an object {Variable} called data. This object contains a {LongTensor} also called data that holds the vocabulary index of each source token for each source sentence. It is a matrix of size X (the longest sentence) by Y (the number of sentences).
The batch object passed into the make_features def contains each tokenized sentence (for example, batch.dataset.examples.00.src) and a {LongTensor} called src_map (for example, batch.dataset.examples.00.src_map, not to be confused with {Variable} batch.src-map). {LongTensor} src_map contains a group of numbers, one for each token in the source sentence. However, I cannot tell what these numbers mean. Are these an index of some sort like in the {LongTensor} data described above?
Thank you very much, Nikhil! This was very helpful. You said, “Here src_map is just the source sentence as a list of indexes.” I can see that you’re right by looking at the _dynamic_dict def. The line that you highlighted shows that src_map is created by pulling the index from src_vocab.stoi for each word token (w) in src.
So another question: what is src_vocab.stoi and what is the logic to the ordering of the source tokens in it (indexes to all of the tokens are contained in it but they are not in the original order as in the source sentence)? And what does “stoi” mean? Is it an acronym?
Yes, they are in random order. As far as I can understand, stoi is basically s to i. source to indexes (hopefully). You also have the inverse, src_vocab.itos - indexes to source tokens (maybe?).
UPDATE 1: stoi is string to index and itos is index to string
UPDATE 2: They are not in random order. They are first sorted by their frequency and then alphabetically. I have no idea if there is a specific reason behind this.
Ah, the naming makes sense. Thanks for that explanation, Nikhil. Is there a reason the indexes in stoi are in random order? Why not just keep them in sentence order? Is the randomness important, perhaps somewhere else in the code? Does it aid the attention mechanism or something like that?
The order of the words in the dictionaries is not important for the encoder-decoder since you “only” use these dictionaries to convert from idexes to strings and viceversa when creating the system input data and when building the output translation.
In particular, the word indexes are used to get the word embeddings by the embeddings layers in both source and target sides.
The attention module, if I am not mistaken, works regarding to the positions of the words in the source sentence and not taking into account the particular word indexes.