What are the values in src-map?

Scuba · May 22, 2018, 4:56am

In Translation.py (onmt.translate.Translation.py) the make_features def generates an object {Variable} called data. This object contains a {LongTensor} also called data that holds the vocabulary index of each source token for each source sentence. It is a matrix of size X (the longest sentence) by Y (the number of sentences).

The batch object passed into the make_features def contains each tokenized sentence (for example, batch.dataset.examples.00.src) and a {LongTensor} called src_map (for example, batch.dataset.examples.00.src_map, not to be confused with {Variable} batch.src-map). {LongTensor} src_map contains a group of numbers, one for each token in the source sentence. However, I cannot tell what these numbers mean. Are these an index of some sort like in the {LongTensor} data described above?

nikhilweee · May 22, 2018, 10:22am

Though I’m not sure, I think this might be of some help?

github.com

OpenNMT/OpenNMT-py/blob/0ecec8b4c16fdec7d8ce2646a0ea47ab6535d308/onmt/io/TextDataset.py#L279


    return num_feats


# Below are helper functions for intra-class use only.
def _dynamic_dict(self, examples_iter):
    for example in examples_iter:
        src = example["src"]
        src_vocab = torchtext.vocab.Vocab(Counter(src),
                                          specials=[UNK_WORD, PAD_WORD])
        self.src_vocabs.append(src_vocab)
        # Mapping source tokens to indices in the dynamic dict.
        src_map = torch.LongTensor([src_vocab.stoi[w] for w in src])
        example["src_map"] = src_map


        if "tgt" in example:
            tgt = example["tgt"]
            mask = torch.LongTensor(
                [0] + [src_vocab.stoi[w] for w in tgt] + [0])
            example["alignment"] = mask
        yield example

Here src_map is just the source sentence as a list of indexes.

Scuba · May 22, 2018, 4:19pm

Thank you very much, Nikhil! This was very helpful. You said, “Here src_map is just the source sentence as a list of indexes.” I can see that you’re right by looking at the _dynamic_dict def. The line that you highlighted shows that src_map is created by pulling the index from src_vocab.stoi for each word token (w) in src.

So another question: what is src_vocab.stoi and what is the logic to the ordering of the source tokens in it (indexes to all of the tokens are contained in it but they are not in the original order as in the source sentence)? And what does “stoi” mean? Is it an acronym?

nikhilweee · May 23, 2018, 2:26pm

Yes, they are in random order. As far as I can understand, stoi is basically s to i. source to indexes (hopefully). You also have the inverse, src_vocab.itos - indexes to source tokens (maybe?).

UPDATE 1: stoi is string to index and itos is index to string

github.com

pytorch/text/blob/405a87282fb5363cf8171531843f86e007cb3173/torchtext/vocab.py#L29-L29


itos: A list of token strings indexed by their numerical identifiers.

UPDATE 2: They are not in random order. They are first sorted by their frequency and then alphabetically. I have no idea if there is a specific reason behind this.

github.com

pytorch/text/blob/405a87282fb5363cf8171531843f86e007cb3173/torchtext/vocab.py#L65-L72


# sort by frequency, then alphabetically
words_and_frequencies = sorted(counter.items(), key=lambda tup: tup[0])
words_and_frequencies.sort(key=lambda tup: tup[1], reverse=True)


for word, freq in words_and_frequencies:
    if freq < min_freq or len(self.itos) == max_size:
        break
    self.itos.append(word)

Scuba · May 23, 2018, 3:59pm

Ah, the naming makes sense. Thanks for that explanation, Nikhil. Is there a reason the indexes in stoi are in random order? Why not just keep them in sentence order? Is the randomness important, perhaps somewhere else in the code? Does it aid the attention mechanism or something like that?

nikhilweee · May 25, 2018, 5:27am

I’ve updated my comment above. Hope that helps.

emartinezVic · May 25, 2018, 7:48am

The order of the words in the dictionaries is not important for the encoder-decoder since you “only” use these dictionaries to convert from idexes to strings and viceversa when creating the system input data and when building the output translation.
In particular, the word indexes are used to get the word embeddings by the embeddings layers in both source and target sides.
The attention module, if I am not mistaken, works regarding to the positions of the words in the source sentence and not taking into account the particular word indexes.

Scuba · May 25, 2018, 9:53pm

Even better! Thanks, Nikhil.

Scuba · May 25, 2018, 9:53pm

Good information, as always. Thanks, Eva!