Does src_vocab_size work correct?

I experiment with OpenNMT-py.
I play with vocab size and watch how it works inside the framework.
When I run:
python ./preprocess.py
-train_src ./data/test/train_src
-train_tgt ./data/test/train_tgt
-valid_src ./data/test/valid_src
-valid_tgt ./data/test/valid_tgt
-save_data ./data/test/t1
-share_vocab
-dynamic_dict
-src_vocab_size 2
-tgt_vocab_size 2
–overwrite

And after see in file t1.vocab.pt:
v = torch.load("./data/test/t1.vocab.pt")
v[‘tgt’].base_field.vocab.freqs
I see all words, not only 2.
Is this correct behavior?

PS: Logs:
[2019-10-14 13:23:03,473 INFO] Extracting features…
[2019-10-14 13:23:03,473 INFO] * number of source features: 0.
[2019-10-14 13:23:03,473 INFO] * number of target features: 0.
[2019-10-14 13:23:03,473 INFO] Building Fields object…
[2019-10-14 13:23:03,474 INFO] Building & saving training data…
[2019-10-14 13:23:03,474 INFO] Reading source and target files: ./data/test/train_src ./data/test/train_tgt.
[2019-10-14 13:23:03,474 INFO] Building shard 0.
[2019-10-14 13:23:03,474 INFO] * saving 0th train data shard to ./data/test/t1.train.0.pt.
[2019-10-14 13:23:03,498 INFO] * tgt vocab size: 6.
[2019-10-14 13:23:03,498 INFO] * src vocab size: 4.
[2019-10-14 13:23:03,498 INFO] * merging src and tgt vocab…
[2019-10-14 13:23:03,498 INFO] * merged vocab size: 6.
[2019-10-14 13:23:03,499 INFO] Building & saving validation data…
[2019-10-14 13:23:03,499 INFO] Reading source and target files: ./data/test/valid_src ./data/test/valid_tgt.
[2019-10-14 13:23:03,499 INFO] Building shard 0.
[2019-10-14 13:23:03,499 INFO] * saving 0th valid data shard to ./data/test/t1.valid.0.pt.

You’re looking at the .freqs counter. In this counter, every word is kept.
The actual vocab used is stored in .stoi and .itos, and should be properly cut to the size you defined.

So, if I want to decrease the size of vocab (make small size of src_vocab_size, tgt_vocab_size) - it is not a working idea?

It does work.
Look at the size of your .stoi/.itos. The .freqs is simply a counter of “what has been seen in the data”, not the vocabulary itself.

May be “the size of vocab” some confusing.
I mean, opennmt-py store all words to file. Right? So the size of the file will be near the same, for the same input data, but different src_vocab_size. Correct?
Same idea for RAM(CPU/CUDA). Inside, opennmt-py makes torch.load("some.vocab"). So, it loads all data from the file. And I cannot decrease size of vocab in memory using src_vocab_size. Correct?

This ‘vocab.pt’ file is peanuts compared to other things in RAM (data, model parameters, states, etc.).
What’s important is the size of your vocab (which will determine part of the size of your model, with consequences on memory), not the file itself.
I’m not sure to understand your point here. Maybe tell us more about your use case and the underlying issue you’re facing.

Ok. Very abstractive task: I want to use BPE encoding. It solves tasks with “back-off dict”, but I want to know to solve this task with RAM(CPU/CUDA).
I working on a summarisation task. Using --share_vocab and --src_vocab_size=50000.
And sometimes I have “overflow memory error in CUDA”.
I know, I can solve this problem some other technics, but I want to know does BPE encoding (with the smaller size of dict) improves this problem with RAM:
I convert input data (tokinaze->detokinaze) with a smaller size of dic than 50k. Using BPE, this library: https://github.com/google/sentencepiece
Using this dict in preprocessing.py. I will point --src_vocab http://opennmt.net/OpenNMT-py/options/preprocess.html#Vocab

Thanks, for your response.

I think your issue may be more with too long sentences / too big batches rather than vocab size. Do you use “-batch_type tokens”?

Thanks, I know it and I try to solve too long sentences
Yes, I use -batch_type tokens

hi @francoishernandez
I am running translation task. I am using BPE vocab size 24k and default src and tgt vocab size in opennmt preprocess.py is 50k. I want to know how do i decide the vocab size both src and tgt in this preprocess.py. What if I am missing many words by using 50k default vocab size where it should be more than 50k?

I don’t understand your point. If you use BPE 24k, your vocab won’t be much bigger than 24k, so you won’t ‘miss’ words by keeping the default 50k value.

@francoishernandez
This is not the case.When I use 24k vocab size in bpe. Run preprocess.py on my src and tgt. After this I get NMT vocab
src vocab size = 60002
tgt vocab size = 60004
not just 24k
So i think nmt preprocess.py has a different way of defining vocab ?

You may have an issue in your BPE tokenization process. Do you learn the BPE model on your whole data or on an extract?

@francoishernandez
whole data.
You mean to say that size of BPE vocab == src vocab size == tgt vocab size after opennmt preprocess.py?

src vocab size == tgt vocab size --> depends on whether you use -share_vocab or not, but it appears you do.
BPE is defined by merge operations, not by vocab size, but the ensuing vocab should be just a little bigger than the number of operations, i.e. for 24k merge operations you shouldn’t have 60k of vocab.
You need to inspect your data and find the reason why your vocab is exploding. Maybe you can have a look at your vocab file (preprocessed) and compare it with the merge operations (your bpe file) to try and see what’s going on.

ok thanks @francoishernandez