How can I decide vocabulary size for given corpus size?

By default Vocab size is 50k.

I got started with 50k parallel corpus and now have almost 800k total Hn-En parallel corpus.

So far I have used 50k vocabulary size. But I am not sure how much should I increase while adding each 100k more parallel sentence.

Does changing vocab size can help in improving accuracy?

I saw increasing corpus has always reduced UNK. In 300k corpus around 104 UNK, in 400k 94, in 500k it’s 77 unk.

My final training log looks like this with 500k parallel sentences.

[2018-11-13 04:36:37,265 INFO] Validation perplexity: 5.97246
[2018-11-13 04:36:37,266 INFO] Validation accuracy: 68.0482
[2018-11-13 04:36:37,266 INFO] Saving checkpoint demo-model_step_300000.pt
[2018-11-13 04:44:22,695 INFO] Loading train dataset from data/demo.train.0.pt, number of examples: 471795
[2018-11-13 04:46:03,939 INFO] Step 300500/300506; acc: 84.75; ppl: 1.79; xent: 0.58; lr: 0.00000; 5884/5885 tok/s; 331002 sec

And for 400k it is

[2018-11-03 18:18:06,682 INFO] Loading valid dataset from data/demo.valid.0.pt, number of examples: 56222
[2018-11-03 18:21:37,464 INFO] Validation perplexity: 5.48158
[2018-11-03 18:21:37,465 INFO] Validation accuracy: 71.1213
[2018-11-03 18:25:03,947 INFO] Loading train dataset from data/demo.train.0.pt, number of examples: 324990

[2018-11-03 19:35:18,649 INFO] Step 174000/175000; acc: 88.92; ppl: 1.54; xent: 0.43; lr: 0.00012; 5260/5615 tok/s; 196250 sec
[2018-11-03 19:44:27,036 INFO] Step 174500/175000; acc: 86.10; ppl: 1.66; xent: 0.51; lr: 0.00012; 6856/5695 tok/s; 196798 sec
[2018-11-03 19:53:32,873 INFO] Step 175000/175000; acc: 85.64; ppl: 1.74; xent: 0.55; lr: 0.00012; 4735/4204 tok/s; 197344 sec

Usually yes, but it also comes with an increased memory usage and slower execution.

You should try a subword tokenization techniques for a better vocabulary coverage:

1 Like