opened 05:39AM - 01 Apr 17 UTC
closed 10:59AM - 26 Apr 17 UTC
For this fake corpus
`when engage what`
Its character vocabulary size is 7 (`…e a h w n g t `).
Lean BPE by two num_operations, and apply it with the two generated codes (`wh` and `en`), we get:
`wh@@ en en@@ g@@ a@@ g@@ e wh@@ a@@ t`
The final vocabulary size is 7 (`a@@ wh@@ g@@ e t en en@@` ), not 9.
Do I calculate it wrong?
In my opinion, the equation `Final vocabulary size = character vocabulary + num_operations` based on the assumption that every merge operation generates one new token.
But in this case, the merge operation of `e` and `n`, generates two token `en` and `en@@` in the encoded text, and this phenomenon is totally unpredictable. To make sure there is no unknown word, the final vocabulary size should be 18 ??
(`e a h w n g t wh en e@@ a@@ h@@ w@@ n@@ g@@ t@@ wh@@ en@@ `)
I am really confused !
How to generate the final vocabulary, and how to control its size exactly ?