Bpe, vocab size

guillaumekln · August 3, 2021, 7:07am

Final vocabulary size is not equal to character vocabulary plus num_operations ?

opened 05:39AM - 01 Apr 17 UTC

closed 10:59AM - 26 Apr 17 UTC

For this fake corpus `when engage what` Its character vocabulary size is 7 (`…e a h w n g t `). Lean BPE by two num_operations, and apply it with the two generated codes (`wh` and `en`), we get: `wh@@ en en@@ g@@ a@@ g@@ e wh@@ a@@ t` The final vocabulary size is 7 (`a@@ wh@@ g@@ e t en en@@` ), not 9. Do I calculate it wrong? In my opinion, the equation `Final vocabulary size = character vocabulary + num_operations` based on the assumption that every merge operation generates one new token. But in this case, the merge operation of `e` and `n`, generates two token `en` and `en@@` in the encoded text, and this phenomenon is totally unpredictable. To make sure there is no unknown word, the final vocabulary size should be 18 ?? (`e a h w n g t wh en e@@ a@@ h@@ w@@ n@@ g@@ t@@ wh@@ en@@ `) I am really confused ! How to generate the final vocabulary, and how to control its size exactly ?