"unknown" words with BPE tokenization

vince62s · February 18, 2017, 8:06am

When we build a BPE model with training data, what happens with “unknown” words from a test set.

Do they most likely get split in “almost letter by letter” ?

What if there is no actual match with BPE encoding, are they left as is ?

thanks.

DYCSystran · February 20, 2017, 9:37am

Yes for the 1st question, the very first step by applying BPE encoding is to split the word or input into characters, then the merge operations learnt during BPE training will be used to merge recursively pairs.

No for the second, without any match of merge operations, the input will remain in its split form, thus as a sequence of characters