Using joint vocab (src + tgt) for BPE modelling has some issues.
I observed some “invented” words in the target translations.
Obviously I am not the only one and here is an extract from the UoEDIN people’s paper for WMT 17:
2.1 Subword Segmentation
Like last year, we use joint byte-pair encoding
(BPE) for subword segmentation (Sennrich et al.,
2016c) (except for ZH
EN, where we train two
separate BPE models). Joint BPE introduces un-
desirable edge cases in that it may produce sub-
word units that have only been observed in one
side of the parallel training corpus, and may thus
be out-of-vocabulary at test time. To prevent this,
we have modified our BPE script to only produce
subword units at test time that have been observed
in the source side of the training corpus.
2
Out-
of-vocabulary subword units are recursively seg-
mented into smaller units until this condition is
met.
We use the same technique to disallow rare sub-
word units (words occurring less than 50 times in
the training corpus), both at test time and in the
training corpus, both on the source-side and the
target-side. This reduces the number of vocabu-
lary symbols reserved for spurious, low-frequency
subword units, and allows for more compact mod-
els. For example, for EN
DE, using 90000 joint
BPE operations, this filtering reduces the network
vocabulary size for English from 80581 to 51092,
with only a minor increase in sequence length
(+0.2%).
In preliminary experiments, this did
not significantly affect B
LEU
, but slightly reduced
the number of spurious OOVs produced – on
EN
→
DE, unigram precision for OOVs increased
from 0.34 to 0.36 on newstest2015 (
N
= 1168
).