Shared Vocab usage & workflow

Hi. do we generally see higher accuracy with shared vocabularies? Even when the languages are unrelated? According to these experiments shared is always better even for Ja/En.

So do I implement this by training a sentencepiece model on one file containing both languages, then convert that shared vocab to work with OpenNMT?

Also I wonder if you guys have seen this paper of a novel sharing technique for the embedding space, they got good results even for Japanese/Chinese → English

Regards,
Matt

The paper was interesting! As I understand it they first determine which (sub)words are more similar and then make their embeddings more similar. The advantage of this is that it makes encoding and decoding easier for the model.

For Argos Translate I put all of the data into one file and then train the tokenizer on that file:

1 Like

The method described in the paper seems needlessly complex. They classify pairs of words in the source and target vocab as lexically similar, words of the same form, or unrelated. Why not just compare all words (source and target) based some general measure of similarity? That would remove the arbitrary similarity boundaries and allow you to exploit similarities within a language.