The paper was interesting! As I understand it they first determine which (sub)words are more similar and then make their embeddings more similar. The advantage of this is that it makes encoding and decoding easier for the model.
For Argos Translate I put all of the data into one file and then train the tokenizer on that file:
The method described in the paper seems needlessly complex. They classify pairs of words in the source and target vocab as lexically similar, words of the same form, or unrelated. Why not just compare all words (source and target) based some general measure of similarity? That would remove the arbitrary similarity boundaries and allow you to exploit similarities within a language.