CTranslate2 generate a vocabulary mapping file of M2M100

Hello, I’m trying to generate a vocabulary mapper file to improve the performance of M2M100 but I can’t apply the method described here: papers/WNMT2018/vmap at master · OpenNMT/papers (

In the M2M100 model there is a vocabulary file, but it is not in the form of a phrase table but only a text file. So I have to convert it and the documentation says to do it:
docker run --rm -v MYCORPUSPATH:/root/corpus build-pt CORPUSNAME SS TT N > phrase-table.gz with as argument: CORPUSPATH/CORPUSNAME.{SS,TT}

Is it only necessary to put the path to the vocabulary file? I haven’t got the script working yet, so if you can help me with this :wink:
if I understood correctly, the text must be in another format.

Note: I need vocabulary in all languages, not just one

As indicated in the README, you should provide tokenized source and target files which usually correspond to the data used to train the model (or a subset of these data).

You should read the Fairseq documentation to regenerate the M2M100 training data:

