CTranslate2 generate a vocabulary mapping file of M2M100

Jourdelune · February 4, 2022, 7:13am

Hello, I’m trying to generate a vocabulary mapper file to improve the performance of M2M100 but I can’t apply the method described here: papers/WNMT2018/vmap at master · OpenNMT/papers (github.com).

In the M2M100 model there is a vocabulary file, but it is not in the form of a phrase table but only a text file. So I have to convert it and the documentation says to do it:
docker run --rm -v MYCORPUSPATH:/root/corpus build-pt CORPUSNAME SS TT N > phrase-table.gz with as argument: CORPUSPATH/CORPUSNAME.{SS,TT}

Is it only necessary to put the path to the vocabulary file? I haven’t got the script working yet, so if you can help me with this
if I understood correctly, the text must be in another format.

Note: I need vocabulary in all languages, not just one

guillaumekln · February 4, 2022, 8:45am

As indicated in the README, you should provide tokenized source and target files which usually correspond to the data used to train the model (or a subset of these data).

You should read the Fairseq documentation to regenerate the M2M100 training data: