Get VMap from the corpus to be used in CTranslate2

Please refer this issue.

Getting an empty gzipped phrase table using commands from this.
Can anyone please confirm this is working.

Thanks.

It should work.

The N value is the maximum N-gram length. So it should probably be around 3 or 4.

What is the size of the corpus you are using as input?

Size of the corpus is around 4.5 Million WMT en-de train corpus (tokenized and provided by OpenNMT-py)
The exact commands I used are

docker build -f Dockerfile . -t build-pt
sudo docker run -v $(pwd):/root/corpus build-pt train en de 3 > phrase-table.gz

But still got an empty phrase-table.

Screenshot for reference.

If I remember correctly, there were some noisy lines in this corpus that should be removed first.

Let me take a look. It’s possible I had some local changes to make it work on this corpus.

1 Like

@gvskalyan I updated the script to filter out empty lines:

Could you update, rebuild the Docker image, and try again? I verified that it produces a non empty phrase table for the corpus you mentioned.

Thanks. The Phrase table got generated and its gzipped version is 2.2 GB.

generated Vmap using
python build-vmap.py -pt ../../../phrase-table -ms 3 -mf 2 -km 20 -tv target_vocabulary -z zg_list > vmap

I am unable to evaluate vmap generated from the above step
python eval-vmap.py -vmap vmap -tv target_vocabulary -src ../../../train.en -tgt ../../../train.de

What is the error?

This line https://github.com/OpenNMT/papers/blob/master/WNMT2018/vmap/eval-vmap.py#L23 is giving the following error.
(src,tgt) = l.split('\t') ValueError: too many values to unpack
Though I am able to convert the model in CTranslate2 and run translate_file upon it.

Thanks.

The eval-map script does not appear to be very robust so it’s probably better to feed a smaller and less noisy test file (the training data probably contains all sort of things).

In any case, the vocabulary map you generated should now work with CTranslate2.

1 Like