Get VMap from the corpus to be used in CTranslate2

gvskalyan · March 23, 2020, 7:32am

Please refer this issue.

Getting an empty gzipped phrase table using commands from this.
Can anyone please confirm this is working.

Thanks.

guillaumekln · March 23, 2020, 8:32am

It should work.

The N value is the maximum N-gram length. So it should probably be around 3 or 4.

What is the size of the corpus you are using as input?

gvskalyan · March 23, 2020, 11:18am

Size of the corpus is around 4.5 Million WMT en-de train corpus (tokenized and provided by OpenNMT-py)
The exact commands I used are

docker build -f Dockerfile . -t build-pt
sudo docker run -v $(pwd):/root/corpus build-pt train en de 3 > phrase-table.gz

But still got an empty phrase-table.

Screenshot for reference.

guillaumekln · March 23, 2020, 12:38pm

If I remember correctly, there were some noisy lines in this corpus that should be removed first.

Let me take a look. It’s possible I had some local changes to make it work on this corpus.

guillaumekln · March 23, 2020, 4:50pm

@gvskalyan I updated the script to filter out empty lines:

Could you update, rebuild the Docker image, and try again? I verified that it produces a non empty phrase table for the corpus you mentioned.

gvskalyan · March 24, 2020, 4:23pm

Thanks. The Phrase table got generated and its gzipped version is 2.2 GB.

generated Vmap using
python build-vmap.py -pt ../../../phrase-table -ms 3 -mf 2 -km 20 -tv target_vocabulary -z zg_list > vmap

I am unable to evaluate vmap generated from the above step
python eval-vmap.py -vmap vmap -tv target_vocabulary -src ../../../train.en -tgt ../../../train.de

guillaumekln · March 24, 2020, 4:39pm

What is the error?

gvskalyan · March 24, 2020, 4:43pm

This line https://github.com/OpenNMT/papers/blob/master/WNMT2018/vmap/eval-vmap.py#L23 is giving the following error.
(src,tgt) = l.split('\t') ValueError: too many values to unpack
Though I am able to convert the model in CTranslate2 and run translate_file upon it.

Thanks.

guillaumekln · March 24, 2020, 4:50pm

The eval-map script does not appear to be very robust so it’s probably better to feed a smaller and less noisy test file (the training data probably contains all sort of things).

In any case, the vocabulary map you generated should now work with CTranslate2.