OpenNMT-py REST server BPE tokenization problem

icanfast · May 25, 2020, 2:45pm

Hello!
I have trained my OpenNMT-py model. I used tokenization calling
python OpenNMT-py/tools/learn_bpe.py
and doing so quite blindly, I must confess.

Now i’m faced with the task of running a REST server as described in this tutorial:

The issue is, i don’t know which parameters for the tokenization do I pass to the config (in an availible models dir). What type did I use by default and do i need to provide any more arguments beside type and path to “src.code”?

Wishing all the best to this comunity,
Maxim

guillaumekln · May 25, 2020, 2:50pm

Hi,

Did you run any tokenization script before calling learn_bpe.py?

icanfast · May 25, 2020, 2:57pm

Hello!
No, i provided raw files.
I then ran tools/apply_bpe.py for train and inference.

guillaumekln · May 25, 2020, 3:08pm

As far as I know, learn_bpe.py is splitting on spaces so can you try with the following configuration?

{
    "type": "pyonmttok",
    "mode": "space",
    "params": {
        "joiner": "@@",
        "joiner_annotate": true,
        "bpe_model_path": "src.code"
    }
}

Please note that BPE requires a pretokenization, so you might want to revise this process in the future.

icanfast · May 25, 2020, 4:01pm

I did just that and it… did not crash, but the capabilities of the translator somehow diminished. It returns empty strings or single dashes very often now, almost for every query. There is definitely a difference though.
And yes, the delimiter is always "@@ " - two “ats” followed by a whitespace:
“dissection” > “dis@@ section”

guillaumekln · May 25, 2020, 4:09pm

Can you give an example and compare onmt_translate with the server response?

icanfast · May 25, 2020, 4:28pm

They are the same. Thanks! I should have checked myself. BPE works in order with that config.
Although I ran a few more tests and the performance is far better without bpe across the board…

Thank you for resolving my problem!

guillaumekln · May 25, 2020, 4:30pm

This sounds like you trained your model on raw data and not on BPE-tokenized data?

icanfast · May 25, 2020, 4:37pm

I may have had… Or partly at least.
And i have not pretokenized too.
Anyway, a reason to look into tokenization deeper, thanks!