CTranslate2 on OpenNMT-py Server

Hello!

I have just installed the latest version of OpenNMT-py 2.0 and CTranslate. I tried to use OpenNMT-py server with the following configuration:

{
    "models_root": "/home/available_models",
    "models": [
        {
            "id": 100,
            "ct2_model": "ct2/hien",
            "model": "ct2/hien",
            "device": "cpu",
            "timeout": 1000,
            "on_timeout": "to_cpu",
            "load": true,
            "tokenizer": {
                "type": "sentencepiece",
                "model": "subword/hien/bpe/hi.model"
            },            
            "opt": {
                "beam_size": 5,
                "replace_unk": true,
                "verbose": true
            }
        }
    ]
}

• The model loaded successfully.
• When I translate for the first time, I get the error below.

root@mt:/home# python3 OpenNMT-py/server.py --ip "0.0.0.0" --port 3333 --url_root "/translator" --config available_models/conf.json > available_models/logs/log.log
[2021-01-25 23:22:09,630 INFO] Loading tokenizer
[2021-01-25 23:22:10,058 INFO] Loading model 100
[2021-01-25 23:22:13,504 INFO] Running translation using 100
[2021-01-25 23:22:13,504 ERROR] Error: The model for this translator was unloaded
[2021-01-25 23:22:13,504 ERROR] repr(text_to_translate): ['▁यह ▁श्रीमती ▁जी ▁सब ▁कुछ ▁चुकता ▁करेंगी']
[2021-01-25 23:22:13,504 ERROR] model: #100
[2021-01-25 23:22:13,504 ERROR] model opt: {'models': ['/home/available_models/ct2/hien'], 'fp32': False, 'int8': False, 'avg_raw_probs': False, 'data_type': 'text', 'src': 'dummy_src', 'tgt': None, 'tgt_prefix': False, 'shard_size': 10000, 'output': 'pred.txt', 'report_align': False, 'report_time': False, 'block_ngram_repeat': 0, 'ignore_when_blocking': [], 'replace_unk': True, 'ban_unk_token': False, 'phrase_table': '', 'min_length': 0, 'max_length': 100, 'max_sent_length': None, 'beam_size': 5, 'random_sampling_topk': 0, 'random_sampling_topp': 0, 'random_sampling_temp': 1.0, 'seed': -1, 'stepwise_penalty': False, 'length_penalty': 'none', 'ratio': -0.0, 'coverage_penalty': 'none', 'alpha': 0.0, 'beta': -0.0, 'log_file': '', 'log_file_level': '0', 'verbose': True, 'attn_debug': False, 'align_debug': False, 'dump_beam': '', 'n_best': 1, 'batch_size': 30, 'batch_type': 'sents', 'gpu': -1, 'cuda': False}
[2021-01-25 23:22:13,505 ERROR] Traceback (most recent call last):
  File "/home/OpenNMT-py/onmt/translate/translation_server.py", line 488, in run
    else self.opt.batch_size)
  File "/home/OpenNMT-py/onmt/translate/translation_server.py", line 114, in translate
    num_hypotheses=self.n_best
RuntimeError: The model for this translator was unloaded

[2021-01-25 23:22:13,505 INFO] Unloading model 100

• When I translate again, there is no error and I get the translation.
• I tried rebooting the server machine; no change.

What should I do? Thanks!

Kind regards,
Yasmin

1 Like

@francoishernandez Francois, I thought that removing "load": true from the configuration file solved it, but the issue happened again though. Any hints? Thanks!

I can reproduce the issue.
The CT2 wrapping in the server is not very clean and thoroughly tested. I’ve been using it mostly with models running on GPU, and the issue seems to be specific to the CPU way.
Some time ago, @guillaumekln introduced a way to unload a GPU model to CPU memory, which I then used in the onmt server here (preloading mechanism because CT2 had quite some overheads when first building the necessary objects):

and here:

It is implicit that, when a GPU model has been unloaded to CPU, it will be moved back to GPU when it’s needed again. Though, for CPU models, it does not seem to work that way, and when unload_model was called, the model is considered unloaded and raises the error you see:

We can probably handle the case in the server code, but maybe @guillaumekln would also like to catch it directly in CT2, not sure.

1 Like

Yes I think the unloading/loading mechanism of the translation server should be updated to not assume the model is running on GPU.

But coincidentally, a recent CTranslate2 commit will fix this specific error. unload_model(to_cpu=True) is now a no-op for translators that are already running on CPU:

Note this has been largely improved in recent versions and may not be needed anymore.

2 Likes

Many thanks, Guillaume and François!

The main reason I would use CTranslate models is to gain more speed. However, running a CTranslate2 model with OpenNMT-py server to translate a sentence takes average 0.52 second while running the *.pt version of the same model on the same sentence takes average 0.31 second.

I was thinking that if I am going to use CTranslate2, maybe I do not need OpenNMT-py server. However, this will mean CTranslate2 will have to load the model for each new translation request.

Thanks for your insights on this!

Kind regards,
Yasmin

How did you make this performance comparison? Are all parameters the same?

Dear Guillaume,

Using the translation time entry from the log of OpenNMT-py server. It is the same config file above. I used this command to convert the model:

ct2-opennmt-py-converter --model_path model.pt --model_spec TransformerBase --output_dir model_ctranslate

Note though that I had one difference than TransformerBase during training, which is batch_size: 2048.

I translated the same sentence 10 times. For the *.pt model, translation time ranges from 0.25 to 0.37 second. For the CTranslate2 model, it sticks to approx. 0.52 second.

Thanks!
Yasmin

1 Like

The number of threads for CTranslate2 is hardcoded to 1 in the translation server:

So when running the server for OpenNMT-py models, you should set the environment variable OMP_NUM_THREADS=1 to get comparable numbers. We expect CTranslate2 to be always faster when settings are comparable.

The translation server should definitely be updated to improve CPU support.

1 Like

Fully agreed. I don’t have the bandwidth right now but I opened this issue to keep track of this. If anyone is willing to contribute on this feel free to ask questions on the issue.

2 Likes

Many thanks, François and Guillaume!

Now, as I am trying to import CTranslate2 into a Flask app directly, the website does not load at all via https. It says “your connection is not secure”. Is there anything in CTranslate2 that contradicts with https; any way to solve this?

Thanks!

No this is not related to CTranslate2. You should probably check the Flask documentation about this.

OK, thanks, Guillaume! It just happens when I import CTranslate2. If I remove the import, it works fine. I will try to figure out what is wrong. Thanks!

Hi again, François and Guillaume!

Just an update that I managed to use CTranslate directly with Flask. The big news is that I managed to hugely cut my costs (from 8 CPU RAM to 3 CPU RAM and less space).

I first used the OpnenNMT-py option onmt_release_model, then created a CTranslate model with --quantization int8 and finally in the code, set the attribute use_vmap=True in translate_batch()

If there are more performance tricks I should follow, I will be even more grateful.

Many thanks!
Yasmin

1 Like

Note that just enabling use_vmap will not make a difference. The vocabulary mapping file should be generated using this procedure. It may not be easy to apply for each model.

If there are more performance tricks I should follow, I will be even more grateful.

There are some general ideas here:

1 Like