CTranslate2 on OpenNMT-py Server

ymoslem · January 25, 2021, 11:36pm

Hello!

I have just installed the latest version of OpenNMT-py 2.0 and CTranslate. I tried to use OpenNMT-py server with the following configuration:

{
    "models_root": "/home/available_models",
    "models": [
        {
            "id": 100,
            "ct2_model": "ct2/hien",
            "model": "ct2/hien",
            "device": "cpu",
            "timeout": 1000,
            "on_timeout": "to_cpu",
            "load": true,
            "tokenizer": {
                "type": "sentencepiece",
                "model": "subword/hien/bpe/hi.model"
            },            
            "opt": {
                "beam_size": 5,
                "replace_unk": true,
                "verbose": true
            }
        }
    ]
}

• The model loaded successfully.
• When I translate for the first time, I get the error below.

root@mt:/home# python3 OpenNMT-py/server.py --ip "0.0.0.0" --port 3333 --url_root "/translator" --config available_models/conf.json > available_models/logs/log.log
[2021-01-25 23:22:09,630 INFO] Loading tokenizer
[2021-01-25 23:22:10,058 INFO] Loading model 100
[2021-01-25 23:22:13,504 INFO] Running translation using 100
[2021-01-25 23:22:13,504 ERROR] Error: The model for this translator was unloaded
[2021-01-25 23:22:13,504 ERROR] repr(text_to_translate): ['▁यह ▁श्रीमती ▁जी ▁सब ▁कुछ ▁चुकता ▁करेंगी']
[2021-01-25 23:22:13,504 ERROR] model: #100
[2021-01-25 23:22:13,504 ERROR] model opt: {'models': ['/home/available_models/ct2/hien'], 'fp32': False, 'int8': False, 'avg_raw_probs': False, 'data_type': 'text', 'src': 'dummy_src', 'tgt': None, 'tgt_prefix': False, 'shard_size': 10000, 'output': 'pred.txt', 'report_align': False, 'report_time': False, 'block_ngram_repeat': 0, 'ignore_when_blocking': [], 'replace_unk': True, 'ban_unk_token': False, 'phrase_table': '', 'min_length': 0, 'max_length': 100, 'max_sent_length': None, 'beam_size': 5, 'random_sampling_topk': 0, 'random_sampling_topp': 0, 'random_sampling_temp': 1.0, 'seed': -1, 'stepwise_penalty': False, 'length_penalty': 'none', 'ratio': -0.0, 'coverage_penalty': 'none', 'alpha': 0.0, 'beta': -0.0, 'log_file': '', 'log_file_level': '0', 'verbose': True, 'attn_debug': False, 'align_debug': False, 'dump_beam': '', 'n_best': 1, 'batch_size': 30, 'batch_type': 'sents', 'gpu': -1, 'cuda': False}
[2021-01-25 23:22:13,505 ERROR] Traceback (most recent call last):
  File "/home/OpenNMT-py/onmt/translate/translation_server.py", line 488, in run
    else self.opt.batch_size)
  File "/home/OpenNMT-py/onmt/translate/translation_server.py", line 114, in translate
    num_hypotheses=self.n_best
RuntimeError: The model for this translator was unloaded

[2021-01-25 23:22:13,505 INFO] Unloading model 100

• When I translate again, there is no error and I get the translation.
• I tried rebooting the server machine; no change.

What should I do? Thanks!

Kind regards,
Yasmin

ymoslem · January 26, 2021, 3:56pm

@francoishernandez Francois, I thought that removing "load": true from the configuration file solved it, but the issue happened again though. Any hints? Thanks!

francoishernandez · January 26, 2021, 4:31pm

I can reproduce the issue.
The CT2 wrapping in the server is not very clean and thoroughly tested. I’ve been using it mostly with models running on GPU, and the issue seems to be specific to the CPU way.
Some time ago, @guillaumekln introduced a way to unload a GPU model to CPU memory, which I then used in the onmt server here (preloading mechanism because CT2 had quite some overheads when first building the necessary objects):

github.com

OpenNMT/OpenNMT-py/blob/58bae872d814e457a82dcb29d7688b965315b885/onmt/translate/translation_server.py#L103


    self.batch_size = batch_size
    self.beam_size = beam_size
    self.n_best = n_best
    self.target_prefix = target_prefix
    if preload:
        # perform a first request to initialize everything
        dummy_translation = self.translate(["a"])
        print("Performed a dummy translation to initialize the model",
              dummy_translation)
        time.sleep(1)
        self.translator.unload_model(to_cpu=True)
def translate(self, texts_to_translate, batch_size=8, tgt=None):
    batch = [item.split(" ") for item in texts_to_translate]
    if tgt is not None:
        tgt = [item.split(" ") for item in tgt]
    preds = self.translator.translate_batch(
        batch,
        target_prefix=tgt if self.target_prefix else None,
        max_batch_size=self.batch_size,
        beam_size=self.beam_size,

and here:

github.com

OpenNMT/OpenNMT-py/blob/58bae872d814e457a82dcb29d7688b965315b885/onmt/translate/translation_server.py#L121-L125


def to_cpu(self):
    self.translator.unload_model(to_cpu=True)
def to_gpu(self):
    self.translator.load_model()

It is implicit that, when a GPU model has been unloaded to CPU, it will be moved back to GPU when it’s needed again. Though, for CPU models, it does not seem to work that way, and when unload_model was called, the model is considered unloaded and raises the error you see:

github.com

OpenNMT/CTranslate2/blob/2202c58a70a66981cb6bc02ccb58631130dc94a6/python/translator.cc#L358


  const ctranslate2::ComputeType _compute_type;
  ctranslate2::TranslatorPool _translator_pool;
  const std::vector<int> _device_index;
  std::vector<std::shared_ptr<const ctranslate2::models::Model>> _cached_models;
  bool _model_is_loaded;
  void assert_model_is_ready() const {
    if (!model_is_loaded())
      throw std::runtime_error("The model for this translator was unloaded");
  }
};
PYBIND11_MODULE(translator, m)
{
  m.def("contains_model", &ctranslate2::models::contains_model, py::arg("path"));
  py::class_<TranslatorWrapper>(m, "Translator")
    .def(py::init<const std::string&, const std::string&, const std::variant<int, std::vector<int>>&, const StringOrMap&, size_t, size_t>(),
         py::arg("model_path"),

We can probably handle the case in the server code, but maybe @guillaumekln would also like to catch it directly in CT2, not sure.

guillaumekln · January 26, 2021, 4:54pm

Yes I think the unloading/loading mechanism of the translation server should be updated to not assume the model is running on GPU.

But coincidentally, a recent CTranslate2 commit will fix this specific error. unload_model(to_cpu=True) is now a no-op for translators that are already running on CPU:

Note this has been largely improved in recent versions and may not be needed anymore.

ymoslem · January 26, 2021, 7:00pm

Many thanks, Guillaume and François!

The main reason I would use CTranslate models is to gain more speed. However, running a CTranslate2 model with OpenNMT-py server to translate a sentence takes average 0.52 second while running the *.pt version of the same model on the same sentence takes average 0.31 second.

I was thinking that if I am going to use CTranslate2, maybe I do not need OpenNMT-py server. However, this will mean CTranslate2 will have to load the model for each new translation request.

Thanks for your insights on this!

Kind regards,
Yasmin

guillaumekln · January 26, 2021, 7:08pm

How did you make this performance comparison? Are all parameters the same?

ymoslem · January 26, 2021, 7:21pm

Dear Guillaume,

Using the translation time entry from the log of OpenNMT-py server. It is the same config file above. I used this command to convert the model:

ct2-opennmt-py-converter --model_path model.pt --model_spec TransformerBase --output_dir model_ctranslate

Note though that I had one difference than TransformerBase during training, which is batch_size: 2048.

I translated the same sentence 10 times. For the *.pt model, translation time ranges from 0.25 to 0.37 second. For the CTranslate2 model, it sticks to approx. 0.52 second.

Thanks!
Yasmin

guillaumekln · January 27, 2021, 8:17am

The number of threads for CTranslate2 is hardcoded to 1 in the translation server:

github.com

OpenNMT/OpenNMT-py/blob/master/onmt/translate/translation_server.py#L91


"""
def __init__(self, model_path, device, device_index, batch_size,
             beam_size, n_best, target_prefix=False, preload=False):
    import ctranslate2
    self.translator = ctranslate2.Translator(
        model_path,
        device=device,
        device_index=device_index,
        inter_threads=1,
        intra_threads=1,
        compute_type="default")
    self.batch_size = batch_size
    self.beam_size = beam_size
    self.n_best = n_best
    self.target_prefix = target_prefix
    if preload:
        # perform a first request to initialize everything
        dummy_translation = self.translate(["a"])
        print("Performed a dummy translation to initialize the model",
              dummy_translation)

So when running the server for OpenNMT-py models, you should set the environment variable OMP_NUM_THREADS=1 to get comparable numbers. We expect CTranslate2 to be always faster when settings are comparable.

The translation server should definitely be updated to improve CPU support.

francoishernandez · January 27, 2021, 9:00am

Fully agreed. I don’t have the bandwidth right now but I opened this issue to keep track of this. If anyone is willing to contribute on this feel free to ask questions on the issue.

ymoslem · January 27, 2021, 5:34pm

Many thanks, François and Guillaume!

Now, as I am trying to import CTranslate2 into a Flask app directly, the website does not load at all via https. It says “your connection is not secure”. Is there anything in CTranslate2 that contradicts with https; any way to solve this?

Thanks!

guillaumekln · January 27, 2021, 5:56pm

No this is not related to CTranslate2. You should probably check the Flask documentation about this.

ymoslem · January 27, 2021, 5:58pm

OK, thanks, Guillaume! It just happens when I import CTranslate2. If I remove the import, it works fine. I will try to figure out what is wrong. Thanks!

ymoslem · February 1, 2021, 3:40am

Hi again, François and Guillaume!

Just an update that I managed to use CTranslate directly with Flask. The big news is that I managed to hugely cut my costs (from 8 CPU RAM to 3 CPU RAM and less space).

I first used the OpnenNMT-py option onmt_release_model, then created a CTranslate model with --quantization int8 and finally in the code, set the attribute use_vmap=True in translate_batch()

If there are more performance tricks I should follow, I will be even more grateful.

Many thanks!
Yasmin

guillaumekln · February 1, 2021, 8:37am

Note that just enabling use_vmap will not make a difference. The vocabulary mapping file should be generated using this procedure. It may not be easy to apply for each model.

If there are more performance tricks I should follow, I will be even more grateful.

There are some general ideas here:

github.com

OpenNMT/CTranslate2/blob/master/docs/performance.md#improving-performance

# Performance

## Improving performance

Below are some recommendations to further improve translation performance. Many of these recommendations were used in the [WNGT 2020 efficiency task submission](../examples/wngt2020).

### General

* Reduce the beam size to the minimum value that meets your quality requirement
* When using a beam size of 1, disable `return_scores` if you are not using prediction scores: the final softmax layer can be skipped
* Set `max_batch_size` and pass a larger batch to `translate_batch`: the input sentences will be sorted by length and split by chunk of `max_batch_size` elements for improved efficiency
* Prefer the "tokens" `batch_type` to make the total number of elements in a batch more constant

### CPU

* Set the compute type to "int8"
* Use an Intel CPU supporting AVX512
* If you are translating a large volume of data, prefer increasing `inter_threads` over `intra_threads` to improve scalability
* Avoid setting `intra_threads` to a value that is greater than the number of physical cores

This file has been truncated. show original