In our API we have to load models from disk at every request: translator = ctranslate2.Translator(translation_model_path)
Would that be possible to load the model to memory only once and then have the translator be pointed to it in the memory, use a memory stream etc.? That would save us from the initial disk read delay.
In my application I have about 20 different models, and the Translator object function requires the model_path, which is different for every model. Would you know how to reuse the same Translator object, if all parameters are identical with the exception of the model_path?
Of course if you have multiple models you should create multiple Translator instances. However, you should try to create only one translator per model during the lifetime of your application.
For example you can store the translators in a dictionary mapping language pairs to Translator instances. When your API is called, you can lookup this dictionary to get the corresponding ready-to-use translator.
As a feature request, please consider adding a Translator constructor taking an input stream/ input stream reader as not all servers have a local disk attached.
In our particular case, we store our models in a GCS bucket, and have to download them. With an input stream, we would load a new Translator with them directly from the storage.
For reference this is already possible with the C++ API which offers a way to customize how the model files are read. It looks like the GCS C++ API could fit nicely in this usage.
We can consider bringing a similar functionality to Python.