At my workplace we train NMT models using Opennmt-Tf. We have some other NLP models developed using huggingface and SentenceTransformers library . For the other NLP models we were able to compile them using AWS neuron sdk and infer them in EC2 inf.xlarge instance. But for Opennmt-Tf we are just not able to compile the nmt models with AWS neuron sdk. We have some other models trained with fairseq and mariannmt which we were able to compile and run in inf1.xlarge. But those models are not in production. Now considering cost perspective the ML team is being asked by management to replace Opennmt-Tf with either fairseq or Mariannmt so that we can use inf-1.xlarge based inference. But we are quite accustomed to Opennmt-Tf because of fast model saving during training and great data shuffle options. Is it possible to add aws inferentia based compilation to ctranslate2 pipeline. Also I think it would significantly boost ctranslate2’s popularity as cheap inference is what everyone is looking for . Thank you!
Just a quick thought, have you tried to sever a CTranslate2 model on a regular CPU, and compare the results? CTranslate2 with quantization is already very fast.
Currently we are serving them in g4dn.xlarge. Although quantized version is very fast in CPU but they are not quite fast as GPU. Even the base model in GPU seems a lot faster.