The current priorities are improving the training scripts to better automate training and collecting user input [1, 2] to identify valuable models to train.
Looking forward, I am planning to keep the current package format for at least most of 2022 and when breaking changes do occur I’m considering single character tokenization and seq2seq sentence boundary detection. Depending on how the machine translation field progresses few shot translation, which is already implemented, may also play a larger role in later versions.
I’d also like to expand into using CTranslate2 language models for more tasks including possibly: Q&A, summarizing text, generating text, messaging, and more. It’s currently possible to use Argos Translate for custom tasks by training a custom translation model but pretrained models and better support would make this much easier.
Another promising area is combining multiple pieces of functionality into one model, this allows for larger models in absolute terms (instead of many small ones) that can share understanding of language and the world. For example, the current sentence boundary detection models used in the Mac app are separate from the translation models. In the future it could become possible to combine them.