I have to say judging from all the tech jargon on this forum I’m not even sure if I am in the right place. I do not know what OpenNMT is.
When translating Dutch to English using Argos Translate, it’s often a disaster. The output is often incomprehensible. Going the other direction, I cannot be a good judge but it seems to be slightly better. But still quite rough.
I sent the original English version along with the Argos-produced Dutch to a native Dutch speaker who said the Dutch is rough. They are rewriting the Dutch version manually.
My question: can the manually created Dutch version and the English source be submitted somewhere in order to improve future Argos translations?
what you are looking for is to further fine-tuning the machine translation model. Somewhere in the argos system you should have a model file on which the translations are based. That file can be either changed out for a better one (check out Huggingface’s Model Hub) or potentially further improved by fine tuning. I don’t know for sure, but it should be possible in principle to conduct the finetuning with opennmt and a ‘simple’ onmt_train call with the right configuration. For this you need to specify ‘train_from’ and reset the optimizer in the configuration file / call. In general, OpenNMT is the ‘framework’ or library that you can utilize to train specific machine translation models.
Best to talk to @argosopentech about it. I’m fairly sure he doesn’t store the OpenNMT-py model files due to lack of space, so the best solution would be to either figure out how to train a new one (maybe using argos-train, not sure).
I’m just curious, what are some of the bad examples you’ve found?
I’m not sure how to DM you on this platform. And I don’t see a way to attach files to a post. This forum UI is quite limited. I don’t see how to strikethrough text. Otherwise I could put a sample here that would show the Argos output and the human corrections. Or I would DM you the whole 12 page pdf (before and after the human corrections).
As far as becoming a machine learning trainer, that sounds like a substantial effort. Which makes me wonder if the translation was rough to begin with because there is no simple crowd-sourcing mechanism for the masses to contribute good translations into a community tool without each submitter needing to become knowledgeable about python and machine learning.
I appreciate the information. It all sounds a bit complex for what I was hoping would be a crowd-sourcing service where I could simply submit English source and accurate human-created Dutch translation, which would then be integrated into the model. I have a no background in Python or machine learning. I probably don’t need a customised improved translation for just myself, at least not if it involves a lot of effort and human learning. I was just hoping to submit a translation that could be used to improve Argos for everyone.
That link you posted has 172 models for the nl/en language pair, IIUC. Does that mean I have 172 models to try for each translation I need to perform? Or are the models all combined into one?
Huggingface is an AI model hub, it’s possible to find a better translation model on huggingface that is compatible to be converted into ctranslate2 and put it into the .argospackage used to hold the models locally. Would have to look at the CTranslate2 conversion docs to familiarize yourself.
As for ArgosTranslate crowd sourcing, this forum serves for the OpenNMT eco-system, which is a series of libraries with which to translate/generate from trained AI models as well as train the models themselves. You’d need @argosopentech’s input for any sharing of data for Argos.
Might be better to open an issue in the Github repo for Argos to get his attention.
As for training data issues, this is simply an existing issue in the training process of AI models. The models are only as good as the data they are trained on, and for sequence to sequence models they usually require a huge amount of data (in the millions) to create production-ready translation models. There’s huge datasets of translation texts on the web from sources funded by governments, companies, the UN, etc but they all have the same issues of misalignment, or bad translations, maybe even incorrect language.
As time goes on tools have been developed to try and curb this but there is always a need to be better.
We’re collecting them but currently not really using them because the dataset we’ve collected from users is still too small.
You could try posting on the LibreTranslate forum to request someone train a new Dutch model. The process for training is somewhat involved but there are several people on the forum who are able to do it well.