Is there any pre trained Chinese to Englisht models already available?

HamiguaLu · July 5, 2021, 6:18pm

Hi there, I am quite new for the machine translation and OpenNMT, acturally I started Google search yersterday, sorry if my questions looks stupid
is there any pre trained chinese to English model already available?
Or is there any free data set avaliable(Chinese to English) available to train?
I want to use OpenNMT to translate the web pages from Chinese to English(Vice versa)

Thanks in advance

miguelknals · July 5, 2021, 7:53pm

Hi

Google is your best fried.

In the forum is a simlar question some time ago → Other pretrained models

But notice these models and corpus you will get are mainly a toy you can play with, actually far beyond any serious use.

The task you are interested (translate web pages) from scratch, with OpenNMT is a difficult task (very difficult indeed), not only because you have to deal not only the core translation (a whole problem by itself), but the real word (set a whole flow from the input web page to the output page, servers, file pipeline).

Humble opinion here, you should should select a more affordable task, for instance “learn” neuronal translations or how to setup a translation system for webpage. It depends a little bit of your interests,

Hope this helps!
Have a nice day
Miguel Canals

ymoslem · July 5, 2021, 11:02pm

Hello!

I agree with Miguel that it is important to determine the purpose you have and find the most suitable solution. Still, I would like to answer this specific question.

Yes, Argos Translate has pre-trained models. In reality, they are CTranslate models, compressed as a zip archive. So if you want to use them in CTranslate, you just need to extract them. The models can be downloaded HERE, and as far as I can see, they have zh_en and en_zh.

Both Argos Translate and its models are under the licence MIT or CC0; so I am not sure if I have to say “no affiliation”. What I want to say though is that PJ @argosopentech might want to comment here in the case I missed something, or you have more questions.

All the best,
Yasmin

argosopentech · July 5, 2021, 11:47pm

Thanks @ymoslem! Yes, Argos Translate has pretrained English to Chinese models (Argos package index). Like Yasmin said they are quantized CTranslate2 models packaged in a zip archive with a SentencePiece and Stanza model. You can either use the models with Argos Translate, or extract them from the zip archive and use them with CTranslate2 directly. Be aware that the CTranslate2 models are dependent on the specific tokenization from the SentencePiece tokenizer.

For data I recommend the OPUS parallel corpus.

Like @miguelknals said translating XML is difficult, but translating headers, paragraphs, etc. individually may be easier.