Quick Start Tutorial Google Collab

chopinml · April 13, 2021, 4:16pm

I have included both text and code blocks in this Collab, very fast way to see file and console outputs without installing anything on local machine.

Plkmoi · April 15, 2021, 12:04pm

This is interesting. If this is interfaced with the corpus in Tab-delimited Bilingual Sentence Pairs from the Tatoeba Project (Good for Anki and Similar Flashcard Applications) (manythings.org) there can be good models formed for languages.

chopinml · April 15, 2021, 12:43pm

Hello @Plkmoi

I’m not very experienced in deep learning, just using the library for a month

This link seems to be very useful, but the amount of sentences might not be adequate for many languages.

http://www.manythings.org/bilingual/

English - Russian (421K)
English - Italian (345K)
English - German (227K)

As I know from the previous forum topics, at least 1M sentences are needed for decent translations. But I will try with the same colab, some preprocessing is needed (wget the zip file, create two different files from tab separated values etc)

Which language pair you want me to try first from this corpus?

Thank you.

Plkmoi · April 15, 2021, 5:21pm

English and Berber as this would be interesting as there is Latin script along with Tifinagh script which has much different letters. https://www.manythings.org/anki/ber-eng.zip. In Language index - Tatoeba there are 542,769 Kabyle sentences and Kabyle is a variant of Berber and 382,839 Berber sentences.

chopinml · April 15, 2021, 5:32pm

131357 sentences, may be it could give some results. I will try that in a different colab then.