How an optimal parallel corpus should look like?

Nart · September 23, 2021, 2:40pm

Hello,

Let’s assume that I decided to build up a parallel corpus that consists of 1 million sentences from the ground up:

What are the minimum/maximum length of sentences?
What caveats that I should avoid? (i.e. initials, abbreviation, digits).
Is there any recommended sources for best practices in such task?

Thank you,
Nart.
@ymoslem

argosopentech · September 24, 2021, 1:29am

Opus has lots of good datasets. I normally use all of the largest ones on the assumption that more data generally helps.

Nart · September 24, 2021, 2:02am

Thank you, but my question was how to build one from the ground up.

ymoslem · September 24, 2021, 11:35am

Hi Nart,

As a corpus with 1 million sentences does not seem a general-purpose one, I assume the main question here is: “what is the purpose of this corpus?” The answer to this question can help answer your other questions.

Is it for a low-resource language? Then quality is the most important factor; we do not need another crawled (unrevised) corpus. Is it a technical dataset, then technical sentences are usually shorter, etc.

As for references, I would say papers about filtering approaches can give insights about the quality issues that should be avoided in the first place during creating a new dataset. Examples include:

All the best,
Yasmin

argosopentech · September 24, 2021, 11:35am

You can use WikiMatrix or something like it directly. It generates vector representations across different languages and then creates parallel data from similar vectors.

Nart · September 24, 2021, 7:57pm

It is for low resource language, the dataset will be created from e-book text.
Edited, split and then translated.

Nart · September 24, 2021, 8:07pm

Yes

It in the work to kickstart a program at the university for students to edit and translate sentences by hand to build up a high quality parallel corpus.

If we build a parallel corpus of 1 million sentences by hand, can we have a usable general purpose model?

What is considered a high quality for training general purpose models?

I haven’t got in to the references yet.

Thank you,
Nart.

ymoslem · September 25, 2021, 8:26pm

Hi Nart!

There are few options here. If these students are going to really translate 1 million sentences, it is better to be real translations. I mean, instead of just collecting data randomly and translate it, check a couple of non-profit organizations and cooperate with them to translate content that can help others. There are Translators without Borders and AMARA among many other examples.

Another option would be what @argosopentech suggested, and this can be elaborated on into two options:

find a crawled dataset in your language pair, filter it automatically, and then assign human editors to check the remaining sentences. They can either approve the translation of a segment or correct it; or
use the same tools to crawl and match a new parallel dataset, and then maybe also do #1. Examples of these tools: Facebook created LASER, and Google (1, 2) created LabSE and m-USE.

I would like still to answer this specific question:

This can differ based on field and language. For example, sentences in legal texts are longer than those in technical texts, and the same idea in Spanish could be expressed in more words than in English.

So, basically what I do is that I get a dataset from different fields, delete any sentence that is larger than x (e.g. 150 words), and then check the average of word length for the remaining sentences.

Note that the max token length and filter too long in OpenNMT would be counted on subwords. So check the default or the value you assign and make sure your numbers are matching.

All the best,
Yasmin

JOHW85 · September 29, 2021, 3:21am

Would chime in that OPUS datasets are riddled with mistakes (wrong languages, mistranslations, etc). It’s just virtually impossible to manually go through them, so all sorts of cleaning/filtering techniques are out there like OpusTools and Bicleaner to help with the filtering by ratio of src/tgt length.