Corpus compactization

szhitansky · June 21, 2017, 4:25pm

Hello,

We have a lot of parallel training materials and full training can take months.
I’m trying to find logic for corpus compactization. Here is main idea:

Create histogram of unique words
Cut vocabulary to most used words (first 100K)
Define weights of each string, each word from vocabulary means +1 to weight of string
Cut string with low weights <2 by default

Any comments or advice ?

Thanks!

Etienne38 · June 21, 2017, 4:45pm

Not sure that the number of unknown words is the best criterion. Even with several rare words a sentence could be very interesting, while sentences with only common words would be of no real interest.

Perhaps I would test something like:

randomly select a given number of sentences : 2M, 5M, … ?
do a fast train of a model with it.
do a fast translation, using beam_size=1, of all your corpus. This could be very time consuming if your data set is really huge. Perhaps you will have to also restrict to a randomly chosen set.
keep the best translated sentences : 2M, 5M, …?
finely train a model on them

Only an idea… in the kind of this one:

szhitansky · June 21, 2017, 5:00pm

I mean that we can remove sentences with all unknown words, because no sense to use it in training.