Corpus compactization


We have a lot of parallel training materials and full training can take months.
I’m trying to find logic for corpus compactization. Here is main idea:

  • Create histogram of unique words
  • Cut vocabulary to most used words (first 100K)
  • Define weights of each string, each word from vocabulary means +1 to weight of string
  • Cut string with low weights <2 by default

Any comments or advice ?


Not sure that the number of unknown words is the best criterion. Even with several rare words a sentence could be very interesting, while sentences with only common words would be of no real interest.

Perhaps I would test something like:

  • randomly select a given number of sentences : 2M, 5M, … ?
  • do a fast train of a model with it.
  • do a fast translation, using beam_size=1, of all your corpus. This could be very time consuming if your data set is really huge. Perhaps you will have to also restrict to a randomly chosen set.
  • keep the best translated sentences : 2M, 5M, …?
  • finely train a model on them

Only an idea… in the kind of this one:

I mean that we can remove sentences with all unknown words, because no sense to use it in training.