We have a lot of parallel training materials and full training can take months.
I’m trying to find logic for corpus compactization. Here is main idea:
- Create histogram of unique words
- Cut vocabulary to most used words (first 100K)
- Define weights of each string, each word from vocabulary means +1 to weight of string
- Cut string with low weights <2 by default
Any comments or advice ?
Not sure that the number of unknown words is the best criterion. Even with several rare words a sentence could be very interesting, while sentences with only common words would be of no real interest.
Perhaps I would test something like:
- randomly select a given number of sentences : 2M, 5M, … ?
- do a fast train of a model with it.
- do a fast translation, using beam_size=1, of all your corpus. This could be very time consuming if your data set is really huge. Perhaps you will have to also restrict to a randomly chosen set.
- keep the best translated sentences : 2M, 5M, …?
- finely train a model on them
Only an idea… in the kind of this one:
I mean that we can remove sentences with all unknown words, because no sense to use it in training.