Filter big monolingual corpus

valentinmace · June 4, 2019, 1:24pm

Hi

I have a big monolingual corpus (>350Gb) and I want to use it for back translation

I need to filter it in order to reduce its size, because my NMT model would take a while to make inference on the whole file

I first thought of removing lines shorter than some length, I’m also thinking of using fastText to detect language and remove bad sentences

Could you advice me some techniques to filter my corpus and keep most relevant lines for NMT?

vince62s · June 4, 2019, 6:14pm

There are some papers from WMT18.

park · June 4, 2019, 11:40pm

I suggest you Corpora Cleaning Tools.
Tools for filtering and cleaning parallel and monolingual corpora in order to train better (neural) machine translation systems.

valentinmace · June 6, 2019, 9:02am

Thanks

I’ll make use of these scripts