Hi
I have a big monolingual corpus (>350Gb) and I want to use it for back translation
I need to filter it in order to reduce its size, because my NMT model would take a while to make inference on the whole file
I first thought of removing lines shorter than some length, I’m also thinking of using fastText to detect language and remove bad sentences
Could you advice me some techniques to filter my corpus and keep most relevant lines for NMT?