Preprocess argument "-filter_valid" - what does it mean?

kargintima · November 29, 2019, 12:11pm

http://opennmt.net/OpenNMT-py/options/preprocess.html
Here we can see that “-filter_valid” argument does “Filter validation data by src and/or tgt length”.
But what does it mean? Unfortunately, I didn’t find answer or more detailed description.
So, if I add argument “-filter_valid 100” - any lines from validation set with length <100 will be ignored or what?

jean.senellart · November 30, 2019, 1:41pm

Hello - this option implemented here: https://github.com/OpenNMT/OpenNMT-py/blob/b98fb3d7cb96cd0677a95ca524b280351bf283c4/onmt/bin/preprocess.py#L176 - is applying the same filters defined for train data with --src_seq_length and --tgt_seq_length to validation data.
It is a boolean option - so just do --filter_valid. Also the filtering in this case is for sentences longer (and not shorter) than the defined value.