Here we can see that “-filter_valid” argument does “Filter validation data by src and/or tgt length”.
But what does it mean? Unfortunately, I didn’t find answer or more detailed description.
So, if I add argument “-filter_valid 100” - any lines from validation set with length <100 will be ignored or what?
Hello - this option implemented here: https://github.com/OpenNMT/OpenNMT-py/blob/b98fb3d7cb96cd0677a95ca524b280351bf283c4/onmt/bin/preprocess.py#L176 - is applying the same filters defined for train data with
--tgt_seq_length to validation data.
It is a boolean option - so just do
--filter_valid. Also the filtering in this case is for sentences longer (and not shorter) than the defined value.