Dear Fellow Researchers!
Greetings for the day!
I am trying to train a model that has a noise word that is intentionally kept in the training data.
I want to training the model in a manner such that that particular word , lets say “xxxx” gets ignored while the training is done on the train dataset.
Is their any method to ignore a word while training. but still give a result if the noise word is there in a test sentence .
Thanks and regards
If the noise word is ignored in the target data, it will naturally be ignored by the model.
How can the noise word be ignored in target data?
By not appearing in the target data. Maybe you can preprocess your data accordingly.
I saw some work of [Paul Michel] (https://www.cs.cmu.edu/~pmichel1/index.html) on MTNT: Machine Translation of Noisy Text .
Does our OPENNMT handles the Noisy data too.
The model only generates what is seen in the target data. If you have noise in the target, the model will generate noise.
If you want to handle noisy data, you should prepare your training data such that the source is noisy but the target is clean.
I did this , but when i was generating the output , it was the same for all input. the output was coming same
Can you make an example?
Perhaps you can replace the specific word with a synonym in the training data. With this approach i am trying to enrich my dataset.
@Backware , Actually, the noise is kind of important . that’s why I cannot change it. I just want a method to ignore the noise word what translating .
You can imagine for example the stammering person trying to convey his message , in that case the noise will be very useful to predict the translated output made by the noisy i/p of that stammering person .
Have a further look at data augmentation:
You can train a model with a noisy source and a clean target, but the neural output is soldem perfect even with “syntactic” structures. Automatic post-editing or grammatical filter as post-processing step promise higher quality outputs.
Can you please make a concrete example with a “particular word”?