I am new to Machine Translation, working on the Europarl dataset for English - French MT.
I applied Moses+BPE for preprocessing data, later the Transformer model as suggested in OpenNMT documentation.
I am getting words (mentioned in the title) 's or " which seems like HTML or XML Tags (Please correct if I am wrong). These were created by Moses tokenizer.
Should I just remove them after translation using de-tokenizer.
Or should I use a HTML parser to remove them, before training?
I would suggest to not use the Moses tokenizer.
See for example the OpenNMT Tokenizer which can train a BPE model in about 4 lines of Python code:
@guillaumekln many thanks for the reply. Just one more question.
Any advice on vocab size while using Sentence-Piece. I am using 34Million WMT-14 Eng-Fr Dataset.
Can I go beyond 32000 also??
What will be a suitable number??