OpenNMT Forum

How to remove unwanted words like 's or "

I am new to Machine Translation, working on the Europarl dataset for English - French MT.

I applied Moses+BPE for preprocessing data, later the Transformer model as suggested in OpenNMT documentation.

I am getting words (mentioned in the title) 's or " which seems like HTML or XML Tags (Please correct if I am wrong). These were created by Moses tokenizer.

Should I just remove them after translation using de-tokenizer.

Or should I use a HTML parser to remove them, before training?

Please help.

I would suggest to not use the Moses tokenizer.

See for example the OpenNMT Tokenizer which can train a BPE model in about 4 lines of Python code:

@guillaumekln many thanks for the reply. Just one more question.

Any advice on vocab size while using Sentence-Piece. I am using 34Million WMT-14 Eng-Fr Dataset.

Can I go beyond 32000 also??
What will be a suitable number??

32,000 is usually fine.