How to remove unwanted words like 's or "

Rishi · July 12, 2020, 10:54pm

I am new to Machine Translation, working on the Europarl dataset for English - French MT.

I applied Moses+BPE for preprocessing data, later the Transformer model as suggested in OpenNMT documentation.

I am getting words (mentioned in the title) 's or " which seems like HTML or XML Tags (Please correct if I am wrong). These were created by Moses tokenizer.

Should I just remove them after translation using de-tokenizer.

Or should I use a HTML parser to remove them, before training?

Please help.

guillaumekln · July 15, 2020, 8:12am

I would suggest to not use the Moses tokenizer.

See for example the OpenNMT Tokenizer which can train a BPE model in about 4 lines of Python code:

github.com

OpenNMT/Tokenizer/blob/master/bindings/python/README.md

# Python bindings

```bash
pip install pyonmttok
```

## Tokenization

```python
import pyonmttok

tokenizer = pyonmttok.Tokenizer(
    mode: str,
    bpe_model_path: str = "",
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    sp_model_path: str = "",
    sp_nbest_size: int = 0,
    sp_alpha: float = 0.1,
    joiner: str = "￭",

This file has been truncated. show original

Rishi · July 16, 2020, 6:15pm

@guillaumekln many thanks for the reply. Just one more question.

Any advice on vocab size while using Sentence-Piece. I am using 34Million WMT-14 Eng-Fr Dataset.

Can I go beyond 32000 also??
What will be a suitable number??

guillaumekln · July 17, 2020, 7:49am

32,000 is usually fine.

How to remove unwanted words like &apos;s or &quot;

How to remove unwanted words like 's or "