How to incorporate subwords in the configuration

Once the parallel data is extracted for subword model GitHub - rsennrich/subword-nmt: Unsupervised Word Segmentation for Neural Machine Translation and Text Generation, how to add it the configuration file and build the vocabulary ?

The next thing being how to translate the test data ? should this test data also be translated as subwords and how to restore this segmentation after translation ? ( sed -r 's/(@@ )|(@@ ?$)//g')

Please confirm this bpe step

Read the docs: How do I use Pretrained embeddings (e.g. GloVe)? — OpenNMT-py documentation

should this test data also be translated as subwords

If you’re using the translate script, you need to tokenize your source before hand. If you’re using the server, you can set the tokenization config there, and send untokenized text.

Browse the forum. Most of these topics are widely covered.

1 Like