OpenNMT

Eos in source segments

Hello,

I’ve been trying to apply “</s>” for my source segments with this code. (right after sentencepiece tokenization:

with open(inputList[i], 'r') as rf, open(outputList[i], 'w') as wf:

        if ((('src' in inputList[i]) and not inverteLanguage) or ('tgt' in inputList[i]) and inverteLanguage):

            for line in rf:

                wf.write(' '.join(sourceSP.encode_as_pieces(line[:-1])) + ' </s>' + '\n')

        else:

            for line in rf:

                wf.write(' '.join(targetSP.encode_as_pieces(line[:-1])) + '\n')

but when I start the training I see this in the log:

2022-01-06 19:37:52.291000: I inputter.py:318]  - special tokens: BOS=no, EOS=no
2022-01-06 19:37:52.320000: I inputter.py:318] Initialized target input layer:
2022-01-06 19:37:52.320000: I inputter.py:318]  - vocabulary size: 8001
2022-01-06 19:37:52.320000: I inputter.py:318]  - special tokens: BOS=yes, EOS=yes

where EOS=no.

Those it mean i applied the “</s>” thr wrong way?

If not, how to make sure the MT considered the tag.

Best regards,
Samuel

Hi,

In OpenNMT-tf there is an option to add </s> automatically:

data:
  source_sequence_controls:
    end: true

So you don’t need to add it manually when using this option.

2 Likes