End and Start Tokens

ymoslem · July 21, 2021, 11:13am

Hello!

It is common in NMT tutorials to say that we need to add start and end tokens, something like <s> and <\s>; however, I did not see a reference to this practice in any recent papers.

My question is, does it really make a difference?

Many thanks,
Yasmin

guillaumekln · July 21, 2021, 12:08pm

Hi,

On the target side, these tokens are added automatically by the NMT frameworks since they are required to make the NMT decoding work. The decoding should start from scratch (<s>) and we should know when the sentence is finished (</s>).

On the source side, they may or may not be added by the NMT frameworks. By default OpenNMT does not add them, but we recently found that adding an end token on the source side tends to help for short sentences.

ymoslem · July 21, 2021, 12:47pm

Got it. Thanks, Guillaume!

Does this mean if I prepare data for OpenNMT, I can only add these tokens to the source because adding them to the target would be redundancy (as they are already added to the target by default)?

guillaumekln · July 21, 2021, 12:59pm

Yes, it’s redundant to add them to the target.

ymoslem · August 28, 2021, 5:21pm

Just a comment for readers of this thread. SentencePiece gives IDs 1 and 2 to the tokens <s> and </s>. You can try this by running:

sp = spm.SentencePieceProcessor(sp_source_model)
sp.PieceToId('<s>')
sp.PieceToId('</s>')

Hence, tokens <s> and </s> should be added independently. So if I use SentencePiece to tokenize my source (before feeding the data to OpenNMT), I will have something like this:

# Generate a list of tokens from the source string
source = sp.encode_as_pieces(source.strip())
# Add '<s>' and '</s>' to the list of tokens
source = ['<s>'] + source + ['</s>']

On the other hand, OpenNMT’s transform, on-the-fly tokenization (correct me if I am wrong) does not give this special treatment to <s> and </s>, so you will end up with each character split into a token. To solve this, during training the SentencePiece model, this flag should be added to the training command --user_defined_symbols='<s>,</s>' which is not a recommended practice, but it will allow you to use <s> and </s> properly with OpenNMT’s transform, on-the-fly tokenization.

SamuelLacombe · November 5, 2021, 11:53pm

Hello Guillaume,
Quick question about the <s> and </s>.

If added to the target, will opennmt add them in double or will it recognize the presence of <s> and </s> and do nothing?

Also do we need to have a space following <s> and a space before </s>?

ymoslem · November 7, 2021, 8:32pm

If you use OpenNMT, add them to the source only. See here:

Kind regards,
Yasmin

ymoslem · November 8, 2021, 1:08pm

Hi Samuel! @SamuelLacombe

Here is the right way to add them:

ymoslem · November 26, 2021, 3:10pm

Hi Guillaume!

So all sentences will be like:
<s> this is a sentence </s>

My question is, if we use Tagged Back-Translation, should the <BT> be added after the start token or replacing it?

I assume I would have it like #1, but some might argue that it should be #2 or #3 to avoid having a different number of starting tags.

<s> <BT> this is a sentence </s>
<BT> this is a sentence </s>
<BT> this is a sentence </BT>

Thanks for your insights!

Kind regards,
Yasmin

guillaumekln · November 26, 2021, 3:21pm

Hi,

I would go for 1. because <s> and </s> are usually considered special tokens that cannot be moved or removed.

SamuelLacombe · November 26, 2021, 5:31pm

On the same line of thought, would there be a value to put a tag in between each words? When you use bpe the model lose the notion of words.

guillaumekln · November 29, 2021, 11:08am

It is not useful in my opinion and will just make the tokenized sentences longer.