On the target side, these tokens are added automatically by the NMT frameworks since they are required to make the NMT decoding work. The decoding should start from scratch (<s>) and we should know when the sentence is finished (</s>).
On the source side, they may or may not be added by the NMT frameworks. By default OpenNMT does not add them, but we recently found that adding an end token on the source side tends to help for short sentences.
Hence, tokens <s> and </s> should be added independently. So if I use SentencePiece to tokenize my source (before feeding the data to OpenNMT), I will have something like this:
# Generate a list of tokens from the source string
source = sp.encode_as_pieces(source.strip())
# Add '<s>' and '</s>' to the list of tokens
source = ['<s>'] + source + ['</s>']
On the other hand, OpenNMT’s transform, on-the-fly tokenization (correct me if I am wrong) does not give this special treatment to <s> and </s>, so you will end up with each character split into a token. To solve this, during training the SentencePiece model, this flag should be added to the training command --user_defined_symbols='<s>,</s>' which is not a recommended practice, but it will allow you to use <s> and </s> properly with OpenNMT’s transform, on-the-fly tokenization.