Joiner Annotate

Dear all,
I have a question about the joiner annotate option. I understand how useful it can be to detokenize the target sentence after generation. However, should we use it in the source sentences ? (using BPE or not ?)

If yes, what is the rationale of using it: for instance getting “con @@ gratu @@ lation @@ for your pri @@ ze @@ .” as source ? Would it help the model understanding the tokens are coming from the same word and thus triggering the good translation of this bpe tokens ?

Thanks in advance for your answer

Hello,

That’s a good question. Indeed, if you don’t require the tokenization to be reversible, there is no need to include such tokens.

However, it’s common to apply the same processing on source and target for simplicity (you can then just reverse the training files to train the opposite direction) and it’s also common to have a shared vocabulary between source and target.

Also, I don’t think there is evidence that such tokens helps the training in any way. I would suggest the opposite because they either increase the vocabulary size or the sequence lengths. Maybe other people have experimented on that.

Thanks a lot and well understood!!
I will test with & without and see how it goes :slight_smile: