Custom spacer in noisy translation

Hello.

I tried to apply some word level noise in back-translation and ran into some issues, perhaps related to my custom BPE implementation. My config looks like this:

params:
  decoding_subword_token: · 
  decoding_noise:
    - dropout: 0.1
    - replacement: [0.1, <unk>]
    - permutation: 3

If I just remove the first line (decoding_subword_token) then everything seems to work, but the noise is applied on the subword level. However I think it makes more sense to do it on the word level, so that re-ordering noise does not generate gibberish out of subwords for example. My BPE is custom and I use this upper period symbol to denote space. Here is for example my tokenised input:

if· you· ˿' re· in· there· ˿,· I· have· to· talk· toy ou· ˿!

The · symbol denotes a new word (space) and (˿) denotes that this subword should be concatenated without a space to the previous one. The latter is probably irrelevant. So I pass this symbol in my config and I also modified ONMT-tf to enforce is_spacer=True which I verified by printing. I had to do this because the current implementation considers everything other than SentencePiece's spacer to be a joiner instead which is not true in my case. I am not getting this error at inference:

Traceback (most recent call last):
  File "/home/estergiadis/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/home/estergiadis/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/estergiadis/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Need minval < maxval, got 0 >= -1
	 [[{{node transformer/map/while/cond/random_uniform}}]]

During handling of the above exception, another exception occurred:

Any ideas?

Hi,

Why are you using both spacers and joiners?

For this example and considering only spacers, the issue is that they are used as suffix while the code expects SentencePiece-like spacers, i.e. spacers that are prefixing the tokens.

Maybe you could just configure the joiner as decoding_subword_token?

This makes sense.

The reason we are using both spacer and joiner is to avoid trailing punctuation creating new subwords. For example, Cat and Cat, should not result in different BPEs. The same for first word in a sentence where SentencePiece would probably assign different BPE for I given this data:

  • Hey,I like yoga
  • Hey, I like yoga

as only the second one starts with space and is therefore _I.

As a result, I doubt the proposed solution will work well because my joiner doesn’t always appear - it only does when there is punctuation. So for example we wouldn’t join something like astonish ing· because the only thing giving out the fact that astonish is not a full word is the absence of the spacer ·. I hope this is somewhat clear.