Inconsistent SentencePiece behaviour

tel34 · November 23, 2019, 11:11am

For a particular project I am working with OpenNMT-tf 1.25.0 in which I have trained a Transformer model (en-es) after applying SentencePiece (BPE). I have just found the following strange behaviour:
I take the following source sentence:
“I want to build a new system at the station where my friend designed the restaurant”
command line onmt-main infer with on-the-fly SentencePiece encoding & decoding correctly gives:
“Quiero construir un nuevo sistema en la estación donde mi amigo diseñó el restaurante.”
the server provided by nmtwizard/opennmt-tf docker image with the SP model being stated in config.json gives:
“Quiero construir un nuevo sistema en la estaci\u00f3n donde mi amigo dise\u00f1\u00f3 el restaurante.”
I can write a post-processing script to rectify this but would like to find out why this is happening. The same SentencePiece model has been applied in each case, and these characters are “correctly” rendered in the shared vocabulary. Any ideas welcome?

guillaumekln · November 23, 2019, 11:38am

That’s how Unicode characters are encoded in JSON. If the client code is using a valid JSON parser, it should restore the Unicode characters automatically.

tel34 · November 23, 2019, 11:52am

Yes, of course. Thanks. Will check this in the Java client.

tel34 · November 24, 2019, 6:31pm

Yes, the org.apache.commons.lang3.StringEscapeUtils does this with one line of code for the Java client.