For a particular project I am working with OpenNMT-tf 1.25.0 in which I have trained a Transformer model (en-es) after applying SentencePiece (BPE). I have just found the following strange behaviour:
I take the following source sentence:
“I want to build a new system at the station where my friend designed the restaurant”
command line onmt-main infer with on-the-fly SentencePiece encoding & decoding correctly gives:
“Quiero construir un nuevo sistema en la estación donde mi amigo diseñó el restaurante.”
the server provided by nmtwizard/opennmt-tf docker image with the SP model being stated in config.json gives:
“Quiero construir un nuevo sistema en la estaci\u00f3n donde mi amigo dise\u00f1\u00f3 el restaurante.”
I can write a post-processing script to rectify this but would like to find out why this is happening. The same SentencePiece model has been applied in each case, and these characters are “correctly” rendered in the shared vocabulary. Any ideas welcome?
That’s how Unicode characters are encoded in JSON. If the client code is using a valid JSON parser, it should restore the Unicode characters automatically.
Yes, of course. Thanks. Will check this in the Java client.
Yes, the org.apache.commons.lang3.StringEscapeUtils does this with one line of code for the Java client.