Where can I find sentencepiece_model of my model?

Hello, I’m using OpenNMT-tf.
I had trained an English - Tetun model.

  1. I translate using cmd in docs (from the text “My name is Duy” in test.txt):

onmt-main infer --auto_config --config data.yml --features_file test.txt

Result:

Nia haʼu-nia naran mak Duy

=> This result acceptable for me.
2. I deploy this model using Docker, tensorflow serving. Then I serving using this cmd:

tensorflow_model_server --port=9000 --model_name=ente --model_base_path=/serving/ente/latest &> ente_log &

Result:

Running gRPC ModelServer at 0.0.0.0:9000

  1. I’m using python code of ende_client.py in this folder
    https://github.com/OpenNMT/OpenNMT-tf/tree/master/examples/serving
  2. Run ende_client.py with the code of the main function below:

    def main():
    parser = argparse.ArgumentParser(description=“Translation client example”)
    args = parser.parse_args()
    args.host = ‘172.17.0.2’
    args.model_name = ‘ente’
    args.port = 9000
    args.sentencepiece_model = ‘wmtende.model’
    args.timeout = 15.0
    channel = grpc.insecure_channel("%s:%d" % (args.host, args.port))
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
    tokenizer = pyonmttok.Tokenizer(“none”, sp_model_path=args.sentencepiece_model)
    batch_input = [“My name is Duy”]
    batch_output = translate(stub, args.model_name, batch_input, tokenizer, timeout=args.timeout)
    #print(batch_output)
    x = {
    “translated”: batch_output[0]
    }
    xx = json.dumps(x)
    print(xx)

Result:

{“translated”: “Rejistu:Stella,stellaciorra@hotmailCamposanoUS$3”}

=> The result totally different with the result in the command line above (step 1).
I did the same process with ende500k model, and this return good result because it have wmtende.model (I thought).

Question:
Q1. I tried to generate “sentencepiece_model” of my model using “https://github.com/google/sentencepiece” (Because I can not find my “wmtende.model” in my exported model) using:

spm_train --input=src-train.txt --model_prefix=ente --vocab_size=10000 --character_coverage=1.0

-> So I had a model file named “ente.model”, then I put it to config in main function above.
But it still returns the wrong result.
=> Where can I find sentencepiece_model of my model?
Q2. Is there any way to run my model in python code without using **sentencepiece_model **?

Thanks for giving me some pieces of advice.

The example assumes that the data were prepared using SentencePiece. If you did not use SentencePiece, you should adapt the example code to apply your tokenization instead.

1 Like

When I run SentencePiece from my home directory ‘~/’ I find my model there as ‘my_model.model’(or whatever), your one should be ‘ente.model’ as you specified ‘ente’ as your prefix. That’s what you need to refer to for inference and in your config.json file for serving, giving the full path to your model.

1 Like

Yes, I didn’t use SentencePiece or any others. I just arrange each line one sentence and train it.
So can I skip using “sentencepiece_model”? Thankyou !

Yes, I had “ente.model”. And I put it in my config, but it still returns the wrong result.
Thankyou !

If you post an e-mail address I’d be happy to send you my config.json file. But the TensorFlow model server is expecting to receive tokenized text, whether you use SentencePiece or another mode of tokenization. Personally I’ve found it easiest to use SentencePiece but that means your training data and evaluation files all need to be processed with SentencePiece.

1 Like

My email is nguyenduy1324@gmail.com. I had tried to tokenization all of my training data file with SentencePiece (https://github.com/google/sentencepiece). But it only has 2 types of encode (piece and id):

% spm_encode --model=<model_file> --output_format=piece < input > output

Return text file with every word with _ prefix : _My _name _is _Duy in my data file (ex : src-train.txt, …)

% spm_encode --model=<model_file> --output_format=id < input > output

Return text file with id replace of the word: 1231 4353 3132 3333 in my data file (ex: src-train.txt, …)

But the data file of the example model in the quick start is normal sentences (https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz). I have no idea about How to tokenize with 2 formats of SentencePiece to receive the file like “toy-ende” data.

I have mailed you a screenshot of my config file. Are you running spm_decode on your output? The Docker server model takes care of all that. I feed in raw sentences, they are spm-encoded, inference is done and the output is spm-decoded back into raw sentences.

1 Like

Maybe there is a confusion here.

@duynx is running the example from https://github.com/OpenNMT/OpenNMT-tf/tree/master/examples/serving which hardcodes the SentencePiece tokenization. To use this client, the SentencePiece tokenization should be replaced by whatever tokenization was used to train the model.

@tel34 is referring to the nmt-wizard-docker integration which provides a wrapper around TensorFlow Serving and a configurable tokenization. See some instructions here:

1 Like

Sorry for confusing things!

1 Like

Finally, I have found this link: https://github.com/OpenNMT/Tokenizer/tree/master/bindings/python
I realized that I don’t have to use ente.model file, I just skip it and use basic options.
Now I the result of opennmt command line and the serving are equal.
Thanks both of you for supporting me @tel34 and @guillaumekln

Hi Duy. I’m curious, what’s your interest in the Tetun language? I developed a translator for tetun a few years ago and am looking to upgrade to OpenNMT. Wondering if you’d be interested in collaborating?

To Come state to the point, and solve the problem if you dont want to use Sentencepiece just make this tweak to the code of your python client file and it will run.

  def __init__(self, export_dir):
    imported = tf.saved_model.load(export_dir)
    self._translate_fn = imported.signatures["serving_default"]
    #sp_model_path = os.path.join(export_dir, "assets.extra", "<>.model")
    #sp_model_path = os.path.join(export_dir, "saved_model.pb")
    self._tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
1 Like