Not getting the desired translation , Getting "⁇" as output

Greetings fellow researchers,

Recently I was working on building a model to perform some translation task. But Some how after performing the training and during translation from the source string , I am getting ’ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ’ as output

Can anyone help me with this , point me to the right direction if possible so that I can get the right output.
Below is my config File setup:

model_dir: /media/secondary-disk/data/models/TransformerTiny

    type: SentencePieceTokenizer
      model: /media/secondary-disk/data/ml_data/trg_data/spm_small.model
    type: SentencePieceTokenizer
      model: /media/secondary-disk/data/ml_data/trg_data/spm_small.model
  train_features_file: /media/secondary-disk/data/ml_data/trg_data/train_src.txt
  train_labels_file: /media/secondary-disk/data/ml_data/trg_data/train_tgt.txt
  eval_features_file: /media/secondary-disk/data/ml_data/trg_data/val_src.txt
  eval_labels_file: /media/secondary-disk/data/ml_data/trg_data/val_tgt.txt
  source_vocabulary: /media/secondary-disk/data/ml_data/trg_data/spm_small.vocab
  target_vocabulary: /media/secondary-disk/data/ml_data/trg_data/spm_small.vocab

  batch_size: 0
  batch_type: tokens
  save_checkpoints_steps: 5000
  keep_checkpoint_max: 3
  max_step: 1000000

  optimizer: Adam
    beta_1: 0.8
    beta_2: 0.998
  learning_rate: 1.0
  dropout: 0.3
    type: l2  # can be "l1", "l2", "l1_l2" (case-insensitive).
    scale: 1e-4  # if using "l1_l2" regularization, this should be a YAML list.
  decay_type: NoamDecay
    model_dim: 512
    warmup_steps: 5000
  decay_step_duration: 1
  start_decay_steps: 50000
  minimum_learning_rate: 0.0001
  beam_width: 5
  minimum_decoding_length: 6
  maximum_decoding_length: 6
  share_embeddings: 3

  scorers: bleu
  steps: 5000
    metric: loss
    min_improvement: 0.001
    steps: 10
  export_on_best: bleu

  batch_size: 256
  batch_type: tokens
  n_best: 1
  with_scores: true

Below is the Signature of the model that was trained .

2022-12-06 11:43:28.928112: I tensorflow/core/util/] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
The given SavedModel SignatureDef contains the following input(s):
  inputs['text'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: serving_default_text:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['log_probs'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 1)
      name: StatefulPartitionedCall_4:0
  outputs['text'] tensor_info:
      dtype: DT_STRING
      shape: (-1, 1)
      name: StatefulPartitionedCall_4:1
Method name is: tensorflow/serving/predict

And I am using the following code to do the serving :

import argparse
import os

import tensorflow as tf
import tensorflow_addons as tfa  # Register TensorFlow Addons kernels.

import pyonmttok

class Translator(object):
    def __init__(self, export_dir):
        imported = tf.saved_model.load(export_dir)
        self._translate_fn = imported.signatures["serving_default"]
        self._tokenizer = pyonmttok.Tokenizer("none", sp_model_path=sp_model_path)

    def translate(self, src):
        """Translates a batch of texts."""
        inputs = self._preprocess(src)
        outputs = self._translate_fn(**inputs)
        return self._postprocess(outputs)

    def _preprocess(self, src):
        all_tokens_src = []

        for text_src in src:
            tokens_src, _ = self._tokenizer.tokenize(text_src)


        inputs = {
        "text": tf.constant(all_tokens_src, dtype=tf.string)}
        return inputs

    def _postprocess(self, outputs):
        texts = []
        for tokens in zip(outputs["text"].numpy()):
            tokens = list(tokens[0])
        return texts

And after using this serving code , I am getting Question Marks as my output.

import tensorflow as tf
import tensorflow_text
translator = Translator("model_folder_path")
data = ["Source String entered here for translation "]
output = translator.translate(data)
target = output[0]
' ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ '

Your input will be highly appreciated .
Best Regards,


For a translation task, these values look unexpected to me. Here are some recommendations:

  • Use the models Transformer or TransformerBig
  • Remove the params block from your configuration to rely on default values
  • Make sure to train with --auto_config

I looked into this further , I found out that the ?? are the default Unknowns in SentencePieceTrainer .

When training the SentencePiece tokenizer, what is the --vocab_size parameter , what value is fed into it. , and how can we calculate the same so that we don’t give any small number which in future lead to unknowns in the models.

I have tried both Transformer and TransformerTiny , both are giving unknowns i.e. “??” . as output


You can find some instructions about SentencePiece and some recommended values for vocab_size here: GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation.

1 Like