Not getting the desired translation , Getting "⁇" as output

ishaansharma · December 6, 2022, 6:29am

Greetings fellow researchers,

Recently I was working on building a model to perform some translation task. But Some how after performing the training and during translation from the source string , I am getting ’ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ’ as output

Can anyone help me with this , point me to the right direction if possible so that I can get the right output.
Below is my config File setup:

model_dir: /media/secondary-disk/data/models/TransformerTiny

data:
  source_tokenization:
    type: SentencePieceTokenizer
    params:
      model: /media/secondary-disk/data/ml_data/trg_data/spm_small.model
  target_tokenization:
    type: SentencePieceTokenizer
    params:
      model: /media/secondary-disk/data/ml_data/trg_data/spm_small.model
  train_features_file: /media/secondary-disk/data/ml_data/trg_data/train_src.txt
  train_labels_file: /media/secondary-disk/data/ml_data/trg_data/train_tgt.txt
  eval_features_file: /media/secondary-disk/data/ml_data/trg_data/val_src.txt
  eval_labels_file: /media/secondary-disk/data/ml_data/trg_data/val_tgt.txt
  source_vocabulary: /media/secondary-disk/data/ml_data/trg_data/spm_small.vocab
  target_vocabulary: /media/secondary-disk/data/ml_data/trg_data/spm_small.vocab

train:
  batch_size: 0
  batch_type: tokens
  save_checkpoints_steps: 5000
  keep_checkpoint_max: 3
  max_step: 1000000

params:
  optimizer: Adam
  optimizer_params:
    beta_1: 0.8
    beta_2: 0.998
  learning_rate: 1.0
  dropout: 0.3
  regularization:
    type: l2  # can be "l1", "l2", "l1_l2" (case-insensitive).
    scale: 1e-4  # if using "l1_l2" regularization, this should be a YAML list.
  decay_type: NoamDecay
  decay_params:
    model_dim: 512
    warmup_steps: 5000
  decay_step_duration: 1
  start_decay_steps: 50000
  minimum_learning_rate: 0.0001
  beam_width: 5
  minimum_decoding_length: 6
  maximum_decoding_length: 6
  share_embeddings: 3

eval:
  scorers: bleu
  steps: 5000
  early_stopping:
    metric: loss
    min_improvement: 0.001
    steps: 10
  export_on_best: bleu

infer:
  batch_size: 256
  batch_type: tokens
  n_best: 1
  with_scores: true

Below is the Signature of the model that was trained .

2022-12-06 11:43:28.928112: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
The given SavedModel SignatureDef contains the following input(s):
  inputs['text'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: serving_default_text:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['log_probs'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 1)
      name: StatefulPartitionedCall_4:0
  outputs['text'] tensor_info:
      dtype: DT_STRING
      shape: (-1, 1)
      name: StatefulPartitionedCall_4:1
Method name is: tensorflow/serving/predict

And I am using the following code to do the serving :

import argparse
import os

import tensorflow as tf
import tensorflow_addons as tfa  # Register TensorFlow Addons kernels.

import pyonmttok



class Translator(object):
    def __init__(self, export_dir):
        imported = tf.saved_model.load(export_dir)
        self._translate_fn = imported.signatures["serving_default"]
        sp_model_path="/media/secondary-disk/data/ml_data/trg_data/spm_small.model"
        self._tokenizer = pyonmttok.Tokenizer("none", sp_model_path=sp_model_path)
 

    def translate(self, src):
        """Translates a batch of texts."""
        inputs = self._preprocess(src)
        outputs = self._translate_fn(**inputs)
        return self._postprocess(outputs)

    def _preprocess(self, src):
        
        all_tokens_src = []

        for text_src in src:
            tokens_src, _ = self._tokenizer.tokenize(text_src)

            all_tokens_src.append(tokens_src)

        
        inputs = {
        "text": tf.constant(all_tokens_src, dtype=tf.string)}
        return inputs

    def _postprocess(self, outputs):
        texts = []
        for tokens in zip(outputs["text"].numpy()):
            tokens = list(tokens[0])
            texts.append(self._tokenizer.detokenize(tokens))
        return texts

And after using this serving code , I am getting Question Marks as my output.

import tensorflow as tf
import tensorflow_text
translator = Translator("model_folder_path")
data = ["Source String entered here for translation "]
output = translator.translate(data)
target = output[0]
target
' ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ '

Your input will be highly appreciated .
Best Regards,

guillaumekln · December 6, 2022, 11:49am

Hi,

For a translation task, these values look unexpected to me. Here are some recommendations:

Use the models Transformer or TransformerBig
Remove the params block from your configuration to rely on default values
Make sure to train with --auto_config

ishaansharma · December 6, 2022, 12:03pm

I looked into this further , I found out that the ?? are the default Unknowns in SentencePieceTrainer .

When training the SentencePiece tokenizer, what is the --vocab_size parameter , what value is fed into it. , and how can we calculate the same so that we don’t give any small number which in future lead to unknowns in the models.

I have tried both Transformer and TransformerTiny , both are giving unknowns i.e. “??” . as output

Regards.

guillaumekln · December 6, 2022, 1:18pm

You can find some instructions about SentencePiece and some recommended values for vocab_size here: GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation.