Data not being Tokenized properly

I am creating a Seq2Seq transformer model using the python API of OpenNMT-tf. Here is the code.


model = onmt.models.Transformer(
source_inputter=onmt.inputters.WordEmbedder(embedding_size=512),
target_inputter=onmt.inputters.WordEmbedder(embedding_size=512),
num_layers=6,
num_units=512,
num_heads=8,
ffn_inner_dim=2048,
dropout=0.1,
attention_dropout=0.1,
ffn_dropout=0.1)

def train_models(model_dir, src_vocab, tgt_vocab, src_data, tgt_data, src_val, tgt_val):

config = {
"model_dir": model_dir,
"data": {
    "source_vocabulary": src_vocab,
    "target_vocabulary": tgt_vocab,
    "train_features_file": src_data,
    "train_labels_file": tgt_data,
    "eval_features_file": src_val,
    "eval_labels_file": tgt_val,
    "source_tokenization": {
        "type": "OpenNMTTokenizer",
        "params": {
            "mode": "Space",
        }
    },
    "target_tokenization": {
        "type": "OpenNMTTokenizer",
        "params": {
            "mode": "Space",
        }
    }
},
"params": {
   "optimizer": "Adam",
    "minimum_decoding_length": 2,
    "maximum_decoding_length": 2,   
},
"train": {
    "save_checkpoints_steps": 1000,
    "keep_checkpoint_max": 1,
    #"save_summary_steps": 100,
    "max_step": 10000,
    #effective_batch_size": 200,
    "maximum_features_length": 8,
    "maximum_labels_length": 2,
},
"eval": {
    "steps": 1000,
    "early_stopping": {
    "metric": "loss",
    "min_improvement": 0.01,
    "steps": 4
    },
},
"infer": {
    "n_best": 1,
    "with_scores": "false"
}
}

runner = onmt.Runner(model, config, auto_config=True)
runner.train(num_devices=2, with_eval=True)

The input is a fixed length text line of 8 words and the output is fixed length of 2 words. The files are in the format as follows :-

Source:
THIS IS MY FIRST EXAMPLE AND IT IS
MY NAME IS JOHN AND I AM A
WE MUST DO THE BEST WE CAN IN

Target:
VERY GOOD
GOOD MAN
THIS SITUATION

Each word is separated by a space and the sentences by newline. I am using a “space” tokenizer with a maximum feature length of 8 and maximum label length of 2.
However, there seems to be a problem. I am getting the following error.

“RuntimeError: No training steps were executed. This usually means the training file is empty or all examples were filtered out.
For the latter, verify that the maximum_*_length values are consistent with your data.”

I suspect that the data is not being tokenized by space, but by character. Thus all the data is being filtered out as their character based length is more than 8. If I increase the maximum_feature_length to 200, the error does not occur. But the vocabulary and results are no longer meet my requirement.

How can I ensure that the data is tekenized by space. Do I need to change the config vales? Or can I change the data file?

Thanks in advance.

If your data is already tokenized, you don’t need to set any tokenization options in the configuration. Can you try without these options?

Yes, first I had tried without any tokenization config options. Got the same error.

The value of maximum_labels_length is actually too restrictive because the target sequence starts with the special token <s>. So it should be set to 3 at least.

In general you don’t need to set exact values for these options. They are just meant to filter unusually long sequences that can produce out of memory issues.

1 Like

Thanks. I increased the maximum_labels_length by 1. It is working properly now.