I am creating a Seq2Seq transformer model using the python API of OpenNMT-tf. Here is the code.
model = onmt.models.Transformer(
source_inputter=onmt.inputters.WordEmbedder(embedding_size=512),
target_inputter=onmt.inputters.WordEmbedder(embedding_size=512),
num_layers=6,
num_units=512,
num_heads=8,
ffn_inner_dim=2048,
dropout=0.1,
attention_dropout=0.1,
ffn_dropout=0.1)
def train_models(model_dir, src_vocab, tgt_vocab, src_data, tgt_data, src_val, tgt_val):
config = {
"model_dir": model_dir,
"data": {
"source_vocabulary": src_vocab,
"target_vocabulary": tgt_vocab,
"train_features_file": src_data,
"train_labels_file": tgt_data,
"eval_features_file": src_val,
"eval_labels_file": tgt_val,
"source_tokenization": {
"type": "OpenNMTTokenizer",
"params": {
"mode": "Space",
}
},
"target_tokenization": {
"type": "OpenNMTTokenizer",
"params": {
"mode": "Space",
}
}
},
"params": {
"optimizer": "Adam",
"minimum_decoding_length": 2,
"maximum_decoding_length": 2,
},
"train": {
"save_checkpoints_steps": 1000,
"keep_checkpoint_max": 1,
#"save_summary_steps": 100,
"max_step": 10000,
#effective_batch_size": 200,
"maximum_features_length": 8,
"maximum_labels_length": 2,
},
"eval": {
"steps": 1000,
"early_stopping": {
"metric": "loss",
"min_improvement": 0.01,
"steps": 4
},
},
"infer": {
"n_best": 1,
"with_scores": "false"
}
}
runner = onmt.Runner(model, config, auto_config=True)
runner.train(num_devices=2, with_eval=True)
The input is a fixed length text line of 8 words and the output is fixed length of 2 words. The files are in the format as follows :-
Source:
THIS IS MY FIRST EXAMPLE AND IT IS
MY NAME IS JOHN AND I AM A
WE MUST DO THE BEST WE CAN IN
Target:
VERY GOOD
GOOD MAN
THIS SITUATION
Each word is separated by a space and the sentences by newline. I am using a “space” tokenizer with a maximum feature length of 8 and maximum label length of 2.
However, there seems to be a problem. I am getting the following error.
“RuntimeError: No training steps were executed. This usually means the training file is empty or all examples were filtered out.
For the latter, verify that the maximum_*_length values are consistent with your data.”
I suspect that the data is not being tokenized by space, but by character. Thus all the data is being filtered out as their character based length is more than 8. If I increase the maximum_feature_length to 200, the error does not occur. But the vocabulary and results are no longer meet my requirement.
How can I ensure that the data is tekenized by space. Do I need to change the config vales? Or can I change the data file?
Thanks in advance.