Erro raised while using openNMT-tf to train a chinese->english engine

dlkht · March 30, 2018, 2:03am

if I set the param of num_gpus=1,all things ok,but when i set num_gpus=3 or >=2,an erro happened as below:
"
InvalidArgumentError (see above for traceback): logits and labels must have the same first dimension, got logits shape [832,203347] and labels shape [896]
"
before training i’ve tokenized the input text files. And batch-size setted to 64, i have noticed that the logits shape was
always less one batch-size than that of the labels (in first dimension)

that’s why?

guillaumekln · March 30, 2018, 8:59am

Can you specify the model and run configurations you are using?

dlkht · March 30, 2018, 9:05am

Yes,the model code is list below:
“”
def model():
return onmt.models.SequenceToSequence(
source_inputter=onmt.inputters.WordEmbedder(
vocabulary_file_key=“source_words_vocabulary”,
embedding_size=512),
target_inputter=onmt.inputters.WordEmbedder(
vocabulary_file_key=“target_words_vocabulary”,
embedding_size=512),
encoder=onmt.encoders.BidirectionalRNNEncoder(
num_layers=4,
num_units=512,
reducer=onmt.utils.ConcatReducer(),
cell_class=tf.contrib.rnn.LSTMCell,
dropout=0.1,
residual_connections=False),
decoder=onmt.decoders.AttentionalRNNDecoder(
num_layers=4,
num_units=512,
bridge=onmt.utils.CopyBridge(),
attention_mechanism_class=tf.contrib.seq2seq.LuongAttention,
cell_class=tf.contrib.rnn.LSTMCell,
dropout=0.1,
residual_connections=False))
“”
and the config file is:
“”
model_dir: work_dir

data:
train_features_file: data/chdata/cn_token.txt
train_labels_file: data/chdata/en.txt
eval_features_file: data/chdata/test.txt
eval_labels_file: data/chdata/test1.txt

source_words_vocabulary: data/chdata/src-vocab.txt
target_words_vocabulary: data/chdata/tgt-vocab.txt

params:

optimizer: AdamOptimizer
learning_rate: 0.002

param_init: 0.1

clip_gradients: 5.0

average_loss_in_time: false

decay_type: exponential_decay

decay_rate: 0.9

decay_steps: 10000

staircase: true

start_decay_steps: 30000

minimum_learning_rate: 0.00001

scheduled_sampling_type: linear

scheduled_sampling_read_probability: 1

scheduled_sampling_k: 0

label_smoothing: 0

beam_width: 5

length_penalty: 0.2

maximum_iterations: 200

train:
batch_size: 128

save_checkpoints_steps: 500

keep_checkpoint_max: 3

save_summary_steps: 100

train_steps: 1000000

eval_delay: 720000

save_eval_predictions: false

external_evaluators: BLEU

maximum_features_length: 70

maximum_labels_length: 70

num_buckets: 5
cores).
num_parallel_process_calls: 4

prefetch_buffer_size: 100000

sample_buffer_size: 1000000

infer:

batch_size: 10
cores).
num_parallel_process_calls: 1

prefetch_buffer_size: 1000

n_best: 1

“”"

infact,this erro occoured after the below message:
INFO:tensorflow:Saving checkpoints for 1 into ./work_dir\model.ckpt.
INFO:tensorflow:step = 1, loss = 91.267555

that means the model has initialized,i think.

Than you !

dlkht · March 30, 2018, 9:10am

in addition,i used this model to train German->English,it runs with no problem

dlkht · March 30, 2018, 10:38am

If every line of the feature file and label file only has one word( that is say let the length to be 1),it works ok.

guillaumekln · April 3, 2018, 7:43am

Would it be possible to share your training data?

dlkht · April 3, 2018, 7:51am

Yes,but how to send to you?

guillaumekln · April 3, 2018, 8:31am

Google Drive or Dropbox. Whatever service works for you.

dlkht · April 3, 2018, 9:20am

Thanks very much!

I 've sent a message to you