Train the onmt with multiple sources on tensorflow v2.0

Lee · March 17, 2020, 7:14am

Hello Fellow Researchers,

I am trying to run onmt with multiple sources.
My config yml file is as follows:

model_dir: zh_yue_transformer

data:
  source_tokenization: space.yml
  target_tokenization: space.yml
  train_features_file: 
    - data/train.zh
    - data/train.zh_f
  train_labels_file: data/train.yue
  eval_features_file: 
    - data/vaild.zh
    - data/vaild.zh_f
  eval_labels_file: data/vaild.yue
  source_1_vocabulary: data/zh.vocab
  source_2_vocabulary: data/zh_f.vocab
  target_vocabulary: data/yue.vocab

train:
  save_checkpoints_step: 1000

eval:
  external_evaluators: BLEU

I am getting output like this with errors . can anyone help me where I am doing wrong and how can I use multiple features model.

onmt-main --model multi_source_nmt.py --config nmt.yml --auto_config train --num_gpus 1

INFO:tensorflow:Creating model directory zh_yue_transformer
INFO:tensorflow:Using parameters:
data:
eval_features_file:

data/vaild.zh

data/vaild.zh_f
eval_labels_file: data/vaild.yue
source_1_vocabulary: data/zh.vocab
source_2_vocabulary: data/zh_f.vocab
source_tokenization: space.yml
target_tokenization: space.yml
target_vocabulary: data/yue.vocab
train_features_file:

data/train.zh

data/train.zh_f
train_labels_file: data/train.yue
eval:
batch_size: 32
external_evaluators: BLEU
infer:
batch_size: 32
length_bucket_width: 5
model_dir: zh_yue_transformer
params:
average_loss_in_time: true
beam_width: 4
decay_params:
model_dim: 512
warmup_steps: 8000
decay_type: NoamDecay
label_smoothing: 0.1
learning_rate: 2.0
num_hypotheses: 1
optimizer: LazyAdam
optimizer_params:
beta_1: 0.9
beta_2: 0.998
score:
batch_size: 64
train:
average_last_checkpoints: 8
batch_size: 3072
batch_type: tokens
effective_batch_size: 25000
keep_checkpoint_max: 8
length_bucket_width: 1
max_step: 500000
maximum_features_length:

100

100
maximum_labels_length: 100
sample_buffer_size: -1
save_checkpoints_step: 1000
save_summary_steps: 100

2020-03-17 06:22:14.146976: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-03-17 06:22:14.288324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3d:00.0
2020-03-17 06:22:14.288962: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-03-17 06:22:14.292157: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-03-17 06:22:14.296020: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-03-17 06:22:14.296643: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-03-17 06:22:14.299689: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-03-17 06:22:14.301906: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-03-17 06:22:14.308034: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-17 06:22:14.324893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-03-17 06:22:14.325436: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-03-17 06:22:14.338858: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz
2020-03-17 06:22:14.342544: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x52219b0 executing computations on platform Host. Devices:
2020-03-17 06:22:14.342580: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
2020-03-17 06:22:15.022707: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5279400 executing computations on platform CUDA. Devices:
2020-03-17 06:22:15.022751: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Tesla V100-PCIE-16GB, Compute Capability 7.0
2020-03-17 06:22:15.026849: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3d:00.0
2020-03-17 06:22:15.026963: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-03-17 06:22:15.026980: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-03-17 06:22:15.026994: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-03-17 06:22:15.027008: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-03-17 06:22:15.027022: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-03-17 06:22:15.027036: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-03-17 06:22:15.027051: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-17 06:22:15.036168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-03-17 06:22:15.036253: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-03-17 06:22:15.060902: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-17 06:22:15.060997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-03-17 06:22:15.061011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-03-17 06:22:15.077117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14926 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3d:00.0, compute capability: 7.0)
WARNING:tensorflow:No checkpoint to restore in zh_yue_transformer
INFO:tensorflow:Training on 2364910 examples
INFO:tensorflow:Accumulate gradients of 9 iterations to reach effective batch size of 25000
2020-03-17 06:22:37.154883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3d:00.0
2020-03-17 06:22:37.155073: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-03-17 06:22:37.155102: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-03-17 06:22:37.155130: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-03-17 06:22:37.155178: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-03-17 06:22:37.155200: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-03-17 06:22:37.155229: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-03-17 06:22:37.155265: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-17 06:22:37.162789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-03-17 06:22:37.162855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-17 06:22:37.162868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-03-17 06:22:37.162878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-03-17 06:22:37.171687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 14926 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3d:00.0, compute capability: 7.0)
2020-03-17 06:22:37.246945: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_array_ops.py”, line 8961, in shape
name, _ctx._post_execution_callbacks, input, “out_type”, out_type)
tensorflow.python.eager.core._FallbackException: This function does not handle the case of the path where all inputs are not already EagerTensors.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/bin/onmt-main”, line 8, in
sys.exit(main())
File “/usr/local/lib/python3.6/dist-packages/opennmt/bin/main.py”, line 189, in main
checkpoint_path=args.checkpoint_path)
File “/usr/local/lib/python3.6/dist-packages/opennmt/runner.py”, line 198, in train
mixed_precision=self._mixed_precision)
File “/usr/local/lib/python3.6/dist-packages/opennmt/training.py”, line 44, in init
self._model.create_variables()
File “/usr/local/lib/python3.6/dist-packages/opennmt/models/model.py”, line 286, in create_variables
_ = self(features, labels=labels, training=True, step=0)
File “/usr/local/lib/python3.6/dist-packages/opennmt/models/model.py”, line 95, in call
return super(Model, self).call(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py”, line 891, in call
outputs = self.call(cast_inputs, *args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/opennmt/models/sequence_to_sequence.py”, line 163, in call
training=training)
File “/usr/local/lib/python3.6/dist-packages/opennmt/models/sequence_to_sequence.py”, line 197, in _decode_target
initial_state=encoder_state)
File “/usr/local/lib/python3.6/dist-packages/opennmt/decoders/decoder.py”, line 178, in initial_state
batch_size = tf.shape(sentinel)[0]
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/array_ops.py”, line 432, in shape_v2
return shape(input, name, out_type)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/array_ops.py”, line 458, in shape
return shape_internal(input, name, optimize=True, out_type=out_type)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/array_ops.py”, line 486, in shape_internal
return gen_array_ops.shape(input, name=name, out_type=out_type)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_array_ops.py”, line 8966, in shape
input, out_type=out_type, name=name, ctx=_ctx)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_array_ops.py”, line 9005, in shape_eager_fallback
_attr_T, (input,) = _execute.args_to_matching_eager([input], _ctx)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py”, line 257, in args_to_matching_eager
t, dtype, preferred_dtype=default_dtype, ctx=ctx))
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1296, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/constant_op.py”, line 286, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/constant_op.py”, line 227, in constant
allow_broadcast=True)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/constant_op.py”, line 235, in _constant_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/constant_op.py”, line 96, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
ValueError: Attempt to convert a value (None) with an unsupported type (<class ‘NoneType’>) to a Tensor.

Currently the CUDA version of my server is 10.0, and cannot be upgraded to 10.1 due to some reason disallowed to upgrade the drivers, so the corresponding Tensorflow==2.1 cannot be installed. But OpenNMT-tf==2.8.0 needs Tensorflow==2.1, so I have to use OpenNMT-tf==2.4.0 which depends on Tensorflow==2.0.

Thanks & Regards

Lee · March 17, 2020, 7:21am

The code of multi_source_nmt.py is as follows

import opennmt as onmt
  
from opennmt.utils import misc


class DualSourceTransformer(onmt.models.Transformer):

  def __init__(self):
    super(DualSourceTransformer, self).__init__(
      source_inputter=onmt.inputters.ParallelInputter([
          onmt.inputters.WordEmbedder(embedding_size=512),
          onmt.inputters.WordEmbedder(embedding_size=512)]),
      target_inputter=onmt.inputters.WordEmbedder(embedding_size=512),
      num_layers=6,
      num_units=512,
      num_heads=8,
      ffn_inner_dim=2048,
      dropout=0.1,
      attention_dropout=0.1,
      ffn_dropout=0.1,
      share_encoders=True)

  def auto_config(self, num_replicas=1):
    config = super(DualSourceTransformer, self).auto_config(num_replicas=num_replicas)
    max_length = config["train"]["maximum_features_length"]
    return misc.merge_dict(config, {
        "train": {
            "maximum_features_length": [max_length, max_length]
        }
    })


model = DualSourceTransformer

guillaumekln · March 17, 2020, 8:26am

Hi,

This was already reported in:

It was fixed in OpenNMT-tf 2.8.0.

Lee · March 18, 2020, 8:25am

Thanks a lot !