Opennmt-py translate.py throws AssertionError with a case feature input file

miguelknals · April 13, 2019, 1:17pm

Hi

I am running some tests in opennmt-py. I have a set of src/tgr files with case feature/joiner/aggresive that have been able to be processed in opennmt. I can run preprocess and train without any error.

The problem arises when I want to translate it as I get an assertion error. Runing without case feature is fine.

I have seen several posts in the forum but i have not been able to find a solution or are just unanswered.

Reviewing the stack looks like there is some torch file culprit. The system is pretty new with opennmt-py installed just a couple of weeks ago with ubuntu 18.10 and as far as I recall pytorch build Stable 1.0, python 3.7.1 and Cuda 10.0 (if i run: “conda install pytorch torchvision cudatoolkit=10.0 -c pytorch” I get a “# All requested packages already installed.”)

nvidia-smi reports 418.56 Cuda version 10.1. I have 2 8Mb GTX1070.

ptython replies with "Python 3.7.1 (default, Dec 14 2018, 19:28:38) "

As I am not a linux expert, take this with care!

I have seen in docker a openmt/opennmt-py image but looks really old. Not sure If I can use it with my current nvida-docker that I use for the docker opennmt image (by the way, thanks to provide it!)

So I run:

thpython /home/laika/OpenNMT-py/preprocess.py -train_src src.atokCJA     -train_tgt tgt.atokCJA \
                                     -valid_src src_tunning.atokCJA -valid_tgt tgt_tunning.atokCJA \
                                     -save_data CAES_CJA.data
[2019-04-13 14:06:33,121 INFO] Extracting features...
[2019-04-13 14:06:33,121 INFO]  * number of source features: 1.
[2019-04-13 14:06:33,121 INFO]  * number of target features: 1.
[2019-04-13 14:06:33,121 INFO] Building `Fields` object...
[2019-04-13 14:06:33,121 INFO] Building & saving training data...
[2019-04-13 14:06:33,122 INFO] Reading source and target files: src.atokCJA tgt.atokCJA.
[2019-04-13 14:06:33,173 INFO] Building shard 0.
[2019-04-13 14:06:38,813 INFO]  * saving 0th train data shard to CAES_CJA.data.train.0.pt.
[2019-04-13 14:06:43,956 INFO] Building & saving validation data...
[2019-04-13 14:06:43,956 INFO] Reading source and target files: src_tunning.atokCJA tgt_tunning.atokCJA.
[2019-04-13 14:06:43,958 INFO] Building shard 0.
[2019-04-13 14:06:44,196 INFO]  * saving 0th valid data shard to CAES_CJA.data.valid.0.pt.
[2019-04-13 14:06:44,556 INFO] Building & saving vocabulary...
[2019-04-13 14:06:45,748 INFO]  * reloading CAES_CJA.data.train.0.pt.
[2019-04-13 14:06:47,330 INFO]  * tgt vocab size: 40442.
[2019-04-13 14:06:47,330 INFO]  * tgt_feat_0 vocab size: 9.
[2019-04-13 14:06:47,387 INFO]  * src vocab size: 36771.
[2019-04-13 14:06:47,387 INFO]  * src_feat_0 vocab size: 7.

Then (export CUDA_VISIBLE_DEVICES=0):

python /home/laika/OpenNMT-py/train.py -data CAES_CJA.data -save_model CAES_CJA.data.model -world_size 1 -gpu_ranks 0
[2019-04-13 14:10:26,868 INFO]  * src vocab size = 36771
[2019-04-13 14:10:26,868 INFO]  * src_feat_0 vocab size = 7
[2019-04-13 14:10:26,868 INFO]  * tgt vocab size = 40442
[2019-04-13 14:10:26,868 INFO]  * tgt_feat_0 vocab size = 9
[2019-04-13 14:10:26,868 INFO] Building model...
[2019-04-13 14:10:29,683 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(36771, 500, padding_idx=1)
          (1): Embedding(7, 3, padding_idx=1)
        )
      )
    )
    (rnn): LSTM(503, 500, num_layers=2, dropout=0.3)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(40442, 500, padding_idx=1)
          (1): Embedding(9, 4, padding_idx=1)
        )
      )
    )
    (dropout): Dropout(p=0.3)
    (rnn): StackedLSTM(
      (dropout): Dropout(p=0.3)
      (layers): ModuleList(
        (0): LSTMCell(1004, 500)
        (1): LSTMCell(500, 500)
      )
    )
    (attn): GlobalAttention(
      (linear_in): Linear(in_features=500, out_features=500, bias=False)
      (linear_out): Linear(in_features=1000, out_features=500, bias=False)
    )
  )
  (generator): Sequential(
    (0): Linear(in_features=500, out_features=40442, bias=True)
    (1): Cast()
    (2): LogSoftmax()
  )
)
[2019-04-13 14:10:29,684 INFO] encoder: 22399521
[2019-04-13 14:10:29,684 INFO] decoder: 46248478
[2019-04-13 14:10:29,684 INFO] * number of parameters: 68647999
[2019-04-13 14:10:29,685 INFO] Starting training on GPU: [0]
[2019-04-13 14:10:29,685 INFO] Start training loop and validate every 10000 steps...
[2019-04-13 14:10:30,991 INFO] Loading dataset from CAES_CJA.data.train.0.pt, number of examples: 85656
[2019-04-13 14:10:37,727 INFO] Step 50/100000; acc:   4.56; ppl: 560408.95; xent: 13.24; lr: 1.00000; 9100/8965 tok/s;      8 sec
.
.
.

Then:

python /home/laika/OpenNMT-py/translate.py -model CAES_CJA.data.model_step_10000.pt -src src_verify.atokCJA -output output.txt -replace_unk -verbose -gpu 0
[2019-04-13 14:46:46,807 INFO] Translating shard 0.
Traceback (most recent call last):
  File "/home/laika/OpenNMT-py/translate.py", line 48, in <module>
    main(opt)
  File "/home/laika/OpenNMT-py/translate.py", line 32, in main
    attn_debug=opt.attn_debug
  File "/home/laika/OpenNMT-py/onmt/translate/translator.py", line 322, in translate
    batch, data.src_vocabs, attn_debug
  File "/home/laika/OpenNMT-py/onmt/translate/translator.py", line 511, in translate_batch
    return_attention=attn_debug or self.replace_unk)
  File "/home/laika/OpenNMT-py/onmt/translate/translator.py", line 658, in _translate_batch
    batch_offset=beam._batch_offset)
  File "/home/laika/OpenNMT-py/onmt/translate/translator.py", line 549, in _decode_and_generate
    decoder_in, memory_bank, memory_lengths=memory_lengths, step=step
  File "/home/laika/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/laika/OpenNMT-py/onmt/decoders/decoder.py", line 212, in forward
    tgt, memory_bank, memory_lengths=memory_lengths)
  File "/home/laika/OpenNMT-py/onmt/decoders/decoder.py", line 374, in _run_forward_pass
    emb = self.embeddings(tgt)
  File "/home/laika/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/laika/OpenNMT-py/onmt/modules/embeddings.py", line 245, in forward
    source = self.make_embedding(source)
  File "/home/laika/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/laika/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/laika/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/laika/OpenNMT-py/onmt/modules/util_class.py", line 25, in forward
    assert len(self) == len(inputs_)
AssertionError

If will really appreciate any help!
have a nice day!
Miguel

eduamf · April 29, 2019, 3:27am

class Elementwise(nn.ModuleList):
    """
    A simple network container.
    Parameters are a list of modules.
    Inputs are a 3d Tensor whose last dimension is the same length
    as the list.
    Outputs are the result of applying modules to inputs elementwise.
    An optional merge parameter allows the outputs to be reduced to a
    single Tensor.
    """

    def __init__(self, merge=None, *args):
        assert merge in [None, 'first', 'concat', 'sum', 'mlp']
        self.merge = merge
        super(Elementwise, self).__init__(*args)

    def forward(self, inputs):
        inputs_ = [feat.squeeze(2) for feat in inputs.split(1, dim=2)]
...

Everyone who thinks it’s a good idea to use features, or use OpenNMT-lua or abandon the idea.

Debugging I saw that the error occurs at the end of sentence. The last token comes with length = 1.

eduamf · April 29, 2019, 4:08am

The error remain sending a non-feature target. It happens at the end of sentence when a vector one dimension of 2’s return. I think the number 2 represents the < eos >. In fact, < eos > doesn’t have any feature.