OpenNMT Forum

OpenNMT-py error when training with large amount of data

I’m normally able to train models fine but trying to train a Polish model with 55,743,802 lines of data I get this error I can’t make much sense of:

[2021-03-17 03:43:22,204 WARNING] Corpus corpus_1's weight should be given. We default it to 1 for you.
[2021-03-17 03:43:22,204 INFO] Parsed 2 corpora from -data.
[2021-03-17 03:43:22,204 INFO] Get special vocabs from Transforms: {'src': set(), 'tgt': set()}.
[2021-03-17 03:43:22,204 INFO] Loading vocab from text file...
[2021-03-17 03:43:22,204 INFO] Loading src vocabulary from opennmt_data/openmt.vocab
[2021-03-17 03:43:22,309 INFO] Loaded src vocab has 37600 tokens.
Traceback (most recent call last):
  File "/home/argosopentech/.local/bin/onmt_train", line 11, in <module>
    load_entry_point('OpenNMT-py', 'console_scripts', 'onmt_train')()
  File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 169, in main
    train(opt)
  File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 103, in train
    checkpoint, fields, transforms_cls = _init_train(opt)
  File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 80, in _init_train
    fields, transforms_cls = prepare_fields_transforms(opt)
  File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 33, in prepare_fields_transforms
    fields = build_dynamic_fields(
  File "/home/argosopentech/OpenNMT-py/onmt/inputters/fields.py", line 32, in build_dynamic_fields
    _src_vocab, _src_vocab_size = _load_vocab(
  File "/home/argosopentech/OpenNMT-py/onmt/inputters/inputter.py", line 309, in _load_vocab
    for token, count in vocab:
ValueError: not enough values to unpack (expected 2, got 1)

This has nothing to do with the amount of data. It seems to indicate that a line in your vocab file is not valid.

Thanks, that does seem to be the issue but I’m not sure what the root cause is. My sentencepiece.vocab file seems fine even though I can’t easily check every line:

▁Dia    -8.0681
▁who    -8.06953
▁high   -8.07295
ra      -8.07622
ka      -8.08309
▁He     -8.08323
▁New    -8.08698
▁So     -8.08781
ru      -8.08827
▁18     -8.09131
Y       -8.09186
▁#      -8.09324
▁hari   -8.09674
▁9      -8.09719
el      -8.10969
ro      -8.11083

Hi,

The error message indicates that one or more of your lines are missing either the token or the frequency. It should be fairly easy to find these lines with grep or a similar tool.