OpenNMT-py error when training with large amount of data

argosopentech · March 20, 2021, 12:59pm

I’m normally able to train models fine but trying to train a Polish model with 55,743,802 lines of data I get this error I can’t make much sense of:

[2021-03-17 03:43:22,204 WARNING] Corpus corpus_1's weight should be given. We default it to 1 for you.
[2021-03-17 03:43:22,204 INFO] Parsed 2 corpora from -data.
[2021-03-17 03:43:22,204 INFO] Get special vocabs from Transforms: {'src': set(), 'tgt': set()}.
[2021-03-17 03:43:22,204 INFO] Loading vocab from text file...
[2021-03-17 03:43:22,204 INFO] Loading src vocabulary from opennmt_data/openmt.vocab
[2021-03-17 03:43:22,309 INFO] Loaded src vocab has 37600 tokens.
Traceback (most recent call last):
  File "/home/argosopentech/.local/bin/onmt_train", line 11, in <module>
    load_entry_point('OpenNMT-py', 'console_scripts', 'onmt_train')()
  File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 169, in main
    train(opt)
  File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 103, in train
    checkpoint, fields, transforms_cls = _init_train(opt)
  File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 80, in _init_train
    fields, transforms_cls = prepare_fields_transforms(opt)
  File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 33, in prepare_fields_transforms
    fields = build_dynamic_fields(
  File "/home/argosopentech/OpenNMT-py/onmt/inputters/fields.py", line 32, in build_dynamic_fields
    _src_vocab, _src_vocab_size = _load_vocab(
  File "/home/argosopentech/OpenNMT-py/onmt/inputters/inputter.py", line 309, in _load_vocab
    for token, count in vocab:
ValueError: not enough values to unpack (expected 2, got 1)

francoishernandez · March 22, 2021, 8:51am

This has nothing to do with the amount of data. It seems to indicate that a line in your vocab file is not valid.

argosopentech · March 28, 2021, 1:51pm

Thanks, that does seem to be the issue but I’m not sure what the root cause is. My sentencepiece.vocab file seems fine even though I can’t easily check every line:

▁Dia    -8.0681
▁who    -8.06953
▁high   -8.07295
ra      -8.07622
ka      -8.08309
▁He     -8.08323
▁New    -8.08698
▁So     -8.08781
ru      -8.08827
▁18     -8.09131
Y       -8.09186
▁#      -8.09324
▁hari   -8.09674
▁9      -8.09719
el      -8.10969
ro      -8.11083

panosk · March 29, 2021, 8:34am

Hi,

The error message indicates that one or more of your lines are missing either the token or the frequency. It should be fairly easy to find these lines with grep or a similar tool.

argosopentech · April 14, 2021, 10:41pm

For anyone with the same problem @panosk is right it’s a bad line in the sentencepiece.vocab file. It seems to only happen only on large datasets so I’m guessing this is a SentencePiece issue but I’m trying to figure it out.

Here’s the script for finding bad lines:

VOCAB_FILE = 'sentencepiece.vocab'
lines = open(VOCAB_FILE).readlines()
for i, line in enumerate(lines):
    print(line)
    split = line.split()
    assert(len(split) == 2)
    assert(split[1][-1].isdigit())
    print(f'Checked line {i}/{len(lines)}')

And the output:

Checked line 31877/32000
𐍅	-17.1575

Checked line 31878/32000
⟨	-17.1576

Checked line 31879/32000

	-17.1576

Traceback (most recent call last):
  File "search_for_bad_data.py", line 6, in <module>
    assert(len(split) == 2)
AssertionError

Normally lines are <character><tab><number>, this line (near the end of the file) is just <tab><number>.

SentencePiece Issue

argosopentech · April 14, 2021, 11:08pm

Actually looking at this again I think the token may be <carriage return (ascii 13)><new line (ascii 10)>. It may be that you need enough data for these two tokens to get tokenized together. When it happens though maybe onmt reads them as separate lines?

argosopentech · April 15, 2021, 10:44pm

I made a pull request:

Handle invalid lines in vocab file gracefully by PJ-Finlay · Pull Request #2041 · OpenNMT/OpenNMT-py · GitHub

argosopentech · April 18, 2021, 11:29am

The code change appears to have fixed the issue.

SefaZeng · August 13, 2021, 1:17pm

Hello, @argosopentech . I am trying to use OpenNMT to train a multilingual NMT model. I also use sentencepiece to get the vocab file. But I always get a OOM error using large data like 90 million sentence. Do you ever meet this issue before?

JptoEn · August 13, 2021, 1:43pm

I was having this issue with 16 million line Japanese / English files, solved it by using an AWS EC2 instance with 512GB of RAM for 5 dollars an hour.

argosopentech · August 13, 2021, 11:45pm

I recommend adding swap space, ex:

sudo fallocate -l 75G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
sudo swapon --show

This is what I do for the Argos Translate training scripts.

ymoslem · August 14, 2021, 1:05am

In addition to the other replies you received from colleagues, please note that currently the default value of SentencePiece’s --input_sentence_size is 0, i.e. the whole corpus. If you make it something like 10000000, these 10 million sentences will be sampled from the corpus, and they are enough for creating a good SentencePiece model.

There are also a few notes here:

The default SentencePiece value for --vocab_size is 8000. You can go for a higher value like between 30000 and 50000, and up to 100000 for a big corpus. Still, note that smaller values will encourage the model to make more splits on words, which might be better in the case of a multilingual model if the languages share the alphabet.
After you segment your source and target files with the generated SentencePiece models, you must build vocab using OpenNMT-py to generate vocab files compatible with it.
When you start training with OpenNMT-py, you must set src_vocab_size and tgt_vocab_size exactly as you set the --vocab_size for SentencePiece. The default is 50000, which is usually good.

I hope this helps.

Kind regards,
Yasmin

argosopentech · August 14, 2021, 1:39am

Yes good point, this is the command Argos Translate uses (1,000,000 input_sentence_size and a vocab size of 32,000).

github.com

argosopentech/onmt-models/blob/e2b7ff8aa262c6c9d088b91bcbb47bfde06d6640/bin/argos-train#L8

    
      
          #!/bin/sh
          
          
echo "Splitting train and valid data"
          ./split_train_and_valid.py raw_data/source raw_data/target
          
          
cat split_data/*train.txt >> split_data/all.txt
          
          
spm_train --input=split_data/all.txt --model_prefix=sentencepiece \
                     --vocab_size=$vocab_size --character_coverage=$character_coverage \
          	   --input_sentence_size=1000000 --shuffle_input_sentence=true \
          	   --user_defined_symbols=$special_tokens
          
          
onmt_build_vocab -config config.yml -n_sample -1
          
          
rm split_data/all.txt
          
          
echo "Done with tokenization"

Is setting src_vocab_size necessary? I don’t think I set it at all in config.yml.

github.com

argosopentech/onmt-models/blob/e2b7ff8aa262c6c9d088b91bcbb47bfde06d6640/config.yml#L7

    
      
          # Based on https://opennmt.net/OpenNMT-py/examples/Translation.html
          
          
## Where the samples will be written
          save_data: opennmt_data
          ## Where the vocab(s) will be written
          src_vocab: opennmt_data/openmt.vocab
          tgt_vocab: opennmt_data/openmt.vocab
          
          
share_vocab: True
          
          
# Corpus opts:
          data:
              corpus_1:
                  path_src: split_data/src-train.txt
                  path_tgt: split_data/tgt-train.txt
                  transforms: [sentencepiece, filtertoolong]
              valid:

ymoslem · August 14, 2021, 2:01am

If you do not set it, it will take the default 50000, which will still work. However, it is better to be compatible with that you set in SentencePiece.

SefaZeng · August 14, 2021, 2:41am

Thank you for your reply. I set the --input_sentence_size to 70000000 as my training data is a multilingual corpus of about 1.3 billion sentences. I think more data is needed to learn good vocab, but there is always an OOM error. However, there are a lot of multilingual research papers using sentencepiece to learn the join vocab, so I think it should be able to solve the problem of large corpus training. I am still finding a way to solve this OOM error.

JptoEn · August 14, 2021, 9:37am

You will need hundreds of GB of RAM to train a model on that much data, like I said EC2 instances are very affordable. Make sure you also set --shuffle_input_sentence=true to get a complete distribution of your data.

argosopentech · August 14, 2021, 11:39am

The point of adding swap space is you can use your disk space as overflow RAM on Linux. It’s slower but I find it works well to train large SentencePiece models.

SamuelLacombe · November 5, 2021, 1:33am

I just had the same problem as you did.

when running this:

sourceSP = spm.SentencePieceProcessor(model_file=sourceVocabPath + '.model')
print([[sourceSP.id_to_piece(id), id] for id in range(sourceSP.get_piece_size())])

I could clearly see the culprit:
...7974], ['\r', 7975], ['b', 7976], ['.', 7977], ['v', 7978], ['k', 7979], ['’', 7980], ['j', 7981], ['“', 7982], ['”', 7983], ...

[’\r’, 7975] = carriage return.

The vocab file generated by sentencePiece confirm it. Since it’s jumping a line at the exact spot where that char would be in the file.

I will probably just get ride of all carriage return from my text as basic solution.

SamuelLacombe · November 5, 2021, 3:36am

Well, turned out that removing ‘\r’ in my preprocess didn’t solve the issue.

After some investigations. It seems that when my training files get created, there is both ‘\r\n’ on every line (since I’m in windows, in unix it would have been ‘\n’).

Since SentencePiece is consuming those file directly and split on ‘\n’, all ‘\r’ are interpreted has single char and create a jump line in the resulting vocab file. Most people don’t get this issue since they use the default parameter for “normalization_rule_name” in sentencePiece which remove by default all the ‘\r’. In my case, I need to set it at “normalization_rule_name =identity” since I want my pretokenization to stay as is.

Now, I’m playing around with the parameters of the function I’m using to generate my txt files, so that they only generate ‘\n’. I will keep you posted if I find the solution…

Best regards,
Samuel

SamuelLacombe · November 5, 2021, 12:34pm

I finally succeed, but the solution is not the most elegant I seen.

I’m personally using dataframes.to_csv to generate training/testing/validation my files. The way to force to use ‘\n’ and prevent windows to add ‘\r’ is to do what is mentioned here:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

A lot of people had this issue in the past… there is a big PR for this, but this one was the latest:

github.com/pandas-dev/pandas

BUG:DataFrame.to_csv not using correct line terminator on Windows within open block

opened 11:19PM - 17 Dec 20 UTC

closed 05:14PM - 02 Jan 21 UTC

cheitzig

Bug IO CSV Windows

- [x] I have checked that this issue has not already been reported. - [x] I h…ave confirmed this bug exists on the latest version of pandas. - [ ] (optional) I have confirmed this bug exists on the master branch of pandas. --- **Note**: Please read [this guide](https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports) detailing how to provide the necessary information for us to reproduce your bug. #### Code Sample, a copy-pastable example ```python import pandas as pd students = [('Charlie', 'A'), ('Rich', 'B'), ('Katie', 'A'), ('Tommy', 'B'), ] df = pd.DataFrame(students, columns =['Name', 'Grade']) with open('file1.csv',mode='w') as f: df.to_csv(path_or_buf=f) df.to_csv(path_or_buf='file2.csv') ``` #### Problem description Pandas currently seems inconsitent in terms of how it writes data to csv files using the to_csv method. I've looked at previous issues #20353 and #25048. Both are closed, but this issue seems to happen even right now in production Python 3 Above is a snippet of code, similar to #20353, that reproduces the issue. The result is that file1.csv has line endings that are \r\r\n and file2.csv has line endings that are as expected-- \r\n #### Expected Output In both cases, I'd expect: ,Name,Grade\r\n 0,Charlie,A\r\n 1,Rich,B\r\n 2,Katie,A\r\n 3,Tommy,B\r\n Instead, in file1.csv, we see each row with a line ending of \r\r\n #### Output of ``pd.show_versions()`` <details> INSTALLED VERSIONS ------------------ commit : b5958ee1999e9aead1938c0bba2b674378807b3d python : 3.7.4.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19041 machine : AMD64 processor : Intel64 Family 6 Model 23 Stepping 7, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None pandas : 1.1.5 numpy : 1.19.1 pytz : 2020.1 dateutil : 2.7.5 pip : 20.3.3 setuptools : 50.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.18.1 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.3.1 numexpr : None odfpy : None openpyxl : 3.0.5 pandas_gbq : None pyarrow : None pytables : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None </details>

And this example is given:

with open('file1.csv',mode='w',newline='') as f:
    df.to_csv(path_or_buf=f)

Note: the newline='' is the important part.

I hope this help someone else. It did the job for me!

Best regards,
Samuel