Traceback AssertionError while training in Vast.ai

pocakka · September 8, 2022, 12:25pm

Hello! I try to run on Vast Ai a small training package, but I get always this error after uploading the .argosdata package:

Traceback (most recent call last):
  File "/home/argosopentech/env/bin/argos-train", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/home/argosopentech/argos-train/bin/argos-train", line 18, in <module>
    train.train(from_code, to_code, from_name, to_name, version, package_version, argos_version, data_exists)
  File "/home/argosopentech/argos-train/argostrain/train.py", line 78, in train
    source, target = dataset.data()
  File "/home/argosopentech/argos-train/argostrain/dataset.py", line 247, in data
    self.local_dataset = LocalDataset(filepath)
  File "/home/argosopentech/argos-train/argostrain/dataset.py", line 152, in __init__
    assert len(dir_names) > 0
AssertionError
(env) argosopen

I follow this tutorial:
https://libretranslate.fortytwo-it.com/training.php

Error:

I tried with large 2M+ lines, and very small (500 lines) files, but the error is the same.

I checked more times everything:
target, source, json, etc.

Can you help me with any advice?
Thanks.

guillaumekln · September 8, 2022, 12:56pm

Hi,

We can’t help with that since the error is coming from argos-train and not from OpenNMT. Maybe you want to contact the author of this tool?

cc @argosopentech

pocakka · September 8, 2022, 1:15pm

Ohh, thanks, Google took me here, I saw “argosopentech” replies in more topics

argosopentech · September 9, 2022, 2:23am

I think the issue is that there’s no data available for that language, I’ll try to add a better error message.

happy-code-time · April 5, 2023, 9:15pm

Hallo @argosopentech i have a problem creating custom .argosdata files for the argos-train and run a training for new language.

My steps (from official github page ):

Setup the argos env with the docker image argosopentech/argostrain
Initial installation

[+] su argosopentech
[+] source ~/argos-train-init

Create a metadata.json file
{
“package_version”: “1.0”,
“argos_version”: “1.0”,
“name”: “Translation from pl to de”,
“type”: “data”,
“size”: “4096”,
“from_code”: “pl”,
“from_name”: “Polish”,
“to_code”: “de”,
“to_name”: “German”,
}
Create line by line translation with source an target file.
Zip it (I have tried) - manual zip to .zip extension, PHP Zip with .argosdata extension or Zip the folders data with a python script:

import os
import zipfile

def zipDir( path, ziph ) :
“”"
Inserts directory (path) into zipfile instance (ziph)
“”"
for root, dirs, files in os.walk( path ) :
for file in files :
ziph.write( os.path.join( root, file ) , os.path.basename( os.path.normpath( path ) ) + “\” + file )

def makeZip( pathToFolder ) :
“”"
Creates a zip file with the specified folder
“”"
zipf = zipfile.ZipFile( pathToFolder + ‘.argosmodel’, ‘w’, zipfile.ZIP_DEFLATED )
zipDir( pathToFolder, zipf )
zipf.close()
print( "Zip file saved to: " + pathToFolder)

makeZip(‘argosmodels/translate-en_de-1_0’)

When i run the argos-train script i get the error

With this fact I have tried some files from the existing data-index.json file

First entry from hte data-index.json file below Not working (The same error)
Second one are working (an have the same file structure like I have made)

I have customized the data-index.json file for tests:

[
{
“name”: “Wiktionary”,
“type”: “data”,
“from_code”: “en”,
“to_code”: “de”,
“size”: 3048,
“reference”: “Wikiextract - Tatu Ylonen”,
“links”: [
“http://data.argosopentech.com/data-wiktionary-en_de.argosdata”
]
},
{
“name”: “Europarl”,
“type”: “data”,
“from_code”: “en”,
“to_code”: “ca”,
“size”: 1965734,
“links”: [
“https://data.argosopentech.com/data-europarl-en_ca.argosdata”
]
}
]

What Im doing wrong to create a .argosdata file extension thats not readable by the function
assert zipfile.is_zipfile(filepath) ?
Its possible to load files from local drive (upload it to the docker container on build time) ? When Im creating the data locally, I have to send each packed data to github and link it in the data-index.json file to have a public access server because the docker image does no0t have access to local server with mapped domain like
my-domain.com2.
What is the value of “size” inside the metadata.json file ? I have not found any information on the github setup page or other pages.

argosopentech · April 5, 2023, 10:36pm

The is_zipfile assertion error normally means it failed to download.

There currently isn’t very good support for using local data packages but there’s a ticket to improve this. The easiest way to use local data is to put your data in run/source and run/target with no .argosdata packages.

The size is used by Argos Train to decide what data to download and use, it sometimes excludes large datasets. You can get the size with wc -l source.

happy-code-time · April 6, 2023, 11:52am

@argosopentech Thanks for your response.

I´m now inside the docker container. The source file size assert error is now calling

assert len(source_data) > VALID_SIZE

if i check it with a print statement

Len source is 1275 and the limit is: 2000

so the minimum size of chars are 2000 ?

If i comment it out i get the next error

“”"
denormalizer_spec {}
trainer_interface.cc(350) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(181) LOG(INFO) Loading corpus: run/split_data/all.txt
trainer_interface.cc(406) LOG(INFO) Loaded all 0 sentences
trainer_interface.cc(422) LOG(INFO) Adding meta_piece:
trainer_interface.cc(422) LOG(INFO) Adding meta_piece:
~~trainer_interface.cc(422) LOG(INFO) Adding meta_piece:~~
trainer_interface.cc(427) LOG(INFO) Normalizing sentences…
spm_train_main.cc(275) [status.ok()] Internal: src/trainer_interface.cc(428) [!sentences.empty()]
Program terminated with an unrecoverable error.
Corpus corpus_1’s weight should be given. We default it to 1 for you.
Traceback (most recent call last):
File “/home/argosopentech/env/bin/onmt_build_vocab”, line 33, in
sys.exit(load_entry_point(‘OpenNMT-py’, ‘console_scripts’, ‘onmt_build_vocab’)())
File “/home/argosopentech/OpenNMT-py/onmt/bin/build_vocab.py”, line 71, in main
build_vocab_main(opts)
File “/home/argosopentech/OpenNMT-py/onmt/bin/build_vocab.py”, line 32, in build_vocab_main
transforms = make_transforms(opts, transforms_cls, fields)
File “/home/argosopentech/OpenNMT-py/onmt/transforms/transform.py”, line 235, in make_transforms
transform_obj.warm_up(vocabs)
File “/home/argosopentech/OpenNMT-py/onmt/transforms/tokenize.py”, line 147, in warm_up
load_src_model.Load(self.src_subword_model)
File “/home/argosopentech/env/lib/python3.10/site-packages/sentencepiece/init.py”, line 905, in Load
return self.LoadFromFile(model_file)
File “/home/argosopentech/env/lib/python3.10/site-packages/sentencepiece/init.py”, line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: “run/sentencepiece.model”: No such file or directory Error #2
[2023-04-06 11:39:41,855 WARNING] Corpus corpus_1’s weight should be given. We default it to 1 for you.
[2023-04-06 11:39:41,856 INFO] Parsed 2 corpora from -data.
Traceback (most recent call last):
File “/home/argosopentech/env/bin/onmt_train”, line 33, in
sys.exit(load_entry_point(‘OpenNMT-py’, ‘console_scripts’, ‘onmt_train’)())
File “/home/argosopentech/OpenNMT-py/onmt/bin/train.py”, line 172, in main
train(opt)
File “/home/argosopentech/OpenNMT-py/onmt/bin/train.py”, line 106, in train
checkpoint, fields, transforms_cls = _init_train(opt)
File “/home/argosopentech/OpenNMT-py/onmt/bin/train.py”, line 58, in _init_train
ArgumentParser.validate_prepare_opts(opt)
File “/home/argosopentech/OpenNMT-py/onmt/utils/parse.py”, line 197, in validate_prepare_opts
cls._validate_fields_opts(opt, build_vocab_only=build_vocab_only)
File “/home/argosopentech/OpenNMT-py/onmt/utils/parse.py”, line 151, in _validate_fields_opts
cls._validate_file(opt.src_vocab, info=‘src vocab’)
File “/home/argosopentech/OpenNMT-py/onmt/utils/parse.py”, line 18, in _validate_file
raise IOError(f"Please check path of your {info} file!“)
OSError: Please check path of your src vocab file!
Traceback (most recent call last):
File “/home/argosopentech/env/bin/argos-train”, line 7, in
exec(compile(f.read(), file, ‘exec’))
File “/home/argosopentech/argos-train/bin/argos-train”, line 27, in
train.train(from_code, to_code, from_name, to_name, version, package_version, argos_version, data_exists, epochs_count)
File “/home/argosopentech/argos-train/argostrain/train.py”, line 173, in train
str(opennmt_checkpoints[-2].f),
IndexError: list index out of range
“””

OSError: Not found: “run/sentencepiece.model”: No such file or directory Error #2
OSError: Please check path of your src vocab file!

How can i create the sentencepiece.model file, and what content should it have and what Im doin wrong ?
The source and target files inside the directory (pwd)

/home/argosopentech/argos-train/run

Please help me to get this code working.