Memory Error while training with Cyrrillic UTF-8 decoded tokens

Bouncyknighter · February 26, 2023, 12:45am

Hello,
My config file is similiar as the example as it is stated in the website. All the files were tokenized.
Here is the traceback msg with the error:

Traceback (most recent call last):
File “C:####Python\Python310\site-packages\torch\utils\data_utils\worker.py”, line 302, in _worker_loop
data = fetcher.fetch(index)
File “C:####Python\Python310\site-packages\torch\utils\data_utils\fetch.py”, line 43, in fetch
data = next(self.dataset_iter)
File “C:####Python\Python310\site-packages\onmt\inputters\dynamic_iterator.py”, line 290, in iter
for bucket in self._bucketing():
File “C:####Python\Python310\site-packages\onmt\inputters\dynamic_iterator.py”, line 231, in _bucketing
for ex in self.mixer:
File “C:####Python\Python310\site-packages\onmt\inputters\dynamic_iterator.py”, line 83, in iter
item = next(iterator)
File “C:####Python\Python310\site-packages\onmt\inputters\text_corpus.py”, line 209, in iter
yield from indexed_corpus
File “C:####Python\Python310\site-packages\onmt\inputters\text_corpus.py”, line 186, in _add_index
for i, item in enumerate(stream):
File “C:####Python\Python310\site-packages\onmt\inputters\text_corpus.py”, line 170, in _transform
for example in stream:
File “C:####Python\Python310\site-packages\onmt\inputters\text_corpus.py”, line 153, in _tokenize
for example in stream:
File “C:####Python\Python310\site-packages\onmt\inputters\text_corpus.py”, line 69, in load
sline = sline.decode(‘utf-8’)
MemoryError

should I start reducing the training datasets? There were only 65k sentences in the training sets.

Bouncyknighter · February 27, 2023, 9:08pm

I have successfully resolved the issue and am preparing to publish the results soon. After careful analysis, I believe the problem stemmed from the vast difference in vocabulary between the source and target languages, particularly with regards to complex verb and noun structures that caused word changes to increase exponentially in the target language module