OpenNMT-py not showing source test sentence when translating

agpr · December 6, 2022, 12:18pm

When running onmt_translate, the predictions look like this:

Background:
I am trying to train the model on a morphological inflection task (using data from the SIGMORPHON shared task). So the source data is lemma + features, and the target data is the inflected word. I have checked that the source and target are aligned and of the same length. Training set size = 32767 words.

Example from train_source.txt:
enode # V PTCP PST
miscalibrate # V PST
scrutinise # V PST
incense # V PTCP PST
pen # V PTCP PST
greenlight # V PST
polygamise # V PTCP PST
inflesh # V PTCP PST

and train_target.txt:
enoded
miscalibrated
scrutinised
incensed
penned
greenlit
polygamised
infleshed

Currently training with 10000 steps but this has happened with any configuration I tried, both the default ‘quick start’ and with specified learning rate, size, etc. I’m not seeing any errors during building vocab, training or translation but the result is always the same.

Could someone point me to what I might be doing wrong?

vince62s · December 6, 2022, 5:19pm

can you post here:

the training command with the config file
the training log

the translate command you ran, along with the head-10 of the source file to translate.

agpr · December 7, 2022, 9:07am

Hi,

thank you for responding. Adding the information below.

the training command with the config file

!onmt_train -config ED_en.yaml

ED_en.yaml

Where the samples will be written

save_data: /content/drive/MyDrive/Colab Notebooks/R&D project/English/run

Where the vocab(s) will be written

src_vocab: /content/drive/MyDrive/Colab Notebooks/R&D project/English/run/vocab.src
tgt_vocab: /content/drive/MyDrive/Colab Notebooks/R&D project/English/run/vocab.tgt

Prevent overwriting existing files in the folder

overwrite: True

Training files:

data:
corpus_1:
path_src: /content/drive/MyDrive/Colab Notebooks/R&D project/English/train_source.txt
path_tgt: /content/drive/MyDrive/Colab Notebooks/R&D project/English/train_target.txt
valid:
path_src: /content/drive/MyDrive/Colab Notebooks/R&D project/English/valid_source.txt
path_tgt: /content/drive/MyDrive/Colab Notebooks/R&D project/English/valid_target.txt

ED_en.yaml

Vocabulary files

src_vocab: /content/drive/MyDrive/Colab Notebooks/R&D project/English/run/vocab.src
tgt_vocab: /content/drive/MyDrive/Colab Notebooks/R&D project/English/run/vocab.tgt

Train on a single GPU

world_size: 1
gpu_ranks: [0]

Where to save the checkpoints

log_file: /content/drive/MyDrive/Colab Notebooks/R&D project/English/train_ED_en.log
save_model: /content/drive/MyDrive/Colab Notebooks/R&D project/English/run/ED_model.en
save_checkpoint_steps: 500

train_steps: 10000
valid_steps: 500

Batching

batch_size: 20
valid_batch_size: 10

Optimization

optim: ‘adadelta’
learning_rate: 0.1

Model

word_vec_size: 128

encoder_type: brnn

layers: 2

dropout: [0.3]

the training log

pasting beginning & end, as it’s very long

[2022-12-06 09:16:09,929 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2022-12-06 09:16:09,930 WARNING] Corpus corpus_1’s weight should be given. We default it to 1 for you.
[2022-12-06 09:16:09,931 INFO] Missing transforms field for valid data, set to default: [].
[2022-12-06 09:16:09,931 INFO] Parsed 2 corpora from -data.
[2022-12-06 09:16:09,932 INFO] Get special vocabs from Transforms: {‘src’: [], ‘tgt’: []}.
[2022-12-06 09:16:09,940 INFO] Building model…
[2022-12-06 09:16:16,798 INFO] NMTModel(
(encoder): RNNEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(1000, 128, padding_idx=1)
)
)
)
(rnn): LSTM(128, 250, num_layers=2, batch_first=True, dropout=0.3, bidirectional=True)
)
(decoder): InputFeedRNNDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(1000, 128, padding_idx=1)
)
)
)
(dropout): Dropout(p=0.3, inplace=False)
(rnn): StackedLSTM(
(dropout): Dropout(p=0.3, inplace=False)
(layers): ModuleList(
(0): LSTMCell(628, 500)
(1): LSTMCell(500, 500)
)
)
(attn): GlobalAttention(
(linear_in): Linear(in_features=500, out_features=500, bias=False)
(linear_out): Linear(in_features=1000, out_features=500, bias=False)
)
)
(generator): Sequential(
(0): Linear(in_features=500, out_features=1000, bias=True)
(1): Cast()
(2): LogSoftmax(dim=-1)
)
)
[2022-12-06 09:16:16,799 INFO] encoder: 2392000
[2022-12-06 09:16:16,799 INFO] decoder: 5643000
[2022-12-06 09:16:16,799 INFO] * number of parameters: 8035000
[2022-12-06 09:16:16,800 INFO] * src vocab size = 1000
[2022-12-06 09:16:16,800 INFO] * tgt vocab size = 1000
[2022-12-06 09:16:16,802 INFO] Starting training on GPU: [0]
[2022-12-06 09:16:16,802 INFO] Start training loop and validate every 500 steps…
[2022-12-06 09:16:16,854 INFO] Weighted corpora loaded so far:
* corpus_1: 1
[2022-12-06 09:16:16,870 INFO] Weighted corpora loaded so far:
* corpus_1: 1
[2022-12-06 09:16:17,051 INFO] Weighted corpora loaded so far:
* corpus_1: 2
[2022-12-06 09:16:17,053 INFO] Weighted corpora loaded so far:
* corpus_1: 2
[2022-12-06 09:16:17,256 INFO] Weighted corpora loaded so far:
* corpus_1: 3
[2022-12-06 09:16:17,276 INFO] Weighted corpora loaded so far:
* corpus_1: 3
[2022-12-06 09:16:17,496 INFO] Weighted corpora loaded so far:
* corpus_1: 4
[2022-12-06 09:16:17,504 INFO] Weighted corpora loaded so far:
* corpus_1: 4
[2022-12-06 09:16:17,759 INFO] Weighted corpora loaded so far:
* corpus_1: 5
[2022-12-06 09:16:17,782 INFO] Weighted corpora loaded so far:
* corpus_1: 5
[2022-12-06 09:16:18,061 INFO] Weighted corpora loaded so far:
* corpus_1: 6
[2022-12-06 09:16:18,098 INFO] Weighted corpora loaded so far:
* corpus_1: 6
[2022-12-06 09:16:18,417 INFO] Weighted corpora loaded so far:
* corpus_1: 7
[2022-12-06 09:16:18,445 INFO] Weighted corpora loaded so far:
* corpus_1: 7
[2022-12-06 09:16:18,541 INFO] Weighted corpora loaded so far:
* corpus_1: 8
[2022-12-06 09:16:18,582 INFO] Weighted corpora loaded so far:
* corpus_1: 8
[2022-12-06 09:16:18,956 INFO] Weighted corpora loaded so far:
* corpus_1: 9
[2022-12-06 09:16:19,018 INFO] Weighted corpora loaded so far:
* corpus_1: 9
[2022-12-06 09:16:19,085 INFO] Weighted corpora loaded so far:
* corpus_1: 10
[2022-12-06 09:16:19,146 INFO] Weighted corpora loaded so far:
* corpus_1: 10
[2022-12-06 09:16:19,590 INFO] Weighted corpora loaded so far:
* corpus_1: 11
[2022-12-06 09:16:19,631 INFO] Weighted corpora loaded so far:
* corpus_1: 11
[2022-12-06 09:16:19,717 INFO] Weighted corpora loaded so far:
* corpus_1: 12
[2022-12-06 09:16:19,757 INFO] Weighted corpora loaded so far:
* corpus_1: 12
[2022-12-06 09:16:19,847 INFO] Weighted corpora loaded so far:
* corpus_1: 13
[2022-12-06 09:16:19,884 INFO] Weighted corpora loaded so far:
* corpus_1: 13
[2022-12-06 09:16:20,453 INFO] Weighted corpora loaded so far:
* corpus_1: 14
[2022-12-06 09:16:20,459 INFO] Weighted corpora loaded so far:
* corpus_1: 14
[2022-12-06 09:16:20,582 INFO] Weighted corpora loaded so far:
* corpus_1: 15
[2022-12-06 09:16:20,585 INFO] Weighted corpora loaded so far:
* corpus_1: 15
[2022-12-06 09:16:20,719 INFO] Weighted corpora loaded so far:
* corpus_1: 16
[2022-12-06 09:16:20,723 INFO] Weighted corpora loaded so far:
* corpus_1: 16
[2022-12-06 09:16:21,413 INFO] Weighted corpora loaded so far:
* corpus_1: 17
[2022-12-06 09:16:28,790 INFO] Step 50/ 5000; acc: 92.6; ppl: 1.8; xent: 0.6; lr: 1.00000; sents: 1000; bsz: 85/ 40/20; 355/167 tok/s; 12 sec;
[2022-12-06 09:16:29,640 INFO] Step 100/ 5000; acc: 95.2; ppl: 1.7; xent: 0.5; lr: 1.00000; sents: 1000; bsz: 88/ 40/20; 5183/2356 tok/s; 13 sec;
[2022-12-06 09:16:30,489 INFO] Step 150/ 5000; acc: 98.8; ppl: 1.1; xent: 0.1; lr: 1.00000; sents: 1000; bsz: 90/ 40/20; 5327/2357 tok/s; 14 sec;
[2022-12-06 09:16:31,338 INFO] Step 200/ 5000; acc: 97.9; ppl: 1.2; xent: 0.2; lr: 1.00000; sents: 1000; bsz: 90/ 40/20; 5277/2356 tok/s; 15 sec;
[2022-12-06 09:16:31,339 INFO] Saving checkpoint /content/drive/MyDrive/Colab Notebooks/R&D project/English/run/ED_model.en_step_200.pt
[2022-12-06 09:16:32,462 INFO] Step 250/ 5000; acc: 96.2; ppl: 1.3; xent: 0.3; lr: 1.00000; sents: 1000; bsz: 90/ 40/20; 4024/1781 tok/s; 16 sec;
[2022-12-06 09:16:33,303 INFO] Step 300/ 5000; acc: 99.1; ppl: 1.1; xent: 0.1; lr: 1.00000; sents: 1000; bsz: 89/ 40/20; 5279/2378 tok/s; 17 sec;
[2022-12-06 09:16:34,152 INFO] Step 350/ 5000; acc: 99.2; ppl: 1.1; xent: 0.1; lr: 1.00000; sents: 1000; bsz: 90/ 40/20; 5276/2355 tok/s; 17 sec;
[2022-12-06 09:16:35,015 INFO] Step 400/ 5000; acc: 97.7; ppl: 1.2; xent: 0.2; lr: 1.00000; sents: 1000; bsz: 94/ 40/20; 5426/2319 tok/s; 18 sec;
[2022-12-06 09:16:35,016 INFO] Saving checkpoint /content/drive/MyDrive/Colab Notebooks/R&D project/English/run/ED_model.en_step_400.pt
[2022-12-06 09:16:36,394 INFO] Step 450/ 5000; acc: 95.2; ppl: 1.4; xent: 0.4; lr: 1.00000; sents: 1000; bsz: 90/ 40/20; 3251/1451 tok/s; 20 sec;
[2022-12-06 09:16:37,234 INFO] Step 500/ 5000; acc: 97.0; ppl: 1.3; xent: 0.2; lr: 1.00000; sents: 1000; bsz: 89/ 40/20; 5284/2380 tok/s; 20 sec;

[2022-12-06 11:40:31,217 INFO] Validation accuracy: 95.1342
[2022-12-06 11:40:31,219 INFO] Saving checkpoint /content/drive/MyDrive/Colab Notebooks/R&D project/English/run/ED_model.en_step_9500.pt
[2022-12-06 11:40:32,616 INFO] Step 9550/10000; acc: 85.5; ppl: 3.6; xent: 1.3; lr: 0.10000; sents: 1000; bsz: 92/ 40/20; 851/371 tok/s; 265 sec;
[2022-12-06 11:40:33,463 INFO] Step 9600/10000; acc: 85.0; ppl: 3.6; xent: 1.3; lr: 0.10000; sents: 1000; bsz: 88/ 40/20; 5200/2364 tok/s; 266 sec;
[2022-12-06 11:40:34,317 INFO] Step 9650/10000; acc: 89.6; ppl: 2.5; xent: 0.9; lr: 0.10000; sents: 1000; bsz: 90/ 40/20; 5295/2343 tok/s; 267 sec;
[2022-12-06 11:40:35,180 INFO] Step 9700/10000; acc: 91.6; ppl: 2.1; xent: 0.8; lr: 0.10000; sents: 1000; bsz: 91/ 40/20; 5285/2318 tok/s; 267 sec;
[2022-12-06 11:40:36,112 INFO] Step 9750/10000; acc: 86.4; ppl: 3.3; xent: 1.2; lr: 0.10000; sents: 1000; bsz: 92/ 40/20; 4939/2147 tok/s; 268 sec;
[2022-12-06 11:40:37,027 INFO] Step 9800/10000; acc: 89.2; ppl: 2.6; xent: 0.9; lr: 0.10000; sents: 1000; bsz: 86/ 40/20; 4724/2187 tok/s; 269 sec;
[2022-12-06 11:40:37,878 INFO] Step 9850/10000; acc: 88.4; ppl: 2.7; xent: 1.0; lr: 0.10000; sents: 1000; bsz: 88/ 40/20; 5148/2351 tok/s; 270 sec;
[2022-12-06 11:40:38,751 INFO] Step 9900/10000; acc: 88.3; ppl: 2.8; xent: 1.0; lr: 0.10000; sents: 1000; bsz: 91/ 40/20; 5225/2292 tok/s; 271 sec;
[2022-12-06 11:40:39,636 INFO] Step 9950/10000; acc: 87.4; ppl: 3.0; xent: 1.1; lr: 0.10000; sents: 1000; bsz: 90/ 40/20; 5112/2262 tok/s; 272 sec;
[2022-12-06 11:40:40,494 INFO] Step 10000/10000; acc: 87.1; ppl: 3.1; xent: 1.1; lr: 0.10000; sents: 1000; bsz: 90/ 40/20; 5267/2331 tok/s; 273 sec;
[2022-12-06 11:40:44,596 INFO] Train perplexity: 2.92875
[2022-12-06 11:40:44,597 INFO] Train accuracy: 87.847
[2022-12-06 11:40:44,597 INFO] Sentences processed: 200000
[2022-12-06 11:40:44,597 INFO] Average bsz: 90/ 40/20
[2022-12-06 11:40:44,597 INFO] Validation perplexity: 1.55456
[2022-12-06 11:40:44,598 INFO] Validation accuracy: 95.1342
[2022-12-06 11:40:44,600 INFO] Saving checkpoint /content/drive/MyDrive/Colab Notebooks/R&D project/English/run/ED_model.en_step_10000.pt

the translate command you ran, along with the head-10 of the source file to translate.

!onmt_train -config ED_en.yaml

dispost # V PST
enlay # V PTCP PST
hopple # V PST
gentrify # V PTCP PST
abastardize # V PTCP PST
behope # V PST
overmarch # V PST
sublicence # V PTCP PST
nigger-rig # V PTCP PST

vince62s · December 7, 2022, 9:44am

Are you following a paper for this task ?

the way you are trying to accomplish this cannot work.

I guess you built your vocab based on your src and/or target files

your test file must contain only OOV.

maybe you need to do something that is character based or at least with BPE.

Also for features you may need to use special placeholders.

agpr · December 7, 2022, 9:47am

I’m kind of following this paper: Recurrent Neural Networks in Linguistic Theory: Revisiting Pinker and Prince (1988) and the Past Tense Debate - ACL Anthology

And this similar task I found online: Morphological Inflection - Multilingual NLP

agpr · December 7, 2022, 9:51am

And yes I built the vocab based on both source and target files.

So are you saying I should tokenize with BPE, any advice regarding the placeholder for the features? I’m not sure I understand how to do that.

Thank you!

vince62s · December 7, 2022, 9:52am

in your second link, read: “We will treat this task as a machine translation problem, where each “word” is a character or feature designation.”

so you need to follow all of this. it’s character based as said above.
But the the features need to be placeholders within (())

OpenNMT/Tokenizer/blob/master/docs/options.md#preserve_placeholders-boolean-default-false

# Tokenization options

This file documents the options of the Tokenizer interface which can be used in:

* command line client
* C++ API
* Python API

*The exact name format of each option may be different depending on the API used.*

**Terminology:**

* *joiner*: special character indicating that the surrounding tokens should be merged when detokenized
* *spacer*: special character indicating that a space should be introduced when detokenized
* *placeholder* (or *protected sequence*): sequence of characters delimited by ｟ and ｠ that should not be segmented

**Table of contents:**

1. [General](#general)
1. [Case annotation](#case-annotation)

This file has been truncated. show original

agpr · December 11, 2022, 8:45pm

Hi,

thank you for your help so far. I tried to tokenize the data but it seems the preserve_placeholders still tokenizes the features?

The untokenized data is input like this:
enode((V PTCP PST))
miscalibrate((V PST))
scrutinise((V PST))
incense((V PTCP PST))
pen((V PTCP PST))
greenlight((V PST))
polygamise((V PTCP PST))

I tried doing it both using tokenizer.tokenize_file and manually and I’m always getting the same result.

tokenizer = pyonmttok.Tokenizer(“char”, preserve_placeholders = True )
tokenizer.tokenize_file(
input_path = “/content/drive/MyDrive/Colab Notebooks/R&D project/English/train_source.txt”,
output_path = “/content/drive/MyDrive/Colab Notebooks/R&D project/English/train_source.txt_tk.txt”,
num_threads = 1,
verbose = False,
training = True,
tokens_delimiter= " ",
)

OR

tokenizer = pyonmttok.Tokenizer(“char”, preserve_placeholders = True )

def tokenize(fname):
f = open(fname, mode=‘r’, encoding = ‘utf-8’)
flines = f.readlines()

f_new = open(f"{fname}_tk.txt", “w”)
temp = []

for line in flines:
tk = tokenizer(line)
f_new.writelines(“\n”+str(tk))

f_new.close()

and the tokenized data comes out like this (for the latter option, for the former it’s spaces):
[‘e’, ‘n’, ‘o’, ‘d’, ‘e’, ‘(’, ‘(’, ‘V’, ‘P’, ‘T’, ‘C’, ‘P’, ‘P’, ‘S’, ‘T’, ‘)’, ‘)’]
[‘m’, ‘i’, ‘s’, ‘c’, ‘a’, ‘l’, ‘i’, ‘b’, ‘r’, ‘a’, ‘t’, ‘e’, ‘(’, ‘(’, ‘V’, ‘P’, ‘S’, ‘T’, ‘)’, ‘)’]
[‘s’, ‘c’, ‘r’, ‘u’, ‘t’, ‘i’, ‘n’, ‘i’, ‘s’, ‘e’, ‘(’, ‘(’, ‘V’, ‘P’, ‘S’, ‘T’, ‘)’, ‘)’]
[‘i’, ‘n’, ‘c’, ‘e’, ‘n’, ‘s’, ‘e’, ‘(’, ‘(’, ‘V’, ‘P’, ‘T’, ‘C’, ‘P’, ‘P’, ‘S’, ‘T’, ‘)’, ‘)’]
[‘p’, ‘e’, ‘n’, ‘(’, ‘(’, ‘V’, ‘P’, ‘T’, ‘C’, ‘P’, ‘P’, ‘S’, ‘T’, ‘)’, ‘)’]
[‘g’, ‘r’, ‘e’, ‘e’, ‘n’, ‘l’, ‘i’, ‘g’, ‘h’, ‘t’, ‘(’, ‘(’, ‘V’, ‘P’, ‘S’, ‘T’, ‘)’, ‘)’]
[‘p’, ‘o’, ‘l’, ‘y’, ‘g’, ‘a’, ‘m’, ‘i’, ‘s’, ‘e’, ‘(’, ‘(’, ‘V’, ‘P’, ‘T’, ‘C’, ‘P’, ‘P’, ‘S’, ‘T’, ‘)’, ‘)’]

Tried training with both just to see and I’m now also getting this error despite increasing accuracy during training:

!onmt_translate -model run/ED_model.en_step_10000.pt -src test_source.txt -output pred.txt -gpu 0 -verbose -beam_size 12

Traceback (most recent call last):
File “/usr/local/bin/onmt_translate”, line 8, in
sys.exit(main())
File “/usr/local/lib/python3.8/dist-packages/onmt/bin/translate.py”, line 60, in main
translate(opt)
File “/usr/local/lib/python3.8/dist-packages/onmt/bin/translate.py”, line 41, in translate
_, _ = translator._translate(
File “/usr/local/lib/python3.8/dist-packages/onmt/translate/translator.py”, line 345, in _translate
batch_data = self.translate_batch(
File “/usr/local/lib/python3.8/dist-packages/onmt/translate/translator.py”, line 723, in translate_batch
return self._translate_batch_with_strategy(
File “/usr/local/lib/python3.8/dist-packages/onmt/translate/translator.py”, line 768, in _translate_batch_with_strategy
src, enc_final_hs, enc_out, src_len = self._run_encoder(batch)
File “/usr/local/lib/python3.8/dist-packages/onmt/translate/translator.py”, line 732, in _run_encoder
enc_out, enc_final_hs, src_len = self.model.encoder(
File “/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py”, line 1190, in _call_impl
return forward_call(*input, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/onmt/encoders/rnn_encoder.py”, line 72, in forward
packed_emb = pack(emb, src_len_list, batch_first=True)
File “/usr/local/lib/python3.8/dist-packages/torch/nn/utils/rnn.py”, line 262, in pack_padded_sequence
_VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: Length of all samples has to be greater than 0, but found an element in ‘lengths’ that is <= 0

Thank you!

vince62s · December 12, 2022, 6:03am

you need to use the special character ｟｠ like here OpenNMT-py/tokenize.py at master · OpenNMT/OpenNMT-py · GitHub
it is not just the regular parenthesis.