Hi,
thank you for your help so far. I tried to tokenize the data but it seems the preserve_placeholders still tokenizes the features?
The untokenized data is input like this:
enode((V PTCP PST))
miscalibrate((V PST))
scrutinise((V PST))
incense((V PTCP PST))
pen((V PTCP PST))
greenlight((V PST))
polygamise((V PTCP PST))
I tried doing it both using tokenizer.tokenize_file and manually and Iβm always getting the same result.
tokenizer = pyonmttok.Tokenizer(βcharβ, preserve_placeholders = True )
tokenizer.tokenize_file(
input_path = β/content/drive/MyDrive/Colab Notebooks/R&D project/English/train_source.txtβ,
output_path = β/content/drive/MyDrive/Colab Notebooks/R&D project/English/train_source.txt_tk.txtβ,
num_threads = 1,
verbose = False,
training = True,
tokens_delimiter= " ",
)
OR
tokenizer = pyonmttok.Tokenizer(βcharβ, preserve_placeholders = True )
tokenizer = pyonmttok.Tokenizer(βcharβ, preserve_placeholders = True )
def tokenize(fname):
f = open(fname, mode=βrβ, encoding = βutf-8β)
flines = f.readlines()
f_new = open(f"{fname}_tk.txt", βwβ)
temp = []
for line in flines:
tk = tokenizer(line)
f_new.writelines(β\nβ+str(tk))
f_new.close()
and the tokenized data comes out like this (for the latter option, for the former itβs spaces):
[βeβ, βnβ, βoβ, βdβ, βeβ, β(β, β(β, βVβ, βPβ, βTβ, βCβ, βPβ, βPβ, βSβ, βTβ, β)β, β)β]
[βmβ, βiβ, βsβ, βcβ, βaβ, βlβ, βiβ, βbβ, βrβ, βaβ, βtβ, βeβ, β(β, β(β, βVβ, βPβ, βSβ, βTβ, β)β, β)β]
[βsβ, βcβ, βrβ, βuβ, βtβ, βiβ, βnβ, βiβ, βsβ, βeβ, β(β, β(β, βVβ, βPβ, βSβ, βTβ, β)β, β)β]
[βiβ, βnβ, βcβ, βeβ, βnβ, βsβ, βeβ, β(β, β(β, βVβ, βPβ, βTβ, βCβ, βPβ, βPβ, βSβ, βTβ, β)β, β)β]
[βpβ, βeβ, βnβ, β(β, β(β, βVβ, βPβ, βTβ, βCβ, βPβ, βPβ, βSβ, βTβ, β)β, β)β]
[βgβ, βrβ, βeβ, βeβ, βnβ, βlβ, βiβ, βgβ, βhβ, βtβ, β(β, β(β, βVβ, βPβ, βSβ, βTβ, β)β, β)β]
[βpβ, βoβ, βlβ, βyβ, βgβ, βaβ, βmβ, βiβ, βsβ, βeβ, β(β, β(β, βVβ, βPβ, βTβ, βCβ, βPβ, βPβ, βSβ, βTβ, β)β, β)β]
Tried training with both just to see and Iβm now also getting this error despite increasing accuracy during training:
!onmt_translate -model run/ED_model.en_step_10000.pt -src test_source.txt -output pred.txt -gpu 0 -verbose -beam_size 12
Traceback (most recent call last):
File β/usr/local/bin/onmt_translateβ, line 8, in
sys.exit(main())
File β/usr/local/lib/python3.8/dist-packages/onmt/bin/translate.pyβ, line 60, in main
translate(opt)
File β/usr/local/lib/python3.8/dist-packages/onmt/bin/translate.pyβ, line 41, in translate
_, _ = translator._translate(
File β/usr/local/lib/python3.8/dist-packages/onmt/translate/translator.pyβ, line 345, in _translate
batch_data = self.translate_batch(
File β/usr/local/lib/python3.8/dist-packages/onmt/translate/translator.pyβ, line 723, in translate_batch
return self._translate_batch_with_strategy(
File β/usr/local/lib/python3.8/dist-packages/onmt/translate/translator.pyβ, line 768, in _translate_batch_with_strategy
src, enc_final_hs, enc_out, src_len = self._run_encoder(batch)
File β/usr/local/lib/python3.8/dist-packages/onmt/translate/translator.pyβ, line 732, in _run_encoder
enc_out, enc_final_hs, src_len = self.model.encoder(
File β/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.pyβ, line 1190, in _call_impl
return forward_call(*input, **kwargs)
File β/usr/local/lib/python3.8/dist-packages/onmt/encoders/rnn_encoder.pyβ, line 72, in forward
packed_emb = pack(emb, src_len_list, batch_first=True)
File β/usr/local/lib/python3.8/dist-packages/torch/nn/utils/rnn.pyβ, line 262, in pack_padded_sequence
_VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: Length of all samples has to be greater than 0, but found an element in βlengthsβ that is <= 0
Thank you!