NMT pytorch: I wish to generate the new output if I change a token. Lets say if my model fives w1w2w3w4w5 as the original output. Then it might be possible that while decoding if I replace w3 with some w7 then I might get some output as w1w2w7w8w9w10 . How can Ii experiment this thing?
If you trained a standard Transformer model, you could achieve that with CTranslate2. Search for “target prefix”.
I am trying to experiment with CTranslate2. I am able to convert and use the downloaded transformer model which is mentioned in the documents. However when I tried to use my own transformer model it is giving me unk token for all inputs…
after model conversion I am successfully getting model.bin file and source and tgt vocab file.
My model was trained on torch version 1.01. post and torchtext 0.4
I have experimented with all versions
ct2-opennmt-py-converter --model_path models/model_en-hi_exp-5.6_2019-12-09-model_step_250000.pt --model_spec TransformerBase --output_dir models/test
What am I missing
@guillaumekln
Are the inputs tokenized?
yes
model is using sentencepiece
input is
translator.translate_batch([[‘▁According’, ‘▁to’, ‘▁the’, ‘▁police’, ‘▁his’, ‘▁death’, ‘▁was’, ‘▁due’, ‘▁to’, ‘▁drown’, ‘ing’, ‘▁.’]])
thing is it is working for wget https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
this model.
Can it be a version issue of anykind
I used OpenNMT version
setup(name=‘OpenNMT-py’,
description=‘A python implementation of OpenNMT’,
version=‘0.8.2’,
to train my transformer model
and I think CTranslate is using latest OpenNMT
Did you train a standard Transformer model as described here: https://opennmt.net/OpenNMT-py/FAQ.html#how-do-i-use-the-transformer-model?
yes
and it works perfectly fine. I have trained lots of models.
Only here with Ctranslate I am facing these always unk issue
Can you share the model model_en-hi_exp-5.6_2019-12-09-model_step_250000.pt
privately?
how? google drive?
and according to you it cant be version issue of OpenNMT py on which I trained the model vs what CTranslate2 is using?
Also I think there is some issue with model conversion. Because when I increased the batch size to 5 and number of hypothesis to 5, I am receiving output tokens other than unks (although they are not correct translation). However, the same model and same input when served directly on OpenNMT-py (translate function) yield perfectly fine result.
@guillaumekln
There is an issue with your vocabulary. It contains unwanted quotes and brackets on each token. How did you preprocess your data with OpenNMT-py?
@guillaumekln
I tokenised using mosed then bpe using sentencepiece.
When I convert the model in CTranslate I get two dict files source and target, the contents of which are perfectly fine. Similar format to the default "ende_ctranslate2 " models vocab file.
for eg.
the
of
and
to
in
a
is
Also, these tokenisation, encoding etc are working perfectly fine in my OpenNMT (0.8 version)
This is not what I found in the model you sent me:
<unk>
<blank>
'▁the',
'▁of',
'▁,',
'▁.']
'NnUuMm',
'▁to',
'▁and',
'▁',
'▁in',
'.',
'▁is',
'▁a',
'▁that',
The tokens are actually like this in the model itself. This looks like the format of the verbose output of the translation command…
like this only I am making request with CTranslate.
translator.translate_batch([[‘▁According’, ‘▁to’, ‘▁the’, ‘▁police’, ‘▁his’, ‘▁death’, ‘▁was’, ‘▁due’, ‘▁to’, ‘▁drown’, ‘ing’, ‘▁.’]])
Anyidea why is this same thing working on OpenNMT(0.8.2) ?
It does not work with OpenNMT-py either:
Input: ▁According ▁to ▁the ▁police ▁his ▁death ▁was ▁due ▁to ▁drown ing ▁.
Output: <unk>
try this input
[‘▁According’, ‘▁to’, ‘▁the’, ‘▁police’, ‘▁his’, ‘▁death’, ‘▁was’, ‘▁due’, ‘▁to’, ‘▁drown’, ‘ing’, ‘▁.’]
I use sentencepiece outside of OpenNMT
Not sure how you got that but you trained your model on a wrong SentencePiece output. Check their documentation: https://github.com/google/sentencepiece#usage-instructions
If you still want to proceed with this model, you should at least be consistent with your incorrect preprocessing. Try this instead:
translator.translate_batch([["['▁According',", "'▁to',", "'▁the',", "'▁police',", "'▁his',", "'▁death',", "'▁was',", "'▁due',", "'▁to',", "'▁drown',", "'ing',", "'▁.']"]])
Thanks this format is working.
About the sentencepiece model, not sure why you are calling it wrong as it is what is mentioned in their python wrapper documentation. And my format matches their format.
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsPieces('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
This is how Python is formatting lists. The actual SentencePiece output does not contain brackets, quotes, and commas.
I’m going to close the topic as you were able to get an output with CTranslate2.