Dynamic output from NMT

ajitesh3 · February 24, 2020, 8:42am

NMT pytorch: I wish to generate the new output if I change a token. Lets say if my model fives w1w2w3w4w5 as the original output. Then it might be possible that while decoding if I replace w3 with some w7 then I might get some output as w1w2w7w8w9w10 . How can Ii experiment this thing?

guillaumekln · February 24, 2020, 9:16am

If you trained a standard Transformer model, you could achieve that with CTranslate2. Search for “target prefix”.

ajitesh3 · February 24, 2020, 9:29am

thanks @guillaumekln
looking into it

ajitesh3 · February 24, 2020, 11:54am

I am trying to experiment with CTranslate2. I am able to convert and use the downloaded transformer model which is mentioned in the documents. However when I tried to use my own transformer model it is giving me unk token for all inputs…
after model conversion I am successfully getting model.bin file and source and tgt vocab file.
My model was trained on torch version 1.01. post and torchtext 0.4
I have experimented with all versions
ct2-opennmt-py-converter --model_path models/model_en-hi_exp-5.6_2019-12-09-model_step_250000.pt --model_spec TransformerBase --output_dir models/test

What am I missing
@guillaumekln

guillaumekln · February 24, 2020, 12:14pm

Are the inputs tokenized?

ajitesh3 · February 24, 2020, 12:33pm

yes
model is using sentencepiece
input is
translator.translate_batch([[‘▁According’, ‘▁to’, ‘▁the’, ‘▁police’, ‘▁his’, ‘▁death’, ‘▁was’, ‘▁due’, ‘▁to’, ‘▁drown’, ‘ing’, ‘▁.’]])

@guillaumekln

ajitesh3 · February 24, 2020, 1:21pm

thing is it is working for wget https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
this model.
Can it be a version issue of anykind
I used OpenNMT version
setup(name=‘OpenNMT-py’,
description=‘A python implementation of OpenNMT’,
version=‘0.8.2’,
to train my transformer model

and I think CTranslate is using latest OpenNMT

guillaumekln · February 24, 2020, 1:41pm

Did you train a standard Transformer model as described here: https://opennmt.net/OpenNMT-py/FAQ.html#how-do-i-use-the-transformer-model?

ajitesh3 · February 24, 2020, 1:41pm

yes
and it works perfectly fine. I have trained lots of models.
Only here with Ctranslate I am facing these always unk issue

guillaumekln · February 24, 2020, 4:14pm

Can you share the model model_en-hi_exp-5.6_2019-12-09-model_step_250000.pt privately?

ajitesh3 · February 25, 2020, 5:32am

how? google drive?
and according to you it cant be version issue of OpenNMT py on which I trained the model vs what CTranslate2 is using?
Also I think there is some issue with model conversion. Because when I increased the batch size to 5 and number of hypothesis to 5, I am receiving output tokens other than unks (although they are not correct translation). However, the same model and same input when served directly on OpenNMT-py (translate function) yield perfectly fine result.
@guillaumekln

guillaumekln · February 25, 2020, 8:42am

There is an issue with your vocabulary. It contains unwanted quotes and brackets on each token. How did you preprocess your data with OpenNMT-py?

ajitesh3 · February 25, 2020, 9:44am

@guillaumekln
I tokenised using mosed then bpe using sentencepiece.
When I convert the model in CTranslate I get two dict files source and target, the contents of which are perfectly fine. Similar format to the default "ende_ctranslate2 " models vocab file.
for eg.

the
of
and
to
in
a
is

Also, these tokenisation, encoding etc are working perfectly fine in my OpenNMT (0.8 version)

guillaumekln · February 25, 2020, 9:54am

This is not what I found in the model you sent me:

<unk>
<blank>
'▁the',
'▁of',
'▁,',
'▁.']
'NnUuMm',
'▁to',
'▁and',
'▁',
'▁in',
'.',
'▁is',
'▁a',
'▁that',

The tokens are actually like this in the model itself. This looks like the format of the verbose output of the translation command…

ajitesh3 · February 25, 2020, 10:00am

like this only I am making request with CTranslate.
translator.translate_batch([[‘▁According’, ‘▁to’, ‘▁the’, ‘▁police’, ‘▁his’, ‘▁death’, ‘▁was’, ‘▁due’, ‘▁to’, ‘▁drown’, ‘ing’, ‘▁.’]])

Anyidea why is this same thing working on OpenNMT(0.8.2) ?

guillaumekln · February 25, 2020, 10:04am

It does not work with OpenNMT-py either:

Input: ▁According ▁to ▁the ▁police ▁his ▁death ▁was ▁due ▁to ▁drown ing ▁.
Output: <unk>

ajitesh3 · February 25, 2020, 10:23am

try this input
[‘▁According’, ‘▁to’, ‘▁the’, ‘▁police’, ‘▁his’, ‘▁death’, ‘▁was’, ‘▁due’, ‘▁to’, ‘▁drown’, ‘ing’, ‘▁.’]

I use sentencepiece outside of OpenNMT

guillaumekln · February 25, 2020, 10:40am

Not sure how you got that but you trained your model on a wrong SentencePiece output. Check their documentation: https://github.com/google/sentencepiece#usage-instructions

If you still want to proceed with this model, you should at least be consistent with your incorrect preprocessing. Try this instead:

translator.translate_batch([["['▁According',", "'▁to',", "'▁the',", "'▁police',", "'▁his',", "'▁death',", "'▁was',", "'▁due',", "'▁to',", "'▁drown',", "'ing',", "'▁.']"]])

ajitesh3 · February 25, 2020, 10:56am

Thanks this format is working.
About the sentencepiece model, not sure why you are calling it wrong as it is what is mentioned in their python wrapper documentation. And my format matches their format.

github.com

google/sentencepiece/blob/master/python/README.md

# SentencePiece Python Wrapper

Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectively.
* Support model training with SentencePieceTrainer.Train method.
* SentencePieceText proto is not supported.
* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively.

## Build and Install SentencePiece
For Linux (x64/i686), macOS, and Windows(win32/x64) environment, you can simply use pip command to install SentencePiece python module.

```
% pip install sentencepiece
```

To build and install the Python wrapper from source, please install [SentencePiece C++](https://github.com/google/sentencepiece#c-from-source) and try the following commands:
```
% python setup.py build
% sudo python setup.py install
```

This file has been truncated. show original

>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']

>>> sp.EncodeAsPieces('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']

guillaumekln · February 25, 2020, 11:03am

This is how Python is formatting lists. The actual SentencePiece output does not contain brackets, quotes, and commas.

I’m going to close the topic as you were able to get an output with CTranslate2.