How can I leave unknown words that show up in the target translation, as untranslated?
I can’t find any documentation or options on this. I am looking for something similar the way Moses handles OOV:
http://www.statmt.org/moses/?n=Advanced.OOVs
How can I leave unknown words that show up in the target translation, as untranslated?
I can’t find any documentation or options on this. I am looking for something similar the way Moses handles OOV:
http://www.statmt.org/moses/?n=Advanced.OOVs
Dear Steve,
Can you please give an example sentence showing the translation behaviour you have now, and explain the translation behaviour you rather want.
Kind regards,
Yasmin
Hi Yasmin,
OK. So let me make sure we are on the same page. This is during translation, NOT training. So here is the first sentence straight from the Wikipedia article on machine translation:
Source Text to Machine Translate:
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation (MAHT) or interactive translation) is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.
EN>ES Model trained with OpenNMT-py:
La traducción automática, a veces referida por la abreviatura MT (no confundir con la traducción <unk>, traducción humana <unk> (<unk>) o traducción interactiva) es un <unk> de <unk> computacional que investiga el uso del software para traducir texto o discurso de un idioma.
I don’t want the <unk>
tags in my target text. I would like to have the OOV (unknown words) untranslated in the target language, like the translation below:
EN>ES Model trained with Moses SMT:
Máquina traducción, a veces refieren a por la abreviación mt (no deben confundirse con computer-aided traducción, machine-aided traducción humano (MAHT) o traducción interactivo) es un sub-field de lingüística computacional que investigates el uso de software para traducir texto o habla de una lengua a otra.
I hope this is more clear. Obviously there is an issue with tokenizing hyphenated words so that they can be translated, but let’s forget that for now. I want unknown words left untranslated in the target text.
Thanks,
Steve
Dear Steve,
There are two translation options that can help you:
1- Add -replace_unk
to the translation command, and it will replace the tag with the original word, i.e. it will keep it untranslated.
2- Add -phrase_table
to the translation command followed by a dictionary file path to replace the tag with a translation from the file. So the -replace_unk
option should be there as well.
The phrase table file should include a single translated word (token) per line in the format:
source|||target
Recently, I have implemented the translation option -phrase_table
into the OpenNMT-py version. It was documented from the Lua version but was not implemented in the PyTorch version. So please if you use it and get feedback, please let me know.
You can find more details about the options at:
http://opennmt.net/OpenNMT-py/options/translate.html?highlight=phrase%20table
You can also refer to this page - from the Lua version, but the concept is the same:
http://opennmt.net/OpenNMT/translation/unknowns/
I hope this helps.
Kind regards,
Yasmin
Thank you very much Yasmin!
Two more questions:
How can log any unknown words? Moses SMT has a feature for this. It was supposed to log unknown words to a file. I could never get this to work, however.
Hyphenated words are a problem. This seems like a tokenization issue. If I can break up hyphenated words, I think I would have a lot less unknown words.
Thanks!
Steve
One more question, sorry. How do you add an option with no value to the config.json file for translating with server.py?
Check out this unanswered question asking the same thing?
http://forum.opennmt.net/t/how-do-you-use-replace-unk-on-server-py-translationserver-py/1758
My options in the config.json file look like this:
"opt":
{
"gpu": -1,
"beam_size": 5,
"replace_unk": true
}
When I set -replace_unk to true, instead of replacing unknown words with the untranslated source word, it just simply leaves it out.
Please advise how to set this option and how to turn on options that are just boolean.
Thanks,
Steve
Dear Steve,
- How can log any unknown words? Moses SMT has a feature for this. It was supposed to log unknown words to a file. I could never get this to work, however.
As far as I know, there is no feature like this. It can be easily added to `translation.py file after this part.
if self.replace_unk and attn is not None and src is not None:
for i in range(len(tokens)):
if tokens[i] == tgt_field.unk_token:
_, max_index = attn[i].max(0)
tokens[i] = src_raw[max_index.item()]
Under the second if
, you can write the current src_raw[max_index.item()]
to a file.
- Hyphenated words are a problem. This seems like a tokenization issue. If I can break up hyphenated words, I think I would have a lot less unknown words.
This depends on the tokenizer you use, and you have to use the same tokenizer during both training and translation.
Kind regards,
Yasmin
"replace_unk": true
is correct. So the questions are:
replace_unk
option.Kind regards,
Yasmin
Hi Yasmin,
So this is not looking good. The answers to your questions:
replace_unk
option. No, the <unk>
tags show up
I am confused now. I don’t know what could cause the -replace_unk switch to simply drop the unknown word rather than leave it untranslated as designed.
I do use the tokenizer that comes with OpenNMT-py.
OpenNMT-py/tools/tokenizer.perl
However, there appears there appears to be another tokenizer that is recommended:
Do you think my using the perl tokenizer is the cause of my misery? That doesn’t seem likely to me. I am training and translating with this same perl tokenizer.
Can someone confirm that the -replace_unk option is working as designed?
Thanks,
Steve
P.S. I trained this model with a fairly recent copy of OpenNMT-py. Here is the output from a “git pull” command that I just ran (and then ran the translate commands with -replace_unk and without):
steve@gp1:~/workspace/OpenNMT-py$ git pull
remote: Enumerating objects: 163, done.
remote: Counting objects: 100% (163/163), done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 210 (delta 155), reused 160 (delta 154), pack-reused 47
Receiving objects: 100% (210/210), 56.94 KiB | 6.33 MiB/s, done.
Resolving deltas: 100% (166/166), completed with 69 local objects.
From https://github.com/OpenNMT/OpenNMT-py
db4d2fb..1b3cc33 master -> origin/master
e74a63e..2777a74 gh-pages -> origin/gh-pages
* [new tag] 0.9.0 -> 0.9.0
Updating db4d2fb..1b3cc33
Fast-forward
.travis.yml | 14 +++++-----
CHANGELOG.md | 6 +++++
README.md | 7 ++++-
docs/source/FAQ.md | 34 ++++++++++++++++++++++--
docs/source/Library.md | 2 +-
docs/source/onmt.modules.rst | 2 +-
onmt/__init__.py | 2 +-
onmt/decoders/cnn_decoder.py | 6 ++++-
onmt/decoders/decoder.py | 12 ++++++++-
onmt/decoders/transformer.py | 13 +++++++++-
onmt/encoders/audio_encoder.py | 15 ++++++++---
onmt/encoders/cnn_encoder.py | 5 +++-
onmt/encoders/image_encoder.py | 6 ++++-
onmt/encoders/rnn_encoder.py | 5 +++-
onmt/encoders/transformer.py | 12 ++++++++-
onmt/inputters/inputter.py | 239 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------
onmt/model_builder.py | 2 +-
onmt/models/model.py | 4 +++
onmt/modules/embeddings.py | 4 +++
onmt/modules/multi_headed_attn.py | 3 +++
onmt/modules/position_ffn.py | 4 +++
onmt/opts.py | 29 ++++++++++++++++-----
onmt/tests/test_preprocess.py | 12 ++++-----
onmt/train_single.py | 13 ++++++++--
onmt/trainer.py | 20 +++++++++++++--
onmt/translate/translator.py | 2 +-
onmt/utils/parse.py | 32 ++++++++++++++++++-----
preprocess.py | 167 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------
requirements.txt | 2 +-
setup.py | 2 +-
30 files changed, 520 insertions(+), 156 deletions(-)
Dear Steve,
Personally, I thought about this once: should I tokenize on hyphens. I believe “no” because you will have words like “co-operation” or “break-in” (noun, not break in the verb), etc. Such compound words should be trained (and translated) as they are. But it is my opinion and only about English.
Tokenization cannot affect replace_unk if there is really an <unk>
If you use the server, you have to restart it before any file changes can take effect.
Kind regards,
Yasmin
Hi Yasmin,
I have restarted the services and even rebooted the server. When I add the -replace_unk option, it just drops the unknown word completely. I get the same result using server.py and translate.py with latest.
I’m not sure where to go from here. Moses SMT does a really good job of handling this (for me anyway). Check out the output below.
One thing I noticed. If I translate an unknown word by itself:
bippittyboppityboo
It will leave it untranslated (as designed).
Now if I prefix that unknown word with:
Source: This is a sentence bippittyboppityboo
Translation: Esta es una frase bippittyboppityboo
It leaves the word untranslated!
However, if I add one more word or any punctuation, it drops off the unknown word
Source: This is a sentence with bippittyboppityboo
Translation: Esta es una frase con with
The more I play around with that, the stranger the output. For instance:
Source: This is a sentence Spanish bippittyboppityboo
Translation: Esta es una frase bippittyboppityboo
What do you advise going forward?
Thanks,
Steve
Dear Steve,
I believe I got it now. Your model is a Transformer model, right? If so, -replace_unk
and -phrase_table
options do not work with a Transformer model. They work perfectly otherwise.
Kind regards,
Yasmin
Hi Yasmin,
It is a transformer model. But the -replace_unk is a requirement.
What do I lose by changing from a transformer model (and what kind of model am I changing back to)? The use case is for real-time translation.
Below are the hyperparameters that I used. I just took this from the documentation somewhere. I understand some of the parameters, but there is a lot there. Do you have any recommended settings? I am using 20 million segments from the ParaCrawl Corpus. This is for EN>ES.
Feedback is greatly appreciated!
Steve
python3 train.py -data data/demo -save_model paracrawl_enes-model
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8
-encoder_type transformer -decoder_type transformer -position_encoding
-train_steps 200000 -max_generator_batches 2 -dropout 0.1
-batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2
-max_grad_norm 0 -param_init 0 -param_init_glorot
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000
-world_size 1 -gpu_ranks 0
Did you try subword tokenization like BPE or SentencePiece already?
Dear Steve,
The first question I would ask myself if I get many unknown words is “why”. In-domain corpus gives better translations of in-domain sentences. Your test sentence is very specialized compared to the used corpus. Even Moses gives you many unknown words because the idea is the same.
The Transformer model is known for producing better translations (because this is what it was invented for). However, sometimes you get better translations without it. It makes sense you just try the default options and compare the result.
One important point is that you are trying to translate long sentences, and this can affect the quality. It makes sense to split your translation sentences when possible. Also, if you are going to train another model without the Transformer model, search the forum for posts about long sentences because there are options like “layers” and “word_vec_size” that can help. Also, it can help to include words and short phrases in your training data, not only long sentences.
As you can see there is a great space for research and trial-and-error learning.
Kind regards,
Yasmin
Hi Yasmin,
I don’t think this is really an in-domain or out-of-domain corpus. This corpus is the “Kitchen Sink”. It’s the paracrawl corpus of 20 million sentences, which is everything publicly available on the web. I just used that sentence because it has a lot of unknown words (mainly hyphenated words). I know it’s a long run-on sentence.
Thanks for all your help!
Steve
@stevebpdx @ymoslem @guillaumekln
What is the definition of unk token when we are using subword BPE. I mean for Names of person normal word model returns unk token when it is not able to translate, however subword based mode does not return unk but if makes a wrong name by joining subword. I don’t want new subword based name but want original name only, left untranslated. Any suggestion how to achieve this?
Do you mean, that if I use RNN model with BPE tokenization - -replace_unk will work perfectly?
Because if I use words as tokens, -replace_unk messes things up and replaces wrong word from src sentence.