Vocab not recognized during translation, producing <unk> all over

oxo · June 16, 2023, 4:27am

I’m going through an opennmt tutorial: Quickstart — OpenNMT-py documentation

I successfully generated the vocab lists. Then I ran

onmt_train -config toy_en_de.yaml

I saw ‘The first 10 tokens of the vocabs are:[’‘, ‘’, ‘~~’, ‘~~’, ‘the\t12670\r’, ‘,\t9710\r’, ‘.\t9647\r’, ‘of\t6634\r’, ‘and\t5787\r’, ‘to\t5610\r’]’ and I’m not sure if I should be worried.

For training, my config file looks like this:

save_data: run
# Prevent overwriting existing files in the folder
#overwrite: False

# Vocabulary files that were just created
src_vocab: toy-ende/vocab.src
tgt_vocab: toy-ende/vocab.tgt
        

# Corpus opts:
data:
    corpus_1:
        path_src: toy-ende/src-train.txt
        path_tgt: toy-ende/tgt-train.txt
    valid:
        path_src: toy-ende/src-val.txt
        path_tgt: toy-ende/tgt-val.txt
        

# Train on a single GPU
world_size: 1
gpu_ranks: [0]

# Where to save the checkpoints
save_model: run/model
save_checkpoint_steps: 500
train_steps: 1000
valid_steps: 500

I’m using windows.

After training for 1000 steps, I ran

onmt_translate -model run/model_step_1000.pt -src toy-ende/src-mytest.txt -output run/pred_1000.txt -gpu 0 -verbose

I’ve shortened the test file to 4 sentences, and the first sentence in the test file is the same as the first sentence in the training data. This is what I got:

I expect at least the first sentence (from the source file) should be recognized correctly. What can I do?

oxo · June 16, 2023, 4:48pm

ok. i figured out why. basically when the vocab list is loaded, it’s not parsed correctly.
solution:
i changed a line in inputter.py to

lines = [line.strip() for line in f if line.strip()]

Lynn · November 23, 2023, 6:02pm

Hi! I am struggling with the same issue right now. Could you provide more details about how to fix it? Thank you so much.