stewart
(Stewart)
August 31, 2020, 12:35am
1
I am a new user and trying to use the provided language translation model as follows:
onmt_translate -model /Downloads/iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt -src data/src-test.txt -output preds456.txt -verbose
The performance seems very poor on the provided data set. Here is an example:
[2020-08-30 17:04:05,878 INFO]
SENT 10: [‘Jet’, ‘makers’, ‘feud’, ‘over’, ‘seat’, ‘width’, ‘with’, ‘big’, ‘orders’, ‘at’, ‘stake’]
PRED 10: with big
PRED SCORE: -4.8279
What am I missing?
Thanks!
Stewart
This model expects a German sentence as input.
You should also apply the same tokenization that was used for the training data. See:
#!/usr/bin/env bash
# Copyright (c) 2017-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the LICENSE file in
# the root directory of this source tree. An additional grant of patent rights
# can be found in the PATENTS file in the same directory.
#
# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
echo 'Cloning Moses github repository (for tokenization scripts)...'
git clone https://github.com/moses-smt/mosesdecoder.git
SCRIPTS=mosesdecoder/scripts
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
LC=$SCRIPTS/tokenizer/lowercase.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
URL="https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz"
GZ=de-en.tgz
This file has been truncated. show original
stewart
(Stewart)
August 31, 2020, 11:46pm
3
Thanks so much for the help!
In the documentation (https://github.com/OpenNMT/OpenNMT-py ), it looks like they run translate on the raw text file, as in.
onmt_translate -model demo-model_acc_XX.XX_ppl_XXX.XX_eX.pt -src data/src-test.txt* -output pred.txt -replace_unk -verbose
Are you saying that the file data/src-test.txt cannot just be raw German text, but must be turned into list of tokens or something?
data/src-test.txt
is already tokenized (see for example the space before the periods).