Using Translate on PreTrained Model

stewart · August 31, 2020, 12:35am

I am a new user and trying to use the provided language translation model as follows:

onmt_translate -model /Downloads/iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt -src data/src-test.txt -output preds456.txt -verbose

The performance seems very poor on the provided data set. Here is an example:

[2020-08-30 17:04:05,878 INFO]
SENT 10: [‘Jet’, ‘makers’, ‘feud’, ‘over’, ‘seat’, ‘width’, ‘with’, ‘big’, ‘orders’, ‘at’, ‘stake’]
PRED 10: with big
PRED SCORE: -4.8279

What am I missing?

Thanks!
Stewart

guillaumekln · August 31, 2020, 8:00am

This model expects a German sentence as input.

You should also apply the same tokenization that was used for the training data. See:

github.com

facebookresearch/fairseq/blob/master/data/prepare-iwslt14.sh

#!/usr/bin/env bash
# Copyright (c) 2017-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the LICENSE file in
# the root directory of this source tree. An additional grant of patent rights
# can be found in the PATENTS file in the same directory.
#
# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh

echo 'Cloning Moses github repository (for tokenization scripts)...'
git clone https://github.com/moses-smt/mosesdecoder.git

SCRIPTS=mosesdecoder/scripts
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
LC=$SCRIPTS/tokenizer/lowercase.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl

URL="https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz"
GZ=de-en.tgz

This file has been truncated. show original

stewart · August 31, 2020, 11:46pm

Thanks so much for the help!

In the documentation (https://github.com/OpenNMT/OpenNMT-py), it looks like they run translate on the raw text file, as in.

onmt_translate -model demo-model_acc_XX.XX_ppl_XXX.XX_eX.pt -src data/src-test.txt* -output pred.txt -replace_unk -verbose

Are you saying that the file data/src-test.txt cannot just be raw German text, but must be turned into list of tokens or something?

guillaumekln · September 1, 2020, 7:19am

data/src-test.txt is already tokenized (see for example the space before the periods).