OpenNMT Forum

Installed but can't get pre-processing working

I’ve installed openNMT-py and wget-ed over and unpacked the training data sets into data/multi30k but can’t get this working:

http://opennmt.net/OpenNMT-py/extended.html

for l in en de; do for f in data/multi30k/.$l; do if [[ “$f” != “test” ]]; then sed -i “$ d” $f; fi; done; done
for l in en de; do for f in data/multi30k/
.$l; do perl tools/tokenizer.perl -a -no-escape -l $l -q < $f > $f.atok; done; done
onmt_preprocess -train_src data/multi30k/train.en.atok -train_tgt data/multi30k/train.de.atok -valid_src data/multi30k/val.en.atok -valid_tgt data/multi30k/val.de.atok -save_data data/multi30k.atok.low -lower

This looks like a bash script.
I’m running it in my PERSONAL openNMT folder with data/multi30k in it.

I can see it’s accessing the .en .de files but it fails with errors:

sed: 1: “data/multi30k/train.en”: extra characters at the end of d command
sed: 1: “data/multi30k/val.en”: extra characters at the end of d command
sed: 1: “data/multi30k/train.de”: extra characters at the end of d command
sed: 1: “data/multi30k/val.de”: extra characters at the end of d command
Can’t open perl script “tools/tokenizer.perl”: No such file or directory
Can’t open perl script “tools/tokenizer.perl”: No such file or directory
Can’t open perl script “tools/tokenizer.perl”: No such file or directory
Can’t open perl script “tools/tokenizer.perl”: No such file or directory
Can’t open perl script “tools/tokenizer.perl”: No such file or directory
Can’t open perl script “tools/tokenizer.perl”: No such file or directory

Why?
Why script errors like extra chars after d?
I have perl.
But why should I mysteriously have a tools/tokenizer.perl structure?
Where do I get that?
Where should I be running this?

Are you on macOS? Your sed error looks very much like this.
For the tokenizer error, make sure you execute this from your OpenNMT-py folder.
Sidenote, this Translation tutorial is not really up to date. To save you some time, you can have a look at the quickstart first, and then at the transformer and read about subword tokenization.

1 Like

Yes, I’m on Mac OS.
Out of date tutorial? OK.
I need to run in ‘my’ OpenNMT folder? You mean where it’s installed? It’s a mystery to me where it’s installed. I’ll try and find it.
And I should go to Quickstart instead?
OK . .

Out of date tutorial? OK.

It’s functional, but the methods and command line may not be the most appropriate now.

I need to run in ‘my’ OpenNMT folder? You mean where it’s installed? It’s a mystery to me where it’s installed. I’ll try and find it.

As you wrote “I’m running it in my PERSONAL openNMT folder with data/multi30k in it.” I thought you had done a git clone. If you installed via pip it may be easier for you to get the perl scripts separately. You can retrieve them here.
You can also use OpenNMT’s Tokenizer for tokenization.

1 Like

I just want to get it running.
Yes I installed via Pip after a pytorch install via Anaconda.

I looked in that github repo.
It’s unclear to me what I should do next to get a demo running.

I can usually get demos running but openNMT is a few steps beyond me (I’m a python ML coder not a systems administrator or whatever it is you guys are :slight_smile: ).

I can’t even work out where openNMT is installed. I’ve looked in all sorts of alien locations llike /usr/bin/local, /anaconda/ . . /lib/ . . etc etc. I foudn the license file and a few other things. NO data folder or perl folder.

HELP!

OK, so that github repo IS the tools folder . .
OK.
I’ll try that.
But how will I fix the sed error?

The Quickstart is there for that purpose. As for the paths, they are meant to run from a git clone. We only added pip support a few weeks ago.

Simplest is to adapt the paths to your own usage I think, e.g. just retrieve the data folder of the git repo.

1 Like

OK.
What about my sed error?

Just don’t use this data (hence those sed/tokenizer command). it won’t give you any good result anyways. You’ll need much bigger datasets.
Use the data folder from the repo to try and have the commands from Quickstart running.

1 Like