Installed but can't get pre-processing working

I’ve installed openNMT-py and wget-ed over and unpacked the training data sets into data/multi30k but can’t get this working:

http://opennmt.net/OpenNMT-py/extended.html

for l in en de; do for f in data/multi30k/.$l; do if [[ “$f” != “test” ]]; then sed -i “$ d” $f; fi; done; done
for l in en de; do for f in data/multi30k/
.$l; do perl tools/tokenizer.perl -a -no-escape -l $l -q < $f > $f.atok; done; done
onmt_preprocess -train_src data/multi30k/train.en.atok -train_tgt data/multi30k/train.de.atok -valid_src data/multi30k/val.en.atok -valid_tgt data/multi30k/val.de.atok -save_data data/multi30k.atok.low -lower

This looks like a bash script.
I’m running it in my PERSONAL openNMT folder with data/multi30k in it.

I can see it’s accessing the .en .de files but it fails with errors:

sed: 1: “data/multi30k/train.en”: extra characters at the end of d command
sed: 1: “data/multi30k/val.en”: extra characters at the end of d command
sed: 1: “data/multi30k/train.de”: extra characters at the end of d command
sed: 1: “data/multi30k/val.de”: extra characters at the end of d command
Can’t open perl script “tools/tokenizer.perl”: No such file or directory
Can’t open perl script “tools/tokenizer.perl”: No such file or directory
Can’t open perl script “tools/tokenizer.perl”: No such file or directory
Can’t open perl script “tools/tokenizer.perl”: No such file or directory
Can’t open perl script “tools/tokenizer.perl”: No such file or directory
Can’t open perl script “tools/tokenizer.perl”: No such file or directory

Why?
Why script errors like extra chars after d?
I have perl.
But why should I mysteriously have a tools/tokenizer.perl structure?
Where do I get that?
Where should I be running this?

Are you on macOS? Your sed error looks very much like this.
For the tokenizer error, make sure you execute this from your OpenNMT-py folder.
Sidenote, this Translation tutorial is not really up to date. To save you some time, you can have a look at the quickstart first, and then at the transformer and read about subword tokenization.

1 Like

Yes, I’m on Mac OS.
Out of date tutorial? OK.
I need to run in ‘my’ OpenNMT folder? You mean where it’s installed? It’s a mystery to me where it’s installed. I’ll try and find it.
And I should go to Quickstart instead?
OK . .

Out of date tutorial? OK.

It’s functional, but the methods and command line may not be the most appropriate now.

I need to run in ‘my’ OpenNMT folder? You mean where it’s installed? It’s a mystery to me where it’s installed. I’ll try and find it.

As you wrote “I’m running it in my PERSONAL openNMT folder with data/multi30k in it.” I thought you had done a git clone. If you installed via pip it may be easier for you to get the perl scripts separately. You can retrieve them here.
You can also use OpenNMT’s Tokenizer for tokenization.

1 Like

I just want to get it running.
Yes I installed via Pip after a pytorch install via Anaconda.

I looked in that github repo.
It’s unclear to me what I should do next to get a demo running.

I can usually get demos running but openNMT is a few steps beyond me (I’m a python ML coder not a systems administrator or whatever it is you guys are :slight_smile: ).

I can’t even work out where openNMT is installed. I’ve looked in all sorts of alien locations llike /usr/bin/local, /anaconda/ . . /lib/ . . etc etc. I foudn the license file and a few other things. NO data folder or perl folder.

HELP!

OK, so that github repo IS the tools folder . .
OK.
I’ll try that.
But how will I fix the sed error?

The Quickstart is there for that purpose. As for the paths, they are meant to run from a git clone. We only added pip support a few weeks ago.

Simplest is to adapt the paths to your own usage I think, e.g. just retrieve the data folder of the git repo.

1 Like

OK.
What about my sed error?

Just don’t use this data (hence those sed/tokenizer command). it won’t give you any good result anyways. You’ll need much bigger datasets.
Use the data folder from the repo to try and have the commands from Quickstart running.

1 Like

OK, so now (on Mac OS 10.11.4) I’ve:

  • Downloaded via github
  • Installed via: python setup.py install

PROBLEMS

  1. onmt_preprocess -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo
    FAILS
    Illegal instruction: 4

  2. For optional features: pip install -r requirements.opt.txt
    FAILS

ERROR: Command errored out with exit status -4:
command: //anaconda3/bin/python -c ‘import sys, setuptools, tokenize; sys.argv[0] = ‘"’"’/private/var/folders/6m/r5_59y7n1yndj0mrgpqydmv00000gn/T/pip-req-build-fwajfc9a/setup.py’"’"’; file=’"’"’/private/var/folders/6m/r5_59y7n1yndj0mrgpqydmv00000gn/T/pip-req-build-fwajfc9a/setup.py’"’"’;f=getattr(tokenize, ‘"’"‘open’"’"’, open)(file);code=f.read().replace(’"’"’\r\n’"’"’, ‘"’"’\n’"’"’);f.close();exec(compile(code, file, ‘"’"‘exec’"’"’))’ egg_info --egg-base /private/var/folders/6m/r5_59y7n1yndj0mrgpqydmv00000gn/T/pip-req-build-fwajfc9a/pip-egg-info
cwd: /private/var/folders/6m/r5_59y7n1yndj0mrgpqydmv00000gn/T/pip-req-build-fwajfc9a/
Complete output (0 lines):
----------------------------------------
ERROR: Command errored out with exit status -4: python setup.py egg_info Check the logs for full command output.

I still can’t get oNMT working . .
Python 3.7 installed via Anaconda.
GIt clone works fine.
python setup.py install works fine.
Data folder contains the demo files.

But then Quickstart demo does NOT work,
See ^ for errors.

Help!

Looks like there is something broken in your install.
You can try executing the scripts directly, from the directory in which you cloned.
onmt_preprocess --> preprocess.py
onmt_train --> train.py

With or without the -train_src data/src-train.txt diretive, it fails:

(base) PaulPalhysiMac4:OpenNMT-py paul$ python3 preprocess.py
Illegal instruction: 4
(base) PaulPalhysiMac4:OpenNMT-py paul$ vi preprocess.py
(base) PaulPalhysiMac4:OpenNMT-py paul$ python3 preprocess.py -train_src data/src-train.txt
Illegal instruction: 4

and preprocess.py only contains:

#!/usr/bin/env python
from onmt.bin.preprocess import main


if __name__ == "__main__":
    main()

This very much looks like a macOS specific issue, and probably not specifically related to OpenNMT-py.

By the way, you might have some trouble to get anything running decently on macOS, unless you 1) have an NVIDIA GPU (hence a rather old version of macOS) 2) manage to install the proper drivers, CUDA, etc. 3) compile pytorch for your configuration. I believe there may also be some way to compile pytorch for AMD gpus, but never tried it.

The preprocess script you’re looking at is just a wrapper. The “real” script is imported at the top from onmt.bin.preprocess.

1 Like

I could load it on my PC.
But all the demos use Linux and bash scripts . .
Do I load Linux on my PC first? An emulator?
Or just load the PC version of openNMT?

What do you recommend?

OpenNMT-py is mainly developed tested on linux. Easiest would be an ubuntu install on your PC (not a VM, as it would require PCIe passtrough to get the GPU).

1 Like

Thnx.
It loaded on my PC (on DOS shell, not Linux), installed & the training demo is running!
(But I had to use the direct python X.py -xxxx version of commands)

My laptop probably has no GPU so the training will take . . 172 hrs!

Thanks everyone.