Masking/noise during training

hammondm · April 14, 2021, 2:44pm

Hi

I’ve been trying to get masking to work, but the relevant parameters seem to have no effect. For example:

mask_ratio: 0.1
mask_length: word
rotate_ratio: 0.1
insert_ratio: 0.1
permute_sent_ratio: 0.1

Whatever value I set, there’s no difference.

Is there some other parameter that turns these on?

thanks,

mike h.

francoishernandez · April 14, 2021, 4:46pm

You need to activate the corresponding transform "bart" on the datasets you want to use it on. This is not very clear in the docs, we should probably update it.

Add transforms: [bart] either at the root of your config if you want it to be applied on all datasets, or on some specific datasets in your data entries.

hammondm · April 14, 2021, 8:34pm

Francois:

Thank you. I tried that and now I’m getting this error:

AttributeError: 'BARTNoiseTransform' object has no attribute 'vocabs'

Any ideas?

mike h

francoishernandez · April 15, 2021, 7:41am

cc @Zenglinxiao

Zenglinxiao · April 15, 2021, 8:52am

Could you please share with me the OpenNMT version(commit) you are working with and the detailed error trace?

hammondm · April 15, 2021, 1:45pm

Hi Linxiao

I’m running it in a docker image and installed opennmt via pip. Here’s what that shows:

root@a33d70122f86:/mh# pip list | grep nmt
pyonmttok              1.25.0

I’m running onmt_train from a shell script. It does a bunch of data wrangling, builds the vocab, and then seems to hit the error. That error seems to generate a bunch of additional problems, but I’ve cut them off below.

Mike H.

root@a33d70122f86:/mh# ./both.sh 
ady
gre
ice
ita
khm
lav
mlt_latn
rum
slv
wel_sw
Corpus corpus_1's weight should be given. We default it to 1 for you.
[2021-04-15 13:42:42,294 INFO] Counter vocab from 7000 samples.
[2021-04-15 13:42:42,294 INFO] Build vocab on 7000 transformed examples/corpus.
[2021-04-15 13:42:42,301 INFO] corpus_1's transforms: TransformPipe(BARTNoiseTransform(None))
[2021-04-15 13:42:42,301 INFO] Loading ParallelCorpus(/workspace/big/BIG_src-train.txt, /workspace/big/BIG_tgt-train.txt, align=None)...
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/opt/conda/lib/python3.8/site-packages/onmt/inputters/corpus.py", line 298, in build_sub_vocab
    maybe_example = DatasetAdapter._process(item, is_train=True)
  File "/opt/conda/lib/python3.8/site-packages/onmt/inputters/corpus.py", line 69, in _process
    maybe_example = transform.apply(
  File "/opt/conda/lib/python3.8/site-packages/onmt/transforms/transform.py", line 189, in apply
    example = transform.apply(
  File "/opt/conda/lib/python3.8/site-packages/onmt/transforms/bart.py", line 380, in apply
    if is_train and self.vocabs is not None:
AttributeError: 'BARTNoiseTransform' object has no attribute 'vocabs'
"""

The above exception was the direct cause of the following exception:
...

Zenglinxiao · April 16, 2021, 9:05am

Hi @hammondm,
Just checked in the code. The error is not from training, but when building the vocab. I’ll commit a PR to fix this issue.
As a temporary workaround, you can remove the bart transform when build_vocab and only add them in the train config, the error would disappear and it won’t affect the result.

hammondm · April 16, 2021, 9:00pm

Hi Linxiao

Excellent. Yes, that’s working; thanks!

mike h