Hi all,
I am following instructions here for Supervised learning on a specific head. I was wondering what are the recommended parameter values- specifically for alignment_layer and alignment_heads? I am using the standard Transformer model with 6 layers and 8 heads.
Hello ArbinTimilsina,
For the guided alignments, I encourage you to take a look at the original paper. I’ve followed all the experiments to implement and test this feature in OpenNMT. During the test, we use the lambda_align as 0.05 as mentioned in the paper.
And alignment_layer is NOT the same as the paper. Because the original implementation takes Vaswani version transformer(that in the paper of ‘Attention is all you need’), but as the transformer structure in OpenNMT is a little bit different with the original paper (It’s a PreNorm Transformer). The best layer is also different. In the paper, they use the 5th/6, we found 4th/6 is the best in our experiments. Notice that, we use this argument as the index of a list, so use --alignment_layer 3 or --alignment_layer -3 for 4th/6.
And for alignment_heads, if you want to supervise on one head, just set it as --alignment_heads 1. For other values, we will supervise on the average across this number of heads, which I used on the layer average baseline when doing testing.
These values are only what I’m used during the implementation and comparation with the paper. You can always have the possibility to find better ones.
As for your second question, unfortunately, I haven’t seen this kind of issues with the tests I’ve done.
Could you please print the error tracking to see where it goes wrong ?
Traceback (most recent call last):
File "/home/atimilsina/Work/guided-alignment/OpenNMT-py/onmt/trainer.py", line 377, in _gradient_accumulation
trunc_size=trunc_size)
File "/home/atimilsina/Work/guided-alignment/OpenNMT-py/onmt/utils/loss.py", line 165, in __call__
for shard in shards(shard_state, shard_size):
File "/home/atimilsina/Work/guided-alignment/OpenNMT-py/onmt/utils/loss.py", line 381, in shards
torch.autograd.backward(inputs, grads)
File "/home/atimilsina/.virtualenvs/for-alignment/lib/python3.6/site-packages/torch/autograd/__init__.py", line 87, in backward
grad_tensors = _make_grads(tensors, grad_tensors)
File "/home/atimilsina/.virtualenvs/for-alignment/lib/python3.6/site-packages/torch/autograd/__init__.py", line 28, in _make_grads
raise RuntimeError("grad can be implicitly created only for scalar outputs")
RuntimeError: grad can be implicitly created only for scalar outputs
[2020-02-06 15:29:56,317 INFO] At step 1, we removed a batch - accum 0
Traceback (most recent call last):
File "/home/atimilsina/Work/guided-alignment/OpenNMT-py/onmt/trainer.py", line 377, in _gradient_accumulation
trunc_size=trunc_size)
In the make_batch_align_matrix function, we are creating tensors with the align indice pair for each batch. The size in the argument is the paded_src * paded_tgt, and we fill in this tensor with values in the corresponding index.
So, when error raises here, I think there may have misaligned example in the data between the src-tgt sentence pair and its alignments.
I don’t know if you provide alignments with giza++ or fastalign. Maybe you should check your data if there are blank lines in src / tgt or alignemnts…
Hi @Zenglinxiao,
I am using fast_align. I have made sure that there are no blank lines in the src/tgt side (else, fast_align was throwing error). The resulted alignments output had some blank lines, which I had just replaced with 0-0. Maybe there is some mis-match with this procedure. I will check my data and will report back on the results.
Btw, will it be possible to share the scripts/tools/commands that you used to get the alignments from Giza++ or fast_align?
Hello,
I use the script in the lilt/alignment-scripts with some cmd to delete blank alignments.
I personally recommend you to start with the DE-EN experiments as it has test data which helps to verify if all are correct. Once the training is good with the DE-EN dataset, it should also work for your personal data.
Good luck.
@Zenglinxiao,
Btw, is it possible to train using the layer average baseline or is supervised training (with alignment labels from external statistical toolkits) the only option?
It’s all about the accuracy. You can use the layer average to get the alignments, but much less accurate than the supervised version.
To get the alignments of the layer average baseline, you just need to specify --report_align when calling translate. For a model trained before, this will update the argument --alignment_layer -2, you can change the layer by edit the update_model_opts function in onmt/utils/parse.py.
We always recommend you to train a supervised version if you want decent alignments.
Thanks for the detailed answer regarding the average baseline.
In regards to the error that I had mentioned earlier, there was some issue in the way I was producing input file for fast_align. After fixing that, training has resumed smoothly.