Recommended parameter values for 'Supervised learning on a specific head'

Hi all,
I am following instructions here for Supervised learning on a specific head. I was wondering what are the recommended parameter values- specifically for alignment_layer and alignment_heads? I am using the standard Transformer model with 6 layers and 8 heads.

@Zenglinxiao, maybe you have some idea?


Also, I just noticed that lambda_align value > 0.0 results in the following error:

RuntimeError: grad can be implicitly created only for scalar outputs

Any suggestions regarding this as well? Btw, I am using master branch of OpenNMT-py.

Hello ArbinTimilsina,
For the guided alignments, I encourage you to take a look at the original paper. I’ve followed all the experiments to implement and test this feature in OpenNMT. During the test, we use the lambda_align as 0.05 as mentioned in the paper.
And alignment_layer is NOT the same as the paper. Because the original implementation takes Vaswani version transformer(that in the paper of ‘Attention is all you need’), but as the transformer structure in OpenNMT is a little bit different with the original paper (It’s a PreNorm Transformer). The best layer is also different. In the paper, they use the 5th/6, we found 4th/6 is the best in our experiments. Notice that, we use this argument as the index of a list, so use --alignment_layer 3 or --alignment_layer -3 for 4th/6.
And for alignment_heads, if you want to supervise on one head, just set it as --alignment_heads 1. For other values, we will supervise on the average across this number of heads, which I used on the layer average baseline when doing testing.
These values are only what I’m used during the implementation and comparation with the paper. You can always have the possibility to find better ones.

1 Like

As for your second question, unfortunately, I haven’t seen this kind of issues with the tests I’ve done.
Could you please print the error tracking to see where it goes wrong ?

Hello Linxiao,
Thanks a lot for the detailed answer.


@Zenglinxiao, Here is the error:

Traceback (most recent call last):
  File "/home/atimilsina/Work/guided-alignment/OpenNMT-py/onmt/", line 377, in _gradient_accumulation
  File "/home/atimilsina/Work/guided-alignment/OpenNMT-py/onmt/utils/", line 165, in __call__
    for shard in shards(shard_state, shard_size):
  File "/home/atimilsina/Work/guided-alignment/OpenNMT-py/onmt/utils/", line 381, in shards
    torch.autograd.backward(inputs, grads)
  File "/home/atimilsina/.virtualenvs/for-alignment/lib/python3.6/site-packages/torch/autograd/", line 87, in backward
    grad_tensors = _make_grads(tensors, grad_tensors)
  File "/home/atimilsina/.virtualenvs/for-alignment/lib/python3.6/site-packages/torch/autograd/", line 28, in _make_grads
    raise RuntimeError("grad can be implicitly created only for scalar outputs")
RuntimeError: grad can be implicitly created only for scalar outputs
[2020-02-06 15:29:56,317 INFO] At step 1, we removed a batch - accum 0
Traceback (most recent call last):
  File "/home/atimilsina/Work/guided-alignment/OpenNMT-py/onmt/", line 377, in _gradient_accumulation

Looks the max_generator_batches option is > 0. Would you mind trying with it set to 0?

Also, could be useful to see your full command line to see which other arguments might be causing the issue here.

Ah, setting max_generator_batches to 0 gets the training going but I am seeing the following error at the middle of the training:

[2020-02-07 10:58:31,589 INFO] Step 75/50000; acc:  11.48; ppl: 4920.76; xent: 8.50; lr: 0.00001; 5167/5861 tok/s;     58 sec
[2020-02-07 10:58:32,207 INFO] Step 76/50000; acc:  10.00; ppl: 4871.80; xent: 8.49; lr: 0.00001; 4862/6077 tok/s;     59 sec
Traceback (most recent call last):
  File "/home/atimilsina/Work/guided-alignment/OpenNMT-py/onmt/", line 377, in _gradient_accumulation
  File "/home/atimilsina/Work/guided-alignment/OpenNMT-py/onmt/utils/", line 160, in __call__
    shard_state = self._make_shard_state(batch, output, trunc_range, attns)
  File "/home/atimilsina/Work/guided-alignment/OpenNMT-py/onmt/utils/", line 269, in _make_shard_state
    align_idx, align_matrix_size, normalize=True)
  File "/home/atimilsina/Work/guided-alignment/OpenNMT-py/onmt/utils/", line 20, in make_batch_align_matrix
    index_tensor.t(), value_tensor, size=size, device=device).to_dense()
RuntimeError: size is inconsistent with indices: for dim 1, size is 10 but found index 11
[2020-02-07 10:58:32,320 INFO] At step 77, we removed a batch - accum 0
[2020-02-07 10:58:32,641 INFO] Step 77/50000; acc:   4.49; ppl: 5700.81; xent: 8.65; lr: 0.00001; 6863/4416 tok/s;     59 sec

Below are my arguments for training:

#!/usr/bin/env bash


python OpenNMT-py/ -data data/preprocessed_data/preprocessed \
                           -save_model data/saved_model/transformer_model \
                           -train_steps ${N_TRAIN_STEPS} -batch_size ${N_TRAIN_BATCH_SIZE} \
                           -valid_batch_size ${N_VALID_BATCH_SIZE} -valid_steps ${N_VALID_STEPS} \
                           -report_every ${N_REPORT_EVERY} \
                           -save_checkpoint_steps ${N_SAVE_CHECKPOINT_STEPS} -keep_checkpoint ${N_KEEP_CHECKPOINT} \
                           -log_file data/log/train_log.txt \
                           -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 \
                           -encoder_type transformer -decoder_type transformer -position_encoding \
                           -max_generator_batches 0 -dropout 0.1 \
                           -batch_type tokens -normalization tokens -accum_count 2 \
                           -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
                           -max_grad_norm 0 -param_init 0  -param_init_glorot \
                           -label_smoothing 0.1 \
                           -gpu_ranks 0 \
                           -lambda_align 0.05 \
                           -alignment_layer 3 \
                           -alignment_heads 1 \

@Zenglinxiao any idea?

In the make_batch_align_matrix function, we are creating tensors with the align indice pair for each batch. The size in the argument is the paded_src * paded_tgt, and we fill in this tensor with values in the corresponding index.
So, when error raises here, I think there may have misaligned example in the data between the src-tgt sentence pair and its alignments.
I don’t know if you provide alignments with giza++ or fastalign. Maybe you should check your data if there are blank lines in src / tgt or alignemnts…

Hi @Zenglinxiao,
I am using fast_align. I have made sure that there are no blank lines in the src/tgt side (else, fast_align was throwing error). The resulted alignments output had some blank lines, which I had just replaced with 0-0. Maybe there is some mis-match with this procedure. I will check my data and will report back on the results.

Btw, will it be possible to share the scripts/tools/commands that you used to get the alignments from Giza++ or fast_align?


I use the script in the lilt/alignment-scripts with some cmd to delete blank alignments.
I personally recommend you to start with the DE-EN experiments as it has test data which helps to verify if all are correct. Once the training is good with the DE-EN dataset, it should also work for your personal data.
Good luck.

Thank you. I will give this a try and will let you know how it goes.

Btw, is it possible to train using the layer average baseline or is supervised training (with alignment labels from external statistical toolkits) the only option?

It’s all about the accuracy. You can use the layer average to get the alignments, but much less accurate than the supervised version.
To get the alignments of the layer average baseline, you just need to specify --report_align when calling translate. For a model trained before, this will update the argument --alignment_layer -2, you can change the layer by edit the update_model_opts function in onmt/utils/
We always recommend you to train a supervised version if you want decent alignments.

Thanks for the detailed answer regarding the average baseline.

In regards to the error that I had mentioned earlier, there was some issue in the way I was producing input file for fast_align. After fixing that, training has resumed smoothly.

Best regards

1 Like