Multimodal: initialize hidden state of encoder + transformer

Hi, could you please tell me how the Image as additional data is used to initialise the encoder hidden states (Calixto et al., 2017) take place when implemented with the transformer model?
We are using the MultimodalNMT, based on the Pytorch port of OpenNMT, an open-source (MIT) neural machine translation system.

python train_mm.py -data dataset/bpe -save_model model/IMGE_ADAM -gpuid 0 -path_to_train_img_feats image_feat/train_vgg19_bn_cnn_features.hdf5 -path_to_valid_img_feats image_feat/valid_vgg19_bn_cnn_features.hdf5 -enc_layers 6 -dec_layers 6 -encoder_type transformer -decoder_type transformer -position_encoding -epochs 300 -dropout 0.1 -batch_size 128 -batch_type tokens -optim adam -learning_rate 0.01 -gpuid 0 --multimodal_model_type imge

On running the above command, does the system ignore the image features and train the system with text only transformer model?

In Calixto et al., 2017, the author only talks about the attention mechanism of (Bahdanau et al., 2014) and not the transformer model.

We tried the transformer model with text only ( train.py script) and also with multimodal (train_mm.py script). But, there was no improvement in the result and the BLEU scores are almost the same.

So, we are under the assumption that, even if we use the train_mm.py script along with one of the multi-modal model type to train transformer model (as mentioned above), train_mm.py script seems to ignore the multimodal approach and simply train with the text-only version.

Our goal is to train the multi-modal NMT (transformer) model.