I am new to openNMT and am experimenting with the transformer model in opennmt-tf. I trained with the default transformer model and the transformer_1gpu.yml settings provided.
Now I am trying to run inference with the default settings (batch_size = 32) and I sorted the input according to the number of tokens in the input as suggested in this thread (Low performance on Inference with the transformer model on single GPU). However, the system would stop responding within a few minutes of starting the inference. As a test, I trimmed the input file to only have one input sentence with a single token, but I still see the memory of the python process quickly climb and exhaust all 15G of available memory at which point the system becomes unresponsive. I am using a Google Compute Engine instance with 15G of memory, 4 vCPUs and an Nvidia Tesla K80 with about 12G of memory.
Is this behavior expected for inference even with a single input sentence? Any tips to resolve the issue would be appreciated.