OpeNMT-tf memory error

aquorio15 · May 14, 2022, 8:12am

I have been trying to train model using transformers but it continously giving me a error. My log file is attached below
https://drive.google.com/file/d/1LQ7-67l2NKbcWqs_MfAHUGIk6Kr7B6sl/view?usp=sharing

guillaumekln · May 16, 2022, 8:23am

2022-05-14 08:02:16.682724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5154 MB memory: → device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0

It looks like another process is using memory on the first GPU.

aquorio15 · May 16, 2022, 9:39am

Hi @guillaumekln can you tell me how to change the GPU_id. because i tried using CUDA_VISIBLE_DEVICE it is not working. Is there any way i can do some changes in the code or yaml file like in openmt-py

Thank you
Command Line:
CUDA_VISIBLE_DEVICES=1 onmt-main --model_type Transformer --config data.yaml --auto_config train --with_eval

guillaumekln · May 16, 2022, 9:41am

Why is CUDA_VISIBLE_DEVICES not working?

aquorio15 · May 16, 2022, 9:57am

I am running the command line as a .job file and if i use CUDA_VISIBLE_DEVICES it is giving me a invalid command error
The .job file is added for you reference
https://drive.google.com/file/d/1z2ZIyKuWjxVL68YV1XrJe4zcUymZOfXy/view?usp=sharing

guillaumekln · May 16, 2022, 10:10am

Your usage of NV_GPU with nvidia-docker should also work.

However, the last command looks incorrect because you are running the OpenNMT-tf script inside a PyTorch container.

aquorio15 · May 16, 2022, 10:23am

Can you explain it a bit more clearly @guillaumekln . I did not understand because i have been running the opennmt-tf command line for a while and it did not give me any error

guillaumekln · May 16, 2022, 10:29am

nvidia-docker run --name olr2021 -v /iitdh/PhD/jagabandhu/:/home -w /home --rm nvcr.io/nvidia/pytorch:21.09-py3 onmt-main --model_type Transformer --config data.yaml --auto_config train --with_eval

This command line is using a PyTorch Docker image, but OpenNMT-tf is using TensorFlow. You probably used a different command line before?