How to coordinate GPU memory usage

I’m training a model with four GPUs ,the GPU memory usage is as follows, I want to know why the memory usage of gpu 1 and gpu 2 are very high, but the memory usage of gpu 0 and gpu 3 are low, what’s the reason , How to coordinate GPU memory usage,Thanks very much.

Every 0.5s: nvidia-smi                                                                                                                                                             Wed Nov 15 21:05:53 2017

Wed Nov 15 21:05:53 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Graphics Device     Off  | 0000:04:00.0     Off |                  N/A |
| 38%   63C    P2    93W / 250W |   5161MiB / 11172MiB |     52%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Graphics Device     Off  | 0000:05:00.0     Off |                  N/A |
| 42%   70C    P2    97W / 250W |  11119MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Graphics Device     Off  | 0000:06:00.0     Off |                  N/A |
| 40%   67C    P2    93W / 250W |  10163MiB / 11172MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Graphics Device     Off  | 0000:07:00.0     Off |                  N/A |
| 43%   71C    P2   124W / 250W |   9233MiB / 11172MiB |     99%      Default |

At which epoch did you report these numbers?

The GPU memory usage is about the same from epoch1 to epoch 6.

You should try a training with:

export THC_CACHING_ALLOCATOR=0

It will give more interpretable numbers.

OK, thanks very much!
I have trained the model for one week, I want to know the reason why the memory usage of gpu 1 and gpu 2 are still very high, but the memory usage of gpu 0 and gpu 3 are still low.Thanks!

It is expected that 1 GPU has a higher memory usage as it is keeping a storage to copy the parameters from other replicas. But without the flag above, numbers are difficult to interpret as there is some memory caching done at lower levels.