Server memory creep?

I’ve noticed with both the REST and 0mq server versions that when translation is finished, the memory usage of luajit creeps up slowly with each job. I’ve done more extensive testing on the 0mq version than REST, but I verified that the issue occurs with both.

I have a model that I’ve been testing with that is around 1.2GB in memory when I first launch my server (in an nvidia-docker). I then monitor using nvidia-smi -l 1 while I send requests of varying batch sizes to the server. The memory usage goes up, then comes back down, but never to baseline. It seems to increase with each request, and the larger the batch, the greater the creep.

Is there some caching going on? If so, is there a way to clear it at the end of a request?

Hello David,

I am only using the REST one, but extensively, so I can tell you wha is happening.

Try using the TH_CACHING_ALLOCATOR=0 setting.

It will help to release the memory.

However, there is still a small fixed amount of memory that is never released until the .lua process is killed.

not a big deal but just to know.

2 Likes

Thanks @vince62s! I either did something wrong, or that’s not working for the 0mq server…

I put ENV TH_CACHING_ALLOCATOR=0 in my Dockerfile and rebuilt the image. I sent through some batches of 300, and mem use is still 2.5GB (up from 1.2) when it’s done… :confused:

I looked at the zmq code and it can’t work. as a matter of fact for the REST version I handled this in the new PR here : https://github.com/OpenNMT/OpenNMT/pull/196

The point is the following:
this line https://github.com/OpenNMT/OpenNMT/blob/master/tools/translation_server.lua#L79
loads the model.
To release it it requires a translator = NIL and collectgarbage()
But the code is not structured that way.
In the new Rest version, I “unload” the model after a timeout and it releases the memory.

1 Like

Hmm… yeah… that’s less than optimal. A Translator method to do GC would be helpful, but I’m a lua neophyte.

@guillaumekln - is that a doable feature request? I’d prefer not to have to make a new instance of a Translator object periodically to avoid the memory creep. How do you guys do this for production translation servers?

What happen when you send again some batches of 300? Is it still growing?

No, it seems to hover around 2.5GB. But if I send it a batch that’s too big, it throws an error but keeps running. Then when I reduce the batch size, I can no longer get translations even though the mem usage falls back down from the max to ~4GB.

(E.g., send 300 ok, 300 ok, 500, get error; 10 get error.)

The memory usage is proportional to the maximum batch size and it will grow each time your send a bigger batch. Then the memory usage should be stable.

Is 300 an arbitrary or selected value? It is quite large (an actual batch size of 1500 with the default beam size!) and I remember GPU translation to loose efficiency way before this value mostly because of the beam search.

2 Likes

It’s arbitrary; I was experimenting to see how large a batch size I could get away with given the constraints of model size & available memory. It’s certainly much faster to run 1000 segments in 300 segment batches (35 sec) than 30 segment batches (49 sec). :slight_smile:

I do still worry about the inability of both REST & 0mq servers to recover from a batch that’s too big.