How do you solve the graphic memory overflow?

When I started the engine with the th Tools/rest_translation_server.lua command, Each translation results in an increase in memory, When the graphic memory is full, The engine can’t be used.

Has anyone ever had the same problem as me? How is it resolved?


What is the request size you are sending to the server (batch size, sequence length)?

Hello guillaumekln,

The result I can see here is that there’s a lot of data. I don’t have a specific value for the exact number. So after a few days, the engine will die. I think it’s too much trouble to reboot manually every time after the engine goes down. Whether the program can automatically empty the cache, just like the SQL Server caching mechanism.
Or permanently turn off the cache feature


Is the client constraining the request size? If not, the memory usage increases if the request size is larger than any requests seen previously.

No request restrictions were made. But why can’t we turn off this caching mechanism?

There are 2 levels of cache:

  1. the CUDA allocator
  2. the internal buffers of the model (e.g. the output of each layer)

For 1., I already indicated to you to set THC_CACHING_ALLOCATOR=0.

For 2., these buffers are not freed for performance reasons and can only grow in size. You can manually release them by calling this method on the main model object: