-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory - but there's plenty of memory #1866
Comments
I can tell from the log that you are not using the latest master version. There have been substantial GPU changes so please re-do your test with the latest master version. |
Edited OP to reflect what happens on the latest commit [254a7a7] |
I can't reproduce this issue on my machine. |
What are the specs of your machine? Which model did you test with? |
|
hmmm. you have 16GB of RAM but only 12GB of VRAM if my guess on those GPUs is accurate. Can you confirm if RAM / VRAM usage aligns with what it should be for the number of layers offloaded? |
Yes, I can confirm that it works correctly on my machine. |
So the RAM usage aligns or no? As I mentioned it would appear to work correctly if your ram capacity wasn't an issue. Any suggestions how else I can test? i've tried a few different models and on different machines and see the same thing in all cases |
Saw a new build come through a09f919 - issue persists. If I up my RAM to 64GB it runs fine like you say. But surely when I have 48GB VRAM and the model needs 38GB memory I shouldn't be using any RAM should I? |
Agreed, it seems counter-intuitive why would you need RAM if the layers are going to be in VRAM. Why buffer entire model in RAM before passing it to GPU in the first place? |
@ggerganov any ideas on this one? I'd rather not have to buy ram to get around a bug 👀 if @JohannesGaessler can't look into this that's what I'll have to do to run any model that isn't fitting into ram |
I mean, I can't look into it until I know how to reproduce the issue. Right now I'm just waiting for other people to report the same problem to see if there is a pattern. |
Sorry @JohannesGaessler all I meant was your test approach isn't going to replicate the issue because you're not in a situation where you have more VRAM than RAM. eg. If you can reduce your available system ram to 8gb or less (perhaps run a memory stress test which lets you set how many GB to use) to load an approx ~10gb model fully offloaded into your 12GB of vram you should be able to replicate it |
I don't see why that would make a difference. |
??? Let's start from the top:
This makes no sense. |
The entire model is never loaded into RAM when offloading. When CUDA says it's out of memory it's referring to VRAM. My guess is that for some reason the logic for splitting tensors across GPUs doesn't work correctly on your system so everything gets put onto just one GPU and you run out of memory. |
I'm looking at the vram utilisation while loading and both cards are doing the same thing, getting filled up to around 18GB out of 24 If you have the time I'd be happy to let you TeamViewer in or something and take a look. I'm certain I'm not messing this up on my end and I'm not sure how else I can rule out user error. Like you say it doesn't make any sense that system ram is seeing much use at all let alone being completely filled. |
Sorry but to me fixing this issue simply isn't as urgent as it is to you. I'm perfectly happy with just waiting until more people provide information. I am willing to do remote debugging but not via Teamviewer or similar software. I only do it via SSH or equivalent. |
Okay... I didn't say it was urgent to me, or trying to rush you. Just trying to offer my help to solve this I'm on a different system now, this one with a 4080 16GB and 128GB of RAM. I can load a 65b model with no layers offloaded to GPU and llama.cpp will occupy 56GB of RAM. If I offload 20 layers to GPU (llama.cpp occupies 12GB of VRAM) it will also occupy... 56GB of RAM. That's pretty definitive. If reports from other users is what you need in order to warrant looking into this I'll see who else I can find to replicate the issue and refer them here 👍 |
@LoganDark For something like this please make a separate issue rather than commenting on an existing, unrelated issue. |
I am also having a similar issue where it seems like its buffering to the system RAM along with filling up the VRAM. Aka, the more GPU layers I have the more system RAM it takes up. 3090 - for example, doesnt fill up fully at around 10-15GB out of 24 while system RAM usage jumps up to almost double. Lowering the gpu_layers results less VRAM usage and over all less more accurate memory size the model took. Windows 11 |
I have similar problem for 2x4090, but i have 98gb of ram and it is still doesn't work |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I believe I'm seeing this too, with the official server-cuda image, pulled today, although note the "failed to initialize cuda", which no one above seemed to mention. I'm running with 128GB RAM, 96GB VRAM (1x3090, 3xP40), loading Meta-Llama-3-70B-Instruct.Q5_K_M.gguf. This is with: $ sudo lsmod | grep nvidia_uvm and: llama-cpp: and I get: llama-cpp | ggml_cuda_init: failed to initialize CUDA: unknown error But this works fine: check-gpu: $ docker compose up check-gpu |
TLDR: When offloading all layers to GPU, RAM usage is the same as if no layers were offloaded. In situations where VRAM is sufficient to load the model but RAM is not, a CUDA Out of Memory error ocurrs even though there is plenty of VRAM still available.
System specs
OS: Windows + conda
CPU: 13900K
RAM: 32GB DDR5
GPU: 2x RTX 3090 (48GB total VRAM)
When trying to load a 65B ggml 4bit model, regardless of how many layers I offload to GPU, system RAM is filled and I get a CUDA out of memory error.
I've tried with all 80 layers offloaded to GPUs, and with no layers offloaded to the GPUs at all, and the RAM usage doesn't change in either scenario. There is still about 12GB total VRAM free when the out of memory error is thrown.
Screenshot of RAM / VRAM usage with all layers offloaded to GPUs: https://i.imgur.com/vTl04qL.png
Interestingly the system RAM usage hits a ceiling while loading the model but the error isn't thrown until the end of the loading sequence. If I had to make a guess on what's happening I would say llama.cpp isn't doing garbage collection on the buffer contents. When CUDA goes to use some system memory it can't see any as available and so crashes.
Bonus: Without
-ngl
set, loading succeeds and I actually get a few tokens worth of inference beforeCUDA error 2 at D:\AI\llama.cpp\ggml-cuda.cu:994: out of memory
is thrown. The model needs ~38GB of RAM and I only have 32GB so I assume it's using swapfile, but with no layers offloaded it's odd that an error still comes from CUDA.The text was updated successfully, but these errors were encountered: