CUDA out of memory - but there's plenty of memory #1866

Energiz3r · 2023-06-15T03:46:02Z

TLDR: When offloading all layers to GPU, RAM usage is the same as if no layers were offloaded. In situations where VRAM is sufficient to load the model but RAM is not, a CUDA Out of Memory error ocurrs even though there is plenty of VRAM still available.

System specs
OS: Windows + conda
CPU: 13900K
RAM: 32GB DDR5
GPU: 2x RTX 3090 (48GB total VRAM)

When trying to load a 65B ggml 4bit model, regardless of how many layers I offload to GPU, system RAM is filled and I get a CUDA out of memory error.

I've tried with all 80 layers offloaded to GPUs, and with no layers offloaded to the GPUs at all, and the RAM usage doesn't change in either scenario. There is still about 12GB total VRAM free when the out of memory error is thrown.

Screenshot of RAM / VRAM usage with all layers offloaded to GPUs: https://i.imgur.com/vTl04qL.png

Interestingly the system RAM usage hits a ceiling while loading the model but the error isn't thrown until the end of the loading sequence. If I had to make a guess on what's happening I would say llama.cpp isn't doing garbage collection on the buffer contents. When CUDA goes to use some system memory it can't see any as available and so crashes.

E:\llama.cpp release 254a7a7>main -t 8 -n -1 -ngl 80 --color -c 2048 --temp 0.7 --repeat_penalty 1.2 --mirostat 2 --interactive-first  -m ../models/ggml-LLaMa-65B-quantized/ggml-LLaMa-65B-q4_0.bin -i -ins
main: build = 670 (254a7a7)
main: seed  = 1686799791
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
  Device 1: NVIDIA GeForce RTX 3090
llama.cpp: loading model from ../models/ggml-LLaMa-65B-quantized/ggml-LLaMa-65B-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.18 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llama_model_load_internal: mem required  = 10814.46 MB (+ 5120.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 64 layers to GPU
llama_model_load_internal: total VRAM used: 28308 MB
....................................................................................................
llama_init_from_file: kv self size  = 5120.00 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.200000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.700000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

CUDA error 2 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:2342: out of memory

Bonus: Without -ngl set, loading succeeds and I actually get a few tokens worth of inference before CUDA error 2 at D:\AI\llama.cpp\ggml-cuda.cu:994: out of memory is thrown. The model needs ~38GB of RAM and I only have 32GB so I assume it's using swapfile, but with no layers offloaded it's odd that an error still comes from CUDA.

The text was updated successfully, but these errors were encountered:

JohannesGaessler · 2023-06-15T11:57:11Z

I can tell from the log that you are not using the latest master version. There have been substantial GPU changes so please re-do your test with the latest master version.

Energiz3r · 2023-06-15T12:36:44Z

Edited OP to reflect what happens on the latest commit [254a7a7]

JohannesGaessler · 2023-06-15T17:27:59Z

I can't reproduce this issue on my machine.

Energiz3r · 2023-06-15T21:05:45Z

I can't reproduce this issue on my machine.

What are the specs of your machine? Which model did you test with?

JohannesGaessler · 2023-06-16T08:23:14Z

    ~/Pr/llama.cpp    master *3 ?10    neofetch                                                              ✔  johannesg@johannes-ms7850 
██████████████████  ████████   johannesg@johannes-ms7850 
██████████████████  ████████   ------------------------- 
██████████████████  ████████   OS: Manjaro Linux x86_64 
██████████████████  ████████   Host: MS-7850 1.0 
████████            ████████   Kernel: 6.3.0-1-MANJARO 
████████  ████████  ████████   Uptime: 27 mins 
████████  ████████  ████████   Packages: 1100 (pacman) 
████████  ████████  ████████   Shell: zsh 5.9 
████████  ████████  ████████   Terminal: /dev/pts/2 
████████  ████████  ████████   CPU: Intel i5-4570S (4) @ 3.600GHz 
████████  ████████  ████████   GPU: NVIDIA GeForce GTX 1050 Ti 
████████  ████████  ████████   GPU: NVIDIA GeForce GTX 1070 
████████  ████████  ████████   Memory: 362MiB / 15921MiB 
████████  ████████  ████████
                                                       
                                                       


    ~/Pr/llama.cpp    master *3 ?10    ./main --model models/opt/llama-${model_size}-ggml-${quantization}.bin --ignore-eos --n_predict 128 --ctx_size 2048 --batch_size 512 --seed 1337 --threads 4 --gpu_layers 32 --mlock | tee chat.txt
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 670 (254a7a7)
main: seed  = 1337
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070
  Device 1: NVIDIA GeForce GTX 1050 Ti
llama.cpp: loading model from models/opt/llama-33b-ggml-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0,13 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce GTX 1070) as main device
llama_model_load_internal: mem required  = 10570,53 MB (+ 3124,00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloaded 32/63 layers to GPU
llama_model_load_internal: total VRAM used: 9699 MB
....................................................................................................
llama_init_from_file: kv self size  = 3120,00 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 128, n_keep = 0


 ← The Writer’s Block: A Video Q&A with Kathleen Duey
The Writer’s Block: A Video Q&A with Shannon Hale →
by keplertalk | September 26, 2012 · 3:14 pm
Blog Tour Kick-Off: The Dark Unwinding by Sharon Cameron
As we have mentioned in the past, we here at KEPLER’S BOOKS LOVE to read. So what better way to spend our days than helping to put great books into people’s hands?
llama_print_timings:        load time = 100207,50 ms
llama_print_timings:      sample time =    89,00 ms /   128 runs   (    0,70 ms per token)
llama_print_timings: prompt eval time =  1473,93 ms /     2 tokens (  736,96 ms per token)
llama_print_timings:        eval time = 103572,14 ms /   127 runs   (  815,53 ms per token)
llama_print_timings:       total time = 105181,02 ms

Energiz3r · 2023-06-16T08:57:21Z

hmmm. you have 16GB of RAM but only 12GB of VRAM if my guess on those GPUs is accurate. Can you confirm if RAM / VRAM usage aligns with what it should be for the number of layers offloaded?

JohannesGaessler · 2023-06-16T09:07:38Z

Yes, I can confirm that it works correctly on my machine.

Energiz3r · 2023-06-16T09:16:30Z

So the RAM usage aligns or no? As I mentioned it would appear to work correctly if your ram capacity wasn't an issue. Any suggestions how else I can test? i've tried a few different models and on different machines and see the same thing in all cases

Energiz3r · 2023-06-16T12:28:03Z

Saw a new build come through a09f919 - issue persists. If I up my RAM to 64GB it runs fine like you say. But surely when I have 48GB VRAM and the model needs 38GB memory I shouldn't be using any RAM should I?

hmage · 2023-06-17T01:20:26Z

Agreed, it seems counter-intuitive why would you need RAM if the layers are going to be in VRAM. Why buffer entire model in RAM before passing it to GPU in the first place?

Energiz3r · 2023-06-17T01:32:35Z

@ggerganov any ideas on this one? I'd rather not have to buy ram to get around a bug 👀 if @JohannesGaessler can't look into this that's what I'll have to do to run any model that isn't fitting into ram

JohannesGaessler · 2023-06-17T07:02:05Z

I mean, I can't look into it until I know how to reproduce the issue. Right now I'm just waiting for other people to report the same problem to see if there is a pattern.

Energiz3r · 2023-06-17T07:47:15Z

Sorry @JohannesGaessler all I meant was your test approach isn't going to replicate the issue because you're not in a situation where you have more VRAM than RAM.

eg. If you can reduce your available system ram to 8gb or less (perhaps run a memory stress test which lets you set how many GB to use) to load an approx ~10gb model fully offloaded into your 12GB of vram you should be able to replicate it

JohannesGaessler · 2023-06-17T08:39:02Z

I don't see why that would make a difference.

Energiz3r · 2023-06-17T09:12:33Z

???

Let's start from the top:

I have a 38GB model. I have 48gb of VRAM, but only 32GB of RAM so cannot run it on CPU
I fully offload it to GPUs
I get a cuda out of memory issue
There is ~12GB VRAM free when the error is thrown
System RAM is completely full

This makes no sense.

JohannesGaessler · 2023-06-17T09:19:50Z

The entire model is never loaded into RAM when offloading. When CUDA says it's out of memory it's referring to VRAM. My guess is that for some reason the logic for splitting tensors across GPUs doesn't work correctly on your system so everything gets put onto just one GPU and you run out of memory.

Energiz3r · 2023-06-17T09:25:27Z

I'm looking at the vram utilisation while loading and both cards are doing the same thing, getting filled up to around 18GB out of 24

If you have the time I'd be happy to let you TeamViewer in or something and take a look. I'm certain I'm not messing this up on my end and I'm not sure how else I can rule out user error. Like you say it doesn't make any sense that system ram is seeing much use at all let alone being completely filled.

JohannesGaessler · 2023-06-17T13:05:07Z

Sorry but to me fixing this issue simply isn't as urgent as it is to you. I'm perfectly happy with just waiting until more people provide information. I am willing to do remote debugging but not via Teamviewer or similar software. I only do it via SSH or equivalent.

Energiz3r · 2023-06-20T07:36:38Z

Okay... I didn't say it was urgent to me, or trying to rush you. Just trying to offer my help to solve this

I'm on a different system now, this one with a 4080 16GB and 128GB of RAM. I can load a 65b model with no layers offloaded to GPU and llama.cpp will occupy 56GB of RAM. If I offload 20 layers to GPU (llama.cpp occupies 12GB of VRAM) it will also occupy... 56GB of RAM. That's pretty definitive.

If reports from other users is what you need in order to warrant looking into this I'll see who else I can find to replicate the issue and refer them here 👍

JohannesGaessler · 2023-06-20T16:47:40Z

@LoganDark For something like this please make a separate issue rather than commenting on an existing, unrelated issue.

Mradr · 2023-06-27T09:07:52Z

I am also having a similar issue where it seems like its buffering to the system RAM along with filling up the VRAM. Aka, the more GPU layers I have the more system RAM it takes up. 3090 - for example, doesnt fill up fully at around 10-15GB out of 24 while system RAM usage jumps up to almost double. Lowering the gpu_layers results less VRAM usage and over all less more accurate memory size the model took.

Windows 11
wizardLM-13B-Uncensored.ggmlv3.q4_0.bin
Cuda supported
32GB of RAM, 3090 video card 24 GB of VRAM

ex3ndr · 2023-12-22T07:28:23Z

I have similar problem for 2x4090, but i have 98gb of ram and it is still doesn't work

github-actions · 2024-04-10T01:07:06Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

lee-b · 2024-06-25T18:10:26Z

I believe I'm seeing this too, with the official server-cuda image, pulled today, although note the "failed to initialize cuda", which no one above seemed to mention.

I'm running with 128GB RAM, 96GB VRAM (1x3090, 3xP40), loading Meta-Llama-3-70B-Instruct.Q5_K_M.gguf.

This is with:

$ sudo lsmod | grep nvidia_uvm
nvidia_uvm 1380352 0
nvidia 56410112 55 nvidia_uvm,nvidia_modeset

and:

llama-cpp:
image: ghcr.io/ggerganov/llama.cpp:server-cuda
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
count: 4

and I get:

llama-cpp | ggml_cuda_init: failed to initialize CUDA: unknown error
...
llama-cpp | llm_load_tensors: offloading 80 repeating layers to GPU
llama-cpp | llm_load_tensors: offloading non-repeating layers to GPU
llama-cpp | llm_load_tensors: offloaded 81/81 layers to GPU
llama-cpp | llm_load_tensors: CPU buffer size = 47628.36 MiB
llama-cpp | ...................................................................................................
llama-cpp | llama_new_context_with_model: n_ctx = 8192
llama-cpp | llama_new_context_with_model: n_batch = 2048
llama-cpp | llama_new_context_with_model: n_ubatch = 512
llama-cpp | llama_new_context_with_model: flash_attn = 0
llama-cpp | llama_new_context_with_model: freq_base = 500000.0
llama-cpp | llama_new_context_with_model: freq_scale = 1
llama-cpp | ggml_cuda_host_malloc: failed to allocate 2560.00 MiB of pinned memory: unknown error
llama-cpp | llama_kv_cache_init: CPU KV buffer size = 2560.00 MiB
llama-cpp | llama_new_context_with_model: KV self size = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama-cpp | ggml_cuda_host_malloc: failed to allocate 0.98 MiB of pinned memory: unknown error
llama-cpp | llama_new_context_with_model: CPU output buffer size = 0.98 MiB
llama-cpp | ggml_cuda_host_malloc: failed to allocate 1104.01 MiB of pinned memory: unknown error
llama-cpp | llama_new_context_with_model: CUDA_Host compute buffer size = 1104.01 MiB

But this works fine:

check-gpu:
image: nvidia/cuda:11.4.3-runtime-ubuntu20.04
command: nvidia-smi
profiles: ["check-gpu"]
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]

Energiz3r mentioned this issue Jun 20, 2023

[User] 65B models on CUBLAS/cuda bugged when prompts approach model's max context size #1948

Closed

BruceMacD mentioned this issue Nov 13, 2023

Out of memory when using multiple GPUs ollama/ollama#1105

Closed

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory - but there's plenty of memory #1866

CUDA out of memory - but there's plenty of memory #1866

Energiz3r commented Jun 15, 2023 •

edited

Loading

JohannesGaessler commented Jun 15, 2023

Energiz3r commented Jun 15, 2023 •

edited

Loading

JohannesGaessler commented Jun 15, 2023

Energiz3r commented Jun 15, 2023

JohannesGaessler commented Jun 16, 2023

Energiz3r commented Jun 16, 2023 •

edited

Loading

JohannesGaessler commented Jun 16, 2023

Energiz3r commented Jun 16, 2023 •

edited

Loading

Energiz3r commented Jun 16, 2023 •

edited

Loading

hmage commented Jun 17, 2023

Energiz3r commented Jun 17, 2023

JohannesGaessler commented Jun 17, 2023

Energiz3r commented Jun 17, 2023 •

edited

Loading

JohannesGaessler commented Jun 17, 2023

Energiz3r commented Jun 17, 2023 •

edited

Loading

JohannesGaessler commented Jun 17, 2023

Energiz3r commented Jun 17, 2023 •

edited

Loading

JohannesGaessler commented Jun 17, 2023

Energiz3r commented Jun 20, 2023 •

edited

Loading

JohannesGaessler commented Jun 20, 2023

Mradr commented Jun 27, 2023 •

edited

Loading

ex3ndr commented Dec 22, 2023

github-actions bot commented Apr 10, 2024

lee-b commented Jun 25, 2024 •

edited

Loading

CUDA out of memory - but there's plenty of memory #1866

CUDA out of memory - but there's plenty of memory #1866

Comments

Energiz3r commented Jun 15, 2023 • edited Loading

JohannesGaessler commented Jun 15, 2023

Energiz3r commented Jun 15, 2023 • edited Loading

JohannesGaessler commented Jun 15, 2023

Energiz3r commented Jun 15, 2023

JohannesGaessler commented Jun 16, 2023

Energiz3r commented Jun 16, 2023 • edited Loading

JohannesGaessler commented Jun 16, 2023

Energiz3r commented Jun 16, 2023 • edited Loading

Energiz3r commented Jun 16, 2023 • edited Loading

hmage commented Jun 17, 2023

Energiz3r commented Jun 17, 2023

JohannesGaessler commented Jun 17, 2023

Energiz3r commented Jun 17, 2023 • edited Loading

JohannesGaessler commented Jun 17, 2023

Energiz3r commented Jun 17, 2023 • edited Loading

JohannesGaessler commented Jun 17, 2023

Energiz3r commented Jun 17, 2023 • edited Loading

JohannesGaessler commented Jun 17, 2023

Energiz3r commented Jun 20, 2023 • edited Loading

JohannesGaessler commented Jun 20, 2023

Mradr commented Jun 27, 2023 • edited Loading

ex3ndr commented Dec 22, 2023

github-actions bot commented Apr 10, 2024

lee-b commented Jun 25, 2024 • edited Loading

Energiz3r commented Jun 15, 2023 •

edited

Loading

Energiz3r commented Jun 15, 2023 •

edited

Loading

Energiz3r commented Jun 16, 2023 •

edited

Loading

Energiz3r commented Jun 16, 2023 •

edited

Loading

Energiz3r commented Jun 16, 2023 •

edited

Loading

Energiz3r commented Jun 17, 2023 •

edited

Loading

Energiz3r commented Jun 17, 2023 •

edited

Loading

Energiz3r commented Jun 17, 2023 •

edited

Loading

Energiz3r commented Jun 20, 2023 •

edited

Loading

Mradr commented Jun 27, 2023 •

edited

Loading

lee-b commented Jun 25, 2024 •

edited

Loading