Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory - but there's plenty of memory #1866

Closed
Energiz3r opened this issue Jun 15, 2023 · 24 comments
Closed

CUDA out of memory - but there's plenty of memory #1866

Energiz3r opened this issue Jun 15, 2023 · 24 comments
Labels

Comments

@Energiz3r
Copy link

Energiz3r commented Jun 15, 2023

TLDR: When offloading all layers to GPU, RAM usage is the same as if no layers were offloaded. In situations where VRAM is sufficient to load the model but RAM is not, a CUDA Out of Memory error ocurrs even though there is plenty of VRAM still available.

System specs
OS: Windows + conda
CPU: 13900K
RAM: 32GB DDR5
GPU: 2x RTX 3090 (48GB total VRAM)

When trying to load a 65B ggml 4bit model, regardless of how many layers I offload to GPU, system RAM is filled and I get a CUDA out of memory error.

I've tried with all 80 layers offloaded to GPUs, and with no layers offloaded to the GPUs at all, and the RAM usage doesn't change in either scenario. There is still about 12GB total VRAM free when the out of memory error is thrown.

Screenshot of RAM / VRAM usage with all layers offloaded to GPUs: https://i.imgur.com/vTl04qL.png

Interestingly the system RAM usage hits a ceiling while loading the model but the error isn't thrown until the end of the loading sequence. If I had to make a guess on what's happening I would say llama.cpp isn't doing garbage collection on the buffer contents. When CUDA goes to use some system memory it can't see any as available and so crashes.

E:\llama.cpp release 254a7a7>main -t 8 -n -1 -ngl 80 --color -c 2048 --temp 0.7 --repeat_penalty 1.2 --mirostat 2 --interactive-first  -m ../models/ggml-LLaMa-65B-quantized/ggml-LLaMa-65B-q4_0.bin -i -ins
main: build = 670 (254a7a7)
main: seed  = 1686799791
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
  Device 1: NVIDIA GeForce RTX 3090
llama.cpp: loading model from ../models/ggml-LLaMa-65B-quantized/ggml-LLaMa-65B-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.18 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llama_model_load_internal: mem required  = 10814.46 MB (+ 5120.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 64 layers to GPU
llama_model_load_internal: total VRAM used: 28308 MB
....................................................................................................
llama_init_from_file: kv self size  = 5120.00 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.200000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.700000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

CUDA error 2 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:2342: out of memory

Bonus: Without -ngl set, loading succeeds and I actually get a few tokens worth of inference before CUDA error 2 at D:\AI\llama.cpp\ggml-cuda.cu:994: out of memory is thrown. The model needs ~38GB of RAM and I only have 32GB so I assume it's using swapfile, but with no layers offloaded it's odd that an error still comes from CUDA.

@JohannesGaessler
Copy link
Collaborator

I can tell from the log that you are not using the latest master version. There have been substantial GPU changes so please re-do your test with the latest master version.

@Energiz3r
Copy link
Author

Energiz3r commented Jun 15, 2023

Edited OP to reflect what happens on the latest commit [254a7a7]

@JohannesGaessler
Copy link
Collaborator

I can't reproduce this issue on my machine.

@Energiz3r
Copy link
Author

I can't reproduce this issue on my machine.

What are the specs of your machine? Which model did you test with?

@JohannesGaessler
Copy link
Collaborator

    ~/Pr/llama.cpp    master *3 ?10    neofetch                                                              ✔  johannesg@johannes-ms7850 
██████████████████  ████████   johannesg@johannes-ms7850 
██████████████████  ████████   ------------------------- 
██████████████████  ████████   OS: Manjaro Linux x86_64 
██████████████████  ████████   Host: MS-7850 1.0 
████████            ████████   Kernel: 6.3.0-1-MANJARO 
████████  ████████  ████████   Uptime: 27 mins 
████████  ████████  ████████   Packages: 1100 (pacman) 
████████  ████████  ████████   Shell: zsh 5.9 
████████  ████████  ████████   Terminal: /dev/pts/2 
████████  ████████  ████████   CPU: Intel i5-4570S (4) @ 3.600GHz 
████████  ████████  ████████   GPU: NVIDIA GeForce GTX 1050 Ti 
████████  ████████  ████████   GPU: NVIDIA GeForce GTX 1070 
████████  ████████  ████████   Memory: 362MiB / 15921MiB 
████████  ████████  ████████
                                                       
                                                       


    ~/Pr/llama.cpp    master *3 ?10    ./main --model models/opt/llama-${model_size}-ggml-${quantization}.bin --ignore-eos --n_predict 128 --ctx_size 2048 --batch_size 512 --seed 1337 --threads 4 --gpu_layers 32 --mlock | tee chat.txt
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 670 (254a7a7)
main: seed  = 1337
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070
  Device 1: NVIDIA GeForce GTX 1050 Ti
llama.cpp: loading model from models/opt/llama-33b-ggml-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0,13 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce GTX 1070) as main device
llama_model_load_internal: mem required  = 10570,53 MB (+ 3124,00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloaded 32/63 layers to GPU
llama_model_load_internal: total VRAM used: 9699 MB
....................................................................................................
llama_init_from_file: kv self size  = 3120,00 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 128, n_keep = 0


 ← The Writer’s Block: A Video Q&A with Kathleen Duey
The Writer’s Block: A Video Q&A with Shannon Hale →
by keplertalk | September 26, 2012 · 3:14 pm
Blog Tour Kick-Off: The Dark Unwinding by Sharon Cameron
As we have mentioned in the past, we here at KEPLER’S BOOKS LOVE to read. So what better way to spend our days than helping to put great books into people’s hands?
llama_print_timings:        load time = 100207,50 ms
llama_print_timings:      sample time =    89,00 ms /   128 runs   (    0,70 ms per token)
llama_print_timings: prompt eval time =  1473,93 ms /     2 tokens (  736,96 ms per token)
llama_print_timings:        eval time = 103572,14 ms /   127 runs   (  815,53 ms per token)
llama_print_timings:       total time = 105181,02 ms

@Energiz3r
Copy link
Author

Energiz3r commented Jun 16, 2023

hmmm. you have 16GB of RAM but only 12GB of VRAM if my guess on those GPUs is accurate. Can you confirm if RAM / VRAM usage aligns with what it should be for the number of layers offloaded?

@JohannesGaessler
Copy link
Collaborator

Yes, I can confirm that it works correctly on my machine.

@Energiz3r
Copy link
Author

Energiz3r commented Jun 16, 2023

So the RAM usage aligns or no? As I mentioned it would appear to work correctly if your ram capacity wasn't an issue. Any suggestions how else I can test? i've tried a few different models and on different machines and see the same thing in all cases

@Energiz3r
Copy link
Author

Energiz3r commented Jun 16, 2023

Saw a new build come through a09f919 - issue persists. If I up my RAM to 64GB it runs fine like you say. But surely when I have 48GB VRAM and the model needs 38GB memory I shouldn't be using any RAM should I?

@hmage
Copy link

hmage commented Jun 17, 2023

Agreed, it seems counter-intuitive why would you need RAM if the layers are going to be in VRAM. Why buffer entire model in RAM before passing it to GPU in the first place?

@Energiz3r
Copy link
Author

@ggerganov any ideas on this one? I'd rather not have to buy ram to get around a bug 👀 if @JohannesGaessler can't look into this that's what I'll have to do to run any model that isn't fitting into ram

@JohannesGaessler
Copy link
Collaborator

I mean, I can't look into it until I know how to reproduce the issue. Right now I'm just waiting for other people to report the same problem to see if there is a pattern.

@Energiz3r
Copy link
Author

Energiz3r commented Jun 17, 2023

Sorry @JohannesGaessler all I meant was your test approach isn't going to replicate the issue because you're not in a situation where you have more VRAM than RAM.

eg. If you can reduce your available system ram to 8gb or less (perhaps run a memory stress test which lets you set how many GB to use) to load an approx ~10gb model fully offloaded into your 12GB of vram you should be able to replicate it

@JohannesGaessler
Copy link
Collaborator

I don't see why that would make a difference.

@Energiz3r
Copy link
Author

Energiz3r commented Jun 17, 2023

???

Let's start from the top:

  • I have a 38GB model. I have 48gb of VRAM, but only 32GB of RAM so cannot run it on CPU
  • I fully offload it to GPUs
  • I get a cuda out of memory issue
  • There is ~12GB VRAM free when the error is thrown
  • System RAM is completely full

This makes no sense.

@JohannesGaessler
Copy link
Collaborator

The entire model is never loaded into RAM when offloading. When CUDA says it's out of memory it's referring to VRAM. My guess is that for some reason the logic for splitting tensors across GPUs doesn't work correctly on your system so everything gets put onto just one GPU and you run out of memory.

@Energiz3r
Copy link
Author

Energiz3r commented Jun 17, 2023

I'm looking at the vram utilisation while loading and both cards are doing the same thing, getting filled up to around 18GB out of 24

If you have the time I'd be happy to let you TeamViewer in or something and take a look. I'm certain I'm not messing this up on my end and I'm not sure how else I can rule out user error. Like you say it doesn't make any sense that system ram is seeing much use at all let alone being completely filled.

@JohannesGaessler
Copy link
Collaborator

Sorry but to me fixing this issue simply isn't as urgent as it is to you. I'm perfectly happy with just waiting until more people provide information. I am willing to do remote debugging but not via Teamviewer or similar software. I only do it via SSH or equivalent.

@Energiz3r
Copy link
Author

Energiz3r commented Jun 20, 2023

Okay... I didn't say it was urgent to me, or trying to rush you. Just trying to offer my help to solve this

I'm on a different system now, this one with a 4080 16GB and 128GB of RAM. I can load a 65b model with no layers offloaded to GPU and llama.cpp will occupy 56GB of RAM. If I offload 20 layers to GPU (llama.cpp occupies 12GB of VRAM) it will also occupy... 56GB of RAM. That's pretty definitive.

If reports from other users is what you need in order to warrant looking into this I'll see who else I can find to replicate the issue and refer them here 👍

@JohannesGaessler
Copy link
Collaborator

@LoganDark For something like this please make a separate issue rather than commenting on an existing, unrelated issue.

@Mradr
Copy link

Mradr commented Jun 27, 2023

I am also having a similar issue where it seems like its buffering to the system RAM along with filling up the VRAM. Aka, the more GPU layers I have the more system RAM it takes up. 3090 - for example, doesnt fill up fully at around 10-15GB out of 24 while system RAM usage jumps up to almost double. Lowering the gpu_layers results less VRAM usage and over all less more accurate memory size the model took.

Windows 11
wizardLM-13B-Uncensored.ggmlv3.q4_0.bin
Cuda supported
32GB of RAM, 3090 video card 24 GB of VRAM

@ex3ndr
Copy link

ex3ndr commented Dec 22, 2023

I have similar problem for 2x4090, but i have 98gb of ram and it is still doesn't work

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@lee-b
Copy link

lee-b commented Jun 25, 2024

I believe I'm seeing this too, with the official server-cuda image, pulled today, although note the "failed to initialize cuda", which no one above seemed to mention.

I'm running with 128GB RAM, 96GB VRAM (1x3090, 3xP40), loading Meta-Llama-3-70B-Instruct.Q5_K_M.gguf.

This is with:

$ sudo lsmod | grep nvidia_uvm
nvidia_uvm 1380352 0
nvidia 56410112 55 nvidia_uvm,nvidia_modeset

and:

llama-cpp:
image: ghcr.io/ggerganov/llama.cpp:server-cuda
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
count: 4

and I get:

llama-cpp | ggml_cuda_init: failed to initialize CUDA: unknown error
...
llama-cpp | llm_load_tensors: offloading 80 repeating layers to GPU
llama-cpp | llm_load_tensors: offloading non-repeating layers to GPU
llama-cpp | llm_load_tensors: offloaded 81/81 layers to GPU
llama-cpp | llm_load_tensors: CPU buffer size = 47628.36 MiB
llama-cpp | ...................................................................................................
llama-cpp | llama_new_context_with_model: n_ctx = 8192
llama-cpp | llama_new_context_with_model: n_batch = 2048
llama-cpp | llama_new_context_with_model: n_ubatch = 512
llama-cpp | llama_new_context_with_model: flash_attn = 0
llama-cpp | llama_new_context_with_model: freq_base = 500000.0
llama-cpp | llama_new_context_with_model: freq_scale = 1
llama-cpp | ggml_cuda_host_malloc: failed to allocate 2560.00 MiB of pinned memory: unknown error
llama-cpp | llama_kv_cache_init: CPU KV buffer size = 2560.00 MiB
llama-cpp | llama_new_context_with_model: KV self size = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama-cpp | ggml_cuda_host_malloc: failed to allocate 0.98 MiB of pinned memory: unknown error
llama-cpp | llama_new_context_with_model: CPU output buffer size = 0.98 MiB
llama-cpp | ggml_cuda_host_malloc: failed to allocate 1104.01 MiB of pinned memory: unknown error
llama-cpp | llama_new_context_with_model: CUDA_Host compute buffer size = 1104.01 MiB

But this works fine:

check-gpu:
image: nvidia/cuda:11.4.3-runtime-ubuntu20.04
command: nvidia-smi
profiles: ["check-gpu"]
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]

$ docker compose up check-gpu
[+] Running 1/0
✔ Container solution-check-gpu-1 Created 0.0s
Attaching to check-gpu-1
check-gpu-1 |
check-gpu-1 | ==========
check-gpu-1 | == CUDA ==
check-gpu-1 | ==========
check-gpu-1 |
check-gpu-1 | CUDA Version 11.4.3
check-gpu-1 |
check-gpu-1 | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
check-gpu-1 |
check-gpu-1 | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
check-gpu-1 | By pulling and using the container, you accept the terms and conditions of this license:
check-gpu-1 | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
check-gpu-1 |
check-gpu-1 | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
check-gpu-1 |
check-gpu-1 | Tue Jun 25 18:07:29 2024
check-gpu-1 | +-----------------------------------------------------------------------------+
check-gpu-1 | | NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
check-gpu-1 | |-------------------------------+----------------------+----------------------+
check-gpu-1 | | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
check-gpu-1 | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
check-gpu-1 | | | | MIG M. |
check-gpu-1 | |===============================+======================+======================|
check-gpu-1 | | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
check-gpu-1 | | 0% 38C P8 8W / 350W | 1MiB / 24576MiB | 0% Default |
check-gpu-1 | | | | N/A |
check-gpu-1 | +-------------------------------+----------------------+----------------------+
check-gpu-1 | | 1 Tesla P40 Off | 00000000:0E:00.0 Off | Off |
check-gpu-1 | | N/A 35C P8 9W / 250W | 0MiB / 24576MiB | 0% Default |
check-gpu-1 | | | | N/A |
check-gpu-1 | +-------------------------------+----------------------+----------------------+
check-gpu-1 | | 2 Tesla P40 Off | 00000000:12:00.0 Off | Off |
check-gpu-1 | | N/A 36C P8 9W / 250W | 0MiB / 24576MiB | 0% Default |
check-gpu-1 | | | | N/A |
check-gpu-1 | +-------------------------------+----------------------+----------------------+
check-gpu-1 | | 3 Tesla P40 Off | 00000000:17:00.0 Off | Off |
check-gpu-1 | | N/A 34C P8 10W / 250W | 0MiB / 24576MiB | 0% Default |
check-gpu-1 | | | | N/A |
check-gpu-1 | +-------------------------------+----------------------+----------------------+
check-gpu-1 |
check-gpu-1 | +-----------------------------------------------------------------------------+
check-gpu-1 | | Processes: |
check-gpu-1 | | GPU GI CI PID Type Process name GPU Memory |
check-gpu-1 | | ID ID Usage |
check-gpu-1 | |=============================================================================|
check-gpu-1 | | No running processes found |
check-gpu-1 | +-----------------------------------------------------------------------------+
check-gpu-1 exited with code 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants