Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIO - memory issue - embedding #2126

Open
shuther opened this issue Apr 25, 2024 · 4 comments
Open

AIO - memory issue - embedding #2126

shuther opened this issue Apr 25, 2024 · 4 comments
Labels
bug Something isn't working unconfirmed

Comments

@shuther
Copy link

shuther commented Apr 25, 2024

LocalAI version:
container image: AIO Cuda12-latest

Environment, CPU architecture, OS, and Version:
VM ubuntu 22.04 latest
nvidia 2600

Describe the bug
get memory issue while switching and testing multiple prompts.
Error for embeddings while image generation works fine.

curl http://linuxmain.local:8445/embeddings \
  -X POST -H "Content-Type: application/json" \
  -d '{
      "input": "Your text string goes here",
      "model": "text-embedding-ada-002"
    }'

{"error":{"code":500,"message":"could not load model (no success): Unexpected err=OutOfMemoryError('CUDA out of memory. Tried to allocate 46.00 MiB. GPU 0 has a total capacty of 5.62 GiB of which 55.50 MiB is free. Process 46 has 0 bytes memory in use. Process 52 has 0 bytes memory in use. Process 122 has 0 bytes memory in use. Process 158 has 0 bytes memory in use. Process 223 has 0 bytes memory in use. Process 303 has 0 bytes memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF'), type(err)=\u003cclass 'torch.cuda.OutOfMemoryError'\u003e","type":""}}

To Reproduce
all the curl tests published in the documentation

Expected behavior
no error, old models are evicted if memory pressure is too high

Logs

localai-docker-api-1  | curl http://localhost:8080/embeddings -X POST -H "Content-Type: application/json" -d '{
localai-docker-api-1  |   "input": "Your text string goes here",
localai-docker-api-1  |   "model": "text-embedding-ada-002"
localai-docker-api-1  | }'}
localai-docker-api-1  | 8:36AM INF Loading model 'all-MiniLM-L6-v2' with backend sentencetransformers
localai-docker-api-1  | 8:36AM DBG Loading model in memory from file: /build/models/all-MiniLM-L6-v2
localai-docker-api-1  | 8:36AM DBG Loading Model all-MiniLM-L6-v2 with gRPC (file: /build/models/all-MiniLM-L6-v2) (backend: sentencetransformers): {backendString:sentencetransformers model:all-MiniLM-L6-v2 threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc001c7ee00 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
localai-docker-api-1  | 8:36AM DBG Loading external backend: /build/backend/python/sentencetransformers/run.sh
localai-docker-api-1  |
localai-docker-api-1  | 8:36AM DBG Loading GRPC Process: /build/backend/python/sentencetransformers/run.sh
localai-docker-api-1  | 8:36AM DBG GRPC Service for all-MiniLM-L6-v2 will be running at: '127.0.0.1:44963'
localai-docker-api-1  | 8:36AM DBG GRPC Service state dir: /tmp/go-processmanager596413675
localai-docker-api-1  | 8:36AM DBG GRPC Service Started
localai-docker-api-1  | 8:36AM DBG GRPC(all-MiniLM-L6-v2-127.0.0.1:44963): stderr /opt/conda/envs/transformers/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
localai-docker-api-1  | 8:36AM DBG GRPC(all-MiniLM-L6-v2-127.0.0.1:44963): stderr   warnings.warn(
localai-docker-api-1  | 8:36AM DBG GRPC(all-MiniLM-L6-v2-127.0.0.1:44963): stderr Server started. Listening on: 127.0.0.1:44963
localai-docker-api-1  | 8:36AM DBG GRPC Service Ready
localai-docker-api-1  | 8:36AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:all-MiniLM-L6-v2 ContextSize:512 Seed:2025187195 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/all-MiniLM-L6-v2 Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
localai-docker-api-1  | [172.30.0.1]:41244 500 - POST /embeddings

Additional context

@shuther shuther added bug Something isn't working unconfirmed labels Apr 25, 2024
@localai-bot
Copy link
Contributor

Hello shuther,

Thank you for reporting this issue. I apologize for the inconvenience you're experiencing with the LocalAI container image. Based on the information you've provided, it seems this is a memory allocation issue on your VM.

Could you please provide the output of the following command to help us gather more details about your GPU and system capabilities?

nvidia-smi

This command will give us information about the GPU usage and memory details.

Additionally, please try to run the command you provided but with the following modification:

CUDA_LAUNCH_BLOCKING=0

This will disable CUDA launch blocking, which might help to prevent the GPU memory error. You should run the command like this:

CUDA_LAUNCH_BLOCKING=0 curl http://linuxmain.local:8445/embeddings \
  -X POST -H "Content-Type: application/json" \
  -d '{"input": "Your text string goes here", "model": "text-embedding-ada-002}"'

I understand that this issue is crucial for your project, and I assure you that we'll do our best to investigate this problem and provide a solution as soon as possible.

Best regards,

The Github Bot (Experiment of @mudler)

@shuther
Copy link
Author

shuther commented Apr 25, 2024

works fine if I stop the docker and start it again; likely we need a better eviction approach when we switch models?

@shuther
Copy link
Author

shuther commented Apr 25, 2024

extra logs
nvidia-smi # when I launch the docker (initial load)

Thu Apr 25 11:14:44 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060        Off | 00000000:13:00.0  On |                  N/A |
| 38%   38C    P8              16W / 160W |    258MiB /  6144MiB |     22%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+


+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2158      G   /usr/lib/xorg/Xorg                          131MiB |
|    0   N/A  N/A      2686      G   /usr/bin/gnome-shell                         67MiB |
|    0   N/A  N/A      3376      G   /usr/bin/nextcloud                            3MiB |
|    0   N/A  N/A     24782      G   ...30092458,1701102826035513081,262144       50MiB |
+---------------------------------------------------------------------------------------+

I spotted this error also:

localai-docker-api-1  | 9:15AM INF Trying to load the model '5c7cd056ecf9a4bb5b527410b97f48cb' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/vall-e-x/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/diffusers/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/exllama2/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/mamba/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/exllama/run.sh
localai-docker-api-1  | 9:15AM INF [llama-cpp] Attempting to load
localai-docker-api-1  | 9:15AM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend llama-cpp
localai-docker-api-1  | 9:15AM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
localai-docker-api-1  | 9:15AM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: llama-cpp): {backendString:llama-cpp model:5c7cd056ecf9a4bb5b527410b97f48cb threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0000bae00 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
localai-docker-api-1  | 9:15AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp
localai-docker-api-1  | 9:15AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:44089'
localai-docker-api-1  | 9:15AM INF [llama-cpp] Fails: fork/exec /tmp/localai/backend_data/backend-assets/grpc/llama-cpp: permission denied
localai-docker-api-1  | 9:15AM INF [llama-ggml] Attempting to load
localai-docker-api-1  | 9:15AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:44789'
localai-docker-api-1  | 9:15AM INF [rwkv] Fails: fork/exec /tmp/localai/backend_data/backend-assets/grpc/rwkv: permission denied
...
ocalai-docker-api-1  | 9:15AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/whisper
localai-docker-api-1  | 9:15AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:42503'
localai-docker-api-1  | 9:15AM INF [whisper] Fails: fork/exec /tmp/localai/backend_data/backend-assets/grpc/whisper: permission denied
localai-docker-api-1  | 9:15AM INF [stablediffusion] Attempting to load
...
localai-docker-api-1  | 9:15AM INF [/build/backend/python/vall-e-x/run.sh] Fails: grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/build/backend/python/vall-e-x/run.sh. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS

Now with LOCALAI_SINGLE_ACTIVE_BACKEND=true we get the embedding working.
I would recommend making a change to the docker compose yaml file to load by default the .env (and update the documentation since it seems a crucial parameter)
Still the eviction in case of memory error should be tried ?

nvidia-smi

Thu Apr 25 11:19:50 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060        Off | 00000000:13:00.0  On |                  N/A |
| 38%   39C    P8              13W / 160W |   4422MiB /  6144MiB |     20%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2158      G   /usr/lib/xorg/Xorg                          131MiB |
|    0   N/A  N/A      2686      G   /usr/bin/gnome-shell                         67MiB |
|    0   N/A  N/A      3376      G   /usr/bin/nextcloud                            3MiB |
|    0   N/A  N/A     24782      G   ...30092458,1701102826035513081,262144       50MiB |
|    0   N/A  N/A   1647486      C   python                                        0MiB |
|    0   N/A  N/A   1647698      C   python                                        0MiB |
+---------------------------------------------------------------------------------------+

@jtwolfe
Copy link
Contributor

jtwolfe commented Apr 26, 2024

I believe that the eviction process is being assessed atm, maybe related to #2047 and #2102

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unconfirmed
Projects
None yet
Development

No branches or pull requests

3 participants