Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Used docker image to implement the localGPT not work. #480

Closed
chiehpower opened this issue Sep 17, 2023 · 20 comments · May be fixed by #484
Closed

Used docker image to implement the localGPT not work. #480

chiehpower opened this issue Sep 17, 2023 · 20 comments · May be fixed by #484

Comments

@chiehpower
Copy link

chiehpower commented Sep 17, 2023

hi @KonradHoeffner and everyone,

I followed the docker section to build the docker image first, and start the container. However, it encountered this error below.

Steps

  1. docker build . -t localgpt
  2. docker run -it --mount src="$HOME/.cache",target=/root/.cache,type=bind --gpus=all localgpt

Error message

localGPT ›› docker run -it --mount src="$HOME/.cache",target=/root/.cache,type=bind --gpus=all localgpt                                                        

==========
== CUDA ==
==========

CUDA Version 11.7.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

2023-09-17 02:43:59,219 - INFO - run_localGPT.py:212 - Running on: cuda
2023-09-17 02:43:59,219 - INFO - run_localGPT.py:213 - Display Source Documents set to: False
2023-09-17 02:43:59,219 - INFO - run_localGPT.py:214 - Use history set to: False
2023-09-17 02:43:59,497 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
Downloading (…)c7233/.gitattributes: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1.48k/1.48k [00:00<00:00, 12.9MB/s]
Downloading (…)_Pooling/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [00:00<00:00, 2.45MB/s]
Downloading (…)/2_Dense/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 943kB/s]
Downloading pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 3.15M/3.15M [00:00<00:00, 9.00MB/s]
Downloading (…)9fb15c7233/README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 66.3k/66.3k [00:00<00:00, 24.6MB/s]
Downloading (…)b15c7233/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1.53k/1.53k [00:00<00:00, 12.0MB/s]
Downloading (…)ce_transformers.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 122/122 [00:00<00:00, 990kB/s]
Downloading pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1.34G/1.34G [02:39<00:00, 8.41MB/s]
Downloading (…)nce_bert_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 433kB/s]
Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 2.20k/2.20k [00:00<00:00, 19.1MB/s]
Downloading spiece.model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 792k/792k [00:00<00:00, 9.28MB/s]
Downloading (…)c7233/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 2.42M/2.42M [00:00<00:00, 3.14MB/s]
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 2.41k/2.41k [00:00<00:00, 19.2MB/s]
Downloading (…)15c7233/modules.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 461/461 [00:00<00:00, 4.13MB/s]
load INSTRUCTOR_Transformer
max_seq_length  512
2023-09-17 02:46:51,770 - INFO - posthog.py:16 - Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
2023-09-17 02:46:51,888 - INFO - run_localGPT.py:50 - Loading Model: TheBloke/Llama-2-7b-Chat-GGUF, on: cuda
2023-09-17 02:46:51,888 - INFO - run_localGPT.py:51 - This action can take a few minutes!
2023-09-17 02:46:51,888 - INFO - load_models.py:38 - Using Llamacpp for GGUF/GGML quantized models
Downloading (…)-7b-chat.Q4_K_M.gguf: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 4.08G/4.08G [08:14<00:00, 8.25MB/s]
Traceback (most recent call last):
  File "//run_localGPT.py", line 250, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "//run_localGPT.py", line 220, in main
    qa = retrieval_qa_pipline(device_type, use_history, promptTemplate_type="llama")
  File "//run_localGPT.py", line 139, in retrieval_qa_pipline
    qa = RetrievalQA.from_chain_type(llm=llm, 
  File "/usr/local/lib/python3.10/dist-packages/langchain/chains/retrieval_qa/base.py", line 100, in from_chain_type
    combine_documents_chain = load_qa_chain(
  File "/usr/local/lib/python3.10/dist-packages/langchain/chains/question_answering/__init__.py", line 249, in load_qa_chain
    return loader_mapping[chain_type](
  File "/usr/local/lib/python3.10/dist-packages/langchain/chains/question_answering/__init__.py", line 73, in _load_stuff_chain
    llm_chain = LLMChain(
  File "/usr/local/lib/python3.10/dist-packages/langchain/load/serializable.py", line 74, in __init__
    super().__init__(**kwargs)
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for LLMChain
llm
  none is not an allowed value (type=type_error.none.not_allowed)

Does anyone also encounter this problem?

@finnishbroccoli
Copy link

Yes, I get the same error message.
I run this on Arch linux. As a reference, here is my installation history:

python3.10 -m venv /mnt/yhteinen/docker-localgpt
source bin/activate
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.83 --no-cache-dir
sudo docker build . -t localgpt
sudo docker run -it --mount src="$HOME/.cache",target=/root/.cache,type=bind --gpus=all localgpt

.....

2023-09-17 05:54:52,701 - INFO - run_localGPT.py:212 - Running on: cuda
2023-09-17 05:54:52,701 - INFO - run_localGPT.py:213 - Display Source Documents set to: False
2023-09-17 05:54:52,701 - INFO - run_localGPT.py:214 - Use history set to: False
2023-09-17 05:54:52,933 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length 512
2023-09-17 05:54:57,061 - INFO - posthog.py:16 - Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
2023-09-17 05:54:57,154 - INFO - run_localGPT.py:50 - Loading Model: TheBloke/Llama-2-7b-Chat-GGUF, on: cuda
2023-09-17 05:54:57,154 - INFO - run_localGPT.py:51 - This action can take a few minutes!
2023-09-17 05:54:57,154 - INFO - load_models.py:38 - Using Llamacpp for GGUF/GGML quantized models
Downloading (…)-7b-chat.Q4_K_M.gguf: 100%|████████████████████████████████████████████| 4.08G/4.08G [01:02<00:00, 65.0MB/s]
Traceback (most recent call last):
File "//run_localGPT.py", line 250, in
main()
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "//run_localGPT.py", line 220, in main
qa = retrieval_qa_pipline(device_type, use_history, promptTemplate_type="llama")
File "//run_localGPT.py", line 139, in retrieval_qa_pipline
qa = RetrievalQA.from_chain_type(llm=llm,
File "/usr/local/lib/python3.10/dist-packages/langchain/chains/retrieval_qa/base.py", line 100, in from_chain_type
combine_documents_chain = load_qa_chain(
File "/usr/local/lib/python3.10/dist-packages/langchain/chains/question_answering/init.py", line 249, in load_qa_chain
return loader_mapping[chain_type](
File "/usr/local/lib/python3.10/dist-packages/langchain/chains/question_answering/init.py", line 73, in _load_stuff_chain
llm_chain = LLMChain(
File "/usr/local/lib/python3.10/dist-packages/langchain/load/serializable.py", line 74, in init
super().init(**kwargs)
File "pydantic/main.py", line 341, in pydantic.main.BaseModel.init
pydantic.error_wrappers.ValidationError: 1 validation error for LLMChain
llm
none is not an allowed value (type=type_error.none.not_allowed)

@KonradHoeffner
Copy link
Contributor

Thanks for the report! I haven't tested the Dockerfile with the new GGUF models and will investigate!

@sp1d3rino
Copy link

sp1d3rino commented Sep 18, 2023

found.
in Docker file is missing llama-cpp installation
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

@KonradHoeffner
Copy link
Contributor

Oh I see, that used to be inside requirements.txt.
Commit 16f949e contains:

  • Removed llamacpp from requirement.txt file. It needs to be installed separately to ensure it supports GPU

I will add it back in the Dockerfile

@KonradHoeffner
Copy link
Contributor

Created PR at #484 but I currently don't have a GPU to test, please report here and in the PR if it works @sp1d3rino, @chiehpower, @finnishbroccoli.

@sp1d3rino
Copy link

Finally I tested docker image on runpod.io and GPU worked but only with this model set in constants.py

MODEL_ID = "TheBloke/WizardLM-13B-V1.2-GPTQ"
MODEL_BASENAME = "model.safetensors"

@chiehpower
Copy link
Author

Created PR at #484 but I currently don't have a GPU to test, please report here and in the PR if it works @sp1d3rino, @chiehpower, @finnishbroccoli.

Sure! thanks for updating. I will test it later and update here soon.

@KonradHoeffner
Copy link
Contributor

@sp1d3rino What is the error when you use a GGUF model?

@sp1d3rino
Copy link

@KonradHoeffner actually no errors, simply no GPU is used. Only if I use GPTQ model I can see GPU working

@KonradHoeffner
Copy link
Contributor

Can you paste the log?

@sp1d3rino
Copy link

@KonradHoeffner there is no log ... it works but uses only CPU.

@KonradHoeffner
Copy link
Contributor

I mean the console output.

@sp1d3rino
Copy link

this is GGUF and it doesn't work with GPU, but only with CPU.

2023-09-19 06:49:37,377 - INFO - run_localGPT.py:221 - Running on: cuda
2023-09-19 06:49:37,377 - INFO - run_localGPT.py:222 - Display Source Documents set to: False
2023-09-19 06:49:37,377 - INFO - run_localGPT.py:223 - Use history set to: False
2023-09-19 06:49:37,491 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length 512
2023-09-19 06:49:40,533 - INFO - posthog.py:16 - Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
2023-09-19 06:49:40,599 - INFO - run_localGPT.py:56 - Loading Model: TheBloke/Llama-2-7b-Chat-GGUF, on: cuda
2023-09-19 06:49:40,599 - INFO - run_localGPT.py:57 - This action can take a few minutes!
2023-09-19 06:49:40,599 - INFO - load_models.py:38 - Using Llamacpp for GGUF/GGML quantized models
Downloading (…)-7b-chat.Q4_K_M.gguf: 100%|████████████████████████████████████████████████████████████████████| 4.08G/4.08G [00:48<00:00, 83.7MB/s]
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/models--TheBloke--Llama-2-7b-Chat-GGUF/snapshots/11156f865e384ddb66c191b8e0887f59dedd46cc/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor 0: token_embd.weight q4_K [ 4096, 32000, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 7: blk.0.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 8: blk.0.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 9: blk.0.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 10: blk.1.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 11: blk.1.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 13: blk.1.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 14: blk.1.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 15: blk.1.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 16: blk.1.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 17: blk.1.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 18: blk.1.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 19: blk.10.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 20: blk.10.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 21: blk.10.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 22: blk.10.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 23: blk.10.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 24: blk.10.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 25: blk.10.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 26: blk.10.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 27: blk.10.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 28: blk.11.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 29: blk.11.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 30: blk.11.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 31: blk.11.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 32: blk.11.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 33: blk.11.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 34: blk.11.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 35: blk.11.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 36: blk.11.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 37: blk.12.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 38: blk.12.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 39: blk.12.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 40: blk.12.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 41: blk.12.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 42: blk.12.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 43: blk.12.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 44: blk.12.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 45: blk.12.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 46: blk.13.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 47: blk.13.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 48: blk.13.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 49: blk.13.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 50: blk.13.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 51: blk.13.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 52: blk.13.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 53: blk.13.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 54: blk.13.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 55: blk.14.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 56: blk.14.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 57: blk.14.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 58: blk.14.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 59: blk.14.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 60: blk.14.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 61: blk.14.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 62: blk.14.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 63: blk.14.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 64: blk.15.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 65: blk.15.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 66: blk.15.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 67: blk.15.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 68: blk.15.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 69: blk.15.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 70: blk.15.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 71: blk.15.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 72: blk.15.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 73: blk.16.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 74: blk.16.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 75: blk.16.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 76: blk.16.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 77: blk.16.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 78: blk.16.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 79: blk.16.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 80: blk.16.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 81: blk.16.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 82: blk.17.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 83: blk.17.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 84: blk.17.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 85: blk.17.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 86: blk.17.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 87: blk.17.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 88: blk.17.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 89: blk.17.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 90: blk.17.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 91: blk.18.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 92: blk.18.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 93: blk.18.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 94: blk.18.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 95: blk.18.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 96: blk.18.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 97: blk.18.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 98: blk.18.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 99: blk.18.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 100: blk.19.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 101: blk.19.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 102: blk.19.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 103: blk.19.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 104: blk.19.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 105: blk.19.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 106: blk.19.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 107: blk.19.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 108: blk.19.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 109: blk.2.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 110: blk.2.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 111: blk.2.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 112: blk.2.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 113: blk.2.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 114: blk.2.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 115: blk.2.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 116: blk.2.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 117: blk.2.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 118: blk.20.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 119: blk.20.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 120: blk.20.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 121: blk.20.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 122: blk.20.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 123: blk.20.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 124: blk.20.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 125: blk.20.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 126: blk.20.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 127: blk.21.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 128: blk.21.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 129: blk.21.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 130: blk.21.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 131: blk.21.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 132: blk.21.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 133: blk.21.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 134: blk.21.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 135: blk.21.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 136: blk.22.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 137: blk.22.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 138: blk.22.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 139: blk.22.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 140: blk.22.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 141: blk.22.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 142: blk.22.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 143: blk.22.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 144: blk.22.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 145: blk.23.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 146: blk.23.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 147: blk.23.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 148: blk.23.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 149: blk.23.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 150: blk.23.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 151: blk.23.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 152: blk.23.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 153: blk.23.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 154: blk.3.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 155: blk.3.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 156: blk.3.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 157: blk.3.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 158: blk.3.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 159: blk.3.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 160: blk.3.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 161: blk.3.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 162: blk.3.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 163: blk.4.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 164: blk.4.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 165: blk.4.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 166: blk.4.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 167: blk.4.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 168: blk.4.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 169: blk.4.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 170: blk.4.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 171: blk.4.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 172: blk.5.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 173: blk.5.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 174: blk.5.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 175: blk.5.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 176: blk.5.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 177: blk.5.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 178: blk.5.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 179: blk.5.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 180: blk.5.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 181: blk.6.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 182: blk.6.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 183: blk.6.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 184: blk.6.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 185: blk.6.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 186: blk.6.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 187: blk.6.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 188: blk.6.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 189: blk.6.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 190: blk.7.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 191: blk.7.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 192: blk.7.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 193: blk.7.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 194: blk.7.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 195: blk.7.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 196: blk.7.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 197: blk.7.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 198: blk.7.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 199: blk.8.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 200: blk.8.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 201: blk.8.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 202: blk.8.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 203: blk.8.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 204: blk.8.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 205: blk.8.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 206: blk.8.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 207: blk.8.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 208: blk.9.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 209: blk.9.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 210: blk.9.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 211: blk.9.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 212: blk.9.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 213: blk.9.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 214: blk.9.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 215: blk.9.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 216: blk.9.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 217: output.weight q6_K [ 4096, 32000, 1, 1 ]
llama_model_loader: - tensor 218: blk.24.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 219: blk.24.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 220: blk.24.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 221: blk.24.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 222: blk.24.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 223: blk.24.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 224: blk.24.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 225: blk.24.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 226: blk.24.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 227: blk.25.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 228: blk.25.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 229: blk.25.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 230: blk.25.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 231: blk.25.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 232: blk.25.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 233: blk.25.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 234: blk.25.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 235: blk.25.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 236: blk.26.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 237: blk.26.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 238: blk.26.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 239: blk.26.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 240: blk.26.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 241: blk.26.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 242: blk.26.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 243: blk.26.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 244: blk.26.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 245: blk.27.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 246: blk.27.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 247: blk.27.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 248: blk.27.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 249: blk.27.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 250: blk.27.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 251: blk.27.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 252: blk.27.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 253: blk.27.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 254: blk.28.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 255: blk.28.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 256: blk.28.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 257: blk.28.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 258: blk.28.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 259: blk.28.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 260: blk.28.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 261: blk.28.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 262: blk.28.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 263: blk.29.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 264: blk.29.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 265: blk.29.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 266: blk.29.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 267: blk.29.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 268: blk.29.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 269: blk.29.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 270: blk.29.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 271: blk.29.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 272: blk.30.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 273: blk.30.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 274: blk.30.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 275: blk.30.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 276: blk.30.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 277: blk.30.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 278: blk.30.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 279: blk.30.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 280: blk.30.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 281: blk.31.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 282: blk.31.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 283: blk.31.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 284: blk.31.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 285: blk.31.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 286: blk.31.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 287: blk.31.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 288: blk.31.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 289: blk.31.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 290: output_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: general.file_type u32
llama_model_loader: - kv 11: tokenizer.ggml.model str
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
llama_model_loader: - kv 13: tokenizer.ggml.scores arr
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly Q4_K - Medium
llm_load_print_meta: model size = 6.74 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.09 MB
llm_load_tensors: mem required = 3891.34 MB (+ 2048.00 MB per state)
..................................................................................................
llama_new_context_with_model: kv self size = 2048.00 MB
llama_new_context_with_model: compute buffer total size = 281.47 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

Enter a query:
llama_print_timings: load time = 13356.09 ms
llama_print_timings: sample time = 172.36 ms / 23 runs ( 7.49 ms per token, 133.44 tokens per second)
llama_print_timings: prompt eval time = 13356.05 ms / 90 tokens ( 148.40 ms per token, 6.74 tokens per second)
llama_print_timings: eval time = 5307.39 ms / 22 runs ( 241.24 ms per token, 4.15 tokens per second)
llama_print_timings: total time = 18961.33 ms

@KonradHoeffner
Copy link
Contributor

@sp1d3rino hm this shows BLAS=0, are you sure you checked out the correct branch and rebuild the Dockerfile?
I have to test this later when I have a GPU available.

@sp1d3rino
Copy link

hmm but I compiled requirements with CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1
and now I don't know BLAS=0.

@chiehpower
Copy link
Author

hi @KonradHoeffner
I already pulled the latest codes from your branch and tested it.
It did work! thanks!

Although I noticed the GPU memory raising by almost ~2GB, I also found CPU memory almost full. The one query time takes sooooo long.

2023-09-19_23-07
2023-09-19_23-03

@KonradHoeffner
Copy link
Contributor

hmm but I compiled requirements with CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 and now I don't know BLAS=0.

What do you mean? Your locally installed python dependencies don't have any effect on Docker images.

@sp1d3rino
Copy link

sp1d3rino commented Sep 20, 2023

? in docker file there are that two env vars that you have to set before installl requirements with pip
This is why I wrote that I checked out in docker if container CUBLAS parameter was on or not.
Anyway you said that you noticed that BLAS=0 in my logs. Where exactly I can enable it in docker file?

@antonio-castellon
Copy link

antonio-castellon commented Sep 22, 2023

Sorry guys but, I just downloaded and executed the Docker file from master branch and GPU is not used in my case.
It's always using CPU! :-(

llm_load_print_meta: n_ff           = 11008
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 7B
llm_load_print_meta: model ftype    = mostly Q4_K - Medium
llm_load_print_meta: model size     = 6.74 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: mem required  = 3891.34 MB (+ 2048.00 MB per state)
..................................................................................................
llama_new_context_with_model: kv self size  = 2048.00 MB
llama_new_context_with_model: compute buffer total size =  281.47 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

no signal os GPU usage

same computer but executing another test (llama-cpp-python) it shows the usage of the GPU and returns faster:

llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  102.63 MB (+  256.00 MB per state)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 5426 MB
....................................................................................................
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size =   71.97 MB
llama_new_context_with_model: VRAM scratch buffer: 70.50 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

@chiehpower
Copy link
Author

Closed the issue that the bug was fixed.
About not using GPU thing, we can create another issue to discuss about it. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants