New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local models for llama.cpp #51
Comments
From our current design perspective, the OpenAI's LLM interface essentially functions as an HTTP proxy. Therefore, our implementation could follow the same pattern: sending LLM requests to an HTTP gateway operating at 127.0.0.1. This approach simplifies the process and ensures consistency in our design. I believe this would be a practical and effective way to move forward. I look forward to hearing your thoughts on this proposal. |
I have found another project: https://github.com/abetlen/llama-cpp-python.git It provide 2 method to use models:
Compared to implementing APIs ourselves, ready-made projects are more conducive to interface standardization. So I discard the previously selected project. |
I am going to provide a module for dynamically loading local LLM nodes: Launch a
|
ID | CPU | Memory size | GPU |
---|---|---|---|
A | Intel(R) Core(TM) i5-9400F CPU @ 2.90GHz | 16G | - |
B | AMD Ryzen 7 5800X 8-Core Processor | 128G | NVIDIA GeForce RTX 3060 |
C | 13th Gen Intel(R) Core(TM) i7-13700K | 128G | NVIDIA RTX A6000 |
Name | Download Link | Model Introduction | Performance(A) | Performance(B) | Performance(C) |
---|---|---|---|---|---|
Llama-2-70B-chat | Llama-2-70B-chat-GGUF | TheBloke/Llama-2-70B-chat-GGUF | - | llama_print_timings: load time = 67333.57 ms llama_print_timings: sample time = 18.90 ms / 47 runs ( 0.40 ms per token, 2486.64 tokens per second) llama_print_timings: prompt eval time = 67333.46 ms / 175 tokens ( 384.76 ms per token, 2.60 tokens per second) llama_print_timings: eval time = 56067.47 ms / 46 runs ( 1218.86 ms per token, 0.82 tokens per second) llama_print_timings: total time = 123741.27 ms |
llama_print_timings: load time = 482531.21 ms llama_print_timings: sample time = 32.81 ms / 111 runs ( 0.30 ms per token, 3383.42 tokens per second) llama_print_timings: prompt eval time = 519715.31 ms / 570 tokens ( 911.78 ms per token, 1.10 tokens per second) llama_print_timings: eval time = 107253.35 ms / 110 runs ( 975.03 ms per token, 1.03 tokens per second) llama_print_timings: total time = 627724.99 ms |
Llama-2-13B-chat | Llama-2-13B-chat-GGUF | TheBloke/Llama-2-13B-chat-GGUF | llama_print_timings: load time = 44175.59 ms llama_print_timings: sample time = 62.91 ms / 83 runs ( 0.76 ms per token, 1319.30 tokens per second) llama_print_timings: prompt eval time = 44175.25 ms / 185 tokens ( 238.79 ms per token, 4.19 tokens per second) llama_print_timings: eval time = 27077.26 ms / 82 runs ( 330.21 ms per token, 3.03 tokens per second) llama_print_timings: total time = 71906.77 ms |
llama_print_timings: load time = 12763.66 ms llama_print_timings: sample time = 37.38 ms / 95 runs ( 0.39 ms per token, 2541.53 tokens per second) llama_print_timings: prompt eval time = 12763.55 ms / 175 tokens ( 72.93 ms per token, 13.71 tokens per second) llama_print_timings: eval time = 22880.37 ms / 94 runs ( 243.41 ms per token, 4.11 tokens per second) llama_print_timings: total time = 36057.42 ms |
llama_print_timings: load time = 56523.37 ms llama_print_timings: sample time = 49.18 ms / 73 runs ( 0.67 ms per token, 1484.43 tokens per second) llama_print_timings: prompt eval time = 64706.64 ms / 598 tokens ( 108.21 ms per token, 9.24 tokens per second) llama_print_timings: eval time = 16400.27 ms / 72 runs ( 227.78 ms per token, 4.39 tokens per second) llama_print_timings: total time = 82099.84 ms |
- Launch the
Llama-cpp-python
service withDocker
You can directly use the model you downloaded to launch the Docker
image provided by Llama-cpp-python
:
docker run --rm -it -p 8000:8000 -v ${/path/to/models}:/models -e MODEL=/models/${model-filename} ghcr.io/abetlen/llama-cpp-python:latest
Depending on the performance of your computer, the startup may take a few minutes. When docker
outputs the following log, it means that the service has started:
llama_new_context_with_model: kv self size = 640.00 MB
llama_new_context_with_model: compute buffer total size = 305.47 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
INFO: Started server process [173]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
This service provides an http
interface similar to OpenAI
. Through the http://localhost:8000/docs
page, we can see the prototypes of these interfaces.
For more detailed information, see Llama-cpp-python
.
- Launch GPU acceleration
git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
docker build -t llama-cpp-python-cuda docker/cuda_simple/
docker run --gpus all --rm -it -p 8000:8000 -v ${/path/to/models}:/models -e MODEL=/models/${model-filename} llama-cpp-python-cuda python3 -m llama_cpp.server --n_gpu_layers ${X} --n_ctx ${L}
I have test it with the computer B:
Model | n_gpu_layers | Performance(B) | Performance(C) |
---|---|---|---|
Llama-2-13B-chat | 43 | llama_print_timings: load time = 6563.91 ms llama_print_timings: sample time = 39.52 ms / 97 runs ( 0.41 ms per token, 2454.39 tokens per second) llama_print_timings: prompt eval time = 6563.80 ms / 185 tokens ( 35.48 ms per token, 28.18 tokens per second) llama_print_timings: eval time = 2659.15 ms / 96 runs ( 27.70 ms per token, 36.10 tokens per second) llama_print_timings: total time = 9653.26 ms |
llama_print_timings: load time = 3669.67 ms llama_print_timings: sample time = 20.01 ms / 73 runs ( 0.27 ms per token, 3647.63 tokens per second) llama_print_timings: prompt eval time = 3856.10 ms / 642 tokens ( 6.01 ms per token, 166.49 tokens per second) llama_print_timings: eval time = 1176.83 ms / 72 runs ( 16.34 ms per token, 61.18 tokens per second) llama_print_timings: total time = 5751.94 ms |
Llama-2-70B-chat | 83 | n_gpu_layers=24 llama_print_timings: load time = 12639.14 ms llama_print_timings: sample time = 27.72 ms / 59 runs ( 0.47 ms per token, 2128.20 tokens per second) llama_print_timings: prompt eval time = 12639.07 ms / 185 tokens ( 68.32 ms per token, 14.64 tokens per second) llama_print_timings: eval time = 58749.19 ms / 58 runs ( 1012.92 ms per token, 0.99 tokens per second) llama_print_timings: total time = 71768.85 ms |
llama_print_timings: load time = 5321.81 ms llama_print_timings: sample time = 30.52 ms / 111 runs ( 0.27 ms per token, 3637.56 tokens per second) llama_print_timings: prompt eval time = 5918.75 ms / 594 tokens ( 9.96 ms per token, 100.36 tokens per second) llama_print_timings: eval time = 7527.38 ms / 110 runs ( 68.43 ms per token, 14.61 tokens per second) llama_print_timings: total time = 14166.71 ms |
- Support for other models
Here I only introduce a few models that I have experimented with. Llama-cpp-python
also supports many other models. If you have other needs, you can select other models on huggingface.
Here are a few things to note:
- Models provided by meta-llama need to wait for their reply email. You can try to find similar models in the TheBloke repository.
Llama-cpp-python
needshf
version ofgguf
format models:- If you have downloaded the original
.pth
suffix model, you need to use theconvert.py
script inLlama.cpp
to convert the format:
python convert.py --outfile ${/your/target/file/path} ${/your/.pth/file/path}
- If you have downloaded a
ggml
format model, you need to use theconvert-llama-ggml-to-gguf.py
script inLlama.cpp
to convert the format:
python ./convert-llama-ggml-to-gguf.py --eps 1e-5 --input ${/your/ggml/model/path} --output ${/your/gguf/model/path} # If the model you downloaded is `70B`, you need to add the parameter `--gqa 8`
- If you have downloaded the original
Integrate the Llama
service into the OpenDAN
system
We have successfully launched a local LLM
node based on the Llama
model. The shell
of the OpenDAN
system provides 3 commands for managing these dynamic nodes:
# Create a new node
/node create
* This command will execute the above shell commands interactively to start the service.
* We can specify the parameters for starting the node according to the wizard's instructions.
# Add a exist node
/node add $model_name $url
# Remove a node
/node rm $model_name $url
# List the currently joined nodes
/node list
There are two parameters, $model_name
and $url
, here is an explanation:
- $model_name: Give this model a name, which should be exactly the same as the
llm_model_name
specified in theagent
you built based onOpenDAN
. The computing task of thisagent
will be assigned to this node. If there are multiple models with the same name in the system,OpenDAN
will randomly assign the computing task to one of them. - $url: The access address of the
Llama-cpp-python
service started withDocker
before, such ashttp://192.168.0.123:8000
Welcome to provide feedback and discussion
This project and other projects mentioned in the text are rapidly iterating. The various commands and operations mentioned above may become outdated. If you encounter any problems during use, please feel free to give me feedback, and I will follow up as soon as possible.
Greate job! I am interested in understanding the performance of our 70B and 13B models on typical hardware environments. Specifically, I would like to know the performance metrics when running on two types of devices:
This information will be invaluable in helping us optimize our models and plan for future hardware requirements. I appreciate your assistance in providing these details. |
Thank you for your attention. I will gradually add more models to the device list. |
I has a local server for llama, it's built with the project:
https://github.com/soulteary/llama-docker-playground
And I want to add it in the system, but it dependence my custom api, so I will add it a example.
But I find the base class
ComputeNode
is provide atsrc/aios_kernel
, and the foldersrc
is not a package, I think we should provide the extern interfaces in a extern package, or splite them into another project?The text was updated successfully, but these errors were encountered: