Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local models for llama.cpp #51

Open
streetycat opened this issue Sep 14, 2023 · 6 comments
Open

Local models for llama.cpp #51

streetycat opened this issue Sep 14, 2023 · 6 comments

Comments

@streetycat
Copy link
Contributor

streetycat commented Sep 14, 2023

I has a local server for llama, it's built with the project:

https://github.com/soulteary/llama-docker-playground

And I want to add it in the system, but it dependence my custom api, so I will add it a example.

But I find the base class ComputeNode is provide at src/aios_kernel, and the folder src is not a package, I think we should provide the extern interfaces in a extern package, or splite them into another project?

@waterflier
Copy link
Collaborator

From our current design perspective, the OpenAI's LLM interface essentially functions as an HTTP proxy. Therefore, our implementation could follow the same pattern: sending LLM requests to an HTTP gateway operating at 127.0.0.1.

This approach simplifies the process and ensures consistency in our design. I believe this would be a practical and effective way to move forward. I look forward to hearing your thoughts on this proposal.

@streetycat
Copy link
Contributor Author

I have found another project:

https://github.com/abetlen/llama-cpp-python.git

It provide 2 method to use models:

  1. Integrate it in same process. We can use it very conveniently.
  2. Setup a local service with a docker image. We can make it system independent and more flexible to deploy, it can run in any computers.

Compared to implementing APIs ourselves, ready-made projects are more conducive to interface standardization. So I discard the previously selected project.

@streetycat
Copy link
Contributor Author

streetycat commented Oct 13, 2023

I am going to provide a module for dynamically loading local LLM nodes:

Launch a Llama node on a personal server

  1. Download the Llama model

I have test the follow models with 3 computers:

ID CPU Memory size GPU
A Intel(R) Core(TM) i5-9400F CPU @ 2.90GHz 16G -
B AMD Ryzen 7 5800X 8-Core Processor 128G NVIDIA GeForce RTX 3060
C 13th Gen Intel(R) Core(TM) i7-13700K 128G NVIDIA RTX A6000
Name Download Link Model Introduction Performance(A) Performance(B) Performance(C)
Llama-2-70B-chat Llama-2-70B-chat-GGUF TheBloke/Llama-2-70B-chat-GGUF - llama_print_timings: load time = 67333.57 ms
llama_print_timings: sample time = 18.90 ms / 47 runs ( 0.40 ms per token, 2486.64 tokens per second)
llama_print_timings: prompt eval time = 67333.46 ms / 175 tokens ( 384.76 ms per token, 2.60 tokens per second)
llama_print_timings: eval time = 56067.47 ms / 46 runs ( 1218.86 ms per token, 0.82 tokens per second)
llama_print_timings: total time = 123741.27 ms
llama_print_timings: load time = 482531.21 ms
llama_print_timings: sample time = 32.81 ms / 111 runs ( 0.30 ms per token, 3383.42 tokens per second)
llama_print_timings: prompt eval time = 519715.31 ms / 570 tokens ( 911.78 ms per token, 1.10 tokens per second)
llama_print_timings: eval time = 107253.35 ms / 110 runs ( 975.03 ms per token, 1.03 tokens per second)
llama_print_timings: total time = 627724.99 ms
Llama-2-13B-chat Llama-2-13B-chat-GGUF TheBloke/Llama-2-13B-chat-GGUF llama_print_timings: load time = 44175.59 ms
llama_print_timings: sample time = 62.91 ms / 83 runs ( 0.76 ms per token, 1319.30 tokens per second)
llama_print_timings: prompt eval time = 44175.25 ms / 185 tokens ( 238.79 ms per token, 4.19 tokens per second)
llama_print_timings: eval time = 27077.26 ms / 82 runs ( 330.21 ms per token, 3.03 tokens per second)
llama_print_timings: total time = 71906.77 ms
llama_print_timings: load time = 12763.66 ms
llama_print_timings: sample time = 37.38 ms / 95 runs ( 0.39 ms per token, 2541.53 tokens per second)
llama_print_timings: prompt eval time = 12763.55 ms / 175 tokens ( 72.93 ms per token, 13.71 tokens per second)
llama_print_timings: eval time = 22880.37 ms / 94 runs ( 243.41 ms per token, 4.11 tokens per second)
llama_print_timings: total time = 36057.42 ms
llama_print_timings: load time = 56523.37 ms
llama_print_timings: sample time = 49.18 ms / 73 runs ( 0.67 ms per token, 1484.43 tokens per second)
llama_print_timings: prompt eval time = 64706.64 ms / 598 tokens ( 108.21 ms per token, 9.24 tokens per second)
llama_print_timings: eval time = 16400.27 ms / 72 runs ( 227.78 ms per token, 4.39 tokens per second)
llama_print_timings: total time = 82099.84 ms
  1. Launch the Llama-cpp-python service with Docker

You can directly use the model you downloaded to launch the Docker image provided by Llama-cpp-python:

docker run --rm -it -p 8000:8000 -v ${/path/to/models}:/models -e MODEL=/models/${model-filename} ghcr.io/abetlen/llama-cpp-python:latest

Depending on the performance of your computer, the startup may take a few minutes. When docker outputs the following log, it means that the service has started:

llama_new_context_with_model: kv self size  =  640.00 MB
llama_new_context_with_model: compute buffer total size =  305.47 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
INFO:     Started server process [173]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

This service provides an http interface similar to OpenAI. Through the http://localhost:8000/docs page, we can see the prototypes of these interfaces.

For more detailed information, see Llama-cpp-python.

  • Launch GPU acceleration
git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
docker build -t llama-cpp-python-cuda docker/cuda_simple/
docker run --gpus all --rm -it -p 8000:8000 -v ${/path/to/models}:/models -e MODEL=/models/${model-filename} llama-cpp-python-cuda python3 -m llama_cpp.server --n_gpu_layers ${X} --n_ctx ${L}

I have test it with the computer B:

Model n_gpu_layers Performance(B) Performance(C)
Llama-2-13B-chat 43 llama_print_timings: load time = 6563.91 ms
llama_print_timings: sample time = 39.52 ms / 97 runs ( 0.41 ms per token, 2454.39 tokens per second)
llama_print_timings: prompt eval time = 6563.80 ms / 185 tokens ( 35.48 ms per token, 28.18 tokens per second)
llama_print_timings: eval time = 2659.15 ms / 96 runs ( 27.70 ms per token, 36.10 tokens per second)
llama_print_timings: total time = 9653.26 ms
llama_print_timings: load time = 3669.67 ms
llama_print_timings: sample time = 20.01 ms / 73 runs ( 0.27 ms per token, 3647.63 tokens per second)
llama_print_timings: prompt eval time = 3856.10 ms / 642 tokens ( 6.01 ms per token, 166.49 tokens per second)
llama_print_timings: eval time = 1176.83 ms / 72 runs ( 16.34 ms per token, 61.18 tokens per second)
llama_print_timings: total time = 5751.94 ms
Llama-2-70B-chat 83 n_gpu_layers=24
llama_print_timings: load time = 12639.14 ms
llama_print_timings: sample time = 27.72 ms / 59 runs ( 0.47 ms per token, 2128.20 tokens per second)
llama_print_timings: prompt eval time = 12639.07 ms / 185 tokens ( 68.32 ms per token, 14.64 tokens per second)
llama_print_timings: eval time = 58749.19 ms / 58 runs ( 1012.92 ms per token, 0.99 tokens per second)
llama_print_timings: total time = 71768.85 ms
llama_print_timings: load time = 5321.81 ms
llama_print_timings: sample time = 30.52 ms / 111 runs ( 0.27 ms per token, 3637.56 tokens per second)
llama_print_timings: prompt eval time = 5918.75 ms / 594 tokens ( 9.96 ms per token, 100.36 tokens per second)
llama_print_timings: eval time = 7527.38 ms / 110 runs ( 68.43 ms per token, 14.61 tokens per second)
llama_print_timings: total time = 14166.71 ms
  1. Support for other models

Here I only introduce a few models that I have experimented with. Llama-cpp-python also supports many other models. If you have other needs, you can select other models on huggingface.

Here are a few things to note:

  • Models provided by meta-llama need to wait for their reply email. You can try to find similar models in the TheBloke repository.
  • Llama-cpp-python needs hf version of gguf format models:
    • If you have downloaded the original .pth suffix model, you need to use the convert.py script in Llama.cpp to convert the format:
    python convert.py --outfile ${/your/target/file/path} ${/your/.pth/file/path}
    
    • If you have downloaded a ggml format model, you need to use the convert-llama-ggml-to-gguf.py script in Llama.cpp to convert the format:
    python ./convert-llama-ggml-to-gguf.py --eps 1e-5 --input ${/your/ggml/model/path} --output ${/your/gguf/model/path}
    # If the model you downloaded is `70B`, you need to add the parameter `--gqa 8`
    

Integrate the Llama service into the OpenDAN system

We have successfully launched a local LLM node based on the Llama model. The shell of the OpenDAN system provides 3 commands for managing these dynamic nodes:

# Create a new node
/node create

* This command will execute the above shell commands interactively to start the service.
* We can specify the parameters for starting the node according to the wizard's instructions.

# Add a exist node
/node add $model_name $url

# Remove a node
/node rm $model_name $url

# List the currently joined nodes
/node list

There are two parameters, $model_name and $url, here is an explanation:

  • $model_name: Give this model a name, which should be exactly the same as the llm_model_name specified in the agent you built based on OpenDAN. The computing task of this agent will be assigned to this node. If there are multiple models with the same name in the system, OpenDAN will randomly assign the computing task to one of them.
  • $url: The access address of the Llama-cpp-python service started with Docker before, such as http://192.168.0.123:8000

Welcome to provide feedback and discussion

This project and other projects mentioned in the text are rapidly iterating. The various commands and operations mentioned above may become outdated. If you encounter any problems during use, please feel free to give me feedback, and I will follow up as soon as possible.

@waterflier
Copy link
Collaborator

Greate job!

I am interested in understanding the performance of our 70B and 13B models on typical hardware environments. Specifically, I would like to know the performance metrics when running on two types of devices:

  1. CPU-only Devices: What performance can we expect when running our models purely on CPU?

  2. GPU Devices: What is the inference performance when using typical GPUs, such as the 3060 or 4060 series?

This information will be invaluable in helping us optimize our models and plan for future hardware requirements. I appreciate your assistance in providing these details.

@streetycat
Copy link
Contributor Author

Greate job!

I am interested in understanding the performance of our 70B and 13B models on typical hardware environments. Specifically, I would like to know the performance metrics when running on two types of devices:

  1. CPU-only Devices: What performance can we expect when running our models purely on CPU?
  2. GPU Devices: What is the inference performance when using typical GPUs, such as the 3060 or 4060 series?

This information will be invaluable in helping us optimize our models and plan for future hardware requirements. I appreciate your assistance in providing these details.

Thank you for your attention. I will gradually add more models to the device list.

@streetycat
Copy link
Contributor Author

streetycat commented Oct 31, 2023

I will try to test more open source LLMs and add guidelines:

  1. Alpaca
  2. Vicuna
  3. Mistral
  4. MTP
  5. Aquila

@streetycat streetycat changed the title Local models for llama Local models for llama.cpp Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants