Local models for llama.cpp #51

streetycat · 2023-09-14T03:06:26Z

I has a local server for llama, it's built with the project:

https://github.com/soulteary/llama-docker-playground

And I want to add it in the system, but it dependence my custom api, so I will add it a example.

But I find the base class ComputeNode is provide at src/aios_kernel, and the folder src is not a package, I think we should provide the extern interfaces in a extern package, or splite them into another project?

The text was updated successfully, but these errors were encountered:

waterflier · 2023-09-14T03:41:20Z

From our current design perspective, the OpenAI's LLM interface essentially functions as an HTTP proxy. Therefore, our implementation could follow the same pattern: sending LLM requests to an HTTP gateway operating at 127.0.0.1.

This approach simplifies the process and ensures consistency in our design. I believe this would be a practical and effective way to move forward. I look forward to hearing your thoughts on this proposal.

streetycat · 2023-10-07T08:23:03Z

I have found another project:

https://github.com/abetlen/llama-cpp-python.git

It provide 2 method to use models:

Integrate it in same process. We can use it very conveniently.
Setup a local service with a docker image. We can make it system independent and more flexible to deploy, it can run in any computers.

Compared to implementing APIs ourselves, ready-made projects are more conducive to interface standardization. So I discard the previously selected project.

streetycat · 2023-10-13T07:43:09Z

I am going to provide a module for dynamically loading local LLM nodes:

Launch a `Llama` node on a personal server

Download the Llama model

I have test the follow models with 3 computers:

ID	CPU	Memory size	GPU
A	Intel(R) Core(TM) i5-9400F CPU @ 2.90GHz	16G	-
B	AMD Ryzen 7 5800X 8-Core Processor	128G	`NVIDIA GeForce RTX 3060`
C	13th Gen Intel(R) Core(TM) i7-13700K	128G	`NVIDIA RTX A6000`

Name	Download Link	Model Introduction	Performance(A)	Performance(B)	Performance(C)
Llama-2-70B-chat	Llama-2-70B-chat-GGUF	TheBloke/Llama-2-70B-chat-GGUF	-	llama_print_timings: load time = 67333.57 ms llama_print_timings: sample time = 18.90 ms / 47 runs ( 0.40 ms per token, 2486.64 tokens per second) llama_print_timings: prompt eval time = 67333.46 ms / 175 tokens ( 384.76 ms per token, 2.60 tokens per second) llama_print_timings: eval time = 56067.47 ms / 46 runs ( 1218.86 ms per token, 0.82 tokens per second) llama_print_timings: total time = 123741.27 ms	llama_print_timings: load time = 482531.21 ms llama_print_timings: sample time = 32.81 ms / 111 runs ( 0.30 ms per token, 3383.42 tokens per second) llama_print_timings: prompt eval time = 519715.31 ms / 570 tokens ( 911.78 ms per token, 1.10 tokens per second) llama_print_timings: eval time = 107253.35 ms / 110 runs ( 975.03 ms per token, 1.03 tokens per second) llama_print_timings: total time = 627724.99 ms
Llama-2-13B-chat	Llama-2-13B-chat-GGUF	TheBloke/Llama-2-13B-chat-GGUF	llama_print_timings: load time = 44175.59 ms llama_print_timings: sample time = 62.91 ms / 83 runs ( 0.76 ms per token, 1319.30 tokens per second) llama_print_timings: prompt eval time = 44175.25 ms / 185 tokens ( 238.79 ms per token, 4.19 tokens per second) llama_print_timings: eval time = 27077.26 ms / 82 runs ( 330.21 ms per token, 3.03 tokens per second) llama_print_timings: total time = 71906.77 ms	llama_print_timings: load time = 12763.66 ms llama_print_timings: sample time = 37.38 ms / 95 runs ( 0.39 ms per token, 2541.53 tokens per second) llama_print_timings: prompt eval time = 12763.55 ms / 175 tokens ( 72.93 ms per token, 13.71 tokens per second) llama_print_timings: eval time = 22880.37 ms / 94 runs ( 243.41 ms per token, 4.11 tokens per second) llama_print_timings: total time = 36057.42 ms	llama_print_timings: load time = 56523.37 ms llama_print_timings: sample time = 49.18 ms / 73 runs ( 0.67 ms per token, 1484.43 tokens per second) llama_print_timings: prompt eval time = 64706.64 ms / 598 tokens ( 108.21 ms per token, 9.24 tokens per second) llama_print_timings: eval time = 16400.27 ms / 72 runs ( 227.78 ms per token, 4.39 tokens per second) llama_print_timings: total time = 82099.84 ms

Launch the Llama-cpp-python service with Docker

You can directly use the model you downloaded to launch the Docker image provided by Llama-cpp-python:

docker run --rm -it -p 8000:8000 -v ${/path/to/models}:/models -e MODEL=/models/${model-filename} ghcr.io/abetlen/llama-cpp-python:latest

Depending on the performance of your computer, the startup may take a few minutes. When docker outputs the following log, it means that the service has started:

llama_new_context_with_model: kv self size  =  640.00 MB
llama_new_context_with_model: compute buffer total size =  305.47 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
INFO:     Started server process [173]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

This service provides an http interface similar to OpenAI. Through the http://localhost:8000/docs page, we can see the prototypes of these interfaces.

For more detailed information, see Llama-cpp-python.

Launch GPU acceleration

git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
docker build -t llama-cpp-python-cuda docker/cuda_simple/
docker run --gpus all --rm -it -p 8000:8000 -v ${/path/to/models}:/models -e MODEL=/models/${model-filename} llama-cpp-python-cuda python3 -m llama_cpp.server --n_gpu_layers ${X} --n_ctx ${L}

I have test it with the computer B:

Model	n_gpu_layers	Performance(B)	Performance(C)
Llama-2-13B-chat	43	llama_print_timings: load time = 6563.91 ms llama_print_timings: sample time = 39.52 ms / 97 runs ( 0.41 ms per token, 2454.39 tokens per second) llama_print_timings: prompt eval time = 6563.80 ms / 185 tokens ( 35.48 ms per token, 28.18 tokens per second) llama_print_timings: eval time = 2659.15 ms / 96 runs ( 27.70 ms per token, 36.10 tokens per second) llama_print_timings: total time = 9653.26 ms	llama_print_timings: load time = 3669.67 ms llama_print_timings: sample time = 20.01 ms / 73 runs ( 0.27 ms per token, 3647.63 tokens per second) llama_print_timings: prompt eval time = 3856.10 ms / 642 tokens ( 6.01 ms per token, 166.49 tokens per second) llama_print_timings: eval time = 1176.83 ms / 72 runs ( 16.34 ms per token, 61.18 tokens per second) llama_print_timings: total time = 5751.94 ms
Llama-2-70B-chat	83	n_gpu_layers=24 llama_print_timings: load time = 12639.14 ms llama_print_timings: sample time = 27.72 ms / 59 runs ( 0.47 ms per token, 2128.20 tokens per second) llama_print_timings: prompt eval time = 12639.07 ms / 185 tokens ( 68.32 ms per token, 14.64 tokens per second) llama_print_timings: eval time = 58749.19 ms / 58 runs ( 1012.92 ms per token, 0.99 tokens per second) llama_print_timings: total time = 71768.85 ms	llama_print_timings: load time = 5321.81 ms llama_print_timings: sample time = 30.52 ms / 111 runs ( 0.27 ms per token, 3637.56 tokens per second) llama_print_timings: prompt eval time = 5918.75 ms / 594 tokens ( 9.96 ms per token, 100.36 tokens per second) llama_print_timings: eval time = 7527.38 ms / 110 runs ( 68.43 ms per token, 14.61 tokens per second) llama_print_timings: total time = 14166.71 ms

Support for other models

Here I only introduce a few models that I have experimented with. Llama-cpp-python also supports many other models. If you have other needs, you can select other models on huggingface.

Here are a few things to note:

Models provided by meta-llama need to wait for their reply email. You can try to find similar models in the TheBloke repository.
Llama-cpp-python needs hf version of gguf format models:
- If you have downloaded the original .pth suffix model, you need to use the convert.py script in Llama.cpp to convert the format:
```
python convert.py --outfile ${/your/target/file/path} ${/your/.pth/file/path}
```
- If you have downloaded a ggml format model, you need to use the convert-llama-ggml-to-gguf.py script in Llama.cpp to convert the format:
```
python ./convert-llama-ggml-to-gguf.py --eps 1e-5 --input ${/your/ggml/model/path} --output ${/your/gguf/model/path}
# If the model you downloaded is `70B`, you need to add the parameter `--gqa 8`
```

Integrate the `Llama` service into the `OpenDAN` system

We have successfully launched a local LLM node based on the Llama model. The shell of the OpenDAN system provides 3 commands for managing these dynamic nodes:

# Create a new node
/node create

* This command will execute the above shell commands interactively to start the service.
* We can specify the parameters for starting the node according to the wizard's instructions.

# Add a exist node
/node add $model_name $url

# Remove a node
/node rm $model_name $url

# List the currently joined nodes
/node list

There are two parameters, $model_name and $url, here is an explanation:

$model_name: Give this model a name, which should be exactly the same as the llm_model_name specified in the agent you built based on OpenDAN. The computing task of this agent will be assigned to this node. If there are multiple models with the same name in the system, OpenDAN will randomly assign the computing task to one of them.
$url: The access address of the Llama-cpp-python service started with Docker before, such as http://192.168.0.123:8000

Welcome to provide feedback and discussion

This project and other projects mentioned in the text are rapidly iterating. The various commands and operations mentioned above may become outdated. If you encounter any problems during use, please feel free to give me feedback, and I will follow up as soon as possible.

waterflier · 2023-10-16T04:34:19Z

Greate job!

I am interested in understanding the performance of our 70B and 13B models on typical hardware environments. Specifically, I would like to know the performance metrics when running on two types of devices:

CPU-only Devices: What performance can we expect when running our models purely on CPU?
GPU Devices: What is the inference performance when using typical GPUs, such as the 3060 or 4060 series?

This information will be invaluable in helping us optimize our models and plan for future hardware requirements. I appreciate your assistance in providing these details.

streetycat · 2023-10-16T07:31:00Z

Greate job!

I am interested in understanding the performance of our 70B and 13B models on typical hardware environments. Specifically, I would like to know the performance metrics when running on two types of devices:

CPU-only Devices: What performance can we expect when running our models purely on CPU?

GPU Devices: What is the inference performance when using typical GPUs, such as the 3060 or 4060 series?

This information will be invaluable in helping us optimize our models and plan for future hardware requirements. I appreciate your assistance in providing these details.

Thank you for your attention. I will gradually add more models to the device list.

streetycat · 2023-10-31T08:26:14Z

I will try to test more open source LLMs and add guidelines:

streetycat changed the title ~~Local models for llama~~ Local models for llama.cpp Nov 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local models for llama.cpp #51

Local models for llama.cpp #51

streetycat commented Sep 14, 2023 •

edited

waterflier commented Sep 14, 2023

streetycat commented Oct 7, 2023

streetycat commented Oct 13, 2023 •

edited

waterflier commented Oct 16, 2023

streetycat commented Oct 16, 2023

streetycat commented Oct 31, 2023 •

edited

Local models for llama.cpp #51

Local models for llama.cpp #51

Comments

streetycat commented Sep 14, 2023 • edited

waterflier commented Sep 14, 2023

streetycat commented Oct 7, 2023

streetycat commented Oct 13, 2023 • edited

Launch a Llama node on a personal server

Integrate the Llama service into the OpenDAN system

Welcome to provide feedback and discussion

waterflier commented Oct 16, 2023

streetycat commented Oct 16, 2023

streetycat commented Oct 31, 2023 • edited

streetycat commented Sep 14, 2023 •

edited

streetycat commented Oct 13, 2023 •

edited

Launch a `Llama` node on a personal server

Integrate the `Llama` service into the `OpenDAN` system

streetycat commented Oct 31, 2023 •

edited