Skip to content

Pure C++ implementation of several models for real-time chatting on your computer (CPU)

License

Notifications You must be signed in to change notification settings

foldl/chatllm.cpp

Repository files navigation

ChatLLM.cpp

中文版

License: MIT

Inference of a bunch of models from less than 1B to more than 300B, for real-time chatting with RAG on your computer (CPU), pure C++ implementation based on @ggerganov's ggml.

| Supported Models | Download Quantized Models |

What's New:

  • 2024-06-15: Tool calling
  • 2024-06-07: Qwen2
  • 2024-06-06: GLM-4
  • 2024-06-03: XVERSE
  • 2024-06-01: Codestral
  • 2024-05-30: MAP-Neo
  • 2024-05-29: ggml is forked instead of submodule
  • 2024-05-26: Aya-23 from Cohere
  • 2024-05-24: DeepSeek-V2 Light
  • 2024-05-23: Mistral v0.3
  • 2024-05-22: Phi3-medium
  • 2024-05-15: StarCoder2
  • 2024-05-14: OpenAI API, CodeGemma Base & Instruct supported
  • 2024-05-13: Yi 1.5 Chat models
  • 2024-05-08: Layer shuffling

Features

  • Accelerated memory-efficient CPU inference with int4/int8 quantization, optimized KV cache and parallel computing;

  • Use OOP to address the similarities between different Transformer based models;

  • Streaming generation with typewriter effect;

  • Continuous chatting (content length is virtually unlimited)

    Two methods are available: Restart and Shift. See --extending options.

  • Retrieval Augmented Generation (RAG) 🔥

  • LoRA;

  • Python/JavaScript/C Bindings, web demo, and more possibilities.

Usage

Preparation

Clone the ChatLLM.cpp repository into your local machine:

git clone --recursive https://github.com/foldl/chatllm.cpp.git && cd chatllm.cpp

If you forgot the --recursive flag when cloning the repository, run the following command in the chatllm.cpp folder:

git submodule update --init --recursive

Quantize Model

Some quantized models can be downloaded from here.

Install dependencies of convert.py:

pip install -r requirements.txt

Use convert.py to transform models into quantized GGML format. For example, to convert the fp16 base model to q8_0 (quantized int8) GGML model, run:

# For models such as ChatLLM-6B, ChatLLM2-6B, InternLM, LlaMA, LlaMA-2, Baichuan-2, etc
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin

# For some models such as CodeLlaMA, model type should be provided by `-a`
# Find `-a ...` option for each model in `docs/models.md`.
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a CodeLlaMA

Use -l to specify the path of the LoRA model to be merged, such as:

python3 convert.py -i path/to/model -l path/to/lora/model -o quantized.bin

Note: Appropriately, only HF format is supported (with a few exceptions); Format of the generated .bin files is different from the one (GGUF) used by llama.cpp.

Build

In order to build this project you have several different options.

  • Using make:

    Prepare for using make on Windows:

    1. Download the latest fortran version of w64devkit.
    2. Extract w64devkit on your pc.
    3. Run w64devkit.exe, then cd to the chatllm.cpp folder.
    make

    The executable is ./obj/main.

  • Using CMake:

    cmake -B build
    # On Linux, WSL:
    cmake --build build -j
    # On Windows with MSVC:
    cmake --build build -j --config Release

    The executable is ./build/obj/main.

Run

Now you may chat with a quantized model by running:

./build/bin/main -m chatglm-ggml.bin                            # ChatGLM-6B
# 你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。
./build/bin/main -m llama2.bin  --seed 100                      # Llama-2-Chat-7B
# Hello! I'm here to help you with any questions or concerns ....

To run the model in interactive mode, add the -i flag. For example:

# On Windows
.\build\bin\Release\main -m model.bin -i

# On Linux (or WSL)
rlwrap ./build/bin/main -m model.bin -i

In interactive mode, your chat history will serve as the context for the next-round conversation.

Run ./build/bin/main -h to explore more options!

Acknowledgements

  • This project is started as refactoring of ChatGLM.cpp, without which, this project could not be possible.

  • Thank those who have released their the model sources and checkpoints.

Note

This project is my hobby project to learn DL & GGML, and under active development. PRs of features won't be accepted, while PRs for bug fixes are warmly welcome.

About

Pure C++ implementation of several models for real-time chatting on your computer (CPU)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published