Skip to content

fcn94/llm_stream_endpoint

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm_stream_endpoint

Minimal llm rust api streaming endpoint.

It is a minimalist service to interact with a LLM, in a streaming mode.

It is designed to run quantized version of llama2, mistral or phi-2 quantized model, on a CPU

It is a very simple Rest Streaming API using :

  • Rust
  • Warp
  • Candle

How to use the service

The selection of the model is activated by a feature, either mistral or phi-2 ( by default)

A Makefile facilitates clean,update, build,run

Prior to execution , please run :

make clean

and then

make update



To build the service , just type

With Phi-2 , type :

  • to build for CPU

make build

  • to build using CUDA

make build_cuda



or with mistral, type :

  • to build for CPU

make FEATURE=mistral build

  • to build using CUDA

make FEATURE=mistral build_cuda



or With llama, type :

  • to build for CPU

make FEATURE=llama build

  • to build using CUDA

make FEATURE=llama build_cuda



Then, to run it,

make run



Once launched, to use the API, you can

You can specify a custom model, and a tokenizer file

Provided these models are compatible with phi-2 , mistral or llama, you can specify your own huggingface repo and quantized file , as well as customer tokenizer repo ( usually model file and tokenizer are on a different repo).
\

You can type:

make FEATURE=mistral build

make run MODEL_REPO="Your quantized model repo" MODEL_FILE="Your quantized gguf file" TOKENIZER_REPO="Your tokenizer repo"

This is useful should you be willing to run a fine tuned version of either phi-2 ,mistral or llama



For example, here is a phi-2 model fine tuned using guidelines described in the tutorial https://youtu.be/J0RbOtLrJhQ?si=2lcEAzxX-ToeMPWR

make run MODEL_REPO="fcn94/phi-2-finetuned-med-text" MODEL_FILE="model-v2-q4k.gguf" TOKENIZER_REPO="fcn94/phi-2-finetuned-med-text"

You can test the following prompt with standard phi-2 model and with this fine-tuned model

curl -X POST -H "Content-Type: application/json" --no-buffer 'http://127.0.0.1:3030/token_stream' -d '{"query":"I have a headache with low fever. What should I do ?"}'

WHat about specific targeted open source models

For this repo, features phi-2 and mistral are using gguf file generated by 'tensor-tools' from candle

Majority of open source gguf files from hugging face are following llama formalism

If you are using such file , here is a suggested modus operandi

You can type:

make FEATURE=llama build

and

make run MODEL_REPO="Your quantized model repo" MODEL_FILE="Your quantized gguf file" TOKENIZER_REPO="Your tokenizer repo"

For example , using a popular repo

make run MODEL_REPO="TheBloke/Mistral-7B-Instruct-v0.2-GGUF" MODEL_FILE="mistral-7b-instruct-v0.2.Q4_K_M.gguf" TOKENIZER_REPO="mistralai/Mistral-7B-Instruct-v0.2"

You can specify a context type ( general, sql, classifier, math)

Four context prompts are defined in ./config/prompt_comfig.toml

You can type :

for default ( general)

make run

or

for classifier

make run CONTEXT_TYPE=classifier

or

for sql

make run CONTEXT_TYPE=sql

or

for math

make run CONTEXT_TYPE=math

References

About

Simple LLM Rest API using Rust, Warp and Candle. Dedicated for quantized version of either phi-2 ( default) , Mistral, or Llama. Work using CPU or CUDA

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published