llm_stream_endpoint

Minimal llm rust api streaming endpoint.

It is a minimalist service to interact with a LLM, in a streaming mode.

It is designed to run quantized version of llama2, mistral or phi-2 quantized model, on a CPU

It is a very simple Rest Streaming API using :

Rust
Warp
Candle

How to use the service

The selection of the model is activated by a feature, either mistral or phi-2 ( by default)

A Makefile facilitates clean,update, build,run

Prior to execution , please run :

make clean

and then

make update

To build the service , just type

With Phi-2 , type :

to build for CPU

make build

to build using CUDA

make build_cuda

or with mistral, type :

to build for CPU

make FEATURE=mistral build

to build using CUDA

make FEATURE=mistral build_cuda

or With llama, type :

to build for CPU

make FEATURE=llama build

to build using CUDA

make FEATURE=llama build_cuda

Then, to run it,

make run

Once launched, to use the API, you can

From a linux terminal, use curl

curl -X POST -H "Content-Type: application/json" --no-buffer 'http://127.0.0.1:3030/token_stream' -d '{"query":"Where is located Paris ?"}'

From a browser, a very simple UI is available at :

http://127.0.0.1:3030/

You can specify a custom model, and a tokenizer file

Provided these models are compatible with phi-2 , mistral or llama, you can specify your own huggingface repo and quantized file , as well as customer tokenizer repo ( usually model file and tokenizer are on a different repo).
\

You can type:

make FEATURE=mistral build

make run MODEL_REPO="Your quantized model repo" MODEL_FILE="Your quantized gguf file" TOKENIZER_REPO="Your tokenizer repo"

This is useful should you be willing to run a fine tuned version of either phi-2 ,mistral or llama

For example, here is a phi-2 model fine tuned using guidelines described in the tutorial https://youtu.be/J0RbOtLrJhQ?si=2lcEAzxX-ToeMPWR

make run MODEL_REPO="fcn94/phi-2-finetuned-med-text" MODEL_FILE="model-v2-q4k.gguf" TOKENIZER_REPO="fcn94/phi-2-finetuned-med-text"

You can test the following prompt with standard phi-2 model and with this fine-tuned model

curl -X POST -H "Content-Type: application/json" --no-buffer 'http://127.0.0.1:3030/token_stream' -d '{"query":"I have a headache with low fever. What should I do ?"}'

WHat about specific targeted open source models

For this repo, features phi-2 and mistral are using gguf file generated by 'tensor-tools' from candle

Majority of open source gguf files from hugging face are following llama formalism

If you are using such file , here is a suggested modus operandi

You can type:

make FEATURE=llama build

and

make run MODEL_REPO="Your quantized model repo" MODEL_FILE="Your quantized gguf file" TOKENIZER_REPO="Your tokenizer repo"

For example , using a popular repo

make run MODEL_REPO="TheBloke/Mistral-7B-Instruct-v0.2-GGUF" MODEL_FILE="mistral-7b-instruct-v0.2.Q4_K_M.gguf" TOKENIZER_REPO="mistralai/Mistral-7B-Instruct-v0.2"

You can specify a context type ( general, sql, classifier, math)

Four context prompts are defined in ./config/prompt_comfig.toml

You can type :

for default ( general)

make run

or

for classifier

make run CONTEXT_TYPE=classifier

or

for sql

make run CONTEXT_TYPE=sql

or

for math

make run CONTEXT_TYPE=math

References

This is heavily inspired by one of the example from candle repository https://github.com/huggingface/candle/tree/main/candle-examples/examples/mistral

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.cargo		.cargo
config		config
sample_queries		sample_queries
site		site
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.cargo

.cargo

config

config

sample_queries

sample_queries

site

site

src

src

.gitignore

.gitignore

Cargo.toml

Cargo.toml

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

build.rs

build.rs

Repository files navigation

llm_stream_endpoint

How to use the service

You can specify a custom model, and a tokenizer file

WHat about specific targeted open source models

You can specify a context type ( general, sql, classifier, math)

References

About

Releases

Packages

Languages

License

fcn94/llm_stream_endpoint

Folders and files

Latest commit

History

Repository files navigation

llm_stream_endpoint

How to use the service

You can specify a custom model, and a tokenizer file

WHat about specific targeted open source models

You can specify a context type ( general, sql, classifier, math)

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages