Auto-Ollama & Auto-GGUF ⚡️

Inference or Quantize Large Language Models (LLMs) Locally with a Single Command

Overview

Auto-Ollama is a toolkit designed to simplify the inference or Quantization of Large Language Models (LLMs) directly on your local environment. With an emphasis on ease of use and flexibility, Auto-Ollama supports both direct usage and conversion of models into an efficient format for local deployment.

For quantization, check out the new package called Auto-QuantLLM ⚡️. It's currently under development, but it aims to provide a streamlined and user-friendly approach to quantizing large language models (LLMs) with different Quantization methods.

Getting Started

Installation

Clone the repository to get started with Auto-Ollama:

git clone https://github.com/monk1337/auto-ollama.git
cd auto-ollama

Quick Tour

Running Auto-Ollama Use the autollama.sh script to quickly inference LLMs. This script requires the model name and the quantized file name as arguments.

# Deploy Large Language Models (LLMs) locally with Auto-Ollama
# Usage:
# ./scripts/autollama.sh -m <model path> -g <gguf file name>


# Example command:
./scripts/autollama.sh -m TheBloke/MistralLite-7B-GGUF -g mistrallite.Q4_K_M.gguf

Handling Non-Quantized Models with AutoGGUF

If your desired model is not available in a quantized format suitable for local deployment, Auto-Ollama offers the AutoGGUF utility. This tool can convert any Hugging Face model into the GGUF format and upload it to the Hugging Face model hub.

# Convert your Hugging Face model to GGUF format for local deployment
# Usage:
# ./scripts/autogguf.sh -m <MODEL_ID> [-u USERNAME] [-t TOKEN] [-q QUANTIZATION_METHODS]

# Example command:
./scripts/autogguf.sh -m unsloth/gemma-2b

More Options

# if want to upload the gguf model to hub after the conversion, provide the user and token
# Example command:
./scripts/autogguf.sh -m unsloth/gemma-2b -u user_name -t hf_token


#if wants to provide QUANTIZATION_METHODS
# Example command:
./scripts/autogguf.sh -m unsloth/gemma-2b -u user_name -t hf_token -q "q4_k_m,q5_k_m"

Quantization Recommendations

Use Q5_K_M for the best performance-resource balance.
Q4_K_M is a good choice if you need to save memory.
K_M versions generally perform better than K_S.

Support and Contributions

For issues, suggestions, or contributions, please open an issue or pull request in the GitHub repository. We welcome contributions from the community to make Auto-Ollama even better!

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Auto-Ollama & Auto-GGUF ⚡️

Overview

Getting Started

Installation

Quick Tour

Handling Non-Quantized Models with AutoGGUF

More Options

Quantization Recommendations

Support and Contributions

About

Releases

Packages

Contributors 3

Languages

License

monk1337/auto-ollama

Folders and files

Latest commit

History

Repository files navigation

Auto-Ollama & Auto-GGUF ⚡️

Overview

Getting Started

Installation

Quick Tour

Handling Non-Quantized Models with AutoGGUF

More Options

Quantization Recommendations

Support and Contributions

About

Topics

Resources

License

Stars

Watchers

Forks

Languages