Skip to content

Commit

Permalink
Merge pull request #198 from imjwang/issue-197-cpu-quant
Browse files Browse the repository at this point in the history
ggml quant cpu mps support
  • Loading branch information
PromtEngineer committed Jul 4, 2023
2 parents 925d63c + ec150d8 commit 89ac90d
Show file tree
Hide file tree
Showing 3 changed files with 83 additions and 48 deletions.
60 changes: 37 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,41 @@ In order to ask a question, run a command like:
python run_localGPT.py --device_type cpu
```

# Run quantized for M1/M2:

GGML quantized models for Apple Silicon (M1/M2) are supported through the llama-cpp library, [example](https://huggingface.co/TheBloke/Wizard-Vicuna-13B-Uncensored-GGML). GPTQ quantized models that leverage auto-gptq will not work, [see here](https://github.com/PanQiWei/AutoGPTQ/issues/133#issuecomment-1575002893). GGML models will work for CPU or MPS.

## Troubleshooting

**Install MPS:**
1- Follow this [page](https://developer.apple.com/metal/pytorch/) to build up PyTorch with Metal Performance Shaders (MPS) support. PyTorch uses the new MPS backend for GPU training acceleration. It is good practice to verify mps support using a simple Python script as mentioned in the provided link.

2- By following the page, here is an example of what you may initiate in your terminal

```shell
xcode-select --install
conda install pytorch torchvision torchaudio -c pytorch-nightly
pip install chardet
pip install cchardet
pip uninstall charset_normalizer
pip install charset_normalizer
pip install pdfminer.six
pip install xformers
```

**Upgrade packages:**
Your langchain or llama-cpp version could be outdated. Upgrade your packages by running install again.

```shell
pip install -r requirements.txt
```

If you are still getting errors, try installing the latest llama-cpp-python with these flags, and [see thread](https://github.com/abetlen/llama-cpp-python/issues/317#issuecomment-1587962205).

```shell
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
```

# Run the UI

1. Start by opening up `run_localGPT_API.py` in a code editor of your choice. If you are using gpu skip to step 3.
Expand Down Expand Up @@ -158,7 +193,7 @@ The following will provide instructions on how you can select a different LLM mo
5. For models that end with HF or have a .bin inside its "Files and versions" on its HuggingFace page.

- Make sure you have a model_id selected. For example -> `model_id = "TheBloke/guanaco-7B-HF"`
- If you go to its HuggingFace [Site] (https://huggingface.co/TheBloke/guanaco-7B-HF) and go to "Files and versions" you will notice model files that end with a .bin extension.
- If you go to its HuggingFace [repo](https://huggingface.co/TheBloke/guanaco-7B-HF) and go to "Files and versions" you will notice model files that end with a .bin extension.
- Any model files that contain .bin extensions will be run with the following code where the `# load the LLM for generating Natural Language responses` comment is found.
- `model_id = "TheBloke/guanaco-7B-HF"`

Expand All @@ -168,7 +203,7 @@ The following will provide instructions on how you can select a different LLM mo

- Make sure you have a model_id selected. For example -> model_id = `"TheBloke/wizardLM-7B-GPTQ"`
- You will also need its model basename file selected. For example -> `model_basename = "wizardLM-7B-GPTQ-4bit.compat.no-act-order.safetensors"`
- If you go to its HuggingFace [Site] (https://huggingface.co/TheBloke/wizardLM-7B-GPTQ) and go to "Files and versions" you will notice a model file that ends with a .safetensors extension.
- If you go to its HuggingFace [repo](https://huggingface.co/TheBloke/wizardLM-7B-GPTQ) and go to "Files and versions" you will notice a model file that ends with a .safetensors extension.
- Any model files that contain no-act-order or .safetensors extensions will be run with the following code where the `# load the LLM for generating Natural Language responses` comment is found.
- `model_id = "TheBloke/WizardLM-7B-uncensored-GPTQ"`

Expand Down Expand Up @@ -203,27 +238,6 @@ To install a C++ compiler on Windows 10/11, follow these steps:

Follow this [page](https://linuxconfig.org/how-to-install-the-nvidia-drivers-on-ubuntu-22-04) to install NVIDIA Drivers.

### M1/M2 Macbook users:

1- Follow this [page](https://developer.apple.com/metal/pytorch/) to build up PyTorch with Metal Performance Shaders (MPS) support. PyTorch uses the new MPS backend for GPU training acceleration. It is good practice to verify mps support using a simple Python script as mentioned in the provided link.

2- By following the page, here is an example of what you may initiate in your terminal

```shell
xcode-select --install
conda install pytorch torchvision torchaudio -c pytorch-nightly
pip install chardet
pip install cchardet
pip uninstall charset_normalizer
pip install charset_normalizer
pip install pdfminer.six
pip install xformers
```

3- Please keep in mind that the quantized models are not yet supported by Apple Silicon (M1/M2) by auto-gptq library that is being used for loading quantized models, [see here](https://github.com/PanQiWei/AutoGPTQ/issues/133#issuecomment-1575002893). Therefore, you will not be able to run quantized models on M1/M2.



## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=PromtEngineer/localGPT&type=Date)](https://star-history.com/#PromtEngineer/localGPT&Date)
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Natural Language Processing
langchain==0.0.191
chromadb==0.3.22
llama-cpp-python==0.1.48
llama-cpp-python==0.1.66
pdfminer.six==20221105
InstructorEmbedding
sentence-transformers
Expand Down
69 changes: 45 additions & 24 deletions run_localGPT.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@
import click
import torch
from auto_gptq import AutoGPTQForCausalLM
from huggingface_hub import hf_hub_download
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.llms import HuggingFacePipeline, LlamaCpp

# from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.vectorstores import Chroma
Expand Down Expand Up @@ -39,33 +40,45 @@ def load_model(device_type, model_id, model_basename=None):
Raises:
ValueError: If an unsupported model or device type is provided.
"""
if device_type.lower() in ["cpu", "mps"]:
model_basename = None

logging.info(f"Loading Model: {model_id}, on: {device_type}")
logging.info("This action can take a few minutes!")

if model_basename is not None:
# The code supports all huggingface models that ends with GPTQ and have some variation
# of .no-act.order or .safetensors in their HF repo.
logging.info("Using AutoGPTQForCausalLM for quantized models")

if ".safetensors" in model_basename:
# Remove the ".safetensors" ending if present
model_basename = model_basename.replace(".safetensors", "")

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
logging.info("Tokenizer loaded")

model = AutoGPTQForCausalLM.from_quantized(
model_id,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
use_triton=False,
quantize_config=None,
)
if device_type.lower() in ["cpu", "mps"]:
logging.info("Using Llamacpp for quantized models")
model_path = hf_hub_download(repo_id=model_id, filename=model_basename)
if device_type.lower() == "mps":
return LlamaCpp(
model_path=model_path,
n_ctx=2048,
max_tokens=2048,
temperature=0,
repeat_penalty=1.15,
n_gpu_layers=1000,
)
return LlamaCpp(model_path=model_path, n_ctx=2048, max_tokens=2048, temperature=0, repeat_penalty=1.15)

else:
# The code supports all huggingface models that ends with GPTQ and have some variation
# of .no-act.order or .safetensors in their HF repo.
logging.info("Using AutoGPTQForCausalLM for quantized models")

if ".safetensors" in model_basename:
# Remove the ".safetensors" ending if present
model_basename = model_basename.replace(".safetensors", "")

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
logging.info("Tokenizer loaded")

model = AutoGPTQForCausalLM.from_quantized(
model_id,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
use_triton=False,
quantize_config=None,
)
elif (
device_type.lower() == "cuda"
): # The code supports all huggingface models that ends with -HF or which have a .bin
Expand Down Expand Up @@ -198,6 +211,14 @@ def main(device_type, show_sources):
# model_id = "TheBloke/WizardLM-7B-uncensored-GPTQ"
# model_basename = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"

# for GGML (quantized cpu+gpu+mps) models - check if they support llama.cpp
# model_id = "TheBloke/wizard-vicuna-13B-GGML"
# model_basename = "wizard-vicuna-13B.ggmlv3.q4_0.bin"
# model_basename = "wizard-vicuna-13B.ggmlv3.q6_K.bin"
# model_basename = "wizard-vicuna-13B.ggmlv3.q2_K.bin"
# model_id = "TheBloke/orca_mini_3B-GGML"
# model_basename = "orca-mini-3b.ggmlv3.q4_0.bin"

llm = load_model(device_type, model_id=model_id, model_basename=model_basename)

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)
Expand Down

0 comments on commit 89ac90d

Please sign in to comment.