Merge pull request #198 from imjwang/issue-197-cpu-quant

ggml quant cpu mps support
PromtEngineer · Jul 4, 2023 · 89ac90d · 89ac90d
2 parents 925d63c + ec150d8
commit 89ac90d
Show file tree

Hide file tree

Showing 3 changed files with 83 additions and 48 deletions.
diff --git a/README.md b/README.md
@@ -100,6 +100,41 @@ In order to ask a question, run a command like:
 python run_localGPT.py --device_type cpu
 ```
 
+# Run quantized for M1/M2:
+
+GGML quantized models for Apple Silicon (M1/M2) are supported through the llama-cpp library, [example](https://huggingface.co/TheBloke/Wizard-Vicuna-13B-Uncensored-GGML). GPTQ quantized models that leverage auto-gptq will not work, [see here](https://github.com/PanQiWei/AutoGPTQ/issues/133#issuecomment-1575002893). GGML models will work for CPU or MPS.
+
+## Troubleshooting
+
+**Install MPS:** 
+1- Follow this [page](https://developer.apple.com/metal/pytorch/) to build up PyTorch with Metal Performance Shaders (MPS) support. PyTorch uses the new MPS backend for GPU training acceleration. It is good practice to verify mps support using a simple Python script as mentioned in the provided link.
+
+2- By following the page, here is an example of what you may initiate in your terminal
+
+```shell
+xcode-select --install
+conda install pytorch torchvision torchaudio -c pytorch-nightly
+pip install chardet
+pip install cchardet
+pip uninstall charset_normalizer
+pip install charset_normalizer
+pip install pdfminer.six
+pip install xformers
+```
+
+**Upgrade packages:** 
+Your langchain or llama-cpp version could be outdated. Upgrade your packages by running install again.
+
+```shell
+pip install -r requirements.txt
+```
+
+If you are still getting errors, try installing the latest llama-cpp-python with these flags, and [see thread](https://github.com/abetlen/llama-cpp-python/issues/317#issuecomment-1587962205).
+
+```shell
+CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
+```
+
 # Run the UI
 
 1. Start by opening up `run_localGPT_API.py` in a code editor of your choice. If you are using gpu skip to step 3.
@@ -158,7 +193,7 @@ The following will provide instructions on how you can select a different LLM mo
 5. For models that end with HF or have a .bin inside its "Files and versions" on its HuggingFace page.
 
  - Make sure you have a model_id selected. For example -> `model_id = "TheBloke/guanaco-7B-HF"`
- - If you go to its HuggingFace [Site] (https://huggingface.co/TheBloke/guanaco-7B-HF) and go to "Files and versions" you will notice model files that end with a .bin extension.
+ - If you go to its HuggingFace [repo](https://huggingface.co/TheBloke/guanaco-7B-HF) and go to "Files and versions" you will notice model files that end with a .bin extension.
  - Any model files that contain .bin extensions will be run with the following code where the `# load the LLM for generating Natural Language responses` comment is found.
  - `model_id = "TheBloke/guanaco-7B-HF"`
 
@@ -168,7 +203,7 @@ The following will provide instructions on how you can select a different LLM mo
 
  - Make sure you have a model_id selected. For example -> model_id = `"TheBloke/wizardLM-7B-GPTQ"`
  - You will also need its model basename file selected. For example -> `model_basename = "wizardLM-7B-GPTQ-4bit.compat.no-act-order.safetensors"`
- - If you go to its HuggingFace [Site] (https://huggingface.co/TheBloke/wizardLM-7B-GPTQ) and go to "Files and versions" you will notice a model file that ends with a .safetensors extension.
+ - If you go to its HuggingFace [repo](https://huggingface.co/TheBloke/wizardLM-7B-GPTQ) and go to "Files and versions" you will notice a model file that ends with a .safetensors extension.
  - Any model files that contain no-act-order or .safetensors extensions will be run with the following code where the `# load the LLM for generating Natural Language responses` comment is found.
  - `model_id = "TheBloke/WizardLM-7B-uncensored-GPTQ"`
 
@@ -203,27 +238,6 @@ To install a C++ compiler on Windows 10/11, follow these steps:
 
 Follow this [page](https://linuxconfig.org/how-to-install-the-nvidia-drivers-on-ubuntu-22-04) to install NVIDIA Drivers.
 
-### M1/M2 Macbook users:
-
-1- Follow this [page](https://developer.apple.com/metal/pytorch/) to build up PyTorch with Metal Performance Shaders (MPS) support. PyTorch uses the new MPS backend for GPU training acceleration. It is good practice to verify mps support using a simple Python script as mentioned in the provided link.
-
-2- By following the page, here is an example of what you may initiate in your terminal
-
-```shell
-xcode-select --install
-conda install pytorch torchvision torchaudio -c pytorch-nightly
-pip install chardet
-pip install cchardet
-pip uninstall charset_normalizer
-pip install charset_normalizer
-pip install pdfminer.six
-pip install xformers
-```
-
-3- Please keep in mind that the quantized models are not yet supported by Apple Silicon (M1/M2) by auto-gptq library that is being used for loading quantized models, [see here](https://github.com/PanQiWei/AutoGPTQ/issues/133#issuecomment-1575002893). Therefore, you will not be able to run quantized models on M1/M2.
-
-
-
 ## Star History
 
 [![Star History Chart](https://api.star-history.com/svg?repos=PromtEngineer/localGPT&type=Date)](https://star-history.com/#PromtEngineer/localGPT&Date)

diff --git a/requirements.txt b/requirements.txt
@@ -1,7 +1,7 @@
 # Natural Language Processing
 langchain==0.0.191
 chromadb==0.3.22
-llama-cpp-python==0.1.48
+llama-cpp-python==0.1.66
 pdfminer.six==20221105
 InstructorEmbedding
 sentence-transformers

diff --git a/run_localGPT.py b/run_localGPT.py
@@ -3,9 +3,10 @@
 import click
 import torch
 from auto_gptq import AutoGPTQForCausalLM
+from huggingface_hub import hf_hub_download
 from langchain.chains import RetrievalQA
 from langchain.embeddings import HuggingFaceInstructEmbeddings
-from langchain.llms import HuggingFacePipeline
+from langchain.llms import HuggingFacePipeline, LlamaCpp
 
 # from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
 from langchain.vectorstores import Chroma
@@ -39,33 +40,45 @@ def load_model(device_type, model_id, model_basename=None):
  Raises:
  ValueError: If an unsupported model or device type is provided.
  """
- if device_type.lower() in ["cpu", "mps"]:
- model_basename = None
-
  logging.info(f"Loading Model: {model_id}, on: {device_type}")
  logging.info("This action can take a few minutes!")
 
  if model_basename is not None:
- # The code supports all huggingface models that ends with GPTQ and have some variation
- # of .no-act.order or .safetensors in their HF repo.
- logging.info("Using AutoGPTQForCausalLM for quantized models")
-
- if ".safetensors" in model_basename:
- # Remove the ".safetensors" ending if present
- model_basename = model_basename.replace(".safetensors", "")
-
- tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
- logging.info("Tokenizer loaded")
-
- model = AutoGPTQForCausalLM.from_quantized(
- model_id,
- model_basename=model_basename,
- use_safetensors=True,
- trust_remote_code=True,
- device="cuda:0",
- use_triton=False,
- quantize_config=None,
- )
+ if device_type.lower() in ["cpu", "mps"]:
+ logging.info("Using Llamacpp for quantized models")
+ model_path = hf_hub_download(repo_id=model_id, filename=model_basename)
+ if device_type.lower() == "mps":
+ return LlamaCpp(
+ model_path=model_path,
+ n_ctx=2048,
+ max_tokens=2048,
+ temperature=0,
+ repeat_penalty=1.15,
+ n_gpu_layers=1000,
+ )
+ return LlamaCpp(model_path=model_path, n_ctx=2048, max_tokens=2048, temperature=0, repeat_penalty=1.15)
+
+ else:
+ # The code supports all huggingface models that ends with GPTQ and have some variation
+ # of .no-act.order or .safetensors in their HF repo.
+ logging.info("Using AutoGPTQForCausalLM for quantized models")
+
+ if ".safetensors" in model_basename:
+ # Remove the ".safetensors" ending if present
+ model_basename = model_basename.replace(".safetensors", "")
+
+ tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
+ logging.info("Tokenizer loaded")
+
+ model = AutoGPTQForCausalLM.from_quantized(
+ model_id,
+ model_basename=model_basename,
+ use_safetensors=True,
+ trust_remote_code=True,
+ device="cuda:0",
+ use_triton=False,
+ quantize_config=None,
+ )
  elif (
  device_type.lower() == "cuda"
  ): # The code supports all huggingface models that ends with -HF or which have a .bin
@@ -198,6 +211,14 @@ def main(device_type, show_sources):
  # model_id = "TheBloke/WizardLM-7B-uncensored-GPTQ"
  # model_basename = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"
 
+ # for GGML (quantized cpu+gpu+mps) models - check if they support llama.cpp
+ # model_id = "TheBloke/wizard-vicuna-13B-GGML"
+ # model_basename = "wizard-vicuna-13B.ggmlv3.q4_0.bin"
+ # model_basename = "wizard-vicuna-13B.ggmlv3.q6_K.bin"
+ # model_basename = "wizard-vicuna-13B.ggmlv3.q2_K.bin"
+ # model_id = "TheBloke/orca_mini_3B-GGML"
+ # model_basename = "orca-mini-3b.ggmlv3.q4_0.bin"
+
  llm = load_model(device_type, model_id=model_id, model_basename=model_basename)
 
  qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)