Skip to content

KonradHoeffner/localgpt

Β 
Β 

Repository files navigation

Chat with the textbook "Health Information Systems - Technological and Management Perspectives" (2023)

Chat with the textbook using LLaMA 2 7b Chat 4 bit and langchain. This is a different approach than finetuning the language model on the textbook. Instead the book is transformed into an embedding.

Requirements

The 4 bit 7b param model is 3.5GB in size so this needs to fit into CPU or GPU RAM. CPU inference is very slow, around 30s for an answer on an Intel i9-10900k. GPU inference requires an NVIDIA GPU with CUDA 11 installed. Later we can try it with the 30b model.

Resources

  • Konrads home desktop: 48 GB RAM, NVIDIA 3070 with 8 GB GPU RAM.
  • Konrads office desktop: 32 GB RAM, no GPU.
  • Server 1: 8xNvidia Tesla A30 with 24 GB GPU RAM
  • Server 2: 4xNvidia Tesla V100 A30 with 32 GB GPU RAM

First experiments are done with Llama 2 7b on the desktop. Later LLaMA 2 30b can be used on the server when it comes out. LLaMA 2 70b would need several GPUs, not yet clear if that can be done with localGPT.

Problems

  • the default GGML model does not work with CUDA yet, so we use GPTQ instead

Arch Linux Support

Arch Linux has CUDA 12 which is not supported by PyTorch. I tried installing CUDA 11.7 from the Arch Linux package archives but that depends on gcc 11 which depends on gcc-libs 11 and downgrading that breaks the system. Also tried installing gcc 11 from AUR but that was still compiling after an hour and its unclear if CUDA will actually choose the correct one if both are installed. I could install gcc 11 in the local Conda environment but CUDA 12 wasn't using that. Right now trying with cuda-11.0 and cudnn8-cuda11.0 from AUR. This still fails at pip install . in the AutoGPTQ directory version 0.2.2 with RuntimeError: The current installed version of g++ (13.1.1) is greater than the maximum required version by CUDA 11.0. Please make sure to use an adequate version of g++ (>=5.0.0, <10.0).. Manually changing the maximum g++ version in cpp_extension.py also produces an error, it seems to be there for a reason. Could be dealt with using Docker but that would make a very large container, may still be worth investigating because of the difficult setup. Unclear if that works though because CUDA would need to be installed in the container but it depends on system drivers. On systems with CUDA 11 it shouldn't be a problem.

This is a fork of localGPT, original readme below with setup instructions below. You need some space on your drive for the model and the large libraries and dependencies like CUDA. Setup and run:

python ingest.py
python run

localGPT

LocalGPT is an open-source initiative that allows you to converse with your documents without compromising your privacy. With everything running locally, you can be assured that no data ever leaves your computer. Dive into the world of secure, local document interactions with LocalGPT.

Features 🌟

  • Utmost Privacy: Your data remains on your computer, ensuring 100% security.
  • Versatile Model Support: Seamlessly integrate a variety of open-source models, including HF, GPTQ, GGML, and GGUF.
  • Diverse Embeddings: Choose from a range of open-source embeddings.
  • Reuse Your LLM: Once downloaded, reuse your LLM without the need for repeated downloads.
  • Chat History: Remebers your previous conversations (in a session).
  • API: LocalGPT has an API that you can use for building RAG Applications.
  • Graphical Interface: LocalGPT comes with two GUIs, one uses the API and the other is standalone (based on streamlit).
  • GPU, CPU & MPS Support: Supports mulitple platforms out of the box, Chat with your data using CUDA, CPU or MPS and more!

Dive Deeper with Our Videos πŸŽ₯

Technical Details πŸ› οΈ

By selecting the right local models and the power of LangChain you can run the entire RAG pipeline locally, without any data leaving your environment, and with reasonable performance.

  • ingest.py uses LangChain tools to parse the document and create embeddings locally using InstructorEmbeddings. It then stores the result in a local vector database using Chroma vector store.
  • run_localGPT.py uses a local LLM to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the docs.
  • You can replace this local LLM with any other LLM from the HuggingFace. Make sure whatever LLM you select is in the HF format.

This project was inspired by the original privateGPT.

Built Using 🧩

Environment Setup 🌍

  1. πŸ“₯ Clone the repo using git:
git clone https://github.com/PromtEngineer/localGPT.git
  1. 🐍 Instal conda for virtual environment management. Create and activate a new virtual environment.
conda create -n localGPT python=3.10.0
conda activate localGPT
  1. πŸ› οΈ Install the dependencies using pip

To set up your environment to run the code, first install all requirements:

pip install -r requirements.txt

Installing LLAMA-CPP :

LocalGPT uses LlamaCpp-Python for GGML (you will need llama-cpp-python <=0.1.76) and GGUF (llama-cpp-python >=0.1.83) models.

If you want to use BLAS or Metal with llama-cpp you can set appropriate flags:

For NVIDIA GPUs support, use cuBLAS

# Example: cuBLAS
CMAKE_ARGS="-DLLAMA_CUBLAS=on" CMAKE_CUDA_COMPILER=$(which nvcc) FORCE_CMAKE=1 pip install llama-cpp-python==0.1.83 --upgrade --force-reinstall

For Apple Metal (M1/M2) support, use

# Example: METAL
CMAKE_ARGS="-DLLAMA_METAL=on"  FORCE_CMAKE=1 pip install llama-cpp-python==0.1.83 --no-cache-dir

For more details, please refer to llama-cpp

Docker 🐳

Installing the required packages for GPU inference on Nvidia GPUs, like gcc 11 and CUDA 11, may cause conflicts with other packages in your system. As an alternative to Conda, you can use Docker with the provided Dockerfile. It includes CUDA, your system just needs Docker, BuildKit, your Nvidia GPU driver and the Nvidia container toolkit. Build as docker build . -t localgpt, requires BuildKit. Docker BuildKit does not support GPU during docker build time right now, only during docker run. Run as docker run -it --mount src="$HOME/.cache",target=/root/.cache,type=bind --gpus=all localgpt.

Test dataset

For testing, this repository comes with Constitution of USA as an example file to use.

Ingesting your OWN Data.

Put you files in the SOURCE_DOCUMENTS folder. You can put multiple folders within the SOURCE_DOCUMENTS folder and the code will recursively read your files.

Support file formats:

LocalGPT currently supports the following file formats. LocalGPT uses LangChain for loading these file formats. The code in constants.py uses a DOCUMENT_MAP dictionary to map a file format to the corresponding loader. In order to add support for another file format, simply add this dictionary with the file format and the corresponding loader from LangChain.

DOCUMENT_MAP = {
    ".txt": TextLoader,
    ".md": TextLoader,
    ".py": TextLoader,
    ".pdf": PDFMinerLoader,
    ".csv": CSVLoader,
    ".xls": UnstructuredExcelLoader,
    ".xlsx": UnstructuredExcelLoader,
    ".docx": Docx2txtLoader,
    ".doc": Docx2txtLoader,
}

Ingest

Run the following command to ingest all the data.

If you have cuda setup on your system.

python ingest.py

You will see an output like this: Screenshot 2023-09-14 at 3 36 27 PM

Use the device type argument to specify a given device. To run on cuda

python ingest.py --device_type cpu

To run on M1/M2

python ingest.py --device_type mps

Use help for a full list of supported devices.

python ingest.py --help

This will create a new folder called DB and use it for the newly created vector store. You can ingest as many documents as you want, and all will be accumulated in the local embeddings database. If you want to start from an empty database, delete the DB and reingest your documents.

Note: When you run this for the first time, it will need internet access to download the embedding model (default: Instructor Embedding). In the subseqeunt runs, no data will leave your local enviroment and you can ingest data without internet connection.

Ask questions to your documents, locally!

In order to chat with your documents, run the following commnad (by default, it will run on cuda).

python run_localGPT.py

You can also specify the device type just like ingest.py

python run_localGPT.py --device_type mps # to run on Apple silicon

This will load the ingested vector store and embedding model. You will be presented with a prompt:

> Enter a query:

After typing your question, hit enter. LocalGPT will take some time based on your hardware. You will get a response like this below. Screenshot 2023-09-14 at 3 33 19 PM

Once the answer is generated, you can then ask another question without re-running the script, just wait for the prompt again.

Note: When you run this for the first time, it will need internet connection to download the LLM (default: TheBloke/Llama-2-7b-Chat-GGUF). After that you can turn off your internet connection, and the script inference would still work. No data gets out of your local environment.

Type exit to finish the script.

Extra Options with run_localGPT.py

You can use the --show_sources flag with run_localGPT.py to show which chunks were retrieved by the embedding model. By default, it will show 4 different sources/chunks. You can change the number of sources/chunks

python run_localGPT.py --show_sources

Another option is to enable chat history. Note: This is disabled by default and can be enabled by using the --use_history flag. The context window is limited so keep in mind enabling history will use it and might overflow.

python run_localGPT.py --use_history

Run the Graphical User Interface

  1. Open constants.py in an editor of your choice and depending on choice add the LLM you want to use. By default, the following model will be used:

    MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF"
    MODEL_BASENAME = "llama-2-7b-chat.Q4_K_M.gguf"
  2. Open up a terminal and activate your python environment that contains the dependencies installed from requirements.txt.

  3. Navigate to the /LOCALGPT directory.

  4. Run the following command python run_localGPT_API.py. The API should being to run.

  5. Wait until everything has loaded in. You should see something like INFO:werkzeug:Press CTRL+C to quit.

  6. Open up a second terminal and activate the same python environment.

  7. Navigate to the /LOCALGPT/localGPTUI directory.

  8. Run the command python localGPTUI.py.

  9. Open up a web browser and go the address http://localhost:5111/.

How to select different LLM models?

To change the models you will need to set both MODEL_ID and MODEL_BASENAME.

  1. Open up constants.py in the editor of your choice.

  2. Change the MODEL_ID and MODEL_BASENAME. If you are using a quantized model (GGML, GPTQ, GGUF), you will need to provide MODEL_BASENAME. For unquatized models, set MODEL_BASENAME to NONE

  3. There are a number of example models from HuggingFace that have already been tested to be run with the original trained model (ending with HF or have a .bin in its "Files and versions"), and quantized models (ending with GPTQ or have a .no-act-order or .safetensors in its "Files and versions").

  4. For models that end with HF or have a .bin inside its "Files and versions" on its HuggingFace page.

    • Make sure you have a MODEL_ID selected. For example -> MODEL_ID = "TheBloke/guanaco-7B-HF"
    • Go to the HuggingFace Repo
  5. For models that contain GPTQ in its name and or have a .no-act-order or .safetensors extension inside its "Files and versions on its HuggingFace page.

    • Make sure you have a MODEL_ID selected. For example -> model_id = "TheBloke/wizardLM-7B-GPTQ"
    • Got to the corresponding HuggingFace Repo and select "Files and versions".
    • Pick one of the model names and set it as MODEL_BASENAME. For example -> MODEL_BASENAME = "wizardLM-7B-GPTQ-4bit.compat.no-act-order.safetensors"
  6. Follow the same steps for GGUF and GGML models.

GPU and vRAM Requirements

Below is the vRAM requiment for different models depending on their size (Billions of paramters). The estimates in the table does not include vRAM used by the Embedding models - which use an additional 2GB-7GB of VRAM depending on the model.

Mode Size (B) float32 float16 GPTQ 8bit GPTQ 4bit
7B 28 GB 14 GB 7 GB - 9 GB 3.5 GB - 5 GB
13B 52 GB 26 GB 13 GB - 15 GB 6.5 GB - 8 GB
32B 130 GB 65 GB 32.5 GB - 35 GB 16.25 GB - 19 GB
65B 260.8 GB 130.4 GB 65.2 GB - 67 GB 32.6 GB - 35 GB

System Requirements

Python Version

To use this software, you must have Python 3.10 or later installed. Earlier versions of Python will not compile.

C++ Compiler

If you encounter an error while building a wheel during the pip install process, you may need to install a C++ compiler on your computer.

For Windows 10/11

To install a C++ compiler on Windows 10/11, follow these steps:

  1. Install Visual Studio 2022.
  2. Make sure the following components are selected:
    • Universal Windows Platform development
    • C++ CMake tools for Windows
  3. Download the MinGW installer from the MinGW website.
  4. Run the installer and select the "gcc" component.

NVIDIA Driver's Issues:

Follow this page to install NVIDIA Drivers.

Star History

Star History Chart

Disclaimer

This is a test project to validate the feasibility of a fully local solution for question answering using LLMs and Vector embeddings. It is not production ready, and it is not meant to be used in production. Vicuna-7B is based on the Llama model so that has the original Llama license.

Common Errors

About

Fork of localgpt for the SNIK project

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 68.6%
  • HTML 29.5%
  • Dockerfile 1.9%