⚠️ privateGPT has significant changes to their codebase. Please visit their repo for the latest doc.

Multi-doc QA based on privateGPT

privateGPT is an open-source project based on llama-cpp-python and LangChain, aiming to provide an interface for localized document analysis and interaction with large models for Q&A. Users can utilize privateGPT to analyze local documents and use large model files compatible with GPT4All or llama.cpp to ask and answer questions about document content, ensuring data localization and privacy. This article introduces how to use privateGPT, taking the GGML format model in llama.cpp as an example.

For more detailed content and usage, please refer to the privateGPT official directory:

Prerequisites: Install llama-cpp-python

Since privateGPT uses the GGML model from llama.cpp, you need to install the llama-cpp-python extension in advance. Note: The following installation method does not use any acceleration library.

$ pip install llama-cpp-python

💡 (Recommended) If you want to install a version adapted to OpenBLAS/cuBLAS/CLBlast/Metal, please refer to:

Must-read for Mac M series chip users

Make sure the python in the current installation environment supports arm64 architecture, otherwise, the execution speed will be more than 10x slower. The test method is to execute the following python command after installing llama-cpp-python, where the model path should be replaced with a GGML model file supported by your local llama.cpp.

>>> from llama_cpp import Llama
>>> llm = Llama(model_path="./models/7B/ggml-model.bin")

If it displays NEON = 1, it means it's normal; NEON = 0 indicates that it hasn't been installed correctly for the arm64 architecture. Below is a log example with ARM NEON acceleration support.

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

How to install python adapted for arm64?

If you use conda, you can create the relevant environment with the following command, selecting Python 3.10 to meet the requirements of privateGPT.

$ CONDA_SUBDIR=osx-arm64 conda create -n privategpt python=3.10 -c conda-forge

Step 1: Clone the directory and install dependencies

After successfully installing llama-cpp-python, you can proceed with installing privateGPT with the following commands (note python >= 3.10).

$ git clone
$ cd privateGPT
$ pip3 install -r requirements.txt

Step 2: Modify the configuration file

Create a .env configuration file in the root directory of privateGPT. Here's an example:

  • MODEL_TYPE: Fill in as LlamaCpp
  • PERSIST_DIRECTORY: Specify where the analysis files are stored. A db directory will be created in the root directory of privateGPT.
  • MODEL_PATH: Point to where the large model is stored, which in this case is a GGML file supported by llama.cpp.
  • MODEL_N_CTX: The maximum token limit of the large model, set to 4096 (same as the -c parameter in llama.cpp). You can increase this value with no bigger than 16384 (16K) for long-context 16K model series.
  • MODEL_N_BATCH: Size of the prompt batch processing (same as the -b parameter in llama.cpp).
  • EMBEDDINGS_MODEL_NAME: Location of the SentenceTransformers word vector model. You can specify the path on HuggingFace (it will be automatically downloaded). For other officially supported models, refer to:
  • TARGET_SOURCE_CHUNKS: Number of chunks used to answer questions.

Step 3: Analyzing Local Files

privateGPT supports the analysis of the following common document formats, such as (only the most commonly used are listed):

  • Word files: .doc, .docx
  • PPT files: .ppt, .pptx
  • PDF files: .pdf
  • Plain text files: .txt
  • CSV files: .csv
  • Markdown files: .md
  • Email files: .eml, .msg

Place the documents to be analyzed (not limited to a single document) in the source_documents directory under the privateGPT root directory. Here, 3 word files related to "Musk's Visit to China" have been placed. The directory structure is similar to:

$ ls source_documents
musk1.docx	musk2.docx	musk3.docx

Next, run the command to analyze the documents.

$ python

The output is as follows (test environment is M1 Max, parsing took only a few seconds). Note that the first use will download the word vector model from the configuration file (if given as a HuggingFace address, rather than a local path).

Creating new vectorstore
Loading documents from source_documents
Loading new documents: 100%|██████████████████████| 3/3 [00:02<00:00,  1.11it/s]
Loaded 3 new documents from source_documents
Split into 7 chunks of text (max. 500 tokens each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now run to query your documents

⚠️ Note: If there are already related analysis files in the db directory, the data files will accumulate. If you only want to parse the current document, clear the db directory before ingesting.

Step 4: Modify Decoding Strategy

Acceleration Strategy

Before running, you need to modify the model decoding-related parameters to get the best speed and effect. actually calls the llama-cpp-python interface, so the default decoding strategy is used if no code changes are made. Open and find the following statement (around line 35, may vary depending on the version).

llm = LlamaCpp(model_path=model_path, max_tokens=model_n_ctx, callbacks=callbacks, verbose=False)

This is where the LlamaCpp model is defined. You can pass in more custom parameters according to the definition of the llama-cpp-python interface. Here's an example:

  • n_threads: Consistent with the -n parameter in llama.cpp, defining the number of decoding threads, which helps increase decoding speed. Adjust according to the actual number of physical cores.
  • n_ctx: Consistent with the -c parameter in llama.cpp, defining the context window size. The default is 512. Here it is set to the model_n_ctx quantity in the configuration file, which is 4096.
  • n_gpu_layers: Consistent with the -ngl parameter in llama.cpp, defining the number of offload layers using GPU; Apple M series chips can be set to 1.
  • rope_freq_scale: Default value is 1.0. If you are using 16K context model, please change this value to 0.25.
llm = LlamaCpp(model_path=model_path, max_tokens=model_n_ctx, 
               callbacks=callbacks, verbose=False, 
               n_threads=8, n_ctx=model_n_ctx, n_gpu_layers=1, rope_freq_scale=1.0)

Work with Alpaca-2 Instruction Template

The default decoding method does not include any instruction templates. Next, we will introduce the method of nested Alpaca-2 instruction templates to load the model in the correct way.

Open and find the following statement (around line 40, may vary depending on the version).

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", 
                                 retriever=retriever, return_source_documents= not args.hide_source)

Replace with the following code (note the adjustment of the indent):

alpaca2_prompt_template = (
    "[INST] <<SYS>>\n"
    "You are a helpful assistant. 你是一个乐于助人的助手。\n"
    "{context}\n\n{question} [/INST]"

from langchain import PromptTemplate
input_with_prompt = PromptTemplate(template=alpaca2_prompt_template, 
                                   input_variables=["context", "question"])

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, 
                                 return_source_documents= not args.hide_source, 
                                 chain_type_kwargs={"prompt": input_with_prompt})

Refer to the example code here >>> scripts/privategpt/

Step 5: Asking Questions about Local Documents

After completing the document analysis in the previous step, you can run the following command to start asking questions about the document:

$ python

After the following prompt appears, you can input questions, such as entering the following question:

Enter a query: 马斯克此次访华可能有什么目的?

The result is as follows (source document output part omitted):

> Question:

> Answer (took 48.29 s.):





The reading process is not very fast, the answering process is relatively quick. Overall, it took about half a minute to provide relevant results, and it will provide data from four sources.

Enter exit to end the script.

Optimize LangChain Strategy

The default strategy used by when calling LangChain is stuff. This strategy is not suitable for handling particularly long texts. So, if the effect is not good when dealing with long or multiple documents, you can switch to strategies such as refine or map_reduce. If you want to use refine, first define two prompt templates (note the adjustment of the indent):

  alpaca2_refine_prompt_template = (
      "[INST] <<SYS>>\n"
      "You are a helpful assistant. 你是一个乐于助人的助手。\n"
      "已有的回答: {existing_answer}\n"
      "请根据新的文段,进一步完善你的回答。 [/INST]"

  alpaca2_initial_prompt_template = (
      "[INST] <<SYS>>\n"
      "You are a helpful assistant. 你是一个乐于助人的助手。\n"
      "请根据以上背景知识,回答这个问题:{question} [/INST]"

Then initialize qa in the following way, replacing the definition of qa around line 39 in the original code (note the adjustment of the indent):

    from langchain import PromptTemplate
    refine_prompt = PromptTemplate(
        input_variables=["question", "existing_answer", "context_str"],
    initial_qa_prompt = PromptTemplate(
        input_variables=["context_str", "question"],
    chain_type_kwargs = {"question_prompt": initial_qa_prompt, "refine_prompt": refine_prompt}
    qa = RetrievalQA.from_chain_type(
        llm=llm, chain_type="refine",
        retriever=retriever, return_source_documents= not args.hide_source,

For reference, see the example code >>> scripts/privategpt/

