privategpt_en

⚠️ privateGPT has significant changes to their codebase. Please visit their repo for the latest doc.

Multi-doc QA based on privateGPT

privateGPT is an open-source project based on llama-cpp-python and LangChain, aiming to provide an interface for localized document analysis and interaction with large models for Q&A. Users can utilize privateGPT to analyze local documents and use large model files compatible with GPT4All or llama.cpp to ask and answer questions about document content, ensuring data localization and privacy. This article introduces how to use privateGPT, taking the GGML format model in llama.cpp as an example.

For more detailed content and usage, please refer to the privateGPT official directory: https://github.com/imartinez/privateGPT

Prerequisites: Install llama-cpp-python

Since privateGPT uses the GGML model from llama.cpp, you need to install the llama-cpp-python extension in advance. Note: The following installation method does not use any acceleration library.

$ pip install llama-cpp-python

💡 (Recommended) If you want to install a version adapted to OpenBLAS/cuBLAS/CLBlast/Metal, please refer to: https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast

Must-read for Mac M series chip users

Make sure the python in the current installation environment supports arm64 architecture, otherwise, the execution speed will be more than 10x slower. The test method is to execute the following python command after installing llama-cpp-python, where the model path should be replaced with a GGML model file supported by your local llama.cpp.

>>> from llama_cpp import Llama
>>> llm = Llama(model_path="./models/7B/ggml-model.bin")

If it displays NEON = 1, it means it's normal; NEON = 0 indicates that it hasn't been installed correctly for the arm64 architecture. Below is a log example with ARM NEON acceleration support.

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

How to install python adapted for arm64?

If you use conda, you can create the relevant environment with the following command, selecting Python 3.10 to meet the requirements of privateGPT.

$ CONDA_SUBDIR=osx-arm64 conda create -n privategpt python=3.10 -c conda-forge

Step 1: Clone the directory and install dependencies

After successfully installing llama-cpp-python, you can proceed with installing privateGPT with the following commands (note python >= 3.10).

$ git clone https://github.com/imartinez/privateGPT.git
$ cd privateGPT
$ pip3 install -r requirements.txt

Step 2: Modify the configuration file

Create a .env configuration file in the root directory of privateGPT. Here's an example:

MODEL_TYPE=LlamaCpp
PERSIST_DIRECTORY=db
MODEL_PATH=your-path-to-ggml-model.bin
MODEL_N_CTX=4096
MODEL_N_BATCH=512
EMBEDDINGS_MODEL_NAME=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
TARGET_SOURCE_CHUNKS=4

MODEL_TYPE: Fill in as LlamaCpp
PERSIST_DIRECTORY: Specify where the analysis files are stored. A db directory will be created in the root directory of privateGPT.
MODEL_PATH: Point to where the large model is stored, which in this case is a GGML file supported by llama.cpp.
MODEL_N_CTX: The maximum token limit of the large model, set to 4096 (same as the -c parameter in llama.cpp). You can increase this value with no bigger than 16384 (16K) for long-context 16K model series.
MODEL_N_BATCH: Size of the prompt batch processing (same as the -b parameter in llama.cpp).
EMBEDDINGS_MODEL_NAME: Location of the SentenceTransformers word vector model. You can specify the path on HuggingFace (it will be automatically downloaded). For other officially supported models, refer to: https://www.sbert.net/docs/pretrained_models.html
TARGET_SOURCE_CHUNKS: Number of chunks used to answer questions.

Step 3: Analyzing Local Files

privateGPT supports the analysis of the following common document formats, such as (only the most commonly used are listed):

Word files: .doc, .docx
PPT files: .ppt, .pptx
PDF files: .pdf
Plain text files: .txt
CSV files: .csv
Markdown files: .md
Email files: .eml, .msg

Place the documents to be analyzed (not limited to a single document) in the source_documents directory under the privateGPT root directory. Here, 3 word files related to "Musk's Visit to China" have been placed. The directory structure is similar to:

$ ls source_documents
musk1.docx	musk2.docx	musk3.docx

Next, run the ingest.py command to analyze the documents.

$ python ingest.py

The output is as follows (test environment is M1 Max, parsing took only a few seconds). Note that the first use will download the word vector model from the configuration file (if given as a HuggingFace address, rather than a local path).

Creating new vectorstore
Loading documents from source_documents
Loading new documents: 100%|██████████████████████| 3/3 [00:02<00:00,  1.11it/s]
Loaded 3 new documents from source_documents
Split into 7 chunks of text (max. 500 tokens each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now run privateGPT.py to query your documents

⚠️ Note: If there are already related analysis files in the db directory, the data files will accumulate. If you only want to parse the current document, clear the db directory before ingesting.

Step 4: Modify Decoding Strategy

Acceleration Strategy

Before running, you need to modify the model decoding-related parameters to get the best speed and effect.

privateGPT.py actually calls the llama-cpp-python interface, so the default decoding strategy is used if no code changes are made. Open privateGPT.py and find the following statement (around line 35, may vary depending on the version).

llm = LlamaCpp(model_path=model_path, max_tokens=model_n_ctx, callbacks=callbacks, verbose=False)

This is where the LlamaCpp model is defined. You can pass in more custom parameters according to the definition of the llama-cpp-python interface. Here's an example:

n_threads: Consistent with the -n parameter in llama.cpp, defining the number of decoding threads, which helps increase decoding speed. Adjust according to the actual number of physical cores.
n_ctx: Consistent with the -c parameter in llama.cpp, defining the context window size. The default is 512. Here it is set to the model_n_ctx quantity in the configuration file, which is 4096.
n_gpu_layers: Consistent with the -ngl parameter in llama.cpp, defining the number of offload layers using GPU; Apple M series chips can be set to 1.
rope_freq_scale: Default value is 1.0. If you are using 16K context model, please change this value to 0.25.

llm = LlamaCpp(model_path=model_path, max_tokens=model_n_ctx, 
               callbacks=callbacks, verbose=False, 
               n_threads=8, n_ctx=model_n_ctx, n_gpu_layers=1, rope_freq_scale=1.0)

Work with Alpaca-2 Instruction Template

The default decoding method does not include any instruction templates. Next, we will introduce the method of nested Alpaca-2 instruction templates to load the model in the correct way.

Open privateGPT.py and find the following statement (around line 40, may vary depending on the version).

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", 
                                 retriever=retriever, return_source_documents= not args.hide_source)

Replace with the following code (note the adjustment of the indent):

alpaca2_prompt_template = (
    "[INST] <<SYS>>\n"
    "You are a helpful assistant. 你是一个乐于助人的助手。\n"
    "<</SYS>>\n\n"
    "{context}\n\n{question} [/INST]"
)

from langchain import PromptTemplate
input_with_prompt = PromptTemplate(template=alpaca2_prompt_template, 
                                   input_variables=["context", "question"])

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, 
                                 return_source_documents= not args.hide_source, 
                                 chain_type_kwargs={"prompt": input_with_prompt})

Refer to the example code here >>> scripts/privategpt/privateGPT.py

Step 5: Asking Questions about Local Documents

After completing the document analysis in the previous step, you can run the following command to start asking questions about the document:

$ python privateGPT.py

After the following prompt appears, you can input questions, such as entering the following question:

Enter a query: 马斯克此次访华可能有什么目的？

The result is as follows (source document output part omitted):

> Question:
马斯克此次访华可能有什么目的？

> Answer (took 48.29 s.):
根据路透社披露的消息和报道分析，马斯克访华可能有以下一些目的：

1.加强与中国政府高层的互动沟通，推动特斯拉在中国市场的发展计划。此前中国监管部门对特斯拉在销售、售后服务等方面进行了多项整改措施，此次访问也可能涉及解决上述问题并寻求政府的支持。

2.参观特斯拉在上海拥有的超级工厂以及探索进一步扩大规模的可能性。上海工厂是目前全球最大的电动汽车工厂之一，扩建可能有利于加速产能提升和提高产量水平。

3.探讨与中国本土汽车制造商在市场上竞争的问题。随着特斯拉在中国市场的份额逐渐增加，其在与国产品牌之间的竞争关系也可能越来越重要，通过此次访问，马斯克可能会就这一问题提出建议或寻求解决办法。

4.推动电动汽车产业的全球合作和发展。作为全球最大的新能源汽车市场之一，中国市场对特斯拉的发展具有重要的战略意义。如果成功地拓展到中国，特斯拉将能够进一步扩大其在全球范围内的影响力并加速电动车普及进程。

The reading process is not very fast, the answering process is relatively quick. Overall, it took about half a minute to provide relevant results, and it will provide data from four sources.

Enter exit to end the script.

Optimize LangChain Strategy

The default strategy used by privateGPT.py when calling LangChain is stuff. This strategy is not suitable for handling particularly long texts. So, if the effect is not good when dealing with long or multiple documents, you can switch to strategies such as refine or map_reduce. If you want to use refine, first define two prompt templates (note the adjustment of the indent):

  alpaca2_refine_prompt_template = (
      "[INST] <<SYS>>\n"
      "You are a helpful assistant. 你是一个乐于助人的助手。\n"
      "<</SYS>>\n\n"
      "这是原始问题：{question}\n"
      "已有的回答: {existing_answer}\n"
      "现在还有一些文字，（如果有需要）你可以根据它们完善现有的回答。"
      "\n\n{context}\n\n"
      "请根据新的文段，进一步完善你的回答。 [/INST]"
  )

  alpaca2_initial_prompt_template = (
      "[INST] <<SYS>>\n"
      "You are a helpful assistant. 你是一个乐于助人的助手。\n"
      "<</SYS>>\n\n"
      "以下为背景知识：\n{context}\n"
      "请根据以上背景知识，回答这个问题：{question} [/INST]"
  )

Then initialize qa in the following way, replacing the definition of qa around line 39 in the original code (note the adjustment of the indent):

    from langchain import PromptTemplate
    refine_prompt = PromptTemplate(
        input_variables=["question", "existing_answer", "context_str"],
        template=alpaca2_refine_prompt_template,
    )
    initial_qa_prompt = PromptTemplate(
        input_variables=["context_str", "question"],
        template=alpaca2_initial_prompt_template,
    )
    chain_type_kwargs = {"question_prompt": initial_qa_prompt, "refine_prompt": refine_prompt}
    qa = RetrievalQA.from_chain_type(
        llm=llm, chain_type="refine",
        retriever=retriever, return_source_documents= not args.hide_source,
        chain_type_kwargs=chain_type_kwargs)

For reference, see the example code >>> scripts/privategpt/privateGPT_refine.py

中文文档

模型合并与转换
- 在线模型合并与转换（Colab）
- 手动模型合并与转换
模型量化、推理、部署
效果与评测
训练脚本
- 预训练脚本
- 指令精调脚本
基于人类反馈的强化学习
- 奖励模型
- 强化学习
常见问题

English Docs

Model Reconstruction
- Online Conversion (Colab)
- Manual Conversion
Model Quantization, Inference and Deployment
System Performance
Training Scripts
- Pre-training Scripts
- Instruction Fine-tuning Scripts
Reinforcement Learning from Human Feedback
- Reward Modeling
- Reinforcement Learning
FAQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly