在dolly-3B模型使用自己的數據集無法輸出正確的回覆 #60

AndyLinOuO · 2023-07-31T04:59:20Z

目前我想用databricks 開源的Dolly模型做出個可以針對我給的數據集內的問題給專業知識的回覆的機器人

數據庫裡大概都是這樣的教你步驟去解決問題
我想要用langchain去實現不過我實際使用後發現回覆的不是我要的答案
這是我的code

from langchain.embeddings import HuggingFaceEmbeddings
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain.prompts import PromptTemplate
import torch

hf_embed = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
reader = PdfReader('/content/gdrive/My Drive/data/operation Manual.pdf')
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text
text_splitter = CharacterTextSplitter(        
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)
docsearch = FAISS.from_texts(texts, hf_embed)
model_name = "databricks/dolly-v2-3b"
instruct_pipeline = pipeline(model=model_name, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto", 
                               return_full_text=True, max_new_tokens=256, top_p=0.95, top_k=50)
hf_pipe = HuggingFacePipeline(pipeline=instruct_pipeline)
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)
query = "I forgot my login password."          
docs = docsearch.similarity_search(query)
chain = load_qa_chain(llm = hf_pipe, chain_type="stuff", prompt=PROMPT)
chain({"input_documents": docs, "question": query}, return_only_outputs=True)

不只回覆的答案不是我數據集中的解答這個問題，產生回覆的速度也需要花費大約2.5個小時
不知道我在哪個步驟使用錯誤了
求大老幫忙解答一下謝謝!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在dolly-3B模型使用自己的數據集無法輸出正確的回覆 #60

在dolly-3B模型使用自己的數據集無法輸出正確的回覆 #60

AndyLinOuO commented Jul 31, 2023

在dolly-3B模型使用自己的數據集無法輸出正確的回覆 #60

在dolly-3B模型使用自己的數據集無法輸出正確的回覆 #60

Comments

AndyLinOuO commented Jul 31, 2023