v0.2.10版本如何自定义文档嵌入和文档检索逻辑 #3015

michaelxu1107 · 2024-02-20T14:12:32Z

michaelxu1107
Feb 20, 2024

如题：v0.2.10版本能否自定义文档嵌入和文档检索逻辑？
参考langchain官方给的关于半结构化文档RAG的例子：
链接如下：https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb
大致思路如下
1、使用partition_pdf解析pdf文档，输出文本类型数据和表格类型数据
2、针对表格和文本数据分别计算摘要，并将摘要向量化到向量数据库
3、将原始表格和文本数据存储在缓存中，实际为了查询时可用，需要做持久化
4、构建摘要数据和原始数据的对应关系
5、创建多向量检索器

Answered by dosubot[bot]

Feb 20, 2024

🤖

嘿，@michaelxu1107！很高兴再次见到你。期待这次又是怎样的有趣对话呢？👾

是的，Langchain-Chatchat v0.2.10版本支持自定义文档嵌入和文档检索逻辑。您提供的上下文显示了一个Python脚本，该脚本修改了嵌入模型的分词器，以在模型中包含用于嵌入的关键词。这是通过add_keyword_to_model和add_keyword_to_embedding_model函数完成的。

然而，需要注意的是，从0.3.0版本开始，这个功能被标记为已弃用，如get_keyword_embedding函数中的@deprecated装饰器所示。这意味着这个功能将在未来的版本中被移除，不建议在新项目中使用。

以下是相关代码：

@deprecated(
        since="0.3.0",
        message="自定义关键词 Langchain-Chatchat 0.3.x 重写, 0.2.x中相关功能将废弃",
        removal="0.3.0"
    )
def get_keyword_embedding(bert_model, tokenizer, key_words):
    tokenizer_output = tokenizer(key_words, return_tensors="pt", padding=True, truncation=True)
    input_ids = tokenizer_output['input_ids']
    input_ids = input_ids[:, 1:-1]
…

View full answer

@michaelxu1107 · 2024-02-20T14:12:43Z

dosubot[bot]
bot Feb 20, 2024

🤖

嘿，@michaelxu1107！很高兴再次见到你。期待这次又是怎样的有趣对话呢？👾

是的，Langchain-Chatchat v0.2.10版本支持自定义文档嵌入和文档检索逻辑。您提供的上下文显示了一个Python脚本，该脚本修改了嵌入模型的分词器，以在模型中包含用于嵌入的关键词。这是通过add_keyword_to_model和add_keyword_to_embedding_model函数完成的。

然而，需要注意的是，从0.3.0版本开始，这个功能被标记为已弃用，如get_keyword_embedding函数中的@deprecated装饰器所示。这意味着这个功能将在未来的版本中被移除，不建议在新项目中使用。

以下是相关代码：

@deprecated(
        since="0.3.0",
        message="自定义关键词 Langchain-Chatchat 0.3.x 重写, 0.2.x中相关功能将废弃",
        removal="0.3.0"
    )
def get_keyword_embedding(bert_model, tokenizer, key_words):
    tokenizer_output = tokenizer(key_words, return_tensors="pt", padding=True, truncation=True)
    input_ids = tokenizer_output['input_ids']
    input_ids = input_ids[:, 1:-1]

    keyword_embedding = bert_model.embeddings.word_embeddings(input_ids)
    keyword_embedding = torch.mean(keyword_embedding, 1)
    return keyword_embedding

这个函数接收一个BERT模型，一个分词器，和一个关键词列表作为输入。它对关键词进行分词，并获取它们的输入ID。然后，它使用BERT模型的词嵌入来获取这些输入ID的嵌入。然后，对第一维度进行平均以获取关键词嵌入。

然后，add_keyword_to_model函数使用这个关键词嵌入来修改模型的分词器，并将关键词添加到其中。然后将修改后的模型保存到指定的输出路径。

def add_keyword_to_model(model_name=EMBEDDING_MODEL, keyword_file: str = "", output_model_path: str = None):
    key_words = []
    with open(keyword_file, "r") as f:
        for line in f:
            key_words.append(line.strip())

    st_model = SentenceTransformer(model_name)
    key_words_len = len(key_words)
    word_embedding_model = st_model._first_module()
    bert_model = word_embedding_model.auto_model
    tokenizer = word_embedding_model.tokenizer
    key_words_embedding = get_keyword_embedding(bert_model, tokenizer, key_words)

    embedding_weight = bert_model.embeddings.word_embeddings.weight
    embedding_weight_len = len(embedding_weight)
    tokenizer.add_tokens(key_words)
    bert_model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=32)
    embedding_weight = bert_model.embeddings.word_embeddings.weight
    with torch.no_grad():
        embedding_weight[embedding_weight_len:embedding_weight_len + key_words_len, :] = key_words_embedding

    if output_model_path:
        os.makedirs(output_model_path, exist_ok=True)
        word_embedding_model.save(output_model_path)
        safetensors_file = os.path.join(output_model_path, "model.safetensors")
        metadata = {'format': 'pt'}
        save_model(bert_model, safetensors_file, metadata)
        print("save model to {}".format(output_model_path))

add_keyword_to_embedding_model函数是一个包装函数，它以适当的参数调用add_keyword_to_model。

def add_keyword_to_embedding_model(path: str = EMBEDDING_KEYWORD_FILE):
    keyword_file = os.path.join(path)
    model_name = MODEL_PATH["embed_model"][EMBEDDING_MODEL]
    model_parent_directory = os.path.dirname(model_name)
    current_time = datetime.now().strftime('%Y%m%d_%H%M%S')
    output_model_name = "{}_Merge_Keywords_{}".format(EMBEDDING_MODEL, current_time)
    output_model_path = os.path.join(model_parent_directory, output_model_name)
    add_keyword_to_model(model_name, keyword_file, output_model_path)

来源：

embeddings/add_embedding_keywords.py

关于partition_pdf函数如何在Langchain-Chatchat v0.2.10版本中解析PDF文档为文本和表格数据的问题，partition_pdf函数并未在提供的上下文中直接提及。然而，pdf2text函数被用于在Langchain-Chatchat v0.2.10版本中解析PDF文档为文本和表格数据。以下是它的工作方式：

定义了一个名为pdf2text的函数，该函数接受一个文件路径作为参数。
使用pyMuPDF的fitz包打开PDF文档。
启动一个循环，遍历文档中的每一页。
使用get_text方法提取每一页的文本，并将其追加到resp字符串中。
使用get_image_info方法提取每一页的图片。如果图片大小超过一定阈值，就忽略它。
如果页面有旋转角度，将图片旋转到正确的方向。
对图片应用OCR（光学字符识别）以提取文本，然后将文本追加到resp字符串中。
然后使用unstructured.partition.text模块的partition_text函数对提取的文本进行分区。

请注意，partition_text函数并未在提供的上下文中定义，因此无法根据给定的信息描述其确切的功能。
来源：

document_loaders/mypdfloader.py

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.10版本如何自定义文档嵌入和文档检索逻辑 #3015

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

About Dosu

Select a reply

v0.2.10版本如何自定义文档嵌入和文档检索逻辑 #3015

michaelxu1107 Feb 20, 2024

Replies: 1 comment

dosubot[bot] bot Feb 20, 2024

Sources

About Dosu

michaelxu1107
Feb 20, 2024

dosubot[bot]
bot Feb 20, 2024