Add new splitter to process QA type file(now only support JSON) and add Toggle button in knowledge_base page #3298

Donovan-Ye · 2024-03-13T04:16:25Z

I wrote a new splitter to improve the processing of QA-type knowledge(Now only supports JSON, as shown in the example). I also added a Toggle button on the knowledge_base page to switch between the QA splitter and the normal splitter (ChineseRecursiveTextSplitter defined in kb_config.py).

I created a PR because I noticed that many people are encountering the same issue (#3164, #893, and others).

Here are the updated page and test results for the QA splitter:

chuanSir123 · 2024-04-18T08:14:23Z

你好，我按照这个代码改了，最后分词还是走了ChineseRecursiveTextSplitter，我看你的截图也是

chuanSir123 · 2024-04-18T08:53:54Z

你好，我按照这个代码改了，最后分词还是走了ChineseRecursiveTextSplitter，我看你的截图也是

搞定了，是网络问题导致默认选择了其他分词器。

Donovan-Ye · 2024-04-19T01:39:06Z

你好，我按照这个代码改了，最后分词还是走了ChineseRecursiveTextSplitter，我看你的截图也是

搞定了，是网络问题导致默认选择了其他分词器。

嗯嗯是的。连不上huggingface的会走默认分词器

Donovan-Ye · 2024-04-23T08:03:57Z

您好，我想问下，自己定义了qa_text_splitter.py，那为什么还需要联网走huggingface？这块不是很了解

可以跟着上传文件向量化的逻辑看，中间会走到这里：
https://github.com/Donovan-Ye/Langchain-Chatchat/blob/2ef5d1fafe164797151ad79c8c42f04e39cc4876/server/knowledge_base/utils.py#L189

可以发现会去根据source和tokenizer_name_or_path去加载分词器和tokenizer。。。因为我设置的qa_text_splitter的source是huggingface，所以会走这里的逻辑，去加载对应的tokenizer。如果加载错误就会走下面的catch，去拿默认的分词器。

我也没有特别深入的去研究过，就我这边的使用场景来说：1. 如果使用的是本地模型，tokenizer_name_or_path设置为''。 2. 如果是走openai 的 api，tokenizer_name_or_path设置为gpt2。

不过刚才我仔细看了一下，你可以尝试将source设置为''试试。因为看到还有一个else的逻辑。

try:
  # ...
  if text_splitter_dict[splitter_name]["source"] == "tiktoken":  ## 从tiktoken加载
    # ...
  elif text_splitter_dict[splitter_name]["source"] == "huggingface":  ## 从huggingface加载
    # ...
  else:
      try:
          text_splitter = TextSplitter(
              pipeline="zh_core_web_sm",
              chunk_size=chunk_size,
              chunk_overlap=chunk_overlap
          )
      except:
          text_splitter = TextSplitter(
              chunk_size=chunk_size,
              chunk_overlap=chunk_overlap
          )
except Exception as e:
        print(e)
        text_splitter_module = importlib.import_module('langchain.text_splitter')
        TextSplitter = getattr(text_splitter_module, "RecursiveCharacterTextSplitter")
        text_splitter = TextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
# ...

nauyiahc · 2024-04-23T08:54:45Z

您好，想问下，我在初始化数据库中，用了qa_text_splitter.py，但只想向量化question的部分，不想向量化answer，这个该如何实现呢？我现在用qa_text_splitter.py之后，是对整个q-a进行了向量化。。。

Donovan-Ye · 2024-04-23T09:06:30Z

您好，想问下，我在初始化数据库中，用了qa_text_splitter.py，但只想向量化question的部分，不想向量化answer，这个该如何实现呢？我现在用qa_text_splitter.py之后，是对整个q-a进行了向量化。。。

是指只要这部分嘛？

nauyiahc · 2024-04-23T10:21:07Z

您好，想问下，我在初始化数据库中，用了qa_text_splitter.py，但只想向量化question的部分，不想向量化answer，这个该如何实现呢？我现在用qa_text_splitter.py之后，是对整个q-a进行了向量化。。。

是指只要这部分嘛？

我简单实现了一下，在base.py的EmbeddingsFunAdapter的embed_documents方法中，在向量化时用正则表达式把texts的question给提取了出来，这样就可以做到只向量化question

nauyiahc · 2024-04-23T10:25:16Z

您好，想问下，我在初始化数据库中，用了qa_text_splitter.py，但只想向量化question的部分，不想向量化answer，这个该如何实现呢？我现在用qa_text_splitter.py之后，是对整个q-a进行了向量化。。。

是指只要这部分嘛？

我简单实现了一下，在base.py的EmbeddingsFunAdapter的embed_documents方法中，在向量化时用正则表达式把texts的question给提取了出来，这样就可以做到只向量化question

仅对问题进行向量化，

我在embed_documents方法加入了如下函数

nauyiahc · 2024-04-23T10:59:44Z

感觉texts直接转字典，然后把question的value取出来也可以，用try来取，我是想在数据库初始化和增量更新时做这个事情，暂时没有考虑前端页面，只向量化问题，检索的阈值就可以设置得更低一些，匹配的更精准

chuanSir123 · 2024-04-29T02:05:42Z

您好，想问下，我在初始化数据库中，用了qa_text_splitter.py，但只想向量化question的部分，不想向量化answer，这个该如何实现呢？我现在用qa_text_splitter.py之后，是对整个q-a进行了向量化。。。

是指只要这部分嘛？

我简单实现了一下，在base.py的EmbeddingsFunAdapter的embed_documents方法中，在向量化时用正则表达式把texts的question给提取了出来，这样就可以做到只向量化question

仅对问题进行向量化，我在embed_documents方法加入了如下函数

大佬，我按照你的代码位置改了，好像没触发print，确定这个qa模式是走的这个方法么

feat(text_splitter): add an QA_text_splitter to process QA json file

2ef5d1f

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new splitter to process QA type file(now only support JSON) and add Toggle button in knowledge_base page #3298

Add new splitter to process QA type file(now only support JSON) and add Toggle button in knowledge_base page #3298

Donovan-Ye commented Mar 13, 2024

chuanSir123 commented Apr 18, 2024

chuanSir123 commented Apr 18, 2024

Donovan-Ye commented Apr 19, 2024

Donovan-Ye commented Apr 23, 2024

nauyiahc commented Apr 23, 2024

Donovan-Ye commented Apr 23, 2024

nauyiahc commented Apr 23, 2024

nauyiahc commented Apr 23, 2024

nauyiahc commented Apr 23, 2024 •

edited

chuanSir123 commented Apr 29, 2024

Add new splitter to process QA type file(now only support JSON) and add Toggle button in knowledge_base page #3298

Are you sure you want to change the base?

Add new splitter to process QA type file(now only support JSON) and add Toggle button in knowledge_base page #3298

Conversation

Donovan-Ye commented Mar 13, 2024

chuanSir123 commented Apr 18, 2024

chuanSir123 commented Apr 18, 2024

Donovan-Ye commented Apr 19, 2024

Donovan-Ye commented Apr 23, 2024

nauyiahc commented Apr 23, 2024

Donovan-Ye commented Apr 23, 2024

nauyiahc commented Apr 23, 2024

nauyiahc commented Apr 23, 2024

nauyiahc commented Apr 23, 2024 • edited

chuanSir123 commented Apr 29, 2024

nauyiahc commented Apr 23, 2024 •

edited