Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new splitter to process QA type file(now only support JSON) and add Toggle button in knowledge_base page #3298

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Donovan-Ye
Copy link

I wrote a new splitter to improve the processing of QA-type knowledge(Now only supports JSON, as shown in the example). I also added a Toggle button on the knowledge_base page to switch between the QA splitter and the normal splitter (ChineseRecursiveTextSplitter defined in kb_config.py).

I created a PR because I noticed that many people are encountering the same issue (#3164, #893, and others).

Here are the updated page and test results for the QA splitter:
image
image

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Mar 13, 2024
@chuanSir123
Copy link

你好,我按照这个代码改了,最后分词还是走了ChineseRecursiveTextSplitter,我看你的截图也是

@chuanSir123
Copy link

你好,我按照这个代码改了,最后分词还是走了ChineseRecursiveTextSplitter,我看你的截图也是

搞定了,是网络问题导致默认选择了其他分词器。

@Donovan-Ye
Copy link
Author

你好,我按照这个代码改了,最后分词还是走了ChineseRecursiveTextSplitter,我看你的截图也是

搞定了,是网络问题导致默认选择了其他分词器。

嗯嗯 是的。连不上huggingface的会走默认分词器

@Donovan-Ye
Copy link
Author

您好,我想问下,自己定义了qa_text_splitter.py,那为什么还需要联网走huggingface?这块不是很了解

可以跟着上传文件向量化的逻辑看,中间会走到这里:
https://github.com/Donovan-Ye/Langchain-Chatchat/blob/2ef5d1fafe164797151ad79c8c42f04e39cc4876/server/knowledge_base/utils.py#L189

可以发现会去根据sourcetokenizer_name_or_path去加载分词器和tokenizer。。。因为我设置的qa_text_splittersource是huggingface,所以会走这里的逻辑,去加载对应的tokenizer。如果加载错误就会走下面的catch,去拿默认的分词器。

我也没有特别深入的去研究过,就我这边的使用场景来说:1. 如果使用的是本地模型,tokenizer_name_or_path设置为''。 2. 如果是走openai 的 api,tokenizer_name_or_path设置为gpt2

不过刚才我仔细看了一下,你可以尝试将source设置为''试试。因为看到还有一个else的逻辑。

try:
  # ...
  if text_splitter_dict[splitter_name]["source"] == "tiktoken":  ## 从tiktoken加载
    # ...
  elif text_splitter_dict[splitter_name]["source"] == "huggingface":  ## 从huggingface加载
    # ...
  else:
      try:
          text_splitter = TextSplitter(
              pipeline="zh_core_web_sm",
              chunk_size=chunk_size,
              chunk_overlap=chunk_overlap
          )
      except:
          text_splitter = TextSplitter(
              chunk_size=chunk_size,
              chunk_overlap=chunk_overlap
          )
except Exception as e:
        print(e)
        text_splitter_module = importlib.import_module('langchain.text_splitter')
        TextSplitter = getattr(text_splitter_module, "RecursiveCharacterTextSplitter")
        text_splitter = TextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
# ...

@nauyiahc
Copy link

您好,想问下,我在初始化数据库中,用了qa_text_splitter.py,但只想向量化question的部分,不想向量化answer,这个该如何实现呢?我现在用qa_text_splitter.py之后,是对整个q-a进行了向量化。。。

@Donovan-Ye
Copy link
Author

您好,想问下,我在初始化数据库中,用了qa_text_splitter.py,但只想向量化question的部分,不想向量化answer,这个该如何实现呢?我现在用qa_text_splitter.py之后,是对整个q-a进行了向量化。。。

是指只要这部分嘛?
image

@nauyiahc
Copy link

您好,想问下,我在初始化数据库中,用了qa_text_splitter.py,但只想向量化question的部分,不想向量化answer,这个该如何实现呢?我现在用qa_text_splitter.py之后,是对整个q-a进行了向量化。。。

是指只要这部分嘛? image

我简单实现了一下,在base.py的EmbeddingsFunAdapter的embed_documents方法中,在向量化时用正则表达式把texts的question给提取了出来,这样就可以做到只向量化question

@nauyiahc
Copy link

您好,想问下,我在初始化数据库中,用了qa_text_splitter.py,但只想向量化question的部分,不想向量化answer,这个该如何实现呢?我现在用qa_text_splitter.py之后,是对整个q-a进行了向量化。。。

是指只要这部分嘛? image

我简单实现了一下,在base.py的EmbeddingsFunAdapter的embed_documents方法中,在向量化时用正则表达式把texts的question给提取了出来,这样就可以做到只向量化question

仅对问题进行向量化,
image
我在embed_documents方法加入了如下函数
image

@nauyiahc
Copy link

nauyiahc commented Apr 23, 2024

感觉texts直接转字典,然后把question的value取出来也可以,用try来取,我是想在数据库初始化和增量更新时做这个事情,暂时没有考虑前端页面,只向量化问题,检索的阈值就可以设置得更低一些,匹配的更精准

@chuanSir123
Copy link

您好,想问下,我在初始化数据库中,用了qa_text_splitter.py,但只想向量化question的部分,不想向量化answer,这个该如何实现呢?我现在用qa_text_splitter.py之后,是对整个q-a进行了向量化。。。

是指只要这部分嘛? image

我简单实现了一下,在base.py的EmbeddingsFunAdapter的embed_documents方法中,在向量化时用正则表达式把texts的question给提取了出来,这样就可以做到只向量化question

仅对问题进行向量化, image 我在embed_documents方法加入了如下函数 image

大佬,我按照你的代码位置改了,好像没触发print,确定这个qa模式是走的这个方法么

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants