[Bug]: Chinese support not very well? #317

ablozhou · 2023-04-28T11:03:02Z

Current Behavior

I test the offical similarity example in readme .

onnx = Onnx()
data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=onnx.dimension))
cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
    )

...

but it dosen't support Chinese very well. I ask some question, it always occured of the same answer:

q:俄罗斯总统是谁
目前的俄罗斯总统是弗拉基米尔·普京。
q:你是谁?
目前的俄罗斯总统是弗拉基米尔·普京。
q:东风夜放花千树
目前的俄罗斯总统是弗拉基米尔·普京。
q:who are you?
I am an AI language model developed by OpenAI. I am designed to assist and provide information to users through conversation.
q:我儿子8岁, 我3年后比我儿子2倍大3岁, 我多少岁?
目前你的年龄是13岁，因为（8+3）*2=22。

q:东风夜放花千树
目前的俄罗斯总统是弗拉基米尔·普京。
Time consuming: 0.10s
2023-04-28 18:30:01,839 - 140497058133568 - _internal.py-_internal:186 - INFO: 127.0.0.1 - - [28/Apr/2023 18:30:01] "POST / HTTP/1.1" 302 -
2023-04-28 18:30:01,853 - 140497184024128 - _internal.py-_internal:186 - INFO: 127.0.0.1 - - [28/Apr/2023 18:30:01] "GET /?result=目前的俄罗斯总统是弗拉基米尔·普京。 HTTP/1.1" 200 -

I don't know how to avoid these problems?

Thank you!

Expected Behavior

match the right question and give the right answer.

Steps To Reproduce

run the similar match in the readme.

Environment

ubuntu 22.04

Anything else?

using onnx

The text was updated successfully, but these errors were encountered:

junjiejiangjjj · 2023-04-28T12:45:10Z

Try other embedding models : https://github.com/zilliztech/GPTCache/blob/main/gptcache/embedding/__init__.py
example:

from gptcache.embedding import SBERT, Huggingface, FastText

ablozhou · 2023-05-04T03:49:12Z

I just replace the onnx to other models, but all report the error below:
There are any samples how to using other embedding models?

Traceback (most recent call last):
  File "tsim.py", line 53, in <module>
    response = openai.ChatCompletion.create(
  File "/home/zhh/git/GPTCache/gptcache/adapter/openai.py", line 79, in create
    return adapt(
  File "/home/zhh/git/GPTCache/gptcache/adapter/adapter.py", line 52, in adapt
    cache_data_list = time_cal(
  File "/home/zhh/git/GPTCache/gptcache/utils/time.py", line 9, in inner
    res = func(*args, **kwargs)
  File "/home/zhh/git/GPTCache/gptcache/manager/data_manager.py", line 319, in search
    return self.v.search(data=embedding_data, top_k=top_k)
  File "/home/zhh/git/GPTCache/gptcache/manager/vector_data/faiss.py", line 45, in search
    dist, ids = self._index.search(np_data, top_k)
  File "/home/zhh/anaconda3/envs/ai/lib/python3.8/site-packages/faiss/class_wrappers.py", line 329, in replacement_search
    assert d == self.d

123seven · 2023-05-04T03:51:38Z

I just replace the onnx to other models, but all report the error below: There are any samples how to using other embedding models?

Traceback (most recent call last):
  File "tsim.py", line 53, in <module>
    response = openai.ChatCompletion.create(
  File "/home/zhh/git/GPTCache/gptcache/adapter/openai.py", line 79, in create
    return adapt(
  File "/home/zhh/git/GPTCache/gptcache/adapter/adapter.py", line 52, in adapt
    cache_data_list = time_cal(
  File "/home/zhh/git/GPTCache/gptcache/utils/time.py", line 9, in inner
    res = func(*args, **kwargs)
  File "/home/zhh/git/GPTCache/gptcache/manager/data_manager.py", line 319, in search
    return self.v.search(data=embedding_data, top_k=top_k)
  File "/home/zhh/git/GPTCache/gptcache/manager/vector_data/faiss.py", line 45, in search
    dist, ids = self._index.search(np_data, top_k)
  File "/home/zhh/anaconda3/envs/ai/lib/python3.8/site-packages/faiss/class_wrappers.py", line 329, in replacement_search
    assert d == self.d

delete faiss.index and sqlite file

jaelgu · 2023-05-04T04:11:07Z

I just replace the onnx to other models, but all report the error below: There are any samples how to using other embedding models?

Traceback (most recent call last):
  File "tsim.py", line 53, in <module>
    response = openai.ChatCompletion.create(
  File "/home/zhh/git/GPTCache/gptcache/adapter/openai.py", line 79, in create
    return adapt(
  File "/home/zhh/git/GPTCache/gptcache/adapter/adapter.py", line 52, in adapt
    cache_data_list = time_cal(
  File "/home/zhh/git/GPTCache/gptcache/utils/time.py", line 9, in inner
    res = func(*args, **kwargs)
  File "/home/zhh/git/GPTCache/gptcache/manager/data_manager.py", line 319, in search
    return self.v.search(data=embedding_data, top_k=top_k)
  File "/home/zhh/git/GPTCache/gptcache/manager/vector_data/faiss.py", line 45, in search
    dist, ids = self._index.search(np_data, top_k)
  File "/home/zhh/anaconda3/envs/ai/lib/python3.8/site-packages/faiss/class_wrappers.py", line 329, in replacement_search
    assert d == self.d

What embedding model do you use? You can find some built-in embedding methods with examples at our docs: https://gptcache.readthedocs.io/en/latest/references/embedding.html

scguoi · 2023-05-04T08:45:31Z

I find lot of embedding model in the documents. But which one is recommandation for Chinese?

SimFG · 2023-05-04T09:09:15Z

I use the uer/albert-base-chinese-cluecorpussmall model in the huggingface. There is a simple demo, you can try it.

import os
import time

from gptcache.embedding import Huggingface
from gptcache import cache
from gptcache.adapter import openai
from gptcache.manager import get_data_manager, VectorBase
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

huggingface = Huggingface(model='uer/albert-base-chinese-cluecorpussmall')
vector_base = VectorBase('faiss', dimension=huggingface.dimension)
data_manager = get_data_manager('sqlite', vector_base)
cache.init(
    embedding_func=huggingface.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
    )
os.environ['OPENAI_API_KEY'] = 'YOUR API KEY'
cache.set_openai_key()

questions = [
    '什么是Github',
    '你可以解释下什么是Github吗',
    '可以告诉我关于Github一些信息吗'
]


def response_text(openai_resp):
    return openai_resp['choices'][0]['message']['content']


for question in questions:
    for _ in range(2):
        start_time = time.time()
        response = openai.ChatCompletion.create(
            model='gpt-3.5-turbo',
            messages=[
                {
                    'role': 'user',
                    'content': question
                }
            ],
        )
        print(f'Question: {question}')
        print('Time consuming: {:.2f}s'.format(time.time() - start_time))
        print(f'Answer: {response_text(response)}\n')

output:

Question: 什么是Github
Time consuming: 6.96s
Answer: Github是一个基于Git版本控制系统的代码托管平台，用于协作开发、分享和存储代码。用户可以在Github上创建一个仓库，并将自己的代码提交到该仓库中。Github支持多种语言，包括但不限于Java、Python、JavaScript、Ruby等。通过Github，用户可以方便地共享自己的代码，并与其他开发者协作开发项目。Github也提供了许多功能，如Pull Request、Issues、Projects等，可以帮助开发者更好地管理和协作开发项目。同时，Github也是一个开源社区，用户可以在平台上浏览、学习和贡献开源项目。

Question: 什么是Github
Time consuming: 0.08s
Answer: Github是一个基于Git版本控制系统的代码托管平台，用于协作开发、分享和存储代码。用户可以在Github上创建一个仓库，并将自己的代码提交到该仓库中。Github支持多种语言，包括但不限于Java、Python、JavaScript、Ruby等。通过Github，用户可以方便地共享自己的代码，并与其他开发者协作开发项目。Github也提供了许多功能，如Pull Request、Issues、Projects等，可以帮助开发者更好地管理和协作开发项目。同时，Github也是一个开源社区，用户可以在平台上浏览、学习和贡献开源项目。

Question: 你可以解释下什么是Github吗
Time consuming: 0.10s
Answer: Github是一个基于Git版本控制系统的代码托管平台，用于协作开发、分享和存储代码。用户可以在Github上创建一个仓库，并将自己的代码提交到该仓库中。Github支持多种语言，包括但不限于Java、Python、JavaScript、Ruby等。通过Github，用户可以方便地共享自己的代码，并与其他开发者协作开发项目。Github也提供了许多功能，如Pull Request、Issues、Projects等，可以帮助开发者更好地管理和协作开发项目。同时，Github也是一个开源社区，用户可以在平台上浏览、学习和贡献开源项目。

Question: 你可以解释下什么是Github吗
Time consuming: 0.15s
Answer: Github是一个基于Git版本控制系统的代码托管平台，用于协作开发、分享和存储代码。用户可以在Github上创建一个仓库，并将自己的代码提交到该仓库中。Github支持多种语言，包括但不限于Java、Python、JavaScript、Ruby等。通过Github，用户可以方便地共享自己的代码，并与其他开发者协作开发项目。Github也提供了许多功能，如Pull Request、Issues、Projects等，可以帮助开发者更好地管理和协作开发项目。同时，Github也是一个开源社区，用户可以在平台上浏览、学习和贡献开源项目。

Question: 可以告诉我关于Github一些信息吗
Time consuming: 0.11s
Answer: Github是一个基于Git版本控制系统的代码托管平台，用于协作开发、分享和存储代码。用户可以在Github上创建一个仓库，并将自己的代码提交到该仓库中。Github支持多种语言，包括但不限于Java、Python、JavaScript、Ruby等。通过Github，用户可以方便地共享自己的代码，并与其他开发者协作开发项目。Github也提供了许多功能，如Pull Request、Issues、Projects等，可以帮助开发者更好地管理和协作开发项目。同时，Github也是一个开源社区，用户可以在平台上浏览、学习和贡献开源项目。

Question: 可以告诉我关于Github一些信息吗
Time consuming: 0.10s
Answer: Github是一个基于Git版本控制系统的代码托管平台，用于协作开发、分享和存储代码。用户可以在Github上创建一个仓库，并将自己的代码提交到该仓库中。Github支持多种语言，包括但不限于Java、Python、JavaScript、Ruby等。通过Github，用户可以方便地共享自己的代码，并与其他开发者协作开发项目。Github也提供了许多功能，如Pull Request、Issues、Projects等，可以帮助开发者更好地管理和协作开发项目。同时，Github也是一个开源社区，用户可以在平台上浏览、学习和贡献开源项目。

jaelgu · 2023-05-04T09:10:42Z

I find lot of embedding model in the documents. But which one is recommandation for Chinese?

You can always pass your own embedding function in GPTCache. There are also many open-source multilingual or Chinese-supportive models available on Hugging Face. In the case of using models from huggingface, you can try to pass the model name referring to example here: https://gptcache.readthedocs.io/en/latest/references/embedding.html#module-gptcache.embedding.huggingface

SimFG · 2023-05-05T04:58:16Z

@ablozhou If the appeal answer has solved your problem, I will close this issue

EricKong1985 · 2023-08-19T18:39:00Z

`from gptcache.embedding import Huggingface
from gptcache import cache
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

def get_content_func(data, **_):
return data.get("prompt").split("Question")[-1]

cache_base = CacheBase('sqlite')
huggingface = Huggingface(model='uer/albert-base-chinese-cluecorpussmall')
collection_name='chatbot')
vector_base = VectorBase('milvus', host='127.0.0.1',
port='19530',
dimension=huggingface.dimension,
collection_name='chatbot')
data_manager = get_data_manager(cache_base, vector_base)
cache.init(
pre_embedding_func=get_content_func,
# embedding_func=onnx.to_embeddings,
embedding_func=huggingface.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(),
)
cache.set_openai_key()

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Milvus
from langchain.document_loaders import TextLoader

loader = TextLoader('customer_data/data.txt', encoding="utf-8")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
vector_db = Milvus.from_documents(
docs,
embeddings,
connection_args={"host": "127.0.0.1", "port": "19530"},
)
query = "我的产品名字叫什么?"
docs = vector_db.similarity_search(query)

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

from gptcache.adapter.langchain_models import LangChainLLMs

llm = LangChainLLMs(llm=OpenAI(temperature=0))
chain = load_qa_chain(llm, chain_type="stuff")
chain({"input_documents": docs, "question": query}, return_only_outputs=True)`

@SimFG
================================================================
Then I hit below error, anyone know what happen for it ?
Traceback (most recent call last):
File "C:\codes\demo\gptcache.py", line 59, in
from gptcache.adapter.langchain_models import LangChainLLMs
File "C:\codes\demor\gptcache\adapter\langchain_models.py", line 30, in
class LangChainLLMs(LLM, BaseModel):
File "pydantic\main.py", line 197, in pydantic.main.ModelMetaclass.new
File "pydantic\fields.py", line 506, in pydantic.fields.ModelField.infer
File "pydantic\fields.py", line 436, in pydantic.fields.ModelField.init
File "pydantic\fields.py", line 546, in pydantic.fields.ModelField.prepare
File "pydantic\fields.py", line 570, in pydantic.fields.ModelField._set_default_and_type
File "pydantic\fields.py", line 439, in pydantic.fields.ModelField.get_default
File "pydantic\utils.py", line 693, in pydantic.utils.smart_deepcopy
File "D:\Py310\lib\copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "D:\Py310\lib\copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "D:\Py310\lib\copy.py", line 146, in deepcopy
y = copier(x, memo)
File "D:\Py310\lib\copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "D:\Py310\lib\copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "D:\Py310\lib\copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "D:\Py310\lib\copy.py", line 146, in deepcopy
y = copier(x, memo)
File "D:\Py310\lib\copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "D:\Py310\lib\copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "D:\Py310\lib\copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "D:\Py310\lib\copy.py", line 146, in deepcopy
y = copier(x, memo)
File "D:\Py310\lib\copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "D:\Py310\lib\copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "D:\Py310\lib\copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "D:\Py310\lib\copy.py", line 146, in deepcopy
y = copier(x, memo)
File "D:\Py310\lib\copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "D:\Py310\lib\copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "D:\Py310\lib\copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "D:\Py310\lib\copy.py", line 146, in deepcopy
y = copier(x, memo)
File "D:\Py310\lib\copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "D:\Py310\lib\copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "D:\Py310\lib\copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "D:\Py310\lib\copy.py", line 146, in deepcopy
y = copier(x, memo)
File "D:\Py310\lib\copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "D:\Py310\lib\copy.py", line 161, in deepcopy
rv = reductor(4)
TypeError: cannot pickle 'module' object

Process finished with exit code 1

EricKong1985 · 2023-08-19T18:44:58Z

@SimFG

SimFG · 2023-08-21T02:35:54Z

@EricKong1985 This does not seem to be caused by GPTCache, maybe a similar problem: https://stackoverflow.com/questions/2790828/python-cant-pickle-module-objects-error.

SimFG closed this as completed May 5, 2023

SimFG mentioned this issue May 12, 2023

[Feature]: Support huggingface transformers LLM model #335

Closed

SimFG pinned this issue May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Chinese support not very well? #317

[Bug]: Chinese support not very well? #317

ablozhou commented Apr 28, 2023

junjiejiangjjj commented Apr 28, 2023

ablozhou commented May 4, 2023

123seven commented May 4, 2023

jaelgu commented May 4, 2023

scguoi commented May 4, 2023

SimFG commented May 4, 2023

jaelgu commented May 4, 2023

SimFG commented May 5, 2023

EricKong1985 commented Aug 19, 2023 •

edited

EricKong1985 commented Aug 19, 2023

SimFG commented Aug 21, 2023

[Bug]: Chinese support not very well? #317

[Bug]: Chinese support not very well? #317

Comments

ablozhou commented Apr 28, 2023

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

junjiejiangjjj commented Apr 28, 2023

ablozhou commented May 4, 2023

123seven commented May 4, 2023

jaelgu commented May 4, 2023

scguoi commented May 4, 2023

SimFG commented May 4, 2023

jaelgu commented May 4, 2023

SimFG commented May 5, 2023

EricKong1985 commented Aug 19, 2023 • edited

EricKong1985 commented Aug 19, 2023

SimFG commented Aug 21, 2023

EricKong1985 commented Aug 19, 2023 •

edited