Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Chinese support not very well? #317

Closed
ablozhou opened this issue Apr 28, 2023 · 11 comments
Closed

[Bug]: Chinese support not very well? #317

ablozhou opened this issue Apr 28, 2023 · 11 comments

Comments

@ablozhou
Copy link

Current Behavior

I test the offical similarity example in readme .

onnx = Onnx()
data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=onnx.dimension))
cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
    )

...

but it dosen't support Chinese very well. I ask some question, it always occured of the same answer:

q:俄罗斯总统是谁
目前的俄罗斯总统是弗拉基米尔·普京。
q:你是谁?
目前的俄罗斯总统是弗拉基米尔·普京。
q:东风夜放花千树
目前的俄罗斯总统是弗拉基米尔·普京。
q:who are you?
I am an AI language model developed by OpenAI. I am designed to assist and provide information to users through conversation.
q:我儿子8岁, 我3年后比我儿子2倍大3岁, 我多少岁?
目前你的年龄是13岁,因为(8+3)*2=22。

q:东风夜放花千树
目前的俄罗斯总统是弗拉基米尔·普京。
Time consuming: 0.10s
2023-04-28 18:30:01,839 - 140497058133568 - _internal.py-_internal:186 - INFO: 127.0.0.1 - - [28/Apr/2023 18:30:01] "POST / HTTP/1.1" 302 -
2023-04-28 18:30:01,853 - 140497184024128 - _internal.py-_internal:186 - INFO: 127.0.0.1 - - [28/Apr/2023 18:30:01] "GET /?result=目前的俄罗斯总统是弗拉基米尔·普京。 HTTP/1.1" 200 -

I don't know how to avoid these problems?

Thank you!

Expected Behavior

match the right question and give the right answer.

Steps To Reproduce

run the similar match in the readme.

Environment

ubuntu 22.04

Anything else?

using onnx

@junjiejiangjjj
Copy link
Contributor

Try other embedding models : https://github.com/zilliztech/GPTCache/blob/main/gptcache/embedding/__init__.py
example:

from gptcache.embedding import SBERT, Huggingface, FastText

@ablozhou
Copy link
Author

ablozhou commented May 4, 2023

I just replace the onnx to other models, but all report the error below:
There are any samples how to using other embedding models?

Traceback (most recent call last):
  File "tsim.py", line 53, in <module>
    response = openai.ChatCompletion.create(
  File "/home/zhh/git/GPTCache/gptcache/adapter/openai.py", line 79, in create
    return adapt(
  File "/home/zhh/git/GPTCache/gptcache/adapter/adapter.py", line 52, in adapt
    cache_data_list = time_cal(
  File "/home/zhh/git/GPTCache/gptcache/utils/time.py", line 9, in inner
    res = func(*args, **kwargs)
  File "/home/zhh/git/GPTCache/gptcache/manager/data_manager.py", line 319, in search
    return self.v.search(data=embedding_data, top_k=top_k)
  File "/home/zhh/git/GPTCache/gptcache/manager/vector_data/faiss.py", line 45, in search
    dist, ids = self._index.search(np_data, top_k)
  File "/home/zhh/anaconda3/envs/ai/lib/python3.8/site-packages/faiss/class_wrappers.py", line 329, in replacement_search
    assert d == self.d

@123seven
Copy link

123seven commented May 4, 2023

I just replace the onnx to other models, but all report the error below: There are any samples how to using other embedding models?

Traceback (most recent call last):
  File "tsim.py", line 53, in <module>
    response = openai.ChatCompletion.create(
  File "/home/zhh/git/GPTCache/gptcache/adapter/openai.py", line 79, in create
    return adapt(
  File "/home/zhh/git/GPTCache/gptcache/adapter/adapter.py", line 52, in adapt
    cache_data_list = time_cal(
  File "/home/zhh/git/GPTCache/gptcache/utils/time.py", line 9, in inner
    res = func(*args, **kwargs)
  File "/home/zhh/git/GPTCache/gptcache/manager/data_manager.py", line 319, in search
    return self.v.search(data=embedding_data, top_k=top_k)
  File "/home/zhh/git/GPTCache/gptcache/manager/vector_data/faiss.py", line 45, in search
    dist, ids = self._index.search(np_data, top_k)
  File "/home/zhh/anaconda3/envs/ai/lib/python3.8/site-packages/faiss/class_wrappers.py", line 329, in replacement_search
    assert d == self.d

delete faiss.index and sqlite file

@jaelgu
Copy link
Collaborator

jaelgu commented May 4, 2023

I just replace the onnx to other models, but all report the error below: There are any samples how to using other embedding models?

Traceback (most recent call last):
  File "tsim.py", line 53, in <module>
    response = openai.ChatCompletion.create(
  File "/home/zhh/git/GPTCache/gptcache/adapter/openai.py", line 79, in create
    return adapt(
  File "/home/zhh/git/GPTCache/gptcache/adapter/adapter.py", line 52, in adapt
    cache_data_list = time_cal(
  File "/home/zhh/git/GPTCache/gptcache/utils/time.py", line 9, in inner
    res = func(*args, **kwargs)
  File "/home/zhh/git/GPTCache/gptcache/manager/data_manager.py", line 319, in search
    return self.v.search(data=embedding_data, top_k=top_k)
  File "/home/zhh/git/GPTCache/gptcache/manager/vector_data/faiss.py", line 45, in search
    dist, ids = self._index.search(np_data, top_k)
  File "/home/zhh/anaconda3/envs/ai/lib/python3.8/site-packages/faiss/class_wrappers.py", line 329, in replacement_search
    assert d == self.d

What embedding model do you use? You can find some built-in embedding methods with examples at our docs: https://gptcache.readthedocs.io/en/latest/references/embedding.html

@scguoi
Copy link

scguoi commented May 4, 2023

I find lot of embedding model in the documents. But which one is recommandation for Chinese?

@SimFG
Copy link
Collaborator

SimFG commented May 4, 2023

I use the uer/albert-base-chinese-cluecorpussmall model in the huggingface. There is a simple demo, you can try it.

import os
import time

from gptcache.embedding import Huggingface
from gptcache import cache
from gptcache.adapter import openai
from gptcache.manager import get_data_manager, VectorBase
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

huggingface = Huggingface(model='uer/albert-base-chinese-cluecorpussmall')
vector_base = VectorBase('faiss', dimension=huggingface.dimension)
data_manager = get_data_manager('sqlite', vector_base)
cache.init(
    embedding_func=huggingface.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
    )
os.environ['OPENAI_API_KEY'] = 'YOUR API KEY'
cache.set_openai_key()

questions = [
    '什么是Github',
    '你可以解释下什么是Github吗',
    '可以告诉我关于Github一些信息吗'
]


def response_text(openai_resp):
    return openai_resp['choices'][0]['message']['content']


for question in questions:
    for _ in range(2):
        start_time = time.time()
        response = openai.ChatCompletion.create(
            model='gpt-3.5-turbo',
            messages=[
                {
                    'role': 'user',
                    'content': question
                }
            ],
        )
        print(f'Question: {question}')
        print('Time consuming: {:.2f}s'.format(time.time() - start_time))
        print(f'Answer: {response_text(response)}\n')

output:

Question: 什么是Github
Time consuming: 6.96s
Answer: Github是一个基于Git版本控制系统的代码托管平台,用于协作开发、分享和存储代码。用户可以在Github上创建一个仓库,并将自己的代码提交到该仓库中。Github支持多种语言,包括但不限于Java、Python、JavaScript、Ruby等。通过Github,用户可以方便地共享自己的代码,并与其他开发者协作开发项目。Github也提供了许多功能,如Pull Request、Issues、Projects等,可以帮助开发者更好地管理和协作开发项目。同时,Github也是一个开源社区,用户可以在平台上浏览、学习和贡献开源项目。

Question: 什么是Github
Time consuming: 0.08s
Answer: Github是一个基于Git版本控制系统的代码托管平台,用于协作开发、分享和存储代码。用户可以在Github上创建一个仓库,并将自己的代码提交到该仓库中。Github支持多种语言,包括但不限于Java、Python、JavaScript、Ruby等。通过Github,用户可以方便地共享自己的代码,并与其他开发者协作开发项目。Github也提供了许多功能,如Pull Request、Issues、Projects等,可以帮助开发者更好地管理和协作开发项目。同时,Github也是一个开源社区,用户可以在平台上浏览、学习和贡献开源项目。

Question: 你可以解释下什么是Github吗
Time consuming: 0.10s
Answer: Github是一个基于Git版本控制系统的代码托管平台,用于协作开发、分享和存储代码。用户可以在Github上创建一个仓库,并将自己的代码提交到该仓库中。Github支持多种语言,包括但不限于Java、Python、JavaScript、Ruby等。通过Github,用户可以方便地共享自己的代码,并与其他开发者协作开发项目。Github也提供了许多功能,如Pull Request、Issues、Projects等,可以帮助开发者更好地管理和协作开发项目。同时,Github也是一个开源社区,用户可以在平台上浏览、学习和贡献开源项目。

Question: 你可以解释下什么是Github吗
Time consuming: 0.15s
Answer: Github是一个基于Git版本控制系统的代码托管平台,用于协作开发、分享和存储代码。用户可以在Github上创建一个仓库,并将自己的代码提交到该仓库中。Github支持多种语言,包括但不限于Java、Python、JavaScript、Ruby等。通过Github,用户可以方便地共享自己的代码,并与其他开发者协作开发项目。Github也提供了许多功能,如Pull Request、Issues、Projects等,可以帮助开发者更好地管理和协作开发项目。同时,Github也是一个开源社区,用户可以在平台上浏览、学习和贡献开源项目。

Question: 可以告诉我关于Github一些信息吗
Time consuming: 0.11s
Answer: Github是一个基于Git版本控制系统的代码托管平台,用于协作开发、分享和存储代码。用户可以在Github上创建一个仓库,并将自己的代码提交到该仓库中。Github支持多种语言,包括但不限于Java、Python、JavaScript、Ruby等。通过Github,用户可以方便地共享自己的代码,并与其他开发者协作开发项目。Github也提供了许多功能,如Pull Request、Issues、Projects等,可以帮助开发者更好地管理和协作开发项目。同时,Github也是一个开源社区,用户可以在平台上浏览、学习和贡献开源项目。

Question: 可以告诉我关于Github一些信息吗
Time consuming: 0.10s
Answer: Github是一个基于Git版本控制系统的代码托管平台,用于协作开发、分享和存储代码。用户可以在Github上创建一个仓库,并将自己的代码提交到该仓库中。Github支持多种语言,包括但不限于Java、Python、JavaScript、Ruby等。通过Github,用户可以方便地共享自己的代码,并与其他开发者协作开发项目。Github也提供了许多功能,如Pull Request、Issues、Projects等,可以帮助开发者更好地管理和协作开发项目。同时,Github也是一个开源社区,用户可以在平台上浏览、学习和贡献开源项目。

@jaelgu
Copy link
Collaborator

jaelgu commented May 4, 2023

I find lot of embedding model in the documents. But which one is recommandation for Chinese?

You can always pass your own embedding function in GPTCache. There are also many open-source multilingual or Chinese-supportive models available on Hugging Face. In the case of using models from huggingface, you can try to pass the model name referring to example here: https://gptcache.readthedocs.io/en/latest/references/embedding.html#module-gptcache.embedding.huggingface

@SimFG
Copy link
Collaborator

SimFG commented May 5, 2023

@ablozhou If the appeal answer has solved your problem, I will close this issue

@EricKong1985
Copy link

EricKong1985 commented Aug 19, 2023

`from gptcache.embedding import Huggingface
from gptcache import cache
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

def get_content_func(data, **_):
return data.get("prompt").split("Question")[-1]

cache_base = CacheBase('sqlite')
huggingface = Huggingface(model='uer/albert-base-chinese-cluecorpussmall')
collection_name='chatbot')
vector_base = VectorBase('milvus', host='127.0.0.1',
port='19530',
dimension=huggingface.dimension,
collection_name='chatbot')
data_manager = get_data_manager(cache_base, vector_base)
cache.init(
pre_embedding_func=get_content_func,
# embedding_func=onnx.to_embeddings,
embedding_func=huggingface.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(),
)
cache.set_openai_key()

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Milvus
from langchain.document_loaders import TextLoader

loader = TextLoader('customer_data/data.txt', encoding="utf-8")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
vector_db = Milvus.from_documents(
docs,
embeddings,
connection_args={"host": "127.0.0.1", "port": "19530"},
)
query = "我的产品名字叫什么?"
docs = vector_db.similarity_search(query)

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

from gptcache.adapter.langchain_models import LangChainLLMs

llm = LangChainLLMs(llm=OpenAI(temperature=0))
chain = load_qa_chain(llm, chain_type="stuff")
chain({"input_documents": docs, "question": query}, return_only_outputs=True)`

@SimFG
================================================================
Then I hit below error, anyone know what happen for it ?

Traceback (most recent call last):
File "C:\codes\demo\gptcache.py", line 59, in
from gptcache.adapter.langchain_models import LangChainLLMs
File "C:\codes\demor\gptcache\adapter\langchain_models.py", line 30, in
class LangChainLLMs(LLM, BaseModel):
File "pydantic\main.py", line 197, in pydantic.main.ModelMetaclass.new
File "pydantic\fields.py", line 506, in pydantic.fields.ModelField.infer
File "pydantic\fields.py", line 436, in pydantic.fields.ModelField.init
File "pydantic\fields.py", line 546, in pydantic.fields.ModelField.prepare
File "pydantic\fields.py", line 570, in pydantic.fields.ModelField._set_default_and_type
File "pydantic\fields.py", line 439, in pydantic.fields.ModelField.get_default
File "pydantic\utils.py", line 693, in pydantic.utils.smart_deepcopy
File "D:\Py310\lib\copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "D:\Py310\lib\copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "D:\Py310\lib\copy.py", line 146, in deepcopy
y = copier(x, memo)
File "D:\Py310\lib\copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "D:\Py310\lib\copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "D:\Py310\lib\copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "D:\Py310\lib\copy.py", line 146, in deepcopy
y = copier(x, memo)
File "D:\Py310\lib\copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "D:\Py310\lib\copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "D:\Py310\lib\copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "D:\Py310\lib\copy.py", line 146, in deepcopy
y = copier(x, memo)
File "D:\Py310\lib\copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "D:\Py310\lib\copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "D:\Py310\lib\copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "D:\Py310\lib\copy.py", line 146, in deepcopy
y = copier(x, memo)
File "D:\Py310\lib\copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "D:\Py310\lib\copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "D:\Py310\lib\copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "D:\Py310\lib\copy.py", line 146, in deepcopy
y = copier(x, memo)
File "D:\Py310\lib\copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "D:\Py310\lib\copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "D:\Py310\lib\copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "D:\Py310\lib\copy.py", line 146, in deepcopy
y = copier(x, memo)
File "D:\Py310\lib\copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "D:\Py310\lib\copy.py", line 161, in deepcopy
rv = reductor(4)
TypeError: cannot pickle 'module' object

Process finished with exit code 1

@EricKong1985
Copy link

@SimFG

@SimFG
Copy link
Collaborator

SimFG commented Aug 21, 2023

@EricKong1985 This does not seem to be caused by GPTCache, maybe a similar problem: https://stackoverflow.com/questions/2790828/python-cant-pickle-module-objects-error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants