data did not match any variant of untagged enum PyPreTokenizerTypeWrapper #3910

jiusi9 · 2024-04-28T06:39:05Z

问题描述 / Problem Description
用简洁明了的语言描述这个问题 / Describe the problem in a clear and concise manner.
Image build 成功，container启动失败。
能帮我看下这是哪里出现问题了吗？我一直没有定位到是哪里出现了问题，或者能帮我看下这是哪个模块提示的错误

复现问题的步骤 / Steps to Reproduce
==============================Langchain-Chatchat Configuration==============================
操作系统：Linux-5.15.0-76-generic-x86_64-with-glibc2.29.
python版本：3.8.10 (default, Nov 22 2023, 10:22:35)
[GCC 9.4.0]
项目版本：v0.2.10
langchain版本：0.0.344. fastchat版本：0.2.36

当前使用的分词器：ChineseRecursiveTextSplitter
当前启动的LLM模型：['CodeQwen1.5-7B-Chat', 'openai-api'] @ cuda
{'device': 'cuda',
'host': '0.0.0.0',
'infer_turbo': False,
'model_path': '/opt/models/CodeQwen1.5-7B-Chat',
'model_path_exists': True,
'port': 20002}
{'api_base_url': 'https://api.openai.com/v1',
'api_key': '',
'device': 'auto',
'host': '0.0.0.0',
'infer_turbo': False,
'model_name': 'gpt-3.5-turbo',
'online_api': True,
'openai_proxy': '',
'port': 20002}
当前Embbedings模型： bge-large-en-v1.5 @ cuda
==============================Langchain-Chatchat Configuration==============================
2024-04-28 06:29:43,432 - startup.py[line:650] - INFO: 正在启动服务：
2024-04-28 06:29:43,433 - startup.py[line:651] - INFO: 如需查看 llm_api 日志，请前往 /opt/Langchain-ChatChat/logs
2024-04-28 06:29:48 | ERROR | stderr | INFO: Started server process [475]
2024-04-28 06:29:48 | ERROR | stderr | INFO: Waiting for application startup.
2024-04-28 06:29:48 | ERROR | stderr | INFO: Application startup complete.
2024-04-28 06:29:48 | ERROR | stderr | INFO: Uvicorn running on http://0.0.0.0:20000 (Press CTRL+C to quit)
2024-04-28 06:29:48 | INFO | model_worker | Loading the model ['CodeQwen1.5-7B-Chat'] on worker 131939df ...
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|███████████████████████████████████████████████████████████████████████████▎ | 1/4 [00:00<00:02, 1.13it/s]
Loading checkpoint shards: 50%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 2/4 [00:01<00:01, 1.10it/s]
Loading checkpoint shards: 75%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 3/4 [00:02<00:00, 1.07it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.21it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.16it/s]
2024-04-28 06:29:54 | ERROR | stderr |
2024-04-28 06:29:54 | ERROR | stderr | Process model_worker - CodeQwen1.5-7B-Chat:
2024-04-28 06:29:54 | ERROR | stderr | Traceback (most recent call last):
2024-04-28 06:29:54 | ERROR | stderr | File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
2024-04-28 06:29:54 | ERROR | stderr | self.run()
2024-04-28 06:29:54 | ERROR | stderr | File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
2024-04-28 06:29:54 | ERROR | stderr | self._target(*self._args, **self._kwargs)
2024-04-28 06:29:54 | ERROR | stderr | File "/opt/Langchain-ChatChat/startup.py", line 386, in run_model_worker
2024-04-28 06:29:54 | ERROR | stderr | app = create_model_worker_app(log_level=log_level, **kwargs)
2024-04-28 06:29:54 | ERROR | stderr | File "/opt/Langchain-ChatChat/startup.py", line 214, in create_model_worker_app
2024-04-28 06:29:54 | ERROR | stderr | worker = ModelWorker(
2024-04-28 06:29:54 | ERROR | stderr | File "/usr/local/lib/python3.8/dist-packages/fastchat/serve/model_worker.py", line 77, in init
2024-04-28 06:29:54 | ERROR | stderr | self.model, self.tokenizer = load_model(
2024-04-28 06:29:54 | ERROR | stderr | File "/usr/local/lib/python3.8/dist-packages/fastchat/model/model_adapter.py", line 353, in load_model
2024-04-28 06:29:54 | ERROR | stderr | model, tokenizer = adapter.load_model(model_path, kwargs)
2024-04-28 06:29:54 | ERROR | stderr | File "/usr/local/lib/python3.8/dist-packages/fastchat/model/model_adapter.py", line 1706, in load_model
2024-04-28 06:29:54 | ERROR | stderr | tokenizer = AutoTokenizer.from_pretrained(
2024-04-28 06:29:54 | ERROR | stderr | File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/tokenization_auto.py", line 862, in from_pretrained
2024-04-28 06:29:54 | ERROR | stderr | return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
2024-04-28 06:29:54 | ERROR | stderr | File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
2024-04-28 06:29:54 | ERROR | stderr | return cls._from_pretrained(
2024-04-28 06:29:54 | ERROR | stderr | File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
2024-04-28 06:29:54 | ERROR | stderr | tokenizer = cls(*init_inputs, **init_kwargs)
2024-04-28 06:29:54 | ERROR | stderr | File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_fast.py", line 111, in init
2024-04-28 06:29:54 | ERROR | stderr | fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
2024-04-28 06:29:54 | ERROR | stderr | Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 12564 column 3

环境信息 / Environment Information

langchain-ChatGLM 版本/commit 号：
==============================Langchain-Chatchat Configuration==============================
操作系统：Linux-5.15.0-76-generic-x86_64-with-glibc2.29.
python版本：3.8.10 (default, Nov 22 2023, 10:22:35)
[GCC 9.4.0]
项目版本：v0.2.10
langchain版本：0.0.344. fastchat版本：0.2.36
是否使用 Docker 部署（是/否）：是 / Is Docker deployment used (yes/no): yes
使用的模型（ChatGLM2-6B / Qwen-7B 等）：CodeQwen1.5-7B-Chat
使用的 Embedding 模型（moka-ai/m3e-base 等）：bge-large-en-v1.5
使用的向量库类型 (faiss / milvus / pg_vector 等)：milvus
操作系统及版本 / Operating system and version:
Python 版本 / Python version:
其他相关环境信息 / Other relevant environment information: Nvidia GPU A30

附加信息 / Additional Information
添加与问题相关的任何其他信息 / Add any other information related to the issue.

dosubot · 2024-04-28T06:39:32Z

The error "data did not match any variant of untagged enum PyPreTokenizerTypeWrapper" you're encountering is likely due to a mismatch or misconfiguration in the tokenizer setup within your Docker-deployed project. This issue can arise from several factors, including Python version discrepancies, compatibility issues between Langchain, Fastchat, and the tokenizer, or incorrect tokenizer configuration.

Check Python Version Compatibility: Your environment mentions Python 3.8.10, but ensure that this version is indeed compatible with the Langchain and Fastchat versions you're using. If your Dockerfile specifies a different Python version (e.g., Python 3.11), adjust it to match the required Python version for your project dependencies.
Review Dependency Versions: Ensure that the versions of Langchain (0.0.344) and Fastchat (0.2.36) are compatible with each other and with the CodeQwen1.5-7B-Chat model. It's crucial to verify that these versions work well together and with the specific tokenizer you're using.
Tokenizer Configuration: The error suggests an issue with the tokenizer's initialization. Double-check the tokenizer configuration, especially how the PyPreTokenizerTypeWrapper is expected to be set up for your project's version of Langchain and Fastchat. Make sure the configuration aligns with the expected format.
Update and Consult Documentation: If the issue persists, consider updating your dependencies to the latest compatible versions. Additionally, consult the documentation for Langchain, Fastchat, and the tokenizer for any specific setup or compatibility notes. Community forums or GitHub issues for these projects may also offer insights from users who have encountered similar problems.

For the ChineseRecursiveTextSplitter tokenizer, ensure it's correctly configured to use a source from "huggingface" and specify the correct tokenizer_name_or_path if necessary. This setup is crucial for the tokenizer to function correctly within your project.

Rebuild your Docker image after making any changes to ensure they are applied. If you need further assistance or have more questions, feel free to ask.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

jwang-paradise · 2024-04-29T01:39:49Z

可能是 CodeQwen 不兼容新版 tokenizers. 你使用的 transformers 和 tokenizers 是什么版本？

See here Qwen/CodeQwen1.5-7B-Chat · Having trouble loading this with transformers

jiusi9 · 2024-04-29T02:54:56Z

可能是 CodeQwen 不兼容新版 tokenizers. 你使用的 transformers 和 tokenizers 是什么版本？

See here Qwen/CodeQwen1.5-7B-Chat · Having trouble loading this with transformers

和这个issus中一样，我试下其他版本
tokenizers: 0.19.1
transformers: 4.40.1

你觉得这个可能是llm model和transformers 和 tokenizers兼容问题吗？
我也搞不清和embedding model bge-large-en-v1.5有没有关系，反正是用 BAAI的bge-large那几个模型都是同样的报错

jiusi9 · 2024-04-30T01:13:34Z

This issue was resolved.
I downgraded the transformers and tokenizers version.

Now, the version is:
tokenizers: 0.15.2
transformers: 4.38.2
accelerate: 0.25.0

jiusi9 added the bug Something isn't working label Apr 28, 2024

jiusi9 closed this as completed Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data did not match any variant of untagged enum PyPreTokenizerTypeWrapper #3910

data did not match any variant of untagged enum PyPreTokenizerTypeWrapper #3910

jiusi9 commented Apr 28, 2024

dosubot bot commented Apr 28, 2024 •

edited

Details

jwang-paradise commented Apr 29, 2024

jiusi9 commented Apr 29, 2024 •

edited

jiusi9 commented Apr 30, 2024

data did not match any variant of untagged enum PyPreTokenizerTypeWrapper #3910

data did not match any variant of untagged enum PyPreTokenizerTypeWrapper #3910

Comments

jiusi9 commented Apr 28, 2024

dosubot bot commented Apr 28, 2024 • edited

Details

jwang-paradise commented Apr 29, 2024

jiusi9 commented Apr 29, 2024 • edited

jiusi9 commented Apr 30, 2024

dosubot bot commented Apr 28, 2024 •

edited

jiusi9 commented Apr 29, 2024 •

edited