New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] 直接跑mydocloader.py的RapidOCRDocLoader中的样例,出现zipfile.BadZipFile: File is not a zip file #3890
Comments
嘿,@zhoufurong2022!看来我们又在这里相遇了😄。希望你最近一切顺利!我正在仔细查看你的问题,很快会给你一个完整的回答。 |
最近测试了一个不带图片或者特殊字符的docx文档,发现还是出现无法转换成Document 带图片的就更不行了 |
To address the issue with
These steps should help you resolve the issue with converting DOCX documents when encountering a content type error. If the problem continues, consider providing more details about the DOCX files for further assistance.
|
问题是我用的文件是项目自带的test文件,异常是这个:zipfile.BadZipFile: File is not a zip file |
我通过WPS在文件里面加了一些内容之后直接转成Document都不行了 |
上面的前4个点我都试过了,我执行upgrade python-doc也显示已经是最新的版本了。 |
问题描述 / Problem Description
我并没有做任何改动,就只执行了main方法,出现了把docx文件当成zip文件处理的步骤
复现问题的步骤 / Steps to Reproduce
直接run mydocloader.py的RapidOCRDocLoader
预期的结果 / Expected Result
正常loader解析出结果
实际结果 / Actual Result
RapidOCRDocLoader block index: 3: 100%|██████████| 4/4 [00:00<00:00, 4.76it/s]
Traceback (most recent call last):
File "/Users/xxx/code/knowledge-chat/document_loaders/mydocloader.py", line 70, in
docs = loader.load()
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/langchain_community/document_loaders/unstructured.py", line 87, in load
elements = self._get_elements()
File "/Users/xxx/code/knowledge-chat/document_loaders/mydocloader.py", line 65, in _get_elements
return partition_text(text=text, **self.unstructured_kwargs)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/partition/text.py", line 93, in partition_text
return _partition_text(
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/documents/elements.py", line 518, in wrapper
elements = func(*args, **kwargs)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 591, in wrapper
elements = func(*args, **kwargs)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 546, in wrapper
elements = func(*args, **kwargs)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/chunking/init.py", line 52, in wrapper
elements = func(*args, **kwargs)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/partition/text.py", line 190, in _partition_text
element = element_from_text(ctext)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/partition/text.py", line 235, in element_from_text
elif is_possible_narrative_text(text):
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/partition/text_type.py", line 87, in is_possible_narrative_text
if "eng" in languages and (sentence_count(text, 3) < 2) and (not contains_verb(text)):
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/partition/text_type.py", line 189, in contains_verb
pos_tags = pos_tag(text)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/nlp/tokenize.py", line 44, in pos_tag
_download_nltk_package_if_not_present(
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/nlp/tokenize.py", line 21, in _download_nltk_package_if_not_present
nltk.find(f"{package_category}/{package_name}")
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/nltk/data.py", line 555, in find
return find(modified_name, paths)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/nltk/data.py", line 542, in find
return ZipFilePathPointer(p, zipentry)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/nltk/compat.py", line 41, in _decorator
return init_func(*args, **kwargs)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/nltk/data.py", line 394, in init
zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/nltk/compat.py", line 41, in _decorator
return init_func(*args, **kwargs)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/nltk/data.py", line 935, in init
zipfile.ZipFile.init(self, filename)
File "/usr/local/Cellar/[email protected]/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/zipfile.py", line 1271, in init
self._RealGetContents()
File "/usr/local/Cellar/[email protected]/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/zipfile.py", line 1338, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
环境信息 / Environment Information
langchain-ChatGLM 版本/commit 号:v0.2.10
是否使用 Docker 部署(是/否):否
使用的模型(ChatGLM2-6B / Qwen-7B 等):无
使用的 Embedding 模型(moka-ai/m3e-base 等):无
使用的向量库类型 (faiss / milvus / pg_vector 等): 无
操作系统及版本 / Operating system and version:ProductName: macOS
ProductVersion: 14.4.1
BuildVersion: 23E224
Python 版本 / Python version: 3.10
其他相关环境信息 / Other relevant environment information:
unstructured 0.11.8
transformers 4.34.0
python-dateutil 2.8.2
python-decouple 3.8
python-docx 1.1.0
python-dotenv 1.0.0
python-iso639 2023.12.11
python-magic 0.4.27
python-multipart 0.0.6
python-pptx 0.6.23
langchain 0.0.354
langchain-community 0.0.8
langchain-core 0.1.6
附加信息 / Additional Information
添加与问题相关的任何其他信息 / Add any other information related to the issue.
The text was updated successfully, but these errors were encountered: