Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 直接跑mydocloader.py的RapidOCRDocLoader中的样例,出现zipfile.BadZipFile: File is not a zip file #3890

Closed
zhoufurong2022 opened this issue Apr 26, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@zhoufurong2022
Copy link

问题描述 / Problem Description
我并没有做任何改动,就只执行了main方法,出现了把docx文件当成zip文件处理的步骤

复现问题的步骤 / Steps to Reproduce
直接run mydocloader.py的RapidOCRDocLoader

预期的结果 / Expected Result
正常loader解析出结果

实际结果 / Actual Result
RapidOCRDocLoader block index: 3: 100%|██████████| 4/4 [00:00<00:00, 4.76it/s]
Traceback (most recent call last):
File "/Users/xxx/code/knowledge-chat/document_loaders/mydocloader.py", line 70, in
docs = loader.load()
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/langchain_community/document_loaders/unstructured.py", line 87, in load
elements = self._get_elements()
File "/Users/xxx/code/knowledge-chat/document_loaders/mydocloader.py", line 65, in _get_elements
return partition_text(text=text, **self.unstructured_kwargs)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/partition/text.py", line 93, in partition_text
return _partition_text(
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/documents/elements.py", line 518, in wrapper
elements = func(*args, **kwargs)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 591, in wrapper
elements = func(*args, **kwargs)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 546, in wrapper
elements = func(*args, **kwargs)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/chunking/init.py", line 52, in wrapper
elements = func(*args, **kwargs)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/partition/text.py", line 190, in _partition_text
element = element_from_text(ctext)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/partition/text.py", line 235, in element_from_text
elif is_possible_narrative_text(text):
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/partition/text_type.py", line 87, in is_possible_narrative_text
if "eng" in languages and (sentence_count(text, 3) < 2) and (not contains_verb(text)):
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/partition/text_type.py", line 189, in contains_verb
pos_tags = pos_tag(text)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/nlp/tokenize.py", line 44, in pos_tag
_download_nltk_package_if_not_present(
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/unstructured/nlp/tokenize.py", line 21, in _download_nltk_package_if_not_present
nltk.find(f"{package_category}/{package_name}")
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/nltk/data.py", line 555, in find
return find(modified_name, paths)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/nltk/data.py", line 542, in find
return ZipFilePathPointer(p, zipentry)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/nltk/compat.py", line 41, in _decorator
return init_func(*args, **kwargs)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/nltk/data.py", line 394, in init
zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/nltk/compat.py", line 41, in _decorator
return init_func(*args, **kwargs)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/nltk/data.py", line 935, in init
zipfile.ZipFile.init(self, filename)
File "/usr/local/Cellar/[email protected]/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/zipfile.py", line 1271, in init
self._RealGetContents()
File "/usr/local/Cellar/[email protected]/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/zipfile.py", line 1338, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

环境信息 / Environment Information

  • langchain-ChatGLM 版本/commit 号:v0.2.10

  • 是否使用 Docker 部署(是/否):否

  • 使用的模型(ChatGLM2-6B / Qwen-7B 等):无

  • 使用的 Embedding 模型(moka-ai/m3e-base 等):无

  • 使用的向量库类型 (faiss / milvus / pg_vector 等): 无

  • 操作系统及版本 / Operating system and version:ProductName: macOS
    ProductVersion: 14.4.1
    BuildVersion: 23E224

  • Python 版本 / Python version: 3.10

  • 其他相关环境信息 / Other relevant environment information:
    unstructured 0.11.8
    transformers 4.34.0
    python-dateutil 2.8.2
    python-decouple 3.8
    python-docx 1.1.0
    python-dotenv 1.0.0
    python-iso639 2023.12.11
    python-magic 0.4.27
    python-multipart 0.0.6
    python-pptx 0.6.23
    langchain 0.0.354
    langchain-community 0.0.8
    langchain-core 0.1.6

附加信息 / Additional Information
添加与问题相关的任何其他信息 / Add any other information related to the issue.

@zhoufurong2022 zhoufurong2022 added the bug Something isn't working label Apr 26, 2024
Copy link

dosubot bot commented Apr 26, 2024

嘿,@zhoufurong2022!看来我们又在这里相遇了😄。希望你最近一切顺利!我正在仔细查看你的问题,很快会给你一个完整的回答。

@zhoufurong2022
Copy link
Author

zhoufurong2022 commented Apr 26, 2024

最近测试了一个不带图片或者特殊字符的docx文档,发现还是出现无法转换成Document
Traceback (most recent call last):
File "/Users/xxx/code/knowledge-chat/document_loaders/mydocloader.py", line 70, in
docs = loader.load()
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/langchain_community/document_loaders/unstructured.py", line 87, in load
elements = self._get_elements()
File "/Users/xxx/code/knowledge-chat/document_loaders/mydocloader.py", line 63, in _get_elements
text = doc2text(self.file_path)
File "/Users/xxx/code/knowledge-chat/document_loaders/mydocloader.py", line 19, in doc2text
doc = Document(filepath)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/docx/api.py", line 26, in Document
raise ValueError(tmpl % (docx, document_part.content_type))
ValueError: file '../tests/samples/mydocx2.docx' is not a Word file, content type is 'application/vnd.ms-word.document.macroEnabled.main+xml'

带图片的就更不行了

Copy link

dosubot bot commented Apr 26, 2024

To address the issue with mydocloader.py in the RapidOCRDocLoader example where DOCX files are not recognized correctly, follow these steps:

  1. Check if the DOCX File is Corrupted: Ensure the file can be opened with a word processor like Microsoft Word or LibreOffice Writer to rule out corruption.

  2. Update python-docx Library: Make sure you have the latest version of python-docx installed. Update it using pip with pip install --upgrade python-docx.

  3. Validate DOCX File Before Conversion: Implement a check to ensure the file is a valid DOCX before attempting conversion. For example, you can check the file extension:

    import os
    
    def is_valid_docx(filepath):
        return os.path.splitext(filepath)[1].lower() == '.docx'
  4. Catch Exceptions for Unsupported Files: Modify the doc2text function to handle exceptions gracefully, which will help in cases where files are not recognized as Word documents:

    def doc2text(filepath):
        try:
            doc = Document(filepath)
        except ValueError as e:
            print(f"Error loading document: {e}")
            return ""
  5. Investigate Compatibility Issues: If problems persist, the issue might be with the DOCX files themselves, especially if they were generated by non-standard tools. Try opening and re-saving the document with a standard word processor to normalize its structure.

These steps should help you resolve the issue with converting DOCX documents when encountering a content type error. If the problem continues, consider providing more details about the DOCX files for further assistance.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@zhoufurong2022
Copy link
Author

问题是我用的文件是项目自带的test文件,异常是这个:zipfile.BadZipFile: File is not a zip file

@zhoufurong2022
Copy link
Author

我通过WPS在文件里面加了一些内容之后直接转成Document都不行了
is_valid_docx=True
2024-04-26 11:56:02,275 33480 ERROR "There is no item named 'NULL' in the archive"
Traceback (most recent call last):
File "/Users/xxx/code/knowledge-chat/document_loaders/mydocloader.py", line 27, in doc2text
doc = Document(filepath)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/docx/api.py", line 23, in Document
document_part = Package.open(docx).main_document_part
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/docx/opc/package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 25, in from_file
sparts = PackageReader._load_serialized_parts(
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 53, in _load_serialized_parts
for partname, blob, reltype, srels in part_walker:
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 86, in _walk_phys_parts
for partname, blob, reltype, srels in next_walker:
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 81, in _walk_phys_parts
blob = phys_reader.blob_for(partname)
File "/Users/xxx/code/knowledge-chat/venv3.10/lib/python3.10/site-packages/docx/opc/phys_pkg.py", line 83, in blob_for
return self._zipf.read(pack_uri.membername)
File "/usr/local/Cellar/[email protected]/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/zipfile.py", line 1485, in read
with self.open(name, "r", pwd) as fp:
File "/usr/local/Cellar/[email protected]/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/zipfile.py", line 1524, in open
zinfo = self.getinfo(name)
File "/usr/local/Cellar/[email protected]/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/zipfile.py", line 1451, in getinfo
raise KeyError(
KeyError: "There is no item named 'NULL' in the archive"
KeyError loading document: "There is no item named 'NULL' in the archive"

@zhoufurong2022
Copy link
Author

上面的前4个点我都试过了,我执行upgrade python-doc也显示已经是最新的版本了。
兼容性的问题是说我在QPS上打开了文件编辑,加了一些文字之后有其他问题?

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this May 5, 2024
@zRzRzRzRzRzRzR zRzRzRzRzRzRzR closed this as not planned Won't fix, can't repro, duplicate, stale May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants