Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix TypeError: argument of type 'PSLiteral' is not iterable #883

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

nathtest
Copy link

@nathtest nathtest commented May 3, 2023

Pull request

Fix TypeError: argument of type 'PSLiteral' is not iterable for pdf where "W" and "H" are null in obj.

Traceback of the error :

Traceback (most recent call last):
  File "/opt/editik_engine/src/editik_engine/commands_class/engine/engine.py", line 179, in generate_custom_documents
    SplitVpc(self.computed_out_path + file.fic_nom, self.computed_out_path).run()
  File "/opt/editik_engine/src/editik_engine/commands_class/document_generator/split_vpc.py", line 93, in run
    extracted_text = page.extract_text()
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfplumber/page.py", line 260, in extract_text
    return utils.extract_text(self.chars, **kwargs)
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfplumber/container.py", line 48, in chars
    return self.objects.get("char", [])
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfplumber/page.py", line 161, in objects
    self._objects = self.parse_objects()
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfplumber/page.py", line 222, in parse_objects
    for obj in self.iter_layout_objects(self.layout._objs):
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfplumber/page.py", line 110, in layout
    interpreter.process_page(self.page_obj)
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
    self.execute(list_value(streams))
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 933, in execute
    func(*args)
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 840, in do_EI
    if 'W' in obj and 'H' in obj:
TypeError: argument of type 'PSLiteral' is not iterable

How Has This Been Tested?

I've had a few pdf at work that could not been read because W and H were present but null.
This is a better way to check if those args are in obj.

This fix solved the issue for those pdf and did not impact other pdfs.

We handle more than 10k pdf per day so i want to say this is correctly tested.

Checklist

  • I have read CONTRIBUTING.md.
  • [] I have added a concise human-readable description of the change to CHANGELOG.md.
  • I have tested that this fix is effective or that this feature works.
  • [] I have added docstrings to newly created methods and classes.
  • I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

@pietermarsman
Copy link
Member

@nathtest Thanks for your time and contribution!

I'm happy to merge this if I can test it on a PDF that shows the issue. Can you share the PDF and the code that you are using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants