Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Crash on multiple .pdf files #1312

Closed
olafure opened this issue May 15, 2024 · 5 comments
Closed

[Bug]: Crash on multiple .pdf files #1312

olafure opened this issue May 15, 2024 · 5 comments
Assignees
Labels

Comments

@olafure
Copy link

olafure commented May 15, 2024

Describe the bug

Crash on multiple .pdf files. Latest master version.

Steps to reproduce

pip install git+https://github.com/ocrmypdf/OCRmyPDF.git
(.venv) user@host:~/ocr$ ocrmypdf  --version 
16.2.1.dev5+g5caf654

wget https://archive.org/download/PopularMechanics1945/Popular_Mechanics_09_1945.pdf

ocrmypdf -j 1 --pages 45 Popular_Mechanics_09_1945.pdf /tmp/out.pdf 
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 149/149 0:00:00
   45  lots of diacritics - possibly poor OCR                                                                                                                                                                                                 tesseract.py:241
OCR                   ━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━  30%  44/149 0:00:01
An exception occurred while executing the pipeline                                                                                                                                                                                              _common.py:284
Traceback (most recent call last):                                                                                                                                                                                                                            
  File "/home/user/ocr-all/.venv/lib/python3.10/site-packages/ocrmypdf/_pipelines/_common.py", line 249, in cli_exception_handler                                                                                                                          
    return fn(options, plugin_manager)                                                                                                                                                                                                                        
  File "/home/user/ocr-all/.venv/lib/python3.10/site-packages/ocrmypdf/_pipelines/ocr.py", line 191, in _run_pipeline                                                                                                                                      
    optimize_messages = exec_concurrent(context, executor)                                                                                                                                                                                                    
  File "/home/user/ocr-all/.venv/lib/python3.10/site-packages/ocrmypdf/_pipelines/ocr.py", line 117, in exec_concurrent                                                                                                                                    
    executor(                                                                                                                                                                                                                                                 
  File "/home/user/ocr-all/.venv/lib/python3.10/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__                                                                                                                                               
    self._execute(                                                                                                                                                                                                                                            
  File "/home/user/ocr-all/.venv/lib/python3.10/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute                                                                                                                              
    result = future.result()                                                                                                                                                                                                                                  
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result                                                                                                                                                                                 
    return self.__get_result()                                                                                                                                                                                                                                
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result                                                                                                                                                                           
    raise self._exception                                                                                                                                                                                                                                     
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run                                                                                                                                                                                    
    result = self.fn(*self.args, **self.kwargs)                                                                                                                                                                                                               
  File "/home/user/ocr-all/.venv/lib/python3.10/site-packages/ocrmypdf/_pipelines/ocr.py", line 81, in _exec_page_sync                                                                                                                                     
    ocr_out, text_out = _image_to_ocr_text(page_context, ocr_image_out)                                                                                                                                                                                       
  File "/home/user/ocr-all/.venv/lib/python3.10/site-packages/ocrmypdf/_pipelines/ocr.py", line 63, in _image_to_ocr_text                                                                                                                                  
    ocr_out = render_hocr_page(hocr_out, page_context)                                                                                                                                                                                                        
  File "/home/user/ocr-all/.venv/lib/python3.10/site-packages/ocrmypdf/_pipeline.py", line 780, in render_hocr_page                                                                                                                                        
    HocrTransform(                                                                                                                                                                                                                                            
  File "/home/user/ocr-all/.venv/lib/python3.10/site-packages/ocrmypdf/hocrtransform/_hocr.py", line 211, in to_pdf                                                                                                                                        
    self._do_line(                                                                                                                                                                                                                                            
  File "/home/user/ocr-all/.venv/lib/python3.10/site-packages/ocrmypdf/hocrtransform/_hocr.py", line 289, in _do_line                                                                                                                                      
    assert line_box.ury > line_box.lly  # lly is top, ury is bottom                                                                                                                                                                                           
AssertionError                                                                                                                     

Files

https://archive.org/download/PopularMechanics1945/Popular_Mechanics_09_1945.pdf

How did you download and install the software?

PyPI (pip, poetry, pipx, etc.), source build

OCRmyPDF version

16.2.1.dev5+g5caf654

Relevant log output

$ uname -a 
Linux host 6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr  4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

$ ocrmypdf  --version 
16.2.1.dev5+g5caf654

$ python --version 
Python 3.10.12

@olafure
Copy link
Author

olafure commented May 15, 2024

Attached a verbose (-V 2) logfile:
log.txt

@olafure olafure changed the title [Bug]: Crash on a multiple .pdf files [Bug]: Crash on multiple .pdf files May 15, 2024
@jbarlow83
Copy link
Collaborator

Can't reproduce here. Possibly, this is a tesseract bug.

What is the output of tesseract --version on the machine that produced the issue?

@olafure
Copy link
Author

olafure commented May 17, 2024

$ tesseract --version 
tesseract 4.1.1
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found SSE
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8

@jbarlow83
Copy link
Collaborator

Can you try upgrading to tesseract 5.x?
For Ubuntu here is the PPA.
https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr5

@olafure
Copy link
Author

olafure commented May 17, 2024

Yep, that solves it, thanks!

tesseract 5.3.4
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
 Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.17
``

jbarlow83 added a commit that referenced this issue May 19, 2024
Addresses [Bug]: Crash on multiple .pdf files #1312

Not actually a fix, but at least it will get us better diagnostics. Appears old Tesseract 4.x generates bad line boxes at times.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants