-
-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: The file size increases significantly by OCR even without image recompression #1278
Comments
It seems to me that it is related to With the option So the OCR layer takes 59 KB for So the questions are:
|
I wrote a script that redoes the ocr on PDFs by deleting any original text from the file and then using ocrmypdf to generate new ocr which I then add to the original file. I use it mainly to replace the often bad ocr in jstor files. It uses Ghostscript to remove the text and relies on some other stuff that you can see in the code. |
Describe the bug
I'm trying make OCR of scanned books and preserve the small size of the input file. To do so I use
--output-type pdf
option. However, the size is increased by 40% even without image recompression.Moreover, the size is increased even further after the second pass despite
redo-ocr
flag.My current version is 16.1.1 installed on Arch Linux from AUR repository.
In a previous version (16.0.4 or so) I did not notice such an increase in the file size.
I observe such a problem for various files with high enough compression. Below, a part of such a book is attached as an example.
Steps to reproduce
For the given small part of the book the file sizes are:
251 KB → 349 KB → 447 KB
Files
Here is the part of one book.
Watson1.pdf
Watson2.pdf
Watson3.pdf
How did you download and install the software?
No response
OCRmyPDF version
16.1.1
Relevant log output
The text was updated successfully, but these errors were encountered: