feat/ocr_layer_to_pdf #2991

punjabdhaputar · 2024-05-08T22:37:37Z

Is your feature request related to a problem? Please describe.
When I OCR a PDF, I would like to be able to open the PDF and see the OCRed text as a hidden layer.

Describe the solution you'd like
I would like to have an option to output a new PDF file after the "partition" method that will be the original + a hidden text layer of the OCR text.

Additional context
Slack Thread: https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1715109355171469

MthwRobinson · 2024-05-10T12:21:10Z

Hi @punjabdhaputar - could you describe the use case you have in mind for this feature? And do I understand correctly that your proposed solution would output a new PDF rather than a list of Element objects?

punjabdhaputar · 2024-05-10T17:01:02Z

Hello @MthwRobinson!

Actually I am thinking about another optional argument to the "partition" function like the following:

from unstructured.partition.auto import partition
elements = partition("my_pdf.pdf", path_for_ocr_pdf="ocr_pdf.pdf")

Where the partition function would write out a new PDF with the hidden text OCR layer to "ocr_pdf.pdf".

The use-case I have is to be able to view the PDF with the text layer and be able to highlight specific text (e.g. a small phrase, subset of the previous chunks generated).

MthwRobinson · 2024-05-10T20:16:46Z

Thanks @punjabdhaputar ! Definitely see the use case there. Writing to PDF is outside the scope of what we'd like to do within the partition functions themselves. If you wanted to contribute an elements_to_pdf similar to elements_to_json though we'd be happy to consider that, as long as it doesn't introduce new dependencies.

punjabdhaputar added the enhancement New feature or request label May 8, 2024

scanny added the ocr Related to optical character recognition (OCR). label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/ocr_layer_to_pdf #2991

feat/ocr_layer_to_pdf #2991

punjabdhaputar commented May 8, 2024

MthwRobinson commented May 10, 2024

punjabdhaputar commented May 10, 2024 •

edited

MthwRobinson commented May 10, 2024

feat/ocr_layer_to_pdf #2991

feat/ocr_layer_to_pdf #2991

Comments

punjabdhaputar commented May 8, 2024

MthwRobinson commented May 10, 2024

punjabdhaputar commented May 10, 2024 • edited

MthwRobinson commented May 10, 2024

punjabdhaputar commented May 10, 2024 •

edited