Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/ocr_layer_to_pdf #2991

Open
punjabdhaputar opened this issue May 8, 2024 · 3 comments
Open

feat/ocr_layer_to_pdf #2991

punjabdhaputar opened this issue May 8, 2024 · 3 comments
Labels
enhancement New feature or request ocr Related to optical character recognition (OCR).

Comments

@punjabdhaputar
Copy link

Is your feature request related to a problem? Please describe.
When I OCR a PDF, I would like to be able to open the PDF and see the OCRed text as a hidden layer.

Describe the solution you'd like
I would like to have an option to output a new PDF file after the "partition" method that will be the original + a hidden text layer of the OCR text.

Additional context
Slack Thread: https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1715109355171469

@punjabdhaputar punjabdhaputar added the enhancement New feature or request label May 8, 2024
@scanny scanny added the ocr Related to optical character recognition (OCR). label May 9, 2024
@MthwRobinson
Copy link
Contributor

Hi @punjabdhaputar - could you describe the use case you have in mind for this feature? And do I understand correctly that your proposed solution would output a new PDF rather than a list of Element objects?

@punjabdhaputar
Copy link
Author

punjabdhaputar commented May 10, 2024

Hello @MthwRobinson!

Actually I am thinking about another optional argument to the "partition" function like the following:

from unstructured.partition.auto import partition
elements = partition("my_pdf.pdf", path_for_ocr_pdf="ocr_pdf.pdf")

Where the partition function would write out a new PDF with the hidden text OCR layer to "ocr_pdf.pdf".

The use-case I have is to be able to view the PDF with the text layer and be able to highlight specific text (e.g. a small phrase, subset of the previous chunks generated).

@MthwRobinson
Copy link
Contributor

Thanks @punjabdhaputar ! Definitely see the use case there. Writing to PDF is outside the scope of what we'd like to do within the partition functions themselves. If you wanted to contribute an elements_to_pdf similar to elements_to_json though we'd be happy to consider that, as long as it doesn't introduce new dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ocr Related to optical character recognition (OCR).
Projects
None yet
Development

No branches or pull requests

3 participants