PAPERLESS_APPS to extend the default PDF parser to handle application/octet-stream #6563

bkanuka · 2024-05-04T15:48:06Z

bkanuka
May 4, 2024

This is in relation to this (closed) bug: #776

I have some PDFs (water bills, tax bills, etc) that are email to me from my city. Unfortunately, they have the mime type application/octet-stream. In the bug above, and in other places, the solution is that the PDF needs to be run though qpdf in order for it to get the right mime type.

Unfortunatley, preprocessing scripts only run after the mime-type check, and application/octet-stream is unsupported by Paperless, so it's rejected. This means that I can't run a pre-processing step that runs the pdf through qpdf.

I beleive the solution is that I need to create an "App" that registers itself as a handler of files with extension *.pdf and mime type application/octet-stream . This app would simply call qpdf with the input file, produce a valid pdf, and then call the default pdf handler for the rest of the processing - essentially this could be a very thin wrapper.

My plan:

Write this extension/app
Install it in the Docker container using https://docs.paperless-ngx.com/advanced_usage/#custom-container-initialization
Add the new app to PAPERLESS_APPS config

My 2 questions are:

Does anyone have an example of a stupid-simple "app" that I could use as a template? I think I understand the basics, but the part I'm missing is how I would install this app so that it can be picked up by django/paperless when I put its name in PAPERLESS_APPS. I'm a decent Python dev, but have never touched django.
Is paperless_tesseract.RasterisedDocumentParser the default/main PDF parser? In which case can my app just base off of RasterisedDocumentParser instead of DocumentParser? Then I would just need to override the parse method to essentially be a subprocess call to qpdf followed by a call to super().parse. Am I thinking about this correctly?

Any guiance here would be appreciated! Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PAPERLESS_APPS to extend the default PDF parser to handle application/octet-stream #6563

{{title}}

Replies: 0 comments

Select a reply

PAPERLESS_APPS to extend the default PDF parser to handle application/octet-stream #6563

bkanuka May 4, 2024

Replies: 0 comments

bkanuka
May 4, 2024