You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have some PDFs (water bills, tax bills, etc) that are email to me from my city. Unfortunately, they have the mime type application/octet-stream. In the bug above, and in other places, the solution is that the PDF needs to be run though qpdf in order for it to get the right mime type.
Unfortunatley, preprocessing scripts only run after the mime-type check, and application/octet-stream is unsupported by Paperless, so it's rejected. This means that I can't run a pre-processing step that runs the pdf through qpdf.
I beleive the solution is that I need to create an "App" that registers itself as a handler of files with extension *.pdf and mime type application/octet-stream . This app would simply call qpdf with the input file, produce a valid pdf, and then call the default pdf handler for the rest of the processing - essentially this could be a very thin wrapper.
Does anyone have an example of a stupid-simple "app" that I could use as a template? I think I understand the basics, but the part I'm missing is how I would install this app so that it can be picked up by django/paperless when I put its name in PAPERLESS_APPS. I'm a decent Python dev, but have never touched django.
Is paperless_tesseract.RasterisedDocumentParser the default/main PDF parser? In which case can my app just base off of RasterisedDocumentParser instead of DocumentParser? Then I would just need to override the parse method to essentially be a subprocess call to qpdf followed by a call to super().parse. Am I thinking about this correctly?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
This is in relation to this (closed) bug: #776
I have some PDFs (water bills, tax bills, etc) that are email to me from my city. Unfortunately, they have the mime type
application/octet-stream
. In the bug above, and in other places, the solution is that the PDF needs to be run thoughqpdf
in order for it to get the right mime type.Unfortunatley, preprocessing scripts only run after the mime-type check, and
application/octet-stream
is unsupported by Paperless, so it's rejected. This means that I can't run a pre-processing step that runs the pdf throughqpdf
.I beleive the solution is that I need to create an "App" that registers itself as a handler of files with extension
*.pdf
and mime typeapplication/octet-stream
. This app would simply callqpdf
with the input file, produce a valid pdf, and then call the default pdf handler for the rest of the processing - essentially this could be a very thin wrapper.My plan:
My 2 questions are:
paperless_tesseract.RasterisedDocumentParser
the default/main PDF parser? In which case can my app just base off ofRasterisedDocumentParser
instead ofDocumentParser
? Then I would just need to override theparse
method to essentially be a subprocess call toqpdf
followed by a call tosuper().parse
. Am I thinking about this correctly?Any guiance here would be appreciated! Thanks!
Beta Was this translation helpful? Give feedback.
All reactions