-
-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct recognized OCR data missing in search index #473
Comments
Hi there,
OSS takes the filename and metadata it directly into the index, but leaves the OCR data to be added later. Thats sone by Apache Tika.
Try using command line to index manually single files see if Tika is at HTTP Error 500.
i made the experience that the service hangs up on processing too much at the time.
Also when low on disk, it stops adding OCR.
Furthermore, there is a parameter somewhere where you can disable double OCR, if you have a better calibrated OCR solution beforehand and then it takes the original OCRed PDF.
By default it takes Google Tesseract in english language.
Make sure you set the ocr language to what your document content language is.
I use Chronoscan with a mix of Tesseract and Nuance.
It avoids unnecessary tokenization (the extra spaces).
Best regards
Andy
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I have some .pdf files where the OCR recognition in graphics works perfectly, and the recognized text is also displayed correctly in the search results in the OCR tab, but I cannot find this text or its contents in the search index itself.
Does anyone have an idea why the OCR text does not appear in the search index?
The extracted text tab only contains very poorly recognized text, e.g.
"tems Ltg Am Rohiance 3 5S300 WetterCar"
"Invoice 12345 6 AV"
In the OCR tab the text is correctly recognized:
"Car Systems Ltd Am Rohlande 3 58300 Wetter"
"Invoice 123456 /W"
A search for "123456", for example, returns no results. I'm a bit at a loss right now.
The text was updated successfully, but these errors were encountered: