Correct recognized OCR data missing in search index #473

playbackandrewind · 2024-01-31T12:23:23Z

I have some .pdf files where the OCR recognition in graphics works perfectly, and the recognized text is also displayed correctly in the search results in the OCR tab, but I cannot find this text or its contents in the search index itself.

Does anyone have an idea why the OCR text does not appear in the search index?

The extracted text tab only contains very poorly recognized text, e.g.
"tems Ltg Am Rohiance 3 5S300 WetterCar"
"Invoice 12345 6 AV"

In the OCR tab the text is correctly recognized:
"Car Systems Ltd Am Rohlande 3 58300 Wetter"
"Invoice 123456 /W"

A search for "123456", for example, returns no results. I'm a bit at a loss right now.

mosea3 · 2024-01-31T12:54:41Z

Hi there, OSS takes the filename and metadata it directly into the index, but leaves the OCR data to be added later. Thats sone by Apache Tika. Try using command line to index manually single files see if Tika is at HTTP Error 500. i made the experience that the service hangs up on processing too much at the time. Also when low on disk, it stops adding OCR. Furthermore, there is a parameter somewhere where you can disable double OCR, if you have a better calibrated OCR solution beforehand and then it takes the original OCRed PDF. By default it takes Google Tesseract in english language. Make sure you set the ocr language to what your document content language is. I use Chronoscan with a mix of Tesseract and Nuance. It avoids unnecessary tokenization (the extra spaces). Best regards Andy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct recognized OCR data missing in search index #473

Correct recognized OCR data missing in search index #473

playbackandrewind commented Jan 31, 2024

mosea3 commented Jan 31, 2024 via email •

edited

Correct recognized OCR data missing in search index #473

Correct recognized OCR data missing in search index #473

Comments

playbackandrewind commented Jan 31, 2024

mosea3 commented Jan 31, 2024 via email • edited

mosea3 commented Jan 31, 2024 via email •

edited