Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetDatapath can't find the default path that tesseract should find on windows #326

Open
NewUserHa opened this issue Oct 3, 2023 · 7 comments

Comments

@NewUserHa
Copy link

GetDatapath can't find the default path that tesseract should find by default on windows: C:\Program Files\Tesseract-OCR\tessdata
or relative to anywhere the tesseract binary is

@darnox
Copy link

darnox commented Oct 12, 2023

The same is true on Linux. It has been working fine with tesserocr==2.6.0, but since tesserocr==2.6.1, the package has stopped discovering the default tessdata paths.

@sirfz
Copy link
Owner

sirfz commented Oct 12, 2023

tesserocr relies on tesseract's GetDatapath function to get the default data path and doesn't make any assumptions itself. Not sure if it's an issue with the way the CI binary is built or with the newer tesseract versions. Did you try pip install --no-binary tesserocr tesserocr and see if you get the same behavior?

@darnox
Copy link

darnox commented Oct 13, 2023

Indeed, it's seems like a problem with the binary. Installation with pip install --no-binary tesserocr tesserocr works just fine, even on the newest version of tesserocr.

@sirfz
Copy link
Owner

sirfz commented Oct 13, 2023

What version of tesseract did you build against? (tesserocr.tesseract_version()), v2.6.2 binary wheels are built against 5.3.3

@darnox
Copy link

darnox commented Oct 13, 2023

tesseract 5.3.2
 leptonica-1.79.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1

@sirfz
Copy link
Owner

sirfz commented Oct 17, 2023

@nijel do you have any idea what could cause the difference in behavior by the binaries built in our CI pipeline?

@nijel
Copy link
Contributor

nijel commented Oct 17, 2023

Neither of #331, #330, or #329 should change the behavior.

Anyway, I don't think the CI workflow builds Windows wheels, so I don't see how --no-binary could make a difference there.

For Linux, the default tesseract prefix seems to be /usr/local while distros will use /usr as default, so using system library could behave differently here. As the changed behavior is observed since 2.6.1 (introduction of binary wheels), so may be broken for the binary wheels since the beginning and changing the default prefix here will fix that:

PREFIX="${PREFIX:-/usr/local}"

I actually remember seeing this issue after upgrading to 2.6.1, but as we wanted to introduce dynamic downloading of the trained data anyway, this was just a reason to do that upon the upgrade without investigating the root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants