Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue - invalid byte sequence in US-ASCII (ArgumentError) #121

Open
intellisense opened this issue Nov 24, 2014 · 13 comments
Open

Comments

@intellisense
Copy link

I am getting several errors like these. Any workaround? Thanks!

Exception("ErrorCode 1: /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_cleaner.rb:49:in `scan': invalid byte sequence in US-ASCII (ArgumentError)
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_cleaner.rb:49:in `block in clean'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_cleaner.rb:48:in `loop'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_cleaner.rb:48:in `clean'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit.rb:79:in `clean_text'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:92:in `block in clean_text'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:88:in `open'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:88:in `clean_text'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:78:in `extract_from_ocr'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:36:in `block in extract'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:32:in `each'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:32:in `extract'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit.rb:45:in `extract_text'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/command_line.rb:46:in `run'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/command_line.rb:37:in `initialize'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/bin/docsplit:5:in `new'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/bin/docsplit:5:in `<top (required)>'
from /usr/bin/docsplit:23:in `load'
from /usr/bin/docsplit:23:in `<main>')
@knowtheory
Copy link
Member

First Q, are you OCRing english or non-english docs?

If the latter, you can set the --no-clean flag (if you're using Docsplit from the commandline). If you upgrade to 0.7.6 setting the --language flag will automatically set --no-clean.

If you're OCRing an english language doc, we'd be interested in seeing a sample doc (as our TextCleaner isn't doing the right thing if that's the case).

@intellisense
Copy link
Author

Thanks for the quick reply. Yes we are OCRing the docs. No matter in what language. This doc failed with the above same error. Although it is English with some hand writing in it. And I am using docsplit 0.7.5

@knowtheory
Copy link
Member

Alrighty, mind letting us know what tesseract version you're using? We're up on docsplit 0.7.6 and tesseract 3.03 (succeeded in processing the doc linked above). Looks like you're on ubuntu?

@intellisense
Copy link
Author

Here are the full environment details:

Ubuntu 14.04 (trusty)
docsplit 0.7.5
tesseract 3.03

Let me know if you want any more information. Thanks

@intellisense
Copy link
Author

Here are some more files to test on doc1 and doc2
Please tell me the solution for this issue as we are on production :( also what are the consequences of using --no-clean flag?

Thanks!

@knowtheory
Copy link
Member

the TextCleaner will strip out character sequences that look like garbage in English (lots of consonants in a row for example). So if your input is clean-ish turning it off won't do much.

@intellisense
Copy link
Author

So the text extraction is only works on English? Any handy tool you can recommend which can extract the plain text out of non english pdf's easily?

@knowtheory
Copy link
Member

Text cleaning only works in english. Docsplit'll OCR in non-english languages if you specify the input language.

@nathanstitt
Copy link
Member

@intellisense: My environment is pretty close to yours and I'm able to extract your documents successfully.

Can you tell me what docsplit command you are running? I ran: docsplit text <pdf_file>

Can you also provide the Ruby version from ruby --version? Mine is ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]

@intellisense
Copy link
Author

@nathanstitt I am using this command: docsplit text --output /output/path/abc.txt /input/path/abc.pdf
The ruby version is exactly same as yours ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux].

I just ran the command with --no-clean flag and it works. But without this flag I am having trouble as mentioned above.

@nathanstitt
Copy link
Member

Hm. Since our commands and ruby versions are the same, I'm thinking that the culprit may be Tesseract. Perhaps your version is generating some sequence of UTF characters that Docsplit/Ruby doesn't like.

My tesseract --version reports:
tesseract 3.03 leptonica-1.70 libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

Do your versions differ?

I should also note that docsplit/tesseract didn't do a very good job on the second document you linked above. Since the scan's were blurry, the text is pretty garbled. The text scanner attempted to clean it, but the difference between using --no-clean and the normal command line wasn't very large. I think you'll be fine to use the --no-clean flag if we can't get to the bottom of the issue.

@intellisense
Copy link
Author

The tesseract version is exactly the same as yours with every image libraries as you have mentioned no difference whatsoever. I think I should go with the --no-clean flag but its not an optimal solution as I want to support text extraction from Non-English documents as well. You can close this if you want to. Thanks for the help I highly appreciate it.

@nathanstitt
Copy link
Member

Hey @intellisense. Sorry for the confusion but you absolutely can extract text from non-English documents with or without using the --no-clean flag. In fact, if you are extracting from non-English documents, the no-clean flag is set to true internally and it's usage is ignored.

All the option does is disable running the TextCleaner (which removes non-valid characters) on the OCR'ed text. Since the TextCleaner only knows how to recognize non-english characters that's the only language it's effective on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants