Encoding issue - invalid byte sequence in US-ASCII (ArgumentError) #121

intellisense · 2014-11-24T15:22:07Z

I am getting several errors like these. Any workaround? Thanks!

Exception("ErrorCode 1: /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_cleaner.rb:49:in `scan': invalid byte sequence in US-ASCII (ArgumentError)
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_cleaner.rb:49:in `block in clean'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_cleaner.rb:48:in `loop'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_cleaner.rb:48:in `clean'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit.rb:79:in `clean_text'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:92:in `block in clean_text'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:88:in `open'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:88:in `clean_text'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:78:in `extract_from_ocr'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:36:in `block in extract'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:32:in `each'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/text_extractor.rb:32:in `extract'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit.rb:45:in `extract_text'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/command_line.rb:46:in `run'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/lib/docsplit/command_line.rb:37:in `initialize'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/bin/docsplit:5:in `new'
from /var/lib/gems/1.9.1/gems/docsplit-0.7.5/bin/docsplit:5:in `<top (required)>'
from /usr/bin/docsplit:23:in `load'
from /usr/bin/docsplit:23:in `<main>')

The text was updated successfully, but these errors were encountered:

knowtheory · 2014-11-24T15:30:19Z

First Q, are you OCRing english or non-english docs?

If the latter, you can set the --no-clean flag (if you're using Docsplit from the commandline). If you upgrade to 0.7.6 setting the --language flag will automatically set --no-clean.

If you're OCRing an english language doc, we'd be interested in seeing a sample doc (as our TextCleaner isn't doing the right thing if that's the case).

intellisense · 2014-11-24T15:37:56Z

Thanks for the quick reply. Yes we are OCRing the docs. No matter in what language. This doc failed with the above same error. Although it is English with some hand writing in it. And I am using docsplit 0.7.5

knowtheory · 2014-11-24T15:52:45Z

Alrighty, mind letting us know what tesseract version you're using? We're up on docsplit 0.7.6 and tesseract 3.03 (succeeded in processing the doc linked above). Looks like you're on ubuntu?

intellisense · 2014-11-24T15:56:06Z

Here are the full environment details:

Ubuntu 14.04 (trusty)
docsplit 0.7.5
tesseract 3.03

Let me know if you want any more information. Thanks

intellisense · 2014-11-24T16:04:46Z

Here are some more files to test on doc1 and doc2
Please tell me the solution for this issue as we are on production :( also what are the consequences of using --no-clean flag?

Thanks!

knowtheory · 2014-11-24T16:06:44Z

the TextCleaner will strip out character sequences that look like garbage in English (lots of consonants in a row for example). So if your input is clean-ish turning it off won't do much.

intellisense · 2014-11-24T16:12:56Z

So the text extraction is only works on English? Any handy tool you can recommend which can extract the plain text out of non english pdf's easily?

knowtheory · 2014-11-24T16:16:09Z

Text cleaning only works in english. Docsplit'll OCR in non-english languages if you specify the input language.

nathanstitt · 2014-11-24T16:43:01Z

@intellisense: My environment is pretty close to yours and I'm able to extract your documents successfully.

Can you tell me what docsplit command you are running? I ran: docsplit text <pdf_file>

Can you also provide the Ruby version from ruby --version? Mine is ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]

intellisense · 2014-11-24T16:48:09Z

@nathanstitt I am using this command: docsplit text --output /output/path/abc.txt /input/path/abc.pdf
The ruby version is exactly same as yours ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux].

I just ran the command with --no-clean flag and it works. But without this flag I am having trouble as mentioned above.

nathanstitt · 2014-11-24T17:00:19Z

Hm. Since our commands and ruby versions are the same, I'm thinking that the culprit may be Tesseract. Perhaps your version is generating some sequence of UTF characters that Docsplit/Ruby doesn't like.

My tesseract --version reports:
tesseract 3.03 leptonica-1.70 libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

Do your versions differ?

I should also note that docsplit/tesseract didn't do a very good job on the second document you linked above. Since the scan's were blurry, the text is pretty garbled. The text scanner attempted to clean it, but the difference between using --no-clean and the normal command line wasn't very large. I think you'll be fine to use the --no-clean flag if we can't get to the bottom of the issue.

intellisense · 2014-11-24T17:05:38Z

The tesseract version is exactly the same as yours with every image libraries as you have mentioned no difference whatsoever. I think I should go with the --no-clean flag but its not an optimal solution as I want to support text extraction from Non-English documents as well. You can close this if you want to. Thanks for the help I highly appreciate it.

nathanstitt · 2014-11-24T17:19:36Z

Hey @intellisense. Sorry for the confusion but you absolutely can extract text from non-English documents with or without using the --no-clean flag. In fact, if you are extracting from non-English documents, the no-clean flag is set to true internally and it's usage is ignored.

All the option does is disable running the TextCleaner (which removes non-valid characters) on the OCR'ed text. Since the TextCleaner only knows how to recognize non-english characters that's the only language it's effective on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding issue - invalid byte sequence in US-ASCII (ArgumentError) #121

Encoding issue - invalid byte sequence in US-ASCII (ArgumentError) #121

intellisense commented Nov 24, 2014

knowtheory commented Nov 24, 2014

intellisense commented Nov 24, 2014

knowtheory commented Nov 24, 2014

intellisense commented Nov 24, 2014

intellisense commented Nov 24, 2014

knowtheory commented Nov 24, 2014

intellisense commented Nov 24, 2014

knowtheory commented Nov 24, 2014

nathanstitt commented Nov 24, 2014

intellisense commented Nov 24, 2014

nathanstitt commented Nov 24, 2014

intellisense commented Nov 24, 2014

nathanstitt commented Nov 24, 2014

Encoding issue - invalid byte sequence in US-ASCII (ArgumentError) #121

Encoding issue - invalid byte sequence in US-ASCII (ArgumentError) #121

Comments

intellisense commented Nov 24, 2014

knowtheory commented Nov 24, 2014

intellisense commented Nov 24, 2014

knowtheory commented Nov 24, 2014

intellisense commented Nov 24, 2014

intellisense commented Nov 24, 2014

knowtheory commented Nov 24, 2014

intellisense commented Nov 24, 2014

knowtheory commented Nov 24, 2014

nathanstitt commented Nov 24, 2014

intellisense commented Nov 24, 2014

nathanstitt commented Nov 24, 2014

intellisense commented Nov 24, 2014

nathanstitt commented Nov 24, 2014