Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting table data? #1

Open
munikarmanish opened this issue Jan 2, 2019 · 5 comments
Open

Extracting table data? #1

munikarmanish opened this issue Jan 2, 2019 · 5 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@munikarmanish
Copy link

Right now, it only seems to perform OCR. i.e., convert image to raw text. Is there any table-specific extraction performed? Basically, I'm researching about good algorithms to extract tabular data from scanned documents.

Thanks in advance. :)

@cseas
Copy link
Owner

cseas commented Jan 11, 2019

Hi, @munikarmanish !

You're correct. The OCR currently only works for pre-processed images.

While it does extract data from PDFs with tables, it currently performs a horizontal scan and doesn't perform any table based classification on the text yet, I'm still trying to figure out how to make that work.

A make-do solution could be to classify the text after extraction based on the length of columns but that will only work if every column has a fixed length of words, which is not the case in most scenarios.

@cseas cseas added enhancement New feature or request help wanted Extra attention is needed labels Jan 11, 2019
@aribornstein
Copy link

The way to do this is to use code to do table detection (column and row) and then preform the ocr within the table it's a really hard problem though.

@jaysinghr
Copy link

Hi @munikarmanish did you found any thing regarding the research you mentioned above ?

@munikarmanish
Copy link
Author

Hi @munikarmanish did you found any thing regarding the research you mentioned above ?

Yes, I've found a few interesting approaches:

@SAIVENKATARAJU
Copy link

I am also facing above issue. did any found best solution after 2 years?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants