Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing table with empty columns not optimal. result_type='text' produces better table formatting. #150

Open
ggjx22 opened this issue Apr 19, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@ggjx22
Copy link

ggjx22 commented Apr 19, 2024

Hello, I have been trying to parse my document (purchase orders and invoices) containing table by leveraging on parsing_instructions to handle awkward tables. However, I still cannot seem to achieve good parsing result after rewriting my prompts many times. result_type='markdown' is the preferred result type because it allows more control over the output. Is there any 'special' prompts the parser needs so that it behaves as instructed?

In one of my latest prompts to the parser, I have written the following instructions for table extraction. I also wrote formatting instructions to handle a few edge cases:

STEP 7: Extract tables of itemized list of items:
- Tables of itemized list are line items of description of goods or
services
- All rows and columns of the table needs to be extracted.
Notes: Exclude total amounts of the table. Replace special characters such
as "|" in the description with "," as they may cause the table to be 
misaligned. Take note of empty rows or columns and apply the appropriate
markdown formatting in the output.

As I cannot share company document publicly, here is a screenshot of a portion of the table where the parser struggles on.
image

If result_type='markdown', I get the following result. The columns shifted.

| Receiver Suburb    | Extra Charges | Qty | Weight | Description |
|--------------------|---------------|-----|--------|-------------|
| Emerald            | 2             | 3   |        | carton,jiffy|
| Rockhampton Dc     | 2             | 53  |        | skid,jiffy  |
| Emerald            | 2             | 3   |        | jiffy       |
| Rockhampton Dc     | 2             | 72  |        | skid,carton |
| Rockhampton Dc     | 1             | 5   |        | carton      |
| Emerald            | 1             | 1   |        | jiffy       |
| Paget              | 1             | 11  |        | carton      |
| Paget              | 1             | 5   |        |carton       |
| Emerald            | 2             | 165 |        | skid,jiffy  |
| Rockhampton Dc     | 1             | 8   |        | carton      |
| Rockhampton Dc     | 1             | 1   |        | jiffy       |
| Rockhampton Dc     | 1             | 8   |        | carton      |

If result_type='text', I get the following result. No columns where shifted.

Receiver Suburb  Extra Charges  Qty  Weight         Description
Emerald                          2       3             carton,jiffy
Rockhampton Dc                   2      53               skid,jiffy
Emerald                          2       3                     jiffy
Rockhampton Dc                   2      72            skid,carton
Rockhampton Dc                   1       5                  carton
Emerald                          1       1                     jiffy
Paget                            1      11                  carton
Paget                            1       5                  carton
Emerald                          2      165              skid,jiffy
Rockhampton Dc                   1       8                  carton
Rockhampton Dc                   1       1                     jiffy
Rockhampton Dc                   1       8                  carton
@logan-markewich logan-markewich added the bug Something isn't working label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants