Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ai_extraction=True not working locally #11

Open
sisyga opened this issue Apr 26, 2024 · 2 comments
Open

ai_extraction=True not working locally #11

sisyga opened this issue Apr 26, 2024 · 2 comments

Comments

@sisyga
Copy link

sisyga commented Apr 26, 2024

Hi! Not sure if this is a bug or a feature, but I'd love to use the ai_extraction option to improve the handling of PDF documents. However, enabling this option overwrites the local=True option.

MWE:

from thepipe.thepipe_api import thepipe 
source = 'example.pdf'
messages = thepipe.extract(source, local=True, verbose=True, ai_extraction=True)

Throws the error:
Failed to extract from example.pdf: No valid API key given. Visit https://thepi.pe/docs to learn more.

It works without enabling ai_extraction, but I don't like that it adds every page as an image to the messages because this massively increases the token count for longer PDFs.
As a workaround, I adapted the extract_pdf function only to extract PDF pages as images if the page contains an image. It would be great to have this as an option. (I know this approach is not optimal as it misses tables and some images containing only SVG objects; maybe a better option is possible only based on the fitz library, but I am no expert in this package).

def extract_pdf(file_path: str, ai_extraction: bool = False, text_only: bool = False, verbose: bool = False, limit: int = None) -> List[Chunk]:
    chunks = []
    if ai_extraction:
        with open(file_path, "rb") as f:
            response = requests.post(
                url=API_URL,
                files={'file': (file_path, f)},
                data={'api_key': THEPIPE_API_KEY, 'ai_extraction': ai_extraction, 'text_only': text_only, 'limit': limit}
            )
        try:
            response_json = response.json()
        except json.JSONDecodeError:
            raise ValueError(f"Our backend likely couldn't handle this request. This can happen with large content such as videos, streams, or very large files/websites. Re")
        if 'error' in response_json:
            raise ValueError(f"{response_json['error']}")
        messages = response_json['messages']
        chunks = create_chunks_from_messages(messages)
    else:
        import fitz
        # extract text and images of each page from the PDF
        with open(file_path, 'rb') as file:
            doc = fitz.open(file_path)
            for page in doc:
                text = page.get_text()
                image_list = page.get_images(full=True)
                if text_only:
                    chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))
                elif image_list:
                    pix = page.get_pixmap()
                    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
                    chunks.append(Chunk(path=file_path, text=text, image=img, source_type=SourceTypes.PDF))

                else: chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))

            doc.close()
    return chunks
@emcf
Copy link
Owner

emcf commented Apr 26, 2024

Hi @sisyga , the ai_extraction parameter is only available from the API at the moment.

When running locally on PDFs with lots of pages, I experience this problem too. That is a reasonable workaround, although I don't think it is sufficient for the reasons you mentioned.

I am actually not sure what would be sufficient -- I am toying with the idea of training a page-image classifier to filter pages without visuals/tables, but this is quite demanding. If you had any additional ideas I would love to hear them!

@sisyga
Copy link
Author

sisyga commented Apr 29, 2024

Hey, thanks for working to open-source the AI classifier. In the meantime, I use the following workaround:

def extract_pdf(file_path: str, ai_extraction: bool = False, text_only: bool = False, verbose: bool = False, limit: int = None) -> List[Chunk]:
    chunks = []
    if ai_extraction:
        with open(file_path, "rb") as f:
            response = requests.post(
                url=API_URL,
                files={'file': (file_path, f)},
                data={'api_key': THEPIPE_API_KEY, 'ai_extraction': ai_extraction, 'text_only': text_only, 'limit': limit}
            )
        try:
            response_json = response.json()
        except json.JSONDecodeError:
            raise ValueError(f"Our backend likely couldn't handle this request. This can happen with large content such as videos, streams, or very large files/websites. Re")
        if 'error' in response_json:
            raise ValueError(f"{response_json['error']}")
        messages = response_json['messages']
        chunks = create_chunks_from_messages(messages)
    else:
        import fitz
        # extract text and images of each page from the PDF
        with open(file_path, 'rb') as file:
            doc = fitz.open(file_path)
            for page in doc:
                text = page.get_text()
                image_list = page.get_image_info()
                drawing_commands = page.get_drawings()
                drawing_count = len(drawing_commands)

                if text_only:
                    chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))
                elif image_list or drawing_count > 5:  # only make a snapshot if there is an image or more than 5 lines drawn
                    pix = page.get_pixmap()
                    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
                    chunks.append(Chunk(path=file_path, text=text, image=img, source_type=SourceTypes.PDF))

                else: chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))

            doc.close()
    return chunks

Basically, I extract the number of drawing commands, and if it is higher than a threshold (here: 5, which could be implemented as an option), I make an image snapshot. This is working all right since complex formulas and table lines also count toward the drawing commands, which is what I want.

@emcf emcf added wontfix This will not be worked on and removed wontfix This will not be worked on labels Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants