How to get the page number of each figure? #75

LiyingCheng95 · 2024-03-20T07:05:38Z

I want to crop all the figures/images/tables in one pdf. Can get the page number of each figure in doc.figures[x]?

kyleclo · 2024-03-20T23:07:35Z

please check out this example snippet in #63

import json
import os
import pathlib

from papermage.magelib import Document
from papermage.recipes import CoreRecipe
from papermage.visualizers.visualizer import plot_entities_on_page

# load doc
recipe = CoreRecipe()
pdfpath = pathlib.Path(__file__).parent.parent / "tests/fixtures/2020.acl-main.447.pdf"
doc = recipe.from_pdf(pdf=pdfpath)

# visualize figures on a page
page_id = 0
figures = doc.pages[page_id].intersect_by_box("figures")
plot_entities_on_page(page_image=doc.images[page_id], entities=figures)

# get the image of a page and its dimensions
page_image = doc.images[page_id]
page_w, page_h = page_image.pilimage.size

# get the bounding box of a figure
figure_box = figures[0].boxes[0]

# convert it
figure_box_xy = figure_box.to_absolute(page_width=page_w, page_height=page_h).xy_coordinates

# crop the image using PIL
page_image._pilimage.crop(figure_box_xy)

LiyingCheng95 · 2024-03-21T08:16:27Z

Thanks for your prompt reply. However, it doesn't work for my case. For example, there is a figure on Page 8 in my pdf file. When I ran the code below, it can crop the figure for me. For this code, I have to indicate the page of each figure detected from the file.

recipe = CoreRecipe()
doc = recipe.run("path to my pdf")

# get the image of a page and its dimensions
page_image = doc.images[8]
page_w, page_h = page_image.pilimage.size

# get the bounding box of a figure
figure_box = doc.figures[0].boxes[0]

# convert it
figure_box_xy = figure_box.to_absolute(page_width=page_w, page_height=page_h).xy_coordinates

# crop the image using PIL
cropped_image = page_image._pilimage.crop(figure_box_xy)

cropped_image.save('cropped_image.jpg')

But when I ran this code below, it returned the error: "figure_box = figures[0].boxes[0] IndexError: list index out of range"

# load doc
recipe = CoreRecipe()
pdfpath = pathlib.Path(__file__).parent / "path to my pdf"
doc = recipe.from_pdf(pdf=pdfpath)

# visualize figures on a page
page_id = 8
figures = doc.pages[page_id].intersect_by_box("figures")
plot_entities_on_page(page_image=doc.images[page_id], entities=figures)

# get the image of a page and its dimensions
page_image = doc.images[page_id]
page_w, page_h = page_image.pilimage.size

# get the bounding box of a figure
figure_box = figures[0].boxes[0]

# convert it
figure_box_xy = figure_box.to_absolute(page_width=page_w, page_height=page_h).xy_coordinates

# crop the image using PIL
cropped_image = page_image._pilimage.crop(figure_box_xy)
cropped_image.save('cropped_image.jpg')

Not sure what's wrong there?

kyleclo · 2024-03-21T08:43:29Z

Do you mind emailing the PDF file?

kyleclo · 2024-03-21T20:21:43Z

Thanks @LiyingCheng95 this is definitely a bug; I'm looking into patching it!

First, it seems like the figure is actually being detected correctly. For example:

recipe = CoreRecipe()
doc = recipe.from_pdf(pdf='your-file.pdf')

# asserts there are definitely figures on page 8
figures = [figure for figure in doc.figures if figure.boxes[0].page == 8]
assert len(figures) > 0
print(f"{figures[0].boxes}")

> [Box[0.12299907267594538, 0.05627375260667959, 0.731138803177521, 0.19940743706854958, 8]]

# i can visualize that figure on page 8
plot_entities_on_page(page_image=doc.images[8], entities=figures)

So I looked into where the bug is coming from. It seems like bug is coming from this cross-layer indexing operation is not finding a match:

figures[0].intersect_by_box("pages")
> []

doc.pages[8].intersect_by_box("figures")
> []

This is super weird because the boxes definitely overlap

doc.pages[0].boxes[0]
> Box[0.027564877832563207, 0.2701246785544094, 0.943916833476601, 0.523800017428919, 8]

figure.boxes[0]
> Box[0.12299907267594538, 0.05627375260667959, 0.731138803177521, 0.19940743706854958, 8]

So I checked and it looks like there's a bug in my Box.is_overlap logic:

figure.boxes[0].is_overlap(page.boxes[0])
> False

I'll work on fixing this.

In the meantime, you should be able to grab all the figures using doc.figures and if you want to check which page it's on, then it's for figure in doc.figures if figure.boxes[0].page == ??.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get the page number of each figure? #75

How to get the page number of each figure? #75

LiyingCheng95 commented Mar 20, 2024

kyleclo commented Mar 20, 2024

LiyingCheng95 commented Mar 21, 2024

kyleclo commented Mar 21, 2024

kyleclo commented Mar 21, 2024 •

edited

How to get the page number of each figure? #75

How to get the page number of each figure? #75

Comments

LiyingCheng95 commented Mar 20, 2024

kyleclo commented Mar 20, 2024

LiyingCheng95 commented Mar 21, 2024

kyleclo commented Mar 21, 2024

kyleclo commented Mar 21, 2024 • edited

kyleclo commented Mar 21, 2024 •

edited