Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bnewm0609/layer slice #44

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Bnewm0609/layer slice #44

wants to merge 2 commits into from

Conversation

bnewm0609
Copy link
Contributor

@bnewm0609 bnewm0609 commented Aug 16, 2023

Begins an implementation of Layer to wrap the List[Entities] and allow for more intuitive slicing (e.g. doc.sentences[:3].text instead of [sent.text for sent in doc.sentences]. This addresses #24 .

The new data structure is implemented in papermage/magelib/layer.py and inherits from python's UserList. The changes into integrate Layer were mainly made in the Document data structure.

One design decision to consider is what to do with chained access: e.g. doc.pages.paragraphs.sentences.tokens. Currently, each access creates a new layer, so doing the above would create a four-dimensional list. Two consequences of this decision:

  1. To get the first token, you would have to write doc.pages.paragraphs.sentences.tokens[0][0][0][0], which is a bit ugly.
  2. doc.pages.paragraphs.pages does not return the original Layer of pages., which is a bit uninutitive

The main question is: "Should chained accessing return the union of all of the entities in a single layer or should it return the entities in the shape of the chained accessing?"

As another example, if the doc is

Paragraph 1: "I am. I was."
Paragraph 2: "You are. You were."

Sentence 1: "I am."
Sentence 2: "I was."
Sentence 3: "You are."
Sentence 4: "You were."

Which should doc.paragraphs.sentences.text return?

# Option 1 - currently implemented
[[["I", "am", "."], ["I", "was", "."]], [["You", "are", "."], ["You", "were", "."]]]

# Option 2
["I", "am", ".", "I", "was", ".", "You", "are", ".", "You", "were", "."]

self.assertListEqual(doc.chunks[2].tokens, [tokens[5]])
self.assertSequenceEqual(doc.chunks[0].tokens, tokens[0:3])
self.assertSequenceEqual(doc.chunks[1].tokens, tokens[3:5])
self.assertSequenceEqual(doc.chunks[2].tokens, [tokens[5]])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertListEqual enforces that both arguments are list, while assertSequenceEqual does not. Because the first argument is a Layer, we use assertSequenceEqual.

@bnewm0609 bnewm0609 linked an issue Aug 16, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Layer definition with Slice compatibility
1 participant