Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Content Iterator #177

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

mickael-menu
Copy link
Member

See the full proposal

Summary

This proposal introduces a new Publication Service to iterate through a publication's content extracted as semantic elements.

The Content Iterator service provides a building block for many high-level features requiring access to the raw content of a publication, such as:

  • Text to speech
  • Accessibility readers
  • Basic search
  • Full-text search indexing
  • Image or audio indexes

Today, implementing such features is complex because you need to:

  1. (maybe) Find a starting location in one of the reading order resources.
  2. Iterate through the reading order, opening and closing resources when needed.
  3. Extracting the textual or media content from the raw resource, which is different for every supported media type.

The Content Iterator handles all that in a media type agnostic way.

Review notes

  • The whole Content class and its components need some review to account for all needs.

* `previous() -> Content?`
* Returns the previous `Content` element in the iterator, or `null` when reaching the beginning.
* `next() -> Content?`
* Returns the next `Content` element in the iterator, or `null` when reaching the end.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should make it explicit that these methods make the iterator move forward or backward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the documentation comment? I'm fine making it more explicit there.

I wouldn't change the function signature as it's pretty common though:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was talking about the documentation.

##### Properties

* `locator: Locator`
* Locator targeting this element in the `Publication`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we be able to get a locator for any target data? Think of three successive images without HTML ids. I believe that neither fragments nor text after/before can be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't guarantee that it will be the case for all media types, but for the ones we have so far I think so.

With image elements, you can use a cssSelector in the locations object. I filled it for all locators in the HtmlResourceContentIterator, as even with a text context it can help limit the search scope to a parent element.

If there's ever a case where we can't target precisely, this Locator should at least match the closest targetable parent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see.

* `style: TextStyle`
* Semantic style for this element.
* `spans: [TextSpan]`
* List of text spans in this element.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale behind grouping multiple TextSpans into a single Content.Data.Text?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example if you have the body paragraph:

<body lang="en">
    <p>The correct pronunciation is <span lang="fr">croissant</span>, and not croissant.</p>
</body>

We want the whole paragraph as a single semantic Text element. However to not loose the information about the language (which is important for TTS), you can split the Text element into a list of spans which are basically "attributed ranges" in the text element:

{
  "locator": {},
  "data": {
    "style": "body",
    "spans": [
      { "text": "The correct pronunciation is ", "language": "en" },
      { "text": "croissant", "language": "fr" },
      { "text": ", and not croissant.", "language": "en" }
    ]
  }
}

Like @danielweck mentioned on the call, the term span can be confusing when parsing HTML, as they are no 1-1 relationship between the HTML spans and the produced elements.

However thinking more about this, I think the term is still the most accurate. The problem happens only with HTML media types and the semantic seems to be correct:

I think it even matches the meaning from the HTML spec:

The span element doesn't mean anything on its own, but can be useful when used together with the global attributes, e.g. class, lang, or dir.
https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-span-element

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielweck I'm renaming span into segment to lift any ambiguity. (Other candidates: part, chunk, portion)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also rename style into role, to remove the notion of appearance that style brings.


### Extracting the data from `Content` elements

The `Content` elements are value objects containing:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some new elements to take into account:

  • SVG
    • can have an aria-label or title child element
  • MathML
    • extracting the tt child elements could be useful
  • MathJax

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants