Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to extract inner HTML of an element? #212

Open
Jared-Sprague opened this issue Mar 13, 2024 · 2 comments
Open

How to extract inner HTML of an element? #212

Jared-Sprague opened this issue Mar 13, 2024 · 2 comments

Comments

@Jared-Sprague
Copy link

Hello!

I'm trying to use lol_html as an HTML parser to extract all content within an element, it's inner HTML. However I haven't figured out the right methods for this yet. Here is what I want to do, given the following HTML snippet:

<article id="main-content">
  <div>
    <p>foo</p>
  </div>
</article>

I want to extract all the content within the element matching css selector article#main-content and store it in a String named content after this runs the value of content will be equal to:

  <div>
    <p>foo</p>
  </div>

I've what would be perfect would be a method on the element called get_inner_content() there seems to be all the methods for manipulating the inner content but not actually getting it such as:
set_inner_content
remove_and_keep_content

Maybe I'm thinking about this wrong, any help just getting the inner content of an element would be so hepful!

@bglw
Copy link

bglw commented Mar 14, 2024

I'm just a user, not a maintainer here, but I can at least answer and say that there isn't a good way to achieve this.

Everything in lol-html revolves around it being a streaming HTML transformer, and as a result it doesn't hold on to that stream for very long as it passes through. set_inner_content is trivial, since it can just ignore its stream for a while and use the provided content, and likewise remove_and_keep_content doesn't block the stream since it just removes some of the values passing through.

Anything like get_inner_content though would require buffering an arbitrary amount of data from the stream, which would inherently stop it from streaming. If you called get_inner_content on the html element itself, lol-html would cease to be a streaming parser and would have to store the whole document.

There are some solutions for inner text, by adding a text! handler and appending the text to a buffer yourself, but there isn't an equivalent for HTML as far as I'm aware. You would have to create an element! handler and re-serialize the tag/attributes yourself, alongside the text! handler for the text nodes.

The only viable path I can think of would be to use the el.prepend() and el.append() methods to insert some delimiters into the stream, then processing that yourself afterward to extract the innerHTML between the delimiters.

For anything more involved, I'd look at using kuchiki(ki) instead.

@mwcz
Copy link
Contributor

mwcz commented Mar 26, 2024

Related #40 #78

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants