How to extract inner HTML of an element? #212

Jared-Sprague · 2024-03-13T14:33:53Z

Hello!

I'm trying to use lol_html as an HTML parser to extract all content within an element, it's inner HTML. However I haven't figured out the right methods for this yet. Here is what I want to do, given the following HTML snippet:

<article id="main-content">
  <div>
    <p>foo</p>
  </div>
</article>

I want to extract all the content within the element matching css selector article#main-content and store it in a String named content after this runs the value of content will be equal to:

  <div>
    <p>foo</p>
  </div>

I've what would be perfect would be a method on the element called get_inner_content() there seems to be all the methods for manipulating the inner content but not actually getting it such as:
set_inner_content
remove_and_keep_content

Maybe I'm thinking about this wrong, any help just getting the inner content of an element would be so hepful!

The text was updated successfully, but these errors were encountered:

bglw · 2024-03-14T07:28:02Z

I'm just a user, not a maintainer here, but I can at least answer and say that there isn't a good way to achieve this.

Everything in lol-html revolves around it being a streaming HTML transformer, and as a result it doesn't hold on to that stream for very long as it passes through. set_inner_content is trivial, since it can just ignore its stream for a while and use the provided content, and likewise remove_and_keep_content doesn't block the stream since it just removes some of the values passing through.

Anything like get_inner_content though would require buffering an arbitrary amount of data from the stream, which would inherently stop it from streaming. If you called get_inner_content on the html element itself, lol-html would cease to be a streaming parser and would have to store the whole document.

There are some solutions for inner text, by adding a text! handler and appending the text to a buffer yourself, but there isn't an equivalent for HTML as far as I'm aware. You would have to create an element! handler and re-serialize the tag/attributes yourself, alongside the text! handler for the text nodes.

The only viable path I can think of would be to use the el.prepend() and el.append() methods to insert some delimiters into the stream, then processing that yourself afterward to extract the innerHTML between the delimiters.

For anything more involved, I'd look at using kuchiki(ki) instead.

mwcz · 2024-03-26T16:50:40Z

Related #40 #78

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract inner HTML of an element? #212

How to extract inner HTML of an element? #212

Jared-Sprague commented Mar 13, 2024

bglw commented Mar 14, 2024

mwcz commented Mar 26, 2024

How to extract inner HTML of an element? #212

How to extract inner HTML of an element? #212

Comments

Jared-Sprague commented Mar 13, 2024

bglw commented Mar 14, 2024

mwcz commented Mar 26, 2024