Merge pull request #8 from flavorjones/flavorjones-always-return-nodeset

Inference.parse always returns a NodeSet for fragments
flavorjones · May 5, 2024 · 137331a · 137331a
2 parents bbe3531 + 7426a36
commit 137331a
Show file tree

Hide file tree

Showing 4 changed files with 64 additions and 78 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,7 @@
 
 - Use a `<template>` tag as the context node for the majority of fragment parsing, which greatly simplifies this gem. #7 @flavorjones @stevecheckoway
 - Clean up the README. @marcoroth
+- `Nokogiri::HTML5::Inference.parse` always returns a `Nokogiri::XML::Nodeset` for fragments. Previously this method sometimes returns a `Nokogiri::HTML5::DocumentFragment`, but some API inconsistencies between `DocumentFragment` and `NodeSet` made using the returned object tricky. We hope this provides a more consistent development experience. @flavorjones
 
 
 ## [0.2.0] - 2024-04-26

diff --git a/README.md b/README.md
@@ -2,22 +2,15 @@
 
 Given HTML5 input, make a reasonable guess at how to parse it correctly.
 
-`Nokogiri::HTML5::Inference` makes reasonable inferences that work for both HTML5 documents and HTML5
-fragments, and for all the different HTML5 tags that a web developer might need in a view library.
+`Nokogiri::HTML5::Inference` makes reasonable inferences that work for both HTML5 documents and HTML5 fragments, and for all the different HTML5 tags that a web developer might need in a view library.
 
 This is useful for parsing trusted content like view snippets, particularly for morphing cases like StimulusReflex.
 
 ## The problem this library solves
 
-The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise
-context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML
-without knowing the parent node -- also called the "context node" -- in which it will be inserted.
+The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML without knowing the parent node -- also called the "context node" -- in which it will be inserted.
 
-Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
-["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody),
-but there are some notable exceptions. Perhaps the most problematic to web developers are the
-table-related tags, which will not be parsed properly unless the parser is in the
-["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
+Most content in an HTML5 document can be parsed assuming the parser's mode will be in the ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody), but there are some notable exceptions. Perhaps the most problematic to web developers are the table-related tags, which will not be parsed properly unless the parser is in the ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
 
 For example:
 
@@ -26,9 +19,7 @@ Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
 # => "foo" # where did the tag go!?
 ```
 
-In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
-and drop the tag. This particular fragment must be parsed "in the context" of a table in order to
-parse properly.
+In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here", and drop the tag. This particular fragment must be parsed "in the context" of a table in order to parse properly.
 
 Thankfully, libgumbo and Nokogiri allow us to set the context node:
 
@@ -41,9 +32,7 @@ Nokogiri::HTML5::DocumentFragment.new(
 # => "<tbody><tr><td>foo</td></tr></tbody>"
 ```
 
-This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
-_intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
-the `<td>` tag must be wrapped in `<tbody><tr>` tags.
+This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case, the `<td>` tag must be wrapped in `<tbody><tr>` tags.
 
 We can fix this to only return the tags we provided by using the `<template>` tag as the context node, which the HTML5 spec provides exactly for this purpose:
 
@@ -103,19 +92,7 @@ Nokogiri::HTML5::Inference.parse(html)
 # })
 ```
 
-If the input is a fragment that is parsed normally, you'll either get a `Nokogiri::HTML5::DocumentFragment` back:
-
-``` ruby
-Nokogiri::HTML5::Inference.parse("<div>hello,</div><div>world!</div>")
-# => #(DocumentFragment:0x34f8 {
-# name = "#document-fragment",
-# children = [
-# #(Element:0x3624 { name = "div", children = [ #(Text "hello,")] }),
-# #(Element:0x3804 { name = "div", children = [ #(Text "world!")] })]
-# })
-```
-
-or, if there are intermediate parent tags that need to be removed, you'll get a `Nokogiri::XML::NodeSet`:
+If the input is a fragment, you'll get back a `Nokogiri::XML::NodeSet`:
 
 ``` ruby
 Nokogiri::HTML5::Inference.parse("<tr><td>hello</td><td>world!</td></tr>")
@@ -128,14 +105,12 @@ Nokogiri::HTML5::Inference.parse("<tr><td>hello</td><td>world!</td></tr>")
 # ]
 ```
 
-All of these return types respond to the same query methods like `#css` and `#xpath`, tree-traversal
-methods like `#children`, and serialization methods like `#to_html`.
+Both of these return types respond to the same query methods like `#css` and `#xpath`, tree-traversal methods like `#children`, and serialization methods like `#to_html`.
 
 
 ## Caveats
 
-The implementation is currently pretty hacky and only looks at the first tag in the input to make
-decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do out-of-the-box.
+The implementation is currently pretty hacky and only looks at the first tag in the input to make decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do out-of-the-box.
 
 The implementation also is almost certainly incomplete, meaning there are HTML5 tags that aren't handled by this library as you might expect.
 

diff --git a/lib/nokogiri/html5/inference.rb b/lib/nokogiri/html5/inference.rb
@@ -12,15 +12,18 @@ module Nokogiri
  module HTML5
  # :markup: markdown
  #
- # The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise
- # context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML
- # without knowing the parent node -- also called the "context node" -- in which it will be inserted.
+ # The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very
+ # precise context-dependent parsing rules which can make it challenging to "just parse" a
+ # fragment of HTML without knowing the parent node -- also called the "context node" -- in
+ # which it will be inserted.
  #
  # Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
- # ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody),
- # but there are some notable exceptions. Perhaps the most problematic to web developers are the
- # table-related tags, which will not be parsed properly unless the parser is in the
- # ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
+ # ["in body" insertion
+ # mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody), but there
+ # are some notable exceptions. Perhaps the most problematic to web developers are the
+ # table-related tags, which will not be parsed properly unless the parser is in the ["in
+ # table" insertion
+ # mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
  #
  # For example:
  #
@@ -29,9 +32,9 @@ module HTML5
  # # => "foo" # where did the tag go!?
  # ```
  #
- # In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
- # and drop the tag. This particular fragment must be parsed "in the context" of a table in order to
- # parse properly.
+ # In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed
+ # here", and drop the tag. This particular fragment must be parsed "in the context" of a
+ # table in order to parse properly.
  #
  # Thankfully, libgumbo and Nokogiri allow us to set the context node:
  #
@@ -44,11 +47,12 @@ module HTML5
  # # => "<tbody><tr><td>foo</td></tr></tbody>"
  # ```
  #
- # This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
- # _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
- # the `<td>` tag must be wrapped in `<tbody><tr>` tags.
+ # This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action:
+ # there may be _intermediate parent tags_ that the HTML5 spec requires to be inserted by the
+ # parser. In this case, the `<td>` tag must be wrapped in `<tbody><tr>` tags.
  #
- # We can fix this to only return the tags we provided by using the `<template>` tag as the context node, which the HTML5 spec provides exactly for this purpose:
+ # We can fix this to only return the tags we provided by using the `<template>` tag as the
+ # context node, which the HTML5 spec provides exactly for this purpose:
  #
  # ``` ruby
  # Nokogiri::HTML5::DocumentFragment.new(
@@ -59,7 +63,7 @@ module HTML5
  # # => "<td>foo</td>"
  # ```
  #
- # Huzzah! That works. And it's precisely what Nokogiri::HTML5::Inference.parse does:
+ # Huzzah! That works. And it's precisely what `Nokogiri::HTML5::Inference.parse` does:
  #
  # ``` ruby
  # Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
@@ -88,7 +92,7 @@ module PluckRegexp # :nodoc:
  class << self
  #
  # call-seq:
- # parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::HTML5::DocumentFragment | Nokogiri::XML::NodeSet)
+ # parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::XML::NodeSet)
  #
  # Based on the start of the input HTML5 string, guess whether it's a full document or a
  # fragment and, using the fragment context node if necessary, parse it properly and
@@ -112,8 +116,7 @@ class << self
  #
  # [Returns]
  # - A +Nokogiri::HTML5::Document+ if the input appears to represent a full document.
- # - A +Nokogiri::HTML5::DocumentFragment+ or a +Nokogiri::XML::NodeSet+ if the input
- # appears to be a fragment.
+ # - A +Nokogiri::XML::NodeSet+ if the input appears to be a fragment.
  #
  def parse(input, pluck: true)
  context = Nokogiri::HTML5::Inference.context(input)
@@ -124,7 +127,7 @@ def parse(input, pluck: true)
  if pluck && (path = pluck_path(input))
  fragment.xpath(path)
  else
- fragment
+ fragment.children
  end
  end
  end

diff --git a/test/nokogiri/html5/test_inference.rb b/test/nokogiri/html5/test_inference.rb
@@ -61,37 +61,40 @@
  describe ".parse" do
  describe "passed a Document with doctype" do
  it "returns a Document" do
- assert_equal(
- "<!DOCTYPE html><html><head></head><body></body></html>",
- Nokogiri::HTML5::Inference.parse("<!doctype html><html><head></head><body></body></html>").to_html
- )
- assert_equal(
- "<!DOCTYPE html><html><head></head><body></body></html>",
- Nokogiri::HTML5::Inference.parse("<!DOCTYPE HTML><HTML><HEAD></HEAD><BODY></BODY></HTML>").to_html
- )
+ actual = Nokogiri::HTML5::Inference.parse("<!doctype html><html><head></head><body></body></html>")
+
+ assert_kind_of(Nokogiri::HTML5::Document, actual)
+ assert_equal("<!DOCTYPE html><html><head></head><body></body></html>", actual.to_html)
+
+ actual = Nokogiri::HTML5::Inference.parse("<!DOCTYPE HTML><HTML><HEAD></HEAD><BODY></BODY></HTML>")
+
+ assert_kind_of(Nokogiri::HTML5::Document, actual)
+ assert_equal("<!DOCTYPE html><html><head></head><body></body></html>", actual.to_html)
  end
  end
 
  describe "passed a Document without doctype" do
  it "returns a Document" do
- assert_equal(
- "<html><head></head><body></body></html>",
- Nokogiri::HTML5::Inference.parse("<html><head></head><body></body></html>").to_html
- )
- assert_equal(
- "<html><head></head><body></body></html>",
- Nokogiri::HTML5::Inference.parse("<HTML><HEAD></HEAD><BODY></BODY></HTML>").to_html
- )
+ actual = Nokogiri::HTML5::Inference.parse("<html><head></head><body></body></html>")
+
+ assert_kind_of(Nokogiri::HTML5::Document, actual)
+ assert_equal("<html><head></head><body></body></html>", actual.to_html)
+
+ actual = Nokogiri::HTML5::Inference.parse("<HTML><HEAD></HEAD><BODY></BODY></HTML>")
+
+ assert_kind_of(Nokogiri::HTML5::Document, actual)
+ assert_equal("<html><head></head><body></body></html>", actual.to_html)
  end
  end
 
  fragment_actions.each do |context, fragments|
  describe "passed a fragment requiring 'in #{context}' insertion mode" do
  fragments.each do |fragment|
  it "parses '#{fragment}' correctly" do
- actual = Nokogiri::HTML5::Inference.parse(fragment).to_html
+ actual = Nokogiri::HTML5::Inference.parse(fragment)
 
- assert_equal(fragment, actual)
+ assert_kind_of(Nokogiri::XML::NodeSet, actual)
+ assert_equal(fragment, actual.to_html)
  end
  end
  end
@@ -100,37 +103,41 @@
  describe "passed a Fragment containing head and body" do
  it "returns a Fragment containing both head and body" do
  fragment = "<head></head><body></body>"
- actual = Nokogiri::HTML5::Inference.parse(fragment).to_html
+ actual = Nokogiri::HTML5::Inference.parse(fragment)
 
- assert_equal(fragment, actual)
+ assert_kind_of(Nokogiri::XML::NodeSet, actual)
+ assert_equal(fragment, actual.to_html)
  end
  end
 
  describe "passed a Fragment containing head and p" do
  it "returns a Fragment containing both head and body" do
  fragment = "<head><p>"
  expected = "<head></head><body><p></p></body>"
- actual = Nokogiri::HTML5::Inference.parse(fragment).to_html
+ actual = Nokogiri::HTML5::Inference.parse(fragment)
 
- assert_equal(expected, actual)
+ assert_kind_of(Nokogiri::XML::NodeSet, actual)
+ assert_equal(expected, actual.to_html)
  end
  end
 
  describe "multiple children" do
  it "parses correctly" do
  fragment = "<tr><td>hello</td></tr><tr><td>world</td></tr>"
- actual = Nokogiri::HTML5::Inference.parse(fragment).to_html
+ actual = Nokogiri::HTML5::Inference.parse(fragment)
 
- assert_equal(fragment, actual)
+ assert_kind_of(Nokogiri::XML::NodeSet, actual)
+ assert_equal(fragment, actual.to_html)
  end
 
  describe "with pluck: false" do
  it "includes the additional sibling nodes created" do
  fragment = "<body><div>hello</div></body>"
  expected = "<head></head>#{fragment}"
- actual = Nokogiri::HTML5::Inference.parse(fragment, pluck: false).to_html
+ actual = Nokogiri::HTML5::Inference.parse(fragment, pluck: false)
 
- assert_equal(expected, actual)
+ assert_kind_of(Nokogiri::XML::NodeSet, actual)
+ assert_equal(expected, actual.to_html)
  end
  end
  end