Skip to content

Commit

Permalink
Merge pull request #8 from flavorjones/flavorjones-always-return-nodeset
Browse files Browse the repository at this point in the history
Inference.parse always returns a NodeSet for fragments
  • Loading branch information
flavorjones committed May 5, 2024
2 parents bbe3531 + 7426a36 commit 137331a
Show file tree
Hide file tree
Showing 4 changed files with 64 additions and 78 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

- Use a `<template>` tag as the context node for the majority of fragment parsing, which greatly simplifies this gem. #7 @flavorjones @stevecheckoway
- Clean up the README. @marcoroth
- `Nokogiri::HTML5::Inference.parse` always returns a `Nokogiri::XML::Nodeset` for fragments. Previously this method sometimes returns a `Nokogiri::HTML5::DocumentFragment`, but some API inconsistencies between `DocumentFragment` and `NodeSet` made using the returned object tricky. We hope this provides a more consistent development experience. @flavorjones


## [0.2.0] - 2024-04-26
Expand Down
41 changes: 8 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,15 @@

Given HTML5 input, make a reasonable guess at how to parse it correctly.

`Nokogiri::HTML5::Inference` makes reasonable inferences that work for both HTML5 documents and HTML5
fragments, and for all the different HTML5 tags that a web developer might need in a view library.
`Nokogiri::HTML5::Inference` makes reasonable inferences that work for both HTML5 documents and HTML5 fragments, and for all the different HTML5 tags that a web developer might need in a view library.

This is useful for parsing trusted content like view snippets, particularly for morphing cases like StimulusReflex.

## The problem this library solves

The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise
context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML
without knowing the parent node -- also called the "context node" -- in which it will be inserted.
The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML without knowing the parent node -- also called the "context node" -- in which it will be inserted.

Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody),
but there are some notable exceptions. Perhaps the most problematic to web developers are the
table-related tags, which will not be parsed properly unless the parser is in the
["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
Most content in an HTML5 document can be parsed assuming the parser's mode will be in the ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody), but there are some notable exceptions. Perhaps the most problematic to web developers are the table-related tags, which will not be parsed properly unless the parser is in the ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).

For example:

Expand All @@ -26,9 +19,7 @@ Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
# => "foo" # where did the tag go!?
```

In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
and drop the tag. This particular fragment must be parsed "in the context" of a table in order to
parse properly.
In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here", and drop the tag. This particular fragment must be parsed "in the context" of a table in order to parse properly.

Thankfully, libgumbo and Nokogiri allow us to set the context node:

Expand All @@ -41,9 +32,7 @@ Nokogiri::HTML5::DocumentFragment.new(
# => "<tbody><tr><td>foo</td></tr></tbody>"
```

This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
_intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
the `<td>` tag must be wrapped in `<tbody><tr>` tags.
This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case, the `<td>` tag must be wrapped in `<tbody><tr>` tags.

We can fix this to only return the tags we provided by using the `<template>` tag as the context node, which the HTML5 spec provides exactly for this purpose:

Expand Down Expand Up @@ -103,19 +92,7 @@ Nokogiri::HTML5::Inference.parse(html)
# })
```

If the input is a fragment that is parsed normally, you'll either get a `Nokogiri::HTML5::DocumentFragment` back:

``` ruby
Nokogiri::HTML5::Inference.parse("<div>hello,</div><div>world!</div>")
# => #(DocumentFragment:0x34f8 {
# name = "#document-fragment",
# children = [
# #(Element:0x3624 { name = "div", children = [ #(Text "hello,")] }),
# #(Element:0x3804 { name = "div", children = [ #(Text "world!")] })]
# })
```

or, if there are intermediate parent tags that need to be removed, you'll get a `Nokogiri::XML::NodeSet`:
If the input is a fragment, you'll get back a `Nokogiri::XML::NodeSet`:

``` ruby
Nokogiri::HTML5::Inference.parse("<tr><td>hello</td><td>world!</td></tr>")
Expand All @@ -128,14 +105,12 @@ Nokogiri::HTML5::Inference.parse("<tr><td>hello</td><td>world!</td></tr>")
# ]
```

All of these return types respond to the same query methods like `#css` and `#xpath`, tree-traversal
methods like `#children`, and serialization methods like `#to_html`.
Both of these return types respond to the same query methods like `#css` and `#xpath`, tree-traversal methods like `#children`, and serialization methods like `#to_html`.


## Caveats

The implementation is currently pretty hacky and only looks at the first tag in the input to make
decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do out-of-the-box.
The implementation is currently pretty hacky and only looks at the first tag in the input to make decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do out-of-the-box.

The implementation also is almost certainly incomplete, meaning there are HTML5 tags that aren't handled by this library as you might expect.

Expand Down
41 changes: 22 additions & 19 deletions lib/nokogiri/html5/inference.rb
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,18 @@ module Nokogiri
module HTML5
# :markup: markdown
#
# The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise
# context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML
# without knowing the parent node -- also called the "context node" -- in which it will be inserted.
# The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very
# precise context-dependent parsing rules which can make it challenging to "just parse" a
# fragment of HTML without knowing the parent node -- also called the "context node" -- in
# which it will be inserted.
#
# Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
# ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody),
# but there are some notable exceptions. Perhaps the most problematic to web developers are the
# table-related tags, which will not be parsed properly unless the parser is in the
# ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
# ["in body" insertion
# mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody), but there
# are some notable exceptions. Perhaps the most problematic to web developers are the
# table-related tags, which will not be parsed properly unless the parser is in the ["in
# table" insertion
# mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
#
# For example:
#
Expand All @@ -29,9 +32,9 @@ module HTML5
# # => "foo" # where did the tag go!?
# ```
#
# In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
# and drop the tag. This particular fragment must be parsed "in the context" of a table in order to
# parse properly.
# In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed
# here", and drop the tag. This particular fragment must be parsed "in the context" of a
# table in order to parse properly.
#
# Thankfully, libgumbo and Nokogiri allow us to set the context node:
#
Expand All @@ -44,11 +47,12 @@ module HTML5
# # => "<tbody><tr><td>foo</td></tr></tbody>"
# ```
#
# This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
# _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
# the `<td>` tag must be wrapped in `<tbody><tr>` tags.
# This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action:
# there may be _intermediate parent tags_ that the HTML5 spec requires to be inserted by the
# parser. In this case, the `<td>` tag must be wrapped in `<tbody><tr>` tags.
#
# We can fix this to only return the tags we provided by using the `<template>` tag as the context node, which the HTML5 spec provides exactly for this purpose:
# We can fix this to only return the tags we provided by using the `<template>` tag as the
# context node, which the HTML5 spec provides exactly for this purpose:
#
# ``` ruby
# Nokogiri::HTML5::DocumentFragment.new(
Expand All @@ -59,7 +63,7 @@ module HTML5
# # => "<td>foo</td>"
# ```
#
# Huzzah! That works. And it's precisely what Nokogiri::HTML5::Inference.parse does:
# Huzzah! That works. And it's precisely what `Nokogiri::HTML5::Inference.parse` does:
#
# ``` ruby
# Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
Expand Down Expand Up @@ -88,7 +92,7 @@ module PluckRegexp # :nodoc:
class << self
#
# call-seq:
# parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::HTML5::DocumentFragment | Nokogiri::XML::NodeSet)
# parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::XML::NodeSet)
#
# Based on the start of the input HTML5 string, guess whether it's a full document or a
# fragment and, using the fragment context node if necessary, parse it properly and
Expand All @@ -112,8 +116,7 @@ class << self
#
# [Returns]
# - A +Nokogiri::HTML5::Document+ if the input appears to represent a full document.
# - A +Nokogiri::HTML5::DocumentFragment+ or a +Nokogiri::XML::NodeSet+ if the input
# appears to be a fragment.
# - A +Nokogiri::XML::NodeSet+ if the input appears to be a fragment.
#
def parse(input, pluck: true)
context = Nokogiri::HTML5::Inference.context(input)
Expand All @@ -124,7 +127,7 @@ def parse(input, pluck: true)
if pluck && (path = pluck_path(input))
fragment.xpath(path)
else
fragment
fragment.children
end
end
end
Expand Down
59 changes: 33 additions & 26 deletions test/nokogiri/html5/test_inference.rb
Original file line number Diff line number Diff line change
Expand Up @@ -61,37 +61,40 @@
describe ".parse" do
describe "passed a Document with doctype" do
it "returns a Document" do
assert_equal(
"<!DOCTYPE html><html><head></head><body></body></html>",
Nokogiri::HTML5::Inference.parse("<!doctype html><html><head></head><body></body></html>").to_html
)
assert_equal(
"<!DOCTYPE html><html><head></head><body></body></html>",
Nokogiri::HTML5::Inference.parse("<!DOCTYPE HTML><HTML><HEAD></HEAD><BODY></BODY></HTML>").to_html
)
actual = Nokogiri::HTML5::Inference.parse("<!doctype html><html><head></head><body></body></html>")

assert_kind_of(Nokogiri::HTML5::Document, actual)
assert_equal("<!DOCTYPE html><html><head></head><body></body></html>", actual.to_html)

actual = Nokogiri::HTML5::Inference.parse("<!DOCTYPE HTML><HTML><HEAD></HEAD><BODY></BODY></HTML>")

assert_kind_of(Nokogiri::HTML5::Document, actual)
assert_equal("<!DOCTYPE html><html><head></head><body></body></html>", actual.to_html)
end
end

describe "passed a Document without doctype" do
it "returns a Document" do
assert_equal(
"<html><head></head><body></body></html>",
Nokogiri::HTML5::Inference.parse("<html><head></head><body></body></html>").to_html
)
assert_equal(
"<html><head></head><body></body></html>",
Nokogiri::HTML5::Inference.parse("<HTML><HEAD></HEAD><BODY></BODY></HTML>").to_html
)
actual = Nokogiri::HTML5::Inference.parse("<html><head></head><body></body></html>")

assert_kind_of(Nokogiri::HTML5::Document, actual)
assert_equal("<html><head></head><body></body></html>", actual.to_html)

actual = Nokogiri::HTML5::Inference.parse("<HTML><HEAD></HEAD><BODY></BODY></HTML>")

assert_kind_of(Nokogiri::HTML5::Document, actual)
assert_equal("<html><head></head><body></body></html>", actual.to_html)
end
end

fragment_actions.each do |context, fragments|
describe "passed a fragment requiring 'in #{context}' insertion mode" do
fragments.each do |fragment|
it "parses '#{fragment}' correctly" do
actual = Nokogiri::HTML5::Inference.parse(fragment).to_html
actual = Nokogiri::HTML5::Inference.parse(fragment)

assert_equal(fragment, actual)
assert_kind_of(Nokogiri::XML::NodeSet, actual)
assert_equal(fragment, actual.to_html)
end
end
end
Expand All @@ -100,37 +103,41 @@
describe "passed a Fragment containing head and body" do
it "returns a Fragment containing both head and body" do
fragment = "<head></head><body></body>"
actual = Nokogiri::HTML5::Inference.parse(fragment).to_html
actual = Nokogiri::HTML5::Inference.parse(fragment)

assert_equal(fragment, actual)
assert_kind_of(Nokogiri::XML::NodeSet, actual)
assert_equal(fragment, actual.to_html)
end
end

describe "passed a Fragment containing head and p" do
it "returns a Fragment containing both head and body" do
fragment = "<head><p>"
expected = "<head></head><body><p></p></body>"
actual = Nokogiri::HTML5::Inference.parse(fragment).to_html
actual = Nokogiri::HTML5::Inference.parse(fragment)

assert_equal(expected, actual)
assert_kind_of(Nokogiri::XML::NodeSet, actual)
assert_equal(expected, actual.to_html)
end
end

describe "multiple children" do
it "parses correctly" do
fragment = "<tr><td>hello</td></tr><tr><td>world</td></tr>"
actual = Nokogiri::HTML5::Inference.parse(fragment).to_html
actual = Nokogiri::HTML5::Inference.parse(fragment)

assert_equal(fragment, actual)
assert_kind_of(Nokogiri::XML::NodeSet, actual)
assert_equal(fragment, actual.to_html)
end

describe "with pluck: false" do
it "includes the additional sibling nodes created" do
fragment = "<body><div>hello</div></body>"
expected = "<head></head>#{fragment}"
actual = Nokogiri::HTML5::Inference.parse(fragment, pluck: false).to_html
actual = Nokogiri::HTML5::Inference.parse(fragment, pluck: false)

assert_equal(expected, actual)
assert_kind_of(Nokogiri::XML::NodeSet, actual)
assert_equal(expected, actual.to_html)
end
end
end
Expand Down

0 comments on commit 137331a

Please sign in to comment.