Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with list items #357

Open
q00u opened this issue Oct 23, 2020 · 5 comments · May be fixed by #358
Open

Issues with list items #357

q00u opened this issue Oct 23, 2020 · 5 comments · May be fixed by #358

Comments

@q00u
Copy link

q00u commented Oct 23, 2020

Issue 1: Headers inside list items.

Consider:

<li><h4>Header inside list item</h4></li>

This renders in html as:
chrome_eudXDirqJY

In turndown, this becomes:

*   ####Header inside list item

Which renders as:

  • ####Header inside list item

I propose that headers inside list items be treated as strong:

*   **Header inside list item**

Which renders as:

  • Header inside list item

Issue 2: Standalone headers following other non-list-item nodes.

Consider:

<img src="https://picsum.photos/id/1/200" /><li>Standalone list item</li>

This renders in html as:
chrome_mfoJdvDZY0

In turndown, this becomes:

![](https://picsum.photos/id/1/200)*   Standalone list item

Which renders as:
* Standalone list item

I propose that list items that follow non-list-items should be placed on a new line:

![](https://picsum.photos/id/1/200)
*   Standalone list item

Which renders as:

  • Standalone list item
@q00u q00u linked a pull request Oct 23, 2020 that will close this issue
@pavelhoral
Copy link
Collaborator

Just an observation: The second example is invalid HTML. <li> element is only valid inside <ul> or <ol>. The first example, unfortunatelly, seems like a valid HTML.

@q00u
Copy link
Author

q00u commented Dec 5, 2020

True, but browsers will still render list items outside of lists. If the goal is for the resulting markdown to most-closely render as the original html renders, then it should be moved to a new line just like the browser does.

@dchacke
Copy link

dchacke commented Aug 27, 2023

I can reproduce the first issue:

<ol>
  <li>
    <h1>foo</h1>
  </li>
</ol>

is converted to:

1.  foo
    ===

Which a markdown parser (I'm using this one) turns into:

<ol>
<li>foo
===</li>
</ol>

So the heading is lost.

I don't think turning a heading into bold text is a good idea (#358). It may retain some semblance of a heading visually, but semantically and structurally, it does not adequately replace the original HTML. In addition, the same issue seems to extend to lis having any block-level content as their first child, in which case the suggested fix doesn't generalize. Consider blockquotes:

<ol>
  <li>
    <blockquote>foo</blockquote>
  </li>
</ol>

...is converted into...

1.  > foo

Passed back into a markdown parser, that is turned into:

<ol>
<li>> foo</li>
</ol>

So it treats the part "> foo" as regular text and keeps the ">" sign.

The good news is that there's no need to turn headings into bold text because there's a more general fix. For the first code block in this comment, turndown should instead return this:

1. 

    foo
    ===

That is, with one space after the period and four spaces before the "f" in "foo". Then the markdown parser recognizes the heading properly:

<ol>
<li><h1>foo</h1></li>
</ol>

The same fix works for blockquotes as well:

1. 

    > foo

is turned into:

<ol>
<li><blockquote>
  <p>foo</p>
</blockquote></li>
</ol>

As a temporary workaround, all list items could have this spacing, but that looks odd and is unnecessary when a list item starts with inline content.

@dchacke
Copy link

dchacke commented Aug 27, 2023

Here's a workaround for the consumer side to fix the first issue for now. I created it with the help of AI and only lightly tested it. I'm a Turndown noob but maybe it helps someone:

let td = new TurndownService();

td.addRule('listItemWithBlockAsFirstChild', {
  filter: node => {
    return (
      node.nodeName === 'LI' &&
      node.firstChild &&
      // Block elements listed on https://www.w3schools.com/htmL/html_blocks.asp
      [
        'address', 'article', 'aside', 'blockquote', 'canvas', 'dd', 'div',
        'dl', 'dt', 'fieldset', 'figcaption', 'figure', 'footer', 'form',
        'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'hr', 'li', 'main',
        'nav', 'noscript', 'ol', 'p', 'pre', 'section', 'table', 'tfoot',
        'ul', 'video'
      ]
      .map(t => t.toUpperCase())
      .includes(node.firstChild.nodeName)
    );
  },
  replacement: (content, node, options) => {
    let indicator;

    if (node.parentNode.nodeName === 'OL') {
      let count = 1; // Start counting from 1
      let sibling = node.previousElementSibling;

      while (sibling) {
        if (sibling.nodeName === 'LI') {
          count++;
        }

        sibling = sibling.previousElementSibling;
      }

      indicator = `${count}.`;
    } else {
      indicator = options.bulletListMarker;
    }

    let innerMarkdown = td.turndown(node.innerHTML);

    // Add one space behind the indicator and four spaces underneath
    return `${indicator} \n\n    ${innerMarkdown}\n`;
  }
});

In short, the filter applies to any li that has a block-level element as its first child.

We then determine the indicator (either n. with some positive integer n for ordered lists or the given bullet-list marker for unordered lists). If the li is part of an ordered list, we determine the indicator by counting the previous li elements in the same list, increasing a counter for each such previous element. If the li is part of an unordered list instead, we just use the given bullet-list marker.

Lastly, we parse the li's inner HTML and construct the final markdown string: the indicator followed by a space, then two line breaks, the last of which is followed by four spaces.

(I wrote this explanation, not an AI.)

dchacke added a commit to dchacke/turndown that referenced this issue Aug 27, 2023
Fixes the first issue described in mixmark-io#357. The fix the described in more detail here: mixmark-io#357
dchacke added a commit to dchacke/turndown that referenced this issue Aug 27, 2023
Fixes the first issue mentioned in mixmark-io#357. I describe the bug in more detail here: mixmark-io#357 (comment)
@dchacke
Copy link

dchacke commented Aug 27, 2023

Disregard previous workaround in favor of this fix as per my PR #447:

rules.listItem = {
  filter: 'li',
  replacement: function (content, node, options) {
    content = content
      .replace(/^\n+/, '') // remove leading newlines
      .replace(/\n+$/, '\n') // replace trailing newlines with just a single one
      .replace(/\n/gm, '\n    ') // indent
    var prefix = options.bulletListMarker + '   '
    var parent = node.parentNode

    // Check if the first child is a block-level element
    var blockElements = [
      'address', 'article', 'aside', 'blockquote', 'canvas', 'dd', 'div',
      'dl', 'dt', 'fieldset', 'figcaption', 'figure', 'footer', 'form',
      'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'hr', 'li', 'main',
      'nav', 'noscript', 'ol', 'p', 'pre', 'section', 'table', 'tfoot',
      'ul', 'video'
    ].map(function (t) {
      return t.toUpperCase()
    })

    var firstChildIsBlock = node.firstChild && blockElements.includes(node.firstChild.nodeName)

    if (parent.nodeName === 'OL') {
      var start = parent.getAttribute('start')
      var index = Array.prototype.indexOf.call(parent.children, node)
      prefix = (start ? Number(start) + index : index + 1) + '.  '
    }

    return (
      prefix + (firstChildIsBlock ? '\n\n    ' : '') + content + (node.nextSibling && !/\n$/.test(content) ? '\n' : '')
    )
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants