Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

πŸ› Bug <br> is converted into two new lines (\n\n) #40

Open
prologic opened this issue Aug 2, 2021 · 9 comments
Open

πŸ› Bug <br> is converted into two new lines (\n\n) #40

prologic opened this issue Aug 2, 2021 · 9 comments
Labels
bug Something isn't working

Comments

@prologic
Copy link

prologic commented Aug 2, 2021

Describe the bug

In my testing I've found that the HTML tag <br /> gets turned into two new lines (\n\n);

Example:

(⎈ |local:default)
prologic@Jamess-iMac
Mon Aug 02 11:37:55
~/tmp/html2md
 (master) 130
$ ./html2md -i
Hello<br />World
Hello

World

HTML Input

Hello<br />World

Generated Markdown

Hello

World

Expected Markdown

Hello
World

Additional context

Is there any way to control this behaviour? I get that this might be getting interpreted as a "paragraph", but I would only expect that if there are two <br />(s) or an actual paragraph <p>...</p>. Thanks!

@prologic prologic added the bug Something isn't working label Aug 2, 2021
@wcalandro
Copy link
Contributor

This is expected behavior. A line break in Markdown requires two newline characters. A single newline character will not render as a line break, instead it will render as a space.

@akuehnis
Copy link

According to this page (https://www.markdownguide.org/basic-syntax) a newline in markdown shall be formatted as follows:
To create a line break or new line (<br>), end a line with two or more spaces, and then type return.

I have also seen implementations where <br> and <p></p> are converted to one and two newlines (as prologic recommends).

I don't know if there is a real standard for this. However, <br> must be treaded differently than <p></p> for not to loose information when converting from html to md.

@JohannesKaufmann
Copy link
Owner

Take this HTML as the input:

<p>Line 1<br />Line 2</p>

With html-to-markdown and the normal commonmark behaviour for "br" with two newlines we get:

Line 1

Line 2

With Commonmark (see playground) this renders as:

<p>Line 1</p>
<p>Line 2</p>

two_newlines


If you add a custom rule for "br" that just returns a single newline with:

return String("\n")

You get this ouput:

Line 1
Line 2

With Commonmark (see playground) this renders as:

<p>Line 1
Line 2</p>

one_newline


If we compare the different implementations (see babelmark) this behaviour is mostly shared between implementations.

babelmark

The markdown rendering on github.com works differently however πŸ€·β€β™‚οΈ

github_dot_com


If we want to be extra precise, the html-to-markdown library would need to also support hard line breaks. However that would require some other changes.

So for now, the current behaviour is going to stay as it is. Changing it would break it for other implementations. However you are free to change the behaviour, by writing a very simple custom rule.

@suntong
Copy link
Contributor

suntong commented May 10, 2023

The markdown rendering on github.com works differently however

Then can we have the GitHub-flavored markdown to use single line breaks please?
(without the need of hard line breaks, as the GitHub-flavored markdown is supposed to be tailored towards github.com)

And the change would be minimum I'd presume. IE changing from output \n\n, to do the following instead:

output "\n"
if (not in the GitHub-flavored markdown mode) output "\n"

Thanks

@JohannesKaufmann
Copy link
Owner

Then can we have the GitHub-flavored markdown to use single line breaks please?

There are other renderers β€” like the GitHub Flavored Markdown Extension from goldmark β€” that also implement the spec. And I don't want to break those.

Right now, it seems like its only github.com that causes the problem...

@ImportTaste
Copy link

Then can we have the GitHub-flavored markdown to use single line breaks please?

There are other renderers β€” like the GitHub Flavored Markdown Extension from goldmark β€” that also implement the spec. And I don't want to break those.

Right now, it seems like its only github.com that causes the problem...

What about an additional built-in rule for these linebreaks? @suntong seems to be against the idea of altering the behavior of using this project GFM's plugin or adding a new parameter to accomplish this.

@ImportTaste
Copy link

@suntong I'm doubting you want a PR of this but: ImportTaste/html2md@082a6fb

Works well for me. I really don't think @JohannesKaufmann is going to budge.

@suntong
Copy link
Contributor

suntong commented Jun 15, 2023

NP, I'd love to, since it works well for you, and also because I'd agree with you that such feature might never be accepted here. So, send the PR pls.

@zturtleman
Copy link

Expanding on a previous comment (#40 (comment)):

From the official Markdown specification:

When you do want to insert a <br /> break tag using Markdown, you end a line with two or more spaces, then type return.

Converting this HTML to Markdown:

<p>Line 1<br />Line 2<br />Line 3</p>

Should be this Markdown (with two spaces at the end of the first two lines where the <br /> tags were):

Line 1  
Line 2  
Line 3

Using + to visual spaces, it would look like this:

Line 1++
Line 2++
Line 3

Though the Markdown itself should use spaces, not +.


This works on GitHub, CommonMark, and 27 implementations on babelmark.

The reference Markdown (note the two spaces at the end of the first two lines):

Line 1  
Line 2  
Line 3

GitHub:

Line 1
Line 2
Line 3

CommonMark (web demo):

<p>Line 1<br />
Line 2<br />
Line 3</p>

Screenshot from 2024-05-25 18-11-14

babelmark (web demo):
Screenshot from 2024-05-25 18-15-27

It seems like the solution for how html-to-markdown should handle <br /> tags is to convert <br /> tags to two spaces and a new line (\x20\x20\n) rather than one (\n) or two new lines (\n\n). This behavior is defined by the official Markdown specification and it seems well supported by various implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants