-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML blocks are split up in events #803
Comments
Hello, text events are not guaranteed to be emitted in only one because the incremental nature of pulldown-cmark, and the same happens here with HTML blocks. However, it's easy to wrap the event iterator to fix this, as is done for text events in https://github.com/raphlinus/pulldown-cmark/blob/master/src/utils.rs. An example is also available in the main rustdoc, and it will be reflected in the documentation website when the 0.10 release (the first one with this utility) is launched:
I hope this solves your issue. |
While it is nice that this is offered as an addition, I'd like to disagree that it fully solves the problem. One of the big benefits of having something built into the library to do this is that it would be able to take advantage of the fact that these strings are genuinely adjacent to each other in the source code and avoid allocating a new string to concatenate them together. In most cases the performance benefits are near-zero, but again, this library already uses Additionally, this only does this for Would you be willing to merge an adapter that takes this into consideration? I'd be more than happy to dig into the code and make one, if you think it'd be good to upstream. |
Such a contribution will be welcome!, just take into account it is not trivial due to the incremental nature of the parser. But if you get a smart manner of achieving it, it would be a good enhancement for the library. Thanks! |
The main reason the borrowed lines cannot be given as a single event is that they aren't always adjacent in the source text. This can happen when the HTML block is indented due to external structure:
commonmark.js: <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
<list type="bullet" tight="true">
<item>
<paragraph>
<text>Hello</text>
<text>!</text>
</paragraph>
<html_block><!--
Testing
--></html_block>
</item>
</list>
</document> As you can see, the indentation is trimmed, as per the spec. It would be impossible to give this as a single borrowed Another example is if there are CRLF line breaks in the input file. |
I wrote the The text events don't always map one-to-one to the original source buffer of your input. Hence, at least some allocations are necessary. Of course, any performance improvements are welcome and appreciated - it is indeed true that the |
Given the following test markdown:
The following events are generated by the commonmark dingus:
Whereas these events are generated by
pulldown-cmark
:As you'll notice, the spec tries to keep the entire HTML block together as one unit, whereas
pulldown-cmark
will instead split the HTML at newlines, which isn't exactly ideal.The main reason why I found this relevant is that it runs into an issue in zola when I try to match a
<!-- more -->
comment spread across multiple lines. The only way to potentially fix this issue would be to proactively combine these events, which would mean converting the borrowed slices into owned strings.It seems like it would make the most sense for
pulldown-cmark
to keep these blocks together when returning them as events, since it seems to correctly parse it, just split it up weirdly.The text was updated successfully, but these errors were encountered: