Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option to join multiple CDATA sections into one when parsing #599

Open
schulzer opened this issue Nov 30, 2023 · 2 comments
Open

option to join multiple CDATA sections into one when parsing #599

schulzer opened this issue Nov 30, 2023 · 2 comments

Comments

@schulzer
Copy link

schulzer commented Nov 30, 2023

This is similar to 'Comments split Text into multiple Nodes #546' when considering a complicated document with lots of escaping

<node> some&amp;data ]]> more&amp;data <node>

This could be rewritten (somewhat simplified) but erroneous

<node> <![CDATA[some&data ]]> more&data]]> <node>

and corrected

<node> <![CDATA[some&data ]]]><![CDATA[]> more&data]]> <node>

In any case (the original and final) the user would like to call doc.child("node").child_value() and access the stored data, without complicated logic to iterate over all children and concatenate their values, in particular as this node could be called 'url' and could be given in various different versions, which require currently various different versions of code to access the full URL, coming back I as a user would like to have an options to retrieve all those flavours nice and cleanly de-escaped with a single access point, because at the end essentially only this single value URL is encoded, from a logical/high level stand point, from low level I see the reason why they are split as they are and stored into multiple child nodes.

If this is an acceptable change, I could offer handing in a pull request.

@zeux
Copy link
Owner

zeux commented Dec 15, 2023

What if there's a mix of PCDATA and CDATA content?

@schulzer
Copy link
Author

Just for clarification you mean something like

<node> <![CDATA[some data]]> more data <node>

and not the case

<node> <![CDATA[some data]]> <child ... /> <![CDATA[more data]]> <node>

because for the latter case it would be the same as in #546 (where I assume they are just not merged because they can't but anything before&after the child would be)

the former case is a bit tricky yes but here I would consider a CDATA section just a another kind of escaping compared to &amp; encoding therefore actually not important to the user, and as usual it would follow the general white spacing rules (and modification through options) e.g.

<node><![CDATA[some data]]></node> 
    yields "some data"
<node> <![CDATA[some data]]> </node> 
    yields " some data " (or as before when 'parse_trim_pcdata')
<node><![CDATA[some data]]>more data</node> 
    yields "some datamore data"
<node> <![CDATA[some data]]> more data </node> 
    yields " some data more data "
<node> <![CDATA[some data ]]>more data </node> 
    yields " some data more data "
<node> <![CDATA[some data ]]> more data </node> 
    yields " some data  more data" with two spaces in between

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants