Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper XML Namespaces support #367

Open
ddevienne opened this issue Aug 19, 2020 · 1 comment
Open

Proper XML Namespaces support #367

ddevienne opened this issue Aug 19, 2020 · 1 comment

Comments

@ddevienne
Copy link

I've searched issues, and although there are a few related to namespaces, they do not tackle the real issue,
which is that all node's namespace URI should be available. Knowing a node's namespace-qualified name
(i.e. NS-Prefix colon Local-Name) is insufficient, because of default namespaces, and the fact prefixes are arbitrary
and only namespace URIs matter. Namespaces in XML are pretty essential.

The XML parser is best suited to keep track of in-scope namespaces, and assign them to nodes.
Client code would need to scan the whole document for all namespace-related attributes, and for
all element names, maintain a stack of in-scope namespace, and keep an external shadow DOM
(i.e. a map) to know the NS URI of all nodes, which is possible, but cumbersome and inefficient.

(BTW, I don't see how the namespace-uri() = X XPath predicate can be correct and efficient w/o the
parse-tree knowing about the NS of all nodes, as described above)

I'm currently using https://github.com/svgpp/rapidxml_ns, but would welcome being able to replace it with pugixml,
provided proper XML Namespaces support. Please consider this a formal request for Enhancement. Thanks, --DD

@zeux
Copy link
Owner

zeux commented Sep 22, 2020

(BTW, I don't see how the namespace-uri() = X XPath predicate can be correct and efficient w/o the
parse-tree knowing about the NS of all nodes, as described above)

To get the namespace URI for a single node, it's enough to scan the ancestry chain for the relevant attributes. It's not as efficient as already having the information, but it doesn't require scanning the entire document.

First class support in namespaces would introduce memory overhead for all nodes to store the extra URI data, and make the parser slower because of the need to identify xmlns attributes and maintain the relevant structures. Because of this I actually believe that the external tracking approach is ideal - with that the users only pay for the extra namespace information when it's relevant. The implementation would be less efficient than the first class implementation but not by much, and it doesn't need to make the core more complex.

It's possible to include a helper like this in pugixml, maybe a separate class xml_namespaces that you can create from xml_document which would pre-record the association between nodes and namespace URIs. The implementation requires a hash map but there's already an implementation for compact mode that could be reused for this.

Perhaps an interface like this could work. This would assume that the tree doesn't mutate after construction.

class xml_namespaces
{
public:
    xml_namespaces();
    explicit xml_namespaces(xml_node root); // alternatively an explicit reset() method

    void reset(xml_node root);

    const char* local_name(xml_node node) const;
    const char* namespace_uri(xml_node node) const;

    // possibly also something like this for more efficient lookup:
    const void* get_namespace(const char* uri) const;
    bool has_namespace(xml_node node, const void* id) const;
};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants