Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] make parsing more tolerant by allowing to specify https:// namespaces #746

Open
alexandru opened this issue Sep 5, 2022 · 1 comment

Comments

@alexandru
Copy link

Hello,

I bumped into the feed of a Jekyll-generated website that uses this namespace:
https://www.w3.org/2005/Atom

Using https:// instead of http:// breaks SimplePie's parser, with the feed not recognized as being an Atom feed.

While https:// may be incorrect, it would be great if it worked, as that's the direction that the web is taking. That namespace also has an associated web page, and my browser automatically redirects the http: link to https://www.w3.org/2005/Atom, which still works. And as a rookie in website generation, I could easily make the mistake to copy/paste the browser's location bar.

Cheers,

@skyzyx
Copy link
Member

skyzyx commented Sep 6, 2022

It's just a namespace, and the http bit is the _official_namespace. Like com.company.project in Java or Project\Namespace\Class in PHP. It's a URI, but not necessarily a URL — even though it looks like it.

The official namespace for the Atom spec is http://www.w3.org/2005/Atom. The https is officially wrong/invalid.

Does that make sense?

I haven't worked on this project in over 12 years (other than helping answer questions), so the current maintainer will need to weigh-in on whether or not to make this change.

My personal opinion is that this is a mistake that will inevitably continue to occur as HTTPS has become the default over the last 10 years. SimplePie has always been about parsing the feed and implementing the spec, without being a stickler for the spec. And in cases where we can't programmatically know what to do, we allow users to call $simplepie->force_feed(). If I were still leading this project (which I'm not), this is something for which I would implement support.

In terms of implementation of a patch:

SimplePie handles multiple namespaces, each for a different spec. IIRC, SimplePie's namespace parsing treats each unique URL as a set of tags that it understands — however, RSS and Atom are core specs and are managed differently. Right now, I think all of that code is in SimplePie_Parser (or wherever that code exists today) where it figures out what kind of content it has received, and then determines what to do with it.

For other namespaces, the http and https versions would be treated as two separate namespaces for two separate specs (since that's how namespaces fundamentally work in XML), but since Atom is a core spec it's possible that this could be solved by:

  • Supporting regexes for namespaces — which I believe we had to do for Yahoo! Media RSS since they screwed-up the spec namespace a few times; you can probably pull from that code.

  • Mapping multiple, known-URIs to the same namespace — I have a vague memory of having to do this for iTunes RSS feeds at some point.

  • Removing the scheme all-together from all namespaces.

  • Or some blend of the above.

Depending on how up-to/out-of date my knowledge is of the current state of the source code, I think that creating a patch to put a bandaid on this one single issue is probably very small. But I think that supporting https namespaces across the ecosystem of feeds probably needs a bit more engineering thinking and a moderately-sized patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants