New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix absolutize URL for several cases #861
base: master
Are you sure you want to change the base?
Conversation
There were a number of bugs related to the fact that `Item::get_links()` and `Item::get_base()` call each-other, making a nice mess during initialisation. See tests. Furthermore, the standard Atom `self` link was not supported, wrongly falling back to `alternate`. In the same PR because otherwise the tests from both PRs would fail.
Running the additional tests without the patches returns:
|
yield 'Test RSS 2.0 with channel link and enclosure' => [ | ||
<<<XML | ||
<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/"> | ||
<channel> | ||
<link>http://example.net/tests/</link> | ||
<item> | ||
<link>/tests/3/</link> | ||
<media:content url="/images/3.jpg" medium="image"></media:content> | ||
</item> | ||
</channel> | ||
</rss> | ||
XML | ||
, | ||
'http://example.net/images/3.jpg', | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the original bug I faced, which had me investigate the issue (which turned out to be more severe and complex than anticipated...)
</rss> | ||
XML | ||
, | ||
'http://example.net/images/3.jpg', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was wrongly returning /images/3.jpg
before this patch
</rss> | ||
XML | ||
, | ||
'http://example.net/images/4.jpg', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was wrongly returning /images/4.jpg
before this patch
</rss> | ||
XML | ||
, | ||
'http://example.net/tests/1/', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was wrongly returning http://example.com/tests/1/
before this patch (side-effect of the enclosure)
@@ -1199,11 +1217,11 @@ public function get_enclosures() | |||
// PLAYER | |||
if ($player_parent = $this->get_item_tags(\SimplePie\SimplePie::NAMESPACE_MEDIARSS, 'player')) { | |||
if (isset($player_parent[0]['attribs']['']['url'])) { | |||
$player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI); | |||
$player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($player_parent[0])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were no tests for this section and I have not added any. Help welcome if anyone is motivated.
For sure not providing the base URL will lead to wrong absolute URLs
This is especially relevant for HTML+XPath mode, for which we rely on proper URL "absolutize" Upstream PR simplepie/simplepie#861
Downstream PR FreshRSS/FreshRSS#6270 |
This is especially relevant for HTML+XPath mode, for which we rely on proper URL "absolutize" Upstream PR simplepie/simplepie#861
*/ | ||
protected function get_own_base(array $element = []): string | ||
{ | ||
if (!empty($element['xml_base_explicit']) && isset($element['xml_base'])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, xml:base
itself should be resolved recursively relative to xml:base
in parent elements. Thankfully, this appears to be handled by our own Parser
class.
src/Item.php
Outdated
* Similar to `get_base()` but can safely be used during initialisation methods | ||
* such as `get_links()` (`get_base()` and `get_links()` call each-other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will the mutual recursion be prevented when xml:base
is not set? Would not it be the same as calling SimplePie::get_base()
directly in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If xml:base
is not set, it will rely on the feed's SimplePie::get_base()
, which does not depend on Item::*
, so there should not be mutual dependencies anymore, unlike before this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed that the comment is talking about Item::get_base()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved comment 13a398d
* Uses `<xml:base>` if available, | ||
* otherwise uses the first 'self' link or the first 'alternate' link of the feed, | ||
* or failing that, the URL of the feed itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original RSS specification requires URLs to include scheme. I would expect that if the feed has relative URLs the content is taken from the HTML (alternate) version unchanged, and so the links should be resolved relative to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also reflected in the previous definition, as self
link is basically the canonical version of subscribe URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I guess self
before alternate
might make sense for mrss elements, since those only exist in the feed. (But really, it will depend on how the feed is generated. mrss only mandates direct URL, which is not very useful.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like https://www.rssboard.org/news/151/relative-links discusses this and recommends self
link as a fallback, not mentioning alternate
at all. (Note that URLs in the comments are displayed as domains, you will need to check the link href for the examples to make sense.)
And for completeness Atom only seems to mention xml:base
:
Any element defined by this specification MAY have an
xml:base
attribute [W3C.REC-xmlbase-20010627]. Whenxml:base
is used in an Atom Document, it serves the function described in section 5.1.1 of RFC3986, establishing the base URI (or IRI) for resolving any relative references found within the effective scope of thexml:base
attribute.
And, as mentioned in one of the comments on the RSS article, the xml:base
specification also suggests the document URI (i.e. subscribe URL after redirects, which I would expect to match self
link) for the fallback:
The attribute
xml:base
may be inserted in XML documents to specify a base URI other than the base URI of the document or external entity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to follow-up on this. It looks like we agree, right? In other words, there does not seem to be any (new) test in contradiction.
There were a number of bugs related to the fact that
Item::get_links()
andItem::get_base()
call each-other, making a nice mess during initialisation. See tests.Furthermore, the standard Atom 1.0
self
link was not supported for absolutize URL, wrongly using onlyalternate
. In the same PR because otherwise the tests from both PRs would fail.