Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix absolutize URL for several cases #861

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
56 changes: 37 additions & 19 deletions src/Item.php
Original file line number Diff line number Diff line change
Expand Up @@ -117,9 +117,27 @@ public function get_item_tags(string $namespace, string $tag)
return null;
}

/**
* Get base URL of the item itself.
* Returns `<xml:base>` or feed base URL.
* Similar to `Item::get_base()` but can safely be used during initialisation methods
* such as `Item::get_links()` (`Item::get_base()` and `Item::get_links()` call each-other)
* and is not affected by enclosures.
*
* @param array<string, mixed> $element
* @see get_base
*/
protected function get_own_base(array $element = []): string
{
if (!empty($element['xml_base_explicit']) && isset($element['xml_base'])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, xml:base itself should be resolved recursively relative to xml:base in parent elements. Thankfully, this appears to be handled by our own Parser class.

return $element['xml_base'];
}
return $this->feed->get_base();
}

/**
* Get the base URL value.
* Uses `<xml:base>`, or item link, or feed base URL.
* Uses `<xml:base>`, or item link, or enclosure link, or feed base URL.
*
* @param array<string, mixed> $element
* @return string
Expand Down Expand Up @@ -812,27 +830,27 @@ public function get_links(string $rel = 'alternate')
foreach ((array) $this->get_item_tags(\SimplePie\SimplePie::NAMESPACE_ATOM_10, 'link') as $link) {
if (isset($link['attribs']['']['href'])) {
$link_rel = (isset($link['attribs']['']['rel'])) ? $link['attribs']['']['rel'] : 'alternate';
$this->data['links'][$link_rel][] = $this->sanitize($link['attribs']['']['href'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($link));
$this->data['links'][$link_rel][] = $this->sanitize($link['attribs']['']['href'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_own_base($link));
}
}
foreach ((array) $this->get_item_tags(\SimplePie\SimplePie::NAMESPACE_ATOM_03, 'link') as $link) {
if (isset($link['attribs']['']['href'])) {
$link_rel = (isset($link['attribs']['']['rel'])) ? $link['attribs']['']['rel'] : 'alternate';
$this->data['links'][$link_rel][] = $this->sanitize($link['attribs']['']['href'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($link));
$this->data['links'][$link_rel][] = $this->sanitize($link['attribs']['']['href'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_own_base($link));
}
}
if ($links = $this->get_item_tags(\SimplePie\SimplePie::NAMESPACE_RSS_10, 'link')) {
$this->data['links']['alternate'][] = $this->sanitize($links[0]['data'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($links[0]));
$this->data['links']['alternate'][] = $this->sanitize($links[0]['data'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_own_base($links[0]));
}
if ($links = $this->get_item_tags(\SimplePie\SimplePie::NAMESPACE_RSS_090, 'link')) {
$this->data['links']['alternate'][] = $this->sanitize($links[0]['data'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($links[0]));
$this->data['links']['alternate'][] = $this->sanitize($links[0]['data'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_own_base($links[0]));
}
if ($links = $this->get_item_tags(\SimplePie\SimplePie::NAMESPACE_RSS_20, 'link')) {
$this->data['links']['alternate'][] = $this->sanitize($links[0]['data'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($links[0]));
$this->data['links']['alternate'][] = $this->sanitize($links[0]['data'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_own_base($links[0]));
}
if ($links = $this->get_item_tags(\SimplePie\SimplePie::NAMESPACE_RSS_20, 'guid')) {
if (!isset($links[0]['attribs']['']['isPermaLink']) || strtolower(trim($links[0]['attribs']['']['isPermaLink'])) === 'true') {
$this->data['links']['alternate'][] = $this->sanitize($links[0]['data'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($links[0]));
$this->data['links']['alternate'][] = $this->sanitize($links[0]['data'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_own_base($links[0]));
}
}

Expand Down Expand Up @@ -1199,11 +1217,11 @@ public function get_enclosures()
// PLAYER
if ($player_parent = $this->get_item_tags(\SimplePie\SimplePie::NAMESPACE_MEDIARSS, 'player')) {
if (isset($player_parent[0]['attribs']['']['url'])) {
$player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
$player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($player_parent[0]));
Copy link
Contributor Author

@Alkarex Alkarex Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were no tests for this section and I have not added any. Help welcome if anyone is motivated.
For sure not providing the base URL will lead to wrong absolute URLs

}
} elseif ($player_parent = $parent->get_channel_tags(\SimplePie\SimplePie::NAMESPACE_MEDIARSS, 'player')) {
if (isset($player_parent[0]['attribs']['']['url'])) {
$player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
$player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($player_parent[0]));
}
}

Expand Down Expand Up @@ -1323,13 +1341,13 @@ public function get_enclosures()
if ($thumbnails = $this->get_item_tags(\SimplePie\SimplePie::NAMESPACE_MEDIARSS, 'thumbnail')) {
foreach ($thumbnails as $thumbnail) {
if (isset($thumbnail['attribs']['']['url'])) {
$thumbnails_parent[] = $this->sanitize($thumbnail['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
$thumbnails_parent[] = $this->sanitize($thumbnail['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($thumbnail));
}
}
} elseif ($thumbnails = $parent->get_channel_tags(\SimplePie\SimplePie::NAMESPACE_MEDIARSS, 'thumbnail')) {
foreach ($thumbnails as $thumbnail) {
if (isset($thumbnail['attribs']['']['url'])) {
$thumbnails_parent[] = $this->sanitize($thumbnail['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
$thumbnails_parent[] = $this->sanitize($thumbnail['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($thumbnail));
}
}
}
Expand Down Expand Up @@ -1453,7 +1471,7 @@ public function get_enclosures()
if (isset($content['attribs']['']['width'])) {
$width = $this->sanitize($content['attribs']['']['width'], \SimplePie\SimplePie::CONSTRUCT_TEXT);
}
$url = $this->sanitize($content['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
$url = $this->sanitize($content['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($content));

// Checking the other optional media: elements. Priority: media:content, media:group, item, channel

Expand Down Expand Up @@ -1712,9 +1730,9 @@ public function get_enclosures()

// PLAYER
if (isset($content['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['player'])) {
$player = $this->sanitize($content['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['player'][0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
$player = $this->sanitize($content['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['player'][0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($content['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['player']));
} elseif (isset($group['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['player'])) {
$player = $this->sanitize($group['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['player'][0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
$player = $this->sanitize($group['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['player'][0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($group['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['player']));
} else {
$player = $player_parent;
}
Expand Down Expand Up @@ -1804,14 +1822,14 @@ public function get_enclosures()
// THUMBNAILS
if (isset($content['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['thumbnail'])) {
foreach ($content['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['thumbnail'] as $thumbnail) {
$thumbnails[] = $this->sanitize($thumbnail['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
$thumbnails[] = $this->sanitize($thumbnail['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($thumbnail));
}
if (is_array($thumbnails)) {
$thumbnails = array_values(array_unique($thumbnails));
}
} elseif (isset($group['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['thumbnail'])) {
foreach ($group['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['thumbnail'] as $thumbnail) {
$thumbnails[] = $this->sanitize($thumbnail['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
$thumbnails[] = $this->sanitize($thumbnail['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($thumbnail));
}
if (is_array($thumbnails)) {
$thumbnails = array_values(array_unique($thumbnails));
Expand Down Expand Up @@ -1909,7 +1927,7 @@ public function get_enclosures()
$width = $this->sanitize($content['attribs']['']['width'], \SimplePie\SimplePie::CONSTRUCT_TEXT);
}
if (isset($content['attribs']['']['url'])) {
$url = $this->sanitize($content['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
$url = $this->sanitize($content['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($content));
}
// Checking the other optional media: elements. Priority: media:content, media:group, item, channel

Expand Down Expand Up @@ -2064,7 +2082,7 @@ public function get_enclosures()
// PLAYER
if (isset($content['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['player'])) {
if (isset($content['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['player'][0]['attribs']['']['url'])) {
$player = $this->sanitize($content['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['player'][0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
$player = $this->sanitize($content['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['player'][0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($content['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['player'][0]));
}
} else {
$player = $player_parent;
Expand Down Expand Up @@ -2120,7 +2138,7 @@ public function get_enclosures()
if (isset($content['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['thumbnail'])) {
foreach ($content['child'][\SimplePie\SimplePie::NAMESPACE_MEDIARSS]['thumbnail'] as $thumbnail) {
if (isset($thumbnail['attribs']['']['url'])) {
$thumbnails[] = $this->sanitize($thumbnail['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
$thumbnails[] = $this->sanitize($thumbnail['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($thumbnail));
}
}
if (is_array($thumbnails)) {
Expand Down
13 changes: 9 additions & 4 deletions src/SimplePie.php
Original file line number Diff line number Diff line change
Expand Up @@ -2456,8 +2456,9 @@ public function get_image_tags(string $namespace, string $tag)
/**
* Get the base URL value from the feed
*
* Uses `<xml:base>` if available, otherwise uses the first link in the
* feed, or failing that, the URL of the feed itself.
* Uses `<xml:base>` if available,
* otherwise uses the first 'self' link or the first 'alternate' link of the feed,
* or failing that, the URL of the feed itself.
Comment on lines +2459 to +2461
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original RSS specification requires URLs to include scheme. I would expect that if the feed has relative URLs the content is taken from the HTML (alternate) version unchanged, and so the links should be resolved relative to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also reflected in the previous definition, as self link is basically the canonical version of subscribe URL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I guess self before alternate might make sense for mrss elements, since those only exist in the feed. (But really, it will depend on how the feed is generated. mrss only mandates direct URL, which is not very useful.)

Copy link
Contributor

@jtojnar jtojnar Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like https://www.rssboard.org/news/151/relative-links discusses this and recommends self link as a fallback, not mentioning alternate at all. (Note that URLs in the comments are displayed as domains, you will need to check the link href for the examples to make sense.)

And for completeness Atom only seems to mention xml:base:

Any element defined by this specification MAY have an xml:base attribute [W3C.REC-xmlbase-20010627]. When xml:base is used in an Atom Document, it serves the function described in section 5.1.1 of RFC3986, establishing the base URI (or IRI) for resolving any relative references found within the effective scope of the xml:base attribute.

And, as mentioned in one of the comments on the RSS article, the xml:base specification also suggests the document URI (i.e. subscribe URL after redirects, which I would expect to match self link) for the fallback:

The attribute xml:base may be inserted in XML documents to specify a base URI other than the base URI of the document or external entity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to follow-up on this. It looks like we agree, right? In other words, there does not seem to be any (new) test in contradiction.

*
* @see get_link
* @see subscribe_url
Expand All @@ -2469,8 +2470,12 @@ public function get_base(array $element = [])
{
if (!empty($element['xml_base_explicit']) && isset($element['xml_base'])) {
return $element['xml_base'];
} elseif ($this->get_link() !== null) {
return $this->get_link();
}
if (($link = $this->get_link(0, 'self')) !== null) {
return $link;
}
if (($link = $this->get_link(0, 'alternate')) !== null) {
return $link;
}

return $this->subscribe_url() ?? '';
Expand Down
32 changes: 32 additions & 0 deletions tests/Unit/EnclosureTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,38 @@ public static function getLinkProvider(): iterable
,
'http://example.net/link?a=%22b%22&amp;c=%3Cd%3E',
];

yield 'Test RSS 2.0 with channel link and enclosure' => [
<<<XML
<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/">
<channel>
<link>http://example.net/tests/</link>
<item>
<link>/tests/3/</link>
<media:content url="/images/3.jpg" medium="image"></media:content>
</item>
</channel>
</rss>
XML
,
'http://example.net/images/3.jpg',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was wrongly returning /images/3.jpg before this patch

];
Comment on lines +92 to +106
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the original bug I faced, which had me investigate the issue (which turned out to be more severe and complex than anticipated...)


yield 'Test RSS 2.0 with Atom channel link and enclosure' => [
<<<XML
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
<channel>
<atom:link href="http://example.net/tests/" rel="self" type="application/rss+xml" />
<item>
<link>/tests/4/</link>
<media:content url="/images/4.jpg" medium="image"></media:content>
</item>
</channel>
</rss>
XML
,
'http://example.net/images/4.jpg',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was wrongly returning /images/4.jpg before this patch

];
}

/**
Expand Down
44 changes: 44 additions & 0 deletions tests/Unit/ItemTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -3262,6 +3262,50 @@ public static function getPermalinkDataProvider(): array
,
'http://example.com/',
],
'Test RSS 2.0 with channel link and enclosure from another domain' => [
<<<XML
<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/">
<channel>
<link>http://example.net/tests/</link>
<item>
<link>/tests/1/</link>
<media:content url="http://example.com/images/1.jpg" medium="image"></media:content>
</item>
</channel>
</rss>
XML
,
'http://example.net/tests/1/',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was wrongly returning http://example.com/tests/1/ before this patch (side-effect of the enclosure)

],
'Test RSS 2.0 with Atom channel link and relative enclosure' => [
<<<XML
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
<channel>
<atom:link href="http://example.net/tests/" rel="self" type="application/rss+xml" />
<item>
<link>/tests/2/</link>
<media:content url="/images/2.jpg" medium="image"></media:content>
</item>
</channel>
</rss>
XML
,
'http://example.net/tests/2/',
],
'Test RSS 2.0 with xml:base and enclosure from another domain' => [
<<<XML
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
<channel>
<item>
<link xml:base="http://example.net/tests/">/tests/3/</link>
<media:content url="http://example.com/images/3.jpg" medium="image"></media:content>
</item>
</channel>
</rss>
XML
,
'http://example.net/tests/3/',
],
'Test Atom 1.0 xmlbase 1' => [
<<<EOT
<feed xmlns="http://www.w3.org/2005/Atom" xml:base="http://example.com/">
Expand Down