Fix absolutize URL for several cases #861

Alkarex · 2024-04-05T22:17:36Z

There were a number of bugs related to the fact that Item::get_links() and Item::get_base() call each-other, making a nice mess during initialisation. See tests.

Furthermore, the standard Atom 1.0 self link was not supported for absolutize URL, wrongly using only alternate. In the same PR because otherwise the tests from both PRs would fail.

There were a number of bugs related to the fact that `Item::get_links()` and `Item::get_base()` call each-other, making a nice mess during initialisation. See tests. Furthermore, the standard Atom `self` link was not supported, wrongly falling back to `alternate`. In the same PR because otherwise the tests from both PRs would fail.

Alkarex · 2024-04-05T22:22:33Z

Running the additional tests without the patches returns:

There were 4 failures:

1) SimplePie\Tests\Unit\EnclosureTest::test_get_link with data set "Test RSS 2.0 with channel link and enclosure" ('            <rss version="2.0...</rss>', 'http://example.net/images/3.jpg')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'http://example.net/images/3.jpg'
+'/images/3.jpg'

/home/alex/GitHub/simplepie/tests/Unit/EnclosureTest.php:40

2) SimplePie\Tests\Unit\EnclosureTest::test_get_link with data set "Test RSS 2.0 with Atom channel link and enclosure" ('            <rss version="2.0...</rss>', 'http://example.net/images/4.jpg')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'http://example.net/images/4.jpg'
+'/images/4.jpg'

/home/alex/GitHub/simplepie/tests/Unit/EnclosureTest.php:40

3) SimplePie\Tests\Unit\ItemTest::test_get_permalink with data set "Test RSS 2.0 with channel link and enclosure from another domain" ('<rss version="2.0" xmlns:medi...</rss>', 'http://example.net/tests/1/')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'http://example.net/tests/1/'
+'http://example.com/tests/1/'

/home/alex/GitHub/simplepie/tests/Unit/ItemTest.php:3436

4) SimplePie\Tests\Unit\ItemTest::test_get_permalink with data set "Test RSS 2.0 with Atom channel link and relative enclosure" ('<rss version="2.0" xmlns:atom...</rss>', 'http://example.net/tests/2/')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'http://example.net/tests/2/'
+'/tests/2/'

/home/alex/GitHub/simplepie/tests/Unit/ItemTest.php:3436

FAILURES!
Tests: 2044, Assertions: 2926, Failures: 4.
Script phpunit handling the test event returned with error code 1

Alkarex · 2024-04-05T22:40:52Z

tests/Unit/EnclosureTest.php

+ yield 'Test RSS 2.0 with channel link and enclosure' => [
+ <<<XML
+ <rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/">
+ <channel>
+ <link>http://example.net/tests/</link>
+ <item>
+ <link>/tests/3/</link>
+ <media:content url="/images/3.jpg" medium="image"></media:content>
+ </item>
+ </channel>
+ </rss>
+XML
+ ,
+ 'http://example.net/images/3.jpg',
+ ];


This was the original bug I faced, which had me investigate the issue (which turned out to be more severe and complex than anticipated...)

Alkarex · 2024-04-05T22:41:52Z

tests/Unit/EnclosureTest.php

+ </rss>
+XML
+ ,
+ 'http://example.net/images/3.jpg',


Was wrongly returning /images/3.jpg before this patch

Alkarex · 2024-04-05T22:42:20Z

tests/Unit/EnclosureTest.php

+ </rss>
+XML
+ ,
+ 'http://example.net/images/4.jpg',


Was wrongly returning /images/4.jpg before this patch

Alkarex · 2024-04-05T22:43:48Z

tests/Unit/ItemTest.php

+</rss>
+XML
+ ,
+ 'http://example.net/tests/1/',


Was wrongly returning http://example.com/tests/1/ before this patch (side-effect of the enclosure)

Alkarex · 2024-04-05T22:46:51Z

src/Item.php

@@ -1199,11 +1217,11 @@ public function get_enclosures()
 // PLAYER
 if ($player_parent = $this->get_item_tags(\SimplePie\SimplePie::NAMESPACE_MEDIARSS, 'player')) {
 if (isset($player_parent[0]['attribs']['']['url'])) {
- $player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
+ $player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($player_parent[0]));


There were no tests for this section and I have not added any. Help welcome if anyone is motivated.
For sure not providing the base URL will lead to wrong absolute URLs

This is especially relevant for HTML+XPath mode, for which we rely on proper URL "absolutize" Upstream PR simplepie/simplepie#861

Alkarex · 2024-04-06T14:56:46Z

Downstream PR FreshRSS/FreshRSS#6270

This is especially relevant for HTML+XPath mode, for which we rely on proper URL "absolutize" Upstream PR simplepie/simplepie#861

jtojnar · 2024-04-10T08:31:12Z

src/Item.php

+ */
+ protected function get_own_base(array $element = []): string
+ {
+ if (!empty($element['xml_base_explicit']) && isset($element['xml_base'])) {


Apparently, xml:base itself should be resolved recursively relative to xml:base in parent elements. Thankfully, this appears to be handled by our own Parser class.

jtojnar · 2024-04-10T08:35:14Z

src/Item.php

+ * Similar to `get_base()` but can safely be used during initialisation methods
+ * such as `get_links()` (`get_base()` and `get_links()` call each-other)


How will the mutual recursion be prevented when xml:base is not set? Would not it be the same as calling SimplePie::get_base() directly in that case?

If xml:base is not set, it will rely on the feed's SimplePie::get_base(), which does not depend on Item::*, so there should not be mutual dependencies anymore, unlike before this PR

Ah, I missed that the comment is talking about Item::get_base().

Improved comment 13a398d

jtojnar · 2024-04-10T08:50:08Z

src/SimplePie.php

+ * Uses `<xml:base>` if available,
+ * otherwise uses the first 'self' link or the first 'alternate' link of the feed,
+ * or failing that, the URL of the feed itself.


Original RSS specification requires URLs to include scheme. I would expect that if the feed has relative URLs the content is taken from the HTML (alternate) version unchanged, and so the links should be resolved relative to that.

This is also reflected in the previous definition, as self link is basically the canonical version of subscribe URL.

Though I guess self before alternate might make sense for mrss elements, since those only exist in the feed. (But really, it will depend on how the feed is generated. mrss only mandates direct URL, which is not very useful.)

Looks like https://www.rssboard.org/news/151/relative-links discusses this and recommends self link as a fallback, not mentioning alternate at all. (Note that URLs in the comments are displayed as domains, you will need to check the link href for the examples to make sense.)

And for completeness Atom only seems to mention xml:base:

Any element defined by this specification MAY have an xml:base attribute [W3C.REC-xmlbase-20010627]. When xml:base is used in an Atom Document, it serves the function described in section 5.1.1 of RFC3986, establishing the base URI (or IRI) for resolving any relative references found within the effective scope of the xml:base attribute.

And, as mentioned in one of the comments on the RSS article, the xml:base specification also suggests the document URI (i.e. subscribe URL after redirects, which I would expect to match self link) for the fallback:

The attribute xml:base may be inserted in XML documents to specify a base URI other than the base URI of the document or external entity.

Just to follow-up on this. It looks like we agree, right? In other words, there does not seem to be any (new) test in contradiction.

pull-request-size bot added the size/L label Apr 5, 2024

Alkarex added 2 commits April 6, 2024 00:30

Minor style

ceae5eb

Fix PHPStan

8541e5d

Alkarex commented Apr 5, 2024

View reviewed changes

Alkarex added a commit to Alkarex/FreshRSS that referenced this pull request Apr 6, 2024

Fix SimplePie absolutize URL for several cases

01f520a

This is especially relevant for HTML+XPath mode, for which we rely on proper URL "absolutize" Upstream PR simplepie/simplepie#861

Alkarex mentioned this pull request Apr 6, 2024

Fix SimplePie absolutize URL for several cases FreshRSS/FreshRSS#6270

Merged

Alkarex added a commit to FreshRSS/FreshRSS that referenced this pull request Apr 8, 2024

Fix SimplePie absolutize URL for several cases (#6270)

6e12781

This is especially relevant for HTML+XPath mode, for which we rely on proper URL "absolutize" Upstream PR simplepie/simplepie#861

jtojnar reviewed Apr 10, 2024

View reviewed changes

Improved comment

13a398d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix absolutize URL for several cases #861

Fix absolutize URL for several cases #861

Alkarex commented Apr 5, 2024 •

edited

Alkarex commented Apr 5, 2024

Alkarex Apr 5, 2024

Alkarex Apr 5, 2024

Alkarex Apr 5, 2024

Alkarex Apr 5, 2024

Alkarex Apr 5, 2024 •

edited

Alkarex commented Apr 6, 2024

jtojnar Apr 10, 2024

jtojnar Apr 10, 2024

Alkarex Apr 10, 2024

jtojnar Apr 10, 2024

Alkarex Apr 10, 2024

jtojnar Apr 10, 2024

jtojnar Apr 10, 2024

jtojnar Apr 10, 2024

jtojnar Apr 10, 2024 •

edited

Alkarex Apr 30, 2024

		* Similar to `get_base()` but can safely be used during initialisation methods
		* such as `get_links()` (`get_base()` and `get_links()` call each-other)

Fix absolutize URL for several cases #861

Are you sure you want to change the base?

Fix absolutize URL for several cases #861

Conversation

Alkarex commented Apr 5, 2024 • edited

Alkarex commented Apr 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Alkarex Apr 5, 2024 • edited

Choose a reason for hiding this comment

Alkarex commented Apr 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtojnar Apr 10, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Alkarex commented Apr 5, 2024 •

edited

Alkarex Apr 5, 2024 •

edited

jtojnar Apr 10, 2024 •

edited