Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix absolutize URL for several cases #861

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Alkarex
Copy link
Contributor

@Alkarex Alkarex commented Apr 5, 2024

There were a number of bugs related to the fact that Item::get_links() and Item::get_base() call each-other, making a nice mess during initialisation. See tests.

Furthermore, the standard Atom 1.0 self link was not supported for absolutize URL, wrongly using only alternate. In the same PR because otherwise the tests from both PRs would fail.

There were a number of bugs related to the fact that `Item::get_links()` and `Item::get_base()` call each-other, making a nice mess during initialisation. See tests.

Furthermore, the standard Atom `self` link was not supported, wrongly falling back to `alternate`. In the same PR because otherwise the tests from both PRs would fail.
@Alkarex
Copy link
Contributor Author

Alkarex commented Apr 5, 2024

Running the additional tests without the patches returns:

There were 4 failures:

1) SimplePie\Tests\Unit\EnclosureTest::test_get_link with data set "Test RSS 2.0 with channel link and enclosure" ('            <rss version="2.0...</rss>', 'http://example.net/images/3.jpg')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'http://example.net/images/3.jpg'
+'/images/3.jpg'

/home/alex/GitHub/simplepie/tests/Unit/EnclosureTest.php:40

2) SimplePie\Tests\Unit\EnclosureTest::test_get_link with data set "Test RSS 2.0 with Atom channel link and enclosure" ('            <rss version="2.0...</rss>', 'http://example.net/images/4.jpg')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'http://example.net/images/4.jpg'
+'/images/4.jpg'

/home/alex/GitHub/simplepie/tests/Unit/EnclosureTest.php:40

3) SimplePie\Tests\Unit\ItemTest::test_get_permalink with data set "Test RSS 2.0 with channel link and enclosure from another domain" ('<rss version="2.0" xmlns:medi...</rss>', 'http://example.net/tests/1/')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'http://example.net/tests/1/'
+'http://example.com/tests/1/'

/home/alex/GitHub/simplepie/tests/Unit/ItemTest.php:3436

4) SimplePie\Tests\Unit\ItemTest::test_get_permalink with data set "Test RSS 2.0 with Atom channel link and relative enclosure" ('<rss version="2.0" xmlns:atom...</rss>', 'http://example.net/tests/2/')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'http://example.net/tests/2/'
+'/tests/2/'

/home/alex/GitHub/simplepie/tests/Unit/ItemTest.php:3436

FAILURES!
Tests: 2044, Assertions: 2926, Failures: 4.
Script phpunit handling the test event returned with error code 1

Comment on lines +92 to +106
yield 'Test RSS 2.0 with channel link and enclosure' => [
<<<XML
<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/">
<channel>
<link>http://example.net/tests/</link>
<item>
<link>/tests/3/</link>
<media:content url="/images/3.jpg" medium="image"></media:content>
</item>
</channel>
</rss>
XML
,
'http://example.net/images/3.jpg',
];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the original bug I faced, which had me investigate the issue (which turned out to be more severe and complex than anticipated...)

</rss>
XML
,
'http://example.net/images/3.jpg',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was wrongly returning /images/3.jpg before this patch

</rss>
XML
,
'http://example.net/images/4.jpg',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was wrongly returning /images/4.jpg before this patch

</rss>
XML
,
'http://example.net/tests/1/',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was wrongly returning http://example.com/tests/1/ before this patch (side-effect of the enclosure)

@@ -1199,11 +1217,11 @@ public function get_enclosures()
// PLAYER
if ($player_parent = $this->get_item_tags(\SimplePie\SimplePie::NAMESPACE_MEDIARSS, 'player')) {
if (isset($player_parent[0]['attribs']['']['url'])) {
$player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI);
$player_parent = $this->sanitize($player_parent[0]['attribs']['']['url'], \SimplePie\SimplePie::CONSTRUCT_IRI, $this->get_base($player_parent[0]));
Copy link
Contributor Author

@Alkarex Alkarex Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were no tests for this section and I have not added any. Help welcome if anyone is motivated.
For sure not providing the base URL will lead to wrong absolute URLs

Alkarex added a commit to Alkarex/FreshRSS that referenced this pull request Apr 6, 2024
This is especially relevant for HTML+XPath mode, for which we rely on proper URL "absolutize"

Upstream PR simplepie/simplepie#861
@Alkarex
Copy link
Contributor Author

Alkarex commented Apr 6, 2024

Downstream PR FreshRSS/FreshRSS#6270

Alkarex added a commit to FreshRSS/FreshRSS that referenced this pull request Apr 8, 2024
This is especially relevant for HTML+XPath mode, for which we rely on proper URL "absolutize"

Upstream PR simplepie/simplepie#861
*/
protected function get_own_base(array $element = []): string
{
if (!empty($element['xml_base_explicit']) && isset($element['xml_base'])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, xml:base itself should be resolved recursively relative to xml:base in parent elements. Thankfully, this appears to be handled by our own Parser class.

src/Item.php Outdated
Comment on lines 123 to 124
* Similar to `get_base()` but can safely be used during initialisation methods
* such as `get_links()` (`get_base()` and `get_links()` call each-other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the mutual recursion be prevented when xml:base is not set? Would not it be the same as calling SimplePie::get_base() directly in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If xml:base is not set, it will rely on the feed's SimplePie::get_base(), which does not depend on Item::*, so there should not be mutual dependencies anymore, unlike before this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed that the comment is talking about Item::get_base().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved comment 13a398d

Comment on lines +2459 to +2461
* Uses `<xml:base>` if available,
* otherwise uses the first 'self' link or the first 'alternate' link of the feed,
* or failing that, the URL of the feed itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original RSS specification requires URLs to include scheme. I would expect that if the feed has relative URLs the content is taken from the HTML (alternate) version unchanged, and so the links should be resolved relative to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also reflected in the previous definition, as self link is basically the canonical version of subscribe URL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I guess self before alternate might make sense for mrss elements, since those only exist in the feed. (But really, it will depend on how the feed is generated. mrss only mandates direct URL, which is not very useful.)

Copy link
Contributor

@jtojnar jtojnar Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like https://www.rssboard.org/news/151/relative-links discusses this and recommends self link as a fallback, not mentioning alternate at all. (Note that URLs in the comments are displayed as domains, you will need to check the link href for the examples to make sense.)

And for completeness Atom only seems to mention xml:base:

Any element defined by this specification MAY have an xml:base attribute [W3C.REC-xmlbase-20010627]. When xml:base is used in an Atom Document, it serves the function described in section 5.1.1 of RFC3986, establishing the base URI (or IRI) for resolving any relative references found within the effective scope of the xml:base attribute.

And, as mentioned in one of the comments on the RSS article, the xml:base specification also suggests the document URI (i.e. subscribe URL after redirects, which I would expect to match self link) for the fallback:

The attribute xml:base may be inserted in XML documents to specify a base URI other than the base URI of the document or external entity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to follow-up on this. It looks like we agree, right? In other words, there does not seem to be any (new) test in contradiction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants