Extract feed and item images from more places #220

infogulch · 2024-02-21T20:47:42Z

Additional locations where images are attempted to be extracted:

media:content extension https://www.rssboard.org/media-rss#media-content
The first <img> in content or description

Fixes #133

testdata/translator/rss/feed_image_-_rss_description.xml

infogulch · 2024-02-21T20:50:16Z

Besides a few tests that have the issue mentioned above in the review I think this should work fine.

I'd like to get some input on the review above before I convert this from a draft.

mmcdole · 2024-02-23T06:28:26Z

@infogulch I think the fallback image sources in the translator function you added look clean and make sense to me, including the HTML parsing code. I had no clue that many images stash their images in there, lol.

mmcdole · 2024-02-23T18:13:06Z

@infogulch update looks good to me.

I might create a separate issue to think about what to do with naked HTML markup within tags.

testdata/translator/rss/feed_item_image_-_rss_channel_item_content.xml

mmcdole · 2024-02-23T18:32:17Z

Thank you for your contribution @infogulch !

Now I just need to tackle #210, and hopefully turn back on gating of PRs for tests passing.

@JLugagne

PR mmcdole#220 introduced a failing test for detecting images in the "content" element. It should instead be testing the "content:encoded" element. But that uncovered an issue with how extensions were being detected (the "content" namespace was being detected as an extension namespace). As a more robust way of checking for the "content" namespace, this PR exposes `shared.PrefixForNamspace()` as a public function so it can be used in the rss parser. This provides a solution to PR mmcdole#211 (and includes @JLugagne's test case from that PR). Once the fixes to xml:base handling in mmcdole#222 are merged, this should fix the remaining failing test reported in mmcdole#210.

@JLugagne

PR mmcdole#220 introduced a failing test for detecting images in the "content" element. It should instead be testing the "content:encoded" element. But that uncovered an issue with how extensions were being detected (the "content" namespace was being detected as an extension namespace). As a more robust way of checking for the "content" namespace, this PR exposes `shared.PrefixForNamspace()` as a public function so it can be used in the rss parser. This provides a solution to PR mmcdole#211 (and includes a test based on @JLugagne's test case from that PR). Once the fixes to xml:base handling in mmcdole#222 are merged, this should fix the remaining failing test reported in mmcdole#210.

@JLugagne

PR mmcdole#220 introduced a failing test for detecting images in the "content" element. It should instead be testing the "content:encoded" element. But that uncovered an issue with how extensions were being detected (the "content" namespace was being detected as an extension namespace). As a more robust way of checking for the "content" namespace, this PR exposes `shared.PrefixForNamspace()` as a public function so it can be used in the rss parser. This should also fix PR mmcdole#211 (and includes @JLugagne's test case from that PR). Once the fixes to xml:base handling in mmcdole#222 are merged, this should fix the remaining failing test reported in mmcdole#210.

spacecowboy · 2024-02-29T20:54:38Z

I'd like to comment that fetching the first <img> inside body isn't such a great idea.

Take for example the feed from slashdot: https://rss.slashdot.org/Slashdot/slashdotMain

The first image in the body will be https://a.fsdn.com/sd/twitter_icon_large.png which is 56x20 pixels. This is directly unsuitable as a thumbnail for an article.

Perhaps it would be better to place the first body image as an extension? Then clients can choose if they want to consider it or not?

@JLugagne

PR #220 introduced a failing test for detecting images in the "content" element. It should instead be testing the "content:encoded" element. But that uncovered an issue with how extensions were being detected (the "content" namespace was being detected as an extension namespace). As a more robust way of checking for the "content" namespace, this PR exposes `shared.PrefixForNamspace()` as a public function so it can be used in the rss parser. This should also fix PR #211 (and includes @JLugagne's test case from that PR). Once the fixes to xml:base handling in #222 are merged, this should fix the remaining failing test reported in #210.

infogulch commented Feb 21, 2024

View reviewed changes

testdata/translator/rss/feed_image_-_rss_description.xml Outdated Show resolved Hide resolved

infogulch marked this pull request as ready for review February 23, 2024 02:41

infogulch marked this pull request as draft February 23, 2024 02:43

Extract feed and item images from more places

b1ed5bf

infogulch force-pushed the find-image branch from 1eed9c3 to b1ed5bf Compare February 23, 2024 07:03

infogulch commented Feb 23, 2024

View reviewed changes

testdata/translator/rss/feed_item_image_-_rss_channel_item_content.xml Show resolved Hide resolved

infogulch marked this pull request as ready for review February 23, 2024 18:30

mmcdole merged commit 454d6a3 into mmcdole:master Feb 23, 2024
1 check failed

rystaf pushed a commit to rystaf/gofeed that referenced this pull request Feb 25, 2024

Extract feed and item images from more places (mmcdole#220)

bec69a3

cristoper mentioned this pull request Feb 29, 2024

Unit tests are failing #210

Closed

cristoper mentioned this pull request Feb 29, 2024

Fix handling of RSS content:encoded #223

Merged

infogulch deleted the find-image branch February 29, 2024 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract feed and item images from more places #220

Extract feed and item images from more places #220

infogulch commented Feb 21, 2024 •

edited

infogulch commented Feb 21, 2024 •

edited

mmcdole commented Feb 23, 2024

mmcdole commented Feb 23, 2024

mmcdole commented Feb 23, 2024

spacecowboy commented Feb 29, 2024

Extract feed and item images from more places #220

Extract feed and item images from more places #220

Conversation

infogulch commented Feb 21, 2024 • edited

infogulch commented Feb 21, 2024 • edited

mmcdole commented Feb 23, 2024

mmcdole commented Feb 23, 2024

mmcdole commented Feb 23, 2024

spacecowboy commented Feb 29, 2024

infogulch commented Feb 21, 2024 •

edited

infogulch commented Feb 21, 2024 •

edited