Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract feed and item images from more places #220

Merged
merged 1 commit into from
Feb 23, 2024

Conversation

infogulch
Copy link
Contributor

@infogulch infogulch commented Feb 21, 2024

Additional locations where images are attempted to be extracted:

Fixes #133

@infogulch
Copy link
Contributor Author

infogulch commented Feb 21, 2024

Besides a few tests that have the issue mentioned above in the review I think this should work fine.

I'd like to get some input on the review above before I convert this from a draft.

@infogulch infogulch marked this pull request as ready for review February 23, 2024 02:41
@infogulch infogulch marked this pull request as draft February 23, 2024 02:43
@mmcdole
Copy link
Owner

mmcdole commented Feb 23, 2024

@infogulch I think the fallback image sources in the translator function you added look clean and make sense to me, including the HTML parsing code. I had no clue that many images stash their images in there, lol.

@mmcdole
Copy link
Owner

mmcdole commented Feb 23, 2024

@infogulch update looks good to me.

I might create a separate issue to think about what to do with naked HTML markup within tags.

@infogulch infogulch marked this pull request as ready for review February 23, 2024 18:30
@mmcdole mmcdole merged commit 454d6a3 into mmcdole:master Feb 23, 2024
1 check failed
@mmcdole
Copy link
Owner

mmcdole commented Feb 23, 2024

Thank you for your contribution @infogulch !

Now I just need to tackle #210, and hopefully turn back on gating of PRs for tests passing.

rystaf pushed a commit to rystaf/gofeed that referenced this pull request Feb 25, 2024
@cristoper cristoper mentioned this pull request Feb 29, 2024
cristoper added a commit to cristoper/gofeed that referenced this pull request Feb 29, 2024
PR mmcdole#220 introduced a failing test for detecting images in the "content"
element. It should instead be testing the "content:encoded" element. But
that uncovered an issue with how extensions were being detected (the
"content" namespace was being detected as an extension namespace).

As a more robust way of checking for the "content" namespace, this PR
exposes `shared.PrefixForNamspace()` as a public function so it can be
used in the rss parser. This provides a solution to PR mmcdole#211 (and
includes @JLugagne's test case from that PR).

Once the fixes to xml:base handling in mmcdole#222 are merged, this should fix
the remaining failing test reported in mmcdole#210.
cristoper added a commit to cristoper/gofeed that referenced this pull request Feb 29, 2024
PR mmcdole#220 introduced a failing test for detecting images in the "content"
element. It should instead be testing the "content:encoded" element. But
that uncovered an issue with how extensions were being detected (the
"content" namespace was being detected as an extension namespace).

As a more robust way of checking for the "content" namespace, this PR
exposes `shared.PrefixForNamspace()` as a public function so it can be
used in the rss parser. This provides a solution to PR mmcdole#211 (and
includes a test based on @JLugagne's test case from that PR).

Once the fixes to xml:base handling in mmcdole#222 are merged, this should fix
the remaining failing test reported in mmcdole#210.
cristoper added a commit to cristoper/gofeed that referenced this pull request Feb 29, 2024
PR mmcdole#220 introduced a failing test for detecting images in the "content"
element. It should instead be testing the "content:encoded" element. But
that uncovered an issue with how extensions were being detected (the
"content" namespace was being detected as an extension namespace).

As a more robust way of checking for the "content" namespace, this PR
exposes `shared.PrefixForNamspace()` as a public function so it can be
used in the rss parser. This should also fix PR mmcdole#211 (and includes
@JLugagne's test case from that PR).

Once the fixes to xml:base handling in mmcdole#222 are merged, this should fix
the remaining failing test reported in mmcdole#210.
@spacecowboy
Copy link
Contributor

I'd like to comment that fetching the first <img> inside body isn't such a great idea.

Take for example the feed from slashdot: https://rss.slashdot.org/Slashdot/slashdotMain

The first image in the body will be https://a.fsdn.com/sd/twitter_icon_large.png which is 56x20 pixels. This is directly unsuitable as a thumbnail for an article.

Perhaps it would be better to place the first body image as an extension? Then clients can choose if they want to consider it or not?

@infogulch infogulch deleted the find-image branch February 29, 2024 21:11
mmcdole pushed a commit that referenced this pull request Mar 1, 2024
PR #220 introduced a failing test for detecting images in the "content"
element. It should instead be testing the "content:encoded" element. But
that uncovered an issue with how extensions were being detected (the
"content" namespace was being detected as an extension namespace).

As a more robust way of checking for the "content" namespace, this PR
exposes `shared.PrefixForNamspace()` as a public function so it can be
used in the rss parser. This should also fix PR #211 (and includes
@JLugagne's test case from that PR).

Once the fixes to xml:base handling in #222 are merged, this should fix
the remaining failing test reported in #210.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Image always nil
3 participants