Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSS feature request: sanitize/filter out HTML from $description #1539

Open
Mikaela opened this issue May 29, 2023 · 3 comments
Open

RSS feature request: sanitize/filter out HTML from $description #1539

Mikaela opened this issue May 29, 2023 · 3 comments

Comments

@Mikaela
Copy link
Contributor

Mikaela commented May 29, 2023

The $description of many RSS feeds (e.g. GitHub, GitLab, crt.sh, Tor blog) contain HTML tags making them messy to read.

2023-W21-4 00:25:03 +0300 <@R-66Y> https://blog.torproject.org/new-alpha-release-tor-browser-125a6/ torproject: New Alpha Release: Tor Browser 12.5a6 (Android, Windows, macOS, Linux) <article class="blog-post"> <source media="(min-width:415px)" type="image/webp" /> <source type="image/webp" /> <img class="lead" src="https://blog.torproject.org/new-alpha-release-tor-browser-125a6/lead.png" /> <div class="body"><p>Tor Browser 12.5a6

I think Limnoria cleaning them up and just sending the user visible text would improve readability and thus usability of the plugin a lot.

While it's a different protocol and different capabilities, the Matrix bot Hookshot has this ability, matrix-org/matrix-hookshot#738

Possibly related:

@progval
Copy link
Owner

progval commented May 29, 2023

I assume the example you submitted to matrix-org/matrix-hookshot#732 is from https://bodhi.fedoraproject.org/rss/updates/ and the one you have here is from https://blog.torproject.org/feed.xml

In both of these feeds, the description does not contain HTML tags, but escaped HTML tags. For example, respectively:

<item><title>libphidget22-1.15.20230526-1.fc39</title><link>https://bodhi.fedoraproject.org/updates/FEDORA-2023-ffb20eb9af</link><description>&lt;h1&gt;FEDORA-2023-ffb20eb9af&lt;/h1&gt;
&lt;h2&gt;Packages in this update:&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;libphidget22-1.15.20230526-1.fc39&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Update description:&lt;/h2&gt;
&lt;p&gt;Automatic update for libphidget22-1.15.20230526-1.fc39.&lt;/p&gt;
&lt;h5&gt;&lt;strong&gt;Changelog&lt;/strong&gt;&lt;/h5&gt;
&lt;pre&gt;&lt;code&gt;* Mon May 29 2023 Richard Shaw &amp;lt;&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;&amp;gt; - 1.15.20230526-1
- Update to 1.15.20230526.

&lt;/code&gt;&lt;/pre&gt;</description><pubDate>Mon, 29 May 2023 12:06:21 +0000</pubDate></item>

and

<entry><title>New Alpha Release: Tor Browser 12.5a6 (Android, Windows, macOS, Linux)</title><link href="https://blog.torproject.org/new-alpha-release-tor-browser-125a6/" rel="alternate"></link><updated>2023-05-24T00:00:00Z</updated><author><name>richard</name></author><id>urn:uuid:3d4a5097-1fc1-35ce-960d-7c29c6d28676</id><content type="html">&lt;article class="blog-post"&gt;
    &lt;picture&gt;
      &lt;source media="(min-width:415px)" srcset="https://blog.torproject.org/new-alpha-release-tor-browser-125a6/lead.webp" type="image/webp"&gt;
&lt;source srcset="https://blog.torproject.org/new-alpha-release-tor-browser-125a6/lead_small.webp" type="image/webp"&gt;

      &lt;img class="lead" referrerpolicy="no-referrer" loading="lazy" src="https://blog.torproject.org/new-alpha-release-tor-browser-125a6/lead.png"&gt;
    &lt;/picture&gt;
    &lt;div class="body"&gt;&lt;p&gt;Tor Browser 12.5a6 is now available from the &lt;a href="https://www.torproject.org/download/alpha/"&gt;Tor Browser download page&lt;/a&gt; and also from our &lt;a href="https://www.torproject.org/dist/torbrowser/12.5a6/"&gt;distribution directory&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This release updates Firefox 102.11.0esr, including bug fixes, stability improvements and important &lt;a href="https://www.mozilla.org/en-US/security/advisories/mfsa2023-17/"&gt;security updates&lt;/a&gt;. There were no Android-specific security updates to backport from the Firefox 113 release.&lt;/p&gt;
&lt;h2&gt;Build-Signing Infrastructure Updates&lt;/h2&gt;
&lt;p&gt;We are in the process of updating our build signing infrastructure, and unfortunately are unable to ship code-signed 12.5a6 installers for Windows systems currently. Therefore we will not be providing full Window installers for this release. However, automatic build-to-build upgrades from 12.5a4 and 12.5a5 should continue to work as expected.&lt;/p&gt;
&lt;h2&gt;Full changelog&lt;/h2&gt;
&lt;p&gt;The full changelog since &lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/raw/main/projects/browser/Bundle-Data/Docs/ChangeLog.txt"&gt;Tor Browser 12.5a5&lt;/a&gt; is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;All Platforms&lt;ul&gt;
&lt;li&gt;Updated Translations&lt;/li&gt;
&lt;li&gt;Updated Go to 11.9.9&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/40860"&gt;Bug tor-browser-build#40860&lt;/a&gt;: Improve the transition from the old fontconfig file to the new one&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41728"&gt;Bug tor-browser#41728&lt;/a&gt;: Pin bridges.torproject.org domains to Let's Encrypt's root cert public key&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41738"&gt;Bug tor-browser#41738&lt;/a&gt;: Replace the patch to disable live reload with its preference&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41757"&gt;Bug tor-browser#41757&lt;/a&gt;: Rebase Tor Browser Alpha to 102.11.0esr&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41763"&gt;Bug tor-browser#41763&lt;/a&gt;: TTP-02-003 WP1: Data URI allows JS execution despite safest security level (Low)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41764"&gt;Bug tor-browser#41764&lt;/a&gt;: TTP-02-004 OOS: No user-activation required to download files (Low)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41775"&gt;Bug tor-browser#41775&lt;/a&gt;: Avoid re-defining some macros in nsUpdateDriver.cpp&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Windows + macOS + Linux&lt;ul&gt;
&lt;li&gt;Updated Firefox to 102.11esr&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41607"&gt;Bug tor-browser#41607&lt;/a&gt;: Update "New Circuit" icon&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41736"&gt;Bug tor-browser#41736&lt;/a&gt;: Customize the default CustomizableUI toolbar using CustomizableUI.jsm&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41770"&gt;Bug tor-browser#41770&lt;/a&gt;: Keyboard navigation broken leaving the toolbar tor circuit button&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41777"&gt;Bug tor-browser#41777&lt;/a&gt;: Internally shippped manual does not adapt to RTL languages (it always align to the left)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Windows + Linux&lt;ul&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41654"&gt;Bug tor-browser#41654&lt;/a&gt;: UpdateInfo jumped into Data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Linux&lt;ul&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41732"&gt;Bug tor-browser#41732&lt;/a&gt;: implement linux font whitelist as defense-in-depth&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41776"&gt;Bug tor-browser#41776&lt;/a&gt;: System fonts are temporarily leaked on Linux after the browser is updated from 12.5a4 or earlier&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Android&lt;ul&gt;
&lt;li&gt;Updated GeckoView to 102.11esr&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Build System&lt;ul&gt;
&lt;li&gt;All Platforms&lt;ul&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/33953"&gt;Bug tor-browser-build#33953&lt;/a&gt;: Provide a way for easily updating Go dependencies of projects&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/40673"&gt;Bug tor-browser-build#40673&lt;/a&gt;: Avoid building each go module separately&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/40818"&gt;Bug tor-browser-build#40818&lt;/a&gt;: Enable wasm target for rust compiler&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/40841"&gt;Bug tor-browser-build#40841&lt;/a&gt;: Adapt signing scripts to new signing machines&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/40849"&gt;Bug tor-browser-build#40849&lt;/a&gt;: Move Go dependencies to the projects dependent on them, not as a standalone projects&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/40856"&gt;Bug tor-browser-build#40856&lt;/a&gt;: Unblock nightly builds&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Windows&lt;ul&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/40846"&gt;Bug tor-browser-build#40846&lt;/a&gt;: Temporarily disable Windows signing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

    &lt;/div&gt;
  &lt;div class="categories"&gt;
    &lt;ul&gt;&lt;li&gt;
        &lt;a href="https://blog.torproject.org/../category/applications"&gt;
          applications
        &lt;/a&gt;
      &lt;/li&gt;&lt;li&gt;
        &lt;a href="https://blog.torproject.org/../category/releases"&gt;
          releases
        &lt;/a&gt;
      &lt;/li&gt;&lt;/ul&gt;
  &lt;/div&gt;
  &lt;/article&gt;
</content></entry>

you can see there are lots of &lt; and &gt; in these feeds, which are the escapes for < and >.
In other words, the descriptions do not actually contain HTML tags, and that's because these feeds are buggy.

Hookshot's PR decided to handle these buggy feeds as their authors intended, but instead it is now broken on correct feeds. For example, if a feed contained this: <description>I want the &lt;blink&gt; tag back</description> then RSS clients are expected to display it like this: I want the <blink> tag back but after that PR, Hookshot will display it like this: I want the tag back and that is incorrect.

Therefore, I won't change Limnoria's behavior to accomodate buggy feeds while breaking correct feeds. The correct solution is to make the feeds' authors fix their feeds.

@progval progval closed this as completed May 29, 2023
@progval progval closed this as not planned Won't fix, can't repro, duplicate, stale May 29, 2023
@progval
Copy link
Owner

progval commented May 29, 2023

Hmm actually it seems that feedparser (the library Limnoria uses to parse RSS and Atom feeds) has a heuristic to auto-fix such feeds (it detects if a description contains &gt; &lt; and not a single < or >)

What version of feedparser do you have installed?

@progval progval reopened this May 29, 2023
@Mikaela
Copy link
Contributor Author

Mikaela commented May 29, 2023

Thank you, I reported this issue to GitHub so far.

My python3-feedparser appears to be 5.2.1-3 and it would be upgradable to 6.0.8-1~bpo11+1 from Debian Backports.

@jlu5 jlu5 closed this as completed May 29, 2023
@jlu5 jlu5 reopened this May 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants