[FR]: Article Extracting #1290

barolo · 2024-02-01T05:18:34Z

Brief description of the feature request

This is a followup to #399. Since this script stopped working https://github.com/martinrotter/rssguard/blob/master/resources/scripts/scrapers/scrape-full-articles.py [uses site to extract which is 404] I was experimenting with different solution.

Imho, sending it online and back seems wholy unnecesary.

Would it be possible to integrate something like this?
It's a script, needs axios, jsdom and @mozilla/readability npm modules as dependencies, takes site url as argument. Spits out extracted html.

const { Readability } = require('./node_modules/@mozilla/readability');
const { JSDOM } = require('./node_modules/jsdom');


// Check if a URL is provided as a command line argument
if (process.argv.length < 3) {
  console.error('Please provide a URL as a command line argument.');
  process.exit(1);
}

const url = process.argv[2];

// Using dynamic import to import axios
(async () => {
  try {
    // Dynamically import axios
    const { default: axios } = await import('axios');

    // Fetch HTML content from the given URL
    const response = await axios.get(url);

    // Create a JSDOM instance with the fetched HTML content
    const doc = new JSDOM(response.data, { url: url });

    // Use Readability to parse the document and extract the article content
    const reader = new Readability(doc.window.document);
    const article = reader.parse();

    // Print the article content
    console.log('Title:', article.title);
    console.log('Content:', article.content);
  } catch (error) {
    console.error('Error fetching the URL:', error.message);
  }
})();

and load it on article clicked instead of trying to extract all urls unnecessarily? There's already node based adblock implemented, from what I've seen. Parsing everything would be fine too, would it be enough to put it into post-processing just?

The text was updated successfully, but these errors were encountered:

martinrotter · 2024-02-26T09:19:59Z

RSS Guard already integrates readability via its "reader mode" feature.

martinrotter · 2024-02-26T09:20:51Z

barolo · 2024-02-26T12:05:52Z

This is something else, it's not a "reader mode", it extracts article content, even if rss only contains a headline or part of the article, without opening the whole page.
I use rss mainly to avoid opening the full webpage.

barolo · 2024-05-17T17:14:35Z

Recent release has this feature implemented and it works great.

barolo added the Type-Enhancement This is request for brand new feature. label Feb 1, 2024

barolo assigned martinrotter Feb 1, 2024

barolo changed the title ~~[FR]: Article Extractors~~ [FR]: Article Extracting Feb 1, 2024

barolo closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR]: Article Extracting #1290

[FR]: Article Extracting #1290

barolo commented Feb 1, 2024 •

edited

martinrotter commented Feb 26, 2024

martinrotter commented Feb 26, 2024

barolo commented Feb 26, 2024 •

edited

barolo commented May 17, 2024

[FR]: Article Extracting #1290

[FR]: Article Extracting #1290

Comments

barolo commented Feb 1, 2024 • edited

Brief description of the feature request

martinrotter commented Feb 26, 2024

martinrotter commented Feb 26, 2024

barolo commented Feb 26, 2024 • edited

barolo commented May 17, 2024

barolo commented Feb 1, 2024 •

edited

barolo commented Feb 26, 2024 •

edited