Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR]: Article Extracting #1290

Closed
barolo opened this issue Feb 1, 2024 · 4 comments
Closed

[FR]: Article Extracting #1290

barolo opened this issue Feb 1, 2024 · 4 comments
Assignees
Labels
Type-Enhancement This is request for brand new feature.

Comments

@barolo
Copy link

barolo commented Feb 1, 2024

Brief description of the feature request

This is a followup to #399. Since this script stopped working https://github.com/martinrotter/rssguard/blob/master/resources/scripts/scrapers/scrape-full-articles.py [uses site to extract which is 404] I was experimenting with different solution.

Imho, sending it online and back seems wholy unnecesary.

Would it be possible to integrate something like this?
It's a script, needs axios, jsdom and @mozilla/readability npm modules as dependencies, takes site url as argument. Spits out extracted html.

const { Readability } = require('./node_modules/@mozilla/readability');
const { JSDOM } = require('./node_modules/jsdom');


// Check if a URL is provided as a command line argument
if (process.argv.length < 3) {
  console.error('Please provide a URL as a command line argument.');
  process.exit(1);
}

const url = process.argv[2];

// Using dynamic import to import axios
(async () => {
  try {
    // Dynamically import axios
    const { default: axios } = await import('axios');

    // Fetch HTML content from the given URL
    const response = await axios.get(url);

    // Create a JSDOM instance with the fetched HTML content
    const doc = new JSDOM(response.data, { url: url });

    // Use Readability to parse the document and extract the article content
    const reader = new Readability(doc.window.document);
    const article = reader.parse();

    // Print the article content
    console.log('Title:', article.title);
    console.log('Content:', article.content);
  } catch (error) {
    console.error('Error fetching the URL:', error.message);
  }
})();

and load it on article clicked instead of trying to extract all urls unnecessarily? There's already node based adblock implemented, from what I've seen. Parsing everything would be fine too, would it be enough to put it into post-processing just?

@barolo barolo added the Type-Enhancement This is request for brand new feature. label Feb 1, 2024
@barolo barolo changed the title [FR]: Article Extractors [FR]: Article Extracting Feb 1, 2024
@martinrotter
Copy link
Owner

RSS Guard already integrates readability via its "reader mode" feature.

@martinrotter
Copy link
Owner

image

image

@barolo
Copy link
Author

barolo commented Feb 26, 2024

This is something else, it's not a "reader mode", it extracts article content, even if rss only contains a headline or part of the article, without opening the whole page.
I use rss mainly to avoid opening the full webpage.

@barolo
Copy link
Author

barolo commented May 17, 2024

Recent release has this feature implemented and it works great.

@barolo barolo closed this as completed May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type-Enhancement This is request for brand new feature.
Projects
None yet
Development

No branches or pull requests

2 participants