Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to scrape all posts in a specific subreddit? #13

Closed
chandrasg opened this issue May 19, 2020 · 5 comments
Closed

Is it possible to scrape all posts in a specific subreddit? #13

chandrasg opened this issue May 19, 2020 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@chandrasg
Copy link

This is not a bug report, but a question about the functionality. Can this scraper be used to obtain all posts in a specific subreddit surpassing the htrcnrs criteria? Looks like the reddit API is now limiting the number of posts we can pull?

@chandrasg chandrasg added the bug Something isn't working label May 19, 2020
@JosephLai241 JosephLai241 added question Further information is requested and removed bug Something isn't working labels May 21, 2020
@JosephLai241
Copy link
Owner

It is currently not possible to do this with URS. PRAW implemented this feature in the past, but it is now deprecated as of PRAW v6.0.0.

In the past, Reddit used the Cloudsearch API to search for posts based on UNIX timestamps. Reddit removed the API in PRAW v6.0.0, rendering the Subreddit.submissions() function useless. This is why this functionality was removed. PRAW's author wrote a Reddit post in r/redditdev detailing this change.

There is an alternative to scraping Reddit, however. Pushshift.io looks like a good alternative, although I have not used it before.

I am considering a code refactor in the future to possibly utilize Pushshift's API instead of PRAW and allow for more versatile scraping capabilities. It seems like this is a feature many people would like have in URS, and a growing number of social media websites are beginning to limit their official API's versatility. I am not familiar with Pushshift so will have to do more research before making any changes to current functionality. It is likely that I will refactor URS if Pushshift's API seems more promising than PRAW. Stay tuned for updates!

@filyp
Copy link

filyp commented Aug 25, 2021

Did you have any success using pushshift? I'd like to try to use their API for scraping all the posts, but it's quite poorly documented.

@JosephLai241
Copy link
Owner

I have had success using Pushshift and am in the process of integrating the API so that it may be accessed via command-line flags (spoiler: there are a lot of optional flags for granular scrape settings that are associated with the Pushshift scrapers).

I mentioned I would consider an entire refactor using Pushshift in my May 2020 comment. After some research, I realized integrating Pushshift alongside PRAW would be the better choice because each API provides unique features. Livestreaming comments or submissions that are submitted in a Subreddit is not possible with Pushshift, for example. The ability to use both/either API would make a powerful Reddit scraping tool.

As you mentioned, the Pushshift documentation is subpar. I would need to do a fair amount of testing and provide clear explanations of how each optional flag interacts with the API within this repository's README before the release.

I believe the integration is almost finished, however I stepped away from URS development for now to focus on other things - building a portfolio site and practicing Leetcode since I am unfortunately still looking for a full-time job. I plan on releasing the Pushshift integration in the next minor iteration (v3.4.0), although it may take some time before that happens. Keep an eye out for updates!

@filyp
Copy link

filyp commented Aug 26, 2021

That's good to hear :)

Good luck with the job search!

@Derpitron
Copy link

Hey. Any updates on this issue request? I need to scrape some of my own posts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants