Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/scrape public domain info from social media sites (like privacy policy) #161

Closed
Gbillington1 opened this issue May 19, 2024 · 5 comments
Closed

Comments

@Gbillington1
Copy link

I'm using Firecrawl in my application to scrape privacy policies for many websites. It works great for most cases but fails with a 403 error when trying to scrape what Firecrawl considers "social media" sites. I got this error when trying to scrape the privacy policy of twitter (ironically x.com works), and Instagram:

https://twitter.com/en/privacy
https://help.instagram.com/155833707900388

I'm assuming that these sites are being blocked by some blacklist on your side, but it would be awsome if I could scrape pages that don't necessarily relate to the data on the platforms. I want the information about the privacy policy, which Is public domain info that isn't related to the data that is stored on the social platforms. If you guys could make it possible for me to scrape these types of sites it would be greatly appreciated!

@nickscamara
Copy link
Member

Hey! Thanks for the feedback! Very good point, we will look into it on monday!

@nickscamara
Copy link
Member

ccing @rafaelsideguide

@Gbillington1
Copy link
Author

Awesome, thanks!

@nickscamara
Copy link
Member

Hey @Gbillington1, just submitted a pr that might help with this for now. If you have a better idea, or other ways we can do this lmk.

@nickscamara
Copy link
Member

@Gbillington1 Merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants