Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudflare-protected site responds with 503 Service Temporarily Unavailable #205

Open
rmfkdehd opened this issue Oct 30, 2021 · 5 comments

Comments

@rmfkdehd
Copy link

I installed grab-site on ubuntu 20.04 using nix.

The command I use is 'grab-site https://www.forexfactory.com/forums --concurrency=1' .

Example.com and other sites completed crawling, but the 'https://www.forexfactory.com/' site failed to crawl. I've also tried with sub-addresses.

Below is the log.

Imported /home/user/www.forexfactory.com-forums-2021-10-30-022ebe75/igsets
Imported /home/user/www.forexfactory.com-forums-2021-10-30-022ebe75/ignores
Connected to ws://127.0.0.1:29000
Imported /home/user/www.forexfactory.com-forums-2021-10-30-022ebe75/max_content_length
503 Service Temporarily Unavailable https://www.forexfactory.com/forums
Imported /home/user/www.forexfactory.com-forums-2021-10-30-022ebe75/delay
Imported /home/user/www.forexfactory.com-forums-2021-10-30-022ebe75/concurrency
/nix/store/12ip3ixhj0zbxy54pqqai0hssjrhgddg-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/protocol/http/client.py:185: UserWarning: HTTP session did not complete.
  warnings.warn(_('HTTP session did not complete.'))
200 OK https://www.forexfactory.com/robots.txt
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap.xml
503 Service Temporarily Unavailable https://www.forexfactory.com/forums
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap-index.xml
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap.xml
503 Service Temporarily Unavailable https://www.forexfactory.com/forums
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap-index.xml
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap.xml
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap-index.xml
Finished grab 022ebe7544a3c4163f989e95c54d3d54 https://www.forexfactory.com/forums with exit code 8
Output is in directory:
/home/user/www.forexfactory.com-forums-2021-10-30-022ebe75
@TheTechRobo
Copy link
Contributor

You sure that the site is up?

Also, are you sure that you aren't banned?

@rmfkdehd
Copy link
Author

I can still go into regular chrome.. no problem at all.

@TheTechRobo
Copy link
Contributor

weird. maybe the site requires JS and if you don't have it, bans you?

otherwise idk

@ivan
Copy link
Contributor

ivan commented Oct 31, 2021

@TheTechRobo please don't speculate like this in the issues, try to reproduce the issue yourself if you're interested in it.

Anyway, I see

DDoS protection by <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing/" target="_blank">Cloudflare</a>

in the resulting WARC when trying to crawl this forum.

cloudflare is known to block bots sending the wrong TLS fingerprint. It is probably picking up on grab-site's 'incorrect' TLS fingerprint, which does not match the browser it claims to be (Firefox). We might be able to fix that in ludios_wpull.

@ivan ivan changed the title Specific site crawling errors Cloudflare-protected site responds with 503 Service Temporarily Unavailable Oct 31, 2021
@TheTechRobo
Copy link
Contributor

@TheTechRobo please don't speculate like this in the issues, try to reproduce the issue yourself if you're interested in it.

@ivan Gotcha. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants