-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore errors and keep crawling #193
Comments
grab-site does keep crawling on errors like a connection error. If it is exiting too early, it is probably because it has run through the entire queue (perhaps because it didn't discover any URLs on the same domain?) |
Thanks. What does exit code 4 mean here? |
grab-site effectively becomes a wpull process, so that would be https://wpull.readthedocs.io/en/master/api/errors.html#wpull.errors.ExitStatus.network_failure - meaning that some request had a network failure. It doesn't mean that it exited immediately because of it. |
Okay thanks a lot for this explanation. I have another unrelated question: I've been using grab-site on Javascript-heavy websites, particularly Wix-powered websites. However, grab-site doesn't render the Javascript UI elements correctly. Is there a way to archive these sites properly? One possible solution I was thinking: I've been trying pywb's website recording functions. It seems like when I visit Is there a similar way visit/render sites with a browser, using grab-site? |
Yeah, that's a known issue. IIRC grab-site doesn't extract links from JavaScript, so they won't be saved. The JS itself will be saved as it is a page requisite but none of the URLs it actually contacts. You could use some sort of proxy with grab-site. Or you could use another tool, like https://github.com/internetarchive/brozzler. |
@TheTechRobo thanks! I'm new to this so not too sure what you mean by using some sort of proxy with grab-site, or how using a proxy would solve this, would you be so kind to elaborate? |
I mean a proxy that would parse and/or run JvaaScript (and then add the links to the finished html or put the links in a text file that can be used with |
Just realised - adding links to the HTML is a no-go, since we probably want clean archives. But just a textfile with urls would probably be fine. 😸 |
Hi there.
When I am archiving sites, sometimes grab-site will encounter a URL that it cannot connect to (i.e. connecting to the page times out). From my observation, whenever this happens, the scraping operation immediately errors out and quits, even though I have more URLs left to crawl.
For example: upon encountering a page that will time out: this is printed.
Note that it is exit code 4, and not 0, i.e. there is an error.
Is there a way to ignore errors, and keep crawling?
The text was updated successfully, but these errors were encountered: