Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore errors and keep crawling #193

Open
TowardMyth opened this issue Jul 24, 2021 · 8 comments
Open

Ignore errors and keep crawling #193

TowardMyth opened this issue Jul 24, 2021 · 8 comments

Comments

@TowardMyth
Copy link

Hi there.

When I am archiving sites, sometimes grab-site will encounter a URL that it cannot connect to (i.e. connecting to the page times out). From my observation, whenever this happens, the scraping operation immediately errors out and quits, even though I have more URLs left to crawl.

For example: upon encountering a page that will time out: this is printed.

ERROR Fetching ‘http://some-site.com:83/_somefolder/’ encountered an error: Connect timed out.
http://some-site.com:83/_somefolder/
ERROR Fetching ‘http://some-site.com:83/_somefolder/’ encountered an error: Connect timed out.
Finished grab some-hash https://some-site.com/ with exit code 4

Note that it is exit code 4, and not 0, i.e. there is an error.

Is there a way to ignore errors, and keep crawling?

@ivan
Copy link
Contributor

ivan commented Jul 24, 2021

grab-site does keep crawling on errors like a connection error. If it is exiting too early, it is probably because it has run through the entire queue (perhaps because it didn't discover any URLs on the same domain?)

@TowardMyth
Copy link
Author

Thanks. What does exit code 4 mean here?

@ivan
Copy link
Contributor

ivan commented Jul 24, 2021

grab-site effectively becomes a wpull process, so that would be https://wpull.readthedocs.io/en/master/api/errors.html#wpull.errors.ExitStatus.network_failure - meaning that some request had a network failure. It doesn't mean that it exited immediately because of it.

@TowardMyth
Copy link
Author

Okay thanks a lot for this explanation.

I have another unrelated question: I've been using grab-site on Javascript-heavy websites, particularly Wix-powered websites. However, grab-site doesn't render the Javascript UI elements correctly.

Is there a way to archive these sites properly?

One possible solution I was thinking: I've been trying pywb's website recording functions. It seems like when I visit http://localhost:8080/my-web-archive/record/http://example.com/ with my browser, then the Javascript elements are saved properly, but if I visit with wget/curl, they aren't.

Is there a similar way visit/render sites with a browser, using grab-site?

@TheTechRobo
Copy link
Contributor

Yeah, that's a known issue. IIRC grab-site doesn't extract links from JavaScript, so they won't be saved. The JS itself will be saved as it is a page requisite but none of the URLs it actually contacts.

You could use some sort of proxy with grab-site. Or you could use another tool, like https://github.com/internetarchive/brozzler.

@TowardMyth
Copy link
Author

@TheTechRobo thanks! I'm new to this so not too sure what you mean by using some sort of proxy with grab-site, or how using a proxy would solve this, would you be so kind to elaborate?

@TheTechRobo
Copy link
Contributor

TheTechRobo commented Jul 24, 2021

I mean a proxy that would parse and/or run JvaaScript (and then add the links to the finished html or put the links in a text file that can be used with grab-site -i). I don't know of any but if you can find one (or code one), it might work :D

@TheTechRobo
Copy link
Contributor

Just realised - adding links to the HTML is a no-go, since we probably want clean archives.

But just a textfile with urls would probably be fine. 😸

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants