Ignore errors and keep crawling #193

TowardMyth · 2021-07-24T06:06:46Z

Hi there.

When I am archiving sites, sometimes grab-site will encounter a URL that it cannot connect to (i.e. connecting to the page times out). From my observation, whenever this happens, the scraping operation immediately errors out and quits, even though I have more URLs left to crawl.

For example: upon encountering a page that will time out: this is printed.

ERROR Fetching ‘http://some-site.com:83/_somefolder/’ encountered an error: Connect timed out.
http://some-site.com:83/_somefolder/
ERROR Fetching ‘http://some-site.com:83/_somefolder/’ encountered an error: Connect timed out.
Finished grab some-hash https://some-site.com/ with exit code 4

Note that it is exit code 4, and not 0, i.e. there is an error.

Is there a way to ignore errors, and keep crawling?

The text was updated successfully, but these errors were encountered:

ivan · 2021-07-24T06:11:21Z

grab-site does keep crawling on errors like a connection error. If it is exiting too early, it is probably because it has run through the entire queue (perhaps because it didn't discover any URLs on the same domain?)

TowardMyth · 2021-07-24T06:26:18Z

Thanks. What does exit code 4 mean here?

ivan · 2021-07-24T06:35:54Z

grab-site effectively becomes a wpull process, so that would be https://wpull.readthedocs.io/en/master/api/errors.html#wpull.errors.ExitStatus.network_failure - meaning that some request had a network failure. It doesn't mean that it exited immediately because of it.

TowardMyth · 2021-07-24T07:05:56Z

Okay thanks a lot for this explanation.

I have another unrelated question: I've been using grab-site on Javascript-heavy websites, particularly Wix-powered websites. However, grab-site doesn't render the Javascript UI elements correctly.

Is there a way to archive these sites properly?

One possible solution I was thinking: I've been trying pywb's website recording functions. It seems like when I visit http://localhost:8080/my-web-archive/record/http://example.com/ with my browser, then the Javascript elements are saved properly, but if I visit with wget/curl, they aren't.

Is there a similar way visit/render sites with a browser, using grab-site?

TheTechRobo · 2021-07-24T12:29:39Z

Yeah, that's a known issue. IIRC grab-site doesn't extract links from JavaScript, so they won't be saved. The JS itself will be saved as it is a page requisite but none of the URLs it actually contacts.

You could use some sort of proxy with grab-site. Or you could use another tool, like https://github.com/internetarchive/brozzler.

TowardMyth · 2021-07-24T17:48:45Z

@TheTechRobo thanks! I'm new to this so not too sure what you mean by using some sort of proxy with grab-site, or how using a proxy would solve this, would you be so kind to elaborate?

TheTechRobo · 2021-07-24T18:38:47Z

I mean a proxy that would parse and/or run JvaaScript (and then add the links to the finished html or put the links in a text file that can be used with grab-site -i). I don't know of any but if you can find one (or code one), it might work :D

TheTechRobo · 2021-07-28T17:26:26Z

Just realised - adding links to the HTML is a no-go, since we probably want clean archives.

But just a textfile with urls would probably be fine. 😸

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore errors and keep crawling #193

Ignore errors and keep crawling #193

TowardMyth commented Jul 24, 2021

ivan commented Jul 24, 2021

TowardMyth commented Jul 24, 2021

ivan commented Jul 24, 2021

TowardMyth commented Jul 24, 2021

TheTechRobo commented Jul 24, 2021

TowardMyth commented Jul 24, 2021

TheTechRobo commented Jul 24, 2021 •

edited

TheTechRobo commented Jul 28, 2021

Ignore errors and keep crawling #193

Ignore errors and keep crawling #193

Comments

TowardMyth commented Jul 24, 2021

ivan commented Jul 24, 2021

TowardMyth commented Jul 24, 2021

ivan commented Jul 24, 2021

TowardMyth commented Jul 24, 2021

TheTechRobo commented Jul 24, 2021

TowardMyth commented Jul 24, 2021

TheTechRobo commented Jul 24, 2021 • edited

TheTechRobo commented Jul 28, 2021

TheTechRobo commented Jul 24, 2021 •

edited