Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infinite recursion on offsite links? #194

Open
TheTechRobo opened this issue Jul 27, 2021 · 3 comments
Open

infinite recursion on offsite links? #194

TheTechRobo opened this issue Jul 27, 2021 · 3 comments

Comments

@TheTechRobo
Copy link
Contributor

how would I go about enabling that?

@acrois
Copy link

acrois commented Aug 23, 2021

How deep do you really want to go?

A middle ground ideally would be to support a configurable depth for crawls to avoid finding every page on the internet.

Unless that's your thing... You can try to use it as is and by what it says, it seems like it should do that, but I have not considered that a reasonable thing for a single process to be responsible for and haven't experimented with that much beyond basic/plaintext sites.

Personally, I always run with --no-offsite-links (avoid following links to a depth of 1 on other domains). It will crawl immediate pre-requisite resources but not any links found past that. Then I'll set up a whole crawl of the site and read the index for off-site URLs. Then take the list and divide up those sites into separate crawls. You could call it a system.

What did you do?
What should happen?
What happened?

@TheTechRobo
Copy link
Contributor Author

I never really found a solution. It isn't a much-needed feature for me really, would just be nice to have a configurable depth, including "inf" for infinite.

@JustAnotherArchivist
Copy link
Contributor

The depth is infinite by default, but grab-site hardcodes the --span-hosts-allow wpull option, which prevents recursion on off-site pages. So you need to reset that to the default empty value. Maybe --wpull-args='--span-hosts --span-hosts-allow ""' would do the trick. Not sure if there are further reasons that would prevent the recursion though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants