Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetcher: optionally slow down fetching from hosts with repeated exceptions #1106

Open
jnioche opened this issue Oct 14, 2023 · 2 comments
Open

Comments

@jnioche
Copy link
Contributor

jnioche commented Oct 14, 2023

See NUTCH-2946

The fetcher holds for every fetch queue a counter which counts the number of observed "exceptions" seen when fetching from the host (resp. domain or IP) bound to this queue.

As an improvement to increase the politeness of the crawler, the counter value could be used to dynamically increase the fetch delay for hosts where requests fail repeatedly with exceptions or HTTP status codes mapped to ProtocolStatus.EXCEPTION (HTTP 403 Forbidden, 429 Too many requests, 5xx server errors, etc.) Of course, this should be optional. The aim to reduce the load on such hosts already before the configured max. number of exceptions (property fetcher.max.exceptions.per.queue) is hit.

@jnioche
Copy link
Contributor Author

jnioche commented Oct 15, 2023

apache/nutch#728

@jnioche
Copy link
Contributor Author

jnioche commented Dec 21, 2023

Instead of delaying, which would increase latency, trigger timeouts and fail the tuples. It would be better to assume Fetch errors for the URLs in the queue and push them straight to status.
An even better approach would be to have #867 and send data at the queue level so that URLs from it are held for a while. URLFrontier would be a good match for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant