-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCS, S3 timeouts are not retried #7980
Comments
This was referenced May 2, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
We had an incident which appeared as intermittent network flakiness for 1-2 hours. During the affected time, the compactor's cleanup loop had a number of small object fetch calls to GCS that errored with "context deadline exceeded" after one minute. Example:
I saw the mention of "retry failed" in the error, but I wondered why a retry would not paper over what looks like a transient network issue. It turns out that no retry actually occurred: The standard GCS client retries certain classes of idempotent errors automatically, but in this case when the first request takes up the entire deadline before failing, no retries will occur. This was also reported in Google's tracker some time ago:
A tenant's cleanup loop will fail if any of its GCS operation times out. In our configuration (cleanup every 15 mins, store gateway only loads bucket indexes <1hr old) a single tenant would have a partial query outage if four of their consecutive cleanup loops experience this kind of unluckiness. Tenants with more blob store operations to perform at cleanup time will be more impacted.
Store-gateway looks to be victim to this same issue. When a store-gateway -> GCS fetch stalls in the same way (say in the
Series
RPC), store-gateway will wait for the caller's context to expire (by default 2 minutes in querier). What ends up happening is querier's request context is canceled and querier cancels all fan-out requests on other store-gateways and returns an error to query-frontend, where a retry is finally performed. That's a lot of stuff riding on every GCS fetch not timing out.(I am assuming in this issue that exhausting a 1 minute timeout on a fetch of 6KiB is a case of network instability and that a retry would help. Cuz networks are unreliable.)
My cursory read of Thanos/minio's S3 interactions is that it has the same behavior: Certain HTTP errors are retried, but if the failure is the caller's context being canceled of course no retry will occur.
Expected behavior
In the face of transient network blips, I expect individual blob storage operations to be retried.
Environment
Tasks
The text was updated successfully, but these errors were encountered: