New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote-write: threshold to skip resharding should be higher #14044
Comments
I notice in a similar previous issue @csmarchbanks said #7124 (comment)
Which matches the situation in this case (volume was about 500 series, scraped every 5s).
More context: the machine was occasionally under heavy CPU load; I believe this generated a backlog on the send queue. |
is this good @bboreham or do we need to wait for others review . |
Best to comment on the PR within the PR itself. |
okay @bboreham . |
I would have suggested/assumed people would drop the batch size |
I don't think that helps. In my example, Prometheus scraped 509 series every 5 seconds; I wanted it to send those 509 series without waiting 5 seconds. |
Personally I would still lower the batch size before the send deadline timeout, but even so I think guarding against excessive resharding checks is a valid change. Reviewing the PR again today. |
I don't think I am understanding your point. What would you lower |
Something below This is separate from the issue of the the resharding check happening too often when the send deadline is < 5s, which I don't have any issue with merging a fix for. |
OK, that case still shows the issue I am talking about, because two times 1 second is way less than the 10s interval it checks at.
That isn't what I asked for; I asked for:
and
I disagree, it matches what I wanted. Bryan |
Bryan and I discussed on Slack; over text we'd misunderstood each other. His config changes were definitely valid given the low scrape load he had. Remote write has some gaps when it comes to handling timely sending of data in that kind of a scenario, the hard coding of the reshard check ticker is just one of those gaps. I'll be opening a few issues soon for some things we can try out, there are a number of people interested in taking on some smaller tasks in RW and those could be good first issues for them. |
I saw a lot of log lines like this:
Context was that we wanted to feed data in a timely manner, so
BatchSendDeadline
had been reduced to 100ms.The code that generates the message:
prometheus/storage/remote/queue_manager.go
Lines 1021 to 1026 in 94c81bb
is called every 10s (hard-coded), so if
BatchSendDeadline
is any less than 5s we stand some chance that we didn't even try to send within that interval.Proposal
I suggest the check should be within
2 * time.Duration(t.cfg.BatchSendDeadline) + shardUpdateDuration
.The text was updated successfully, but these errors were encountered: