Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does VM solve "incompatible bucket ranges" problem? #74

Open
zhumengzhu opened this issue May 23, 2024 · 3 comments
Open

How does VM solve "incompatible bucket ranges" problem? #74

zhumengzhu opened this issue May 23, 2024 · 3 comments

Comments

@zhumengzhu
Copy link

Hi there,

I came across the blog post titled Improving Histogram Usability for Prometheus and Grafana where the author claims to have resolved "Issue #3: incompatible bucket ranges." I am curious about the specific approach taken to solve this problem. Could you please provide more details on how this issue was addressed? Additionally, if possible, point me to the relevant source code where this solution has been implemented.

Moreover, I have encountered a problem in our production environment related to this topic. When the range of buckets varies, the calculated percentile data seems to be completely inaccurate. Below, I am providing the relevant data for your reference.

When using the following expression to calculate the final percentile value, the result returned is 30000, which is incorrect:

(histogram_quantile(0.99, sum (rate(http_client_requests_seconds_bucket{env="staging",project_name="xx"}[2m])) by (le, uri, method, project_name)) * 1000)

Data

[{"metric":{"le":"+Inf","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"+Inf","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9166"]},{"metric":{"le":"30.0","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"28.633115306","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"22.906492245","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"17.179869184","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"15.748213416","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"14.316557651","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"12.884901886","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"11.453246121","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"10.021590356","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"10.0","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9166"]},{"metric":{"le":"8.589934591","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"7.158278826","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"6.0","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9166"]},{"metric":{"le":"5.726623061","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"4.294967296","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"4.0","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9166"]},{"metric":{"le":"3.937053352","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"3.579139411","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"3.22122547","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"3.0","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9166"]},{"metric":{"le":"2.863311529","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"2.505397588","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"2.147483647","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"2.0","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9166"]},{"metric":{"le":"1.789569706","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"1.5","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9166"]},{"metric":{"le":"1.431655765","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"1.073741824","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"1.0","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9166"]},{"metric":{"le":"0.984263336","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.894784851","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.805306366","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.768","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9166"]},{"metric":{"le":"0.715827881","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.64","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9166"]},{"metric":{"le":"0.626349396","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.536870911","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.512","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9165"]},{"metric":{"le":"0.447392426","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.384","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9165"]},{"metric":{"le":"0.357913941","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.268435456","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.256","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9164"]},{"metric":{"le":"0.246065832","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.223696211","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.20132659","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.192","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9161"]},{"metric":{"le":"0.178956969","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.156587348","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.134217727","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.128","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9157"]},{"metric":{"le":"0.111848106","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.089478485","instance":"10.90.39.2:8081"},"value":[1716366921.957,"26"]},{"metric":{"le":"0.067108864","instance":"10.90.39.2:8081"},"value":[1716366921.957,"25"]},{"metric":{"le":"0.064","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9140"]},{"metric":{"le":"0.061516456","instance":"10.90.39.2:8081"},"value":[1716366921.957,"25"]},{"metric":{"le":"0.055924051","instance":"10.90.39.2:8081"},"value":[1716366921.957,"25"]},{"metric":{"le":"0.050331646","instance":"10.90.39.2:8081"},"value":[1716366921.957,"24"]},{"metric":{"le":"0.044739241","instance":"10.90.39.2:8081"},"value":[1716366921.957,"23"]},{"metric":{"le":"0.04","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9126"]},{"metric":{"le":"0.039146836","instance":"10.90.39.2:8081"},"value":[1716366921.957,"23"]},{"metric":{"le":"0.033554431","instance":"10.90.39.2:8081"},"value":[1716366921.957,"23"]},{"metric":{"le":"0.032","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9113"]},{"metric":{"le":"0.027962026","instance":"10.90.39.2:8081"},"value":[1716366921.957,"23"]},{"metric":{"le":"0.024","instance":"10.90.42.35:8081"},"value":[1716366921.957,"9097"]},{"metric":{"le":"0.022369621","instance":"10.90.39.2:8081"},"value":[1716366921.957,"21"]},{"metric":{"le":"0.016777216","instance":"10.90.39.2:8081"},"value":[1716366921.957,"4"]},{"metric":{"le":"0.015379112","instance":"10.90.39.2:8081"},"value":[1716366921.957,"2"]},{"metric":{"le":"0.013981011","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.01258291","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.012","instance":"10.90.42.35:8081"},"value":[1716366921.957,"7339"]},{"metric":{"le":"0.011184809","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.009786708","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.008388607","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.008","instance":"10.90.42.35:8081"},"value":[1716366921.957,"1"]},{"metric":{"le":"0.006990506","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.006","instance":"10.90.42.35:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.005592405","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.004194304","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.004","instance":"10.90.42.35:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.003844776","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.003495251","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.003145726","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.003","instance":"10.90.42.35:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.002796201","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.002446676","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.002097151","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.002","instance":"10.90.42.35:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.001747626","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.001398101","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.001048576","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.001","instance":"10.90.39.2:8081"},"value":[1716366921.957,"0"]},{"metric":{"le":"0.001","instance":"10.90.42.35:8081"},"value":[1716366921.957,"0"]}]

I would appreciate it if you could analyze this data and help identify the root cause of the issue.

@zhumengzhu
Copy link
Author

zhumengzhu commented May 23, 2024

just found that two other issues also raise the same problem, VictoriaMetrics/VictoriaMetrics#3231, VictoriaMetrics/VictoriaMetrics#2819, it looks like a known issue, is there any plan to fix them? @valyala

@zhumengzhu
Copy link
Author

based on the algorithm described in this issue, I figured out how the false 30000 was calculated:

Data-Processor

import json
from collections import defaultdict

def read_json_file(file_path):
    with open(file_path, 'r') as file:
        data = json.load(file)
    return data

def process_data(data):
    sum_values = defaultdict(int)

    for item in data:
        metric_le = item['metric']['le']
        value = item['value'][1]
        sum_values[metric_le] += int(value)

    sorted_sum_values = dict(sorted(sum_values.items(), key=lambda x: float(x[0]), reverse=True))

    previous_value = None
    for metric_le, value in sorted_sum_values.items():
        if previous_value is not None and value > previous_value:
            sorted_sum_values[metric_le] = previous_value
        else:
            previous_value = value

    return sorted_sum_values


def print_result(result):
    print("le             value")
    for metric_le, value in result.items():
        print(f"{metric_le.ljust(15)}{value}")

if __name__ == "__main__":
    file_path = "metrics_1716192915.json"
    data = read_json_file(file_path)
    result = process_data(data)
    print_result(result)

Result

le             value
+Inf           9192
30.0           26
28.633115306   26
22.906492245   26
17.179869184   26
15.748213416   26
14.316557651   26
12.884901886   26
11.453246121   26
10.021590356   26
10.0           26
8.589934591    26
7.158278826    26
6.0            26
5.726623061    26
4.294967296    26
4.0            26
3.937053352    26
3.579139411    26
3.22122547     26
3.0            26
2.863311529    26
2.505397588    26
2.147483647    26
2.0            26
1.789569706    26
1.5            26
1.431655765    26
1.073741824    26
1.0            26
0.984263336    26
0.894784851    26
0.805306366    26
0.768          26
0.715827881    26
0.64           26
0.626349396    26
0.536870911    26
0.512          26
0.447392426    26
0.384          26
0.357913941    26
0.268435456    26
0.256          26
0.246065832    26
0.223696211    26
0.20132659     26
0.192          26
0.178956969    26
0.156587348    26
0.134217727    26
0.128          26
0.111848106    26
0.089478485    26
0.067108864    25
0.064          25
0.061516456    25
0.055924051    25
0.050331646    24
0.044739241    23
0.04           23
0.039146836    23
0.033554431    23
0.032          23
0.027962026    23
0.024          23
0.022369621    21
0.016777216    4
0.015379112    2
0.013981011    0
0.01258291     0
0.012          0
0.011184809    0
0.009786708    0
0.008388607    0
0.008          0
0.006990506    0
0.006          0
0.005592405    0
0.004194304    0
0.004          0
0.003844776    0
0.003495251    0
0.003145726    0
0.003          0
0.002796201    0
0.002446676    0
0.002097151    0
0.002          0
0.001747626    0
0.001398101    0
0.001048576    0
0.001          0

now calculate 0.99-percentile: 9192 * 0.99 ==9100.08, which is in the bucket with le="+Inf", so vm return the previous le, 30.0, which means 30000 in milliseconds, the unexpected result.

@zhumengzhu
Copy link
Author

and here is the source code which fixBrokenBuckets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant