[BUG] Jobs using `length`, `delay`, and `period` at the job level and metric level results in inconsistent start time and end time for GetMetricData queries #1290

kgeckhart · 2024-02-02T14:07:53Z

Is there an existing issue for this?

I have searched the existing issues

YACE version

No response

Config file

No response

Current Behavior

The way YACE handles length, period, and rounding period when a job has metrics with different configured values is really weird.

Length, period, and rounding period all factor in to calculating the start + endtime for the GetMetricData API call. Each GetMetricData API call has 500 MetricDataQuery's. Each MetricDataQuery represent a single metric returned from ListMetrics which has a configured length, and period. This creates a mismatch in parameters the start + endtime needs to depend on values from the 500 different metrics included in the query. Instead of partitioning the requests around the potential difference in start + end time from differing length + period combinations YACE tries to make a best guess at what is a good enough value for what it is doing,

Calculate rounding period based on the smallest period in a BATCH if the job does not have rounding period configured
- This can easily result it different batches having different rounding periods
- The end result will be varying start + endtimes for different batches depending on which metrics were included in the batch
The length is set to the largest configured length of all metric configs in the job
- A longer length for a job which does not require it results in more individual data points being returned
- Extra individual data points are thrown away resulting in wasted IO time and unclear impact on API response times (querying for more data likely takes longer)
Delay is only respected at the job level even though it's configurable at the metric level

Expected Behavior

GetMetricData requests should contain queries for metrics which all have the same start and end time determined by the configured metric value for rounding period, period, length, and delay. These should be done in batches of 500.

Steps To Reproduce

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

kgeckhart · 2024-02-16T17:07:04Z

Just adding a note I realized,

Delay is only respected at the job level even though it's configurable at the metric level

Is especially bad because it will have very unintended consequences when fixed. Metric validation adds a default 5 minute delay to all metrics if they do not have a defined delay and the job does not have a defined delay. This default delay has largely been ignored because all example configs use no delay at the job level. As soon as this is fixed that delay will take effect potentially breaking the expected behavior for everyone. I'm still trying to think through how to approach this in a safe manner but felt it was necessary to write down.

This PR restructures some of the fields on the CloudwatchData struct to make the field usage much more obvious. This struct is used through out the scraping process to carry around context and the result of finding metric data and contains a lot of overlapping/overloaded fields. The net result should be a struct that is much easier to understand making some of the complex mapping done in migrate.BuildMetrics a little more obvious, and less pointers where they are not necessary. Related to: #1290

rhys-evans · 2024-05-15T16:40:15Z

Hi, so I need help to understand what delay is being depreciated means in https://github.com/nerdswords/yet-another-cloudwatch-exporter/blob/master/CHANGELOG.md#0590

At present I can only consistently get metrics if I have a configuration along the lines of the below. Otherwise I more often than not get NaN values due to the metric not being present in CloudWatch at scrape time. (My scrape interval is 60s and am using v0.60.0)

  - type: AWS/NATGateway
    regions:
      - <region-1>
      - <region-2>
      - .....
    roles:
      - roleArn: arn:aws:iam::<accountid-1>:role/monitoring-role
      - roleArn: arn:aws:iam::<accountid-2>:role/monitoring-role
      - roleArn: arn:aws:iam::<...........>:role/monitoring-role
    period: 60
    length: 60
    delay: 120
    addCloudwatchTimestamp: true
    metrics:
      - name: ActiveConnectionCount
        statistics: [Maximum, Sum]
      - name: BytesInFromDestination
        statistics: [Maximum, Sum]
      - name: BytesInFromSource
        statistics: [Maximum, Sum]
      - name: BytesOutToDestination
        statistics: [Maximum, Sum]
      - name: BytesOutToSource
        statistics: [Maximum, Sum]
      - name: ConnectionAttemptCount
        statistics: [Maximum, Sum]
      - name: ConnectionEstablishedCount
        statistics: [Maximum, Sum]
      - name: ErrorPortAllocation
        statistics: [Maximum, Sum]
      - name: IdleTimeoutCount
        statistics: [Maximum, Sum]
      - name: PacketsDropCount
        statistics: [Maximum, Sum]
      - name: PacketsInFromDestination
        statistics: [Maximum, Sum]
      - name: PacketsInFromSource
        statistics: [Maximum, Sum]
      - name: PacketsOutToDestination
        statistics: [Maximum, Sum]
      - name: PacketsOutToSource
        statistics: [Maximum, Sum]

Are you able to help with my understanding ? I understand my period and length are short, but if by default there is a 5 minute delay, why would i need to set the delay option

Any help is appreciated

kgeckhart · 2024-05-22T15:58:18Z

Hey @rhys-evans, your usage of delay won't be impacted as you're setting it at the job level. If you see warnings logged on startup when you run v0.60.0 please report back though as this shouldn't be happening.

The deprecation is at the metrics level where it's valid to set the value but it was never being used.

rhys-evans · 2024-05-22T17:13:07Z

Ah , thank you

kgeckhart added the bug Something isn't working label Feb 2, 2024

This was referenced Feb 2, 2024

Restructure fields on CloudwatchData #1289

Merged

Add abstraction for GetMetricsData processing kgeckhart/yet-another-cloudwatch-exporter#2

Closed

kgeckhart mentioned this issue Mar 4, 2024

Add abstraction for GetMetricsData processing #1325

Merged

kgeckhart mentioned this issue Apr 5, 2024

getmetricdata: introduce an iterator #1368

Closed

This was referenced Apr 18, 2024

getmetricdata: move window calculator to processor #1388

Merged

getmetricdata: Move batching to an iterator #1389

Merged

This was referenced Apr 21, 2024

Update cloudwatch_exporter to 0.59.0 openstreetmap/prometheus-exporters#55

Merged

Update cloudwatch_exporter to 0.59.0 openstreetmap/prometheus-exporters#57

Closed

kgeckhart mentioned this issue May 13, 2024

Add support for batching by time params #1412

Merged

cristiangreco closed this as completed in #1412 May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Jobs using `length`, `delay`, and `period` at the job level and metric level results in inconsistent start time and end time for GetMetricData queries #1290

[BUG] Jobs using `length`, `delay`, and `period` at the job level and metric level results in inconsistent start time and end time for GetMetricData queries #1290

kgeckhart commented Feb 2, 2024

kgeckhart commented Feb 16, 2024

rhys-evans commented May 15, 2024 •

edited

kgeckhart commented May 22, 2024 •

edited

rhys-evans commented May 22, 2024

[BUG] Jobs using length, delay, and period at the job level and metric level results in inconsistent start time and end time for GetMetricData queries #1290

[BUG] Jobs using length, delay, and period at the job level and metric level results in inconsistent start time and end time for GetMetricData queries #1290

Comments

kgeckhart commented Feb 2, 2024

Is there an existing issue for this?

YACE version

Config file

Current Behavior

Expected Behavior

Steps To Reproduce

Anything else?

kgeckhart commented Feb 16, 2024

rhys-evans commented May 15, 2024 • edited

kgeckhart commented May 22, 2024 • edited

rhys-evans commented May 22, 2024

[BUG] Jobs using `length`, `delay`, and `period` at the job level and metric level results in inconsistent start time and end time for GetMetricData queries #1290

[BUG] Jobs using `length`, `delay`, and `period` at the job level and metric level results in inconsistent start time and end time for GetMetricData queries #1290

rhys-evans commented May 15, 2024 •

edited

kgeckhart commented May 22, 2024 •

edited