Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Jobs using length, delay, and period at the job level and metric level results in inconsistent start time and end time for GetMetricData queries #1290

Closed
1 task done
kgeckhart opened this issue Feb 2, 2024 · 4 comments · Fixed by #1412
Labels
bug Something isn't working

Comments

@kgeckhart
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

YACE version

No response

Config file

No response

Current Behavior

The way YACE handles length, period, and rounding period when a job has metrics with different configured values is really weird.

Length, period, and rounding period all factor in to calculating the start + endtime for the GetMetricData API call. Each GetMetricData API call has 500 MetricDataQuery's. Each MetricDataQuery represent a single metric returned from ListMetrics which has a configured length, and period. This creates a mismatch in parameters the start + endtime needs to depend on values from the 500 different metrics included in the query. Instead of partitioning the requests around the potential difference in start + end time from differing length + period combinations YACE tries to make a best guess at what is a good enough value for what it is doing,

  1. Calculate rounding period based on the smallest period in a BATCH if the job does not have rounding period configured
    • This can easily result it different batches having different rounding periods
    • The end result will be varying start + endtimes for different batches depending on which metrics were included in the batch
  2. The length is set to the largest configured length of all metric configs in the job
    • A longer length for a job which does not require it results in more individual data points being returned
    • Extra individual data points are thrown away resulting in wasted IO time and unclear impact on API response times (querying for more data likely takes longer)
  3. Delay is only respected at the job level even though it's configurable at the metric level

Expected Behavior

GetMetricData requests should contain queries for metrics which all have the same start and end time determined by the configured metric value for rounding period, period, length, and delay. These should be done in batches of 500.

Steps To Reproduce

No response

Anything else?

No response

@kgeckhart kgeckhart added the bug Something isn't working label Feb 2, 2024
@kgeckhart kgeckhart changed the title [BUG] Jobs using length, delay,, and period at the job level and metric level results in inconsistent start time and end time for GetMetricData queries [BUG] Jobs using length, delay, and period at the job level and metric level results in inconsistent start time and end time for GetMetricData queries Feb 2, 2024
@kgeckhart
Copy link
Contributor Author

Just adding a note I realized,

  1. Delay is only respected at the job level even though it's configurable at the metric level

Is especially bad because it will have very unintended consequences when fixed. Metric validation adds a default 5 minute delay to all metrics if they do not have a defined delay and the job does not have a defined delay. This default delay has largely been ignored because all example configs use no delay at the job level. As soon as this is fixed that delay will take effect potentially breaking the expected behavior for everyone. I'm still trying to think through how to approach this in a safe manner but felt it was necessary to write down.

cristiangreco pushed a commit that referenced this issue Feb 27, 2024
This PR restructures some of the fields on the CloudwatchData struct to make the field usage much more obvious. This struct is used through out the scraping process to carry around context and the result of finding metric data and contains a lot of overlapping/overloaded fields. The net result should be a struct that is much easier to understand making some of the complex mapping done in migrate.BuildMetrics a little more obvious, and less pointers where they are not necessary.

Related to: #1290
@rhys-evans
Copy link
Contributor

rhys-evans commented May 15, 2024

Hi, so I need help to understand what delay is being depreciated means in https://github.com/nerdswords/yet-another-cloudwatch-exporter/blob/master/CHANGELOG.md#0590

At present I can only consistently get metrics if I have a configuration along the lines of the below. Otherwise I more often than not get NaN values due to the metric not being present in CloudWatch at scrape time. (My scrape interval is 60s and am using v0.60.0)

  - type: AWS/NATGateway
    regions:
      - <region-1>
      - <region-2>
      - .....
    roles:
      - roleArn: arn:aws:iam::<accountid-1>:role/monitoring-role
      - roleArn: arn:aws:iam::<accountid-2>:role/monitoring-role
      - roleArn: arn:aws:iam::<...........>:role/monitoring-role
    period: 60
    length: 60
    delay: 120
    addCloudwatchTimestamp: true
    metrics:
      - name: ActiveConnectionCount
        statistics: [Maximum, Sum]
      - name: BytesInFromDestination
        statistics: [Maximum, Sum]
      - name: BytesInFromSource
        statistics: [Maximum, Sum]
      - name: BytesOutToDestination
        statistics: [Maximum, Sum]
      - name: BytesOutToSource
        statistics: [Maximum, Sum]
      - name: ConnectionAttemptCount
        statistics: [Maximum, Sum]
      - name: ConnectionEstablishedCount
        statistics: [Maximum, Sum]
      - name: ErrorPortAllocation
        statistics: [Maximum, Sum]
      - name: IdleTimeoutCount
        statistics: [Maximum, Sum]
      - name: PacketsDropCount
        statistics: [Maximum, Sum]
      - name: PacketsInFromDestination
        statistics: [Maximum, Sum]
      - name: PacketsInFromSource
        statistics: [Maximum, Sum]
      - name: PacketsOutToDestination
        statistics: [Maximum, Sum]
      - name: PacketsOutToSource
        statistics: [Maximum, Sum]

Are you able to help with my understanding ? I understand my period and length are short, but if by default there is a 5 minute delay, why would i need to set the delay option

Any help is appreciated

@kgeckhart
Copy link
Contributor Author

kgeckhart commented May 22, 2024

Hey @rhys-evans, your usage of delay won't be impacted as you're setting it at the job level. If you see warnings logged on startup when you run v0.60.0 please report back though as this shouldn't be happening.

The deprecation is at the metrics level where it's valid to set the value but it was never being used.

@rhys-evans
Copy link
Contributor

Ah , thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants