Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CloudWatch dimensions #77

Open
ruurtjan opened this issue Jan 2, 2020 · 3 comments
Open

Use CloudWatch dimensions #77

ruurtjan opened this issue Jan 2, 2020 · 3 comments

Comments

@ruurtjan
Copy link

ruurtjan commented Jan 2, 2020

We're using Remora for exporting consumer group lag to CloudWatch metrics. Thanks for open sourcing this!

The issue

Metrics are currently exported as follows:

Screenshot from 2020-01-02 11-16-50
Screenshot from 2020-01-02 11-17-56

This limits how they can be queried (for example in Grafana). When creating a single graph that shows the lag for all partitions in a certain consumer group, you have to add a query for each of them individually. This is because you can't do wildcard searches on metric a name. Grafana allows for up to 5 CloudWatch searches in a single panel, so a maximum of 5 partitions can be plotted.

It is possible to do wildcard searches on dimensions though. This way, you would be able to do a single query that displays all partition offsets regardless of the number of partitions.

Proposed solution

I propose we change how metrics are exported to CloudWatch:

  • Metric name: By consumer group.<Consumer group id>.<metric> where is one of 'lag', 'logend' and 'offset'
  • Metric dimensions:
    • Topic (e.g. 'MyTopic')
    • Partition (e.g. '2')

For internal metrics like KafkaClientActor.receiveCounter:

  • Metric name: Remora internals.<metric> where is the same as what it is now
  • Metric dimensions:
    • metricType (e.g. 'gauge' or 'counterCount')

This would be a breaking change, so we'd have to change the version to 2.0.0.

More info on CloudWatch dimensions: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html

What do you think?

@ruurtjan
Copy link
Author

ruurtjan commented Jan 2, 2020

Turns out there is a workaround for this issue. In Grafana, you can set a CloudWatch expression. The following expression plots each partition as a single line.

SEARCH(' {TheValueForCLOUDWATCH_NAME} gauge.TOPIC_NAME AND CONSUMER_GROUP_NAME.lag NOT TOPIC_NAME.CONSUMER_GROUP_NAME', 'Average', 60)

It would still be beneficial to improve the way metrics are stored as suggested in my original post though.

@soceanainn
Copy link
Contributor

@ruurtjan sorry for missing this until now.

I definitely see the value in the changes you suggested, however I would suggest having this as a configurable option as opposed to making a breaking change

@ruurtjan
Copy link
Author

Yeah, having it configurable makes sense. Keep it turned off by default for 1.x.x and possibly flip it to on by default whenever Remora upgrades to 2.0.0.

I'm not part of the team that uses Remora anymore, so I don't think I'll be able to write a MR for this. CC'ing my old teammate @wgreven, so he's in the loop for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants