You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Priority 1: MustHighest priority. A release cannot be made if this issue isn’t resolved.Type: EnhancementUse to signal an issue enhances an already existing feature of the project.
After two separate instances of providing training and on-site support for performance in AF applications, I believe there are things we can do to improve the observability.
We should separate the observability into two different categories:
Capacity of system - for alerting and scaling
Timings - for investigation
The reason for this separation is that timings can be very hard to alert on. They can fluctuate heavily and therefore we mention percentiles. E.g. 95% of requests are done withing 200ms.
However, this value fluctuating is not a problem, as long as it doesn't cause the capacity to be reached. When you reach the capacity (or want to optimize) you start investigating.
Capacity
The current way to measure capacity for commands and events is the capacity metric. This is the number of threads that were busy (on average) over the last 10 minutes. There are a few problems with this metric:
If a message takes very long, the MonitorCallback is not called so the time taken is not registered
It is an average. You can still have spikes and thus capacity problems and not see them (this has been improved in 4.8.0 by setting the period to 1 second - but still)
It is not a relative metric. I have 10 threads active! Of how many? 100? 1000? We don't know! So we cannot scale or detect capacity problems
I want to propose to:
Remove/deprecate this somewhat misleading metric
Expose the thread pool itself to metrics via an abstraction. This will measure how many threads are busy and if there are tasks queuing
You can then alert on the queue! You have > 1 pending tasks? Alert/scale
Measure the time it takes for a message to be picked up. > 0? Alert/scale
Maybe: Measure the time it takes for a command to reach the localSegment bus.
This would mean adding timestamps to queries and events - might be controversial
But very useful. Also includes network and AS routing
Capacity monitoring for event processors is good (using eventprocessor latency). We lack any autoscaling capabilities though! And I would like monitoring on the PSEP thread pool, just like in the buses.
I have an idea for the autoscaling. Expect a blog soon.
Timings
The timings are already very good. There are some things we can improve there:
Measure the message response time (command/query) form the sending side
Measure time spent of GRPC calls (appendEvent / listAggreagteEvents)
Measure time taken to load aggregate
The text was updated successfully, but these errors were encountered:
Priority 1: MustHighest priority. A release cannot be made if this issue isn’t resolved.Type: EnhancementUse to signal an issue enhances an already existing feature of the project.
After two separate instances of providing training and on-site support for performance in AF applications, I believe there are things we can do to improve the observability.
We should separate the observability into two different categories:
The reason for this separation is that timings can be very hard to alert on. They can fluctuate heavily and therefore we mention percentiles. E.g. 95% of requests are done withing 200ms.
However, this value fluctuating is not a problem, as long as it doesn't cause the capacity to be reached. When you reach the capacity (or want to optimize) you start investigating.
Capacity
The current way to measure capacity for commands and events is the capacity metric. This is the number of threads that were busy (on average) over the last 10 minutes. There are a few problems with this metric:
MonitorCallback
is not called so the time taken is not registeredI want to propose to:
Capacity monitoring for event processors is good (using eventprocessor latency). We lack any autoscaling capabilities though! And I would like monitoring on the PSEP thread pool, just like in the buses.
I have an idea for the autoscaling. Expect a blog soon.
Timings
The timings are already very good. There are some things we can improve there:
The text was updated successfully, but these errors were encountered: