Metrics SDK improvements #1740

cijothomas · 2024-05-11T18:45:03Z

Opening a parent issue to track Metrics SDK improvements for Stable release.

Background
The primary function of the Metrics SDK is to accept a number <T> along with a slice of KeyValue pairs <T>, &[KeyValue], aggregating these measurements in memory and exporting the aggregated values to Readers/Exporters as needed. Our main goals are ensuring correctness, thread-safety, memory-efficiency and high performance, particularly on the "hot path" where measurement reporting occurs, as this demands the utmost efficiency. Correctness/thread-safety/memory-efficiency requires extensive testing via unit tests and stress testing.

Performance issues

Cloning and allocation on hot path - A significant portion of the overhead involves cloning the incoming slice to prepare AttributeSet. We can avoid this in many cases by using a thread-local Vec, which would reduce memory allocations, but still requires copy. Copy as well can be avoided by carefully designing temp data structures to hold references only.
Sorting of the Keys - This is a "identity" requirement, so cannot be avoided entirely. However, it is possible to avoid this in the common case, by storing both sorted and original orders for quick lookups.
De-deduplication of Attributes with same Keys - This is another "identify" requirement, but similar idea as above can be used to avoid this in the common case.
Contention - The use of a Mutex around the HashMap for aggregations leads to heavy contention. Replacing it with a RwLock and applying interior mutability could lessen this issue, though sharding may be necessary for further scalability improvements as demonstrated here.
Memory efficiency, mostly affecting delta - Metrics - Delta aggregation should not export unless new measurements are reported in current cycle #1598 . This is relatively easy to fix.
Memory issue - Memory optimization for metric datapoints #1566. This is also relatively easy to fix.

Some issues like calculating hash outside of lock, special casing 0-attributes etc. were addressed already. Also, a lot of ideas were discussed in the past (Community meetings, PRs, issues). I have attempted prototyping several of them here: https://github.com/cijothomas/metrics-mini/tree/main/metrics/src. A lot of the issues from 1,2,3, part of 4 has been addressed in the prototype, giving huge performance improvements. I plan to incorporate them to this repo soon.

It is unlikely that we fix all performance issues for 1.0, but the goal is to ensure that the fixes can be continued even after 1.0 without any breaking changes. This requires trimming off unnecessary public APIs, and also to avoid exposing any internals to readers/exporters.

Correctness issues:

Lack of adequate testing - The existing test suite does not sufficiently confirm the accuracy of aggregations. Although a few tests have been introduced to demonstrate known issues (see this and this), a lot more thorough testing is required.
There are virtually no tests in multi-thread setup. While Rust compiler protects from some issues, it cannot ensure correctness in anyway, and those require carefully orchestrated tests.
For memory efficiency tests also, stress tests should be leveraged.

Most of the correctness issues can fixed via better test coverage. One thing to note is that "Views" feature expands the testing matrix significantly due to its capability to alter aggregations/attributes and produce multiple metrics streams from a single measurement. This is the main reason to remove "Views" from the scope of 1st stable release.

The text was updated successfully, but these errors were encountered:

cijothomas added this to the Metrics SDK Stable milestone May 11, 2024

cijothomas self-assigned this May 11, 2024

cijothomas mentioned this issue May 25, 2024

Metrics Aggregation - Improve throughput by 10x #1833

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics SDK improvements #1740

Metrics SDK improvements #1740

cijothomas commented May 11, 2024 •

edited

Metrics SDK improvements #1740

Metrics SDK improvements #1740

Comments

cijothomas commented May 11, 2024 • edited

cijothomas commented May 11, 2024 •

edited