some easy heuristic to supress anomaly bit for noisy metrics #14993
andrewm4894
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
What is a "model" in this case? Is it the latest KMeans model, or the entire history of models we are maintaining for each dimension (ie. are we talking about suppressing models or dimensions)? If it's the later should there be any case where we are reactivating the model? (or should we drop such models entirely when they become the second most recent model of a dimension?). |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Sometimes we see individual metrics that for whatever reason "get stuck" in a bad state and with a bad model that just stays anomalous too consistently:
Its obvious that if you observe a really persistently high anomaly rate over an extended period of time for a dimension that 9 times out of 10 its just a symptom of a bad model or noisy model for that dim.
Without getting too fancy in terms of the ML (which would introduce more complexity), i'm wondering if there could be some rules or heuristics we could introduce as config options in the [ml] section of
netdata.conf
such that bad metrics like this would just have their anomaly bit supressed and/or be ignored by prediction until they get naturally retrained next.Am thinking if there could be some sort of silencing layer in the ML that would turn off anomaly detection until next training once we observe enough to say that to metric is just subject to a poor model.
For example a very simple rule would be that if the anomaly rate in last 30 minutes is above 50% then silence:
Idea of above is a simple rule to just turn off (until next training where hopefully a better model may be trained eg based on more data etc) obviously noisy dimensions.
@vkalintiris @ktsaou fyi as think it might make sense to try think about this - should we build a process into the agent to just turn off obviously bad models based on observing anomaly rates themselves?
Ideally would like to start with something as simple as possible so that can be as easy to reason about and easy enough to implement too.
note: probably what we want is some notion of "firing rate" e.g. to control for if anomaly bit is just consistently going on/off (bad) vs clumping together (which could be valid and actually something we defo would not want to then supress) etc - but maybe the AR itself if a good proxy for this if we use a big enough window.
Beta Was this translation helpful? Give feedback.
All reactions