-
Notifications
You must be signed in to change notification settings - Fork 697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestions in Metrics & Monitoring section #92
Comments
Hello @bvvkrishna thank you for sharing your inputs. Here is my take on your suggestions.
Also, We welcome any contribution from the community. |
First of all i would like to say huge thanks to you guys for sharing the SRE world knowledge to the community. It is really useful and bring visibility on how important the SRE's are for the company and the expectations of this role.
I have looked at the Metrics and Monitoring section and i have some suggestions. Please check.
The statement "Monitoring is a process of collecting real-time performance metrics from a system" might not be correct for all use cases. There are certain ML or offline jobs which are measured once in a day or hour so we cannot say real-time performance metrics.
The statement "What gets measured, gets fixed" might not be true. For instance, lets say if an ecommerce system is experiencing huge traffic because of lot of requests from a single IP(DDOS attack) they will throttle the requests after a certain threshold or block but it is not fixing the problem rather i would say mitigating it. Similarly if an ecommerce systems is expecting to receive high traffic during sale event they might add hosts prior to the event(based on projection) to accomodate the traffic but does not mean we are fixing the problem rather finding a way to handle it.
In four golden signals of monitoring, i think we should also have Availability as one of the key metric which would help us to understand how much % of time service is available.
In basic terminologies of monitoring we should also add about what a percentile is? Because percentile is the one most frequently used in monitoring and engineers often get confused with this measurement.
In Command line tools, we should also add du command to get disk usage of directories as df shows free space at file system level. Also we should add ping, telnet, vmstat and lsof commands as these i see commonly used in operations world.
In Best Practices for Monitoring we should call out that we should try to bring the system to a stable state rather than trying to fix the problem when a production problem happens. Because getting the service under control is more important than fixing the problem itself.
In Best Practices for Monitoring we should also add "Never hesitate to escalate to the right team if needed". As every issue mitigation has its own SLA we should escalate to the right owner when needed rather than trying to deep dive and breaching the SLA which could cause impact to the customer.
The text was updated successfully, but these errors were encountered: