Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Runbooks #343

Open
awcodify opened this issue Apr 4, 2023 · 2 comments
Open

Add Runbooks #343

awcodify opened this issue Apr 4, 2023 · 2 comments

Comments

@awcodify
Copy link

awcodify commented Apr 4, 2023

Hi Folks,

I think it would be nice if we have runbooks which contain outlines the procedures to be followed when an alert is triggered in a monitoring system. It provides step-by-step instructions for identifying the cause of the alert, assessing its impact, and implementing a solution to resolve the issue.

If it okay, i would very happy to make a contribution.

For Example:


HostHighCpuLoad

Meaning

The "HostHighCpuLoad" alert is triggered when the CPU load on a host exceeds a defined threshold. This alert is designed to detect performance issues and potential system instability related to high CPU utilization.

Impact

If this alert is not properly addressed, it may result in degraded performance, system crashes, and potential service disruptions.

Diagnosis

Check the system load average using the following command:

uptime

The output will show the current system load average for the past 1, 5, and 15 minutes. If the load average is consistently higher than the number of CPU cores on the system, it indicates that the system is experiencing high CPU load.

Identify which processes are using the most CPU resources by running the following command:

ps -eo pid,ppid,cmd,%cpu --sort=-%cpu | head

The output will show the top CPU-consuming processes. Identify any processes that are consuming a significant amount of CPU resources and investigate further.

Check for any system configuration issues and/or update that may be causing high CPU load. Look for any misconfigured services or applications that are running on the system and causing excessive CPU usage.

Check system logs for any error messages related to high CPU usage. Look for any system errors or warnings that may indicate a problem with the system's CPU usage.

Mitigations

To mitigate this alert and address the performance and stability issues related to high CPU load, the following steps can be taken:

  • Identify and prioritize the processes or applications that are contributing to the high CPU load and take appropriate actions to reduce their resource usage.
  • Increase the available resources on the host, such as adding more CPU cores or increasing memory capacity.
  • Optimize the system and application configurations to better utilize available resources and reduce overhead.
  • Implement proactive monitoring and capacity planning to avoid future high CPU load incidents.
@samber
Copy link
Owner

samber commented Apr 5, 2023

I fully agree 💯

How can i help you? Would you like to start writing runbooks? In the short term, I can adapt the site.

The example you provided looks good. I think we can write it in markdown.

@awcodify
Copy link
Author

awcodify commented Apr 6, 2023

I fully agree 💯

How can i help you? Would you like to start writing runbooks? In the short term, I can adapt the site.

The example you provided looks good. I think we can write it in markdown.

Yes, i can help to write runbooks, in parallel maybe you can start creating the page for this runbooks 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants