/* ---- Google Analytics Code Below */

Monday, June 03, 2019

Seeking Better Alerting

Seems an obvious thing, but the use of alerts has come up in recent interactions as being key to getting things done effectively.  How, How often and Followups.  Interaction with Risk and Trust of the system and its resource components.  Mature Alerts?   Here specifically about site reliability, but broadly useful.  O'Reilly does a good overview of the topic:

Reduce Toil through better Alerting

How SREs can use a hierarchy for mature alerts.
By Štěpán Davidovič, Betsy Beyer  

Check out "The Site Reliability Workbook" for real-world examples of how to put SRE principles and practices to work in your environment.

SRE best practices at Google advocate for building alerts based upon meaningful service-level objectives (SLOs) and service-level indicators (SLIs). In addition to an SRE book chapter, other site reliability engineers at Google have written on the topic of alerting philosophy. However, the nuances of how to structure well-reasoned alerting are varied and contentious. For example, traditional "wisdom" argues that cause-based alerts are bad, while symptom-based or SLO-based alerts are good.

Navigating the dichotomy of symptom-based and cause-based alerting adds undue toil to the process of writing alerts: rather than focusing on writing a meaningful alert that addresses a need for running the system, the dichotomy brings anxiety around deciding whether an alert condition falls on the “correct” side of this dichotomy.

Instead, consider approaching alerting as a hierarchy of the alerts available to you: reactive, symptom-based alerts—typically based on your SLOs—form the foundation of this hierarchy. As systems mature and achieve higher availability targets, other types of alerts can add to your system's overall reliability without adding excessive toil. Using this approach, you can identify value in different types of alerts, while aiming for a comprehensive alerting setup. .... 

As detailed below, by analyzing their existing alerts and organizing them according to a hierarchy, then iterating as appropriate, service owners can improve the reliability of their systems and reduce the toil and overhead associated with traditional cause-based and investigative alerts.  .... "

No comments: