On-Call and Incident Response: Lessons for Success
Many businesses continue to use on-call rotations and incident response procedures that leave employees worried, apprehensive, and generally unhappy. Notably, many sound engineers are declining or quitting positions explicitly because of this.
In the field: on-call policies
Our engineering teams are comprised of software engineers, site reliability engineers (SREs), and engineering managers. The majority of teams are in charge of at least three services. And, within the first two to three months of their employment, every engineer and engineering manager in the business joins an on-call rotation.
First and foremost, we do this because it is vital. It is not an option to put off a problem with one of our customers until the next day. While we have engineers located throughout the United States and Europe, most of our engineering teams are based in the same time zone. That means we won’t be able to use Google’s “follow the sun” rotations, in which engineers in one part of the world hand off their on-call duties to colleagues in other parts of the world at the end of their workdays.
Adopt and accept DevOps principles as a best practice.
Before introducing DevOps as an application development technique, on-call responsibilities were usually handled by a small group of engineers and other IT workers, such as a centralized site reliability or operations team.
These employees, not the programmers who created the software, responded to incidents impacting their monitoring services. The site reliability team’s feedback, on the other hand, rarely made it to the developers. Furthermore, instead of encouraging their teams to pay down technical debt and make their products and services as reliable as possible, product owners frequently choose to move on to the next significant feature.
One of the reasons for DevOps’ inception was to break down these organizational barriers. In a modern application architecture, services are grouped to build a vast, interconnected product platform that relies on a complicated system of cloud services, database administrators, and extensive networking layers, to name a few components. While a specific incident response may begin with one team, the details needed to comprehend the incident and handle any customer-facing consequences may require the assistance of service higher down the stack.
DevOps promotes the idea that no team is an island and that teams must be able to collaborate and have clear, documented on-call protocols to keep these complex systems running smoothly. Furthermore, developers in a robust DevOps approach make better decisions about the services they design because they must also support them—they can’t just throw a service over the wall and expect someone else to take care of it.
Best practice: Maintain a healthy balance of autonomy and accountability.
The team’s membership, the services they administer, and the team’s cumulative understanding of the services all play a role in the success of the on-call procedure. Most organizations adopt a one-week on-call cycle, with one engineer serving as the primary and another as the backup. So, if a team has six engineers, each engineer will be the principal person on call every six weeks.
Most teams use PagerDuty schedules to decide who is the primary and the secondary. PagerDuty escalation procedures define who is paged first, when the secondary is paged, and what to do if neither the primary nor the secondary acknowledge the page.
Track and measure on-call performance as a best practice.
At the individual, team, and group levels, they track many on-call metrics:
- The total number of pages paged an engineer’s number of times in 24 hours per the engineer.
- The amount of off-hours pages (those received outside of regular business hours)
- These indicators and how you respond to them are crucial to keeping a framework and organization in place that helps teams succeed in their on-call activities. We created a tool that extracts alerting data from PagerDuty and publishes it to the Telemetry Data Platform. Managers and executives may then design dashboards that show how many times a team was paged in a given timeframe and how many of those alerts occurred outside of regular business hours.
Keeping track of off-hours pages might help identify teams dealing with overwhelming on-call volumes. What does having an unmanageable load imply? A team is deemed to have a high on-call load if it receives more than one off-hours page each week.
Allowing a team to focus on paying down technical debt or automating away toil until their on-call load lowers is good if the team’s burden is too high. You might also send in senior site reliability engineers (SREs) to assist the team in improving their services.
When it comes to picking an on-call model, there are a few things to think about.
An on-call model doesn’t have to be complicated. Still, it does need to ensure that a designated engineer is always ready to respond to a page and handle events that fall within their sphere of responsibility. An on-call model should be able to answer the following questions:
- For each on-call rotation, how will the model choose team members?
- How long does it take to complete a course?
- What happens if an on-call engineer doesn’t answer a page?
- Which options do you have if an engineer isn’t up to the task of handling an on-call page?
- At any given time, how many engineers will be on call?
- How will the duties of many on-call engineers be divided?
- How will the team deal with unforeseen circumstances such as unannounced rotations?
The degree of team autonomy will also be a factor in larger businesses with several teams. Organizations that practice DevOps advocate a high level of team autonomy, but some go further than others.
What occurs when the pager goes off, and how do you respond?
One of the most critical aspects of an organization’s software quality and dependability practices is its on-call process. Another factor that is closely related is its incident response methods.
Incident response encompasses a wide range of occurrences, from the commonplace to the scary; some are hard to detect without the use of specialist monitoring systems, while others have the potential to affect millions of people and create national news.
We must act promptly and decisively. We must have a well-thought-out strategy in place and ready to implement.
Best practice: Find out about incidents before they happen to your clients.
The objective of a successful incident response system is straightforward: Before customers are impacted, find out about the situation and, hopefully, fix it.
Our goal as a firm is to never learn about a problem from an upset customer who is tweeting about it—that is the worst-case situation. We’d also like to avoid having irate consumers phone customer service, as that’s not an ideal situation.
Engineering teams are free to utilize any technologies they want to construct services, as long as they are instrumented. It needs monitoring and warning. (Except in exceptional circumstances, we employ our products.)
Engineering teams, as previously mentioned, have on-call rotations for the services they administer. With proactive incident reporting and a sound monitoring system, an engineer will be paged when an issue is detected—preferably before a client sees it.
Develop a mechanism to assess the severity of incidents as a best practice.
The effective incident response starts with a system that ranks incidents by severity, typically defined in customer impact. The internal incident-severity scale is a great place to start when developing your incident response process; it’s based on a range of 1 to 5, with clearly defined criteria for each level:
- A Level 5 incident should never hurt customers, and it may be announced only to raise the notice of potentially dangerous service deployment.
- Minor problems or data lags that effect but do not hinder clients are classified as Level 4 events.
- Significant data lags or inaccessible functionality are level 3 occurrences.
- Level 1 and 2 issues are reserved for complete product outages or incidents that represent a direct threat to the business.
Each incident level has its routine for contacting internal resources, managing the reaction, deciding whether or not to contact customers and other tasks.
It’s critical to think about how an incident can affect customers and the customer experience and the resources a response team will require to diagnose, contain, and fix the issue.
We reassess the assigned severity level after an occurrence based on the actual customer effect. It highlights a crucial incident response principle: during an event, we urge engineers to escalate fast so that they may get the help they need to tackle the issue. We examine the actual impact once the incident is done and decrease the intensity of the effect isn’t as awful as we previously thought.
Define and assign duties to the response team as a best practice.
Many of these roles are introduced at different levels of intensity. In other cases, the obligations assigned to a function may change based on the severity of an incident:
Role
|
Description
|
Organization
|
---|---|---|
RoleIncident Commander (IC)
|
DescriptionDrives resolution of site incident. Keeps CL informed of the incident’s impact and resolution status. Stays alert for new complications.
The IC does not perform technical diagnoses of the incident. |
OrganizationEngineering
|
RoleTech Lead (TL)
|
DescriptionPerforms technical diagnosis and fix for incident. Keeps IC informed on technical progress.
|
OrganizationEngineering
|
RoleCommunications Lead (CL)
|
DescriptionKeeps IC informed on customer impact reports during an incident. Keeps customers and the business informed about an incident. Decides which communication channels to use.
|
OrganizationSupport
|
RoleCommunications Manager (CM)
|
DescriptionCoordinates emergency communication strategy across teams: customer success, marketing, legal, etc.
|
OrganizationSupport
|
RoleIncident Liaison (IL)
|
DescriptionFor severity 1 incidents only. Keeps Support and the business informed so IC can focus on resolution.
|
OrganizationEngineering
|
RoleEmergency Commander (EC)
|
DescriptionOptional for severity 1 incidents. Acts as “IC of ICs” if multiple products are down.
|
OrganizationEngineering
|
RoleEngineering Manager (EM)
|
DescriptionManages post-incident process for affected teams depending on root cause and outcome of the incident.
|
OrganizationEngineering
|
Set up an incident response scenario as a best practice.
Most businesses cannot correctly recreate an actual crisis response, exceptionally high intensity. However, even modest simulations can offer you an idea of what will happen during an incident, how to define priorities and escalation procedures, how to coordinate team roles and other important insights.
Consider the following scenario concerning a fictional incident:
She starts by declaring an incident in our Slack channel. A bot dubbed Nrrdbot (a modified version of GitHub’s Hubot) assists her. She types 911 ic me because she’s decided to take on the role of Incident Commander. It changes the Slack channel header and generates a new, open incident in Upboard (our in-house incident tracker); Nrrdbot then sends the engineer a direct message (DM) with the following actions.
Three things should be done by the IC now:
- Decide on a level of severity (how severe is it?).
- Set the incident’s title (a summary of what’s going wrong) and status (an overview).
- To debug the problem, locate one or more Tech Leads. Because the IC does not undertake technical diagnostics of the incident, they will find someone else to take over the IC function if the IC is the right person for the job.
Who is brought in to assist with the reaction is determined by the severity specified by the IC (or changed throughout the incident)? For occurrences with a severity rating of 3 or higher, a support team member is immediately assigned to the incident as the Communications Lead. The CL’s goal is to organize customer communication; they’ll convey any incident-related consumer complaints and engage with customers in advance based on what engineers discover.
At this stage, the IC creates a crowd-sourced coordination document that everyone involved in the response can see. She’s in charge of coordinating communication among all parties participating in the reaction. She’s also enlisting help when necessary, keeping track of the situation (every 10 minutes, or whenever Nrrdbot reminds her), and updating the severity as things improve or deteriorate.
If the problem does not resolve in 60-90 minutes, she’ll take up her IC position with someone else since it’s a tiring weight, especially at 3 a.m. when awoken from a sound slumber.
Once the issue resolves and all leads indicate their satisfaction, the IC closes the incident by putting 911 in Slack.
It brings the situation to a close.
The ideal approach is to hope for the best but prepare for the worse.
Although the scenario above simulated a significant disaster, it never reached the level of an actual emergency. Emergencies are infrequent (or should be), but they pose a significantly higher risk to a company. In a worst-case scenario, an incident might spiral out of control and become an existential threat.
They are skilled in dealing with high-severity incidents, especially when coordinating several teams.
Executives join an incident response team to help with three essential tasks: informing senior leadership, coordinating with our legal, support, and security teams, and making difficult decisions.
Using occurrences to learn, improve, and grow is a best practice.
In this example, as a first step toward capturing knowledge and learning from an incident, we would complete the following post-incident tasks:
- Gather final details for the coordination document, such as
- Duration of the incident
- Impact on the customer
- If any emergency fixes need to be undone,
- Any significant concerns that developed as a result of the occurrence
- Notes on who should participate in the post-incident review
- Decide who should attend the blameless retrospective.
- Select a team to own the incident (in this case, the Synthetics team) so that the team’s engineering manager can organize the post-event retrospective.
In addition, we expect teams to undertake a retrospective within one or two business days of an occurrence.
Implement a Don’t Repeat Incidents (DRI) policy as a best practice.
It’s vital to remember that the goal isn’t to eradicate incidents simply because that’s impossible.
Now it’s your job to ask the questions that will help you prepare your incident response.
We recommend that you establish clear standards so that your teams know what to expect; identify and eliminate the sources of the most friction in your issue response and resolution procedures, and decide how to structure your on-call and incident response processes.
The answers to the following questions will help you do these activities more quickly.
- What is the engineering firm’s size? And what’s the average team size? Which kinds of rotations can your units handle?
- How quickly is the engineering department expanding? What is the rate of employee turnover?
- Geographical Distribution: Is your company geographically centralized or dispersed? Do you have the size and distribution to implement “follow the sun” rotations, or do your engineers have to deal with pages that come in after hours?
- What is the structure of the engineering organization? Do our development and operations isolate? Or do you have to adopt a contemporary DevOps culture where teams own the whole lifecycle from development to operations? Is there a centralized SRE group, or are SREs entrenched on engineering teams all over the place?
- Complexity: What is the structure of your applications? Is your product a monolithic application maintained by multiple teams, or do your engineers handle well-defined services plugged into a monumental architecture? What is the total number of services supported by each group? What is the reliability of the services?
- Dependencies: How many clients rely on your services (internal or external)? What is the blast radius if a service fails?
- Tooling: What level of sophistication do your incident response procedure and tools have? How well-maintained and up-to-date are your team’s runbooks and monitoring? When engineers respond to a page, do they have adequate tooling and organizational support? Do engineers receive automatic, actionable problem notifications?
- Expectations: Is it common in your engineering culture to be on call? Is it regarded as a necessary and valuable component of the job or an unnecessary burden?
- Is your company’s culture blameless, focusing on the root cause and tackling systemic issues, or is it a “blame and shame” culture where people are penalized when things go wrong?
About Enteros
Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Revolutionizing Healthcare IT: Leveraging Enteros, FinOps, and DevOps Tools for Superior Database Software Management
- 21 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Real Estate Operations with Enteros: Harnessing Azure Resource Groups and Advanced Database Software
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Revolutionizing Real Estate: Enhancing Database Performance and Cost Efficiency with Enteros and Cloud FinOps
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enteros in Education: Leveraging AIOps for Advanced Anomaly Management and Optimized Learning Environments
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…