9 incident response myths—busted!
For many DevOps, site reliability engineering (SRE), and operations teams, it still takes too much time to detect potential problems before they turn into incidents. Crews frequently respond to crises reactively, never taking the time to develop systems that allow them to spot issues before they cause disruptions.
Every minute spent responding to incidents negatively impacts service-level objectives (SLOs), company reputations, and individual teams’ bottom lines.
Gartner projected the average cost of a minute of downtime in 2014 at $5,600; by 2020, the impact on large enterprises at critical “moments of truth” may be much, much more significant. Simple statistics like this highlight the need to respond promptly and efficiently to any events that disrupt your site’s availability or performance.
What is an “incident”?
An incident occurs simply when a service is unavailable or fails to function as expected, as described by a written service-level agreement (SLA). Network outages, application issues, hardware failures, and, increasingly, in today’s sophisticated and multilayered infrastructures, configuration errors can all cause incidents.
Incident response refers to the collective processes that help detect, identify, troubleshoot, and resolve such situations. Incident response has grown over the years to encompass numerous frameworks and methodologies, heavily influenced by the British government’s IT Infrastructure Library (ITIL) in the 1980s. They all have the same goal: to provide stakeholders with the tools they need to get malfunctioning systems up and running as quickly as possible while also making those systems more robust and reliable.
Despite its lengthy history, incident response is still cloaked in myths and hampered by misunderstandings that prohibit firms from resolving problems as quickly and effectively as possible—and, perhaps more crucially, from learning how to avoid incidents from occurring in the first place.
That’s why we asked incident response professionals from New Relic and other companies to identify common incident response myths and blunders and provide their recommendations for best practices for optimal issue response.
Myth #1: It’s all about speed.
The “any-fix-is-a-good-fix” myth is also known as the “any-fix-is-a-good-fix” myth. It’s evident that quickly resolving difficulties is critical, especially for systems that interact directly with customers. But it isn’t the only issue to be concerned about. In the name of speed, a faulty or partial fix, a temporary remedy, or a change that damages something else downstream can be harmful to deploy.
Kepner-Christoph Tregoe’s Goldenstern.
“A lot of lip service is paid to the importance of quality and customer satisfaction in incident response, but when you look at a lot of the metrics for measuring incident response success, they focus on efficiency: how quickly an issue is resolved,” says Christoph Goldenstern, vice president of innovation and service excellence at Kepner-Tregoe, an incident response training and consulting firm.
Instead, businesses should concentrate on the result’s effectiveness and speed. “Are we eventually providing a long-term solution to the customer?” Guildenstern inquires. “Are we avoiding something similar from happening in the future?” Those are the inquiries to make.”
He says that relying on “lagging indicators” or looking backward to assess how completed something is ineffective. Instead, he says, firms should concentrate on developing behaviors that yield better and long-lasting results and create measurements around those.
The time it takes to come to a good explanation of the problem at hand is one statistic that Kepner-Tregoe urges customers to employ. According to Goldenstern, “we know from our study that the quality of the problem statement is a direct driver of lower resolution time and increased customer satisfaction.” “Rather than merely putting a fix in place, train your team to develop clear, short, and specific problem statements as rapidly as feasible.”
Myth #2: You’re done once you’ve put out the fire.
This notion is slowly being debunked, which is a good thing. After resolving an event, it’s common practice to do a post-mortem or internal retrospective. The goal is to proactively learn from the incident to make your systems more robust and stable and avoid future problems. “Proactively learn” is the key phrase here.
xMatters’ Adam Serediuk.
“It’s critical to incentivizing preventative actions rather than just resolving events reactively,” says Adam Serediuk, head of operations for xMatters, a DevOps incident-management product manufacturer. “You’re practically saying, ‘We’re not interested in preventing future occurrences,” says Serediuk, if you don’t stipulate that your incident lifecycle doesn’t stop until that postmortem is done and its results are accepted or rejected. He goes on to say that there’s a distinction between reacting and responding. You could, for example, react to an issue by sending some of your rock stars to the scene and mending it immediately. “However, can’t simply replicate that technique, and it can’t scale,” he says.
Branimir Valentic, a Croatian ITIL and ISO 20000 specialist at Advisera.com, an international ITSM consulting, says that incident response should be viewed as an end-to-end process that is monitored, iterative, repeatable, and scalable. “The objective of incident response is to go far deeper and learn, not just to resolve,” he explains.
One concern is that, over time, the post-mortem will become a routine exercise—just another box for jaded engineers to check. Don’t let the postmortem turn into a chore. Learning from experiences is beneficial but also tricky, and it necessitates ongoing tuning and adaptation to figure out how to learn efficiently.
Myth #3: To avoid making IT seem bad, only report severe issues that consumers complain about.
Another common misconception is that you shouldn’t talk about your incidents too often. The logic is that if you report every event, IT will appear to be failing. Keep your head down and only acknowledge and share the significant incidents clients have noticed and reported.
That’s the premise, at least, but it’s a terrible plan. Customers—and internal stakeholders—want to know that you’re being truthful and upfront and that they can rely on you to spot and report events that could affect them. The act of concealing incidences, even minor ones, might jeopardize that confidence.
Don’t think of it as a bad mark against your IT department when things go wrong. Incidents are unavoidable in this game. The important thing is what you do about them.
Communicate well, both internally and with customers. Many businesses are wary of disclosing any information unless they are forced to, but this is a mistake. Be open and honest.
Myth #4: Only occurrences that hurt customers are essential.
Another common misconception is that only incidents that have an impact on external clients are relevant. In some instances, incidents are only defined as “customer-impacting disruptions.” However, believing that myth will diminish the efficacy of your total event response. Again, the notion is that incident response should be viewed as a learning opportunity, with proactive steps taken resulting from that learning.
“Internal misses and internal-only accidents can teach us a lot. They could even be some of your best learning opportunities because they allow you to fine-tune your reaction process and learn without feeling rushed, according to xMatters’ Serediuk. “When things are on fire, it’s difficult to implement meaningful organizational change.”
Let’s say your internal ticketing system or internal wiki goes down. What kind of supervision or lack of control resulted in this? “You may learn under less pressure and possibly avert production mishaps later on” in relatively modest internal instances like these, adds Serediuk. With decreased blood pressure, you may concentrate more intently on why you experienced a particular difficulty and how to avoid it in the future.
Myth #5: When systems are in pain, they will always alert you.
People in charge of operations tend to keep an eye on what they consider significant. However, they aren’t always correct. If this happens, a system could be in jeopardy, and your team could be blissfully unaware. Previously, operations teams measured disk use, CPU usage, and network traffic. “But the actual question is, is the service healthy?” Serediuk asks.
It is where the distinction between macro and micro monitoring comes into play. Individual components such as the CPU, memory, and disk are monitored in micro monitoring. When you employ macro monitoring, you’re looking at the larger picture: how the system’s users are affected.
Serediuk explains that “here is where service level objectives [SLOs] and service level indicators [SLIs] come into play.” “You’re judging things based on how they make you feel.” If, for example, your web requests per second suddenly decrease to zero, you know you have a problem. You might have missed it if you only did micro monitoring, such as keeping track of memory usage. “By looking at the metric that matters—whether people are connecting with my system,” he explains, “I detect something I wouldn’t have spotted otherwise.”
Myth #6: Your mean time to resolution can tell you how well your IM processes are operating (MTTR)
The mean (average) time it takes to settle an event is known as the MTTR. However, there are numerous issues with using this measure as a barometer for incident response success. To begin with, not all occurrences are made equal. Should not compare simple, easy-to-resolve situations to more challenging incidents.
Concurrency’s Randy Steinberg.
Randy Steinberg, a solutions architect at IT consulting business Concurrency, asks, “How do you compare an enterprise-wide email service being down with an application with only a handful of users that maybe suffers from one easily addressed issue every other month?” “Because incidents vary, it’s difficult to gauge how well you’re doing.”
Furthermore, MTTR measurement is an art, not a science. When does the clock begin to tick, for example? Is it when a program starts to slow down? When did you get your first notification? What happens if a customer notices? It is a harrowing statistic to record consistently since the boundaries of complex systems are so flexible. If your incident reaction time is so wrong that you’re trying to get it down to a reasonable level, MTTR can be helpful; otherwise, it might be pretty misleading.
Myth #7: We’re improving at IM because we’re catching problems sooner.
Businesses are growing far better at spotting events than was previously feasible because of the increased efficacy and granularity of automated monitoring and alerting systems like New Relic. That doesn’t mean we’re growing better at responding to incidents. Only half of the battle is won when an incident is discovered. The other part is resolving it.
Everbridge’s Vincent Geffray.
“What’s interesting is that, overall, we’re not becoming any better at responding to incidents,” says Vincent Geffray, senior director of product marketing at Everbridge, a critical-event management firm. Why? Because all of the gains made in the first part of the process—finding events sooner—are lost in the second phase, which entails locating the appropriate personnel to fix the problem. “It can take a few minutes to spot a problem and an hour to bring the relevant people together to start figuring out a solution,” he explains.
What is the solution? AI and machine learning can assist by analyzing historical data connected to incidents and proposing suitable responses based on similar instances. In addition, spend time studying the phases in the incident response process to improve their efficiency. That’s where the most progress has yet to be made.
“After a tool like New Relic has identified a problem with an application, what happens in real life is that a ticket is created in your ticketing system, and then you have to find the right people, get them together, and give them the information they need so they can start investigating,” Geffray says. Most of the time, it won’t be just one individual. “Studies suggest that the majority of IT situations require at least five personnel to resolve,” he says. “And, as you might expect, the more mission-critical apps there are, the larger and more dispersed the company is, the longer it takes.”
Myth #8: A “blameless culture” implies that incidents are not held accountable.
Given the (overwhelmingly favorable) push in the IT industry toward a blameless culture, this is a crucial misconception to debunk.
On the bright side, a blameless culture eliminates fear from the incident response equation: when people know they will not be fired if they make a mistake, they are much more inclined to be frank, and transparent. That isn’t to say there isn’t any responsibility. You should still figure out who made the mistakes and who made them so you may learn from them.
The distinction between accountability and blame is significant. Blame frequently misunderstands the nature of complex systems, in which a single blunder is more likely to be a triggering event that sets off a chain reaction of latent failures. A blameless culture promotes true responsibility because individuals and teams feel comfortable enough to be upfront about mistakes so that the organization may improve the whole system.
Myth #9: A dedicated IM team is required.
While some businesses choose to have a specialized incident response team, others prefer to circulate employees via regular IT engineering employment. There are various reasons why you should disseminate incident response capabilities throughout your IT organization.
Any engineer in any job should be able to respond to any incident in a DevOps manner. Should disseminate day-to-day incident responses throughout the organization.
It is critical to equip every engineer with the required information to make difficult decisions during an incident. Empower whoever responds to difficult choices with the confidence that they can do their best and make the right option.
All of this, of course, necessitates extensive, ongoing training and repeatable, iterative processes. You want the most significant resources available to deal with the most severe occurrences, which requires adequate planning and well-honed strategies. Every engineer on call should have sufficient training and expertise to make successful calls and support in the event a call goes awry.
About Enteros
IT organizations routinely spend days and weeks troubleshooting production database performance issues across multitudes of critical business systems. Fast and reliable resolution of database performance problems by Enteros enables businesses to generate and save millions of direct revenue, minimize waste of employees’ productivity, reduce the number of licenses, servers, and cloud resources and maximize the productivity of the application, database, and IT operations teams.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions
- 18 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Healthcare Enterprise Architecture with Enteros: Leveraging Forecasting Models for Enhanced Performance and Cost Efficiency
- 15 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Transforming Banking Operations with Enteros: Leveraging Database Solutions and Logical Models for Enhanced Performance
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…