Why SLOs are essential for observability
You probably tracked resource consumption (CPU, disk, and memory) in the legacy monitoring world since those metrics were easy to get and had an essential correlation to availability and performance. If you, like me, practiced continuous improvement, you quickly accumulated many granular resource measurements and are required to maintain a maze of correlation rules to make sense of them.
You’ll need more now that you’re working with DevOps and observability. It is why:
- Several types of resource use can’t (or shouldn’t) be measured in a cloud context.
- Workloads can be highly changeable, making setting criteria difficult.
- It would help if you had more control over what you’re measuring and how you’re measuring it.
- You must express your situation to many people swiftly and succinctly.
Most importantly, as IT environments evolve (think microservices), you’ll need to streamline your data collection and make it more relevant. Your metrics collection will swell to absurd proportions if you don’t, increasing your administrative burden without delivering value.
Customers (internal and external) expect technology to respond quickly and accurately; they aren’t bothered with the nitty-gritty specifics. If you want to meet their expectations on their terms (and you should because they hired you), you should concentrate on evaluating reaction time (speed) and errors (quality). Furthermore, it would help reduce everything to a single, simple figure informing your Ops and Engineering teams where to spend their efforts and allowing non-technical stakeholders to comprehend the current status and intuitively make crucial business choices.
It is significant because the firm is more concerned with the customers’ experience than that some processes consume 2% more CPU than previously. That implies you must track and convey two critical metrics: speed and quality.
A service-level indicator (SLI) and a service-level objective (SLO) come to the rescue in this situation. By defining an SLI metric that both technical and non-technical stakeholders understand, you’ll be able to meet the business’s expectations. Then, in the form of a simple percentage, you’ll set a target (an SLO) to illustrate how often you’re meeting your speed and quality commitments.
Here’s some data from the B2B-Gateway, one of my customer’s microservices. They’ve been accumulating information on the number of errors over time, such as this:
The green line across the bottom indicates the customer’s alert threshold. As you can see, notifications are being issued against this service regularly. As a result, their Ops team no longer pays heed to this notice. It is an excellent example of a low-quality alert—noise that is overlooked.
You might go the traditional route and spend a lot of time attempting to set a different threshold or looking at fancy deviations from standard algorithms, but that still doesn’t account for the two most important factors of speed and quality. We can go from here to there in two simple stages, thankfully.
Step one: Choose an SLI to use as a metric for quality, such as the proportion of error-free total transactions.
SELECT percentage(count(*), WHERE error IS False) FROM Transaction IF appName =’b2b-gateway’ THEN MAXIMUM TIMESERIES SINCE a week ago
The downward dips in the line show the most severe issues. The purple horizontal line shows the 98 percent success rate (your SLO). Anything below that level is a problem that necessitates an alarm. When your error SLI falls below the 98 percent threshold (violating your SLO), you’ll receive four notifications throughout the day (in contrast to one alert, asserted 24×7, which everyone ignores).
You’ve just accomplished an important goal with a straightforward process: You’ve produced a highly relevant, easy-to-understand business metric (“98 percent of our transactions are error-free”). Furthermore, you’ve transformed a bothersome signal into a useful one with real-world implications. It’s self-evident that a rise in errors will directly impact the user experience.
You may make this measurement more or less meaningful by adjusting the SLI threshold. For example, if it’s still too noisy, set it to 97 percent or 99 percent if you need to push for improvement.
Step two: Since speed equals response time, we’ll include it in the equation.
Continuing with the example, the response time threshold for the B2B-Gateway is 100 milliseconds (0.1 seconds). (Watch for a blog entry on how to create SLO criteria shortly.)
To include response time in your query, edit it to look like this:
SELECT percentage(count(*), WHERE error IS False and duration 0.1) FROM Transaction IF appName =’b2b-gateway’ THEN MAXIMUM TIMESERIES SINCE a week ago
The downward dips indicate the most severe issues, and the purple line represents the 98 percent threshold. When performance is considered, you might see that this service isn’t doing as well as you had hoped. The service is good quality, but it is not achieving its speed commitments. That’s not something you’d expect from a collection of resource use measurements. It is a significant success since it bridges a fundamental gap between IT (whose resource consumption measurements are “absolutely normal”) and the business (who hears about a slow service).
Finally, consider the benefits of focusing on speed and quality rather than resource consumption:
Strong business alignment: You have a simple metric to understand and tie the business’s needs to the technology that meets them. This metric makes it easier to go to your business stakeholders and say, “Hey, we’re only fulfilling our performance obligations 94% of the time. Therefore we need to make some changes.”
Your SLI will identify problems with a business impact as apparent symptoms of concern. That means you’ll be able to draw attention to severe issues sooner, giving you a head start on fixing them and lowering their extent and severity. Remember that this will also help to prevent alert tiredness.
Simplicity: Your SLIs and SLOs (response time and quality) are simple to comprehend and modify.
“That’s great,” you might think, “but what should I do now?” Start by establishing SLI/SLO-based observability on a few services in your environment and comparing the findings to your present monitoring. After a few weeks, you’ll probably notice that your SLO is catching fewer, more serious issues, and it may even be highlighting problems that your existing monitoring was missing.
Enteros
About Enteros
Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enhancing Accountability and Cost Estimation in the Financial Sector with Enteros
- 27 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing E-commerce Operations with Enteros: Leveraging Enterprise Agreements and AWS Cloud Resources for Maximum Efficiency
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Revolutionizing Healthcare IT: Leveraging Enteros, FinOps, and DevOps Tools for Superior Database Software Management
- 21 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Real Estate Operations with Enteros: Harnessing Azure Resource Groups and Advanced Database Software
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…