Why SLIs and SLOs Are Essential for Observability
Like me, you’ve spent most of your career dealing with IT operations teams. You’ve seen them put in much effort to satisfy the company’s expectations, but they’ve only had limited success. The business constantly criticizes IT for poor service while IT tries to meet ambiguous expectations with limited resources. The main issue is a fundamental misalignment in how IT and business judge performance.
IT is in charge of allocating limited resources (CPU, memory, and disk) among business operations. Therefore they keep track of usage. IT employs these data to minimize problems and keep costs low to determine when a resource is nearing exhaustion. The business, on the other hand, needs quick and error-free services. Therefore speed and quality are used to gauge performance. Two teams with vastly different notions of success are at odds.
It means that IT and the business are constantly at odds in practice. Here’s an example from the actual world: The company continually criticized one of our customers because “the system is always slow.” They had gradually added tools to collect thousands of consumption metrics and attempted to develop correlation rules that would indicate when the system was slow. They ended up in chaos: a massive data collecting infrastructure that collected measurements at sub-second intervals, alerts that went out 24 hours a day, and no straightforward method to figure out what was happening.
They weren’t getting anywhere because they weren’t measuring the right things. However, this is because resource-based monitoring strategies provided an incomplete picture. Instead, focus on service-level measurements if you want a more straightforward and responsive observability strategy, better alignment with the business, and faster improvement routes. I’ll teach you how to set your SLOs after introducing you to service-level indicators (SLIs) and service-level goals (SLOs).
Service-level indicators
“A precisely specified quantitative indicator of some component of the level of service offered,” according to the textbook definition of an SLI. In other words, an SLI is a statistic that measures one aspect of your IT service’s performance. To expand on this statement, I’d add that it must be relevant to the service offered and straightforward to comprehend. In other words, if an SLI fails, there must be a business consequence, such as a service outage or a bad user experience.
Remember that the business expects speed and quality, so select SLIs (metrics) that reflect these expectations, such as:
- The time it takes for anything to happen (latency) or for something to happen (response time).
- Availability Error rate/quality
- Uptime
Here are specific SLI options you shouldn’t use because they don’t immediately correspond with business impact:
- Consumption of CPU, disk, and memory
- The percentage of cache hits
- Time for garbage collection
The significant distinction between a good and terrible SLI is how relevant the statistic is to service delivery. Service delivery is hampered by a high error rate or a long response time. Although high CPU utilization may influence service delivery, the link between CPU and service performance is more difficult to establish. It is why resource consumption measurement teams in IT have a hard time.
The goal is to choose an SLI measure that is clearly and unequivocally tied to service delivery and is straightforward to describe to non-technical individuals. It will bridge the gap, making things easier for all parties concerned.
Service-level objectives
An SLO is nothing more than a target you set for your SLIs. You must first identify your SLIs. Then you establish your SLOs by specifying thresholds for each SLI.
Even non-technical stakeholders should be able to comprehend SLOs. Standalone resource consumption measures, such as CPU use, cannot tell you whether or not something is working well; a subject matter expert must interpret them. Identifying business-impacting SLIs, setting SLOs, and appropriately presenting them eliminates the need for SLO consumers to wonder whether the figure is excellent or negative. The response is “excellent” or “not good” in terms of interpretation. SLOs are also simple to utilize as a metric for progress.
The percentage is the most excellent approach to convey your SLOs to fit the conditions above (intuitive and straightforward). Don’t utilize averages; they hide much information that you need to know.
Another advantage of employing percentages is that they implicitly account for statistical outliers and overall business impact. Slow transactions and failures will always occur, but you don’t want to set off an alarm every time one occurs. You only want to set off an alert if there are enough to make a difference.
Here are some instances of well-chosen SLOs that have been effectively presented as percentages:
- A response time of 500 milliseconds or fewer is required for 95% of transactions.
- There should be no errors in 99 percent of transactions.
- The application should be up 99.9% of the time during business hours.
In contrast to:
- Transactions should take no more than 750 milliseconds to complete.
- The average number of errors per hour should be less than 100.
Best Practice: When possible, combine your SLIs into a single SLO. For instance, 99 percent of login operations should be completed in less than two seconds and without errors.
Setting your SLOs
If the company or IT management has already established SLOs for you, you should follow them. If they haven’t already, I recommend taking the following iterative approach:
- Determine the service for which you wish to create SLOs.
- Determine the service’s most important transactions. Many services feature transactions that should not contribute to performance SLOs, such as health checks.
- Should identify sLIs for services and transactions.
- Create a baseline SLO for each SLI using the 95th percentile. (Avoid using averages because they disguise outliers and result in loud alerts.)
- Set up notifications for SLO violations.
- Review alert KPIs and service performance regularly to ensure that your SLOs remain relevant and drive improvement.
Chapter 4 of the Google SRE book is a fantastic resource for setting SLOs in further depth. This article will get you to start, but you should read the chapter when you have the opportunity.
Why Are SLIs and SLOs Important for Observability
Setting SLIs and SLOs will result in a more straightforward and responsive observability practice, better business alignment, and a speedier route to change. It’s quick and straightforward to start; try it out on a single service to see how well it works.
About Enteros
IT organizations routinely spend days and weeks troubleshooting production database performance issues across multitudes of critical business systems. Fast and reliable resolution of database performance problems by Enteros enables businesses to generate and save millions of direct revenue, minimize waste of employees’ productivity, reduce the number of licenses, servers, and cloud resources and maximize the productivity of the application, database, and IT operations teams.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Revolutionizing Healthcare IT: Leveraging Enteros, FinOps, and DevOps Tools for Superior Database Software Management
- 21 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Real Estate Operations with Enteros: Harnessing Azure Resource Groups and Advanced Database Software
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Revolutionizing Real Estate: Enhancing Database Performance and Cost Efficiency with Enteros and Cloud FinOps
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enteros in Education: Leveraging AIOps for Advanced Anomaly Management and Optimized Learning Environments
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…