Why Are SLIs and SLOs Important for Observability?

Home > Enteros’ Blog – Thoughts on Database Technology, Machine / Deep learning, and a Generative AI > Database Performance Management > Why Are SLIs and SLOs Important for Observability?

Why SLIs and SLOs Are Essential for Observability

Like me, you’ve spent most of your career dealing with IT operations teams. You’ve seen them put in much effort to satisfy the company’s expectations, but they’ve only had limited success. The business constantly criticizes IT for poor service while IT tries to meet ambiguous expectations with limited resources. The main issue is a fundamental misalignment in how IT and business judge performance.
IT is in charge of allocating limited resources (CPU, memory, and disk) among business operations. Therefore they keep track of usage. IT employs these data to minimize problems and keep costs low to determine when a resource is nearing exhaustion. The business, on the other hand, needs quick and error-free services. Therefore speed and quality are used to gauge performance. Two teams with vastly different notions of success are at odds.

It means that IT and the business are constantly at odds in practice. Here’s an example from the actual world: The company continually criticized one of our customers because “the system is always slow.” They had gradually added tools to collect thousands of consumption metrics and attempted to develop correlation rules that would indicate when the system was slow. They ended up in chaos: a massive data collecting infrastructure that collected measurements at sub-second intervals, alerts that went out 24 hours a day, and no straightforward method to figure out what was happening.

They weren’t getting anywhere because they weren’t measuring the right things. However, this is because resource-based monitoring strategies provided an incomplete picture. Instead, focus on service-level measurements if you want a more straightforward and responsive observability strategy, better alignment with the business, and faster improvement routes. I’ll teach you how to set your SLOs after introducing you to service-level indicators (SLIs) and service-level goals (SLOs).

Service-level indicators

“A precisely specified quantitative indicator of some component of the level of service offered,” according to the textbook definition of an SLI. In other words, an SLI is a statistic that measures one aspect of your IT service’s performance. To expand on this statement, I’d add that it must be relevant to the service offered and straightforward to comprehend. In other words, if an SLI fails, there must be a business consequence, such as a service outage or a bad user experience.

Remember that the business expects speed and quality, so select SLIs (metrics) that reflect these expectations, such as:

The time it takes for anything to happen (latency) or for something to happen (response time).
Availability Error rate/quality
Uptime

Here are specific SLI options you shouldn’t use because they don’t immediately correspond with business impact:

Consumption of CPU, disk, and memory
The percentage of cache hits
Time for garbage collection

The significant distinction between a good and terrible SLI is how relevant the statistic is to service delivery. Service delivery is hampered by a high error rate or a long response time. Although high CPU utilization may influence service delivery, the link between CPU and service performance is more difficult to establish. It is why resource consumption measurement teams in IT have a hard time.

The goal is to choose an SLI measure that is clearly and unequivocally tied to service delivery and is straightforward to describe to non-technical individuals. It will bridge the gap, making things easier for all parties concerned.

Service-level objectives

An SLO is nothing more than a target you set for your SLIs. You must first identify your SLIs. Then you establish your SLOs by specifying thresholds for each SLI.

Even non-technical stakeholders should be able to comprehend SLOs. Standalone resource consumption measures, such as CPU use, cannot tell you whether or not something is working well; a subject matter expert must interpret them. Identifying business-impacting SLIs, setting SLOs, and appropriately presenting them eliminates the need for SLO consumers to wonder whether the figure is excellent or negative. The response is “excellent” or “not good” in terms of interpretation. SLOs are also simple to utilize as a metric for progress.

The percentage is the most excellent approach to convey your SLOs to fit the conditions above (intuitive and straightforward). Don’t utilize averages; they hide much information that you need to know.

Another advantage of employing percentages is that they implicitly account for statistical outliers and overall business impact. Slow transactions and failures will always occur, but you don’t want to set off an alarm every time one occurs. You only want to set off an alert if there are enough to make a difference.

Here are some instances of well-chosen SLOs that have been effectively presented as percentages:

A response time of 500 milliseconds or fewer is required for 95% of transactions.
There should be no errors in 99 percent of transactions.
The application should be up 99.9% of the time during business hours.

In contrast to:

Transactions should take no more than 750 milliseconds to complete.
The average number of errors per hour should be less than 100.

Best Practice: When possible, combine your SLIs into a single SLO. For instance, 99 percent of login operations should be completed in less than two seconds and without errors.

Setting your SLOs

If the company or IT management has already established SLOs for you, you should follow them. If they haven’t already, I recommend taking the following iterative approach:

Determine the service for which you wish to create SLOs.
Determine the service’s most important transactions. Many services feature transactions that should not contribute to performance SLOs, such as health checks.
Should identify sLIs for services and transactions.
Create a baseline SLO for each SLI using the 95th percentile. (Avoid using averages because they disguise outliers and result in loud alerts.)
Set up notifications for SLO violations.
Review alert KPIs and service performance regularly to ensure that your SLOs remain relevant and drive improvement.

Chapter 4 of the Google SRE book is a fantastic resource for setting SLOs in further depth. This article will get you to start, but you should read the chapter when you have the opportunity.

Why Are SLIs and SLOs Important for Observability

Setting SLIs and SLOs will result in a more straightforward and responsive observability practice, better business alignment, and a speedier route to change. It’s quick and straightforward to start; try it out on a single service to see how well it works.

About Enteros

IT organizations routinely spend days and weeks troubleshooting production database performance issues across multitudes of critical business systems. Fast and reliable resolution of database performance problems by Enteros enables businesses to generate and save millions of direct revenue, minimize waste of employees’ productivity, reduce the number of licenses, servers, and cloud resources and maximize the productivity of the application, database, and IT operations teams.

The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.

Are you interested in writing for Enteros’ Blog? Please send us a pitch!

Optimizing Database Performance and Cloud FinOps in the Insurance Sector with Enteros and AWS CloudFormation

3 April 2025
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Driving Growth and RevOps Efficiency in the Technology Sector with Enteros

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Enhancing Healthcare Efficiency with Enteros: AIOps-Driven Database Performance Monitoring

2 April 2025
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Optimizing Cost Estimation and RevOps Efficiency with Enteros: Enhancing Financial Stability and Balance Sheet Management

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Why SLIs and SLOs Are Essential for Observability

Service-level indicators

Service-level objectives

Setting your SLOs

Why Are SLIs and SLOs Important for Observability

About Enteros

RELATED POSTS

Optimizing Database Performance and Cloud FinOps in the Insurance Sector with Enteros and AWS CloudFormation

Driving Growth and RevOps Efficiency in the Technology Sector with Enteros

Enhancing Healthcare Efficiency with Enteros: AIOps-Driven Database Performance Monitoring

Optimizing Cost Estimation and RevOps Efficiency with Enteros: Enhancing Financial Stability and Balance Sheet Management