It’s critical to look at the overall architecture of your program, not just certain parts of it, and understand how the pieces go together. The initial priority is observability, followed by monitoring.
In this post, we’ll look at the database component of your architecture and show you how to track and improve database performance. Regardless of your stack or whether you wish to monitor and optimize performance for PostgreSQL, MongoDB, or any other database, the majority of the following principles apply.
Why Monitor the Performance of the database?
Imagine a slow query, and your app must wait for responses, forcing your end-users and a section of your app to grind to a halt.
As in traffic congestion, a section of the road may halt due to something happening elsewhere. If many queries take too long to resolve, other inquiries may be getting in the way.
Before things start going wrong in other portions of your program, you need to figure out where the problem originates.
The next stage is to use monitoring to proactively optimize your architecture and discover where you can improve performance. But don’t get ahead of yourself; let’s start with the first step:
Step 1: Measuring AKA Instrumenting Performance of Database
Making sure you’re measuring the right items is the first step in the performance of database monitoring:
- Errors
- The means and 90th percentiles of query length are used to calculate total throughput, which gives you another opportunity to create triggers/flags when throughput is a concern.
Fortunately, that data has already been collected for you: your APM program will most likely allow you to configure what you wish to instrument and measure. Using AppSignal, data collection and aggregation happen automatically and without much effort. We’ve taken care of the heavy lifting!
Step 2: A Starting Point – Find High-impact Queries
Patterns develop once your app is instrumented and real-world data flow in. Under ‘Improve’ in the AppSignal menu, we’ve added displays for the slowest API requests and queries.
This page will provide you with an excellent place to start for optimization. Slow queries are sorted by impact by default. These are questions that you frequently ask. The effect is equal to the throughput multiplied by the mean query time.
Toggling between these sorts will show you which queries have the most volume, which are the slowest, and which has the most significant impact.
A Quick Side-step: Fix Some Individual Queries
You can now look for some issues to work on.
You can see all the facts about a sluggish query if you click on it.
When you hover over a peak’s response time, you can zoom in by selecting the ‘what happened here’ option. You’ll see what errors occurred at that moment in that sample and what occurred on the host.
We’ve highlighted it for one specific anti-pattern: N+1. When a query is run for each result of a previous question, the N+1 anti-pattern occurs.
The total number of queries is N + 1, where N is the total number of requests for the initial query results. Also, if the first query yielded only one result, N+1 = 2. N+1 = 1,001 queries if there are 1,000 results. Boom.
It’s a lucky accident if queries follow the N+1 anti-pattern.
Let’s pretend we corrected some bizarre issues, and when we did so, we noted when they occurred and the worst response times.
Step 3: Finding Deciding on Patterns – What Is Wrong?
When it comes to monitoring, it’s common to focus on when things go wrong. However, determining what should be regarded as ‘bad’ for your arrangement is part of the monitoring complexity. Also, would you instead be notified of a problem that resolves itself or misses one that does not?
Let’s look at the average throughput and duration of your queries. You’ll be able to configure that in your APM application. Suppose you’re monitoring Node.js with AppSignal with PostgreSQL as your database performance. In that case, we’ll automatically produce dashboards with the most critical metrics.
Here’s some more information if you’d like to implement this in your app. When you flip between time ranges, you’ll notice brief peaks that fade away in an hour or so and peaks that remain high when you zoom out to a longer time frame. We now know the primary offenders and when they peaked based on our experience of seeing the slowest query screen. Short peaks in throughput may be acceptable if the host is busy with other operations that delay a query.
You’ll want to zoom in on longer query times or have higher throughputs. When comparing the 95th percentile to the mean, you can see whether a slower query affects everyone or a small percentage of users, which might help you figure out what to improve.
Step 4: What Should Keep Get You up at Night?
Instead of using PagerDuty, we usually advocate starting on the loud side of the spectrum and setting some triggers to warn you on a channel that doesn’t wake you up, such as email or Slack.
To obtain an alert on average query time in your app, create a custom metrics trigger on the “event duration” metric and pick the mean-field. To receive warnings on the slowest queries your app encounters, select the 90th or 95th percentile.
Then, for the event you want to be alerted about and in which namespace, configure the “group” and “namespace” tags. For example, for the group, “active record” (Rails), “Postgres” (Node.js), or “ecto” (Elixir), and for the namespace, “web” or “background.”
After a week, you’ll begin to notice patterns and better understand which cases deserve your attention and which ones resolve themselves.
You can then adjust the triggers to be less noisy, and channels can be added when you wish to be awakened by specific warnings.
Step 5: Done with Setting Up Database Monitoring
In principle, we should now have a good monitoring setup:
- The valuable data is coming in.
- We’ve set up triggers to warn us before the bits start flying.
- We’ve identified some significant culprits.
Aren’t we supposed to be done now?
Your app’s usage may increase, resulting in new bottlenecks. Alternatively, code updates could affect your database, causing the duration to increase. Or something will be triggered by a particular field with a large payload. No matter how much experience you have, new issues and bottlenecks will arise.
However, you may rest easy knowing that you’ll be alerted. You will learn to dig deep into your architecture to resolve these issues with the proper setup.
Perhaps you’ll find satisfaction in improving the large, “impactful” queries before any other bolts fall out. It’s our favorite food, along with stroopwafels.
About Enteros
IT organizations routinely spend days and weeks troubleshooting production database performance issues across multitudes of critical business systems. Fast and reliable resolution of database performance problems by Enteros enables businesses to generate and save millions of direct revenue, minimize waste of employees’ productivity, reduce the number of licenses, servers, and cloud resources and maximize the productivity of the application, database, and IT operations teams.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions
- 18 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Healthcare Enterprise Architecture with Enteros: Leveraging Forecasting Models for Enhanced Performance and Cost Efficiency
- 15 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Transforming Banking Operations with Enteros: Leveraging Database Solutions and Logical Models for Enhanced Performance
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…