If You’re Not Monitoring Your Resource Pools, You’re Doing It Wrong
Developing production-ready software in today’s world entails much more than just adding functionality. It’s only half the struggle to create a “functionally complete” software system. Systems must be designed to considerably higher standards to compete in today’s market; gone are the days of deploying software as soon as it passes your QA team’s functional validation.
You must be ready to deal with third-party dependence failures malicious users, scale your system as you add customers, and meet your dependability service-level goals (SLOs), indicators (SLIs), and agreements, among other things (SLAs).
Monitoring is, of course, an essential aspect of reliability. You’ll only know something is wrong if customers call—or tweet—to complain if you don’t have visibility into the health of your system (which is terrible). And the only way you’ll figure out what’s wrong is to stumble around aimlessly (which is very bad).
But how do you know what you need to monitor when reliability experts advise you that you need to monitor the health of your systems? Throughput? How long does it take for you to respond? Latency? These are the most obvious options, and while they can often signal when you have a problem, they don’t tell you much about what’s causing the issue.
You need to take a look at your pool of resources.
Any non-trivial software system will have pools of resources ready to handle requests as they come in. A collection of database connections is required to communicate with a database. A pool of threads is required to process tasks from a queue. The work queue is a pool, although one that fills rather than drains. (Keep in mind that a single “non-pooled” connection is practically the same as a single connection in a pool.)
A collection of resource pools underpins all streaming systems, which are made up of any number of services. Even if your benefit, such as a primary windowing data aggregator, doesn’t interact with databases or make any external requests, reading and writing to your message broker require several threads and buffers.
The same is true for HTTP services. Request queue, for example, is a pool of requests waiting to be handled by a collection of request threads in an ASP.NET application running on Microsoft Internet Information Services (IIS).
The sizes of resource pools are simple calculate, and this information might be helpful. Symptoms will inevitably appear in one or more of your resource pools when something is wrong with your system.
The agent state downsampler is being monitored.
The agent state downsampler is a basic Apache Kafka service that minimises the amount of data traveling to our downstream consumers from the language agents our clients have installed in their apps. It receives a large stream of agent metadata but only sends out one message per hour per agent. It keeps track of which agents have already received a message in the last hour using Memcached.
So, how are we going to keep track of this? Let’s start with the apparent aspects, such as throughput, processing time, and lag.
This appears to be some helpful information. But what will happen to these graphs if the down sampler begins to lag? Throughput will drop, while processing time and lag will increase. That’s fantastic, but what happens next? This data can’t tell us anything other than “something’s wrong” on its own, which is lovely for alerting reasons but doesn’t help us figure out what’s causing the issue. We need to dig a little deeper.
We can think more critically about the service now that we understand it better. “How full are our queues and buffers, and how busy are our thread pools?” should have been the first question we should be asking whenever something goes wrong.
A small list of latency cases that we may immediately diagnose by monitoring our resource pools is as follows:
Symptoms
|
Problem
|
Next Steps
|
---|---|---|
Throughput is down and the Memcached thread pool is fully utilised
|
Memcached is down/slow
|
Investigate the health of the Memcached cluster
|
Throughput is down and the Kafka producer buffer is full
|
The destination Kafka brokers are down/slow
|
Investigate the health of the destination Kafka brokers
|
Throughput is down and the work queue is mostly empty
|
The source Kafka brokers are down/slow, and the consumer thread isn’t pulling messages fast enough
|
Investigate the health of the source Kafka cluster
|
Throughput is up and the Kafka producer buffer is full
|
An increase in traffic has caused us to hit a bottleneck in the producer
|
Address the bottleneck (tune the producer, possibly by increasing the buffer) or scale the service
|
A tried-and-true method for keeping track of resource pools
The first step is to gather information about your resource pools. As previously stated, this is quite simple: create a background thread in your service whose sole purpose is to regularly evaluate each of your resource pools’ size and fullness. ThreadPoolExecutor.getSize() and ThreadPoolExecutor.getActiveCount(), for example, will return the size of a thread pool and the number of active threads.
Using Guava’s AbstractScheduledService and Apache’s HttpClient libraries here’s a basic example:
So that you have good data granularity, you should check the thread pool’s stats somewhat frequently (I recommend once per second).
public class ThreadPoolReporter extends AbstractScheduledService { private final ObjectMapper jsonObjectMapper = new ObjectMapper(); private final ThreadPoolExecutor threadPoolToWatch; private final HttpClient httpClient; public ThreadPoolReporter(final ThreadPoolExecutor threadPoolToWatch, final HttpClient httpClient) { this.threadPoolToWatch = threadPoolToWatch; this.httpClient = httpClient; } @Override protected void runOneIteration() { try { final int poolSize = threadPoolToWatch.getPoolSize(); final int activeTaskCount = threadPoolToWatch.getActiveCount(); final ImmutableMap<String, Object> attributes = ImmutableMap.of("eventType", "ServiceStatus", "timestamp", System.currentTimeMillis(), "poolSize", poolSize, "activeCount", activeCount); final String json = jsonObjectMapper.writeValueAsString(ImmutableList.of(attributes)); final HttpResponse response = sendRequest(json); handleResponse(response); } catch (final Exception e) { NewRelic.noticeError(e); } } private HttpResponse sendRequest(final String json) throws IOException { final HttpPost request = new HttpPost("http://example-api.net"); request.setHeader("X-Insert-Key", "secret key value"); request.setHeader("content-type", "application/json"); request.setHeader("accept-encoding", "compress, gzip"); request.setEntity(new StringEntity(json)); return httpClient.execute(request); } private void handleResponse(final HttpResponse response) throws Exception { try (final InputStream responseStream = response.getEntity().getContent()) { final int statusCode = response.getStatusLine().getStatusCode(); if (statusCode != 200) { final String responseBody = extractResponseBody(responseStream); throw new Exception(String.format("Received HTTP %s response from Insights API. Response body: %s", statusCode, responseBody)); } } } private String extractResponseBody(final InputStream responseStream) throws Exception { try (final InputStreamReader responseReader = new InputStreamReader(responseStream, Charset.defaultCharset())) { return CharStreams.toString(responseReader); } } @Override protected Scheduler scheduler() { return Scheduler.newFixedDelaySchedule(1, 1, TimeUnit.SECONDS); } }
SELECT histogram(activeTaskCount, width: 300, buckets: 30) FROM ServiceStatus SINCE 1 minute ago FACET host LIMIT 100
So that you have good data granularity, you should check the thread pool’s stats somewhat frequently (I recommend once per second).
The information can be analysed as a line graph. However, I like to display my resource pool utilisations as two-dimensional histograms (or heat maps) because it’s easier to spot problems.
Our thread pools are mainly idle during “normal” operations. As you can see, we want to have a lot of headroom for traffic bursts. If the dark squares begin to migrate to the right, it’s a clear indication that something is wrong.
Add monitoring code to each of your resource pools in the same way. If you want to limit the number of events you save, consider integrating the data from each collection into a single Insights event.
Finally, to tie everything together, create an Insights dashboard. Our whole agent state downsampler dashboard is shown below—all it takes is a quick check to see whether anything is wrong with our service or resource pools.
It’s all about taking charge!
Resource pool monitoring has helped every system I’ve worked on, but high-throughput streaming services have benefited the most. We’ve identified a slew of unpleasant problems in record speed.
For example, we recently observed a devastating issue in one of the most high-throughput streaming systems, which caused all processing to stop. It showed out to have been an issue with the Kafka producer’s buffer space, which would have been extremely difficult to diagnose without monitoring. Instead, we could access the service’s dashboards, look at the Kafka producer charts, and see that the buffer was full. Within minutes, we had adjusted the producer with a more significant buffer and were back in business.
Monitoring allows you to prevent problems before they occur. Look for historical trends in your dashboards, not just during occurrences but regularly (once a week, for example). Scale the service before it starts sluggish, and a potential incident occurs if you notice your thread pool consumption slowly increasing.
Enteros
About Enteros
Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enhancing Education Sector Efficiency: Enteros for Database Performance, AWS DevOps, Cloud FinOps, and RevOps Integration
- 27 December 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enteros: Optimizing Cloud Platforms and Database Software for Cost Efficiency in the Healthcare Sector with Cloud FinOps
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enteros and Cloud FinOps: Elevating Database Performance and Logical Models in the Public Sector
- 26 December 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Transforming Life Sciences with Enteros: Harnessing Database Software and Generative AI for Innovation
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…