Preamble
Now that we’ve delved a little deeper, let’s discuss customization options in more detail. Here, we’re still using the default Docker installation, so I’ll start by listing and explaining each part of the Docker image before moving on to more complicated topics.
The configuration database
This Docker-based PostgreSQL database is a standard place to store information about which databases should be watched and how (with apt-get install postgresql). More specifically, configurations consist of:
Definitions of metrics: Metrics are named SQL queries that can return almost anything you need and that have various query text iterations depending on the target PostgreSQL version. By regularly connecting to the target database and checking the version, the right version of the metric definition will be chosen automatically. However, when defining metrics, you should keep a few basic concepts in mind:
- If the “epoch_ns” (nanoseconds since epoch, default InfluxDB timestamp precision) column is not present, the gathering server’s timestamp will be used instead, but you’ll lose a few milliseconds of precision (assuming intra-datacenter monitoring) and the metrics reading time won’t be recorded.
- Only text, integer, boolean, and floating point (also known as double precision) data can be returned by queries.
- By prefixing columns with “tag_,” which is optional, data in those columns can be “tagged” and indexed by InfluxDB, which has the following benefits:
- When creating charts with Grafana, sophisticated auto-discovery support for indexed keys/values.
- InfluxDB queries on those columns will be executed more quickly.
- Less disk space is used for repeating values, so it might still make sense to prefix the column with “tag_” to save space when, for instance, you’re returning some lengthy and repetitive status strings (possible with Singlestat or Table panels) that you’ll be looking up by some ID column.
When you have a collection of databases with similar monitoring needs, you can reuse preset configurations, which are named pairs of metrics and their intervals that can be applied to the databases under monitoring. The format of configurations is “metricX”: “gathering_intervalX”, “metricY”: “gathering_intervalY”,…
Custom configurations: for each database, one may define metric-interval pairs at their discretion.
Web UI
For quick configuration changes (like adding databases to monitoring, adding metrics, etc.), a small Python web application built on the CherryPy web framework is included. For mass changes, one could technically also log into the configuration database and change the tables in the “pgwatch2” schema directly. In addition to the configuration options, two helpful features are the “Status Statements Overview” page and the ability to see the logs of each component. For example, the “Status Statements Overview” page lets you find the query with the longest average runtime over a certain time period.
The metrics gathering daemon
This component, which is a Go daemon that performs the “heavy lifting,” doesn’t expose any ports. After setting up a few variables and data exchange procedures, it actually works like a simple state machine, which does the following:
1) Look for a list of databases that need to be monitored in the configuration database.
2) Make sure that each of the discovered DBs has worker processes (goroutines) for each metric that run the metric definitions after the interval has passed and then sleep for the duration of the interval.
3) Remove the workers if any databases were removed from the configuration.
The gatherer also includes some built-in fault tolerance. The last known configuration will be used when the configuration database “goes away.” When the storage backend (InfluxDB) is removed, the gathered metric values are kept in memory, and storing is re-tried on a regular basis. But at some point, the program’s memory will run out, and any metrics data that was not saved will be lost. For more important applications, it is still a good idea to set up some kind of external monitoring for InfluxDB.
The Web UI on /logs/pgwatch2 can be used to view the gatherer’s error log. Connection failures due to pg_hba.conf rules, typos in host addresses, and mistyped users or passwords are the most common issues I can imagine.
InfluxDB
Due to its high performance and excellent data compression, InfluxDB, a time series database option that comes in two flavors (community and enterprise), is the best option for bundling into Docker. The disadvantage of being a NoSQL database is that it does not support SQL, but it does offer a simpler query language called InfluxQL. It can do a lot of the same things that PostgreSQL can, but it can’t do some of the more complex analytics. For example, it can’t tell me the top 10 slowest queries. In these cases, you have to use graphs to find problems. In the most recent version of 1.2, support for simple subqueries has been added, which shows that they are aware of the problem.
InfluxDB also has an “Admin interface” for ad hoc queries that you can turn on or off, but we leave it on by default. However, this interface might go away in the future. It is mostly helpful for debugging and viewing metrics data directly in tabular format. But as said, it’s not really needed as most users should be covered by functionality provided by Grafana.
Grafana
Grafana, which is currently the most popular dashboarding app, makes it easy to connect to many time-series databases, including InfluxDB. Before I leave you, take a look at the excellent documentation to see how simple it is to create your own custom dashboards.
Security aspects
Both the management UI and the primary user interface (Grafana) lack authentication by default. However, one must first log in before making any changes to Grafana dashboards. To add authentication to the Web UI, run the Docker image with the following environment variables set:
docker run -d -p 3000:3000 -p 8080:8080 --name pw2 -e PW2_WEBNOANONYMOUS=1 PW2_WEBUSER=admin PW2_WEBPASSWORD=secret cybertec/pgwatch2
Since HTTPS support is not yet built-in, you must use a load balancer to make sure that the UI can be accessed securely over the Internet. We will, however, plan built-in support for the next major version update.
Metrics fetching helpers
By default, pgwatch2 is made to only track metrics that are present in a typical PostgreSQL database (i.e., pg_stat_* views), and it does not require Superuser privileges. This is a great example for developers because it makes it easy to keep an eye on live databases at any time. However, in most cases, you will still need more, such as system information such as the CPU or memory state. In these situations, the way to “break out” of Postgres and get system information is usually to write a small wrapper function in an “unsafe” PL-language that returns the needed data as standard SQL results.
There are currently two of these “helpers”: a PL/Python sproc for reading CPU load and a “security definer” wrapper for pg_stat_statement information, which allows monitoring of all user queries (by default, non-superusers can only see their own query statistics). You can easily make your own (for free memory, disk space, etc.), and maybe we’ll add some as well.
Alerting
Just in time for our release, the most recent major version of Grafana added support for basic alerting. As of right now, only graphs are supported, and the way it works is that you can set a hard limit (a constant number) on a value that you’re graphing; for example, “if DB size growth for 1h is bigger than 1GB, send me a message on Slack along with an image of the graph.” In addition, there are email, webhook, pagerduty, victorops, and opsgenie notification options available. With the exception of email, setting up API-based services via the user interface is fairly simple.
Integrating with Kapacitor is another option for those with more complex alerting needs. Kapacitor works well with InfluxDB and makes it easier to find outliers, among other things.
Backups / updates
Moving to a new version of a Docker image (for example, to get the benefits of a new version of Grafana) currently requires quite a few steps if you are not using a custom setup where you can easily switch out single components.
There are basically two options. The first is to go into Docker and manually update the component. This means downloading the latest Grafana.deb package and installing it with “dpkg -i…” Actually, this is the simplest approach. The other way would be to use “docker pull” to get the latest pgwatch2 image, which already has the latest versions of all its parts, and then use a backup from a previous setup to restore the data. To restore, one must stop InfluxDB once more in Docker, but that is a simple process.
Starting the Docker image with “-v /pgwatch2_backups:/pgwatch2_backups” would make sense to mount the backup folder from the host on the new container. This would make the restore process easier. Otherwise, SSH must be configured, or another option like S3 should be used.
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enhancing Identity and Access Management in Healthcare with Enteros
- 19 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Maximizing Efficiency with Enteros: Revolutionizing Cost Allocation Through a Cloud Center of Excellence
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions
- 18 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…