Preamble
I recently encountered an intriguing support case that demonstrates how a problem’s root cause may occasionally be found in the most unlikely of places.
About table bloat
After an UPDATE
or DELETE
, PostgreSQL keeps old versions of a table row around. This way, sessions that want to read the row at the same time don’t have to wait. But eventually cleanup will be required for this “garbage.” That is what the autovacuum daemon is supposed to do.
In most cases, you won’t need to worry about that, but occasionally something will go wrong. The table then continues to expand because the old row versions are not deleted. This will not only waste storage space, but it will also make index scans take longer.
To get rid of the bloat, you can use VACUUM (FULL)
and other tools like pg_squeeze. But it is important to find and fix the cause of the table bloat so that it does not reappear.
The problem
A client of mine called me after noticing table bloat in the pg attribute system catalog table, which holds the metadata for the table columns.
This may occur if table columns are often changed or removed. Most of the time, these are temporary tables that get deleted automatically when a session or transaction ends.
Temporary tables were used extensively by the client. But the problem was only seen on a small number of Linux servers that run databases and have a lot of Linux machines.
Searching the cause
I looked over the list of typical reasons for table bloat:
- The status “idle in transaction” for open database transactions.
These will prevent autovacuum from removing outdated row versions after the transaction has begun. - A data modification rate that is too fast for autovacuum to handle.
Making autovacuum more aggressive is the appropriate response in this situation.
Both were untrue; the second possibility was ruled out because it would cause bloat on all machines, not just a select few, so it could not be true.
I next examined the impacted table’s usage information:
dbname=> SELECT * FROM pg_stat_sys_tables dbname-> WHERE relname = 'pg_attribute'; -[ RECORD 1 ]-------+------------- relid | 1249 schemaname | pg_catalog relname | pg_attribute seq_scan | 167081 seq_tup_read | 484738 idx_scan | 1506941 idx_tup_fetch | 4163837 n_tup_ins | 0 n_tup_upd | 0 n_tup_del | 0 n_tup_hot_upd | 0 n_live_tup | 0 n_dead_tup | 0 n_mod_since_analyze | 0 last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | vacuum_count | 0 autovacuum_count | 0 analyze_count | 0 autoanalyze_count | 0
This demonstrates that the autovacuum has never operated. But what’s more intriguing is that we discover the reason why it didn’t run:
PostgreSQL made no attempt to remove any dead tuples (row versions that could be), as it believes there aren’t any.
The statistics collector
Suspicion was confirmed when the following message was found in the server logs:
Because the statistician is not responding, outdated statistics are being used instead of current ones.
The PostgreSQL backend process that gathers usage statistics is called the statistics collector process.
Following each activity, PostgreSQL backends send statistics about that activity. These statistics updates are sent through a UDP socket on localhost
; that is created at PostgreSQL startup time. The statistics collector reads from the socket and aggregates the collected statistics.
Closing in on the problem
The statistics gatherer was active:
918 1 /usr/pgsql-10/bin/postmaster -D /var/lib/pgsql/10/data/ 947 918 postgres: logger process 964 918 postgres: checkpointer process 965 918 postgres: writer process 966 918 postgres: wal writer process 967 918 postgres: autovacuum launcher process 968 918 postgres: stats collector process 969 918 postgres: bgworker: logical replication launcher
I followed the statistics collector’s execution to see what it was doing and to identify any issues.
# strace -p 968 strace: Process 968 attached epoll_pwait(3,
On the UDP socket, the statistics collector was waiting for messages, but none were getting through.
I examined the UPD socket:
# netstat -u -n -p Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name udp6 0 0 ::1:59517 ::1:59517 ESTABLISHED 918/postmaster
Nothing odd has happened so far.
However, I got a different result when I tried the same thing on a machine that wasn’t having the issue:
# netstat -u -n -p Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name udp 0 0 127.0.0.1:46031 127.0.0.1:46031 ESTABLISHED 9303/postmaster
It turned out that on all systems that experienced table bloat, the statistics collector socket was created on the IPv6 address for localhost
, while all working systems were using the IPv4 address!
However, IPv6 was turned off for the loopback interface on all machines:
# ifconfig lo lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 loop txqueuelen 1000 (Local Loopback) RX packets 6897 bytes 2372420 (2.2 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 6897 bytes 2372420 (2.2 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Nailing the bug to the wall
PostgreSQL uses the POSIX function getaddrinfo(3)
to resolve localhost
.
PostgreSQL will loop through each address returned by that call, create a UDP socket, and test it until it has one that functions because it is crucial to have a functional statistics collection.
As a result, we can assume that IPv6 was functional at the time PostgreSQL was launched.
After more research, it was found that IPv6 was turned off during the boot process, but there was a race condition:
PostgreSQL would occasionally start with IPv6 disabled and other times without it. And it was on these latter machines that the statistics collector stopped working and the tables became bloated!
The problem was fixed by changing the boot order so that IPv6 is always turned off before PostgreSQL starts.
Conclusion
This demonstrates once more how a computer issue’s root cause might not be what you initially think it is. It also demonstrates the importance of having solid operating system knowledge for database administrators.
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enhancing Identity and Access Management in Healthcare with Enteros
- 19 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Maximizing Efficiency with Enteros: Revolutionizing Cost Allocation Through a Cloud Center of Excellence
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions
- 18 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…