Preamble
It’s surprising how many of the recent PostgreSQL support requests we’ve gotten are about autovacuum and UPDATE in particular. PostgreSQL handles updates and storage in general very differently from other databases like Oracle. As a result, switching from Oracle to PostgreSQL may surprise some people. Therefore, it may be beneficial to step back and consider the bigger picture: Here is our PostgreSQL autovacuum and UPDATE beginner’s guide.
How PostgreSQL handles UPDATE
The most crucial thing for beginners to remember is that UPDATE will internally duplicate a row. The old row and the new row will both be in your table following an UPDATE. Why, though, is that the case? Here’s an illustration:
BEGIN; UPDATE tab SET id = 17 WHERE id = 16; SELECT … ROLLBACK;
The concept is straightforward: ROLLBACK would not function if UPDATE destroyed the previous iteration of the row. UPDATE must copy the row to make sure that both versions are there and that transactions can be done right.
Why is that important? Here is one more illustration. Suppose there is only one row in the table (id = 16):
Connection 1 | Connection 2 |
SELECT * FROM tab; | SELECT * FROM tab; |
… returns 16 … | … returns 16 … |
BEGIN; | |
UPDATE tab SET id = 20 WHERE id = 16 | |
SELECT * FROM tab; | SELECT * FROM tab; |
… returns 20 … | … returns 16 … |
COMMIT; |
Keep in mind that the previous row will appear in the second select in the second connection. To make sure the row is still present, PostgreSQL must check.
The same holds true for DELETE: If your disk is full, deleting 100 million rows won’t immediately free up space in the filesystem because other concurrent transactions might still be using the data.
VACUUM: The art of cleaning dead rows
As previously mentioned, delete does not actually eliminate old rows. Qui does? The answer is “vacuum.
Let’s look at an illustration:
Connection 1 | Connection 2 | Connection 3 |
BEGIN; | ||
DELETE FROM tab; | ||
… running … | SELECT * FROM tab; | |
… running … | … will return all the rows … | |
VACUUM; | ||
COMMIT; | … does not delete anything … | |
VACUUM; | ||
… space can be reclaimed … |
The crucial point in this illustration is that there are still transactions that can see the rows that are about to be deleted, so the first VACUUM is not permitted to remove dead rows already. In other words, even if you try your hardest to run VACUUM constantly, the cleanup will be delayed by long-running transactions.
Rule: Protracted transactions can bloat tables and delay cleanup. The fact that VACUUM simply marks space so that it can be used again in the future rather than necessarily shrinking the table on disk is also important to note. As a result, monitoring lengthy transactions carefully is essential. Idle_in_transaction_session_timeout can be very helpful in this situation. Another way to handle long transactions is to just kill them after a certain amount of time if they haven’t been used.
Autovacuum (VACUUM) in action
Vacuuming is very important, and PostgreSQL does a lot of vacuuming by itself to make life as simple as possible. About once a minute, the autovacuum daemon checks to see if the tables need to be vacuumed. A process to clean up a table is started if anything needs to be done.
Keep in mind that those automatic VACUUM processes are not operating at full speed to protect currently active transactions. So, pressing the VACUUM button on a table by hand will “feel” faster than pressing the VACUUM button automatically, but this is only true in the default configuration.
Most of the time, the autovacuum works perfectly and cleans the tables just the way it should. However, there are some circumstances where things might end up being a problem.
Autovacuum: The dark sides of VACUUM
In most cases, autovacuum is exactly what you want. However, there are a few extremely unique situations that users should be aware of. For example, consider the following scenario:
- Table size: 50.000 rows
- One thousand updates per second
- Autovac default settings
- reading some other tables in your database with long reads
In this instance, all the elements are present for disaster: Long reads will make it so that VACUUM can’t do its job and get rid of dead rows quickly, which will cause the table to get too big. Your tables will merely outgrow their proper size. A larger table, of course, makes vacuuming take longer and longer. When the situation becomes out of control, your database’s tables will eventually just blow up. In recent years, we have seen tables with only a few million rows grow to a shocking 4.5 TB in size. It goes without saying that such an explosion has a noticeable effect on performance. It’s important to keep in mind that running a more aggressive VACUUM on a table like this won’t actually help because it will only mark the space as reusable; in most cases, space won’t be returned to the filesystem.
There are really only two ways to prevent this kind of catastrophe:
- (VACUUM FULL, etc.) Rebuild the table.
- use pg_squeeze or a comparable instrument
While the table is being rebuilt, VACUUM FULL must lock the table, which essentially results in “downtime”. With 50.000 transactions per second, downtime is clearly not an option. So, we spent some time making a tool called pg_squeeze that lets us reorganize data in place without having to lock a lot of tables.
In general, the default configuration is somewhat ok unless you are running countless UPDATE statements. In this case, it might be best to set up autovacuum to work harder and combine it with pg_squeeze and other ways to make tables less big.
Fixing the data structure to avoid UPDATE all together
UPDATE statements are sometimes used in applications to modify a person’s or a vehicle’s current position. Consider that you want to track your pizza delivery and know exactly where it is right now. If there are thousands of pizza deliveries going on at the same time, each driver may send their GPS location every few seconds. Thousands of updates would be made to a relatively small amount of data as a result, which can cause the bloat problems I mentioned earlier.
Using INSERT rather than UPDATE is one way to solve the issue. That’s right, why not simply keep the entire history? What kind of reasoning is that, then? Assuming that UPDATE makes a copy of the row, there isn’t much difference between INSERT and UPDATE from a storage point of view, since both create a new row on the disk. The benefit of using INSERT is that you will eventually have access to the track’s history, which can be useful (BI professionals adore this type of timeseries data). So, how does cleaning work? Use a time-partitioned table and just type “DROP TABLE” to clear out the data, according to the plan. Compared to DELETE / VACUUM and UPDATE / VACUUM, DROP TABLE has a few really nice advantages: The nice thing is that DROP TABLE will quickly and simply delete all associated data files from the disk.
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enhancing Identity and Access Management in Healthcare with Enteros
- 19 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Maximizing Efficiency with Enteros: Revolutionizing Cost Allocation Through a Cloud Center of Excellence
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions
- 18 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…