Preamble
What is the relationship between PostgreSQL Full-Text Search and VACUUM? Many people might be surprised to find out that there might even be a relationship worth talking about. But those two subjects are more intimately related than individuals may realize. Many people might not be aware of those problems because the reason is hidden deep within the code. I have therefore made the decision to shed some light on this subject and clarify what is actually happening. The goal is to help end users speed up full-text indexing (FTI) and make PostgreSQL work better for everyone.
Controlling VACUUM and autovacuum
It is essential to produce some test data before delving into the meat of the matter. I made a table with that objective in mind. Keep in mind that I disabled autovacuum to ensure complete control over every action. This makes it simpler to illustrate how PostgreSQL works.
test=# CREATE TABLE t_fti (payload tsvector) WITH (autovacuum_enabled = off); CREATE TABLE
We can generate 2 million random texts in the following step. I didn’t import a real data set with real texts because I wanted to keep things simple. Instead, I just made a set of MD5 hashes, which are more than good enough for the job:
test=# INSERT INTO t_fti SELECT to_tsvector('english', md5('dummy' || id)) FROM generate_series(1, 2000000) AS id; INSERT 0 2000000
Here is how our data appears:
test=# SELECT to_tsvector('english', md5('dummy' || id)) FROM generate_series(1, 5) AS id; to_tsvector -------------------------------------- '8c2753548775b4161e531c323ea24c08':1 'c0c40e7a94eea7e2c238b75273087710':1 'ffdc12d8d601ae40f258acf3d6e7e1fb':1 'abc5fc01b06bef661bbd671bde23aa39':1 '20b70cebcb94b1c9ba30d17ab542a6dc':1 (5 rows)
I made the decision to use the tsvector data type directly in the table to improve efficiency. We can directly create a full-text index (FTI) on the column, which is an advantage:
test=# CREATE INDEX idx_fti ON t_fti USING gin(payload); CREATE INDEX
In PostgreSQL, “full text search” (FTS) is typically handled by a GIN index.
To create all those hint bits and instruct PostgreSQL to compute optimizer statistics, we run VACUUM at the end.
test=# VACUUM ANALYZE ; VACUUM
How GIN indexes work in PostgreSQL
We first got to examine how GIN indexes truly work in order to comprehend what VACUUM and Full Text Search (FTS) have to do with one another: In essence, a “normal tree” down to the word level is what a GIN index is. So you can quickly find a term using a binary search. However, GIN has a “posting tree” below the word level as opposed to a btree. As a result, each term only appears once in the index but refers to a vast number of items. This makes sense for full-text searches since there are only so many unique words in the real world, whereas a single word may appear thousands of times.
Let’s examine the posting tree in more detail: For a pointer to the underlying table, there is only one entry. The posting tree is sorted to increase its efficiency. The issue right now is that it costs a lot of money to change the GIN index for each row if you insert data into the table. It costs money to change the posting tree. Remember that maintaining the correct order in your posting tree necessitates significant overhead.
Fortunately, the issue may be resolved. GIN’s “pending list.” A row is not added straight to the main index when it is inserted. But instead, it is added to a “to do” list that Vacuum will later go through. So, when a row is added, the index is not really in its final state. Why does that matter? It implies that in order to properly scan an index, you must also scan the tree and read each item in the pending list in order. In other words, a big pending list will affect performance to some extent. So, it might sometimes help to vacuum a table that is used for full-text search more than usual. Keep in mind that VACUUM will handle every entry in the pending list.
Measuring the performance impact of VACUUM
Install pgstattuple to see what is happening in the background:
CREATE EXTENSION pgstattuple;
You can examine the internals of the index using pgstattuple:
test=# SELECT * FROM pgstatginindex('idx_fti'); version | pending_pages | pending_tuples ---------+---------------+---------------- 2 | 0 | 0 (1 row)
The pending list is empty in this instance, and the index is also quite small:
test=# SELECT pg_relation_size('idx_gin'); pg_relation_size ------------------ 188416 (1 row)
Remember: The index is still practically nothing compared to the size of the table even though we had 2 million entries:
test=# SELECT pg_relation_size('t_fti'); pg_relation_size ------------------ 154329088 (1 row)
Let’s run a quick query to find a word that doesn’t exist, which takes less than one millisecond to complete:
test = explain (analyze, buffers) SELECT * FROM t_fti WHERE payload @@ to_tsquery('whatever'); QUERY PLAN -------------------------------------------------------------------- Bitmap Heap Scan on t_fti (cost = 20.77..294.37 rows = 67 width = 45) (actual time = 0.030..0.030 rows = 0 loops = 1). Recheck Cond: (payload @@ to_tsquery('whatever'::text)) Buffers: shared hit = 5 -> Bitmap Index Scan on idx_fti (cost=0.00..20.75 rows=67 width=0) (actual time=0.028..0.028 rows=0 loops=1) Index Cond: (payload @@ to_tsquery('whatever'::text)) Buffers: shared hit = 5 Planning time: 0.148 ms Execution time: 0.066 ms (8 rows)
I’d also like to draw your attention to “shared hit = 5”. Only five blocks of data were required for the query to run. This is really, really good because the query will still return in a reasonable amount of time even if it needs to be written to disk.
There are no hidden operations taking place because autovacuum is turned off, so let’s add some more information:
test=# INSERT INTO t_fti SELECT to_tsvector('english', md5('dummy' || id)) FROM generate_series(2000001, 3000000) AS id; INSERT 0 1000000
The same query is now much slower despite previously performing very well:
test = explain (analyze, buffers) SELECT * FROM t_fti WHERE payload @@ to_tsquery('whatever'); QUERY PLAN -------------------------------------------------------------------- Bitmap Heap Scan on t_fti (cost = 1329.02..1737.43 rows = 100 width = 45) (actual time=9.377..9.377 rows=0 loops=1) Recheck Cond: (payload @@ to_tsquery('whatever'::text)) Buffers: shared hit = 331 -> Bitmap Index Scan on idx_fti (cost = 0.00..1329.00 rows = 100 width = 0) (actual time=9.374..9.374 rows=0 loops=1) Index Cond: (payload @@ to_tsquery('whatever'::text)) Buffers: shared hit = 331 Planning time: 0.194 ms Execution time: 9.420 ms (8 rows)
The query takes longer than 9 milliseconds to execute in PostgreSQL. The pending list contains a large number of pending tuples, which is the cause. A LOT more pages than before—331—had to be accessed by the query in this case. The underlying issue is shown by the GIN pending list:
test=# SELECT * FROM pgstatginindex('idx_fti'); version | pending_pages | pending_tuples ---------+---------------+---------------- 2 | 326 | 50141 (1 row)
The pending list explains all the additional use of data pages immediately, i.e.. 5 pages + 326 pages = 331 pages.
Running VACUUM to speed up Full-Text-Search (FTS) in PostgreSQL
It is easy to add those pending entries to the real index; we just run VACUUM ANALYZE once more.
test=# VACUUM ANALYZE; VACUUM
The pending list is now empty, as you can see:
test=# SELECT * FROM pgstatginindex('idx_fti'); version | pending_pages | pending_tuples ---------+---------------+---------------- 2 | 0 | 0 (1 row)
What matters is that the query is now significantly slower once more because there are fewer blocks.
test = explain (analyze, buffers) SELECT * FROM t_fti WHERE payload @@ to_tsquery('whatever'); QUERY PLAN ----------------------------------------------------------------- Bitmap Heap Scan on t_fti (cost = 25.03; rows = 100; width = 45; actual time = 0.03; rows = 0; loops = 1). Recheck Cond: (payload @@ to_tsquery('whatever'::text)) Buffers: shared hit = 5 -> Bitmap Index Scan on idx_fti (cost = 0.00..25.00 rows = 100 width = 0) (actual time=0.030..0.030 rows=0 loops=1) Index Cond: (payload @@ to_tsquery('whatever'::text)) Buffers: shared hit = 5 Planning time: 0.240 ms Execution time: 0.075 ms (8 rows)
Those illustrations, in my opinion, pretty clearly demonstrate that VACUUM does have a significant impact on the effectiveness of your full text indexing. This is only true, of course, if a sizable portion of your data is altered frequently.
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enhancing Identity and Access Management in Healthcare with Enteros
- 19 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Maximizing Efficiency with Enteros: Revolutionizing Cost Allocation Through a Cloud Center of Excellence
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions
- 18 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…