Preamble
It’s time to examine one of the most significant new features offered by PostgreSQL 11, which will soon be released: the capability of parallel index creation.
We are pleased that PostgreSQL has joined this exclusive club, which offers multi-core index creation that will significantly enhance the usability of large database deployments in the future. Various commercial database vendors have already offered this feature for many years.
Creating large tables in PostgreSQL
PostgreSQL supports traditional “stored procedures” as of version 11. A procedure can run multiple transactions, which is fantastic if you want to produce a large amount of random data. PostgreSQL must store this information in memory when you call generate_series to create one million rows. Therefore, it can be really helpful to reduce the memory footprint by generating hundreds of millions of random rows using multiple transactions. The process is as follows:
CREATE TABLE t_demo (data numeric); CREATE OR REPLACE PROCEDURE insert_data (buckets, integer) LANGUAGE plpgsql AS $$ DECLARE i int; BEGIN i := 0; WHILE i < buckets LOOP INSERT INTO t_demo SELECT random() FROM generate_series(1, 1000000); i := i + 1; RAISE NOTICE 'inserted % buckets', i; COMMIT; END LOOP; RETURN; END; $$; CALL insert_data(500);
500 million randomly generated numbers are loaded by this tiny piece of code, which should be sufficient to show how CREATE INDEX can be enhanced in PostgreSQL 11. In our illustration, 500 million rows equal about 21 GB of data.
test=# \d+ List of relations Schema | Name | Type | Owner | Size | Description --------+--------+-------+-------+-------+------------- public | t_demo | table | hs | 21 GB | (1 row)
I chose numeric because it has the highest overhead of all the number data types. Creating a numeric index is a lot more costly than indexing, say, int4 or int8. The objective is to determine how much CPU time can be saved by creating a substantial index on an expensive field.
CREATE INDEX: Using just 1 CPU core
Parallel index creation is enabled by default in PostgreSQL 11. The max_parallel_maintenance_workers parameter, which is controllable in postgresql.conf, is the one responsible for this problem:
test=# SHOW max_parallel_maintenance_workers; max_parallel_maintenance_workers ---------------------------------- 2 (1 row)
The default setting here instructs PostgreSQL to launch two workers to assist with index creation if the table is big enough. I’ve set max_parallel_maintenance_workers to 0 to contrast the new settings with the “traditional” method of building the index. Thus, there won’t be any multicore indexing available.
test=# SET max_parallel_maintenance_workers TO 0; SET
Indexing will therefore take an eternity. The CREATE INDEX statement will use a significant amount of CPU and I/O. The situation was made worse by the fact that I left all memory parameters at their default settings. As a result, the index creation can only use 4 MB of memory, which is insufficient given the size of the table.
Here are the results on my “Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz”:
test=# CREATE INDEX idx1 ON t_demo (data); CREATE INDEX Time: 1031650.658 ms (17:11.651)
17 minutes is not too bad. Keep in mind that there are 500 million really bad lines of data involved.
Using more than just one core
Let us run the same type of indexing on two cores:
test=# SET max_parallel_maintenance_workers TO 2; SET test=# CREATE INDEX idx2 ON t_demo (data); CREATE INDEX Time: 660672.867 ms (11:00.673)
Wow, only 11 minutes remain. Naturally, the process is not entirely linear because we must take into account the fact that those partial results must be combined with one another and all that. There is a catch, though: If set max_parallel_maintenance_workers to 2 and what we saw is 2 cores, right? What if we set the value to 4? Since there are 4 physical cores in my machine, using any higher values is pointless. You’ll notice that PostgreSQL still only utilizes two cores.
How can that be altered? The following listing contains the solution: ALTER TABLE… SET… allows us to lift this restriction and use more workers:
test=# ALTER TABLE t_demo SET (parallel_workers = 4); ALTER TABLE test=# SET max_parallel_maintenance_workers TO 4; SET
In this instance, the table parameter and max_parallel_workers are both set to 4. PostgreSQL will now employ five processes, as we will see. What causes that to occur? What you will see is one main process and four other processes that help with index creation. Even though it may not seem entirely obvious, it makes sense when you consider it.
Naturally, we cannot keep adding workers indefinitely and expect performance to increase linearly. At this point, we will also begin to experience performance limitations with our (single) SSD, and the two-fold increase will no longer be visible.
test=# CREATE INDEX idx3 ON t_demo (data); CREATE INDEX Time: 534775.040 ms (08:54.775)
We will experience wild swings in our I/O curve because everyone is working on the same thing essentially at the same time. This causes the process to be a little slower and nonlinear overall. Still, we were able to cut the time it took to make an index from 17 minutes to almost 9 minutes by simply adding more cores to the system.
Using more memory for CREATE INDEX
The number of CPU cores is not the only constraint on index creation. The importance of memory cannot be overstated. By default, maintenance_work_mem is set to a really low value (64 MB), which greatly limits the amount of data that can be sorted in memory. Therefore, increasing this parameter and setting it to a higher value to create the new index is the next logical step.
test=# SET maintenance_work_mem TO '4 GB'; SET
In my case, I decided to pump the value to 4 GB. My server has 32 GB of memory, and we have to keep in mind that we are not the only ones, which might create an index, so 4 GB x 5 cores might already be a really aggressive value in a real-world scenario.
We can expect to see much more parallelism during the initial stages of the index creation, as this is exactly what will happen as we create the index. Towards the end, CPU usage is quite low, and PostgreSQL is waiting for the disk to complete its work, as you can also see quite clearly. Because the entire system is configured with default values, writes have not yet been optimized and will be a problem.
However, we will still see a nice improvement.
test=# CREATE INDEX idx4 ON t_demo (data); CREATE INDEX Time: 448498.535 ms (07:28.499)
7 minutes and 28 seconds. That is already very nice. But let us see if we can do even better. What we have seen so far is that checkpoints and I/O have started to become a limiting factor. Therefore, we will try to improve on that by telling PostgreSQL to use larger checkpoint distances. I’ve chosen to modify postgresql.conf in this example to the following values:
max_wal_size = 50GB, min_wal_size = 80MB, checkpoint_timeout = 120min
By loading the configuration file again, those settings can be quickly activated:
pg_reload_conf(); test=# SELECT pg_reload_conf ---------------- t (1 row)
Use these greater checkpoint distances to build a new index.
When building the index, PostgreSQL spent a significant amount of time writing the WAL to disk, as can be seen by looking at the process table. There isn’t much more we can do about it as long as we use just one SSD. What will happen, though, if we use our next trump card? Additional hardware. What if we built the index on a third SSD, sent the WAL to the main disk, and built all of our temporary data on one disk? We could divide the required amount of I/O in this way and monitor the results.
Using tablespaces in PostgreSQL to speed up indexing
As was already mentioned, using tablespaces to add more hardware might be a good idea. I’m well aware that a contemporary cloud environment might make this impossible. However, I still received the luxuries on my test server: a couple of real, physical SSD drives.
Let’s try them out by making two tablespaces that can hold the data. Additionally, I’ll instruct PostgreSQL to store the new index in those tablespaces and use them for sorting:
test=# CREATE TABLESPACE indexspace LOCATION '/ssd1/tabspace1'; CREATE TABLESPACE test=# CREATE TABLESPACE sortspace LOCATION '/ssd2/tabspace2'; CREATE TABLESPACE
PostgreSQL can then be instructed where to store temporary data by providing the following information:
test=# SET temp_tablespaces TO sortspace; SET
In the next step, the index creation can start:
test=# CREATE INDEX idx6 ON t_demo (data) TABLESPACE indexspace; CREATE INDEX Time: 408508.976 ms (06:48.509)
Because multiple SSDs can operate simultaneously, we can observe that our throughput peaks at higher values during index creation than it did previously. Our throughput occasionally increases to as much as 900 MB / sec in place of the 500 MB / sec peak. Additionally, the overall speed has increased. It’s really nice that we are already under 7 minutes.
It might be worthwhile to think about creating one filesystem using all of the disks at once if you add more hardware to the box. I did not have time to test these options, but I believe they might produce results that are comparable to or even superior to those of the first test.
TIP: Don’t undervalue the significance of the chosen data type. If we did the same test using normal integer values, we could create the index in 3 minutes and 51 seconds. In other words, the data type is of significant importance.
You’ve seen in this post that index creation can be enhanced. But remember that new indices aren’t always advantageous. Pointless indexes can even slow things down. Consider reading a post by Laurenz Albe, who describes how to solve this kind of problem, to determine which indexes may not be required.
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Revolutionizing Healthcare IT: Leveraging Enteros, FinOps, and DevOps Tools for Superior Database Software Management
- 21 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Real Estate Operations with Enteros: Harnessing Azure Resource Groups and Advanced Database Software
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Revolutionizing Real Estate: Enhancing Database Performance and Cost Efficiency with Enteros and Cloud FinOps
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enteros in Education: Leveraging AIOps for Advanced Anomaly Management and Optimized Learning Environments
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…