Preamble
A key component of PostgreSQL performance tuning is sorting. But many people frequently misunderstand or simply ignore tuning sorts. I made the decision to create a PostgreSQL blog to demonstrate how sorts can be customized in PostgreSQL.
Creating sample data
I first created a couple of million rows to demonstrate how sorting functions:
test=# CREATE TABLE t_test (x numeric); CREATE TABLE test=# INSERT INTO t_test SELECT random() FROM generate_series(1, 5000000); INSERT 0 5000000 test=# ANALYZE ; ANALYZE
The code loads 5 million random values into a table that is created. You’ll see that data can load in just a few seconds.
Sorting data in PostgreSQL
Let’s try to organize the information. I’m using the clearest possible statements to keep things simple. As you can see, PostgreSQL cannot fit the data we want to sort in memory, so it must sort the data on disk. A little bit more than 100 MB of data are transferred to disk in this case:
test=# explain analyze SELECT * FROM t_test ORDER BY x; QUERY PLAN -------------------------------------------------------------------------- Sort (cost=804270.42..816770.42 rows=5000000 width=11) (actual time=4503.484..6475.222 rows=5000000 loops=1) Sort Key: x Sort Method: external merge Disk: 102896kB -> Seq Scan on t_test (cost=0.00..77028.00 rows=5000000 width=11) (actual time=0.035..282.427 rows=5000000 loops=1) Planning time: 0.139 ms Execution time: 6606.637 ms (6 rows)
Why doesn’t PostgreSQL perform an in-memory sort of all the data? This is due to the work_mem parameter, which has a default value of 4 MB:
test=# SHOW work_mem; work_mem ---------- 4MB (1 row)
Work_mem informs the server that a maximum of 4 MB can be used for each operation (per sort, grouping operation, etc.). Sorting a large amount of data requires PostgreSQL to move a lot of data to disk, which obviously takes time.
Thankfully, changing work_mem is simple and can even be done at the session level.
Speeding up sorts in PostgreSQL – using more work_mem
Change work_mem for the duration of the current session and observe what happens to the previous example.
test=# SET work_mem TO '1 GB'; SET
Using SET is the most convenient method for instantly changing work_mem. I’ve set the parameter in this instance to 1 GB. PostgreSQL now has enough RAM to execute tasks in memory:
test=# explain analyze SELECT * FROM t_test ORDER BY x; QUERY PLAN --------------------------------------------------------------------------- Sort (cost=633365.42..645865.42 rows=5000000 width=11) (actual time=1794.953..2529.398 rows=5000000 loops=1) Sort Key: x Sort Method: quicksort Memory: 430984kB -> Seq Scan on t_test (cost=0.00..77028.00 rows=5000000 width=11) (actual time=0.075..296.167 rows=5000000 loops=1) Planning time: 0.067 ms Execution time: 2686.635 ms (6 rows)
The effect on performance is astounding. The time has decreased by about 60%, from 6.6 seconds to about 2.7 seconds. As can be seen, PostgreSQL substitutes “quicksort” for “external merge Disk.” If you want to speed up and tune sorting in PostgreSQL, there is no way of doing that without changing work_mem. THE MOST IMPORTANT KNOB YOU HAVE IS THE work_mem parameter. The cool thing is that work_mem is used for more than just speeding up sorts; it will also benefit aggregations and other operations.
Taking care of partial sorts
There are three different types of sort algorithms in PostgreSQL as of version 10:
- Disk external sort
- quicksort
- prime-N heapsort
If you only need a few sorted rows, “top-N heapsort” is what you should use. For instance: The top 10, the bottom 10, and so forth. The efficient “top-N heapsort” returns the desired data almost immediately:
test=# explain analyze SELECT * FROM t_test ORDER BY x LIMIT 10; QUERY PLAN ---------------------------------------------------------------------------------- Limit (cost=185076.20..185076.23 rows=10 width=11) (actual time=896.739..896.740 rows=10 loops=1) -> Sort (cost=185076.20..197576.20 rows=5000000 width=11) (actual time=896.737..896.738 rows=10 loops=1) Sort Key: x Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on t_test (cost=0.00..77028.00 rows=5000000 width=11) (actual time=1.154..282.408 rows=5000000 loops=1) Planning time: 0.087 ms Execution time: 896.768 ms (7 rows)
Wow, the query returns in less than one second.
Improving sorting: Consider indexing …
Sorts can be accelerated well by work_mem. To start with, not sorting can make sense in many circumstances. Indexes are a useful tool for “sorting input” for the database engine. In actuality, a sorted list and a btree are somewhat comparable.
Sorting will also be necessary when creating indexes (btrees). When the CREATE INDEX command was first introduced, PostgreSQL used work_mem to specify how much memory should be used to create the index. This is no longer the case: The maintenance_work_mem parameter in current versions of PostgreSQL instructs DDLs how much memory to use.
Here’s an illustration:
timing test=# Timing is on. test=# CREATE INDEX idx_x ON t_test (x); CREATE INDEX Time: 4648.530 ms (00:04.649)
The default setting for maintenance_work_mem is 64 MB, but this can of course be changed:
test=# SET maintenance_work_mem TO '1 GB'; SET Time: 0.469 ms
With more memory, index creation will be much faster:
test=# CREATE INDEX idx_x2 ON t_test (x); CREATE INDEX Time: 3083.661 ms (00:03.084)
In this situation, CREATE INDEX can sort the data using up to 1 GB of RAM, which is obviously much faster than using the disk. This is particularly helpful if you want to build big indexes.
The query will run much more quickly if the proper indexes are set up. Here’s an illustration:
test=# explain analyze SELECT * FROM t_test ORDER BY x LIMIT 10; QUERY PLAN -------------------------------------------------------------------------------- Limit (cost=0.43..0.95 rows=10 width=11) (actual time=0.068..0.087 rows=10 loops=1) -> Index Only Scan using idx_x2 on t_test (cost=0.43..260132.21 rows=5000000 width=11) (actual time=0.066..0.084 rows=10 loops=1) Heap Fetches: 10 Planning time: 0.130 ms Execution time: 0.119 ms (5 rows)
In my illustration, the query takes much less time than a millisecond. If your database frequently sorts large amounts of data, instead of continually increasing work_mem, think about using better indexes to speed up the process.
Sorting in PostgreSQL and tablespaces
Tablespaces are widely used to scale I/O by many people today. PostgreSQL only employs one tablespace by default, which is a bottleneck that is easily created. Adding more hardware to PostgreSQL is possible with tablespaces.
Let’s suppose you frequently sort a lot of data: Administrators can manage where temporary files are sent to disk by using the parameter temp_tablespaces. Sorting can be sped up by using a different tablespace for temporary files.
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Optimizing Database Performance with Enteros and AWS Resource Groups: A RevOps Approach to Streamlined Efficiency
- 13 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enhancing Healthcare Data Integrity: How Enteros, Logical Models, and Database Security Transform Healthcare Operations
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Budgeting and Cost Allocation in the Finance Sector with Enteros: A Smarter Approach to Financial Efficiency
- 12 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enteros and Cloud FinOps: Unleashing Big Data Potential for eCommerce Profitability
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…