Preamble
Creating a large table
The initial query is: “How do we make a table with numerous columns?” The simplest method is to just use “generate series” to generate the CREATE TABLE statement:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
test=# SELECT 'CREATE TABLE t_broad (' || string_agg( 't_' || x || ' varchar(10) DEFAULT ' 'a' ' ' , ', ' ) || ' )' FROM generate_series(1, 4) AS x; ? column ? ---------------------------------------------------------- CREATE TABLE t_broad ( t_1 varchar (10) DEFAULT 'a' , t_2 varchar (10) DEFAULT 'a' , t_3 varchar (10) DEFAULT 'a' , t_4 varchar (10) DEFAULT 'a' ) (1 row) test=# \gexec CREATE TABLE |
I’ve only used 4 columns here for simplicity’s sake. The string we just compiled can be executed using gexec once the command has been produced. A highly useful feature of gexec is that it treats the prior result as SQL input, which is exactly what we need in this case. We are left with a table that has four columns.
But let’s abandon the table and build a massive one instead.
1
2
|
test=# DROP TABLE t_broad ; DROP TABLE |
Create an extremely wide table
The statement that comes after produces a table with 1500 columns. Remember that there are a maximum of 1600 columns:
1
2
3
4
|
test=# SELECT 'CREATE TABLE t_broad (' || string_agg( 't_' || x || ' varchar(10) DEFAULT ' 'a' ' ' , ', ' ) || ' )' FROM generate_series(1, 1500) AS x; |
Such a table is rarely effective in real life and shouldn’t typically be utilized to store data. Simply put, it will add too much overhead, and the majority of the time, the modeling is poor to begin with.
Let’s fill the table with 1 million more rows.
1
2
3
4
5
6
7
8
9
10
11
|
test=# \timing Timing is on . test=# INSERT INTO t_broad SELECT 'a' FROM generate_series(1, 1000000); INSERT 0 1000000 Time : 67457,107 ms (01:07,457) test=# VACUUM ANALYZE ; VACUUM Time : 155935,761 ms (02:35,936) |
The default values in the table let us know that the columns do, in fact, contain something. In order to ensure that all hint bits and similar items are set, I have finally run VACUUM.
The following line can be used to quickly calculate the size of the newly formed table, which is around 4 GB in size:
1
2
3
4
5
|
test=# SELECT pg_size_pretty(pg_total_relation_size( 't_broad' )); pg_size_pretty ---------------- 3907 MB (1 row) |
Accessing various columns
Rows are how PostgreSQL stores data. As you may be aware, data can be stored either row- or column-oriented. The benefits of either strategy may depend on your use case. A row-based method is typically far more effective in the case of OLTP.
We’ll count (*) the seconds and see how long it takes:
1
2
3
4
5
6
|
test=# SELECT count (*) FROM t_broad; count --------- 1000000 (1 row) Time : 416,732 ms |
We can complete the query in about 400 milliseconds, which is good. The optimizer will perform a parallel sequential scan as anticipated.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
test=# explain SELECT count (*) FROM t_broad; QUERY PLAN -------------------------------------------------------------------- Finalize Aggregate (cost=506208.55..506208.56 rows =1 width=8) -> Gather (cost=506208.33..506208.54 rows =2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=505208.33..505208.34 rows =1 width=8) -> Parallel Seq Scan on t_broad (cost=0.00..504166.67 rows =416667 width=0) JIT: Functions: 4 Options: Inlining true , Optimization true , Expressions true , Deforming true (8 rows ) |
Comparing this to a count in the first column will help. Performance will differ just a little bit. The rationale is that count(*) must verify the row’s existence, whereas count(column) must determine whether or not a NULL value was supplied to the aggregate. If the value is NULL, it must be ignored:
1
2
3
4
5
6
|
test=# SELECT count (t_1) FROM t_broad; count --------- 1000000 (1 row) Time : 432,803 ms |
Let’s check out what happens if we access column 100, though. There will be a big difference in how long that takes.
1
2
3
4
5
6
7
|
test=# SELECT count (t_100) FROM t_broad; count --------- 1000000 (1 row) Time : 857,897 ms |
Execution times have nearly doubled. If we run a count on column 1000, the performance is considerably worse:
1
2
3
4
5
6
|
test=# SELECT count (t_1000) FROM t_broad; count --------- 1000000 (1 row) Time : 8570,238 ms (00:08,570) |
Wow, we’ve already slowed down 20 times compared to before. This is not a small difference, so it is important to understand what is going on.
Debunking PostgreSQL performance issues: column order
Consider the way PostgreSQL stores data in order to see why the issue arises in the first place: Following the tuple header, which appears in each row, there are a few varchar columns. Here, we’ve only utilized varchar to demonstrate the idea. The same problems will arise with other data types as well; however, because varchar has greater internal complexity than, say, integer, the issue is more obvious.
How can I access a column in PostgreSQL? To determine the location of the requested column within the row, it will first fetch the row and then analyze this tuple. Therefore, if we wish to access column #1000, we must determine how lengthy the first 999 columns are in actuality. This might be really complicated. For an integer, we just need to add 4, but for a varchar, the procedure becomes very expensive. Let’s look at PostgreSQL’s varchar storage method to understand why it is so expensive:
- A single bit designates a string as short (127 bytes) or lengthy (> 127 bits).
- length of 7 bits or 31 bits (depending on the first bit)
- “data” + 0 (to terminate the string ) (to terminate the string)
- alignment (to ensure the following column starts at a multiple of CPU-word length) (to make sure the next column starts at a multiple of CPU-word length)
Think about the implications if we need to loop over 1000 columns. It does result in some significant overhead.
Finally …
About Enteros
Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of clouds, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Revolutionizing Healthcare IT: Leveraging Enteros, FinOps, and DevOps Tools for Superior Database Software Management
- 21 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Real Estate Operations with Enteros: Harnessing Azure Resource Groups and Advanced Database Software
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Revolutionizing Real Estate: Enhancing Database Performance and Cost Efficiency with Enteros and Cloud FinOps
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enteros in Education: Leveraging AIOps for Advanced Anomaly Management and Optimized Learning Environments
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…