It does matter what the column order is in PostgreSQL.

Home > Enteros’ Blog – Thoughts on Database Technology, Machine / Deep learning, and a Generative AI > Database Performance Management > It does matter what the column order is in PostgreSQL.

Preamble

Recently, we came across some really broad tables with hundreds of columns in a database that was not particularly well organized. Our PostgreSQL support client reported odd runtime behavior that was difficult to explain. I made the decision to share the details of a pretty typical performance issue many people don’t understand: column order and column access, in order to assist other PostgreSQL users in a similar scenario.

Creating a large table

The initial query is: “How do we make a table with numerous columns?” The simplest method is to just use “generate series” to generate the CREATE TABLE statement:

test=# SELECT'CREATE TABLE t_broad ('

|| string_agg('t_'|| x

|| ' varchar(10) DEFAULT ''a'' ', ', ')

|| ' )'

FROMgenerate_series(1, 4) ASx;

?column?

----------------------------------------------------------

CREATETABLEt_broad (

t_1 varchar(10) DEFAULT'a',

t_2 varchar(10) DEFAULT'a',

t_3 varchar(10) DEFAULT'a',

t_4 varchar(10) DEFAULT'a' )

(1 row)

test=# \gexec

CREATETABLE

I’ve only used 4 columns here for simplicity’s sake. The string we just compiled can be executed using gexec once the command has been produced. A highly useful feature of gexec is that it treats the prior result as SQL input, which is exactly what we need in this case. We are left with a table that has four columns.

But let’s abandon the table and build a massive one instead.

1 2	`test=#` `DROPTABLEt_broad ;` `DROPTABLE`

Create an extremely wide table

The statement that comes after produces a table with 1500 columns. Remember that there are a maximum of 1600 columns:

test=# SELECT'CREATE TABLE t_broad ('

|| string_agg('t_'|| x

|| ' varchar(10) DEFAULT ''a'' ', ', ') || ' )'

FROMgenerate_series(1, 1500) ASx;

Such a table is rarely effective in real life and shouldn’t typically be utilized to store data. Simply put, it will add too much overhead, and the majority of the time, the modeling is poor to begin with.

Let’s fill the table with 1 million more rows.

test=# \timing

Timing ison.

test=# INSERTINTOt_broad

SELECT'a'FROMgenerate_series(1, 1000000);

INSERT0 1000000

Time: 67457,107 ms (01:07,457)

test=# VACUUM ANALYZE ;

VACUUM

Time: 155935,761 ms (02:35,936)

The default values in the table let us know that the columns do, in fact, contain something. In order to ensure that all hint bits and similar items are set, I have finally run VACUUM.

The following line can be used to quickly calculate the size of the newly formed table, which is around 4 GB in size:

test=# SELECTpg_size_pretty(pg_total_relation_size('t_broad'));

pg_size_pretty

----------------

3907 MB

(1 row)

Accessing various columns

Rows are how PostgreSQL stores data. As you may be aware, data can be stored either row- or column-oriented. The benefits of either strategy may depend on your use case. A row-based method is typically far more effective in the case of OLTP.

We’ll count (*) the seconds and see how long it takes:

test=# SELECTcount(*) FROMt_broad;

count

---------

1000000

(1 row)

Time: 416,732 ms

We can complete the query in about 400 milliseconds, which is good. The optimizer will perform a parallel sequential scan as anticipated.

test=# explain SELECTcount(*) FROMt_broad;

QUERY PLAN

--------------------------------------------------------------------

Finalize Aggregate (cost=506208.55..506208.56 rows=1 width=8)

-> Gather (cost=506208.33..506208.54 rows=2 width=8)

Workers Planned: 2

-> PartialAggregate

(cost=505208.33..505208.34 rows=1 width=8)

-> Parallel Seq Scan ont_broad

(cost=0.00..504166.67 rows=416667 width=0)

JIT:

Functions: 4

Options: Inlining true, Optimization true, Expressions true,

Deforming true

(8 rows)

Comparing this to a count in the first column will help. Performance will differ just a little bit. The rationale is that count(*) must verify the row’s existence, whereas count(column) must determine whether or not a NULL value was supplied to the aggregate. If the value is NULL, it must be ignored:

test=# SELECTcount(t_1) FROMt_broad;

count

---------

1000000

(1 row)

Time: 432,803 ms

Let’s check out what happens if we access column 100, though. There will be a big difference in how long that takes.

test=# SELECTcount(t_100) FROMt_broad;

count

---------

1000000

(1 row)

Time: 857,897 ms

Execution times have nearly doubled. If we run a count on column 1000, the performance is considerably worse:

test=# SELECTcount(t_1000) FROMt_broad;

count

---------

1000000

(1 row)

Time: 8570,238 ms (00:08,570)

Wow, we’ve already slowed down 20 times compared to before. This is not a small difference, so it is important to understand what is going on.

Debunking PostgreSQL performance issues: column order

Consider the way PostgreSQL stores data in order to see why the issue arises in the first place: Following the tuple header, which appears in each row, there are a few varchar columns. Here, we’ve only utilized varchar to demonstrate the idea. The same problems will arise with other data types as well; however, because varchar has greater internal complexity than, say, integer, the issue is more obvious.

How can I access a column in PostgreSQL? To determine the location of the requested column within the row, it will first fetch the row and then analyze this tuple. Therefore, if we wish to access column #1000, we must determine how lengthy the first 999 columns are in actuality. This might be really complicated. For an integer, we just need to add 4, but for a varchar, the procedure becomes very expensive. Let’s look at PostgreSQL’s varchar storage method to understand why it is so expensive:

A single bit designates a string as short (127 bytes) or lengthy (> 127 bits).
length of 7 bits or 31 bits (depending on the first bit)
“data” + 0 (to terminate the string ) (to terminate the string)
alignment (to ensure the following column starts at a multiple of CPU-word length) (to make sure the next column starts at a multiple of CPU-word length)

Think about the implications if we need to loop over 1000 columns. It does result in some significant overhead.

Finally …

The main takeaway from this is that having very large tables is frequently not advantageous in terms of performance. Use reasonable table arrangements to achieve a decent balance between performance and practicality.

About Enteros

Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of clouds, RDBMS, NoSQL, and machine learning database platforms.

The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.

Are you interested in writing for Enteros’ Blog? Please send us a pitch!

Enhancing Healthcare Efficiency with Enteros: AIOps-Driven Database Performance Monitoring

2 April 2025
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Optimizing Cost Estimation and RevOps Efficiency with Enteros: Enhancing Financial Stability and Balance Sheet Management

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Optimizing Retail Operations with Enteros: Performance Management and Observability for Enhanced Efficiency

1 April 2025
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Optimizing Database Performance in the Retail Sector with Enteros, Cloud FinOps, and DevOps Strategies

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Preamble

Creating a large table

Create an extremely wide table

Accessing various columns

Debunking PostgreSQL performance issues: column order

Finally …

About Enteros

RELATED POSTS

Enhancing Healthcare Efficiency with Enteros: AIOps-Driven Database Performance Monitoring

Optimizing Cost Estimation and RevOps Efficiency with Enteros: Enhancing Financial Stability and Balance Sheet Management

Optimizing Retail Operations with Enteros: Performance Management and Observability for Enhanced Efficiency

Optimizing Database Performance in the Retail Sector with Enteros, Cloud FinOps, and DevOps Strategies