PostgreSQL bulk loading massive amounts of data

Home > Enteros’ Blog – Thoughts on Database Technology, Machine / Deep learning, and a Generative AI > Database Performance Management > PostgreSQL bulk loading massive amounts of data

Article

The quickest way to add a lot of data to a PostgreSQL database is through bulk loading. Large-scale imports can be facilitated in a variety of ways, and there are numerous ways to scale them as well. This article will demonstrate some of these tips for you and go over how quickly importing functions. This information can be used to improve data warehousing or any other workload that requires a lot of data.

When using PostgreSQL to speed up bulk loading of enormous amounts of data, there are a number of things to keep in mind:

INSERT as opposed to COPY
enhancing checkpoints
Tables with and without logging
creating indexes again
Triggers can be turned on or off.
improving column layout and space usage

Let’s examine these issues more thoroughly.

INSERT versus COPY

The first thing to remember is that copies usually outperform plain inserts. The cause is the high overhead of INSERT. A common question is what kind of overhead exists. What makes COPY significantly quicker than INSERT? There are a variety of reasons: Every INSERT statement must, among other things, check for locks, permissions, the existence of the table and its columns, the table’s existence, and data types. This is done only once when using copy, which is much faster. Data copy is typically the best option whenever you need to write a lot of data.

I have put together a brief example to demonstrate the kind of performance impact this change has. Let’s make a table and add some test data:

 test=# CREATE TABLE t_sample ( a varchar(50), b int, c varchar(50), d int ); CREATE TABLE

Four columns make up the sample table, which is a fairly straightforward design. The next step will be to create a script with 1,000,000 insert statements in a single transaction.

BEGIN; INSERT INTO t_sample VALUES ('abcd', 1, 'abcd', 1); INSERT INTO t_sample VALUES ('abcd', 1, 'abcd', 1); INSERT INTO t_sample VALUES ('abcd', 1, 'abcd', 1);.. COMMIT;

Psql can be used to execute the script:

iMac: hs$ time psql test  /tmp/sample.sql > /dev/null real 1m20.883s user 0m11.515s sys 0m10.070s

This quick test takes about 81 seconds to complete, which is a lot of time. So, it follows that a single INSERT statement is not the best way to do quick imports and effective bulk loading.

We’ll use the same data, but feed it to COPY rather than INSERT since COPY is much more effective, as I’ve already stated;

COPY t_sample FROM stdin; abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 ab

Once more, running the script is simple:

 iMac:~ hs$ time psql test < /tmp/bulk.sql > /dev/null real 0m2.646s user 0m0.110s sys 0m0.043s

The time has decreased from 81 to just 2.6 seconds. Wow.

Remember that I ran this test using a completely untuned database on a fairly old machine. On more up-to-date hardware and operating systems, much more can be accomplished than on my personal iMac desktop computer. In the real world, loading a million lines or more is not unusual. Of course, this data depends on the length of a “record” and so on. Understanding what is feasible and what isn’t, though, is crucial.

NOTE: Runtimes could differ. This has many reasons. One of them is certainly related to the hardware in use here. As we have seen, many SSDs give us surprisingly erratic response times.

Checkpoints are changed for faster bulk loading

The performance of bulk loading is affected by PostgreSQL configuration. A number of configuration parameters affect how well databases work in general and how well they load in particular. However, I explicitly want to focus your attention on checkpoint and I/O performance. I/O is king if you want to load billions of rows. There are several ways to approach the subject:

Limit the amount of written data.
- (Column order) If at all possible, make tables smaller.
- Limit the amount of WAL that is written
Writing data more quickly
- greater checkpoint separations
- improved I/O planning

The following options are crucial:

Maximum number of WAL that can be created (soft limit)
Control checkpoint behavior is the checkpoint completion target.

In general, extending checkpoints by a sizable distance is a really good idea. In the case of bulk-load intensive workloads, increasing this value to 100 or 200 GB is entirely reasonable.

Remember that longer distances between checkpoints do not endanger your server. PostgreSQL’s ability to write data is impacted. Also bear in mind that, in the event of a crash, more disk space will be used and recovery could take longer.

Check out if you want to learn more about checkpointing.

CONSTRUCT TABLE versus CONSTRUCT UNLOGGED TABLE

But what if there was a way to completely do away with WAL? There is one. It is called an “unlogged table.” What is the general idea? Often, we get the following sequence of events:

Large amounts of data should be loaded into a PostgreSQL staging area.
Make a few aggregations
Eliminate the initial import

The best situation to utilize the WAL bypass offered by unlogged tables is as follows:

test=# DROP TABLE t_sample; DROP TABLE test=# CREATE UNLOGGED TABLE t_sample (a varchar(50), b int, c varchar(50), d int); CREATE TABLE test=# d t_sample Column | Type | Collation | Nullable | Default --------+-----------------------+-----------+----------+--------- a | character varying(50) | | | b | integer | | | c | character varying(50) | | | d | integer | | |

Let’s reload the identical data:

 iMac:~ hs$ time psql test < /tmp/sample.sql > /dev/null real 0m59.296s user 0m10.597s sys 0m9.417s iMac:~ hs$ time psql test < /tmp/bulk.sql > /dev/null real 0m0.618s user 0m0.107s sys 0m0.038s

As you can see, everything is much quicker. 81 versus 59 and 2.6 versus 0.6 second comparisons. The distinction is significant.

The reason is that an unlogged table does not require a WAL because the data only needs to be written once. However, there is a cost associated with this:

An unlogged table behaves exactly like a regular table during a normal shutdown.
An unlogged table is guaranteed to be empty in the event of a crash.
Unlogged tables do not have their content replicated.

These limitations imply that storing “normal” data in an unlogged table is not recommended. However, staging areas and bulk loading are perfect for it.

Tables can be logged or unlogged. Contrary to popular belief, these procedures are not inexpensive. Let’s look and see what transpires:

 test=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn -------------------- 5/3AC7CCD0 (1 row) test=# ALTER TABLE t_sample SET LOGGED; ALTER TABLE test=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn -------------------- 5/3F9048A8 (1 row)

I have set the table from “UNLOGGED” to “LOGGED,” and I have also measured the WAL position. We can see that a lot of information has been written:

 test=# SELECT '5/3F9048A8'::pg_lsn - '5/3AC7CCD0'::pg_lsn; ?column? ----------- 80247768 (1 row) Time: 11.298 ms

Wow, we generated 80 MB of WAL (if you performed exactly one COPY on a blank table; the volume will increase if you perform additional imports). The volume will be significantly higher in the case of COPY + INSERT.

This leads us to the conclusion that setting a table from LOGGED to UNLOGGED, importing the data, and then setting the table back to LOGGED may not be the best of ideas if we want to perform efficient bulk loading. This is because once a table is set back to LOGGED, the entire contents of the table must be sent to the WAL in order to ensure that the replicas can receive the content of the table.

Direct import versus index creation

 test=# SELECT count(*) FROM t_sample; count --------- 1000001 (1 row) test=# CREATE TABLE t_index (LIKE t_sample); CREATE TABLE test=# CREATE INDEX idx_a ON t_index (a); CREATE INDEX test=# CREATE INDEX idx_b ON t_index (b); CREATE INDEX test=# \timing Timing is on. test=# INSERT INTO t_index SELECT * FROM t_sample; INSERT 0 1000001 Time: 8396.210 ms (00:08.396)

Data is transferred over in about 8 seconds. Try the same thing later by building the indexes:

 test=# CREATE TABLE t_noindex (LIKE t_sample); CREATE TABLE test=# INSERT INTO t_noindex SELECT * FROM t_sample; INSERT 0 1000001 Time: 4789.017 ms (00:04.789) test=# SET maintenance_work_mem TO '1 GB'; SET Time: 13.059 ms test=# CREATE INDEX idx_aa ON t_noindex (a); CREATE INDEX Time: 1151.521 ms (00:01.152) test=# CREATE INDEX idx_bb ON t_noindex (b); CREATE INDEX Time: 1086.972 ms (00:01.087)

The copy procedure (= INSERT) is evidently much quicker than it was previously. Overall, it takes less time to create the index later. Remember that I am using fictitious data on Mac OSX, which is not very efficient. The difference is significantly larger if the test is repeated with significantly more real data.

The bottom line is:

Create indexes after importing data if possible.

Enabled triggers vs. disabled triggers

Triggers are another crucial element. Triggers could be considered “the natural enemy” of bulk loading efficiency. Let’s look at the following illustration:

 iMac:~ hs$ head -n 20 /tmp/bulk.sql BEGIN; CREATE FUNCTION dummy() RETURNS trigger AS $$ BEGIN NEW.b := NEW.b + 1; NEW.d := NEW.d + 1; RETURN NEW; END; $$ LANGUAGE 'plpgsql'; CREATE TRIGGER mytrig BEFORE INSERT ON t_sample FOR EACH ROW EXECUTE PROCEDURE dummy(); COPY t_sample FROM stdin; abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1 abcd 1

Our trigger is very straightforward. It merely modifies two data entries, which is all it does. The extra function call added by the trigger to each row, however, really adds up.
In our situation, we have the following information: Trigger variation moves about three times more slowly. However, the true difference is greatly influenced by the trigger’s complexity, the size of a row, and many other factors. There is no way to state that “a trigger slows things down by a factor of X”. Case by case, one must observe what transpires.

column order optimization for bulk loading

Large data imports into PostgreSQL require more effort than first appears. So far, checkpoints, indexes, triggers, and other components have all been optimized. How about the order of the columns? Let’s attempt to learn.

Column order does actually matter in PostgreSQL. “Fixed length” columns should typically be placed in front. In other words, the table should start with int8, int4, timestamptz, and so forth. Data types with variable lengths, such as text and varchar, should be at the end of the table. This is due to a disk-based problem with CPU alignment. For regular heap tables only; not for zheap.

Since I/O is one of the main bottlenecks when bulk loading data, reducing the size of a table without changing its contents can speed up operations. Find out more by reading this article.

Equipment for bulk loading

If what you have already seen is still insufficient, we can suggest some tools to make bulk loading even better. The following equipment is suggested:

PGLoader
pg_bulkload

Both are widely used and well-known tools. They can be used risk-free.

Please feel free to ask any additional questions about these tools in the comment section or by email.

The end…

We also advise looking into our consulting services if you’re interested in learning more about PostgreSQL performance. We assist you in optimizing your database and ensure that your servers are running smoothly.

About Enteros

Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.

The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.

Are you interested in writing for Enteros’ Blog? Please send us a pitch!

Enhancing Healthcare Efficiency with Enteros: AIOps-Driven Database Performance Monitoring

2 April 2025
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Optimizing Cost Estimation and RevOps Efficiency with Enteros: Enhancing Financial Stability and Balance Sheet Management

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Optimizing Retail Operations with Enteros: Performance Management and Observability for Enhanced Efficiency

1 April 2025
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Optimizing Database Performance in the Retail Sector with Enteros, Cloud FinOps, and DevOps Strategies

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Article

INSERT versus COPY

Checkpoints are changed for faster bulk loading

CONSTRUCT TABLE versus CONSTRUCT UNLOGGED TABLE

Direct import versus index creation

Enabled triggers vs. disabled triggers

column order optimization for bulk loading

Equipment for bulk loading

The end…

About Enteros

RELATED POSTS

Enhancing Healthcare Efficiency with Enteros: AIOps-Driven Database Performance Monitoring

Optimizing Cost Estimation and RevOps Efficiency with Enteros: Enhancing Financial Stability and Balance Sheet Management

Optimizing Retail Operations with Enteros: Performance Management and Observability for Enhanced Efficiency

Optimizing Database Performance in the Retail Sector with Enteros, Cloud FinOps, and DevOps Strategies