Preamble
The data structure may occasionally need to be changed when an application is being used in production. Changing the number of columns, adding columns, removing columns, etc. Changes to data structures should not be made carelessly, though; there are some considerations to make.
The main problem with DDLs is that locks are sometimes held for a very long time, which can have serious repercussions if PostgreSQL is running on crucial production systems. I’m hoping that by using this blog, readers will manage DDLs more carefully.
Creating some demo data
To show how things work, the typical demo data is needed. In the event that the script below just generates 10 million simple integer values:
test=# CREATE TABLE data (id int); CREATE TABLE test=# INSERT INTO data SELECT * FROM generate_series(1, 10000000); INSERT 0 10000000
Be aware of locking: Indexing
The creation of indexes is one of the DDLs’ most frequent problems. When you create an index in PostgreSQL, writes are “normally” blocked while the index is being created. This can be very detrimental when dealing with very large tables. A large index simply takes time to build, and concurrent writes may suffer as illustrated in the following example:
Connection 1 | Connection 2 |
BEGIN; | |
CREATE UNIQUE INDEX idx_id ON data (id); | BEGIN; |
— running | INSERT INTO data VALUES (0); |
— running | — waiting |
COMMIT; | — will proceed |
COMMIT; |
If the data is extremely large, the INSERT may have to wait an inordinate amount of time. This is especially important if there are many concurrent requests. Your web server may experience connection problems or run out of connections. In other words, a fix is required.
Because PostgreSQL will only stop writing while the index is being built, reading is still possible. This needs to be mentioned because it is crucial.
CREATE INDEX CONCURRENTLY is the best option if you cannot afford to lock a table while creating an index. CREATE INDEX CONCURRENTLY takes longer than a “normal” CREATE INDEX, but it lets you write to the index at the same time and keeps the table from getting locked up too much. Note that CREATE INDEX CONCURRENTLY is not guaranteed to work. If it doesn’t, you might end up with an invalid index that you have to drop by hand. However, if you are indexing a 1TB or larger table, there is no way around concurrent indexing if you are constantly under load, which is often the case.
Here’s an illustration: As you can see, there are no conflicts and no one has to wait an unreasonable amount of time.
Connection 1 | Connection 2 |
CREATE UNIQUE INDEX CONCURRENTLY idx_id2 ON data (id); |
|
— running | INSERT INTO data VALUES (-1); |
— running | INSERT INTO data VALUES (-2); |
— done | INSERT INTO data VALUES (-3); |
Adding columns to tables in PostgreSQL
The most frequently encountered issue with PostgreSQL is probably adding a new column to a table. The fact that there are actually two cases is crucial in this situation:
- A column being added without a default value
- A column with a default value being added
What do those two cases have in store for us, shall we see?
Connection 1 | Connection 2 |
BEGIN; | |
ALTER TABLE data ADD COLUMN x int; | |
Time: 1.241 ms | INSERT INTO data VALUES (-4); |
COMMIT; | — waits a VERY short time |
BEGIN; | |
ALTER TABLE data ADD COLUMN y int DEFAULT 0; |
|
— we require time | INSERT VALUES OF (-5) INTO the data; |
Time: 11888.54 ms | — holding out for a while |
COMMIT; | — finally proceeds |
As you can see, the ALTER TABLE… ADD COLUMN command completes in essentially 1 millisecond without the default value, which is perfectly acceptable. The concurrent write is not required to wait long. All will be well as long as the DDL can briefly obtain the table lock it requires. There is no need to rewrite the data files when adding a column in PostgreSQL that has no default value; instead, the metadata in the catalog is simply changed. Be aware that not all other database engines follow this rule, especially some pricy commercial ones.
If you require a default value, the narrative is slightly different. In this case it has to be written to disk. We generally advise staying away from operations of this nature. Rewriting things is not very appealing if your tables are large.
Making use of tablespaces
The use of tablespaces is a further crucial factor to take into account. Moving tables requires money, even though tablespaces are generally a nice way to scale I/O. Remember that ALTER TABLE… SET TABLESPACE locks the table and prevents writing.
As an illustration:
Connection 1 | Connection 2 |
BEGIN; | |
ALTER TABLE data SET TABLESPACE some_storage; |
|
— usually time consuming | INSERT INTO data VALUES (-6); |
— usually time consuming | — waiting for a long time |
COMMIT; | — finally done |
The good news in this situation is that there is a feasible workaround for the table. The tool pg_squeeze is made to constrict a bloated table. It is capable of much more, including the “abuse” of moving a table from one tablespace to another. By using pg_squeeze, you can elegantly avoid heavy locking in addition to reducing your storage footprint. Check it out right away.
Adding foreign keys
Constraints and foreign keys are additional significant factors. PostgreSQL will check the table’s content to ensure the key’s accuracy if you want to add foreign keys or CHECK constraints. The table will once more be locked by PostgreSQL as a result of those checks:
test=# ALTER TABLE data ADD CONSTRAINT constname FOREIGN KEY (id) REFERENCES data(id);
It is impossible to lock the table without adding keys ex post. As a result, it’s crucial to plan your actions and think ahead.
Dropping columns in PostgreSQL
Let us move on to something more relaxing: DROPPING A COLUMN. DROP TABLE in PostgreSQL does not rewrite the table. It merely renders it undetectable. VACUUM will handle cleanup asynchronously. Drop Column is thus extremely fast. To start a DROP COLUMN, there must be no concurrent reads, so keep that in mind:
Connection 1 | Connection 2 |
BEGIN; | |
SELECT count(*) FROM data | |
ALTER TABLE data DROP COLUMN y; |
— takes a while |
— has to wait | — takes a while |
— has to wait | — takes a while |
— finally proceeds | COMMIT; |
Otherwise, DROP COLUMN will immediately reappear.
There is always more…
Always, there is more to say. However, the examples provided in this blog cover some of the most prevalent problems that people encounter globally. Please feel free to add your thoughts and provide feedback.
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enhancing Identity and Access Management in Healthcare with Enteros
- 19 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Maximizing Efficiency with Enteros: Revolutionizing Cost Allocation Through a Cloud Center of Excellence
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions
- 18 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…