Preamble
Customers occasionally ask me which option is best for automatically generated primary keys. I’ll discuss the options and offer suggestions in this article.
A primary key is necessary for every table. It’s crucial to be able to recognize a specific table row in a relational database. If you’re curious as to why, look up any of the countless questions requesting assistance in deleting duplicate entries from a table on the internet.
You would be wise to select a primary key that is both distinct and immutable over the course of a table row. This is due to the fact that primary keys are frequently referenced by foreign key constraints, which makes it difficult or time-consuming to change a primary key that is used elsewhere.
Nevertheless, a table can occasionally have a natural primary key, such as a citizen’s social security number. However, in most cases, this attribute is absent, so you must create a fake primary key. Even though there is a natural primary key, some people argue that you should use a synthetic one instead. I won’t get into that “holy war.”
Techniques for auto-generated primary keys in PostgreSQL
There are two fundamental methods:
Key generation using a sequence
A sequence is a database object whose sole function is to generate unique numbers. It does this by increasing a built-in counter.
Sequences are designed to work well with multiple users at the same time, and they will never give out the same number twice. Still, accessing a sequence from many concurrent SQL statements could become a bottleneck, so there is the CACHE
option that makes the sequence hand out several values at once to database sessions.
Sequences don’t follow the usual rules for transactions in that when a transaction is rolled back, the counter does not start over. It is not a problem because this is necessary for effective performance. If you want to make a list of numbers without any gaps, a sequence is not the best way to do it. Instead, you have to use less effective and more difficult methods.
To fetch the next value from a sequence, you use the nextval
function like this:
"sequence_name"); SELECT nextval;
For more information on how to manipulate sequences, consult the documentation.
producing UUIDs
UUID stands for “universally unique identifier.” It is a 128-bit number that is made by an algorithm that makes sure it is unique. For that, there are several widely used algorithms. There are several functions in PostgreSQL that produce UUIDs:
- The
uuid-ossp
extension offers functions to generate UUIDs. Note that because of the hyphen in the name, you have to quote the name of the extension (CREATE EXTENSION "uuid-ossp";
). - From PostgreSQL v13 on, you can use the core function
gen_random_uuid()
to generate version-4 (random) UUIDs.
Note that you should always use the PostgreSQL data type uuid
for UUIDs. Don’t try to convert them to strings, or numeric
— you will waste space and lose performance.
Defining auto-generated primary keys
A column with automatically generated values can be defined in one of four ways.
Using the DEFAULT
clause
You can use this method with sequences and UUIDs. Here are some examples:
CREATE TABLE has_integer_pkey (id: bigint, nextval("integer_id_seq") as default, primary key,...); CREATE TABLE has_uuid_pkey ( id uuid DEFAULT gen_random_uuid() PRIMARY KEY, ... );
PostgreSQL uses the DEFAULT
value whenever theINSERT
statement doesn’t explicitly insert that column.
Using the serial
and bigserial
pseudo-types
This method is a shortcut for defining a sequence and setting a DEFAULT
clause as above. With this technique, a table is defined as follows:
CREATE TABLE uses_serial ( id bigserial PRIMARY KEY, ... );
That amounts to the following:
CREATE TABLE (id bigint PRIMARY KEY,...); OWNED BY uses_serial.id; CREATE SEQUENCE uses_serial_id_seq; Nextval('uses_serial_id_seq'): ALTER TABLE uses_serial ALTER id SET DEFAULT;
The “OWNED BY
” clause adds a dependency between the column and the sequence, so that dropping the column automatically drops the sequence.
Using serial
will create an integer
column, while bigserial
will create a bigint
column.
Using identity columns
Since identity columns are made by PostgreSQL “behind the scenes,” this is another use for a sequence.
(id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, CREATE TABLE USES_IDENTITY);
There is also “GENERATED BY DEFAULT AS IDENTITY
”, which is the same except that you won’t get an error message if you try to explicitly insert a value for the column (much like with a DEFAULT
clause). See below for more!
You can specify sequence options for identity columns:
CREATE TABLE uses_identity ( id bigint GENERATED ALWAYS AS IDENTITY (MINVALUE 0 START WITH 0 CACHE 20) PRIMARY KEY, ... );
Using BEFORE INSERT
triggers
This is similar toDEFAULT
values, but it allows you to unconditionally override a value inserted by the user with a generated value. The big disadvantage of a trigger is the performance impact.
Should I use integer(serial) or bigint(bigserial) for my auto-generated primary key?
You should always use bigint
.
True, an integer
occupies four bytes, while a bigint
needs eight. But:
- If you have a small table, where
integer
would suffice, the four wasted bytes won’t matter much. Also, not every table that you designed to be small will remain small! - If you have a big table, you might exceed the maximum for
integer
, which is 2147483647. Note that that could also happen if your table contains fewer rows than that: you might delete rows, and some sequence values can get “lost” by transactions that are rolled back.
Now it is quite complicated to change the primary key column frominteger
tobigint
in a big table inside an active database without causing excessive down time, so you should save yourself that pain.
With bigint
, you are certain to never exceed the maximum of 9223372036854775807: even if you insert 10000 rows per second without any pause, you have almost 30 million years before you reach the limit.
Should I use bigserial or an identity column for my auto-generated primary key?
You should use an identity column, unless you have to support old PostgreSQL versions.
Identity columns were introduced in PostgreSQL v11, and they have two advantages overbigserial
:
- They comply with the SQL standard, while
bigserial
is proprietary PostgreSQL syntax. This will make your code more portable. - If you use
GENERATED ALWAYS AS IDENTITY
, you will get an error message if you try to override the generated value by explicitly inserting a number. This avoids the common problem that manually entered values will conflict with generated values later on, causing surprising application errors.
So unless you have to support PostgreSQL v10 or below, there is no reason to use bigserial
.
Should I use bigint or uuid for an auto-generated primary key?
Use a sequence unless you use database sharding or have another reason to make primary keys in a “decentralized” way (outside of a single database).
Real differences
The advantages of bigint
are clear:
bigint
uses only eight bytes, whileuuid
uses 16- It is less expensive to retrieve a value from a sequence than to create a UUID.
The fact that a sequence is a single object in a single database is one drawback of using one. So, you can’t use a sequence if you use sharding, which is when you split your data up and put it in different databases. UUIDs are an obvious solution in this scenario. (You could use sequences defined with an INCREMENT
greater than 1 and different START
values, but that might lead to problems when you add additional shards.)
Of course, you will also want to use UUIDs if your primary key was made in an application that was spread across multiple application servers instead of being generated automatically by the database.
Imaginary differences
Some argue that UUIDs are superior because they distribute writes across multiple pages of the primary key index. This is meant to cut down on competition and make the index more balanced, or less broken up. The first statement is true, but it might work against you because it needs to cache the whole index to work well. The second is definitely wrong, since B-tree indices are always balanced. Additionally, a change in PostgreSQL v11 made sure that monotonically increasing values would fill an index more effectively than random inserts ever could (but obviously, subsequent deletes would result in fragmentation). In short, any such advantages are either marginal or nonexistent, and they are more than balanced by the fact that uuid
uses twice as much storage, which will make the index bigger, causing more writes and occupying more of your cache.
Aspects of security
People have said that sequence-generated primary keys can leak information because they make it possible to figure out the general order in which rows were added to a table (see, for example, the comments below). That is true, even though I have a hard time imagining a situation in which this would really be a security risk. If that worries you, use UUIDs and put your worries to rest.
Benchmarking bigint versus uuid
My co-worker Kaarel ran a small performance test a while ago and found that uuid
was slower thanbigint
when it came to bigger joins.
I decided to run a small insert-only benchmark with these two tables:
make a test_bigint table called "unlogged" with the primary key "id bigint" generated with a 200-cell cache; Create an unlogged table with the following formula: test_uuid (id uuid DEFAULT gen_random_uuid() PRIMARY KEY);
I performed the benchmark on my laptop (SSD, 8 cores) with a pgbench
custom script that had 6 concurrent clients repeatedly run transactions of 1000 prepared INSERT
statements for five minutes:
INSERT INTO test_bigint /* or test_uuid */ DEFAULT VALUES;
Performance comparison bigint
versus uuid
autogenerated primary keys
bigint |
uuid |
|
inserts per second
|
107090
|
74947
|
index growth per row
|
30.5 bytes
|
41.7 bytes
|
Usingbigint
clearly wins, but the difference is not spectacular.
Conclusion
As auto-generated primary keys, you can use both UUIDs and numbers made by a sequence. Use identity columns unless you need to generate primary keys outside a single database, and make sure all your primary key columns are of type bigint
.
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enhancing Identity and Access Management in Healthcare with Enteros
- 19 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Maximizing Efficiency with Enteros: Revolutionizing Cost Allocation Through a Cloud Center of Excellence
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions
- 18 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…