Normalization of data in PostgreSQL

Home > Enteros’ Blog – Thoughts on Database Technology, Machine / Deep learning, and a Generative AI > Database Performance Management > Normalization of data in PostgreSQL

A Guide to Data Normalization in PostgreSQL

Introduction

Even though I installed RedHat Linux 6.0 (not RHEL) in 1999 to set up my first PostgreSQL database, I probably only used this 6.5 release to launch SysinitV scripts for the server. It was cool because, at the time, PostgreSQL was MySQL’s underdog. At that time, I knew nothing about SQL (except for the name), databases, or data modeling.

Soon after, I enrolled at a large technical university, where, much to my surprise, the university’s IT department hired me to assist with the rollout of a brand-new student records management system that had been developed internally. Even though my knowledge and skills hadn’t changed at all, I suddenly became an Oracle DBA. They said you would learn on the job. Indeed, on the job I learned that Big Red is a thing (or rather, a beast).

I served my days as best as I could, sitting between two huge ERD diagrams (A0 size) that I had printed out and meticulously glued together from smaller sheets. I still wasn’t much smarter about data modeling, but since the system was developed by the university’s most experienced academic staff, it was supposed to be designed correctly. And it was well designed. I was attempting to navigate the maze of tables, views, and foreign keys without any formal instruction or the proper tools (I didn’t have the money for anything more than about 70 named user licenses; Win SQLPlus was all I had). I had to write reports and queries on the spot, fix mistakes made by operators, and do other things. Over the years, the ancient knowledge of how to properly decompose and normalize data slowly crept into my mind, thoughts, and reflexes, as if by some magical force.

Since then, whenever I see a database schema, I can tell right away if it feels right or not. I only discovered the causes of those feelings much later.

And yes, Codd’s third normal form was used to design the database (for the most part).

What am I talking about, then?

The concept of…

normal forms

…is older than SQL and, for that matter, useful relational database software. In his influential work from 1970, F. Edgar Codd from IBM introduced it. This work led to the creation of all of the relational databases we use today, including PostgreSQL.
Sadly, these ideas are somewhat academic in nature and origin, despite the fact that they have many very practical applications; in order to discuss them, I need to introduce some theory and define a few terms.

Note: The few sentences that follow don’t apply only to PostgreSQL; you can use them with any relational database. Also, feel free to skip ahead as much as is convenient for you if you already know all this like the back of your hand.

It’s crucial to keep in mind that the goal of introducing the idea of normal forms was to prevent…

data anomalies

…by removing any and all instances of data duplication. That is ultimately what it comes down to:

eliminate uncertainty (update anomaly)
Delete the anomaly to prevent information loss
(Insert anomaly) Prevent incompleteness

What do anomalies resemble in real life? Imagine that we are in the early 1970s and that our successful paperback romance bookstore uses a program that stores all of its data in a single set of records (for example, in an ISAM file):

Author	Origin	Title	Pages	Price
Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99
Angie Nadajoskova	Bratislava, SK	Irritated soul	234	$23.99
Rommy la Dequell	Altdorf, AT	Desire to suffer	456	$1.99

We also need to include a new author who joined our list but whose upcoming thriller-romance is still a mystery:

Michael Birg

Bludenz, AT

$4.99

Therefore, the following anomalies are visible: First, there is a contradiction in our best-selling author’s two books (record 1 and 3). Is Altdorf in Austria or Hungary? Or has the author relocated? Second, when the book “Irritated Soul” ran out of print, while removing it from the record we also lost all the information about the author – that we might very well want to keep (record 2). Finally, the new record cannot be added because it simply lacks the necessary data. As is, we are unable to use it.

Now that we are all used to PostgreSQL and other RDBMSes’ amazing capabilities, these issues may seem naive or even ridiculous. But keep in mind that this was 50 years ago, when all joins had to be programmed by hand. More importantly, keep in mind that these oddities can still happen today, though in a less obvious way.

I don’t think everyone will have the same experience with spreading knowledge as I’ve described above, but there are a few…

basic concepts

I must first define the following terms:

We have a set of named and typed fields (also known as attributes or columns) that make up a record, also known as a tuple or row. A table, also referred to as a relation, is a group of rows with the same type. There is no guarantee that the fields in a record or the records in a table will be in a specific order. So far, so good; these terms ought to be familiar to every user of a relational database.

What about…

keys

Making a key that combines all fields is a simple way to identify any row in a table; this is known as the trivial superkey (values that are a part of a key are highlighted below).

Author	Origin	Title	Pages	Price
Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99

Most of the time, you can make superkeys that don’t include all fields but can still be used to identify rows.

Author	Origin	Title	Pages	Price
Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99

Author	Origin	Title	Pages	Price
Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99

A column is a prime attribute (or key attribute) if it is a member of a superkey. If it doesn’t, it qualifies as a non-prime (non-key) attribute. It’s crucial to make this distinction when describing what depends on what. There is no functional dependency, so in the examples above, the page counts or prices may be the same for other books.

We have reached a candidate key if it is no longer possible to remove a field from a superkey without losing its ability to identify a row; there may be more than one such set of fields.

Author	Origin	Title	Pages	Price
Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99

Author	Origin	Title	Pages	Price
Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99

To access any row reliably and make sure it is unique, one of the candidate keys must be chosen as the primary key. The primary key is a combination of one or more columns (when there are more than one column, it is called a compound key), and their combined contents must be both unique and not empty.

Author	Origin	Title	Pages	Price
Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99

The record’s keys can be natural, which means they come from the natural, innate characteristics of the real-world objects it describes, or surrogate. A social security number is an example of a natural key because it comes from one or more of the columns that are already there. Because each record needs to be identified individually, a surrogate key is made up, and it has its own column. Surrogate keys are useful because they can be made quickly, stored cheaply, and sorted easily (bigint). They can be assured to be unique globally if necessary (UUID). This type of key is often used when a persistence framework or an ORM automatically creates a schema.

Surrogate keys do have one big problem, though: they don’t say anything about the real-world properties of the thing being described by the row. So, they can’t tell us anything about it, including whether or not it’s unique. As a result, a surrogate key can only ensure its own uniqueness. Because of this, duplicates—exactly what normalization aims to prevent—can happen in real life.

But in the real world, objects are usually recognizable by a group of features, not just one. Because of this, natural primary keys are usually compound, which makes them hard to use in practice. Also, using them over and over as foreign keys in other columns creates redundant data in the form of many rows and tables with the same (potentially long) values, which is just a waste of space.

Since it’s common to use a surrogate key instead of the primary key even before normalization starts, here’s an example:

Let’s make a table defining people, and let’s add enough different characteristics to represent each person’s individuality: name, surname, birthplace and date, height, eye color, net worth, favorite food, pet name, etc. Let’s choose the combination of these characteristics that will allow each person to be uniquely identified for our use case. Now, this exact combination will be our compound primary key. This means that the database will make sure there is only one record with these exact values for the chosen fields. But if we just give such a person a number and make it the primary key, nothing will stop us from having multiple identical records that, for all intents and purposes, describe the same person but differ only by the meaningless value of that one artificially assigned number (rows 1 and 4). Even worse, these records may have small differences because of typos or mistakes (row 3). But keep in mind that a mistake in row 2 would not have been avoided in any case because it occurred in a column that was once a primary key.

Id	Author	Origin	Title	Pages	Price
1	Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99
2	Rommy la Dequell	Altdorf, AT	Power of silence	123	$0.99
3	Rommy la Dequel	Altdorf, HU	Power of silence	123	$0.99
4	Rommy la Dequell	Altdorf, HU	Power of silence	123	$0.99

Note: The main key should always be natural or, if necessary, a combination of natural and compound keys to make sure it is unique. This helps solve the problem of assigning uniqueness while letting the database do its job well. However, add one more surrogate key and make it unique to conveniently refer to the table rows. Then, we’ll separate the concerns about providing an easy way to refer to that row from other tables from the concerns about ensuring row uniqueness (server only in one place by primary key) (server by column with only unique constraint).That’s adequate. For more information on specific restrictions, visit the Cybertec blog.

The functional dependence between qualities is the final idea I need to discuss. If the value of an attribute always appears with the specific value of the attribute it depends on, then the attributes are functionally dependent on one another. There is one-way reliance here. It clearly converts one value into another, and you can think of it as a function or a mapping table.

One simple example would be to have two columns, one with metric measurements and the other with imperial measurements. Each column would show a different physical trait, like weight. Since everyone knows how to convert between the two, the value of one depends on the value of the other.

A less obvious example would be to base the assignment of a less precise property (such as category) on a more precise property (e.g., price or number of pages). There is an exact formula somewhere (usually in the application) for how to get the first from the second. We will have functional reliance if it is implemented consistently and without fail. This dependency could be shown as a relationship in a separate database and be calculated in a view or for each query. But keep in mind that there is no strong functional dependency and that it must be reflected in the design of the schema if the business goal is to have exceptions (that is, to assign categories randomly, even if only occasionally).

Returning to regular forms, some are mainly

relevant

…to everyday practice, but the distinction between “relevant/not relevant” should be viewed with caution given the ongoing evolution of database software capabilities, particularly PostgreSQL.

Let’s discuss the normal forms in order since each one has a specific requirement that must be met and expands upon the requirement stated in the previous form:

The only requirement for 0NF, also known as NNF (Not Normalized Form), is that a record has a primary key that uniquely identifies it.

As it is with current RDBMSes, and especially PostgreSQL, this presents a rather challenging requirement; let’s stick with the «no table as value» form and discuss what atomic, compound, and opaque mean later. The original definition of 1NF was even more strict: that a field value is atomic (that is — not compound).

non-compliant

Person	Born in	Born on	Mother’s name	Born in Land	Mother born in	Mother born on	Lived in
Anna Gruber	Klagenfurt	12.10.1967	Angela Bauer	Kärnten	Salzburg	30.02.1948
Betty Huber	Bregenz	14.05.1989	Regina Wagner	Vorarlberg	Graz	22.11.1970

Composite field values (lived in) are moved to a separate table to make the form compliant.

Person	Born in	Born on	Born in Land	Mother’s name	Mother born in	Mother born on
Anna Gruber	Klagenfurt	12.10.1967	Kärnten	Angela Bauer	Salzburg	30.02.1948
Betty Huber	Bregenz	14.05.1989	Vorarlberg	Regina Wagner	Graz	22.11.1970

Person	Born in	Born on	Lived in
Anna Gruber	Klagenfurt	12.10.1967	Wien
Anna Gruber	Klagenfurt	12.10.1967	Linz
Betty Huber	Bregenz	14.05.1989	Linz
Betty Huber	Bregenz	14.05.1989	Salzburg

conforms to a distinct surrogate key (Person ID)

Person	Born in	Born on	Person ID	Born in Land	Mother’s name	Mother born in	Mother born on
Anna Gruber	Klagenfurt	12.10.1967	1	Kärnten	Angela Bauer	Salzburg	30.02.1948
Betty Huber	Bregenz	14.05.1989	2	Vorarlberg	Regina Wagner	Graz	22.11.1970

Person ID	Lived in
1	Wien
1	Linz
2	Linz
2	Salzburg

If a partial dependency is found, the table must be split, with the dependent attribute(s) and the portion of the primary key it depends upon being moved to a separate table. 2NF — requires that no non-key attribute (for a compound primary key) depends just on part of the primary key.

non-compliant

Person	Born in	Born on	Person ID	Born in Land	Mother’s name	Mother born in	Mother born on
Anna Gruber	Klagenfurt	12.10.1967	1	Kärnten	Angela Bauer	Salzburg	30.02.1948
Betty Huber	Bregenz	14.05.1989	2	Vorarlberg	Regina Wagner	Graz	22.11.1970

fields (born in land) dependent on a portion of the primary key (born in) are moved to a different table to be compliant.

Person	Born in	Born on	Person ID	Mother’s name	Mother born in	Mother born on
Anna Gruber	Klagenfurt	12.10.1967	1	Angela Bauer	Salzburg	30.02.1948
Betty Huber	Bregenz	14.05.1989	2	Regina Wagner	Graz	22.11.1970

This assumes that the city name, and only the city name, is a primary key in the cities table below (which is rarely the case in practice).

City	Land
Klagenfurt	Kärnten
Bregenz	Vorarlberg

If the values of two or more fields frequently appear together, they should be moved to a separate table, and one of them should be made a key, in order to avoid information repetition. 3NF — requires that there be no transitive dependency (through another field depending directly on the primary key) on the primary key.

non-compliant

Person	Born in	Born on	Person ID	Mother’s name	Mother born in	Mother born on
Anna Gruber	Klagenfurt	12.10.1967	1	Angela Bauer	Salzburg	30.02.1948
Betty Huber	Bregenz	14.05.1989	2	Regina Wagner	Graz	22.11.1970

Due to the fact that “mother” is also a “person,” it is possible to move the fields (mother born in & mother born on) that are transitively dependent (through mother’s name) on the primary key (person, born in & born on) to separate rows.

Person	Born in	Born on	Person ID	Mother ID
Anna Gruber	Klagenfurt	12.10.1967	1	3
Betty Huber	Bregenz	14.05.1989	2	4
Angela Bauer	Salzburg	30.02.1948	3	«null»
Regina Wagner	Graz	22.11.1970	4	«null»

BCNF — Boyce-Codd Normal Form aka 3.5NF is just a bit of a stronger version of 3NF; often, a table in 3NF is also compliant with BCNF.
It requires that no part of a primary key is functionally dependent on a non-key attribute. The difference is quite subtle — till now I have been talking about non-key attributes which are dependent on key attributes. Now it is the opposite way round.

non-compliant — Born in is part of the primary key and functionally depends on non-key attribute Birth Post Code

Person	Born in	Born on	Person ID	Mother ID	Birth Post Code
Anna Gruber	Klagenfurt	12.10.1967	1	3	9073
Betty Huber	Bregenz	14.05.1989	2	4	6900
Angela Bauer	Salzburg	30.02.1948	3	«null»	5020
Regina Wagner	Graz	22.11.1970	4	«null»	8032

compliant — attribute “Birth Post Code” replaces “Born in” in the primary key. Also, the structure of the Cities table is changed (and is much more realistic). Note that with this change, the data are correctly normalized, but less readable.

Person	Birth Post Code	Born on	Person ID	Mother ID
Anna Gruber	9073	12.10.1967	1	3
Betty Huber	6900	14.05.1989	2	4
Angela Bauer	5020	30.02.1948	3	«null»
Regina Wagner	8032	22.11.1970	4	«null»

Post Code	City	Land
9073	Klagenfurt	Kärnten
6900	Bregenz	Vorarlberg
5020	Salzburg	Salzburg
8032	Graz	Steiermark

This is usually where information about normal forms ends, particularly because the remaining normal forms are the ones any practitioner would usually not be able to quote from memory, and are generally considered…

not-so relevant

…for real-world uses. This isn’t entirely true, though, because some of them are already being used in design, even if it’s not on purpose.
These forms were developed in response to the initial work by Edgar F. Codd. You should take into account the possibility of tables breaking. I’ve provided some examples below:

No multivalued dependency on non-super key attributes is necessary according to 4NF.
Note, however, that if this assumption is false (e.g., some trainers do deliver only specific courses in particular languages), we cannot stop here and must go one step further to 5NF. This table has only one key (no non-primary attributes), but assuming that each trainer is able to deliver all his courses in every language he can speak, adding a new course would require inserting multiple rows.

Trainer	Course	Language
Michal	PostgreSQL Professional	EN
Michal	PostgreSQL Professional	PL
Pavlo	PostgreSQL Professional	EN
Pavlo	PostgreSQL Professional	SK
Michal	Administration & Performance Tuning	EN
Michal	Administration & Performance Tuning	PL
Hans	Administration & Performance Tuning	EN
Hans	Administration & Performance Tuning	DE
Hans	PostgreSQL Professional	EN
Hans	PostgreSQL Professional	DE

compliant — table is split in two

Trainer	Course
Michal	PostgreSQL Professional
Michal	Administration & Performance Tuning
Pavlo	PostgreSQL Professional
Hans	PostgreSQL Professional
Hans	Administration & Performance Tuning

Trainer	Language
Michal	EN
Michal	PL
Pavlo	EN
Pavlo	SK
Hans	EN
Hans	DE

5NF — requires that there are no joint dependencies that do not have only superkey components. In other words a table is in 5NF if it cannot be split (reduced) further into smaller tables that have different (smaller) key without losing information.

non-compliant — let’s use the table from the example above, but insert additional data: Pavlo started to offer an additional course, but does not have all materials in all languages yet

Trainer	Course	Language
Michal	PostgreSQL Professional	EN
Michal	PostgreSQL Professional	PL
Pavlo	PostgreSQL Professional	EN
Pavlo	PostgreSQL Professional	SK
Pavlo	Administration & Performance Tuning	EN
Michal	Administration & Performance Tuning	EN
Michal	Administration & Performance Tuning	PL
Hans	Administration & Performance Tuning	EN
Hans	Administration & Performance Tuning	DE
Hans	PostgreSQL Professional	EN
Hans	PostgreSQL Professional	DE

compliant — one additional table must be introduced.

Trainer	Course
Michal	PostgreSQL Professional
Michal	Administration & Performance Tuning
Pavlo	PostgreSQL Professional
Hans	PostgreSQL Professional
Hans	Administration & Performance Tuning

Trainer	Language
Michal	EN
Michal	PL
Pavlo	EN
Pavlo	SK
Hans	EN
Hans	DE

Course	Language
PostgreSQL Professional	EN
PostgreSQL Professional	DE
PostgreSQL Professional	PL
PostgreSQL Professional	SK
Administration & Performance Tuning	EN
Administration & Performance Tuning	DE
Administration & Performance Tuning	PL

DKNF — requires that there are no constraints other than those expressed by domain or key constraints. In other words, constraints must be precisely defined, there is no place for an arbitrary or fuzzy choice for constrained values.

non-compliant — the table does not contain any precise indication about how to select values for the columns Length and Pace
compliant — the values for Length and Pace are precisely constrained, note that in order to determine the pace of the course, an additional calculation must be performed

Course	Hours	Days	Length	Pace
PostgreSQL Professional	15	3	normal	standard
Administration & Performance Tuning	25	5	long	standard
High Availability & Patroni	24	3	normal	intense

Course	Hours	Days
PostgreSQL Professional	15	3
Administration & Performance Tuning	25	5
High Availability & Patroni	24	3

Length	min days	max days
short	1	1
normal	2	3
long	4	5

Pace	min hours per day	max hours per day
introductory	3	4
standard	5	5
intense	6	8

6NF — requires that a row must not contain more than one non-primary attribute in addition to a primary key. As impractical as this normal form might seem, it has its applications in e.g. data warehouses and very sparse schemas, but it is most efficient in so-called columnar storage engines. For typical OLTP, 6NF is not very practical and is too close for comfort to the dreaded EAV antipattern.

About Enteros

Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of clouds, RDBMS, NoSQL, and machine learning database platforms.

The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.

Are you interested in writing for Enteros’ Blog? Please send us a pitch!

Optimizing Database Performance and Cloud FinOps in the Insurance Sector with Enteros and AWS CloudFormation

3 April 2025
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Driving Growth and RevOps Efficiency in the Technology Sector with Enteros

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Enhancing Healthcare Efficiency with Enteros: AIOps-Driven Database Performance Monitoring

2 April 2025
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Optimizing Cost Estimation and RevOps Efficiency with Enteros: Enhancing Financial Stability and Balance Sheet Management

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

A Guide to Data Normalization in PostgreSQL

normal forms

data anomalies

basic concepts

keys

relevant

not-so relevant

About Enteros

RELATED POSTS

Optimizing Database Performance and Cloud FinOps in the Insurance Sector with Enteros and AWS CloudFormation

Driving Growth and RevOps Efficiency in the Technology Sector with Enteros

Enhancing Healthcare Efficiency with Enteros: AIOps-Driven Database Performance Monitoring

Optimizing Cost Estimation and RevOps Efficiency with Enteros: Enhancing Financial Stability and Balance Sheet Management