A Guide to Data Normalization in PostgreSQL
Introduction
Even though I installed RedHat Linux 6.0 (not RHEL) in 1999 to set up my first PostgreSQL database, I probably only used this 6.5 release to launch SysinitV scripts for the server. It was cool because, at the time, PostgreSQL was MySQL’s underdog. At that time, I knew nothing about SQL (except for the name), databases, or data modeling.
Soon after, I enrolled at a large technical university, where, much to my surprise, the university’s IT department hired me to assist with the rollout of a brand-new student records management system that had been developed internally. Even though my knowledge and skills hadn’t changed at all, I suddenly became an Oracle DBA. They said you would learn on the job. Indeed, on the job I learned that Big Red is a thing (or rather, a beast).
I served my days as best as I could, sitting between two huge ERD diagrams (A0 size) that I had printed out and meticulously glued together from smaller sheets. I still wasn’t much smarter about data modeling, but since the system was developed by the university’s most experienced academic staff, it was supposed to be designed correctly. And it was well designed. I was attempting to navigate the maze of tables, views, and foreign keys without any formal instruction or the proper tools (I didn’t have the money for anything more than about 70 named user licenses; Win SQLPlus was all I had). I had to write reports and queries on the spot, fix mistakes made by operators, and do other things. Over the years, the ancient knowledge of how to properly decompose and normalize data slowly crept into my mind, thoughts, and reflexes, as if by some magical force.
Since then, whenever I see a database schema, I can tell right away if it feels right or not. I only discovered the causes of those feelings much later.
And yes, Codd’s third normal form was used to design the database (for the most part).
What am I talking about, then?
The concept of…
normal forms
…is older than SQL and, for that matter, useful relational database software. In his influential work from 1970, F. Edgar Codd from IBM introduced it. This work led to the creation of all of the relational databases we use today, including PostgreSQL.
Sadly, these ideas are somewhat academic in nature and origin, despite the fact that they have many very practical applications; in order to discuss them, I need to introduce some theory and define a few terms.
Note: The few sentences that follow don’t apply only to PostgreSQL; you can use them with any relational database. Also, feel free to skip ahead as much as is convenient for you if you already know all this like the back of your hand.
It’s crucial to keep in mind that the goal of introducing the idea of normal forms was to prevent…
data anomalies
…by removing any and all instances of data duplication. That is ultimately what it comes down to:
- eliminate uncertainty (update anomaly)
- Delete the anomaly to prevent information loss
- (Insert anomaly) Prevent incompleteness
What do anomalies resemble in real life? Imagine that we are in the early 1970s and that our successful paperback romance bookstore uses a program that stores all of its data in a single set of records (for example, in an ISAM file):
Author | Origin | Title | Pages | Price |
Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
Angie Nadajoskova | Bratislava, SK | Irritated soul | 234 | $23.99 |
Rommy la Dequell | Altdorf, AT | Desire to suffer | 456 | $1.99 |
We also need to include a new author who joined our list but whose upcoming thriller-romance is still a mystery:
Michael Birg | Bludenz, AT | $4.99 |
Therefore, the following anomalies are visible: First, there is a contradiction in our best-selling author’s two books (record 1 and 3). Is Altdorf in Austria or Hungary? Or has the author relocated? Second, when the book “Irritated Soul” ran out of print, while removing it from the record we also lost all the information about the author – that we might very well want to keep (record 2). Finally, the new record cannot be added because it simply lacks the necessary data. As is, we are unable to use it.
Now that we are all used to PostgreSQL and other RDBMSes’ amazing capabilities, these issues may seem naive or even ridiculous. But keep in mind that this was 50 years ago, when all joins had to be programmed by hand. More importantly, keep in mind that these oddities can still happen today, though in a less obvious way.
I don’t think everyone will have the same experience with spreading knowledge as I’ve described above, but there are a few…
basic concepts
I must first define the following terms:
We have a set of named and typed fields (also known as attributes or columns) that make up a record, also known as a tuple or row. A table, also referred to as a relation, is a group of rows with the same type. There is no guarantee that the fields in a record or the records in a table will be in a specific order. So far, so good; these terms ought to be familiar to every user of a relational database.
What about…
keys
Making a key that combines all fields is a simple way to identify any row in a table; this is known as the trivial superkey (values that are a part of a key are highlighted below).
Author | Origin | Title | Pages | Price |
Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
Most of the time, you can make superkeys that don’t include all fields but can still be used to identify rows.
Author | Origin | Title | Pages | Price |
Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
or
Author | Origin | Title | Pages | Price |
Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
A column is a prime attribute (or key attribute) if it is a member of a superkey. If it doesn’t, it qualifies as a non-prime (non-key) attribute. It’s crucial to make this distinction when describing what depends on what. There is no functional dependency, so in the examples above, the page counts or prices may be the same for other books.
We have reached a candidate key if it is no longer possible to remove a field from a superkey without losing its ability to identify a row; there may be more than one such set of fields.
Author | Origin | Title | Pages | Price |
Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
or
Author | Origin | Title | Pages | Price |
Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
To access any row reliably and make sure it is unique, one of the candidate keys must be chosen as the primary key. The primary key is a combination of one or more columns (when there are more than one column, it is called a compound key), and their combined contents must be both unique and not empty.
Author | Origin | Title | Pages | Price |
Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
The record’s keys can be natural, which means they come from the natural, innate characteristics of the real-world objects it describes, or surrogate. A social security number is an example of a natural key because it comes from one or more of the columns that are already there. Because each record needs to be identified individually, a surrogate key is made up, and it has its own column. Surrogate keys are useful because they can be made quickly, stored cheaply, and sorted easily (bigint). They can be assured to be unique globally if necessary (UUID). This type of key is often used when a persistence framework or an ORM automatically creates a schema.
Surrogate keys do have one big problem, though: they don’t say anything about the real-world properties of the thing being described by the row. So, they can’t tell us anything about it, including whether or not it’s unique. As a result, a surrogate key can only ensure its own uniqueness. Because of this, duplicates—exactly what normalization aims to prevent—can happen in real life.
But in the real world, objects are usually recognizable by a group of features, not just one. Because of this, natural primary keys are usually compound, which makes them hard to use in practice. Also, using them over and over as foreign keys in other columns creates redundant data in the form of many rows and tables with the same (potentially long) values, which is just a waste of space.
Since it’s common to use a surrogate key instead of the primary key even before normalization starts, here’s an example:
Let’s make a table defining people, and let’s add enough different characteristics to represent each person’s individuality: name, surname, birthplace and date, height, eye color, net worth, favorite food, pet name, etc. Let’s choose the combination of these characteristics that will allow each person to be uniquely identified for our use case. Now, this exact combination will be our compound primary key. This means that the database will make sure there is only one record with these exact values for the chosen fields. But if we just give such a person a number and make it the primary key, nothing will stop us from having multiple identical records that, for all intents and purposes, describe the same person but differ only by the meaningless value of that one artificially assigned number (rows 1 and 4). Even worse, these records may have small differences because of typos or mistakes (row 3). But keep in mind that a mistake in row 2 would not have been avoided in any case because it occurred in a column that was once a primary key.
Id | Author | Origin | Title | Pages | Price |
1 | Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
2 | Rommy la Dequell | Altdorf, AT | Power of silence | 123 | $0.99 |
3 | Rommy la Dequel | Altdorf, HU | Power of silence | 123 | $0.99 |
4 | Rommy la Dequell | Altdorf, HU | Power of silence | 123 | $0.99 |
Note: The main key should always be natural or, if necessary, a combination of natural and compound keys to make sure it is unique. This helps solve the problem of assigning uniqueness while letting the database do its job well. However, add one more surrogate key and make it unique to conveniently refer to the table rows. Then, we’ll separate the concerns about providing an easy way to refer to that row from other tables from the concerns about ensuring row uniqueness (server only in one place by primary key) (server by column with only unique constraint).That’s adequate. For more information on specific restrictions, visit the Cybertec blog.
The functional dependence between qualities is the final idea I need to discuss. If the value of an attribute always appears with the specific value of the attribute it depends on, then the attributes are functionally dependent on one another. There is one-way reliance here. It clearly converts one value into another, and you can think of it as a function or a mapping table.
One simple example would be to have two columns, one with metric measurements and the other with imperial measurements. Each column would show a different physical trait, like weight. Since everyone knows how to convert between the two, the value of one depends on the value of the other.
A less obvious example would be to base the assignment of a less precise property (such as category) on a more precise property (e.g., price or number of pages). There is an exact formula somewhere (usually in the application) for how to get the first from the second. We will have functional reliance if it is implemented consistently and without fail. This dependency could be shown as a relationship in a separate database and be calculated in a view or for each query. But keep in mind that there is no strong functional dependency and that it must be reflected in the design of the schema if the business goal is to have exceptions (that is, to assign categories randomly, even if only occasionally).
Returning to regular forms, some are mainly
relevant
…to everyday practice, but the distinction between “relevant/not relevant” should be viewed with caution given the ongoing evolution of database software capabilities, particularly PostgreSQL.
Let’s discuss the normal forms in order since each one has a specific requirement that must be met and expands upon the requirement stated in the previous form:
- The only requirement for 0NF, also known as NNF (Not Normalized Form), is that a record has a primary key that uniquely identifies it.
- As it is with current RDBMSes, and especially PostgreSQL, this presents a rather challenging requirement; let’s stick with the «no table as value» form and discuss what atomic, compound, and opaque mean later. The original definition of 1NF was even more strict: that a field value is atomic (that is — not compound).
- non-compliant
Person | Born in | Born on | Mother’s name | Born in Land | Mother born in | Mother born on | Lived in |
Anna Gruber | Klagenfurt | 12.10.1967 | Angela Bauer | Kärnten | Salzburg | 30.02.1948 | |
Betty Huber | Bregenz | 14.05.1989 | Regina Wagner | Vorarlberg | Graz | 22.11.1970 |
- Composite field values (lived in) are moved to a separate table to make the form compliant.
Person | Born in | Born on | Born in Land | Mother’s name | Mother born in | Mother born on |
Anna Gruber | Klagenfurt | 12.10.1967 | Kärnten | Angela Bauer | Salzburg | 30.02.1948 |
Betty Huber | Bregenz | 14.05.1989 | Vorarlberg | Regina Wagner | Graz | 22.11.1970 |
Person | Born in | Born on | Lived in |
Anna Gruber | Klagenfurt | 12.10.1967 | Wien |
Anna Gruber | Klagenfurt | 12.10.1967 | Linz |
Betty Huber | Bregenz | 14.05.1989 | Linz |
Betty Huber | Bregenz | 14.05.1989 | Salzburg |
- conforms to a distinct surrogate key (Person ID)
Person | Born in | Born on | Person ID | Born in Land | Mother’s name | Mother born in |
Mother born on
|
Anna Gruber | Klagenfurt | 12.10.1967 | 1 | Kärnten | Angela Bauer | Salzburg | 30.02.1948 |
Betty Huber | Bregenz | 14.05.1989 | 2 | Vorarlberg | Regina Wagner | Graz | 22.11.1970 |
Person ID | Lived in |
1 | Wien |
1 | Linz |
2 | Linz |
2 | Salzburg |
- If a partial dependency is found, the table must be split, with the dependent attribute(s) and the portion of the primary key it depends upon being moved to a separate table. 2NF — requires that no non-key attribute (for a compound primary key) depends just on part of the primary key.
- non-compliant
Person | Born in | Born on | Person ID | Born in Land | Mother’s name | Mother born in |
Mother born on
|
Anna Gruber | Klagenfurt | 12.10.1967 | 1 | Kärnten | Angela Bauer | Salzburg | 30.02.1948 |
Betty Huber | Bregenz | 14.05.1989 | 2 | Vorarlberg | Regina Wagner | Graz | 22.11.1970 |
- fields (born in land) dependent on a portion of the primary key (born in) are moved to a different table to be compliant.
Person | Born in | Born on | Person ID | Mother’s name | Mother born in | Mother born on |
Anna Gruber | Klagenfurt | 12.10.1967 | 1 | Angela Bauer | Salzburg | 30.02.1948 |
Betty Huber | Bregenz | 14.05.1989 | 2 | Regina Wagner | Graz | 22.11.1970 |
This assumes that the city name, and only the city name, is a primary key in the cities table below (which is rarely the case in practice).
City | Land |
Klagenfurt | Kärnten |
Bregenz | Vorarlberg |
If the values of two or more fields frequently appear together, they should be moved to a separate table, and one of them should be made a key, in order to avoid information repetition. 3NF — requires that there be no transitive dependency (through another field depending directly on the primary key) on the primary key.
- non-compliant
Person | Born in | Born on | Person ID | Mother’s name | Mother born in | Mother born on |
Anna Gruber | Klagenfurt | 12.10.1967 | 1 | Angela Bauer | Salzburg | 30.02.1948 |
Betty Huber | Bregenz | 14.05.1989 | 2 | Regina Wagner | Graz | 22.11.1970 |
Due to the fact that “mother” is also a “person,” it is possible to move the fields (mother born in & mother born on) that are transitively dependent (through mother’s name) on the primary key (person, born in & born on) to separate rows.
Person | Born in | Born on | Person ID | Mother ID |
Anna Gruber | Klagenfurt | 12.10.1967 | 1 | 3 |
Betty Huber | Bregenz | 14.05.1989 | 2 | 4 |
Angela Bauer | Salzburg | 30.02.1948 | 3 | «null» |
Regina Wagner | Graz | 22.11.1970 | 4 | «null» |
- BCNF — Boyce-Codd Normal Form aka 3.5NF is just a bit of a stronger version of 3NF; often, a table in 3NF is also compliant with BCNF.
It requires that no part of a primary key is functionally dependent on a non-key attribute. The difference is quite subtle — till now I have been talking about non-key attributes which are dependent on key attributes. Now it is the opposite way round.
- non-compliant — Born in is part of the primary key and functionally depends on non-key attribute Birth Post Code
Person | Born in | Born on | Person ID | Mother ID | Birth Post Code |
Anna Gruber | Klagenfurt | 12.10.1967 | 1 | 3 | 9073 |
Betty Huber | Bregenz | 14.05.1989 | 2 | 4 | 6900 |
Angela Bauer | Salzburg | 30.02.1948 | 3 | «null» | 5020 |
Regina Wagner | Graz | 22.11.1970 | 4 | «null» | 8032 |
- compliant — attribute “Birth Post Code” replaces “Born in” in the primary key. Also, the structure of the Cities table is changed (and is much more realistic). Note that with this change, the data are correctly normalized, but less readable.
Person | Birth Post Code | Born on | Person ID | Mother ID |
Anna Gruber | 9073 | 12.10.1967 | 1 | 3 |
Betty Huber | 6900 | 14.05.1989 | 2 | 4 |
Angela Bauer | 5020 | 30.02.1948 | 3 | «null» |
Regina Wagner | 8032 | 22.11.1970 | 4 | «null» |
Post Code | City | Land |
9073 | Klagenfurt | Kärnten |
6900 | Bregenz | Vorarlberg |
5020 | Salzburg | Salzburg |
8032 | Graz | Steiermark |
This is usually where information about normal forms ends, particularly because the remaining normal forms are the ones any practitioner would usually not be able to quote from memory, and are generally considered…
not-so relevant
…for real-world uses. This isn’t entirely true, though, because some of them are already being used in design, even if it’s not on purpose.
These forms were developed in response to the initial work by Edgar F. Codd. You should take into account the possibility of tables breaking. I’ve provided some examples below:
- No multivalued dependency on non-super key attributes is necessary according to 4NF.
- Note, however, that if this assumption is false (e.g., some trainers do deliver only specific courses in particular languages), we cannot stop here and must go one step further to 5NF. This table has only one key (no non-primary attributes), but assuming that each trainer is able to deliver all his courses in every language he can speak, adding a new course would require inserting multiple rows.
Trainer | Course | Language |
Michal | PostgreSQL Professional | EN |
Michal | PostgreSQL Professional | PL |
Pavlo | PostgreSQL Professional | EN |
Pavlo | PostgreSQL Professional | SK |
Michal | Administration & Performance Tuning | EN |
Michal | Administration & Performance Tuning | PL |
Hans | Administration & Performance Tuning | EN |
Hans | Administration & Performance Tuning | DE |
Hans | PostgreSQL Professional | EN |
Hans | PostgreSQL Professional | DE |
- compliant — table is split in two
Trainer | Course |
Michal | PostgreSQL Professional |
Michal | Administration & Performance Tuning |
Pavlo | PostgreSQL Professional |
Hans | PostgreSQL Professional |
Hans | Administration & Performance Tuning |
Trainer | Language |
Michal | EN |
Michal | PL |
Pavlo | EN |
Pavlo | SK |
Hans | EN |
Hans | DE |
- 5NF — requires that there are no joint dependencies that do not have only superkey components. In other words a table is in 5NF if it cannot be split (reduced) further into smaller tables that have different (smaller) key without losing information.
- non-compliant — let’s use the table from the example above, but insert additional data: Pavlo started to offer an additional course, but does not have all materials in all languages yet
Trainer | Course | Language |
Michal | PostgreSQL Professional | EN |
Michal | PostgreSQL Professional | PL |
Pavlo | PostgreSQL Professional | EN |
Pavlo | PostgreSQL Professional | SK |
Pavlo | Administration & Performance Tuning | EN |
Michal | Administration & Performance Tuning | EN |
Michal | Administration & Performance Tuning | PL |
Hans | Administration & Performance Tuning | EN |
Hans | Administration & Performance Tuning | DE |
Hans | PostgreSQL Professional | EN |
Hans | PostgreSQL Professional | DE |
- compliant — one additional table must be introduced.
Trainer | Course |
Michal | PostgreSQL Professional |
Michal | Administration & Performance Tuning |
Pavlo | PostgreSQL Professional |
Hans | PostgreSQL Professional |
Hans | Administration & Performance Tuning |
Trainer | Language |
Michal | EN |
Michal | PL |
Pavlo | EN |
Pavlo | SK |
Hans | EN |
Hans | DE |
Course | Language |
PostgreSQL Professional | EN |
PostgreSQL Professional | DE |
PostgreSQL Professional | PL |
PostgreSQL Professional | SK |
Administration & Performance Tuning | EN |
Administration & Performance Tuning | DE |
Administration & Performance Tuning | PL |
- DKNF — requires that there are no constraints other than those expressed by domain or key constraints. In other words, constraints must be precisely defined, there is no place for an arbitrary or fuzzy choice for constrained values.
- non-compliant — the table does not contain any precise indication about how to select values for the columns Length and Pace
- compliant — the values for Length and Pace are precisely constrained, note that in order to determine the pace of the course, an additional calculation must be performed
Course | Hours | Days | Length | Pace |
PostgreSQL Professional | 15 | 3 | normal | standard |
Administration & Performance Tuning | 25 | 5 | long | standard |
High Availability & Patroni | 24 | 3 | normal | intense |
Course | Hours | Days |
PostgreSQL Professional | 15 | 3 |
Administration & Performance Tuning | 25 | 5 |
High Availability & Patroni | 24 | 3 |
Length | min days | max days |
short | 1 | 1 |
normal | 2 | 3 |
long | 4 | 5 |
Pace | min hours per day | max hours per day |
introductory | 3 | 4 |
standard | 5 | 5 |
intense | 6 | 8 |
6NF — requires that a row must not contain more than one non-primary attribute in addition to a primary key. As impractical as this normal form might seem, it has its applications in e.g. data warehouses and very sparse schemas, but it is most efficient in so-called columnar storage engines. For typical OLTP, 6NF is not very practical and is too close for comfort to the dreaded EAV antipattern.
About Enteros
Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of clouds, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enhancing Identity and Access Management in Healthcare with Enteros
- 19 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Maximizing Efficiency with Enteros: Revolutionizing Cost Allocation Through a Cloud Center of Excellence
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions
- 18 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…