Normalization by Osmosis
After completing a theoretical introduction in part 1, it is time to talk about some additional real-world examples. I’ll begin by discussing the…
type modeling
columns, etc. It is crucial to understand that at the time Codd developed the base 3 normal forms (see Part 1 for more information), the types available in widely used programming languages were a relatively primitive set, which influenced how people thought about the data types supported by databases. Although the model does not by itself imply any type assumptions, the reality of using PostgreSQL today—with its amazingly wide range of available data types—is very, very different. Which of these can actually be considered simple, in particular when we consider how many of these types are…
atomic structures
…as well as when and why it is appropriate to do so?
Consider simple textual data, such as char, varchar, and text types, for the most extreme example of such structures. They can be viewed as simple arrays of bytes that are reachable by offset (hello, C). If we take into account multibyte encoding, such as UTF-8, it turns into an array of characters. Although not very practical, it is still quite possible thanks to PostgreSQL’s extensive set of operators.
A database provides the ability to perform equal/unequal comparisons and some other scalar operations, as well as the ability to use indices to speed up those operations.
However, PostgreSQL goes further:
Pattern matching, such as regular expressions, can be used to process and constrain textual data. These operations can be sped up by GIN indices by utilizing trigrams. Questions can be answered in a more subtle way than true/false by using fuzzy matching (with trigrams or Levenshtein distance). A typical textual type can be defined with domains, which gives this type much more internal structure. In other words, the database starts to comprehend its data better.
Further, a full-text search mechanism can be defined over textual data, converting the array of characters into information that has language, some discernible data items (email addresses, phone numbers, prices, words), and some of those items can have an additional structure that can be processed: a base form that can be derived from them, or synonyms to other base forms.
Other “base” types can also have features that are improved in the same way. For example, interval, date, and timestamp are just plain scalars, but they have so many operators that their complex and asymmetric internal structure can be processed naturally and accessed easily.
And then there are types that PostgreSQL openly defines as complex, blatantly violating 1NF: starting in the twilight zone with geometric types, moving into the realm of range types defined over scalar types, and then boldly moving (in chronological order) into the realm of arrays, complex types, hstore, XML, and JSON — all of which have extraordinarily rich sets of operators, operator classes, and index methods associated with them that let you process them quickly
In the end, it doesn’t matter if data elements (fields) are truly atomic or not; what matters is whether they are simultaneously…
opaque structures
…not reachable from a database. An opaque structure (or file) is defined by the fact that neither the database kernel “understands” its content nor does it have any particular tools for processing it. two typical archetypes A bytea field is where a JPG image is initially stored. A string of bytes is all it is to the database kernel. Second, a text field contains a CSV file. The database has the ability to perform a full-text search, access words inside the column using string processing functions, and sort the column according to text collation. However, the database lacks the “idea” that the field stores a collection of records with meaningful syntax.
For PostgreSQL, in all the cases like XML and JSON, they are certainly not opaque structures. Additionally, by storing those as text, we are only using a portion of the server’s capabilities and essentially wasting our opportunity to optimize performance and simplify the application code. At any rate, it is important to distinguish between two use cases:
- It certainly covers the use of geometric and range types, and to some extent the use of arrays, hstore, XML, and JSON types — the latter on the condition that they are considered as documents with specific, distinguished properties. A complex type is used as an extension to a relational model and treated like an object through a well-defined interface.
As a result, none of those types will materially alter a relational data model because they all neatly fit within the normalization rules in this situation. - Particularly when using arrays, hstore, and JSON to create structures that could be modeled as tables and columns, a complex type is used in place of a relational model.
Whatever the justification for doing so (and there are undoubtedly valid reasons, which I will discuss in the third part), this decision must be deliberated, bearing in mind that there is a cost involved: In these situations, the relational principles are broken, and it is obvious that 1NF cannot be applied.- Utilizing an integrated optimizer’s benefits is much more difficult, if not impossible.
- It is more laborious to manipulate data.
- The access language (such as XPath for XML and jsonpath for JSON) is very different from SQL.
- Schema enforcement is very difficult.
PostgreSQL makes it easy to build a hybrid data model that uses both a relational and a non-relational approach. The following section will cover hybrid data models. The goal of a formal approach like normalization is to reduce data duplication and unintentional dependencies, which minimizes the possibility of anomalies (inconsistencies and errors) occurring. This is true even if the formal approach cannot be directly applied.
I want to discuss a few typical scenarios to wrap up our discussion of real-world data normalization examples.
when following the book makes your life harder
…in straightforward situations, such as choosing how to model structures at the level of one (or a few) columns.
It is often useful to keep in mind that when trying to represent different real-life objects in a database…
People do not have primary keys
…and not just real people, either. From the perspective of modeling, some things that appear to have a structure that is easily recognizable and that everyone intuitively understands might not actually require that structure:
- It might not matter for that specific application.
- It might not be sufficiently general to address all actual and anticipated use cases in all situations.
And I’m not just talking about some requirements that might come up later; I’m specifically referring to requirements that the designer is already aware of but that could be obscured by assumptions and preconceptions.
Several issues arise when this occurs, including:
- excessively structuring or disassembling something into its component parts
- imposing excessively onerous or unnecessary restrictions, especially with regard to the data format
- limiting a valid set of values by imposing insufficiently flexible dictionaries or enumerations
They all have a particularly bad habit of appearing, especially when the database’s required set of values is a little more diverse than the designer had anticipated.
Examples include
- Keeping track of people’s names (and titles or salutations) frequently tempts the designer to perform decomposition into separate fields for the given name and family name, or worse, add a separate field for a middle name (or initial), and a dictionary for allowed salutations. To add insult to injury, impose upper and lower limits of length to all those fields, and constrain them with an allowed set of characters (like excluding whitespaces).
- What if someone has two names and two surnames forever (like in Spain or Portugal)?
- What if the rankings were reversed (last, first, as in Hungary)?
- What if, like in Iceland, the idea of a surname is completely foreign to that particular person?
- and so on…
- Even when there is a reasonable probability that the model will contain mostly data from one geographical location, most often that does not imply exclusivity. Within one location, the assumptions as to what components are requisite, what are allowed, and in what order can vary greatly. Storing addresses is the same story, just a little more complicated because it is even more dependent on cultural and national factors and conventions than with personal names.
- It can be very personal, specific to a relatively small cultural circle, or highly dependent on one’s nationality or language to keep dictionaries of categories for people or places, such as titles, salutations, the type of place one lives or works in, or even the proverbial gender assignment limited to M,F.
All of these situations, which in theory would aid in the discovery and modeling of data structures within a database, can in practice make the data less useful, more prone to containing errors, impose a lot of work in ever more elaborate data validation and sanitation rules, and ultimately lead to a much less stable schema than anyone would prefer.
Yes, there are instances where accurate modeling is required:
- When specific applications place restrictions on the way data must be stored, this is known as legal or compliance.
- applications that are extremely specific — when there is a genuine business need that is openly expressed by data users and data uses to decompose data in a specific way (such as storing data to provide medical services or requiring processing full legal names).
- compatibility (often backward compatibility) — when the need to exchange data with third parties imposes the model structure (such as exchanging information regarding locations with a local cadaster or land registry or with an airline, where the information regarding gender is actually mandatory and limited to “M,F”).
Both of those, however, will not make data use simpler or more natural.
The advantages of a well-normalized schema should therefore be weighed against the project’s practical considerations and the designer’s capacity to support his work; otherwise, it could easily become an exercise in…
how to make developers (and often users) hate you
…with unmaintainable, underperforming code giving all parties involved a negative experience overall. Because it is frequently a fruitless exercise (unless the model is intended to support a national census application), even though being able to compute the distribution of family names with regard to which building floor they live on or determine how many users have the second name Bettina might be an interesting possibility.
So I’d strongly advise: Before using data normalization on a model.
- The answer to the question “why,” “what is the purpose of this data being stored,” and “how are they going to be processed?” often belongs to the author of an actual use case, not the modeler.
- Not to discount the value of years of practical experience, but to emphasize the significance of a fresh look and open mind with each modeling task. This is not to undermine the value of a modeller’s personal experiences and preconceived notions, especially in modelling natural concepts and objects about which other people may have opinions.
- Less is often more and it results in more maintainable data of higher quality by concentrating on a core business need, being truthful about naming an object and its properties, describing their true purposes, and resisting the urge to collect as much detail as possible.
- Strive for simplicity and store as little data as necessary to accomplish the business purpose of an application — especially taking into account that, should the need arise to store more or in more detail, PostgreSQL allows you to store semi-structured, non-relational data that can be made relational, if necessary.
Now that I’ve given you an overview of practical examples of data normalization in PostgreSQL…
On the next blog in this series, we’ll talk about building a hybrid data model that combines relational and non-relational approaches, so keep an eye out for that. In PostgreSQL, hybrid data modeling is not only feasible but also well supported.
About Enteros
Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of clouds, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Optimizing Database Performance and Cloud FinOps in the Insurance Sector with Enteros and AWS CloudFormation
- 3 April 2025
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Driving Growth and RevOps Efficiency in the Technology Sector with Enteros
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enhancing Healthcare Efficiency with Enteros: AIOps-Driven Database Performance Monitoring
- 2 April 2025
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Cost Estimation and RevOps Efficiency with Enteros: Enhancing Financial Stability and Balance Sheet Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…