Type modeling
Even though the model doesn’t make any assumptions about data types on its own, the reality of using PostgreSQL today, with its huge number of data types, is very different. Which of these can truly be considered simple, especially given how numerous these types are…
atomic structures
…as well as when and why it is appropriate to do so?
Consider simple textual data types like char, varchar, and text types as the most extreme example of such structures. They can be viewed as simple arrays of bytes that are reachable by offset (hello, C). If we take into account multibyte encoding, such as UTF-8, it turns into an array of characters. It’s not very practical, but thanks to PostgreSQL’s extensive set of operators, it’s still doable.
A database provides the ability to perform equal-and-unequal comparisons and some other scalar operations, as well as the ability to use indices to speed up those operations.
PostgreSQL goes further, though:
Matching patterns, like with regular expressions, can be used to process and limit text data. GIN indices can speed up these operations by utilizing trigrams. Questions can be answered in a more subtle way than true or false by using fuzzy matching (with trigrams or Levenshtein distance). Domains can be defined over a typical textual type, which gives this type much more internal structure. That indicates that the database begins to comprehend its data better.
Further, a full-text search mechanism can be defined over textual data, converting the array of characters into information that has language and some discernible data items (email addresses, phone numbers, prices, words), and some of those items can have an additional structure that can be processed: a base form that can be derived from them or synonyms to other base forms.
Even though interval, date, and timestamp are just simple scalars, they have so many operators that their complex and asymmetric internal structure can be processed naturally and accessed easily. Other “base” types can also provide capabilities that are similarly extended.
Then there are the types that PostgreSQL openly defines as complex, flagrantly violating 1NF: starting in the limbo of geometric types, moving into the realm of range types defined over scalar types, and then boldly moving (in chronological order) into the realm of arrays, complex types, hstore, XML, and JSON—aall of which have incredibly rich sets of operators, operator classes, and index methods associated with them that let you process them quickly.
Whether data elements (fields) are truly atomic or not is ultimately immaterial; what matters is whether they are simultaneously
opaque structures
not reachable from a database. A structure (or file) is opaque if the database kernel doesn’t “understand” what’s in it and doesn’t have any special tools for working with it. two typical archetypes A “bytea” field is where a JPG image is initially stored. A string of bytes is all there is to the database kernel. Second, a text field contains a CSV file. The database can do a full-text search, use string processing functions to get to words inside the column, and sort the column based on text collation. But the database doesn’t have the “idea” that the field holds a group of records with meaningful syntax.
For PostgreSQL, there is no question that structures like XML and JSON are not opaque. Also, by storing them as text, we’re only using some of the server’s features and missing out on a chance to improve performance and make the application code simpler. In any case, it’s critical to differentiate between two use cases:
- It certainly covers the use of geometric and range types, and to some extent the use of arrays, hstore, XML, and JSON types — the latter on the condition that they are considered as documents with specific, distinguished properties. A complex type is used as an extension to a relational model and treated like an object through a well-defined interface.
In these circumstances, all of those types are compatible with the normalization rules and won’t materially change a relational data model. - Particularly when using arrays, hstore, and JSON to create structures that could be modeled as tables and columns, a complex type is used in place of a relational model.
Whatever the justification for doing so (and there is no doubt there are valid reasons I am going to discuss in the third part), this decision must be made consciously, keeping in mind there is a price to pay: In these cases, the relational principles are broken – and clearly 1NF cannot be applied.- Utilizing an integrated optimizer’s benefits is much more difficult, if not impossible.
- It is more laborious to manipulate data.
- The access language (such as XPath for XML and jsonpath for JSON) is very different from SQL.
- Schema enforcement is very challenging.
PostgreSQL lets you create a hybrid data model that uses both relational and non-relational methods. Next, I’ll talk about hybrid data models. The goal of a formal approach like normalization is to cut down on data duplication and unintended dependencies. This makes it less likely that errors and other “anomalies” will happen. This is true even if the formal approach cannot be directly applied.
I want to discuss a few common scenarios before we wrap up our discussion of concrete examples of data normalization.
when following the book makes your life harder
…in simple cases, like deciding how to model structures at the level of one (or a few) columns.
When trying to put real-world objects into a database, it is often helpful to keep this in mind.
People do not have primary keys
…and not just real people, either. From the perspective of modeling, some things that appear to have a structure that is easily recognizable and that everyone intuitively understands might not actually require that structure.
- It might not matter for that specific application.
- It might not be sufficiently general to address all actual and anticipated use cases in all situations.
And I’m not just talking about some requirements that might come up later; I’m specifically referring to requirements that the designer is already aware of but that could be obscured by assumptions and preconceptions.
When this occurs, the following problems may arise:
- excessively adding structure or excessively breaking something down into its component parts
- imposing excessively onerous or unnecessary restrictions, especially with regard to the data format
- restricting a set of acceptable values by implementing inflexible dictionaries or enumerations
Particularly when the required set of values for the database is a little more diverse than the designer had anticipated, they all have a bad habit of showing up.
a few instances
- Keeping track of people’s names (and titles or salutations) frequently tempts the designer to perform decomposition into separate fields for the given name and family name, or worse, add a separate field for a middle name (or initial), and a dictionary for allowed salutations. To add insult to injury, impose upper and lower limits of length to all those fields, and constrain them with an allowed set of characters (like excluding whitespaces).
- What if someone has two names and two surnames forever (like in Spain or Portugal)?
- How about reversing the order (last, first, as in Hungary)?
- What if, like in Iceland, the idea of a surname is completely foreign to that particular person?
- and so forth.
- Even when there is a reasonable probability that the model will contain mostly data from one geographical location, most often that does not imply exclusivity. Within one location, the assumptions as to what components are requisite, what are allowed, and in what order can vary greatly. Storing addresses is the same story, just a little more complicated because it is even more dependent on cultural and national factors and conventions than with personal names.
- It can be very personal, specific to a relatively small cultural circle, or highly dependent on one’s nationality or language to keep dictionaries of categories for people or places, such as titles, salutations, the type of place one lives or works in, or even the proverbial gender assignment limited to M,F.
All of these circumstances, which in theory would help in locating and modeling the database’s data structures, can actually make the data less useful, more prone to errors, difficult to clean up and validate, and ultimately result in a much less stable schema than anyone would like.
Yes, there are circumstances in which precise modeling is required.
- When specific applications place restrictions on the way data must be stored, this is known as legal or compliance.
- applications that are extremely specific — when there is a genuine business need that is openly expressed by data users and data uses to decompose data in a specific way (such as storing data to provide medical services or requiring processing full legal names).
- compatibility (often backward compatibility) — when the need to exchange data with third parties imposes the model structure (such as exchanging information regarding locations with a local cadaster or land registry or with an airline, where the information regarding gender is actually mandatory and limited to “M,F”).
But neither of those things will make data use simpler or more natural.
The advantages of a well-normalized schema should therefore be weighed against the project’s practical considerations and the designer’s capacity to support his work; otherwise, it could easily become an exercise in
how to make developers (and often users) hate you
…with unmaintainable, underperforming code, giving all parties involved a negative experience overall. Because it is frequently a fruitless exercise (unless the model is intended to support a national census application), even though being able to compute the distribution of family names with regard to which building floor they live on or determine how many users have the second name Bettina might be an interesting possibility,
So I’d strongly advise: Before using data normalization on a model,
- The answer to the question “why,” “what is the purpose of this data being stored,” and “how are they going to be processed?” often belongs to the author of an actual use case, not the modeler.
- Not to discount the value of years of practical experience, but to emphasize the significance of a fresh look and open mind with each modeling task. This is not to undermine the value of a modeller’s personal experiences and preconceived notions, especially in modelling natural concepts and objects about which other people may have opinions.
- Less is often more and it results in more maintainable data of higher quality by concentrating on a core business need, being truthful about naming an object and its properties, describing their true purposes, and resisting the urge to collect as much detail as possible.
- Strive for simplicity and store as little data as necessary to accomplish the business purpose of an application — especially taking into account that, should the need arise to store more or in more detail, PostgreSQL allows you to store semi-structured, non-relational data that can be made relational, if necessary.
Now that I’ve given you an overview of practical examples of data normalization in PostgreSQL…
Keep an eye out for the next blog post in this series, which will discuss creating a hybrid data model that combines relational and non-relational approaches. In PostgreSQL, hybrid data modeling is not only feasible but also well supported.
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enteros for Pharma: Cost Estimation, AWS EC2 Performance, and Cloud FinOps for Scalable Operations
- 25 February 2025
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Database Performance for RevOps Success in the BFSI Sector with Enteros
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enhancing Database Security in the Technology Sector with Enteros: A Proactive Approach to Performance and Protection
- 24 February 2025
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enteros and Generative AI in the Insurance Sector: Performance Monitoring and Enterprise Efficiency
- 21 February 2025
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…