Preamble
One of PostgreSQL’s most cherished features is the extensive JSON support. It is widely used by many people, especially by those who have more experience with Javascript programming than with relational databases. But from my observations, the majority of people don’t use it properly. In the long run, this leads to issues and unhappiness.
I’ll try to highlight both effective and ineffective ways to use JSON in PostgreSQL in this article, as well as offer you some best practices.
A bad example
This data model is a perfect example of all the mistakes you can make:
/* this table is fine */ CREATE TABLE people ( id bigint PRIMARY KEY, name text NOT NULL ); INSERT INTO people VALUES (1, 'laurenz'), (2, 'julian'), (3, 'ants'), (4, 'kaarel'); /* this table is ill-defined */ CREATE TABLE rooms ( id bigint PRIMARY KEY, data jsonb ); INSERT INTO rooms VALUES (1, '{ "name": "Room 1", "reservations": [ { "who": 1, "from": "2021-06-01 09:00:00", "to": "2021-06-01 10:00:00" }, { "who": 3, "from": "2021-06-01 10:00:00", "to": "2021-06-01 11:30:00" }, { "who": 2, "from": "2021-06-01 13:15:00", "to": "2021-06-01 14:45:00" }, { "who": 1, "from": "2021-06-01 15:00:00", "to": "2021-06-01 16:00:00" } ] }'), (2, '{ "name": "Room 2", "reservations": [ { "who": 2, "from": "2021-06-01 09:30:00", "to": "2021-06-01 10:30:00" } ] }');
Modeling regular columns as JSON attributes was the first error.
There is no justification for the room name not to be a regular column. All rooms will have names, after all, and we might want to impose requirements like uniqueness on the room name.
Another error is modeling tabular data as a JSON array.
The reservations for rooms define a many-to-many relationship between the rooms and the people in a perfectly normal tabular format. The same data could have been easily modeled using a junction table:
CREATE TABLE reservations ( people_id bigint REFERENCES people NOT NULL, room_id bigint REFERENCES rooms NOT NULL, reserved tsrange NOT NULL, PRIMARY KEY (room_id, people_id) ); INSERT INTO reservations VALUES (1, 1, '[2021-06-01 09:00:00,2021-06-01 10:00:00)'), (3, 1, '[2021-06-01 10:00:00,2021-06-01 11:30:00)'), (2, 1, '[2021-06-01 13:15:00,2021-06-01 14:45:00)'), (1, 1, '[2021-06-01 15:00:00,2021-06-01 16:00:00)'), (2, 2, '[2021-06-01 09:30:00,2021-06-01 10:30:00)');
It seems that many people think that storing a few large rows rather than many small rows is more efficient. There is some truth to that statement given that PostgreSQL compresses large data and that each row has some overhead. But as we will see below, many small rows are much more efficient if you only need to retrieve or modify a portion of the data.
The third error is to save foreign keys as JSON.
The "who"
attribute stores a foreign key reference to people
. That is not a good idea, because it is impossible for the database to enforce such a constraint: I could just as well have inserted a reference to a non-existing person. Determining a foreign key is easy using the junction table from above.
Furthermore, you frequently want to join on foreign keys. That would call for a cross join in JSON with the unnested JSON array:
SELECT rooms.data ->> 'name', people.name FROM rooms CROSS JOIN LATERAL jsonb_array_elements( rooms.data -> 'reservations' ) AS res(j) JOIN people ON res.j ->> 'who' = people.id::text;
That would be the junction table, so
SELECT rooms.name, people.name FROM rooms JOIN reservations AS r ON r.room_id = rooms.id JOIN people ON r.people_id = people.id;
Which of these two queries will return results more quickly is probably obvious.
Fourth error: alter JSON data
You must run a statement like this if you want to add a new reservation.
UPDATE rooms SET data = jsonb_set( data, '{reservations,100000}', '{"who": 3, "from": "2021-06-01 11:00:00", "to": "2021-06-01 12:00:00"}', TRUE ) WHERE id = 2;
By doing this, the entire JSON object will be fetched, used to create a new JSON object, which will then be stored in the table. More I/O is required than you would like, especially if the JSON object is large and stored out of order. The entire JSON object must be read and written.
Compare this exercise’s ease of use to the junction table:
INSERT INTO reservations VALUES (3, 2, '[2021-06-01 11:00:00,2021-06-01 12:00:00)');
There won’t be much data written by this statement.
Deleting a reservation is just as difficult and costly, and is left up to the reader.
The fifth mistake tries to limit JSON in some way.
So far, our data model doesn’t protect against reservations that overlap, which would be a good thing to make sure doesn’t happen.
We are essentially out of options with JSON in this case. The best that comes to mind is a constraint trigger, but that would require elaborate locking or the SERIALIZABLE
transaction isolation level to be free from race conditions. Also, the code would be far from simple.
The exercise is straightforward when using the junction table; all that is required is the addition of an exclusion constraint that checks for operator overlaps:
CREATE EXTENSION IF NOT EXISTS btree_gist; ALTER TABLE reservations ADD EXCLUDE USING gist ( reserved WITH &&, room_id WITH = );
The extension is required to create a GiST index on a bigint
column.
Sixth mistake: challenging JSON searches
Simple searches for equality can be performed with the JSON containment operator @>
, and such searches can be supported by a GIN index. But complicated searches are a pain.
Assume we want to find all rooms that are occupied on 2021-06-01 15:30:00. With JSON, that would look somewhat like
SELECT id FROM rooms CROSS JOIN LATERAL jsonb_array_elements( rooms.data -> 'reservations' ) AS elem(j) WHERE CAST(elem.j ->> 'from' AS timestamp) <= TIMESTAMP '2021-06-01 15:30:00' AND CAST(elem.j ->> 'to' AS timestamp) > TIMESTAMP '2021-06-01 15:30:00';
Our junction table transforms the query into
SELECT room_id FROM reservations WHERE reserved @> TIMESTAMP '2021-06-01 15:30:00';
The GiST index from the above-created exclusion constraint can be used by that query.
If all the above is wrong, should we use JSON in PostgreSQL at all?
Do not misunderstand me; PostgreSQL’s JSON support is a wonderful feature. A lot of people don’t know how to use it properly. For example, most PostgreSQL and JSON-related questions on Stackoverflow are about problems that come up when JSON is used when it would be better not to.
When thinking about using JSON in PostgreSQL, remember to:
- When storing data that can simply be placed in database tables, avoid using JSON.
- When changing specific attributes in large JSON objects, stay away from them.
- Don’t use JSON if you want to use attributes in complicated
WHERE
conditions. - If you want to join on a few of the attributes, avoid JSON.
- If you want restrictions on the data, avoid using JSON.
- Keep references to other tables out of JSON attributes.
Often, it may be a good idea to store some attributes in JSON and others in regular table columns. The less you need to process the data inside the database, the better it is to store it as JSON.
A good example
Let’s use the example of a store that sells a variety of mixed goods to demonstrate how JSON in the database can be used to good effect. All or most of the goods will share certain characteristics, such as price, weight, manufacturer, quantity, or package size. Other characteristics could be uncommon (like the kind of plug it has) or ambiguous (like the pitch of the screws or tuning forks).
We model the most frequent attributes with regular table columns and use a JSON for the rest rather than defining a table with hundreds of columns for all the potential attributes, the majority of which will be NULL:
CREATE TABLE wares ( id bigint PRIMARY KEY, name text UNIQUE NOT NULL, price numeric(15,2) NOT NULL, weight_kg double precision NOT NULL, available integer NOT NULL CHECK (available >= 0), manufacturer text, package_size integer NOT NULL CHECK (package_size >= 0), attributes jsonb ); /* to search in "attributes" */ CREATE INDEX ON wares USING gin (attributes); /* for similarity search on "name" */ CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX ON wares USING gin (name gin_trgm_ops);
This will enable effective queries such as
SELECT name, price, available FROM wares WHERE name LIKE '%tuning fork%' AND attributes @> '{"pitch": "a"}';
Conclusion
When using JSON in PostgreSQL, there are many things that can go wrong, especially if you are unfamiliar with relational databases. However, it can be a powerful tool when used properly.
About Enteros
Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of clouds, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enhancing Identity and Access Management in Healthcare with Enteros
- 19 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Maximizing Efficiency with Enteros: Revolutionizing Cost Allocation Through a Cloud Center of Excellence
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions
- 18 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…