AVOID "OR" FOR BETTER QUERY PERFORMANCE.

Home > Enteros’ Blog – Thoughts on Database Technology, Machine / Deep learning, and a Generative AI > Technology Business > Machine learning > AVOID “OR” FOR BETTER QUERY PERFORMANCE.

Preamble

Of course, there is a need for the OR operator in SQL, so if you can’t avoid it, use it. But you should be mindful of the effects on performance.

I’ll discuss “good” and “bad” ORs in this article, as well as what you can do to prevent the latter.

A little sample schema

We’ll illustrate using this simple setup:

 CREATE TABLE a(id integer NOT NULL, a_val text NOT NULL); INSERT INTO a SELECT i, md5(i::text) FROM generate_series(1, 100000) i; CREATE TABLE b(id integer NOT NULL, b_val text NOT NULL); INSERT INTO b SELECT i, md5(i::text) FROM generate_series(1, 100000) i; ALTER TABLE a ADD PRIMARY KEY (id); ALTER TABLE b ADD PRIMARY KEY (id); ALTER TABLE b ADD FOREIGN KEY (id) REFERENCES a; VACUUM (ANALYZE) a; VACUUM (ANALYZE) b;

If, for example, we want to run queries on the text columns with equality and LIKE conditions, we need some indexes:

 CREATE INDEX a_val_idx ON a(a_val text_pattern_ops); CREATE INDEX b_val_idx ON b(b_val text_pattern_ops);

If you don’t understand text pattern ops, take a look at the documentation.

The “good” OR

Most of the time, using an OR in a SQL query is OK because it has no detrimental effects on query performance as long as it is not used to filter out rows from the query result.

Therefore, don’t be concerned if your OR occurs in a CASE statement in the SELECT list.

Unfortunately, the WHERE clause is where you typically locate the OR and that hurts.

The “bad” OR

Here is an OR in a WHERE clause example that is still rather nice:

 EXPLAIN (COSTS off) SELECT id FROM a WHERE id = 42 OR a_val = 'value 42'; QUERY PLAN ----------------------------------------------------------- Bitmap Heap Scan on a Recheck Cond: ((id = 42) OR (a_val = 'value 42'::text)) -> BitmapOr -> Bitmap Index Scan on a_pkey Index Cond: (id = 42) -> Bitmap Index Scan on a_val_idx Index Cond: (a_val = 'value 42'::text) (7 rows)

Because PostgreSQL can use a “bitmap OR” to combine the bitmaps for both indexes, it can use an index scan to answer the query.

But keep in mind that since the bitmap needs to be built, a bitmap index scan costs more than a regular index scan. Additionally, each of these bitmaps might require up to a gigabyte of main memory, which means it consumes a lot more RAM.

There is no less expensive way to perform this query because a multi-column index on (id, val) won’t help at all.

IN is better than OR

Now for a more stupid variant of the above query:

 EXPLAIN (COSTS off) SELECT id FROM a WHERE id = 42 OR id = 4711; QUERY PLAN -------------------------------------------- Bitmap Heap Scan on a Recheck Cond: ((id = 42) OR (id = 4711)) -> BitmapOr -> Bitmap Index Scan on a_pkey Index Cond: (id = 42) -> Bitmap Index Scan on a_pkey Index Cond: (id = 4711) (7 rows)

Again, a bitmap index scan is used. But there is a simple method to rewrite that query without the pesky OR:

 EXPLAIN (COSTS off) SELECT id FROM a WHERE id IN (42, 4711); QUERY PLAN --------------------------------------------------- Index Only Scan using a_pkey on a Index Cond: (id = ANY ('{42,4711}'::integer[])) (2 rows)

You see? As soon as you get rid of the OR, an efficient index scan can be used!

You could argue that this is advantageous for equality circumstances, but how about the following issue:

 SELECT id FROM a WHERE a_val LIKE 'something%' OR a_val LIKE 'other%';

To improve that query, observe that the PostgreSQL optimizer rewrote the IN in the previous query to = ANY.

This is a case of the standard SQL “quantified comparison predicate”: <comparison operator> ANY is true if the comparison is TRUE for any of the values on the right-hand side (the standard only defines this for subqueries on the right-hand side, but PostgreSQL extends the syntax to arrays).

Now LIKE is a comparison operator as well, so we can write:

 EXPLAIN (COSTS off) SELECT id FROM a WHERE a_val LIKE ANY (ARRAY['something%', 'other%']); QUERY PLAN ---------------------------------------------------------- Seq Scan on a Filter: (a_val ~~ ANY ('{something%,other%}'::text[])) (2 rows)

Unfortunately, this situation precludes using the index.

pg_trgm to the rescue

Though not yet, we still have some more ideas! There are so many different indexes in PostgreSQL; let’s try one of them. For this, we need the pg_trgm extension:

 CREATE EXTENSION pg_trgm;

Then we can create a GIN trigram index on the column:

 CREATE INDEX a_val_trgm_idx ON a USING gin (a_val gin_trgm_ops);

Now things are looking better:

 EXPLAIN (COSTS off) SELECT id FROM a WHERE a_val LIKE ANY (ARRAY['something%', 'other%']); QUERY PLAN -------------------------------------------------------------------- Bitmap Heap Scan on a Recheck Cond: (a_val ~~ ANY ('{something%,other%}'::text[])) -> Bitmap Index Scan on a_val_trgm_idx Index Cond: (a_val ~~ ANY ('{something%,other%}'::text[])) (4 rows)

Trigram indices are extremely powerful!

Note 1: If the search pattern begins with, this index can also be used.%

2. The GIN index has the potential to grow significantly. You can also use a GiST index, which is much smaller but less effective for searching, to avoid that.

The “ugly” OR

Things become really bad if OR combines conditions from different tables:

 EXPLAIN (COSTS off) SELECT id, a.a_val, b.b_val FROM a JOIN b USING (id) WHERE a.id = 42 OR b.id = 42; QUERY PLAN --------------------------------------------- Merge Join Merge Cond: (a.id = b.id) Join Filter: ((a.id = 42) OR (b.id = 42)) -> Index Scan using a_pkey on a -> Index Scan using b_pkey on b (5 rows)

Here, we have to first do a full join between the two tables and then filter out any rows that don’t meet the criteria. In our example, that would mean adding up 100,000 rows and getting rid of the other 99,999 because they don’t meet the criteria.

Avoiding the ugly OR

Thankfully, there is an equivalent query that takes more time to write but costs a lot less to run:

 EXPLAIN (COSTS off) SELECT id, a.a_val, b.b_val FROM a JOIN b USING (id) WHERE a.id = 42 UNION SELECT id, a.a_val, b.b_val FROM a JOIN b USING (id) WHERE b.id = 42; QUERY PLAN ---------------------------------------------------------- Unique -> Sort Sort Key: a.id, a.a_val, b.b_val -> Append -> Nested Loop -> Index Scan using a_pkey on a Index Cond: (id = 42) -> Index Scan using b_pkey on b Index Cond: (id = 42) -> Nested Loop -> Index Scan using a_pkey on a a_1 Index Cond: (id = 42) -> Index Scan using b_pkey on b b_1 Index Cond: (id = 42) (14 rows)

Since the rows are identical, UNION will combine them into a single record. Both sections of the query can employ effective index scans to return one row.

If you are sure that both parts of the query will return different sets, you should use UNION ALL instead of UNION. This will save you from having to do extra work to get rid of duplicates.

When you use this method, you should be aware that it doesn’t always give you the same query because the UNION would get rid of any identical results that the original query could have given you. Since the primary keys were returned in the query result, we don’t need to be concerned in our instance. In my experience, this rarely becomes an issue in real life.

About Enteros

Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.

The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.

Are you interested in writing for Enteros’ Blog? Please send us a pitch!

Enhancing Identity and Access Management in Healthcare with Enteros

19 November 2024
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Maximizing Efficiency with Enteros: Revolutionizing Cost Allocation Through a Cloud Center of Excellence

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions

18 November 2024
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Preamble

A little sample schema

The “good” OR

The “bad” OR

IN is better than OR

pg_trgm to the rescue

The “ugly” OR

Avoiding the ugly OR

About Enteros

RELATED POSTS

Enhancing Identity and Access Management in Healthcare with Enteros

Maximizing Efficiency with Enteros: Revolutionizing Cost Allocation Through a Cloud Center of Excellence

Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions

Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform