Preamble
Alternatives for case-insensitive search
There are three recognized techniques for case-insensitive search in PostgreSQL:
Explicit conversion with lower()
or upper()
A query that uses this method would look as follows:
SELECT id, col FROM tab WHERE lower(col) = lower('search string');
This can be done quickly with a B-tree indexlower(col)
, but has two disadvantages:
- The remedy is put into practice at the application level, so you must modify the query for case-insensitive search
- When a database column has long values, even if only a few characters need to be compared, the entire value must be converted to lower case, which has a negative impact on performance.
Using the citext
extension
The extensioncitext
provides a data typecitext
, which stands for “case-insensitive text
”. The table is defined as
CREATE TABLE tab ( col citext, ... );
The question is straightforward:
SELECT id, col FROM tab AND'search string' as col;
That is convenient and easy, but it also has some drawbacks:
- there is no data type
civarchar
, so you can only implement that with a check constraint - performance for longer values can also be bad, because
citext
internally callslower(col COLLATE "default")
before comparing the values - regular expression matching is not case insensitive, and you have to use the case insensitive operator
~*
explicitly
Using case-insensitive ICU collations
If you are using PostgreSQL v12 or higher and PostgreSQL is configured--with-icu
, you can define a case-insensitive collation like this:
CREATE COLLATION english_ci ( PROVIDER = icu, -- 'en-US@colStrength=secondary' for old ICU versions LOCALE = 'en-US-u-ks-level2', DETERMINISTIC = FALSE );
The table is described as
CREATE TABLE tab ( col text COLLATE english_ci, ... );
And once more, the question is simple:
SELECT id, col FROM tab WHERE col = 'search string';
Please read my article on ICU collations for more information.
The trouble with pattern matching and case-insensitive collations
The best solution to the issue is case-insensitive collations. There are two issues with this strategy:
- ICU collations are still not compatible with case-insensitive collations, so you still need to explicitly specify the collation for each column definition in PostgreSQL v15 even though you can use ICU collations as the database’s default collation.
- Case-insensitive collations are not compatible with pattern matching.
The first issue is merely annoying; however, the second issue will require further consideration. Here are two instances that illustrate the issue:
SELECT id, long FROM perf_coll WHERE long LIKE "abcd%"; ERROR: nondeterministic collations are not supported for LIKE; SELECT id, long FROM perf_coll WHERE long "abcd"; ERROR: nondeterministic collations are not supported for regular expressions
Why is that not supported?
German soccer’s difficult situation
The ICU documentation explains the challenges with accurate case-insensitive pattern matching. German’s “ß,” which traditionally lacks an upper-case equivalent, serves as a good illustration. Therefore, with good German collations (the GNU C library’s collation is not good in that regard), you will obtain the following result:
Upper FUSSBALL (1 row); SELECT upper('Fußball' COLLATE "de-DE-x-icu");
Now what would be the correct result for the following query in a case-insensitive collation?
SELECT 'Fußball' LIKE 'FUS%';
You could argue that it should be TRUE
because that’s what you’d get for upper('Fußball') LIKE 'FUS%'
. On the other hand,
SELECT lower('FUSSBALL' COLLATE "de-DE-x-icu"); lower ══════════ fussball (1 row)
So you could just as well argue that the result should be FALSE
. The ICU library goes with the second solution for simplicity. Either solution would be hard to implement in PostgreSQL, so we have stopped thinking about it for now. Compare this quote from the mailing list:
ICU also provides regexp matching, but not collation-aware, since
character-based patterns don’t play well with the concept of collation.
About a potential collation-aware LIKE, it looks hard to implement,
since the algorithm currently used in like_match.c seems purely
character-based. AFAICS there’s no way to plug calls to usearch_*
functions into it, it would need a separate redesign from scratch.
A semiotic aside
There is no universally accepted authority for correct German. For example, German-speaking Swiss did away with the ß in the twentieth century. On the other hand, the “Rat für deutsche Rechtschreibung” (Committee for German Spelling) introduced an upper-case letter in 2017, but this attempt to simplify the problem has been widely ignored by the German-speaking world. What a ridiculous idea—simplifying German!
The end result is even more confusing than before.
SELECT upper('ß' COLLATE "de-DE-x-icu"), lower('ẞ' COLLATE "de-DE-x-icu"); upper │ lower ═══════╪═══════ SS │ ß (1 row)
A solution for case-insensitive pattern matching
We require a workable answer to the issue. Of course we could use lower()
or cast to citext
, but that would re-introduce the performance problems for long strings. So we evade the problem by explicitly using a different collation. We cast to the binary collation, which compares words character for character, for performance reasons and to avoid the brain-twist of German soccer:
SELECT id, col FROM tab WHERE col COLLATE "C" ILIKE 'search%pattern'; SELECT id, col FROM tab WHERE col COLLATE "C" ~* '^search.*pattern';
Note that for this to work as expected, we have to use the case-insensitive versions ofLIKE
and the regular expression matching operator~
.
Obviously, this solution isn’t as easy as we’d like it to be. Again, the application has to write queries that make it clear that case-sensitive processing is to be used. Another difficulty is performance: while case-sensitive pattern matching can be supported with B-tree indexes, case-insensitive pattern matching requires a trigram index:
CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX tab_col_pattern_idx ON tab USING gin (col gin_trgm_ops);
Although it can grow large and update more slowly than a typical B-tree index, such an index can accelerate both of the aforementioned statements.
A performance test for case-insensitive comparisons
I used tables with ten million rows and a column with a randomly generated string of 320 characters for this test. The column lacked an index. In the tableperf_citext
, the column is defined as citext
. In perf_lower
, the data type is text
, and comparison is made using lower()
. Table perf_coll
uses text
with the collation english_ci
as defined above. PostgreSQL v15 with US English collations served as the database. Shared buffers were used to cache all tables.
Comparison of the effectiveness of case-insensitive search methods
WHERE ... = 'abcd' |
WHERE ... LIKE 'abcd%' |
WHERE ... COLLATE "C" ILIKE 'abcd%' |
WHERE ... COLLATE "C" ~* 'abcd%' ( ~ for lower() ) |
|
---|---|---|---|---|
citext |
540 ms
|
536 ms
|
1675 ms
|
2500 ms
|
lower() |
9000 ms
|
9000 ms
|
3000 ms
|
3800 ms
|
english_ci |
830 ms
|
ERROR
|
2000 ms
|
1940 ms
|
In this test,citext
comes out ahead, but case-insensitive collations are a decent runner-up. Using lower()
doesn’t perform well.
Conclusion
PostgreSQL doesn’t have pattern matching for collations that don’t care about case, and it’s not clear what the right thing to do would be in this case. We have a good workaround by explicitly using pattern matching operators that don’t care about case and the binary collation, but the situation is far from ideal.
About Enteros
Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of clouds, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Revolutionizing Healthcare IT: Leveraging Enteros, FinOps, and DevOps Tools for Superior Database Software Management
- 21 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Real Estate Operations with Enteros: Harnessing Azure Resource Groups and Advanced Database Software
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Revolutionizing Real Estate: Enhancing Database Performance and Cost Efficiency with Enteros and Cloud FinOps
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enteros in Education: Leveraging AIOps for Advanced Anomaly Management and Optimized Learning Environments
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…