POSTGRESQL CASE-INSENSITIVE PATTERN MATCHING

Home > Enteros’ Blog – Thoughts on Database Technology, Machine / Deep learning, and a Generative AI > Technology Business > Machine learning > POSTGRESQL CASE-INSENSITIVE PATTERN MATCHING

Preamble

Case-insensitive search is a highly requested feature, in part to keep Microsoft SQL Server compatibility. Using case-insensitive ICU collations is one of many possible solutions to the issue. The only time this doesn’t work perfectly is when you want to do pattern matching. Let’s examine the issue more closely and consider some potential solutions.

Alternatives for case-insensitive search

There are three recognized techniques for case-insensitive search in PostgreSQL:

Explicit conversion with `lower()` or `upper()`

A query that uses this method would look as follows:

SELECT id, col FROM tab WHERE lower(col) = lower('search string');

This can be done quickly with a B-tree indexlower(col), but has two disadvantages:

The remedy is put into practice at the application level, so you must modify the query for case-insensitive search
When a database column has long values, even if only a few characters need to be compared, the entire value must be converted to lower case, which has a negative impact on performance.

Using the `citext` extension

The extensioncitext provides a data typecitext, which stands for “case-insensitive text”. The table is defined as

CREATE TABLE tab ( col citext, ... );

The question is straightforward:

SELECT id, col FROM tab AND'search string' as col;

That is convenient and easy, but it also has some drawbacks:

there is no data type civarchar, so you can only implement that with a check constraint
performance for longer values can also be bad, because citext internally calls lower(col COLLATE "default") before comparing the values
regular expression matching is not case insensitive, and you have to use the case insensitive operator ~* explicitly

Using case-insensitive ICU collations

If you are using PostgreSQL v12 or higher and PostgreSQL is configured--with-icu, you can define a case-insensitive collation like this:

CREATE COLLATION english_ci ( PROVIDER = icu, -- 'en-US@colStrength=secondary' for old ICU versions LOCALE = 'en-US-u-ks-level2', DETERMINISTIC = FALSE );

The table is described as

CREATE TABLE tab ( col text COLLATE english_ci, ... );

And once more, the question is simple:

SELECT id, col FROM tab WHERE col = 'search string';

Please read my article on ICU collations for more information.

The trouble with pattern matching and case-insensitive collations

The best solution to the issue is case-insensitive collations. There are two issues with this strategy:

ICU collations are still not compatible with case-insensitive collations, so you still need to explicitly specify the collation for each column definition in PostgreSQL v15 even though you can use ICU collations as the database’s default collation.
Case-insensitive collations are not compatible with pattern matching.

The first issue is merely annoying; however, the second issue will require further consideration. Here are two instances that illustrate the issue:

SELECT id, long FROM perf_coll WHERE long LIKE "abcd%"; ERROR: nondeterministic collations are not supported for LIKE; SELECT id, long FROM perf_coll WHERE long  "abcd"; ERROR: nondeterministic collations are not supported for regular expressions

Why is that not supported?

German soccer’s difficult situation

The ICU documentation explains the challenges with accurate case-insensitive pattern matching. German’s “ß,” which traditionally lacks an upper-case equivalent, serves as a good illustration. Therefore, with good German collations (the GNU C library’s collation is not good in that regard), you will obtain the following result:

Upper  FUSSBALL (1 row); SELECT upper('Fußball' COLLATE "de-DE-x-icu");

Now what would be the correct result for the following query in a case-insensitive collation?

SELECT 'Fußball' LIKE 'FUS%';

You could argue that it should be TRUEbecause that’s what you’d get for upper('Fußball') LIKE 'FUS%'. On the other hand,

SELECT lower('FUSSBALL' COLLATE "de-DE-x-icu"); lower ══════════ fussball (1 row)

So you could just as well argue that the result should be FALSE. The ICU library goes with the second solution for simplicity. Either solution would be hard to implement in PostgreSQL, so we have stopped thinking about it for now. Compare this quote from the mailing list:

ICU also provides regexp matching, but not collation-aware, since
character-based patterns don’t play well with the concept of collation.
About a potential collation-aware LIKE, it looks hard to implement,
since the algorithm currently used in like_match.c seems purely
character-based. AFAICS there’s no way to plug calls to usearch_*
functions into it, it would need a separate redesign from scratch.

A semiotic aside

There is no universally accepted authority for correct German. For example, German-speaking Swiss did away with the ß in the twentieth century. On the other hand, the “Rat für deutsche Rechtschreibung” (Committee for German Spelling) introduced an upper-case letter in 2017, but this attempt to simplify the problem has been widely ignored by the German-speaking world. What a ridiculous idea—simplifying German!

The end result is even more confusing than before.

SELECT upper('ß' COLLATE "de-DE-x-icu"), lower('ẞ' COLLATE "de-DE-x-icu"); upper │ lower ═══════╪═══════ SS │ ß (1 row)

A solution for case-insensitive pattern matching

We require a workable answer to the issue. Of course we could use lower() or cast to citext, but that would re-introduce the performance problems for long strings. So we evade the problem by explicitly using a different collation. We cast to the binary collation, which compares words character for character, for performance reasons and to avoid the brain-twist of German soccer:

SELECT id, col FROM tab WHERE col COLLATE "C" ILIKE 'search%pattern'; SELECT id, col FROM tab WHERE col COLLATE "C" ~* '^search.*pattern';

Note that for this to work as expected, we have to use the case-insensitive versions ofLIKE and the regular expression matching operator~.

Obviously, this solution isn’t as easy as we’d like it to be. Again, the application has to write queries that make it clear that case-sensitive processing is to be used. Another difficulty is performance: while case-sensitive pattern matching can be supported with B-tree indexes, case-insensitive pattern matching requires a trigram index:

CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX tab_col_pattern_idx ON tab USING gin (col gin_trgm_ops);

Although it can grow large and update more slowly than a typical B-tree index, such an index can accelerate both of the aforementioned statements.

A performance test for case-insensitive comparisons

I used tables with ten million rows and a column with a randomly generated string of 320 characters for this test. The column lacked an index. In the tableperf_citext, the column is defined as citext. In perf_lower, the data type is text, and comparison is made using lower(). Table perf_coll uses text with the collation english_ci as defined above. PostgreSQL v15 with US English collations served as the database. Shared buffers were used to cache all tables.

Comparison of the effectiveness of case-insensitive search methods

`WHERE ... = 'abcd'`	`WHERE ... LIKE 'abcd%'`	`WHERE ... COLLATE "C" ILIKE 'abcd%'`	`WHERE ... COLLATE "C" ~* 'abcd%'` (`~` for `lower()`)
`citext`	540 ms	536 ms	1675 ms	2500 ms
`lower()`	9000 ms	9000 ms	3000 ms	3800 ms
`english_ci`	830 ms	ERROR	2000 ms	1940 ms

In this test,citext comes out ahead, but case-insensitive collations are a decent runner-up. Using lower() doesn’t perform well.

Conclusion

PostgreSQL doesn’t have pattern matching for collations that don’t care about case, and it’s not clear what the right thing to do would be in this case. We have a good workaround by explicitly using pattern matching operators that don’t care about case and the binary collation, but the situation is far from ideal.

About Enteros

Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of clouds, RDBMS, NoSQL, and machine learning database platforms.

The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.

Are you interested in writing for Enteros’ Blog? Please send us a pitch!

Revolutionizing Healthcare IT: Leveraging Enteros, FinOps, and DevOps Tools for Superior Database Software Management

21 November 2024
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Optimizing Real Estate Operations with Enteros: Harnessing Azure Resource Groups and Advanced Database Software

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Revolutionizing Real Estate: Enhancing Database Performance and Cost Efficiency with Enteros and Cloud FinOps

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Enteros in Education: Leveraging AIOps for Advanced Anomaly Management and Optimized Learning Environments

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Preamble

Alternatives for case-insensitive search

Explicit conversion with lower() or upper()

Using the citext extension

Using case-insensitive ICU collations

The trouble with pattern matching and case-insensitive collations

German soccer’s difficult situation

A semiotic aside

A solution for case-insensitive pattern matching

A performance test for case-insensitive comparisons

Conclusion

About Enteros

RELATED POSTS

Revolutionizing Healthcare IT: Leveraging Enteros, FinOps, and DevOps Tools for Superior Database Software Management

Optimizing Real Estate Operations with Enteros: Harnessing Azure Resource Groups and Advanced Database Software

Revolutionizing Real Estate: Enhancing Database Performance and Cost Efficiency with Enteros and Cloud FinOps

Enteros in Education: Leveraging AIOps for Advanced Anomaly Management and Optimized Learning Environments

Explicit conversion with `lower()` or `upper()`

Using the `citext` extension