Preamble
Despite the fact that new index techniques are relatively infrequent, a new feature called Bloom indexing quietly appeared with the most recent 9.6 release. Maybe because it was one of the contributed modules (also called extensions) and was used to show off a feature of the new generic WAL interface. This will make it easier to add new features in the future. I naturally glanced at the 9.6 release notes as soon as they were available, spent some time there, and thought, “A new index type, great!” But to be completely honest, the explanations were not very thorough, and I honestly don’t think I fully understood everything that was going on. Because of this, I think its only use is to tell Postgres which search values “do not exist” in a table. But when I spent some time learning more about this new feature, I was pleasantly surprised to find that it has a fully working index method that could be very helpful for everyday problems.
Working principles and when to apply?
The intended use case for Bloom filters and indexes should be explained first. When should we use it? The short answer is when we use many of those columns in seemingly random combinations in queries that use the equality operator (“=”), such as “SELECT * FROM tbl WHERE col1 = 2 AND col9 = “x” AND col17 = 865.” And take note of the emphasis on the phrase “equality operator”; in Bloom’s case, hashes of values are stored and compared mathematically, where range operations are useless.
In essence, Postgres creates a hash for each of your column values and then stores bits of each hash as an index entry along with information about the row’s physical location (as with every other index). The efficiency of this index type really shines when multiple column values are “merged” into a single index entry, creating a signature in our Bloom context. In summary, it can greatly assist you in saving disk space! Thus, instead of 10 separate normal B-tree indexes, you can now have only one Bloom index, which is lossy, meaning that matched values must always be re-checked from the table, but from a probabilistic standpoint it is “good enough” to be useful. Of course, if it’s used and set up correctly, which, if you’ve read this blog, shouldn’t be a problem, However, now that we have given our application some thought and have a hunch about a real use case for Bloom, how does it operate?
Basic usage for bloom
So, in its simplest form, here’s how it works:
CREATE EXTENSION bloom; -- make sure “contrib” is installed CREATE INDEX bloomidx ON tbloom USING bloom (i1,i2,i3) WITH (length=80, col1=2, col2=2, col3=4);
So far, the GIN and GST indexes appear to be somewhat familiar. Create the extension, specify that your index uses the Bloom index method, and then start running queries in the hopes that PostgreSQL will use the index. Also, keep in mind that only equality comparisons were supported before and that only integer and text data types are supported right now. As humans aren’t very good at remembering or typing in floating-point numbers and millisecond precision timestamps, this shouldn’t be a deal breaker for human-input queries.
However, what on earth do the extra parameters in the WITH part mean? Evidence suggests:
The index is created with a signature length of 80 bits, with attributes i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits. We could have omitted the length, col1, and col2 specifications since those have the default values.
Here, things start to get hazy. How does it add up exactly? Out of a total of 80 bits, only 22 bits plus 4 bits are used. I looked online for examples or explanations, but I couldn’t find any, so I decided to go for the truth, which meant attacking the extension’s source code.
Going deeper
In the end, this stuff is pretty clever, so after an hour of poking around in the code (while gritting my teeth and feeling a lot of respect for the authors), I think I figured it out. In the process, I also gave the algorithm a nickname that could very well be described as “a wicked semi-random bit-twisting modulo game.” The “aha” moments in the code can be found in the signValue() function in blutils.c, but here is a brief excerpt that demonstrates which signature bits are set for a single column value.
/** Init hash sequence to map our value into bits. the same values in * different columns will be mapped into different bits because of step * above */ hashVal = DatumGetInt32(FunctionCall1(&state->hashFn[attno], value)); mySrand(hashVal ^ myRand()); for (j = 0; j < state->opts.bitSize[attno]; j++) { /* prevent multiple evaluation in SETBIT macro */ nBit = myRand() % (state->opts.bloomLength * SIGNWORDBITS); SETBIT(sign, nBit); }
So, what does it actually tell us? We can see that there is some randomness going on, but luckily, it answers my question about how 22 bits plus 4 bits equals 80 bits. It works by calculating separate bit storage locations for each and every indexed column and column value (two by default, but user-configurable in the WITH part)! This basically makes sure that all 80 bits, which are available by default, are used up in an even and efficient way. Additionally, the column index actually serves as the “seed” for randomness. Normally, from a security perspective, we wouldn’t want to do this, but in this case it makes perfect sense because indexing and index comparisons wouldn’t be predictable with a normal or random seed. In the end, a very sophisticated solution
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enhancing Identity and Access Management in Healthcare with Enteros
- 19 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Maximizing Efficiency with Enteros: Revolutionizing Cost Allocation Through a Cloud Center of Excellence
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions
- 18 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…