Preamble
Therefore, in order to begin, we would first require a test case that was “close to real life” in order to help us better understand and remember the problem. As I said in the last post, Bloom needs a query or table with a lot of columns that can be queried in different ways. Since there are only integer and text data types and equality operators, there aren’t a lot of options. After looking around for a while and failing to find any suitable test sets, I decided to keep staring up at the ceiling and imagined a straightforward table from an online car dealership domain, where Bloom would fit in well and be simple enough for our brains to process.
The online car dealership
The schema, which is just one table, is explained in more detail below, along with the number of unique values in the comment. As B-tree indexes are less useful there and Bloom should help, picking a domain with less distinct values would be a good example. Also, take note of how well normalized the table is. Since we are only dealing with natural values and numeric type codes, we can easily generate test data by using random brand codes, such as 0..199, etc. So that disk access wouldn’t affect the tests, I decided to make 5 million test rows. These rows are lists of cars for sale, which may be a realistic number for larger countries. This gave us a 365 MB table that would fit nicely into shared buffers.
create table vehicle ( id bigserial primary key, brand_code int not null, -- n_distinct 200, imagine “brand_code”=0 => “Audi” model_code int not null, -- n_distinct 25, “model_code”=0 => “Q7” color_code int not null, -- n_distinct 25, “color_code”=0 => “silver” etc... year int not null, -- n_distinct 26 body_type_code int not null, -- n_distinct 10 doors int not null, -- n_distinct 5 seats int not null, -- n_distinct 9 gearbox_code int not null, -- n_distinct 3 fuel_type_code int not null, -- n_distinct 10 aircon_type_code int not null -- n_distinct 4 ); insert into vehicle (brand_code, model_code, color_code, year, body_type_code, doors, seats, gearbox_code, fuel_type_code, aircon_type_code) select random()*200, --brand random()*25, --model random()*25, --color random()*26 + 1990, --year random()*10, --body_type random()*4 + 1, --doors random()*8 + 1, --seats random()*3, --gearbox random()*10, --fuel random()*4 --aircon from generate_series(1, 5*1e6); -- 5 mio rows analyze vehicle;
Building the indexes
Again, the idea behind our “application” was that users would send queries that used equality filters across a number of columns. This meant that all columns had to be indexed to make sure that queries were answered quickly. So, we use the B-tree to index each column separately before we use a single Bloom index to index each column separately.
CREATE INDEX ON vehicle (brand_code, model_code); CREATE INDEX ON vehicle (color_code); CREATE INDEX ON vehicle (year); CREATE INDEX ON vehicle (body_type_code); CREATE INDEX ON vehicle (doors); CREATE INDEX ON vehicle (seats); CREATE INDEX ON vehicle (gearbox_code); CREATE INDEX ON vehicle (fuel_type_code); CREATE INDEX ON vehicle (aircon_type_code); CREATE INDEX bloom_80_bits ON vehicle USING bloom (brand_code,model_code,color_code,year,body_type_code,doors,seats,gearbox_code,fuel_type_code,aircon_type_code);
The sizes of the resulting indices are as follows: Bloom produces 77 MB, while the sum of all B-tree indexes produces 107 MB. already pretty impressive, and remember that they should answer our questions in a similar way.
Running a test query
Whether or not queries will benefit from the Bloom index is what interests us right now. I’m imposing some behavior on them for my hypothetical use case because our “users” are too picky as a group to give the “lossy” Bloom index much of a chance. For example, I won’t use all 10 columns or include the brand or model in the query. This could mean that our “users” are practical, know what features they want in a car, and care more about how useful a car is than how flashy its brand name is. Be aware that the chosen code values are chosen at random, but that the outcome is unaffected by changing the numbers because there aren’t many distinct values in our random distribution.
select * from vehicle where color_code = 12 and year > 2010 and body_type_code = 5 and doors = 4 and gearbox_code = 2 and fuel_type_code = 5 and aircon_type_code = 2;
* Run 1.
Bitmap Heap Scan on Vehicle (Cost: 22376.30; Rows: 8; Width: 48) (Actual Time: 130.326; Rows: 13; Loops: 1) Recheck Cond: ((color_code = 12) AND (body_type_code = 5) AND (fuel_type_code = 5)) Filter: (year > 2010) AND (doors = 4) AND (gearbox_code = 2) AND (aircon_type_code = 2). Rows removed by filter: 2113 Heap Blocks: exact=2072 -> BitmapAnd (cost=22376.30..22376.30 rows=1971 width=0) (actual time=129.832..129.832 rows=0 loops=1) -> Bitmap Index Scan on vehicle_color_code_idx (cost = 0.00..3487.43 rows = 188667 width = 0) (Actual time: 50.499..50.499 rows: 199791 loops: 1) Index condition: (color_code = 12) -> Bitmap Index Scan on vehicle_body_type_code_idx (cost = 0.00..9244.93 rows = 500333 width = 0) (actual time = 33.992..33.992 rows = 499668 loops = 1) Index Condition: (body_type_code = 5) -> Bitmap Index Scan on vehicle_fuel_type_code_idx (cost = 0.00..9643.43 rows = 522000 width = 0; actual time = 33.622..33.622 rows = 500561 loops = 1) Index condition: (fuel_type_code = 5) Planning time: 0.642 ms Execution time: 132.150 ms
Much to my surprise, the B-tree indexes were still used, and the average runtime for five runs was 130 ms. Hmm. In this case, however, we can see that Postgres can choose the most useful indexes and do a bitmap merge on just three of them. Okay, it appears that in order to test Bloom, we must still disable or delete the standard indexes.
* Run 2.
Bitmap Heap Scan on Vehicle (cost = 139220.00..139377.59 rows = 8 width = 48) (actual time = 67.600..67.673 rows = 13 loops = 1) Recheck Cond: ((color_code = 12) AND (body_type_code = 5) AND (doors = 4) AND (gearbox_code = 2) AND (fuel_type_code = 5) AND (aircon_type_code = 2)) Rows removed by index Recheck: 37 Filter: (year > 2010) Rows removed by filter: 28 Heap Blocks: exact = 78 -- Bitmap Index Scan on bloom_80_bits (cost = 0.00..139220.00 rows = 40 width = 0) (actual time=67.576..67.576 rows=78 loops=1) Index Cond: ((color_code = 12) AND (body_type_code = 5) AND (doors = 4) AND (gearbox_code = 2) AND (fuel_type_code = 5) AND (aircon_type_code = 2) Planning time: 0.270 ms Execution time: 67.728 ms
After turning off the indexes, hurrah, we’re making some progress! It appears that Bloom is a better option for our use case because it is 10 times more space-efficient and 2 times faster! It’s just a shame that Postgres can’t yet estimate that correctly, so we have to draw attention to it ourselves. What gives, though? owing to the Postgres planner’s cost estimation mechanism. We can see that the Bloom case’s total cost (penalty points calculated for each operation) is higher (29K vs 139K). Despite being faster in real time, Bloom The Bloom index’s bitmap index scan and the need to currently scan the entire index and compare all 5 million signatures are what are causing the majority of the costs. Of course, one could adjust variables like cpu_tuple_cost and others to make the planner favor Bloom, but that’s a different discussion and could have a significant impact on other successful queries. By the way, our query’s full table scan (with enable_bitmapscan set to off) takes about 500 ms.
Tuning Bloom signature length
Bloom’s first “feedback” made me think about what might happen if I changed the length of the signature, so I did. Sizes of 48 MB and 106 MB (5 million rows) were successfully produced using relatively low values of 32 and slightly higher values of 128. It should be noted that the Bloom index size doesn’t really change linearly with the signature size because each entry already needs a substantial amount of physical location information (6 bytes).
krl@postgres=# \di+ bloom_* List of relations Schema │ Name │ Type │ Owner │ Table │ Size │ Description ────────┼────────────────┼───────┼───────┼──────────┼────────┼───────────── public │ bloom_128_bits │ index │ krl │ vehicle │ 106 MB │ public │ bloom_32_bits │ index │ krl │ vehicle │ 48 MB │ public │ bloom_80_bits │ index │ krl │ vehicle │ 77 MB │ (3 rows)
Even though we have a better (more precise) index available after running the same EXPLAIN ANALYZE from above (we’ve omitted the output for brevity), Postgres amusingly chose to use the less accurate (32-bit) index here. I got rid of the 32-bit one to test my theory, and sure enough, the runtime was cut in half, from 137 to 65 ms. There are fewer tuples to double-check and heaps of pages to visit. Again, in this case, Postgres could be a little more shrewd. It probably just uses the smallest index that covers the columns being queried, similar to how B-tree works.
Summary
Our example use case was not too exhaustive but hopefully close enough to real life that you got an idea of when to consider Bloom, what to take into account, and what the pros and cons are. A list of things to remember
- Since they are ideal for “checkbox” type applications that have numerous columns with little variable data (and numerous data rows), one could make comparisons to Oracle bitmap index use cases.
- Since the index is lossy and may produce false positives, there are signatures linked to the tables via CTIDs as usual in place of actual column values.
- Bloom requires some trial and error to determine the best columns to include and the ideal length for signatures, but when used properly, it can easily provide a 10x benefit (in terms of aggregating space and speed).
- Only equality operations as well as the data types int4 and text (also known as varchar because they are the same) are supported at this time.
- The default signature length of 80 bits was sufficient for 5m rows with 10 columns (limited number of distinct values) and outperformed individual B-tree indexes while also being much smaller!
- Having both the B-tree and Bloom indexes at the moment is pointless because B-tree is preferred and the planner cost estimates for Bloom could use some improvement.
- ERROR: access method “bloom” does not support unique values… What about unique values?
- What about NULL-s? Since NULL-s are not indexed, Bloom indexes cannot be used to speed up queries like “WHERE col IS NULL.”
- You can rely on Bloom because it is WAL logged and crash-safe (unlike Hash indexes).
- Increasing bits for the distinct column should theoretically result in fewer rows being discarded during rechecking, but doing so is probably not worth the effort when one of the columns has a large number of distinct values and the others have very few.
- Bits are actually rounded up to full words (16 bits) when the index signature bit length is specified (… WITH (length=X)), so “length=33” bits actually becomes “length=48” bits.
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Revolutionizing Healthcare IT: Leveraging Enteros, FinOps, and DevOps Tools for Superior Database Software Management
- 21 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Real Estate Operations with Enteros: Harnessing Azure Resource Groups and Advanced Database Software
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Revolutionizing Real Estate: Enhancing Database Performance and Cost Efficiency with Enteros and Cloud FinOps
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enteros in Education: Leveraging AIOps for Advanced Anomaly Management and Optimized Learning Environments
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…