POSTGRESSQL JOINS PGSTROM ON A GPU

Home > Enteros’ Blog – Thoughts on Database Technology, Machine / Deep learning, and a Generative AI > Technology Business > Machine learning > POSTGRESSQL JOINS PGSTROM ON A GPU

Article

I decided to try joins after reading the information on the pgstrom wiki page to see how much speed we could gain by offloading some of the work that PostgreSQL needs to do to the graphics card.

I started by producing some test data:

 test=# CREATE TABLE t_test AS SELECT id, id % 10 AS ten, id % 20 AS twenty FROM generate_series(1, 25000000) AS id ORDER BY id; SELECT 25000000 test=# CREATE TABLE t_join AS SELECT * FROM t_test ORDER BY random() LIMIT 1000000; SELECT 1000000 test=# SELECT count(*) FROM t_join; count --------- 1000000 (1 row) test=# SELECT count(*) FROM t_test; count ---------- 25000000 (1 row)

1 million rows, which are a 100% subset of the original data, should be joined with 25 million rows.

An initial evaluation reveals a lovely improvement. My CPU’s initial two queries are:

 test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.ten; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost = 724919.18..724919.28 rows = 10 width = 4) (actual time = 11283.769..11283.772 rows = 10 loops = 1). Group Key: a.ten -> Hash Join (cost = 31813.00..719919.18 rows = 1000000 width = 4) (actual time = 340.218..11061.192 rows = 1000000 loops = 1). Hash Cond: (a.id = b.id) -> Seq Scan on t_test a (cost = 0.00..385135.40 rows = 24999940 width = 8) (actual time = 0.046..3590.336 rows = 25000000 loops = 1). ->  Hash  (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=338.516..338.516 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3226 kB ->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.031..142.085 rows=1000000 loops=1) Planning time: 0.587 ms Execution time: 11284.411 ms (10 rows) test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.twenty; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost = 724919.18..724919.38 rows = 20 width = 4) (actual time = 10991.615..10991.619 rows = 20 loops = 1) Group Key: a.twenty -> Hash Join (cost = 31813.00..719919.18 rows = 1000000 width = 4) (actual time = 317.089..10766.779 rows = 1000000 loops = 1). Hash Cond: (a.id = b.id) -> Seq Scan on t_test a (cost = 0.00..385135.40 rows = 24999940 width = 8) (actual time = 0.50..3636.188 rows = 25000000 loops = 1). ->  Hash  (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=316.321..316.321 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3226 kB -> Seq Scan on t_join b (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time = 0.026..141.551 rows = 1000000 loops = 1). Planning time: 0.240 ms Execution time: 10992.203 ms (10 rows)

The “AMD FX(tm)-8350 Eight-Core Processor” being used in this instance has a clock speed of 4 Ghz.

Let’s attempt the same on the GPU:

 test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.ten; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- HashAggregate  (cost=175445.25..175445.35 rows=10 width=4) (actual time=2556.159..2556.161 rows=10 loops=1) Group Key: a.ten -> Custom Scan (GPUPreAgg) (cost (cost=27659.66..174601.72 rows = 22 width = 8) (actual time = 2538.558..2556.123 rows = 30 loops = 1). Bulkload: Off Reduction: Local + Global -> Custom Scan (GpuJoin) (cost = 23659.66..170445.25 rows = 1000000 width = 4) (actual time = 1392.571..2437.772 rows = 1000000 loops = 1). Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1) inner buffer: (82.25MB), DMA numbers: 13, size: 1069.32MB -> Custom Scan (BulkScan) on t_test a (cost = 0.00..385137.60 rows = 25000160 width = 8) (actual time = 16.137..1333.313 rows = 25000000 loops = 1). ->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.018..109.725 rows=1000000 loops=1) Planning time: 1.720 ms Execution time: 3264.747 ms (14 rows) test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id group by twenty; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- HashAggregate  (cost=175445.25..175445.45 rows=20 width=4) (actual time=2532.113..2532.121 rows=20 loops=1) Group Key: a.twenty -> Custom Scan (GPUPreAgg) (cost (cost=27659.66..174601.94 rows = 44 width = 8) (actual time = 2508.512,2531.991 rows = 120 loops = 1). Bulkload: Off Reduction: Local + Global Custom Scan (GpuJoin) (cost = 23659.66..170445.25 rows = 1000000 width = 4) (actual time=1373.296..2410.850 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1) inner buffer: (82.25MB), DMA numbers: 11, size: 904.81MB -> Custom Scan (BulkScan) on t_test a (cost = 0.00..385137.60 rows = 25000160 width = 8) (actual time = 12.961..1308.733 rows = 25000000 loops = 1). -> Seq Scan on T_Join B (Cost: 0.00; Rows: 1000000; Width: 4); Actual Time: 0.018%; Rows: 1000000; Loops: 1) Planning time: 0.539 ms Execution time: 3225.901 ms (14 rows)

This is a nice development, in my opinion. The query is executed much more quickly.

Utilizing indexes

Observing what happens when indexes are added to both sides of the join is interesting.

 test=# CREATE INDEX idx_id ON t_test (id); CREATE INDEX test=# CREATE INDEX idx_id2 ON t_join (id); CREATE INDEX

Because the join requires too much information, the standard query that was previously displayed does not demonstrate any difference. Consequently, the test is run again using a reasonable filter:

 test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id AND a.id < 100000 GROUP BY a.ten; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost = 32541.70..32541.71 rows = 1 width = 4) (actual time = 45.845..45.846 rows = 10 loops = 1) Group Key: a.ten -> Merge Join (cost = 1.49..32519.75 rows = 4390 width = 4) (actual time = 0.76..44.423 rows = 4137 loops = 1). Merge condition: (a.id = b.id) -> Index Scan using idx_id on t_test a (cost = 0.44..3722.05 rows = 109749 width = 8) (actual time = 0.025..27.556 rows = 99999 loops = 1). Index Cond: (id < 100000) ->  Index Only Scan using idx_id2 on t_join b  (cost=0.42..25980.42 rows=1000000 width=4) (actual time=0.022..0.931 rows=4138 loops=1) Heap Fetches: 0 Planning time: 0.449 ms Execution time: 45.946 ms (10 rows)

The CPU version runs fairly quickly. PostgreSQL performs a nice merge join after scanning the indexes and using the sorted input. The aggregation happens fairly quickly as well.

The plan appears less effective in the default setting when GPU is enabled:

 test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id AND a.id < 100000 GROUP BY a.ten; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ HashAggregate (cost = 13404.21..13404.22 rows = 1 width = 4) (actual time=118.154..118.156 rows=10 loops=1) Group Key: a.ten -> Custom Scan (GPUJoin) (cost (cost=9649.48..13383.34 rows = 4174 width = 4) (actual time = 70.993..117.099 rows = 4137 loops = 1). Bulkload: On (100.00% density) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:1000000 out:4137, 0.41% planned 0.42%), KDS-Hash (size: 5859.39KB planned 8.58MB, nbatches: 1 planned 1) inner buffer: (8.58MB), DMA numbers: 1, size: 8.58MB -> Custom Scan (BulkScan) on t_join b (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time = 7.164.24.709 rows = 1000000 loops = 1). -> Index Scan using idx_id on t_test a (cost = 0.44..3542.55 rows = 104349 width = 8) (actual time = 0.018..21.684 rows = 99999 loops = 1) Index condition: (id  100000) Planning time: 0.858 ms Execution time: 553.543 ms (12 rows)

It shouldn’t be a big deal, even if the query is slower in this instance. The system should be able to determine that the CPU version is the faster one if the cost models are properly adjusted and the planner is given instructions on how to choose the best plan. In this case, the speed reduction is not really a problem. Once all infrastructure is established, balancing the cost model and making minor planner adjustments does not seem like a deal-breaker (at least not in light of the accomplishments already made).

Joining more tables

Just to see what happens, let’s simply add more tables to the join. I once again removed my two indexes to give the CPU and GPU a fair chance.

 test=# explain analyze SELECT count(*) FROM    t_test AS a, t_join AS b, t_test AS c, t_join AS d, t_test AS e, t_join AS f WHERE a.id = b.id, and b.id = c.id, and c.id = d.id, and d.id = e.id, and e.id = f.id; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Aggregate  (cost=1745125.48..1745125.49 rows=1 width=0) (Actual time = 34234.209; 34234.209 rows = 1 loops = 1) Hash Join (cost = 1266218.99..1745121.48 rows = 1600 width = 0) (actual time = 23054.940..34133.602 rows = 1000000 loops = 1) Hash Cond: (e.id = a.id) -> Seq Scan on t_test e (cost = 0.00..385136.36 rows = 25000036 width = 4) (Actual time = 0.056...2728,388 rows = 25,000,000 loops = 1) -> Hash (cost (cost=1266198.99..1266198.99 rows = 1600 width = 20) (actual time = 23054.441..23054.441 rows = 1000000 loops = 1) Buckets: 131072 (originally 2048) Batches: 32 (originally 1). Memory Usage: 4205 kB Hash Join (cost = 787296.49...1266198.99 rows = 1600 width = 20) (actual time = 12247.386..22803.559 rows = 1000000 loops = 1) Hash Cond: (c.id = a.id) -> Seq Scan on t_test c (cost = 0.00..385136.36 rows = 25000036 width = 4) (actual time = 0.004..2727.218 rows = 25000000 loops = 1). Hash (cost (cost=787276.49..787276.49 rows = 1600; width = 16; actual time = 12246.958; rows = 1000000; loops = 1) Buckets: 131072 (originally 2048) Batches: 16 (originally 1). Memory Usage: 3960 kB Hash Join (cost = 768104.49..787276.49 rows = 1600 width = 16) (actual time = 11330.994..12044.482 rows = 1000000 loops = 1) Hash Cond: (f.id = a.id) -> Seq Scan on t_join f (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time = 0.030..111.371 rows = 1000000 loops = 1). Hash (cost (cost=767604.49..767604.49 rows = 40000; width = 12; actual time = 11330.687; rows = 1000000; loops = 1) Buckets: 131072 (originally 65536) Batches: 16 (originally 1). Memory Usage: 3716 kB Hash Join (cost = 63626.00..767604.49 rows = 40000 width = 12) (actual time = 620.425..11124.981 rows = 1000000 loops = 1) Hash Cond: (a.id = d.id) Hash Join (cost = 31813.00..719920.49 rows = 1000000 width = 8) (actual time = 306.078..10215.246 rows = 1000000 loops = 1) Hash Cond: (a.id = b.id) -> Seq Scan on t_test a (cost = 0.00..385136.36 rows = 25000036 width = 4) (actual time = 0.002..3430.792 rows = 25000000 loops = 1). ->  Hash  (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=305.466..305.466 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3226 kB -> Seq Scan on t_join b (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time = 0.002..137.925 rows = 1000000 loops = 1) -> Hash (cost (cost=15406.00..15406.00 rows = 1000000 width = 4) (actual time = 314.051..314.051 rows = 1000000 loops = 1) Buckets: 131072 Batches: 16 Memory Usage: 3226 kB -> Seq Scan on t_join (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time=0.006..137.796 rows=1000000 loops=1) Planning time: 4.833 ms Execution time: 34238.964 ms (29 rows)

As anticipated, the CPU has trouble putting everything together. It’s a nasty connection. 34 seconds are required at the end.

That environment allows the GPU to really shine. After all, it appears that expensive joints are what this is all about:

 test=# explain analyze SELECT count(*) FROM    t_test AS a, t_join AS b, t_test AS c, t_join AS d, t_test AS e, t_join AS f WHERE a.id = b.id, and b.id = c.id, and c.id = d.id, and d.id = e.id, and e.id = f.id; NOTICE:  GpuJoin(0x204508860) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) Inners: (0, 1000000) Results: [312465, 312465] NOTICE:  GpuJoin(0x204506060) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) Results: [312465, 312465] NOTICE:  GpuJoin(0x204505860) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) Results: [312465, 312465] QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost = 471554.15..471554.16 rows = 1 width = 0) (actual time = 7270.748..7270.748 rows = 1 loops = 1). Custom Scan (GpuJoin) (cost = 334846.07..471550.15 rows = 1600 width = 0; actual time = 6212.393..7184.973 rows = 1000000 loops = 1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 0.01%), KDS-Hash (size: 64.85MB planned 134.86KB, nbatches: 1 planned 1) inner buffer: (124.00MB), DMA numbers: 9, size: 1116.01MB -> Custom Scan (BulkScan) on t_test e (cost = 0.00..385136.36 rows = 25000036 width = 4) (actual time = 13.106..1103.447 rows = 25000000 loops = 1). Custom Scan (GpuJoin) (cost = 192385.58..329089.66 rows = 1600 width = 20) (actual time=4240.682..5277.841 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in: 25000000, out: 1000000, 4.00% planned, 0.01%), KDS-Hash (size: 57.22 MB, planned: 119.23 KB, nbatches: 1 planned), inner buffer: 62.00 MB), DMA numbers: 15, size: 930.01 MB -> Custom Scan (BulkScan) on t_test c (cost = 0.00..385136.36 rows = 25000036 width = 4) (actual time = 13.160..1093.463 rows = 25000000 loops = 1). Custom Scan (GpuJoin) (cost = 182921.05; 186629.17; rows = 1600; width = 16) (actual time = 3320.939..344.434 rows = 1000000 loops = Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in: 1000000, out: 1000000, 100.00% planned (0.16%), KDS-Hash (size: 57.22MB planned) 2978.59KB, nbatches: 1 planned 1) Inner Buffer: (62.00MB), DMA nums: 1, size: 62.00MB -> Custom Scan (BulkScan) on t_join (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time = 13.220..45.445 rows = 1000000 loops = Custom Scan (GpuJoin) (cost = 41543.18..176974.98 rows = 40000 width = 12) (actual time = 1736.740..2818.735 rows = 1000000 loops = 1). Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in: 25000000, out: 1000000, 4.00% planned), KDS-Hash (size: 57.22MB), 82.25MB, nbatches: 1 planned 1) Depth 2: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in: 1000000, out: 1000000, 100.00% planned, 4.00% actual), KDS-Hash (size: 57.22MB, actual: 82.25MB, nbatches: 1 planned), Inner Buffer: (82.25MB)x(82.25MB), DMA nums: 10, size: 1645.10MB -> Custom Scan (BulkScan) on t_test a (cost = 0.00..385136.36 rows = 25000036 width = 4) (actual time = 16.505..1599.395 rows = 25000000 loops = 1) -> Seq Scan on t_join b (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time = 0.022..111.602 rows = 1000000 loops = 1). -> Seq Scan on t_join d (cost = 0.00..15406.00 rows = 1000000 width = 4) (actual time = 0.005..95.175 rows = 1000000 loops = 1). Planning time: 11.106 ms Execution time: 8051.245 ms (31 rows)

This is a really nice number to see, especially given the number of rows involved. The real potential of a GPU is demonstrated in just 8 seconds.

You must, of course, keep in mind that I am executing arbitrary SQL statements, many of which are not very realistic. I have not yet tested PostgreSQL with a real workload involving more frequent queries like those found in PostgreSQL.

About Enteros

Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.

The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.

Are you interested in writing for Enteros’ Blog? Please send us a pitch!

Enhancing Identity and Access Management in Healthcare with Enteros

19 November 2024
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Maximizing Efficiency with Enteros: Revolutionizing Cost Allocation Through a Cloud Center of Excellence

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions

18 November 2024
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Article

About Enteros

RELATED POSTS

Enhancing Identity and Access Management in Healthcare with Enteros

Maximizing Efficiency with Enteros: Revolutionizing Cost Allocation Through a Cloud Center of Excellence

Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions

Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform