Preamble
Producing sample data
Before we begin, we must first create some example data:
test=# CREATE TABLE a (aid int); CREATE TABLE test=# CREATE TABLE b (bid int); CREATE TABLE test=# INSERT INTO a VALUES (1), (2), (3), (4); INSERT 0 4 test=# INSERT INTO b VALUES (2), (3), (4), (4), (5); INSERT 0 5
It has a rather straightforward structure. For our little demonstration, it comprises of two tables with a few rows each.
Explicit vs implicit joins
A common question is what distinguishes an implicit join from an explicit join. Syntactic sugar is the simple solution. An implicit join appears here.
test=# SELECT * FROM a, b WHERE aid = bid; aid | bid -----+----- 2 | 2 3 | 3 4 | 4 4 | 4 (4 rows)
In this case, the WHERE clause is where the join actually occurs. We just list the tables we want to join in the FROM clause. Use an explicit join as an alternative:
test=# SELECT * FROM a JOIN b ON (aid = bid); aid | bid -----+----- 2 | 2 3 | 3 4 | 4 4 | 4 (4 rows)
They are both identical in syntax. All that matters in this case is a matter of taste and fashion. The execution plans for both proposals are exactly the same if you look at them.
test=# explain SELECT * FROM a, b WHERE aid = bid; QUERY PLAN ----------------------------------------------------------------- Merge Join (cost=393.42..893.85 rows=32512 width=8) Merge Cond: (a.aid = b.bid) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: a.aid -> Seq Scan on a (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: b.bid -> Seq Scan on b (cost=0.00..35.50 rows=2550 width=4) (8 rows)
Here is the second plan …
test=# explain SELECT * FROM a JOIN b ON (aid = bid); QUERY PLAN ----------------------------------------------------------------- Merge Join (cost=393.42..893.85 rows=32512 width=8) Merge Cond: (a.aid = b.bid) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: a.aid -> Seq Scan on a (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: b.bid -> Seq Scan on b (cost=0.00..35.50 rows=2550 width=4) (8 rows)
For most queries, the optimizer basically treats both requests in the same way. There is no difference if you happen to run a normal application. However, there are a few minute variations that merit addressing. Let’s look at the configuration setting listed below, which is located in postgresql.conf:
test=# SHOW join_collapse_limit; join_collapse_limit --------------------- 8 (1 row)
This indicates that eight explicit joins are planned implicitly. PostgreSQL will be able to choose the join order for those joins automatically if you use conventional implicit joins. This was carried out to hasten the inquiry. In practice, this implies that PostgreSQL may reorder the order in which the data is joined if the optimizer determines that it is essential. If you are using explicit joins, this is only done for the first 8 joins; the remaining joins are processed in the order specified by the query. Although there is a distinction between implicit and explicit joins, most applications will not often be affected by this. Joining more than ten tables is typically the exception rather than the rule. Remember that combining numerous tables will already result in genetic query optimization (GEQO).
The choice between explicit and implicit joins is typically one of preference rather than performance. However, keep in mind that there are differences when looking at highly sophisticated queries.
What is an inner join? What is an outer join?
The following common query is: What distinguishes an inner join from an outer join? The correct response is that an inner join will search those tables for shared entries. Every single piece of information on one side will be used in an outer join to locate matches on the other side.
Here’s an illustration:
test=# SELECT * FROM a LEFT JOIN b ON (aid = bid); aid | bid -----+----- 1 | 2 | 2 3 | 3 4 | 4 4 | 4 (5 rows)
Here, we display every record on table “a” (sometimes known as the “left side”) and search for any rows that match on the right side. A RIGHT JOIN is the polar opposite of a LEFT JOIN. The following inquiry is the same as the previous one:
test=# SELECT * FROM b RIGHT JOIN a ON (aid = bid); bid | aid -----+----- | 1 2 | 2 3 | 3 4 | 4 4 | 4 (5 rows)
Full joins
A full join is a third choice, though. It finds matches by combining ALL of the data from both sides. If there is no match on either side of the join, NULL values will be filled in. Here’s an illustration:
test=# SELECT * FROM a FULL JOIN b ON (aid = bid); aid | bid -----+----- 1 | 2 | 2 3 | 3 4 | 4 4 | 4 | 5 (6 rows)
Even though this is easy to do, people often make mistakes because they don’t know what outer joins are or how they work.
Outer joins: Common mistakes and errors
What happens more often than not is that people add an AND clause to the join condition. What does it mean for the result:
test=# SELECT * FROM a LEFT JOIN b ON (aid = bid AND bid = 2); aid | bid -----+----- 1 | 2 | 2 3 | 4 | (4 rows)
Contrary to what most people believe, the result set will not be less than it was. The LEFT JOIN will still produce all the rows on the left; the ON condition only changes which row matches. The additional condition does not reduce the amount of data; it merely causes certain items to become NULL. So, a LEFT JOIN that doesn’t make sense from a semantic point of view is often reported to our PostgreSQL 24-7 help desk as a performance problem.
This can conveniently mask the underlying semantic issue, especially when aggregates are concerned:
test=# SELECT count(*), count(bid) FROM a LEFT JOIN b ON (aid = bid AND bid = 2); count | count -------+------- 4 | 1 (1 row)
People often see wrong results because they think the new criterion will get rid of some data.
IN: What is a semi-join?
But there is more. Often people ask: What is a semi-join? The answer is quite simple: Think of an IN statement:
test=# SELECT * FROM a WHERE aid IN (SELECT bid FROM b); aid ----- 2 3 4 (3 rows)
The implicit DISTINCT filter of the IN-statement eliminates duplicate entries. Therefore, the method used to handle duplication distinguishes between a join and a semi-join.
This duplication is eliminated, as the implementation plan makes quite evident. In this instance, a HashAggregate is used (which you will frequently see when a GROUP BY statement is used):
test=# explain SELECT * FROM a WHERE aid IN (SELECT bid FROM b); QUERY PLAN ----------------------------------------------------------------------- Hash Join (cost=46.38..102.75 rows=1275 width=4) Hash Cond: (a.aid = b.bid) -> Seq Scan on a (cost=0.00..35.50 rows=2550 width=4) -> Hash (cost=43.88..43.88 rows=200 width=4) -> HashAggregate (cost=41.88..43.88 rows=200 width=4) Group Key: b.bid -> Seq Scan on b (cost=0.00..35.50 rows=2550 width=4) (7 rows)
The opposite of an IN statement is NOT IN:
test=# SELECT * FROM a WHERE aid NOT IN (SELECT bid FROM b); aid ----- 1 (1 row)
In this case we remove all other rows from the result.
About Enteros
Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of clouds, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Optimizing Database Performance with Enteros and AWS Resource Groups: A RevOps Approach to Streamlined Efficiency
- 13 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enhancing Healthcare Data Integrity: How Enteros, Logical Models, and Database Security Transform Healthcare Operations
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Budgeting and Cost Allocation in the Finance Sector with Enteros: A Smarter Approach to Financial Efficiency
- 12 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enteros and Cloud FinOps: Unleashing Big Data Potential for eCommerce Profitability
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…