Preamble
Very many programmers look at SQL like a terrible beast. SQL – one of the few declarative languages, and its behavior is fundamentally different from how imperative, object-oriented and even functional programming languages work. Although many consider SQL to be somewhat functional.
This lesson is dedicated to those who:
- Already working with SQL, but do not fully understand it.
- He knows SQL pretty well, but he never really thought about its structure.
- Wants to teach SQL to others.
In this manual we will focus only on SELECT operators.
Other DML operators will be discussed in detail next time.
SQL is declarative
SELECT f_name, l_name
FROM empls
WHERE salary_id > 100000
It’s clear. And you don’t care where those empl records come from. You just want the ones with a decent salary_id.
What do we learn from that?
If it’s that simple, what’s the problem? The problem is that most of us think intuitively within a framework of imperative programming. It’s like, “machine, do this, then that, but first check both.” So, you have to store time results in variables, make loops, call functions, etc., etc.
Forget about all this. Think about how to declare things. Not about how to tell a machine, how to calculate something.
SQL code is not ordered
Usually confusion occurs because SQL code elements are not in the order in which they are executed. The lexical order looks like this:
SELECT [ DISTINCT ]
FROM
WHERE
GROUP BY
HAVING
UNION
ORDER BY
For simplicity, not all SQL sentences are listed. This verbal order is fundamentally different from the logical order (which in turn may differ from the execution order depending on the choice of the optimizer):
- FROM
- WHERE
- GROUP BY
- HAVING
- SELECT
- DISTINCT
- UNION
- ORDER BY
Three points that should be noted:
1. The first sentence is FROM, not SELECT. First you load data from the disk into memory so that you can work with it.
2. SELECT is executed after most other sentences. The main thing is after FROM and GROUP BY. This is important to understand if you think you can refer to items that you declare in the SELECT sentence from the WHERE sentence. The next one is impossible:
SELECT A.x + A.y AS z
FROM A
WHERE z = 10 -- z is not available here!
There are two options to reuse z. Either repeat the expression:
SELECT A.x + A.y AS z
FROM A
WHERE (A.x + A.y) = 10
… or resort to derived tables, common table expressions or data views to avoid code repetition. See the examples below.
3. UNION is put before ORDER BY in lexical and logical order. Many people think that every UNION subquery can be ordered, but this is not the case in most SQL dialects. Although some dialects allow ordering subqueries or derived tables, there is no guarantee that this order will be preserved after a UNION operation is performed.
Please note that not all databases implement things in the same way. Rule number 2, for example, does not apply exactly as described above to MySQL, PostgreSQL, and SQLite.
What do we learn from this?
To avoid common errors, you should always keep in mind the lexical and logical orders of SQL sentences. If you have learned these differences, it becomes clear why some things work and others do not.
Of course, it would be nice if the language was designed in such a way, when the lexical sequence actually reflects the logical order, as it is implemented in Microsoft LINQ.
The whole point of SQL is to refer to tables
Because of the difference between lexical ordering and logical ordering, most beginners probably think that column values are first class objects in SQL. But, no. The most important ones are references to tables.
The SQL standard defines the FROM sentence in the following way:
<from clause> ::=
FROM <table reference>
[ { <comma> <table reference> }... ]
The FROM sentence output is a combined reference to a table of combined values of all references. Let’s try to digest it.
FROM a, b
1
FROM a, b
The above entry makes a combined reference to the table of values a and b. If a has 3 columns and b has 5 columns, then the “output table” will consist of 8 (3 + 5) columns.
The entries contained in this combined table reference are a cross / Cartesian product of a x b. In other words, each record a forms a pair with record b. If a has 3 records and b has 5 records, the above table reference produces 15 records (3 x 5).
This output is converted into a GROUP BY sentence (after filtering in the WHERE sentence), where it is converted into a new output. We will deal with this later.
If we look at these examples in light of relational algebra / set theory, the SQL table is a ratio or set of tuples. Each SQL sentence converts one or more relationships to create new ones.
What do we learn from this?
Always think in terms of table references to understand how data is transferred in SQL sentences.
SQL table links can be quite functional
The reference to the table is a powerful thing. An example of their strength is the keyword JOIN, which is not really an expression of SELECT, but part of a special reference to a table. A combined table, as defined in the SQL standard (simplified):
<table reference> ::=
<table reference>
| <derived table>
| <derived table>
Let’s go back to our example:
FROM a, b
1
FROM a, b
a can essentially be a combined table:
a1 JOIN a2 ON a1.id = a2.id
1
a1 JOIN a2 ON a1.id = a2.id
Developing the previous expression, we will get:
FROM a1 JOIN a2 ON a1.id = a2.id, b
1
FROM a1 JOIN a2 ON a1.id = a2.id, b
Although it is not recommended to combine the comma-separated table reference list syntax with the syntax of the combined tables, you can do so. As a result, the merged reference will contain values a1+a2+b.
Derived tables are even more powerful than merged tables. We’ll get to that point.
What do we learn from this?
You always have to think in categories of links to tables. This will help us not only to understand how data is passed in SQL sentences (see previous section), but also how complex table references are created.
And, importantly, to understand, JOIN is the keyword for building linked tables. Not part of the SELECT operator. Some databases allow using JOIN in INSERT, UPDATE, DELETE operators.
In SQL, you should use JOIN tables instead of separated by commas
We’ve seen this offer before:
FROM a, b
1
FROM a, b
Advanced SQL developers will probably say that you should not use a comma-separated list at all, and always display JOIN tables. This will improve the readability of the SQL instruction, and thus prevent errors.
One very common mistake is to forget the JOIN predicate somewhere. Think about the following:
FROM a, b, c, d, e, f, g, h
WHERE a.a1 = b.bx
AND a.a2 = c.c1
AND d.d1 = b.bc
-- etc...
Syntax of connected tables:
Safer because join predicates can be placed near linked tables, thus preventing errors.
More expressive because you can distinguish between OUTER JOIN, INNER JOIN, etc.
What do we learn from this?
Always use JOIN. Never use references to comma-separated tables in FROM sentences.
Different JOIN operations in SQL
JOIN operations consist mainly of the following five types:
- EQUI JOIN
- SEMI JOIN
- ANTI JOIN
- CROSS JOIN
- DIVISION
These terms are commonly used in relational algebra. SQL uses other terms for the above concepts, if they exist at all. Let’s take a closer look at them:
EQUI JOIN
This is the most common JOIN operation. Contains two subtypes:
- INNER JOIN (or simply JOIN).
- OUTER JOIN (further subdivided into LEFT, RIGHT, FULL OUTER JOIN)
The difference is better explained by example:
-- The link to the table contains the authors and their books.
-- There's one entry for each book and its author.
-Authors without books are NOT included.
author JOIN book ON author.id = book.author_id
-- The link to the table contains the authors and their books.
-- There's one entry for each book and its author.
--... Or there's an entry for "empty" (blank) for authors without books
-- ("empty" means all columns in the book are NULL)
author LEFT OUTER JOIN book ON author.id = book.author_id
SEMI JOIN
This relational concept in SQL can be expressed in two ways: Using the IN predicate or using the EXISTS predicate. “Semi” in Latin means “half”. This type of connection is used to combine only the “half” of a table reference. What does it mean? Let’s look again at the above mentioned merging of author and book. Let’s imagine that we don’t need the author-book combination, but only authors who have books. Then we can write:
-- Using IN
FROM author
WHERE author.id IN (SELECT book.author_id FROM book)
-- Using EXISTS
FROM author
WHERE EXISTS (SELECT 1 FROM book WHERE book.author_id = author.id)
Although there is no general rule, you should prefer IN or EXISTS, but you can say something like this:
IN predicates are easier to read than EXISTS predicates.
EXISTS predicates are more expressive than IN predicates (i.e. they are easier to express than complex SEMI JOIN predicates).
There is no noticeable difference in performance. But big difference in performance can be in some databases.
Since INNER JOIN also produces results for authors who have books, many beginners will think that you can remove duplicates with the DISTINCT keyword. They think that SEMI JOIN can be expressed in the following way:
-- Find only those authors who have books.
SELECT DISTINCT first_name, last_name
FROM author
JOIN book ON author.id = book.author_id
This is not recommended for two reasons:
Performance is slowed down because the database only needs to load a lot of data into memory in order to remove duplicates.
This solution is not quite right, although it does lead to correct results in this example. But if you combine more links to tables, removing duplicates becomes a real problem.
Detailed information about DISTINCT abuse can be found in this post.
ANTI JOIN
This relational concept is the exact opposite of SEMI JOIN. It can be formed by simply adding the keyword NOT in IN or EXISTS predicates. An example where we will select those authors who do not have books:
-- Using IN
FROM author
WHERE author.id NOT IN (SELECT book.author_id FROM book)
-- Using EXISTS
FROM author
WHERE NOT EXISTS (SELECT 1 FROM book WHERE book.author_id = author.id)
The same rules apply for performance, readability and expression. However, there is a slight nuance to NULL when using NOT IN, but this topic is out of the scope of our lesson.
CROSS JOIN
This operator creates a cross-production of two linked references to a table by combining each entry of the first reference with the entry of the second reference to a table. We have seen before that this can be achieved by linking to tables with commas separated in the FROM sentence. In the rare cases when this is really necessary, in most SQL dialects, a CROSS JOIN cross-connection can be written explicitly:
-- Combine each author with a book
author CROSS JOIN book
DIVISION
Relational division is what a real beast is. In short, if JOIN is multiplication, division is the opposite of JOIN operation. Relational divisions are very difficult to express in SQL. Since this is a lesson for beginners, we do not consider division. For the bravest, information can be found here, here and here.
What do we learn from it?
A lot of things. Let’s put it in our heads. There are references to tables everywhere in SQL. Connected tables are quite complex links to tables. But there’s a difference between relational expressions and SQL expressions. Not all relational connection operations are also formal SQL connection operations. With a little practice and knowledge in relational theory, you can always choose the right type of relational connection JOIN and correctly present it in SQL.
SQL derivative tables are similar to tabular variables
Previously we learned that SQL is a declarative language, and has no variables (although in some SQL dialects they do exist). But you can write something similar to variables. These beasts are called derived tables.
A derived table is nothing more than a nested query enclosed in parentheses.
-- Derived table
FROM (SELECT * FROM author)
Note that some SQL dialects require that derived tables have a correlation name (also called an alias).
-- Alias derivative table
FROM (SELECT * FROM author) a
Derived tables are great if you want to bypass the problems caused by the logical ordering of SQL sentences. For example, if you want to reuse a column expression in a SELECT and WHERE sentence, just write (Oracle dialect):
-- Get the name of the authors and their age in the days.
SELECT first_name, last_name, age
FROM (
SELECT first_name, last_name, current_date - date_of_birth age
FROM author
)
-- If the age is more than 10,000 days
WHERE age > 10000
Note that some databases and the SQL:1999 standard took the derivative tables to the next level by introducing generalized table expressions. This will allow you to reuse the same derived table several times in one SQL SELECT instruction. The above query will be translated to (almost) equivalent:
WITH a AS (
SELECT first_name, last_name, current_date - date_of_birth age
FROM author
)
SELECT *
FROM a
WHERE age > 10000
Obviously, “a” can also be displayed in a separate view for wider use of common SQL subqueries. Learn more about views here.
What do we learn from it?
Again and again and again. In SQL, everything is tied to links, not columns. How to use it. Don’t be afraid of writing derivative tables or other complex links to tables.
SQL GROUP BY converts previous table links to
Let’s discuss our FROM proposal again:
FROM a, b
1
FROM a, b
And now for the linked link (see above) we apply the GROUP BY proposal.
GROUP BY A.x, A.y, B.z
1
GROUP BY A.x, A.y, B.z
The proposal makes a new reference to the table with only three remaining columns (!). Let’s try to digest it. With GROUP BY, the number of available columns in all subsequent logical proposals, including SELECT, is reduced. To refer only to the columns from the GROUP BY proposal in the SELECT proposal is possible for syntactic reasons.
Note that other columns may still be available as arguments for aggregate functions:
SELECT A.x, A.y, SUM(A.z)
FROM A
GROUP BY A.x, A.y
It should be noted that MySQL, unfortunately, does not adhere to this standard, which only creates confusion. Do not get caught on tricks in MySQL. GROUP BY converts the links to the table So you can only refer to the columns, also mentioned in the proposal GROUP BY.
What do we learn from this?
GROUP BY, again, works with table links, transforming them into a new form.
SQL SELECT in relational algebra is called projection
I personally like the term “projection” because it is used in relational algebra. Once you have created a link to the table, filtered it and converted it, you can go to projection in another form. The SELECT offer is similar to a projector. Tabular function that uses an expression for the value of a string to convert each record from a previously created table reference to the final result.
In the SELECT proposal it is possible to work with columns by creating complex column expressions as part of the record/rows.
There are many special rules regarding the nature of available expressions, functions, etc. The main thing to keep in mind is the following:
You can only use references to columns derived from the table reference in “output”.
- If you have a GROUP BY offer, you can only reference the columns from that offer or aggregate functions.
- If you do not have a GROUP BY offer, you can use window functions instead of aggregate functions.
- If there is no GROUP BY offer, you cannot combine aggregate and non-aggregate functions.
- There are some rules concerning the transfer of regular functions to aggregate functions and vice versa.
- There are a lot of complicated rules. With which you can fill out another lesson.
For example, the reason why it is not possible to combine aggregate functions with non-aggregate functions in the SELECT instruction projection without GROUP BY (Rule No. 4) is as follows:
- It doesn’t make sense. It’s intuitive.
- If intuition doesn’t help (for example, for beginners in SQL), syntactic rules help. In SQL:1999 the GROUPING SETS operator is implemented, and in SQL:2003 the empty grouping sets operator: GROUP BY (). Whenever there is an aggregate function and there is no explicit offer of GROUP BY, an implicit empty GROUPING SET is used (rule #2). Consequently, the original rules on logical ordering are no longer correct, and projection (SELECT) affects the result of the logically preceding but lexically consistent sentence (GROUP BY).
Are we confused? Yeah. Me, too. Let’s go back to simpler things.
What do we learn from this?
The SELECT sentence can be one of the most complex sentences in SQL, even if it looks simple. All the other sentences just transfer references to tables from one to another. The SELECT sentence spoils the beauty of these links by completely transforming them by applying rules to them.
To understand SQL, you need to learn everything else before using a SELECT statement. Even if SELECT is the first sentence in lexical order, it must be the last.
SQL DISTINCT, UNION, ORDER BY and OFFSET is much simpler
After a complex SELECT we can return to simple truths:
- Set operations (DISTINCT and UNION)
- Order operations (ORDER BY, OFFSET … FETCH)
Set Operations
Set operations work with sets, which are actually tables. Almost. Theoretically, it’s not difficult to understand.
- DISTINCT removes duplicates after the projection.
- UNION combines two subqueries and removes duplicates
- The UNION ALL combines two subqueries, saving duplicates
- EXCEPT removes records from the first subquery which are also contained in the second subquery (and then removes duplicates)
- INTERSECT saves only the records contained in both subqueries (and then deletes duplicates)
Removing duplicates in all these cases makes no sense. UNION ALL should be used most often to combine subqueries.
Arrangement operations
Arrangement is not a relational function. It is a function intended for SQL only. It is applied at the very end of lexical ordering and logical ordering of an SQL instruction. Using ORDER BY and OFFSET. FETCH is the only way to ensure that index records can be accessed in a reliable way. All other ordering methods are always random and random, even if they may seem reproducible.
OFFSET … FETCH is just one variant of the syntax. Other options include LIMIT, OFFSET in MySQL and PostgreSQL … or TOP and START AT in SQL Server and Sybase. Also a good overview of the different ways to implement OFFSET… You can see FETCH here.
Let’s get to work
As in every language, it takes practice to master SQL. The 10 simple steps above will help you understand common SQL operations. On the other hand, it is also a good idea to learn from common mistakes.
About Enteros
Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of clouds, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enhancing Identity and Access Management in Healthcare with Enteros
- 19 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Maximizing Efficiency with Enteros: Revolutionizing Cost Allocation Through a Cloud Center of Excellence
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions
- 18 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…