A Beginner's Guide to the True Order of SQL Operations
Reading a blog post about the true order of SQL operations. I was having trouble with SQL earlier and found this article, so I am going to read it.
References
Notes
The lexical order of operations does not match the logical order of operations.
GROUP BY
contributes the most confusion:
-- Wrong
SELECT first_name, count(*)
FROM customer
WHERE count(*) > 1
GROUP BY first_name
-- Correct
SELECT first_name, count(*)
FROM customer
GROUP BY first_name
HAVING count(*) > 1
-- Correct
SELECT first_name, count(*)
FROM customer
GROUP BY first_name
ORDER BY count(*) DESC
-- Wrong
SELECT first_name, last_name, count(*)
FROM customer
GROUP BY first_name
-- Correct
SELECT first_name, MAX(last_name), count(*)
FROM customer
GROUP BY first_name
-- Wrong
SELECT first_name || ' ' || last_name, count(*)
FROM customer
GROUP BY first_name
-- Correct
SELECT first_name || ' ' || MAX(last_name), count(*)
FROM customer
GROUP BY first_name
-- Correct
SELECT MAX(first_name || ' ' || last_name), count(*)
FROM customer
GROUP BY first_name
Even though SQL is written in plain English, all the different SQL clauses have extremely complex dependencies. To find out all the interdependencies, read the SQL standard documents. The lexical (syntactical) order of SQL operations (clauses) does not correspond at all to the logical order of operations (although, sometimes they do coincidentally). Thanks to modern optimizers, the order also doesn't correspond to the actual order of operations, so we really have: syntactical logical actual order. The logical order of operations:
FROM
- This is actually the first thing that happens, logically. Before anything else, we're loading all the rows from all the tables and join them. This is what logically happens first, not actually (the optimizer will probably not do this operation first, but instead access some index based on the
WHERE
clause).JOIN
clauses are actually part of theFROM
clause.
- This is actually the first thing that happens, logically. Before anything else, we're loading all the rows from all the tables and join them. This is what logically happens first, not actually (the optimizer will probably not do this operation first, but instead access some index based on the
WHERE
- Once we have loaded all the rows from the tables above, we can now throw them away again using
WHERE
- Once we have loaded all the rows from the tables above, we can now throw them away again using
GROUP BY
- If you want, you can take the rows that remain after
WHERE
and put them in groups or buckets, where each group contains the same value for theGROUP BY
expression (and all the other rows are put in a list for that group).
- If you want, you can take the rows that remain after
- aggregations
- Aggregate functions are calculated right after
GROUP BY
logically. This explains why we cannot put an aggregate function in theWHERE
clause, because its value cannot be accessed yet.
- Aggregate functions are calculated right after
HAVING
- Now you can access aggregate function values. For instance, you can check that
COUNT(*)>1
in theHAVING
clause.
- Now you can access aggregate function values. For instance, you can check that
WINDOW
- If you’re using the awesome window function feature, this is the step where they’re all calculated. Note that PostgreSQL and Sybase SQL Anywhere have an actual
WINDOW
clause!
- If you’re using the awesome window function feature, this is the step where they’re all calculated. Note that PostgreSQL and Sybase SQL Anywhere have an actual
SELECT
- We now use all the rows that are produced from the above clauses and create new rows / tuples from them using
SELECT
.
- We now use all the rows that are produced from the above clauses and create new rows / tuples from them using
DISTINCT
DISTINCT
happens afterSELECT
UNION
,INTERSECT
,EXCEPT
- A
UNION
is an operator that connects two subqueries. Everything we've talked about thus far was a subquery.
- A
ORDER BY
- It makes sense to postpone the ordering of a result until the end.
OFFSET
- Don't use
OFFSET
- Don't use
LIMIT
,FETCH
,TOP
- Sane databases put the
LIMIT
(MySQL, PostgreSQL) orFETCH
(DB2, Oracle 12c, SQL Server 2012) clause at the very end, syntactically.
- Sane databases put the
If you ever want to do something that is not in the right order
, the simplest track is always to resort to a derived table:
-- Doesn't work, cannot put window functions in GROUP BY
SELECT ntile(4) ORDER BY (age) AS bucket, MIN(age), MAX(age)
FROM customer
GROUP BY ntile(4) ORDER BY (age)
-- Works:
SELECT bucket, MIN(age), MAX(age)
FROM (
SELECT age, ntile(4) ORDER BY (age) AS bucket
FROM customer
) c
GROUP BY bucket
Always think about the logical order of operations