A Beginner's Guide to the True Order of SQL Operations

Reading a blog post about the true order of SQL operations. I was having trouble with SQL earlier and found this article, so I am going to read it.

Date Created:
2 481

References



Notes


The lexical order of operations does not match the logical order of operations.

GROUP BY contributes the most confusion:

-- Wrong
SELECT first_name, count(*)
FROM customer
WHERE count(*) > 1
GROUP BY first_name

-- Correct
SELECT first_name, count(*)
FROM customer
GROUP BY first_name
HAVING count(*) > 1

-- Correct
SELECT first_name, count(*)
FROM customer
GROUP BY first_name
ORDER BY count(*) DESC

-- Wrong
SELECT first_name, last_name, count(*)
FROM customer
GROUP BY first_name

-- Correct
SELECT first_name, MAX(last_name), count(*)
FROM customer
GROUP BY first_name

-- Wrong
SELECT first_name || ' ' || last_name, count(*)
FROM customer
GROUP BY first_name

-- Correct
SELECT first_name || ' ' || MAX(last_name), count(*)
FROM customer
GROUP BY first_name

-- Correct
SELECT MAX(first_name || ' ' || last_name), count(*)
FROM customer
GROUP BY first_name

Even though SQL is written in plain English, all the different SQL clauses have extremely complex dependencies. To find out all the interdependencies, read the SQL standard documents. The lexical (syntactical) order of SQL operations (clauses) does not correspond at all to the logical order of operations (although, sometimes they do coincidentally). Thanks to modern optimizers, the order also doesn't correspond to the actual order of operations, so we really have: syntactical logical actual order. The logical order of operations:

  • FROM
    • This is actually the first thing that happens, logically. Before anything else, we're loading all the rows from all the tables and join them. This is what logically happens first, not actually (the optimizer will probably not do this operation first, but instead access some index based on the WHERE clause). JOIN clauses are actually part of the FROM clause.
  • WHERE
    • Once we have loaded all the rows from the tables above, we can now throw them away again using WHERE
  • GROUP BY
    • If you want, you can take the rows that remain after WHERE and put them in groups or buckets, where each group contains the same value for the GROUP BY expression (and all the other rows are put in a list for that group).
  • aggregations
    • Aggregate functions are calculated right after GROUP BY logically. This explains why we cannot put an aggregate function in the WHERE clause, because its value cannot be accessed yet.
  • HAVING
    • Now you can access aggregate function values. For instance, you can check that COUNT(*)>1 in the HAVING clause.
  • WINDOW
  • SELECT
    • We now use all the rows that are produced from the above clauses and create new rows / tuples from them using SELECT.
  • DISTINCT
    • DISTINCT happens after SELECT
  • UNION, INTERSECT, EXCEPT
    • A UNION is an operator that connects two subqueries. Everything we've talked about thus far was a subquery.
  • ORDER BY
    • It makes sense to postpone the ordering of a result until the end.
  • OFFSET
    • Don't use OFFSET
  • LIMIT, FETCH, TOP
    • Sane databases put the LIMIT (MySQL, PostgreSQL) or FETCH (DB2, Oracle 12c, SQL Server 2012) clause at the very end, syntactically.


If you ever want to do something that is not in the right order, the simplest track is always to resort to a derived table:

-- Doesn't work, cannot put window functions in GROUP BY
SELECT ntile(4) ORDER BY (age) AS bucket, MIN(age), MAX(age)
FROM customer
GROUP BY ntile(4) ORDER BY (age)

-- Works:
SELECT bucket, MIN(age), MAX(age)
FROM (
  SELECT age, ntile(4) ORDER BY (age) AS bucket
  FROM customer
) c
GROUP BY bucket

Always think about the logical order of operations


Further Reading


You can read more about how comments are sorted in this blog post.

User Comments