Understanding SQL: Order of Execution

Artificial Intelligence

Understanding SQL: Order of Execution

admin

April 13, 2023

A temporary guide on how databases interpret your SQL queries

Introduction

Writing efficient SQL queries is a vital skill for any data analyst working with large volumes of knowledge. I’m sure lots of us have endured the pain of getting developed a question that runs well on small scales, only to have it slowly grind out the outcomes when applied to a much larger dataset.

Oftentimes, query performance might be significantly improved by simply understanding how a database interprets the query. Not only will this make it easier to optimize queries for speed and performance, but it is going to also help debug and troubleshoot erroneous scripts.

So today, I’m going to step you thru the order during which a SQL query is executed and touch on some common errors that arise when constructing queries.

Declarative vs. Procedural Languages

First, it’s necessary to grasp that SQL is a declarative programming language. Because of this we define the result we would like but provide no instruction on how it’s achieved. This contrasts with imperative, or procedural, languages that require each step to supply the output to be explicitly defined. The implication of working with a declarative language like SQL is that, while SQL expects statements to be written in a specified order, the sequence during which the statements are evaluated differs.

To reveal, here is the list of the seven common clauses used when constructing SQL queries and the order they needs to be utilized in:

1. SELECT
2. FROM
3. WHERE
4. GROUP BY
5. HAVING
6. ORDER BY
7. LIMIT

Now compare this with the order of execution:

1. FROM
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. ORDER BY
7. LIMIT

As you’ll be able to see, the statements have been shuffled around a bit. For instance, note that while the SELECT clause is written first it is way lower down the list relating to execution. As we’ll see shortly, it’s the order of execution that matters most and is something an analyst should be keenly aware of.

The FROM Clause

Naturally, the database must know where data is coming from so this can be a logical first step. While simpler queries might only reference a single table, more often the knowledge you wish exists across several tables. Accordingly, the JOIN statement is used along with FROM to mix source tables. If a join is required then the very first thing the database will do is bring the whole lot together.

Because of this you must think in regards to the size of the source tables, the style of join used, and the variety of predicates utilized in the join. For instance, reducing the dimensions of source tables by choosing only the essential columns, filtering out unnecessary rows, and ensuring there are common identifiers to finish the join, will all improve efficiency. Moreover, INNER JOINs needs to be preferred over OUTER JOINs as the previous are generally faster.

Ultimately, you don’t wish to process data that isn’t needed so minimizing the set to work with needs to be a primary objective, as much as possible.

The WHERE Clause

This clause is used to filter a base table, or joined output, by retuning only those rows that meet a given condition. Any supported data type might be used to filter records. For instance, consider the table below which lists a small variety of Commonwealth cities, together with their populations:

A small example table called ‘cities’ (image by writer).

If we were to filter this table all the way down to only those cities in Recent Zealand we could write the next query, for instance:

SELECT 
city
,country
FROM 
cities
WHERE 
country = 'Recent Zealand';

which might return the rows containing Auckland, Christchurch, and Wellington. Alternatively, if we desired to return all cities which have a population larger than half 1,000,000 then the query would seem like this:

SELECT
city
,country
FROM
cities
WHERE 
population > 500000;

We could also mix these filters using the AND operator, which might return Auckland only:

SELECT
city
,country
FROM
cities
WHERE 
country = 'Recent Zealand' 
AND population > 500000;

One necessary thing to recollect in regards to the WHERE clause is that it cannot be used to filter aggregated columns. For instance, take a have a look at the modified query below:

SELECT 
country
,SUM(popualtion) 
FROM 
cities
WHERE 
SUM(popualtion) > 5000000
GROUP BY 
country;

The intent of the above query is to acquire all countries where the combined city population is larger than 5 million. Unfortunately, this question will fail because an aggregation function has been utilized in the WHERE statement. The difficulty is that aggregation functions require the GROUP BY clause which is executed after the WHERE clause. This implies the WHERE condition can’t be evaluated since the database is just not yet aware of any aggregated variables.

We’ll see find out how to treatment this shortly, but before we do let’s quickly touch on the GROUP BY clause.

The GROUP BY Clause

As you’ve probably recognized, this clause allows us to aggregate, or summarise, a quantity and is used together with functions comparable to COUNT(), SUM(), MIN(), MAX(), and the like. In effect, GROUP BY collapses the variable, or variables, and returns a single value for every distinct element, or combination of elements. For instance, if we desired to tally the town populations for every country, we will group them by country like so:

SELECT 
country
,SUM(popualtion) 
FROM 
cities
GROUP BY 
country;

The output will then return 4 rows — one for every country — together with the aggregated population for every country listed within the table.

The HAVING Clause

This clause solves the issue encountered earlier when attempting to filter using an aggregation function with the WHERE clause. The HAVING clause allows us to filter results using grouped and aggregated data since it is executed after the GROUP BY statement. The database is now aware of the aggregations meaning they might be utilized in all statements that follow. We will now amend the sooner query like so:

SELECT 
country
,SUM(population)
FROM 
cities
GROUP BY 
country
HAVING 
SUM(popualtion) > 5000000;

This may return just two countries: Australia and England.

The SELECT Clause

The SELECT clause is where we define the columns we would like in our table, together with any grouped and aggregated fields. This can be where we will apply column aliases using the AS operator. Now, while the select statement comes first when constructing our queries, it isn’t executed until the info have been sourced and filtered. This is significant to acknowledge because what this implies is that aggregated variables and aliases cannot be utilized in WHERE, GROUP BY, or HAVING statements.

For instance, consider the next query that creates a column alias total_pop which is then used with the HAVING clause. This question will throw an error since the alias has not yet been created. The HAVING clause precedes the SELECT clause so there may be nothing called total_pop to reference.

SELECT 
country
,SUM(population) AS total_pop
FROM 
cities
GROUP BY 
country
HAVING 
total_pop > 5000000;

I won’t go into detail on these, however the DISTINCT and UNION statements are executed after SELECT and before the ORDER BY clause, with DISTINCT executed before UNION.

The ORDER BY Clause

We’re now reaching the top and far of the heavy lifting has been done. We now have sourced (and possibly joined) tables, applied some filtering, grouped and aggregated some fields, and specified the columns we would like to be included in our final table.

At this point, you’re likely desirous about how you wish the info arranged within the goal table. For instance, you may have rows ordered chronologically, or perhaps ordered based on some rating value. That is precisely what the ORDER BY clause does.

The good about this statement is that, since it’s on the backend of the order, we will use aggregations and column aliases in our GROUP BY statements. For instance, suppose we desired to order country by total city population. We could write a question just like the following:

SELECT 
country
,SUM(population) AS total_pop
FROM 
cities
GROUP BY 
country
ORDER BY
total_pop;

Note here that we will use the column alias total_pop within the ORDER BY statement. By default, it will return records in ascending order (i.e., smallest to largest). To return the rows in descending order we will use the DESC operator like so:

SELECT 
country
,SUM(population) AS total_pop
FROM 
cities
GROUP BY 
country
ORDER BY
total_pop DESC;

The LIMIT Clause

When coping with large tables it’s often not optimal to have the query return all rows, particularly whenever you’re only developing and testing. The LIMIT clause is exceedingly useful here and allows us to define the variety of rows we would like back. It’s also used together with the ORDER BY clause to return the n-top or n-bottom records. For instance, suppose we wanted the highest three most populous cities within the table. We could use the ORDER BY and LIMIT clauses as follows:

SELECT 
city
,country
,population AS city_pop
FROM 
cities
ORDER BY
city_pop DESC
LIMIT
3;

Note that not all databases support the LIMIT statement, but they may have equivalents that perform an identical function.

Wrapping Up

The order during which statements are executed is a vital concept to know when constructing SQL queries, and we’ve touched on some common gotchas that may occur. Though I haven’t provided in-depth examples I hope this temporary little primer gets you desirous about find out how to improve your query performance, and in the event you’re just starting out with SQL, I hope this text helps you along your journey.