Practical SQL Puzzles That Will Level Up Your Skill

There are some Sql patterns that, once you realize them, you begin seeing them in all places. The solutions to the puzzles that I’ll show you today are literally quite simple SQL queries, but understanding the concept behind them will certainly unlock recent solutions to the queries you write on a day-to-day basis.

These challenges are all based on real-world scenarios, as over the past few months I made some extent of writing down every puzzle-like query that I had to construct. I also encourage you to try them for yourself, so you could challenge yourself first, which is able to improve your learning!

All queries to generate the datasets shall be provided in a PostgreSQL and DuckDB-friendly syntax, so you could easily copy and play with them. At the tip I can even provide you a link to a GitHub repo containing all of the code, in addition to the reply to the bonus challenge I’ll leave for you!

I organized these puzzles so as of accelerating difficulty, so, when you find the primary ones too easy, at the least take a have a look at the last one, which uses a method that I really consider you won’t have seen before.

Okay, let’s start.

I really like this puzzle due to how short and easy the ultimate query is, despite the fact that it deals with many edge cases. The info for this challenge shows tickets moving in between Kanban stages, and the target is to search out how long, on average, tickets stay within the Doing stage.

The info comprises the ID of the ticket, the date the ticket was created, the date of the move, and the “from” and “to” stages of the move. The stages present are Recent, Doing, Review, and Done.

Some things you have to know (edge cases):

Tickets can move backwards, meaning tickets can return to the Doing stage.
You need to not include tickets which might be still stuck within the Doing stage, as there isn’t any technique to know the way long they’ll stay there for.
Tickets should not all the time created within the Recent stage.

CREATE TABLE ticket_moves (
    ticket_id INT NOT NULL,
    create_date DATE NOT NULL,
    move_date DATE NOT NULL,
    from_stage TEXT NOT NULL,
    to_stage TEXT NOT NULL
);

INSERT INTO ticket_moves (ticket_id, create_date, move_date, from_stage, to_stage)
    VALUES
        -- Ticket 1: Created in "Recent", then moves to Doing, Review, Done.
        (1, '2024-09-01', '2024-09-03', 'Recent', 'Doing'),
        (1, '2024-09-01', '2024-09-07', 'Doing', 'Review'),
        (1, '2024-09-01', '2024-09-10', 'Review', 'Done'),
        -- Ticket 2: Created in "Recent", then moves: Recent → Doing → Review → Doing again → Review.
        (2, '2024-09-05', '2024-09-08', 'Recent', 'Doing'),
        (2, '2024-09-05', '2024-09-12', 'Doing', 'Review'),
        (2, '2024-09-05', '2024-09-15', 'Review', 'Doing'),
        (2, '2024-09-05', '2024-09-20', 'Doing', 'Review'),
        -- Ticket 3: Created in "Recent", then moves to Doing. (Edge case: no subsequent move from Doing.)
        (3, '2024-09-10', '2024-09-16', 'Recent', 'Doing'),
        -- Ticket 4: Created already in "Doing", then moves to Review.
        (4, '2024-09-15', '2024-09-22', 'Doing', 'Review');

A summary of the info:

Ticket 1: Created within the Recent stage, moves normally to Doing, then Review, after which Done.
Ticket 2: Created in Recent, then moves: Recent → Doing → Review → Doing again → Review.
Ticket 3: Created in Recent, moves to Doing, however it continues to be stuck there.
Ticket 4: Created within the Doing stage, moves to Review afterward.

It is likely to be idea to stop for a bit and think how you’d cope with this. Are you able to learn the way long a ticket stays on a single stage?

Truthfully, this sounds intimidating at first, and it looks like it’s going to be a nightmare to cope with all the sting cases. Let me show you the complete solution to the issue, after which I’ll explain what is occurring afterward.

WITH stage_intervals AS (
    SELECT
        ticket_id,
        from_stage,
        move_date 
        - COALESCE(
            LAG(move_date) OVER (
                PARTITION BY ticket_id 
                ORDER BY move_date
            ), 
            create_date
        ) AS days_in_stage
    FROM
        ticket_moves
)
SELECT
    SUM(days_in_stage) / COUNT(DISTINCT ticket_id) as avg_days_in_doing
FROM
    stage_intervals
WHERE
    from_stage = 'Doing';

The primary CTE uses the LAG function to search out the previous move of the ticket, which shall be the time the ticket entered that stage. Calculating the duration is so simple as subtracting the previous date from the move date.

What it’s best to notice is using the COALESCE within the previous move date. What that does is that if a ticket doesn’t have a previous move, then it uses the date of creation of the ticket. This takes care of the cases of tickets being created directly into the Doing stage, because it still will properly calculate the time it took to go away the stage.

That is the results of the primary CTE, showing the time spent in each stage. Notice how the Ticket 2 has two entries, because it visited the Doing stage in two separate occasions.

With this done, it’s only a matter of getting the common because the SUM of total days spent in doing, divided by the distinct variety of tickets that ever left the stage. Doing it this fashion, as a substitute of simply using the AVG, makes sure that the 2 rows for Ticket 2 get properly accounted for as a single ticket.

Not so bad, right?

The goal of this second challenge is to find essentially the most recent contract sequence of each worker. A break of sequence happens when two contracts have a niche of greater than someday between them.

On this dataset, there aren’t any contract overlaps, meaning that a contract for a similar worker either has a niche or ends a day before the brand new one starts.

CREATE TABLE contracts (
    contract_id integer PRIMARY KEY,
    employee_id integer NOT NULL,
    start_date date NOT NULL,
    end_date date NOT NULL
);

INSERT INTO contracts (contract_id, employee_id, start_date, end_date)
VALUES 
    -- Worker 1: Two continuous contracts
    (1, 1, '2024-01-01', '2024-03-31'),
    (2, 1, '2024-04-01', '2024-06-30'),
    -- Worker 2: One contract, then a niche of three days, then two contracts
    (3, 2, '2024-01-01', '2024-02-15'),
    (4, 2, '2024-02-19', '2024-04-30'),
    (5, 2, '2024-05-01', '2024-07-31'),
    -- Worker 3: One contract
    (6, 3, '2024-03-01', '2024-08-31');

As a summary of the info:

Worker 1: Has two continuous contracts.
Worker 2: One contract, then a niche of three days, then two contracts.
Worker 3: One contract.

The expected result, given the dataset, is that every one contracts needs to be included aside from the primary contract of Worker 2, which is the one one which has a niche.

Before explaining the logic behind the answer, I would really like you to take into consideration what operation will be used to affix the contracts that belong to the identical sequence. Focus only on the second row of knowledge, what information do you have to know if this contract was a break or not?

I hope it’s clear that that is the right situation for window functions, again. They’re incredibly useful for solving problems like this, and understanding when to make use of them helps loads to find clean solutions to problems.

Very first thing to do, then, is to get the tip date of the previous contract for a similar worker with the LAG function. Doing that, it’s easy to check each dates and check if it was a break of sequence.

WITH ordered_contracts AS (
    SELECT
        *,
        LAG(end_date) OVER (PARTITION BY employee_id ORDER BY start_date) AS previous_end_date
    FROM
        contracts
),
gapped_contracts AS (
    SELECT
        *,
        -- Deals with the case of the primary contract, which won't have
        -- a previous end date. On this case, it's still the beginning of a brand new
        -- sequence.
        CASE WHEN previous_end_date IS NULL
            OR previous_end_date < start_date - INTERVAL '1 day' THEN
            1
        ELSE
            0
        END AS is_new_sequence
    FROM
        ordered_contracts
)
SELECT * FROM gapped_contracts ORDER BY employee_id ASC;

An intuitive technique to proceed the query is to number the sequences of every worker. For instance, an worker who has no gap, will all the time be on his first sequence, but an worker who had 5 breaks in contracts shall be on his fifth sequence. Funnily enough, this is completed by one other window function.

--
-- Previous CTEs
--
sequences AS (
    SELECT
        *,
        SUM(is_new_sequence) OVER (PARTITION BY employee_id ORDER BY start_date) AS sequence_id
FROM
    gapped_contracts
)
SELECT * FROM sequences ORDER BY employee_id ASC;

Notice how, for Worker 2, he starts his sequence #2 after the primary gapped value. To complete this question, I grouped the info by worker, got the worth of their most up-to-date sequence, after which did an inner join with the sequences to maintain only essentially the most recent one.

--
-- Previous CTEs
--
max_sequence AS (
    SELECT
        employee_id,
        MAX(sequence_id) AS max_sequence_id
FROM
    sequences
GROUP BY
    employee_id
),
latest_contract_sequence AS (
    SELECT
        c.contract_id,
        c.employee_id,
        c.start_date,
        c.end_date
    FROM
        sequences c
        JOIN max_sequence m ON c.sequence_id = m.max_sequence_id
            AND c.employee_id = m.employee_id
        ORDER BY
            c.employee_id,
            c.start_date
)
SELECT
    *
FROM
    latest_contract_sequence;

As expected, our outcome is largely our starting query just with the primary contract of Worker 2 missing!

Finally, the last puzzle — I’m glad you made it this far.

For me, that is essentially the most mind-blowing one, as once I first encountered this problem I believed of a totally different solution that might be a large number to implement in SQL.

For this puzzle, I’ve modified the context from what I needed to cope with for my job, as I feel it's going to make it easier to clarify.

Imagine you’re an information analyst at an event venue, and also you’re analyzing the talks scheduled for an upcoming event. You ought to find the time of day where there shall be the best variety of talks happening at the identical time.

That is what it's best to know concerning the schedules:

Rooms are booked in increments of 30min, e.g. from 9h-10h30.
The info is clean, there aren't any overbookings of meeting rooms.
There will be back-to-back meetings in a single meeting room.

Meeting schedule visualized (that is the actual data).

CREATE TABLE meetings (
    room TEXT NOT NULL,
    start_time TIMESTAMP NOT NULL,
    end_time TIMESTAMP NOT NULL
);

INSERT INTO meetings (room, start_time, end_time) VALUES
    -- Room A meetings
    ('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
    ('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
    ('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
    -- Room B meetings
    ('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
    -- Room C meetings
    ('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
    ('Room C', '2024-10-01 11:30', '2024-10-01 12:00');

The technique to solve that is using what is known as a Sweep Line Algorithm, or also often called an event-based solution. This last name actually helps to know what shall be done, as the concept is that as a substitute of coping with intervals, which is what we now have in the unique data, we cope with events as a substitute.

To do that, we'd like to rework every row into two separate events. The primary event shall be the Start of the meeting, and the second event shall be the End of the meeting.

WITH events AS (
  -- Create an event for the beginning of every meeting (+1)
  SELECT 
    start_time AS event_time, 
    1 AS delta
  FROM meetings
  UNION ALL
  -- Create an event for the tip of every meeting (-1)
  SELECT 
   -- Small trick to work with the back-to-back meetings (explained later)
    end_time - interval '1 minute' as end_time,
    -1 AS delta
  FROM meetings
)
SELECT * FROM events;

Take the time to know what is occurring here. To create two events from a single row of knowledge, we’re simply unioning the dataset on itself; the primary half uses the beginning time because the timestamp, and the second part uses the tip time.

You may already notice the delta column created and see where that is going. When an event starts, we count it as +1, when it ends, we count it as -1. You may even be already pondering of one other window function to resolve this, and also you’re actually right!

But before that, let me just explain the trick I used ultimately dates. As I don’t want back-to-back meetings to count as two concurrent meetings, I’m subtracting a single minute of each end date. This fashion, if a gathering ends and one other starts at 10h30, it won’t be assumed that two meetings are concurrently happening at 10h30.

Okay, back to the query and yet one more window function. This time, though, the function of alternative is a rolling SUM.

--
-- Previous CTEs
--
ordered_events AS (
  SELECT
    event_time,
    delta,
    SUM(delta) OVER (ORDER BY event_time, delta DESC) AS concurrent_meetings
  FROM events
)
SELECT * FROM ordered_events ORDER BY event_time DESC;

The rolling SUM on the Delta column is basically walking down every record and finding what number of events are energetic at the moment. For instance, at 9 am sharp, it sees two events starting, so it marks the variety of concurrent meetings as two!

When the third meeting starts, the count goes up to a few. But when it gets to 9h59 (10 am), then two meetings end, bringing the counter back to at least one. With this data, the one thing missing is to search out when the best value of concurrent meetings happens.

--
-- Previous CTEs
--
max_events AS (
  -- Find the utmost concurrent meetings value
  SELECT 
    event_time, 
    concurrent_meetings,
    RANK() OVER (ORDER BY concurrent_meetings DESC) AS rnk
  FROM ordered_events
)
SELECT event_time, concurrent_meetings
FROM max_events
WHERE rnk = 1;

That’s it! The interval of 9h30–10h is the one with the most important variety of concurrent meetings, which checks out with the schedule visualization above!

This solution looks incredibly easy for my part, and it really works for thus many situations. Each time you might be coping with intervals now, it's best to think if the query wouldn’t be easier when you thought of it in the attitude of events.

But before you progress on, and to essentially nail down this idea, I would like to go away you with a bonus challenge, which can also be a standard application of theSweep Line Algorithm. I hope you give it a try!

Bonus challenge

The context for this one continues to be the identical because the last puzzle, but now, as a substitute of trying to search out the period when there are most concurrent meetings, the target is to search out bad scheduling. It appears that evidently there are overlaps within the meeting rooms, which have to be listed so it might be fixed ASAP.

How would you discover out if the identical meeting room has two or more meetings booked at the identical time? Listed below are some tips about the best way to solve it:

It’s still the identical algorithm.
This implies you'll still do the UNION, but it's going to look barely different.
You need to think in the attitude of every meeting room.

You need to use this data for the challenge:

CREATE TABLE meetings_overlap (
    room TEXT NOT NULL,
    start_time TIMESTAMP NOT NULL,
    end_time TIMESTAMP NOT NULL
);

INSERT INTO meetings_overlap (room, start_time, end_time) VALUES
    -- Room A meetings
    ('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
    ('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
    ('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
    -- Room B meetings
    ('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
    -- Room C meetings
    ('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
    -- Overlaps with previous meeting.
    ('Room C', '2024-10-01 09:30', '2024-10-01 12:00');

In the event you’re concerned about the answer to this puzzle, in addition to the remainder of the queries, check this GitHub repo.

The primary takeaway from this blog post is that window functions are overpowered. Ever since I got more comfortable with using them, I feel that my queries have gotten a lot simpler and easier to read, and I hope the identical happens to you.

In the event you’re concerned about learning more about them, you'd probably enjoy reading this other blog post I’ve written, where I'm going over how you possibly can understand and use them effectively.

The second takeaway is that these patterns utilized in the challenges really do occur in lots of other places. You may need to search out sequences of subscriptions, customer retention, or you may need to search out overlap of tasks. There are numerous situations when you will have to make use of window functions in a really similar fashion to what was done within the puzzles.

The third thing I would like you to recollect is about this solution to using events besides coping with intervals. I’ve checked out some problems I solved an extended time ago that I could’ve used this pattern on to make my life easier, and unfortunately, I didn’t find out about it on the time.

I actually do hope you enjoyed this post and gave a shot to the puzzles yourself. And I’m sure that when you made it this far, you either learned something recent about SQL or strengthened your knowledge of window functions!

Thanks a lot for reading. If you might have questions or simply need to get in contact with me, don’t hesitate to contact me at mtrentz.com.

Practical SQL Puzzles That Will Level Up Your Skill

Bonus challenge

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Constitutional AI with Open LLMs

Hugging Face Text Generation Inference available for AWS Inferentia2

The best way to Leverage Slash Commands to Code Effectively

Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Automobile Example

Practical SQL Puzzles That Will Level Up Your Skill

Bonus challenge

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.