Home Artificial Intelligence Why Do You Must Use SQL Grouping Sets for Aggregating Data?

Why Do You Must Use SQL Grouping Sets for Aggregating Data?

0
Why Do You Must Use SQL Grouping Sets for Aggregating Data?

Photo by Helena Lopes on Unsplash

Even though it’s called a question language, SQL is able to not only querying databases but additionally performing efficient data evaluation and manipulation. It isn’t a surprise that SQL is embraced by the information science community.

In this text, we are going to find out about a really handy SQL feature, which allows for writing cleaner and more efficient queries. This I-wish-I-knew-this-earlier feature is the GROUPING SETS, which might be regarded as an extension of the GROUP BY function.

We’ll learn the difference between them in addition to the advantage of using GROUPING SETS over the GROUP BY function but first, we’d like a dataset to work on.

I created a SQL table from the Melbourne housing dataset available on Kaggle with a public domain license. The primary 5 rows of the table looks as follows:

(image by creator)

The GROUP BY function

We are able to use the function to calculate aggregate values per group or distinct values in a column or multiple columns. As an example, the next query returns the typical price for every listing type.

SELECT 
type,
AVG(price) AS avg_price
FROM melb
GROUP BY type

The output of this question is:

(image by creator)

Multiple groupings

Let’s say you desire to see the typical price for every region within the northern area, which might be achieved by utilizing the GROUP BY function as follows:

SELECT 
regionname,
AVG(price) AS avg_price
FROM melb
WHERE regionname LIKE 'Northern%'
GROUP BY regionname

The output:

(image by creator)

Consider a case where you desire to see the typical price of various house types in these two regions in the identical table. You may achieve this by writing two groupings and mixing the outcomes with UNION ALL.

SELECT 
regionname,
'all' AS type,
AVG(price) AS average_price
FROM melb
WHERE regionname LIKE 'Eastern%'
GROUP BY regionname
UNION ALL
SELECT
regionname,
type,
AVG(price) AS average_price
FROM melb
WHERE regionname LIKE 'Eastern%'
GROUP BY regionname, type
ORDER BY regionname, type

What the query does is to calculate the typical price for every region first. Then, in a separate query, it groups the rows by each region name and kind and calculates the typical price for every group. The union combines the output of those two queries.

For the reason that first query doesn’t have the kind column, we create it manually with a worth of “all”. Finally, the combined results are ordered by the region name and the kind.

The output of this question:

(image by creator)

The primary row for every region shows the region average and the next rows show the typical price for various house types.

We had to put in writing two separate queries because we cannot have different queries in a GROUP BY statement unless we use GROUPING SETS.

GROUPING SETS

Let’s rewrite the previous query using GROUPING SETS.

SELECT 
regionname,
type,
AVG(price) as average_price
FROM melb
WHERE regionname LIKE 'Eastern%'
GROUP BY
GROUPING SETS (
(regionname),
(regionname, type)
)
ORDER BY regionname, type

The output:

(image by creator)

The output is similar apart from the null values in the kind column which might easily get replaced with “all”.

Using the GROUPING SETS has two major benefits:

  • It’s shorter and more intuitive which makes the code easier to debug and manage
  • It’s more efficient and performant than writing separate queries and mixing the outcomes because SQL scans the tables for every query.

Final thoughts

We regularly disregard query readability and efficiency. We’re joyful if the query returns the specified data.

Efficiency is something we at all times have to have in mind. The impact of writing bad queries could also be tolerated when querying a small database. Nonetheless, when the information size becomes large, bad queries may result in serious performance issues. In an effort to make ETL processes scalable and easy-to-manage, we’d like to adapt best practices. The GROUPING SETS is considered one of these best practices.

You may turn out to be a Medium member to unlock full access to my writing, plus the remainder of Medium. For those who already are, don’t forget to subscribe if you happen to’d prefer to get an email at any time when I publish a latest article.

Thanks for reading. Please let me know if you will have any feedback.

LEAVE A REPLY

Please enter your comment!
Please enter your name here