Clustering: A straightforward technique to group similar rows and forestall unnecessary data processing
In my previous article, I explained the way to optimise SQL queries using partitioning:
Now, I’m writing the sequel! (Dad joke, anyone?)
This text will have a look at clustering: one other powerful optimisation technique you should use in BigQuery. Like partitioning, clustering can assist you write more performant queries which can be quicker and cheaper to run. If you ought to develop your SQL toolkit and construct those higher-level Data Science skills, that is an ideal place to start out.
In BigQuery, a clustered table is a table that keeps similar rows grouped together in physical “blocks”.
For instance, picture a table called user_signups
that keeps track of all of the people registering an account on a fictitious website. It’s got 4 columns:
registration_date
: the date on which the user created an accountcountry
: the country where the user is predicatedtier
: the user’s plan (“Free” or “Paid”)username
: the user’s username
If we wanted, we could cluster the table by country
in order that users from the identical country are stored nearby one another within the table: