Methods to Rewrite and Optimize Your SQL Queries to Pandas in 5 Easy Examples

Artificial Intelligence

Methods to Rewrite and Optimize Your SQL Queries to Pandas in 5 Easy Examples

admin

June 1, 2023

Methods to Rewrite and Optimize Your SQL Queries to Pandas in 5 Easy Examples

Querying an entire table

We are able to dive right into it by the classic SELECT ALL from a table.

Here’s the SQL:

SELECT * FROM df

And here’s the pandas

df

All you want to do is call the DataFrame in Pandas to return the entire table and all its columns.

You might also want to only take a look at a small subset of your table as a fast check before writing a more complicated query. In SQL, you’d use LIMIT 10 or something much like get only a select variety of rows. In Pandas, similarly, you possibly can call df.head(10) or df.tails(10) to get the primary or last 10 rows of the table.

Querying a table without null values

So as to add to our initial select query, along with just limiting the variety of rows, you’d put conditions to filter the table inside a WHERE clause in SQL. For instance, for those who’d want all rows within the table with none null values within the Order_ID column, the SQL would seem like this:

SELECT * FROM df WHERE Order_ID IS NOT NULL

In Pandas, you’ve gotten two options:

# Option 1
df.dropna(subset="Order_ID")# Option 2
df.loc[df["Order_ID"].notna()]

Now, the table we get back doesn’t have any null values from the Order_ID column (which you possibly can compare to the primary output above). Each options will return a table without the null values, but they work barely in another way.

You need to use the native dropna method in Pandas to return the DataFrame with none null rows, specifying within the subset parameter which columns you’d prefer to drop nulls from.

Alternatively, the loc method enables you to pass a mask or boolean label you possibly can specify to filter the DataFrame. Here, we pass df["Order_ID"].notna(), which for those who would call it by itself would return a Series of True and False values that may map to the unique DataFrame rows for whether the Order_ID is null. Once we pass it to the loc method, it as a substitute returns the DataFrame where df["Order_ID"].notna() evaluates to True (so all rows where the Order_ID column isn’t null.

Querying specific columns from a table

Next, as a substitute of choosing all columns from the table, let’s as a substitute select just just a few specific columns. In SQL, you’d write the column names within the SELECT a part of the query like this:

SELECT Order_ID, Product, Quantity_Ordered FROM df

In Pandas, we’d write the code like this:

df[["Order_ID", "Product", "Quantity_Ordered"]]

To pick a particular subset of columns, you possibly can pass an inventory of the column names into the DataFrame in Pandas. You too can define the list individually like this for clarity:

target_cols = ["Order_ID", "Product", "Quantity_Ordered"]
df[target_cols]

Assigning an inventory of goal columns that you may then pass right into a DataFrame could make working with a table over time when you want to make changes in your code a bit easier. For instance, you can have a function return the columns you would like as an inventory, or append and take away columns to the list as needed depending on what sort of output the user needs.

The GROUP BY in SQL and Pandas

We are able to now move on to aggregating data. In SQL, we do that by passing a column to the SELECT and GROUP BY clauses that we wish to group on after which adding the column to an aggregate measure like COUNT within the SELECT clause as well. For instance, doing so will allow us to group all the person Order_ID rows in the unique table for every Product and count what number of there are. The query can seem like this:

SELECT 
Product, 
COUNT(Order_ID)
FROM df
WHERE Order_ID IS NOT NULL
GROUP BY Product

In Pandas, it might seem like this:

df[df["Order_ID"].notna()].groupby(["Product"])["Order_ID"].count()

The output is a Pandas Series where the table is grouped the products and there’s a count of all of the Order_ID for every product. Along with our previous query in Pandas where we included a filter, we now do three things:

Add groupby and pass a column (or list of columns) that you wish to group the DataFrame on;
Pass the name of the column in square brackets on the raw grouped DataFrame;
Call the count (or some other aggregate) method to perform the aggregation on the DataFrame for the goal column.

For higher readability, we will assign the condition to a variable (this can come in useful later) and format the query so it’s easier to read.

condition = df["Order_ID"].notna()
grouped_df = (
df.loc[condition]
.groupby("Product")
["Order_ID"]  # select column to count
.count()
)
grouped_df

Now that we’ve got many of the components of a whole SQL query, let’s take a take a look at a more complicated one and see what it might seem like in Pandas.

SELECT 
Product,
COUNT(Order_ID)
FROM df
WHERE Order_ID IS NOT NULL
AND Purchase_Address LIKE "%Los Angeles%"
AND Quantity_Ordered == 1
GROUP BY Product
ORDER BY COUNT(Order_ID) DESC

Here, we add a bit to our previous query by including multiple filter conditions in addition to an ORDER BY in order that the table returned in our query is sorted by the measure we’re aggregating on. Since there are just a few more components to this question, let’s have a look step-by-step at how we’d implement this in Pandas.

First, as a substitute of passing multiple conditions after we call the loc method, let’s as a substitute define an inventory of conditions and assign them to a variable FILTER_CONDITIONS.

FILTER_CONDITIONS = [
df["Order_ID"].notna(),
df["Purchase_Address"].str.incorporates("Los Angeles"),
df["Quantity_Ordered"] == "1",
]

As before, a condition passed into loc needs to be a Pandas mask that evaluates to either true or false. It’s possible to pass multiple conditions to loc, however the syntax should seem like this:

df.loc[condition_1 & condition_2 & condition_3]

Nevertheless, just passing an inventory of conditions like this won’t work:

df.loc[FILTER_CONDITIONS]  
# doesn't work -> you possibly can't just pass an inventory into loc

You’ll get an error for those who try the above because each condition needs to be separated by the & operator for “and” conditions (or the | operator for those who need “or” conditions). As an alternative, we will write some quick code to return the conditions in the right format. We’ll make use of the functools.reduce method to place the conditions together.

If you wish to see what it looks like in a notebook and see what it looks prefer to mix some strings using the reduce function, do that:

reduce(lambda x, y: f"{x} & {y}", ["condition_1", "condition_2", "condition_3"])

This outputs the string like this:

>>> 'condition_1 & condition_2 & condition_3'

Going back to our actual Pandas conditions, we will write this as a substitute (without the string formatting and just using our defined list of conditions within the FILTER_CONDITIONS variable).

reduce(lambda x, y: x & y, FILTER_CONDITIONS)

What reduce does is apply a function cumulatively to the weather present in an iterable, or in our case run the lambda function over the items in our FILTER_CONDITIONS list which mixes each of them with the & operator. This runs until there are not any conditions left, or on this case, for all three conditions it might effectively return:

df["Order_ID"].notna() & df["Purchase_Address"].str.incorporates("Los Angeles") & df["Quantity_Ordered"] == "1"

Finally, let’s add the list of conditions to create a final group by query in Pandas:

final_df = (
df
.loc[reduce(lambda x, y: x & y, FILTER_CONDITIONS)]
.groupby("Product")
.size()
.sort_values(ascending=False)
)

You’ll notice two additional differences from the previous query:

As an alternative of specifying the particular column to count on, we will simply call the size method which can return the variety of rows within the DataFrame (as before where every Order_ID value was unique and meant to represent one row after we counted on it);
There are just a few alternative ways to do the ORDER BY in Pandas- a technique is to easily call sort_values and pass ascending=False to sort on descending order.

In the event you wanted to make use of the previous syntax for aggregating the information it might seem like this:

final_df = (
df
.loc[reduce(lambda x, y: x & y, FILTER_CONDITIONS)]
.groupby("Product")
["Order_ID"].count()
.sort_values(ascending=False)
)

The output of each methods will probably be similar to before, which is a Series with the column you’re grouping on and the counts for every product.

If as a substitute, you desired to output a DataFrame, you possibly can call the reset_index method on the series to get the unique column names back for which column you grouped on and the column you’re aggregating on (on this case we grouped on “Product” and are counting the “Order_ID”.

final_df.reset_index()

And there we’ve got it! All of the components of a full SQL query but finally written in Pandas. A few of the things we will do further to optimize this process for working with data over time include:

Putting the several lists of columns to SELECT or GROUP BY to their very own variables or functions (so that you or a user can modify them over time);
Move the logic to mix the list of columns for a filter condition to its own function so the top user doesn’t should be confused over what the reduce logic is doing;
After passing reset_index we will rename the output column (or columns if we’re aggregating on multiple) for clarity, for instance to “Count_Order_ID”.