Home Artificial Intelligence Using DuckDB with Polars

Using DuckDB with Polars

1
Using DuckDB with Polars

Photo by Hans-Jurgen Mager on Unsplash

In my previous couple of articles on data analytics, I speak about two essential up-and-coming libraries which are currently gaining lots of tractions within the industry:

  • DuckDB — where you may query your dataset in-memory using SQL statements.
  • Polars — a way more efficient DataFrame library in comparison with the venerable Pandas library.

What about combining the facility of those two libraries?

The truth is, you may directly query a Polars dataframe through DuckDB, using SQL statements.

So what are the advantages of querying your Polars dataframe using SQL? Despite the benefit of use, manipulating Polars dataframes still require a little bit of practise and a comparatively steep learning curve. But since most developers are already accustomed to SQL, isn’t it more convenient to govern the dataframes directly using SQL? Using this approach, developers have one of the best of each worlds:

  • the flexibility to question Polars dataframes using all the assorted functions, or
  • use SQL for cases where it’s way more natural and easier to extract the info that they need

In this text, I gives you some examples of how you may make use of SQL through DuckDB to question your Polars dataframes.

For this text, I’m using Jupyter Notebook. Make sure that you might have installed Polars and DuckDB using the next commands:

!pip install polars
!pip install duckdb

To start, let’s create a Polars DataFrame by hand:

import polars as pl

df = pl.DataFrame(
{
'Model': ['iPhone X','iPhone XS','iPhone 12',
'iPhone 13','Samsung S11',
'Samsung S12','Mi A1','Mi A2'],
'Sales': [80,170,130,205,400,30,14,8],
'Company': ['Apple','Apple','Apple','Apple',
'Samsung','Samsung','Xiao Mi',
'Xiao Mi'],
})
df

Here’s how the dataframe looks:

All images by writer

Say, you now want to search out all phones from Apple which has sales of greater than 80. You should use the filter() function in Polars, like this:

df.filter(
(pl.col('Company') == 'Apple') &
(pl.col('Sales') > 80)
)

And the result looks like this:

Let’s now do the precise query that we did within the previous section, except that this time round we are going to use DuckDB with a SQL statement. But first, let’s select all of the rows within the dataframe:

import duckdb

result = duckdb.sql('SELECT * FROM df')
result

You possibly can directly reference the df dataframe out of your SQL statement.

Using DuckDB, you issue a SQL statement using the sql() function. Alternatively, the query() function also works:

result = duckdb.query('SELECT * FROM df')

The result variable is a duckdb.DuckDBPyRelation object. Using this object, you may perform quite a variety of different tasks, comparable to:

  • Getting the mean of the Sales column:
result.mean('Sales')
  • Describing the dataframe:
result.describe()
  • Applying a scaler function to the columns within the dataframe:
result.apply("max", 'Sales,Company')
  • Reordering the dataframe:
result.order('Sales DESC')

However the simplest way is to question the Polars DataFrame is to make use of SQL directly.

For instance, if you would like to get all of the rows with sales greater than 80, simply use the sql() function with the SQL statement below:

duckdb.sql('SELECT * FROM df WHERE Sales >80').pl()

The pl() function converts the duckdb.DuckDBPyRelation object to a Polars DataFrame. If you would like to convert it to a Pandas DataFrame as a substitute, use the df() function.

If you would like to get all of the rows whose model name starts with “iPhone”, then use the next SQL statement:

duckdb.sql("SELECT * FROM df WHERE Model LIKE 'iPhone%'").pl()

For those who want all devices from Apple and Xiao Mi, then use the next SQL statement:

duckdb.sql("SELECT * FROM df WHERE Company = 'Apple' OR Company ='Xiao Mi'").pl()

The true power of using DuckDB with Polars DataFrame is when you would like to query from multiple dataframes. Consider the next three CSV files from the 2015 Flights Delay dataset:

2015 Flights Delay datasethttps://www.kaggle.com/datasets/usdot/flight-delays. Licensing — CC0: Public Domain

  • flights.csv
  • airlines.csv
  • airports.csv

Let’s load them up using Polars:

import polars as pl

df_flights = pl.scan_csv('flights.csv')
df_airlines = pl.scan_csv('airlines.csv')
df_airports = pl.scan_csv('airports.csv')

display(df_flights.collect().head())
display(df_airlines.collect().head())
display(df_airports.collect().head())

The above statements use lazy evaluation to load up the three CSV files. This ensures that any queries on the dataframes will not be performed until all of the queries are optimized. The collect() function forces Polars to load the CSV files into dataframes.

Here is how the df_flights, df_airlines, and df_airports dataframes appear like:

Suppose you would like to count the variety of times an airline has a delay , and at the identical time display the name of every airline, here is the SQL statement that you would be able to use using the df_airlines and df_flights dataframes:

duckdb.sql('''
SELECT
count(df_airlines.AIRLINE) as Count,
df_airlines.AIRLINE
FROM df_flights, df_airlines
WHERE df_airlines.IATA_CODE = df_flights.AIRLINE AND df_flights.ARRIVAL_DELAY > 0
GROUP BY df_airlines.AIRLINE
ORDER BY COUNT DESC
''')

And here is the result:

If you would like to count the variety of airports in each state and kind the count in descending order, you need to use the next SQL statement:

duckdb.sql('''
SELECT STATE, Count(*) as AIRPORT_COUNT
FROM df_airports
GROUP BY STATE
ORDER BY AIRPORT_COUNT DESC
''')

Finally, suppose you would like to know which airline has the very best average delay. You should use the next SQL statement to calculate the assorted statistics, comparable to minimum arrival delay, maximum array delay, mean arrival delay, and standard deviation of arrival delay:

duckdb.sql('''
SELECT AIRLINE, MIN(ARRIVAL_DELAY), MAX(ARRIVAL_DELAY),
MEAN(ARRIVAL_DELAY), stddev(ARRIVAL_DELAY)
FROM df_flights
GROUP BY AIRLINE
ORDER BY MEAN(ARRIVAL_DELAY)
''')

Based on the mean arrival delay, we are able to see that the AS airline is the one with the shortest delay (as the worth is negative, this implies more often than not it arrives earlier!) and NK airline is the one with the longest delay. Need to know what’s the AS airline? Try it out using what you might have just learned! I’ll leave it as an exercise and the reply is at the top of this text.

For those who like reading my articles and that it helped your profession/study, please consider signing up as a Medium member. It’s $5 a month, and it gives you unlimited access to all of the articles (including mine) on Medium. For those who enroll using the next link, I’ll earn a small commission (at no additional cost to you). Your support implies that I’ll have the ability to devote more time on writing articles like this.

On this short article, I illustrated how DuckDB and Polars will be used together to question your dataframes. Utilizing each libraries gives you one of the best of each worlds — using a well-recognized querying language (which is SQL) to question an efficient dataframe. Go ahead and take a look at it out using your individual dataset and share with us the way it has helped your data analytics processes.

Answer to quiz:

duckdb.sql("SELECT AIRLINE from df_airlines WHERE IATA_CODE = 'AS'")

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here