Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB

If with Python for data, you may have probably experienced the frustration of waiting minutes for a Pandas operation to complete.

At first, the whole lot seems high quality, but as your dataset grows and your workflows change into more complex, your laptop suddenly looks like it’s preparing for lift-off.

A few months ago, I worked on a project analyzing e-commerce transactions with over 3 million rows of knowledge.

It was a reasonably interesting experience, but more often than not, I watched easy groupby operations that normally ran in seconds suddenly stretch into minutes.

At that time, I noticed Pandas is amazing, nevertheless it shouldn’t be all the time enough.

This text explores modern alternatives to Pandas, including Polars and DuckDB, and examines how they will simplify and improve the handling of huge datasets.

For clarity, let me be upfront about just a few things before we start.

This text shouldn’t be a deep dive into Rust memory management or a proclamation that Pandas is obsolete.

As an alternative, it’s a practical, hands-on guide. You will note real examples, personal experiences, and actionable insights into workflows that may prevent time and sanity.

Why Pandas Can Feel Slow

Back once I was on the e-commerce project, I remember working with CSV files over two gigabytes, and each filter or aggregation in Pandas often took several minutes to finish.

During that point, I’d stare on the screen, wishing I could just grab a coffee or binge just a few episodes of a show while the code ran.

The primary pain points I encountered were speed, memory, and workflow complexity.

All of us understand how large CSV files eat enormous amounts of RAM, sometimes greater than what my laptop could comfortably handle. On top of that, chaining multiple transformations also made code harder to keep up and slower to execute.

Polars and DuckDB address these challenges in other ways.

Polars, inbuilt Rust, uses multi-threaded execution to process large datasets efficiently.

DuckDB, then again, is designed for analytics and executes SQL queries while not having you to load the whole lot into memory.

Principally, each of them has its own superpower. Polars is the speedster, and DuckDB is sort of just like the memory magician.

And the most effective part? Each integrate seamlessly with Python, allowing you to reinforce your workflows and not using a complete rewrite.

Setting Up Your Environment

Before we start coding, be sure that your environment is prepared. For consistency, I used Pandas 2.2.0, Polars 0.20.0, and DuckDB 1.9.0.

Pinning versions can prevent headaches when following tutorials or sharing code.

pip install pandas==2.2.0 polars==0.20.0 duckdb==1.9.0

In Python, import the libraries:

import pandas as pd
import polars as pl
import duckdb
import warnings
warnings.filterwarnings("ignore")

For instance, I’ll use an e-commerce sales dataset with columns akin to order ID, product ID, region, country, revenue, and date. You’ll be able to download similar datasets from Kaggle or generate synthetic data.

Loading Data

Loading data efficiently sets the tone for the remainder of your workflow. I remember a project where the CSV file had nearly 5 million rows.

Pandas handled it, however the load times were long, and the repeated reloads during testing were painful.

It was one among those moments where you want your laptop had a “fast forward” button.

Switching to Polars and DuckDB completely improved the whole lot, and suddenly, I could access and manipulate the info almost immediately, which truthfully made the testing and iteration processes way more enjoyable.

With Pandas:

df_pd = pd.read_csv("sales.csv")
print(df_pd.head(3))

With Polars:

df_pl = pl.read_csv("sales.csv")
print(df_pl.head(3))

With DuckDB:

con = duckdb.connect()
df_duck = con.execute("SELECT * FROM 'sales.csv'").df()
print(df_duck.head(3))

DuckDB can query CSVs directly without loading the complete datasets into memory, making it much easier to work with large files.

Filtering Data

The issue here is that filtering in Pandas might be slow when coping with thousands and thousands of rows. I once needed to research European transactions in a large sales dataset. Pandas took minutes, which slowed down my evaluation.

With Pandas:

filtered_pd = df_pd[df_pd.region == "Europe"]

Polars is quicker and might process multiple filters efficiently:

filtered_pl = df_pl.filter(pl.col("region") == "Europe")

DuckDB uses SQL syntax:

filtered_duck = con.execute("""
    SELECT *
    FROM 'sales.csv'
    WHERE region = 'Europe'
""").df()

Now you may filter through large datasets in seconds as a substitute of minutes, leaving you more time to deal with the insights that actually matter.

Aggregating Large Datasets Quickly

Aggregation is commonly where Pandas starts to feel slow. Imagine calculating total revenue per country for a marketing report.

In Pandas:

agg_pd = df_pd.groupby("country")["revenue"].sum().reset_index()

In Polars:

agg_pl = df_pl.groupby("country").agg(pl.col("revenue").sum())

In DuckDB:

agg_duck = con.execute("""
    SELECT country, SUM(revenue) AS total_revenue
    FROM 'sales.csv'
    GROUP BY country
""").df()

I remember running this aggregation on a ten million-row dataset. In Pandas, it took nearly half an hour. Polars accomplished the identical operation in under a minute.

The sense of relief was almost like ending a marathon and realizing your legs still work.

Joining Datasets at Scale

Joining datasets is one among those things that sounds easy until you’re actually knee-deep in the info.

In real projects, your data often lives in multiple sources, so you may have to mix them using shared columns like customer IDs.

I learned this the hard way while working on a project that required combining thousands and thousands of customer orders with an equally large demographic dataset.

Each file was sufficiently big by itself, but merging them felt like attempting to force two puzzle pieces together while your laptop begged for mercy.

Pandas took so long that I started timing the joins the identical way people time how long it takes their microwave popcorn to complete.

Spoiler: the popcorn won each time.

Polars and DuckDB gave me a way out.

With Pandas:

merged_pd = df_pd.merge(pop_df_pd, on="country", how="left")

Polars:

merged_pl = df_pl.join(pop_df_pl, on="country", how="left")

DuckDB:

merged_duck = con.execute("""
    SELECT *
    FROM 'sales.csv' s
    LEFT JOIN 'pop.csv' p
    USING (country)
""").df()

Joins on large datasets that used to freeze your workflow now run easily and efficiently.

Lazy Evaluation in Polars

One thing I didn’t appreciate early in my data science journey was how much time gets wasted while running transformations line by line.

Polars approaches this in another way.

It uses a way called lazy evaluation, which essentially waits until you may have accomplished defining your transformations before executing any operations.

It examines the complete pipeline, determines probably the most efficient path, and executes the whole lot concurrently.

It’s like having a friend who listens to your entire order before walking to the kitchen, as a substitute of 1 who takes each instruction individually and keeps going backwards and forwards.

This TDS article indepthly explains lazy evaluation.

Here’s what the flow looks like:

Pandas:

df = df[df["amount"] > 100]
df = df.groupby("segment").agg({"amount": "mean"})
df = df.sort_values("amount")

Polars Lazy Mode:

import polars as pl

df_lazy = (
    pl.scan_csv("sales.csv")
      .filter(pl.col("amount") > 100)
      .groupby("segment")
      .agg(pl.col("amount").mean())
      .sort("amount")
)

result = df_lazy.collect()

The primary time I used lazy mode, it felt strange not seeing quick results. But once I ran the ultimate .collect(), the speed difference was obvious.

Lazy evaluation won’t magically solve every performance issue, nevertheless it brings a level of efficiency that Pandas wasn’t designed for.

Conclusion and takeaways

Working with large datasets doesn’t should feel like wrestling together with your tools.

Using Polars and DuckDB showed me that the issue wasn’t all the time the info. Sometimes, it was the tool I used to be using to handle it.

If there’s one thing you are taking away from this tutorial, let it’s this: you don’t should abandon Pandas, but you may reach for something higher when your datasets start pushing their limits.

Polars gives you speed in addition to smarter execution, then DuckDB allows you to query huge files like they’re tiny. Together, they make working with large data feel more manageable and fewer tiring.

If you should go deeper into the ideas explored on this tutorial, the official documentation of Polars and DuckDB are good places to begin.

Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB

Why Pandas Can Feel Slow

Setting Up Your Environment

Loading Data

Filtering Data

Aggregating Large Datasets Quickly

Joining Datasets at Scale

Lazy Evaluation in Polars

Conclusion and takeaways

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Introducing NPC-Playground, a 3D playground to interact with LLM-powered NPCs

Launching the Artificial Evaluation Text to Image Leaderboard & Arena

Making sense of this mess

Introducing the Hugging Face Embedding Container for Amazon SageMaker

Diffusers welcomes Stable Diffusion 3

Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB

Why Pandas Can Feel Slow

Setting Up Your Environment

Loading Data

Filtering Data

Aggregating Large Datasets Quickly

Joining Datasets at Scale

Lazy Evaluation in Polars

Conclusion and takeaways

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.