EDA in Public (Part 2): Product Deep Dive & Time-Series Evaluation in Pandas

-

! Welcome back to the “EDA in Public” series! That is Part 2 of the series; when you haven’t seen Part 1 yet, read it here. Here’s a recap of what we conquered.

In Part 1, we took a messy, half-million-row sales file and got it into shape. Specifically, we:

  • Sampled the information for faster processing.
  • Fixed missing values and standardised text entries (like ‘EIRE’ to ‘Ireland’).
  • Filtered out all of the noise — returns, cancellations, and transactions with zero price.
  • Engineered our most crucial feature: the Revenue column.
  • Achieved our first business insight: the Top 10 Revenue-Generating Countries.

We now have a clean, revenue-ready Pandas DataFrame able to be exploited for insights! By the top of this series, I would like to have mastered exploratory data evaluation using Pandas. Be happy to follow along when you’re an information enthusiast.

Now we shift our focus to delivering truly impactful business intelligence for NovaShop. Our overall goal for Part 2 is to reply the elemental questions: products are the most effective performers, and is the most effective time to sell them?

This part is all about evaluation using powerful Pandas aggregation techniques (.groupby()) and have engineering using the datetime column (.dt accessor). We’ll break down our evaluation into two key areas:

  • Product Deep Dive: Identifying the products that move probably the most units versus people who herald probably the most money.
  • Sales Timing (Time-Series): Uncovering the seasonality of sales (monthly peaks) and the operational trends (busiest hour of the day).

Without further ado, let’s get to it.

Loading the clean data

To enhance performance, let’s export our cleaned dataset and import it again for evaluation

df.to_csv('online_retail_clean.csv', index=False)

Now, I can create a brand new project and begin afresh.

When loading CSVs that contain date columns, I attempt to use the parse_dates argument in pd.read_csv(). This protects me the step of manually converting the column type later, ensuring the date features are correct from the beginning.

import pandas as pd
df = pd.read_csv('online_retail_clean.csv', parse_dates=['InvoiceDate'])
print(f"Successfully loaded clean data. Total rows: {len(df)}")

Output:

Successfully loaded clean data. Total rows: 52933

Product Deep Dive: Who Are the True Money Makers?

With the clean data loaded, let’s jump straight into product performance. Products might be considered “best-sellers” in two alternative ways: by volume (units sold) or by value (revenue generated).

Insight 2: Top 10 Bestsellers (by Quantity)

The amount list tells us which products are popular and in high demand, even when their price is low. This is essential for warehouse stocking and inventory management.

We’ll group by the product Description, sum the Quantity, and type it descendingly.

# Top 10 Bestsellers (by Quantity)
df.groupby(‘Description’)[‘Quantity’].sum().sort_values(ascending=False).head(10)

Output:

Description
MEDIUM CERAMIC TOP STORAGE JAR 74698
WORLD WAR 2 GLIDERS ASSTD DESIGNS 9419
GIN + TONIC DIET METAL SIGN 5016
ASSORTED COLOURS SILK FAN 4684
JUMBO BAG RED RETROSPOT 4671
PACK OF 72 RETROSPOT CAKE CASES 3561
MINI PAINT SET VINTAGE 3285
ASSORTED COLOUR BIRD ORNAMENT 2783
BROCADE RING PURSE 2525
POPCORN HOLDER 2508
Name: Quantity, dtype: int64

Insight 3: Top 10 Revenue Drivers (The True Money-Makers)

That is what NovaShop really desires to know. These products drive profitability. They may not sell as many units because the items within the previous list, but their higher price or larger bulk orders make them way more beneficial.

We follow the identical process, but this time we aggregate on the Revenue column we engineered in Part 1.

# Top 10 Revenue Drivers (The True Money-Makers)
df.groupby(‘Description’)[‘Revenue’].sum().sort_values(ascending=False).head(10)

Output:

Description
MEDIUM CERAMIC TOP STORAGE JAR 77785.74
DOTCOM POSTAGE 18456.13
REGENCY CAKESTAND 3 TIER 15093.38
AMAZON FEE 13541.33
PARTY BUNTING 9491.60
GIN + TONIC DIET METAL SIGN 9480.51
JUMBO BAG RED RETROSPOT 8996.65
Manual 8966.96
CHILLI LIGHTS 8395.20
POSTAGE 7194.51
Name: Revenue, dtype: float64

Discussion: What These Lists Tell NovaShop

That is where the ability of EDA truly shines, because these two lists tell a really different story:

  • The All-Star Product: The “MEDIUM CERAMIC TOP STORAGE JAR” is clearly the winner, dominating each lists by a large margin. It’s high volume and high value. This item must be NovaShop’s primary focus for inventory, marketing, and expansion.
  • A surprising revenue source: While exploring the information, I noticed items like and These aren’t actual products — they’re service fees and manual adjustments. Yet together, they create in almost £48,000 in revenue. NovaShop should double-check whether these fees are being counted of their gross profit. It’s a straightforward thing to miss, nevertheless it may very well be a quiet and meaningful source of income hiding in plain sight.
  • The High-Value Physical Goods: Items just like the “REGENCY CAKESTAND 3 TIER” and “CHILLI LIGHTS” make the revenue list but not the amount list. This tells us they’re high-price items that contribute significantly to the entire money flow, they usually are price targeted marketing efforts.

We now have an entire picture of what NovaShop sells best. Time to determine when!

Preparing for Time-Series Evaluation: Feature Engineering

We all know what sold well. Now, let’s determine when sales peak. Our goal is to interrupt down the InvoiceDate column into components that allow us to group and analyze sales by yr, month, day, and even hour.

This can be a classic feature engineering move. Since we already made sure InvoiceDate is a correct Pandas datetime object (because of parse_dates in our loading step!), we are able to use the extremely useful .dt accessor.

Extracting Time Components

To know seasonality (monthly) and operational efficiency (hourly), we’ll create 4 latest columns.

  • 12 months — For long-term comparisons
  • Month — For identifying seasonal peaks (e.g., Q4 rush).
  • DayName — To search out the busiest day of the week.
  • Hour — To find out peak operational hours.

Let’s generate these columns:

print(“n — — Time Series Feature Extraction — -”)

# Extract 12 months, Month, DayName, and Hour from the InvoiceDate column
df[‘Year’] = df[‘InvoiceDate’].dt.yr
df[‘Month’] = df[‘InvoiceDate’].dt.month
df[‘DayName’] = df[‘InvoiceDate’].dt.day_name()
df[‘Hour’] = df[‘InvoiceDate’].dt.hour
print(“Latest time features added: 12 months, Month, DayName, Hour.”)
print(df[[‘InvoiceDate’, ‘Year’, ‘Month’, ‘DayName’, ‘Hour’]].head())

Output:

--- Time Series Feature Extraction ---
Latest time features added: 12 months, Month, DayName, Hour.
InvoiceDate 12 months Month DayName Hour
0 2011-06-01 12:05:00 2011 6 Wednesday 12
1 2011-05-27 17:14:00 2011 5 Friday 17
2 2011-04-21 17:05:00 2011 4 Thursday 17
3 2011-11-16 10:39:00 2011 11 Wednesday 10
4 2011-03-10 08:40:00 2011 3 Thursday 8

Perfect! We will move right into mapping the sales patterns in the following section.

Mapping Sales Patterns (Time-Series Insights)

With our time features ready, we are able to now ask and answer our questions on when sales occur. We’ll start big (monthly trends) and drill down small (hourly trends).

Insight 4: Sales Trends by Month (Seasonality)

Monthly aggregation is crucial for spotting seasonality — the predictable pattern of sales repeating yearly. This data is used for financial forecasting and planning inventory levels for the height season.

We’ll group by the Month column and calculate the entire revenue for every.

print(“n — — Insight 4: Sales Trends by Month (Seasonality) — -”)
# Group by month and sum the revenue
monthly_revenue = df.groupby(‘Month’)[‘Revenue’].sum().sort_values(ascending=False)
# Print the outcomes, which should show the seasonal peaks
print(monthly_revenue)

Output:

--- Insight 4: Sales Trends by Month (Seasonality) ---
Month
11 143576.360
1 142013.420
12 133178.980
10 119533.540
9 103929.420
3 72968.270
8 71302.890
5 70424.510
6 68722.520
7 68713.831
4 51882.010
2 50178.730
Name: Revenue, dtype: float64

Interpretation: The Massive Q4 and January Surprise

The info clearly shows that NovaShop’s sales are heavily skewed toward the top of the yr, confirming the everyday retail Q4 (October, November, December) rush.

  • Peak Season: November is the height month by a slight margin, followed closely by December. That is the time to replenish on the Medium Ceramic Top Storage Jars and run targeted ads.
  • The January Surge: The second-highest month is January! This is a wonderful insight for NovaShop. It suggests customers could also be using Christmas gift money or benefiting from post-holiday sales. This era shouldn’t be treated as a slump but as a secondary high-volume sales opportunity.
  • Low Season: February and April are the bottom performers, which helps NovaShop plan for lower staffing and marketing budgets during those periods.

Insight 5: Sales Trends by Day of Week and Hour (Operational Planning)

While seasonality is about finance and inventory, every day and hourly trends are about operations. NovaShop can use this to schedule warehouse staff, optimize ad spend, and time email campaigns.
We’ll run two separate aggregations here: one for the day of the week and one for the hour of the day.

Revenue by Day of Week

# 1. Sales by Day of Week
daily_revenue = df.groupby(‘DayName’)[‘Revenue’].sum()
print(“n — — Revenue by Day of Week — -”)
print(daily_revenue)

Output:

--- Revenue by Day of Week ---
DayName
Friday 163861.320
Monday 171026.230
Sunday 83125.890
Thursday 219342.980
Tuesday 282796.741
Wednesday 176271.320
Name: Revenue, dtype: float64

Interpretation: Tuesday is the Power Day

The info reveals a transparent and actionable pattern for the work week:

  • Peak Sales: Tuesday absolutely dominates the week, generating almost 30% more revenue than the following busiest day, Thursday. That is a primary day for launching latest products or running high-impact flash sales.
  • Slump Days: Unsurprisingly, sales drop dramatically on Sunday. NovaShop might consider this a low-priority day for email marketing or customer support staffing.
  • Missing Day: The dataset accommodates no transactions for Saturday in our sample (or perhaps the complete set, which is common in B2B datasets).

Revenue by Hour of Day

# 2. Sales by Hour of Day
hourly_revenue = df.groupby(‘Hour’)[‘Revenue’].sum().sort_index()
print(“n — — Revenue by Hour of Day — -”)
print(hourly_revenue)

Output:

--- Revenue by Hour of Day ---
Hour
7 2830.910
8 26597.800
9 76768.750
10 209809.971
11 115233.600
12 142474.460
13 130348.290
14 119241.520
15 145178.980
16 70620.550
17 36148.030
18 14022.790
19 4397.130
20 2751.700

Interpretation: The Late Morning Boom

The hourly data points on to operational needs:

  • Massive Peak: Sales explode at 10 AM (Hour 10), generating almost double the revenue of the following few hours. That is the optimal time for NovaShop to launch email marketing and social media campaigns.
  • Sustained Activity: Sales remain high from 11 AM through 3 PM, suggesting a robust midday and post-lunch shopping window.
  • End of Day: Activity falls off sharply after 4 PM, with little or no revenue generated within the evening (7 PM onward). This means NovaShop’s customer base is primarily desk-based, shopping during traditional working hours.

Conclusion: Setting the Stage for Customer Segmentation

We’ve got officially finished the heavy-duty EDA! We all know what sells best (the Ceramic Jar and costs), where sales are strongest (the UK), and when the sales peaks occur (Q4, Tuesdays, and 10 AM).

Nevertheless, we’re still missing the Most worthy piece of the puzzle: The Who. One of the best firms don’t just know what sells; they know who their Most worthy customers are and tips on how to keep them.

 In Part 3, we are going to dive into probably the most advanced evaluation technique on this series: RFM (Recency, Frequency, Monetary) Customer Segmentation. This can allow us to categorise customers into tiers like “Champions,” “Loyal Customers,” and “At-Risk,” giving NovaShop a real roadmap for personalised marketing.

I hope you found this text helpful. Be happy to say hi to me on any of those platforms. I really appreciate your feedback.

Medium

LinkedIn

Twitter

YouTube

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x