EDA in Public (Part 3): RFM Evaluation for Customer Segmentation in Pandas

-

! In case you’ve been following along, we’ve come a good distance. In Part 1, we did the “dirty work” of cleansing and prepping.

In Part 2, we zoomed out to a high-altitude view of NovaShop’s world — spotting the large storms (high-revenue countries) and the seasonal patterns (the large Q4 rush).

But here’s the thing: a business doesn’t actually sell to “months” or “countries.” It sells to human beings.

In case you treat every customer the exact same, you’re making two very expensive mistakes:

  • Over-discounting: Giving a “20% off” coupon to someone who was already reaching for his or her wallet.
  • Ignoring the “Quiet” Ones: Failing to note when a formerly loyal customer stops visiting, until they’ve been gone for six months and it’s too late to win them back.

The Solution? Behavioural Segmentation.

As a substitute of guessing, we’re going to make use of the info to let the purchasers tell us who they’re. We do that using the gold standard of retail analytics: RFM Evaluation.

  • Recency (R): How recently did they buy? (Are they still engaged with us?)
  • Frequency (F): How often do they buy? (Are they loyal, or was it a one-off?)
  • Monetary (M): How much do they spend? (What’s their total business impact?)

By the top of this part, we’ll move beyond “Top 10 Products” and really assign a particular, actionable Label to each customer in NovaShop’s database.

Data Preparation: The “Missing ID” Pivot

Before we will start scoring, we’ve got to handle a choice we made back in Part 1.

In case you remember our Initial Inspection, we noticed that about 25% of our rows were missing a CustomerID. On the time, we made a strategic business decision toaccurate total revenue and see which products were popular overall.

For RFM evaluation, the principles change. You can’t track behavior with out a consistent identity. We are able to’t understand how “frequent” a customer is that if we don’t know who they’re!

So, our first step in Part 3 is to isolate our “Trackable Universe” by filtering for rows where a CustomerID exists.

Engineering the RFM Metrics

Now that we’ve got a dataset where every row is linked to a particular person, we want to aggregate all their individual transactions into three summary numbers: Recency, Frequency, and Monetary.

Defining the Snapshot Date

Before calculating RFM, we want a reference cut-off date, commonly called the .

Here, we take essentially the most recent transaction date within the dataset and add sooner or later. This snapshot date represents the moment at which we’re evaluating customer behaviour.

snapshot_date = df['InvoiceDate'].max() + dt.timedelta(days=1)

We added sooner or later, so customers who bought on essentially the most recent date still have a Recency value of 1 day, not 0. This keeps the metric intuitive and avoids edge-case problems.

Aggregating Transactions on the Customer Level

rfm = df.groupby(‘CustomerID’).agg({
‘InvoiceDate’: lambda x: (snapshot_date — x.max()).days,
‘InvoiceNo’: ‘nunique’,
‘Revenue’: ‘sum’
})

Each row in our dataset represents a single transaction. To calculate RFM, we want to collapse these transactions into one row per customer.

We do that by grouping the info by CustomerID and applying different aggregation functions:

  • Recency: For every customer, we discover their most up-to-date purchase date and calculate what number of days have passed since then.
  • Frequency: We count the variety of unique invoices related to each customer. This tells us how often they’ve made purchases.
  • Monetary: We sum the overall revenue generated by each customer across all transactions.

Renaming Columns for Clarity

rfm.rename(columns={
'InvoiceDate': 'Recency',
'InvoiceNo': 'Frequency',
'Revenue': 'Monetary'
}, inplace=True)py

The aggregation step keeps the unique column names, which may be confusing. Renaming them makes the dataframe immediately readable and aligns it with standard RFM terminology.

Now each column clearly answers a business query:

  • Recency → How recently did the client purchase?
  • Frequency → How often do they purchase?
  • Monetary → How much revenue do they generate?

Inspecting the Result

print(rfm.head())

The ultimate rfm dataframe accommodates one row per customer, with three intuitive metrics summarizing their behavior. 

Output:

Let’s walk through this the best way we might with NovaShop in an actual conversation.

“When was the last time this customer bought from us?”

That’s exactly what Recency answers.

Take Customer 12347:

  • Recency = 2
  • Translation:

They’re fresh. They remember the brand. They’re still engaged.

Now compare that to Customer 12346:

  • Recency = 326
  • Translation:

Regardless that this customer spent quite a bit up to now, they’re currently silent.

From NovaShop’s perspective: Recency tells us who’s still listening and who might need a nudge (or a wake-up call).

“Is that this a one-time buyer or someone who keeps coming back?”

That’s where Frequency is available in.

Look again at Customer 12347:

  • Frequency = 7
  • They didn’t just buy once — they got here back time and again.

Now take a look at several others:

  • Frequency = 1
  • One purchase, then gone.

From a business perspective, frequency separates casual shoppers from loyal customers.

“Who actually brings within the money?”

That’s the Monetary column.
And that is where things get interesting.

Customer 12346:

  • Monetary = £77,183.60
  • Frequency = 1
  • Recency = 326

This tells a really specific story:

A single, very large order… an extended time ago… and nothing since.

Now compare that to Customer 12347:

  • Lower total spend
  • Multiple purchases
  • Very recent activity

Vital insight for NovaShop: A “high-value” customer up to now isn’t necessarily a worthwhile customer .

Why This View Changes the Conversation

If NovaShop only checked out total revenue, they may focus all their attention on customers like 12346.

But RFM shows us that:

  • Some customers spent quite a bit once and disappeared
  • Some spend less but stay loyal
  • Some are energetic and able to be engaged

This output helps NovaShop stop guessing and begin prioritizing:

  • Who should get retention emails?
  • Who needs reactivation campaigns?
  • Who’s already loyal and must be rewarded?

Right away, these are still raw numbers.

In the subsequent step, we’ll rank and rating these customers, so NovaShop doesn’t need to interpret rows manually. As a substitute, they’ll see clear segments like:

That’s where this becomes an actual decision-making tool — not only a dataframe.

Turning RFM Numbers Into Meaningful Customer Segments

At this stage, NovaShop has a table filled with numbers. Useful — but not exactly decision-friendly.

A marketing team can’t realistically scan lots of or hundreds of rows asking:

Our goal is to rank customers relative to at least one one other and switch raw values into scores

Step 1: Rating Customers by Each RFM Metric

As a substitute of treating Recency, Frequency, and Monetary as absolute values, we take a look at where each customer stands in comparison with everyone else.

  • Customers with more moderen purchases should rating higher
  • Customers who buy more often should rating higher
  • Customers who spend more should rating higher

In practice, we do that by splitting each metric into quantiles (often 4 or 5 buckets).

Nonetheless, there’s a small real-world wrinkle. That is something I got here across while working on this project

In transactional datasets, it’s common to see:

  • Many purchasers with the identical Frequency (e.g. one-time buyers)
  • Highly skewed Monetary values
  • Small samples where quantile binning can fail

To maintain things robust and readable, we’ll wrap the scoring logic in a small helper function.

def rfm_score(series, ascending=True, n_bins=5):
# Rank the values to make sure uniqueness
ranked = series.rank(method=’first’, ascending=ascending)

# Use pd.qcut on the ranks to assign bins
return pd.qcut(
ranked,
q=n_bins,
labels=range(1, n_bins+1)
).astype(int)

To elucidate what’s occurring here:

  • We’re making a helper function that turns a raw numeric column right into a clean RFM rating using quantile-based binning.
  • First, the values are ranked. So, as an alternative of binning the raw values directly, we rank them first. This step guarantees unique ordering, even when many shoppers share the identical value (a standard issue in RFM data). 
  • The ascending flag lets us flip the logic depending on the metric — for instance, lower recency is healthier, while higher frequency and monetary values are higher.
  • Next, we’re applying quantile-based binning. qcut splits the ranked values into n_bins equally sized groups. Each customer is assigned a rating from 1 to five (by default), where the rating represents their relative position inside the distribution.
  • Finally, the outcomes will probably be converted to integers for simple use in evaluation and segmentation.

Briefly, this function provides a robust and reusable way to attain RFM metrics without running into duplicate bin edge errors — and without overcomplicating the logic.

Step 2: Applying the Scores

Now we will rating each metric cleanly and consistently:

# Assign R, F, M scores
rfm['R_Score'] = rfm_score(rfm['Recency'], ascending=False) # Recent purchases = high rating
rfm['F_Score'] = rfm_score(rfm['Frequency']) # More frequent = high rating
rfm['M_Score'] = rfm_score(rfm['Monetary']) # Higher spend = high rating

The one special case here is Recency:

  • Lower values mean more moderen activity
  • So we reverse the rating with ascending=False
  • All the pieces else follows the natural “higher is healthier” rule.

What This Means for NovaShop

As a substitute of seeing this:

Recency = 326
Frequency = 1
Monetary = 77,183.60

NovaShop now sees something like:

R = 1, F = 1, M = 5

That’s immediately more interpretable:

  • Not recent
  • Not frequent
  • High spender (historically)

Step 3: Making a Combined RFM Rating

Now we mix these three scores right into a single RFM code:

rfm['RFM_Score'] = (
rfm['R_Score'].astype(str) +
rfm['F_Score'].astype(str) +
rfm['M_Score'].astype(str)
)

This produces values like:

  • 555 → Best customers
  • 155 → High spenders who haven’t returned
  • 111 → Customers who’re likely gone

Each customer now carries a compact behavioral fingerprint. And we’re not done yet.

Translating RFM Scores Into Customer Segments

Raw scores are nice, but let’s be honest: no marketing manager wants to take a look at 555, 154, or 311 all day.

NovaShop needs labels that make sense at a look. That’s where RFM segments are available.

Step 1: Defining Segments

Using RFM scores, we will classify customers into meaningful categories. Here’s a standard approach:

  • Champions: Top Recency, top Frequency, top Monetary (555) — your best customers
  • Loyal Customers: Regular buyers, might not be spending essentially the most, but keep coming back
  • Big Spenders: High Monetary, but not necessarily recent or frequent
  • At-Risk: Used to purchase, but haven’t returned recently
  • Lost: Low scores in all three metrics — likely disengaged
  • Promising / Recent: Recent customers with lower frequency or monetary spend

This transforms abstract numbers right into a narrative that marketing and management can act on.

Step 2: Mapping Scores to Segments

Here’s an example using easy conditional logic:

def rfm_segment(row):
if row['R_Score'] >= 4 and row['F_Score'] >= 4 and row['M_Score'] >= 4:
return 'Champions'
elif row['F_Score'] >= 4:
return 'Loyal Customers'
elif row['M_Score'] >= 4:
return 'Big Spenders'
elif row['R_Score'] <= 2:
return 'At-Risk'
else:
return 'Others'
rfm['Segment'] = rfm.apply(rfm_segment, axis=1)

Now each customer has a human-readable label, making it immediately actionable.

Let’s review our results using rfm.head()

Step 3: Turning Segments into Strategy

With labeled segments, NovaShop can:

  • Reward Champions → Exclusive deals, loyalty points
  • Re-engage Big Spenders & At-Risk customers → Personalized emails or discounts
  • Focus marketing properly → Don’t waste effort on customers who're truly lost

That is the moment where data becomes strategy.

What NovaShop Should Do Next (Key Takeaways & Recommendations)

At the beginning of this evaluation, NovaShop had a well-known problem:
Plenty of transactional data, but limited clarity on customer behaviour.

By applying the RFM framework, we’ve turned raw purchase history into a transparent, structured view of who NovaShop’s customers are — and the way they behave.

Now let’s speak about what to really do with it.

1. Protect and Reward Your Best Customers

Champions and Loyal Customers are already doing what every business wants:

  • They buy recently
  • They buy often
  • They generate consistent revenue

These customers don’t need heavy discounts — they need recognition.

Really helpful actions:

  • Early access to sales
  • Loyalty points or VIP tiers
  • Personalized thank-you emails

2. Re-Engage High-Value Customers Before They’re Lost

Essentially the most dangerous segment for NovaShop isn’t “Lost” customers.
It’s At-Risk and Big Spenders.

These customers:

  • Have shown clear value up to now
  • But haven’t purchased recently
  • Are one step away from churning completely

Really helpful actions:

  • Targeted win-back campaigns
  • Personalized offers (not blanket discounts)
  • Reminder emails tied to past purchase behavior

3. Don’t Over-Put money into Truly Lost Customers

Some customers will inevitably churn. RFM helps NovaShop discover those customers early and avoid spending ad budget, discounts and marketing effort on users who're unlikely to return. This isn’t about being cold — it’s about being efficient.

4. Use RFM as a Living Framework, Not a One-Off Evaluation

The actual power of RFM comes when it’s:

  • Recomputed monthly or quarterly
  • Integrated into dashboards
  • Used to trace movement between segments over time

For NovaShop, this implies asking questions like:

RFM turns customer behaviour into something measurable and trackable.

Final Thoughts: Closing the EDA in Public Series

Once I began this series, I wasn’t attempting to construct the right evaluation or reveal advanced techniques. I desired to decelerate and share how I actually think when working with real data. Not the polished version, however the messy, iterative process that sometimes stays hidden.

This project began with a loud CSV and a variety of open questions. Along the best way, there have been small issues that only surfaced once I paid closer attention — dates stored as strings, assumptions that didn’t quite delay, metrics that needed context before they made sense. Working through those moments in public was uncomfortable at times, but additionally genuinely worthwhile. Each correction made the evaluation stronger and more honest.

One thing this process reinforced for me is that the majority meaningful insights don’t come from complexity. They arrive from slowing down, structuring the info properly, and asking higher questions. By the point I reached the RFM evaluation, the worth wasn’t within the formulas themselves — it was in what they forced me to confront. A customer who spent quite a bit once isn’t necessarily worthwhile today. Recency matters. Frequency matters. And none of those metrics mean much in isolation.

Ending the series with RFM felt deliberate. It sits at the purpose where technical work meets business pondering, where tables turn into conversations and numbers turn into decisions. It’s also where exploratory evaluation stops being purely descriptive and starts becoming practical. At that stage, the goal is not any longer just to grasp the info, but to choose what to do next.

Doing this work in public modified how I approach evaluation. Writing things out forced me to clarify my reasoning, query my assumptions, and be comfortable showing imperfect work. It jogged my memory that EDA isn’t a checklist you rush through — it’s a dialogue with the info. Sharing that dialogue makes you more thoughtful and more accountable.

This could be the final a part of the series, however it doesn’t feel like an endpoint. All the pieces here could evolve into dashboards, automated pipelines, or deeper customer evaluation. 

And should you’re a founder, analyst, or team working with customer or sales data and attempting to make sense of it, this sort of exploratory work is commonly where the most important clarity comes from. These are precisely the sorts of problems I enjoy working through — slowly, thoughtfully, and with the business context in mind.

In case you’re documenting your individual analyses, I’d like to see the way you approach it. And should you’re wrestling with similar questions in your data and wish to speak through them, be at liberty to succeed in out on any of the platforms below. Good data conversations often start there.

Thanks for following along!

Medium

LinkedIn

Twitter

YouTube

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x