Encoding Categorical Data, Explained: A Visual Guide with Code Example for Beginners

-

DATA PREPROCESSING

Six ways of matchmaking categories and numbers

Ah, categorical data — the colourful characters in our datasets that machines just can’t seem to grasp. That is where “red” becomes 1, “blue” 2, and data scientists turn into language translators (or more like matchmakers?).

Now, I do know what you’re considering: “Encoding? Isn’t that just assigning numbers to categories?” Oh, if only it were that straightforward! We’re about to explore six different encoding methods, all on (again) a single, tiny dataset (with visuals, in fact!) From easy labels to mind-bending cyclic transformations, you’ll see why selecting the suitable encoding might be as necessary as picking the right algorithm.

Cartoon illustration of two figures embracing, with letters ‘A’, ‘B’, ‘C’ and numbers ‘1’, ‘2’, ‘3’ floating around them. A pink heart hovers above, symbolizing affection. The background is a pixelated pattern of blue and green squares, representing data or encoding. This image metaphorically depicts the concept of encoding categorical data, where categories (ABC) are transformed into numerical representations (123).
All visuals: Creator-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

Before we jump into our dataset and encoding methods, let’s take a moment to grasp what categorical data is and why it needs special treatment on the planet of machine learning.

What Is Categorical Data?

Categorical data is just like the descriptive labels we use in on a regular basis life. It represents characteristics or qualities that might be grouped into categories.

Why Does Categorical Data Need Encoding?

Here’s the catch: most machine learning algorithms are like picky eaters — they only digest numbers. They’ll’t directly understand that “sunny” is different from “rainy”. That’s where encoding is available in. It’s like translating these categories right into a language that machines can understand and work with.

Kinds of Categorical Data

Not all categories are created equal. We generally have two types:

  1. Nominal: These are categories with no inherent order.
    Ex: “Outlook” (sunny, overcast, rainy) is nominal. There’s no natural rating between these weather conditions.
  2. Ordinal: These categories have a meaningful order.
    Ex: “Temperature” (Very Low, Low, High, Very High) is ordinal. There’s a transparent progression from coldest to hottest.
Two panels comparing nominal and ordinal data types. The nominal panel shows a cartoon figure with an umbrella in the rain, illustrating weather as a nominal variable with examples like sunny, rainy, or cloudy. The ordinal panel depicts a sweating figure eating ice cream, demonstrating temperature as an ordinal variable with examples ranging from warm to very hot. Each panel includes a table with example categories.

Why Care About Proper Encoding?

  1. It preserves necessary information in your data.
  2. It could significantly impact your model’s performance.
  3. Incorrect encoding can introduce unintended biases or relationships.

Imagine if we encoded “sunny” as 1 and “rainy” as 2. The model might think rainy days are “greater than” sunny days, which isn’t what we would like!

Now that we understand what categorical data is and why it needs encoding, let’s take a have a look at our dataset and see how we will tackle its categorical variables using six different encoding methods.

Let’s use a straightforward golf dataset for instance our encoding methods (and it has mostly categorical columns). This dataset records various weather conditions and the resulting crowdedness at a golf course.

Weather dataset table spanning March 25 to April 5. Columns include date, day, month, temperature (Low/High/Extreme), humidity (Dry/Humid), wind (Yes/No), outlook (sunny/rainy/overcast), and a count. Icons above represent data types. The table shows varied weather conditions and corresponding visitor numbers across 12 days.
import pandas as pd
import numpy as np

data = {
'Date': ['03-25', '03-26', '03-27', '03-28', '03-29', '03-30', '03-31', '04-01', '04-02', '04-03', '04-04', '04-05'],
'Weekday': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
'Month': ['Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Apr', 'Apr', 'Apr', 'Apr', 'Apr'],
'Temperature': ['High', 'Low', 'High', 'Extreme', 'Low', 'High', 'High', 'Low', 'High', 'Extreme', 'High', 'Low'],
'Humidity': ['Dry', 'Humid', 'Dry', 'Dry', 'Humid', 'Humid', 'Dry', 'Humid', 'Dry', 'Dry', 'Humid', 'Dry'],
'Wind': ['No', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes'],
'Outlook': ['sunny', 'rainy', 'overcast', 'sunny', 'rainy', 'overcast', 'sunny', 'rainy', 'sunny', 'overcast', 'sunny', 'rainy'],
'Crowdedness': [85, 30, 65, 45, 25, 90, 95, 35, 70, 50, 80, 45]
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

As we will see, we have now loads of categorical variables. Our task is to encode these variables in order that a machine learning model can use them to predict, say, the Crowdedness of the golf course.

Let’s get into it.

Label Encoding assigns a novel integer to every category in a categorical variable.

Common Use 👍: It’s often used for ordinal variables where there’s a transparent order to the categories, corresponding to education levels (e.g., primary, secondary, tertiary) or product rankings (e.g., 1 star, 2 stars, 3 stars).

In Our Case: We could use Label Encoding for the ‘Weekday’ column in our golf dataset. Every day of the week could be assigned a novel number (e.g., Monday = 0, Tuesday = 1, etc.). Nevertheless, we should be careful as this might imply that Sunday (6) is “greater than” Saturday (5), which is probably not meaningful for our evaluation.

Two columns showing weekday encoding. Left column lists days from Monday to Friday with corresponding numbers 0–11. Right column shows encoded values 0–6 repeating, where 0 represents Monday and 6 Sunday. A calendar icon above indicates these relate to days of the week.
# 1. Label Encoding for Weekday
df['Weekday_label'] = pd.factorize(df['Weekday'])[0]

One-Hot Encoding creates a brand new binary column for every category in a categorical variable.

Common Use 👍: It’s typically used for nominal variables where there’s no inherent order to the categories. It’s particularly useful when coping with variables which have a comparatively small variety of categories.

In Our Case: One-Hot Encoding could be ideal for our ‘Outlook’ column. We’d create three latest columns: ‘Outlook_sunny’, ‘Outlook_overcast’, and ‘Outlook_rainy’. Each row would have a 1 in one in all these columns and 0 within the others, representing the weather condition for that day.

Two columns showing weather encoding. Left column lists weather conditions (sunny, rainy, overcast) for 12 days. Right column shows one-hot encoded values: 3 sub-columns for sunny, overcast, and rainy, with 1 indicating the condition and 0 otherwise. Weather icons above represent the three conditions.
# 2. One-Hot Encoding for Outlook
df = pd.get_dummies(df, columns=['Outlook'], prefix='Outlook', dtype=int)

Binary Encoding represents each category as a binary number (0 and 1).

Common Use 👍: It’s often used when there are only two categories, mostly in a yes-no situation.

In Our Case: While our ‘Windy’ column only has two categories (Yes and No), we could use Binary Encoding to display the technique. It might lead to a single binary column, where one category (e.g., No) is represented as 0 and the opposite (Yes) as 1.

Two columns showing binary encoding. Left column lists “Yes” or “No” values for 12 entries. Right column shows the encoded values: 1 for “Yes” and 0 for “No”. A wind icon above indicates this likely represents wind presence.
# 3. Binary Encoding for Wind
df['Wind_binary'] = (df['Wind'] == 'Yes').astype(int)

Goal Encoding replaces each category with the mean of the goal variable for that category.

Common Use 👍: It’s used when there’s likely a relationship between the specific variable and the goal variable. It’s particularly useful for high-cardinality features in datasets with an inexpensive variety of rows.

In Our Case: We could apply Goal Encoding to our ‘Humidity’ column, using ‘Crowdedness’ because the goal. Each ‘Dry’ or ‘Humid’ within the ‘Windy’ column would get replaced with the typical crowdedness observed for humid and dry days respectively.

Image shows target encoding for humidity. Left column lists “Dry” or “Humid” with corresponding visitor numbers. Right column replaces “Dry” with 65 (average visitors on dry days) and “Humid” with 52 (average on humid days). Icons above indicate humidity and visitor count.
# 4. Goal Encoding for Humidity
df['Humidity_target'] = df.groupby('Humidity')['Crowdedness'].transform('mean')

Ordinal Encoding assigns ordered integers to ordinal categories based on their inherent order.

Common Use 👍: It’s used for ordinal variables where the order of categories is meaningful and you desire to preserve this order information.

In Our Case: Ordinal Encoding is ideal for our ‘Temperature’ column. We could assign integers to represent the order: Low = 1, High = 2, Extreme = 3. This preserves the natural ordering of temperature categories.

Image shows ordinal encoding for temperature. Left column lists temperatures as “Low”, “High”, or “Extreme” for 12 entries. Right column shows encoded values: 1 for “Low”, 2 for “High”, and 3 for “Extreme”. A sun icon above indicates this represents temperature levels.
# 5. Ordinal Encoding for Temperature
temp_order = {'Low': 1, 'High': 2, 'Extreme': 3}
df['Temperature_ordinal'] = df['Temperature'].map(temp_order)

Cyclic Encoding transforms a cyclical categorical variable into two numerical features that preserve the variable’s cyclical nature. It typically uses sine and cosine transformations to represent the cyclical pattern. For instance, for the column “Month” we’d make it numerical first (1–12) then create two latest features:

  • Month_cos = cos(2 π (m — 1) / 12)
  • Month_sin = sin(2 π (m — 1) / 12)

where m is a number from 1 to 12 representing January to December.

Circular diagram representing cyclical encoding of time. A circle with 12 points labeled 1 to 12 clockwise, resembling a clock face. Point 3 is highlighted, with its coordinates (0.5, 0.866) calculated using cosine and sine functions. The formula (cos(2π(3–1)/12), sin(2π(3–1)/12)) is shown above, demonstrating how the position is derived from the hour number.
Imagine the encoding to be the (x,y) coordinate on this weird clock, ranging from 1–12. To preserve the cyclical order, we’d like to represent them using two columns as an alternative of 1.

Common Use: It’s used for categorical variables which have a natural cyclical order, corresponding to days of the week, months of the 12 months, or hours of the day. Cyclic encoding is especially useful when the “distance” between categories matters and wraps around (e.g., the gap between December and January ought to be small, similar to the gap between every other consecutive months).

In Our Case: In our golf dataset, the most effective column for cyclic encoding could be the ‘Month’ column. Months have a transparent cyclical pattern that repeats yearly. This could possibly be particularly useful for our golf dataset, as it will capture seasonal patterns in golfing activity that may repeat annually. Here’s how we could apply it:

Image shows cyclical encoding for months. Left column lists months (Mar, Apr) for 12 entries. Middle column assigns numbers (3 for Mar, 4 for Apr). Right columns show sin and cos values calculated using formulas sin(2π(m-1)/12) and cos(2π(m-1)/12), where m is the month number. This creates a cyclical representation of months, with March values at (0.866, 0.5) and April at (1, 0).
# 6. Cyclic Encoding for Month
month_order = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
df['Month_num'] = df['Month'].map(month_order)
df['Month_sin'] = np.sin(2 * np.pi * (df['Month_num']-1) / 12)
df['Month_cos'] = np.cos(2 * np.pi * (df['Month_num']-1) / 12)

So, there you have got it! Six other ways to encode categorical data, all applied to our golf course dataset. Now, all categories are transformed into numbers!

Let’s recap how each method tackled our data:

  1. Label Encoding: Turned our ‘Weekday’ into numbers, making Monday 0 and Sunday 6 — easy but potentially misleading.
  2. One-Hot Encoding: Gave ‘Outlook’ its own columns, letting ‘sunny’, ‘overcast’, and ‘rainy’ stand independently.
  3. Binary Encoding: Compressed our ‘Humidity’ into efficient binary code, saving space without losing information.
  4. Goal Encoding: Replaced ‘Windy’ categories with average ‘Crowdedness’, capturing hidden relationships.
  5. Ordinal Encoding: Respected the natural order of ‘Temperature’, from ‘Very Low’ to ‘Very High’.
  6. Cyclic Encoding: Transformed ‘Month’ into sine and cosine components, preserving its circular nature.

There’s no one-size-fits-all solution in categorical encoding. One of the best method is determined by your specific data, the character of your categories, and the necessities of your machine learning model.

Encoding categorical data might seem to be a small step within the grand scheme of a machine learning project, but it surely’s often these seemingly minor details that could make or break a model’s performance.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x