Categorical Features: What’s Improper With Label Encoding?

Artificial Intelligence

Categorical Features: What’s Improper With Label Encoding?

admin

November 20, 2023

Categorical Features: What’s Improper With Label Encoding?

Why we are able to’t arbitrarily encode categorical features

It’s well-known that many machine learning models can’t process categorical features natively. While there are some exceptions, it’s normally as much as the practitioner to come to a decision on a numeric representation of every categorical feature. There are some ways to perform this, but one strategy seldom advisable is label encoding.

Label encoding replaces each categorical value with an arbitrary number. As an illustration, if we’ve got a feature containing letters of the alphabet, label encoding might assign the letter “A” a worth of 0, the letter “B” a worth of 1, and proceed this pattern until “Z” which is assigned 25. After this process, technically speaking, any algorithm should have the option to handle the encoded feature.

But what’s the issue with this? Shouldn’t sophisticated machine learning models have the option to handle such a encoding? Why do libraries like Catboost and other encoding strategies exist to take care of high cardinality categorical features?

This text will explore two examples demonstrating why label encoding might be problematic for machine learning models. These examples will help us appreciate why there are such a lot of alternatives to label encoding, and it’s going to deepen our understanding of the connection between data complexity and model performance.

Among the best ways to realize intuition for a machine learning concept is to grasp how it really works in a low dimensional space and take a look at to extrapolate the result to higher dimensions. This mental extrapolation doesn’t all the time align with reality, but for our purposes, all we’d like is a single feature to see why we’d like higher categorical encoding strategies.

A Feature With 25 Categories

Let’s start by taking a look at a basic toy dataset with one feature and a continuous goal. Listed below are the dependencies we’d like:

import numpy as np
import polars as pl
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from…

Why we are able to’t arbitrarily encode categorical features

A Feature With 25 Categories

LEAVE A REPLY Cancel reply