Although continuous variables in real-world datasets provide detailed information, they should not at all times probably the most effective form for modelling and interpretation. That is where variable discretization comes into play.
Understanding variable discretization is crucial for data science students constructing strong ML foundations and AI engineers designing interpretable systems.
Early in my data science journey, I mainly focused on tuning hyperparameters, experimenting with different algorithms, and optimising performance metrics.
Once I experimented with variable discretization methods, I noticed how certain ML models became more stable and interpretable. So, I made a decision to elucidate these methods in this text.Â
is variable discretization?
Some work with discrete variables. For instance, if we wish to coach a call tree model on a dataset with continuous variables, it is best to rework these variables into discrete variables to scale back the model training time.Â
Benefits of variable discretization
- Decision trees and naive bayes modles work higher with discrete variables.
- Discrete features are easy to grasp and interpret.
- Discretization can reduce the impact of skewed variables and outliers in data.
In summary, discretization simplifies data and allows models to coach faster.Â
Disadvantages of variable discretization
The most important drawback of variable discretization is the loss of knowledge occurred resulting from the creation of bins. We want to seek out the minimum variety of bins and not using a significant loss of knowledge. The algorithm can’t find this number itself. The user must input the variety of bins as a model hyperparameter. Then, the algorithm will find the cut points to match the variety of bins.Â
Supervised and unsupervised discretization
The most important categories of discretization methods are supervised and unsupervised. Unsupervised methods determine the bounds of the bins through the use of the underlying distribution of the variable, while supervised methods use ground truth values to find out these bounds.
Varieties of variable discretization
We are going to discuss the next kinds of variable discretization.
- Equal-width discretization
- Equal-frequency discretization
- Arbitrary-interval discretization
- K-means clustering-based discretization
- Decision tree-based discretization
Equal-width discretization
Because the name suggests, this method creates bins of equal size. The width of a bin is calculated by dividing the range of values of a variable, , by the variety of bins, .
Width = {Max(X) — Min(X)} / k
Here, is a hyperparameter defined by the user.
For instance, if the values of range between 0 and 50 and k=5, we get 10 because the bin width and the bins are 0–10, 10–20, 20–30, 30–40 and 40–50. If k=2, the bin width is 25 and the bins are 0–25 and 25–50. So, the bin width differs based on the worth of the hyperparameter. Equal-width discretization assings a unique number of information points to every bin. The bin widths are the identical.
Let’s implement equal-width discretization using the Iris dataset. strategy='uniform' in creates bins of equal width.
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import KBinsDiscretizer
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Select one feature
feature = 'sepal length (cm)'
X = df[[feature]]
# Initialize
equal_width = KBinsDiscretizer(
n_bins=15,
encode='ordinal',
strategy='uniform'
)
bins_equal_width = equal_width.fit_transform(X)
plt.hist(bins_equal_width, bins=15)
plt.title("Equal Width Discretization")
plt.xlabel(feature)
plt.ylabel("Count")
plt.show()
The histogram shows equal-range width bins.
Equal-frequency discretization
This method allocates the values of the variable into the bins that contain an analogous number of information points. The bin widths should not the identical. The bin width is decided by quantiles, which divide the information into 4 equal parts. Here also, the variety of bins is defined by the user as a hyperparameter.Â
The key drawback of equal-frequency discretization is that there shall be many empty bins or bins with just a few data points if the distribution of the information points is skewed. This may lead to a major loss of knowledge.
Let’s implement equal-width discretization using the Iris dataset. strategy='quantile' in creates balanced bins. Each bin has (roughly) an equal number of information points.
# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Select one feature
feature = 'sepal length (cm)'
X = df[[feature]]
# Initialize
equal_freq = KBinsDiscretizer(
n_bins=3,
encode='ordinal',
strategy='quantile'
)
bins_equl_freq = equal_freq.fit_transform(X)
Arbitrary-interval discretization
On this method, the user allocates the information points of a variable into bins in such a way that it is sensible (arbitrary). For instance, you could allocate the values of the variable in bins representing , and . The priority is given to the overall sense. There is no such thing as a have to have the identical bin width or an equal number of information points in a bin.
Here, we manually define bin boundaries based on domain knowledge.
# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Select one feature
feature = 'sepal length (cm)'
X = df[[feature]]
# Define custom bins
custom_bins = [4, 5.5, 6.5, 8]
df['arbitrary'] = pd.cut(
df[feature],
bins=custom_bins,
labels=[0,1,2]
)
K-means clustering-based discretization
K-means clustering focuses on grouping similar data points into clusters. This feature will be used for variable discretization. On this method, bins are the clusters identified by the k-means algorithm. Here also, we’d like to define the variety of clusters, , as a model hyperparameter. There are several methods to find out the optimal value of . Read this article to learn those methods.Â
Here, we use algorithm to create groups which act as discretized categories.
# Import libraries
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Select one feature
feature = 'sepal length (cm)'
X = df[[feature]]
kmeans = KMeans(n_clusters=3, random_state=42)
df['kmeans'] = kmeans.fit_predict(X)
Decision tree-based discretization
The choice tree-based discretization process uses decision trees to seek out the bounds of the bins. Unlike other methods, this one routinely finds the optimal variety of bins. So, the user doesn’t have to define the variety of bins as a hyperparameter.Â
The discretization methods that we discussed to this point are supervised methods. Nevertheless, this method is an unsupervised method meaning that we also use goal values, , to find out the bounds.
# Import libraries
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Select one feature
feature = 'sepal length (cm)'
X = df[[feature]]
# Get the goal values
y = iris.goal
tree = DecisionTreeClassifier(
max_leaf_nodes=3,
random_state=42
)
tree.fit(X, y)
# Get leaf node for every sample
df['decision_tree'] = tree.apply(X)
tree = DecisionTreeClassifier(
max_leaf_nodes=3,
random_state=42
)
tree.fit(X, y)
That is the overview of variablee discretization methods. The implementation of every method shall be discussed in separate articles.
That is the top of today’s article.
Please let me know if you’ve any questions or feedback.
How about an AIÂ course?
See you in the subsequent article. Glad learning to you!
Iris dataset info
- Citation:Â Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
- Source:Â https://archive.ics.uci.edu/ml/datasets/iris
- License:  holds the copyright of this dataset. Michael Marshall donated this dataset to the general public under the (CC0). You possibly can learn more about different dataset license types here.
Designed and written by:Â
Rukshan Pramoditha
2025–03–04
