Latest Scikit-Learn is More Suitable for Data Evaluation

Pandas Compatibility and More in Scikit-Learn Version ≥1.2.0

Some pretty cool updates within the Latest Sklearn! (Source: Writer’s Notebook)

Around last 12 months December, Scikit-Learn released a serious stable update (v. 1.2.0–1) and eventually I get to try among the highlighted latest features. It’s now more compatible with Pandas and a couple of other features will even help us in regression in addition to classification tasks. Below, I am going through among the latest updates with examples of find out how to use them. Let’s begin!

Compatibility with Pandas:

Applying some data standardization before using them for training an ML model like regression or neural net is a standard technique to be certain different features with various ranges get equal importance (if or when crucial) for predictions. Scikit-Learn provides various pre-processing APIs like StandardScaler , MaxAbsScaler etc. With the newer version, , let’s see below:

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
########################
X, y = load_wine(as_frame=True, return_X_y=True) 
# available from version >=0.23; as_frame
########################
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, 
random_state=0)
X_train.head(3)

A part of the Wine dataset in Dataframe format

The newer version includes an option to maintain this dataframe format even after the standardization:


############
# v1.2.0
############from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().set_output(transform="pandas") 
## change here
scaler.fit(X_train)
X_test_scaled = scaler.transform(X_test)
X_test_scaled.head(3)

Dataframe format is kept because it is even after standardization.

Before, it might have modified the format to a Numpy array:

###########
# v 0.24
########### scaler.fit(X_train)
X_test_scaled = scaler.transform(X_test)
print (type(X_test_scaled))
>>>

With the dataframe format remaining intact, we don’t have to keep tabs on the columns, like we would have liked to do with the Numpy array format. Evaluation and plotting change into easier:


fig = plt.figure(figsize=(8, 5))
fig.add_subplot(121)
plt.scatter(X_test['proline'], X_test['hue'], 
c=X_test['alcohol'], alpha=0.8, cmap='bwr')
clb = plt.colorbar()
plt.xlabel('Proline', fontsize=11)
plt.ylabel('Hue', fontsize=11)
fig.add_subplot(122)
plt.scatter(X_test_scaled['proline'], X_test_scaled['hue'], 
c=X_test_scaled['alcohol'], alpha=0.8, cmap='bwr')
# pretty easy now within the newer version to see the effectplt.xlabel('Proline (Standardized)', fontsize=11)
plt.ylabel('Hue (Standardized)', fontsize=11)
clb = plt.colorbar()
clb.ax.set_title('Alcohol', fontsize=8)
plt.tight_layout()
plt.show()

Fig. 1: Dependence of features before and after standardization! (Source: Writer’s Notebook)

Even once we construct a pipeline, as below:


from sklearn.pipeline import make_pipeline
from sklearn.svm import SVCclf = make_pipeline(StandardScaler(), SVC())
clf.set_output(transform="pandas") # change here 
svm_fit = clf.fit(X_train, y_train)
print (clf[:-1]) # StandardScaler 
print ('check that set_output format indeed stays even after we construct a pipleline: ', 'n')
X_test_transformed = clf[:-1].transform(X_test)
X_test_transformed.head(3)

Dataframe format could be kept because it is even inside a pipeline!

Fetching DataSet is Faster and More Efficient:

OpenML is an open platform for sharing datasets and the Dataset API in Sklearn offers fetch_openml function to fetch data; With the updated Sklearn, this step is more efficient in memory and time.


from sklearn.datasets import fetch_openmlstart_t = time.time()
X, y = fetch_openml("titanic", version=1, as_frame=True, 
return_X_y=True, parser="pandas")
# # parser pandas is the addition within the version 1.2.0
X = X.select_dtypes(["number", "category"]).drop(columns=["body"])
print ('check types: ', type(X), 'n',  X.head(3))
print ('check shapes: ', X.shape)
end_t = time.time()
print ('time taken: ', end_t-start_t)

Using parser='pandas' makes a drastic improvement in runtime and memory consumption. One can easily check the memory consumption using psutil library as:

print(psutil.cpu_percent())

Partial Dependency Plots: Categorical Features

Partial dependency plots existed before too, but just for numerical features, now this has been prolonged for categorical features.

As described within the Sklearn documentation:

Partial dependence plots show the dependence between the targets and a set of input feature(s) of interest, marginalizing over the values of all other input features (the ‘complement’ features). Intuitively, we are able to interpret the partial dependence because the expected goal response as a function of the input features of interest.

Using the ‘titanic’ dataset from above, we are able to easily plot the partial dependence of categorical features:

Fig. 2: Partial dependency plots of categorical variables. (Source: Writer’s Notebook)

Directly Plot Residuals (Regression Models):

For analyzing the performance of a classification model, inside Sklearn metrics API, plotting routines like PrecisionRecallDisplay , RocCurveDisplay existed in older versions (0.24); In the brand new update, it is feasible to do similar for regression models. Let’s see an example below:

Linear Model fit and corresponding residuals could be directly plotted using Sklearn. (Source: Writer’s Notebook)

There are a couple of more improvements/additions available in the brand new Sklearn, but I discovered these 4 major improvements to be particularly useful for normal data evaluation most of the time.

References:

[1] Sklearn Release Highlights: V 1.2.0

[2] Sklearn Release Highlights: Video

[3]All of the plots and codes: My GitHub

Latest Scikit-Learn is More Suitable for Data Evaluation

Pandas Compatibility and More in Scikit-Learn Version ≥1.2.0

Compatibility with Pandas:

Fetching DataSet is Faster and More Efficient:

Partial Dependency Plots: Categorical Features

Directly Plot Residuals (Regression Models):

References:

What are your thoughts on this topic?
Let us know in the comments below.

3 COMMENTS

Share this article

Recent posts

What We Still Don’t Understand About Machine Learning

OpenAI Unveils SearchGPT: A Recent AI-Powered Search Engine

Public Release: Kling AI Video Generator

UK declares hiring of AI staff, but criticism continues

Radical Simplicity in Data Engineering

Latest Scikit-Learn is More Suitable for Data Evaluation

Pandas Compatibility and More in Scikit-Learn Version ≥1.2.0

Compatibility with Pandas:

Fetching DataSet is Faster and More Efficient:

Partial Dependency Plots: Categorical Features

Directly Plot Residuals (Regression Models):

References:

What are your thoughts on this topic? Let us know in the comments below.

3 COMMENTS

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.