## A step-by-step case study of how data scientists approach and execute a cluster evaluation

Going back to my time as an undergraduate student in statistics, there may be a day that stands out. It was the primary day of the multivariate evaluation module. This class was recent on the time, and to our surprise, the professor decided to do something different. As a substitute of going through the agenda for the semester, he closed the lights and announced that today we’d learn in regards to the module in another way — by watching the pilot episode of the series “Numb3rs”.

The series focuses on FBI Special Agent Don Eppes and his brother, Charles Eppes, an excellent mathematician who uses data science to ferret out essentially the most tricky criminals. More specifically, within the pilot episode, Charles uses cluster evaluation to seek out out the criminal’s point of origin. Unnecessary to say, we were all hooked.

Fast forwarding to today, I even have been through different industries, from eCommerce retail to consultancy and gaming, and the one constant desire from all business stakeholders was to segment (cluster) their customers or users. It appears that evidently clustering piques the interest of each data scientists and business stakeholders (in addition to movie audiences).

In this text, I even have created a guide for aspiring data scientists that wish to add cluster evaluation to their arsenal and get one step closer to landing their first data science job. The guide will likely be structured in two parts:

- Understanding the three constructing blocks of cluster evaluation
- Visualizing the info, data processing, computing similarity matrix (distances), choosing the variety of clusters, and describing results

Clustering is used to group entities based on an outlined set of characteristics. For instance, we are able to use clustering to group customers (entities) based on their shopping behavior (characteristics).

## 1.2. How does Clustering Works?

In (like linear regression), the algorithm is trained on a labeled dataset with the goal of maximizing prediction accuracy on recent unlabeled data. For instance, an algorithm will be trained on a labeled dataset that features characteristics of homes (square footage, variety of bedrooms, etc.) and their corresponding label (sales price) with the goal of making a model that predicts with high accuracy the sales price of a recent house, based on its respective characteristics.

Clustering, then again, is an . The dataset has no label (hence the name unsupervised). As a substitute, the goal is to create a recent label in the shape of a cluster task. Every cluster evaluation will likely be different, but in each case, there are three fundamental constructing blocks or steps in the event you prefer:

- Create a dataset that is exclusive by row for the entity you ought to cluster. So if you ought to cluster customers, each row must represent a unique customer. For every considered one of these rows, you should have attributes (columns) that describe specific characteristics, like revenue previously 12 months, favorite product, variety of days since last purchase, etc.
- Compute the similarity between each row and all others in your dataset (based on the available attributes). So we can have a similarity value between the primary and the second row, the primary and third row, the second and third row, and so forth. Probably the most continuously used similarity functions are distances. We’ll undergo these in additional detail within the case study of the subsequent section as they’re higher explained using an example
- Input the similarity matrix right into a clustering algorithm. There are various available in R and Python, like k-means, hierarchical, and PAM. Clustering algorithms have an easy purpose. Assign observations into clusters in order that observations inside a cluster are as similar as possible and observations between different clusters are as dissimilar as possible. In the long run, you should have a cluster task for every row (statement)

Although I couldn’t find the dataset utilized in the pilot episode of the series “Numb3rs”, I believed it will be fitting to make use of something similar for our case study, so I picked the US Arrests dataset, loaded from the package (a part of the library in R).

The dataset accommodates 50 rows and 4 attributes. Each row is a US state (entity). The 4 attributes describe the next characteristics of every state:

- Variety of murder arrests per 100000 residents in 1975
- Variety of assault arrests per 100000 residents in 1975
- Variety of rape arrests per 100000 residents in 1975
- Percent of the population living in urban areas in 1975
- We even have the state name as row names. This may not be used as input for the clustering algorithm

`##############################`

# Loading libraries

##############################

library("dplyr") # summarizing data

library("ggplot2") # visualization

library("cluster") # gower distance and PAM

library("ggalluvial") # alluvial plots##############################

# Loading the info set

##############################

data("USArrests")

##############################

# Examine data set

##############################

head(USArrests) # head -> header

## 2.1. Visualizing the Data

the box plots below, all 4 attributes seem roughly symmetric (UrbanPop is barely right-skewed, and Rape is barely left-skewed). If an attribute was heavily skewed or if there have been outliers, we could consider applying different transformations (corresponding to log transformation). This shouldn’t be essential with our dataset.

`##############################`

# Box plot for every attribute

##############################

USArrests %>%

select(c(Murder, Assault, Rape, UrbanPop)) %>%

boxplot(.,

boxwex = 0.7,

alpha = 0.2,

col = c("red3", "lightgreen", "lightblue", "purple"),

horizontal = TRUE

)

You can too see that the attribute has much higher values than the opposite three. It’s an excellent practice in clustering to standardize all attributes to the identical scale before computing the similarities. This is completed in order that attributes on a bigger scale don’t overcontribute (is within the a whole bunch, whereas the opposite three attributes are within the tens). Fortunately, the similarity function we are going to use takes care of that, so we don’t must do any rescaling at this stage.

## 2.2. Computing Similarity Matrix

For our case study, we are going to use the Gower distancefrom the function (package) in R to compute the similarity matrix. To be more specific, Gower uses dissimilarities as a distance function, however the concept is similar (as a substitute of maximizing similarity, we minimize dissimilarity).

Gower is considered one of the few distance functions that may work with mixed-type data (numerical and categorical attributes). One other advantage is that Gower scales all distances to be between 0 and 1 (attributes on high scales don’t overcontribute).

So for our dataset with n = 50 rows (states), we can have a 50×50 matrix with 1225 unique values (*n*(n-1)/2*). The diagonal shouldn’t be needed (distance of a state with itself), and the upper triangle is similar because the lower (the matrix is symmetrical).

`##############################`

# Get distance

##############################

gower_dist <-

USArrests %>%

select(c(Murder, Assault, Rape, UrbanPop)) %>%

daisy(.,metric = "gower")gower_dist

## 2.3. Choosing the Variety of Clusters

For our case study, we are going to use the PAM (Partitioning Around Medoids) clustering algorithm and, more specifically, the function (package).

PAM (and all other K-medoid algorithms) is a strong alternative to k-means since it uses medoids as cluster centers as a substitute of means. In consequence, the algorithm is less sensitive to noise and outliers in comparison with k-means (but not immune!).

First, we’d like to pick the variety of clusters. The approach for determining the variety of clusters is pretty straightforward. We’ll run the PAM algorithm using the Gower similarity matrix we computed within the previous step, and in each run, we are going to select a unique variety of clusters (from 2 to 10). Then, we are going to compute the Average Silhouette Width (ASW) for every run. We’ll do this using the function below.

`##############################`

# Function to compute ASW

##############################

get_asw_using_pam <- function(distance, min_clusters, max_clusters) {

average_sil_width <- c(NA)

for (i in min_clusters:max_clusters){

pam_fit <-

pam(

distance,

diss = TRUE,

k=i)

average_sil_width[i] <- pam_fit$silinfo$avg.width

}

return(average_sil_width)

}############################

## Get ASW from each run

############################

sil_width <- get_asw_using_pam(gower_dist, 2, 10)

The ASW represents how well each statement matches in its present cluster in comparison with the closest neighboring cluster. It ranges from -1 to 1, with higher values (closer to 1) indicating higher clustering results. We’ll use the Silhouette Plot (ASW on the y-axis and variety of clusters on the x-axis) to visually examine the promising candidates for our clustering.

`##############################`

# Visualize Silhouette Plot

##############################

silhouette_plot <- sil_width %>% as.data.frame()

names(silhouette_plot)[names(silhouette_plot) == '.'] <- 'sil_width'silhouette_plot %>%

mutate(number = c(1:10)) %>%

filter(number > 1) %>%

ggplot(aes(x = number , y = sil_width)) +

geom_line( color="turquoise3", size=1) +

geom_point(color="darkgrey",fill = "black", size=3) +

scale_x_continuous(breaks = seq(2, 10, 1) ) +

ylab("Average Silhouette Width") +

xlab("No of Clusters") +

theme_classic()

The answer with the best ASW is 2 clusters, and after a steep decrease, we have now 3 and 4 clusters, after which ASW decreases abruptly again. Our possible candidates are these three options (2, 3, and 4 clusters). Let’s save the clustering task from each run as three recent attributes (cluster_2, cluster_3, and cluster_4) within the USArrests dataset.

`## Set the seed of R's random number generator`

## which is helpful for creating reproducible simulations

## like in cases of clustering task

set.seed(5)##############################

# Saving PAM results

##############################

pam_2 <- pam(gower_dist, diss = TRUE, k= 2 )

pam_3 <- pam(gower_dist, diss = TRUE, k= 3 )

pam_4 <- pam(gower_dist, diss = TRUE, k= 4 )

##############################

# Adding task as columns

##############################

USArrests <-

USArrests %>%

mutate(cluster_2 = as.factor(pam_2$clustering)) %>%

mutate(cluster_3 = as.factor(pam_3$clustering)) %>%

mutate(cluster_4 = as.factor(pam_4$clustering))

## 2.4. Describing Cluster Solutions

Because clustering is an unsupervised technique, it will be important to keep in mind that, in essence, it’s exploratory evaluation (EDA) on steroids. So unlike supervised techniques, we cannot evaluate the accuracy of our clustering solution (there is no such thing as a label within the dataset to make use of for comparison). As a substitute, the “goodness” of your cluster evaluation will likely be evaluated based on a unique criterion:

Are the resulting clusters separated in a way that reflects the facets that the business stakeholders had in mind?

So the ultimate decision is predicated on something aside from the answer with the best ASW. It’s as a substitute made by examining the characteristics of every clustering solution. You must at all times pick the clustering that’s more actionable to the business than the one with the best ASW.

First, we wish to look at how the various clustering solutions are related. We’ll use the alluvial plot to visualise the flow of observations (states) between the three runs.

`##############################`

# Alluvial plot

##############################

aluv <-

USArrests %>%

group_by(cluster_2, cluster_3, cluster_4) %>%

summarise(Freq = n()) alluvial(

aluv[,1:3],

freq=aluv$Freq,

border="lightgrey",

gap.width=0.2,

alpha=0.8,

col =

ifelse(

aluv$cluster_4 == 1, "red",

ifelse(aluv$cluster_4 == 2, "lightskyblue",

ifelse(aluv$cluster_4 == 3 & aluv$cluster_3 == 2, "lightskyblue4",

ifelse(aluv$cluster_4 == 3 & aluv$cluster_3 == 2, "purple",

"orange")))),

cex=0.65

)

Within the alluvial plot above, each column represents a unique run, and the various coloured ribbons represent fixed blocks of states as they split or get reassigned between the three cluster runs.

For instance, the red ribbon represents the states of Alabama, Georgia, Louisiana, Mississippi, North Carolina, South Carolina, and Tennessee (see code below). These states were assigned to the identical cluster (“1”) because the states from the blue ribbon within the run with two and three clusters (cluster_2 and cluster_3) but were separated within the run with 4 clusters (cluster_4).

`##################################################`

# Filter for the red ribbon (cluster_4 = 1)

##################################################

USArrests %>% filter(cluster_4 == "1")

Next, we wish to grasp higher the characteristics of every cluster and the way they’re different between the three solutions.

`##################################################`

# Summarizing (average) attributes, 2 clusters

##################################################

USArrests %>%

group_by(cluster_2) %>%

summarise(

count = n(),

mean_murder = mean(Murder),

mean_assault = mean(Assault),

mean_rape = mean(Rape),

mean_urbanpop = mean(UrbanPop)

)##################################################

# Summarizing (average) attributes, 3 clusters

##################################################

USArrests %>%

group_by(cluster_3) %>%

summarise(

count = n(),

mean_murder = mean(Murder),

mean_assault = mean(Assault),

mean_rape = mean(Rape),

mean_urbanpop = mean(UrbanPop)

)

##################################################

# Summarizing (average) attributes, 4 clusters

##################################################

USArrests %>%

group_by(cluster_4) %>%

summarise(

count = n(),

mean_murder = mean(Murder),

mean_assault = mean(Assault),

mean_rape = mean(Rape),

mean_urbanpop = mean(UrbanPop)

)

- Cluster “1” has higher average arrests across all crimes
- No observable difference in average urban population %

- The three clusters appear to be well separated
- In comparison with the high (cluster “1”) and low (cluster “2”) arrests from the run with two clusters, we now have high (cluster “1”), medium (cluster “2”), and low arrests (cluster “3”)
- Cluster “3” also has a considerably lower average urban population %

- The separation shouldn’t be as clear
- The medium (cluster “2”) and low (cluster “3”) clusters from the run with three clusters remained unchanged (only renamed to “3” and “4” respectively)
- The high (cluster “1”) from the run with three clusters was split into two clusters. Cluster “1” has a considerably lower average urban population %, with lower arrests for rape and assault

Depending on the necessity behind the cluster evaluation, we’d select the suitable clustering solution. Remember, not the one with the best ASW however the one which will be more impactful to the business stakeholders (probably a alternative between 3 and 4 clusters).

Thank you for your sharing. I am worried that I lack creative ideas. It is your article that makes me full of hope. Thank you. But, I have a question, can you help me? https://www.binance.com/cs/register?ref=W0BCQMF1