Who shall be taking home Lord Stanley’s cup this 12 months?
In Machine Learning (ML) with the ability to accurately classify a rare event is amazingly difficult for 2 reasons:
- The event attempting to be predicted doesn’t occur often enough to have the ability to accurately determine relationships between the predictors and response variables.
- Splitting the info into training and testing data is difficult as a consequence of the imbalance between the positive and negative response values.
Predicting the winner of the NHL Stanley Cup Playoffs is an important example of a rare event classification problem because every 12 months just one team wins while 31 others lose. This represent a ratio of 97% negative response values to three% positive response values. To over come this class imbalance there are two options:
- Randomly reduce the variety of negative response values so it’s closer to the variety of positive response values.
- Increase the entire positive response values equally so the category balance is closer.
In this text I’ll go together with option number two as I don’t have enough data to remove any rows. The info I’ll use to predict the Stanley Cup Playoffs is from Natural Stat Trick which is a tremendous site stuffed with limitless NHL stats. I’ll use 64 advanced NHL statistics from the 2007/2008 season to the 2021/2022 season to construct the models and predict the Stanley Cup Winner based on the stats from the 2022/2023 season. The advanced statistics’ definitions might be found here:
Because we’re using 64 predictors, not only is that this a rare events classification model it’s also a high-dimensional dataset which causes its own problems. In previous articles I discussed the advantages of using easy, explainable models to create accurate predictions which the top users can understand. On this evaluation I’ll use Least Absolute Shrinkage and Selection Operator (LASSO) and Decision Tree (DT) models as they’ll reduce the variety of predictors and produce explainable and hopefully accurate models.
# Load Libraries #
library(dplyr)
library(tidyr)
library(glmnet)
library(caret)
library(rpart)# Read in Data #
NHL_data <- read.csv("nhl_season_data.csv")
head(NHL_data)
dim(NHL_data)
# Replicate Winning Rows #
winners <- NHL_data %>%
filter(Winner >= 1) # Filter the winnerslosers <- NHL_data %>%
filter(Winner <= 0) # Filter the losers
duplicated_winners <- winners[rep(seq_len(nrow(winners)), each = 29), ] # Replicate each winner 29 times
prepped_data <- rbind(duplicated_winners, losers)
# Remove Unnecessary Data #
df <- prepped_data[,-c(1,2,3,5,6,7,8,9)]
dim(df)
# Arrange Training and Testing Data #
set.seed(1) # Set Seed so that very same sample might be reproduced in future also
# Now Choosing 67% of knowledge as sample from total 'n' rows of the info
sample <- sample.int(n = nrow(df), size = floor(.67*nrow(df)), replace = F)# Training data #
train_Data <- df[sample, ]
x.train<-as.matrix(train_Data[,-65])
y.train<-as.numeric(unlist(train_Data[,65]))
# Test data #
test_Data <- df[-sample, ]
x.test<-(test_Data[,-65])
y.test<-as.numeric(unlist(test_Data[,65]))
# LASSO #
# perform k-fold cross-validation to seek out optimal lambda value #
lambda_model <- cv.glmnet(x.train, y.train, alpha = 1, family = "binomial")# find optimal lambda value that minimizes test MSE #
best_lambda <- lambda_model$lambda.min
# produce plot of test MSE by lambda value #
plot(lambda_model)
# find coefficients of best model #
LASSO_model <- glmnet(x.train, y.train, alpha = 1, lambda = best_lambda, family = "binomial")
coef(LASSO_model)
Of the 64 original predictors, LASSO “shrunk” 35 of them to zero leaving only 29. For the LASSO prediction, any predicted probability of winning the Stanley Cup > 0.78 was considered a win and anything ≤ 0.78 was considered a loss. So how well does this model perform on test data?:
# Predict on Test Data #
LASSO_model_predict <- predict(LASSO_model, as.matrix(x.test), type = "response")
lasso_accuracy <- cbind(LASSO_model_predict, y.test)
The LASSO model accurately predicted all 144 Stanley cup winners within the Test data would win. At all times seeing 100% accuracy should raise concerns, but the explanation this occurred is because we artificially inflated the variety of Stanley Cup champions which meant that the winning rows within the Test data was already seen and trained upon within the Training data. The LASSO model accurately predicted 127 of the 146 losers (87%) within the Test data would lose, for an overall accuracy of 93%.
# Decision Tree #
# Model Tuning #
hyper_grid <- expand.grid(
minsplit = seq(5, 20, 1),
maxdepth = seq(8, 15, 1)
)models <- list()
for (i in 1:nrow(hyper_grid)) {
# get minsplit, maxdepth values at row i
minsplit <- hyper_grid$minsplit[i]
maxdepth <- hyper_grid$maxdepth[i]
# train a model and store within the list
models[[i]] <- rpart(
formula = Winner ~ .,
data = train_Data,
method = "class",
control = list(minsplit = minsplit, maxdepth = maxdepth)
)
}
# function to get optimal cp
get_cp <- function(x) {
min <- which.min(x$cptable[, "xerror"])
cp <- x$cptable[min, "CP"]
}
# function to get minimum error
get_min_error <- function(x) {
min <- which.min(x$cptable[, "xerror"])
xerror <- x$cptable[min, "xerror"]
}
hyper_grid %>%
mutate(
cp = purrr::map_dbl(models, get_cp),
error = purrr::map_dbl(models, get_min_error)
) %>%
arrange(error) %>%
top_n(-5, wt = error)
# Create Model #
tree_model <- rpart(
formula = Winner ~ .,
data = train_Data,
method = "class",
control = list(minsplit = 10, maxdepth = 12, cp = 0.01)
)# Plot #
plot(tree_model, uniform = TRUE,
essential = "NHL Winner")
text(tree_model, use.n = TRUE, cex = .7)
Of the 64 original predictors, the DT removed 54 of them leaving only 10. Similar to with LASSO, any predicted probability of winning the Stanley Cup > 0.78 was considered a win and anything ≤ 0.78 was considered a loss. So how well does the DT model perform on Test data?:
Like with the LASSO model the DT accurately identified all 144 winners within the Test Data and 135 of the 146 losers (92%) for a complete accuracy of 96%.
Since the DT model was more accurate than the LASSO model, we’ll use the DT to make the essential prediction and the LASSO because the tie-breaker.
DT Full Prediction
# Decision Tree on Full data #
# Model Tuning #
hyper_grid <- expand.grid(
minsplit = seq(5, 20, 1),
maxdepth = seq(8, 15, 1)
)models <- list()
for (i in 1:nrow(hyper_grid)) {
# get minsplit, maxdepth values at row i
minsplit <- hyper_grid$minsplit[i]
maxdepth <- hyper_grid$maxdepth[i]
# train a model and store within the list
models[[i]] <- rpart(
formula = Winner ~ .,
data = df,
method = "class",
control = list(minsplit = minsplit, maxdepth = maxdepth)
)
}
# function to get optimal cp
get_cp <- function(x) {
min <- which.min(x$cptable[, "xerror"])
cp <- x$cptable[min, "CP"]
}
# function to get minimum error
get_min_error <- function(x) {
min <- which.min(x$cptable[, "xerror"])
xerror <- x$cptable[min, "xerror"]
}
hyper_grid %>%
mutate(
cp = purrr::map_dbl(models, get_cp),
error = purrr::map_dbl(models, get_min_error)
) %>%
arrange(error) %>%
top_n(-5, wt = error)
# Create Model #
tree_model <- rpart(
formula = Winner ~ .,
data = df,
method = "class",
control = list(minsplit = 15, maxdepth = 14, cp = 0.01)
)# Plot #
plot(tree_model, uniform = TRUE,
essential = "NHL Winner")
text(tree_model, use.n = TRUE, cex = .7)
DT Model Explainability
The DT model on the complete data only used 8 of the 64 potential predictors, lower than what was used during training and testing. With a DT model we start at the highest of the plot and move through each query until we reach a terminal point, where 0 = losers and 1 = Stanley Cup Winner.
Glossary for the DT Plot
- — Percentage of total Goals in games that team played which might be for that team. GF*100/(GF+GA)
- — Rate of Medium Danger Scoring Probabilities for that team per 60 minutes of play. MDCF*60/TOI
- — Rate of Goals for that team per 60 minutes of play. GF*60/TOI
- — Percentage of Scoring Likelihood Shots against that team that weren’t Goals. 100-(SCGA*100/SCSA)
- — Percentage of total Goals off of Low Danger Scoring Probabilities in games that team played which might be for that team. LDGF*100/(LDGF+LDGA)
- — Rate of Goals off of Low Danger Scoring Probabilities against that team per 60 minutes of play. LDGA*60/TOI
- — Rate of Goals off of scoring probabilities for that team per 60 minutes of play. SCGF*60/TOI
- — Rate of Goals against that team per 60 minutes of play. GA*60/TOI
Prediction
# Predict on 2022/2023 Data #
current_data <- read.csv("...202220223season.csv")
decision_tree_predict <- predict(tree_model, as.data.frame(current_data), method = "class")
playoff_prediction <- cbind(decision_tree_predict, current_data)
The DT model predicted 4 of the 32 NHL teams would win. One positive thing is that every of those 4 teams are literally within the playoffs, two from the East and two from the West. These shall be our predicted winning teams and to choose the tie-breaker we’ll use the marginally less accurate LASSO model.
LASSO Full Prediction
# Isolate x and y values on Historical Data #
x <- as.matrix(df[,-65])
y <- as.numeric(unlist(df[,65]))# LASSO on Full Data #
# perform k-fold cross-validation to seek out optimal lambda value #
lambda_model <- cv.glmnet(x, y, alpha = 1, family = "binomial")
# find optimal lambda value that minimizes test MSE #
best_lambda <- lambda_model$lambda.min
# produce plot of MSE by lambda value #
plot(lambda_model)
# find coefficients of best model #
LASSO_model <- glmnet(x, y, alpha = 1, lambda = best_lambda, family = "binomial")# Model Explainability #
coefficients(LASSO_model)
exp(coefficients(LASSO_model))
LASSO Model Explainability
The LASSO model on the complete dataset (not training or testing) shrunk 24 of the 64 variables leaving 40. To know the LASSO model’s prediction within the variable list above we assume as each variable increases by one the 1 — exp(coefficient) value is how much that team’s odds of winning the Stanley Cup increases/decreases. For instance because the CF/60 (the third variable) increases by one, the chances of winning the Stanley Cup increases by 122% (2.22 — 1 = 1.22). If we have a look at the sixth variable from the highest called FF/60, because it increases by 1, the chances of winning the Stanley Cup decrease by 92% (1–0.08). Finally the eighth variable down called FF% shows a exp(coefficient) of 1, which suggests a 0% effect on the chances of winning the Stanley Cup (1–1 = 0), or in other words, this variable’s coefficient was “shrunk” to zero.
# Predict on 2022/2023 Data #
x <- current_data[-c(1,2,3,5,6,7,8,9)] #Remove unneeded data
LASSO_model_predict <- predict(LASSO_model, as.matrix(x), type = "response")
lasso_23 <- cbind(LASSO_model_predict, current_data)
The LASSO model is predicting the Boston Bruins, who had a historic season, would win followed by the Dallas Stars before a large drop in prediction confidence. So what does all of this mean?
If I used to be gambling I’d place future bets to win the Stanley Cup on:
- Dallas Stars (+1500)
- Colorado Avalanche (+650)
- Recent York Rangers (+1200)
- Tampa Bay Lightning (+1400)
In a NHL playoff pool you rank each team from 16 to 1, and as each team wins a series they get the corresponding points. So for instance if a team is ranked 16 they’ll obtain 16 points per series win.
In Data Analytics and Data Science business knowledge is significant and is as vital as what the info says. Unfortunately, I even have watched essentially no NHL this 12 months and my “business knowledge” on the NHL is zilch. Nonetheless, I wanted to affix a NHL Pool and Bracket Challenge and so I needed a solution to make data centric picks based on science.
I used a DT model to predict the Stanley Cup winner which picked the Dallas Stars, Colorado Avalanche, Recent York Rangers and the Tampa Bay Lightning. For the tie-breaker I used the less accurate LASSO model which chosen the Dallas Stars. I used the DT and LASSO models together to then fill out the bracket and NHL pool picks.