Predicting the NBA Champion with Machine Learning

-

Every NBA season, 30 teams compete for something just one will achieve: the legacy of a championship. From power rankings to trade deadline chaos and injuries, fans and analysts alike speculate endlessly about who will raise the Larry O’Brien Trophy.

But what if we could transcend the new takes and predictions, and use data and Machine Learning to, at the top of the regular season, forecast the NBA Champion?

In this text, I’ll walk through this process — from gathering and preparing the information, to training and evaluating the model, and eventually using it to make predictions for the upcoming 2024–25 Playoffs. Along the best way, I’ll highlight among the most surprising insights that emerged from the evaluation.

All of the code and data used can be found on GitHub.


Understanding the problem

Before diving into model training, an important step in any machine learning project is knowing the issue:
What query are we attempting to answer, and what data (and model) might help us get there?

On this case, the query is easy: Who’s going to be the NBA Champion?

A natural first idea is to border this as a classification problem: each team in each season is labeled as either or .

But there’s a catch. There’s only one champion per 12 months (obviously).

So if we pull data from the last 40 seasons, we’d have 40 positive examples… and lots of of negative ones. That lack of positive samples makes it extremely hard for a model to learn meaningful patterns, specially considering that winning an NBA title is such a rare event that we simply don’t have enough historical data — we’re not working with 20,000 seasons. That scarcity makes it extremely difficult for any classification model to actually understand what separates champions from the remainder.

We want a wiser option to frame the issue.

To assist the model understand what makes a champion, it’s useful to also teach it what makes an champion — and the way that differs from a team that was knocked out in the primary round. In other words, we would like the model to learn degrees of success within the playoffs, relatively than a straightforward yes/no consequence.

This led me to the concept of Champion Share — the proportion of playoff wins a team achieved out of the overall needed to win the title.

From 2003 onward, it takes 16 wins to develop into a NBA Champion. Nonetheless, between 1984 and 2002, the primary round was a best-of-five series, so during that period the overall required was 15 wins.

A team that loses in the primary round might need 0 or 1 win (Champion Share = 1/16), while a team that makes the Finals but loses might need 14 wins (Champion Share = 14/16). The Champion has a full share of 1.0.

Example of playoff bracket from the 2021 Playoffs

This reframes the duty as a regression problem, where the model predicts a continuous value between 0 and 1 — representing how close each team got here to winning all of it.

On this setup, the team with the highest predicted value is our model’s pick for the NBA Champion.

That is the same approach to the MVP prediction from my previous article.

Data

Basketball — and the NBA specifically — is one of the vital exciting sports to work with in data science, because of the amount of freely available statistics. For this project, I gathered data from Basketball Reference using my python package BRScraper, that enables easy accessibility to the players’ and teams data. All data collection was done in accordance with the web site’s guidelines and rate limits.

The info used includes team-level statistics, final regular season standings (e.g., win percentage, seeding), in addition to player-level statistics for every team (limited to players who appeared in at the least 30 games) and historical playoff performance indicators.

Nonetheless, it’s necessary to be cautious when working with raw, absolute values. For instance, the average points per game (PPG) within the 2023–24 season was 114.2, while in 2000–01 it was 94.8 — a rise of nearly 20%.

That is attributable to a series of things, but the actual fact is that the sport has modified significantly through the years, and so have the metrics derived from it.

Evolution of some per-game NBA statistics (Image by Creator)

To account for this shift, the approach here avoids using absolute statistics directly, opting as a substitute for normalized, relative metrics. For instance:

  • As an alternative of a team’s PPG, you should utilize their rating in that season.
  • As an alternative of counting what number of players average 20+ PPG, you’ll be able to consider what number of are within the top 10 in scoring, and so forth.

This allows the model to capture relative dominance inside each era, making comparisons across a long time more meaningful and thus permitting the inclusion of older seasons to complement the dataset.

Data from the 1984 to 2024 seasons were used to coach and test the model, totaling 40 seasons, with a complete of 70 variables.

Before diving into the model itself, some interesting patterns emerge from an exploratory evaluation when comparing championship teams to all playoff teams as a complete:

Comparison of teams: Champions vs Remainder of Playoff teams (Image by Creator)

Champions are likely to come from the highest seeds and with higher winning percentages, unsurprisingly. The team with the worst regular season record to win all of it in this era was the 1994–95 Houston Rockets, led by Hakeem Olajuwon, ending 47–35 (.573) and entering the playoffs as only the tenth best overall team (sixth within the West).

One other notable trend is that champions are likely to have a rather higher average age, suggesting that have plays a vital role once the playoffs begin. The youngest championship team within the database with a median of 26.6 years is the 1990–91 Chicago Bulls, and the oldest is the 1997–98 Chicago Bulls, with 31.2 years — the primary and last titles from the Michael Jordan dinasty.

Similarly, teams with coaches who’ve been with the franchise longer also are likely to find more success within the postseason.

Modeling

The model used was LightGBM, a tree-based algorithm well known as one of the vital effective methods for tabular data, alongside others like XGBoost. A grid search was done to discover the very best hyperparameters for this specific problem.

The model performance was evaluated using the basis mean squared error (RMSE) and the coefficient of determination ().

You could find the formula and explanation of every metric in my previous MVP article.

The seasons used for training and testing were randomly chosen, with the constraint of reserving the last three seasons for the test set to be able to higher assess the model’s performance on newer data. Importantly, all teams were included within the dataset — not only those who qualified for the playoffs — allowing the model to learn patterns without counting on prior knowledge of postseason qualification.

Results

Here we will see a comparison between the “distributions” of each the predictions and the actual values. While it’s technically a histogram — since we’re coping with a regression problem — it still works as a visible distribution since the goal values range from 0 to 1. Moreover, we also display the distribution of the residual error for every prediction.

(Image by Creator)

As we will see, the predictions and the actual values follow the same pattern, each concentrated near zero — as most teams don’t achieve high playoff success. That is further supported by the distribution of the residual errors, which is centered around zero and resembles a standard distribution. This means that the model is in a position to capture and reproduce the underlying patterns present in the information.

By way of performance metrics, the very best model achieved an RMSE of 0.184 and an R² rating of 0.537 on the test dataset.

An efficient approach for visualizing the important thing variables influencing the model’s predictions is thru SHAP Values, atechnique that gives an inexpensive explanation of how each feature impacts the model’s predictions.

Again, a deeper explanation about SHAP and how you can interpret its chart will be present in Predicting the NBA MVP with Machine Learning.

SHAP chart (Image by Creator)

From the SHAP chart, several necessary insights emerge:

  • Seed and W/L% rank among the many top three most impactful features, highlighting the importance of team performance within the regular season.
  • Team-level stats similar to Net Rating (NRtg), Opponent Points Per Game (PA/G), Margin of Victory (MOV) and Adjusted Offensive Rating (ORtg/A) also play a major role in shaping playoff success.
  • On the player side, advanced metrics stand out: the variety of players in the highest 30 for Box Plus/Minus (BPM) and top 3 for Win Shares per 48 Minutes (WS/48) are amongst essentially the most influential.

Interestingly, the model also captures broader trends — teams with the next average age are likely to perform higher within the playoffs, and a robust showing within the previous postseason often correlates with future success. Each patterns point again to experience as a precious asset within the pursuit of a championship.

Let’s now take a better take a look at how the model performed in predicting the last three NBA champions:

Predictions for the last three years (Image by Creator)

The model accurately predicted two of the last three NBA champions. The one miss was in 2023, when it favored the Milwaukee Bucks. That season, Milwaukee had the very best regular-season record at 58–24 (.707), but an injury to Giannis Antetokounmpo hurt their playoff run. The Bucks were eliminated 4–1 in the primary round by the Miami Heat, who went on to succeed in the Finals — a surprising and disappointing postseason exit for Milwaukee, who had claimed the championship just two years earlier.

2025 Playoffs Predictions

For this upcoming 2025 playoffs, the model is predicting the Boston Celtics to go back-to-back, with OKC and Cleveland close behind. 

Given their strong regular season (61–21, 2nd seed within the East) and the undeniable fact that they’re the reigning champions, I are likely to agree. They mix current performance with recent playoff success.

Still, as everyone knows, anything can occur in sports — and we’ll only get the actual answer by the top of June.

(Photo by Richard Burlton on Unsplash)

Conclusions

This project demonstrates how machine learning will be applied to complex, dynamic environments like sports. Using a dataset spanning 4 a long time of basketball history, the model was in a position to uncover meaningful patterns into what drives playoff success. Beyond prediction, tools like SHAP allowed us to interpret the model’s decisions and higher understand the aspects that contribute to postseason success.

One in every of the most important challenges on this problem is accounting for injuries. They’ll completely reshape the playoff landscape — particularly after they affect star players throughout the playoffs or late within the regular season. Ideally, we could incorporate injury histories and availability data to higher account for this. Unfortunately, consistent and structured open data on this matter— especially on the granularity needed for modeling — is tough to return by. Consequently, this stays one in every of the model’s blind spots: it treats all teams at full strength, which is commonly not the case.

While no model can perfectly predict the chaos and unpredictability of sports, this evaluation shows that data-driven approaches can get close. Because the 2025 playoffs unfold, it’ll be exciting to see how the predictions delay — and what surprises the sport still has in store.

(Photo by Tim Hart on Unsplash)

I’m at all times available on my channels (LinkedIn and GitHub).

Thanks in your attention!👏

Gabriel Speranza Pastorello

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x