Every March, thousands and thousands gather to fill out brackets to compete with family, friends, and coworkers within the madness of school playoff basketball. Even being an information scientist, I still relied on emotional ties to high schools, specifically within the Carolinas. With limited knowledge of NCAA basketball, I might pick the upper seed, flip a coin, or sometimes pick a team based on my favorite mascot. Yr after yr for so long as I can remember, this strategy has failed me as I actually have never won and even finished in the highest 3….until now.
Finally deciding to place my data science and engineering skills to the test alongside my friend & coworker Tyler White, we launched into a journey that took us down a rabbit hole of feature engineering tricks to govern and prepare data from 2003-present provided to us via Kaggle.
We’re going to take you down that journey as we show you the way we were capable of obtain a ~ and go into the 2023 Final 4 game with 520 points and a possibility of 680 points with the assistance from the Huskies on Saturday. The Hex notebook may be found at the underside of the article and we’ve got chosen just a few code examples from it to indicate.
Data Ingestion
Kaggle provides a public API that enabled us to download the dataset quickly we would want for this. After downloading the information, we could ingest the person CSVs into Snowflake using the next code.
with zipfile.ZipFile("march-machine-learning-mania-2023.zip") as zf:
for file in zf.filelist:
if file.filename.endswith(".csv"):
with zf.open(file.filename) as z:
df = pd.read_csv(z, encoding="iso-8859-1")
table_name = file.filename.split("/")[-1].replace(".csv", "").upper()
df.columns = [col.upper() for col in df.columns]
session.create_dataframe(df).write.save_as_table(
table_name=table_name, mode="overwrite"
)py
This solution grabs each CSV from throughout the compressed file and reads them using Pandas. We applied a step to grab the filename stem that might develop into the table name and uppercased each column to make working with the thing identifiers easier in Snowflake.
Feature Engineering
Aggregations
After ingesting the information into Snowflake, it’s time to start out cleansing. The info provided was on a game-by-game basis, due to this fact our first task was defining a function that we could use to handle aggregations from a source table and write it to a goal table. After running the season stats for each NCAA game played through our function, we now have the typical stats for each team from 2003 together with a separate table for Tournament aggregates, as shown below.
Win Locations
Next, we calculated their wins based on how often they won at home, away, or on a neutral court.
Region and Seed Values
With the way in which regions were presented within the dataset for tournaments, we needed to flatten and discover a standard region after which mix it with the seed value of every team in previous tournaments to get a consistent region throughout the years in addition to their seeds.
Coach Tenure
One other feature we thought may impact a team’s performance was coach tenure. One trick here was some teams had multiple coaches in a single season. We’d like to leverage Window Functions to account for this and one other function to get tenure leading to the next DataFrame.
def extract_coaches_tenure(
session: snowflake.snowpark.Session,
source_table: str,
target_table: str,
exclude_seasons: list,
) -> str:
df = session.table(source_table)if exclude_seasons:
df = df.filter(~F.col("SEASON").isin(exclude_seasons))
# There are some teams which have multiple coaches in a season.
# We'd like to leverage a Window Functions to account for this and
# One other function count up tenure.
coaches_tenure = (
df.with_column(
"ROW_NUM",
F.row_number().over(
Window.partition_by(["SEASON", "TEAMID"]).order_by(
F.col("LASTDAYNUM").desc()
)
),
)
.filter(F.col("ROW_NUM") == 1)
.select("SEASON", "TEAMID", "COACHNAME")
.with_column(
"COACH_TENURE",
F.row_number().over(
Window.partition_by(["TEAMID", "COACHNAME"]).order_by(F.col("SEASON"))
),
)
)
coaches_tenure.write.save_as_table(target_table, mode="overwrite")
return f"Successfully created {target_table}."
Best Tournament Finish
Since there could also be a relationship between the season stats and the way far a team has made it in that yr’s tournament, we created a table that showcases how far each team made it annually of the tournament. Teams who made it past the primary round can have multiple values for a tournament in that yr. Due to this fact we used one other window function to get the team’s maximum progress. This feature might be used as a option to tell if a program has historically done well within the last 20 years.
Win/Loss Statistics
One other set of features we could engineer can be created from team records. First was their Win Percentage which was separate from the unique dataset as that only had every game played. Due to this fact we needed to calculate the games each team won out of the games played and get the proportion won. Second was their winning point margin and losing point margin. These features told us when a team won, how much did they win by on average, and vice versa for losses.
Now that we’ve got engineered some extra features, we want to place all of them together, which sounds easy, but that is where we needed to be very careful in structuring our data. We are going to write a helper function that can make this easier by reducing how much we’ve got to manually type and rename all columns besides TEAMID and SEASON. These columns should be prefixed with “R_AVG” to point the regular season average using that function created earlier, in addition to apply an identical prefix “T_AVG” for tournament games. After some column renaming, we joined up the win/loss data.
Having Season and Tourney stats together, we are able to now add features that must do with the general team, resembling the seed info table we created, coach tenure, tournament progress, and conference data.
With our features table complete, we must prep it for our model. Since we’ll predict the probability of team one winning, we want to remodel our categorical features using pandas get_dummies, as you may see here with conference and tournament finish.
We have now to create a recent column indicating a win, which implies we can have to flip every little thing as well in order that our TEAM1 just isn’t the team winning each time.
Finally! Time to coach our model. We are going to split our train and test set data into games played before 2020 and after 2020. You’ll notice we skipped 2020 because the tournament was not played resulting from Covid, and we just ran a baseline XGBClassifier.
import xgboost as xgb
from sklearn.metrics import classification_report
from xgboost import XGBClassifiertrain_df = df_pd[df_pd["SEASON"] < 2020]
test_df = df_pd[df_pd["SEASON"] > 2020]
X_train = train_df.drop("WIN_INDICATOR", axis=1)
y_train = train_df["WIN_INDICATOR"]
X_test = test_df.drop("WIN_INDICATOR", axis=1)
y_test = test_df["WIN_INDICATOR"]
model = xgb.XGBClassifier(n_estimators=2000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
And there it’s — 73% accuracy, leading us to the most effective finishes in our bracket challenges thus far (for me at the least…thanks UCONN).
After training in Hex, and confirming every little thing runs easily, we are able to now train and store this model in Snowflake using just an X-Small Warehouse.
def tourney_predict(session: Session) -> str:
df = session.table("FEATURES.JOINED_FEATURES_TO_TOURNEY_TEAMS")new_cols = {}
for c in df.columns:
if c.startswith("W"):
new_cols[c] = "L" + c[1:]
elif c.startswith("L"):
new_cols[c] = "W" + c[1:]
else:
new_cols[c] = c
df_flipped = df.select([F.col(c).alias(new_cols.get(c, c)) for c in df.columns]).select(
*[col for col in df.columns]
)
df = df.with_column("WIN_INDICATOR", F.lit(1))
df_flipped = df_flipped.with_column("WIN_INDICATOR", F.lit(0))
df = df.union_all(df_flipped)
df_pd = df.to_pandas()
new_dtypes = {}
for c in df_pd.columns:
if str(df_pd[c].dtype).startswith("u"):
new_dtypes[c] = str(df_pd[c].dtype).replace("u", "")
else:
new_dtypes[c] = str(df_pd[c].dtype)
df_pd = df_pd.astype(new_dtypes)
train_df = df_pd[df_pd["SEASON"] < 2020]
test_df = df_pd[df_pd["SEASON"] > 2020]
X_train = train_df.drop("WIN_INDICATOR", axis=1)
y_train = train_df["WIN_INDICATOR"]
X_test = test_df.drop("WIN_INDICATOR", axis=1)
y_test = test_df["WIN_INDICATOR"]
model = xgb.XGBClassifier(n_estimators=1000)
model.fit(X_train, y_train)
return "success"
session.sproc.register(func=tourney_predict,
name="COMMON.TOURNEY_PREDICT",
packages=['snowflake-snowpark-python','pandas','xgboost'],
is_permanent=True,
stage_location="@COMMON.PYTHON_CODE",
replace=True)
Clean & Prep 2023 data
Now that we’ve got our model we want to feature engineer the 2023 data so we are able to prepare it for model inference.
from copy import copym_teams_2023 = (
session.table("RAW.MTEAMCONFERENCES")
.filter(F.col("SEASON") == 2023)
.drop("CONFABBREV")
)
w_teams_2023 = (
session.table("RAW.WTEAMCONFERENCES")
.filter(F.col("SEASON") == 2023)
.drop("CONFABBREV")
)
m_teams_2023 = m_teams_2023.join(
copy(m_teams_2023), how="cross", lsuffix="_L", rsuffix="_R"
).filter(F.col("TEAMID_L") < F.col("TEAMID_R"))
w_teams_2023 = w_teams_2023.join(
copy(w_teams_2023), how="cross", lsuffix="_L", rsuffix="_R"
).filter(F.col("TEAMID_L") < F.col("TEAMID_R"))
teams_2023 = m_teams_2023.union_all(w_teams_2023)
teams_2023 = teams_2023.with_column(
"ID",
F.concat(
F.col("SEASON_L"), F.lit("_"), F.col("TEAMID_L"), F.lit("_"), F.col("TEAMID_R")
),
)
For games, we had to provide every possible game that could possibly be played within the NCAA, leading us to 65,703 game probabilities. The next screenshot shows team 1181 (Duke) and a few possible matchups.
Perform Inference for Winning Probabilities
Since our previous model only included as much as 2019 since we would have liked a test set, we took somewhat gamble and trained on all data as much as 2022 for our bracket predictions (which seems to have paid off).
train_df = df_pd[df_pd["SEASON"] <= 2022]X_train = train_df.drop(["SEASON", "WIN_INDICATOR"], axis=1)
y_train = train_df["WIN_INDICATOR"]
model_2022 = xgb.XGBClassifier(n_estimators=1000)
model_2022.fit(X_train, y_train)
X_pred = JOINED_FEATURES_TO_TOURNEY_TEAMS_pd.drop(["ID"], axis=1)
y_pred = model.predict_proba(X_pred)[:, 1]
submission_df = pd.DataFrame(JOINED_FEATURES_TO_TOURNEY_TEAMS_pd["ID"])
submission_df["Pred"] = y_pred
session.create_dataframe(submission_df).write.save_as_table(
"FEATURES.M_SUBMISSION_2023", mode="overwrite"
)
split_pred = (
session.table("FEATURES.M_SUBMISSION_2023")
.with_column("TEAM1", F.solid(F.split(F.col("ID"), F.lit("_"))[1], T.IntegerType()))
.with_column("TEAM2", F.solid(F.split(F.col("ID"), F.lit("_"))[2], T.IntegerType()))
.select("ID", "TEAM1", "TEAM2", '"Pred"')
)teams = session.table("RAW.MTEAMS")
t1_join = (
split_pred.join(teams, on=split_pred.TEAM1 == teams.TEAMID)
.with_column_renamed("TEAMNAME", "T1NAME")
.drop("TEAMID", "FIRSTD1SEASON", "LASTD1SEASON")
)
t2_join = (
t1_join.join(teams, on=split_pred.TEAM2 == teams.TEAMID)
.with_column_renamed("TEAMNAME", "T2NAME")
.drop("TEAMID", "FIRSTD1SEASON", "LASTD1SEASON")
)
t2_join = t2_join.with_column_renamed('"Pred"', "PREDICTION")
t2_join.write.save_as_table("FEATURES.M_SUBMISSION_TEAMNAMES_2023", mode="overwrite")
We now have a table in Snowflake showing every game that could possibly be played in NCAA men’s basketball in 2023.
We then looked for the games played within the tournament and filled out our brackets.
The collaborative features provided by Hex allowed us to provide a model that beat most of our family, friends, and coworkers. The feature engineering techniques done in Snowpark may be applied to many other use cases and we hope this helps deepen your understanding of Snowpark and help others win their brackets in 2024.
Our bracket (posted before the Final 4 has began)
Japanese Music Mix
para kazanmak istiyorsanız http://www.aviatorace.com