My Stuff + Model: The Process, Results, and Reproducible Code

Artificial Intelligence

My Stuff + Model: The Process, Results, and Reproducible Code

admin

June 20, 2023

My Stuff + Model: The Process, Results, and Reproducible Code

My model uses 2019–2022 data to coach, then makes predictions on the 2023 data. I initially trained on 2019–2022 data and a slice of 2023 data. The outcomes were unbelievable, but to me it is a no-no in machine learning if it will probably be avoided to limit overfitting. Originally, I used to be using a randomForest model. I soon switched to an xgboost model due to the quicker run time and barely higher performance. I used a straightforward grid search approach to tune the model. The hyperparameters that were of interest to me were ntree, depth, and alpha. One of the best values for ntree, depth, and alpha are extracted from the grid search and applied to the model. This approach greatly strengthened my model and saved me quite a lot of time in tinkering with the features of the model.

I used to be fairly pleased with the accuracy of my final model. There was a correlation of 0.33 which resulted in an R² of 0.11. The RMSE is a 3.25 which is a bit high but not too bad. Considering that command, execution, sequencing, and batter weren’t included, the outcomes seem solid to me.

Now for the interesting part! Listed below are the leaderboards per pitch type:

2023 Fastballs Top 10 Stuff+ (≥ 100 Pitches)… Wow does my model love 4S FBs!

2023 Sliders Top 10 Stuff+ (≥ 100 Pitches)

2023 Curveballs Top 10 Stuff+ (≥ 100 Pitches)

2023 Changeups Top 10 Stuff+ (≥ 100 Pitches)

Final Thoughts and Ideas Moving Forward

I believe it is a solid begin to my initial Stuff + model! This can likely undergo more iterations as I learn more advanced machine learning techniques or as latest ideas come to mind. One aspect of the model I used to be still unsure on is how people typically scale their Stuff + models. To me, it makes essentially the most sense to scale 100 as average for all pitches. Nonetheless, I can see how it is usually useful to scale 100 as average for individual pitches so pitchers know if that individual offering is above or below average based on the identical pitch type. If you’ve got a Stuff + model, let me know what you probably did! Future iterations could consist of prioritization of tuning for the RMSE as a substitute of correlation and assess model performance between the 2 methods. I could consider adding plate location in to construct a separate (but similar) model to this Stuff + one.

My hope and most important purpose of this blog is that the code is completely reproducible, which I imagine it’s. Scraping the information off savant first is essential and omitted as I scraped in separate files. Here is an incredible resource on the best way to scrape the information into R.

My code for this project may be found here: https://github.com/rileyfeltner/Final-Stuff-Plus

My Stuff + Model: The Process, Results, and Reproducible Code

Final Thoughts and Ideas Moving Forward

1 COMMENT

LEAVE A REPLY Cancel reply