MLB Pitcher Strikeout Predictor

View Code on GitHub

Introduction

Strikeouts are one of the most defining statistics in modern baseball. They showcase a pitcher’s dominance, ability to miss bats, and overall effectiveness on the mound. But what if we could predict a pitcher’s strikeout percentage (K%) before the season even begins?

Using historical MLB data from FanGraphs, I developed a machine learning model to estimate each pitcher’s K% for the 2024 season. The model analyzes past performance metrics, plate discipline data, and advanced pitching statistics to generate predictions.

After testing multiple approaches, Ridge Regression proved to be the most accurate model for this task, outperforming other techniques such as Random Forest and XGBoost. Below, I’ll walk through my methodology, the key features that influence strikeout rates, and my final predictions for the 2024 MLB season.

How Strikeouts Are Influenced by Pitching Metrics

Several factors contribute to a pitcher’s ability to generate swings and misses. Based on historical data and advanced sabermetrics, I identified the following key features as the best predictors of K%:

Building the Strikeout Prediction Model

After compiling the dataset, I tested four different machine learning models to determine which provided the most accurate predictions:

1. Linear Regression (Baseline Model)

2. Random Forest (Nonlinear Approach)

3. XGBoost (Gradient Boosting)

4. Ridge Regression (Best Model)

Final Predictions for the 2024 MLB Season

Using the final Ridge Regression model, I generated predictions for the 2024 season. The table below highlights five pitchers projected to have the highest K% in 2024:

Rank Pitcher Predicted K%
1 Spencer Strider 34.5%
2 Jacob deGrom 32.8%
3 Shohei Ohtani 30.7%
4 Corbin Burnes 30.2%
5 Dylan Cease 29.9%

Conclusion & Future Improvements

This project successfully developed a highly accurate model for predicting MLB pitcher strikeout rates using key performance indicators such as fastball velocity, plate discipline metrics, and earned run average.

With a 69.9% R² and a low MAE of 2.28 percentage points, the Ridge Regression model emerged as the best predictive tool, outperforming nonlinear methods like Random Forest and XGBoost.

Potential Future Enhancements:

Overall, this study provides a data-driven approach to evaluating MLB pitcher strikeout potential, helping analysts, scouts, and fantasy baseball players make better-informed decisions.