Data Geek Challenge — Baseball Stats and Post-Season Predictions
I’m not an expert on any sport, but I have recently become a casual fan of baseball. With the excitement of the post-season this year, I decided to hunt down some baseball data to see if I could better understand professional baseball based on popular team-level MLB statistics. I’m definitely not an expert in sports statistics, but I pulled the well-known Lahman baseball database and calculated some summary statistics to evaluate team performance. I used SAP Predictive Analysis (with SAP Lumira visualization components) to visualize the data and perform some predictive analytics for the 2013 post season.
Metrics and Data
I pulled just a few metrics to summarize batting and fielding performance of each team. The metrics I pulled were:
- Put Outs and Errors per inning out for each of the main positions (1B, 2B, 3B, C, CF, RF, LF, SS)
- HR, H, R, 2B, 3B, SO, SB, CS, SF, SH per At Bat
For all teams from 1981 – 2012 during the regular season only.
Visualizing Changes Over Time
Not being familiar with the ins and outs (get it?) of baseball, I decided to look for trends over time—would these metrics be consistent, or have strategies changed over the years?
I can look at these trends over time by league (HR increased through the early 2000s and have since been decreasing):
And by metric, it looks like the frequency of 2B hits has been increasing, while 3B hits have been steadily decreasing. Though interestingly, the number of runs per AB has been relatively steady since 1993.
Perhaps these decreasing trends in scoring are driven by an improvement in fielding or pitching? Steady decreases in errors per inning out for all infield positions and increases in strikeouts per at bat suggest this could be the case. Or, as many of my baseball-fan coworkers have pointed out, it could have something to do with the rapid increase in the use of steroids during the 90s and then a decrease in use of steroids after the harsher steroid policy penalties were implemented in 2005 and the 2006 investigation by the MLB into steroid use.
Stolen bases have always fascinated me in baseball, so I wanted to look at the prevalence of stealing over time. Interestingly, this is the one metric that showed significant differences between the leagues with NL teams stealing much more frequently than AL teams through the early 90s, though over time they have tracked much more closely and now show similar trends.
Visualizing Differences by Team
Still on stolen bases, which teams are most effective at stealing? For the 2012 season, Milwaukee, Miami, and San Diego had the highest frequency of stealing, with Milwaukee, Miami, Oakland, Minnesota, and Kansas City stealing most effectively (fewest CS per SB). Pittsburgh, Arizona, and Baltimore are the least effective at stealing bases, with Pittsburgh successfully stealing only 58.4% of the time.
We can also visualize the base stealing geographically by state, with darker blue states stealing less than lighter blue states. Generally, base stealing seems to be less popular in the AL and the East.
Shifting over towards batting strategy, we can compare the frequency of sacrifice flies and hits by team. Sacrifice hits seem much more common in NL teams, and even sacrifice flies are relatively uncommon in the AL except for a few teams (Minnesota, Texas, Toronto, Tampa, the Yankees, and Boston).
Let’s look at fielding performance for infielders by team in 2012. Interestingly, it appears that the NL teams have lower frequency of errors across all positions, but especially for 3B errors.
Outfield fielding performance seems to differ less between leagues, but the NL still seems to have slightly better fielding. Maybe Kansas City should look at firing their center fielder (though it looks like their main center fielder was injured in April).
Predicting Post-Season Performance
Expanding into the SAP Predictive Analysis toolset, I built a modeling dataset that uses the regular season team-level (season summary) statistics discussed above to predict the outcome of a post-season matchup. This model uses all post-season games from 1981-2012 to train the model, and I have scored it on every possible (and impossible) matchup for the 2013 post season. (2013 regular season data was pulled from ESPN and a variety of other sources since it is not yet included in the Lahman database).
The post season information in the Lahman database is at the series-level, so I used this to create simulated records for each game in a series (ex. if there was a 4-3 series between Team 1 and Team 2, there are 4 records with Team 1 winning and 3 with Team 2 winning), so this model predicts the outcome of a series between teams. It also only takes into account position-level statistics, not statistics of any particular player, so if the team lineup changes significantly in the post season due to injury, it will likely not be as accurate. In order to predict the likelihood of winning, I used my Custom R Logistic Regression algorithm, but similar analysis could also be done with a decision tree model.
The factors in the model that were most predictive in determining the outcome of a post-season matchup included:
- Catcher error rate
- Center Fielder error rate
- Runs per At Bat
- Hits per At Bat
- Strike Outs per At Bat
- Sacrifice Flies per At Bat
- The difference in put outs at 2B between teams
- The difference in the error rates between teams for 1B
- The difference in the error rates between teams for 2B
- The difference in the error rates between teams for SS
The model has an Area Under the Curve (AUC) of 0.722 (where values closer to 1 indicate better predictive performance), which indicates a relatively good predictive fit, but is not a highly predictive model. Additionally, due to the relatively low volume of data, 100% of the sample was used for both fitting and validation.
Once the model has been developed, I’m able to simulate the results of every pairing of teams for every possible series in the 2013 post-season and once the matchups have been decided, determine the model’s predicted victor. The chart below summarizes the predicted outcome for each matchup in the post season and the actual outcome of the series or game.
Over the course of the season, the model was over 50% accurate, predicting 5 of 9 series correctly, which isn’t bad for someone that knows nothing about baseball statistics or the results of the season so far. Interestingly, though Boston finished the regular season in first place overall, the model consistently predicted them performing poorly in the post season. I’m excited to use the same model next year and see how it performs!
See other blog posts on predictive analytics and data visualization using SAP Predictive Analysis and SAP Lumira under the Predictive Analytics topic at sapbiblog.com. Follow me on Twitter at @HillaryBlissDFT!