#ViztheMadness with the SAP DataGenius Team
Update April 12—Congratulations to the Virginia Cavaliers!
Finally, we are at the end of 2019’s tournament and what an amazing ride it was. Over on the court, there were many surprises and close calls this year. Teams such as Auburn, the underdog, one-upping so many rounds of the competition while Duke, the most heavily seeded team, lost out earlier than expected in the tournament. After a whirlwind of events, the winner of this year’s tournament was the Virginia Cavaliers from the University of Virginia!
Back at SAP, we from the #DataGenius Team, also had plenty of things to celebrate and reflect on.
Firstly, out of 63 games, we predicted 40 games correct, leaving us with an accuracy of 63%, which, was quite amazing overall. Given how slim the chances are to produce a perfect bracket, we were able to create a good bracket without extensive knowledge in statistics or data science. One thing to note was how the team decided to stick to our model and predictions in removing Duke from our selection even though Duke was heavily seeded by others!
We also have plenty of areas we could have improved on when designing our model too! Here are a few we identified.
We could have consulted a basketball expert before we designed our model
In our previous post, we had our results looked over by a basketball expert and learned that there were quite a few tweaks we could have made to our design. While we understood the basics of basketball, we didn’t realize how much more complex basketball could become especially when coupled statistics. If we consulted a basketball expert earlier, we might have had a better understanding of which metrics would contribute more strongly to the prediction and know which variables we should have excluded.
After all, although SAP Analytics Cloud’s predictive analytics can take previous data and determine based of previous patterns, which variables highly contribute to our result, it also doesn’t know anything about basketball. It did well, despite Michigan and Kentucky not advancing to the final game but with some fine-tuning, the results could have been much better.
Consider gathering individual player data
Although basketball is a team sport, it’s undeniable that the individual talents of the team make up the sum of the group. We could have spent more time to gather data on for instance, how many highly valued players there were on the team and how that may have affected the team’s chances.
Use Smart Predict and Smart Insights to more effectively include or exclude variables
Our model reviewed the data available from Ken Pomeroy and used the following variables:
- Offense/Defense ratio
- Adjusted Defensive Efficiency
- Adjusted offensive Efficiency
- Offensive 3-point percentage
- Offensive rebound from a small forward
- Defensive Rebound from a Center
- Region the game is in
- Difference in seed
- Points that a Power forward scores on average
- Turnover Percentage from Defense
- Defensive Rebound from a point guard
- Defensive Rebound from a small forward
From a basketball perspective, some of these variables such as Offense/Defense ratio and luck are great indicators of game results especially with the randomness in these tournaments. On the other hand, some are the variables are less effective to use because of their importance in the game. A great help we could have used in this case was using Smart Predict to manually exclude more variables than we did before. That’s what’s so great about Smart Predict. While a data scientist could have done this by manipulating the data itself, somebody well versed in the subject matter but with no technical expertise could use Smart Predict and pick through the variables that make sense for them
Create specialized models for each region or round
Randomness is such a large part of these games so it would have helped if we narrowed down our model to a more specific set of games. That’s why creating a specific model for each region or round could be a great way to improve on our model for next year. This way, the model would account for specific concerns like stadium locations or how well teams from a specific region usually do.
Overall, this challenge was an amazing experience for us, the #DataGenius Team. we can’t wait to take the skills that we’ve learned from this challenge and see where we can apply it next!
Try It Yourself!
Update April 3—Round of 16 & Quarter-finals
And the results are in! We’re finally down to our last four teams fighting for the semi-finals position on April 6th. It has been a hectic weekend for both the players and the #DataGenius Team at SAP, and it’s not going to stop there.
Before we move into the semi-final results with confidence, let’s look at our results after the Round of 16 & Quarter-finals below.
We were able to predict 4/8 games correctly for the Round of 16, and 1/4 games correctly for the Quarterfinals.
The most biting fact of all, is that both of our picks for the Finals have both been knocked out.
Kentucky surprised us the most, by being knocked out by Auburn. A team that we never even expected to make past the 2nd round. This made us reflect on both our model and data, to investigate where we could have improved on when we were building out our bracket.
To get some feedback, we reached out to someone with a stronger acumen in the realm of basketball—Darren Wan.
So Darren, looking at the model’s predictions and what happened over the weekend, what do you think about our results?
Some of these variables are good but it doesn’t cover the whole picture. For example, point guard rebounds. The point guard tends to not get a lot of rebounds and the focus of the position is on their assists. Analyzing rebounds for a center would be more effective. While these statistics are important, you should be looking at team statistics. Basketball is a team sport after all.
I see, I guess this is a case where having access to the right kind of data really could have made a difference. Your point about overall team statistics is really interesting. What are your thoughts on the Duke vs Michigan State game? Our model was predicting Duke vs LSU.
Duke vs Michigan State is a good example of how basketball is a team sport. Your model predicted that Duke was going to lose in that round. While it wasn’t LSU competing against Duke, Duke still lost and you were right. Duke has very strong players such as Zion Williamson but they can’t just play for the whole team. Michigan State worked very well together. Seeing this, you could consider adding more overall team statistics to your model to improve its predictions.
You also have to also take into account the team’s playing style and the player match ups. Before this game, Duke won their games but there was a very small score gap between them and their opponents. As well, they had great players but they were definitely not playing at their best in these games. You can find out more at FiveThirtyEight in “Who Didn’t Expect This Final Four?”
It’s a shame what happened to LSU.
Take a look at analyst’ FiveThirtyEight’s “Your Guide to the 2019 NCAA Men’s tournament”. They were listed as third seed but the general consensus is that they were overvalued and should have been a fifth seed instead. There is a coaching scandal that’s not helping their morale as well.
Our model definitely didn’t take that into account for LSU. It looks like predicting basketball takes a lot more information than what was available to us.
For sure! There are so many factors to take in and some factors are just really hard to quantify into a model like this. For example, mental state of the team can make a big difference, but how are you going to account for it in a model? That’s why analysts look at a team’s historical performance and if they have entered the competition before to try to gauge how well they will do.
As well, taking into account the team’s performance over the course of the season is a good way to gauge how well they will do. A team starting the competition with a large win streak would be much more confident than a team who hasn’t been doing well.
Do you think luck was a big factor in these games?
The NCAA Basketball Championships are just hard to predict in general. Team’s only play one game against each other so it’s natural that there will be upsets as team only have one chance to prove themselves. In the NBA playoffs, the rounds are always played over seven games to reduce the impact of one off wins by a weaker team.
Wow! Sports really is just hard to predict then?
At the end of the day, sports is indeed hard to predict. If it was easy, the NCAA wouldn’t be tracking the perfect bracket. The closest someone has ever gotten was 50 games out of 63, which was broken in the Purdue vs Tennessee game last week.
The odds of picking the perfect bracket are 1 in 9.2 quintillion. So it really is hard to predict even for the professional analytics. (See “Perfect NCAA Bracket Absurd Odds March Madness Dream.”)
Looking Back at Lessons Learned
Overall, it looks like our model itself may not have been the issue. Instead, one component we could have improved on was the way we collected the data, and the number of data sources.
As Darren mentioned, we could have gathered better stats such as team cohesion. We could’ve also scraped for more individual player data to add diversify to our pool. However, collecting detailed information like this can be time-consuming and difficult to obtain. It is another trade-off we must consider for future predictions.
Furthermore, we noticed that since our luck stat had a high variable contribution to our model, it is possible that the variable may have thrown our model off. Normally in national wide games, seven matches will be played. But in this case, each team only play one match, making luck to be a much more volatile factor than usual.
This was definitely an amazing learning experience, and we are beginning to understand why the odds of picking the perfect bracket are 1 in 9.2 quintillion. Although it might be difficult, we were still able to predict 40 out of 63 games so far, with a total accuracy of 63%. Not bad for a team that’s new to the game!
Finally, here are our predictions below after re-running our model.
Try It Yourself!
- If you have predictions of your own you’d like to share, tweet @SAPAnalytics and use the hashtag #vizthemadness. We recommend gathering up your own data and using the embedded machine learning features in SAP Analytics Cloud, such as Smart Predict, to do your analysis.
- Check out the free 30 day trial and the fantastic augmented analytics learning resources.
Update March 27—Rounds One and Two
What an amazing, action-packed four days last week, filled with close calls, upsets, and good old-fashioned fervor! Here at SAP, our DataGenius team was watching with bated breath, praying that the games wouldn’t upset the bracket we thoughtfully pieced together.
Unfortunately, it turns out upsets aren’t that uncommon in March, especially in the 1st round.
Out of the 32 games in the first round, we predicted the correct winner in 22 of those games, leaving us with about 69% accuracy. Not bad for a group that knows almost nothing about basketball and with just enough data to be dangerous.
In our defense, there were quite a few upsets, with 12 lower-seeded teams besting their opponents. To be fair, however, seven of those “upsets” were in 8 vs. 9 or 7 vs. 10 match-ups where teams are traditionally regarded as equally matched, making it hard to consider these games true upsets.
The biggest heartbreaker for the DataGenius team in Round 1 was Villanova vs. St. Mary’s, where based on previous data we predicted that there would be an upset though it didn’t materialize. Instead, the 6-seed Villanova came out on top by a mere 4 points. Our heartfelt congratulations to the person with the one remaining perfect bracket.
Luckily, our predictions (finally) came through for us in the 2nd round where we correctly picked 13 of the 16 teams remaining in the tournament, for a Round 2 accuracy of 81%. This is the highest number of correct picks in the history of our #vizthemadness program!
So far it seems that as we get closer to the top, the playing field will begin to level off as the stronger teams weed out the weak until each team left will be relatively even.
This could bode well for our model as there will be fewer upsets, but it also might mean it’ll be harder for our model to predict our results correctly as the games involve more nuance which can’t explicitly be captured in our analysis.
Updated Bracket for the Round of 16
Below is our updated bracket for the Round of 16! We reran our model and highlighted the changes we made to our bracket. Honestly didn’t expect Oregon, Auburn, and North Carolina to get this far but we are very excited to see where they will go from here!
Let us know how your bracket is doing by tweeting @SAPAnalytics with #vizthemadness or leaving a comment below. And just because the tournament has already started doesn’t mean you can’t build a revisionist bracket using SAP Analytics Cloud. We dare you to try and beat our 13 out of 16 team standard.
It’s that time of the year again! An intense couple of weeks where basketball fans are pulled into a whirlwind period of fun and excitement, where moments are defined by gasps and bated as they wait to see who will take a step closer to winning the bracket. In fact, this excitement prompts thousands of sports commentators, machine learning models, and professional statisticians to weigh in on the winners of each and every round.
Although we know predicting sports outcomes is a fool’s errand, here at SAP, the DataGenius team is still joining in on the fun ,but by representing the perspective of the average person.
The average person probably doesn’t know much about the nuances of basketball nor the complicated world of statistics. That holds true for us too! Our DataGenius #ViztheMadness team (consisting of only one person who knows anything about basketball) wanted to weigh in as that “average person.” By using SAP technology, we’re opening the closed-off world of statistics and demonstrating that anyone can formulate a knowledgeable opinion of who will win the tournament.
Using Smart Predict in SAP Analytics Cloud
As complete amateurs to both machine learning and the finer points of basketball, we decided to approach the problem using SAP Analytic Cloud and a tool we didn’t have access to last year: Smart Predict. Designed to make machine learning accessible to business users without the need for a data scientist, Smart Predict augments existing business intelligence capabilities by learning from historical data to create recommendations on the next best action.
Smart Predict allows for three different predictive scenarios. Reading over the example description of each scenario, we thought that picking Classification would be the most suitable as we were looking at who would win between two teams (Team 1 or Team 2). This is a binary result.
The idea behind our model was relatively simple—The better team will win, so which team is better? Let’s decide by seeing who has the better stats! Given that our goal was to figure out which team would win when pitted against each other in the bracket, we decided to manipulate the data to look at the difference between the two selected team’s statistics. Then, for our historical data, we used game statistics dating back to 2007to build our model.
However, it is a given that out of the endless list of statistical variables we’ve collected, only a few of those variables would have a strong impact on the results. In order to identify these critical variables, we fed our model through Smart Predict, looking at the difference between the opposing team’s statistics for each game since 2007 and noting the winners.
By training our model through this process, the embedded AI in SAP Analytics Cloud was able to pick out the trends over all the games and we were able to narrow the top contributing variables to the following.
The basketball guru on our team came over, commenting that it seemed pretty obvious that the difference in Wins and Losses would affect who would win and lose. Therefore, we decided to try excluding the two variables:
Afterwards, we tested our model by trying to predict the 2018 tournament results and found that the model predicted only 9 games wrong for 86% accuracy! Furthermore, some of the incorrect picks were complete upsets that other statisticians could not foresee, reassuring us that we’ve made quite a reliable model. Not bad for a group of beginners that don’t know basketball or machine learning, even if we do have the benefit of hindsight.
SAP DataGenius #ViztheMadness Predictions
Now that the bracket has been set, here are the official SAP DataGenius predictions for the 2019 tournament.
We expect to see Michigan face-off against Kentucky in the finals with Michigan coming out in victory!
We will keep you posted on how the DataGenius model does as the tournament progresses so make sure to check back every week. We’ll also investigate how we can further improve the model as the 2019 results roll in.
Try It Yourself!
If you have predictions of your own you’d like to share, tweet @SAPAnalytics and use the hashtag #vizthemadness. We recommend gathering up your own data and using the embedded machine learning features in SAP Analytics Cloud, such as Smart Predict, to do your analysis.