Analyzing Flight Cancellations With SAP Lumira
There are two categories of business travelers: The first type sees flying as a nasty consequence of needing to be somewhere else to get the job done – one that includes trudging to and from the airport, waiting in long lines, and generally having your schedule tossed around because of traffic, flight delays, and even cancellations. I’m of the other type, who doesn’t enjoy these inconveniences either, but accepts them as part of the price to pay to go to new places, see new people, and make a difference to a customer or an internal meeting. My motto is NIHITO (Nothing Interesting Happens In The Office).
Flight delays and cancellations happen all the time, and whether we admit it or not, in most cases it is not actually the airline’s fault. A gate may not be available, a runway could be closed, or the weather could make it impossible to even take off. Nobody can change the weather, but we can make better decisions if we understand the patterns. Flight delays are usually sustainable – the worst thing that can happen is I miss my connecting flight and maybe need to stay overnight until the next flight out in the morning. However flight cancellations may mean I am stuck somewhere for much longer because all other flights are full.
Flying through Chicago is a great example – awesome airport but when Chicago has problems, it produces a domino effect throughout the continent. It’s reputation for flight problems is perhaps only matched by San Francisco. But is it really Chicago that is so bad, or is something else at play – such as the weather?
So I used SAP Lumira to see if there is a pattern between flight cancellations and weather. To do this, I needed two data sets: the weather for each airport I want to analyze (acquired from weatheroffice.com) and the flight statistics at those same airports (from faa.gov). The first step is simply to export these as Excel files and bring them into SAP Lumira. Here’s the weather information that contains the date, airport code, min/max temperatures, and the weather condition of that day:
Next, I brought in the flight information. For each flight, this included the date, airport origin, destination, whether it was cancelled, and other information like flight time and flight number:
Now that I both data sets are in, I needed to cleanse the data – some things were just not right. In the weather data, there was “Shiny Day”. Changing that to “Sunny” was easy by using Lumira’s data manipulation features:
While I was at it, I also made a few other changes:
- Changed the column “Events” to “Weather” to make more sense
- Changed aggregation for Max Temperature to be “Max” instead of Sum
- Changed aggregation for Min Temperature to be “Min” instead of Sum
- Replaced all weather events to be consistent (i.e. “Foggy Days” to be “Fog” etc)
In the flight data set, I also did the following to create a geo-hierarchy for both airport origin and airport destination:
- Split on “:” to separate the airport name from its city/state
- Split on “,” to separate the airport city from airport state
- Created geographic hierarchy using airport city and state
Then I did a multi-key join (using the airport code and date) to add the appropriate weather condition for each flight in my dataset:
Moving to the “Visualize” tab… I had a hunch that weather actually does make a difference, but the only way to know is by creating manually exploring through the data and creating visualizations to find some patterns – or is it? SAP Lumira has this handy “Light Bulb” icon that exposes a really cool feature called “What’s Influencing This Measure?”. It actually is an implementation of sophisticated algorithms from SAP’s Predictive Analysis that do a lot of whiz-bang-cool stuff for you automatically.
Since I’m interested in what affects flight cancellations, I simply clicked on the light bulb icon, select “Flights Cancelled” from measures and then hit “Run Analysis”. This produces a list of recommended visualizations based on Lumira’s statistical analysis of the data. Clicking on any recommendation auto-populated Lumira’s chart builder with the necessary elements to create the visualization. However the first recommendation in the list is very special – it is a popup chart that shows me what dimensions have the largest impact statistical impact on “Flights Cancelled”:
From here you can see that weather conditions affect flight cancellations more than anything. By selecting “Weather” from this chart and selecting “Explore Flights Cancelled by Weather”, I can see that cancellations happen most often when it is sunny followed very closely by when it is snowing. Odd right? Maybe there is more to it. What happens when I chart the top airports for flight cancellations and map weather on top?
Chicago unsurprisingly has the most cancellations – In fact, 14.8% of all flight cancellations in the US are attributable to Chicago. Snow is by far the most prevalent weather condition even though the overall data suggests it is more even. Hmmm… What happens when it isn’t actually snowing? It’s not rain or fog, so it must be sunny. When it is sunny in the wintertime, there are fewer clouds to hold the earth’s heat in – which means even when it is sunny, it can be cold.
So let’s take a look at cancellations by min/max temperature across all airport (To make my analysis easier, I’ve created a storyboard that includes a filter panel for city):
There does appear to be some correlation between minimum temperature and flight cancellations. What happens to Chicago when the weather changes?
Interesting! We can conclude that Chicago is strongly affected by low temperatures – and something to keep in mind when I am flying in the winter.
However I need to be very careful – an airport may have more flight cancellations because it also has more flights, so the ratio of flights operated vs cancelled is an important measure – one that is not in my dataset. This is easily solved by going back to the Prepare tab and creating a “Calcuated Measure” that gives me a ratio of cancelled flights versus total number of flights:
Whipping up a few Lumira “Numeric Point Charts” I find that Seattle’s “Cancel Ratio” is 4% whereas for Chicago it is only 3% (and they both blow away the USA average of 1%):
What else might affect flight cancellations in Chicago other weather and temperature? Likely the airline carrier – some might be better at ground operations than others because when there are flight cancellations it is not usually all airlines equally unless the airport itself closes down.
I can create a tree map filtered on Chicago to analyze carrier statistics (you will notice I am still using “Cancel Ratio” because I’m filtering on Chicago and drawing based on Carrier Name as a dimension):
The largest surface area of the tree map belongs to American Airlines which means it cancels more flights than anyone else, but Delta has a higher cancel ratio – so it means if I am on a Delta flight, I have a *higher chance* of having it cancelled even if Delta cancels less flights overall. Interestingly, Chicago is a United Airlines hub but they seem to be doing pretty good. Using the “Numeric Points Chart” again, Lumira calculates that American Airlines cancels 107% more flights than United, and Delta cancels 350% more of its own flights than United cancels theirs.
So what did I learn?
- Chicago is indeed the worst airport with 14.8% of all cancellations.
- Seattle cancels a larger percentage of its flights than any other city (including Chicago).
- United has fewer cancellations and a better cancellation ratio than the other big airlines at Chicago.
For someone I want to share this with that just wants “the answer”, I have created a Lumira Stoyboard with an InfoGraphic that walks the them through the analysis in a logical and visually appealing way. I have also added a number of additional pages to the Lumira Storyboard that allows anyone to do further analysis. SCN does not allow me to embed an iFrame here, so here’s a screenshot and a link to the public SAP Lumira Cloud Storyboard that you can explore further yourself.
Using my Lumira Story, flip through the additional boards and try to answer these questions on your own and see just how easy it is:
- Are Houston’s cancellations typically affected by temperature variances?
- What are the airlines to be careful of when flying out of Seattle?
- Which city in California has only one cancellation for the whole month?
We’ve also prepared a video that shows how this could use BusinessObjects Universes as a datasource here: Integrating SAP Lumira with SAP #BI41 – YouTube
This is of course only one month’s data. What would be really interesting is to get my hands on historical data for a number of years and use the full power of SAP Predictive Analysis to have a much better way of clustering, categorizing, and predicting problems on my travels.
Maybe an idea for my entry into next year’s Data Geek Contest? 🙂