After the initial publication of my entry for the Data Geek Challenge: Analyzing flight delays with SAP Lumira, I received lots of comments and questions, which is great! Several of them were related to the version of SAP Lumira I was using and that got outdated, so I upgraded and started all over. By doing so, I uncovered some mistakes I had done.
Use the comments section below to share with others, which questions would you like to see answered with this dataset. Also, if you give me your email addres, I can send you a link to the dataset and visualizations through SAP Lumira Cloud.
In this revised post, I won’t go into the details of how to create specific measures, since the process would be exactly the same. Let’s take a look at the new UI, which I find slicker.
Since the data is the same, most of the conclusions will be similar. For instance, we clearly see that the number of flights is slightly decreasing and the most frequent incidents are related to delays, as opposed to diversions or cancellations. Due to the volatility of flight numbers, we will use ratios instead and focus on delays only.
With a running average, we easily spot that flight delays had been slowly increasing from 2003 to 2008, but the trend reverted between 2008 and 2013. Let’s drill into the different types of flight delays.
Drilling-down into the different types of delays, we could think that most flight delays are caused by the National Aviation System, which is correct if we only look at number of incidents, but if we look at the time impact, we see that “Aircraft Arriving Late” is more important, which is why we will focus on it (that nuance was missed in the first part). Now, let’s figure out who is causing these delays.
If we narrow down on aircraft delays and rank the top 10 carriers, we can spot outliers like “Atlantic Southeast Airlines”. As it turns out, that specific airline was rebranded as ExpressJet in November 2011. Therefore, it would be more relevant to filter on data from 2013. Now, we can isolate Frontier Airline for more details. In the first experiment, American Airlines had been selected because the count on flight delays was used, as opposed to ratio of flight delays. Let’s see if this is temporary situation or if this has been an ongoing trend.
The above line charts clearly show that the trend in flight delays has been increasing recently to alarming proportions, especially for the category “Aircraft Arriving Late”. Since I have limited knowledge of the specifics of this airline company, my set of assumptions is pretty narrow. I believe that what makes Frontier Airline unique is their hub in Denver, as opposed to Atlanta, Chicago, Miami, etc. Let’s take a look at the geographic repartition of flight delays for Frontier Airlines in 2013.
Since Frontier Airlines is flying to all states with a similar delay ratio accros the board, showing all states was not very informative. Therefore, I have on purpose restricted the data to the top 10 locations to try and identify a pattern. Here, Nebraska comes on top and we see 2 pockets of regions that are most impacted by flight delays. The only reason I can see is maybe a weather correlation. If this is the case, then all airlines should be impacted. Let’s dive into that.
Just like in the first part, we see that there is definitely a seasonality in the weather-related flight delays, with summer conditions being more impactful that winter conditions. Perhaps snow and cold weather are more predictable than summer storms.
Talking about predictability, let’s see which airport and airline could have the biggest impact by improving their flight ponctuality.
Top Carriers in Newark Airport in 2013
Using scatter plots, we can first identify the best potential airport. To do so, I selected the top 20 airports for 2013 in terms of number of flights (solving all airport delays in Sitka, AL with about 100 flights / month would be far less impactful than solving 1% of LAX’s delays). Then, I spotted the airport with the highest aircraft delay ratio: Newark Liberty International.
In a new view, I use a similar process to select the top 10 airlines in Newark for 2013. I have now identified ExpressJet Airlines as the highest potential for improvement in flight delays. The question now becomes: considering the high volatility of flights and delays, how could they know they improved their performance? Let’s predict what it should be and use this as a target.
Based on the above chart, we can read that if all conditions were equal, about 36.41% of the flights operated by Frontier Airlines to Newark Liberty airport in December 2013 should be impacted by delays. Multiplying this with the actual number of flight for the same carrier, airport, and period will give management and employees a target to measures themselves against.
I really like the fact that SAP Lumira allows us to analyze past data. If I were using SAP HANA as a source instead of just flat files, I could be able to connect to the source website and regularly refresh the data as they get published. Also, the analysis could be easier by normalizing the data in the OLAP engine (specifically, having the delay type as an attribute instead of several measures).
Now that I have upgraded and cleaned up the data, I can already imagine more questions to answer: is there a geographic correlation for the weather-related flight delays, especially summer versus winter? What is the economic impact of flight delays for the passengers (by multiplying the number of minutes lost by some kind of national salary average)? Also, I haven’t yet explored the new “Compose” feature of SAP Lumira and this would probably generate new ideas.
In your own opinion, which questions would you like to see answered by this dataset or an extension of it? Don’t hesitate to send me your email address so I can share the dataset and visualizations with you through SAP Lumira Cloud.