Update: after several comments and questions, I have created a revised version.
Please find it here and enjoy:
According to the United States Bureau of Transportation Statistics, 643 million passengers travel over 9 million flights every year in the US only. A significant portion of these passengers are impacted by delays and cancellations. For this SAP Data Geek challenge, we will be analyzing flight data from 2003 to 2013 (more than 172,000 records) with SAP Lumira.
- RITA (Research and Innovative Technology Administration), Bureau of Transportation Statistics
- Localized Airport Codes
Importing data source
First, let’s import the data file into SAP Lumira / Visual Intelligence.
Now, we need to handle the time dimension.
Then, we will convert some columns from Attributes to Measures and rename them.
Once this is done for all measures, we need to group them for easier analysis. This is done via formula.
Now, we can start visualizing these data. First, we need to know if the number of flights, delays and cancelations has evolved over time.
The percentage of delayed or canceled flights is hard to analyze this way. Let’s create a new formula to calculate these ratios.
The result is more clear, but the trend is hard to analyze. Let’s add a running average.
We can now see that on average, the flight delays have increased over the last 10 years. With this information, let’s dig deeper and analyze which type of delay is the highest source of inconvenience for the passengers.
From the above chart, we can easily see that the Air Carriers are the main source of flights delays in the US from 2003 to 2013.
If we add the Carrier Name dimension to the chart and filter the data by only the Carrier Delay, we get the following chart.
Notice that American Airlines and American Eagles are both in the top 5 of the worst carriers. Since both are sharing the same hubs, their difficulties may be related to geography. However, the current dataset is missing exact location like latitude and longitude. To do this, let’s add and merge with a list of all airport codes and corresponding GPS location and create a geographical hierarchy.
Let’s now use this new information and visualize flight delays by geography, filtered by American Airlines and American Eagles.
However, when removing the filter and showing all airlines, it seems like the delays are more related to large airports like Chicago, Dallas or Atlanta.
This analysis doesn’t provide enough information, but brings the idea of analyzing the geography repartition of weather related delays.
Here again, the data at hand doesn’t provide us enough insight, but when weather comes in mind, seasonality is expected. Let’s redraw this chart using timelines.
Weather delays are indeed seasonal. However, surprisingly, weather delays cannot only be limited to snow and storms since May, June and July are also very impacted, probably due to thunderstorms and / or hurricane season.
As a conclusion, I want to try and estimate the chances for my American Airlines flight to SAP TechEd in Las Vegas on October 2013 of being delayed or canceled. Since the dataset end on June 2013, I will need to use Predictive Analytics to estimate this.
SAP Lumira’s forecast feature calculates that on a flight with American Airlines to Las Vegas in October, the probability of getting delayed is 0.27%. This is of course a pretty rough estimate. With more time and understanding of the dataset, we could improve this calculation by merging the dataset with other data like weather records and forecasts, event schedules that could impact travel (think Christmas, Thanksgiving, large conferences, etc.). Also, notice that the data at hand indicates the arrival airport, but is lacking the information about the departure airport, which probably has an impact on delays and cancellations.
As a conclusion, this short first hands-on test of SAP Lumira shows a very effective solution for the analyzing, visualization and forecast of large datasets. Go ahead and try it for yourself!