Skip to Content
Author's profile photo Former Member

Data Geek III – Analyzing Accidents Data

This blog is part of DataGeek III under House of Spirits – Caring for social good.For this, the data in our infographic and storyboard is in Odata form published by BigML from Windows Azure Marketplace which is available for free. We have also used other csv data sets to arrive at various results in conjunction with this data set. Following is a brief account of what we have done.

We made new features from the existing data set like the month, population, population density to find out how these variables affect fatalities. All the exploratory analyses were done in SAP PA in conjunction with R. We cleaned the data in SAP Predictive analysis.

In the past 5 years, more than 30,000 people were victims to road accidents each year in the USA alone. This is a huge loss in terms of lives and billions of dollars. So, we ask some questions to analyze and prevent such man-made disasters and save lives by understanding some underlying relationships.

We used this to answer few questions and find out few interesting patterns regarding accidents that occurred in USA.

Questions that we answered:

  1. What are the major factors that influence road accidents?
  2. How much can human behavior like alcohol consumption and drug intake affect driving skills?
  3. Which roads and states have seen more number of accidents?
  4. What can you do to reduce chances of an accident?

We used few statistical tests which helped us arrive at various conclusions like…

Chi-squared test:

The Chi-squared test was used to see if there is any significant association between two categorical factors. Such associated factors help in predicting outcomes more accurately. The Chi-square test returns three values (Chi-squared, P-value and degrees of freedom). Chi-square returns a value comparing the frequencies of both the variables occurring together. A high value indicates strong correlation. the P value is a probability that is used to reject null hypothesis. A value lower than 0.05 is generally a strong indicator to reject null hypothesis .

The R-CNR Decision Tree:

The model in which every decision is based on the comparison of two numbers within constant time is called simply a decision tree model. Given a set of variables, we can predict the possibility an outcome (like a fatal crash). We have built such a model called the R-CNR Tree in SAP Predictive Analysis taking into consideration factors such as blood alcohol level, type of road, age of driver etc. Hence, we can know beforehand the chances of a crash given a set of variables, which may help prevent an accident.

Here are few screen shots of our analysis:

Few interesting stats that we found about accidents.


More stats..


How does it stack across gender?


How does it stack across various cities?


Month wise split of accidents..


More facts on month..


Who is more likely to face a fatal accident?


Major reasons causing accidents:


Assigned Tags

      1 Comment
      You must be Logged on to comment or reply to a post.
      Author's profile photo Former Member
      Former Member

      Great work. How did you to do Chi-squared test in Lumira? How I could not found options for hypothatical testing.. Thanks