Here is my first submission for Data Geek Challenge 2…
My objective of analysis to see if I can see any correlation of food habit and exercise and with breast Cancer in females in US aged 16-64.
Lets see if the data shows any correlation with the smoking, obesity , eating habits and any relationship with Brest cancer across US states I will be using Predictive analysis. I got the data point from the below link as CSV and did a little cleansing.
Here is the source I got some of my data from.
Now lets import the CSV data into predictive analysis and then we will merge the data for the possible related factors with it.
Now once the data is acquired we will have to enrich the time hierarchy data and Geo graphic data in this case the states.
Now I would create a Geographic hierarchy Region for states so that I can use the dimension in the Geo maps. I would be assigning the non mapped states manually.
Now acquiring and merging the other parameter and merging based on the state dimensions.
Now lets start plotting just the % incident numbers by state in a Geo map.Now here is the chart which looks pretty much the same for all the states.
Now here comes the need for the some basic predictive analysis.Having said that let me be very clear here I am beginner in the predictive analysis stuff though worked in R with HANA a while ago.We would normalize the data based on a Max min Scaling algorithm with min-Max as (1-0).
Now platting the scaled data looks like data is scaled for based on the algorithm and gives some better idea of the data.
Thats a good start.
Now let’s create a bubble chart of Annual Incidence Rate per 100,000 polulation with the % of Obesity and % of Smoking for the top 10 incidents. For this we had already merging the data from another spreadsheet. Here is how it comes as.
Ok lets do a regression of this incident with Obestity % and plotting the output. This is open for interpretation.
Now plotting the linear regression of with Obesity numbers
Now out of curiosity let’s see if there is any relationship we can find with Obesity & Smoking using the chart and looks like there is a pretty good correlation just by looking into it. Now let’s find out what is the correlation coefficient.Now lets do a regression.
Now plotting the predicted numbers in a bar chart shows there are some correlation. So we can infer if a person is a smoker there is a higher chance that
he is obese.
Now let’s plot it with the % of people eating fruit 5 times a day and see any correlation. Looks like people who are obese and smoke most likely eat less fruit and vegetable. As there is clearly a downward trend.So thats great its just confirming what we normally think.
Now let’s quickly see any relation between Insured Percentage before any analysis. Again its quite obvious the persons who are obese and eat less fruit and veg tends to be less insured as well.
As I mentioned earlier this is my first post here , I will have some more post once I something more to share.Thanks for reading.