Technical Articles
Top Bottom Analysis using R Visualizations (part 2)
In part 1 we saw how to build with a single R visualization multiple top-n bar charts against a BI Model. Part 2 shows different top-n analyses from an individual stacked bar chart. It concludes with a top bottom analysis. For each example presented below there is no need to define a BI model because the dataset used is directly available within the R environment that SAP Analytics Cloud provides.
Here are the packages and data required for this second blog.
library(dplyr)
library(tidyr)
library(highcharter)
my_table <- ftable(UCBAdmissions,row.vars="Dept",col.vars=c("Gender","Admit"))
my_table
Most popular departments
From the UCB Admissions data, we want to show the number of applicants by department and gender.
We set the labels for the legend, and the colors.
seg1_name = "Male"
seg2_name = "Female"
color1 = "#2274c1"
color2 = "#f59000"
We turn the table object into a data frame object, reshape the data (aggregation, pivot) and assign generic column names as well as a chart title.
df <- as.data.frame(my_table) %>%
group_by(Dept, Gender) %>%
summarise(Frequency = sum(Freq)) %>%
ungroup %>%
mutate(Dept = paste("Dept",Dept)) %>%
spread(key = Gender, value = Frequency) %>%
select(Bar_name = Dept, Segment1 = Male, Segment2 = Female)
title_caption = "UC Berkeley applicants by department (1973)"
We plot the prepared data in a stacked bar chart.
highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = title_caption,
margin = 20, align = "center",
style = list(color = "black", fontWeight = "bold")) %>%
hc_plotOptions(column = list(stacking = "normal",
dataLabels = list(enabled = TRUE))) %>%
hc_legend(align = "left", layout = "vertical", verticalAlign = "top",
x=0, y= 250, symbolRadius = 0 ) %>%
hc_xAxis(categories = df$Bar_name,
tickLength = 0, lineWidth = 3, lineColor = "black",
labels = list(style = list(fontSize = "14px", fontWeight = "bold"))) %>%
hc_yAxis(title = list(text = "Number of applicants"),
stackLabels = list(color = "black", fontWeight = "bold", enabled = TRUE)) %>%
hc_tooltip(enabled = FALSE) %>%
hc_add_series(name=seg1_name, data = df$Segment1, color = color1) %>%
hc_add_series(name=seg2_name, data = df$Segment2, color = color2)
Switching to percent stacking generates normalized bars and allows us to better compare departments.
title_caption = "Percentage of male versus female applicants by department"
highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = title_caption ,
margin = 20, align = "center",
style = list(color = "black", fontWeight = "bold")) %>%
hc_plotOptions(column = list(stacking = "percent",
dataLabels = list(enabled = TRUE, format = "{percentage:,.1f}%"))) %>%
hc_legend(align = "left", layout = "vertical", verticalAlign = "top",
x=0, y= 250, symbolRadius = 0 ) %>%
hc_xAxis(categories = df$Bar_name,
tickLength = 0, lineWidth = 3, lineColor = "black",
labels = list(style = list(fontSize = "14px", fontWeight = "bold"))) %>%
hc_yAxis(title = list(text = "Percentage of applicants"),
labels = list(format = "{value}%")) %>%
hc_tooltip(headerFormat = "<b> {point.key}</b><br>",
pointFormat = "Number of {series.name} applicants: {point.y}<br> Total number: {point.total}<br>Percent of total: {point.percentage:,.2f}%") %>%
hc_add_series(name=seg1_name, data = df$Segment1, color = color1) %>%
hc_add_series(name=seg2_name, data = df$Segment2, color = color2)
From the stacked bar chart showing all the departments, we can derive a top-n chart.
We compute the total number of each bar and then apply the top_n R function.
top_limit = 2
df <- df %>%
mutate(Total = Segment1 + Segment2) %>%
top_n(top_limit, Total) %>%
arrange(desc(Total))
To draw a top-n chart with absolute values, we reuse the code snippet of our first chart.
title_caption = paste("The", top_limit, "most popular departments")
# Insert here the code snippet used for the 1st chart
If you need a chart with percentage values, reuse the code snippet of the second chart.
Easiest and hardest departments
How easy or not easy it is to get into a department?
Let’s answer that question using the admission data we have.
We set the labels for the legend of the two admission statuses and their colors.
seg1_name = "Admitted"
seg2_name = "Rejected"
color1 = "green"
color2 = "red"
We aggregate the data by admission status and sort it from highest rejection rate to lowest.
df <- as.data.frame(my_table) %>%
group_by(Dept, Admit) %>%
summarise(Frequency = sum(Freq)) %>%
ungroup %>%
mutate(Dept = paste("Dept",Dept)) %>%
spread(key = Admit, value = Frequency) %>%
select(Bar_name = Dept, Segment1 = Admitted, Segment2 = Rejected) %>%
mutate(Total = Segment1 + Segment2) %>%
mutate(Percent_Segment2 = Segment2 / Total) %>%
arrange(desc(Percent_Segment2))
We plot it.
title_caption = "Hardest to easiest departments to get into"
# Insert here the code snippet used for the 2nd chart
We now want to check the proportion of women applying to the easiest and the hardest departments.
The code below refers to the data frame we have defined for the previous chart. It identifies the hardest and the easiest departments and bring them together.
top_limit = 2
# top
tpn <- df %>%
top_n(top_limit, Percent_Segment2) %>%
mutate(Position = "Hardest") %>%
select(Position, Bar_name)
# bottom
btn <- df %>%
top_n(-top_limit, Percent_Segment2) %>%
mutate(Position = "Easiest") %>%
select(Position, Bar_name)
flt = union(tpn,btn)
We define a final data frame for the graph to be built.
df <- as.data.frame(my_table) %>%
group_by(Dept, Gender) %>%
summarise(Frequency = sum(Freq)) %>%
ungroup %>%
mutate(Dept = paste("Dept",Dept)) %>%
spread(key = Gender, value = Frequency) %>%
select(Bar_name = Dept, Segment1 = Male, Segment2 = Female) %>%
mutate(Total = Segment1 + Segment2) %>%
mutate(Bar_size = round(Segment2 / Total * 100 ,2)) %>%
inner_join(flt)
We group the departments by their category and set the color of the bars as well as the chart title.
grouping <- df %>%
group_by(name = Position) %>%
do(categories = .$Bar_name) %>%
list_parse()
color_bar = "#f59000"
title_caption = paste("Women applicants in the ",top_limit, "easiest and hardest departments")
We plot the prepared data.
hchart(df, "column", hcaes(x = Bar_name, y = Bar_size)) %>%
hc_title(text = title_caption, style = list(fontWeight = "bold")) %>%
hc_plotOptions(column = list(color = color_bar,
dataLabels = list(enabled = TRUE, format = "{y}%"))) %>%
hc_xAxis(categories = grouping, title=list(text = NULL),
tickLength = 0, lineWidth = 3, lineColor = "black",
labels = list(
style = list(fontSize = "14px", fontWeight = "bold"),
groupedOptions = list(list(style = list(fontSize = "18px"))) )) %>%
hc_yAxis(title = list(text = "Percent female applicants"),
labels = list(format = "{value}%"),
allowDecimals = FALSE) %>%
hc_tooltip(enabled = FALSE)
Continue to part 3