Top Bottom Analysis using R Visualizations (part 4...

marc_daniau · ‎09-19-2019

So far in this series, in part 1, 2 and 3, we have built bar charts on categorical data. One aspect of top bottom analysis we have not yet covered is the case of time series data. That will be the topic of this blog.

We load the R packages for data manipulation and data plotting.

library(dplyr)

library(lubridate)

library(highcharter)

We’ll work with the dataset “txhousing” because it has multiple time series.

data(txhousing, package = "ggplot2")

str(txhousing)

We remove cities with missing data.

incomplete_series <- filter(txhousing, is.na(median)) %>% distinct(city)

my_data <- txhousing %>%

  anti_join(incomplete_series)

We prepare the monthly series of each cities.

df <- my_data  %>%

  mutate(Price = median/1000) %>%

  mutate(Period = make_date(year, month, 1)) %>% 

  select(City = city, Year = year, Month = month, Period, Price)

We plot all cities in a line chart.

hchart(df, 'line', hcaes(x = 'Period', y = 'Price', group = "City")) %>% 

hc_title(text = "Monthly Housing Price in Texan Cities",

         margin = 20, align = "center",

         style = list(color = "black", fontWeight = "bold")) %>%

hc_yAxis(title = list(text = "Price (K$)")) %>%

hc_xAxis(title = list(text = NULL))

This is a busy chart with many lines and a lot of history. Keeping only the top and bottom cities is a possible way to bring clarity to the visualization. This can be achieved within a single line plot or using a first plot for top cities and a second one for bottom cities, each plot having its own Y axis.

We compute the average price by city to then keep the top and bottom cities

agg_city <- df %>% 

   group_by(City) %>%

   summarise(Indicator = mean(Price))  

   

top_limit = 8

tpn_cities <- agg_city %>% 

   top_n(top_limit, Indicator) 



btn_cities <- agg_city %>% 

   top_n(-top_limit, Indicator)

We define two separate data frames for the two plots to be built.

df1 <- df %>% 

     inner_join(tpn_cities) %>% 

     arrange(desc(Indicator), City, Period) %>%

     mutate(City = factor(City, unique(City)))  



df2 <- df %>% 

     inner_join(btn_cities) %>% 

     arrange(desc(Indicator), City, Period) %>%

     mutate(City = factor(City, unique(City)))

We prepare the line plots, p1 and p2.

title_caption = paste("Monthly Housing Price of Top", top_limit, "Cities")

p1 <- hchart(df1, 'line', hcaes(x = 'Period', y = 'Price', group = "City")) %>% 

hc_title(text = title_caption,

           margin = 20, align = "center",

           style = list(color = "black", fontWeight = "bold")) %>%

hc_yAxis(title = list(text = "Price (K$)")) %>%

hc_xAxis(title=list(text = NULL))

		   

title_caption = paste("Monthly Housing Price of Bottom", top_limit, "Cities")

p2 <- hchart(df2, 'line', hcaes(x = 'Period', y = 'Price', group = "City")) %>% 

hc_title(text = title_caption,

           margin = 20, align = "center",

           style = list(color = "black", fontWeight = "bold")) %>%

hc_yAxis(title = list(text = "Price (K$)")) %>%

hc_xAxis(title = list(text = NULL))

We bring them together in one visualization.

library(htmltools)



p <- hw_grid(p1, p2)

browsable(p)

Another way to provide clarity to our busy chart filled with so many lines is to define a seasonal plot.

We aggregate the price by year and month across all cities.

seasonal <- df %>% 

   group_by(Year, Month) %>%

   summarise(Price = mean(Price))

We plot one line per year.

hchart(seasonal, 'line', hcaes(x = 'Month', y = 'Price', group = "Year")) %>% 

hc_yAxis(title = list(text = "Average house price (K$)")) %>%

hc_xAxis(tickInterval = 1) %>%

hc_title(text = "Monthly Housing Price - Seasonal plot",

           margin = 20, align = "center",

           style = list(color = "black", fontWeight = "bold")) %>% 

hc_tooltip(headerFormat = "<b>Year: {series.name}</b><br><b>Month: {point.key}</b><br>",

           pointFormat = "Average house price (K$): {point.y:,.2f}")

We will conclude with a question. What are the top 3 months across all cities and years?

To answer it, we aggregate the price by month and select the months having the highest price.

top_limit = 3

df3 <- df %>% 

   group_by(Month) %>%

   summarise(Bar_size = mean(Price)) %>% 

   top_n(top_limit, Bar_size) %>%

   mutate(Month_name = month(make_date(2019, Month, 1), label=TRUE, abbr=FALSE))

The result is shown in a bar chart.

title_caption = paste("Housing Price - Top", top_limit, "Months")

hchart(df3, "column", hcaes(x = Month_name, y = Bar_size)) %>% 

  hc_xAxis(title=list(text = NULL), tickLength = 0, 

     lineWidth = 3, lineColor = "black",

     labels = list(style = list(fontSize = "14px", fontWeight = "bold"))) %>%

  hc_yAxis(title = list(text = "Average house price (K$)")) %>%

  hc_title(text = title_caption, 

           margin = 40, align = "center",

           style = list(fontWeight = "bold")) %>% 

  hc_tooltip(headerFormat = "<b>{point.key}</b><br>",

             pointFormat = "Average house price (K$): {point.y:,.2f}")

Continue to part 5

Top Bottom Analysis using R Visualizations (part 4)

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win