Introduction

martinboeckling · ‎03-07-2022

Introduction

This blogpost is the third part of the blog series covering the question analysis project. The first part of the blog post series covers the data gathering of the questions. The second part of the blog post series covers the e-mail automation of open questions. In this blogpost we will examine how the questions can be clustered how to integrate our results in an SAC Dashboard. This blogpost will solely focus on the generation of the clustering approach and its implementation into a SAC dashboard. The integration of the database into SAC and other fundamentals of SAC will not be part of this blog post as they are covered in other blogposts. A special thanks needs to go out to sarah.detzler who came up with the idea of building a network to improve the initial clustering performance.

In the following section the clustering approach that I used for the project is discussed and presented.

Clustering approach

To cluster questions accordingly it is important to define which part of a question should build the base of a question clustering. As we look at the data extracted from the SAP Community, we can differentiate between the following data sources from the SAP community question area.

SAP Community question structure

Name of user

Last update of question

Title of question

Content of question

Assigned Tags (user defined and predefined tags)

Similar questions (similar questions to posted question)

Within those different data sources, the most informative source for a question is the question content where an author describes the question/ problem the person wants to tackle. The question content will therefore be the base of our clustering approach. During the project the data showed that classical clustering approach unfortunately do not get good results in regard of the created cluster. The reason for that is probably that we are looking only at a little more then 450+ questions. Due to that reason the solution was shift from a classical clustering approach to a graph-based clustering approach.

Graph based clustering approach

As mentioned in the above paragraph the approach will be a graph-based clustering approach. For that we need to define how our graph structure is defined. In the first blog post of the series, we define that a topic of a blog post is build up by the five most important words extracted from the question. By looking at what is needed for a graph construction, a graph consists of two main components:

Nodes: Fundamental unit that builds base of graph

Edges: Represent connection between nodes

As the base of the graph, I selected the extracted keywords to be the nodes in the graph. A connection or edge between keywords are built if the keywords are in the same question. Whenever a keyword of a question is shared by a different question, the shared keyword connects the different other keywords together. An example how two different questions and its keywords are connected can be found in the following image.

Connection of two questions by its keywords

After we looked at the theory and procedure how we want to use to construct our graph we will dive into detail how in program code the theory can be constructed.

Constructing graph

As we want to visualize in the end our graph in SAC, the format of the graph construction needs to be appropriate with integrating it into SAC. Therefore, the graph creation will be implemented in R as within SAC the possibility to use R as a visualization language is present. The complete creation of the graph will be set up on a SAC Dashboard.

Creation of nodes

The nodes in a graph build the base of every graph. For our use case, nodes are represented by keywords. The set of all nodes are a unique list of all keywords present in the data. Therefore, the coding to create all existing nodes is relatively easy by transforming the data into a data frame where one column represents the node. As we want to only look at unique nodes, we will aggregate the nodes based on the keywords. Therefore, the following script will be used:

# create base of nodes

nodes <- as.data.frame(as.vector(data))

nodes <- nodes %>%

  rename(label = `as.vector(data)`) %>%

  group_by(label) %>%

  summarise(title = paste('Questions:', n())) %>%

  mutate(id = label) %>%

  left_join(cluster_df, by = "label") %>%

  rename(group = V1) %>%

  mutate(group = as.character(group)) %>%

  na.omit()

The dataframe has the following columns:

Label

Group

To connect the different nodes, a network consists of edges that build the connection between the different nodes. In the following section the creation of the edges between the nodes will be covered.

For the creation of the edges the keywords of one question need to be considered. To create the edges the keywords of one question need to be pairwise combined. Therefore, a question with the keywords keyword1, keyword2, keyword3, keyword4 and keyword5 will look like this where each bracket represents one edge between the two nodes:

(keyword1, keyword2)

(keyword1, keyword3)

(keyword1, keyword4)

(keyword1, keyword5)

(keyword2, keyword3)

(keyword2, keyword4)

(keyword2, keyword5)

(keyword3, keyword4)

(keyword3, keyword5)

(keyword4, keyword5)

To be able to get the pairwise combination of the keyword list of one question the following, script is needed.

edges <- do.call(rbind.data.frame, lapply(1:(nrow(data)-1), function(i) t(combn(data[i,], 2))))

The graph can now be constructed by using the following script:

graph <- graph_from_data_frame(edges, directed = FALSE)

After constructing the graph, we want to assign groups to the nodes and cluster them accordingly. In the next section the clustering approach is outlined.

Implementation of clustering approach

For the clustering approach we use the already imported igraph package to cluster keywords within the graph and detect the clusters between the different properties of a node. To build the different clusters it is necessary for igraph to simplify the graph and remove loops and multiple edges.

graph <- igraph::simplify(graph)

For the clustering of the network, I decided to use the fast greedy clustering algorithm. It provided for the underlying graph the most reasonable clustering mechanism. A comparison between the different approaches implemented in igraph can be found in the following stack overflow post. The original paper with the implementation of the algorithm can be found in the following scientific paper from A. Clauset, MEJ Newman and C Moore.

The below displayed code shows the creation of the clustering together with the merge of the cluster id to each node present in the cluster.

# extract cluster from graph

cluster <- cluster_fast_greedy(graph)

cluster_df <- data.frame(as.list(membership(cluster)))

cluster_df <- as.data.frame(t(cluster_df))

cluster_df <- cluster_df %>%

  mutate(label = rownames(cluster_df)) %>%

  mutate_if(is.character,

            stringr::str_replace_all, pattern='\\.', replacement=' ')

When it comes to clustering in most of the cases a consecutive number is assigned to each identified cluster. The interpretation what the topic of a cluster is relies mostly on human interpretation. For this project I decided to come up with a different approach in defining the name of the identified cluster. The approach that I selected for naming one individual cluster is relying on identifying the most important node in one cluster and then assign the name of the node to the whole cluster.

The identification of the most important node or implementing a ranking through all nodes in one graph is covered by the theory of centrality measurements. In this blog post I will not cover all different centrality measures but will only briefly explain the one that I selected for this use case, the betweenness centrality measurement. The betweenness centrality measures how often a node is present in all the shortest paths to get from one node to the other node in a network and ranks the nodes within a graph accordingly.

In the following you can find the implementation of the R script identifying the most important node name in one cluster and returning it to each cluster.

# define cluster based in graph

clusterNameCreation <- function(nodeCluster) {

  subnodes <- nodes %>%

    filter(group == nodeCluster)

  vids <- as.character(unlist(subnodes$id))

  subgraph <- induced_subgraph(graph, vids)

  betweenessScore <- as.data.frame(betweenness(subgraph))

  betweenessScore <- betweenessScore %>%

    mutate(clusterName = rownames(betweenessScore), group=nodeCluster) %>%

    filter(`betweenness(subgraph)`==max(`betweenness(subgraph)`)) %>%

    select(clusterName, group) %>%

    slice(1)

  return(betweenessScore)

}

clusterOverview <- do.call(rbind.data.frame, lapply(as.vector(unique(nodes$group)), clusterNameCreation))

In total 28 different clusters have been identified in the graph and the cluster names have been assigned to each cluster. In the following section the visualization of the constructed graph and the created clusters is going to be explained.

Visualization

For the visualization the goal is to incorporate the found cluster structure together with the nodes and edges. Furthermore, we want to incorporate selecting the cluster from a drop-down menu with the associated name. And as a third goal we want to be able to select one node and get the direct neighbors of the selected node highlighted. The options can be specified by using the visnetwork framework. A detailed tutorial of the package can be found here.

In the following you find the R script that was used to create the visualization.

# visualize network



visNetwork(nodes, edges, width = "100%") %>%



  visPhysics(solver = "forceAtlas2Based" , stabilization = FALSE) %>%



  visNodes(



    shape = "dot",



    color = list(



      background = "#0085AF",



      border = "#013848",



      highlight = "#FF8000"



    )



  ) %>%



  visEdges(



    shadow = FALSE,



  ) %>%



  visOptions(highlightNearest = list(enabled = T, degree = 1),



             selectedBy = "group") %>%



  visLayout(randomSeed = 11)

The result of the visualization is displayed in the following three images. All images show the interaction possibilities the user has with the created visualization.

Complete network visualization of all keywords

Network visualization with cluster selection by the drop-down menu

Network visualization with neighbor by node selection

SAC implementation

The previously outlined visualization can be implemented into the SAP Analytics Cloud by adding a R Visualization into a dashboard page. How this can be done is in detail explained in the blog post of yannick_schaper. The beauty of the created dashboard is that the visualization provides the interactive possibilities and therefore all interaction possibilities of an end user with the visualization.

Conclusion

This blog post shows what can be achieved by using data extracted from the community page and visualized in SAC. This blog post presents one aspect of the conducted project and builds the last part of the blog series conducted around the SAP Community question analysis. As all the presented aspects were implemented on my own this project wouldn’t have been possible without all the amazing colleagues, I have exchanged the aspects of this project. This blog post marks the last blog post in this blog post series. A special thank needs to go to sarah.detzler who brought up the idea about transforming the questions into a network structure. Furthermore, my colleague thorsten.hapke needs to be mentioned who brought up the idea of implementing an e-mail automation regarding open questions within the SAP Community. And a special thanks needs to go out to britta.jochum, daniel.ingenhaag and yannick_schaper who gave me constantly feedback during the curse of this project.

I want to also encourage you to provide feedback to the project that I conducted. I look forward receiving your opinion of the project and the blog post series.

SAP Community question analysis (Question Clustering & Visualization)

Introduction

Clustering approach

Graph based clustering approach

Constructing graph

Creation of nodes

Implementation of clustering approach

Visualization

SAC implementation

Conclusion

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win