Data Discovery is Dead (as we know it) Part 2

JW4 · ‎10-04-2019

Results from a Machine Learning Driven Data Discovery Approach

Now let’s contrast our manual visualization based approach from Part 1 to the results we would get from a SAP Analytics Cloud’s (SAC's) machine learning (ML) approach to data discovery. SAC has a feature called Smart Predict which is a ML powered Predictive Automation approach. What's great about SAC is that it combines BI, Planning and Predictive in one solution. This means you can derive precise machine learning insights from your data and then create a tangible set of concrete actions and track progress to plan in one single application.

First, let’s create a ‘Predictive Scenario’ in SAC:

Since our target question is, 'which employees left the company (versus stayed) and why', we select ‘Classification’ because the target values are categorical in nature and unordered (like a 'dimension'). In contrast, Regression analysis, which is very similar to Classification analysis, would be applied to a 'measure' or range of values that are continuous values or ordered (i.e. if we were asking the question, 'what is influencing Working Hours Per Week'):

We will point SAC Smart Predict to the same data we did the manual/visual data discovery from and also tell Smart Predict which column of data is our ‘Target’. The target is the column of data that flags the behavior we want to learn about, in this case, when an employee leaves the company (versus staying with the company). The arrows show us what is needed to evaluate the data:

Step 1: Input Dataset: ‘HR Churn data – Train’

Step 2: Choose the Target: ‘LEAVE_JOB’

Step 3: Click on the ‘Train’ button to kick off the automated ML driven data discovery process

The Results

Within seconds (sometimes within minutes based on the volume and width of data) our predictive automation process has completed. The Smart Predict 'Classification' algorithm ‘Trained’ itself on this data, with the Target in mind, and now we get to see what it learned, if anything.

The very first thing we want to look at is how well the data describes why employees are leaving:

The ‘Predictive Power’ represents how well the descriptive data tells the Employee Turnover story The descriptive data would be all the columns of data that are not the Target. In this case we see a Predictive Power of 87.15%. That's really good! It means we can trust the results because our Classification algorithm has found that certain attributes or behaviors in the data mostly happen when someone leaves the company.

Note: If ‘Predictive Power’ had been in the 15% range then this would tell us that the descriptive columns don’t tell us what is happening with Turnover. In that case we would look for other attributes, columns or 'features' about the employee that could tell us why and then we would run it again to see if the accuracy had improved. Having a low Predictive Power can also be good to know as it would save us a lot of time that would have been spent building visuals that may be misinterpreted.

Now that we know we are looking at the right data, let’s look at why employees are leaving.

Influencer Contributions

Smart Predict opens the box and sows us a list of contributing columns to employee turnover in order of most contributing to least contributing. It's called 'explain-ability' and not all ML tools do this but it's a very important aspect of the process which is to know 'what' ML found and 'why' in contributes to our target. Also note that these influencers are another thing we can’t get out of manual data discovery. Smart Predict has found that ‘Working Hours Per Week’ is the biggest influencer to employee turnover followed by ‘Promotion Interval by Month’, ‘Salary’ and then ‘Tenure’.

Now, remember our Story Board of findings that we built based off our manual visualization attempt? The ‘Influencer Contributions’ take this a step further and tells us how each of these employee characteristics influence Employee Turnover, but in ranked order. It’s a very precise and mathematically validated list of the KPI’s we should be paying attention to. No human can produce these results with a manual visualization tool.

Also notice how the top three contributions where not in our original data discovery because they were measures. Without doing anything extra, our automated ML included these into the analysis. Bonus!!

Going Deeper

SAP’s ML approach takes the ‘Influencer Contributions’ to the next level and tells us, not only which KPI's to focus on but also, what ranges of values in each column influence the target behavior (Employee Turnover). For example:

Influencer #1: Working Hours Per Week - With ‘Working Hours Per Week’ we see that employees that work, on average, 43.8 to 48 hours every week are at the most at risk of leaving.

See below: Everything to the right of the 0.00 is an influencer to leave and to the left of the 0.00 is an influencer to stay. From Top to Bottom, the top range is considered the ‘at risk’ behavior and as you go towards the bottom you see the 'buckets' of hours that represent employees most likely to stay. So we get both sides of the coin!

Note: What human would come up with ranges like this? 43.8 to 48 hours per week? Only a ML approach would see patterns in the data so precisely.

Influencer #2: Promotion Interval – We see below, employees that have not been promoted in 22 to 28 months are most likely to leave. 0 to 6 months most likely to stay. Now, imagine you are working 43.8 to 48 hours of overtime AND have not been promoted in 22 to 28 months! That’s a double whammy! Our ML model will now add up all these influencing factors and score each employee individually on their likelihood of leaving based on their unique situation at the company. This new column of information will be added to our original data as a % probability of leaving. What’s really neat is that we can now sort our employees by who is most likely to leave or most likely to stay.

Yet another thing a manual visualization approach can’t provide.

Results Summary

With the help of ML we now know:

How well the data tells our Employee Turnover story.

Out of the 10 columns we analyzed, which employee characteristics we should monitor first.

Out of those top KPI’s when should we turn that KPI Red Yellow or Green based on risky behaviors.

The precise probability of how likely each employee is to leave.

What’s next? Taking Precise Action

Now that we have our results leveraging predictive automation, how do we move forward with some tangible next steps and actions? We will use something called a ‘Profit Simulation’ which is built in to SAC Smart Predict.

What we are going to do here is apply some monetary numbers, in dollars, to the employee turnover situation. If we know we avoid $10,000 in cost (rehiring and retraining fees) by investing $1,000 to retain an employee, how far down that list of employees (ranked now by most likely to leave) should we invest that $1,000 dollars? Let’s enter the values and then hit the Maximize Profit button and find out.

According to this, if we spend $1k on the top 19.7% of our employees most likely to leave (186 at risk x $1,000 = $186,000) then we will expect to prevent 85.5% of our potential employee turnover (73 x $10,000 = $730,000). Maximize Profit represents the most profitable outcome.

By investing just $186,000 on 186 at risk employee we can expect to save $730,000 in rehiring/retraining cost which translates to a profit of $544,000!

How easy would it be to come up with that precise of an ROI with a manual data discovery tool? Is it even possible? I'm not sure I could do that without Smart Predict.

Profit Simulation Summary

In conclusion of our Profit Simulation, we can now present our findings and ROI back to upper management to get approval for the new employee retention program. Also, since we have BI, Planning and Predictive in the same solution, we can spread that $186k of cost and the expected benefits of avoiding $730k worth of rehiring costs over the next 6 to 12 months and track our progress to actuals! All this from a single web browser. Pretty cool, right?

Overall Key Takeaways

What do we want to remember from this exercise?

Manual data discovery is no longer a trusted way to gain insights

Predictive Automation / Automated Data Discovery Powered by Machine Learning is a far superior way at getting to the insights:
- Faster and Easier
- More Precise
- More Scalable Across a Wider Range of Information

ML does not have to be just for data scientists – With SAC Predictive Automation Smart Features anyone can shift to this new way of thinking

Welcome to a new way of doing things!

Data Discovery is Dead (as we know it) Part 2

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win