Technical Articles
Data Discovery is Dead (as we know it) Part 2
Results from a Machine Learning Driven Data Discovery Approach
Now let’s contrast our manual visualization based approach from Part 1 to the results we would get from a SAP Analytics Cloud’s (SAC’s) machine learning (ML) approach to data discovery. SAC has a feature called Smart Predict which is a ML powered Predictive Automation approach. What’s great about SAC is that it combines BI, Planning and Predictive in one solution. This means you can derive precise machine learning insights from your data and then create a tangible set of concrete actions and track progress to plan in one single application.
First, let’s create a ‘Predictive Scenario’ in SAC:
Since our target question is, ‘which employees left the company (versus stayed) and why’, we select ‘Classification’ because the target values are categorical in nature and unordered (like a ‘dimension’). In contrast, Regression analysis, which is very similar to Classification analysis, would be applied to a ‘measure’ or range of values that are continuous values or ordered (i.e. if we were asking the question, ‘what is influencing Working Hours Per Week’):
We will point SAC Smart Predict to the same data we did the manual/visual data discovery from and also tell Smart Predict which column of data is our ‘Target’. The target is the column of data that flags the behavior we want to learn about, in this case, when an employee leaves the company (versus staying with the company). The arrows show us what is needed to evaluate the data:
Step 1: Input Dataset: ‘HR Churn data – Train’
Step 2: Choose the Target: ‘LEAVE_JOB’
Step 3: Click on the ‘Train’ button to kick off the automated ML driven data discovery process
The Results
Within seconds (sometimes within minutes based on the volume and width of data) our predictive automation process has completed. The Smart Predict ‘Classification’ algorithm ‘Trained’ itself on this data, with the Target in mind, and now we get to see what it learned, if anything.
The very first thing we want to look at is how well the data describes why employees are leaving:
The ‘Predictive Power’ represents how well the descriptive data tells the Employee Turnover story The descriptive data would be all the columns of data that are not the Target. In this case we see a Predictive Power of 87.15%. That’s really good! It means we can trust the results because our Classification algorithm has found that certain attributes or behaviors in the data mostly happen when someone leaves the company.
Note: If ‘Predictive Power’ had been in the 15% range then this would tell us that the descriptive columns don’t tell us what is happening with Turnover. In that case we would look for other attributes, columns or ‘features’ about the employee that could tell us why and then we would run it again to see if the accuracy had improved. Having a low Predictive Power can also be good to know as it would save us a lot of time that would have been spent building visuals that may be misinterpreted.
Now that we know we are looking at the right data, let’s look at why employees are leaving.
Influencer Contributions
Smart Predict opens the box and sows us a list of contributing columns to employee turnover in order of most contributing to least contributing. It’s called ‘explain-ability’ and not all ML tools do this but it’s a very important aspect of the process which is to know ‘what’ ML found and ‘why’ in contributes to our target. Also note that these influencers are another thing we can’t get out of manual data discovery. Smart Predict has found that ‘Working Hours Per Week’ is the biggest influencer to employee turnover followed by ‘Promotion Interval by Month’, ‘Salary’ and then ‘Tenure’.
Now, remember our Story Board of findings that we built based off our manual visualization attempt? The ‘Influencer Contributions’ take this a step further and tells us how each of these employee characteristics influence Employee Turnover, but in ranked order. It’s a very precise and mathematically validated list of the KPI’s we should be paying attention to. No human can produce these results with a manual visualization tool.
Also notice how the top three contributions where not in our original data discovery because they were measures. Without doing anything extra, our automated ML included these into the analysis. Bonus!!
Going Deeper
SAP’s ML approach takes the ‘Influencer Contributions’ to the next level and tells us, not only which KPI’s to focus on but also, what ranges of values in each column influence the target behavior (Employee Turnover). For example:
Influencer #1: Working Hours Per Week – With ‘Working Hours Per Week’ we see that employees that work, on average, 43.8 to 48 hours every week are at the most at risk of leaving.
See below: Everything to the right of the 0.00 is an influencer to leave and to the left of the 0.00 is an influencer to stay. From Top to Bottom, the top range is considered the ‘at risk’ behavior and as you go towards the bottom you see the ‘buckets’ of hours that represent employees most likely to stay. So we get both sides of the coin!
Note: What human would come up with ranges like this? 43.8 to 48 hours per week? Only a ML approach would see patterns in the data so precisely.
Influencer #2: Promotion Interval – We see below, employees that have not been promoted in 22 to 28 months are most likely to leave. 0 to 6 months most likely to stay. Now, imagine you are working 43.8 to 48 hours of overtime AND have not been promoted in 22 to 28 months! That’s a double whammy! Our ML model will now add up all these influencing factors and score each employee individually on their likelihood of leaving based on their unique situation at the company. This new column of information will be added to our original data as a % probability of leaving. What’s really neat is that we can now sort our employees by who is most likely to leave or most likely to stay.
Yet another thing a manual visualization approach can’t provide.
Results Summary
With the help of ML we now know:
- How well the data tells our Employee Turnover story.
- Out of the 10 columns we analyzed, which employee characteristics we should monitor first.
- Out of those top KPI’s when should we turn that KPI Red Yellow or Green based on risky behaviors.
- The precise probability of how likely each employee is to leave.
What’s next? Taking Precise Action
Now that we have our results leveraging predictive automation, how do we move forward with some tangible next steps and actions? We will use something called a ‘Profit Simulation’ which is built in to SAC Smart Predict.
What we are going to do here is apply some monetary numbers, in dollars, to the employee turnover situation. If we know we avoid $10,000 in cost (rehiring and retraining fees) by investing $1,000 to retain an employee, how far down that list of employees (ranked now by most likely to leave) should we invest that $1,000 dollars? Let’s enter the values and then hit the Maximize Profit button and find out.
According to this, if we spend $1k on the top 19.7% of our employees most likely to leave (186 at risk x $1,000 = $186,000) then we will expect to prevent 85.5% of our potential employee turnover (73 x $10,000 = $730,000). Maximize Profit represents the most profitable outcome.
By investing just $186,000 on 186 at risk employee we can expect to save $730,000 in rehiring/retraining cost which translates to a profit of $544,000!
How easy would it be to come up with that precise of an ROI with a manual data discovery tool? Is it even possible? I’m not sure I could do that without Smart Predict.
Profit Simulation Summary
In conclusion of our Profit Simulation, we can now present our findings and ROI back to upper management to get approval for the new employee retention program. Also, since we have BI, Planning and Predictive in the same solution, we can spread that $186k of cost and the expected benefits of avoiding $730k worth of rehiring costs over the next 6 to 12 months and track our progress to actuals! All this from a single web browser. Pretty cool, right?
Overall Key Takeaways
What do we want to remember from this exercise?
- Manual data discovery is no longer a trusted way to gain insights
- Predictive Automation / Automated Data Discovery Powered by Machine Learning is a far superior way at getting to the insights:
- Faster and Easier
- More Precise
- More Scalable Across a Wider Range of Information
- ML does not have to be just for data scientists – With SAC Predictive Automation Smart Features anyone can shift to this new way of thinking
Welcome to a new way of doing things!
so, we're replacing data analyst bias with algorithm and its implementation bias ?
in this particular example the top 2 influencers the ML found are sort of obvious reasons people leave, I sure hope that an experienced in HR data analyst would see that as well.
I think quality of data is still main problem area in Data Analysis and human eye is still better at connecting the dots.
It's not just math 😉
Machine Learning can validate things we already know or suspect. That’s certainly better than having no validation. Also, knowing how influencers rank to other factors is important. Manual visualization exploration won’t give you that perspective. It’s not just one or the other. I think it's the combination of human knowledge of the business and ML and data quality.
it's the same problem as with predicting stock market 😉
yes there are algorithms. humans still do it better.
overall, ML/AI will replace majority of data analysts. welcome to the future.
No. I don’t think Data Analysts will be replaced by algorithms. Algorithms augment the data discovery process. We need more Human + Algorithm analysis. I also don’t think predictive automation is a replacement for a data scientist. Different problems have different complexity levels and data scientists still play a vital role in the more complex problems.
I don’t think comparing most business problems or questions to predicting the stock market is a very accurate comparison either. There is too much variability in the stock market. For example, who knows what Donald Trump is going to tweet next? Nobody but Donald Trump and yet he can have an influence on the outcomes of the stock market. But it's unpredictable. Whether you are a human or a human using an algorithm variability in the use case where no clear predictability exists is a reality that cannot be avoided in some cases. But business problems tend to have less variability and are more approachable and solvable with ML.
Yes, they will. If the "machine" can give a basic analysis approaching normal level of "human" analyst -- no reason to keep human. Remember, this is all about profit margins.
Of course, high end analyst for complex analysis will still be humans. The main drive for automation (and this is part of it) is cost/margins.
I agree, stock market analysis was an extreme example.
Thank you for your perspective, Denis!
Maybe ought to set loose SAC Smart Predict on the POTUS tweets, we just need to tap into a database of influencing factors….i.e. # hours slept, # of cheeseburgers, # of CNN accusations, trending topics in evangelical tweets, trending topics in white supremacist groups, trending global political issues, prime democratic messaging representatives, etc.
It might be able to come up with influencing factors on his next radical negative or positive event influencing the VIX.