Dear Data Scientists: It’s Not All About You
Any time a (relatively) specialized or obscure topic gets subjected to worldwide hype and before finally becoming part of the mainstream, there is an interesting phenomena that occurs between those “original believers” and the “newcomers”. I’m old enough to remember a time before the Internet and even Linux when e-mail had to be sent using a Unix command line program called (imaginatively enough) “mail” and could only be sent to people at other universities.
When the Internet became more and more available, some of us “computer geeks” were very proud that we used e-mail long before people even figured out if the word needed a hyphen or not. By the time residential cable and DSL modems became ubiquitous, it was a badge of honor to be running your own e-mail and Web servers at home because back then, you really had to know what you were doing (for the record, I run both in my own on-premise private Cloud to this day – Why? because I *can*).
Watching data scientists interact with regular business users reminds me of my own evolution from a “computer geek” into a “software engineer” (which was much cooler and definitely paid more). The tipping point though was when my background as a computer geek no longer qualified me to lord my knowledge over those less enlightened then myself. It was not that my experience became less valid, it simply became less relevant.
Technology has progressed far enough that making it easier to use and accessible to everyone no longer sacrifices cost or performance. In the case of e-mail servers, it’s actually now cheaper to outsource the whole thing in the cloud than to run your own. That started to make my experience seem really expensive, and in some cases, unnecessary.
Predictive Analytics Has Hit Primetime
Let’s face it, “Predictive Analytics” is the evolution of that much more boring-sounding topic of “Statistics”. However now that predictive is “cool”, I’m seeing the same phenomena: many of those who previously had a math or statistics background became “data scientists” because they have a deep understanding of what makes predictive tick. Today, data scientists are in an enviable (and financially lucrative) position: statistics will never get easier and the fundamentals of mathematics are unlikely to change in…. well… ever.
The truth is that the job of a (good) data scientist is not that easy, and it can be infinitely boring. Significant time is spent cleaning, preparing, and deriving data before it even makes sense to start the predictive modelling process. Creating the models themselves requires a delicate understanding of the (now augmented) dataset and which algorithms are most applicable to it. This is a highly iterative process which requires an understanding of statistics to decide how to refine the models for the best possible accuracy. If this sounds repetitive, you would be right (and if you don’t think this is repetitive, you are likely a data scientist yourself).
Predictive Automation to the Rescue
The good news (for us non-data scientists) is that technology is poised to blow the fortified towers of data science wide open just as easily as it devalued my computer science degree. The industry’s focus on predictive analytics has shifted from pure performance and efficiency to ease of use and accessibility for the larger business audience. Encoding self-tuning algorithms into an autonomous “data scientist in a box” application has always been the Holy Grail of predictive analytics, but how realistic is it?
Ironically predictive analytics software can be made smarter specifically because the rules of mathematics cannot change. Many of the previous attempts have been to create an “uber-data scientist” application that can automatically pick the best algorithm for a given problem. This approach has some merits: there are a number of algorithms that apply specifically to classification problems so you could simply run all of them on a target dataset and pick the best one (and in fact, this is what some products do).
However a data scientist will tell you that one of the dangers is that this can yield drastically different results between runs. The algorithm chosen this week may not be chosen as the best one next week and therefore the results of the two weeks are not directly comparable. Choosing a specific algorithm from week to week may not be the best idea either – it’s possible the data for the first week’s run would yield a different algorithmic decision than the next four weeks. In the end, you need a data scientist to tell you which of the algorithms you could stick with (defeating the purpose of a “data scientist-in-a-box” approach).
The solution to this is to have this “auto-selection” intelligence built into the algorithm itself rather than have it sit above. That means the application can pick this “uber-algorithm” which will then have the intelligence to handle any type of data and always make the right modelling choices. In practice, this super-smart algorithm would do many of the same steps a data scientist would. It would use statistical analysis to determine the optimum analysis parameters and then iteratively create many candidate models before finally coming down to the winner. The difference is that a computer can create hundreds or even thousands of models before choosing the most optimal one.
SAP Predictive Analytics has an automated mode where you basically pick the type of problem (classification, association, etc) to solve and the software handles the rest in order to put predictive analytics in the hands of more user (for a more detailed discussion, see How does Automated Analytics do it? The magic behind creating predictive models automatically).
Predictive Analytics Is Not a “Zero” or “One”
I am continually surprised by the number of customers that think you either are, or aren’t a data scientist. Digging into this a little bit further, I find many times a data scientist has created this binary distinction to differentiate their skills from the “regulars” (or in Harry Potter terms, “muggles”) – who readily agree they don’t want to be anywhere near a mathematical equation, much less an actual algorithm.
However, the field of data science is really a spectrum that blends really quickly into the analytics/business intelligence world. If you accept that predictive analytics “is an algorithmic analysis of past data to find patterns that can be applied to new data to improve a future outcome”, I would argue that business intelligence is “a visual and calculation oriented analysis of past data to make better decisions about the future.”
That means every business intelligence user can benefit from automated predictive analytics to do what they are currently doing – better.
The sooner you can get your organization out of the “data scientist or not” mentality, the quicker you can make everyone more effective at understanding and solving business problems. If you get stuck on this, look for the strongest opponents and likely there’s a non-technical reason for their resistance to “opening the predictive gates”.
Don’t Worry Data Scientists, We Will Still Need You
Does predictive automation completely replace data scientists? Definitely not – a human can understand the semantics in the data, derive new data fields based on their domain knowledge, and can create far more complicated models without so many iterations. However for those that do not have the knowledge or skills, the automated way gets you pretty darn close – and a whole lot faster.
The massive influx of business users with access to predictive analytics reduces the burden on (typically) overloaded data scientists by freeing them up from some of the more “simpler” problems that could be handled by users directly and letting them work on the more sophisticated predictive problems. Sounds like win-win doesn’t it?
Interestingly, many data scientists can also benefit from the use of automated predictive technology to better understand their data and create baseline models *before* they dive into creating their own models by hand. By having a computer do the initial analysis, a data scientist can save hours to days of data profiling before getting down to the core predictive modelling they are being paid those big bucks for. In some cases, the automatically generated models may be solid enough to solve a problem without requiring any manual modelling.
In The Future, It Won’t Matter
The field of predictive analytics is maturing at its fastest rate ever and the move towards more ease of use and simplicity will eventually reach a plateau, just like standing up a full Hadoop cluster in the cloud can be done in under ten minutes today. The focus will shift from “which is the best algorithm to use when?” to “how I can use these predictive results to improve the business?”.
The number of predictive models needed will explode as companies explore micro-segmentation and the (potentially) small gain of a single manually crafted model will give way to the need to create hundreds of models per day.
So to you data scientists out there: We will always need your skills, your experience, and your wisdom.
But take it from someone who has been through this cycle and had his computer geekdom commoditized – automation opens up predictive analytics to everybody, so just remember: “It’s not all about you”. 😉
You can try SAP Predictive Analytics and all of its automated predictive goodness for free by downloading it at http://www.sap.com/trypredictive