Predictive Thursdays: Bringing Automated Predictive Techniques to Hadoop and Spark
New digital technologies allow companies to reimagine business models, rise to disruptive market entrants and squeeze more productivity from less resources. In the past, companies became leaders in their industries by establishing an unbeatable brand or by having a supply chain that was more efficient than anyone else’s. This is still relevant in the digital economy, but now companies have to think about using these digital technologies to their advantage. How? By turning what is driving this digital economy – THE DATA – to their advantage.
As the explosion of data continues to accelerate, many enterprises wanting to become more Big-Data driven have started their Hadoop infrastructure journey and beefed up their staff with a ‘new’ key player, the data scientist. At SAP, we decided early on that we wanted to help enterprises storing masses of data in Hadoop to improve the accuracy of predictive systems and to make sense of all this wealth of data. We wanted to bring automated predictive technique to Hadoop using Apache Spark in the form of Native Spark Modeling.
Apache Spark is an open source data processing platform that debuted in October 2012. It provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. It includes dozens of high-level operators, APIs for Java and Python, support for in-memory technology, and good integration with the Hadoop ecosystem. For a good intro on Spark, I recommend Introduction to Apache Spark with Examples and Use Cases and the Spark FAQ.
Benefits of Native Spark Modeling
The benefits of coupling Hadoop-Spark with automated predictive techniques are potentially very big. First, there is no data movement between the machine learning engine and the data source. You do data manipulation, model training, and retraining directly on Hadoop data using the Spark engine. By doing so, you can take advantage from the inherent benefits offered by Hadoop and Spark—faster response time, better use of CPUs with distributed processing, and higher scalability.
Another benefit is the fact that data scientists can be a lot more productive using automated techniques. By doing cutting down on the time it takes them to prepare the data, create and test their models and make them available for deployment in business applications and processes, data scientists can refocus on high value projects or more complex problems where their skills set shines.
Finally, coupling Hadoop-Spark with automated predictive techniques can also bring ‘new players’ to the predictive value chain: the Data or Business Analysts. These skilled individuals can take advantage of Big Data using Spark without having to code, a significant obstacle for many. They simply use the self-service, flexible workflow provided by SAP BusinessObjects Predictive Analytics while connected to Spark. That alone will make Big Data more available to the lines of business and fuel the demand for added-value predictive use cases.
Come See for Yourself
From September 27-29, SAP (booth #935) will be at Strata+Hadoop Summit in NYC. Join us there to discuss how you can take advantage of all the data stored using open source software on low-cost commodity hardware.
From September 26-27, SAP (booth #k7) will also be at the first O’Reilly Artificial Intelligence show. We look forward to seeing you.