Machine Learning in a Box (week 5) : Upload Machine Learning Datasets
In case you are catching the train running, here is the link to the introduction blog of the Machine Learning in a Box series which allow you to get the series from the start. At the end of this introduction blog you will find the links for each elements of the series.
Before we get started, a quick recap from last week
Last week, we looked at which SAP HANA flavor you can pick (Server only), some hardware requirements will be needed and solutions if you don’t have such machine.
I hope you all managed to build up your instance and connect to it using your favorite SQL query tool.
After Craig Cmehil blog, What’s your setup? Care to share? #mydevsetup, feel free to share your setup this week (later too), for opinion, recommendations, etc.
Welcome to week 5 of Machine Learning in a Box!
Upload Machine Learning Datasets
Now that you have a SAP HANA, express edition instance up and running, you can start loading data.
I’m not going to ask you to load a petabyte of data (even though I previously uploaded about 50GB of flat file during a Hackathon using only 3GB of RAM), let’s be realistic and let’s keep these challenges once we gain more skills around HANA.
The data that you will upload are part of the SAP Predictive Analytics sample data set.
I used these datasets for the last 8 years to demonstrate not only how the product worked, but also to explain how algorithm works, the value of automation, etc.
Let first introduce properly SAP Predictive Analytics, then we will have a look at the sample datasets.
SAP Predictive Analytics
SAP Predictive Analytics was born in 2014 if I remember it right, about a year after the acquisition of KXEN by SAP.
SAP had built a tool called SAP Predictive Analysis to address the need of a data scientist persona.
At that time, SAP Predictive Analysis was already able to consume data from SAP HANA leveraging SAP HANA Predictive Analysis Library (PAL) and the SAP HANA R integration, or consume data from more or less any database with JDBC driver and still leverage about 20 built-in algorithms in addition to a local R integration.
On the other side, the KXEN brought InfiniteInsight and a series of automated algorithms, but also automated data preparation, the ability to extract the scoring equation for almost 40 different programming languages or database and a module dedicated to deployment and monitoring (Factory).
The so called KXEN algorithms are now under SAP intellectual property, so you won’t find fine details on their implementation. What you can find is that it follows the Structural Risk Minimization by Vladimir Vapnik and Alexey Chervonenkis
For those who don’t know these 2 guys, Vladimir Vapnik and Alexey Chervonenkis, they invented the original Support Vector Machine algorithm 1963.
The intent with SAP Predictive Analytics was to merge the Automated Analytics (formerly the KXEN components) and the Expert Analytics (the SAP Predictive Analysis) into one product.
One of the first tasks right after the KXEN acquisition was to bring the automated algorithm inside:
- SAP HANA which led to the SAP HANA Automated Predictive Library (APL)
- Expert Analytics side with additional node for the offline and online mode
- every SAP application and solution (Hybris, SFSF, C4C, …)
There was also a multitude of initiative where the automated analytics where embedded and it’s completely invisible to the end user, like in Lumira or the Digital Boardroom.
Upload data in SAP HANA, express edition
As data practitioner, you already know that there is no magic when you have to deal with uploading data. You either use a tool with a GUI and configure it or you build a script.
Using a GUI
The GUI option is fine you don’t have many files to upload or if you will do it once or twice. For that, you can use the SAP HANA Tools for Eclipse where the Import feature is there for you.
I wrote the following tutorial to introduce how it works : Import CSV into SAP HANA, express edition using the SAP HANA Tools for Eclipse
The Import wizard from the SAP HANA Tools for Eclipse allows you to upload only local data from anywhere (where Eclipse is running), It also enables you to create table if it doesn’t exist.
The scripting option actually leverage the use of the IMPORT FROM SQL command.
I wrote the following tutorial to introduce how it works : Import CSV into SAP HANA, express edition using IMPORT FROM SQL command
The IMPORT FROM SQL command requires the data to be located in a specific location on the SAP HANA host (this can be reconfigured if needed).The recipient table must exist before running the command). It supports a multitude of options like date or time format, field delimiter etc.
My preference goes to scripting as I have to admit, I’m a lazy guy and if I can avoid some clicking, I will.
In addition, this option performs much better especially when you will start uploading larger files.
SAP Predictive Analytics Sample Dataset
SAP Predictive Analytics provides a series of sample dataset to help you get started using the tool itself.
And with version 3.3, they were all made available as part of the online documentation: https://help.sap.com/pa
On the bottom right end side, you will see the Samples section
You can the click on View All to access the full list of sample dataset.
I have prepared another tutorial to help you with this: Import SAP Predictive Analytics Datasets
It explains how to import the following datasets:
- Association Rules Dataset
- Census Dataset
- Geo localization Dataset
- Time Series Dataset
- Text Coding
- Social / Link Analysis
Apart from the Census and Geo localization, the other dataset can be used with the related algorithm category.
The Census dataset can be used for classification using the class variable as target, for clustering or regression using one of the continuous variable like age.
The Geo localization dataset can be used for classification in conjunction with SAP HANA spatial capabilities, or with the association and social algorithms.
Now, you should have your HXE tenant ready with data loaded to run algorithms. Next week, we will continue with the environment preparation and look at the open source R integration.
For those who want to start with some algorithm, I recommend you using the Census dataset and one of the PAL algorithm, but you will have to share your experiments!
(Remember sharing && giving feedback is caring!)
UPDATE: Here are the links to all the Machine Learning in a Box weekly blogs:
- Introducing “Project: Machine Learning in a Box”
- Machine Learning in a Box (part 2) : Project Methodologies
- Recap Machine Learning in a Box (part 2) : Project Methodologies
- Machine Learning in a Box (part 3) : Algorithms Learning Styles
- Machine Learning in a Box (part 4) : Get your environment up and running
- Machine Learning in a Box (part 5) : Upload Machine Learning Datasets
- Machine Learning in a Box (part 6) : SAP HANA R Integration
- Machine Learning in a Box (part 7) : Jupyter Notebook
- Machine Learning in a Box (part 8) : SAP HANA EML and TensorFlow Integration
- Machine Learning in a Box (part 9) : Build your first Machine Learning application
- Machine Learning in a Box (part 10) : JupyterLab