Technical Articles
Two simple tips to boost the working efficiency of a Data Science project
How can we make our daily work more efficient? Is there any straight forward answer? For me, the answer is only one word, experience.
Participating on several Data Science projects the last years, i was really amazed how fast you can confirm the saying “Almost 7080% of a Data Science project is spent on the Data preparation”. There are two simple tips that will be presented on this blog post regarding the Data preparation process.
The first one is comparing four different ways, that a data scientist in SAP HANA, can create random sample datasets from an initial dataset and which can be their potential usage. The second one is exposing the power of SAP HANA ML on creating and automating a set of new aggregated columns (max(), sum(), avg() for example) from existing columns without the need of writing complex and big SQL queries (feature engineering part).
Different ways to create random sample datasets
During a data science project, several pain points can be highlighted from a Data Scientist.

Size of the dataset is the keyword for all the previous pain points of a data scientist. If the size is small, then everything is fine. But, what if we have hundred millions of records and hundred of columns? How do we handle this demanding size? A straightforward answer is to take repeated samples from the initial dataset and then evaluating the results in order to take the appropriate decisions. How easy and fast is to take a sample from a huge dataset. We will compare several ways below in order to understand which approach is the most optimal.
Potential usage of sampling the initial Data sets on a Data Science project
Let’s assume, that our dataset is couple of millions of rows and couple of hundred of columns, for example a typical production dataset for invoice records or for IOT data. In order to conclude on which is the most accurate design of the model, we will probably need to add many derived columns ( feature engineering process). According to the final accuracy, this should be repeated several times in order to have a first version for our training dataset. This is highly repeatable timeconsuming process which can last for days. 
The trick
The trick on this issue is to create several sample subsets from our initial Model and execute the process multiple times until we will be ensure regarding the accuracy of our cross validation similar approach. Since, we have already concluded on the final Data Model, we need it to run it again only once. This is reducing the time dramatically. Moreover, initial insights regarding the level of our model accuracy are extracted.
Insights regarding the level of our model accuracy
For example, if we run couple of sample subsets and their accuracy is more or less all the time between 60 – 80, then we can be really confident that the final accuracy should be around this period (I).
If the accuracy is really low from the subset repeatable executions, then we really need to redesign our model (II).
If the accuracy among the sub executions is not close, then it is really easy to understand that the final accuracy is highly driven by the data that we will use (III).
Comparison of random samples creation
We have created the below column table with 49 M rows. This table contains random created values on all four columns. We will test its sampling performance with four different ways.
CREATE COLUMN TABLE "CHECK_tbl" (
"RN" BIGINT,
"RN1" BIGINT,
"RN2" BIGINT,
"RN3" BIGINT);
DDL of table “CHECK_tbl”
CASE 1, USING RAND() ON ORDER BY
SELECT *
FROM "CHECK_tbl"
ORDER BY rand()
LIMIT 1000000
WITH HINT (IGNORE_PLAN_CACHE);
CASE 2, USING RAND() ON WHERE CLAUSE
SELECT *
FROM "CHECK_tbl"
WHERE rand() < 0.9
LIMIT 1000000
WITH HINT (IGNORE_PLAN_CACHE);
CASE 3, USING TABLESAMPLE SYNTAX
SELECT *
FROM "CHECK_tbl"
TABLESAMPLE SYSTEM (90)
LIMIT 1000000
WITH HINT (IGNORE_PLAN_CACHE);
CASE 4, USING FROM PAl,the Preprocessing Algorithms (Sampling method)
DROP TABLE #PAL_PARAMETER_TBL;
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_PARAMETER_TBL (
"PARAM_NAME" VARCHAR (256),
"INT_VALUE" INTEGER,
"DOUBLE_VALUE" DOUBLE,
"STRING_VALUE" VARCHAR (1000)
);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('SAMPLING_METHOD', 5, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('SAMPLING_SIZE', 1000000, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('INTERVAL', 5, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('COLUMN_CHOOSE', 2, NULL, NULL);
CALL "_SYS_AFL"."PAL_SAMPLING"("CHECK_tbl", #PAL_PARAMETER_TBL, "OUT_");
4 sampling cases
Metrics of executions
 As we can see from the results above, the most efficient on time and memory execution is using the Case 3. More details on this approach you can find on this link , SAP Documentation.
 Case 1, is the worst one among all, since it has to order the whole dataset randomly first.
 Moreover, by setting the rand() function now on the where clause, Case 2 is the second most optimal among the 4.
 Furthermore, Case 4 is implementing the “Sampling” preprocessing Algorithms provided by SAP HANA Predictive Analysis Library(PAL) , information on this link, SAP Documentation.
Upscaling this exercise to real life scenario
A normal Dataset could have 2030 times more rows and couple of hundreds more columns. This translates on a quick calculated multiplier of at least 10 from the above metrics. Adding the repeated samples, for example 10 different executions, we can easily calculate the new metrics regarding the time execution for each case only to sample the data.
Case 1 : 44 (secs) * 10 (multiplier) * 10 (repeated executions) = 74 minutes of execution
Case 2 : 0,8 (secs) * 10 (multiplier) * 10 (repeated executions) = 2 minutes of execution
Case 3 : 0,01 (secs) * 10 (multiplier) * 10 (repeated executions) = 0,02 minutes of execution
Case 4 : 6,5 (secs) * 10 (multiplier) * 10 (repeated executions) = 11 minutes of execution
This exercise demonstrates 4 different approaches/solutions for the same problem. The only thing that differentiates them is how we use the features of SAP HANA. Taking under consideration the needs of the project, it is up to us which is the most appropriate every time.
SAP HANA ML / easily creation of new aggregated columns
On many scenarios, the data that we receive are really massive, for example on Sensor data / IOT and so on. In these cases, we do not only have to understand what is the business meaning / benefit of each column (from the hundred ones ), but we will need to create new ones in order to increase the performance of our machine learning model. These derived columns are usually aggregated ones from the existing ones.
If the number of columns is low, then we can easily write few lines of SQL in order to create them. The real challenge comes when we have hundred of columns of sensor data and we want to apply several aggregations upon them, such as min/max/avg/stdev and more. For those tasks an automation approach is required. Let’s see how hana_ml can be used to automate the creation of aggregation columns.
SAP HANA ML exercise
We created the below table which is composed by an ID column and then 10 measures( sensor information for example). We want for every measure (column) to create the aggregate ones based on the whole dataset and the aggregate ones based on the ID. Aggregation to be created for both cases are the count(),avg(), stddev(), min(), max() and median().
CREATE COLUMN TABLE "DEMO_USER"."DEMO_AGG" (
"ID" INTEGER NOT NULL ,
"Measure_1" DOUBLE NOT NULL ,
"Measure_2" DOUBLE NOT NULL ,
"Measure_3" DOUBLE NOT NULL ,
"Measure_4" DOUBLE NOT NULL ,
"Measure_5" DOUBLE NOT NULL ,
"Measure_6" DOUBLE NOT NULL ,
"Measure_7" DOUBLE NOT NULL ,
"Measure_8" DOUBLE NOT NULL ,
"Measure_9" DOUBLE NOT NULL ,
"Measure_10" DOUBLE NOT NULL )
SAP HANA table creation
.....
# The HANA ML library
import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(server_no_port, port, uname, passw)
dset = conn.sql('''SELECT * FROM DEMO_USER.DEMO_AGG''')
Connection to SAP HANA and define our Hana ML Data frame
Our Hana ML Data frame
########################################################
## ADD DERIVED AGGREGATED COLUMN FROM ALL DATA SET##
#############################################################
corr_cols = list(map(lambda y: y[0], filter(lambda x: (x[1]=='DOUBLE' or x[1]=='DECIMAL'), dset.dtypes())))
corr_cols
#define the group by clm
groupby_col = 'ID'
#define the aggr functions
agg = [ 'count','avg', 'stddev', 'min', 'max', 'median']
agg_sql = []
for operation in agg:
agg_string = map(lambda x: tuple((operation, x, operation+'_'+x)), corr_cols)
agg_sql.extend(agg_string)
dset1 = dset.agg(agg_sql, group_by=groupby_col)
#Run Statement
#dset1.collect()
#Get the sql generated
dset1.select_statement
SAP HANA ML code
This code implements the steps required to produce the result ( new aggregated columns ) that we want. Initially, we are keeping on a list only the columns which are type of double or decimal .Then we define,
 The column from the table that we want the grouping to applied. (groupby_col = ‘ID’)
 The aggregated functions that we want to create on these columns (agg = [ ‘count’,’avg’, ‘stddev’, ‘min’, ‘max’, ‘median’])
According to the SAP HANA ML documentation, in order to use the avg() function we need to provide the following parameters :
agg_list : A list of tuples. Each tuple is a triplet. The triplet consists of (aggregate_operator, expression, name) where:
 aggregate_operator is one of [‘max’, ‘min’, ‘count’, ‘avg’]
 expression is a str that is a column or column expression name that is the name of this aggregate in the project list.
group_by : str or list of str. The group by column. Only a column is allowed although expressions are allowed in SQL.
For every derived column, we build the corresponding name with the following syntax ,
“{Aggregation}_{Initial column name}”
Then by using the select_statement function, we gather the constructed SQL.
'SELECT "ID",
count("Measure_1") AS "count_Measure_1",
count("Measure_2") AS "count_Measure_2",
count("Measure_3") AS "count_Measure_3",
count("Measure_4") AS "count_Measure_4",
count("Measure_5") AS "count_Measure_5",
count("Measure_6") AS "count_Measure_6",
count("Measure_7") AS "count_Measure_7",
count("Measure_8") AS "count_Measure_8",
count("Measure_9") AS "count_Measure_9",
count("Measure_10") AS "count_Measure_10",
avg("Measure_1") AS "avg_Measure_1",
avg("Measure_2") AS "avg_Measure_2",
avg("Measure_3") AS "avg_Measure_3",
avg("Measure_4") AS "avg_Measure_4",
avg("Measure_5") AS "avg_Measure_5",
avg("Measure_6") AS "avg_Measure_6",
avg("Measure_7") AS "avg_Measure_7",
avg("Measure_8") AS "avg_Measure_8",
avg("Measure_9") AS "avg_Measure_9",
avg("Measure_10") AS "avg_Measure_10",
stddev("Measure_1") AS "stddev_Measure_1",
stddev("Measure_2") AS "stddev_Measure_2",
stddev("Measure_3") AS "stddev_Measure_3",
stddev("Measure_4") AS "stddev_Measure_4",
stddev("Measure_5") AS "stddev_Measure_5",
stddev("Measure_6") AS "stddev_Measure_6",
stddev("Measure_7") AS "stddev_Measure_7",
stddev("Measure_8") AS "stddev_Measure_8",
stddev("Measure_9") AS "stddev_Measure_9",
stddev("Measure_10") AS "stddev_Measure_10",
min("Measure_1") AS "min_Measure_1",
min("Measure_2") AS "min_Measure_2",
min("Measure_3") AS "min_Measure_3",
min("Measure_4") AS "min_Measure_4",
min("Measure_5") AS "min_Measure_5",
min("Measure_6") AS "min_Measure_6",
min("Measure_7") AS "min_Measure_7",
min("Measure_8") AS "min_Measure_8",
min("Measure_9") AS "min_Measure_9",
min("Measure_10") AS "min_Measure_10",
max("Measure_1") AS "max_Measure_1",
max("Measure_2") AS "max_Measure_2",
max("Measure_3") AS "max_Measure_3",
max("Measure_4") AS "max_Measure_4",
max("Measure_5") AS "max_Measure_5",
max("Measure_6") AS "max_Measure_6",
max("Measure_7") AS "max_Measure_7",
max("Measure_8") AS "max_Measure_8",
max("Measure_9") AS "max_Measure_9",
max("Measure_10") AS "max_Measure_10",
median("Measure_1") AS "median_Measure_1",
median("Measure_2") AS "median_Measure_2",
median("Measure_3") AS "median_Measure_3",
median("Measure_4") AS "median_Measure_4",
median("Measure_5") AS "median_Measure_5",
median("Measure_6") AS "median_Measure_6",
median("Measure_7") AS "median_Measure_7",
median("Measure_8") AS "median_Measure_8",
median("Measure_9") AS "median_Measure_9",
median("Measure_10") AS "median_Measure_10"
FROM (SELECT * FROM DEMO_USER.DEMO_AGG ) AS "DT_0" GROUP BY "ID"'
Extending the current scenario with even more aggregations based on a specific column, for example column ID on our case, it is really easy to implement. A final usage of SAP HANA ML join function is needed.
Final code
#############################################################
## ADD DERIVED AGGREGATED COLUMN FROM ALL DATA SET##
#############################################################
corr_cols = list(map(lambda y: y[0], filter(lambda x: (x[1]=='DOUBLE' or x[1]=='DECIMAL'), dset.dtypes())))
corr_cols
#define the group by clm
groupby_col = 'ID'
#define the aggr functions
agg = [ 'count','avg', 'stddev', 'min', 'max', 'median']
agg_sql = []
for operation in agg:
agg_string = map(lambda x: tuple((operation, x, operation+'_'+x)), corr_cols)
agg_sql.extend(agg_string)
dset1 = dset.agg(agg_sql, group_by=groupby_col)
#Run Statement
#dset1.collect()
#Get the sql generated
#dset1.select_statement
#############################################################
#############################################################
## ADD DERIVED AGGREGATED COLUMN PARTITION BY THE ID COLUMN##
#############################################################
agg = [ 'count','avg', 'stddev', 'min', 'max', 'median']
agg_sql = []
for operation in agg:
agg_string = map(lambda x :'%s("%s") over (partition by ID) AS "pct_%s_%s"'%(operation,x,operation,x),corr_cols)
agg_sql.extend(agg_string)
agg_string =", ".join(agg_sql)
df_window_fct_ren = conn.sql('select distinct * from (select "ID", ' + agg_string + ' from DEMO_USER.DEMO_AGG)')
result = dset1.set_index('ID').join(df_window_fct_ren.set_index('ID'))
#Run Statement
#result.collect()
#Get the sql generated  FINAL#
result.select_statement
#############################################################
Again, by using the select_statement function, we gather the new constructed SQL.
SELECT T0."ID",
T0."count_Measure_1",
T0."count_Measure_2",
T0."count_Measure_3",
T0."count_Measure_4",
T0."count_Measure_5",
T0."count_Measure_6",
T0."count_Measure_7",
T0."count_Measure_8",
T0."count_Measure_9",
T0."count_Measure_10",
T0."avg_Measure_1",
T0."avg_Measure_2",
T0."avg_Measure_3",
T0."avg_Measure_4",
T0."avg_Measure_5",
T0."avg_Measure_6",
T0."avg_Measure_7",
T0."avg_Measure_8",
T0."avg_Measure_9",
T0."avg_Measure_10",
T0."stddev_Measure_1",
T0."stddev_Measure_2",
T0."stddev_Measure_3",
T0."stddev_Measure_4",
T0."stddev_Measure_5",
T0."stddev_Measure_6",
T0."stddev_Measure_7",
T0."stddev_Measure_8",
T0."stddev_Measure_9",
T0."stddev_Measure_10",
T0."min_Measure_1",
T0."min_Measure_2",
T0."min_Measure_3",
T0."min_Measure_4",
T0."min_Measure_5",
T0."min_Measure_6",
T0."min_Measure_7",
T0."min_Measure_8",
T0."min_Measure_9",
T0."min_Measure_10",
T0."max_Measure_1",
T0."max_Measure_2",
T0."max_Measure_3",
T0."max_Measure_4",
T0."max_Measure_5",
T0."max_Measure_6",
T0."max_Measure_7",
T0."max_Measure_8",
T0."max_Measure_9",
T0."max_Measure_10",
T0."median_Measure_1",
T0."median_Measure_2",
T0."median_Measure_3",
T0."median_Measure_4",
T0."median_Measure_5",
T0."median_Measure_6",
T0."median_Measure_7",
T0."median_Measure_8",
T0."median_Measure_9",
T0."median_Measure_10",
T1."pct_count_Measure_1",
T1."pct_count_Measure_2",
T1."pct_count_Measure_3",
T1."pct_count_Measure_4",
T1."pct_count_Measure_5",
T1."pct_count_Measure_6",
T1."pct_count_Measure_7",
T1."pct_count_Measure_8",
T1."pct_count_Measure_9",
T1."pct_count_Measure_10",
T1."pct_avg_Measure_1",
T1."pct_avg_Measure_2",
T1."pct_avg_Measure_3",
T1."pct_avg_Measure_4",
T1."pct_avg_Measure_5",
T1."pct_avg_Measure_6",
T1."pct_avg_Measure_7",
T1."pct_avg_Measure_8",
T1."pct_avg_Measure_9",
T1."pct_avg_Measure_10",
T1."pct_stddev_Measure_1",
T1."pct_stddev_Measure_2",
T1."pct_stddev_Measure_3",
T1."pct_stddev_Measure_4",
T1."pct_stddev_Measure_5",
T1."pct_stddev_Measure_6",
T1."pct_stddev_Measure_7",
T1."pct_stddev_Measure_8",
T1."pct_stddev_Measure_9",
T1."pct_stddev_Measure_10",
T1."pct_min_Measure_1",
T1."pct_min_Measure_2",
T1."pct_min_Measure_3",
T1."pct_min_Measure_4",
T1."pct_min_Measure_5",
T1."pct_min_Measure_6",
T1."pct_min_Measure_7",
T1."pct_min_Measure_8",
T1."pct_min_Measure_9",
T1."pct_min_Measure_10",
T1."pct_max_Measure_1",
T1."pct_max_Measure_2",
T1."pct_max_Measure_3",
T1."pct_max_Measure_4",
T1."pct_max_Measure_5",
T1."pct_max_Measure_6",
T1."pct_max_Measure_7",
T1."pct_max_Measure_8",
T1."pct_max_Measure_9",
T1."pct_max_Measure_10",
T1."pct_median_Measure_1",
T1."pct_median_Measure_2",
T1."pct_median_Measure_3",
T1."pct_median_Measure_4",
T1."pct_median_Measure_5",
T1."pct_median_Measure_6",
T1."pct_median_Measure_7",
T1."pct_median_Measure_8",
T1."pct_median_Measure_9",
T1."pct_median_Measure_10"\nFROM (SELECT "ID",
count("Measure_1") AS "count_Measure_1",
count("Measure_2") AS "count_Measure_2",
count("Measure_3") AS "count_Measure_3",
count("Measure_4") AS "count_Measure_4",
count("Measure_5") AS "count_Measure_5",
count("Measure_6") AS "count_Measure_6",
count("Measure_7") AS "count_Measure_7",
count("Measure_8") AS "count_Measure_8",
count("Measure_9") AS "count_Measure_9",
count("Measure_10") AS "count_Measure_10",
avg("Measure_1") AS "avg_Measure_1",
avg("Measure_2") AS "avg_Measure_2",
avg("Measure_3") AS "avg_Measure_3",
avg("Measure_4") AS "avg_Measure_4",
avg("Measure_5") AS "avg_Measure_5",
avg("Measure_6") AS "avg_Measure_6",
avg("Measure_7") AS "avg_Measure_7",
avg("Measure_8") AS "avg_Measure_8",
avg("Measure_9") AS "avg_Measure_9",
avg("Measure_10") AS "avg_Measure_10",
stddev("Measure_1") AS "stddev_Measure_1",
stddev("Measure_2") AS "stddev_Measure_2",
stddev("Measure_3") AS "stddev_Measure_3",
stddev("Measure_4") AS "stddev_Measure_4",
stddev("Measure_5") AS "stddev_Measure_5",
stddev("Measure_6") AS "stddev_Measure_6",
stddev("Measure_7") AS "stddev_Measure_7",
stddev("Measure_8") AS "stddev_Measure_8",
stddev("Measure_9") AS "stddev_Measure_9",
stddev("Measure_10") AS "stddev_Measure_10",
min("Measure_1") AS "min_Measure_1",
min("Measure_2") AS "min_Measure_2",
min("Measure_3") AS "min_Measure_3",
min("Measure_4") AS "min_Measure_4",
min("Measure_5") AS "min_Measure_5",
min("Measure_6") AS "min_Measure_6",
min("Measure_7") AS "min_Measure_7",
min("Measure_8") AS "min_Measure_8",
min("Measure_9") AS "min_Measure_9",
min("Measure_10") AS "min_Measure_10",
max("Measure_1") AS "max_Measure_1",
max("Measure_2") AS "max_Measure_2",
max("Measure_3") AS "max_Measure_3",
max("Measure_4") AS "max_Measure_4",
max("Measure_5") AS "max_Measure_5",
max("Measure_6") AS "max_Measure_6",
max("Measure_7") AS "max_Measure_7",
max("Measure_8") AS "max_Measure_8",
max("Measure_9") AS "max_Measure_9",
max("Measure_10") AS "max_Measure_10",
median("Measure_1") AS "median_Measure_1",
median("Measure_2") AS "median_Measure_2",
median("Measure_3") AS "median_Measure_3",
median("Measure_4") AS "median_Measure_4",
median("Measure_5") AS "median_Measure_5",
median("Measure_6") AS "median_Measure_6",
median("Measure_7") AS "median_Measure_7",
median("Measure_8") AS "median_Measure_8",
median("Measure_9") AS "median_Measure_9",
median("Measure_10") AS "median_Measure_10"
FROM (SELECT * FROM DEMO_USER.DEMO_AGG\n ) AS "DT_0" GROUP BY "ID") T0
INNER JOIN (select distinct * from (select "ID",
count("Measure_1") over (partition by ID) AS "pct_count_Measure_1",
count("Measure_2") over (partition by ID) AS "pct_count_Measure_2",
count("Measure_3") over (partition by ID) AS "pct_count_Measure_3",
count("Measure_4") over (partition by ID) AS "pct_count_Measure_4",
count("Measure_5") over (partition by ID) AS "pct_count_Measure_5",
count("Measure_6") over (partition by ID) AS "pct_count_Measure_6",
count("Measure_7") over (partition by ID) AS "pct_count_Measure_7",
count("Measure_8") over (partition by ID) AS "pct_count_Measure_8",
count("Measure_9") over (partition by ID) AS "pct_count_Measure_9",
count("Measure_10") over (partition by ID) AS "pct_count_Measure_10",
avg("Measure_1") over (partition by ID) AS "pct_avg_Measure_1",
avg("Measure_2") over (partition by ID) AS "pct_avg_Measure_2",
avg("Measure_3") over (partition by ID) AS "pct_avg_Measure_3",
avg("Measure_4") over (partition by ID) AS "pct_avg_Measure_4",
avg("Measure_5") over (partition by ID) AS "pct_avg_Measure_5",
avg("Measure_6") over (partition by ID) AS "pct_avg_Measure_6",
avg("Measure_7") over (partition by ID) AS "pct_avg_Measure_7",
avg("Measure_8") over (partition by ID) AS "pct_avg_Measure_8",
avg("Measure_9") over (partition by ID) AS "pct_avg_Measure_9",
avg("Measure_10") over (partition by ID) AS "pct_avg_Measure_10",
stddev("Measure_1") over (partition by ID) AS "pct_stddev_Measure_1",
stddev("Measure_2") over (partition by ID) AS "pct_stddev_Measure_2",
stddev("Measure_3") over (partition by ID) AS "pct_stddev_Measure_3",
stddev("Measure_4") over (partition by ID) AS "pct_stddev_Measure_4",
stddev("Measure_5") over (partition by ID) AS "pct_stddev_Measure_5",
stddev("Measure_6") over (partition by ID) AS "pct_stddev_Measure_6",
stddev("Measure_7") over (partition by ID) AS "pct_stddev_Measure_7",
stddev("Measure_8") over (partition by ID) AS "pct_stddev_Measure_8",
stddev("Measure_9") over (partition by ID) AS "pct_stddev_Measure_9",
stddev("Measure_10") over (partition by ID) AS "pct_stddev_Measure_10",
min("Measure_1") over (partition by ID) AS "pct_min_Measure_1",
min("Measure_2") over (partition by ID) AS "pct_min_Measure_2",
min("Measure_3") over (partition by ID) AS "pct_min_Measure_3",
min("Measure_4") over (partition by ID) AS "pct_min_Measure_4",
min("Measure_5") over (partition by ID) AS "pct_min_Measure_5",
min("Measure_6") over (partition by ID) AS "pct_min_Measure_6",
min("Measure_7") over (partition by ID) AS "pct_min_Measure_7",
min("Measure_8") over (partition by ID) AS "pct_min_Measure_8",
min("Measure_9") over (partition by ID) AS "pct_min_Measure_9",
min("Measure_10") over (partition by ID) AS "pct_min_Measure_10",
max("Measure_1") over (partition by ID) AS "pct_max_Measure_1",
max("Measure_2") over (partition by ID) AS "pct_max_Measure_2",
max("Measure_3") over (partition by ID) AS "pct_max_Measure_3",
max("Measure_4") over (partition by ID) AS "pct_max_Measure_4",
max("Measure_5") over (partition by ID) AS "pct_max_Measure_5",
max("Measure_6") over (partition by ID) AS "pct_max_Measure_6",
max("Measure_7") over (partition by ID) AS "pct_max_Measure_7",
max("Measure_8") over (partition by ID) AS "pct_max_Measure_8",
max("Measure_9") over (partition by ID) AS "pct_max_Measure_9",
max("Measure_10") over (partition by ID) AS "pct_max_Measure_10",
median("Measure_1") over (partition by ID) AS "pct_median_Measure_1",
median("Measure_2") over (partition by ID) AS "pct_median_Measure_2",
median("Measure_3") over (partition by ID) AS "pct_median_Measure_3",
median("Measure_4") over (partition by ID) AS "pct_median_Measure_4",
median("Measure_5") over (partition by ID) AS "pct_median_Measure_5",
median("Measure_6") over (partition by ID) AS "pct_median_Measure_6",
median("Measure_7") over (partition by ID) AS "pct_median_Measure_7",
median("Measure_8") over (partition by ID) AS "pct_median_Measure_8",
median("Measure_9") over (partition by ID) AS "pct_median_Measure_9",
median("Measure_10") over (partition by ID) AS "pct_median_Measure_10" from DEMO_USER.DEMO_AGG)) T1
ON T0."ID" = T1."ID"
This example is based only on 10 columns, just to expose the power of SAP HANA ML. On a recent project, we had to handle more or less 800 measures coming from sensor data. Writing the sql only for those transformations would not be a nightmare only on matter of time to produce this SQL but also really difficult to make any changes. On few words, the win with this trick is not limited only to a fast and easily created SQL script but also a gain of having a scalable and automated process at our disposal.
I would like to thank Stojan Maleschlijski and Dimitrios Lyras for their inspiration and support during my recent working journey on Data Science.
Excellent blog for a very important topic! Thank you Dimitrios.