Anomaly Forecast of Sensor Data in Energy Intensive Industries – Part II: The Machine Learning Execution
This blog post provides easy reference and sample code for some of the functionality typically required when dealing with Time Series data. For detailed background of some of the real world problems in this area refer to part 1 of this blog post by Rafael Pacheco
Here is a high-level overview of parts of the SAP Business Technology Platform that will be discussed for the anomaly prediction of sensor data
The blog post will cover the typical ML Model Life cycle in the context of usecase:
Dealing with Sensor Data
In today’s automated industrial production scenarios there are tens or hundreds of sensor constantly generating new data which could provide vital information on both the current and predicted operational metrics. This data can also be analyzed historically to find root-cause analysis retrospectively. In this blog post we will focus on how to extract information from this data in order to predict anomalies in the near future, on the order of minutes so as corrective action can be taken and the production process can continue with minimal disruption while maintaining quality.
Challenges with Sensor Data
Sensor data from a manufacturing process typically has a few characteristics which need to be dealt with before that data is fit for use in a Machine Learning algorithm. The data needs to be cleaned for duplicates and checked for consistency, which we will not describe here but assume that the available data has been persisted after such checks. The sensor comes in at uneven time intervals across the many sensors as these are not necessarily synchronized as the data is recorded when there is a change in value.
The functionality covered here works for both Cloud and On-prem releases for SAP HANA and SAP Data Intelligence.
Turning Sensor Data into Time Series Data
Here is a sample of data coming from 40+ different sensors which has been persisted in SAP HANA database table called HISTORIAN.
To analyze the different signals recorded at uneven time points the first step is to harmonize the data at a defined time interval. The choice of time interval is dependent on the data and underlying frequency it is being generated. SAP HANA Time Series provides easy functionality for data harmonization without having to deal with inherent nuances of time dimension.
For the above case lets say we decide to analyze the data equidistant at a 30 second interval, this can be done with the SERIES_ROUND function
SELECT SERIES_ROUND(DATETIME, 'INTERVAL 30 SECOND') AS TS ,* FROM HISTORIAN
This will create equi-spaced data:
Smoothing Sensor Data
The next step it to detect abnormal or anomalous events in the sensor data. However, in many cases it is not desirable to detect all abnormal values from the sensors but only consistently abnormal values to ensure the analysis is not impacted by spikes which could come from data quality issues.
In our example the target signal whose anomalous behaviour we want to analyze is the SIGNAL P_STEAM_SUPPLY provided by TAG38 in the data above. The raw data looks like this:
Lets say we want to smooth the data for a rolling window of 5 minutes and calculate their rolling average
SELECT TS, WEIGHTED_AVG(AVG_PRESSURE) OVER (ORDER BY TS ROWS BETWEEN 10 PRECEDING AND CURRENT ROW) AS MOVING_AVG_5MINUTES FROM (SELECT SERIES_ROUND(DATETIME, 'INTERVAL 30 SECOND') AS TS, AVG(VALUE) AS AVG_PRESSURE FROM HISTORIAN WHERE SIGNAL = 'P_STEAM_SUPPLY' GROUP BY SERIES_ROUND(DATETIME, 'INTERVAL 30 SECOND'))
Pivoting Sensor Data
Now in order to create a predictive model which can predict whether or not the target signal drop below a certain value we would need to pivot the data so we can have the different signals as potential features for the ML Model. In order to do this we can use the pivoting functionality provided by python hana_ml package and detailed in this blog post Pivoting Data with SAP HANA
sql_cmd = 'SELECT * SERIES_ROUND(DATETIME, 'INTERVAL 30 SECOND') AS TS,DATETIME,SIGNAL,TAGNAME,VALUE,UOM ORDER BY TAGNAME' ts_data = hd.DataFrame(conn, sql_cmd) print(ts_data.head(5).collect()) ts_pivot = ts_data.pivot_table(index='TS', columns='SIGNAL', values='AVG_VALUE') ts_pivot.head(5).collect()
The pivoted data will be like below:
ML Model Development
Now that we have the data in the desired format, we can move the step of building a predictive model using native ML capabilities provided by SAP HANA.
The ML capabilities of HANA cover a wide range of algorithms which are available via SQL or Python APIs. For this example we will use the Auto-ML capabilities provided by Automated Predictive Library (APL) fuctionality and in particular the auto-classification model.
Lets say we have data over historic period and collected data prior to the event of pressure drop over a threshold. Together with this data we also collect process data providing information on operating condition metrics for example, number of batches, recipes and completion percentages across different consumption lines. The dataset has labeled whether or not there will be a pressure drop after 5,10,15,20 minutes as described in the results section of the part 1 of this blogpost
Step1: Split the dataset in training & validation dataset
If the dataset is imbalanced which is likely the case with anomalies we can first upsample with SMOTE using the native HANA ML functionality for this
from hana_ml.algorithms.pal.preprocessing import SMOTE smote = SMOTE() new_df = smote.fit_transform(data=df, label = 'TARGET_LEAD_20',minority_class = 1)
This provides a balanced dataset which we can then split into training and hold-out sets
import hana_ml.algorithms.pal.partition as partition train, test, valid = partition.train_test_val_split(data=df, training_percentage = 0.7, testing_percentage = 0, validation_percentage = 0.3, partition_method='stratified',stratified_column = 'TARGET_LEAD_20') print("Training Set :", train.count()) print("Validation Set :", valid.count()) print("Fraction of data with in Training DataSet Target 1:", round(train.filter('TARGET_LEAD_20 = 1').count()/train.count()*100,2)) print("Fraction of data with in Validation DataSet Target 1:", round(valid.filter('TARGET_LEAD_20 = 1').count()/valid.count()*100,2))
We do not need to create a test set as the APL library function will do that on its own
Step 2: Train the model
import hana_ml.algorithms.apl as apl from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier model = apl.gradient_boosting_classification.GradientBoostingBinaryClassifier() model.fit(train,features = features, label=TARGET_LEAD_20, key='TIMESTAMP_STR')
Step 3:Check Model Quality
model.get_performance_metrics() ##Get feature importance model.get_feature_importances()['ExactSHAP'] ##Check model on hold-out set score = model.score(valid) print("Model score:", score)
Step 4: Save the ML Model in HANA
import hana_ml.model_storage from hana_ml.model_storage import ModelStorage model_storage = ModelStorage(connection_context=conn) model.name = 'MY_MODEL_NAME' model_storage.save_model(model=model, if_exists='replace')
ML Model Deployment
In the current scenario typically we would want to keep updating training data as more anomalies in the data appear which would require frequent training and re-training steps. The automation of the above model training can be achieved via SAP Data Intelligence.
SAP Data Intelligence provides a few different options to create an ML Training Model. It comes with ready to use templates which can help create the training pipeline in matter of minutes.
In the ML Scenario Manager of SAP’s Data Intelligence create a pipeline of type HANA ML Training
This creates a pipeline which has configurable operators. In our case we configure it for the steps above, for example the table which has the training data, which fields to use as Target, how to split the data between training and test, which ML algorithm to use:
When the pipeline is executed it will save the model as an artifact which is later used for inference on new data. It also generates the accuracy score by default.
The template pipeline provides ready-to-use functionality to create a pipeline which can then be set to run on a scheduled basis.
ML Training Customizations
However, there could be cases where this template is not sufficient. For example, one needs to do some preprocessing or upsampling in our case before we do the training step. In this case you can extend the template with additional operators to process the additional steps. SAP Data Intelligence comes with many in-built operators. Incase these are not sufficient custom operators can be built as described in this Custom Operator blog post
Also for some algorithms like classification it is not sufficient to monitor the accuracy, additionally you would like to track the f1 scores. This can be done via configuring the operator to track additional metrics as described here SAP Data Intelligence Tips & Tricks blog post.
ML Model Consumption
Now that we have a pipeline which creates ML models, we can use the generated model to infer on new data, the whole reason we came down this path so we can predict in advance when the pressure will drop as we get new data.
To enable this again SAP Data Intelligence provides template pipelines which can take a previously generated model and apply on new data. For this we use the HANA ML Inference on Dataset or HANA ML Inference on API Requests, depending on how the new data is available.
This pipeline takes as input the model artifact id which is available from the training step above.
As with the training you then configure the SAP HANA Connection where the inference should take place. This could be different from where the training was done.
Incase the inference needs to be done independent of SAP HANA then its also possible to use the Java Script version of the ML model as described in the blog post by Stojan Maleschlijski MLOps in practice: Applying and updating Machine Learning models in real-time and at scale
As I come to the end of my end of my blog post, as always I hope my blog posts are of some help when you find yourself dealing with similar problems, which are recurrent not only for ML under Beer Pressure but many ML scenarios under time pressure.