PAi Series Blog 4: Approaching Modelling
Welcome to the fourth installment in the PAi and S/4Hana Blog Series. From the previous blogs in this series, you should be familiar with maintaining models within S/4Hana. Though an outstanding question may remain, how do I approach my own predictive scenario?
In this installment we look at answering this question. Will use the predictive scenario, ‘Materials Overdue – Stock in Transit’, as a reference providing examples and intuition towards the various tasks. After reading, you will have the knowledge, ability and confidence to start creating and training predictive models for your predictive scenarios, entering the exciting world of predictive with PAi.
The life-cycle of a predictive scenario can be broken down the below steps, with steps 2-5 iterated many times till a reasonable result is produced.
Figure 1: Predictive Scenario Life-cycle
For this installment we will focus on steps 1, 2 and 3. Step 4 is covered in an earlier installment, and step 5 will be covered in a later installment.
Predictive Scenario Definition
Figure 2: Predictive Scenario Definition
Once the Predictive Scenario is defined, you probably have a sense if the predictive scenario is a classification or regression problem. This can appear clear, for example, the stock in transit predictive scenario is to determine if a delivery will late or not late, giving an initial intuition towards using a classification algorithm. It is important to be cautious with such intuitions, reserving algorithm choice/solution direction until we have a greater understanding of the data.
Figure 3: Data Exploration and Pre-processing
Although an intuition on the approach to address the predictive scenario may exist, the only way of knowing if this intuition is reasonable is through exploring and understanding the underlying historical data that is related to the predictive scenario, a phase known as data exploration.
When working on a predictive scenario we need to understand the data. This means studying the underlying data structure, obtaining an understanding of the meaning of all the columns and how they relate to the predictive scenario. Depending on the size of the tables, how they are related, and how the information can be retrieved, this can take a considerable amount of time.
In fact, for a data a scientist, this is where most time is spent. It is critical to have strong understanding of the data that will be used for model training, ensuring it is of good quality. A saying you may often hear spoken in this domain is ‘Garbage in, garbage out’.
It is worth noting, if the data represents several joined tables, it does not necessarily mean all columns from each table will be included for model training. Rather, a subset of the data will be selected for use, this subset selection can be accelerated through discussion with stakeholders where based on their domain knowledge several potentially important and meaningful features can be highlighted. This is often a good starting point.
Furthermore, as our understanding develops, it may result in re-evaluating our initial sense of how the problem can be best addressed. Such re-evaluations can result in changing what the target value to be predicted will be completely and switching from using a classification to a regression algorithm. This is not uncommon and is why it is important to avoid biasing oneself early on.
In fact, the definition of the target is a critical part of a use case deserving special attention. When determining the target, it needs to be done while keeping the underlying use case data in mind. One reason for this approach is to ensure the information contained in the historical data can be best utilized to explain the target. A post on defining the target variable will be the focus of an upcoming entry.
With an understanding of the data, the target variable identified, and the algorithm selected for the prediction scenario, we can now begin extracting additional information from the data. This increases the ability for the model learn rules to produce reasonable predictions of the target, a process commonly referred to as feature engineering.
Figure 4: Feature Engineering
The success of a predictive scenario algorithm often hinges on the quality of the underlying data fed into the algorithm and this includes the engineered features, which are additional information extracted from the dataset that is useful in the prediction of the target.
Feature engineering involves two components, the first is an understanding of the properties of the task to be solved, and the second is experimental work, where you test expectations and figure out what works and what doesn’t. It is an iterative process and can improve one’s understanding of the problem and underlying data. This deeper understanding often leads to additional experiments, and further engineered features. We will return to this iterative process in our installment on improving the model.
Returning to the Stock in Transit Predictive Scenario, let us consider what features would will be useful in the production of reasonable target predictions. Please bear in mind, this is a contrived example aimed at explaining the process and benefits of feature engineering, it does not represent an exhaustive list of features that could potentially be created.
From defining the target, the target variable was defined as the delivery delay in days – the number of days between the planned delivery date and the actual delivery date. In exploring and understanding the historical data we identified the following features existed:
- Order Creation Date – the date the order was placed
- Planned Packaging Date – the date the order is planned to be packaged
- Actual Packaging Date – the actual date the order packaging occurred
- Planned Shipping Date – the data the order is planned to be shipped
- Actual Packaging Date – the actual date the order shipping occurred
- Planned Delivery Date – the date delivery is planned to occur on
- Actual Delivery Date – the actual date delivery occurred
The actual dates are discarded, as they will not be known when a new record is created. Furthermore, from a fictional discussion with stakeholders, the planned packaging date is identified as optional, and we therefore avoid using this for any feature engineering. With this information we create the following features:
- daysBetweenOrderCreationPlannedShipping := daysBetween(Order Creation Date,Planned Shipping Date)
- daysBetweenOrderCreationPlannedDelivery := daysBetween(Order Creation Date,Planned Delivery Date)
- daysBetweenPlannedShippingPlannedDelivery := daysBetween(Planned Shipping Date,Planned Delivery Date)
With these features engineered we have completed an initial iteration and now ready to split the data into training, validation and test datasets for training and validating our model.
Train, Validate and Test Datasets
When building a machine learning model, the underlying data is split into three datasets, training, validation, and test. The training dataset is used to train the model. During training, the model predicts the responses of a second dataset, the validation dataset. When model training is complete, the learned model is applied against the test dataset – which was ‘held back’ and not used as part of model training – providing an unbiased evaluation of the final model and indicating how the model can be expected to perform the on unseen data.
The use of a test dataset is an important part towards understanding the quality of a model, providing insights on how the model will generalize and if it will produce reasonable results against unseen data.
For APL algorithms, the creation of validation dataset is managed for us, and is extracted from the training dataset. This leaves us with focusing on separating the data into two datasets, training and test.
A strategy commonly followed is to split the data into 80% training, 10% validation, and 10% test. As the creation of the validation dataset is taken care by auto algorithms in APL, we will split the data into two datasets, 90% (training and validation) and 10% (test).
Figure 5: Creating Training, Validation, Testing Datasets – Random Selection
Next is to decide what approach to use for selecting the test data. A common recommendation is randomly selecting 10% of the data for the test dataset with the remaining used for train/validation datasets. This is often used in the academic arena, though for a production environment a more stringent approach is needed.
For the Stock in Transit Predictive Scenario the approach is to set a cut-off date, based on the Order Creation Date, splitting the data into train/test based on a specific order creation date. Data where the Order Creation Date is on or before the selected date are assigned to training/validation dataset, and data with an Order Creation Date after this date assigned to the test dataset. The cut-off date is selected to ensure 10% of the data will be assigned to the test dataset, and 90% to the training/validation dataset.
Figure 6: Creating Training, Validation, Testing Datasets – Hard Cut based on Date
An intuition on using a hard cut based on a date is events may occur in the future (beyond the selected cut-off date) containing behaviour the model has never encountered or learned rules around. Hence, the selection of a cut-off date provides a better indication of how the model will perform in the production environment. Randomly selecting 10% of the data may cause a model to learn rules for behaviours it would not have normally seen. This results in giving a false representation of how well the model is generalizing. The cut-off date approach gives a better indication of the models’ sensitivity to unseen behaviour, aiding understanding the robustness of the model, and how the model will behave in a production environment.
For our use case, we select the appropriate Order Create Date which results in the required 90/10 split in the data. We now have our train/test datasets and are ready to train our model – the approach for this can be seen in Training and Activating a model series entry. In later sessions we will look at how to evaluate our trained model and look at improving upon our initial results.