Welcome to the sixth installment in the PAi and S/4Hana Blog Series. In this installment I’d like to talk about one of the most important stages of a use case – defining the target value to be predicted. It is critical that the target value is correctly defined, as it can have a significant impact on the output for a use case. In this entry, we will see how what appears to be a straightforward classification use case, can be solved through distinctly different targets, one requiring regression, and the other classification.
Let us imagine we have a use case where the goal is determining if the delivery of an item will be “late” or “not late” (yes, it is the return of the stock in transit predictive scenario). Let us also assume initial discussions with the business stakeholders have occurred, and that we were informed that on-time delivery means delivery occurring either before or on the planned delivery date. Let us also assume the technical stakeholders have identified all tables/views related to item delivery, and made them available. Now the fun begins!
From the initial description this suggests the solution requires a classification algorithm. This is a good starting point, though we will only be able to verify this once we have performed an initial data exploration. When we explore the data, our goal is to determine:
- The target attribute for the classification task can be materialized from the data
- That classification is the best approach to meeting the use case goal, and no alternatives exist which are better supported by the data.
Exploring the Data and Determining the Target
The technical stakeholders inform us the data relating to deliveries exists across 5 tables, and none of these tables contain a column stating a delivery was late or not late. This means we need to devise a way materialize our target from the data.
As shown in Figure 1, looking into the data and from discussions with the technical stakeholders, we uncover several date fields exist:
- Order Creation Date – the date the order was placed
- Planned Packaging Date – the date the order is planned to be packaged
- Actual Packaging Date – the actual date the order packaging occurred
- Planned Shipping Date – the date the order is planned to be shipped
- Actual Packaging Date – the actual date the order shipping occurred
- Planned Delivery Date – the date delivery is planned to occur on
- Actual Delivery Date – the actual date delivery occurred
Figure 1 – Identified Date Fields
From this, and as demonstrated in Figure 2, we identify it is possible to materialize the classification target through determining if the actual delivery date is after the planned delivery date, that is:
Figure 2 – Defining the Classification Target
Exploring the data, we note an alternative approach is possible, and we can materialize a different target that will allow us to achieve the same results. If you look at the data you note there are several date fields, and it is possible to predict a difference in days between the planned and actual delivery date. Utilizing this prediction, if the value is 1 or more, it is possible to indicate a late delivery will occur.
To achieve this would require a different target, and to utilize a regression algorithm. Before continuing, we decide more information on the date fields is required. We reach out to our business and technical stakeholder colleagues, and are informed all planned date fields are mandatory, with the actual delivery dates updated when the related action occurs.
This means our ideas are plausible, and as shown in Figure 3, we derive the new target:
Figure 3 – Defining the Regression Target
Then as shown in Figure 4, using the prediction, we would derive a late delivery through:
Figure 4 – Utilizing Regression Prediction to derive Late Delivery Prediction
Though that is not the end, as shown in Figure 5, after a further review of this new prediction target, we realize it would be possible to provide an additional piece of information:
Figure 5 – Calculated Delivery Date Prediction
We again reach out to our business stakeholders, inform them of this new potential piece of information, and request if it will be of value. The stakeholders are energized and enthusiastic on utilizing this new piece of information for incorporating as a correction factor for planned delivery dates! This is great news.
As you can see, on the initial analysis we considered a classification solution, though on analysis of the data we uncovered and switched to a regression. This regression solution not only met the original use case scenario, but generated added value to the stakeholders.
The take away is that the definition of the target is a critical component of the use case, and although an initial intuition of the target may be given through the use case description, the true target definition can only be finalized once the data is explored. Furthermore, a key part of this process was continuous discussion with our business and technical stakeholders.