Technical Articles
Applying SAP AI Business Services to predict the Incident Resolution Time
Prolog
Your manager or your customer: „How long will it take till this incident is resolved?“
Wouldn’t it be great to provide a reliable and precise answer? You will face such kind of questions if you work as Test Manager, Release Manager, Service Level Manager or in multiple other roles many times. Unless the root cause is already identified, the correction already implemented and even unit tested you can only guess.
Idea
Why should you agonize your head searching for an answer? Artificial Intelligence or more specifically Machine Learning seem to solve almost any problem in our days.
Let’s give it a try. SAP AI Business Services [1] does provide currently six services.
- Business Entity Recognition
- Data Attribute Recommendation
- Document Classification
- Document Information Extraction
- Invoice Object Recommendation
- Service Ticket Intelligence
From the individual service description, it looks like Data Attribute Recommendation could help here best. Although our problem is originally a regression problem it can be easily treated as a multi-class problem using discrete values e.g. for the number of weeks it took to resolve the incident. Now, we have the service – where is the data? In our case SAP Solution Manager is used for Test Management as well as for Incident Management. You can easily extract all incidents with the Defect Management App into a csv format.
Implementation
Fortunately, we do not have to invent the wheel. First you will need an account in the SAP Cloud Foundry Environment. Second you can follow some tutorial using the postman application to call the services API.
Either you can follow first the Initial Setup link on the service help.sap.com page and then this tutorial.
Or you follow this tutorial.
After setting up the environment you will have to define your dataset schema. There we distinguish between the features which describe the entity and the label, which is the outcome, the answer we desire from the model. For our problem the label is the duration between defect or incident creation and the time when it is closed or confirmed. For the training data set this can be easily calculated from the “Created On” and “Last Changed On” dates. To simplify the range of labels we do count the duration in weeks.
For the features we do assume that the attributes “Priority” and “Category” may have the most impact on the resolution time.
Dataset schema 1
{
"features": [
{
"label": "priority",
"type": "category"
},
{
"label": "sm_category",
"type": "category"
}
],
"labels": [
{
"label": "duration_in_weeks",
"type": "category"
}
],
"name": "DT Schema Defect Duration 01"
}
The training of the model with 1.400 test defects delivered this result.
{
"createdAt": "2020-08-20T15:32:34+00:00",
"name": "model_22530355533369018",
"validationResult": {
"accuracy": 0.3445378243923187,
"f1Score": 0.28089013383131034,
"precision": 0.25213358070500924,
"recall": 0.3445378151260504
}
}
We can assume that the provided data set was not only used for training but also for validation using a common split rule like 80:20. Which validation strategy has been used remains unclear.
Validation Terms Start
The used metrics are defined as follows.
Predicted Class | |||
Class = Yes | Class = No | ||
Actual Class | Class = Yes | True Positive (TP) | False Negative (FN) |
Class = No | False Positive (FP) | True Negative (TN) |
Accuracy=(TP+TN)/(TP+TN+FP+FN)
Precision=TP/(TP+FP)
Recall=TP/(TP+FN)
F1Score=(2*Recall*Precision)/(Recall+Precision)
Validation Terms End
As our label, the duration for resolution in weeks, has more than two values the metrics precision and f1 score suit best. Unfortunately, the values of 25% and 28% are not satisfying.
Obviously, we have a high bias here. Strategies to apply now you can check in my former post [2].
In the current version of the SAP AI Business Service API we are not able to adjust hyperparameters anymore. We can try to add additional features. Let’s add the “Title” (description) of the test defect as an additional feature.
Dataset schema 2
{
"features": [
{
"label": "description",
"type": "text"
},
{
"label": "priority",
"type": "category"
},
{
"label": "sm_category",
"type": "category"
}
],
"labels": [
{
"label": "duration_in_weeks",
"type": "category"
}
],
"name": "DT Schema Defect Duration 02"
}
The result after passing the 1.400 data items of test defects is the following.
{
"createdAt": "2020-08-20T17:49:07+00:00",
"name": "model_5878353841224151",
"validationResult": {
"accuracy": 0.3361344635486603,
"f1Score": 0.2665251859257635,
"precision": 0.28266361089890507,
"recall": 0.33613445378151263
}
}
The precision slightly increases to 28% but the f1 score slightly decreases to 27%.
Further useful attributes are not available in the data set for test defects.
Next let us check the incidents in the system. There we do have with 3.300 items a larger data set.
Categories have not been maintained for them therefore we will use the incident title (description) and the priority as features.
Dataset schema 3
{
"features": [
{
"label": "description",
"type": "text"
},
{
"label": "priority",
"type": "category"
}
],
"labels": [
{
"label": "duration_in_weeks",
"type": "category"
}
],
"name": "DT Schema Defect Duration 03"
}
The result after training is the following.
{
"createdAt": "2020-08-20T19:48:54+00:00",
"name": "model_6947965492601496",
"validationResult": {
"accuracy": 0.2245989300031713,
"f1Score": 0.10128734372454648,
"precision": 0.07668312080076785,
"recall": 0.22459893048128343
}
}
The precision of 8% and the f1 score of 10% are worse than for the test defects.
When we check the incidents in our data set more closely, we notice that the resolution time has values from less than one week till more than 3 months. Probably our label has a too large value range for the limited number of incidents.
We do limit the number of values for our duration label like this.
- 0 = less than 1 week
- 1 = less than 2 weeks but more than 1 week
- 2 = less than one month but more than 2 weeks
- 3 = less than three month but more than one month
- 4 = more than three month
Training a model with this data set of incidents provide this result.
{
"createdAt": "2020-08-20T21:39:51+00:00",
"name": "model_9062497563787697",
"validationResult": {
"accuracy": 0.40106952126650885,
"f1Score": 0.33715246691025236,
"precision": 0.30967458753836463,
"recall": 0.40106951871657753
}
}
With a precision of 31% and a f1 score of 34% we do receive much better values.
But what do the numbers really mean?
Let us assume we have 100 incidents which get resolved in less than 1 week (class 0). This model will correctly classify 40 of them (recall value).
Let us assume we have 100 incidents classified by this model to get resolved in less than one week (class 0). This will be true for only 31 of them (precision value).
Now, this may be better values than provided by random guessing but they are not really usefull.
As Test or Release Manager I need values significantly higher than 50%.
Epilog
We did not find a model which is good enough for us. Neither for the test defects nor for the incidents. This is most likely caused by the data sets which we have used. If we take a closer look at the data we notice constellations like in the table below. With features having the same values but different labels (classes) which are evenly distributed. No learning algorithm will be able to classify such data records correctly with a high accuracy.
priority | sm_category | duration_in_weeks |
2: High | Level 1 > Level 1.2 | 0 |
2: High | Level 1 > Level 1.2 | 1 |
2: High | Level 1 > Level 1.2 | 2 |
2: High | Level 1 > Level 1.2 | 2 |
2: High | Level 1 > Level 1.2 | 0 |
2: High | Level 1 > Level 1.2 | 0 |
2: High | Level 1 > Level 1.2 | 0 |
2: High | Level 1 > Level 1.2 | 2 |
2: High | Level 1 > Level 1.2 | 1 |
2: High | Level 1 > Level 1.2 | 0 |
2: High | Level 1 > Level 1.2 | 1 |
2: High | Level 1 > Level 1.2 | 1 |
2: High | Level 1 > Level 1.2 | 2 |
2: High | Level 1 > Level 1.2 | 2 |
2: High | Level 1 > Level 1.2 | 1 |
2: High | Level 1 > Level 1.2 | 1 |
2: High | Level 1 > Level 1.2 | 0 |
2: High | Level 1 > Level 1.2 | 2 |
2: High | Level 1 > Level 1.2 | 1 |
2: High | Level 1 > Level 1.2 | 1 |
2: High | Level 1 > Level 1.2 | 0 |
2: High | Level 1 > Level 1.2 | 0 |
2: High | Level 1 > Level 1.2 | 2 |
2: High | Level 1 > Level 1.2 | 3 |
2: High | Level 1 > Level 1.2 | 0 |
More features are required to give the learning algorithm a chance to perform well. If the solution category (e.g. with values for authorization, coding, performance, etc.) would have been maintained this could be a good additional feature.
If you want to solve the same problem, be encouraged to use your data – as data sets will differ among companies. And in case you have a different problem, don’t hesitate to test the SAP AI Business Services.
References
[2] Training and Testing perspective on SAP Leonardo Machine Learning Foundation
Hi Walter,
Great blog, I enjoyed reading it. It clearly explains the uses of "Data Attribute Recommendation
". I would like to mention few points in context with this use case
Regards,
Sharath.