How machine learning helps predict the time-to-completion of a ticket
“When will my service ticket be completed?”
There is growing recognition that high-quality customer service is key in today’s competitive markets. Companies are seeking ways to meet customer demands and increase their satisfaction. When customers seek for help via a service ticket, they expect quick, transparent and effective responses. On the other side of the transaction, a team of service agents is processing a high volume of diverse tickets. The agents’ goal is to provide a pleasant interaction with the customer while resolving their assigned tickets. However, in reality, some seemingly easy and frequently asked questions such as: ‘how long is this going to take?’ are often difficult to answer. Imagine a customer requesting an estimate of how long it will take to fix a wind turbine located in South America. The agent can spend quite a bit of time and effort to figure out a proper response. What if we could provide an estimated time to completion to the service agent as part of a set of extended ‘predicted’ properties of service tickets?
We propose a predictive model to estimate the time to complete a ticket by leveraging the hidden structure of historical records and the use of machine learning algorithms. The predictive models provide a customized solution based on individual customer datasets. After models are trained with customer data, they are applied to new tickets at the time of creation. Thus, service agents can make use of an estimated time-to-completion in early stages of the customer interaction.
Data Collection and Feature Engineering
Surprisingly, data collection and feature engineering are the hardest steps to accomplish when building predictive models in the enterprise domain. From GDPR constraints, to the understanding of variable interactions, to unforeseen changes in the distribution of model variables, understanding the essential flow behind customer service is key for the success of predictive models.
We propose a model that estimates the time required to complete a ticket. This prediction is provided at the moment the ticket is created, with the limited data available at that time. We might have access to data such as: who creates the ticket, who is initially assigned to handle the ticket, the category of the ticket, when the ticket was created, among other variables.
Who is processing the ticket – some service agents are more experienced than others in assigning tickets to the right teams. It helps reduce the ticket transition time between teams.
When the ticket is created – time of creation can be a critical variable for service organizations that work with a non-continuous schedule. For example, extra minutes can be added if a ticket is created during lunch time, or if a ticket is created in a Friday afternoon and has to wait until the next Monday to be processed. Holidays and vacation days can also affect the response time if service personnel are reduced.
Ticket category – Specific ticket categories may be more complex and require more time to resolve. A ticket request for information about a company policy could have a straight forward – and short – resolution. However, a ticket opened to handle a machine repair could take multiple weeks.
Ticket Priority and escalations – the higher the priority and/or escalation level, the more attention and resources are provided to a ticket. This could potentially mean a quicker resolution, but the fact that a ticket has been put in an escalation or set as high priority may imply more complexity and thus, a longer time to resolution.
Machine learning algorithms are generally categorized as: supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning. The two first categories are possibly the most widely known and used.
Supervised learning bases its model on historical examples of data where dependent variables (normally one) is explained with multiple independent variables. A dependent variable, or often referred as the target variable, could consist of discrete or continuous values. An example of a discrete target variable is whether a ticket is spam or not. An example of a continuous target variable is the price of housing. While supervised learning focuses on explaining a target variable, unsupervised learning focuses on describing the overall structure and patterns of data.
Our time-to-completion model is derived from a supervised learning algorithm. While the target variable – the time to complete a ticket – is a continuous variable, we have experimented with a discrete form of the variable and found interesting results. Regression models were derived for the continuous version of the target variable. Multi-class classification models were derived for a discrete target variable representing time-to-completion ranges.
Ensemble methods apply the principle that unity is strength to machine learning algorithms. The idea behind ensemble algorithms is to generate a stronger learner based on several weaker learners. Our approach to multi-class classification uses the principle of ensemble methods. Specifically, we use a boosting technique where each executed tree-based model would define the features that the next model will focus on.
Model Selection and Prediction Results
The question is: how to select a model if we have multiple competing options? Often model quality metrics such as accuracy, precision, recall, ROC, F1 score and others are used to evaluate model performance. Accuracy for example, indicates the fraction of predictions that we got right. These metrics are considered as key indicators from the pure model quality perspective. However, ‘what is right’ for the customer might be relative to their expectations. Model interpretability and consumption of predicted values are just as important.
In this use case (ticket time-to-completion), a predicted ‘time interval’ rather than an exact duration is what matters. From the user’s perspective, providing a timeframe estimation aligns better with their expectations than an exact number. Imagine a service agent in a phone call trying to answer a question about ticket duration, generally customers would be happy with an answer such as “this might take about 3-5 days”. There is no added value, and in fact, it might be awkward to say, “your ticket is estimated to complete in 3 day and 13 hours”.
Fig. 1-A Long tail distribution
Fig. 1-B Distribution with fix-width bin
Fig. 1-C Distribution with log scale transformation
Our model balances accuracy and interpretability by creating a multi-class classification ensemble that uses a discrete form of the ticket duration as the target variable. If we observe multiple cases of ticket duration, often the time to complete a ticket will display a distribution like the one illustrated in Fig. 1-A – a long tail distribution where most tickets are completed quickly rather than taking a long time. It is even more obvious if we plot the distribution in with fix-width bin – most of the data points are concentrated on the left-hand side of the graph. If we take that distribution and create bins to define ticket duration ranges, we can imagine that the narrower the bin, the less accurate our results will get. On the other hand, the wider the bin, the coarser the prediction range. There is a point where a bin gets so wide, that the prediction stops being useful. Our approach to define bins for the target variable is to apply a logarithmic transformation. More granular bins are provided for the head of the distribution and coarser bins are left to the tail. For example, using log scale with base 5 will generate bins [0,5], ]5,25], ]25, 125], and so on. Fig. 1-C illustrates the latter.
“Does data volume matter?”
For our use case, before a minimum number of records is reached, the data volume will significantly influence model performance. Small data sets do not hold enough ‘information’ to be representative of the general cases. After a certain number of records, model performance is mostly influenced by other factors relative to data quality such as percentage of missing values, diversity of values in categorical features, etc. Fig. 2 illustrates a general pattern we found across data sets – model performance vs. data volume increases until stabilizing in an accuracy level dictated by factors other than data volume.
Fig. 2 One typical pattern between data volume and model performance
“Can I move some knobs to improve model performance?”
We use a machine learning algorithm that is based on a number of boosted decision trees and a discrete form of ticket duration that divides the continuous variable space into log-scale bins. If we ask data scientists how to improve the model performance, immediate suggestions will pop up: why you don’t increase the number of trees?; or the width and depth of each tree?; can you change the learning rate? – to balance conversion time without missing results close to the optimum; can you modify the bin width and log base to transform your target variable?, etc., etc., etc.. In order to test many of these options, we have implemented an external loop that tries a combinatory set of options with a hyperparameter tuning algorithm – the algorithm help us to ‘move the knobs’ (the hyperparameters) that will help us to get different model outcomes. After the tuning algorithm has iterated over several sets of hyperparameters, we keep the one with the best performance while limiting the iterations to a specific timeframe and improvement rate.
“Do I need to do anything over time to maintain my model?”
To maintain the model performance, retrain process is suggested when data distribution drifts. This could happen when major changes are made in certain features. For example, the ticket priority level changes from three categories (low – medium – high) to five categories (very low – low – medium – high – very high) – which we suggest immediate retrain. Another more common scenario is when the system generates significant amount of new data. Since this will happen naturally overtime, we suggest trigger retrain at least once every 6 months. Also notice that it is unnecessary to retrain the model too frequently since it will not make any changes to the result.
Going beyond predictions
After hyperparameter tuning and feature engineering we have our best model. We provide predictions that will empower service agents with an estimation of the time to complete tickets. Our models have a reasonable level of accuracy, say around 60 or 70%, which sounds good in this business scenario. It seems that we are good to go.
One thing we want to point out here is that although model accuracy might look good, such metric is a global model metric and does not dictate the actual error of an individual prediction. Again, quantitative model quality is important but so is interpretability and end user expectations about predictive models. A large prediction error for a single case can generate the wrong impression about the entire model validity and its potential to handle all other individual predictions. Figure 3. illustrates the relation between predicted and actual values for a target variable. Most of the predicted values are close to the actual ones, however there are a few outliers that are away from the reality.
Fig. 3 Large error between prediction and reality
Special thanks to Veronica Gacitua Decar, Frank Higgins and Sunny Choi for improving this article.
I wonder if this would help estimate the time it takes to design, code, test a program. I imagine building the correct data would need to be part of it.
Interesting - it got me thinking of other things,
Thanks for the comment! There are certainly many potential use cases. But yes, having the "correct" data with good quality is always important when developing such models. And it is usually the hard part.
Is there any way I can get hold of the code to understand how it is being done . I am trying to replicate the same in my project ?