Today, building a bot is an easy process. There are many tools, like SAP Conversational AI, that allows you to quickly build your own AI. But accurately measuring the effectiveness of your bot can quickly turn into a big mess. In this article, we’ll go over the process of measuring the performance of your bot training: introducing the relevant metrics, the benchmarking process of calculating them, and how to read them on a confusion matrix.
This helps you understand the Training Analytics panel of SAP Conversational AI, but can also be used outside the platform.
Let’s start with the basics
Before we begin, there is one concept we need to introduce: true/false positives/negatives. Each data point, after classification, belongs to one of these four categories.
Positive/negative refers to the declared solution for the datapoint. When trying to detect a medical condition, for example, a patient is declared “positive” if the system declares they have the condition. True/false refers to the success or failure of your prediction. For example, true negative means that the data point you want to classify doesn’t belong to the class, and that is exactly what you predicted. Therefore, you can use four terms to qualify data points: true positive, true negative, false positive and false negative.
This chart sums it up nicely:
The key metrics
To measure the performance of bots, we look at three different metrics: precision, recall and F1-score, calculated separately for each intent of the bot. These three metrics give different insights about the performance of each intent of the bot, as well as the bot as a whole. The calculation of these metrics is based upon the four categories of classification we saw above: true positive, true negative, false positive and false negative. Let’s try to intuitively understand the different types of information they give us:
Precision identifies the frequency of correct answers, when the prediction is intent A . It can be thought of as the answer to the question “Out of all predictions of A, how many were correct?”
Recall identifies the frequency of detecting A, out of all examples pertaining to A in reality. In short, it answers the question “out of all the examples in A, how many were detected?”
Finally, F1-Score calculates the harmonic mean of precision and recall. It helps you answer the question “What is the global performance of prediction, with respect to class A? ”
To obtain global scores over the performance of your bot, it is best to provide a weighted average of each of the metrics. This way, the global F1-score serves as a general ‘grade’ of the performance of your bot, while the precision and recall scores can be used to understand the best ways to go about improving the performance of your bot.
How do we calculate your bot metrics?
We have formulas, that’s great. But we have to fill them up with numbers and get real! To do that, we run a benchmark on our training data.
When you run a benchmark on SAP Conversational AI, expressions are split inside each of intents in two parts: one part is used for training, and the other, usually much smaller, is used to evaluate the classification. The evaluation is simple: each sentence is tested with the training dataset, to check if the first intent returned is the right one. This process is repeated a number of times to enforce randomness in the splits. Once the evaluations are done, the results are averaged while taking in account the number of occurrences of each intent, resulting with our 4 metrics, each ranging between 0 and 1 for each intent.
Applying those metrics on real data
Now that you understand the theory, let’s move on to real life! Let’s consider the following chart. In our example, we’re working on the customer support chatbot of a telecommunications company. The chatbot includes the intents plan, billing and trip:
- The intent plan refers to the phone plan of the customer. Sentences could be “I want to change my phone plan” or “Can I add this option to my plan?”
- The intent billing refers to any billing action. This can include “I want a copy of my invoice” or “what’s my billing frequency?”
- The intent trip refers to any international options. This intent usually includes sentences like “I’m traveling to India, do I need additional options for data?”
Here is an extract of the results of our benchmark on this training dataset. On the left column, we have the true intents, and on the right, the detected intents.
For each intent, let’s calculate the precision, recall and F1-Score!
Metrics for the Plan intent
Here, we have 2 true positives. Indeed, 2 data points where correctly detected as belonging to the Plan intent. We also have one false positive at the bottom of the chart, where the intent Plan was detected instead of Trip. Finally, there are two false negatives, being the two sentences detected as Billing when they belonged to the Plan intent.
This gives us 0,66 in precision, 0,5 in recall and a F1-Score of 0,57.
Now we can easily do the same for the two other intents!
Metrics for the Billing intent
Metrics for the bot as a whole
So, we now have the values of the precision, recall and F1-score per intent. But to get a better overview of the global performance of our bot, it is useful to calculate a weighted average per metric:
Congrats! We now have a better idea of how our model is performing. But there’s one elephant in the room we haven’t addressed: Accuracy. Accuracy is often the go-to metric to measure performance. It is the fraction of all predictions that are correct.
However, accuracy can be misleading because it doesn’t tell the whole story. If 80% of the test dataset is intent A, always predicting A gives an accuracy of 0.8! But that’s not reflecting the reality. That’s why it is quite risky to use it to make decisions on the situation of your bot and how to improve it. However, accuracy can be a valuable metric, especially when the intents of your bot are relatively balanced in terms of training examples.
Introducing the confusion matrix
The confusion matrix is a visual tool that helps you understand the issues in your detection with more precision, based on the four key metrics we’ve introduced before. It also allows you to build a clear plan and define a strategy to improve your bot’s performance .
Reading a confusion matrix is simple. You work with rows and columns! The rows represent the true labels, while the columns represent the detected labels. In the matrix above we’ve kept the same example as above. Out of 4 plan sentences in the dataset, 2 were correctly detected as plan, while 2 were detected as billing.
The perfect confusion matrix has a brightly colored main diagonal, showed in yellow above, with no scores in any of the other areas.
Unfortunately, most of the time your confusion matrix is not going to be perfect. When it isn’t, it can give you insights about the issues in your dataset. At a first glance, the information expressed in the previously mentioned metrics is visually apparent – intents with low recall are spread out across their row, while intents with low precision are spread out across their column (think about it!). But there is much more in a confusion matrix! Are there two intents that are too close to each other and get confused frequently? An intent that is so large that it attracts entries from many other intents? An intent so poorly trained that its tests go in all directions? All of this can be seen in a confusion matrix!
That’s why we’ve decided to make this entire process automatic on SAP Conversational AI. Our goal is always to make bot building as fast and as easy as possible.
To achieve that, we’ve created a powerful analytics tab in the Monitor section of SAP Conversational AI. You don’t need an army of data scientists to understand your bot training anymore! With training analytics, you’ll have all key metrics calculated for you automatically and graphically represented in charts and an interactive confusion matrix.
Originally posted at SAP Conversational AI Blog.