Evaluating recommender systems in absence of labeled data
It is widely known that machine learning methods require large amounts of training data. More often than not however, labeled training data for a certain use case is scarce. In such settings, unsupervised machine learning methods are a popular (and often the only feasible) way to create end user value, as they don’t require any labels to be trained on. Unsupervised models are especially great tools when trying to understand and use the relationship within data samples rather than predicting a target variable.
Performance evaluation is hard without labeled data
When it comes to evaluating the performance of unsupervised models, the task is much more complex than in the case of supervised learning. While for supervised learning the labeled data (ground truth) can be directly used as a target measure, it is much harder to quantify the performance of a certain method when such labeled data is only scarcely available or even not available at all.
Consider, for example, a large wine producer who wants to increase wine sales through targeted recommendations. Their goal would be to have a content-based recommender system that can reliably identify similar wines (i.e., wines with similar taste) for a given input wine. There are many ways to construct the preprocessing and recommendation parts and every option might return slightly different recommendations. At the same time, it is hard to evaluate which method works best for this specific problem if there is no information about which recommendations are actually “optimal”. Thus, the question emerges of how to evaluate models without having a precise ground truth.
By-the-book methods may not be sufficient
Imagine the wine producer having a list of wine descriptions consisting of numerical, categorical and textual attributes similar to the ones in Figure 1. Arguably the most critical part is how to appropriately embed the textual features so that they can be used for recommendation in conjunction with the categorical and numerical features. To judge what “appropriately” means, we need a way to evaluate the performance of different embedding methods in the context of the recommender system.
The common way to assess the performance of a recommender system would be through standard metrics such as Accuracy, Precision or Recall [1,2]. However, these metrics require ground truth knowledge about which recommendations are correct, which is hard to obtain at a large scale in our specific problem setting. In the absence of sufficient amounts of ground truth data, alternative metrics need to be used.
The machine learning domain of clustering knows a range of evaluation metrics, some of which do not require ground truth labels . These metrics typically assess and compare the degree of separation within and between individual clusters. As they only focus on the structural properties in the vector space, it is still hard to judge whether clusters are actually meaningful. This is particularly challenging in the given setting, since the presence of textual data typically yields a very high dimensional problem (originating from the text embeddings). Due to the high dimensionality, there typically exist not just one, but many identifiable subspaces in which some data points lie close together and form a cluster, with clusters being rather far distant from each other. Still, there is no guarantee of knowing whether such a cluster is actually meaningful in the business domain (i.e., if the wines of that cluster do taste similar).
Domain-driven evaluation methods can be useful alternatives
To overcome these challenges, it can be crucial to leverage any available business domain knowledge for telling apart suitable models from less suitable ones. We consequently inspect a range of domain-driven methods as possible alternatives to textbook evaluation metrics in the following.
1. Supervised proxy problem
One approach is to use a supervised “proxy problem”, which means to train a related supervised learning algorithm to predict one of the features in the data as a target variable using other available features as the input. This evaluation method is commonly used to assess word embedding models by using the resulting embeddings as an input to a subsequent supervised task . The performance on this proxy task is used to estimate how well the embedding vector represents the information contained in the data. Generalizing this approach, the overall recommender model making use of numerical, categorical and text embedding features can be understood as a comprehensive feature encoding method. As such, its performance can be assessed by means of a proxy problem.
In the context of wine recommendation, we may want to evaluate which text embedding method works best for this specific purpose. We can construct a supervised proxy problem by predicting the grape variety of a wine based on the embedding of its textual description features, i.e., the tasting notes and the dishes that it goes well with. This approach is also sketched in Figure 2. Our underlying assumption is that embeddings which allow to predict the grape variety well are also well suited for the overall recommendation task. Therefore, the proxy problem can be used as an indirect performance measure of the overall recommender.
2. Smoke tests
In the discipline of software engineering, smoke tests are basic sanity checks that aim to detect simple failures of a system. This notion can be transferred to machine learning models and, more specifically, the recommender setting by defining domain-driven heuristics that specify which properties a good recommendation must fulfill in relation to the input. If any of these properties is not met to a sufficiently high degree, this is a strong indicator for suboptimal performance of the recommender.
For the domain of our wine recommender example, we could consider the following smoke tests:
- We could measure the overlap between the input and output samples in certain key features. For example, recommended wines should predominantly feature the same wine type as the input wine (e.g., red wine for a red wine input). Also, grape variety, origin/region and tasting notes should have a rather high overlap between input and output. Figure 3 shows the example of measuring the overlap in tasting between input and recommendations. Which features to use for smoke tests is of course highly domain-dependent and needs to be defined jointly with domain experts and data scientists. Assuming that these features are chosen carefully, such a test allows for basic sanity checks to evaluate the correctness of the overall approach, as well as a rudimentary comparison between several methods, even in the absence of labeled training data.
- Alternatively, we could assess the stability of recommendations for a given input when replacing the values of some of its features with randomly selected values from other records of the dataset. For features that are a priori rated important for recommendations by domain experts, a random value change should result in rather different recommendations. For example, consider a red wine as an input. If we keep all features, but change it to be of type ‘white wine’, we expect that recommended alternatives for this wine should change rather drastically compared to the original input. A high recommendation overlap on the other hand can be interpreted as an indicator of a low feature importance, as the recommendations stay the same, even when the input feature is changed randomly.
There are many more options how smoke tests could be constructed, with varying degrees of data alteration involved. Ultimately, however, the final design always depends on the specific use case, the available domain knowledge and the basic assumptions that can be made in the specific setting.
3. Ground truth tests
Even though large-scale labeled training data is not available, domain experts might still be able to provide small amounts of ground truth data. In our example setting, wine experts might be able to provide a few lists of 3-5 wines each which they believe are very similar. Such amounts of data are typically not enough to train a supervised model, but it allows to test the unsupervised recommender system by using one of the data samples of each list as an input to our system and assessing the ranks of the other samples in the resulting list of recommendations. A well working recommender system should rank other wines from the same list as the input wine reasonably high. If this is not the case, it is an indicator that the recommender is not yet working the way it should; exploring which wines are recommended instead allows to get helpful insights into how the recommender works and what could be done to improve it. The test alone might not be sufficient to do extensive hyperparameter tuning, but in combination with other tests, such as the ones sketched above, it may still allow to guide the selection of an appropriate unsupervised method.
Beyond unsupervised learning
Leveraging the knowledge of domain experts opens up a range of additional possibilities that go beyond the classical domain of unsupervised machine learning. One of these is to use an active learning approach, where the problem of missing labels is overcome by iteratively asking human experts to label certain instances . An unsupervised method could still be used as a starting point, providing some initial value to a user, while feedback from domain experts that is captured over time allows to subsequently train models in a supervised fashion. Even further, the availability of auxiliary data could be explored. If, for example, the wine producer was to have some amount of point-of-sale data available, this would again drastically shift the problem and open it up to algorithms that learn item similarity from customers’ buying behavior (e.g., from analyzing what has been bought together frequently). Well-known examples of such algorithms are collaborative filtering approaches , which are frequently employed for recommendation purposes. The key message, in any case, is that creativity pays off, and spending some time searching for auxiliary data may boost model quality by a lot.
Summing up, we saw in this blog how, even in absence of a large amount of labeled data, it is possible to assess the performance of recommender systems by leveraging available domain know-how. While surely the proposed methods cannot replace labeled data in general, they may still be helpful tools for teams who work on recommender problems and aim to create a quick “first version” that provides already some end user value, while paving the way for a more sophisticated data collection that allows to employ better tuned machine learning approaches in later versions.
 Isinkaye, F. O., Folajimi, Y. O., & Ojokoh, B. A. (2015). Recommendation systems: Principles, methods and evaluation. Egyptian Informatics Journal, 16(3), 261-273.
 Fayyaz, Z., Ebrahimian, M., Nawara, D., Ibrahim, A., & Kashef, R. (2020). Recommendation Systems: Algorithms, Challenges, Metrics, and Business Opportunities. Applied Sciences, 10(21), 7748.
 Wang, B., Wang, A., Chen, F., Wang, Y., & Kuo, C. C. J. (2019). Evaluating word embedding models: methods and experimental results. APSIPA transactions on signal and information processing, 8.
 Settles, B. (2009). Active learning literature survey. Computer Sciences Technical Report 1648.
 Su, X., & Khoshgoftaar, T. M. (2009). A survey of collaborative filtering techniques. Advances in artificial intelligence, 2009.