The Implicit Knowledge of Loss Functions in Machine Learning Algorithms
How did you learn your lessons at school when you were young? How confident were you about your success? Most importantly, how high were you evaluated by your teachers?
There is a high probability that you were evaluated on a grade scale. Either a letter from A to F or a number from 0 to 20, but you were given a score that reflected an evaluation of your performance.
An interesting point is what happens when your teacher gives you a D grade. Do you look at the feedback and try to learn something out of it, so that you won’t make the same kind of mistakes again in the future. The trivial but tipping point is that your whole learning method is not learning to reproduce what is good, but learning not to do what is wrong.
Machine Learning (ML) algorithms do exactly the same. You may have heard about loss functions, cost functions, or objective functions. They are mostly functions that we try to minimize. Just as our student learns not to reproduce errors, so also our ML model only tries to learn not to make “errors.” That is, it learns not to increase the value of a loss function.
This brings us to an elegant idea that the definition of a loss function actually embeds the human knowledge that we’re intending to reproduce. Let’s have a concrete example: autoencoders. Autoencoders are a family of neural networks that intend to reconstruct the input. You may think that the task is easy to do with a neural network, and you would be right. But the task is made tough by compressing the information in the neural network into a bottleneck. The input is then encoded into a denser representation called the code, which is used to reconstruct the original representation. Why would we do that? The objective is to compress information as much as possible so that core information that represents our sample is retained.
In this way, we can, for example, compare the code of one image with the code of another image. Since they’re compressed, the closest images are, by construction, the ones with the smallest differences in their code. Only the differences in the meaningful information are kept. I recommend a more technical and illustrated post about autoencoders by Jeremy Jordan.
What makes autoencoders interesting? They are a good example of the importance of the loss definition, hence we will see in this post an intuitive experiment through the autoencoding of images. As the experiment is visual, we will literally be able to see what implicit knowledge we embed through the definition of loss. In our case, the implicit knowledge is the meaning of image equality.
Let’s reconstruct images!
What will be interesting for us in this post is the notion of equality between two samples. The straightforward method is to say that equality only means strict equality between each feature of the samples. Therefore, the error of reconstructing a sample is equal to the shift between each of their features.
Especially in the case of images, the features of inputs and outputs correspond to pixel components. This way, we end with the L1 Loss that is the basic loss of a simple autoencoder (sum is over every sample’s features):
Let’s try it. All the results that you see come from a quite simple experience with a convolutional autoencoder that is the best to compress and reconstruct images. Our inputs are 3-dimensional, with width, height, and 3 channels, one for each color (Red, Green, Blue).
Our previous example penalizes each shift between the respective colors with the respective pixels. Our autoencoder will try to reproduce the exact same image.
Despite the small quality loss since there is not that much work to get the perfect autoencoder, we can notice the will to reproduce the exact same image.
Who cares about colors?
Now, let’s have some fun. What if we consider that the notion of image equality only resides in the shapes and edges of what is represented? For whatever reason, you may consider that your ginger cat rendered in blue, is still the same cat. We can change our loss so that the gray values are compared instead of each color. In other words, we reduce our inputs and outputs to 2D images. For this experiment, we convert colors to a luma value.
Here what we have with this loss:
That looks great! Again, a small training for this experiment implies not-that-good quality but we can notice that detailed shapes and edges are being reconstructed. We can see that the network learned to reconstruct images with purple-likes colors. it was enough to match the original images when converted to gray.
Who needs all these colors?
Now, what if your definition of equality in images is related to a specific color component? Remember, we are processing images made of 3 channels: red, green and blue. We could say that images are different when one of their channels differ: With the red component loss:
With the green component loss:
With the blue component loss:
We can notice a common pattern across the three experiments. The reconstructed images consist of two colors: the one that is learned to be reconstructed (which upon error is counted in the loss function) and a random mix of the two other color components. That is how we have the following pairs respectively: red/darkish cyan, green/random purple, blue/random yellow.
With these quick experiments, we are able to validate the point of view developed in this post. Our model (especially neural networks) is represented by an architecture that defines the global output space – a space containing all possibilities of output.
Given the weights and biases of each neural layer, a subset of the output space is available to the network. This subset is the space of all possibilities that a specific instance of this neural network can output given all the possible inputs. When we define a loss and train the network, the result is a change in the instance output space. Data is a crucial point to get a robust model, but we must not forget the implicit knowledge that is brought from the loss function.
Defining the latter may be a difficult task. Most of the neural networks that you may encounter are used for core tasks such as regression or classification. In these cases, famous loss functions are used, such as cross-entropy or mean-squared error. Beyond the mathematical constraints such as the differentiability of the loss function, the goal is to embody our task objective as a negative definition.
What is the error when you want to generate new images? If the image is of a human head, Is the generated hair good? Is the drawn ear good?
What is an error when you want to retrieve pieces of documents? Is the chapter following the answer worthy to retrieve too? Is the answer related to the input question?
We may struggle to answer these ourselves, so guess just how difficult it is to embody them in loss functions.
Yet, the loss function is the implicit but core knowledge that will guide all the learning phases.
The goal of the loss function is not to make the model good, but to ensure that it doesn’t go wrong.
The task of learning is not being the most optimal but being the minimum viable.
The remaining question is to define what viability is.