Logistic Regression in Machine Learning
This blog will cover the following basic concepts of Logistic Regression
- Binary and Multi-class classification,
- Sigmoid function
- Likelihood function
Let’s start with a question.
Is Logistic Regression a regression or a classification algorithm?
Before this we read about Linear Regression in https://blogs.sap.com/2019/07/04/linear-regression-in-machine-learning/. Linear Regression is used when the dependent variable is continuous, and nature of the regression line is linear.
But Logistic Regression is a classification algorithm used to assign observations to a discrete set of classes. For example, classification of online transaction as fraud or not fraud. Logistic Regression transforms its output using the logistic Sigmoid function to return a probability value. We will understand this in detail.
What is the output of the logistic regression?
Logistic regression is the classification algorithm and in such algorithms the output is a categorical variable instead of the numerical variable.
Let’s have a look at an example, A utilities company wants to know whether a customer is about to churn or not. In this case, the target variable has values as churn and not-churn. This is a Binary Classification problem where there are two values possible in the target variable.
What does Logistic Regression do?
Logistic Regression helps to extract pattern from the data and classify the result into one of the given categories.
Let’s take an example of multi-class classification problem, where the possible outcomes are more than two.
A software developed to predict and classify the severity of a bug raised into Low, Medium, High using the historical data. This case has three categories hence, this is a multi-class classification problem.
What is a Sigmoid function?
Let’s learn this with an example.
Imagine a data set consisting of master and transaction data of credit card holders of a bank. We need to predict whether a customer will default the payment or not. In this data set we can have many variables like age, salary, portion (percentage) of payment done on time, owns a property or not, etc.
Decision will be taken based on the pattern formed using important variables contributing to the values in target variable. To make our understanding easy let’s focus only on the percentage of payment done on time (one independent variable) and the result (dependent variable).
A data set consisting of 11 records (analysis is never done on such a small data set, this is just for example purpose)
Closely look at the data, the customer defaulted where the percentage of payments on time was lower but there is a record which breaks the pattern which is 60%. 50% started the pattern of “No” but 60% broke that pattern.
Let’s take Yes as 0 and No as 1 and plot the graph. There are few points which are showing that a customer who defaulted and did not complete the payment (as 0) and some points which show the customers who did not default and completed the payment (as 1).
One way to decide for future data set is to have a decision boundary at 55%.
However, there is problem with this method, clearly in this case we are misclassifying two customers.
Is there a boundary that can help to have zero misclassifications?
There exists no boundary with real data set where there are zero misclassifications. The sharp boundary can be very risky in domains like medicine.
Instead of focusing on the exact values and to avoid the problem of sharp boundaries, think in terms of probabilities. We want the probability of “receiving a complete payment” from a customer with “lower amount of payments done on time” to be low and vice versa. For intermediate points where the percentage is neither high nor low there you can have probability close to 0.5.
How to misclassify as minimum as possible?
Replace the boundary with a curve and use Sigmoid curve.
Sigmoid curve equation: y(probability to default) = 1/1+e−(β0+β1x)
Look at the percentage of payment done on time and the probability of receiving the complete payment. You will find the answer that why Sigmoid curve is better than the sharp cut boundary. You can get different Sigmoid curves by varying the values of β0 and 1.
Consider the probability of all the 11 points is P1, P2, P3, P4, P5, P6, P7, P8, P9, P10, P11.
How to get best β0 and ?
The best fitting combination of and will be the one which maximizes the product:
So, we covered that logistic regression is a classification algorithm. The output comes in terms of probability using the Sigmoid function. The output is then converted to one of the available categories as per the output probability. The best result is found by either minimizing or maximizing the cost function as per the requirement.
Next blog will cover Building a logistic regression model in Python.