What is Logistic Regression? A Guide to the Formula & Equation
In this article
As an aspiring data analyst/data scientist, you would have heard of algorithms that help classify, predict & cluster information. Linear regression is one of the most common machine learning algorithms that is used in solving data problems, followed by the infamous logistic regression.
Why infamous? Because it disguises as a regression algorithm, just like linear regression. But don’t get confused, logistic regression is, in fact, a classification algorithm. It is used to estimate discrete values (binary values like 0/1, yes/no, true/false) based on a given set of independent variable(s). In simple words, logistic regression predicts the probability of occurrence of an event by fitting data to a logit function (hence the name LOGIsTic regression). Logistic regression predicts probability, hence its output values lie between 0 and 1.
Source: Towards Data Science
What is Logistic Regression: Base Behind The Logistic Regression Formula
Logistic regression is named for the function used at the core of the method, the logistic function. The logistic function or the sigmoid function is an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.
1 / (1 + e^-value)
Where :
- ‘e’ is the base of natural logarithms
- ‘value’ is the actual numerical value that you want to transform
Did You Know?
Numbers within a certain range can be transformed into a 0 to 1 range using a logistic/sigmoid function. For instance, we applied the logistic function between a range of -10 to + 10, and this is what our graph looks like:
Now that we know what the logistic function is, let’s see how it is used in logistic regression.
What Does the Equation Look Like?
Logistic regression uses an equation as the representation which is very much like the equation for linear regression. In the equation, input values are combined linearly using weights or coefficient values to predict an output value. A key difference from linear regression is that the output value being modeled is a binary value (0 or 1) rather than a numeric value. Here is an example of a logistic regression equation:
y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))
Where:
- x is the input value
- y is the predicted output
- b0 is the bias or intercept term
- b1 is the coefficient for the single input value (x)
In the equation, each column in your input data has an associated b coefficient (a constant real value) that must be learned from your training data.
Did You Know?
We can also transform this equation to:
ln(y / 1 – y) = b0 + b1 * X
Because ‘e’ from one side can be removed by adding a natural logarithm (ln) to the other. If you observe closely, it looks like the calculation of the output on the right is like linear regression, and the input on the left is a log of the probability of the default class.
Properties of the Logistic Regression Equation:
- The dependent variable in logistic regression follows Bernoulli distribution
- Estimation is done through maximum likelihood
- No R Square, Model fitness is calculated through a concordance, KS-Statistics
When Implementing the Logistic Regression Model
The coefficients (Beta values b) of the logistic regression algorithm must be estimated from your training data using maximum-likelihood estimation. The best Beta values would result in a model that would predict a value very close to 1 for the default class and value very close to 0.
Get To Know Other Data Science Students
Bret Marshall
Software Engineer at Growers Edge
Pizon Shetu
Data Scientist at Whiterock AI
Garrick Chu
Contract Data Engineer at Meta
Logistic Regression Assumptions
While logistic regression seems like a fairly simple algorithm to adopt & implement, there are a lot of restrictions around its use. For instance, it can only be applied to large datasets. Similarly, multiple assumptions need to be made in a dataset to be able to apply this machine learning algorithm.
- The dependent variable has to be binary in a binary logistic equation
- The factor level 1 of the dependent variable should represent the desired outcome
- Including non-meaningful variables may throw errors. Only include the variables that are necessary and may show a correlation
- The model should have little or no multicollinearity – the independent variables should be absolutely independent of each other
- The independent variables are linearly related to the log odds
With so many assumptions that need to be made, you may think that the equation is not versatile enough to be implemented across real-life problems but this equation has a lot of applications in the medical field, wildly popular among the data scientists and is helping people across the world with its superpower.
Where is this Equation Applied in Real Life?
In instances where the binary response is expected/implied, Logistic regression equation is commonly used. While it has found the best use case in the field of medicine, the applications are far-reaching.
- Determining the probability of having a heart attack – Medical researchers use data to understand the relationship between the predictor variables to estimate if an individual will have a heart attack or not. The results of the model tell the researchers exactly how changes in exercise and weight (predictor variables) affect the probability that a given individual has a heart attack. A fitted logistic regression model is used here. This offers us a clear picture of the significance of data science in the medical profession and the application of the equation.
- Understanding the possibility of getting admission into a University – College application aggregators like to determine how variables like CGPA, GMAT & TOEFL score help determine the probability of getting accepted to a particular university. For this, the aggregators perform a logistic regression to understand the relationship between the predictor variables and the probability of getting accepted.
- Gmail & other inboxes identifying ‘Spam Emails’ – One of the most clearly visible examples of how this works, is in the filtering that email inboxes do. Identifying if email communication is promotional/spam is done by understanding the predictor variables & applying a logistic regression algorithm to check for its authenticity.
Now that we have understood the basic math behind logistic regression and how the logit function behaves, along with the steps that we should keep in mind while approaching a dataset with logistic regression, as a next step, we will learn how we can implement this algorithm in Python, and how it can generate favourable outcomes for us. We will publish part two of this article in a couple of days.
You can learn this classification technique & many more with Springboard’s data analytics, data science and AI/machine learning career track programs. With 1:1 mentoring and project-based curriculum that comes with a job guarantee, you can kickstart your career in data centric world with these specially designed programs.
Since you’re here…Are you a future data scientist? Investigate with our free guide to what a data scientist actually does. When you’re ready to build a CV that will make hiring managers melt, join our Data Science Bootcamp that guarantees a job or your tuition back!