Scikit-Learn: A Complete Guide With a Logistic Regression Example

Sakshi GuptaSakshi Gupta | 3 minute read | August 28, 2020
Scikit-Learn: A Complete Guide With a Logistic Regression Example

In this article

Scikit-Learn is a machine learning library that includes many supervised and unsupervised learning algorithms. To date, Scikit-Learn is the first stop for most data scientists and machine learning engineers to build their first machine learning model or set a benchmark for further experiments. This is handy because you don’t always need complex and computationally expensive deep learning algorithms to model your data.

In this article, we will focus on logistic regression and its implementation on the MNIST dataset using Scikit-Learn, a free software machine learning library for Python.

What is Scikit-Learn logistic regression used for?

There are two primary problems in supervised machine learning: regression and classification (just to let you know almost 70% problems in data science are classification problems). Logistic regression (the term logistic regression is a “fake friend” because it does not refer to regression) is a classification algorithm used for classification problems, such as determining whether a tumor is malignant or benign and assessing automotive types. It is essential for an ML engineers or Data Scientists to have a clear understanding on logistic regression.

In simple terms, logistic regression is the process of finding the best possible plane (decision boundary, Figure 1) that separates classes under consideration. It also assumes that these classes are linearly separable.

screen shot 2020 11 27 at 12 47 19 pm

Figure 1: Sample decision plane in 2D (Source: jeremyjordan.me)

Since linear regression is a fundamental building block of machine learning, we’ll use this concept as a jumping-off point to explain the mathematics of logistic regression.

The main difference between linear regression and logistic regression is the output function. Linear regression uses a linear function that outputs continuous values in any range, whereas logistic regression uses a sigmoid function that limits outputs in the range of zero to one.

Get To Know Other Data Science Students

Mikiko Bazeley

Mikiko Bazeley

ML Engineer at MailChimp

Read Story

Jasmine Kyung

Jasmine Kyung

Senior Operations Engineer at Raytheon Technologies

Read Story

Jonas Cuadrado

Jonas Cuadrado

Senior Data Scientist at Feedzai

Read Story

  • Sigmoid function or logistic function

Mathematically, the sigmoid function can be described as:

sigmoid function

This limits the value of output in the range of zero to one, as shown in Figure 1.

screen shot 2020 11 27 at 12 24 03 pm

Figure 2: Sigmoid function. (Source: Wikipedia)

  • Hypothesis

For linear regression hypothesis function can be written as:

linear regression hypothesis function

Which is a simple linear function (straight line). This function can be modified for logistic regression as:

 simple linear function

Hence:

screen shot 2020 11 27 at 12 13 55 pm
notations
  • Cost function

In simple terms, the cost function measures the performance of any given machine learning model with respect to data under consideration. This cost function is used to optimize the parameters of the machine learning model after each iteration, during the training phase, to get more accurate predictions.

The cost function for logistic regression is given by:

screen shot 2020 11 27 at 12 54 16 pm

This can be further simplified to:

screen shot 2020 11 27 at 12 18 39 pm

This cost function is also known as negative log-likelihood loss or cross-entropy loss.

screen shot 2020 11 27 at 12 19 37 pm

Figure 3: Cost function (Source: Researchgate)

Figure 2 depicts the cost function. When “y” is one and “h” is zero (blue line), the cost function will be high, thus severely penalizing the machine learning model. When “y” is one and “h” is also one (blue line), then the cost function will be zero, meaning no penalty for making correct predictions. Similarly, when “y” is zero and “h” is one (red line), the penalty will be high, whereas when “y” is zero and “h” is also zero, the penalty will be zero.

screen shot 2020 11 27 at 12 51 04 pm
screen shot 2020 11 27 at 12 25 10 pm

Implementing logistic regression on the MNIST dataset

In this section, we will implement logistic regression on the MNIST dataset. The MNIST dataset is a well-known benchmark dataset in the machine learning community. This dataset consists of pictures of handwritten digits with labels. All images are squares sized 28 x 28 pixels. The label ranges from zero to nine. This is a multinomial logistic regression problem.

By default, Scikit-learn takes care of the implementation, whether it’s a binary or multinomial problem depending on the number of labels present in the dataset.

The code for implementing logistic regression with Scikit-learn on MNIST dataset can be found here. This includes a detailed implementation of the logistic regression model with Scikit-learn.

Since you’re here…
Curious about a career in data science? Experiment with our free data science learning path, or join our Data Science Bootcamp, where you’ll only pay tuition after getting a job in the field. We’re confident because our courses work – check out our student success stories to get inspired.

Sakshi Gupta

About Sakshi Gupta

Sakshi is a Senior Associate Editor at Springboard. She is a technology enthusiast who loves to read and write about emerging tech. She is a content marketer and has experience working in the Indian and US markets.