Introduction to Logistic Regression

Logistic regression is used to predict a discrete outcome based on variables which may be discrete, continuous or mixed. Thus, when the dependent variable has two or more discrete outcomes, logistic regression is a commonly used technique. The outcome could be in the form of Yes / No, 1 / 0, True / False, High/Low, given a set of independent variables.

Let’s first understand how logistic regression is used in business world. Logistic regression has an array of applications. Here are a few applications used in real-world situations.

Marketing: A marketing consultant wants to predict if the subsidiary of his company will make profit, loss or just break even depending on the characteristic of the subsidiary operations.

Human Resources: The HR manager of a company wants to predict the absenteeism pattern of his employees based on their individual characteristic.

Finance: A bank wants to predict if his customers would default based on the previous transactions and history.
Types of logistic regression

If the response variable is dichotomous (two categories), then it is called binary logistic regression. If you have more than two categories within the response variable, then there are two possible logistic regression models.

  1. If the response variable is nominal, you fit a nominal logistic regression model.
  2. If the response variable is ordinal, you fit an ordinal regression model.

Logistic regression model

The plot shows a model of the relationship between a continuous predictor and the probability of an event or outcome. The linear model clearly does not fit if this is the true relationship between X and the probability. In order to model this relationship directly, you must use a nonlinear function. The plot displays one such function. The S-shape of the function is known as sigmoid.

Logit transformation
A logistic regression model applies a logit transformation to the probabilities. The logit is the natural log of the odds.

P is the probability of the event

In is the natural log (to the base e)

Logit is also denoted as Ln

So, the final logistic regression model formula is

Unlike linear regression, the logit is not normally distributed and the variance is not constant. Therefore, logistic regression requires a more computationally complex estimation method named as Method of Maximum Likelihood (ML) to estimate the parameters. ML obtains the model coefficients that relate predictors to the target. After this initial function is estimated, the process is repeated until LL (Log Likelihood) does not change significantly.

Using R

R makes it very easy to fit a logistic regression model. The function to be called is glm() and the fitting process is similar the one used in linear regression. In this post, I would discuss binary logistic regression with an example though the procedure for multinomial logistic regression is pretty much the same.

The data which has been used is Bankloan. The dataset has 850 rows and 9 columns. (age, education, employment, address, income, debtinc, creddebt, othdebt, default). The dependent variable is default (Defaulted and Not Defaulted).

Let’s first load and check the head of data.



Now, making the subset of the data with 700 rows.

mod_bankloan <- bankloan[1:700,]

Setting a seed of 1000 (meaning picking random numbers from 1000 as starting point)


Let’s have a sample of 500 values. So, creating a variable of training data of 700 rows.

>train<-sample(1:700, 500, replace=FALSE)

Creating training as well as testing data.

>trainingdata<- mod_bankloan [train,]
>testingdata<- mod_bankloan [-train,]

Now, let’s fit the model. Be sure to specify the parameter family=binomial in the glm() function.



The summary will also include the significance level of all the variables. If the P value is less than 0.05 then the variables are significant. We can also remove the insignificant variables to make our accurate.

In our model, only age, employment, address and creddebt seems to be significant. So, building another model with only these variables.


Let’s now predict the model with the training data.

pred1<-predict(model12,newdata=trainingdata, type=”response”)

Now looking at the probability with 0.5% flight delayed or ontime.

predicted_class<-ifelse(pred1<0.5, “Defaluted”, “Not Defaulted”)

Creating a table to see the same.

table(trainingdata$default, predicted_class)

This is also known as confusion matrix. It is a tabular representation of Actual vs Predicted values. This helps us to find the accuracy or error of the model and avoid overfitting.

There are 64 customers who actually defaulted and our model also predicted the same. However, 72 customers defaulted but model predicted them as Not Defaulted. Also, 36 customers actually Not Defaulted where the model mentioned them as defaulted. Let’s now find out the error rate.

err_rate<-1-sum((trainingdata$default ==predicted_class))/500
> err_rate
Which is 34%.

Going ahead, lets test the model on testing data.

pred2<-predict(model12, newdata=testingdata,type=”response”)
predicted_class2<-ifelse(pred2<0.5, “Defaluted”, “Not Defaulted”)
table(testingdata$default, predicted_class2)
err_rate<-1-sum((testingdata$default ==predicted_class2))/200

Here the error rate is 31%.

Now, we can plot this in Receiver Operating Characteristics Curve (commonly known as ROC curve). In R, it can be done by downloading a package called ROCR. An output of the plot is given below.

ROC traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0. For a perfect model, as the cutoff is lowered, it should mark more of actual 1’s as positives and lesser of actual 0’s as 1’s. The area under curve, known as index of accuracy is a performance metric for the curve. Higher the area under curve, better the prediction power of the model.


Logistic regression is a widely used supervised machine learning technique. It is one of the best tools used by statisticians, researchers and data scientists in predictive analytics. The assumptions for logistic regression are mostly similar to that of multiple regression except that the dependent variable should be discrete. Most of the data science students struggled to learn this technique, which is why I am pleased to present you a basic introduction to help you grasp the topic. As I always say “the sky is the limit”, and the internet is your best friend ?, go ahead… and start your learning journey. All the best.

Source URL: Read More
The public content above was dynamically discovered – by graded relevancy to this site’s keyword domain name. Such discovery was by systematic attempts to filter for “Creative Commons“ re-use licensing and/or by Press Release distributions. “Source URL” states the content’s owner and/or publisher. When possible, this site references the content above to generate its value-add, the dynamic sentimental analysis below, which allows us to research global sentiments across a multitude of topics related to this site’s specific keyword domain name. Additionally, when possible, this site references the content above to provide on-demand (multilingual) translations and/or to power its “Read Article to Me” feature, which reads the content aloud to visitors. Where applicable, this site also auto-generates a “References” section, which appends the content above by listing all mentioned links. Views expressed in the content above are solely those of the author(s). We do not endorse, offer to sell, promote, recommend, or, otherwise, make any statement about the content above. We reference the content above for your “reading” entertainment purposes only. Review “DMCA & Terms”, at the bottom of this site, for terms of your access and use as well as for applicable DMCA take-down request.

Acquire this Domain
You can acquire this site’s domain name! We have nurtured its online marketing value by systematically curating this site by the domain’s relevant keywords. Explore our content network – you can advertise on each or rent vs. buy the domain. | Skype: TLDtraders | +1 (475) BUY-NAME (289 – 6263). Thousands search by this site’s exact keyword domain name! Most are sent here because search engines often love the keyword. This domain can be your 24/7 lead generator! If you own it, you could capture a large amount of online traffic for your niche. Stop wasting money on ads. Instead, buy this domain to gain a long-term marketing asset. If you can’t afford to buy then you can rent the domain.

About Us
We are Internet Investors, Developers, and Franchisers – operating a content network of several thousand sites while federating 100+ eCommerce and SaaS startups. With our proprietary “inverted incubation” model, we leverage a portfolio of $100M in valued domains to impact online trends, traffic, and transactions. We use robotic process automation, machine learning, and other proprietary approaches to power our content network. Contact us to learn how we can help you with your online marketing and/or site maintenance.