Manish Saraswat

Author
Manish Saraswat

Blogs
Manish sees hiring through the lens of systems thinking and design operations. Their structured yet poetic approach to writing helps readers rethink how they scale teams and workflows.
author’s Articles

Insights & Stories by Manish Saraswat

From hiring pipelines to collaboration rituals, Manish Saraswat maps out ways to design intentional, high-performing organizations—one post at a time.
Clear all
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Filter
Filter

Feature Engineering + H2o Gradient Boosting (GBM) in R Scores 0.936

With less than 3 days to go, this script is meant to help beginners with feisty ideas, machine learning workflow and motivation for ongoing machine learning challenge.Here's a quick workflow of what I've done below:
  1. Load data and explore
  2. Data Pre-processing
  3. Dropped Features
  4. One Hot Encoding
  5. Feature Engineering
  6. Model Training
Good Luck!Note: For more feature engineering ideas, spend time on exploring data by loan_status variable. For categorical vs categorical data, create dodged bar plots. For categorical vs continuous data, create density plots and use fill=as.factor(loan_status).

To help the community, feel free to contribute the equivalent python / C ++ script in the comments below.

Update: You can get python script for this solution from Jin Cong Ho's comment below.

Script (R)

Resources - Handy Algorithms for this Challenge

Simple Guide to Neural Networks and Deep Learning in Python

Step 2: Import required libraries.

from numpy import*

import numpy as numpy

import keras

from keras.layers import Dense

from keras.models import Sequential

from keras.utils import np_utils

from sklearn.preprocessing import LabelEncoder

Step 3: Load data from the training set.

X= numpy.genfromtxt("Iris_Data.txt",delimiter= ",", usecols=(0,1,2,3))

t= numpy.genfromtxt("Iris_Data.txt",delimiter= ",", usecols= (4), dtype= None)

Here, X is an array of input feature vectors and t is an array containing their corresponding target values. dtype= None changes the default data type of numpy array (i.e float) to contents of each column, individually.

Step 4: Since target values t are in string format, it has to be assigned numerical labels an

Deep Learning & Parameter Tuning with MXnet, H2o Package in R

from the UC Irivine ML repository. Let's start with H2O. This data set isn't the most ideal one to work with in neural networks. However, the motive of this hands-on section is to make you familiar with model-building processes.

H2O Package

H2O package provides h2o.deeplearning function for model building. It is built on Java. Primarily, this function is useful to build multilayer feedforward neural networks. It is enabled with several features such as the following:
  • Multi-threaded distributed parallel computation
  • Adaptive learning rate (or step size) for faster convergence
  • Regularization options such as L1 and L2 which help prevent overfitting
  • Automatic missing value imputation
  • Hyperparameter optimization using grid/random search
There are many more!For optimization, this package uses the hogwild method instead of stochastic gradient descent. Hogwild is just parallelized version of SGD.Let's understand the parameters involved in model building with h2o. Both the packages have different nomenclatures, so make sure you don't get confused. Since most of the parameters are easy to understand by their names, I'll mention the important ones:
  1. hidden - It specifies the number of hidden layers and number of neurons in each layer in the architechture.
  2. epochs - It specifies the number of iterations to be done on the data set.
  3. rate - It specifies the learning rate.
  4. activation - It specifies the type of activation function to use. In h2o, the major activation functions are Tanh, Rectifier, and Maxout.
Let's quickly load the data and get over with sanitary data pre-processing steps:

Now, let's build a simple deep learning model. Generally, computing variable importance from a trained deep learning model is quite pain staking. But, h2o package provides an effortless function to compute variable importance from a deep learning model.



deep learning variable importance

Now, let's train a deep learning model with one hidden layer comprising five neurons. This time instead of checking the cross-validation accuracy, we'll validate the model on test data.



For hyperparameter tuning, we'll perform a random grid search over all parameters and choose the model which returns highest accuracy.

MXNetR Package

The mxnet package provides an incredible interface to build feedforward NN, recurrent NN and convolutional neural networks (CNNs). CNNs are being widely used in detecting objects from images. The team that created xgboost also created this package. Currently, mxnet is being popularly used in kaggle competitions for image classification problems.

This package can be easily connected with GPUs as well. The process of building model architecture is quite intuitive. It gives greater control to configure the neural network manually.

Let's get some hands-on experience using this package.Follow the commands below to install this package in your respective OS. For Windows and Linux users, installation commands are given below. For Mac users, here's the installation procedure.



In R, mxnet accepts target variables as numeric classes and not factors. Also, it accepts data frame as a matrix. Now, we'll make the required changes:



Now, we'll train the multilayered perceptron model using the mx.mlp function.



Softmax function is used for binary and multi-classification problems. Alternatively, you can also manually craft the model structure.



We have configured the network above with one hidden layer carrying three neurons. We have chosen softmax as the output function. The network optimizes for squared loss for regression, and the network optimizes for classification accuracy for classification. Now, we'll train the network:



Similarly, we can configure a more complexed network fed with hidden layers.



Understand it carefully: After feeding the input through data, the first hidden layer consists of 10 neurons. The output of each neuron passes through a relu (rectified linear) activation function. We have used it in place of sigmoid. relu converges faster than a sigmoid function. You can read more about relu here.

Then, the output is fed into the second layer which is the output layer. Since our target variable has two classes, we've chosen num_hidden as 2 in the second layer. Finally, the output from second layer is made to pass though softmax output function.



As mentioned above, this trained model predicts output probability, which can be easily transformed into a label using a threshold value (say, 0.5). To make predictions on the test set, we do this:



The predicted matrix returns two rows and 16281 columns, each column carrying probability. Using the max.col function, we can extract the maximum value from each row. If you check the model's accuracy, you'll find that this network performs terribly on this data. In fact, it gives no better result than the train accuracy! On this data set, xgboost tuning gave 87% accuracy!

If you are familiar with the model building process, I'd suggest you to try working on the popular MNIST data set. You can find tons of tutorials on this data to get you going!

Summary

Deep Learning is getting increasingly popular in solving most complex problems such as image recognition, natural language processing, etc. If you are aspiring for a career in machine learning, this is the best time for you to get into this subject. The motive of this article was to introduce you to the fundamental concepts of deep learning.In this article, we learned about the basics of deep learning (perceptrons, neural networks, and multilayered neural networks). We learned deep learning as a technique is composed of several algorithms such as backpropagration and gradient descent to optimize the networks. In the end, we gained some hands-on experience in developing deep learning models.Do let me know if you have any feedback, suggestions, or thoughts on this article in the comments below!

Practical Guide to Clustering Algorithms and Evaluation in R

Introduction

Clustering algorithms are a part of unsupervised machine learning algorithms. As there is no target variable, the model is trained using input variables to discover intrinsic groups or clusters.

Because we don’t have labels for the data, these groups are formed based on similarity between data points. This tutorial covers clustering concepts, techniques, and applications across domains like healthcare, retail, and manufacturing.

We’ll also walk through examples in R, using real-world data from a water treatment plant to apply our knowledge practically.

Table of Contents

  1. Types of Clustering Techniques
  2. Distance Calculation for Clustering
  3. K-Means Clustering
  4. Choosing the Best K in K-Means
  5. Hierarchical Clustering
  6. Evaluation Methods in Cluster Analysis
  7. Clustering in R – Water Treatment Plants

Types of Clustering Techniques

Common clustering algorithms include K-Means, Fuzzy C-Means, and Hierarchical Clustering. Depending on the data type (numeric, categorical, mixed), the algorithm may vary. Clustering techniques can be classified as:

  • Soft Clustering – Observations are assigned to clusters with probabilities.
  • Hard Clustering – Observations belong to only one cluster.

We’ll focus on K-Means and Hierarchical Clustering in this guide.

Distance Calculation for Clustering

Distance metrics are used to measure similarity between data points. Common metrics include:

  • Euclidean Distance – Suitable for numeric variables.
  • Manhattan Distance – Measures horizontal and vertical distances.
  • Hamming Distance – Used for categorical variables.
  • Gower Distance – Handles mixed variable types.
  • Cosine Similarity – Common in text analysis.

K-Means Clustering

K-Means partitions data into k non-overlapping clusters. The process includes:

  1. Randomly assign k centroids.
  2. Assign observations to the nearest centroid.
  3. Recalculate centroids.
  4. Repeat until convergence.

Clustering minimizes within-cluster variation using squared Euclidean distance.

Choosing the Best K in K-Means

Methods for selecting k include:

  • Cross Validation
  • Elbow Method
  • Silhouette Method
  • X-Means Clustering

Hierarchical Clustering

This method creates a nested sequence of clusters using two approaches:

  • Agglomerative – Bottom-up merging.
  • Divisive – Top-down splitting.

Dendrograms visualize the clustering hierarchy. A horizontal cut across the dendrogram reveals the number of clusters.

Evaluation Methods

Clustering evaluation is divided into:

  • Internal Measures – Based on compactness and separation (e.g., SSE, Scatter Criteria).
  • External Measures – Based on known labels (e.g., Rand Index, Precision-Recall).

Clustering in R – Water Treatment Plants

The water treatment dataset from the UCI repository is used to demonstrate hierarchical and k-means clustering.

# Load and preprocess data
library(data.table)
library(ggplot2)
library(fpc)

water_data <- read.table("water-treatment.data.txt", sep = ",", header = F, na.strings = "?")
setDT(water_data)

# Impute missing values
for(i in colnames(water_data)[-1]) {
  set(water_data, which(is.na(water_data[[i]])), i, median(water_data[[i]], na.rm = TRUE))
}

# Scale numeric features
scaled_wd <- scale(water_data[,-1, with = FALSE])

Next, hierarchical clustering is performed using Euclidean distance and Ward's method. A dendrogram is plotted, and clusters are determined via horizontal cuts. PCA is used to visualize clusters.

# Hierarchical clustering
d <- dist(scaled_wd, method = "euclidean")
h_clust <- hclust(d, method = "ward.D2")
plot(h_clust, labels = water_data$V1)

# Cut dendrogram
rect.hclust(h_clust, k = 4)
groups <- cutree(h_clust, k = 4)

Principal components are used for cluster visualization:

# PCA for visualization
pcmp <- princomp(scaled_wd)
pred_pc <- predict(pcmp)[,1:2]
comp_dt <- cbind(as.data.table(pred_pc), cluster = as.factor(groups), Labels = water_data$V1)

ggplot(comp_dt, aes(Comp.1, Comp.2)) +
  geom_point(aes(color = cluster), size = 3)

Then, k-means clustering is applied and similarly visualized using PCA components. The clustering consistency is confirmed visually.

# K-means clustering
kclust <- kmeans(scaled_wd, centers = 4, iter.max = 100)

ggplot(comp_dt, aes(Comp.1, Comp.2)) +
  geom_point(aes(color = as.factor(kclust$cluster)), size = 3)

How can R Users Learn Python for Data Science ?

Introduction

The best way to learn a new skill is by doing it!

This article is meant to help R users enhance their set of skills and learn Python for data science (from scratch). After all, R and Python are the most important programming languages a data scientist must know.

Python is a supremely powerful and a multi-purpose programming language. It has grown phenomenally in the last few years. It is used for web development, game development, and now data analysis / machine learning. Data analysis and machine learning is a relatively new branch in python.

For a beginner in data science, learning python for data analysis can be really painful. Why?

You try Googling "learn python," and you'll get tons of tutorials only meant for learning python for web development. How can you find a way then?

In this tutorial, we'll be exploring the basics of python for performing data manipulation tasks. Alongside, we'll also look how you do it in R. This parallel comparison will help you relate the set of tasks you do in R to how you do it in python! And in the end, we'll take up a data set and practice our newly acquired python skills.

Note: This article is best suited for people who have a basic knowledge of R language.

Machine learning challenge, ML challenge

Table of Contents

  1. Why learn Python (even if you already know R)
  2. Understanding Data Types and Structures in Python vs. R
  3. Writing Code in Python vs. R
  4. Practicing Python on a Data Set

Why learn Python (even if you already know R)

No doubt, R is tremendously great at what it does. In fact, it was originally designed for doing statistical computing and manipulations. Its incredible community support allows a beginner to learn R quickly.

But, python is catching up fast. Established companies and startups have embraced python at a much larger scale compared to R.

r machine learning vs python machine learning

According to indeed.com (from Jan 2016 to November 2016), the number of job postings seeking "machine learning python" increased much faster (approx. 123%) than "machine learning in R" jobs. Do you know why? It is because

  1. Python supports the entire spectrum of machine learning in a much better way.
  2. Python not only supports model building but also supports model deployment.
  3. The support of various powerful deep learning libraries such as keras, convnet, theano, and tensorflow is more for python than R.
  4. You don't need to juggle between several packages to locate a function in python unlike you do in R. Python has relatively fewer libraries, with each having all the functions a data scientist would need.

Understanding Data Types and Structures in Python vs. R

These programming languages understand the complexity of a data set based on its variables and data types. Yes! Let's say you have a data set with one million rows and 50 columns. How would these programming languages understand the data?

Basically, both R and Python have pre-defined data types. The dependent and independent variables get classified among these data types. And, based on the data type, the interpreter allots memory for use. Python supports the following data types:

  1. Numbers – It stores numeric values. These numeric values can be stored in 4 types: integer, long, float, and complex.
    • Integer – Whole numbers such as 10, 13, 91, 102. Same as R's integer type.
    • Long – Long integers in octa and hexadecimal. R uses bit64 package for hexadecimal.
    • Float – Decimal values like 1.23, 9.89. Equivalent to R's numeric type.
    • Complex – Numbers like 2 + 3i, 5i. Rarely used in data analysis.
  2. Boolean – Stores two values (True and False). R uses factor or character. Case-sensitive difference exists: R uses TRUE/FALSE; Python uses True/False.
  3. Strings – Stores text like "elephant", "lotus". Same as R's character type.
  4. Lists – Like R’s list, stores multiple data types in one structure.
  5. Tuples – Similar to immutable vectors in R (though R has no direct equivalent).
  6. Dictionary – Key-value pair structure. Think of keys as column names, values as data entries.

Since R is a statistical computing language, all the functions to manipulate data and reading variables are available inherently. On the other hand, python hails all the data analysis / manipulation / visualization functions from external libraries. Python has several libraries for data manipulation and machine learning. The most important ones are:

  1. Numpy – Used for numerical computing. Offers math functions and array support. Similar to R’s list or array.
  2. Scipy – Scientific computing in python.
  3. Matplotlib – For data visualization. R uses ggplot2.
  4. Pandas – Main tool for data manipulation. R uses dplyr, data.table.
  5. Scikit Learn – Core library for machine learning algorithms in python.

In a way, python for a data scientist is largely about mastering the libraries stated above. However, there are many more advanced libraries which people have started using. Therefore, for practical purposes you should remember the following things:

  1. Array – Similar to R's list, supports multidimensional data with coercion effect when data types differ.
  2. List – Equivalent to R’s list.
  3. Data Frame – Two-dimensional structure composed of lists. R uses data.frame; python uses DataFrame from pandas.
  4. Matrix – Multidimensional structure of same class data. In R: matrix(); in python: numpy.column_stack().

Until here, I hope you've understood the basics of data types and data structures in R and Python. Now, let's start working with them!

Writing Code in Python vs. R

Let's use the knowledge gained in the previous section and understand its practical implications. But before that, you should install python using Anaconda's Jupyter Notebook. You can download here. Also, you can download other python IDEs. I hope you already have R Studio installed.

1. Creating Lists

In R:

my_list <- list('monday','specter',24,TRUE)
typeof(my_list)
[1] "list"

In Python:

my_list = ['monday','specter',24,True]
type(my_list)
list

Using pandas Series:

import pandas as pd
pd_list = pd.Series(my_list)
pd_list
0     monday
1    specter
2         24
3       True
dtype: object

Python uses zero-based indexing; R uses one-based indexing.

2. Matrix

In R:

my_mat <- matrix(1:10, nrow = 5)
my_mat
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

# Select first row
my_mat[1,]

# Select second column
my_mat[,2]

In Python (using NumPy):

import numpy as np
a = np.array(range(10,15))
b = np.array(range(20,25))
c = np.array(range(30,35))
my_mat = np.column_stack([a, b, c])

# Select first row
my_mat[0,]

# Select second column
my_mat[:,1]

3. Data Frames

In R:

data_set <- data.frame(Name = c("Sam","Paul","Tracy","Peter"),
                       Hair_Colour = c("Brown","White","Black","Black"),
                       Score = c(45,89,34,39))

In Python:

data_set = pd.DataFrame({'Name': ["Sam","Paul","Tracy","Peter"],
                         'Hair_Colour': ["Brown","White","Black","Black"],
                         'Score': [45,89,34,39]})

Selecting columns:

In R:

data_set$Name
data_set[["Name"]]
data_set[1]

data_set[c('Name','Hair_Colour')]
data_set[,c('Name','Hair_Colour')]

In Python:

data_set['Name']
data_set.Name
data_set[['Name','Hair_Colour']]
data_set.loc[:,['Name','Hair_Colour']]

Practicing Python on a Data Set

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston

boston = load_boston()

boston.keys()
['data', 'feature_names', 'DESCR', 'target']

print(boston['feature_names'])
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']

print(boston['DESCR'])
bos_data = pd.DataFrame(boston['data'])
bos_data.head()

bos_data.columns = boston['feature_names']
bos_data.head()

bos_data.describe()

# First 10 rows
bos_data.iloc[:10]

# First 5 columns
bos_data.loc[:, 'CRIM':'NOX']
bos_data.iloc[:, :5]

# Filter rows
bos_data.query("CRIM > 0.05 & CHAS == 0")

# Sample
bos_data.sample(n=10)

# Sort
bos_data.sort_values(['CRIM']).head()
bos_data.sort_values(['CRIM'], ascending=False).head()

# Rename column
bos_data.rename(columns={'CRIM': 'CRIM_NEW'})

# Column means
bos_data[['ZN','RM']].mean()

# Transform numeric to categorical
bos_data['ZN_Cat'] = pd.cut(bos_data['ZN'], bins=5, labels=['a','b','c','d','e'])

# Grouped sum
bos_data.groupby('ZN_Cat')['AGE'].sum()

# Pivot table
bos_data['NEW_AGE'] = pd.cut(bos_data['AGE'], bins=3, labels=['Young','Old','Very_Old'])
bos_data.pivot_table(values='DIS', index='ZN_Cat', columns='NEW_AGE', aggfunc='mean')

Summary

While coding in python, I realized that there is not much difference in the amount of code you write here; although some functions are shorter in R than in Python. However, R has really awesome packages which handle big data quite conveniently. Do let me know if you wish to learn about them!

Overall, learning both the languages would give you enough confidence to handle any type of data set. In fact, the best part about learning python is its comprehensive documentation available on numpy, pandas, and scikit learn libraries, which are sufficient enough to help you overcome all initial obstacles.

In this article, we just touched the basics of python. There's a long way to go. Next week, we'll learn about data manipulation in python in detail. After that, we'll look into data visualization, and the powerful machine learning library in python.

Do share your experience, suggestions, and questions below while practicing this tutorial!

Practical Guide to Logistic Regression Analysis in R

Introduction

Recruiters in the analytics/data science industry expect you to know at least two algorithms: Linear Regression and Logistic Regression. I believe you should have in-depth understanding of these algorithms. Let me tell you why.

Due to their ease of interpretation, consultancy firms use these algorithms extensively. Startups are also catching up fast. As a result, in an analytics interview, most of the questions come from linear and Logistic Regression.

In this article, you'll learn Logistic Regression in detail. Believe me, Logistic Regression isn't easy to master. It does follow some assumptions like Linear Regression. But its method of calculating model fit and evaluation metrics is entirely different from Linear/Multiple regression.

But, don't worry! After you finish this tutorial, you'll become confident enough to explain Logistic Regression to your friends and even colleagues. Alongside theory, you'll also learn to implement Logistic Regression on a data set. I'll use R Language. In addition, we'll also look at various types of Logistic Regression methods.

Note: You should know basic algebra (elementary level). Also, if you are new to regression, I suggest you read how Linear Regression works first.

Table of Contents

  1. What is Logistic Regression ?
  2. What are the types of Logistic Regression techniques ?
  3. How does Logistic Regression work ?
  4. How can you evaluate Logistic Regression's model fit and accuracy ?
  5. Practical - Who survived on the Titanic ?
Machine learning challenge, ML challenge

What is Logistic Regression ?

Many a time, situations arise where the dependent variable isn't normally distributed; i.e., the assumption of normality is violated. For example, think of a problem when the dependent variable is binary (Male/Female). Will you still use Multiple Regression? Of course not! Why? We'll look at it below.

Let's take a peek into the history of data analysis.

So, until 1972, people didn't know how to analyze data which has a non-normal error distribution in the dependent variable. Then, in 1972, came a breakthrough by John Nelder and Robert Wedderburn in the form of Generalized Linear Models. I'm sure you would be familiar with the term. Now, let's understand it in detail.

Generalized Linear Models are an extension of the linear model framework, which includes dependent variables which are non-normal also. In general, they possess three characteristics:

  1. These models comprise a linear combination of input features.
  2. The mean of the response variable is related to the linear combination of input features via a link function.
  3. The response variable is considered to have an underlying probability distribution belonging to the family of exponential distributions such as binomial distribution, Poisson distribution, or Gaussian distribution. Practically, binomial distribution is used when the response variable is binary. Poisson distribution is used when the response variable represents count. And, Gaussian distribution is used when the response variable is continuous.

Logistic Regression belongs to the family of generalized linear models. It is a binary classification algorithm used when the response variable is dichotomous (1 or 0). Inherently, it returns the set of probabilities of target class. But, we can also obtain response labels using a probability threshold value. Following are the assumptions made by Logistic Regression:

  1. The response variable must follow a binomial distribution.
  2. Logistic Regression assumes a linear relationship between the independent variables and the link function (logit).
  3. The dependent variable should have mutually exclusive and exhaustive categories.

In R, we use glm() function to apply Logistic Regression. In Python, we use sklearn.linear_model function to import and use Logistic Regression.

Note: We don't use Linear Regression for binary classification because its linear function results in probabilities outside [0,1] interval, thereby making them invalid predictions.

What are the types of Logistic Regression techniques ?

Logistic Regression isn't just limited to solving binary classification problems. To solve problems that have multiple classes, we can use extensions of Logistic Regression, which includes Multinomial Logistic Regression and Ordinal Logistic Regression. Let's get their basic idea:

1. Multinomial Logistic Regression: Let's say our target variable has K = 4 classes. This technique handles the multi-class problem by fitting K-1 independent binary logistic classifier model. For doing this, it randomly chooses one target class as the reference class and fits K-1 regression models that compare each of the remaining classes to the reference class.

Due to its restrictive nature, it isn't used widely because it does not scale very well in the presence of a large number of target classes. In addition, since it builds K - 1 models, we would require a much larger data set to achieve reasonable accuracy.

2. Ordinal Logistic Regression: This technique is used when the target variable is ordinal in nature. Let's say, we want to predict years of work experience (1,2,3,4,5, etc). So, there exists an order in the value, i.e., 5>4>3>2>1. Unlike a multinomial model, when we train K -1 models, Ordinal Logistic Regression builds a single model with multiple threshold values.

If we have K classes, the model will require K -1 threshold or cutoff points. Also, it makes an imperative assumption of proportional odds. The assumption says that on a logit (S shape) scale, all of the thresholds lie on a straight line.

Note: Logistic Regression is not a great choice to solve multi-class problems. But, it's good to be aware of its types. In this tutorial we'll focus on Logistic Regression for binary classification task.

How does Logistic Regression work?

Now comes the interesting part!

As we know, Logistic Regression assumes that the dependent (or response) variable follows a binomial distribution. Now, you may wonder, what is binomial distribution? Binomial distribution can be identified by the following characteristics:

  1. There must be a fixed number of trials denoted by n, i.e. in the data set, there must be a fixed number of rows.
  2. Each trial can have only two outcomes; i.e., the response variable can have only two unique categories.
  3. The outcome of each trial must be independent of each other; i.e., the unique levels of the response variable must be independent of each other.
  4. The probability of success (p) and failure (q) should be the same for each trial.
Let's understand how Logistic Regression works. For Linear Regression, where the output is a linear combination of input feature(s), we write the equation as:

Y = ?o + ?1X + ?

In Logistic Regression, we use the same equation but with some modifications made to Y. Let's reiterate a fact about Logistic Regression: we calculate probabilities. And, probabilities always lie between 0 and 1. In other words, we can say:

  1. The response value must be positive.
  2. It should be lower than 1.

First, we'll meet the above two criteria. We know the exponential of any value is always a positive number. And, any number divided by number + 1 will always be lower than 1. Let's implement these two findings:

This is the logistic function.

Now we are convinced that the probability value will always lie between 0 and 1. To determine the link function, follow the algebraic calculations carefully. P(Y=1|X) can be read as "probability that Y =1 given some value for x." Y can take only two values, 1 or 0. For ease of calculation, let's rewrite P(Y=1|X) as p(X).

logistic regression equation derivation

As you might recognize, the right side of the (immediate) equation above depicts the linear combination of independent variables. The left side is known as the log - odds or odds ratio or logit function and is the link function for Logistic Regression. This link function follows a sigmoid (shown below) function which limits its range of probabilities between 0 and 1.

SigmoidPlot logistic function

Until here, I hope you've understood how we derive the equation of Logistic Regression. But how is it interpreted?

We can interpret the above equation as, a unit increase in variable x results in multiplying the odds ratio by ? to power ?. In other words, the regression coefficients explain the change in log(odds) in the response for a unit change in predictor. However, since the relationship between p(X) and X is not straight line, a unit change in input feature doesn't really affect the model output directly but it affects the odds ratio.

This is contradictory to Linear Regression where, regardless of the value of input feature, the regression coefficient always represents a fixed increase/decrease in the model output per unit increase in the input feature.

In Multiple Regression, we use the Ordinary Least Square (OLS) method to determine the best coefficients to attain good model fit. In Logistic Regression, we use maximum likelihood method to determine the best coefficients and eventually a good model fit.

Maximum likelihood works like this: It tries to find the value of coefficients (?o,?1) such that the predicted probabilities are as close to the observed probabilities as possible. In other words, for a binary classification (1/0), maximum likelihood will try to find values of ?o and ?1 such that the resultant probabilities are closest to either 1 or 0. The likelihood function is written as

How can you evaluate Logistic Regression model fit and accuracy ?

In Linear Regression, we check adjusted R², F Statistics, MAE, and RMSE to evaluate model fit and accuracy. But, Logistic Regression employs all different sets of metrics. Here, we deal with probabilities and categorical values. Following are the evaluation metrics used for Logistic Regression:

1. Akaike Information Criteria (AIC)

You can look at AIC as counterpart of adjusted r square in multiple regression. It's an important indicator of model fit. It follows the rule: Smaller the better. AIC penalizes increasing number of coefficients in the model. In other words, adding more variables to the model wouldn't let AIC increase. It helps to avoid overfitting.

Looking at the AIC metric of one model wouldn't really help. It is more useful in comparing models (model selection). So, build 2 or 3 Logistic Regression models and compare their AIC. The model with the lowest AIC will be relatively better.

2. Null Deviance and Residual Deviance

Deviance of an observation is computed as -2 times log likelihood of that observation. The importance of deviance can be further understood using its types: Null and Residual Deviance. Null deviance is calculated from the model with no features, i.e.,only intercept. The null model predicts class via a constant probability.

Residual deviance is calculated from the model having all the features.On comarison with Linear Regression, think of residual deviance as residual sum of square (RSS) and null deviance as total sum of squares (TSS). The larger the difference between null and residual deviance, better the model.

Also, you can use these metrics to compared multiple models: whichever model has a lower null deviance, means that the model explains deviance pretty well, and is a better model. Also, lower the residual deviance, better the model. Practically, AIC is always given preference above deviance to evaluate model fit.

3. Confusion Matrix

Confusion matrix is the most crucial metric commonly used to evaluate classification models. It's quite confusing but make sure you understand it by heart. If you still don't understand anything, ask me in comments. The skeleton of a confusion matrix looks like this:

confusion matrix logistic regression

As you can see, the confusion matrix avoids "confusion" by measuring the actual and predicted values in a tabular format. In table above, Positive class = 1 and Negative class = 0. Following are the metrics we can derive from a confusion matrix:

Accuracy - It determines the overall predicted accuracy of the model. It is calculated as Accuracy = (True Positives + True Negatives)/(True Positives + True Negatives + False Positives + False Negatives)

True Positive Rate (TPR) - It indicates how many positive values, out of all the positive values, have been correctly predicted. The formula to calculate the true positive rate is (TP/TP + FN). Also, TPR = 1 - False Negative Rate. It is also known as Sensitivity or Recall.

False Positive Rate (FPR) - It indicates how many negative values, out of all the negative values, have been incorrectly predicted. The formula to calculate the false positive rate is (FP/FP + TN). Also, FPR = 1 - True Negative Rate.

True Negative Rate (TNR) - It indicates how many negative values, out of all the negative values, have been correctly predicted. The formula to calculate the true negative rate is (TN/TN + FP). It is also known as Specificity.

False Negative Rate (FNR) - It indicates how many positive values, out of all the positive values, have been incorrectly predicted. The formula to calculate false negative rate is (FN/FN + TP).

Precision: It indicates how many values, out of all the predicted positive values, are actually positive. It is formulated as:(TP / TP + FP).

F Score: F score is the harmonic mean of precision and recall. It lies between 0 and 1. Higher the value, better the model. It is formulated as 2((precision*recall) / (precision+recall)).