Rashmi Jain

Author
Rashmi Jain

Blogs
Rashmi began their journey in software development but found their voice in storytelling. Now, Rashmi simplifies complex tech concepts through engaging narratives that resonate with both engineers and hiring managers.
author’s Articles

Insights & Stories by Rashmi Jain

Explore Rashmi Jain’s blogs for thoughtful breakdowns of tech hiring, development culture, and the softer skills that build stronger engineering teams.
Clear all
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Filter
Filter

Descriptive statistics with Python-NumPy

Is it gonna rain today? Should I take my umbrella to the office or not? To know the answer to such questions we will just take out our phone and check the weather forecast. How is this done? There are computer models which use statistics to compare weather conditions from the past with the current conditions to predict future weather conditions. From studying the amount of fluoride that is safe in our toothpaste to predicting the future stock rates, everything requires statistics. Data is everything in statistics. Calculating the range, median, and mode of the data set is all a part of descriptive statistics.

Data representation, manipulation, and visualization are key components in statistics. You can read about it here.

The next important step is analyzing the data, which can be done using both descriptive and inferential statistics. Both descriptive and inferential statistics are used to analyze results and draw conclusions in most of the research studies conducted on groups of people.

Through this article, we will learn descriptive statistics using Python.

Machine learning challenge, ML challenge

Introduction

Descriptive statistics describe the basic and important features of data. Descriptive statistics help simplify and summarize large amounts of data in a sensible manner. For instance, consider the Cumulative Grade Point Index (CGPI), which is used to describe the general performance of a student across a wide range of course experiences.

Descriptive statistics involve evaluating measures of center (centrality measures) and measures of dispersion (spread).

descriptive statistics

Centrality measures

Centrality measures give us an estimate of the center of a distribution. It gives us a sense of a typical value we would expect to see. The three major measures of center include the mean, median, and mode.

Machine Learning and Auto-Evaluation

Machine Learning

In very simple terms, Machine Learning is about training or teaching computers to take decisions or actions without explicitly programming them. For example, whenever you read a tweet or movie review, you can figure out if the views expressed are positive or negative. But can you teach a computer to determine the sentiment of that text? This has many real-life applications. For instance, when Donald Trump makes a speech, Twitter responds with a range of sentiments, and his campaign team can assess the overall sentiment using machine learning.

Another example: Baidu predicted that Germany would win the 2014 World Cup even before the match was played.

Weather Problem

Consider this small dataset of favorable weather conditions for playing a game. The goal is to forecast whether one can play the game based on the given conditions.

Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Rainy Mild High False Yes
Sunny Cool Normal False Yes

Definitions

Feature/Attribute: Outlook, Temperature, Humidity, and Windy are features or attributes that influence the outcome.

Outcome/Target: The result to be predicted, i.e., whether you can play or not.

Vector: A row in the dataset representing an ordered collection of features (e.g., Sunny, Hot, High, False).

ML Model: The algorithm or process generated from the learning process (e.g., Decision Trees, SVM, Naive Bayes).

Error Metric/Evaluation Metric: Used to assess the accuracy of an ML model’s predictions. Different types exist for different problems.

Supporting ML Problems on HackerEarth

HackerEarth’s ML platform supports a typical machine learning flow. A dataset is split into training and test sets. Users train their models on the training set and predict outcomes on the test set. The test set does not include the target variable.

Example Dataset

Outlook Temperature Humidity Windy Play
SunnyHotHighFalseNo
RainyMildHighFalseYes
SunnyCoolNormalFalseYes
OvercastHotHighFalseYes
RainyMildHighFalseYes
OvercastHotNormalFalseYes
SunnyMildNormalTrueYes
SunnyMildHighFalseNo
OvercastCoolNormalTrueYes
RainyMildHighTrueYes

Train Dataset (train.csv)

Outlook Temperature Humidity Windy Play
SunnyHotHighFalseNo
RainyMildHighFalseYes
SunnyCoolNormalFalseYes
OvercastHotHighFalseYes
RainyMildHighFalseYes
OvercastHotNormalFalseYes

Test Dataset (test.csv)

Id Outlook Temperature Humidity Windy
1SunnyMildNormalTrue
2SunnyMildHighFalse
3OvercastCoolNormalTrue
4RainyMildHighTrue

Notice the absence of the target variable in the test data.

User Prediction File (user_prediction.csv)

Id Play
1Yes
2Yes
3No
4No

Correct Prediction File (correct_prediction.csv)

Id Play
1Yes
2No
3Yes
4Yes

Evaluation Metric

During the contest, only 50% of the test dataset is used for evaluation to discourage overfitting. The evaluation metric is defined as:

Score = Number of correct predictions / Total rows

In this case, only ID 1 is predicted correctly out of the first two, so:

Score online = 1 / 2 = 0.5

After the contest, the model is evaluated on the full test dataset:

Score offline = 1 / 4 = 0.25

This demonstrates how overfitting can reduce real-world model performance. Online evaluations using partial data help encourage more generalizable solutions.

Gradient descent algorithm for linear regression

You are probably using machine learning multiple times a day without realizing it. For instance, when checking your mailbox, a spam filter automatically filters out junk mail—thanks to Machine Learning (ML). ML is the science of training a machine to learn from past data, without being explicitly programmed.

There are two main types of ML techniques:

  • Supervised Machine Learning: The system learns from predefined training data to predict future outcomes.
  • Unsupervised Learning: The system identifies hidden patterns in data without prior labels—for example, finding close friend groups on Facebook.

Supervised Learning

Consider the following dataset showing house prices in Bengaluru, India:

Living area (sq ft) Price (USD)
82030105
105058448
155085911
120087967
160073722
111754630
55042441
116279596

To predict housing prices based on living area, we define a hypothesis function: hθ(x) = θ₀ + θ₁x. Here, θ₀ and θ₁ are parameters we aim to optimize. Our goal is to minimize the error between predicted and actual prices using a cost function:

J(θ₀, θ₁) = (1/2m) Σ(hθ(x(i)) − y(i)

This cost function measures the squared error over m training examples. Our objective is to find θ₀ and θ₁ that minimize this cost.

Gradient Descent

Gradient descent is an optimization algorithm used to minimize functions like our cost function. Starting with initial guesses for θ₀ and θ₁, we iteratively update them using:

θj := θj − α ∂/∂θj J(θ₀, θ₁)

Here, α is the learning rate. The updates continue until convergence—when changes become negligible.

Applying Gradient Descent to Linear Regression

The partial derivatives of the cost function with respect to θ₀ and θ₁ are:

  • ∂J/∂θ₀ = (1/m) Σ(hθ(x(i)) − y(i))
  • ∂J/∂θ₁ = (1/m) Σ(hθ(x(i)) − y(i))x(i)

Using these, we can apply gradient descent and iteratively update θ₀ and θ₁ to minimize the cost function.

Gradient Descent with Python

import numpy as np
import matplotlib.pyplot as plt

x = np.random.uniform(-4, 4, 500)
y = x + np.random.standard_normal(500) + 2.5
plt.plot(x, y, 'o')
plt.show()

def cost(X, Y, theta):
    return np.dot((np.dot(X, theta) - Y).T, (np.dot(X, theta) - Y)) / (2 * len(Y))

alpha = 0.1
theta = np.array([[0, 0]]).T
X = np.c_[np.ones(500), x]
Y = np.c_[y]
X_1 = np.c_[x].T
num_iters = 1000
cost_history = []
theta_history = []

for i in range(num_iters):
    a = np.sum(theta[0] - alpha * (1 / len(Y)) * np.sum((np.dot(X, theta) - Y)))
    b = np.sum(theta[1] - alpha * (1 / len(Y)) * np.sum(np.dot(X_1, (np.dot(X, theta) - Y))))
    theta = np.array([[a], [b]])
    cost_history.append(cost(X, Y, theta))
    theta_history.append(theta)
    if i in (1, 3, 7, 10, 14, num_iters):
        plt.plot(x, a + x * b)
        plt.title('Linear regression by gradient descent')
        plt.xlabel('x')
        plt.ylabel('y')
        plt.show()
    elif i in range(20, num_iters, 10):
        plt.plot(x, a + x * b)

print(theta)

Gradient Descent with R

x <- runif(500, -4, 4)
y <- x + rnorm(500) + 2.5

cost <- function(X, y, theta) {
  sum((X %*% theta - y)^2) / (2 * length(y))
}

alpha <- 0.1
num_iters <- 1000
cost_history <- rep(0, num_iters)
theta_history <- list(num_iters)
theta <- c(0, 0)
X <- cbind(1, x)

for (i in 1:num_iters) {
  theta[1] <- theta[1] - alpha * (1 / length(y)) * sum((X %*% theta - y))
  theta[2] <- theta[2] - alpha * (1 / length(y)) * sum((X %*% theta - y) * X[,2])
  cost_history[i] <- cost(X, y, theta)
  theta_history[[i]] <- theta
}

print(theta)

plot(x, y, col=rgb(0.2,0.4,0.6,0.4), main='Linear regression by gradient descent')
for (i in c(1,3,6,10,14,seq(20,num_iters,by=10))) {
  abline(coef=theta_history[[i]], col=rgb(0.8,0,0,0.3))
}
abline(coef=theta, col='blue')

Linear regression via gradient descent is simple and intuitive. Although advanced learning algorithms may use more complex models and cost functions, the underlying principles remain the same.

Principal component analysis with linear algebra

Principal component analysis (PCA) is a powerful linear algebra-based statistical method used to reduce the dimensionality of datasets while retaining important information. It simplifies complex datasets, making them easier to analyze and visualize.

Suppose we have n individuals and measure m variables for each. Each individual’s measurements form an m-dimensional vector. For example, data collected from five individuals might look like this:

Name A B C D E
Age 24 50 17 35 65
Height (cm) 152 175 160 170 155
IQ 108 102 95 97 87

Here, n = 5 and m = 3. Each individual's data can be written as a vector, for instance: x₁ = [24, 152, 108]ᵗ.

PCA helps answer questions such as:

  1. Which variables are correlated?
  2. Can we visualize this high-dimensional data more easily?
  3. Which variables contribute most to the variation in the dataset?

Linear Transformations

Multiplying a matrix by a vector results in a linear transformation of that vector. This operation is key in PCA and is defined as: Av = w.

Eigenvectors and Eigenvalues

An eigenvector v of a matrix A satisfies Av = λv, where λ is the eigenvalue. These vectors indicate directions of data variance, remaining unchanged under transformation by A.

Spectral Theorem

For symmetric matrices, the spectral theorem ensures real eigenvalues and orthogonal eigenvectors. This property is fundamental to PCA since the covariance matrix is symmetric.

Covariance Matrix

The covariance matrix captures the variance and correlation of the dataset's variables. Its entries Sₖₗ represent the covariance between variables k and l. Diagonal entries are variances; off-diagonal entries are covariances.

Steps in PCA

  1. Organize the dataset into an m × n matrix where each column is a sample.
  2. Subtract the mean of each variable from the dataset (mean centering).
  3. Compute the covariance matrix S = (1/n−1) · BBᵗ, where B is the mean-centered matrix.
  4. Apply the spectral theorem to get eigenvalues and eigenvectors.
  5. Select the top k eigenvectors based on the highest eigenvalues. These are the principal components.

Dimensionality Reduction

By projecting the data onto the first k principal components, we reduce the dimensions while retaining most of the dataset's variance. This simplifies analysis and visualization.

Interpreting Eigenvalues

  • Each eigenvalue indicates the variance captured by its corresponding eigenvector.
  • The sum of all eigenvalues is the total variance of the dataset.
  • The ratio λᵢ / (λ₁ + λ₂ + ... + λₘ) shows the proportion of variance explained by the i-th component.

Applications of PCA

  • Data visualization
  • Noise reduction
  • Face recognition (eigenfaces)
  • Genomics and bioinformatics
  • Market segmentation

In face recognition, PCA can reduce high-dimensional image data to a small number of significant components (eigenfaces), allowing for efficient and accurate identification based on stored components.

In the next article, we will explore gradient descent, an optimization technique commonly used in machine learning.

Prerequisites of linear algebra for machine learning

Just about everyone has watched animated movies such as Frozen or Big Hero 6, or has at least heard about 3D computer games. It seems more fun to enjoy the movies and games rather than reading a linear algebra book. But it is because of linear algebra that we are able to watch a character move on the screen. Linear algebra is the foundation of our new digital world.

Through this article, we will learn matrix arithmetic and how to use NumPy to carry out these operations in Python.

Why We Need Linear Algebra for Machine Learning

Machine learning involves handling enormous datasets. An effective way to represent this data is in the form of 2D arrays or rectangular blocks, where each row represents a sample and each column represents a feature. It's natural to view this array as a matrix and each column as a vector.

Python and Linear Algebra

NumPy is a Python library used for scientific computing. It provides multidimensional arrays and tools to work with them.

Matrices

A matrix is a rectangular array of numbers arranged in rows and columns. For example:

[1 2 3]
[100 -3 1.15]

This is a 2×3 matrix. A general m×n matrix is denoted as A = (aij)m×n.

Creating Matrices with NumPy

import numpy as np
A = np.array([[1, 2, 2], [3, 2, 1]])
print(A)

Matrix Shape

A.shape  # Output: (2, 3)

Identity Matrix

A = np.eye(3)
print(A)

Matrix Operations

Addition

Matrix addition is entry-wise and only defined for matrices of the same dimension.

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C = np.add(A, B)
print(C)

Transpose

A = np.array([[1, 2, 3], [4, 5, 6]])
T = A.T
print(T)

Multiplication

To multiply matrices A (m×n) and B (n×q), the number of columns of A must match the number of rows of B. The result is an m×q matrix.

A = np.array([[1, 2], [3, 4]])
B = np.array([[2, 0], [1, 2]])
C = np.dot(A, B)
print(C)

Inverse

A = np.array([[2, 3], [4, 5]])
A_inv = np.linalg.inv(A)
print(A_inv)

Vectors

A vector is a one-dimensional array or a matrix with a single column. Example of a 3-dimensional vector:

v = np.array([[1], [2], [3]])

Vector Addition

v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
v_sum = np.add(v1, v2)
print(v_sum)

Dot Product

v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
dot_product = np.dot(v1, v2)
print(dot_product)

Conclusion

Matrix arithmetic is a core component of linear algebra and is essential for many machine learning techniques. In the next article, we will explore how matrix operations are applied in Principal Component Analysis (PCA), a method for identifying patterns in data.