HackerEarth

author’s Articles

Insights & Stories by Rashmi Jain

Explore Rashmi Jain’s blogs for thoughtful breakdowns of tech hiring, development culture, and the softer skills that build stronger engineering teams.

Developer Insights

Descriptive statistics with Python-NumPy

Is it gonna rain today? Should I take my umbrella to the office or not? To know the answer to such questions we will just take out our phone and check the weather forecast. How is this done? There are computer models which use statistics to compare weather conditions from the past with the current conditions to predict future weather conditions. From studying the amount of fluoride that is safe in our toothpaste to predicting the future stock rates, everything requires statistics. Data is everything in statistics. Calculating the range, median, and mode of the data set is all a part of descriptive statistics.

Data representation, manipulation, and visualization are key components in statistics. You can read about it here.

The next important step is analyzing the data, which can be done using both descriptive and inferential statistics. Both descriptive and inferential statistics are used to analyze results and draw conclusions in most of the research studies conducted on groups of people.

Through this article, we will learn descriptive statistics using Python.

Machine learning challenge, ML challenge

Introduction

Descriptive statistics describe the basic and important features of data. Descriptive statistics help simplify and summarize large amounts of data in a sensible manner. For instance, consider the Cumulative Grade Point Index (CGPI), which is used to describe the general performance of a student across a wide range of course experiences.

Descriptive statistics involve evaluating measures of center (centrality measures) and measures of dispersion (spread).

Centrality measures

Centrality measures give us an estimate of the center of a distribution. It gives us a sense of a typical value we would expect to see. The three major measures of center include the mean, median, and mode.

Developer Insights

Machine Learning and Auto-Evaluation

Machine Learning

In very simple terms, Machine Learning is about training or teaching computers to take decisions or actions without explicitly programming them. For example, whenever you read a tweet or movie review, you can figure out if the views expressed are positive or negative. But can you teach a computer to determine the sentiment of that text? This has many real-life applications. For instance, when Donald Trump makes a speech, Twitter responds with a range of sentiments, and his campaign team can assess the overall sentiment using machine learning.

Another example: Baidu predicted that Germany would win the 2014 World Cup even before the match was played.

Weather Problem

Consider this small dataset of favorable weather conditions for playing a game. The goal is to forecast whether one can play the game based on the given conditions.

Outlook	Temperature	Humidity	Windy	Play
Sunny	Hot	High	False	No
Rainy	Mild	High	False	Yes
Sunny	Cool	Normal	False	Yes

Definitions

Feature/Attribute: Outlook, Temperature, Humidity, and Windy are features or attributes that influence the outcome.

Outcome/Target: The result to be predicted, i.e., whether you can play or not.

Vector: A row in the dataset representing an ordered collection of features (e.g., Sunny, Hot, High, False).

ML Model: The algorithm or process generated from the learning process (e.g., Decision Trees, SVM, Naive Bayes).

Error Metric/Evaluation Metric: Used to assess the accuracy of an ML model’s predictions. Different types exist for different problems.

Supporting ML Problems on HackerEarth

HackerEarth’s ML platform supports a typical machine learning flow. A dataset is split into training and test sets. Users train their models on the training set and predict outcomes on the test set. The test set does not include the target variable.

Example Dataset

Outlook	Temperature	Humidity	Windy	Play
Sunny	Hot	High	False	No
Rainy	Mild	High	False	Yes
Sunny	Cool	Normal	False	Yes
Overcast	Hot	High	False	Yes
Rainy	Mild	High	False	Yes
Overcast	Hot	Normal	False	Yes
Sunny	Mild	Normal	True	Yes
Sunny	Mild	High	False	No
Overcast	Cool	Normal	True	Yes
Rainy	Mild	High	True	Yes

Train Dataset (train.csv)

Outlook	Temperature	Humidity	Windy	Play
Sunny	Hot	High	False	No
Rainy	Mild	High	False	Yes
Sunny	Cool	Normal	False	Yes
Overcast	Hot	High	False	Yes
Rainy	Mild	High	False	Yes
Overcast	Hot	Normal	False	Yes

Test Dataset (test.csv)

Id	Outlook	Temperature	Humidity	Windy
1	Sunny	Mild	Normal	True
2	Sunny	Mild	High	False
3	Overcast	Cool	Normal	True
4	Rainy	Mild	High	True

Notice the absence of the target variable in the test data.

User Prediction File (user_prediction.csv)

Id	Play
1	Yes
2	Yes
3	No
4	No

Correct Prediction File (correct_prediction.csv)

Id	Play
1	Yes
2	No
3	Yes
4	Yes

Evaluation Metric

During the contest, only 50% of the test dataset is used for evaluation to discourage overfitting. The evaluation metric is defined as:

Score = Number of correct predictions / Total rows

In this case, only ID 1 is predicted correctly out of the first two, so:

Score online = 1 / 2 = 0.5

After the contest, the model is evaluated on the full test dataset:

Score offline = 1 / 4 = 0.25

This demonstrates how overfitting can reduce real-world model performance. Online evaluations using partial data help encourage more generalizable solutions.

Developer Insights

Gradient descent algorithm for linear regression

You are probably using machine learning multiple times a day without realizing it. For instance, when checking your mailbox, a spam filter automatically filters out junk mail—thanks to Machine Learning (ML). ML is the science of training a machine to learn from past data, without being explicitly programmed.

There are two main types of ML techniques:

Supervised Machine Learning: The system learns from predefined training data to predict future outcomes.
Unsupervised Learning: The system identifies hidden patterns in data without prior labels—for example, finding close friend groups on Facebook.

Supervised Learning

Consider the following dataset showing house prices in Bengaluru, India:

Living area (sq ft)	Price (USD)
820	30105
1050	58448
1550	85911
1200	87967
1600	73722
1117	54630
550	42441
1162	79596

To predict housing prices based on living area, we define a hypothesis function: h_θ(x) = θ₀ + θ₁x. Here, θ₀ and θ₁ are parameters we aim to optimize. Our goal is to minimize the error between predicted and actual prices using a cost function:

J(θ₀, θ₁) = (1/2m) Σ(h_θ(x⁽ⁱ⁾) − y⁽ⁱ⁾)²

This cost function measures the squared error over m training examples. Our objective is to find θ₀ and θ₁ that minimize this cost.

Gradient Descent

Gradient descent is an optimization algorithm used to minimize functions like our cost function. Starting with initial guesses for θ₀ and θ₁, we iteratively update them using:

θ_j := θ_j − α ∂/∂θ_j J(θ₀, θ₁)

Here, α is the learning rate. The updates continue until convergence—when changes become negligible.

Applying Gradient Descent to Linear Regression

The partial derivatives of the cost function with respect to θ₀ and θ₁ are:

∂J/∂θ₀ = (1/m) Σ(h_θ(x⁽ⁱ⁾) − y⁽ⁱ⁾)
∂J/∂θ₁ = (1/m) Σ(h_θ(x⁽ⁱ⁾) − y⁽ⁱ⁾)x⁽ⁱ⁾

Using these, we can apply gradient descent and iteratively update θ₀ and θ₁ to minimize the cost function.

Gradient Descent with Python

import numpy as np
import matplotlib.pyplot as plt

x = np.random.uniform(-4, 4, 500)
y = x + np.random.standard_normal(500) + 2.5
plt.plot(x, y, 'o')
plt.show()

def cost(X, Y, theta):
    return np.dot((np.dot(X, theta) - Y).T, (np.dot(X, theta) - Y)) / (2 * len(Y))

alpha = 0.1
theta = np.array([[0, 0]]).T
X = np.c_[np.ones(500), x]
Y = np.c_[y]
X_1 = np.c_[x].T
num_iters = 1000
cost_history = []
theta_history = []

for i in range(num_iters):
    a = np.sum(theta[0] - alpha * (1 / len(Y)) * np.sum((np.dot(X, theta) - Y)))
    b = np.sum(theta[1] - alpha * (1 / len(Y)) * np.sum(np.dot(X_1, (np.dot(X, theta) - Y))))
    theta = np.array([[a], [b]])
    cost_history.append(cost(X, Y, theta))
    theta_history.append(theta)
    if i in (1, 3, 7, 10, 14, num_iters):
        plt.plot(x, a + x * b)
        plt.title('Linear regression by gradient descent')
        plt.xlabel('x')
        plt.ylabel('y')
        plt.show()
    elif i in range(20, num_iters, 10):
        plt.plot(x, a + x * b)

print(theta)

Gradient Descent with R

x <- runif(500, -4, 4)
y <- x + rnorm(500) + 2.5

cost <- function(X, y, theta) {
  sum((X %*% theta - y)^2) / (2 * length(y))
}

alpha <- 0.1
num_iters <- 1000
cost_history <- rep(0, num_iters)
theta_history <- list(num_iters)
theta <- c(0, 0)
X <- cbind(1, x)

for (i in 1:num_iters) {
  theta[1] <- theta[1] - alpha * (1 / length(y)) * sum((X %*% theta - y))
  theta[2] <- theta[2] - alpha * (1 / length(y)) * sum((X %*% theta - y) * X[,2])
  cost_history[i] <- cost(X, y, theta)
  theta_history[[i]] <- theta
}

print(theta)

plot(x, y, col=rgb(0.2,0.4,0.6,0.4), main='Linear regression by gradient descent')
for (i in c(1,3,6,10,14,seq(20,num_iters,by=10))) {
  abline(coef=theta_history[[i]], col=rgb(0.8,0,0,0.3))
}
abline(coef=theta, col='blue')

Linear regression via gradient descent is simple and intuitive. Although advanced learning algorithms may use more complex models and cost functions, the underlying principles remain the same.

Developer Insights

Principal component analysis with linear algebra

Principal component analysis (PCA) is a powerful linear algebra-based statistical method used to reduce the dimensionality of datasets while retaining important information. It simplifies complex datasets, making them easier to analyze and visualize.

Suppose we have n individuals and measure m variables for each. Each individual’s measurements form an m-dimensional vector. For example, data collected from five individuals might look like this:

Name	A	B	C	D	E
Age	24	50	17	35	65
Height (cm)	152	175	160	170	155
IQ	108	102	95	97	87

Here, n = 5 and m = 3. Each individual's data can be written as a vector, for instance: x₁ = [24, 152, 108]ᵗ.

PCA helps answer questions such as:

Which variables are correlated?
Can we visualize this high-dimensional data more easily?
Which variables contribute most to the variation in the dataset?

Linear Transformations

Multiplying a matrix by a vector results in a linear transformation of that vector. This operation is key in PCA and is defined as: Av = w.

Eigenvectors and Eigenvalues

An eigenvector v of a matrix A satisfies Av = λv, where λ is the eigenvalue. These vectors indicate directions of data variance, remaining unchanged under transformation by A.

Spectral Theorem

For symmetric matrices, the spectral theorem ensures real eigenvalues and orthogonal eigenvectors. This property is fundamental to PCA since the covariance matrix is symmetric.

Covariance Matrix

The covariance matrix captures the variance and correlation of the dataset's variables. Its entries Sₖₗ represent the covariance between variables k and l. Diagonal entries are variances; off-diagonal entries are covariances.

Steps in PCA

Organize the dataset into an m × n matrix where each column is a sample.
Subtract the mean of each variable from the dataset (mean centering).
Compute the covariance matrix S = (1/n−1) · BBᵗ, where B is the mean-centered matrix.
Apply the spectral theorem to get eigenvalues and eigenvectors.
Select the top k eigenvectors based on the highest eigenvalues. These are the principal components.

Dimensionality Reduction

By projecting the data onto the first k principal components, we reduce the dimensions while retaining most of the dataset's variance. This simplifies analysis and visualization.

Interpreting Eigenvalues

Each eigenvalue indicates the variance captured by its corresponding eigenvector.
The sum of all eigenvalues is the total variance of the dataset.
The ratio λᵢ / (λ₁ + λ₂ + ... + λₘ) shows the proportion of variance explained by the i-th component.

Applications of PCA

Data visualization
Noise reduction
Face recognition (eigenfaces)
Genomics and bioinformatics
Market segmentation

In face recognition, PCA can reduce high-dimensional image data to a small number of significant components (eigenfaces), allowing for efficient and accurate identification based on stored components.

In the next article, we will explore gradient descent, an optimization technique commonly used in machine learning.

Developer Insights

Prerequisites of linear algebra for machine learning

Just about everyone has watched animated movies such as Frozen or Big Hero 6, or has at least heard about 3D computer games. It seems more fun to enjoy the movies and games rather than reading a linear algebra book. But it is because of linear algebra that we are able to watch a character move on the screen. Linear algebra is the foundation of our new digital world.

Through this article, we will learn matrix arithmetic and how to use NumPy to carry out these operations in Python.

Why We Need Linear Algebra for Machine Learning

Machine learning involves handling enormous datasets. An effective way to represent this data is in the form of 2D arrays or rectangular blocks, where each row represents a sample and each column represents a feature. It's natural to view this array as a matrix and each column as a vector.

Python and Linear Algebra

NumPy is a Python library used for scientific computing. It provides multidimensional arrays and tools to work with them.

Matrices

A matrix is a rectangular array of numbers arranged in rows and columns. For example:

[1 2 3]
[100 -3 1.15]

This is a 2×3 matrix. A general m×n matrix is denoted as A = (a_ij)_m×n.

Creating Matrices with NumPy

import numpy as np
A = np.array([[1, 2, 2], [3, 2, 1]])
print(A)

Matrix Shape

A.shape  # Output: (2, 3)

Identity Matrix

A = np.eye(3)
print(A)

Matrix Operations

Addition

Matrix addition is entry-wise and only defined for matrices of the same dimension.

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C = np.add(A, B)
print(C)

Transpose

A = np.array([[1, 2, 3], [4, 5, 6]])
T = A.T
print(T)

Multiplication

To multiply matrices A (m×n) and B (n×q), the number of columns of A must match the number of rows of B. The result is an m×q matrix.

A = np.array([[1, 2], [3, 4]])
B = np.array([[2, 0], [1, 2]])
C = np.dot(A, B)
print(C)

Inverse

A = np.array([[2, 3], [4, 5]])
A_inv = np.linalg.inv(A)
print(A_inv)

Vectors

A vector is a one-dimensional array or a matrix with a single column. Example of a 3-dimensional vector:

v = np.array([[1], [2], [3]])

Vector Addition

v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
v_sum = np.add(v1, v2)
print(v_sum)

Dot Product

v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
dot_product = np.dot(v1, v2)
print(dot_product)

Conclusion

Matrix arithmetic is a core component of linear algebra and is essential for many machine learning techniques. In the next article, we will explore how matrix operations are applied in Principal Component Analysis (PCA), a method for identifying patterns in data.