Balancing Datasets in Binary Classification

Marcos Gois
4 min readFeb 28, 2023

--

An Overview of Techniques and Best Practices

Imbalanced classes use sampling my friend

Balancing the number of samples in a binary classification problem is important because having an imbalanced dataset can result in a biased model. An imbalanced dataset is one where one class has significantly more samples than the other. In a binary classification problem, this can result in a model that is more accurate in predicting the majority class, but has poor performance in predicting the minority class.

Distribution of class

One common approach to balancing a dataset is through random under-sampling and random over-sampling. Random under-sampling involves reducing the number of samples in the majority class to match the number of samples in the minority class. On the other hand, random over-sampling involves increasing the number of samples in the minority class to match the number of samples in the majority class.

In this article, we will focus on random under-sampling, which involves reducing the number of samples in the majority class. Random under-sampling can be done in a number of ways, including random selection, stratified sampling, and cluster-based sampling.

Random under-sampling

Random selection involves randomly selecting a subset of samples from the majority class to reduce its size to match the minority class. This approach is simple and straightforward, but it can result in a loss of information.

Stratified sampling involves dividing the majority class into smaller subgroups and selecting a random sample from each subgroup to reduce the size of the majority class. This approach helps to preserve the distribution of the majority class and prevent the loss of important information.

Stratified sampling

Cluster-based sampling involves grouping the majority class into clusters and randomly selecting a subset of clusters to reduce the size of the majority class. This approach helps to preserve the relationships between samples in the majority class and prevent the loss of important information.

Cluster-based sampling

In this article, we will demonstrate how to balance the number of samples in a binary classification problem using random under-sampling in Python.

First, we will import the necessary libraries:

import numpy as np
import pandas as pd

Next, we will load the breast cancer dataset and convert it into a pandas dataframe:

data = your_data_frame
df_breast_cancer = data[['your_features']]
df_breast_cancer['cancer'] = data['cancer']

Now, we will get the number of samples with class 1 (cancer) and the indexes of those samples:

num_cancer = df_breast_cancer.cancer.value_counts()[1]
indexs_cancer = np.array(df_breast_cancer[df_breast_cancer.cancer == 1].index)

Next, we will get the indexes of the samples with class 0 (not cancer):

indexs_not_cancer = np.array(df_breast_cancer[df_breast_cancer.cancer == 0].index)

Now, we will select the same number of random samples with class 0 (not cancer) as there are samples with class 1 (cancer):

indexs_not_cancer_balanced = np.random.choice(indexs_not_cancer, 
num_cancer,
replace = False)

We will combine the indexes of both classes and create a new dataframe with equal number of samples from both classes:

indexs_cancer_balanced = np.concatenate([indexs_cancer, 
indexs_not_cancer_balanced],
axis = None)
df_breast_cancer = df_breast_cancer.iloc[indexs_cancer_balanced,:]
Class balanced

Now that we have balanced the number of samples in our dataset, we can build a binary classification model and evaluate its performance. By balancing the number of samples, we can ensure that our model is not biased towards one class and is able to accurately predict both classes.

Balancing the number of samples in a binary classification problem is important for building a fair and accurate model. Random under-sampling is one approach to balancing a dataset and can be done through random selection, stratified sampling, or cluster-based sampling. By implementing this approach, we can ensure that our model is not biased towards one class and is able to accurately predict both classes.

See more application

if you want to see an application using random subsampling, stratified sampling and constrution/train model click here

References:

  1. https://www.kaggle.com/code/prasathm2001/resampling-techniques-to-handle-imbalanced-dataset
  2. https://www.scribbr.com/methodology/stratified-sampling/
  3. https://www.scribbr.com/methodology/cluster-sampling/
  4. https://medium.com/@blancahuergo/dealing-with-unbalanced-datasets-binary-classification-tasks-3b19c029b63f

--

--

Marcos Gois

Full stack developer | Data Science | Machine Learning | Python | C# | SQL