MITB Banner

Complete Python Guide To Deep Imbalanced Regression

Deep Imbalanced Regression, DIR, helps effectively perform regression tasks in deep learning models with imbalanced regression data

Share

Deep Imbalanced Regression

Imbalance in data is a common phenomenon in real-world applications. Classification problems with class imbalances are popular, and there are quite a few approaches to handle class imbalances, such as reweighting, biased sampling and meta-learning. Nonuniformity and imbalances can occur in regression problems also, but the issues they cause are less addressed. Deep learning is heavily affected by imbalanced continuous targets (regression) than imbalanced categorical targets (classification).

An ideally balanced classification problem will have an equal number of examples for each class. Similarly, an ideally balanced regression problem will have its target variable uniformly distributed throughout. But in practice, the target values in certain regions are abundant in the count and are scarce in some other regions. To address this issue, Yuzhe Yang, Kaiwen Zha, Ying-Cong Chen, Dina Katabi from the Massachusetts Institute of Technology, and Hao Wang from the Rutgers University have introduced Deep Imbalanced Regression, DIR, to effectively perform regression tasks in deep learning models with imbalanced regression data. 

Imbalances in a continuous target variable
Imbalances in a continuous target variable (source)

How does Deep Imbalanced Regression work?

Deep Imbalanced Regression (DIR) can learn continuous targets from real-world imbalanced datasets and provide it to a deep learning model. DIR is executed in two ways:

  1. Label Distribution Smoothing
  2. Feature Distribution Smoothing

DIR incorporates a kernel distribution function that leverages the similarity between adjacent target values and smoothens the distribution of either target or features.  

Imbalance - Label Distribution Smoothing
Label Distribution Smoothing – converts a discontinuous label distribution into a continuous label distribution (source)
Imbalance - Feature Distribution Smoothing
Feature Distribution Smoothing – calculates mean and covariance in features scales them by a kernel function to smoothen the features (source)

The feature distribution smoothing approach follows a trainable and easy-to-incorporate procedure. An encoder obtains the latent feature representations. The mean and variance of the feature representations are calculated at this stage. Covariances are calculated between different features. Exponential Moving-window Average (EMA) captures the feature statistics sequentially and multiplies with a kernel (k) that helps smoothen data according to the imbalance found in target. 

EMA ensures that the shape of the original distribution of features is retained. The smoothened feature representations are supplied back to the neural network. Learning of neural networks are quite the same except for an additional pass through the feature smoothening layer.

Benchmarking of Deep Imbalanced Regression

The following five famous datasets are processed with DIA to curate benchmarks. The datasets belong to computer vision, healthcare, and natural language processing. These datasets have continuous target variables that are highly imbalanced.

  1. IMDB-WIKI-DIR (curated from IMDB-WIKI dataset with age as the target)
  2. AgeDB-DIR (curated from AgeDB dataset with age as the target)
  3. STS-B-DIR (curated from Semantic Textual Similarity Benchmark with text similarity score as the target)
  4. NYUD2-DIR (curated from NYU Depth Dataset V2 with depth as the target)
  5. SHHS-DIR (curated from SHHS dataset with health condition score as the target)
Imbalances found in features of the original datasets
Imbalances found in features of the original datasets (source)

The above datasets are split and provided as train, validation and test sets. These datasets are ready-to-employ in suitable deep learning architectures. The Deep Imbalanced Regression architecture benchmarks are developed with the above datasets. ResNet50 baseline is used to construct the benchmark architecture on the IMDB-WIKI-DIR dataset and the AgeDB-DIR dataset. BiLSTM+GloVe word embedding baseline is used to model the STS-B-DIR dataset. ResNet50 based encoder and decoder architecture is used to model the NYUD2-DIR dataset. A CNN-RNN based architecture with ResNet block is used to model the SHHS-DIR dataset. The performance results greatly exceed the original versions of these models and datasets.  

Deep Imbalanced Regression on IMDB-WIKI dataset

The requirements are PyTorch 1.6, tensorboard_logger, NumPy, pandas, scipy, tqdm, matplotlib, PIL, and wget. In addition, DIR requires a CUDA enabled GPU runtime (at least 4 GPUs) for training.

Install wget and tensorboard_logger using the following command.

!pip install wget tensorboard_logger

Output: 

Download the source files that lead to dataset preparation, model building, end-to-end training and evaluation.

 # Download source code
 !git clone https://github.com/YyzHarry/imbalanced-regression.git 

Output:

Check the contents of the source file.

!ls -p imbalanced-regression/

Output:

Change the current directory to the imdb-wiki-dir directory to proceed further.

%cd /content/imbalanced-regression/imdb-wiki-dir/

 Download and preprocess the original IMDB-WIKI data

!python download_imdb_wiki.py

A portion of the Output:

Train the vanilla architecture for Deep Imbalanced Regression without reweighting using the following command.

 %%bash
 CUDA_VISIBLE_DEVICES=0,1,2,3 
 python train.py \
 --data_dir /content/imbalanced-regression/imdb-wiki-dir/data/ \
 --reweight none 

Inverse reweighting can be included during training by implementing the following command.

 %%bash
 CUDA_VISIBLE_DEVICES=0,1,2,3 
 python train.py \
 --data_dir /content/imbalanced-regression/imdb-wiki-dir/data/ \
 --reweight inverse 

Alternatively, square-root-inverse reweighting can be included during training by implementing the following command.

 %%bash
 CUDA_VISIBLE_DEVICES=0,1,2,3 
 python train.py \
 --data_dir /content/imbalanced-regression/imdb-wiki-dir/data/ \
 --reweight sqrt_inv 

Enable complete training of Deep Imbalanced Regression with Label Distribution Smoothing (LDS) and Feature Distribution Smoothing (FDS) using the following command.

  %%bash
 python train.py \
 --reweight sqrt_inv \
 --lds \
 --lds_kernel gaussian \
 --lds_ks 5 \
 --lds_sigma 2 \
 --fds \
 --fds_kernel gaussian \
 --fds_ks 5 \
 --fds_sigma 2 

Users can also opt for pre-trained models. The pre-trained checkpoints are officially presented at this site.

Dataset preparation and training guidelines are available in the following official links corresponding to the benchmark datasets:

Deep Imbalanced Regression - comparison
IMDB-WIKI-DIR Benchmarking results with different loss and training strategies (source)
Deep Imbalanced Regression - comparison
Age-DIR Benchmarking results with different loss and training strategies (source)
Deep Imbalanced Regression - comparison
SHHS-DIR Benchmarking results with different loss and training strategies (source)
Deep Imbalanced Regression - comparision
STS-B-DIR Benchmarking results with different loss and training strategies (source)
Deep Imbalanced Regression - comparision
NYUD2-DIR Benchmarking results with different loss and training strategies (source)

It is observed that the label distribution smoothing and feature distribution smoothing yields outstanding results when applied together.

Wrapping Up

This article discussed the newly introduced DIR, Deep Imbalanced Regression, that handles datasets with highly imbalanced continuous target variables. We have discussed the label distribution smoothing and feature distribution smoothing approaches and the concepts behind them. We have discussed the benchmarking datasets and architectures that employ the two different smoothing approaches. Finally, we have explored PyTorch code implementation for dataset preparation and end-to-end training on the IMDB-WIKI-DIR dataset. 

References

Share
Picture of Rajkumar Lakshmanamoorthy

Rajkumar Lakshmanamoorthy

A geek in Machine Learning with a Master's degree in Engineering and a passion for writing and exploring new things. Loves reading novels, cooking, practicing martial arts, and occasionally writing novels and poems.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.