Complete Python Guide To Deep Imbalanced Regression

Deep Imbalanced Regression, DIR, helps effectively perform regression tasks in deep learning models with imbalanced regression data

Share

Published on May 29, 2021

by Rajkumar Lakshmanamoorthy

Imbalance in data is a common phenomenon in real-world applications. Classification problems with class imbalances are popular, and there are quite a few approaches to handle class imbalances, such as reweighting, biased sampling and meta-learning. Nonuniformity and imbalances can occur in regression problems also, but the issues they cause are less addressed. Deep learning is heavily affected by imbalanced continuous targets (regression) than imbalanced categorical targets (classification).

An ideally balanced classification problem will have an equal number of examples for each class. Similarly, an ideally balanced regression problem will have its target variable uniformly distributed throughout. But in practice, the target values in certain regions are abundant in the count and are scarce in some other regions. To address this issue, Yuzhe Yang, Kaiwen Zha, Ying-Cong Chen, Dina Katabi from the Massachusetts Institute of Technology, and Hao Wang from the Rutgers University have introduced Deep Imbalanced Regression, DIR, to effectively perform regression tasks in deep learning models with imbalanced regression data.

Imbalances in a continuous target variable (source)

How does Deep Imbalanced Regression work?

Deep Imbalanced Regression (DIR) can learn continuous targets from real-world imbalanced datasets and provide it to a deep learning model. DIR is executed in two ways:

Label Distribution Smoothing
Feature Distribution Smoothing

DIR incorporates a kernel distribution function that leverages the similarity between adjacent target values and smoothens the distribution of either target or features.

Imbalance - Label Distribution Smoothing — Label Distribution Smoothing – converts a discontinuous label distribution into a continuous label distribution (source)

Imbalance - Feature Distribution Smoothing — Feature Distribution Smoothing – calculates mean and covariance in features scales them by a kernel function to smoothen the features (source)

The feature distribution smoothing approach follows a trainable and easy-to-incorporate procedure. An encoder obtains the latent feature representations. The mean and variance of the feature representations are calculated at this stage. Covariances are calculated between different features. Exponential Moving-window Average (EMA) captures the feature statistics sequentially and multiplies with a kernel (k) that helps smoothen data according to the imbalance found in target.

EMA ensures that the shape of the original distribution of features is retained. The smoothened feature representations are supplied back to the neural network. Learning of neural networks are quite the same except for an additional pass through the feature smoothening layer.

Benchmarking of Deep Imbalanced Regression

The following five famous datasets are processed with DIA to curate benchmarks. The datasets belong to computer vision, healthcare, and natural language processing. These datasets have continuous target variables that are highly imbalanced.

IMDB-WIKI-DIR (curated from IMDB-WIKI dataset with age as the target)
AgeDB-DIR (curated from AgeDB dataset with age as the target)
STS-B-DIR (curated from Semantic Textual Similarity Benchmark with text similarity score as the target)
NYUD2-DIR (curated from NYU Depth Dataset V2 with depth as the target)
SHHS-DIR (curated from SHHS dataset with health condition score as the target)

Imbalances found in features of the original datasets (source)

The above datasets are split and provided as train, validation and test sets. These datasets are ready-to-employ in suitable deep learning architectures. The Deep Imbalanced Regression architecture benchmarks are developed with the above datasets. ResNet50 baseline is used to construct the benchmark architecture on the IMDB-WIKI-DIR dataset and the AgeDB-DIR dataset. BiLSTM+GloVe word embedding baseline is used to model the STS-B-DIR dataset. ResNet50 based encoder and decoder architecture is used to model the NYUD2-DIR dataset. A CNN-RNN based architecture with ResNet block is used to model the SHHS-DIR dataset. The performance results greatly exceed the original versions of these models and datasets.

Deep Imbalanced Regression on IMDB-WIKI dataset

The requirements are PyTorch 1.6, tensorboard_logger, NumPy, pandas, scipy, tqdm, matplotlib, PIL, and wget. In addition, DIR requires a CUDA enabled GPU runtime (at least 4 GPUs) for training.

Install wget and tensorboard_logger using the following command.

!pip install wget tensorboard_logger

Output:

Download the source files that lead to dataset preparation, model building, end-to-end training and evaluation.

 # Download source code
 !git clone https://github.com/YyzHarry/imbalanced-regression.git

Output:

Check the contents of the source file.

!ls -p imbalanced-regression/

Output:

Change the current directory to the imdb-wiki-dir directory to proceed further.

%cd /content/imbalanced-regression/imdb-wiki-dir/

Download and preprocess the original IMDB-WIKI data

!python download_imdb_wiki.py

A portion of the Output:

Train the vanilla architecture for Deep Imbalanced Regression without reweighting using the following command.

 %%bash
 CUDA_VISIBLE_DEVICES=0,1,2,3 
 python train.py \
 --data_dir /content/imbalanced-regression/imdb-wiki-dir/data/ \
 --reweight none

Inverse reweighting can be included during training by implementing the following command.

 %%bash
 CUDA_VISIBLE_DEVICES=0,1,2,3 
 python train.py \
 --data_dir /content/imbalanced-regression/imdb-wiki-dir/data/ \
 --reweight inverse

Alternatively, square-root-inverse reweighting can be included during training by implementing the following command.

 %%bash
 CUDA_VISIBLE_DEVICES=0,1,2,3 
 python train.py \
 --data_dir /content/imbalanced-regression/imdb-wiki-dir/data/ \
 --reweight sqrt_inv

Enable complete training of Deep Imbalanced Regression with Label Distribution Smoothing (LDS) and Feature Distribution Smoothing (FDS) using the following command.

  %%bash
 python train.py \
 --reweight sqrt_inv \
 --lds \
 --lds_kernel gaussian \
 --lds_ks 5 \
 --lds_sigma 2 \
 --fds \
 --fds_kernel gaussian \
 --fds_ks 5 \
 --fds_sigma 2

Users can also opt for pre-trained models. The pre-trained checkpoints are officially presented at this site.

Dataset preparation and training guidelines are available in the following official links corresponding to the benchmark datasets:

Deep Imbalanced Regression - comparison — IMDB-WIKI-DIR Benchmarking results with different loss and training strategies (source)

Deep Imbalanced Regression - comparision — STS-B-DIR Benchmarking results with different loss and training strategies (source)

It is observed that the label distribution smoothing and feature distribution smoothing yields outstanding results when applied together.

Wrapping Up

This article discussed the newly introduced DIR, Deep Imbalanced Regression, that handles datasets with highly imbalanced continuous target variables. We have discussed the label distribution smoothing and feature distribution smoothing approaches and the concepts behind them. We have discussed the benchmarking datasets and architectures that employ the two different smoothing approaches. Finally, we have explored PyTorch code implementation for dataset preparation and end-to-end training on the IMDB-WIKI-DIR dataset.