Imbalance in data is a common phenomenon in real-world applications. Classification problems with class imbalances are popular, and there are quite a few approaches to handle class imbalances, such as reweighting, biased sampling and meta-learning. Nonuniformity and imbalances can occur in regression problems also, but the issues they cause are less addressed. Deep learning is heavily affected by imbalanced continuous targets (regression) than imbalanced categorical targets (classification).
An ideally balanced classification problem will have an equal number of examples for each class. Similarly, an ideally balanced regression problem will have its target variable uniformly distributed throughout. But in practice, the target values in certain regions are abundant in the count and are scarce in some other regions. To address this issue, Yuzhe Yang, Kaiwen Zha, Ying-Cong Chen, Dina Katabi from the Massachusetts Institute of Technology, and Hao Wang from the Rutgers University have introduced Deep Imbalanced Regression, DIR, to effectively perform regression tasks in deep learning models with imbalanced regression data.
How does Deep Imbalanced Regression work?
Deep Imbalanced Regression (DIR) can learn continuous targets from real-world imbalanced datasets and provide it to a deep learning model. DIR is executed in two ways:
- Label Distribution Smoothing
- Feature Distribution Smoothing
DIR incorporates a kernel distribution function that leverages the similarity between adjacent target values and smoothens the distribution of either target or features.
The feature distribution smoothing approach follows a trainable and easy-to-incorporate procedure. An encoder obtains the latent feature representations. The mean and variance of the feature representations are calculated at this stage. Covariances are calculated between different features. Exponential Moving-window Average (EMA) captures the feature statistics sequentially and multiplies with a kernel (k) that helps smoothen data according to the imbalance found in target.
EMA ensures that the shape of the original distribution of features is retained. The smoothened feature representations are supplied back to the neural network. Learning of neural networks are quite the same except for an additional pass through the feature smoothening layer.
Benchmarking of Deep Imbalanced Regression
The following five famous datasets are processed with DIA to curate benchmarks. The datasets belong to computer vision, healthcare, and natural language processing. These datasets have continuous target variables that are highly imbalanced.
- IMDB-WIKI-DIR (curated from IMDB-WIKI dataset with age as the target)
- AgeDB-DIR (curated from AgeDB dataset with age as the target)
- STS-B-DIR (curated from Semantic Textual Similarity Benchmark with text similarity score as the target)
- NYUD2-DIR (curated from NYU Depth Dataset V2 with depth as the target)
- SHHS-DIR (curated from SHHS dataset with health condition score as the target)
The above datasets are split and provided as train, validation and test sets. These datasets are ready-to-employ in suitable deep learning architectures. The Deep Imbalanced Regression architecture benchmarks are developed with the above datasets. ResNet50 baseline is used to construct the benchmark architecture on the IMDB-WIKI-DIR dataset and the AgeDB-DIR dataset. BiLSTM+GloVe word embedding baseline is used to model the STS-B-DIR dataset. ResNet50 based encoder and decoder architecture is used to model the NYUD2-DIR dataset. A CNN-RNN based architecture with ResNet block is used to model the SHHS-DIR dataset. The performance results greatly exceed the original versions of these models and datasets.
Deep Imbalanced Regression on IMDB-WIKI dataset
The requirements are PyTorch 1.6, tensorboard_logger, NumPy, pandas, scipy, tqdm, matplotlib, PIL, and wget. In addition, DIR requires a CUDA enabled GPU runtime (at least 4 GPUs) for training.
Install wget and tensorboard_logger using the following command.
!pip install wget tensorboard_logger
Output:
Download the source files that lead to dataset preparation, model building, end-to-end training and evaluation.
# Download source code !git clone https://github.com/YyzHarry/imbalanced-regression.git
Output:
Check the contents of the source file.
!ls -p imbalanced-regression/
Output:
Change the current directory to the imdb-wiki-dir directory to proceed further.
%cd /content/imbalanced-regression/imdb-wiki-dir/
Download and preprocess the original IMDB-WIKI data
!python download_imdb_wiki.py
A portion of the Output:
Train the vanilla architecture for Deep Imbalanced Regression without reweighting using the following command.
%%bash CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py \ --data_dir /content/imbalanced-regression/imdb-wiki-dir/data/ \ --reweight none
Inverse reweighting can be included during training by implementing the following command.
%%bash CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py \ --data_dir /content/imbalanced-regression/imdb-wiki-dir/data/ \ --reweight inverse
Alternatively, square-root-inverse reweighting can be included during training by implementing the following command.
%%bash CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py \ --data_dir /content/imbalanced-regression/imdb-wiki-dir/data/ \ --reweight sqrt_inv
Enable complete training of Deep Imbalanced Regression with Label Distribution Smoothing (LDS) and Feature Distribution Smoothing (FDS) using the following command.
%%bash python train.py \ --reweight sqrt_inv \ --lds \ --lds_kernel gaussian \ --lds_ks 5 \ --lds_sigma 2 \ --fds \ --fds_kernel gaussian \ --fds_ks 5 \ --fds_sigma 2
Users can also opt for pre-trained models. The pre-trained checkpoints are officially presented at this site.
Dataset preparation and training guidelines are available in the following official links corresponding to the benchmark datasets:
It is observed that the label distribution smoothing and feature distribution smoothing yields outstanding results when applied together.
Wrapping Up
This article discussed the newly introduced DIR, Deep Imbalanced Regression, that handles datasets with highly imbalanced continuous target variables. We have discussed the label distribution smoothing and feature distribution smoothing approaches and the concepts behind them. We have discussed the benchmarking datasets and architectures that employ the two different smoothing approaches. Finally, we have explored PyTorch code implementation for dataset preparation and end-to-end training on the IMDB-WIKI-DIR dataset.