ELECTRA achieves state-of-the-art performance in language representation learning by outperforming present leaders RoBERTa, ALBERT and XLNet. On the other hand, ELECTRA works efficiently with relatively less compute than any language representation learning methods.
Representation learning methods in language modeling such as the BERT and its variants follow the Masked Language Modeling (MLM) pre-training method. In this approach, a subset of around 15% of the input tokens are masked before feeding into the model. In attention models such as the XLNet, attention to those tokens are masked to wrap up their identity. Denoising-autoencoder-like networks are trained on these input tokens to recover back the original tokens. These approaches behave like generators whose objective is to learn the original tokens provided noisy masked tokens. Because these models train only around 15% of the data provided, they require enormous computing power and time.
The newly introduced ELECTRA follows a different approach to all of its predecessor in language representation learning. Rather than being a generator to generate the original tokens, ELECTRA behaves like a discriminator. In this pre-training method, selected tokens are replaced with plausible tokens that a small masked language model synthetically generates. The objective of this ELECTRA model is to identify corrupted tokens from all of the input tokens. Thus ELECTRA behaves like a discriminator while almost every other masked language model behaves like a generator. This model’s key advantage is that it trains all the input data requiring very less compute power and time.
ELECTRA is the short form of ‘Efficiently Learning an Encoder that Classifies Token Replacements Accurately’, introduced by Kevin Clark and Christopher D. Manning of Stanford University and Minh-Thang Luong and Quoc V. Le of Google Brain Research. Pre-training of ELECTRA requires a generator that receives masked inputs as in the case of BERT and generates replacement tokens. Once pre-trained, the generator is discarded, and the ELECTRA alone is employed in downstream natural language applications by fine-tuning it as per task.
Python Implementation of ELECTRA
Step-1: Create Environment
Requirements are Python 3+, TensorFlow 1.15, NumPy, SciPy and Scikit-Learn. Pre-training and fine-tuning for downstream applications require a GPU runtime environment.
%%bash pip install tensorflow==1.15 pip install numpy pip install scipy pip install sklearn
Step-2: Download Source Code
The following command downloads the ELECTRA source code to the local environment.
!git clone https://github.com/google-research/electra.git
Output:
Step-3: Create a data directory
Create a new directory to store vocabulary and other data.
%%bash cd electra/ mkdir DATA_DIR
Checking proper download of necessary files and the creation of new directory DATA_DIR,
!ls electra -p
Output:
Step-4: Download Vocabulary
Vocabulary used in the ELECTRA model is available here. Download the file to the data directory using the following commands.
%%bash cd electra/ wget -O DATA_DIR/vocab.txt https://storage.googleapis.com/electra-data/vocab.txt
Output:
Step-5: Download Corpus for training
Download OpenWebTextCorpus (12GB) to the data directory as a zipped binary file.
%%bash wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1EA5V0oetDCOke7afsktL_JDQ-ETtNOvx' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1EA5V0oetDCOke7afsktL_JDQ-ETtNOvx" -O openwebtext.tar.xz && rm -rf /tmp/cookies.txt mv openwebtext.tar.xz electra/DATA_DIR/
Then untar the downloaded file using the following command
%%bash cd electra/DATA_DIR tar xf openwebtext.tar.xz
Step-6: Preprocess the data and tokenize
The following command preprocesses the downloaded data, tokenizes, converts into tfrecords and saves in a sub-directory named ‘pretrain_tfrecords’.
!python3 build_openwebtext_pretraining_dataset.py --data-dir $DATA_DIR --num-processes 5
Step-7: Pre-training ELECTRA’s small model
Pre-training of the small model is performed on the downloaded data using the following command. It should be noted that the process may consume more than 4 days on a Tesla v100 GPU device.
!python3 run_pretraining.py --data-dir $DATA_DIR --model-name electra_small_owt
Step-8: Evaluation of model
Evaluation is performed simply by rerunning the pre-training python file by changing the options provided in the following command.
!python3 run_pretraining.py --data-dir $DATA_DIR --model-name electra_small_owt --hparams '{"do_train": false, "do_eval": true}'
Step-9: Fine-tuning for downstream applications
Fine-tuning can be done on specific-tasks. The data corresponds to the task must be downloaded to the data directory in the prescribed tokenized format. This script helps download all GLUE tasks to our data directory at once. The following command fine-tunes the model for the MNLI task. Users can opt for any available task by changing the task_names
option.
!python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["mnli"]}'
Performance of ELECTRA
Developers evaluated pre-trained ELECTRA on GLUE benchmark (General Language Understanding Evaluation) and SQuAD benchmark (Stanford Question Answering Dataset). A variety of tasks covering textual entailment, question-answer entailment, paraphrase, question paraphrase, textual similarity, sentiment and linguistic acceptability were performed with ELECTRA. Pre-training of base ELECTRA’s generator-discriminator networks pair was done with the same data as BERT, which consists of 3.3 billion tokens from Wikipedia and BookCorpus. Large ELECTRA was trained with the same data as XLNet, which consists of data from the BERT’s dataset , ClueWeb, CommonCrawl and Gigaword. Different generator sizes and discriminator sizes as well as different algorithms were attempted to arrive at the best version of ELECTRA.
On identical downstream small GLUE tasks with identical device configurations, ELECTRA base model performs well even when trained on a single GPU, scoring 5 GLUE points higher than a comparable BERT model and even outscoring the much larger GPT model and ELMo model.
On GLUE big dataset tasks, ELECTRA’s large model outperforms RoBERTa, ALBERT and XLNet in most tasks, while the former consumes around a quarter of the compute power consumed by the later models during their respective pre-training.
Similarly ELECTRA’s large model outperforms all of its preceders with SQuAD dataset versions 1.1 and 2.0. Increase in the number of training flops of ELECTRA leads to increase in the performance on both the GLUE datasets and the SQuAD datasets.
Wrapping Up
ELECTRA is the present state-of-the-art in GLUE and SQuAD benchmarks. It is a self-supervised language representation learning model. It can be used to pre-train transformer networks using relatively little compute power. It performs replaced-token detection with the help of a generator composed of a masked learning model. ELECTRA is compute-efficient and works even with a small memory device (single GPU). It yields greater performance in almost all natural language downstream applications. The generator and discriminator networks of present ELECTRA can be fine-tuned to arrive at better models in future.
Note: Images and illustrations other than code outputs are obtained from original research paper.