BERT (language model)
Getting back into looking at machine learning models, and BERT and GPT are the two main types of LLMs according to text that I have read - so I want to read more about each.
References
Related
- ELMo
- ELMo (embeddings from language model) is a word embedding method for representing a sequence of words as a corresponding sequence of vectors. First released in Feb. 2018. It is a bidirectional LSTM which takes character-level as inputs and produces word-level embeddings, trained on a corpus of about 30 million sentences and 1 billion words.
- The architecture of ELMo accomplishes a contextual understanding of tokens. ELMo was historically important as a pioneer of self-supervised generative pretraining, followed by fine-tuning, where a large model is trained to reproduce a large corpus, then the large model is augmented with additional task-specific weights and dine-tuned on supervised task data.
- BookCorpus
- BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution Smashwords. It was the main corpus used to train the initial GPT model by OpenAI, and has been used as training data for other early large language models including Google's BERT.
- Embedding
- In mathematics, an embedding (or imbedding) is one instance of some mathematical structure contained within another instance, such as a group that is a subgroup. When some object is said to be embedded in another object , the embedding is given by some injective and structure-preserving map: .
- Attention
- Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-with sequence that can range from tens to millions of tokens in size.
- Unlike
hard
weights, which are computed during the backwards training pass,soft
weights exist only in the forward pass and therefore change with every step of the input. Inspired by ideas about attention in humans, the attention mechanism was developed to address the weaknesses of leveraging information about the hidden layers of recurrent neural networks.
- byte pair encoding
- Byte pair encoding (also known as diagram encoding) is an algorithm, first described in 1994 by Phillip Gage for encoding strings of text into tabular form for use in downstream modeling. Its modification is notable for the large language model tokenizer with an ability to combine both tokens that encode single characters and those that encode whole words. This modification, in the first step, assumes all unique characters to be an initial set of 1-characetr long n-grams (i.e. initial "tokens"). Then, successively, the most frequent pair of adjacent characters is merged into a new, 2-character long n-gram and all instances of the pair are replaced by this new token. This is repeated until a vocabulary of the prescribed size is obtained.
Example:
Suppose the data to be encoded is:
aaabdaaabac
The byte pair aa
occurs most often, so it will be replaced by a byte that is not used in the data, such as Z
. Now there is the following data and replacement table:
ZabdZabac
Z=aa
Then the process is repeated with the byte pair ab
, replacing it with Y
:
ZYdZYac
Y=ab
Z=aa
The only literal byte pair left occurs only once, and the encoding might stop here. Alternatively, the process could continue with recursive byte pair encoding, replacing ZY
with X
XdXac
X=ZY
Y=ab
Z=aa
To decompress the data, simply perform the replacements in reverse order.
- LayerNorm
- Layer Normalization (LayerNorm) is a popular alternative to BatchNorm. Unlike BatchNorm, which normaliuzes across the batch dimension for a given feature, LayerNorm normalizes across all the features within a single data sample. LayerNorm's performance is not affected by batch size, compared to BatchNorm. It is a key component of transformer models.
- For a given data input and layer, LayerNorm computes the mean and the variance over all the neurons in the layer. Similar to BatchNorm, learnable parameters (scale) and (shift) are applied). It is defined by:
- GLUE
- GLUE Benchmark
- The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems:
- A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset size, text genres, and degrees of difficulty
- A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural langauge, and
- A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.
Notes
Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learns to represent text as a sequence of vectors using self-supervised learning. it uses the encoder-only transformer architecture. It is notable for its dramatic improvement over previous state-of-the-art models, and as an early example of a large language model. As of 2020, BERT is a ubiquitous baseline in natural language processing (MLP) experiments.
BERT is trained by masked token prediction and next sentence prediction. As a result of this training process, BERT learns contextual, latent representations of tokens in their context, similar to ELMo and GPT-2. BERT was originally implemented in two model sizes, and trained on the Toronto BookCorpus and English Wikipedia (2,500M words).
Architecture
BERT is an encoder-only
transformer architecture. At a high level, BERT consists of 4 modules:
- Tokenizer: This module converts a piece of English text into a sequence of integers.
- The tokenizer of BERT is WordPiece, which is a sub-word strategy like byte pair encoding. Its vocabulary size is 30,000, and any token not appearing in its vocabulary is replaced by
[UNK]
- The tokenizer of BERT is WordPiece, which is a sub-word strategy like byte pair encoding. Its vocabulary size is 30,000, and any token not appearing in its vocabulary is replaced by
- Embedding: This module converts the sequence of tokens into an array of real-valued vectors representing the tokens. It represents the conversion of discrete token types into a lower-dimensional Euclidean space.
- The first layer is the embedding layer, which contains three components:" token type embeddings, position embeddings, and segment type embeddings.
- Token Type: The token type is a standard embedding layer, translating a one-hot vector into a dense vector based on its token type.
- Position: The position embeddings are based on a token's position in the sequence. BERT uses absolute position embeddings, where each position in sequence is mapped to real-valued vector. Each dimension of the vector consists of a sinusoidal function that takes the position in the sequence as input.
- Segment type: Using a vocabulary of just0 or 1, this embedding layer produces a dense vector based on whether the token belongs to the first or second text segment in that input. In other words, type-1 tokens are all tokens that appear after the
[SEP]
special token. All prior tokens are type-0.
- The three embedding vectors are added together representing the initial token representation as a function of these three pieces of information. After embedding, the vector representation is normalized using a LayerNorm operation, outputting a 768-dimensional vector for each input token.
- Encoder: a stack of Transformer blocks with self-attention, but without casual masking.
- Task head: This module converts the final representation vectors into one-hot encoded tokens again by producing a predicted probability distribution over the token types. It can be viewed as a simple decoder, decoding the latent representation into token types, or as an
un-embedding layer
The task head is necessary for pre-training, but it is often unnecessary for so-called downstream tasks
, such as question answering or sentiment classification. Instead, one removes the task head and replaces it with a newly initialized module for the task, and finetune the new module. The latent vector representation of the model is directly fed into this new module, allowing for sample-efficient transfer learning.
The encoder stack of BERT has 2 free parameters: , the number of layers, and , the hidden size. There are always attention heads, and the feed-forward/filter size is always .
For BERT:
- The feed-forward size and filter size are synonymous. Both of them denote the number of dimensions in the middle layer of the feed-forward network.
- The hidden size and embedding size are synonymous. Both of them denote the number of real numbers used to represent a token.
Training
BERT was pre-trained simultaneously on two tasks:
- masked Language Modeling
- In masked language modeling, 15% of tokens would be randomly selected for masked-prediction task, and the training objective was to predict the masked token given its context. In more detail, the selected token in:
- replaced with a
[MASK]
token with probability 80% - replaced with a random word token with probability 10%
- not replaced with probability 10%
- replaced with a
- The reason not all selected tokens are masked is to avoid the dataset shift problem. The dataset shift problem arises when the distribution of inputs seen during training differs significantly from the distribution encountered during inference.
- Next Sentence Prediction
- Given two spans of text, the model predicts if these two spans appeared sequentially in the training corpus, outputting either
[IsNext]
or[NotNext]
. The first span starts with a special token[CLS]
(forclassify
). The two spans are separated by a special token[SEP]
(forseparate
). After processing the two spans, the 1st output vector (the vector coding for[CLS]
) is passed to a separate neural network for the binary classification into[IsNext]
and[NotNext]
.
- Given two spans of text, the model predicts if these two spans appeared sequentially in the training corpus, outputting either
BERT is meant as a general pretrained model for various applications in natural language processing. That is, after pretraining, BERT can be fine-tuned with fewer resources on smaller datasets to optimize its performance on specific tasks such as natural language inference and text classification, and sequence-to-sequence-based language generation tasks such as question answering and conversational response generation.
The original BERT paper published results demonstrating that a small amount of finetuning allowed it to achieve state-of-the-art performance on a number of natural language understanding tasks.
- GLUE (General Language Understanding Evaluation)
- SQuAD (Stanford Question Answering Dataset)
- SWAG (Situations with Adversarial Generations)
Cost
Training BERTBASE on 4 cloud TPU (16 TPU chips total) took 4 days, at an estimated cost of 500 USD. Training BERTLARGE on 16 cloud TPU (64 TPU chips total) took 4 days.
Implementation
The high performance of the BERT model could be attributed to the fact that it is bidirectionally trained. This means that BERT, based on the Transformer model architecture, applies its self-attention mechanism to learn information from a text from the left and right side during training, and consequently gains a deep understand of the context. However, due to encoder-only architecture lacking a decoder, BERT can't be prompted and can't generate text, while bidirectional models in general do not work effectively without the right side, thus being difficult to prompt.
Comments
You have to be logged in to add a comment
User Comments
There are currently no comments for this article.