BERT (language model)

Getting back into looking at machine learning models, and BERT and GPT are the two main types of LLMs according to text that I have read - so I want to read more about each.

Date Created:
2 454

References



Related


  • ELMo
    • ELMo (embeddings from language model) is a word embedding method for representing a sequence of words as a corresponding sequence of vectors. First released in Feb. 2018. It is a bidirectional LSTM which takes character-level as inputs and produces word-level embeddings, trained on a corpus of about 30 million sentences and 1 billion words.
    • The architecture of ELMo accomplishes a contextual understanding of tokens. ELMo was historically important as a pioneer of self-supervised generative pretraining, followed by fine-tuning, where a large model is trained to reproduce a large corpus, then the large model is augmented with additional task-specific weights and dine-tuned on supervised task data.
  • BookCorpus
    • BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution Smashwords. It was the main corpus used to train the initial GPT model by OpenAI, and has been used as training data for other early large language models including Google's BERT.
  • Embedding
    • In mathematics, an embedding (or imbedding) is one instance of some mathematical structure contained within another instance, such as a group that is a subgroup. When some object is said to be embedded in another object , the embedding is given by some injective and structure-preserving map: .
  • Attention
    • Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-with sequence that can range from tens to millions of tokens in size.
    • Unlike hard weights, which are computed during the backwards training pass, soft weights exist only in the forward pass and therefore change with every step of the input. Inspired by ideas about attention in humans, the attention mechanism was developed to address the weaknesses of leveraging information about the hidden layers of recurrent neural networks.
  • byte pair encoding
    • Byte pair encoding (also known as diagram encoding) is an algorithm, first described in 1994 by Phillip Gage for encoding strings of text into tabular form for use in downstream modeling. Its modification is notable for the large language model tokenizer with an ability to combine both tokens that encode single characters and those that encode whole words. This modification, in the first step, assumes all unique characters to be an initial set of 1-characetr long n-grams (i.e. initial "tokens"). Then, successively, the most frequent pair of adjacent characters is merged into a new, 2-character long n-gram and all instances of the pair are replaced by this new token. This is repeated until a vocabulary of the prescribed size is obtained.

Example:

Suppose the data to be encoded is:

aaabdaaabac

The byte pair aa occurs most often, so it will be replaced by a byte that is not used in the data, such as Z. Now there is the following data and replacement table:

ZabdZabac
Z=aa

Then the process is repeated with the byte pair ab, replacing it with Y:

ZYdZYac
Y=ab
Z=aa

The only literal byte pair left occurs only once, and the encoding might stop here. Alternatively, the process could continue with recursive byte pair encoding, replacing ZY with X

XdXac
X=ZY
Y=ab
Z=aa

To decompress the data, simply perform the replacements in reverse order.

  • LayerNorm
    • Layer Normalization (LayerNorm) is a popular alternative to BatchNorm. Unlike BatchNorm, which normaliuzes across the batch dimension for a given feature, LayerNorm normalizes across all the features within a single data sample. LayerNorm's performance is not affected by batch size, compared to BatchNorm. It is a key component of transformer models.
    • For a given data input and layer, LayerNorm computes the mean and the variance over all the neurons in the layer. Similar to BatchNorm, learnable parameters (scale) and (shift) are applied). It is defined by:
  • GLUE
    • GLUE Benchmark
    • The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems:
      • A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset size, text genres, and degrees of difficulty
      • A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural langauge, and
      • A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.


Notes


Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learns to represent text as a sequence of vectors using self-supervised learning. it uses the encoder-only transformer architecture. It is notable for its dramatic improvement over previous state-of-the-art models, and as an early example of a large language model. As of 2020, BERT is a ubiquitous baseline in natural language processing (MLP) experiments.

BERT is trained by masked token prediction and next sentence prediction. As a result of this training process, BERT learns contextual, latent representations of tokens in their context, similar to ELMo and GPT-2. BERT was originally implemented in two model sizes, and trained on the Toronto BookCorpus and English Wikipedia (2,500M words).

Architecture

BERT is an encoder-only transformer architecture. At a high level, BERT consists of 4 modules:

  • Tokenizer: This module converts a piece of English text into a sequence of integers.
    • The tokenizer of BERT is WordPiece, which is a sub-word strategy like byte pair encoding. Its vocabulary size is 30,000, and any token not appearing in its vocabulary is replaced by [UNK]
  • Embedding: This module converts the sequence of tokens into an array of real-valued vectors representing the tokens. It represents the conversion of discrete token types into a lower-dimensional Euclidean space.
    • The first layer is the embedding layer, which contains three components:" token type embeddings, position embeddings, and segment type embeddings.

Embedding Components

      • Token Type: The token type is a standard embedding layer, translating a one-hot vector into a dense vector based on its token type.
      • Position: The position embeddings are based on a token's position in the sequence. BERT uses absolute position embeddings, where each position in sequence is mapped to real-valued vector. Each dimension of the vector consists of a sinusoidal function that takes the position in the sequence as input.
      • Segment type: Using a vocabulary of just0 or 1, this embedding layer produces a dense vector based on whether the token belongs to the first or second text segment in that input. In other words, type-1 tokens are all tokens that appear after the [SEP] special token. All prior tokens are type-0.
    • The three embedding vectors are added together representing the initial token representation as a function of these three pieces of information. After embedding, the vector representation is normalized using a LayerNorm operation, outputting a 768-dimensional vector for each input token.
  • Encoder: a stack of Transformer blocks with self-attention, but without casual masking.
  • Task head: This module converts the final representation vectors into one-hot encoded tokens again by producing a predicted probability distribution over the token types. It can be viewed as a simple decoder, decoding the latent representation into token types, or as an un-embedding layer

The task head is necessary for pre-training, but it is often unnecessary for so-called downstream tasks, such as question answering or sentiment classification. Instead, one removes the task head and replaces it with a newly initialized module for the task, and finetune the new module. The latent vector representation of the model is directly fed into this new module, allowing for sample-efficient transfer learning.

The encoder stack of BERT has 2 free parameters: , the number of layers, and , the hidden size. There are always attention heads, and the feed-forward/filter size is always .

For BERT:

  • The feed-forward size and filter size are synonymous. Both of them denote the number of dimensions in the middle layer of the feed-forward network.
  • The hidden size and embedding size are synonymous. Both of them denote the number of real numbers used to represent a token.

Training

BERT was pre-trained simultaneously on two tasks:

  1. masked Language Modeling

Masked Language Modeling

    1. In masked language modeling, 15% of tokens would be randomly selected for masked-prediction task, and the training objective was to predict the masked token given its context. In more detail, the selected token in:
      1. replaced with a [MASK] token with probability 80%
      2. replaced with a random word token with probability 10%
      3. not replaced with probability 10%
    2. The reason not all selected tokens are masked is to avoid the dataset shift problem. The dataset shift problem arises when the distribution of inputs seen during training differs significantly from the distribution encountered during inference.
  1. Next Sentence Prediction

Next Sentence Prediction

    1. Given two spans of text, the model predicts if these two spans appeared sequentially in the training corpus, outputting either [IsNext] or [NotNext]. The first span starts with a special token [CLS] (for classify). The two spans are separated by a special token [SEP] (for separate). After processing the two spans, the 1st output vector (the vector coding for [CLS]) is passed to a separate neural network for the binary classification into [IsNext] and [NotNext].

BERT is meant as a general pretrained model for various applications in natural language processing. That is, after pretraining, BERT can be fine-tuned with fewer resources on smaller datasets to optimize its performance on specific tasks such as natural language inference and text classification, and sequence-to-sequence-based language generation tasks such as question answering and conversational response generation.

The original BERT paper published results demonstrating that a small amount of finetuning allowed it to achieve state-of-the-art performance on a number of natural language understanding tasks.

Cost
Training BERTBASE on 4 cloud TPU (16 TPU chips total) took 4 days, at an estimated cost of 500 USD. Training BERTLARGE on 16 cloud TPU (64 TPU chips total) took 4 days.

Implementation

The high performance of the BERT model could be attributed to the fact that it is bidirectionally trained. This means that BERT, based on the Transformer model architecture, applies its self-attention mechanism to learn information from a text from the left and right side during training, and consequently gains a deep understand of the context. However, due to encoder-only architecture lacking a decoder, BERT can't be prompted and can't generate text, while bidirectional models in general do not work effectively without the right side, thus being difficult to prompt.

Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC