GPT (Generative Pre-Trained Transformer)

Getting back into looking at machine learning models, and BERT and GPT are the two main types of LLMs according to text that I have read - so I want to read more about each.

Date Created:
2 454

References



Related


  • Markov Property
    • In probability theory and statistics, Markov property refers to the memoryless property of a stochastic process, which means that its future evolution is independent of its history. It is named after the Russian mathematician Andrey Markov.
  • Markov Model
    • In probability theory, a Markov model is a stochastic model used to model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, it assumes the Markov property). Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable. For this reason, in the fields of predictive modelling and probabilistic forecasting, it is desirable for a given model to exhibit the Markov property.
  • Markov Process
    • A Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, What happens next depends only on the state of affairs now. Markov chains have many applications as statistical models of real-world processes. They provide the basis for general stochastic simulation methods known as Markov chain Monte Carlo, which are used for simulating sampling from complex probability distributions, and have many found applications.
  • Hidden Markov Model
    • A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or hidden) Markov process (referred to as ). A HMM requires that there be an observable process whose outcomes depend on the outcomes of in a known way. Since cannot be observed directly; the goal is to learn about the state of by observing . By definition of being a Markov model, an HMM has an additional requirement that the outcome of at time must be influenced exclusively by the outcome of at and the outcomes of and at must be conditionally independent of at given at time . Estimation of the parameters in HMM can be performed using maximum likelihood estimation.
  • Maximum Likelihood Estimation
    • In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become the dominant means of statistical inference.
  • Transformer
    • A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in the 2017 paper Attention is All You Need. Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplifies and less important tokens to be diminished.
    • Transformers have the advantage of having no recurrent units, therefore requiring less training time than earlier recurrent neural architectures (RNNs) such as long short term memory (LSTM).
  • Autoencoder
    • An autoencoder is a type of artificial neural network used to learn efficient coding of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms.
  • Diffusion Models
    • In machine learning, diffusion models are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality.


Notes


A generative pre-trained transformer (GPT) is a type of large language model and a prominent framework for generative artificial intelligence. It is an artificial neural network that is used in natural language processing by machines. It is based on the transformer deep learning architecture, pre-trained on large data sets of unlabeled text, and able to generate novel human-like content. As of 2023, most LLMs had these characteristics and are sometimes referred to broadly as GPTs.

The first GPT was introduced in 2018 by OpenAI. Each of these was significantly more capable than the previous, dure to increased size (number of trainable parameters) and training.

History

Generative pretraining (GP) was a long-established concept in machine learning applications. It was originally used as a form of semi-supervised learning, as the model is trained first on an unlabeled dataset (pretraining step) by learning to generate datapoints in the dataset, and then it is trained to classify a labeled dataset. There were mainly 3 types of early GP. The hidden Markov models learn a generative model of sequences for downstream applications.

The compressors learn to compress data such as images and textual sequences, and the compressed data serves as a good representation for downstream applications such as facial recognition. The autoencoders similarly learn a latent representation of data for later downstream applications such as speech recognition. During the 2010s, the problem of machine translation was solved by recurrent neural networks, with attention mechanism being added. This was optimized into the transformer architecture, published by Google researchers in Attention is All You Need (2017). That development led to the emergence of large language models such as BERT, which was a pretrained transformer but not designed to be generative.

The semi-supervised approach OpenAI employed to make a large-scale generative system - and was first to do with a transformer model - involved two stages: an unsupervised generative pretraining stage to set initial parameters using a language modeling objective, and a supervised discriminative fine-tuning stage to adapt these parameters to a target task.

Foundational Models

A foundational model is an AI model trained on broad data at scale such that it can be adapted to a wide range of downstream tasks. Foundational GPTs can also employ modalities other than text, for input and/or output. Regarding multimodal output, generative transformer-based models are used for text-to-image technologies such as diffusion and parallel decoding. Some of these models can serve as visual foundation models (VFMs) for developing downstream systems that can work with images.

Task Specific Models

A foundational GPT model can be further adapted to produce more targeted systems directed to specific tasks and/or subject-matter domains. Methods for such adaptation can include additional fine-tuning (beyond that done for the foundation model) as well as certain forms of prompt engineering. An important example of this is fine-tuning models to follow instructions, which is a fairly broad task but more targeted than a foundational model. A kind of task-specific models is chatbots.

Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC