GPT (Generative Pre-Trained Transformer)
Getting back into looking at machine learning models, and BERT and GPT are the two main types of LLMs according to text that I have read - so I want to read more about each.
References
Related
- Markov Property
- In probability theory and statistics, Markov property refers to the memoryless property of a stochastic process, which means that its future evolution is independent of its history. It is named after the Russian mathematician Andrey Markov.
- Markov Model
- In probability theory, a Markov model is a stochastic model used to model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, it assumes the Markov property). Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable. For this reason, in the fields of predictive modelling and probabilistic forecasting, it is desirable for a given model to exhibit the Markov property.
- Markov Process
- A Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as,
What happens next depends only on the state of affairs now
. Markov chains have many applications as statistical models of real-world processes. They provide the basis for general stochastic simulation methods known as Markov chain Monte Carlo, which are used for simulating sampling from complex probability distributions, and have many found applications.
- A Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as,
- Hidden Markov Model
- A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or hidden) Markov process (referred to as ). A HMM requires that there be an observable process whose outcomes depend on the outcomes of in a known way. Since cannot be observed directly; the goal is to learn about the state of by observing . By definition of being a Markov model, an HMM has an additional requirement that the outcome of at time must be
influenced
exclusively by the outcome of at and the outcomes of and at must be conditionally independent of at given at time . Estimation of the parameters in HMM can be performed using maximum likelihood estimation.
- A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or hidden) Markov process (referred to as ). A HMM requires that there be an observable process whose outcomes depend on the outcomes of in a known way. Since cannot be observed directly; the goal is to learn about the state of by observing . By definition of being a Markov model, an HMM has an additional requirement that the outcome of at time must be
- Maximum Likelihood Estimation
- In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become the dominant means of statistical inference.
- Transformer
- A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in the 2017 paper
Attention is All You Need
. Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplifies and less important tokens to be diminished. - Transformers have the advantage of having no recurrent units, therefore requiring less training time than earlier recurrent neural architectures (RNNs) such as long short term memory (LSTM).
- A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in the 2017 paper
- Autoencoder
- An autoencoder is a type of artificial neural network used to learn efficient coding of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms.
- Diffusion Models
- In machine learning, diffusion models are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality.
Notes
A generative pre-trained transformer (GPT) is a type of large language model and a prominent framework for generative artificial intelligence. It is an artificial neural network that is used in natural language processing by machines. It is based on the transformer deep learning architecture, pre-trained on large data sets of unlabeled text, and able to generate novel human-like content. As of 2023, most LLMs had these characteristics and are sometimes referred to broadly as GPTs.
The first GPT was introduced in 2018 by OpenAI. Each of these was significantly more capable than the previous, dure to increased size (number of trainable parameters) and training.
History
Generative pretraining (GP) was a long-established concept in machine learning applications. It was originally used as a form of semi-supervised learning, as the model is trained first on an unlabeled dataset (pretraining step) by learning to generate datapoints in the dataset, and then it is trained to classify a labeled dataset. There were mainly 3 types of early GP. The hidden Markov models learn a generative model of sequences for downstream applications.
The compressors learn to compress data such as images and textual sequences, and the compressed data serves as a good representation for downstream applications such as facial recognition. The autoencoders similarly learn a latent representation of data for later downstream applications such as speech recognition. During the 2010s, the problem of machine translation was solved by recurrent neural networks, with attention mechanism being added. This was optimized into the transformer architecture, published by Google researchers in Attention is All You Need (2017). That development led to the emergence of large language models such as BERT, which was a pretrained transformer but not designed to be generative.
The semi-supervised approach OpenAI employed to make a large-scale generative system - and was first to do with a transformer model - involved two stages: an unsupervised generative pretraining
stage to set initial parameters using a language modeling objective, and a supervised discriminative fine-tuning
stage to adapt these parameters to a target task.
Foundational Models
A foundational model is an AI model trained on broad data at scale such that it can be adapted to a wide range of downstream tasks. Foundational GPTs can also employ modalities other than text, for input and/or output. Regarding multimodal output, generative transformer-based models are used for text-to-image technologies such as diffusion and parallel decoding. Some of these models can serve as visual foundation models (VFMs) for developing downstream systems that can work with images.
Task Specific Models
A foundational GPT model can be further adapted to produce more targeted systems directed to specific tasks and/or subject-matter domains. Methods for such adaptation can include additional fine-tuning (beyond that done for the foundation model) as well as certain forms of prompt engineering. An important example of this is fine-tuning models to follow instructions, which is a fairly broad task but more targeted than a foundational model. A kind of task-specific models is chatbots.
Comments
You have to be logged in to add a comment
User Comments
There are currently no comments for this article.