Question Answering, Making Transformers Efficient, Dealing with Few Labels, and Future Directions

These last Chapters of Natural Language Processing with Transformers go over question answering concerns, making transformers efficient through weight pruning and quantization, methods for dealing with few labels, and review some future direction about LLMs.

Question Answering

There are many flavors of question answering (QA). Tehere are many flavors of QA, but the most common is extractive QA, which involves wuestions whose answers can be identified as a span of text in a document, where the document might be a web page, legal contract, or news article. The two-stage process of first retrieving relevant documents and then extracting answers from them is also the basis for many modern QA systems, including semantic search engines, intelligent assistants, and automated information extractors. Community QA involves gathering question-answer pairs that are generated by users on forums like Stack Overflow using semantic similatity search to find the closest matching answer to a new question. Long-form QA aims to generate complex paragraph-length answers to open-ended questions like "Why is the sky blue?"

Building A review-Based QA System

Closed-domain QA deals with questions about a narrow topic (e.g. a single product category), while open-domain QA deals with questions about almost anything (Amazon's whole product catalog).

!pip install datasets
from datasets import get_dataset_config_names

domains = get_dataset_config_names("subjqa")
domains
out[2]
from datasets import load_dataset

subjqa = load_dataset("subjqa", name="electronics")
out[3]
print(subjqa["train"]["answers"][1])
out[4]

{'text': ['Bass is weak as expected', 'Bass is weak as expected, even with EQ adjusted up'], 'answer_start': [1302, 1302], 'answer_subj_level': [1, 1], 'ans_subj_score': [0.5083333253860474, 0.5083333253860474], 'is_ans_subjective': [True, True]}

import pandas as pd

dfs = {split: dset.to_pandas() for split, dset in subjqa.flatten().items()}

for split, df in dfs.items():
  print(f"Number of questions in {split}: {df['id'].nunique()}")
out[5]

Number of questions in train: 1295
Number of questions in test: 358
Number of questions in validation: 255

qa_cols = ["title", "question", "answers.text",
           "answers.answer_start", "context"]
sample_df = dfs["train"][qa_cols].sample(2, random_state=7)
sample_df
out[6]

title question answers.text \

791 B005DKZTMG Does the keyboard lightweight? [this keyboard is compact]

1159 B00AAIPT76 How is the battery? []



answers.answer_start context

791 [215] I really like this keyboard. I give it 4 star...

1159 [] I bought this after the first spare gopro batt...

start_idx = sample_df["answers.answer_start"].iloc[0][0]
end_idx = start_idx + len(sample_df["answers.text"].iloc[0][0])
sample_df["context"].iloc[0][start_idx:end_idx]
out[7]

'this keyboard is compact'

counts = {}
question_types = ["What", "How", "Is", "Does", "Do", "Was", "Where", "Why"]
for q in question_types:
  counts[q] = dfs["train"]["question"].str.startswith(q).value_counts()[True]
ax = pd.Series(counts).sort_values().plot.barh()
ax.set_title("Fequency of Question Types")
import matplotlib.pyplot as plt
plt.show()
out[8]
Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

for question_type in ["How", "What", "Is"]:
  for question in (dfs["train"][dfs["train"].question.str.startswith(question_type)].sample(n=3, random_state=42)['question']):
    print(question)
out[9]

How is the camera?
How do you like the control?
How fast is the charger?
What is direction?
What is the quality of the construction of the bag?
What is your impression of the product?
Is this how zoom works?
Is sound clear?
Is it a wireless keyboard?

Extracting Answers from Text

The first thing we'll need for our QA system is to find a way to identify a potential answer as a span of text in a customer review. To do this we'll need to understand how to:

  • Frame the supervised learning problem
  • Tokenize and encode text for QA tasks
  • Deal with long passages that exceed a model's maximum context size
Span Classification

The most common way to extract answers from text is by framing the problem as a span classification task, where the start and end tokens of an answer span act as the labels that a model needs to predict.

Span Classification QA Tasks

For extractive QA, we can actually start with a fine-tuned model since the structure of the labels remains the same across datasets.

from transformers import AutoTokenizer

model_ckpt = "deepset/minilm-uncased-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
out[11]

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.

0it [00:00, ?it/s]

tokenizer_config.json: 0%| | 0.00/107 [00:00<?, ?B/s]

config.json: 0%| | 0.00/477 [00:00<?, ?B/s]

vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json: 0%| | 0.00/112 [00:00<?, ?B/s]

/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(

question = "How much music can this hold?"
context = """An MP3 is about 1 MB/minute, so about 6000 hours depending on \
file size."""
inputs = tokenizer(question, context, return_tensors="pt")
out[12]
print(tokenizer.decode(inputs["input_ids"][0]))
out[13]

[CLS] how much music can this hold? [SEP] an mp3 is about 1 mb / minute, so about 6000 hours depending on file size. [SEP]

import torch
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained(model_ckpt)

with torch.no_grad():
  outputs = model(**inputs)
print(outputs)
out[14]

model.safetensors: 0%| | 0.00/133M [00:00<?, ?B/s]

Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-0.9862, -4.7750, -5.4025, -5.2378, -5.2863, -5.5117, -4.9819, -6.1880,
-0.9862, 0.2596, -0.2144, -1.7136, 3.7806, 4.8561, -1.0546, -3.9097,
-1.7374, -4.5944, -1.4278, 3.9949, 5.0391, -0.2018, -3.0193, -4.8549,
-2.3107, -3.5110, -3.5713, -0.9862]]), end_logits=tensor([[-0.9623, -5.4733, -5.0326, -5.1639, -5.4278, -5.5151, -5.1749, -4.6233,
-0.9623, -3.7855, -0.8715, -3.7745, -3.0161, -1.1780, 0.1758, -2.7365,
4.8934, 0.3046, -3.1761, -3.2762, 0.8937, 5.6606, -0.3623, -4.9554,
-3.2531, -0.0914, 1.6211, -0.9623]]), hidden_states=None, attentions=None)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
out[15]
print(f"Input IDs shape: {inputs.input_ids.size()}")
print(f"Start logits shape: {start_logits.size()}")
print(f"End logits shape: {end_logits.size()}")
out[16]

Input IDs shape: torch.Size([1, 28])
Start logits shape: torch.Size([1, 28])
End logits shape: torch.Size([1, 28])

import torch

start_idx = torch.argmax(start_logits)
end_idx = torch.argmax(end_logits) + 1
answer_span = inputs["input_ids"][0][start_idx:end_idx]
answer = tokenizer.decode(answer_span)
print(f"Question: {question}")
print(f"Answer: {answer}")
out[17]

Question: How much music can this hold?
Answer: 6000 hours

from transformers import pipeline

pipe = pipeline("question-answering", model=model, tokenizer=tokenizer)
pipe(question=question, context=context, topk=3)
out[18]

/usr/local/lib/python3.10/dist-packages/transformers/pipelines/question_answering.py:326: UserWarning: topk parameter is deprecated, use top_k instead
warnings.warn("topk parameter is deprecated, use top_k instead", UserWarning)

[{'score': 0.2651616930961609, 'start': 38, 'end': 48, 'answer': '6000 hours'},

{'score': 0.22082944214344025,

'start': 16,

'end': 48,

'answer': '1 MB/minute, so about 6000 hours'},

{'score': 0.10253512114286423,

'start': 16,

'end': 27,

'answer': '1 MB/minute'}]

pipe(question="Why is there no data?", context=context,handle_impossible_answer=True)
out[19]

{'score': 0.9068413972854614, 'start': 0, 'end': 0, 'answer': ''}

Dealing with long passages

One subtletly faced by reading comprehension models is that the context often contains more tokens than the maximum sequence length of the model. The standard way to deal with this is to applu a sliding window across the inputs, where each window contains a passge of tokens that fit in the model's context.

sliding Window

example = dfs["train"].iloc[0][["question", "context"]]
tokenized_example = tokenizer(example["question"], example["context"],return_overflowing_tokens=True, # enables sliding window
max_length=100,stride=25)
out[21]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.

for idx, window in enumerate(tokenized_example["input_ids"]):
  print(f"Window #{idx} has {len(window)} tokens")
out[22]

Window #0 has 100 tokens
Window #1 has 88 tokens

Using Haystack to Build a QA Pipeline

Modern QA systems are typically based on the retriever-reader architecture, which has two main components:

  • Retriever
    • Responsible for retrieving relevant docuemnts for a given query. Retreivers are usually categorized as sparse or dense. Sparse retrievers use word frequencies to represent each doument and query as a sparse vector. The relevance of a query and a document is then determined by computing an inner product of the vectors. Dense retreivers use encoders like transformers to represent the query and document as contextualized embeddings (which are dense vectors). These embeddings encode semantic meaning, and allow dense retrievers to improve search accuracy by undrestanding the content of the query.
  • Reader
    • Responsible for extracting an answer from the documents provided by the retriever. The reader is usually a reading comprehension model.

Reader Retriever Architecture

To build the QA system, we'll use the Haystack library developed by deepset, a German company focused on NLP. Haystack is based on the retriever-reader architecture, abstracts much of the complexity involved in building these systems, and integrates tightly with 🤗 Transformers. In assition to the retriever and reader, there are two more components involved when building a QA pipeline with Haystack:

  • Document Store: A document-oriented DB that stores documents and metadata which are provided to the retriever at query time
  • Pipeline: Combines all the components of a QA system to enable custom query flows, merging documents from multiple retrievers, and more.

Improving a QA Pipeline

Evaluating the Retriever

A common metric for evaluating retrievers is recall, which measures the fraction of all relevant documents that are retrieved. In this context. 'relevant' means whether the answer is present in a passage of text or not, so given a set of questions, we can compute recall by counting the number of times an answer appears in the top kkk documents returned by the retriever. A complementary metric to recall is mean average precision (mAP), which rewards retrievers that can place the correct answers higher up in the ranking.

Dense Passage Retrieval

One promising alternative is to use dense embeddings to represent the question and document, and the current state of the art is an architecture known as Dense Passage Retrieval (DPR). The main idea beind DPR is to use two BERT models as encoders for the question and the passage.

DPR bu-encoder Architecture

Evaluating the Reader

In extractive QA, there are two main metrics that are used for evaluating readers:

  1. Exact Match: A binary metric that gives EM = 1 if the characters in the predicted and ground truth answers match exactly, and EM = 0 otherwise. If no answer is expected, the model gives EM = 0 if it predicts any text at all.
  2. F1F_1F1 score: Measures the harmonic mean of the precision and recall.

Under the hood, these functions first normalize the prediction and label by removing punctuation, fixing whitespace, and converting to lowercase. The normalized strings are then tokenized as a bag-of-words, before finally computing the metric at the token level. From this simple example we can see that EM is a much stricter metric than the F1F_1F1 score: adding a simple token to the prediction gives an EM of zero. On the other hand, the F1F_1F1 score can fail to catch truly incorrect answers. Relying on just the F1F_1F1 -score is thus misleading, and tracking both metrics is a good strategy to balance the trade-off between underestimating (EM) and over-estimating ( F1F_1F1 score) model performance.

In general, there are multiple valid answers per question, so these metrics are calculated for each question-answer pair in the evaluation set, and the best score is selected over all possible answers.

Going Beyond Extractive QA

One interesting alternative to extracting answers as spans of text in a document is to generate them with a pretrained language model. This approach is often referred to as abstractive or generative QA and had the potential to produce better-phrased answers that synthesize evidence across multiple passages.

Retrieval-augmented generation (RAG) extends the classic retriever-reader architecture that we've seen by swapping the reader for a generator and using DPR as the retreiver. The generator is a pretrained sequence-to-sequence transformer like T5 or BART that receives latent vectors of documents from DPR and then iteratively generates an answer based on the query and these documents. Since DPR and the generator are differentiable, the whole process cna be fine-tuned end-to-end:

RAG

There are two types of RAG models to choose from:

  1. RAG-Sequence: Uses the same retrievd document to generate the complete answer. In particular, the top kkk documents from the retriever are fed to the generator, which produces an output sequence for each document, and the result is marginalized to obtain the best answer.
  2. RAG-Token: Can use a different document to generate each token in the answer. This allows the generator to synthesize evidence from multiple documents.

Making Transformers Efficient in Production

In this chapter we will explore four complementart techniques that can be used to speed up the predictions and reduce the memory footprint of your transformer models: knowledge distillation, quantization, pruning, and graph optimization with Open Neural Network Exchange (ONNX) format and ONNX Runtime (ORT).

Creating a Benchmark

Like other machine learning models, deploying transformers in production environments involves a trade-off among several constraints, the most common being:

  • Model Performance
  • Latency
  • Memory

Making Models Smaller via Knowledge Distillation

Knowledge distillation is a general-purpose method for training a smaller student model to mimic the behavior of a slower, larger, but better-performing teacher. Given the trend toward pretraining language models with ever-increasing parameter counts, knowledge distillation has also become a popular strategy to compress thes huge models and make them more suitable for building practical applications.

Knowledge Distillation for Fine-Tuning

For supervised tasks like fine-tuning, the main idea is to augment the ground truth labels with a distribution of "soft probabilities" from the teacher which provide complementary information for the student to learn from. By training the student to mimic the output probabilities of the teacher, the goal is to distill some of the "dark knowledge" that the teacher has learned - that is, knowledge that is not available from the labels alone.

Suppose we feed an input sequence x to the teacher to generate a vector of logits z(x)=[z1(x),,zN(x)]\textbf{z}(x) = [z_1(x), \ldots, z_N(x)]z(x)=[z1(x),,zN(x)] , We can convert these logits into probabilities by applying a softmax function:

exp (zi(x))j (zi(x))\cfrac{\exp \space (z_i(x))}{\sum_j \space (z_i(x))}j (zi(x))exp (zi(x))

We want to soften the probabilities by scaling the logits with a temperature hyperparameter TTT before applying the softmax so that student learns more than just the ground truth labels:

pi(x)=exp (zi(x)/T)j (zi(x)/T)p_i(x)=\cfrac{\exp \space (z_i(x)/T)}{\sum_j \space (z_i(x)/T)}pi(x)=j (zi(x)/T)exp (zi(x)/T)

Since the student also produces softened probabilities qi(x)q_i(x)qi(x) of its own, we can use the Kullback-Leibler (KL) divergence to measure the difference between the two probability distributions:

DKL(p,q)=ipi(x)logpi(x)qi(x)D_{KL}(p,q)=\sum_{i}p_i(x)\log \cfrac{p_i(x)}{q_i(x)}DKL(p,q)=ipi(x)logqi(x)pi(x)

With the KL divergence we can calculate how much is lost when we approximate the probability distribution of the teacher with the student. This allows us to define a knowledge distillation loss:

LKD=T2DKLL_{KD}=T^2D_{KL}LKD=T2DKL

where T^2 is a normalization factor to account for the fact that the magnitude of the gradients produced by soft labels scales as 1/T21/T^21/T2 . For classification tasks, the student loss is then a weighted average of the distillation loss with the usual cross-entropy loss LCEL_{CE}LCE of the ground truth labels:

Lstudent=αLCE+(1α)LKDL_{student}=\alpha L_{CE}+(1-\alpha )L_{KD}Lstudent=αLCE+(1α)LKD

where α\alphaα is a hyperparamter that controls the relative strength of each loss.

Knowledge Distillation Process

Knowledge Distillation for Pretraining

Knowledge distillation can also be used during pretraining to create a general-purpose student that can be subsequently fine-tuned on downdtream tasks.

Choosing a Good Student Initialization

A good rule of thumb from the literature is that knowledge distillation works best when the teacher and student are of the same model type.

Making Models Faster with Quantization

Knowledge Distillation can reduce the computational and memory cost of running inference by transferring the information from a teacher into a smaller student. Quantization takes a different approach; instead of reducing the number of computations, it makes them much more efficient by representing the weights and activations with low-precision data types like 8-bit integers instead of the usual 32-bit floating point numbers: Reducinng the number of bits means the resulting model requires less memory storage, and operations like matrix multiplication can be performed much faster with integer arithemetic. These performance gains can be realized with little to no loss in accuracy.

The basic idea behind quantization is that we can "discretize" the floating-=point values fff in each tensor by mapping their range [fmax,fmin][f_{max}, f_{min} ][fmax,fmin] into a smaller one [qmax,qmin][q_{max},q_{min}][qmax,qmin] of fixed-point numbers qqq , and linearly distributing all values in between. Mathematically, this mapping is described by:

f=(fmaxfminqmaxqmin)(qZ)=S(qZ)f=\left(\cfrac{f_{max}-f_{min}}{q_{max}-q_{min}}\right)(q-Z)=S(q-Z)f=(qmaxqminfmaxfmin)(qZ)=S(qZ)

where the scale factor SSS is a positive floating-point number and the constant ZZZ has the same type qqq and is called the zero point because it corresponds to the quantized value of the floating-point value f=0f = 0f=0 . The map needs to be affine so that we get back floating-point numbers when we dequantize the fixed-point ones. One of the main reasons why transformers (and deep neural networks more generally) are prime candidates for qunatization is that the weights and activations tend to take values in relatively small ranges.

For deep neural networks, there are typically three main approaches to quantization:

  1. Dynamic Quantization: When using dynamic quantization nothing is changed during training and the adaptations are only performed during inference.
  2. Static Quantization
  3. Quantization-aware training

Optimizing Inference with ONNX and the ONNX Runtime

ONNX is an open standard that defines a common set of operators and a common file format to represent deep learning models in a wide variety of frameworks, including PyTorch and TensorFlow. When a model is exported to the ONNX format, these operators are used to construct a computational graph (often called an intermediate representation) that represents the flow of data theough the neural network.

ONNX Graph Example

By exposing a graphg with standardized operators and data types, ONNC makes it easy to switch between frameworks. Where ONNX really shines is when it is coupled with a dedicated accelerator like ONNX Runtime, or ORT for short. ORT provides tools to optimize the ONNX graph through technqiues like operator fusion and constant folding, and defines an interface to execution providers that allow you to run the model on different types of hardware. This is a powerful abstraction.

Architecture of the ONNX and ONNX Runtime Ecosystem

Making Models Sparser with Weight Pruning

Looking at how we can shrink the number of parameters in our model by identifying and removing the least important weights in the network.

Sparsity in Deep Neural Networks

The main idea behind pruning is to gradually remove weight connections (and potentially neurons) during training such that the model becomes progressively sparser. The resulting pruned model has a smaller number of nonzero parameters, which can then be stored in a compact sparse matrix format. Pruning can also be combined with quantization to obtain further compression.

Weight Pruning

Weight Pruning Methods

Mathematically, the way most weight pruning methods work is to calculate a matrix S\textbf{S}S of importance scores and then select the top kkk percent of weights by importance:

Topk(S)ij={1if Sij in top k percent0otherwise\text{Top}_k (\textbf{S})_{ij}=\begin{cases}1 & \text{if }S_{ij}\text{ in top }k\text{ percent} \\ 0 & \text{otherwise} \end{cases}Topk(S)ij={10if Sij in top k percentotherwise

kkk acts as a new hyperparameter to control the amount of sparsity in the model in effect - that is, the proportion of weights that are zero-valued. Lower values of kkk correspond to sparser matrices. From these scores we can then define a mask matrix M\textbf{M}M that masks the weights WijW_{ij}Wij during the forward pass with some input xix_ixi and effectively creates a sparse network of activations aia_iai :

ai=kWikMikxka_i = \sum_k W_{ik}M_{ik}x_{k}ai=kWikMikxk
Magnitude Pruning

Magnitude pruning calculates the scores according to the magnitude of the weights S=(Wij)1jjn\textbf{S}=(\lvert W_{ij} \rvert )_{1\leq jj\leq n}S=(∣Wij)1jjn and then derives the masks from M=Topk(S)\textbf{M}=\text{Top}_k(\textbf{S})M=Topk(S) . In the literature it is common to apply magnitude pruning in an iterative fashion by first training the model to learn which connections are imporatnt and pruning the weights of least importance. The sparse model is then retrained and the process repeated until the desired sparsity is reached. One drawback of this approach is that it is computationally demanding. One problem with magnitude pruning is that it is really designed for pure supervised learning, where the importance of each weight is directly related to the task at hand. By contrast, in transfer learning the importance of the weights is primarily determined by the pretraining phase, so magnitude pruning can remove connections that are important for the fine-tuning task.

Movement Pruning

The basic idea behind movement pruning is to gradually remove weights during fine-tuning such that the model becomes progressively sparser. The key novelty is that both the weights and the scores are learned during fine-tuning. The intution behind mobement pruning is that the weights that are moving the most from zero are the most important ones to keep.

Chapter 9: Dealing with Few to No Labels

Abscense of Large Amounts of Data

Implementing a Naive Bayseline

Whenever you start a new NLP project, it's always a good idea to implement a set of strong baselines. There are two major reasons for this:

  1. A baseline based on regular expressions, handcrafted rules, or a very simple model might already word really well to solve the problem. In these cases, there is no reason to bring out big guns like Transformers, which are generally more complex to deploy and maintain in production environments.
  2. The baselines provide quick checks as you explore more complex models.

Working with No Labeled Data

Zero-shot classification is suitable in settings where you have no labeled data at all. This is suprisingly common in industry, and might occur because there is no historic data with labels or because acquiring the labels for the data is difficult. The goal of zero-shot classification is to make use of a pretrained model without any fine-tuning on your task-specific corpus.

Working with a Few Labels

One simple but effective way to boost the performance of text classifiers on small datasets is to apply data augmentation to generate new examples from existing ones. In practice, there are tw types of data augmentation that are commonly used:

  • Back Translation: Take a text in the source language, translate it into one or more target languages using machine translation, and then translate it back to the source language.
  • Token Permutation: Given a text from the training set, randomly choose and perform simple transformations like random synonym replacement, word insertion, swap or deletion

Use Embeddings as a Lookup Table

Large language models such as GPT-3 have been shown to be excellent at solving tasks with limited data. The reason is that these models learn useful representations of text that ecnode information across many dimensions, such as sentiment, topic, text structure, and more. For this reason, the embeddings of large language models can be used to develop a semantic search engine, find similar documents or comments, or even classify text.

Fine-Tuning a Vanilla Transformer

If we have access to labeled data, we can also try to do the obvious thing: simply fune-tune a pretrained transformer model.

Chapter 11: Future Directions

Scaling Transformers

There are now signs that a similar lesson is at play with transformers; while many of the early BERT and GPT descendants focused on tweaking the architecture or pretraining objectives, the best-performing models in mid-2021, like GPT-3, are essentially basic scaled-up versions of the original models without many architectural modifications.

Scaling Laws

Scaling Laws allow one to empirically quantify the "bigger is better" paradigm for language models by studying their behavior with varying compute budget CCC , dataset size DDD , and model size NNN . The basic idea is to chart the dependence of the cross-entropy loss LLL on these three factors and determine if a relationship emerges. For autoregressive models like those in the GPT family, the resulting loss curves are shown below, where each blue curve represents the training run of a single model.

Scaling Laws

From the loss curves we can draw conclusions about:

  • The relationship of performance and scale: The implication of scaling laws is that a more productive path toward better models is to focus on increasing NNN , CCC , and DDD in tandem.
  • Smooth Power Laws
  • Sample Efficiency: Large models are able to reach the same performance as smaller models with a smaller number of training steps.

Challenges with Scaling

  • Infrastructure: Provisioning and managing infrastructure that potentially spans hundreds of thousands of nodes with as many GPUs is not for the faint-hearted.
  • Cost
  • Dataset curation: A model is only as good as the data it is trained on. Training large models requires large, high-quality datasets.
  • Model evaluation
  • Deployment

Attention

In terms of time and memory complexity, the self-attention layer of the Transformer architecture naively scales like O(n2)O(n^2)O(n2) , where nnn is the length of the sequence. Much of the recent research on transformers has focused on making self-attention more efficient.

Attention Research

Sparse Attention

One way to reduce the number of computations that are performed in the self-attention layer is to simply limit the number of query-key pairs that are generated according to some predefined pattern.

Sparse Attention Patterns

In practice, most transformer models with sparse attention use a mix of the atomic sparsity pattens to generate the final attention matrix.

Linearized Attention

An alternative way to make self-attention more efficient is to change the order of operations that are involved in computing the attention scores.

Going Beyond Text

There are limits to LLMs:

  • Human reporting bias: the frequencies of events in text may not represent their true frequencies.
  • Common Sense: Common sense is a fundamental quality of human reasoning, but it is rarelky written down.
  • Facts: A probabilistic language model cannot store facts in a reliable way and can produce text that is factually wrong.
  • Modality: Language models have no way to connect to other modalities that could address the previous points, such as audio or visual signals or tabular data.