Question Answering, Making Transformers Efficient, Dealing with Few Labels, and Future Directions
These last Chapters of Natural Language Processing with Transformers go over question answering concerns, making transformers efficient through weight pruning and quantization, methods for dealing with few labels, and review some future direction about LLMs.
Question Answering
There are many flavors of question answering (QA). Tehere are many flavors of QA, but the most common is extractive QA, which involves wuestions whose answers can be identified as a span of text in a document, where the document might be a web page, legal contract, or news article. The two-stage process of first retrieving relevant documents and then extracting answers from them is also the basis for many modern QA systems, including semantic search engines, intelligent assistants, and automated information extractors. Community QA involves gathering question-answer pairs that are generated by users on forums like Stack Overflow using semantic similatity search to find the closest matching answer to a new question. Long-form QA aims to generate complex paragraph-length answers to open-ended questions like "Why is the sky blue?"
Building A review-Based QA System
Closed-domain QA deals with questions about a narrow topic (e.g. a single product category), while open-domain QA deals with questions about almost anything (Amazon's whole product catalog).
!pip install datasets
from datasets import get_dataset_config_names
domains = get_dataset_config_names("subjqa")
domains
from datasets import load_dataset
subjqa = load_dataset("subjqa", name="electronics")
print(subjqa["train"]["answers"][1])
import pandas as pd
dfs = {split: dset.to_pandas() for split, dset in subjqa.flatten().items()}
for split, df in dfs.items():
print(f"Number of questions in {split}: {df['id'].nunique()}")
qa_cols = ["title", "question", "answers.text",
"answers.answer_start", "context"]
sample_df = dfs["train"][qa_cols].sample(2, random_state=7)
sample_df
start_idx = sample_df["answers.answer_start"].iloc[0][0]
end_idx = start_idx + len(sample_df["answers.text"].iloc[0][0])
sample_df["context"].iloc[0][start_idx:end_idx]
counts = {}
question_types = ["What", "How", "Is", "Does", "Do", "Was", "Where", "Why"]
for q in question_types:
counts[q] = dfs["train"]["question"].str.startswith(q).value_counts()[True]
ax = pd.Series(counts).sort_values().plot.barh()
ax.set_title("Fequency of Question Types")
import matplotlib.pyplot as plt
plt.show()
for question_type in ["How", "What", "Is"]:
for question in (dfs["train"][dfs["train"].question.str.startswith(question_type)].sample(n=3, random_state=42)['question']):
print(question)
Extracting Answers from Text
The first thing we'll need for our QA system is to find a way to identify a potential answer as a span of text in a customer review. To do this we'll need to understand how to:
- Frame the supervised learning problem
- Tokenize and encode text for QA tasks
- Deal with long passages that exceed a model's maximum context size
Span Classification
The most common way to extract answers from text is by framing the problem as a span classification task, where the start and end tokens of an answer span act as the labels that a model needs to predict.
For extractive QA, we can actually start with a fine-tuned model since the structure of the labels remains the same across datasets.
from transformers import AutoTokenizer
model_ckpt = "deepset/minilm-uncased-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
question = "How much music can this hold?"
context = """An MP3 is about 1 MB/minute, so about 6000 hours depending on \
file size."""
inputs = tokenizer(question, context, return_tensors="pt")
print(tokenizer.decode(inputs["input_ids"][0]))
import torch
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained(model_ckpt)
with torch.no_grad():
outputs = model(**inputs)
print(outputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(f"Input IDs shape: {inputs.input_ids.size()}")
print(f"Start logits shape: {start_logits.size()}")
print(f"End logits shape: {end_logits.size()}")
import torch
start_idx = torch.argmax(start_logits)
end_idx = torch.argmax(end_logits) + 1
answer_span = inputs["input_ids"][0][start_idx:end_idx]
answer = tokenizer.decode(answer_span)
print(f"Question: {question}")
print(f"Answer: {answer}")
from transformers import pipeline
pipe = pipeline("question-answering", model=model, tokenizer=tokenizer)
pipe(question=question, context=context, topk=3)
pipe(question="Why is there no data?", context=context,handle_impossible_answer=True)
Dealing with long passages
One subtletly faced by reading comprehension models is that the context often contains more tokens than the maximum sequence length of the model. The standard way to deal with this is to applu a sliding window across the inputs, where each window contains a passge of tokens that fit in the model's context.
example = dfs["train"].iloc[0][["question", "context"]]
tokenized_example = tokenizer(example["question"], example["context"],return_overflowing_tokens=True, # enables sliding window
max_length=100,stride=25)
for idx, window in enumerate(tokenized_example["input_ids"]):
print(f"Window #{idx} has {len(window)} tokens")
Using Haystack to Build a QA Pipeline
Modern QA systems are typically based on the retriever-reader architecture, which has two main components:
- Retriever
- Responsible for retrieving relevant docuemnts for a given query. Retreivers are usually categorized as sparse or dense. Sparse retrievers use word frequencies to represent each doument and query as a sparse vector. The relevance of a query and a document is then determined by computing an inner product of the vectors. Dense retreivers use encoders like transformers to represent the query and document as contextualized embeddings (which are dense vectors). These embeddings encode semantic meaning, and allow dense retrievers to improve search accuracy by undrestanding the content of the query.
- Reader
- Responsible for extracting an answer from the documents provided by the retriever. The reader is usually a reading comprehension model.
To build the QA system, we'll use the Haystack library developed by deepset, a German company focused on NLP. Haystack is based on the retriever-reader architecture, abstracts much of the complexity involved in building these systems, and integrates tightly with 🤗 Transformers. In assition to the retriever and reader, there are two more components involved when building a QA pipeline with Haystack:
- Document Store: A document-oriented DB that stores documents and metadata which are provided to the retriever at query time
- Pipeline: Combines all the components of a QA system to enable custom query flows, merging documents from multiple retrievers, and more.
Improving a QA Pipeline
Evaluating the Retriever
A common metric for evaluating retrievers is recall, which measures the fraction of all relevant documents that are retrieved. In this context. 'relevant' means whether the answer is present in a passage of text or not, so given a set of questions, we can compute recall by counting the number of times an answer appears in the top k documents returned by the retriever. A complementary metric to recall is mean average precision (mAP), which rewards retrievers that can place the correct answers higher up in the ranking.
Dense Passage Retrieval
One promising alternative is to use dense embeddings to represent the question and document, and the current state of the art is an architecture known as Dense Passage Retrieval (DPR). The main idea beind DPR is to use two BERT models as encoders for the question and the passage.
Evaluating the Reader
In extractive QA, there are two main metrics that are used for evaluating readers:
- Exact Match: A binary metric that gives EM = 1 if the characters in the predicted and ground truth answers match exactly, and EM = 0 otherwise. If no answer is expected, the model gives EM = 0 if it predicts any text at all.
- F1 score: Measures the harmonic mean of the precision and recall.
Under the hood, these functions first normalize the prediction and label by removing punctuation, fixing whitespace, and converting to lowercase. The normalized strings are then tokenized as a bag-of-words, before finally computing the metric at the token level. From this simple example we can see that EM is a much stricter metric than the F1 score: adding a simple token to the prediction gives an EM of zero. On the other hand, the F1 score can fail to catch truly incorrect answers. Relying on just the F1 -score is thus misleading, and tracking both metrics is a good strategy to balance the trade-off between underestimating (EM) and over-estimating ( F1 score) model performance.
In general, there are multiple valid answers per question, so these metrics are calculated for each question-answer pair in the evaluation set, and the best score is selected over all possible answers.
Going Beyond Extractive QA
One interesting alternative to extracting answers as spans of text in a document is to generate them with a pretrained language model. This approach is often referred to as abstractive or generative QA and had the potential to produce better-phrased answers that synthesize evidence across multiple passages.
Retrieval-augmented generation (RAG) extends the classic retriever-reader architecture that we've seen by swapping the reader for a generator and using DPR as the retreiver. The generator is a pretrained sequence-to-sequence transformer like T5 or BART that receives latent vectors of documents from DPR and then iteratively generates an answer based on the query and these documents. Since DPR and the generator are differentiable, the whole process cna be fine-tuned end-to-end:
There are two types of RAG models to choose from:
- RAG-Sequence: Uses the same retrievd document to generate the complete answer. In particular, the top k documents from the retriever are fed to the generator, which produces an output sequence for each document, and the result is marginalized to obtain the best answer.
- RAG-Token: Can use a different document to generate each token in the answer. This allows the generator to synthesize evidence from multiple documents.
Making Transformers Efficient in Production
In this chapter we will explore four complementart techniques that can be used to speed up the predictions and reduce the memory footprint of your transformer models: knowledge distillation, quantization, pruning, and graph optimization with Open Neural Network Exchange (ONNX) format and ONNX Runtime (ORT).
Creating a Benchmark
Like other machine learning models, deploying transformers in production environments involves a trade-off among several constraints, the most common being:
- Model Performance
- Latency
- Memory
Making Models Smaller via Knowledge Distillation
Knowledge distillation is a general-purpose method for training a smaller student model to mimic the behavior of a slower, larger, but better-performing teacher. Given the trend toward pretraining language models with ever-increasing parameter counts, knowledge distillation has also become a popular strategy to compress thes huge models and make them more suitable for building practical applications.
Knowledge Distillation for Fine-Tuning
For supervised tasks like fine-tuning, the main idea is to augment the ground truth labels with a distribution of "soft probabilities" from the teacher which provide complementary information for the student to learn from. By training the student to mimic the output probabilities of the teacher, the goal is to distill some of the "dark knowledge" that the teacher has learned - that is, knowledge that is not available from the labels alone.
Suppose we feed an input sequence x to the teacher to generate a vector of logits z(x)=[z1(x),…,zN(x)] , We can convert these logits into probabilities by applying a softmax function:
We want to soften the probabilities by scaling the logits with a temperature hyperparameter T before applying the softmax so that student learns more than just the ground truth labels:
Since the student also produces softened probabilities qi(x) of its own, we can use the Kullback-Leibler (KL) divergence to measure the difference between the two probability distributions:
With the KL divergence we can calculate how much is lost when we approximate the probability distribution of the teacher with the student. This allows us to define a knowledge distillation loss:
where T^2 is a normalization factor to account for the fact that the magnitude of the gradients produced by soft labels scales as 1/T2 . For classification tasks, the student loss is then a weighted average of the distillation loss with the usual cross-entropy loss LCE of the ground truth labels:
where α is a hyperparamter that controls the relative strength of each loss.
Knowledge Distillation for Pretraining
Knowledge distillation can also be used during pretraining to create a general-purpose student that can be subsequently fine-tuned on downdtream tasks.
Choosing a Good Student Initialization
A good rule of thumb from the literature is that knowledge distillation works best when the teacher and student are of the same model type.
Making Models Faster with Quantization
Knowledge Distillation can reduce the computational and memory cost of running inference by transferring the information from a teacher into a smaller student. Quantization takes a different approach; instead of reducing the number of computations, it makes them much more efficient by representing the weights and activations with low-precision data types like 8-bit integers instead of the usual 32-bit floating point numbers: Reducinng the number of bits means the resulting model requires less memory storage, and operations like matrix multiplication can be performed much faster with integer arithemetic. These performance gains can be realized with little to no loss in accuracy.
The basic idea behind quantization is that we can "discretize" the floating-=point values f in each tensor by mapping their range [fmax,fmin] into a smaller one [qmax,qmin] of fixed-point numbers q , and linearly distributing all values in between. Mathematically, this mapping is described by:
where the scale factor S is a positive floating-point number and the constant Z has the same type q and is called the zero point because it corresponds to the quantized value of the floating-point value f=0 . The map needs to be affine so that we get back floating-point numbers when we dequantize the fixed-point ones. One of the main reasons why transformers (and deep neural networks more generally) are prime candidates for qunatization is that the weights and activations tend to take values in relatively small ranges.
For deep neural networks, there are typically three main approaches to quantization:
- Dynamic Quantization: When using dynamic quantization nothing is changed during training and the adaptations are only performed during inference.
- Static Quantization
- Quantization-aware training
Optimizing Inference with ONNX and the ONNX Runtime
ONNX is an open standard that defines a common set of operators and a common file format to represent deep learning models in a wide variety of frameworks, including PyTorch and TensorFlow. When a model is exported to the ONNX format, these operators are used to construct a computational graph (often called an intermediate representation) that represents the flow of data theough the neural network.
By exposing a graphg with standardized operators and data types, ONNC makes it easy to switch between frameworks. Where ONNX really shines is when it is coupled with a dedicated accelerator like ONNX Runtime, or ORT for short. ORT provides tools to optimize the ONNX graph through technqiues like operator fusion and constant folding, and defines an interface to execution providers that allow you to run the model on different types of hardware. This is a powerful abstraction.
Making Models Sparser with Weight Pruning
Looking at how we can shrink the number of parameters in our model by identifying and removing the least important weights in the network.
Sparsity in Deep Neural Networks
The main idea behind pruning is to gradually remove weight connections (and potentially neurons) during training such that the model becomes progressively sparser. The resulting pruned model has a smaller number of nonzero parameters, which can then be stored in a compact sparse matrix format. Pruning can also be combined with quantization to obtain further compression.
Weight Pruning Methods
Mathematically, the way most weight pruning methods work is to calculate a matrix S of importance scores and then select the top k percent of weights by importance:
k acts as a new hyperparameter to control the amount of sparsity in the model in effect - that is, the proportion of weights that are zero-valued. Lower values of k correspond to sparser matrices. From these scores we can then define a mask matrix M that masks the weights Wij during the forward pass with some input xi and effectively creates a sparse network of activations ai :
Magnitude Pruning
Magnitude pruning calculates the scores according to the magnitude of the weights S=(∣Wij∣)1≤jj≤n and then derives the masks from M=Topk(S) . In the literature it is common to apply magnitude pruning in an iterative fashion by first training the model to learn which connections are imporatnt and pruning the weights of least importance. The sparse model is then retrained and the process repeated until the desired sparsity is reached. One drawback of this approach is that it is computationally demanding. One problem with magnitude pruning is that it is really designed for pure supervised learning, where the importance of each weight is directly related to the task at hand. By contrast, in transfer learning the importance of the weights is primarily determined by the pretraining phase, so magnitude pruning can remove connections that are important for the fine-tuning task.
Movement Pruning
The basic idea behind movement pruning is to gradually remove weights during fine-tuning such that the model becomes progressively sparser. The key novelty is that both the weights and the scores are learned during fine-tuning. The intution behind mobement pruning is that the weights that are moving the most from zero are the most important ones to keep.
Chapter 9: Dealing with Few to No Labels
Implementing a Naive Bayseline
Whenever you start a new NLP project, it's always a good idea to implement a set of strong baselines. There are two major reasons for this:
- A baseline based on regular expressions, handcrafted rules, or a very simple model might already word really well to solve the problem. In these cases, there is no reason to bring out big guns like Transformers, which are generally more complex to deploy and maintain in production environments.
- The baselines provide quick checks as you explore more complex models.
Working with No Labeled Data
Zero-shot classification is suitable in settings where you have no labeled data at all. This is suprisingly common in industry, and might occur because there is no historic data with labels or because acquiring the labels for the data is difficult. The goal of zero-shot classification is to make use of a pretrained model without any fine-tuning on your task-specific corpus.
Working with a Few Labels
One simple but effective way to boost the performance of text classifiers on small datasets is to apply data augmentation to generate new examples from existing ones. In practice, there are tw types of data augmentation that are commonly used:
- Back Translation: Take a text in the source language, translate it into one or more target languages using machine translation, and then translate it back to the source language.
- Token Permutation: Given a text from the training set, randomly choose and perform simple transformations like random synonym replacement, word insertion, swap or deletion
Use Embeddings as a Lookup Table
Large language models such as GPT-3 have been shown to be excellent at solving tasks with limited data. The reason is that these models learn useful representations of text that ecnode information across many dimensions, such as sentiment, topic, text structure, and more. For this reason, the embeddings of large language models can be used to develop a semantic search engine, find similar documents or comments, or even classify text.
Fine-Tuning a Vanilla Transformer
If we have access to labeled data, we can also try to do the obvious thing: simply fune-tune a pretrained transformer model.
Chapter 11: Future Directions
Scaling Transformers
There are now signs that a similar lesson is at play with transformers; while many of the early BERT and GPT descendants focused on tweaking the architecture or pretraining objectives, the best-performing models in mid-2021, like GPT-3, are essentially basic scaled-up versions of the original models without many architectural modifications.
Scaling Laws
Scaling Laws allow one to empirically quantify the "bigger is better" paradigm for language models by studying their behavior with varying compute budget C , dataset size D , and model size N . The basic idea is to chart the dependence of the cross-entropy loss L on these three factors and determine if a relationship emerges. For autoregressive models like those in the GPT family, the resulting loss curves are shown below, where each blue curve represents the training run of a single model.
From the loss curves we can draw conclusions about:
- The relationship of performance and scale: The implication of scaling laws is that a more productive path toward better models is to focus on increasing N , C , and D in tandem.
- Smooth Power Laws
- Sample Efficiency: Large models are able to reach the same performance as smaller models with a smaller number of training steps.
Challenges with Scaling
- Infrastructure: Provisioning and managing infrastructure that potentially spans hundreds of thousands of nodes with as many GPUs is not for the faint-hearted.
- Cost
- Dataset curation: A model is only as good as the data it is trained on. Training large models requires large, high-quality datasets.
- Model evaluation
- Deployment
Attention
In terms of time and memory complexity, the self-attention layer of the Transformer architecture naively scales like O(n2) , where n is the length of the sequence. Much of the recent research on transformers has focused on making self-attention more efficient.
Sparse Attention
One way to reduce the number of computations that are performed in the self-attention layer is to simply limit the number of query-key pairs that are generated according to some predefined pattern.
In practice, most transformer models with sparse attention use a mix of the atomic sparsity pattens to generate the final attention matrix.
Linearized Attention
An alternative way to make self-attention more efficient is to change the order of operations that are involved in computing the attention scores.
Going Beyond Text
There are limits to LLMs:
- Human reporting bias: the frequencies of events in text may not represent their true frequencies.
- Common Sense: Common sense is a fundamental quality of human reasoning, but it is rarelky written down.
- Facts: A probabilistic language model cannot store facts in a reliable way and can produce text that is factually wrong.
- Modality: Language models have no way to connect to other modalities that could address the previous points, such as audio or visual signals or tabular data.