Natural Language Processing w/ Transformers: Text Generation and Text Summarization
These two chapters in Natural Language Processing with Transformers go over text generation / how to best search for the next best word and text summairzation / how to evaluate text generation/summarization models.
Chapter 5: Text Generation
One of the most unanny features of transformer-based language models is their ability to generate text that is almost indistinguishable from text written by humans. By simply learning to predict the next word in the text of millions of web pages, GPT-2 and its more powerful descendants like GPT-3 are able to acquire a broad set of skills and pattern recognition abilities that can be activated with different kinds of input prompts.
The Challenge of Generating Coherent Text
Converting a model's probabilistic output to text requires a decoding method, which introduces a few challenges that are unique to text generation:
- The decoding is done iteractively and thus involves significantly morec ompute than simply passing inputs once through the forward pass of a model
- The quality and diversity of the generated text depend on the choice of decoding method and associated hyperparameters.
Like other autoregressive or casual language models, GPT is pretrained to estimate the probability P(y ∣ x) of a sequence of tokens y=y1,y2,…,yt occuring in the text, given some initial prompt or context sequence x=x1,x2,…,xk . Since it is impractical to acquire enough training data to estimate P(y ∣ x) directly, it is common to use the chain rule of probability to factorize it as a product of conditional probabilities:
where y<t is a shorthand notation for the sequence y1,…,yt−1 . It is from these conditional probabilities that we pick up the intuition that autoregressive language modeling amounts to predicting each word given the preceding word in a sentence; this is exactly what the probability on the right hand side of the preceding equation describes.
At the heart of this process lies a decoding method that determines which token is selected at each timestep. Since the language model head produces a logit zt,i per token in the vocabulary at each step, we can get the probability distribution over the next possible token wi by taking the softmax:
The goal of most decoding methods is to search fr the most likely overall sequence by picking a y^ such that:
Finding y^ would involve evaluating every possible sequence with the language model. Since there does not exist an algorithm that can do this in a reasonable amount of time, we rely on approximations instead.
Greedy Search Decoding
The simplest decoding method to get discrete tokens from a model's continuous output is to greedily select the token with the highest probability at each timestep:
"""
Loasing the 1.5b parameter version of GPT-2 with a language
mdoeling head:
"""
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
import pandas as pd
input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
iterations = []
n_steps = 8
choices_per_step = 5
with torch.no_grad():
for _ in range(n_steps):
iteration = dict()
iteration["Input"] = tokenizer.decode(input_ids[0])
output = model(input_ids=input_ids)
# Select logits of the first batch and the last token and apply softmax
next_token_logits = output.logits[0, -1, :]
next_token_probs = torch.softmax(next_token_logits, dim=-1)
sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
# Store tokens with highest probabilities
for choice_idx in range(choices_per_step):
token_id = sorted_ids[choice_idx]
token_prob = next_token_probs[token_id].cpu().numpy()
token_choice = (
f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
)
iteration[f"Choice {choice_idx+1}"] = token_choice
# Append predicted next token to input
input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
iterations.append(iteration)
pd.DataFrame(iterations)
Unlike other tasks such as sequence classification where a single forward pass suffices to generate the predictions, with text generation we need to decode the output tokens one at a time.
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False)
print(tokenizer.decode(output[0]))
max_length = 128
input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length,
do_sample=False)
print(tokenizer.decode(output_greedy[0]))
One of the main drawbacks with greedy search decoding is that it tends to produce repetitive output sequences, which is certainly undesireable in a news article. There is a more popular method of beam search decoding.
Beam Search Decoding
Instead of decoding the token with the highest probability at each step, beam search keeps track of the top-b most probable next tokens, where b is referred to as the number of beams or partial hypotheses. The next set of beams are chosen by considering all possible next token extensions of the existing set and selecting the b most likely extension. The process continues until we reach the max length or an EOS token , and the most likely sequence is selected by ranking the b beams according to their log probabilities.
We take the log probabilities because the small nature of the individual probabilities, when multiplied, can create a very small number that the computer has a hard time representing:
The product of probabilities we saw earlier becomes a sum of log probabilities.
import torch.nn.functional as F
def log_probs_from_logits(logits, labels):
"""
Log probability of single sequence
"""
logp = F.log_softmax(logits, dim=-1)
logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
return logp_label
def sequence_logprob(model, labels, input_len=0):
"""
Log probability of sequence
"""
with torch.no_grad():
output = model(labels)
log_probs = log_probs_from_logits(
output.logits[:, :-1, :], labels[:, 1:])
seq_log_prob = torch.sum(log_probs[:, input_len:])
return seq_log_prob.cpu().numpy()
logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))
print(tokenizer.decode(output_greedy[0]))
print(f"\nlog-prob: {logp:.2f}")
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
do_sample=False)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
do_sample=False, no_repeat_ngram_size=2)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")
Beam search with n-gram penalty is a good way to find a trade-off between focusing on high-probability tokens (with beam search) while reducing repetitions (with n-gram penalty), and it’s commonly used in applications such as summarization or machine translation where factual correctness is important. When factual correctness is less important than the diversity of generated output, for instance in open-domain chitchat or story generation, another alternative to reduce repetitions while improving diversity is to use sampling.
Sampling Methods
The simplest sampling method is to randomly sample from the probability distribution of the model's outputs over the full vocabulary at each timestep:
where ∣V∣ denotes the cardinality of the vocabulary. We can control the diversity of the output by adding a temperature parameter T that rescales the logits before taking the softmax:
By tuning T we can control the shape of the probability distribution. When T<<1 , the distribution becomes peakes around the origin and the rare tokens are suppressed. On the other hand, when T>>1 , the distribution flattens out and each token becomes equally likely.
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True, temperature=2.0, top_k=0)
print(tokenizer.decode(output_temp[0]))
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,temperature=0.5, top_k=0)
print(tokenizer.decode(output_temp[0]))
The main lesson we can draw from temperature is that it allows us to control the quality of the samples, but there's always a trade-off between coherence (low-temperature) and dversity (high temperature) that one has to tune to the use case at hand. Another way to adjust the trade-off between coherence and diversity is to truncate the distribution of the vocabulary. This allows us to adjust the diversity freely with the temperature, but in a more limited range that excludes words that would be too strange in the context.
Tok-k and Nucleus Sampling
Top-k and nucleus (top-p) sampling are two popular alternatives or extensions to using temperature. In both cases, the basic idea is to restrict the number of possible tokens we can sample at each timestep. If we sample hundreds of times there is a significant chance of picking an unlikely token at some point - and picking such tokens when sampling can badly influence the quality of the generated text. For this reason, we generally want to avoid these very unlikely tokens. This is where top-k and top-p sampling come into play.
The idea behind top-k probability is to avoid low-probablity chouces by only sampling from the k tokens with the highest probability.
output_topk = model.generate(input_ids, max_length=max_length, do_sample=True,top_k=50)
print(tokenizer.decode(output_topk[0]))
An alternative is to use a dynamic cutoff. With nucleus of top-p sampling, instead of choosing a fixed cutoff value, we set a condition of when to cut off. This condition is when a certain probabilility mass in the selection is reached. If we set to 95%, then we order all tokens in descending order by probability and add one token after another from the top of the list until the sum of the probabilities of the selected tokens is 95%.
output_topp = model.generate(input_ids, max_length=max_length, do_sample=True,top_p=0.90)
print(tokenizer.decode(output_topp[0]))
Which Decoding Method is Best?
Which approach is best will depend on the nature of the task you are generating text for. If you want your model to perform a precise task like arithmetic or providing an answer to a specific question, then you should lower the temperature or use deterministic methods like greedy search in combination with beam search to guarantee getting the most likely answer. If you want the model to generate longer texts and even be a bit creative, then you should switch to sampling methods and increase the temperature or use a mix of top-k and nucleus sampling.
Chapter 6: Summarization
Summarzation is a classic sequence-to-sequence task with an input text and a target text.
The CNN/DailyMail Dataset
Consists of around 300,000 pairs of news articles and their corresponding summaries, composed from the bullet points that CNN and the DailyMail attach to their articles. An important aspect of the dataset is that the summaries are abstractive and not extractive, which means that they consist of new sentences instead of simple excerpts.
Text Summmarization Pipeline
A convention in summarization to to seperate the summary sentences by a newline. A common baseline for summarizing news articles is to simply take the first three sentences of the article.
"""
Separte by New Line
"""
import nltk
from nltk.tokenize import sent_tokenize
nltk.download("punkt")
string = "The U.S. are a country. The U.N. is an organization."
sent_tokenize(string)
Measuring the Quality of Generated Text
BLEU
The idea of BLEU is simple: instead of looking at how many of the tokens in the generated texts are perfectly aligned with the reference text tokens, we look at words or n-grams. BLEU is a precision-based metric, which means that when we compare the two texts we count the number of words in the generation that occur in the reference and divide it by the length of the generation. Assume we have one generated sentence, snt, that we want to compare against a reference sentence, snt'. We extract all possible n-grames of degree n and do the accounting to get the precision pn :
In order to avoid rewarding repetitive generations, the count in the numerator is clipped. What this means is that the occurence count of an n-gram is capped at how many times it appears in the reference sentence. In general we have more than one sample in the test set we want to evaluate, so we need to slightly extend the equation by summing over all samples in the corpus C :
To compensate for the fact that the precision score favors short generations, the authors of BLEU introduced a brevity penalty:
By taking the minimum, we ensure that this penalty never exceeds 1 and the exponential term becomes exponentially small when the length of the generated text ℓgen is smaller than the reference text ℓref . It;s preferable to look for high precision in translation and make sure the translation and reference have a similar length.
The last term is the geometruc mean of the modified precision up to n-gram N . The BLEU metric has limitations: it does not take synonyms into account, and many steps in the derivation seem like ad hoc and rather fragile heuristics. In general, the field of text generation is still looking for better evaluation metrics, and finding ways to overcome the limits of metrics like BLEU is an active area of research. Another weakness of the BLEU metric is that it expects the text to already be tokenized. This can lead to varying results if the exact same method for text tokenization is not used.
Loading metrics:
from datasets import load_metric
bleu_metric = load_metric("sacrebleu")
The bluew_metric is an instance of the Metric class, and works like an aggregrator: you can add a single instance with add() or while batches via add_batch(). Once you have added all the samples you need to evaluate, you then call compute() and the metric is calculated. This returns a dictionary with several values, such as the precision for each n-gram, the length penalty, and the final BLEU score.
import pandas as pd
import numpy as np
bleu_metric.add(
prediction="the the the the the the", reference=["the cat is on the mat"])
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])
The BLEU score is widely used for evaluating text, especially in machine translation, since precise translations are usually favored over translations that include all possible and appropriate words.
ROGUE
The ROGUE score was specifically developed for applications like summarization where a high recall is more important than just precision. The approach is similar to the BLEU score in that we look at different n-grams and compare their occurences in the generated text and the reference texts. The difference is that with ROGUE we check how many n-grams in the reference text also occur in the generated text. For BLEU, we looked at how many n-grams in the generated text appear in the reference, so we can rese the precision formula with the minor modification that we count the (unclipped) occurence n-grams in the generated text in the denominator:
This was the original proposal for ROGUE. Subsequently, researchers have found that fully removing precision can have strong negative effects. There is a separate score in ROGUE to measure the longest common substring (LCS), called ROGUE-L. The LCS can be calculated for any pair of strings. To normalize for the length of texts, the inventor of ROGUE came up with an F - socre-like scheme where the LCS is normalized with the length of the reference and generated text, then the two normalized scores are mixed together:
rouge_metric = load_metric("rouge")
reference = dataset["train"][1]["highlights"]
records = []
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
for model_name in summaries:
rouge_metric.add(prediction=summaries[model_name], reference=reference)
score = rouge_metric.compute()
rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
records.append(rouge_dict)
pd.DataFrame.from_records(records, index=summaries.keys())