Natural Language Processing w/ Transformers: Text Generation and Text Summarization

These two chapters in Natural Language Processing with Transformers go over text generation / how to best search for the next best word and text summairzation / how to evaluate text generation/summarization models.

DOWNLOAD NOTEBOOK

2 508

Chapter 5: Text Generation

One of the most unanny features of transformer-based language models is their ability to generate text that is almost indistinguishable from text written by humans. By simply learning to predict the next word in the text of millions of web pages, GPT-2 and its more powerful descendants like GPT-3 are able to acquire a broad set of skills and pattern recognition abilities that can be activated with different kinds of input prompts.

The Challenge of Generating Coherent Text

Converting a model's probabilistic output to text requires a decoding method, which introduces a few challenges that are unique to text generation:

The decoding is done iteractively and thus involves significantly morec ompute than simply passing inputs once through the forward pass of a model
The quality and diversity of the generated text depend on the choice of decoding method and associated hyperparameters.

Like other autoregressive or casual language models, GPT is pretrained to estimate the probability $P(\textbf{y}\space |\space \textbf{x} )$ of a sequence of tokens $\textbf{y} = y_1 , y_2, \ldots , y_t$ occuring in the text, given some initial prompt or context sequence $\textbf{x} = x_1, x_2, \ldots, x_k$ . Since it is impractical to acquire enough training data to estimate $P(\textbf{y}\space |\space \textbf{x} )$ directly, it is common to use the chain rule of probability to factorize it as a product of conditional probabilities:

P(y_1 , \ldots , y_t \space | \space \textbf{x}) = \prod_{t=1}^N P(y_t \space | \space y_{<t},\textbf{x})

where $y_{< t}$ is a shorthand notation for the sequence $y_1 , \ldots , y_{t-1}$ . It is from these conditional probabilities that we pick up the intuition that autoregressive language modeling amounts to predicting each word given the preceding word in a sentence; this is exactly what the probability on the right hand side of the preceding equation describes.

At the heart of this process lies a decoding method that determines which token is selected at each timestep. Since the language model head produces a logit $z_{t,i}$ per token in the vocabulary at each step, we can get the probability distribution over the next possible token $w_{i}$ by taking the softmax:

P(y_t = w_i \space | \space y_{<t}, \textbf{x}) = \text{softmax}(z_{t,i})

The goal of most decoding methods is to search fr the most likely overall sequence by picking a $\hat{\textbf{y}}$ such that:

\hat{\textbf{y}}=\underset{y}{\text{argmax}}\space P(\textbf{y}\space | \space \textbf{x})

Finding $\hat{\textbf{y}}$ would involve evaluating every possible sequence with the language model. Since there does not exist an algorithm that can do this in a reasonable amount of time, we rely on approximations instead.

Greedy Search Decoding

The simplest decoding method to get discrete tokens from a model's continuous output is to greedily select the token with the highest probability at each timestep:

\hat{y}_t = \underset{y_t}{\text{argmax}} \space P(y_t \space | \space y_{<t},\textbf{x})

"""
Loasing the 1.5b parameter version of GPT-2 with a language
mdoeling head:
"""
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

out[2]

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(

tokenizer_config.json: 0%| | 0.00/26.0 [00:00<?, ?B/s]

config.json: 0%| | 0.00/689 [00:00<?, ?B/s]

vocab.json: 0%| | 0.00/1.04M [00:00<?, ?B/s]

merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]

tokenizer.json: 0%| | 0.00/1.36M [00:00<?, ?B/s]

/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(

model.safetensors: 0%| | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json: 0%| | 0.00/124 [00:00<?, ?B/s]

import pandas as pd

input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
iterations = []
n_steps = 8
choices_per_step = 5

with torch.no_grad():
  for _ in range(n_steps):
    iteration = dict()
    iteration["Input"] = tokenizer.decode(input_ids[0])
    output = model(input_ids=input_ids)
    # Select logits of the first batch and the last token and apply softmax
    next_token_logits = output.logits[0, -1, :]
    next_token_probs = torch.softmax(next_token_logits, dim=-1)
    sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
    # Store tokens with highest probabilities
    for choice_idx in range(choices_per_step):
      token_id = sorted_ids[choice_idx]
      token_prob = next_token_probs[token_id].cpu().numpy()
      token_choice = (
          f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
      )
      iteration[f"Choice {choice_idx+1}"] = token_choice
    # Append predicted next token to input
    input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
    iterations.append(iteration)

pd.DataFrame(iterations)

out[3]

Input Choice 1 \

0 Transformers are the most (8.53%)

1 Transformers are the most popular (16.78%)

2 Transformers are the most popular toy (10.63%)

3 Transformers are the most popular toy line (34.38%)

4 Transformers are the most popular toy line in (46.28%)

5 Transformers are the most popular toy line in the (65.99%)

6 Transformers are the most popular toy line in the world (69.26%)

7 Transformers are the most popular toy line in ... , (39.73%)

Choice 2 Choice 3 Choice 4 \

0 only (4.96%) best (4.65%) Transformers (4.37%)

1 powerful (5.37%) common (4.96%) famous (3.72%)

2 toys (7.23%) Transformers (6.60%) of (5.46%)

3 in (18.20%) of (11.71%) brand (6.10%)

4 of (15.09%) , (4.94%) on (4.40%)

5 history (12.42%) America (6.91%) Japan (2.44%)

6 United (4.55%) history (4.29%) US (4.23%)

7 . (30.64%) and (9.87%) with (2.32%)

Choice 5

0 ultimate (2.16%)

1 successful (3.20%)

2 and (3.76%)

3 line (2.69%)

4 ever (2.72%)

5 North (1.40%)

6 U (2.30%)

7 today (1.74%)

Unlike other tasks such as sequence classification where a single forward pass suffices to generate the predictions, with text generation we need to decode the output tokens one at a time.

input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False)
print(tokenizer.decode(output[0]))

out[5]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

Transformers are the most popular toy line in the world,

max_length = 128
input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length,
                               do_sample=False)
print(tokenizer.decode(output_greedy[0]))

out[6]

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The researchers, from the University of California, Davis, and the University of Colorado, Boulder, were conducting a study on the Andean cloud forest, which is home to the rare species of cloud forest trees.

The researchers were surprised to find that the unicorns were able to communicate with each other, and even with humans.

The researchers were surprised to find that the unicorns were able

One of the main drawbacks with greedy search decoding is that it tends to produce repetitive output sequences, which is certainly undesireable in a news article. There is a more popular method of beam search decoding.

Beam Search Decoding

Instead of decoding the token with the highest probability at each step, beam search keeps track of the top-b most probable next tokens, where $b$ is referred to as the number of beams or partial hypotheses. The next set of beams are chosen by considering all possible next token extensions of the existing set and selecting the $b$ most likely extension. The process continues until we reach the max length or an EOS token , and the most likely sequence is selected by ranking the $b$ beams according to their log probabilities.

We take the log probabilities because the small nature of the individual probabilities, when multiplied, can create a very small number that the computer has a hard time representing:

\log \space P(y_1, \ldots , y_t \space | \space \textbf{x})=\sum_{t=1}^N \log \space P(y_t \space | \space y_{<t},\textbf{x})

The product of probabilities we saw earlier becomes a sum of log probabilities.

import torch.nn.functional as F

def log_probs_from_logits(logits, labels):
  """
  Log probability of single sequence
  """
  logp = F.log_softmax(logits, dim=-1)
  logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
  return logp_label

def sequence_logprob(model, labels, input_len=0):
  """
  Log probability of sequence
  """
  with torch.no_grad():
    output = model(labels)
    log_probs = log_probs_from_logits(
        output.logits[:, :-1, :], labels[:, 1:])
    seq_log_prob = torch.sum(log_probs[:, input_len:])
  return seq_log_prob.cpu().numpy()

logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))
print(tokenizer.decode(output_greedy[0]))
print(f"\nlog-prob: {logp:.2f}")

out[8]

output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
                             do_sample=False)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

out[9]

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The discovery of the unicorns was made by a team of scientists from the University of California, Santa Cruz, and the National Geographic Society.

The scientists were conducting a study of the Andes Mountains when they discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English

log-prob: -55.23

output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
                             do_sample=False, no_repeat_ngram_size=2)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

out[10]

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The discovery was made by a team of scientists from the University of California, Santa Cruz, and the National Geographic Society.

According to a press release, the scientists were conducting a survey of the area when they came across the herd. They were surprised to find that they were able to converse with the animals in English, even though they had never seen a unicorn in person before. The researchers were

log-prob: -93.12

Beam search with n-gram penalty is a good way to find a trade-off between focusing on high-probability tokens (with beam search) while reducing repetitions (with n-gram penalty), and it’s commonly used in applications such as summarization or machine translation where factual correctness is important. When factual correctness is less important than the diversity of generated output, for instance in open-domain chitchat or story generation, another alternative to reduce repetitions while improving diversity is to use sampling.

Sampling Methods

The simplest sampling method is to randomly sample from the probability distribution of the model's outputs over the full vocabulary at each timestep:

P(y_t = w_i \space | \space y_{<t},\textbf{x})=\text{softmax}(z_{t,i})=\cfrac{\exp (z_{t,i})}{\sum_{j=1}^{\lvert V \rvert} \exp (z_{t,j})}

where $\lvert V \rvert$ denotes the cardinality of the vocabulary. We can control the diversity of the output by adding a temperature parameter $T$ that rescales the logits before taking the softmax:

P(y_t = w_i \space | \space y_{<t},\textbf{x})=\cfrac{\exp (z_{t,i} / T)}{\sum_{j=1}^{\lvert V \rvert} \exp (z_{t,j}/T)}

By tuning $T$ we can control the shape of the probability distribution. When $T < < 1$ , the distribution becomes peakes around the origin and the rare tokens are suppressed. On the other hand, when $T > > 1$ , the distribution flattens out and each token becomes equally likely.

Distibution of Randomly Generated Token Probabilities for Three Selected Temperatures

output_temp = model.generate(input_ids, max_length=max_length, do_sample=True, temperature=2.0, top_k=0)
print(tokenizer.decode(output_temp[0]))

out[12]

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

Left force busted surgre; troubles coming. INC Sarcapira chased Conver131 clicked tha Sure Winning Husongs rapper think envy that threatensPositionImage Adidas cat Ion crashed lengthy Stanford yield weepingiri lapt97ftandmonthbbdieIJ Eagclean grin partner suspicious bol); defectBig Ian025 didn Da main mind absorbed opted extremists traumat believesethy Despair incident assisting lone brood bounty sweatsInstallation Nim Lrt empowered gasped swe

output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,temperature=0.5, top_k=0)
print(tokenizer.decode(output_temp[0]))

out[13]

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The discovery was made by the team of biologists, who had been studying the animals for years. The scientists were shocked to find that the unicorns were so close to humans, and that the animals were able to communicate with them.

The team was able to communicate with the animals using a special headset, which can be used by humans to communicate with animals.

The researchers were able to communicate

The main lesson we can draw from temperature is that it allows us to control the quality of the samples, but there's always a trade-off between coherence (low-temperature) and dversity (high temperature) that one has to tune to the use case at hand. Another way to adjust the trade-off between coherence and diversity is to truncate the distribution of the vocabulary. This allows us to adjust the diversity freely with the temperature, but in a more limited range that excludes words that would be too strange in the context.

Tok-k and Nucleus Sampling

Top-k and nucleus (top-p) sampling are two popular alternatives or extensions to using temperature. In both cases, the basic idea is to restrict the number of possible tokens we can sample at each timestep. If we sample hundreds of times there is a significant chance of picking an unlikely token at some point - and picking such tokens when sampling can badly influence the quality of the generated text. For this reason, we generally want to avoid these very unlikely tokens. This is where top-k and top-p sampling come into play.

Probability Distribution of Next Token Prediction and Cumulative Distribution of Descending Token Prediction

The idea behind top-k probability is to avoid low-probablity chouces by only sampling from the k tokens with the highest probability.

output_topk = model.generate(input_ids, max_length=max_length, do_sample=True,top_k=50)
print(tokenizer.decode(output_topk[0]))

out[15]

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

Researchers with the International Centre for Mountain Research in Patagonia found thousands of individual unicorns in the region of San Marcos, Argentina during their initial field work. Using cameras, radar, and mapping software, the center has now brought together the animals and is analyzing the animals' vocalizations and the genetic profiles of the animals. According to the researchers, the unicorn's unique vocalizations, as well as the

An alternative is to use a dynamic cutoff. With nucleus of top-p sampling, instead of choosing a fixed cutoff value, we set a condition of when to cut off. This condition is when a certain probabilility mass in the selection is reached. If we set to 95%, then we order all tokens in descending order by probability and add one token after another from the top of the list until the sum of the probabilities of the selected tokens is 95%.

output_topp = model.generate(input_ids, max_length=max_length, do_sample=True,top_p=0.90)
print(tokenizer.decode(output_topp[0]))

out[17]

Which Decoding Method is Best?

Which approach is best will depend on the nature of the task you are generating text for. If you want your model to perform a precise task like arithmetic or providing an answer to a specific question, then you should lower the temperature or use deterministic methods like greedy search in combination with beam search to guarantee getting the most likely answer. If you want the model to generate longer texts and even be a bit creative, then you should switch to sampling methods and increase the temperature or use a mix of top-k and nucleus sampling.

Chapter 6: Summarization

Summarzation is a classic sequence-to-sequence task with an input text and a target text.

The CNN/DailyMail Dataset

Consists of around 300,000 pairs of news articles and their corresponding summaries, composed from the bullet points that CNN and the DailyMail attach to their articles. An important aspect of the dataset is that the summaries are abstractive and not extractive, which means that they consist of new sentences instead of simple excerpts.

Text Summmarization Pipeline

A convention in summarization to to seperate the summary sentences by a newline. A common baseline for summarizing news articles is to simply take the first three sentences of the article.

"""
Separte by New Line
"""
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")
string = "The U.S. are a country. The U.N. is an organization."
sent_tokenize(string)

out[21]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.

['The U.S. are a country.', 'The U.N. is an organization.']

Measuring the Quality of Generated Text

BLEU

The idea of BLEU is simple: instead of looking at how many of the tokens in the generated texts are perfectly aligned with the reference text tokens, we look at words or n-grams. BLEU is a precision-based metric, which means that when we compare the two texts we count the number of words in the generation that occur in the reference and divide it by the length of the generation. Assume we have one generated sentence, snt, that we want to compare against a reference sentence, snt'. We extract all possible n-grames of degree n and do the accounting to get the precision $p_{n}$ :

p_n = \cfrac{\sum_{\text{n-gram} \ \in \ snt'} Count_{clip}(\text{n-gram})}{\sum_{\text{n-gram} \ \in \ snt} Count(\text{n-gram})}

In order to avoid rewarding repetitive generations, the count in the numerator is clipped. What this means is that the occurence count of an n-gram is capped at how many times it appears in the reference sentence. In general we have more than one sample in the test set we want to evaluate, so we need to slightly extend the equation by summing over all samples in the corpus $C$ :

p_n = \cfrac{\sum_{snt\ \in C}\space \sum_{\text{n-gram} \ \in \ snt'} Count_{clip}(\text{n-gram})}{\sum_{snt\ \in C}\space\sum_{\text{n-gram} \ \in \ snt} Count(\text{n-gram})}

To compensate for the fact that the precision score favors short generations, the authors of BLEU introduced a brevity penalty:

BR=\text{min}\left( 1, e^{1-\ell_{ref}/\ell _{gen}} \right)

By taking the minimum, we ensure that this penalty never exceeds 1 and the exponential term becomes exponentially small when the length of the generated text $\ell_{gen}$ is smaller than the reference text $\ell_{ref}$ . It;s preferable to look for high precision in translation and make sure the translation and reference have a similar length.

\text{BLEU-}\textit{N}=BR \times \left( \prod_{n=1}^N p_n \right)^{1/N}

The last term is the geometruc mean of the modified precision up to n-gram $N$ . The BLEU metric has limitations: it does not take synonyms into account, and many steps in the derivation seem like ad hoc and rather fragile heuristics. In general, the field of text generation is still looking for better evaluation metrics, and finding ways to overcome the limits of metrics like BLEU is an active area of research. Another weakness of the BLEU metric is that it expects the text to already be tokenized. This can lead to varying results if the exact same method for text tokenization is not used.

Loading metrics:

from datasets import load_metric

bleu_metric = load_metric("sacrebleu")

The bluew_metric is an instance of the Metric class, and works like an aggregrator: you can add a single instance with add() or while batches via add_batch(). Once you have added all the samples you need to evaluate, you then call compute() and the metric is calculated. This returns a dictionary with several values, such as the precision for each n-gram, the length penalty, and the final BLEU score.

import pandas as pd
import numpy as np

bleu_metric.add(
    prediction="the the the the the the", reference=["the cat is on the mat"])
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])

The BLEU score is widely used for evaluating text, especially in machine translation, since precise translations are usually favored over translations that include all possible and appropriate words.

ROGUE

The ROGUE score was specifically developed for applications like summarization where a high recall is more important than just precision. The approach is similar to the BLEU score in that we look at different n-grams and compare their occurences in the generated text and the reference texts. The difference is that with ROGUE we check how many n-grams in the reference text also occur in the generated text. For BLEU, we looked at how many n-grams in the generated text appear in the reference, so we can rese the precision formula with the minor modification that we count the (unclipped) occurence n-grams in the generated text in the denominator:

\text{ROGUE-}\textit{N} = \cfrac{\sum_{snt\ \in C}\space \sum_{\text{n-gram} \ \in \ snt'} Count_{match}(\text{n-gram})}{\sum_{snt\ \in C}\space\sum_{\text{n-gram} \ \in \ snt} Count(\text{n-gram})}

This was the original proposal for ROGUE. Subsequently, researchers have found that fully removing precision can have strong negative effects. There is a separate score in ROGUE to measure the longest common substring (LCS), called ROGUE-L. The LCS can be calculated for any pair of strings. To normalize for the length of texts, the inventor of ROGUE came up with an $F$ - socre-like scheme where the LCS is normalized with the length of the reference and generated text, then the two normalized scores are mixed together:

R_{LCS}=\cfrac{LCS(X,Y)}{m}\\[0.25em] P_{LCS}=\cfrac{LCS(X,Y)}{n}\\[0.25em] F_{LCS}=\cfrac{(1+\beta^2)R_{LCS}P_{LCS}}{R_{LCS}+\beta P_{LCS}}\text{, where }\beta=P_{LCS}/R_{LCS}

rouge_metric = load_metric("rouge")
reference = dataset["train"][1]["highlights"]
records = []
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

for model_name in summaries:
    rouge_metric.add(prediction=summaries[model_name], reference=reference)
    score = rouge_metric.compute()
    rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
    records.append(rouge_dict)
pd.DataFrame.from_records(records, index=summaries.keys())

User Comments

There are currently no comments for this article.