Natural Language Processing w/ Transformers: Transformer Anatomy and Multilingual NER
Chapters 3 and 4 of Natural Language Processing with Transformers goes into depth on the architecture of the Transformer and walks through a multilingual Named Entity Recognition task.
Chaper 3: Transformer Anatomy
This chapter explores the main building blocks of transformer models and how to implement them using PyTorch. While a deep technical understanding of the Transformer architecture is generally not necessary to use 🤗 Transformers and fine-tune models for your use case, it can be helpful for comprehending and navigating the limitations of transformers and using them in new domains.
The Transforer Architecture
The original Transformer is based on the encoder-decoder architecture that is widely used for tasks like mahcine translation. The architecture consists of two components:
- Encoder: Converts an input sequence of tokens into a sequence of embedding vectors, often called the hidden state or context
- Decoder: Uses the encoder's hidden state to iteratively generate an output sequence of tokens, one token at a time.
Things that characterize transformer architecture:
- The input text is tokenized and converted to token embeddings using the techniques in last chapter. SInce the attentuon mechanism is not aware of the relative positions of the tokens, we need a way to inject some information about token positions into the input to model the sequential nature of the text. The token embeddings are thus combined with positional embeddings that contain positional information for each token.
- The encoder is composed of a stack of encoder layers or "blocks", which is analogous to stacking convolutional layers in computer vision. The same is true of the deocder, which has its own stack of decoder layers.
- The encoder's output is fed to each decoder layer, and the decoder then generates a prediction for the most probable next token in the sequence. the output of this step is then fed back into the decoder to generate the next token, and so on until a special end-of-sequence (EOS) token is reached.
The Transformer architecture was originally designed for sequence-to-sequence tasks like machine translation, but both the encoder and decoder blocks were soon adapted as standalone models. Although there are hundreds of different transformer models, most of them belong to one of three types:
- Encoder-only: These models convert an input sequence of text into a rich numerical representation that is well suited for tasks like text classification or named entity recognition. BERT and its variants belong to this class of architecture. The representation computed for a given token in this architecture depends both on the left (before the token) and the right (after the token) contexts. This is often called bidirectional attention.
- Decoder-only: These models will autocomplete the sequence by iteratively predicting the most probable next word. The famuly of GPT belong to this class. The representation computed for a given token in this architecture depends on the left conext. This is often called casual or autoregressive attention.
- Encoder-decoder: These are used for modeling complex mappings from one sequence of text to another; they're suitable for machine translation and summarization tasks.
In reality, the distinction between encoder-only and decoder-only tasks is a bit blurry.
The Encoder
The encoder consists of many encoder layers stacked next to each other. Each encoder layer recieves a sequence of embeddings and feeds them through the following sublayers:
- A multi-head self-attention layer
- A fully connected forward layer that is applied to each input embedding
The output embeddings of each encoder layer have the same size as the inputs, and we'll soon see that the main role of the encoder stack is to "update" the input embeddings to produce representations that encode some contextual information in the sequence. Each of these sublayers also uses skip connections and layer normalization, which are standadrd tricks to train deep neural networks effectively.
Self-Attention
Attention is a mechanism that allows neural networks to assign a different amount of weight or "attention" to each element in a sequence. The "self" attention part of self-attention refers to the fact that these weights are computed for all hidden states in the same set. The main idea behind self-attention is that instead of using a fixed embedding for each token, we can use the whole sequence to compute a weighted average of each embedding. Given a sequence of embeddings x1,…,xn , self-attention produces a sequence of new embeddings x1′,…,xn′ where each xi′ is a linear combination of all the xj :
The coefficients wji are called *attention weights * and are normalized soo that ∑jwji=1 . Embeddings that are generated using the words around them are called contextual embeddings atnd predate the invention of transformers.
Scaled dot-product attention
There are several ways to implenent a self-attention layer, but the most common one is scaled dot-product attention. The four main steps to implement this mechanism:
- Project each token token embedding into three vectors called query, key, and value.
- Compute attention scores. Determine how much the query and key vectors relate to eahc other using a similarity function. The similarity function for scaled dot-product attention is the dot product, computed efficiently using matrix multiplication of the embeddings. Queries and keys that are similar will have a large dot product, while those that don't share much in common will have little to no overlap. The outputs of this step are called attention scores, and for a sequence with n input tokens, there is a corresponding n×n matrix of attention scores.
- Compute attention weights. Attention scores are first multiplied by a scaling factor to normalize their variance and then normalized with a softmax to ensure all the column values sum to 1. The resulting n×n matrix contains all the attention weights wji .
- Update the token embeddings. Once the attention weights are computed, we multiply them by the value vector v1,…,vn to obtain an updated representation for embedding xi′=∑jwjivj .
BertViz for Jupyter allows us to visualize how attention weights are calculated.
"""
Visualizing Attention Weights
"""
!pip install bertviz
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show
model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)
text = "time flies like an arrow"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)
"""
Tokenize Text
"""
# add_special_attention=False to exclude the [CLS] and [SEP] tokens
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids
Dense in this context means that each entry in the embedding contains a nonzero value.
from torch import nn
from transformers import AutoConfig
config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb
Using the AutoConfig class to load config.json file associated wiuth the bert-base-uncased checkpoint. In 🤗 Transformers, every checkpoint is assigned a confguration file that specifies various hyperparameters like vocab_size and hidden_size, which in our example shows us that each input ID will be mapped to one of the 30,522 embedding vectors stored in nn.Embedding, each with a size of 768. The AutoConfig class also stores some sadditional metadata, such as the label names, which are used to format the model's predictions. Token embeddings at this point are independent of their context. The role of the subsequent attention layers will be to mix these token embeddings to disambiguate and inform the representation of each token with the content of its context.
"""
Generate the embeddings by feeding in the Input IDs
"""
inputs_embeds = token_emb(inputs.input_ids)
# Has a tensor of shape [batch_size, seq_len, hidden_dim]
inputs_embeds.size()
"""
Create query, key, value vectors and calcuate attention
scores using the dot product as the similarity function
"""
import torch
from math import sqrt
query = key = value = inputs_embeds
dim_k = key.size(-1)
# batch matrix-matrix product
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
# Creates a 5 x 5 matrix of attention scores per sample in the batch
scores.size()
"""
Apply Softmax
"""
import torch.nn.functional as F
weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1)
"""
Multuple Attention Weights by the Values
"""
attn_outputs = torch.bmm(weights, value)
attn_outputs.shape
"""
Wrap self attention steps into func to use
later
"""
def scaled_dot_product_attention(query, key, value):
"""
Applies self-attention. The whole process is just two
matrix multiplications and a softmax - self-attention is
just a fancy form of averaging
"""
dim_k = query.size(-1)
scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
weights = F.softmax(scores, dim=-1)
return torch.bmm(weights, value)
The current attention mechanism with equal query and key vectors will assign a very large score to identical words in the context, and in particular to the current word itself. In practice, the meaning of a word will be better informed by complementary words in the context than by identical words. Let's allow the model to create a different set of vectors for the query, key, and value of a token by using three different linear projections to project the initial token vector into three different spaces.
Multi-headed Attention
In practicen the self attention applies three independent linear transformations to each embedidng to generate the query, key, and value vectors. These transformations project the embeddings and each projection carries its own set of learnable parameters, which allows the self-attention layer to focus on different semantic aspects of the sequence.
It turns out to be beneficial to have multiple sets of linear projections, each one representing a so-called attention head. The resulting multi-head attention layer is shown below. The softmax of one head tends to focus on one aspect of similarity. Having several heads allows the model to focus on several aspects at once.
class AttentionHead(nn.Module):
"""
Initialize three independent linear layers that applu matrix
multiplication to the embedding vectors to produce tensors of shape
[batch_size, seq_len, head_dim] where head_dim is the number of
dimensions we are projecting into. In practice, head_dim is chosen
to be a multiple of embed_dim so that computation across each head
is constant.
"""
def __init__(self,embed_dim,head_dim):
super().__init__()
self.q = nn.Linear(embed_dim, head_dim)
self.k = nn.Linear(embed_dim, head_dim)
self.v = nn.Linear(embed_dim, head_dim)
def forward(self, hidden_state):
attn_outputs = scaled_dot_product_attention(self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
return attn_outputs
class MultiHeadAttention(nn.Module):
"""
The concatenated output from the attention heads is also fed through
a final linear layer to produce an output tensor of shape [batch_size,
seq_len, hidden_dim] that is suitable for the feed-forward network
downstream.
"""
def __init__(self, config):
super().__init__()
embed_dim = config.hidden_size
num_heads = config.num_attention_heads
head_dim = embed_dim // num_heads
self.heads = nn.ModuleList(
[AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
)
self.output_linear = nn.Linear(embed_dim, embed_dim)
def forward(self, hidden_state):
x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
x = self.output_linear(x)
return x
multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(inputs_embeds)
attn_output.size()
"""
Visualizng multi-head attention
"""
from bertviz import head_view
from transformers import AutoModel
model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)
sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"
viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])
head_view(attention, tokens, sentence_b_start, heads=[8])
The Feed-Forward Layer
The feed-forward sublayer in the encoder and decoder is just a simple two-layer fully connected neural network, but with a twist: instead of processing the whole sequence of embeddings as a single vector, it processes each embedding independently. For this reason, this layer is often referred to as a position-wise feed-forward layer. (It may also be referred to as a one-dimensional convolution with a kernel size of one). A rule of thumb from the literature is for the hidden size of the first layer to be four times the size of the embeddings, and a GELU activation function is most commonly used. This is where most of the capacity and memorization is hypothesized to happen, and it's the part that is most often scaled when scaling up the models.
class FeedForward(nn.Module):
def __init__(self, config):
super().__init__()
self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
self.gelu = nn.GELU()
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, x):
x = self.linear_1(x)
x = self.gelu(x)
x = self.linear_2(x)
x = self.dropout(x)
return x
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_outputs)
ff_outputs.size()
Adding Layer Normalization
The Transformer architecture makes use of layer normalization and skip connections. The former normalizes each input in the batch to have zero mean and unity variance. Skip connections pass a tensor to the next layer of the model without processing and add it to the processed tensor. Two options for placing normalization:
- Post Layer Normalization: It places layer normalization between skip connections. This arragement is tricky to train from scratch as the gradients can diverge.
- Pre Layer Normalization: Most common arrangement found in the literature; it places layer normalization within the span of skip connections. Tend to be much more stable during training.
class TransformerEncoderLayer(nn.Module):
"""
Uses Pre-layer normalization
"""
def __init__(self, config):
super().__init__()
self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
self.attention = MultiHeadAttention(config)
self.feed_forward = FeedForward(config)
def forward(self, x):
# Apply layer normalization and then copy input into query, key, value
hidden_state = self.layer_norm_1(x)
# Apply attention with a skip connection
x = x + self.attention(hidden_state)
# Apply feed-forward layer with a skip connection
x = x + self.feed_forward(self.layer_norm_2(x))
return x
encoder_layer = TransformerEncoderLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size()
Positional Embeddings
Positional embeddings are based on a simple, yet very effective idea: augment the token embeddings with a position-dependent pattern of values arranged in a vector. If the pattern is characteristic for each position, the attention heads and feed-forward layers in each stack can learn to incorporate positional information into their transformations.
There are several ways to achieve this, and one of the most popular approaches is to use a learnable pattern, especially when the pretraining dataset is sufficiently large. This works exactly the same way as the token embeddings, but using the position index instead of the token ID as input. With that approach, an efficient way of encoding the positions of tokens is learned during pretraining.
class Embeddings(nn.Module):
"""
Custom Embeddings module that combines a token embedding layer
that projects the input_ids to a dense hidden state together with the
positional embedding that does the same for position_ids. The resulting
embedding is simply the sum of both embeddings
"""
def __init__(self, config):
super().__init__()
self.token_embeddings = nn.Embedding(config.vocab_size,config.hidden_size)
self.position_embeddings = nn.Embedding(config.max_position_embeddings,config.hidden_size)
self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
self.dropout = nn.Dropout()
def forward(self, input_ids):
# Create position IDs for input sequence
seq_length = input_ids.size(1)
position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
# Create token and position embeddings
token_embeddings = self.token_embeddings(input_ids)
position_embeddings = self.position_embeddings(position_ids)
# Combine token and position embeddings
embeddings = token_embeddings + position_embeddings
embeddings = self.layer_norm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
embedding_layer = Embeddings(config)
embedding_layer(inputs.input_ids).size()
The embedding layer now creates a single, dense embedding for each token. Alternatives to position embeddings:
- Absolute positional representations: Transformer models can use static patterns consisting of modulated sine and cosine signals to encode the positions of the tokens.
- Relative positional representations: While absolute positions are important, once can argue that when computing an embedding, the surrounding tokens are most important. Relative positional reprtesentations follow the intuition and encode the relative positions between tokens.
class TransformerEncoder(nn.Module):
"""
Full transformer encoder combining the embeddings with the encoder
layers.
"""
def __init__(self, config):
super().__init__()
self.embeddings = Embeddings(config)
self.layers = nn.ModuleList([TransformerEncoderLayer(config)
for _ in range(config.num_hidden_layers)])
def forward(self, x):
x = self.embeddings(x)
for layer in self.layers:
x = layer(x)
return x
encoder = TransformerEncoder(config)
encoder(inputs.input_ids).size()
Adding a Classification Head
Transformer models are usually divided into a task-independent body and a task-specific head.
class TransformerForSequenceClassification(nn.Module):
"""
The following class extends the existing encoder for sequence
classificastion.
"""
def __init__(self, config):
super().__init__()
self.encoder = TransformerEncoder(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
def forward(self, x):
x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
x = self.dropout(x)
x = self.classifier(x)
return x
config.num_labels = 3
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids).size()
The Decoder
The main difference between the decoder and encoder is that the decoder has two attention sublayers:
- Masked multi-head self-attention layer: Ensures that the tokens we generate at each timestep are only based on the past outputs and the current token being predicted.
- Encoder-decoder attention layer: Performs multi-head attention over the output key and value vectors of the encoder stack, with the intermediate representations of the decoder acting as the queries. This way the encoder-decoder attention layer learns hwo to relate tokens from two different sequences, such as two different languages.
The trick with masked self-attention is to introduce a mask matrix with ones on the lower diagonal and zeros above.
seq_len = inputs.input_ids.size(-1)
# Creates a lower triangular matrix
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0)
print(mask[0])
scores.masked_fill(mask == 0, -float("inf"))
By setting the upper values to negative infinity, we guarantee that the attention weights are all zero once we take the softmax over the score.
def scaled_dot_product_attention(query, key, value, mask=None):
"""
Scaled dot-product attention function including
the masking behavior
"""
dim_k = query.size(-1)
scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float("-inf"))
weights = F.softmax(scores, dim=-1)
return weights.bmm(value)
Meet the Transformers
There are three main architectures for transformer models: encoders, decoders, and encoder-decoders.
The Encoder Branch
Encoder-only models still dominate research and industry on NLU tasks such as text classification, named entity recognition, and question answering.
The Decoder Branch
The progress on transformer decoder models has been spearheaded to a large extent by OpenAI. These models are exceptionally good at predicting the next word in a sequence and are thus mostly used for text generation tasks.
The Encoder-Decoder Branch
Although it has become more common to build models using a single encoder or decoder stack, there are several encoder-decoder variants of the Transformer architecture that have novel applications across both NLU and NLG domains.
Chapter 4: Multilingual Named Entity Recognition
By pretraining on huge corpora across many languages, these multilingual transformers enable zero-shot cross-=lingual transfer. This means that a model that is fine-tuned on one language can be applied to others without further training. NER is a common NLP task that identifues entities like peiple, organizations, or locations in text. Zero-shot transferz or zero-shot learning usually refers to the task of training a model on one set of labels and then evaluating it on a different set of labels.
The Dataset
Subser of the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark called WikiANN or PAN-X. This dataset consists of many Wikipedia articles in many languages. Each article is annotated with LOC (location), PER (person), and ORG (organization) tags in the inside-outside-beginning format. In this format, a B- prefix indicates the beginning of an entity, and consecutive tokens belonging to the same enbtity are given an I- prefix. An O tag indicates the token does not belong to any entity.
"""
Need to know which dataset configuration to pass load_dataset
"""
!pip install datasets
from datasets import get_dataset_config_names
# Figure ot which subsets are available
xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")
"""
Narrow search of subsets
"""
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
panx_subsets[:3]
# Load the German corpus
from datasets import load_dataset
load_dataset("xtreme", name="PAN-X.de")
from collections import defaultdict
from datasets import DatasetDict
langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]
# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)
for lang, frac in zip(langs, fracs):
# Load monolingual corpus
ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
# Shuffle and downsample each split according to spoken proportion
for split in ds:
panx_ch[lang][split] = (
ds[split]
.shuffle(seed=0)
.select(range(int(frac * ds[split].num_rows))))
import pandas as pd
pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs},index=["Number of training examples"])
element = panx_ch["de"]["train"][0]
for key, value in element.items():
print(f"{key}: {value}")
for key, value in panx_ch["de"]["train"].features.items():
print(f"{key}: {value}")
tags = panx_ch["de"]["train"].features["ner_tags"].feature
print(tags)
def create_tag_names(batch):
return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}
panx_de = panx_ch["de"].map(create_tag_names)
de_example = panx_de["train"][0]
pd.DataFrame([de_example["tokens"], de_example["ner_tags_str"]],['Tokens', 'Tags'])
from collections import Counter
split2freqs = defaultdict(Counter)
for split, dataset in panx_de.items():
for row in dataset["ner_tags_str"]:
for tag in row:
if tag.startswith("B"):
tag_type = tag.split("-")[1]
split2freqs[split][tag_type] += 1
pd.DataFrame.from_dict(split2freqs, orient="index")
Multilingual Transformers
Multilingual transformers involve similar architectures and training procedures as their monolongual counterparts, except that the corpus used for pretraining consists of documents in many languages. A remarkable feature of this approach is that despire receiving no explicit information to differentiate among the languages, the resulting linguistic representations are able to differentiate among the languages, the resulting linguistic representations are able to generalize well across languages for a variety of downstream tasks. Multilingual transformer model are usually evaluated in three different ways:
- en: Fine-tune on the English training data and then evaluate on each language's test set
- each: Fine-tune and evaluate on monolingual test data to measure per-language performance
- all: Fine-tune on all the training data to evaluate on each language's test set.
A Closer Look at Tokenization
XLM-R uses a tokenizer called Sentence-Piece that is trained on the raw text of all one hundred languages.
"""
A close look at tokenization
"""
from transformers import AutoTokenizer
bert_model_name = "bert-base-cased"
xlmr_model_name = "xlm-roberta-base"
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)
"""
retrieve special tokens
"""
text = "Jack Sparrow loves New York!"
bert_tokens = bert_tokenizer(text).tokens()
xlmr_tokens = xlmr_tokenizer(text).tokens()
The Tokenizer Pipeline
Tokenization is actually a full processing pipleine that usually consists of four steps:
- Normalization: steps corresponding to the set of operations you apply to a raw string to make it "cleaner". Common operations include stripping whitespace and removing accented characters. Unicode normalization is another common normalization operation applied to many tokenizers to deal with the fact that there often exist various ways to write the same character.
- Pretokenization: This step splits a text into smaller objects that give an upper bound to what uour tokens will be at the end of training. A good way to think of this is that the pretokenizer will split your text into "words" and your final tokens will be parts of thise words.
- Tokenizer Model: Once the input words are normalized an pretokenized, the tokenizer applies a subword splitting model on the words. This is the part of the pipeline that needs to be traikned on your corpus (or that has been trained if you are using a pretrained tokenizer). The role of the model is to split the words into subwords to reduce the size of the vocabulkary and try to reduce the number of out-of-vocabulary tokens. Several subword tokenization algorithms exist, including BPE, Unigram, and WordPiece.
- Postprocessing: This is the last step of the tokenization piepeline, in which some additional transformations can be applied on the list of tokens - for instance, adding special tokens at the beginning or end of the input sequence of token indeices.
TheSentencePice Tokenizer
The SentencePiece tokenicer is based on a type of subword segmentation called Unigram and encodes each input text as a sequence of Unicode characters. This last feature is especially useful for multilingual corpora since it allows SentencePiece to be agnostic about accents, punctuation, and the fact that many langauges do not have whitespace characters.
"".join(xlmr_tokens).replace(u"\u2581", " ")
Transformers for Named Entity Recognition
NER is often framed as a token classification task since each individual token is fed into the same fully connected layer to output the entity of the token.
The Anatomy of the Transformers Model Class
🤗 Transformers is organized around dedicated classes for each architecture and task. 🤗 Transformers is designed to enable you to easily extend existing models for your specific use case. You can load the weights from pretrained models, and you have access to task-specific helper functions. This lets you build custom models for specific objectives with very little overhead.
Bodies and Heads
🤗 Transformers split the architecture into a body and head. When we switch from the pretrainig task to the downstream task, we need to replace the last layer of the model with one that is suitable for the task. The last layer is called the model head; it's part that is task-specific. The rest of the model is called the bodyl it includes the token embeddings and transformer layers that are task-agnostic.
Creating a Custom Model for Token Classification
"""
Building a Custom Tokem Classification head for XLM-R
"""
import torch.nn as nn
from transformers import XLMRobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel
from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel
class XLMRobertaForTokenClassification(RobertaPreTrainedModel):
# Ensures that the standard XLM-R settings are used when we
# initialize a new model.
config_class = XLMRobertaConfig
def __init__(self, config):
# Call the initializatioon function of the RobertaPretrainedModel class
super().__init__(config)
self.num_labels = config.num_labels
# Load model body
self.roberta = RobertaModel(config, add_pooling_layer=False)
# Set up token classification head
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
# Load and initialize weights
self.init_weights()
def forward(self, input_ids=None, attention_mask=None, token_type_ids=None,labels=None, **kwargs):
# Use model body to get encoder representations
outputs = self.roberta(input_ids, attention_mask=attention_mask,token_type_ids=token_type_ids, **kwargs)
# Apply classifier to encoder representation
sequence_output = self.dropout(outputs[0])
logits = self.classifier(sequence_output)
# Calculate losses
loss = None
if labels is not None:
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
# Return model output object
return TokenClassifierOutput(loss=loss, logits=logits, hidden_states=outputs.hidden_states,attentions=outputs.attentions)
Loading a Custom Model
index2tag = {idx: tag for idx, tag in enumerate(tags.names)}
tag2index = {tag: idx for idx, tag in enumerate(tags.names)}
from transformers import AutoConfig
xlmr_config = AutoConfig.from_pretrained(xlmr_model_name,num_labels=tags.num_classes,id2label=index2tag, label2id=tag2index)
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
xlmr_model = (XLMRobertaForTokenClassification.from_pretrained(xlmr_model_name, config=xlmr_config).to(device))
input_ids = xlmr_tokenizer.encode(text, return_tensors="pt")
pd.DataFrame([xlmr_tokens, input_ids[0].numpy()], index=["Tokens", "Input IDs"])
outputs = xlmr_model(input_ids.to(device)).logits
predictions = torch.argmax(outputs, dim=-1)
print(f"Number of tokens in sequence: {len(xlmr_tokens)}")
print(f"Shape of outputs: {outputs.shape}")
preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
pd.DataFrame([xlmr_tokens, preds], index=["Tokens", "Tags"])
def tag_text(text, tags, model, tokenizer):
# Get tokens with special characters
tokens = tokenizer(text).tokens()
# Encode the sequence into IDs
input_ids = xlmr_tokenizer(text, return_tensors="pt").input_ids.to(device)
# Get predictions as distribution over 7 possible classes
outputs = model(input_ids)[0]
# Take argmax to get most likely class per token
predictions = torch.argmax(outputs, dim=2)
# Convert to DataFrame
preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
return pd.DataFrame([tokens, preds], index=["Tokens", "Tags"])
Tokenizing Texts for NER
🤗 Datasets provides a fast way to tokenize a Dataset object with the map() operation.
words, labels = de_example["tokens"], de_example["ner_tags"]
tokenized_input = xlmr_tokenizer(de_example["tokens"], is_split_into_words=True)
tokens = xlmr_tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
pd.DataFrame([tokens], index=["Tokens"])
word_ids = tokenized_input.word_ids()
pd.DataFrame([tokens, word_ids], index=["Tokens", "Word IDs"])
previous_word_idx = None
label_ids = []
for word_idx in word_ids:
if word_idx is None or word_idx == previous_word_idx:
label_ids.append(-100)
elif word_idx != previous_word_idx:
label_ids.append(labels[word_idx])
previous_word_idx = word_idx
labels = [index2tag[l] if l != -100 else "IGN" for l in label_ids]
index = ["Tokens", "Word IDs", "Label IDs", "Labels"]
pd.DataFrame([tokens, word_ids, label_ids, labels], index=index)
def tokenize_and_align_labels(examples):
tokenized_inputs = xlmr_tokenizer(examples["tokens"], truncation=True,
is_split_into_words=True)
labels = []
for idx, label in enumerate(examples["ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=idx)
previous_word_idx = None
label_ids = []
for word_idx in word_ids:
if word_idx is None or word_idx == previous_word_idx:
label_ids.append(-100)
else:
label_ids.append(label[word_idx])
previous_word_idx = word_idx
labels.append(label_ids)
tokenized_inputs["labels"] = labels
return tokenized_inputs
def encode_panx_dataset(corpus):
return corpus.map(tokenize_and_align_labels, batched=True, remove_columns=['langs', 'ner_tags', 'tokens'])
panx_de_encoded = encode_panx_dataset(panx_ch["de"])
Performance Measures
Evauluating a NER model is similar to evaluating a text classification model, and it is common to report results for precision, recall, and F1 -score. The only subtlety is that all words of an entity need to be predicted correctly in order for a prediction to be counted as correct. There is a library called seqeval that is deisgned for these kinds of tasks.
!pip install seqeval
from seqeval.metrics import classification_report
y_true = [["O", "O", "O", "B-MISC", "I-MISC", "I-MISC", "O"],
["B-PER", "I-PER", "O"]]
y_pred = [["O", "O", "B-MISC", "I-MISC", "I-MISC", "I-MISC", "O"],
["B-PER", "I-PER", "O"]]
print(classification_report(y_true, y_pred))
seqeval expects the predictions and labels as lists of lists, with each list corresponding to a single example in our validation or test sets.
import numpy as np
def align_predictions(predictions, label_ids):
preds = np.argmax(predictions, axis=2)
batch_size, seq_len = preds.shape
labels_list, preds_list = [], []
for batch_idx in range(batch_size):
example_labels, example_preds = [], []
for seq_idx in range(seq_len):
# Ignore label IDs = -100
if label_ids[batch_idx, seq_idx] != -100:
example_labels.append(index2tag[label_ids[batch_idx][seq_idx]])
example_preds.append(index2tag[preds[batch_idx][seq_idx]])
labels_list.append(example_labels)
preds_list.append(example_preds)
return preds_list, labels_list
Fine-Tuning XLM-RoBERTa
Our first stragey will be to fine-tune the base model on the German subset of PAN-X and then evaludate its zero-shot cross-lingual performance on French, Italian, and English.
"""
Here we evaluate the model's predictions on the validation set
at the end of every epoch, tweak the weight decay, and set save_steps to a large number to disable checkpointsg and speed up training
"""
from transformers import TrainingArguments
num_epochs = 3
batch_size = 24
logging_steps = len(panx_de_encoded["train"]) // batch_size
model_name = f"{xlmr_model_name}-finetuned-panx-de"
training_args = TrainingArguments(output_dir=model_name, log_level="error", num_train_epochs=num_epochs, per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, evaluation_strategy="epoch", save_steps=1e6, weight_decay=0.01, disable_tqdm=False, logging_steps=logging_steps, push_to_hub=False)
from seqeval.metrics import f1_score
def compute_metrics(eval_pred):
y_pred, y_true = align_predictions(eval_pred.predictions, eval_pred.label_ids)
return {"f1": f1_score(y_true, y_pred)}
"""
The final step is to define a data collator so we can pad each input sequence
to the larget sequence in a batch. Padding the labels is necessary because the
labels are also sequences.
"""
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(xlmr_tokenizer)
def model_init():
"""
Method loads an untrained model and is called at the beginning of
the train() call
"""
return (XLMRobertaForTokenClassification
.from_pretrained(xlmr_model_name, config=xlmr_config)
.to(device))
from transformers import Trainer
trainer = Trainer(model_init=model_init, args=training_args,data_collator=data_collator, compute_metrics=compute_metrics,train_dataset=panx_de_encoded["train"],eval_dataset=panx_de_encoded["validation"],tokenizer=xlmr_tokenizer)
trainer.train()
text_de = "Jeff Dean ist ein Informatiker bei Google in Kalifornien"
tag_text(text_de, tags, trainer.model, xlmr_tokenizer)
Error Analysis
A thorough error analysis of your model is one of the most important aspects when training and debugging transformers. There are several failure modes where it might look like the model is performing well, while in practice it has some serious flaws. Examples where training can fail:
- Might accidentally mask too many tokens and also mak some of our labels to get a really promising loss drop
- The compute_metrics() function might have a bug that overestimates the true performance
- We might include the zero class or 0 entity in NER as a normal class, which will heavily skew the accuracy and F1 score since it is the majority class by a large margin.
When the model performs much worse than expected, looking at the errors can yield useful insights and reveal bugs that would be hard to spot just by looking at the code. Error analysis isstill useful when the model performs well. For this error analysis, we will use one of the most powerful tools - look at the validation examples with the highest loss.
from torch.nn.functional import cross_entropy
def forward_pass_with_label(batch):
# Convert dict of lists to list of dicts suitable for data collator
features = [dict(zip(batch, t)) for t in zip(*batch.values())]
# Pad inputs and labels and put all tensors on device
batch = data_collator(features)
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["labels"].to(device)
with torch.no_grad():
# Pass data through model
output = trainer.model(input_ids, attention_mask)
# logit.size: [batch_size, sequence_length, classes]
# Predict class with largest logit value on classes axis
predicted_label = torch.argmax(output.logits, axis=-1).cpu().numpy()
# Calculate loss per token after flattening batch dimension with view
loss = cross_entropy(output.logits.view(-1, 7),
labels.view(-1), reduction="none")
# Unflatten batch dimension and convert to numpy array
loss = loss.view(len(input_ids), -1).cpu().numpy()
return {"loss":loss, "predicted_label": predicted_label}
valid_set = panx_de_encoded["validation"]
valid_set = valid_set.map(forward_pass_with_label, batched=True, batch_size=32)
df = valid_set.to_pandas()
index2tag[-100] = "IGN"
df["input_tokens"] = df["input_ids"].apply(
lambda x: xlmr_tokenizer.convert_ids_to_tokens(x))
df["predicted_label"] = df["predicted_label"].apply(
lambda x: [index2tag[i] for i in x])
df["labels"] = df["labels"].apply(
lambda x: [index2tag[i] for i in x])
df['loss'] = df.apply(
lambda x: x['loss'][:len(x['input_ids'])], axis=1)
df['predicted_label'] = df.apply(
lambda x: x['predicted_label'][:len(x['input_ids'])], axis=1)
# print(df.head(1))
df_tokens = df.apply(pd.Series.explode)
df_tokens = df_tokens.query("labels != 'IGN'")
df_tokens["loss"] = df_tokens["loss"].astype(float).round(2)
# print(df_tokens.head(7))
# print(
# df_tokens.groupby("input_tokens")[["loss"]]
# .agg(["count", "mean", "sum"])
# .droplevel(level=0, axis=1) # Get rid of multi-level columns
# .sort_values(by="sum", ascending=False)
# .reset_index()
# .round(2)
# .head(10)
# .T
# )
# print(
# df_tokens.groupby("labels")[["loss"]]
# .agg(["count", "mean", "sum"])
# .droplevel(level=0, axis=1)
# .sort_values(by="mean", ascending=False)
# .reset_index()
# .round(2)
# .T
# )
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt
def plot_confusion_matrix(y_preds, y_true, labels):
cm = confusion_matrix(y_true, y_preds, normalize="true")
fig, ax = plt.subplots(figsize=(6, 6))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
plt.title("Normalized confusion matrix")
plt.show()
plot_confusion_matrix(df_tokens["labels"], df_tokens["predicted_label"],tags.names)
def get_samples(df):
"""
Write a function that helps us display the token sequences with the labels
and the losses
"""
for _, row in df.iterrows():
labels, preds, tokens, losses = [], [], [], []
for i, mask in enumerate(row["attention_mask"]):
if i not in {0, len(row["attention_mask"])}:
labels.append(row["labels"][i])
preds.append(row["predicted_label"][i])
tokens.append(row["input_tokens"][i])
losses.append(f"{row['loss'][i]:.2f}")
df_tmp = pd.DataFrame({"tokens": tokens, "labels": labels,"preds": preds, "losses": losses}).T
yield df_tmp
df["total_loss"] = df["loss"].apply(sum)
df_tmp = df.sort_values(by="total_loss", ascending=False).head(3)
# for sample in get_samples(df_tmp):
# display(sample)
df_tmp = df.loc[df["input_tokens"].apply(lambda x: u"\u2581(" in x)].head(2)
# for sample in get_samples(df_tmp):
# display(sample)
Cross-Lingual Transfer
Now that we have fine-tuned XLM-R on German, we can evaluate its ability to transfer to other languages via the predict() method of the Trainer.
def get_f1_score(trainer, dataset):
"""
Function to help us evaluate other languages
"""
return trainer.predict(dataset).metrics["test_f1"]
f1_scores = defaultdict(dict)
f1_scores["de"]["de"] = get_f1_score(trainer, panx_de_encoded["test"])
print(f"F1-score of [de] model on [de] dataset: {f1_scores['de']['de']:.3f}")
text_fr = "Jeff Dean est informaticien chez Google en Californie"
tag_text(text_fr, tags, trainer.model, xlmr_tokenizer)
def evaluate_lang_performance(lang, trainer):
panx_ds = encode_panx_dataset(panx_ch[lang])
return get_f1_score(trainer, panx_ds["test"])
f1_scores["de"]["fr"] = evaluate_lang_performance("fr", trainer)
print(f"F1-score of [de] model on [fr] dataset: {f1_scores['de']['fr']:.3f}")
f1_scores["de"]["it"] = evaluate_lang_performance("it", trainer)
print(f"F1-score of [de] model on [it] dataset: {f1_scores['de']['it']:.3f}")
f1_scores["de"]["en"] = evaluate_lang_performance("en", trainer)
print(f"F1-score of [de] model on [en] dataset: {f1_scores['de']['en']:.3f}")
Why Does Zero-Shot Transfer Make Sense?