Natural Language Processing w/ Transformers: Transformer Anatomy and Multilingual NER

Chapters 3 and 4 of Natural Language Processing with Transformers goes into depth on the architecture of the Transformer and walks through a multilingual Named Entity Recognition task.

Chaper 3: Transformer Anatomy

This chapter explores the main building blocks of transformer models and how to implement them using PyTorch. While a deep technical understanding of the Transformer architecture is generally not necessary to use 🤗 Transformers and fine-tune models for your use case, it can be helpful for comprehending and navigating the limitations of transformers and using them in new domains.

The Transforer Architecture

The original Transformer is based on the encoder-decoder architecture that is widely used for tasks like mahcine translation. The architecture consists of two components:

  1. Encoder: Converts an input sequence of tokens into a sequence of embedding vectors, often called the hidden state or context
  2. Decoder: Uses the encoder's hidden state to iteratively generate an output sequence of tokens, one token at a time.

Ecncoder Decoder Architecture of Transformer

Things that characterize transformer architecture:

  • The input text is tokenized and converted to token embeddings using the techniques in last chapter. SInce the attentuon mechanism is not aware of the relative positions of the tokens, we need a way to inject some information about token positions into the input to model the sequential nature of the text. The token embeddings are thus combined with positional embeddings that contain positional information for each token.
  • The encoder is composed of a stack of encoder layers or "blocks", which is analogous to stacking convolutional layers in computer vision. The same is true of the deocder, which has its own stack of decoder layers.
  • The encoder's output is fed to each decoder layer, and the decoder then generates a prediction for the most probable next token in the sequence. the output of this step is then fed back into the decoder to generate the next token, and so on until a special end-of-sequence (EOS) token is reached.

The Transformer architecture was originally designed for sequence-to-sequence tasks like machine translation, but both the encoder and decoder blocks were soon adapted as standalone models. Although there are hundreds of different transformer models, most of them belong to one of three types:

  • Encoder-only: These models convert an input sequence of text into a rich numerical representation that is well suited for tasks like text classification or named entity recognition. BERT and its variants belong to this class of architecture. The representation computed for a given token in this architecture depends both on the left (before the token) and the right (after the token) contexts. This is often called bidirectional attention.
  • Decoder-only: These models will autocomplete the sequence by iteratively predicting the most probable next word. The famuly of GPT belong to this class. The representation computed for a given token in this architecture depends on the left conext. This is often called casual or autoregressive attention.
  • Encoder-decoder: These are used for modeling complex mappings from one sequence of text to another; they're suitable for machine translation and summarization tasks.

In reality, the distinction between encoder-only and decoder-only tasks is a bit blurry.

The Encoder

The encoder consists of many encoder layers stacked next to each other. Each encoder layer recieves a sequence of embeddings and feeds them through the following sublayers:

  • A multi-head self-attention layer
  • A fully connected forward layer that is applied to each input embedding

The output embeddings of each encoder layer have the same size as the inputs, and we'll soon see that the main role of the encoder stack is to "update" the input embeddings to produce representations that encode some contextual information in the sequence. Each of these sublayers also uses skip connections and layer normalization, which are standadrd tricks to train deep neural networks effectively.

Encoding Layer

Self-Attention

Attention is a mechanism that allows neural networks to assign a different amount of weight or "attention" to each element in a sequence. The "self" attention part of self-attention refers to the fact that these weights are computed for all hidden states in the same set. The main idea behind self-attention is that instead of using a fixed embedding for each token, we can use the whole sequence to compute a weighted average of each embedding. Given a sequence of embeddings x1,,xnx_1 , \ldots , x_nx1,,xn , self-attention produces a sequence of new embeddings x1,,xnx_1', \ldots , x_n'x1,,xn where each xix_i'xi is a linear combination of all the xjx_jxj :

xi=j=1nwjixjx_i' = \sum_{j=1}^n w_{ji}x_jxi=j=1nwjixj

The coefficients wjiw_{ji}wji are called *attention weights * and are normalized soo that jwji=1\sum_{j} w_{ji} = 1jwji=1 . Embeddings that are generated using the words around them are called contextual embeddings atnd predate the invention of transformers.

Self Attention producing Contextualized Embeddings

Scaled dot-product attention

There are several ways to implenent a self-attention layer, but the most common one is scaled dot-product attention. The four main steps to implement this mechanism:

  1. Project each token token embedding into three vectors called query, key, and value.
  2. Compute attention scores. Determine how much the query and key vectors relate to eahc other using a similarity function. The similarity function for scaled dot-product attention is the dot product, computed efficiently using matrix multiplication of the embeddings. Queries and keys that are similar will have a large dot product, while those that don't share much in common will have little to no overlap. The outputs of this step are called attention scores, and for a sequence with nnn input tokens, there is a corresponding n×nn \times nn×n matrix of attention scores.
  3. Compute attention weights. Attention scores are first multiplied by a scaling factor to normalize their variance and then normalized with a softmax to ensure all the column values sum to 1. The resulting n×nn \times nn×n matrix contains all the attention weights wjiw_{ji}wji .
  4. Update the token embeddings. Once the attention weights are computed, we multiply them by the value vector v1,,vnv_1 , \ldots , v_nv1,,vn to obtain an updated representation for embedding xi=jwjivjx_i' = \sum_j w_{ji} v_jxi=jwjivj .

BertViz for Jupyter allows us to visualize how attention weights are calculated.

"""
Visualizing Attention Weights
"""
!pip install bertviz
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)
text = "time flies like an arrow"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)
out[2]

Collecting bertviz
Downloading bertviz-1.4.0-py3-none-any.whl.metadata (19 kB)
Requirement already satisfied: transformers>=2.0 in /usr/local/lib/python3.10/dist-packages (from bertviz) (4.44.2)
Requirement already satisfied: torch>=1.0 in /usr/local/lib/python3.10/dist-packages (from bertviz) (2.4.0+cu121)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from bertviz) (4.66.5)
Collecting boto3 (from bertviz)
Downloading boto3-1.35.21-py3-none-any.whl.metadata (6.6 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from bertviz) (2.32.3)
Requirement already satisfied: regex in /usr/local/lib/python3.10/dist-packages (from bertviz) (2024.5.15)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (from bertviz) (0.1.99)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.0->bertviz) (3.16.0)
Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.0->bertviz) (4.12.2)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.0->bertviz) (1.13.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.0->bertviz) (3.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.0->bertviz) (3.1.4)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=1.0->bertviz) (2024.6.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.23.2 in /usr/local/lib/python3.10/dist-packages (from transformers>=2.0->bertviz) (0.24.7)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers>=2.0->bertviz) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers>=2.0->bertviz) (24.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers>=2.0->bertviz) (6.0.2)
Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers>=2.0->bertviz) (0.4.5)
Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers>=2.0->bertviz) (0.19.1)
Collecting botocore<1.36.0,>=1.35.21 (from boto3->bertviz)
Downloading botocore-1.35.21-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3->bertviz)
Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.11.0,>=0.10.0 (from boto3->bertviz)
Downloading s3transfer-0.10.2-py3-none-any.whl.metadata (1.7 kB)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->bertviz) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->bertviz) (3.8)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->bertviz) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->bertviz) (2024.8.30)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.10/dist-packages (from botocore<1.36.0,>=1.35.21->boto3->bertviz) (2.8.2)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.0->bertviz) (2.1.5)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.0->bertviz) (1.3.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.36.0,>=1.35.21->boto3->bertviz) (1.16.0)
Downloading bertviz-1.4.0-py3-none-any.whl (157 kB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 157.6/157.6 kB 1.7 MB/s eta 0:00:00
[?25hDownloading boto3-1.35.21-py3-none-any.whl (139 kB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 139.2/139.2 kB 6.0 MB/s eta 0:00:00
[?25hDownloading botocore-1.35.21-py3-none-any.whl (12.5 MB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.5/12.5 MB 55.4 MB/s eta 0:00:00
[?25hDownloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Downloading s3transfer-0.10.2-py3-none-any.whl (82 kB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82.7/82.7 kB 6.9 MB/s eta 0:00:00
[?25hInstalling collected packages: jmespath, botocore, s3transfer, boto3, bertviz
Successfully installed bertviz-1.4.0 boto3-1.35.21 botocore-1.35.21 jmespath-1.0.1 s3transfer-0.10.2

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(

tokenizer_config.json: 0%| | 0.00/48.0 [00:00<?, ?B/s]

config.json: 0%| | 0.00/570 [00:00<?, ?B/s]

vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]

tokenizer.json: 0%| | 0.00/466k [00:00<?, ?B/s]

/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
100%|██████████| 433/433 [00:00<00:00, 260452.26B/s]
100%|██████████| 440473133/440473133 [00:10<00:00, 42427806.34B/s]
/usr/local/lib/python3.10/dist-packages/bertviz/transformers_neuron_view/modeling_utils.py:482: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = torch.load(resolved_archive_file, map_location='cpu')

<IPython.core.display.HTML object>

<IPython.core.display.HTML object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Scaled Dot-Product Attention

"""
Tokenize Text
"""
# add_special_attention=False to exclude the [CLS] and [SEP] tokens
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids
out[4]

tensor([[ 2051, 10029, 2066, 2019, 8612]])

Dense in this context means that each entry in the embedding contains a nonzero value.

from torch import nn
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb
out[6]

Embedding(30522, 768)

Using the AutoConfig class to load config.json file associated wiuth the bert-base-uncased checkpoint. In 🤗 Transformers, every checkpoint is assigned a confguration file that specifies various hyperparameters like vocab_size and hidden_size, which in our example shows us that each input ID will be mapped to one of the 30,522 embedding vectors stored in nn.Embedding, each with a size of 768. The AutoConfig class also stores some sadditional metadata, such as the label names, which are used to format the model's predictions. Token embeddings at this point are independent of their context. The role of the subsequent attention layers will be to mix these token embeddings to disambiguate and inform the representation of each token with the content of its context.

"""
Generate the embeddings by feeding in the Input IDs
"""
inputs_embeds = token_emb(inputs.input_ids)
# Has a tensor of shape [batch_size, seq_len, hidden_dim]
inputs_embeds.size()
out[8]

torch.Size([1, 5, 768])

"""
Create query, key, value vectors and calcuate attention
scores using the dot product as the similarity function
"""
import torch
from math import sqrt

query = key = value = inputs_embeds
dim_k = key.size(-1)
# batch matrix-matrix product
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
# Creates a 5 x 5 matrix of attention scores per sample in the batch
scores.size()

"""
Apply Softmax
"""
import torch.nn.functional as F

weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1)

"""
Multuple Attention Weights by the Values
"""
attn_outputs = torch.bmm(weights, value)
attn_outputs.shape

"""
Wrap self attention steps into func to use
later
"""
def scaled_dot_product_attention(query, key, value):
  """
  Applies self-attention. The whole process is just two
  matrix multiplications and a softmax - self-attention is
  just a fancy form of averaging
  """
  dim_k = query.size(-1)
  scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
  weights = F.softmax(scores, dim=-1)
  return torch.bmm(weights, value)
out[9]

The current attention mechanism with equal query and key vectors will assign a very large score to identical words in the context, and in particular to the current word itself. In practice, the meaning of a word will be better informed by complementary words in the context than by identical words. Let's allow the model to create a different set of vectors for the query, key, and value of a token by using three different linear projections to project the initial token vector into three different spaces.

Multi-headed Attention

In practicen the self attention applies three independent linear transformations to each embedidng to generate the query, key, and value vectors. These transformations project the embeddings and each projection carries its own set of learnable parameters, which allows the self-attention layer to focus on different semantic aspects of the sequence.

It turns out to be beneficial to have multiple sets of linear projections, each one representing a so-called attention head. The resulting multi-head attention layer is shown below. The softmax of one head tends to focus on one aspect of similarity. Having several heads allows the model to focus on several aspects at once.

Multi-head Attention

class AttentionHead(nn.Module):
  """
  Initialize three independent linear layers that applu matrix
  multiplication to the embedding vectors to produce tensors of shape
  [batch_size, seq_len, head_dim] where head_dim is the number of
  dimensions we are projecting into. In practice, head_dim is chosen
  to be a multiple of embed_dim so that computation across each head
  is constant.
  """
  def __init__(self,embed_dim,head_dim):
    super().__init__()
    self.q = nn.Linear(embed_dim, head_dim)
    self.k = nn.Linear(embed_dim, head_dim)
    self.v = nn.Linear(embed_dim, head_dim)

  def forward(self, hidden_state):
    attn_outputs = scaled_dot_product_attention(self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
    return attn_outputs

class MultiHeadAttention(nn.Module):
  """
  The concatenated output from the attention heads is also fed through
  a final linear layer to produce an output tensor of shape [batch_size,
  seq_len, hidden_dim] that is suitable for the feed-forward network
  downstream.
  """
  def __init__(self, config):
    super().__init__()
    embed_dim = config.hidden_size
    num_heads = config.num_attention_heads
    head_dim = embed_dim // num_heads
    self.heads = nn.ModuleList(
        [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
    )
    self.output_linear = nn.Linear(embed_dim, embed_dim)

  def forward(self, hidden_state):
    x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
    x = self.output_linear(x)
    return x

multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(inputs_embeds)
attn_output.size()

"""
Visualizng multi-head attention
"""
from bertviz import head_view
from transformers import AutoModel

model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)

sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[8])
out[11]

model.safetensors: 0%| | 0.00/440M [00:00<?, ?B/s]

BertSdpaSelfAttention is used but `torch.nn.functional.scaled_dot_product_attention` does not support non-absolute `position_embedding_type` or `output_attentions=True` or `head_mask`. Falling back to the manual attention implementation, but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.

<IPython.core.display.HTML object>

<IPython.core.display.HTML object>

<IPython.core.display.Javascript object>

The Feed-Forward Layer

The feed-forward sublayer in the encoder and decoder is just a simple two-layer fully connected neural network, but with a twist: instead of processing the whole sequence of embeddings as a single vector, it processes each embedding independently. For this reason, this layer is often referred to as a position-wise feed-forward layer. (It may also be referred to as a one-dimensional convolution with a kernel size of one). A rule of thumb from the literature is for the hidden size of the first layer to be four times the size of the embeddings, and a GELU activation function is most commonly used. This is where most of the capacity and memorization is hypothesized to happen, and it's the part that is most often scaled when scaling up the models.

class FeedForward(nn.Module):
  def __init__(self, config):
    super().__init__()
    self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
    self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
    self.gelu = nn.GELU()
    self.dropout = nn.Dropout(config.hidden_dropout_prob)

  def forward(self, x):
    x = self.linear_1(x)
    x = self.gelu(x)
    x = self.linear_2(x)
    x = self.dropout(x)
    return x

feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_outputs)
ff_outputs.size()
out[13]

torch.Size([1, 5, 768])

Adding Layer Normalization

The Transformer architecture makes use of layer normalization and skip connections. The former normalizes each input in the batch to have zero mean and unity variance. Skip connections pass a tensor to the next layer of the model without processing and add it to the processed tensor. Two options for placing normalization:

  1. Post Layer Normalization: It places layer normalization between skip connections. This arragement is tricky to train from scratch as the gradients can diverge.
  2. Pre Layer Normalization: Most common arrangement found in the literature; it places layer normalization within the span of skip connections. Tend to be much more stable during training.

Layer Normalization

class TransformerEncoderLayer(nn.Module):
  """
  Uses Pre-layer normalization
  """
  def __init__(self, config):
    super().__init__()
    self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
    self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
    self.attention = MultiHeadAttention(config)
    self.feed_forward = FeedForward(config)

  def forward(self, x):
    # Apply layer normalization and then copy input into query, key, value
    hidden_state = self.layer_norm_1(x)
    # Apply attention with a skip connection
    x = x + self.attention(hidden_state)
    # Apply feed-forward layer with a skip connection
    x = x + self.feed_forward(self.layer_norm_2(x))
    return x

encoder_layer = TransformerEncoderLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size()
out[15]

(torch.Size([1, 5, 768]), torch.Size([1, 5, 768]))

Positional Embeddings

Positional embeddings are based on a simple, yet very effective idea: augment the token embeddings with a position-dependent pattern of values arranged in a vector. If the pattern is characteristic for each position, the attention heads and feed-forward layers in each stack can learn to incorporate positional information into their transformations.

There are several ways to achieve this, and one of the most popular approaches is to use a learnable pattern, especially when the pretraining dataset is sufficiently large. This works exactly the same way as the token embeddings, but using the position index instead of the token ID as input. With that approach, an efficient way of encoding the positions of tokens is learned during pretraining.

class Embeddings(nn.Module):
  """
  Custom Embeddings module that combines a token embedding layer
  that projects the input_ids to a dense hidden state together with the
  positional embedding that does the same for position_ids. The resulting
  embedding is simply the sum of both embeddings
  """
  def __init__(self, config):
    super().__init__()
    self.token_embeddings = nn.Embedding(config.vocab_size,config.hidden_size)
    self.position_embeddings = nn.Embedding(config.max_position_embeddings,config.hidden_size)
    self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
    self.dropout = nn.Dropout()

  def forward(self, input_ids):
    # Create position IDs for input sequence
    seq_length = input_ids.size(1)
    position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
    # Create token and position embeddings
    token_embeddings = self.token_embeddings(input_ids)
    position_embeddings = self.position_embeddings(position_ids)
    # Combine token and position embeddings
    embeddings = token_embeddings + position_embeddings
    embeddings = self.layer_norm(embeddings)
    embeddings = self.dropout(embeddings)
    return embeddings

embedding_layer = Embeddings(config)
embedding_layer(inputs.input_ids).size()
out[17]

torch.Size([1, 5, 768])

The embedding layer now creates a single, dense embedding for each token. Alternatives to position embeddings:

  • Absolute positional representations: Transformer models can use static patterns consisting of modulated sine and cosine signals to encode the positions of the tokens.
  • Relative positional representations: While absolute positions are important, once can argue that when computing an embedding, the surrounding tokens are most important. Relative positional reprtesentations follow the intuition and encode the relative positions between tokens.
class TransformerEncoder(nn.Module):
  """
  Full transformer encoder combining the embeddings with the encoder
  layers.
  """
  def __init__(self, config):
    super().__init__()
    self.embeddings = Embeddings(config)
    self.layers = nn.ModuleList([TransformerEncoderLayer(config)
                                  for _ in range(config.num_hidden_layers)])

  def forward(self, x):
    x = self.embeddings(x)
    for layer in self.layers:
        x = layer(x)
    return x

encoder = TransformerEncoder(config)
encoder(inputs.input_ids).size()
out[19]

torch.Size([1, 5, 768])

Adding a Classification Head

Transformer models are usually divided into a task-independent body and a task-specific head.

class TransformerForSequenceClassification(nn.Module):
  """
  The following class extends the existing encoder for sequence
  classificastion.
  """
  def __init__(self, config):
    super().__init__()
    self.encoder = TransformerEncoder(config)
    self.dropout = nn.Dropout(config.hidden_dropout_prob)
    self.classifier = nn.Linear(config.hidden_size, config.num_labels)

  def forward(self, x):
    x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
    x = self.dropout(x)
    x = self.classifier(x)
    return x

config.num_labels = 3
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids).size()
out[21]

torch.Size([1, 3])

The Decoder

The main difference between the decoder and encoder is that the decoder has two attention sublayers:

  • Masked multi-head self-attention layer: Ensures that the tokens we generate at each timestep are only based on the past outputs and the current token being predicted.
  • Encoder-decoder attention layer: Performs multi-head attention over the output key and value vectors of the encoder stack, with the intermediate representations of the decoder acting as the queries. This way the encoder-decoder attention layer learns hwo to relate tokens from two different sequences, such as two different languages.

The trick with masked self-attention is to introduce a mask matrix with ones on the lower diagonal and zeros above.

seq_len = inputs.input_ids.size(-1)
# Creates a lower triangular matrix
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0)
print(mask[0])
scores.masked_fill(mask == 0, -float("inf"))
out[23]

tensor([[1., 0., 0., 0., 0.],
[1., 1., 0., 0., 0.],
[1., 1., 1., 0., 0.],
[1., 1., 1., 1., 0.],
[1., 1., 1., 1., 1.]])

tensor([[[28.3440, -inf, -inf, -inf, -inf],

[-1.0478, 28.8185, -inf, -inf, -inf],

[-0.1825, -2.4190, 27.8747, -inf, -inf],

[ 0.5176, 2.7035, 0.4991, 28.0570, -inf],

[ 0.3036, 0.5276, 0.2197, -0.7923, 27.1878]]],

grad_fn=<MaskedFillBackward0>)

Transformer Decoder

By setting the upper values to negative infinity, we guarantee that the attention weights are all zero once we take the softmax over the score.

def scaled_dot_product_attention(query, key, value, mask=None):
  """
  Scaled dot-product attention function including
  the masking behavior
  """
  dim_k = query.size(-1)
  scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
  if mask is not None:
      scores = scores.masked_fill(mask == 0, float("-inf"))
  weights = F.softmax(scores, dim=-1)
  return weights.bmm(value)
out[25]

Meet the Transformers

There are three main architectures for transformer models: encoders, decoders, and encoder-decoders.

Prominent Transformer Architectures

The Encoder Branch

Encoder-only models still dominate research and industry on NLU tasks such as text classification, named entity recognition, and question answering.

The Decoder Branch

The progress on transformer decoder models has been spearheaded to a large extent by OpenAI. These models are exceptionally good at predicting the next word in a sequence and are thus mostly used for text generation tasks.

The Encoder-Decoder Branch

Although it has become more common to build models using a single encoder or decoder stack, there are several encoder-decoder variants of the Transformer architecture that have novel applications across both NLU and NLG domains.

Chapter 4: Multilingual Named Entity Recognition

By pretraining on huge corpora across many languages, these multilingual transformers enable zero-shot cross-=lingual transfer. This means that a model that is fine-tuned on one language can be applied to others without further training. NER is a common NLP task that identifues entities like peiple, organizations, or locations in text. Zero-shot transferz or zero-shot learning usually refers to the task of training a model on one set of labels and then evaluating it on a different set of labels.

The Dataset

Subser of the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark called WikiANN or PAN-X. This dataset consists of many Wikipedia articles in many languages. Each article is annotated with LOC (location), PER (person), and ORG (organization) tags in the inside-outside-beginning format. In this format, a B- prefix indicates the beginning of an entity, and consecutive tokens belonging to the same enbtity are given an I- prefix. An O tag indicates the token does not belong to any entity.

"""
Need to know which dataset configuration to pass load_dataset
"""
!pip install datasets
from datasets import get_dataset_config_names
# Figure ot which subsets are available
xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")
out[28]

Collecting datasets
Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.16.0)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.26.4)
Collecting pyarrow>=15.0.0 (from datasets)
Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.1.4)
Requirement already satisfied: requests>=2.32.2 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.32.3)
Requirement already satisfied: tqdm>=4.66.3 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.5)
Collecting xxhash (from datasets)
Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Requirement already satisfied: fsspec<=2024.6.1,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from fsspec[http]<=2024.6.1,>=2023.1.0->datasets) (2024.6.1)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.10.5)
Requirement already satisfied: huggingface-hub>=0.22.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.24.7)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (24.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.2)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.4.0)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (24.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.1.0)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.11.1)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.22.0->datasets) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.8)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2024.8.30)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 474.3/474.3 kB 8.9 MB/s eta 0:00:00
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 10.4 MB/s eta 0:00:00
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.9/39.9 MB 18.4 MB/s eta 0:00:00
[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 8.4 MB/s eta 0:00:00
[?25hDownloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.1/194.1 kB 13.7 MB/s eta 0:00:00
[?25hInstalling collected packages: xxhash, pyarrow, dill, multiprocess, datasets
Attempting uninstall: pyarrow
Found existing installation: pyarrow 14.0.2
Uninstalling pyarrow-14.0.2:
Successfully uninstalled pyarrow-14.0.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 17.0.0 which is incompatible.
ibis-framework 8.0.0 requires pyarrow<16,>=2, but you have pyarrow 17.0.0 which is incompatible.
Successfully installed datasets-3.0.0 dill-0.3.8 multiprocess-0.70.16 pyarrow-17.0.0 xxhash-3.5.0

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(

README.md: 0%| | 0.00/131k [00:00<?, ?B/s]

XTREME has 183 configurations

"""
Narrow search of subsets
"""
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
panx_subsets[:3]
out[29]

['PAN-X.af', 'PAN-X.ar', 'PAN-X.bg']

# Load the German corpus
from datasets import load_dataset

load_dataset("xtreme", name="PAN-X.de")
out[30]

train-00000-of-00001.parquet: 0%| | 0.00/1.18M [00:00<?, ?B/s]

validation-00000-of-00001.parquet: 0%| | 0.00/590k [00:00<?, ?B/s]

test-00000-of-00001.parquet: 0%| | 0.00/588k [00:00<?, ?B/s]

Generating train split: 0%| | 0/20000 [00:00<?, ? examples/s]

Generating validation split: 0%| | 0/10000 [00:00<?, ? examples/s]

Generating test split: 0%| | 0/10000 [00:00<?, ? examples/s]

DatasetDict({

train: Dataset({

features: ['tokens', 'ner_tags', 'langs'],

num_rows: 20000

})

validation: Dataset({

features: ['tokens', 'ner_tags', 'langs'],

num_rows: 10000

})

test: Dataset({

features: ['tokens', 'ner_tags', 'langs'],

num_rows: 10000

})

})

from collections import defaultdict
from datasets import DatasetDict

langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]
# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)

for lang, frac in zip(langs, fracs):
  # Load monolingual corpus
  ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
  # Shuffle and downsample each split according to spoken proportion
  for split in ds:
    panx_ch[lang][split] = (
        ds[split]
        .shuffle(seed=0)
        .select(range(int(frac * ds[split].num_rows))))

import pandas as pd

pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs},index=["Number of training examples"])
out[31]

train-00000-of-00001.parquet: 0%| | 0.00/837k [00:00<?, ?B/s]

validation-00000-of-00001.parquet: 0%| | 0.00/419k [00:00<?, ?B/s]

test-00000-of-00001.parquet: 0%| | 0.00/423k [00:00<?, ?B/s]

Generating train split: 0%| | 0/20000 [00:00<?, ? examples/s]

Generating validation split: 0%| | 0/10000 [00:00<?, ? examples/s]

Generating test split: 0%| | 0/10000 [00:00<?, ? examples/s]

train-00000-of-00001.parquet: 0%| | 0.00/932k [00:00<?, ?B/s]

validation-00000-of-00001.parquet: 0%| | 0.00/459k [00:00<?, ?B/s]

test-00000-of-00001.parquet: 0%| | 0.00/464k [00:00<?, ?B/s]

Generating train split: 0%| | 0/20000 [00:00<?, ? examples/s]

Generating validation split: 0%| | 0/10000 [00:00<?, ? examples/s]

Generating test split: 0%| | 0/10000 [00:00<?, ? examples/s]

train-00000-of-00001.parquet: 0%| | 0.00/942k [00:00<?, ?B/s]

validation-00000-of-00001.parquet: 0%| | 0.00/472k [00:00<?, ?B/s]

test-00000-of-00001.parquet: 0%| | 0.00/472k [00:00<?, ?B/s]

Generating train split: 0%| | 0/20000 [00:00<?, ? examples/s]

Generating validation split: 0%| | 0/10000 [00:00<?, ? examples/s]

Generating test split: 0%| | 0/10000 [00:00<?, ? examples/s]

de fr it en

Number of training examples 12580 4580 1680 1180

element = panx_ch["de"]["train"][0]
for key, value in element.items():
  print(f"{key}: {value}")
out[32]

tokens: ['2.000', 'Einwohnern', 'an', 'der', 'Danziger', 'Bucht', 'in', 'der', 'polnischen', 'Woiwodschaft', 'Pommern', '.']
ner_tags: [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0]
langs: ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de']

for key, value in panx_ch["de"]["train"].features.items():
  print(f"{key}: {value}")
out[33]

tokens: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)
ner_tags: Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None), length=-1, id=None)
langs: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)

tags = panx_ch["de"]["train"].features["ner_tags"].feature
print(tags)
out[34]

ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None)

def create_tag_names(batch):
  return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}
panx_de = panx_ch["de"].map(create_tag_names)
out[35]

Map: 0%| | 0/12580 [00:00<?, ? examples/s]

Map: 0%| | 0/6290 [00:00<?, ? examples/s]

Map: 0%| | 0/6290 [00:00<?, ? examples/s]

de_example = panx_de["train"][0]
pd.DataFrame([de_example["tokens"], de_example["ner_tags_str"]],['Tokens', 'Tags'])
out[36]

0 1 2 3 4 5 6 7 8 \

Tokens 2.000 Einwohnern an der Danziger Bucht in der polnischen

Tags O O O O B-LOC I-LOC O O B-LOC



9 10 11

Tokens Woiwodschaft Pommern .

Tags B-LOC I-LOC O

from collections import Counter
split2freqs = defaultdict(Counter)
for split, dataset in panx_de.items():
  for row in dataset["ner_tags_str"]:
    for tag in row:
      if tag.startswith("B"):
        tag_type = tag.split("-")[1]
        split2freqs[split][tag_type] += 1
pd.DataFrame.from_dict(split2freqs, orient="index")
out[37]

LOC ORG PER

train 6186 5366 5810

validation 3172 2683 2893

test 3180 2573 3071

Multilingual Transformers

Multilingual transformers involve similar architectures and training procedures as their monolongual counterparts, except that the corpus used for pretraining consists of documents in many languages. A remarkable feature of this approach is that despire receiving no explicit information to differentiate among the languages, the resulting linguistic representations are able to differentiate among the languages, the resulting linguistic representations are able to generalize well across languages for a variety of downstream tasks. Multilingual transformer model are usually evaluated in three different ways:

  1. en: Fine-tune on the English training data and then evaluate on each language's test set
  2. each: Fine-tune and evaluate on monolingual test data to measure per-language performance
  3. all: Fine-tune on all the training data to evaluate on each language's test set.

A Closer Look at Tokenization

XLM-R uses a tokenizer called Sentence-Piece that is trained on the raw text of all one hundred languages.

"""
A close look at tokenization
"""
from transformers import AutoTokenizer

bert_model_name = "bert-base-cased"
xlmr_model_name = "xlm-roberta-base"
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)

"""
retrieve special tokens
"""
text = "Jack Sparrow loves New York!"
bert_tokens = bert_tokenizer(text).tokens()
xlmr_tokens = xlmr_tokenizer(text).tokens()
out[39]

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.

0it [00:00, ?it/s]

tokenizer_config.json: 0%| | 0.00/49.0 [00:00<?, ?B/s]

config.json: 0%| | 0.00/570 [00:00<?, ?B/s]

vocab.txt: 0%| | 0.00/213k [00:00<?, ?B/s]

tokenizer.json: 0%| | 0.00/436k [00:00<?, ?B/s]

/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(

tokenizer_config.json: 0%| | 0.00/25.0 [00:00<?, ?B/s]

config.json: 0%| | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model: 0%| | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json: 0%| | 0.00/9.10M [00:00<?, ?B/s]

The Tokenizer Pipeline

Tokenization is actually a full processing pipleine that usually consists of four steps:

Tokenization Pipeline

  1. Normalization: steps corresponding to the set of operations you apply to a raw string to make it "cleaner". Common operations include stripping whitespace and removing accented characters. Unicode normalization is another common normalization operation applied to many tokenizers to deal with the fact that there often exist various ways to write the same character.
  2. Pretokenization: This step splits a text into smaller objects that give an upper bound to what uour tokens will be at the end of training. A good way to think of this is that the pretokenizer will split your text into "words" and your final tokens will be parts of thise words.
  3. Tokenizer Model: Once the input words are normalized an pretokenized, the tokenizer applies a subword splitting model on the words. This is the part of the pipeline that needs to be traikned on your corpus (or that has been trained if you are using a pretrained tokenizer). The role of the model is to split the words into subwords to reduce the size of the vocabulkary and try to reduce the number of out-of-vocabulary tokens. Several subword tokenization algorithms exist, including BPE, Unigram, and WordPiece.
  4. Postprocessing: This is the last step of the tokenization piepeline, in which some additional transformations can be applied on the list of tokens - for instance, adding special tokens at the beginning or end of the input sequence of token indeices.

TheSentencePice Tokenizer

The SentencePiece tokenicer is based on a type of subword segmentation called Unigram and encodes each input text as a sequence of Unicode characters. This last feature is especially useful for multilingual corpora since it allows SentencePiece to be agnostic about accents, punctuation, and the fact that many langauges do not have whitespace characters.

"".join(xlmr_tokens).replace(u"\u2581", " ")
out[41]

'<s> Jack Sparrow loves New York!</s>'

Transformers for Named Entity Recognition

NER is often framed as a token classification task since each individual token is fed into the same fully connected layer to output the entity of the token.

Fine-tuning an Encoder-Based Transformer for NER

The Anatomy of the Transformers Model Class

🤗 Transformers is organized around dedicated classes for each architecture and task. 🤗 Transformers is designed to enable you to easily extend existing models for your specific use case. You can load the weights from pretrained models, and you have access to task-specific helper functions. This lets you build custom models for specific objectives with very little overhead.

Bodies and Heads

🤗 Transformers split the architecture into a body and head. When we switch from the pretrainig task to the downstream task, we need to replace the last layer of the model with one that is suitable for the task. The last layer is called the model head; it's part that is task-specific. The rest of the model is called the bodyl it includes the token embeddings and transformer layers that are task-agnostic.

Creating a Custom Model for Token Classification

"""
Building a Custom Tokem Classification head for XLM-R
"""
import torch.nn as nn
from transformers import XLMRobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel
from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel

class XLMRobertaForTokenClassification(RobertaPreTrainedModel):
  # Ensures that the standard XLM-R settings are used when we
  # initialize a new model.
  config_class = XLMRobertaConfig

  def __init__(self, config):
    # Call the initializatioon function of the RobertaPretrainedModel class
    super().__init__(config)
    self.num_labels = config.num_labels
    # Load model body
    self.roberta = RobertaModel(config, add_pooling_layer=False)
    # Set up token classification head
    self.dropout = nn.Dropout(config.hidden_dropout_prob)
    self.classifier = nn.Linear(config.hidden_size, config.num_labels)
    # Load and initialize weights
    self.init_weights()

  def forward(self, input_ids=None, attention_mask=None, token_type_ids=None,labels=None, **kwargs):
    # Use model body to get encoder representations
    outputs = self.roberta(input_ids, attention_mask=attention_mask,token_type_ids=token_type_ids, **kwargs)
    # Apply classifier to encoder representation
    sequence_output = self.dropout(outputs[0])
    logits = self.classifier(sequence_output)
    # Calculate losses
    loss = None
    if labels is not None:
      loss_fct = nn.CrossEntropyLoss()
      loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
    # Return model output object
    return TokenClassifierOutput(loss=loss, logits=logits, hidden_states=outputs.hidden_states,attentions=outputs.attentions)
out[43]

Loading a Custom Model

index2tag = {idx: tag for idx, tag in enumerate(tags.names)}
tag2index = {tag: idx for idx, tag in enumerate(tags.names)}
out[45]
from transformers import AutoConfig

xlmr_config = AutoConfig.from_pretrained(xlmr_model_name,num_labels=tags.num_classes,id2label=index2tag, label2id=tag2index)
out[46]
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
xlmr_model = (XLMRobertaForTokenClassification.from_pretrained(xlmr_model_name, config=xlmr_config).to(device))

input_ids = xlmr_tokenizer.encode(text, return_tensors="pt")
pd.DataFrame([xlmr_tokens, input_ids[0].numpy()], index=["Tokens", "Input IDs"])
out[47]

model.safetensors: 0%| | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

0 1 2 3 4 5 6 7 8 9

Tokens <s> ▁Jack ▁Spar row ▁love s ▁New ▁York ! </s>

Input IDs 0 21763 37456 15555 5161 7 2356 5753 38 2

outputs = xlmr_model(input_ids.to(device)).logits
predictions = torch.argmax(outputs, dim=-1)
print(f"Number of tokens in sequence: {len(xlmr_tokens)}")
print(f"Shape of outputs: {outputs.shape}")
out[48]

Number of tokens in sequence: 10
Shape of outputs: torch.Size([1, 10, 7])

preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
pd.DataFrame([xlmr_tokens, preds], index=["Tokens", "Tags"])
out[49]

0 1 2 3 4 5 6 7 8 9

Tokens <s> ▁Jack ▁Spar row ▁love s ▁New ▁York ! </s>

Tags I-PER B-LOC B-LOC B-LOC B-LOC B-LOC B-LOC B-LOC B-LOC I-PER

def tag_text(text, tags, model, tokenizer):
  # Get tokens with special characters
  tokens = tokenizer(text).tokens()
  # Encode the sequence into IDs
  input_ids = xlmr_tokenizer(text, return_tensors="pt").input_ids.to(device)
  # Get predictions as distribution over 7 possible classes
  outputs = model(input_ids)[0]
  # Take argmax to get most likely class per token
  predictions = torch.argmax(outputs, dim=2)
  # Convert to DataFrame
  preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
  return pd.DataFrame([tokens, preds], index=["Tokens", "Tags"])
out[50]

Tokenizing Texts for NER

🤗 Datasets provides a fast way to tokenize a Dataset object with the map() operation.

words, labels = de_example["tokens"], de_example["ner_tags"]

tokenized_input = xlmr_tokenizer(de_example["tokens"], is_split_into_words=True)
tokens = xlmr_tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
pd.DataFrame([tokens], index=["Tokens"])
out[52]

0 1 2 3 4 5 6 7 8 9 ... 15 \

Tokens <s> ▁2.000 ▁Einwohner n ▁an ▁der ▁Dan zi ger ▁Buch ... ▁Wo



16 17 18 19 20 21 22 23 24

Tokens i wod schaft ▁Po mmer n ▁ . </s>



[1 rows x 25 columns]

word_ids = tokenized_input.word_ids()
pd.DataFrame([tokens, word_ids], index=["Tokens", "Word IDs"])
out[53]

0 1 2 3 4 5 6 7 8 9 ... \

Tokens <s> ▁2.000 ▁Einwohner n ▁an ▁der ▁Dan zi ger ▁Buch ...

Word IDs None 0 1 1 2 3 4 4 4 5 ...



15 16 17 18 19 20 21 22 23 24

Tokens ▁Wo i wod schaft ▁Po mmer n ▁ . </s>

Word IDs 9 9 9 9 10 10 10 11 11 None



[2 rows x 25 columns]

previous_word_idx = None
label_ids = []

for word_idx in word_ids:
    if word_idx is None or word_idx == previous_word_idx:
        label_ids.append(-100)
    elif word_idx != previous_word_idx:
        label_ids.append(labels[word_idx])
    previous_word_idx = word_idx

labels = [index2tag[l] if l != -100 else "IGN" for l in label_ids]
index = ["Tokens", "Word IDs", "Label IDs", "Labels"]

pd.DataFrame([tokens, word_ids, label_ids, labels], index=index)
out[54]

0 1 2 3 4 5 6 7 8 \

Tokens <s> ▁2.000 ▁Einwohner n ▁an ▁der ▁Dan zi ger

Word IDs None 0 1 1 2 3 4 4 4

Label IDs -100 0 0 -100 0 0 5 -100 -100

Labels IGN O O IGN O O B-LOC IGN IGN



9 ... 15 16 17 18 19 20 21 22 23 \

Tokens ▁Buch ... ▁Wo i wod schaft ▁Po mmer n ▁ .

Word IDs 5 ... 9 9 9 9 10 10 10 11 11

Label IDs 6 ... 5 -100 -100 -100 6 -100 -100 0 -100

Labels I-LOC ... B-LOC IGN IGN IGN I-LOC IGN IGN O IGN



24

Tokens </s>

Word IDs None

Label IDs -100

Labels IGN



[4 rows x 25 columns]

def tokenize_and_align_labels(examples):
  tokenized_inputs = xlmr_tokenizer(examples["tokens"], truncation=True,
                                    is_split_into_words=True)
  labels = []
  for idx, label in enumerate(examples["ner_tags"]):
    word_ids = tokenized_inputs.word_ids(batch_index=idx)
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:
      if word_idx is None or word_idx == previous_word_idx:
        label_ids.append(-100)
      else:
        label_ids.append(label[word_idx])
      previous_word_idx = word_idx
    labels.append(label_ids)
  tokenized_inputs["labels"] = labels
  return tokenized_inputs

def encode_panx_dataset(corpus):
  return corpus.map(tokenize_and_align_labels, batched=True, remove_columns=['langs', 'ner_tags', 'tokens'])

panx_de_encoded = encode_panx_dataset(panx_ch["de"])
out[55]

Map: 0%| | 0/12580 [00:00<?, ? examples/s]

Map: 0%| | 0/6290 [00:00<?, ? examples/s]

Map: 0%| | 0/6290 [00:00<?, ? examples/s]

Performance Measures

Evauluating a NER model is similar to evaluating a text classification model, and it is common to report results for precision, recall, and F1F_1F1 -score. The only subtlety is that all words of an entity need to be predicted correctly in order for a prediction to be counted as correct. There is a library called seqeval that is deisgned for these kinds of tasks.

!pip install seqeval
from seqeval.metrics import classification_report

y_true = [["O", "O", "O", "B-MISC", "I-MISC", "I-MISC", "O"],
          ["B-PER", "I-PER", "O"]]
y_pred = [["O", "O", "B-MISC", "I-MISC", "I-MISC", "I-MISC", "O"],
          ["B-PER", "I-PER", "O"]]
print(classification_report(y_true, y_pred))
out[57]

Collecting seqeval
Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/43.6 kB ? eta -:--:--  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.6/43.6 kB 2.0 MB/s eta 0:00:00
[?25h Preparing metadata (setup.py) ... [?25l[?25hdone
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from seqeval) (1.26.4)
Requirement already satisfied: scikit-learn>=0.21.3 in /usr/local/lib/python3.10/dist-packages (from seqeval) (1.3.2)
Requirement already satisfied: scipy>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.21.3->seqeval) (1.13.1)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.21.3->seqeval) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.21.3->seqeval) (3.5.0)
Building wheels for collected packages: seqeval
Building wheel for seqeval (setup.py) ... [?25l[?25hdone
Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=ba526b84d859847c3faddf8a04324e272f230ecc719c8f0b855ce3ea8cd7f4ee
Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
precision recall f1-score support

MISC 0.00 0.00 0.00 1
PER 1.00 1.00 1.00 1

micro avg 0.50 0.50 0.50 2
macro avg 0.50 0.50 0.50 2
weighted avg 0.50 0.50 0.50 2

seqeval expects the predictions and labels as lists of lists, with each list corresponding to a single example in our validation or test sets.

import numpy as np

def align_predictions(predictions, label_ids):
  preds = np.argmax(predictions, axis=2)
  batch_size, seq_len = preds.shape
  labels_list, preds_list = [], []

  for batch_idx in range(batch_size):
    example_labels, example_preds = [], []
    for seq_idx in range(seq_len):
      # Ignore label IDs = -100
      if label_ids[batch_idx, seq_idx] != -100:
        example_labels.append(index2tag[label_ids[batch_idx][seq_idx]])
        example_preds.append(index2tag[preds[batch_idx][seq_idx]])

    labels_list.append(example_labels)
    preds_list.append(example_preds)

  return preds_list, labels_list
out[59]

Fine-Tuning XLM-RoBERTa

Our first stragey will be to fine-tune the base model on the German subset of PAN-X and then evaludate its zero-shot cross-lingual performance on French, Italian, and English.

"""
Here we evaluate the model's predictions on the validation set
at the end of every epoch, tweak the weight decay, and set save_steps to a large number to disable checkpointsg and speed up training
"""
from transformers import TrainingArguments

num_epochs = 3
batch_size = 24
logging_steps = len(panx_de_encoded["train"]) // batch_size
model_name = f"{xlmr_model_name}-finetuned-panx-de"
training_args = TrainingArguments(output_dir=model_name, log_level="error", num_train_epochs=num_epochs, per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, evaluation_strategy="epoch", save_steps=1e6, weight_decay=0.01, disable_tqdm=False, logging_steps=logging_steps, push_to_hub=False)

from seqeval.metrics import f1_score

def compute_metrics(eval_pred):
  y_pred, y_true = align_predictions(eval_pred.predictions, eval_pred.label_ids)
  return {"f1": f1_score(y_true, y_pred)}

"""
The final step is to define a data collator so we can pad each input sequence
to the larget sequence in a batch. Padding the labels is necessary because the
labels are also sequences.
"""

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(xlmr_tokenizer)

def model_init():
  """
  Method loads an untrained model and is called at the beginning of
  the train() call
  """
  return (XLMRobertaForTokenClassification
          .from_pretrained(xlmr_model_name, config=xlmr_config)
          .to(device))

from transformers import Trainer

trainer = Trainer(model_init=model_init, args=training_args,data_collator=data_collator, compute_metrics=compute_metrics,train_dataset=panx_de_encoded["train"],eval_dataset=panx_de_encoded["validation"],tokenizer=xlmr_tokenizer)

trainer.train()
out[61]

/usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
warnings.warn(

<IPython.core.display.HTML object>

TrainOutput(global_step=1575, training_loss=0.15328798036726693, metrics={'train_runtime': 529.4819, 'train_samples_per_second': 71.277, 'train_steps_per_second': 2.975, 'total_flos': 862324400720376.0, 'train_loss': 0.15328798036726693, 'epoch': 3.0})

text_de = "Jeff Dean ist ein Informatiker bei Google in Kalifornien"
tag_text(text_de, tags, trainer.model, xlmr_tokenizer)
out[62]

0 1 2 3 4 5 6 7 8 9 \

Tokens <s> ▁Jeff ▁De an ▁ist ▁ein ▁Informati ker ▁bei ▁Google

Tags O B-PER I-PER I-PER O O O O O B-ORG



10 11 12 13

Tokens ▁in ▁Kaliforni en </s>

Tags O B-LOC I-LOC O

Error Analysis

A thorough error analysis of your model is one of the most important aspects when training and debugging transformers. There are several failure modes where it might look like the model is performing well, while in practice it has some serious flaws. Examples where training can fail:

  • Might accidentally mask too many tokens and also mak some of our labels to get a really promising loss drop
  • The compute_metrics() function might have a bug that overestimates the true performance
  • We might include the zero class or 0 entity in NER as a normal class, which will heavily skew the accuracy and F1F_1F1 score since it is the majority class by a large margin.

When the model performs much worse than expected, looking at the errors can yield useful insights and reveal bugs that would be hard to spot just by looking at the code. Error analysis isstill useful when the model performs well. For this error analysis, we will use one of the most powerful tools - look at the validation examples with the highest loss.

from torch.nn.functional import cross_entropy

def forward_pass_with_label(batch):
  # Convert dict of lists to list of dicts suitable for data collator
  features = [dict(zip(batch, t)) for t in zip(*batch.values())]
  # Pad inputs and labels and put all tensors on device
  batch = data_collator(features)
  input_ids = batch["input_ids"].to(device)
  attention_mask = batch["attention_mask"].to(device)
  labels = batch["labels"].to(device)
  with torch.no_grad():
    # Pass data through model
    output = trainer.model(input_ids, attention_mask)
    # logit.size: [batch_size, sequence_length, classes]
    # Predict class with largest logit value on classes axis
    predicted_label = torch.argmax(output.logits, axis=-1).cpu().numpy()
  # Calculate loss per token after flattening batch dimension with view
  loss = cross_entropy(output.logits.view(-1, 7),
                        labels.view(-1), reduction="none")
  # Unflatten batch dimension and convert to numpy array
  loss = loss.view(len(input_ids), -1).cpu().numpy()

  return {"loss":loss, "predicted_label": predicted_label}

valid_set = panx_de_encoded["validation"]
valid_set = valid_set.map(forward_pass_with_label, batched=True, batch_size=32)
df = valid_set.to_pandas()

index2tag[-100] = "IGN"
df["input_tokens"] = df["input_ids"].apply(
    lambda x: xlmr_tokenizer.convert_ids_to_tokens(x))
df["predicted_label"] = df["predicted_label"].apply(
    lambda x: [index2tag[i] for i in x])
df["labels"] = df["labels"].apply(
    lambda x: [index2tag[i] for i in x])
df['loss'] = df.apply(
    lambda x: x['loss'][:len(x['input_ids'])], axis=1)
df['predicted_label'] = df.apply(
    lambda x: x['predicted_label'][:len(x['input_ids'])], axis=1)
# print(df.head(1))

df_tokens = df.apply(pd.Series.explode)
df_tokens = df_tokens.query("labels != 'IGN'")
df_tokens["loss"] = df_tokens["loss"].astype(float).round(2)
# print(df_tokens.head(7))

# print(
#   df_tokens.groupby("input_tokens")[["loss"]]
#   .agg(["count", "mean", "sum"])
#   .droplevel(level=0, axis=1)  # Get rid of multi-level columns
#   .sort_values(by="sum", ascending=False)
#   .reset_index()
#   .round(2)
#   .head(10)
#   .T
# )

# print(
#     df_tokens.groupby("labels")[["loss"]]
#     .agg(["count", "mean", "sum"])
#     .droplevel(level=0, axis=1)
#     .sort_values(by="mean", ascending=False)
#     .reset_index()
#     .round(2)
#     .T
# )

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt
def plot_confusion_matrix(y_preds, y_true, labels):
  cm = confusion_matrix(y_true, y_preds, normalize="true")
  fig, ax = plt.subplots(figsize=(6, 6))
  disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
  disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
  plt.title("Normalized confusion matrix")
  plt.show()


plot_confusion_matrix(df_tokens["labels"], df_tokens["predicted_label"],tags.names)
out[64]

<ipython-input-34-c11b37302e41>:44: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_tokens["loss"] = df_tokens["loss"].astype(float).round(2)

Jupyter Notebook Image

<Figure size 600x600 with 1 Axes>

def get_samples(df):
  """
  Write a function that helps us display the token sequences with the labels
  and the losses
  """
  for _, row in df.iterrows():
    labels, preds, tokens, losses = [], [], [], []
    for i, mask in enumerate(row["attention_mask"]):
      if i not in {0, len(row["attention_mask"])}:
        labels.append(row["labels"][i])
        preds.append(row["predicted_label"][i])
        tokens.append(row["input_tokens"][i])
        losses.append(f"{row['loss'][i]:.2f}")
      df_tmp = pd.DataFrame({"tokens": tokens, "labels": labels,"preds": preds, "losses": losses}).T
      yield df_tmp

df["total_loss"] = df["loss"].apply(sum)
df_tmp = df.sort_values(by="total_loss", ascending=False).head(3)

# for sample in get_samples(df_tmp):
#   display(sample)
out[65]
df_tmp = df.loc[df["input_tokens"].apply(lambda x: u"\u2581(" in x)].head(2)
# for sample in get_samples(df_tmp):
#   display(sample)
out[66]

Cross-Lingual Transfer

Now that we have fine-tuned XLM-R on German, we can evaluate its ability to transfer to other languages via the predict() method of the Trainer.

def get_f1_score(trainer, dataset):
  """
  Function to help us evaluate other languages
  """
  return trainer.predict(dataset).metrics["test_f1"]
out[68]
f1_scores = defaultdict(dict)
f1_scores["de"]["de"] = get_f1_score(trainer, panx_de_encoded["test"])
print(f"F1-score of [de] model on [de] dataset: {f1_scores['de']['de']:.3f}")
out[69]

<IPython.core.display.HTML object>

F1-score of [de] model on [de] dataset: 0.866

text_fr = "Jeff Dean est informaticien chez Google en Californie"
tag_text(text_fr, tags, trainer.model, xlmr_tokenizer)
out[70]

0 1 2 3 4 5 6 7 8 9 \

Tokens <s> ▁Jeff ▁De an ▁est ▁informatic ien ▁chez ▁Google ▁en

Tags O B-PER I-PER I-PER O O O O B-ORG O



10 11 12 13

Tokens ▁Cali for nie </s>

Tags B-LOC I-LOC I-LOC O

def evaluate_lang_performance(lang, trainer):
  panx_ds = encode_panx_dataset(panx_ch[lang])
  return get_f1_score(trainer, panx_ds["test"])

f1_scores["de"]["fr"] = evaluate_lang_performance("fr", trainer)
print(f"F1-score of [de] model on [fr] dataset: {f1_scores['de']['fr']:.3f}")

f1_scores["de"]["it"] = evaluate_lang_performance("it", trainer)
print(f"F1-score of [de] model on [it] dataset: {f1_scores['de']['it']:.3f}")

f1_scores["de"]["en"] = evaluate_lang_performance("en", trainer)
print(f"F1-score of [de] model on [en] dataset: {f1_scores['de']['en']:.3f}")
out[71]

Map: 0%| | 0/4580 [00:00<?, ? examples/s]

Map: 0%| | 0/2290 [00:00<?, ? examples/s]

Map: 0%| | 0/2290 [00:00<?, ? examples/s]

<IPython.core.display.HTML object>

F1-score of [de] model on [fr] dataset: 0.702

Map: 0%| | 0/1680 [00:00<?, ? examples/s]

Map: 0%| | 0/840 [00:00<?, ? examples/s]

Map: 0%| | 0/840 [00:00<?, ? examples/s]

<IPython.core.display.HTML object>

F1-score of [de] model on [it] dataset: 0.688

Map: 0%| | 0/1180 [00:00<?, ? examples/s]

Map: 0%| | 0/590 [00:00<?, ? examples/s]

Map: 0%| | 0/590 [00:00<?, ? examples/s]

<IPython.core.display.HTML object>

F1-score of [de] model on [en] dataset: 0.583

Why Does Zero-Shot Transfer Make Sense?

Why Does Zero-Shot Transfer Make Sense

out[73]