Hugging Face NLP Course Chapters 1-2
I am taking the Hugging Face NLP Course becuase it is considered a prerequisite for "Natural Language Processing with Transformers". These first two chapters go over the basics of Hugging Face APIs used for natural language processing.
Transformer Models
Natural Language Processing
NLP is a field of linguistics and machine learning focused on understanding everything related to human language. The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words.
Transformers, what can they do?
The Hugging Face Transformers Library provides the functionality to create and use shared transformer models. The most basic object in the HuggingFave Transformers library is the pipeline() function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input text and get an intelligible answer:
from transformers import pipeline
access_token="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"
"""
Pipeline selects a pretrained model that has been fine tuned for sentiment
analysis in English. The model is downloaded and cached when you create the
classifier object.
"""
classifier = pipeline("sentiment-analysis",model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
print(classifier("I've been waiting for a HuggingFace course my whole life."))
# Passing In Several Sentences
print(classifier(
["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
))
Three main steps involved when you pass some text to a pipeline:
- The text is processed into a format the model can understand.
- The preprocessed inputs are passed to the model.
- The predictions of the model are post-processed, so you can make sense of them.
Currently available pipelines:
- feature-extraction (get the vector representation of text)
- fill-mask
- The idea of this task is to fill in the blanks in a given text.
- The top_k argument controls how many possibluties you want to be displayed. Note that here the model fills in the special mask word, which is often referred to as a mask token. Other mask-filling models might have different mask tokens, so its alwys good to verify the proper mask word before exploring other models.
- ner (named entity recognition)
- Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations.
- The grouped_entities parameter in the pipeline creation function tells the pipeline whether or not to regroup together the parts of the sentence that correspond to the same entity.
- question-answering
- The question-answering pipeline answers questions using information from a given context. Note that this pipeline works by extracting infromation from the provided context; it does not generate the answer.
- sentiment-analysis
- summarization
- The task of reducing a text into a short text while keeping all (or most) of the important aspects referenced in the text.
- Like with text generation, you can specify a max_length or a min_length for the result.
- text-generation
- The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text.
- You can choose a particular model from the Hub to use in a pipeline
for a specific task.
- translation
- Fr translation, you can use a default model if you provide a language pair in the task name (such as "translation_en_to_fr"), but the easiest way is to pick the model you want to used on Model Hub
- Like with text generation and summarization, you can specify a max_length or a min_length for the result
- zero-shot-classification
- We want to classify texts that haven't been labelled. This is a common scenario in real-world projects because annotating text is time-consuming and requires domain expertise.
- The zero-shot-classification pipeline is very powerful for this: it allows you to specify which labels to use for the classification, so you don't have to rely on the labels of the pretrained model. The pipeline is called zero-shot because you don't need to fine tune the model on your data to use it. It can directly return probability scores fr any list of labels you want.
"""
Zero Shot Classification
"""
classifier = pipeline("zero-shot-classification")
classifier(
"This is a course about the Transformers library",
candidate_labels=["education", "politics", "business"],
)
"""
Text generation
"""
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")
generator = pipeline("text-generation", model="distilgpt2")
generator(
"In this course, we will teach you how to",
max_length=30,
num_return_sequences=2,
)
"""
Mask Filling
"""
from transformers import pipeline
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)
"""
Named Entity Recognition
"""
from transformers import pipeline
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
"""
Question Answering
"""
from transformers import pipeline
question_answerer = pipeline("question-answering")
question_answerer(
question="Where do I work?",
context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)
from transformers import pipeline
summarizer = pipeline("summarization")
summarizer(
"""
America has changed dramatically during recent years. Not only has the number of
graduates in traditional engineering disciplines such as mechanical, civil,
electrical, chemical, and aeronautical engineering declined, but in most of
the premier American universities engineering curricula now concentrate on
and encourage largely the study of engineering science. As a result, there
are declining offerings in engineering subjects dealing with infrastructure,
the environment, and related issues, and greater concentration on high
technology subjects, largely supporting increasingly complex scientific
developments. While the latter is important, it should not be at the expense
of more traditional engineering.
Rapidly developing economies such as China and India, as well as other
industrial countries in Europe and Asia, continue to encourage and advance
the teaching of engineering. Both China and India, respectively, graduate
six and eight times as many traditional engineers as does the United States.
Other industrial countries at minimum maintain their output, while America
suffers an increasingly serious decline in the number of engineering graduates
and a lack of well-educated engineers.
"""
)
How do Transformers work?
The Transformer Architecture was introduced in June 2017. The focus of the original research was translation tasks. This was followed by the introduction of several influential models, including:
- June 2018: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results
- October 2018: BERT, another large pretrained model, this one designed to produce better summaries of sentences
- February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns
- October 2019: DistilBERT, a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT's performance
- October 2019: BART and T5, two large pretrained models using the same architecture as the original Transformer model
- May 2020: GPT-3 an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning)
Broadly, the kinds of transformer mdoels can be grouped into three categories:
- GPT-like (also called auto-regressive Transformer models)
- BERT-like (also called auto-encoding Transformer models)
- BART/T5-like (also called sequence-to-sequence Transformer models)
All the transformer models mentioned above havebeen trained as language models - they have been trained on large amounts of raw text in a self-supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. Thus type of model develops a statistical understanding of the language it has been trained on, but it's not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning. During this process, the model is fine-tuned in a supervised way - that is, using human-annotated labels - on a specific task.
Encoder Models
Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having "bi-directional" attention, and are often called auto-encoding models. The pretraining of these models revolves around somehow corrupting a given sentence - i.e. by masking some of the words in it - and tasking the model with finding or reconstructing the initial sentence. Encoder models are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (word classification), and extractive question answering.
Decoder Models
Decoder models use only the decoder of a Transformer model. At each stage, for a given word, the attention layers can only acces sthe words positioned before it in the sentence. These models are often called auto-regressive models. The pretraining of decoder models usually revolves around predicting the next word in the sentence. tHese models are best suited for tasks involving text generation.
Sequece-to-Sequence Models
Encoder-decoder models (also called sequence-to-sequence models) are both parts of the Transformer architecture. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input. The pretraining of these models can be done using the objectives of encoder or decoder models, but usually involves something a bit more complex. For instance, T5 is pretrained by replacing random spans of text with a single mask special word, and the objective is to then predict the text that this mask word replaces. These model are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering.
Bias and Limitations
Pretrained and finetuned models come with limitations. The biggest of these is that, to enable pretraining on large amounts of data, researches often scrape all the content they can find, taking the best as well as the worst of what is available on the internet. When you use these tools, you therefore need to keep in the back of your mind that the original model you are using could very easily generate sexist, racist, or homophobic content. Fine-tuning the model on your data won't make this intrinsic bias disappear.
Summary
Model | Examples | Tasks |
---|---|---|
Encoder | ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa | Sentence classification, named entity recognition, extractive question answering |
Decoder | CTRL, GPT, GPT-2, Transformer XL | Text generation |
Encoder-decoder | BART, T5, Marian, mBART | Summarization, translation, generative question answering |
Using 🤗 Transformers
Transformer models are usually very large. With millions to tens of billions of parameters, training and deploying thse models is a complicated undertaking. Furthermore, with new models being released on a near-daily basis and each having its own implementation, trying them all out is no easy task.
The 🤗 Transformers library was created to solve this problem. Its goal is to provide a single API through which any Transformer model can be loaded, trained, and saved. At their core, all models are simple PyTorch nn.Module or TensorFlow tf.keras.Model classes and can be handled like any other model in their respective machine leanring (ML) frameworks.
Behind the Pipeline
Trying to figure out what happens behind the scenes when executing the following code:
from transformers import pipeline
classifier = pipeline("sentiment-analysis",model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
classifier(
[
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
]
)
The pipeline groups together three steps: preprocessing, pasing the inputs through the model, and postprocessing:
Preprocessing with a Tokenizer
Like other neural networks, Transformer models can;t process raw text directly, os the first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. To do this we use a tokenizer, which will be responsible for:
- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
- Mapping each token to an integer
- Adding additional inputs that may be useful to the model
All this preporcessing needs to be done in exactlythe same way as when the model was pretrained, so we first need to download that information from the Model Hub. To do this, we use the AutoTokenizer class and its from_pretrained() method. Using the checkpoint name of our model, it will automatically fetch the data associated with the model's tokenizer and cache it (so it's only downloaded the first time you run the code below).
Since the default checkpoint of the sentiment-analysis pipeline is distilbert-base-uncased-finetuned-sst-2-english, we run the following:
from transformers import AutoTokenizer
import tensorflow as tf
tf.experimental.numpy.experimental_enable_numpy_behavior()
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
Once we have the tokenizer, we can directly pass our sentences to it and we'll get back a dictionary to feed our model. The only thing that is left to do is to convert the list of input IDs to tensors. Transformers only accept tensors as input. To specify the type of tensors we want to get back (PyTorch, TensorFlow, or plain NumPy), we use the return_tensors argument:
raw_inputs = [
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)
The output itself is a dictionary containing two keys, input_ids and attention_mask. input_ids contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence.
Going through the Model
We can download the pretrained model the same way we did with the tokenizer. 🤗 transformers provides an TFAutoModel (tensorflow) / AutoModel (PyTorch) class which also has a from_pretrained method:
from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
In this code snippet, we have downloaded the same checkpoint we used in our pipeline before and instantiated a model with it. This architecture contains only the base Transformer module: given some inputs, it outputs what we'll call hidden states, also known as features. For each model input, we'll receive a high-dimensional vector representing the contextual understanding of that input by the Transformer model.These hidden states are usually inputs to aother part of the model, known as the head.
A High Dimensional Vector
The vector output by the Transformer model is usually large. It generally has three dimensions:
- Batch Size: The number of sequences processed at a time
- Sequence Length: The length of the numerical representation of the sequence
- Hidden Size: The vector dimension of each model input
Note that the outputs of 🤗 Transformer models behave like namedtuples or dictionaries. You can access the elements by key or index.
Model Heads: Making Sense Out of Numbers
The model heads take the high-dimensionality vector of hidden states as input and project them onto a different dimension. They are usually acomposed of one or a few linear layers:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
The output of the Transformer model is sent directly to the model head to be processed. In the diagram, the model is represented by its embeddings layer and the subsequent layer converts each input ID in the tokenized input into a vector that represents the associated token. The subsequent layers manipulate those vectors using the attention mechaniscm to produce the final representation of teh sentences.
from transformers import AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
"""
Now if we look at the shape of our outputs,
the dimensionality will be much lower: the
model head takes as input the high-dimensional
vectors we saw efore, and outputs vectors containing two
values (one per label).
Since we have just two sentences and two labels, the result we get from our model is of shape 2 x 2.
"""
print(outputs.logits.shape)
Postprocessing the Output
The outputs for each sentence are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a Softmax layer. All 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy.
"""
The values we get as output from the model
don't necessarily make sense by themselves.
"""
print(outputs.logits)
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)
To get the labels corresponding to each prediction, we can inspect the id2label attribute of the model config.
model.config.id2label
Models
The AutoModel class is handy when you want to instantiate any model from a checkpoint. The AutoModel class and all of its relatives are actually simple wrappers over the wide variety of models available in the library. It's a clever wrapper as it can automatically guess the appropraite model architecture for your checkpoint, and then instantiates a model with this architecture.
"""
Creating a Transformer
The first thing we'll need to do to initialize a BERT
model is load a configuration object
"""
from transformers import BertConfig, BertModel
# Building the config
config = BertConfig()
# Building the model from the config
model = BertModel(config)
# The configuration contains many attributes that are used to build
# the model:
print(config)
Different Loading Methods
"""
Creating a model from the default configuration initializes
it with random values
"""
from transformers import BertConfig, BertModel
config = BertConfig()
model = BertModel(config)
# Model is randomly initialized!
"""
The model can be used in this state, but it will output gibberish;
it needs to be trained first. We could train the model from scratch on the task
at hand, but as you saw in Chapter 1, this would require a long time and a lot
of data, and it would have a non-negligible environmental impact. To avoid
unnecessary and duplicated effort, it's imperative to be able to share and
reuse models that have already been trained. Loading a Transformer model is
that simple - we can do this using the from_pretrained() method
"""
model = BertModel.from_pretrained("bert-base-cased")
If your code works for one checkpoint, it should work seamlessly with another.
This applies even if the architecture is different, as long as the checkpoint
was trained for a similar task (for example, a sentiment analysis task).
This model is now initialized with all the weights of the checkpoint. It can
be used directly for inference on the tasks it was trained on, and it can also
be fine-tuned on a new task. By training with pretrained weights rather than
from scratch, we can quickly achieve good results.
The weights have been downloaded and cached (so future calls to the from_pretrained() method won't re-download them) in the cache folder, which defaults to ~/.cache/huggingface/transformers. You can customize your cache folder by setting the HF_HOME environment variable.
Saving Methods
Saving a model is as easy as loading one - we use the save_pretrained() method.
model.save_pretrained("directory_on_my_computer")
This saves two files to your disk:
ls directory_on_my_computer
config.json pytorch_model.bin
The config.json file contains the attributes necessary to build the model architecture. This file contains some metadata, such as ehere the checkpoint originated and what 🤗 Transformers version you were using when you last saved the checkpoint. The pytorch_model.bin file is known as the state dictionary; it contains all your model's weights. The two files go hand in hand - the configuration is necessary to know your model's arhcitecture and the model weights are your model's parameters.
Using a Transformer Model for Inference
# Let's say we have a couple of sequences
sequences = ["Hello!", "Cool.", "Nice!"]
# The tokenizer converts these to vocabulary indices which are
# typically called input IDs. Each sequence is now a list of numbers:
encoded_sequences = [
[101, 7592, 999, 102],
[101, 4658, 1012, 102],
[101, 3835, 999, 102],
]
# This is a list of encoded sequences: a list of lists. Tensors only
# accept rectangualr shapes. Converting the above to tensor
import torch
model_inputs = torch.tensor(encoded_sequences)
"""
Using the tensors as inputs to the model
While the model accepts a lot of different arguments, only
the input IDs are necessary.
"""
output = model(model_inputs)
Tokenizers
Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. There are many different ways to do this, and the goal is to find the most meaningful, smallest representation.
Word-based
The first type of tokenizer that comes to mind is word-based. It's generally very easy to set up and use with only a few rules, and it often yields decent results.There are also variations of word tokenizers that have extra rules for punctuation. With this kind of tokenizer, we can end up with some pretty large "vocabularies", where a vocabularu is defined by the total number of independent tokens that we have in our corpus. Each word gets assigned an ID, starting at 0 and going up to the size of the vocabulary. The model uses these IDs to identify each word.
We need a custom token to represent words that are not in our vocabulary. This is known as the "unknown" token, often represented as "[UNK]". The goal of crafting the vocabulary is to do it in such a way that the tokenizer tokenizes as few eords as possible into the unknown token. One way to reduce the amount of unkown tokens is to go one level deeper, using a character-based tokenizer.
Character-based
Character based tokenizers split the text into characters, rather than words. This has two primary benefits:
- The vocabulary is much smaller
- There are much fewer out-of-vocabulary (unknwon) tokens, since every word can be built from characters
One could argue character-based representation is less meaningful since each character does not mean a lot on its own. Another thing to consider is that we will end up with a very large amount of tokens to be processed by our model. To get the best of both words, we can use subword tokenization.
Subword Tokenization
Subword tokenization algorithms rely on the principle that frquently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.
"""
Word-based Tokenizer
"""
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)
Loading and Saving
Loading and saving tokenizers is as simple as it is with models. It's based on two methods: from_pretrained() and save_pretrained(). These methods will load or save the algorithm used by the tokenizer (a bit like the architecture of the model) as well as its vocabulary (a bit like the weights of the model).
"""
Loading the BERT tokenizer trained with the same checkpoint
as BERT is done the same way as loading the model, except we
use the BertTokenizer class
"""
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
"""
Similar to AutoModel, the AutoTokenizer class will grab the proper tokenizer
class in the library based on the checkpoint name, and can be used directly
with any checkpoint.
"""
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer("Using a Transformer network is simple")
# Saving a token is identical to saving a model
tokenizer.save_pretrained("directory_on_my_computer")
Encoding
Translating text to numbers is known as encoding. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs. The first step is to split the text into tokens. The second step is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model. To do this, the tokenizer has a vocabulary, which is the part we download when we instantiate it with the from_pretrained() method.
Tokenization
The tokenization process is done by the tokenize() method of the tokenizer. The tokenizer below is a subword tokenizer.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
# The output of this method is a list of strings, or tokens
print(tokens)
The conversion to input IDs is handled by the convert_tokens_to_ids() tokenizer method:
ids = tokenizer.convert_tokens_to_ids(tokens)
# These outputs, once converted to the appropriate framework
# tensor, can then be used as inputs to a model as seen in this
# chapter
print(ids)
Decoding
Decoding is going the other way around: from vocabulary indices, we want to get a string. This can be done with the decode() method. The decode method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence.
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)
Handling Multiple Sequences
🤗 Transformers models expect multiple sentences by default. Batching is the act of sending multiple sentences through the model, all at once. Batching allows the model to work when you feed it multiple sentences. Using multiple sentneces is just as simple as building a batch with a single sequence. The sequences must be of the same length, though, so we usually pad the inputs so that they have the same length. Padding makes sure all the sentences have the same length by assing a special word called the passing token to the sentences with fewer values. The padding token ID can be found in tokenizer.pad_token_id. The key feature of Transformer models is attention layers that contextualize each token. These will take into account the padding tokens since they attend to all of the tokens in a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask. Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate teh corresponding takes should be attended to, and 0s indicate the corresponding tokens should not be attended to.
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence = "I've been waiting for a HuggingFace course my whole life."
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)
output = model(input_ids)
print("Logits:", output.logits)
batched_ids = [
[200, 200, 200],
[200, 200, tokenizer.pad_token_id],
]
attention_mask = [
[1, 1, 1],
[1, 1, 0],
]
outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)
With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models have sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. Two solutions to problem:
- Use a model with a longer supported sequence length
- Truncate your sequences
Models have different supported sequence lengths, and some specialize in handling very long sequences. If you're working on a task that requires very long sequences, we recommend you take a look at those models. It is recommended to truncate your sequences by specifying the max_sequence_length parameter:
sequence = sequence[:max_sequence_length]