Natural Langauge Processing with PyTorch Chapters 1-4
I decidied to read this textbook to become more familiar with PyTorch. This has ended up being a review of Machine Learning / Deep Learning concepts mainly (I don't like the way the code is presented in the book). These forst 4 chapters go through some basic PyTorch/NL/ML/DL concepts.
Introduction
Natural Language Processing (NLP) refers to a set of techniques involving the application of statistical methods, with r without insights from linguistics, to understand text for the sake of solving real-world tasks. This "undersatnding" of text is mainly derived by transforming texts to useable computational representations, which are discrete or continuous combinatorial structures such as vectors or tensors, graphs, and trees.
Deep learning enables one to effiently learn representations from data using an abstraction called the computational grpah and numerical optimization techniques. PyTorch is an increasingly popular Python-based computational graph framework to implement deep learning algorithms.
Supervised Learning Paradigm
Supervised learning refers to cases where the ground truth for targets (what's being predicted) is available for the observations.
Observations are items about which we want to predict something. We done observations using x . We sometimes refer to the observations as inputs. A loss function is a function that compares how far off a prediction is from its target for observations in the training data. Given a target and its prediction, the loss function assigns a scalar real value called the loss. The lower value of the loss, the better the model is at predicting the target. We use L to denote the loss function.
Formalized Supervised Learning
Consider a dataset D={Xi,yi}i=1n with n examples. Given this dataset, we want to learn a function (a model) f parameterized by weights w . That is, we make the assumption about the structure of f , and given that structure, thelearned vvalues of the weights w will fully characterize the model. For a giveb input X , the model predicts y^ as the target:
In supervised learning, for training examples, we know the true target y for an observation. The loss for this instance will then be L(y,y^). Supervised learning then becomes aprocess of finding the optimal parameters / weights w that will minimize the cumulative loss for all the n examples.
Training Using (Stochastic) Gradient Descent
The goal of supervised leanring is to pick values of the parameters that minimize the cost function for a given dataset (equivalent to finding roots of equation). We know that gradient descent is a common technique to find roots of an equation. In traditional gradient descent, we guess some initial values for the roots (parameters) and update the parameters iteratively until the objective function (loss function) evaluates to a value below an acceptable threshold (the convergence criterion). Due to memory constraints, an approximation of gradient descent called stochastic gradient descent, where data points are picked at random and the gradient is computed for that subset, is used. When a subset of more than one data points is used, we call it minibatch SGD. The process of iteratively updating the parameters is called backpropagation. Each step (aka epoch) of backpropagation consists of a forward pass and a backward pass. TRhe foward pass evaluates the inputs with the current values of the parameters and computes the loss function. The backward pass updates the parameters using the gradient of the loss.
Observation and Target Encoding
We need to represent the observations (text) numerically to use them in conjunction with machine learning algorithms. A simple way to represent text is as a numerical vector.
One-Hot Representation
The one-hot representation starts with a zero vector, and sets as 1 the corresponding enyry in the vector if the word is present in the sentence of document.
TF Representation
The Term-Frequuency (TF) of a phrase, sentence, or document is simply the sum of the one-hot representations of it constituent words. We denote the TF of a word w by TF(w) .
"""
Generating a "collapsed" one-hot or binary represnetation using scikit-learn
"""
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
import matplotlib.pyplot as plt
corpus = ['Time flies flies like an arrow.',
'Fruit flies like a banana.']
vocab = ["an","arrow","banana","flies","fruit","like","time"]
one_hot_vectorizer = CountVectorizer(binary=True)
one_hot = one_hot_vectorizer.fit_transform(corpus).toarray()
sns.heatmap(one_hot, annot=True,
cbar=False, xticklabels=vocab,
yticklabels=["Sentence 1",'Sentence 2'],cmap="binary")
fig = plt.gcf()
ax = fig.gca()
ax.set_title("Collased One-Hot Representation")
plt.show()
TF-IDF Representation
The TF weights words proportionally to their frequency. It is likely that common words do not add anything to our understanding of a document, and it is likely rare words add a lot to our understanding of a document. We want those rare words to be given more weight. The Inverse Document Frequency (IDF) is a heuristic to do exactly that. The IDF representation penalizes common tokens and rewards rare tokens in the vector representation. The IDF(w) of a token w is defined with respect to a corpus as:
where nw is the number of documents containing the word w and N s the total number of documents. The TF-IDF score is simply the product TF(w) ⋅ IDF(w). In deep learning, it is rare to see inputs encoded using heuristic representations like TF-IDF because the goal is to learn a representation. Often, we start with a one-hot encoding using integer indices and a special "embedding lookup" layer to construct inputs to the neural network.
"""
Generating a TF-IDF Representation using scikit-learn
"""
from sklearn.feature_extraction.text import TfidfVectorizer
import seaborn as sns
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(corpus).toarray()
sns.heatmap(tfidf, annot=True, cbar=False, xticklabels=vocab,
yticklabels= ['Sentence 1', 'Sentence 2'],cmap="binary")
fig = plt.gcf()
ax = fig.gca()
ax.set_title("TF-IDF Representation")
plt.show()
Target Encoding
Target Encoding affects performance dramatically and you should read Supervised and Unsupervised Discretization of Continuous Features to get score of the problem.
Computational Graphs
A computational graph is an abstraction that models mathematical expressions. In the context of Deep Learning, the implementations of the computational graph (such as Theano, TensorFlow, and PyTorch) do additional bookeeping to implement automatic differentiation needed to obtain gradients of parameters during training in the supervised leanring paradigm.
The computational graph below models the expression y=wx+b. This can be written as two subexpresisons, z=wx and y=z+b. We can then represent the original expressio using a directed acyclic graph (DAG) in which nodes are the mathematical operations, like multiplication and division.
PyTorch Basics
PyTorch is an open-source, community-driven deep learning framework. Unlike Theano, Cadde, and TensorFlow, PyTorch implements a tape-based automatic differentiation method that allows us to define and execute computational graphs dynamically. PyTorch is an optimized tensor manipulation library that offers an array of packages for deep learning.
"""
Creating Tensors
"""
def describe(x):
"""
Summary various properties of tensor x
"""
print("Type: {}".format(x.type()))
print("Shape/size: {}".format(x.shape))
print("Values: {}".format(x))
import torch
describe(torch.Tensor(2,3))
"""
Creating a Randomly Initialized Tensor
"""
describe(torch.rand(2,3)) # uniform random
describe(torch.randn(2,3)) # random normal
describe(torch.zeros(2,3))
x = torch.ones(2,3)
describe(x)
# The `fill_()` method will full the tensor with specific values in-place
x.fill_(5)
describe(x)
# Any PyTorch method with an underscore refers to an in-place operation: that
# is, it modifies the content in place withoutr creating a new object
"""
Creating a Tensor from lists
"""
x = torch.Tensor([[1,2,3],[4,5,6]])
describe(x)
"""
Craeting an initializing a Tensor from numpy
"""
import numpy as np
import torch
npy = np.random.randn(2,3)
describe(torch.from_numpy(npy))
Tensor Types and Sizes
Each tensor has an associated type and size. The default tensor type when you use the torch.Tensor constructor is torch.FloatTensor. However, you can convert a tensor to a different type (float,long,double,etc.) by specifying it at initialization or later using one of the typecasting methods.
"""
Tensor Properties
"""
x = torch.FloatTensor([[1, 2, 3],
[4, 5, 6]])
describe(x)
x = x.long()
describe(x)
x = torch.tensor([[1, 2, 3],
[4, 5, 6]], dtype=torch.int64)
describe(x)
x = x.float()
describe(x)
Tensor Operations
After you have creates your tensors, you can operate on them like you would do with traditional programming langauges, like +, - , *, /. There are also operations that you can apply to a specific dimension of a tensor. Oftenm we need to do more complex operations that involve a combination of indexing, slicing, joining, and mutations. Like NumPy and other numeric libraries, PyTorch has built-in functions to make such tensor manipulations very simple. PyTorch tensor class encapsulates the data (the tensor itslef) and a range of oeprations. When the required_grad Boolean flag is set to True on a tensor, bookkeeping operations are enbaled that can track the gradient at the tensor as well as the gradient function, both of which are need to facilitate gradient based learning.
When you create a tensor with requires_grad=True, you are requiring PyTorch to manage bookkeeping information that computes gradients. PyTorch will keep trakc of the values of the forward pass. Then, at the end of computations, a single scalar is used to compute the backward pass. The backward pass is initiated by using the backward() method on a tensor resulting from the evaluation of a loss function. The backward pass computes a gradient value for a tensor object that participated in the foward pass.
In generasl, the gradient is a value that represents the slope of a function output with respect to the function input. In the computation graph setting, gradients exist for each parameter in the model and can be thought of as the parameter's contribution to the error signal. In PyTorch, you can access the gradients for the nodes in the computation graph by using the .grad member variable.
"""
Tensor Operations: Addition
"""
import torch
x = torch.randn(2, 3)
describe(x)
describe(torch.add(x, x))
describe(x + x)
"""
Dimension Based Tensor Operations
"""
import torch
x = torch.arange(6)
describe(x)
x = x.view(2, 3)
describe(x)
describe(torch.sum(x, dim=0))
describe(torch.sum(x, dim=1))
describe(torch.transpose(x, 0, 1))
"""
Slicing and Indexing a Tensor
"""
x = torch.arange(6).view(2, 3)
describe(x)
describe(x[:1, :2])
describe(x[0, 1])
"""
Complex Indexing: noncontiguous indexing of a tensor
Note that the indices are a LongTensor; this is a requirement for indexing
using PyTorch functions.
"""
indices = torch.LongTensor([0, 2])
describe(torch.index_select(x, dim=1, index=indices))
indices = torch.LongTensor([0, 0])
describe(torch.index_select(x, dim=0, index=indices))
row_indices = torch.arange(2).long()
col_indices = torch.LongTensor([0, 1])
describe(x[row_indices, col_indices])
"""
Concatenating Tensors
"""
import torch
x = torch.arange(6).view(2,3)
describe(x)
describe(torch.cat([x, x], dim=0))
describe(torch.cat([x, x], dim=1))
describe(torch.stack([x, x]))
"""
Linear Algebra on Tensors: Multiplication
"""
x1 = torch.arange(6).view(2, 3)
describe(x1)
x2 = torch.ones(3, 2)
x2[:, 1] += 1
describe(x2)
describe(torch.mm(x1, x2.long()))
"""
Creating Tensors for Gradient Bookkeeping
"""
import torch
x = torch.ones(2, 2, requires_grad=True)
describe(x)
print(x.grad is None)
y = (x + 2) * (x + 5) + 3
describe(y)
print(x.grad is None)
z = y.mean()
describe(z)
z.backward()
print(x.grad is None)
CUDA Tensors
You may want to use a GPU to do linear algebra if available. To use a GPU, you need to first allocate the tensor on the GPU's memory. Access to the GPUs is via a specialized API called CUDA. The CUDA API was created by NVIDIA and is limited to use only on NVIDIA GPUs.
"""
Creating CUDA Tensors
"""
import torch
print (torch.cuda.is_available())
# preferred method: device agnostic tensor instantiation
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
x = torch.rand(3, 3).to(device)
describe(x)
Exercises
import torch
# 1. Create a 2D tensor and then add a dimension of size 1 inserted at dimension 0.
x = torch.randn((2,2))
print(x.shape)
x.unsqueeze_(0)
print(x.shape)
# 2. Remove the extra dimension you just added to the previous tensor.
x.squeeze_()
print(x.shape)
# 3. Create a random tensor of shape 5x3 in the interval [3,7]
y = torch.randint(3,7,(5,3))
print(y)
# 4. create a tensor with values from a normal ditribution.
z = torch.randn((10000,10000))
print("Mean:",z.mean(),"Std:",z.std())
# 5. Retrieve the indexes of all the nonzero elements in the tensor torch.Tensor([1, 1, 1, 0, 1]).
tensor = torch.tensor([1,2,1,0,1])
indices = torch.argwhere(tensor)
indices = indices.squeeze_().numpy()
print(indices)
# 6. Create a random tensor of size (3,1) and then horizontally stack four copies together.
rand_tens = torch.randn((3,1))
print(rand_tens.shape)
stack = torch.cat((rand_tens,rand_tens,rand_tens,rand_tens),dim=1)
print(stack.shape)
# 7. Return the batch matrix matrix product of two three dimensional matrices: (a=torch.rand(3,4,5), b=torch.rand(3,5,4)).
a=torch.rand(3,4,5)
b=torch.rand(3,5,4)
out = torch.bmm(a,b)
# 8. Return the batch matrixmatrix product of a 3D matrix and a 2D matrix: (a=torch.rand(3,4,5), b=torch.rand(5,4))
a=torch.rand(3,4,5)
b=torch.rand(5,4)
out = torch.bmm(a, b.unsqueeze(0).expand(a.size(0), *b.size()))
A Quick Tour of Traditional NLP
Natural language processing (NLP) and computational linguistics (CL) are ywop areas of computational study of human language. NLP aims to develop methods fro solving practical problems involving langauge, such as information extraction, automatic speech recognition, machine translation, sentiment analysis, question answering, and summarization. CL employs computational methods to understand properties of the human language. Lessons from CL can be used to inform priors in NLP, and statistical and machine learning methods from NLP can be applied to answer questions that CL seeks to answer.
Corpora, Tokens, and Types
All NLP methods, be they classic or modern, begin with a text dataset, also called a corpus (plural corpora). The raw text is a sequence of characters (bytes), but most times it is useful to group those characters into contiguous units called tokens. In English, tokens correspond to words and numeric sequences separated by white-space characters or punctuation.
The metadata could be any auxilary piece of information associated with the text, like identifiers, labels, and timestamps. In machine learning parlance, the text along with its metadata is called an instance or data point. The corpus, a collection of instances, is also known as a dataset.
The process of breaking text down into tokens is called tokenization. Tokeinizing based on whitespace may not always be appropriate. Tokenization decisions tend to be arbitrary - but those decisions can significantly affect accuracy in practice more than is acknowledged. Tokenization is considered the grunt work of preprocessing. nltk and spaCy are two commonly used packages for text processing.
"""
Tokenization
"""
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Mary, don't slap the green witch"
print([str(token) for token in nlp(text.lower())])
"""
Tokenization
"""
from nltk.tokenize import TweetTokenizer
tweet="Snow White and the Seven Degrees\n#MakeAMovieCold@midnight:)"
tokenizer = TweetTokenizer()
print(tokenizer.tokenize(tweet.lower()))
Types are unique tokens present in a corpus. The set of all types in a corpus is its vocabulary or lexicon. Words can be distinguished as content words and stop words. Stopwords such as articles and prepositions serve mostly a grammatical purpose.
The process of understanding the linguistics of a language and applying it to solve NLP problems is called feature engineering. When building and deploying real-world production systems, feature engineering is indispensible, despite recent claims to the contrary.
Unigrams, Bigrams, Trigrams, ..., N-grams
N-grams are fixed-length (n) consecutive token sequences occuring in the text. A bigram has two tokens, a unigram one. Gneerating n-grams from a text is simple enoough with the libraries mentioned above. For some situations in which the subword information itself carries useful information, one might want to generate character n-grams.
"""
Text / tokens to n-gram
"""
def n_grams(text,n):
"""
Takes tokens or text, returns a list of n-grams
"""
return [text[i:i+n] for i in range(len(text) - n + 1)]
cleaned = ['mary', ',', "n't", 'slap', 'green', 'witch', '.']
print(n_grams(cleaned,3))
Lemmas and Stems
Lemmas are root forms of words. Consider the verb fly. It can be inflected into many different words - flow, flew, flies, flown, flowing, and so on - and fly is the lemma for all of these seemingly different words. Sometimes, it may be useful to reduce the tokens to their lemmas to keep the dimensionality of the vector representation low. This reduction is called lemmaitization. spaCy uses a predefined dictionary, called WordNet, for extracting lemmas, but the lemmatization can be framed as a machine lreaning problem requiring an understanding of the morphology of the language. Stemming is poor-man's lemmatization. It involves the use of handcrafted rules to strip endings of words to reduce them to a common form called stems. Popular stemmers often implemented in open source packages include the Porter and Snowball stemmers.
"""
Lemmatization
"""
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"he was running late")
for token in doc:
print('{} -> {}'.format(token,token.lemma_))
Categorizing Sentences and Documents
Categorizing or classifying documents is probably one of the earliest applications of NLP. Problems such as assigning topic labels, predicting sentiment of reviews, filtering spam emails, language identification, and email triaging can be framed as supervised document classification problems.
Categorizing Words
We can extend the concept of labeling from documents to individual words or tokens - a common example of categorizing words is part-of-speech tagging.
"""
Categorizing Words: POS Tagging
"""
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Mark slapped the green which.")
for token in doc:
print('{} -> {}'.format(token,token.pos_))
Categorizing Spans: Chunking and Named Entity Recognition
Often, we need to label a span of text; that is, a continuous multitoken boundary. We might want to consider the noun phrases (NP) and verb phrases (VP) in a sentence like: "Mary slapped the green witch":
[NP Mary] [VP slapped] [the green witch].
This is called chunking or shallow parsing. Shallow parsing aims to derive higher-order units composed of grammatical atoms, like nouns, verbs, adjectives, and so on. For English and most extensively spoken langauges, such data and pretrained models exist.
"""
Noun Phrase (NP) Shallow Parsing (Chunking)
"""
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Mary slapped the green witch.")
for chunk in doc.noun_chunks:
print('{} {}'.format(chunk, chunk.label_))
Another type of span that's useful is the named entity. A named entity is a string mention of a real world concept like a person, location, organization, drug name, and so on.
Structure of Sentences
The task of identifying the relationship between the phrasal units produced by shallow parsing is called parsing. Parse trees, like seen below, indicate how different grammatical units in a sentence are related hierarchally. The image below is called what's called a constituent parse.
Another useful way to show relationships is a dependency parse, shown below.
Word Senses and Semantics
Words have meanings, and often more than one. The different meaning of a word are called its enses. WordNet, a long-running lexical resource project from Princeton University, aims to catalog the senses of most words in the English relationships, along with their lexical relationship.
Foundational Components of Neural Networks
The Perceptron: The Simplest Neural Network
The simplest neural network unit is a perceptron. Each peceptron unit has an input ( x ), an output ( y ), and three "knobs": a set of weights ( w ), a bias ( b ), and an activation function ( f ). The weights and the bias are learned from the data, and the activation function is handpicked depending on the network designer's intuition of the network and its target outputs. Mathematically, we can express this as
w, b, and x are vectors, The activation function, denoted f , is typically a linear function. Essentially, a perceptron is made up of a linear function ( wx+b ) and a nonlinear activation function. The linear expression wx+b is also known as an affine transform. Below is a Perceptron implemented in PyTorch.
import torch
import torch.nn as nn
class Perceptron(nn.Module):
"""
A perceptron is one linear layer
"""
def __init__(self,input_dim):
"""
Args:
input_dim (int): size of the input features
"""
super(Perceptron,self).__init__()
# The Linear class does the bookkeeping needed for the weights and biases
# and does the needed affine fransform.
self.fc1 = nn.Linear(input_dim,1)
def forward(self,x_in):
"""
The forward pass of the perceptron
Args:
x_in (torch.Tensor): an input data tensor x_in.shape shoule be
(batch, num_features)
"""
# The activation function is the sigmoid function
return torch.sigmoid(self.fc1(x_in)).squeeze()
Activation Functions
Activation functions are nonlinearities introduced in neural networks to capture complex relationships in data.
Sigmoid
The sigmod is one of the earliest used activation functions in neural network history. It takes any real value and squashes it into the range between 0 and 1:
As you can see from the Activation Functions plot, the sigmoid function saturates - produces extreme valued outputs - very quickly and for a majority of inputs. This can lead the gradients to become either zero or diverging to an overflowing floating-point value, which describe the vanishing gradients and exploding gradients problems respectively.
Tanh
The tanh activation function is a cosmetically different variant of the input:
The tanh function, like the sigmoid, is a squashing function except that it maps the set of real values to the range [-1, 1].
ReLU
ReLU (pronounced ray-luh) stands for rectified linear unit. This is arguably the most important of the activation functions. Some may say that many of the recent innovations in deep learning would be impossible without the use of ReLU.
The clipping effect of ReLU that helps with the vanishing gradients problem can also become an issue, where over time certain outputs in the network can simply become zero and never revive again. This is called the "dying ReLU" problem, and many variants (Parametric ReLU - PReLU) have been proposed to fix this problem.
Softmax
Like the sigmoid function, the softmax function squashes the outpit of each unit to between 0 and 1. However, the softmax operation also divides each output by the sum of all outputs, which gives us a discrete probability distribution over k possible classes:
The probabilities in the resulting distribution all sum up to one. This is useful for interpreting outputs for classification tasks, and so this transformation is usually paired with a probabilistic training objective, such as categorical cross entropy.
import matplotlib.pyplot as plt
import torch
fig, ax = plt.subplots(1,1,layout="constrained")
x = torch.arange(-5.,5.,0.1)
# Sigmoid
y_sigmoid = torch.sigmoid(x)
ax.plot(x,y_sigmoid,'b',label="Sigmoid")
# Tanh
y_tanh = torch.tanh(x)
ax.plot(x,y_tanh,'r',label="Tanh")
# ReLU
relu = torch.nn.ReLU()
y_relu = relu(x)
ax.plot(x,y_relu,'c',label="ReLU",linewidth=3)
# PReLU
prelu = torch.nn.PReLU(num_parameters=1)
y_prelu = prelu(x)
ax.plot(x,y_prelu.detach(),'k--',label="PReLU",linewidth=2)
ax.axis((-5,5,-1.2,1.2))
ax.legend()
ax.set_title("Activation Functions")
plt.show()
Loss Functions
A loss function takes a truth (y) and a prediction ( y^ ) as an input and produces a real-valued score. The higher this score, the worse the model's prediction is.
Mean Squared Error Loss
For regression problems for which the network's output ( y^ ) and the target ( y ) are continuous values, one common loss function is the mean squared error (MSE):
Other loss functions that you might want to use for regression problems include mean absolute error (MAE) and root mean sqaured error (RMSE), but they all involve computing a real-valued distance between the output and the target.
Categorical Cross Entropy Loss
The categorical cross-entropy loss is typically used in a multiclass classification setting in which teh outputs are interpreted as predictions of class membership probabilities. The target ( y ) is a vector of n elements that represents the true multinomial distribution over all the classes. If only one class is correct, the vector is a one hot vector. The network's output ( y^ ) is also a vector of n elements but represents the network's prediction of the multinomial distribution.
Cross entropy is a method to compute how different two distributions are. We want the probability of the correct class to be close to 1, whereas the other classes have a probability close to 0.
There are four pieces of information that determine the nuanced relationship between network output and loss function.
- There is a limit to how small or large a number can be.
- If input to the exponential function used in the softmax formula is a negative number, the resultant is an exponentially small number, and if it's a positive number, then the resultant is an exponentially large number.
- The network's output is assumed to be the vector just prior to applying the softmax function.
- The log function is the inverse of the exponential function, and log(exp(x)) is just equal to x.
Stemming from these four pieces of information, mathematical simplifications are made assuming the exponential function that is the core of the softmax function and the log function that is used in the crossentropy computations in order to be more numerically stable and avoid really small or really large numbers.
The consequences of these simplifications are that the network output without the use of a softmax function can be used in conjunction with PyTorch's CrossEntropyLoss() to optimize the probability distribution.
"""
Categorical Cross Entropy
"""
import torch
import torch.nn as nn
ce_loss = nn.CrossEntropyLoss()
# Random values to simulate network output
outputs = torch.randn(3, 5, requires_grad=True)
# targets is created as a vector of integers because
# PyTorch;s implementation of CrossEntropyLoss() assumes
# that each input has one particular class, and each class
# has a unique index.
targets = torch.tensor([1, 0, 3], dtype=torch.int64)
loss = ce_loss(outputs, targets)
print(loss)
Binary Cross Entropy Loss
The categorical cross-entropy loss is useful when discriminating against multiple classes. When the task involves discriminating between two classes, then the task is called binary classification. For such situations, it is efficient to use the binary cross-entropy (BCE) loss.
"""
Binary Cross Entropy
"""
bce_loss = nn.BCELoss()
sigmoid = nn.Sigmoid()
# Binary output vector
probabilities = sigmoid(torch.randn(4, 1, requires_grad=True))
# Ground truth
targets = torch.tensor([1, 0, 1, 0], dtype=torch.float32).view(4, 1)
# Compute binary cross-entropy loss using
# the binary probability vector and the
# ground truth vector.
loss = bce_loss(probabilities, targets)
print(probabilities)
print(loss)
Diving Deep into Supervised Training
Supervised learning is the problem of learning how to map observations to specified targets given labeled examples. A supervise algorithn requires the following:
- model: computes predictions from the observations
- loss function: measures the error or predictions as compared to the targets
- training data: pairs of observations and targets
- optimization algorithm: adjust the model's parameters so that the losses are as low as possible
Learning begins with computing the loss; that is, how far off the model predictions are from the target. The gradient of the loss function, in turn, becomes a signal for "how much" the paramneters should change. The PyTorch loss object has a method named backward() that iteratively propagates the loss backward through the computational graph and notifies each parameter of its gradient. The optimizer instructs the parameters how to update their values knowing the gradient with a function step().
The entire training dataset is partitioned into batches. Each iteration of the gradient step is performed on a batch of data. After a number of batches (typically, the number of batches that are in a finite-sized dataset), the training loop has completed an epoch. AN epoch is a complete training iteration.
# each epoch is a complete pass over the training data
for epoch_i in range(n_epochs):
# the inner loop is over the batches in the dataset
for batch_i in range(n_batches):
# Step 0: Get the data
x_data, y_target = get_toy_data(batch_size)
# Step 1: Clear the gradients
perceptron.zero_grad()
# Step 2: Compute the forward pass of the model
y_pred = perceptron(x_data, apply_sigmoid=True)
# Step 3: Compute the loss value that we wish to optimize
loss = bce_loss(y_pred, y_target)
# Step 4: Propagate the loss signal backward
loss.backward()
# Step 5: Trigger the optimizer to perform one update
optimizer.step()
Auxilary Training Concepts
Evaluation metrics: what the models are wvaluated with. In NLP, there are multiple such metrics, and the most common is accuracy - the fraction of the predictions that were correct in a dataset unseen during training. A model is said to have generalized better than another model if it not only reduces the error on samples seen in the training data, but also on samples from the unseen distribution. To accomplish the goal of generalization, it is standard practice to either split a dataset into three randomly samples partitions (training, validation, and test datasets) or do k-fold cross validation. You should take precautions to make sure the distribution of classes remains the same between each of the three splits. A common training/validation/test split is 70%/15%/15%.
Use training set for updating modle parameters, validation data for measuring model performance at the end of every epoch, and use test data only once, after all modeling choices are explored and the final results need to be reported. k-fold evaluation is computationally expensive but extremely necessary for smaller datasets.
Knowing When to Stop Training
The most common method [to use to know when to stop training] is to use a heuristic called early stopping. Early stopping works by keeping track of the performance on the validation dataset from epoch to epoch and noticing when the performance no longer improves. Then, if the performance continues to not improve, the training is terminated. The number of epochs to wait before terminating the training is referred to as the patience. In general, the point at which a model stops improving on some dataset is said to be when the model has converged.
Finding the Right Hyperparameters
A hyperparameter is any model setting that affects the number of parameters in the model and values taken by the parameters.
Regularization
You can control regularization in PyTorch by using L2 regularization. You can control the amount of regularization by setting the weight_decay parameter in the optimizer. The higher the value, the more regularization is applied. Dropout is another form of regularization.
Example Classifying Sentiment of Restuarant Reviews
After understanding the dataset, you will see a pattern defining three assisting classes that is repeated throughput this book and is used to transform text data into a vectorized form: the Vocabulary, the Vectorizer, and PyTorch's DataLoader. The Vocabulary coordinates the integer-to-token mappings; we use it for both mapping the text tokens to integers and for mapping the class labels to integers. The Vectorizer encapsulates the vocabularies and is responsible for ingesting string data and converting it to numerical vectors that will be used in the training routine. The DataLoader is used to group and collate the individual vectorized data points into minibatches.
You should use the training partition of a dataset to derive model parameters, the validation partition of the dataset for selecting among hyperparameters (making model decisions), and the testing partition of the dataset for final evaluation and reporting.
PyTorch provides an abstraction for the dataset by providing the Dataset class. The Dataset class is an abstract iterator. When using pyTorch with a new dataset, you must first subclass the Dataset class and implement the __getitem__() and __len__() methods.
"""
PyTorch Claass for the Yelp Review Dataset
"""
from torch.utils.data import Dataset
class ReviewDataset(Dataset):
def __init__(self, review_df, vectorizer):
"""
Assumes that the dataset has been cleaned and plit into three
partitions. The dataset assumes that ican split reviews
based on whitespace in order to get the list of tokens in a review.
Auumes that the data has an annotation for the split it belongs to
Args:
review_df (pandas.DataFrame): the dataset
vectorizer (ReviewVectorizer): vectorizer instantiated from dataset
"""
self.review_df = review_df
self._vectorizer = vectorizer
self.train_df = self.review_df[self.review_df.split=='train']
self.train_size = len(self.train_df)
self.val_df = self.review_df[self.review_df.split=='val']
self.validation_size = len(self.val_df)
self.test_df = self.review_df[self.review_df.split=='test']
self.test_size = len(self.test_df)
self._lookup_dict = {'train': (self.train_df, self.train_size),
'val': (self.val_df, self.validation_size),
'test': (self.test_df, self.test_size)}
self.set_split('train')
# "@" classmethod indicates the entrypoint method for this dataset
@classmethod
def load_dataset_and_make_vectorizer(cls, review_csv):
"""Load dataset and make a new vectorizer from scratch
Args:
review_csv (str): location of the dataset
Returns:
an instance of ReviewDataset
"""
review_df = pd.read_csv(review_csv)
return cls(review_df, ReviewVectorizer.from_dataframe(review_df))
def get_vectorizer(self):
""" returns the vectorizer """
return self._vectorizer
def set_split(self, split="train"):
""" selects the splits in the dataset using a column in the dataframe
Args:
split (str): one of "train", "val", or "test"
"""
self._target_split = split
self._target_df, self._target_size = self._lookup_dict[split]
def __len__(self):
return self._target_size
def __getitem__(self, index):
"""the primary entry point method for PyTorch datasets
Args:
index (int): the index to the data point
Returns:
a dict of the data point's features (x_data) and label (y_target)
"""
row = self._target_df.iloc[index]
review_vector = \
self._vectorizer.vectorize(row.review)
rating_index = \
self._vectorizer.rating_vocab.lookup_token(row.rating)
return {'x_data': review_vector,
'y_target': rating_index}
def get_num_batches(self, batch_size):
"""Given a batch size, return the number of batches in the dataset
Args:
batch_size (int)
Returns:
number of batches in the dataset
"""
return len(self) // batch_size
The Vocabulary, the Vectorizer, and the DataLoaser
The Vocabulary, the Vectorizer and DataLoader are three classes used in nearly every example to perform a crucial pipeline: converting text inputs to vectorized minibatches. The pipeline starts with preprocessed text.
Vocabulary
The first stage in going from text to vectorized minibatch is to map eahc token to a numerical version of itself. The standard methodology is to have a bijection - a mapping that can be reversed - between the tokens and integers. In Python, this is simply two dictionaries. This bijection is encapsulated in the Vocabulary class. UNK is a special token that stands for "unknown".
class Vocabulary(object):
"""Class to process text and extract vocabulary for mapping"""
def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
"""
Args:
token_to_idx (dict): a pre-existing map of tokens to indices
add_unk (bool): a flag that indicates whether to add the UNK token
unk_token (str): the UNK token to add into the Vocabulary
"""
if token_to_idx is None:
token_to_idx = {}
self._token_to_idx = token_to_idx
self._idx_to_token = {idx: token
for token, idx in self._token_to_idx.items()}
self._add_unk = add_unk
self._unk_token = unk_token
self.unk_index = -1
if add_unk:
self.unk_index = self.add_token(unk_token)
def to_serializable(self):
""" returns a dictionary that can be serialized """
return {'token_to_idx': self._token_to_idx,
'add_unk': self._add_unk,
'unk_token': self._unk_token}
@classmethod
def from_serializable(cls, contents):
""" instantiates the Vocabulary from a serialized dictionary """
return cls(**contents)
def add_token(self, token):
"""Update mapping dicts based on the token.
Args:
token (str): the item to add into the Vocabulary
Returns:
index (int): the integer corresponding to the token
"""
if token in self._token_to_idx:
index = self._token_to_idx[token]
else:
index = len(self._token_to_idx)
self._token_to_idx[token] = index
self._idx_to_token[index] = token
return index
def add_many(self, tokens):
"""Add a list of tokens into the Vocabulary
Args:
tokens (list): a list of string tokens
Returns:
indices (list): a list of indices corresponding to the tokens
"""
return [self.add_token(token) for token in tokens]
def lookup_token(self, token):
"""Retrieve the index associated with the token
or the UNK index if token isn't present.
Args:
token (str): the token to look up
Returns:
index (int): the index corresponding to the token
Notes:
`unk_index` needs to be >=0 (having been added into the Vocabulary)
for the UNK functionality
"""
if self.unk_index >= 0:
return self._token_to_idx.get(token, self.unk_index)
else:
return self._token_to_idx[token]
def lookup_index(self, index):
"""Return the token associated with the index
Args:
index (int): the index to look up
Returns:
token (str): the token corresponding to the index
Raises:
KeyError: if the index is not in the Vocabulary
"""
if index not in self._idx_to_token:
raise KeyError("the index (%d) is not in the Vocabulary" % index)
return self._idx_to_token[index]
def __str__(self):
return "<Vocabulary(size=%d)>" % len(self)
def __len__(self):
return len(self._token_to_idx)
Vectorizer
The second stage of going from a text dataset to a vectorized minibatch is to iterate through the tokens of an input data point and convert each token to its integer form, The result of this iteration should be a vector. Because this vector will be combined with other vectors, every vector produced by Vectorizer should be the same length. The Vectorizer below produces a one-hot encoded, sparse vector representation of reviews, and it discards the order in which the words appeared in the review (the "bag of words" approach).
class ReviewVectorizer(object):
""" The Vectorizer which coordinates the Vocabularies and puts them to use"""
def __init__(self, review_vocab, rating_vocab):
"""
Args:
review_vocab (Vocabulary): maps words to integers
rating_vocab (Vocabulary): maps class labels to integers
"""
self.review_vocab = review_vocab
self.rating_vocab = rating_vocab
def vectorize(self, review):
"""Create a collapsed one-hot vector for the review
Args:
review (str): the review
Returns:
one_hot (np.ndarray): the collapsed one-hot encoding
"""
one_hot = np.zeros(len(self.review_vocab), dtype=np.float32)
for token in review.split(" "):
if token not in string.punctuation:
one_hot[self.review_vocab.lookup_token(token)] = 1
return one_hot
@classmethod
def from_dataframe(cls, review_df, cutoff=25):
"""Instantiate the vectorizer from the dataset dataframe
Args:
review_df (pandas.DataFrame): the review dataset
cutoff (int): the parameter for frequency-based filtering
Returns:
an instance of the ReviewVectorizer
"""
review_vocab = Vocabulary(add_unk=True)
rating_vocab = Vocabulary(add_unk=False)
# Add ratings
for rating in sorted(set(review_df.rating)):
rating_vocab.add_token(rating)
# Add top words if count > provided count
word_counts = Counter()
for review in review_df.review:
for word in review.split(" "):
if word not in string.punctuation:
word_counts[word] += 1
for word, count in word_counts.items():
if count > cutoff:
review_vocab.add_token(word)
return cls(review_vocab, rating_vocab)
@classmethod
def from_serializable(cls, contents):
"""Instantiate a ReviewVectorizer from a serializable dictionary
Args:
contents (dict): the serializable dictionary
Returns:
an instance of the ReviewVectorizer class
"""
review_vocab = Vocabulary.from_serializable(contents['review_vocab'])
rating_vocab = Vocabulary.from_serializable(contents['rating_vocab'])
return cls(review_vocab=review_vocab, rating_vocab=rating_vocab)
def to_serializable(self):
"""Create the serializable dictionary for caching
Returns:
contents (dict): the serializable dictionary
"""
return {'review_vocab': self.review_vocab.to_serializable(),
'rating_vocab': self.rating_vocab.to_serializable()}
DataLoader
The final stage of the text-to-vectorized minibatch pipeline is to actually group the vectorized data points. Because creating minibatches is so important, PyTorch provides a built-in class DataLoader for coordinating the process. The DataLoader class is instantiated by providing a OPyTorch Dataset, a batch_size, and a handful of other keyword arguments. The resulting obvject os a Python iterator that groups and collates the data points provided in the Dataset.
"""
Generating minitbatches from a datasetb
"""
def generate_batches(dataset, batch_size, shuffle=True,
drop_last=True, device="cpu"):
"""
A generator function which wraps the PyTorch DataLoader. It will
ensure each tensor is on the write device location.
"""
dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
shuffle=shuffle, drop_last=drop_last)
for data_dict in dataloader:
out_data_dict = {}
for name, tensor in data_dict.items():
out_data_dict[name] = data_dict[name].to(device)
yield out_data_dict
A Perceptron Classifier
In a binary classification task, binary cross entropy loss (torch.nn.BCELoss()) is the most appropriate loss function. It is mathematically formulated for binary probabilities. However, there are numerical stability issues with appluing a sigmoid and using this loss function.
The Training Routine
At its core, the training routine is responsible for instantiating the model, iterating over the dataset, computing the output of the model when given the data as input, computing the loss (how wrong the model is), and updating the model proportional to the loss.
Feed-Forward Networks for Natural Language Processing
One of the historic downfalls of the perceptron was that it cannot learn modestly nontrivial patterns present in data. The example below shows one of those patterns, an XOR situation, in which the decision boundary cannot be a single straight line (otherwise known as being linearly separable). In this case, the perceptron fails.
This chapter explores a famuly of neural network models traditionally called feed-forward networks. We focus on two kinds of feed-forward networks: the multilayer perceptron (MLP) and the convolutional neural network (CNN). The multilayer perceptron structurally extends the simpler perceptron that we saw last chapter by grouping many perceptrons in a single layer and stacking multiple layers together. The convolutional neural network is deeply inspired by windowed filters in the processing of digital signals. Through this windowing property, CNNs are able to learn localized patterns in their inputs, which has not only made them the workhorse of computer vision but also an ideal candidate for detecting substructures in sequential data, such as words and sentences.
Feedforward neural networks stand in contrast to a different family of neural networks, recurrent neural networks (RNNs), which allow for feedback (or cycles) such that each previous computation is informed by the previous computation.
The Multilayer Perceptron
The multilayer perceptron (MLP) is considered one of the most basic neural network building blocks. The simplest MLP s an extension to the Perceptron. The perceptron takes the data vector as input and computes a single output value. In an MLP, many perceptrons are grouped so that the output of a single leayer is a new vector instead of a single output value. In PyTorch, MLP is found in the Linear layer. An additional aspect of an MLP is that it combines multiple layers with a nonlinearity between each layer.
Leanring intermediate representations that have specific properties, like being linearly separable for a classification task, is one of the most profound consequences of using neural networks and is quintessential to their modeling capabilities. You must make sure that the input size to one linear layer is equal to the output size of the previous Linear layer. Using a nonlinearity between two Linear layers is essential because without it, two Linear layers in sequence are mathematically equivalent to a single Linear layer and thus unable to model complex patterns.
Reading Inputs and Outputs of PyTorch models. The output of the MLP model is a tensor that has two rows and four columns. The rows in this tensor correspond to the batch dimension, which is the number of data points in the minibatch. The columns are the final feature vectors for each data point. In some cases, such as in a classification setting, the feature vector is a prediction vector (probability distribution). If you want to turn the prediction vector into probabilities, you must use the softmax activation function, which is used to transform a vector of values into probabilities.
Dropout, a form of regularization, probabilistically drops connections between units belonging to two adjacent layers during training. "Dropout, simply described, is the concept that if you can learn how to a task repeatedly whilst drunk, you should be able to do the task even better when sober."
import torch.nn as nn
import torch.nn.functional as F
"""
Multilayer Perceptron Using PyTorch
"""
class MultilayerPerceptron(nn.Module):
"""
"""
def __init__(self, input_size, hidden_size=2, output_size=3,
num_hidden_layers=1, hidden_activation=nn.Sigmoid):
"""Initialize weights.
Args:
input_size (int): size of the input
hidden_size (int): size of the hidden layers
output_size (int): size of the output
num_hidden_layers (int): number of hidden layers
hidden_activation (torch.nn.*): the activation class
"""
super(MultilayerPerceptron, self).__init__()
self.module_list = nn.ModuleList()
interim_input_size = input_size
interim_output_size = hidden_size
for _ in range(num_hidden_layers):
self.module_list.append(nn.Linear(interim_input_size, interim_output_size))
self.module_list.append(hidden_activation())
interim_input_size = interim_output_size
# It is common to name Linear layers fc_x
# to stand for "Fully Connected"
self.fc_final = nn.Linear(interim_input_size, output_size)
self.last_forward_cache = []
def forward(self, x, apply_softmax=False):
"""The forward pass of the MLP
Args:
x_in (torch.Tensor): an input data tensor.
x_in.shape should be (batch, input_dim)
apply_softmax (bool): a flag for the softmax activation
should be false if used with the Cross Entropy losses
Returns:
the resulting tensor. tensor.shape should be (batch, output_dim)
"""
self.last_forward_cache = []
self.last_forward_cache.append(x.to("cpu").numpy())
for module in self.module_list:
x = module(x)
self.last_forward_cache.append(x.to("cpu").data.numpy())
output = self.fc_final(x)
self.last_forward_cache.append(output.to("cpu").data.numpy())
if apply_softmax:
output = F.softmax(output, dim=1)
return output
Convolutional Neural Networks
Convolutional Neural Network is a type of neural network that is well suited for detecting spatial substructure (and creating maningful spatial substructure as a consequence). CNNs accomplish this by having a small number of weights they use to scan the input data tensors. From this scanning, they produce output tensors that represent the detection (or not) of substructures.
CNN Hyperparameters
A "kernel" is a small square matrix that is applied at different positions in the input matrix in a systematic way.
CNNs are designed by specifying hyperparameters that control the behavior of the CNN and then using gradient descent to find the best parameters for a given dataset. The two primary hyperparameters conrol the shape of the convolution (called the kernel_size) and the positions the convolution will multiply in the input data tensor (called the stride). There are additional hyperparameters that control how much the data tensor is passed with 0s (called padding) and how far apart the multiplications should be when applied to the input data tensor (called dialation).
Dimension of the Convolution Operation
PyTorch convolutions can be one-dimensional, two-dimensional, or three-dimensional and are implemented by the Conv1d, Conv2d, and Conv3d modules, respectively. The one-dimensional convolutions are useful for time series in which each time step has a feature vector. In this situation, we can learn patterns on the sequence dimension. Most convolution operations in NLP are one-dimensional convolutions. A two-dimensional convolution tries to capture spatio-temporal patterns along two directions in the data - for example, in images along the height and width dimensions, which is why two-dimensional convolutions are popular for image processing. Similarly, in three-dimensional convolutions the patterns are captured along three dimensions in the data.
Channels
channels refers to the feature dimension along each point in the input. For example, in images there are three channels for each pixel in the image, corresponding to thr RGB components. A similar concept can be carried over to text data when using convolutions. Conceptually, if "pixels" in a text document are words, the number of channels is the size of vocabulary. In PyTorch's convolution implementation, the number of channels in the input is the in_channels ag=rgument. The convolution operation can produce more than one channel in the output (out_channels). You can consider this as the convolution operator "mapping" the input feature dimension to an output feature dimension.
It's diffucult to immediately know how many output channels are appropriate for the problem at hand. A common design pattern is to not shrink the number of channels by more than a factor of two from one convolutional layer to the next.
Kernel Size
The width of the kernel matrix is called the kernel size (kernel_size in PyTorch). The intuition you should develop is that conolutions combine spatially (or temporally) local information in the input and the amount of local information per convolution is controlled by the kernel size.
You can think of the behavior of kernel size in NLP applications as being similar to the behavior of n-grams, which capture patterns by looking at groups of words. With smaller kernel sizes, smaller, more frequent patterns are captured, whereas larger kernel sizes lead to larger patterns, which might be more meaningful but occur less frequently. Small kernel sizes lead to fine-grained features in the output, whereas large kernel sizes lead to coarse-grained features.
Stride
Stride controls the step size between convolutions. If the stide is the same size as the kernel, the kernel computations do not overlap. If the stride is 1, then the kernels are maximally overlapping. The output tensor can be deliberately shurnk to summarize information by increasing the stride.
Padding
stride and kernel_size can shrink the total size of the feature map. To counteract this, the input data tensor is artifically made larger in length (1D, 2D, or 3D), height (2D or 3D), and depth (3D) by appending and prepending 0s to each respective dimension. This means that the CNN will perform more convolutions, but the output shape can be controlled without compromising the desired kernel size, stride, or dilation.
Dilation
Dialtion controls how the convolutional kernel is applied to the input matrix. The image below shows that increasing dilation from 1 to 2 means that the elements of the kernel are two spaces away from each other when applied to the input matrix. Another way to think about this is striding the kernel itself - there is a step size between the elements in the kernel or application of kernel with "holes". This can be usefult for summarizing larger regions of the input space without an increase in the number of parameters. Dialtion convolutions have proven very useful when convolution layers are stacked.
Miscellaneous Topics in CNNs
Pooling
Pooling is an operation to summarize a higher-dmensional feature map to a lower-dimensional feature map. The output of a convolution is a feature map. The values in the feature map summaruze some region of the input. Due to overlapping nature of convolution computation, many of the computed features can be redundant. Pooling is a way to summarize a high-dimensional, and possibly redundant, feature map into a lower-dimensional one. Pooling is an arithmetic operator like sum, mean, or max applied over a local region in a feature map in a systemic way, and the resulting pooling operations are known as sum pooling, average pooling, and max pooling, respectively. Pooling can also function as a way to improve the statistical strength of a larger but weaker feature map into a smaller but stronger feature map.
Batch Normalization (BatchNorm)
Batch Normalization, or BatchNorm, is an often-used tool in designing CNNs. BatchNorm applies a transformation to the output of a CNN by scaling the activations to have zero mean and unit variance. BatchNorm allows models to be less sensitive to initialization of the parameters and simpliefies the tuning of learning rates.
Network-in-Network Connections (1 x 1 Convolutions)
Network in Network (NiN) connectioons are convolutional kernels with kernel_size=1 and have a few interesting properties. This 1x1 convolution acts like a fully connected linear layer across the channels. This is useful in mapping from feature maps with many channels to shallower feature maps. NiN, or 1x1 convolutions, provide an inexpensive way to incorporate additional nonlinearity with few parameters.
Residual Connections / Residual Block
One of the most significant trends in CNNs that has enabled really deep networks (more than 100 layers) is the residual connection, also called the skip connection. Output of a residual block: output = conv(input) + input. FOr the input to be added to the output of the convolution, they must have the same shape, so padding is applied before convolution.