Pytorch Review

One of my goals for 2025 was to complete a machine learning / deep learning project every day. I missed the first three days after being a little wary about getting into Python after having not used it too much for data analytics in a few months (One of the reasons I created this goal was to keep up with my Python / data analytics skills). I am creating this Jupyter Notebook to review PyTorch before getting back into doing deep learning problems consistently.

DOWNLOAD NOTEBOOK

1 110

References

PyTorch Learn the Basics

Learn the Basics

Most machine learning workflows involve working with data, creating models, optimizing model parameters, and saving the trained models. This tutorial introduces you to a complete ML workflow implemented in PyTorch, with links to learn more about each of these concepts. We will use the FashionMNIST dataset here.

Quickstart

PyTorch has two primitives to work with data: torch.utils.data.DataLoader and torch.utils.data.Dataset. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset. PyTorch offers domain-specific libraries such as TorchText, TorchVision, and TorchAudio, all of which include datasets. Every TorchVision Dataset includes two arguments: transform and target_transform to modify the samples and labels respectively.

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

out[2]

# Download training data from open datasets.
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
)

# Download test data from open datasets.
test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor(),
)

out[3]

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to data/FashionMNIST/raw/train-images-idx3-ubyte.gz

100%|██████████| 26.4M/26.4M [00:01<00:00, 16.8MB/s]

Extracting data/FashionMNIST/raw/train-images-idx3-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw/train-labels-idx1-ubyte.gz

100%|██████████| 29.5k/29.5k [00:00<00:00, 270kB/s]

Extracting data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz

100%|██████████| 4.42M/4.42M [00:00<00:00, 5.02MB/s]

Extracting data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz

100%|██████████| 5.15k/5.15k [00:00<00:00, 6.94MB/s]

Extracting data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw

We pass the Dataset as an argument to DataLoader. This wraps an iterable over our dataset, and supports automatic batching, sampling, shuffling, and multiprocess data loading.

batch_size = 64

# Create data loaders.
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break

out[5]

Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64

Creating Models

To define a neural network in PyTorch, we create a class that inherits from nn.Module. We define the layers of the network in the __init__ function and specify how data will pass through the network in the forward function. To accelerate operations in the neural network, we move it to the GPU or MPS if available.

# Get cpu, gpu or mps device for training.
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)

out[7]

Using cpu device
NeuralNetwork(
(flatten): Flatten(start_dim=1, end_dim=-1)
(linear_relu_stack): Sequential(
(0): Linear(in_features=784, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=512, bias=True)
(3): ReLU()
(4): Linear(in_features=512, out_features=10, bias=True)
)
)

Optimizing the Model Parameters

To train the model, we need a loss function and an optimizer.

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

out[9]

In a single training loop, the model makes predictions in the training dataset (def to it in batches), and backpropagates the prediction error to adjust the model's parameters.

def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

out[11]

We also check the model's performance against the test dataset to ensure it is learning.

def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

out[13]

The training process is conducted over several iterations (epochs). During each epocj, the model learns parameters to make better predictions. We print the model's accuracy and loss at each epoch; we'd like to see the accuracy increase and the loss decrease with every epoch.

epochs = 5
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train(train_dataloader, model, loss_fn, optimizer)
    test(test_dataloader, model, loss_fn)
print("Done!")

out[15]

Epoch 1
-------------------------------
loss: 2.312286 [ 64/60000]
loss: 2.292726 [ 6464/60000]
loss: 2.278312 [12864/60000]
loss: 2.263963 [19264/60000]
loss: 2.260134 [25664/60000]
loss: 2.233418 [32064/60000]
loss: 2.235806 [38464/60000]
loss: 2.208580 [44864/60000]
loss: 2.203783 [51264/60000]
loss: 2.176208 [57664/60000]
Test Error:
Accuracy: 47.5%, Avg loss: 2.168565

Epoch 2
-------------------------------
loss: 2.184954 [ 64/60000]
loss: 2.168498 [ 6464/60000]
loss: 2.118544 [12864/60000]
loss: 2.124995 [19264/60000]
loss: 2.090115 [25664/60000]
loss: 2.038443 [32064/60000]
loss: 2.060489 [38464/60000]
loss: 1.991642 [44864/60000]
loss: 1.995058 [51264/60000]
loss: 1.919833 [57664/60000]
Test Error:
Accuracy: 57.4%, Avg loss: 1.920518

Epoch 3
-------------------------------
loss: 1.964385 [ 64/60000]
loss: 1.927278 [ 6464/60000]
loss: 1.817406 [12864/60000]
loss: 1.841750 [19264/60000]
loss: 1.746342 [25664/60000]
loss: 1.699706 [32064/60000]
loss: 1.714502 [38464/60000]
loss: 1.620638 [44864/60000]
loss: 1.642586 [51264/60000]
loss: 1.523147 [57664/60000]
Test Error:
Accuracy: 59.1%, Avg loss: 1.546563

Epoch 4
-------------------------------
loss: 1.627865 [ 64/60000]
loss: 1.581894 [ 6464/60000]
loss: 1.436008 [12864/60000]
loss: 1.490509 [19264/60000]
loss: 1.374736 [25664/60000]
loss: 1.373549 [32064/60000]
loss: 1.378864 [38464/60000]
loss: 1.309162 [44864/60000]
loss: 1.343640 [51264/60000]
loss: 1.234531 [57664/60000]
Test Error:
Accuracy: 63.2%, Avg loss: 1.264600

Epoch 5
-------------------------------
loss: 1.356792 [ 64/60000]
loss: 1.327374 [ 6464/60000]
loss: 1.167462 [12864/60000]
loss: 1.260016 [19264/60000]
loss: 1.135780 [25664/60000]
loss: 1.163786 [32064/60000]
loss: 1.177685 [38464/60000]
loss: 1.119889 [44864/60000]
loss: 1.160879 [51264/60000]
loss: 1.076022 [57664/60000]
Test Error:
Accuracy: 64.7%, Avg loss: 1.095589

Done!

A common way to save a model is to serialize trhe internal state dictionary (containing the model parameters). The process of loading a model includes re-creating the model structure and loading the state dictionary into it.

torch.save(model.state_dict(), "model.pth")
print("Saved PyTorch Model State to model.pth")

model = NeuralNetwork().to(device)
model.load_state_dict(torch.load("model.pth", weights_only=True))

out[17]

Saved PyTorch Model State to model.pth

The model can now be used to make predictions

classes = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot",
]

model.eval()
x, y = test_data[0][0], test_data[0][1]
with torch.no_grad():
    x = x.to(device)
    pred = model(x)
    predicted, actual = classes[pred[0].argmax(0)], classes[y]
    print(f'Predicted: "{predicted}", Actual: "{actual}"')

out[19]

Predicted: "Ankle boot", Actual: "Ankle boot"

Tensors

Tensors are a specialized data structure that are very similar to arrays and matrices. In PyTorch, we use tensors to encode the inputs and outputs of a model, as well as the model's parameters. Tensors are similar to NumPy's ndarrayd, except that tensors can run on GPUs or other hardware accelerators. In fact, tensors and NumPy arrays can often share the same underlying memory, eliminating the need to copy data. Tensors are also optimized for automatic differentiation. If you're familiar with ndarrays, you'll recognize the Tensor API.

import numpy as np
# Tensors can be initialized in various ways

# Directly from data
data = [[1, 2],[3, 4]]
x_data = torch.tensor(data)

# From a NumPy array
np_array = np.array(data)
x_np = torch.from_numpy(np_array)

# From another tensor.
x_ones = torch.ones_like(x_data) # retains the properties of x_data
print(f"Ones Tensor: \n {x_ones} \n")

x_rand = torch.rand_like(x_data, dtype=torch.float) # overrides the datatype of x_data
print(f"Random Tensor: \n {x_rand} \n")

out[21]

Ones Tensor:
tensor([[1, 1],
[1, 1]])

Random Tensor:
tensor([[0.8045, 0.3469],
[0.0646, 0.9795]])

shape is a tuple of tensor dimensions. In the functions below, it determines the dimensionality of the output tensor.

shape = (2,3,)
rand_tensor = torch.rand(shape)
ones_tensor = torch.ones(shape)
zeros_tensor = torch.zeros(shape)

print(f"Random Tensor: \n {rand_tensor} \n")
print(f"Ones Tensor: \n {ones_tensor} \n")
print(f"Zeros Tensor: \n {zeros_tensor}")

out[23]

Random Tensor:
tensor([[0.9373, 0.4768, 0.8225],
[0.4201, 0.5240, 0.7535]])

Ones Tensor:
tensor([[1., 1., 1.],
[1., 1., 1.]])

Zeros Tensor:
tensor([[0., 0., 0.],
[0., 0., 0.]])

Tensor attributes describe the shape, datatype, and the device on which they are stored.

tensor = torch.rand(3,4)

print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

out[25]

Shape of tensor: torch.Size([3, 4])
Datatype of tensor: torch.float32
Device tensor is stored on: cpu

Over 100 tensor operations, including arithemtic, lin alg, matmul, sampling, and more are comprehensibly described on this page. Each of these operations can be run on the GPU. BY default, tensors are created on the CPU. We need to explicitly move tensors to the GPU using .to method.

# We move our tensor to the GPU if available
if torch.cuda.is_available():
    tensor = tensor.to("cuda")

out[27]

tensor = torch.ones(4, 4)
print(f"First row: {tensor[0]}")
print(f"First column: {tensor[:, 0]}")
print(f"Last column: {tensor[..., -1]}")
tensor[:,1] = 0
print(tensor)

out[28]

First row: tensor([1., 1., 1., 1.])
First column: tensor([1., 1., 1., 1.])
Last column: tensor([1., 1., 1., 1.])
tensor([[1., 0., 1., 1.],
[1., 0., 1., 1.],
[1., 0., 1., 1.],
[1., 0., 1., 1.]])

t1 = torch.cat([tensor, tensor, tensor], dim=1)
print(t1)

out[29]

tensor([[1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
[1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
[1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
[1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.]])

# This computes the matrix multiplication between two tensors. y1, y2, y3 will have the same value
# ``tensor.T`` returns the transpose of a tensor
y1 = tensor @ tensor.T
y2 = tensor.matmul(tensor.T)

y3 = torch.rand_like(y1)
torch.matmul(tensor, tensor.T, out=y3)


# This computes the element-wise product. z1, z2, z3 will have the same value
z1 = tensor * tensor
z2 = tensor.mul(tensor)

z3 = torch.rand_like(tensor)
torch.mul(tensor, tensor, out=z3)

out[30]

tensor([[1., 0., 1., 1.],

[1., 0., 1., 1.],

[1., 0., 1., 1.],

[1., 0., 1., 1.]])

agg = tensor.sum()
agg_item = agg.item()
print(agg_item, type(agg_item))

out[31]

12.0 <class 'float'>

# In plac eoperations
print(f"{tensor} \n")
tensor.add_(5)
print(tensor)

out[32]

tensor([[1., 0., 1., 1.],
[1., 0., 1., 1.],
[1., 0., 1., 1.],
[1., 0., 1., 1.]])

tensor([[6., 5., 6., 6.],
[6., 5., 6., 6.],
[6., 5., 6., 6.],
[6., 5., 6., 6.]])

# Tensor to numpy
t = torch.ones(5)
print(f"t: {t}")
n = t.numpy()
print(f"n: {n}")

out[33]

t: tensor([1., 1., 1., 1., 1.])
n: [1. 1. 1. 1. 1.]

Datasets and DataLoaders

Code for processing data samples can get messy and hard to maintain; we ideally want our dataset code to be decoupled from our model training code for better readability and modularity. PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset that allow you to use pre-laoded datasets as well as your own data. Dataset stroes the samples and their corresponding labels, and DataLoader wraps an iterable around Dataset to enable easy access to the samples. PyTorch includes a number of datasets to prototype and benchmark your model.

The example below shows how to load the Fashion-MNIST dataset from TorchVision.

import torch
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
import matplotlib.pyplot as plt


training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

out[35]

We can index Datasets manually like a list: training_data[index]. We use matplotlib to visualize some samples in our training data.

labels_map = {
    0: "T-Shirt",
    1: "Trouser",
    2: "Pullover",
    3: "Dress",
    4: "Coat",
    5: "Sandal",
    6: "Shirt",
    7: "Sneaker",
    8: "Bag",
    9: "Ankle Boot",
}
figure = plt.figure(figsize=(8, 8))
cols, rows = 3, 3
for i in range(1, cols * rows + 1):
    sample_idx = torch.randint(len(training_data), size=(1,)).item()
    img, label = training_data[sample_idx]
    figure.add_subplot(rows, cols, i)
    plt.title(labels_map[label])
    plt.axis("off")
    plt.imshow(img.squeeze(), cmap="gray")
plt.show()

out[37]

A custom Dataset class must implement three functions: __init__, __len__, and __getitem__. Take a look at this implementation; the FashionMNIST images are stored in a directory img_dir, and their labels are stored separately in a CSV file annotations)file.

import os
import pandas as pd
from torchvision.io import read_image

class CustomImageDataset(Dataset):
    def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):
      """'
      The __init__ function is run once when instantiating the Dataset object. We initialize the directory containing the images, the annotations file, and both transforms.
      """
      self.img_labels = pd.read_csv(annotations_file)
      self.img_dir = img_dir
      self.transform = transform
      self.target_transform = target_transform

    def __len__(self):
      """
      The __len__ function returns the number of samples in the dataset.
      """
      return len(self.img_labels)

    def __getitem__(self, idx):
      """
      The __getitem__ function loads and returns a sample from the dataset at the given index idx. Based on the index, it identifies the image's location on disk, converts that to a tensor using `read_image`, retrieves the corresponding label from the csv data in self.img_labels, calls the transfom functions on them (if applicable), and returns the tensor image and corresponding label in a tuple.
      """
      img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
      image = read_image(img_path)
      label = self.img_labels.iloc[idx, 1]
      if self.transform:
          image = self.transform(image)
      if self.target_transform:
          label = self.target_transform(label)
      return image, label

out[39]

The Dataset retreives our dataset's features one sample at a time. While training a model, we typically want to pass samples in "minibatches", reshuffle the data at every epoch to reduce model overfitting, and use Python's multiprocessing to speed up data retrieval. DataLoader is an iterable that abstracts this complexity for us in an easy API.

from torch.utils.data import DataLoader

train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)

out[41]

# Display image and label.
train_features, train_labels = next(iter(train_dataloader))
print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")
img = train_features[0].squeeze()
label = train_labels[0]
plt.imshow(img, cmap="gray")
plt.show()
print(f"Label: {label}")

out[42]

Feature batch shape: torch.Size([64, 1, 28, 28])
Labels batch shape: torch.Size([64])

Label: 0

Transforms

Data does not always come in its final processed form that is required for training machine learning algorithms. We use transforms to perform some manipulation of the data and make it suitable for training. All TorchVision datasets have two parameters - transform to modify the features and target_transform to modify the labels - that accept callables containing the transformation logic. The FashionMNIST features are in PIL Image format, and the labels are integers. For training, we need the features as normalized tensors, and the labels as one-hot encoded tensors. To make these transformations, we use ToTensor and Lambda.

import torch
from torchvision import datasets
from torchvision.transforms import ToTensor, Lambda

ds = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
    target_transform=Lambda(lambda y: torch.zeros(10, dtype=torch.float).scatter_(0, torch.tensor(y), value=1))
)

out[44]

ToTensor() converts a PIL image or NumPy ndarray into a FloatTensor, and scales the image's pixel intensity values in the range [0,1].

BUild the Neural Network

Neural Netowrks ocmprise of layers/modules that perform operations on data. The torch.nn namespace provides all the building blocks you need to build your own neural network. Every module in PyTorch subclasses the nn.Module. A neural network is a module itself that consists of othe modules (layers). This nested structure allows for building and managing complex architectures easily.

import os
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")
class NeuralNetwork(nn.Module):
  """
  We define our neural network by subclassing nn.Module, and initialize the neural network layers in __init__. Every `nn.Module` subclass implements operations on input data in the `forward` method.
  """
  def __init__(self):
    super().__init__()
    self.flatten = nn.Flatten()
    self.linear_relu_stack = nn.Sequential(
        nn.Linear(28*28, 512),
        nn.ReLU(),
        nn.Linear(512, 512),
        nn.ReLU(),
        nn.Linear(512, 10),
    )

  def forward(self, x):
    x = self.flatten(x)
    logits = self.linear_relu_stack(x)
    return logits
# Create an instance of NeuralNetwork andd move it to device. Print its structure
model = NeuralNetwork().to(device)
print(model)

out[47]

To use the model, we pass it the input data. This executes the model's forward, along with some background operations. Do not call model.forward() directly. Calling the model on the input returns a 2-dimensional tensor with dim=0 corresponding to each output of 10 raw predicted values for each class, and dim=1 corresponding to the individual values of each output. We get the prediction probabilities by passing it through an instance of the nn.Softmax module.

X = torch.rand(1, 28, 28, device=device)
logits = model(X)
pred_probab = nn.Softmax(dim=1)(logits)
y_pred = pred_probab.argmax(1)
print(f"Predicted class: {y_pred}")

out[49]

Predicted class: tensor([5])

input_image = torch.rand(3,28,28)
print(input_image.size())
# Flatten image matrix into array
flatten = nn.Flatten()
flat_image = flatten(input_image)
print(flat_image.size())
# Apply a linear transformation on the input using its stored weights and values
layer1 = nn.Linear(in_features=28*28, out_features=20)
hidden1 = layer1(flat_image)
print(hidden1.size())
"""
Non-linear activations are what create the complex mappings between the model’s inputs and outputs. They are applied after linear transformations to introduce nonlinearity, helping neural networks learn a wide variety of phenomena.
In this model, we use nn.ReLU between our linear layers, but there’s other activations to introduce non-linearity in your model.
"""
print(f"Before ReLU: {hidden1}\n\n")
hidden1 = nn.ReLU()(hidden1)
print(f"After ReLU: {hidden1}")
"""
nn.Sequential is an ordered container of modules. The data is passed through all the modules in the same order as defined. You can use sequential containers to put together a quick network like seq_modules.
"""
seq_modules = nn.Sequential(
    flatten,
    layer1,
    nn.ReLU(),
    nn.Linear(20, 10)
)
input_image = torch.rand(3,28,28)
logits = seq_modules(input_image)
"""
The last linear layer of the neural network returns logits - raw values in [-infty, infty] - which are passed to the nn.Softmax module. The logits are scaled to values [0, 1] representing the model’s predicted probabilities for each class. dim parameter indicates the dimension along which the values must sum to 1.
"""
softmax = nn.Softmax(dim=1)
pred_probab = softmax(logits)
"""
Many layers inside a neural network are parameterized, i.e. have associated weights and biases that are optimized during training. Subclassing nn.Module automatically tracks all fields defined inside your model object, and makes all parameters accessible using your model’s parameters() or named_parameters() methods.

In this example, we iterate over each parameter, and print its size and a preview of its values.
"""
print(f"Model structure: {model}\n\n")

for name, param in model.named_parameters():
    print(f"Layer: {name} | Size: {param.size()} | Values : {param[:2]} \n")

out[51]

torch.Size([3, 28, 28])
torch.Size([3, 784])
torch.Size([3, 20])
Before ReLU: tensor([[-0.1463, -0.3910, 0.2380, 0.3751, -0.2928, 0.5330, 0.2122, 0.3659,
-0.0212, -0.2090, -0.1804, 0.1798, -0.0859, -0.0236, -0.0777, 0.0555,
-0.3779, -0.3800, 0.2164, 0.0720],
[-0.3523, -0.0162, 0.3517, 0.0242, -0.5485, 0.3751, 0.0027, 0.5842,
-0.1824, 0.1364, 0.2567, -0.0426, -0.3081, 0.3706, -0.3464, -0.0870,
-0.5391, -0.3459, 0.3385, 0.4627],
[-0.0434, -0.1014, 0.5089, 0.0564, -0.5022, 0.1144, 0.1409, 0.2206,
-0.1875, -0.1411, 0.0171, 0.0064, -0.3736, 0.2107, -0.2088, 0.0419,
-0.2619, -0.6528, 0.1259, 0.3006]], grad_fn=<AddmmBackward0>)

After ReLU: tensor([[0.0000, 0.0000, 0.2380, 0.3751, 0.0000, 0.5330, 0.2122, 0.3659, 0.0000,
0.0000, 0.0000, 0.1798, 0.0000, 0.0000, 0.0000, 0.0555, 0.0000, 0.0000,
0.2164, 0.0720],
[0.0000, 0.0000, 0.3517, 0.0242, 0.0000, 0.3751, 0.0027, 0.5842, 0.0000,
0.1364, 0.2567, 0.0000, 0.0000, 0.3706, 0.0000, 0.0000, 0.0000, 0.0000,
0.3385, 0.4627],
[0.0000, 0.0000, 0.5089, 0.0564, 0.0000, 0.1144, 0.1409, 0.2206, 0.0000,
0.0000, 0.0171, 0.0064, 0.0000, 0.2107, 0.0000, 0.0419, 0.0000, 0.0000,
0.1259, 0.3006]], grad_fn=<ReluBackward0>)
Model structure: NeuralNetwork(
(flatten): Flatten(start_dim=1, end_dim=-1)
(linear_relu_stack): Sequential(
(0): Linear(in_features=784, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=512, bias=True)
(3): ReLU()
(4): Linear(in_features=512, out_features=10, bias=True)
)
)

Layer: linear_relu_stack.0.weight | Size: torch.Size([512, 784]) | Values : tensor([[ 0.0159, -0.0231, -0.0300, ..., 0.0195, -0.0020, 0.0293],
[ 0.0242, 0.0306, -0.0102, ..., 0.0233, -0.0321, -0.0277]],
grad_fn=<SliceBackward0>)

Layer: linear_relu_stack.0.bias | Size: torch.Size([512]) | Values : tensor([ 0.0039, -0.0241], grad_fn=<SliceBackward0>)

Layer: linear_relu_stack.2.weight | Size: torch.Size([512, 512]) | Values : tensor([[ 0.0343, 0.0408, -0.0081, ..., 0.0387, 0.0198, 0.0165],
[-0.0092, -0.0233, 0.0114, ..., -0.0124, 0.0429, 0.0356]],
grad_fn=<SliceBackward0>)

Layer: linear_relu_stack.2.bias | Size: torch.Size([512]) | Values : tensor([-0.0290, -0.0237], grad_fn=<SliceBackward0>)

Layer: linear_relu_stack.4.weight | Size: torch.Size([10, 512]) | Values : tensor([[ 0.0307, 0.0005, -0.0151, ..., 0.0251, -0.0210, 0.0110],
[-0.0273, -0.0115, 0.0098, ..., -0.0208, 0.0245, 0.0059]],
grad_fn=<SliceBackward0>)

Layer: linear_relu_stack.4.bias | Size: torch.Size([10]) | Values : tensor([0.0252, 0.0085], grad_fn=<SliceBackward0>)

Automatic Differentiation with torch.autograd

When training neural networks, the most frequently used algorithm is back propagation. In this algorithm, parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter. To compute the gradients, PyTorch has a built-in differentiation algorithm engine called torch.autograd. It supports automatic computation of gradient for any computation graph.

"""
Simple Neural Network
"""

x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

out[53]

In this network, w and b are parameters, which we need to optimize. Thus, we need to be able to compute the gradients of loss function with respect to those variables. In order to do that, we set the requires_grad property of those tensors.

A function that we apply to tensors to construct computational graph is in fact an object of class Function. This object knows how to compute the function in the forward direction, and also how to compute its derivative during the backward propagation step. A reference to the backward propagation function is stored in grad_fn property of a tensor.

print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

out[55]

Gradient function for z = <AddBackward0 object at 0x7fe12baf3ca0>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x7fe12baf2470>

To optimize weights of parameters in the neural network, we need to compute the derivatives of our loss function with respect to parameters, namely, we need $\cfrac{\partial \text{Loss}}{\partial w}$ and $\cfrac{\partial \text{Loss}}{\partial b}$ under some fixed values x and y. To compute those derivatives, we call loss.backward(), and then retrieve the values from w.grad and b.grad.

loss.backward()
print(w.grad)
print(b.grad)

out[57]

tensor([[0.0273, 0.2098, 0.0029],
[0.0273, 0.2098, 0.0029],
[0.0273, 0.2098, 0.0029],
[0.0273, 0.2098, 0.0029],
[0.0273, 0.2098, 0.0029]])
tensor([0.0273, 0.2098, 0.0029])

By default, all tensors with requires_grad=True are tracking their computational history and support gradient computation. However, there are some cases when we do not need to do that, for example, when we have trained the model and just want to apply it to some input data, i.e. we only want to do forward computations through the network. We can stop tracking computations by surrounding our computation code with torch.no_grad() block:

z = torch.matmul(x, w)+b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)

z = torch.matmul(x, w)+b
z_det = z.detach()
print(z_det.requires_grad)

out[59]

True
False
False

There are reasons you might want to disable gradient tracking:

To mark some parameters in your neural network as frozen parameters.
To speed up computations when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.

Compuationally, autograd keeps a record of data (tensors) and all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) of Function ojects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

run the requested operation to compute a resulting tensor
maintain the operation's gradient function in the DAG

The backward pass kicks off when .backward() is called on the DAG root. autograd then:

computes the gradients from each .grad_fn
accumulates them in the respective tensor's .grad attribute
using the chain rule, propagates all the way to the leaf tensors

Optimizng Model Parameters

Now that we have a model and data it's time to train, validate and test our model by optimizing its parameters on our data. Training a model is an iterative process; in each iteration the model makes a guess about the output, calculates the error in its guess (loss), collects the derivatives of the error with respect to its parameters, and optimizes these parameters using gradient descent. For a more detailed walkthrough of this process.

Hyperparameters are adjustable parameters that let you control the model optimization process. Different hyperparameter values can impact model training and convergence rates.

The following hyperparameters are defined:

Number of Epochs - the number of times to iterate over the dataset
Batch Size - the number of data samples propagated through the network before the parameters are updated
Learning Rate - how much to update model parameters at each batch/epoch. Smmaller values yield slow learning speed, while large values may result in unpredictable behavior during training.

Once we set our hyperparameters, we can then train and optimize our model with an optimization loop. Each iteration of the optimization loop is called an epoch.

Each Epoch consists of two main parts:

The Train Loop: iterate over the training dataset and try to converge to optimal parameters
The Validation/Test Loop: iterate over the test dataset to check if model performance is improving

When presented with some training data, our untrained network is likely not to give the correct answer. Loss function measures the degree of dissimilarity of obtained result to the target value, and it is the loss function that we want to minimize during training. To calculate the loss we make a prediction using the inputs of our given data sample and compare it against the true data label value.

Common loss functions include nn.MSELoss (Mean Square Error) for regression tasks, and nn.NLLLoss (Negative Log Likelihood) for classification. nn.CrossEntropyLoss combines nn.LogSoftmax and nn.NLLLoss.

Optimization is the process of adjusting model parameters to reduce model error in each training step. Optimization algorithms define how this process is performed (in this example we use Stochastic Gradient Descent). All optimization logic is encapsulated in the optimizer object. Here, we use the SGD optimizer; additionally, there are many different optimizers available in PyTorch such as ADAM and RMSProp, that work better for different kinds of models and data.

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

train_dataloader = DataLoader(training_data, batch_size=64)
test_dataloader = DataLoader(test_data, batch_size=64)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork()
# Hyperparameters
learning_rate = 1e-3
batch_size = 64
epochs = 5

# Initialize the loss function
loss_fn = nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    # Set the model to training mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * batch_size + len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def test_loop(dataloader, model, loss_fn):
    # Set the model to evaluation mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

    loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

epochs = 10
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    test_loop(test_dataloader, model, loss_fn)
print("Done!")

out[62]

100%|██████████| 26.4M/26.4M [00:01<00:00, 20.8MB/s]

100%|██████████| 29.5k/29.5k [00:00<00:00, 350kB/s]

100%|██████████| 4.42M/4.42M [00:00<00:00, 6.24MB/s]

100%|██████████| 5.15k/5.15k [00:00<00:00, 5.04MB/s]

Extracting data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw

Epoch 1
-------------------------------
loss: 2.296004 [ 64/60000]
loss: 2.277663 [ 6464/60000]
loss: 2.266967 [12864/60000]
loss: 2.263412 [19264/60000]
loss: 2.231681 [25664/60000]
loss: 2.204355 [32064/60000]
loss: 2.215584 [38464/60000]
loss: 2.174187 [44864/60000]
loss: 2.171807 [51264/60000]
loss: 2.150819 [57664/60000]
Test Error:
Accuracy: 47.6%, Avg loss: 2.136066

Epoch 2
-------------------------------
loss: 2.136690 [ 64/60000]
loss: 2.123915 [ 6464/60000]
loss: 2.071374 [12864/60000]
loss: 2.094200 [19264/60000]
loss: 2.028069 [25664/60000]
loss: 1.966176 [32064/60000]
loss: 2.006924 [38464/60000]
loss: 1.916888 [44864/60000]
loss: 1.915738 [51264/60000]
loss: 1.859104 [57664/60000]
Test Error:
Accuracy: 55.3%, Avg loss: 1.847969

Epoch 3
-------------------------------
loss: 1.864607 [ 64/60000]
loss: 1.837882 [ 6464/60000]
loss: 1.724658 [12864/60000]
loss: 1.778657 [19264/60000]
loss: 1.656243 [25664/60000]
loss: 1.606572 [32064/60000]
loss: 1.644772 [38464/60000]
loss: 1.539186 [44864/60000]
loss: 1.563592 [51264/60000]
loss: 1.469504 [57664/60000]
Test Error:
Accuracy: 62.6%, Avg loss: 1.480930

Epoch 4
-------------------------------
loss: 1.531439 [ 64/60000]
loss: 1.504406 [ 6464/60000]
loss: 1.361066 [12864/60000]
loss: 1.441815 [19264/60000]
loss: 1.320272 [25664/60000]
loss: 1.311362 [32064/60000]
loss: 1.338059 [38464/60000]
loss: 1.258208 [44864/60000]
loss: 1.292112 [51264/60000]
loss: 1.199740 [57664/60000]
Test Error:
Accuracy: 64.1%, Avg loss: 1.221259

Epoch 5
-------------------------------
loss: 1.285256 [ 64/60000]
loss: 1.272447 [ 6464/60000]
loss: 1.114443 [12864/60000]
loss: 1.224015 [19264/60000]
loss: 1.104168 [25664/60000]
loss: 1.119272 [32064/60000]
loss: 1.151128 [38464/60000]
loss: 1.083799 [44864/60000]
loss: 1.120555 [51264/60000]
loss: 1.043983 [57664/60000]
Test Error:
Accuracy: 65.0%, Avg loss: 1.060881

Epoch 6
-------------------------------
loss: 1.120127 [ 64/60000]
loss: 1.126718 [ 6464/60000]
loss: 0.953173 [12864/60000]
loss: 1.088465 [19264/60000]
loss: 0.970379 [25664/60000]
loss: 0.990189 [32064/60000]
loss: 1.035908 [38464/60000]
loss: 0.974668 [44864/60000]
loss: 1.009216 [51264/60000]
loss: 0.947818 [57664/60000]
Test Error:
Accuracy: 66.1%, Avg loss: 0.958650

Epoch 7
-------------------------------
loss: 1.006112 [ 64/60000]
loss: 1.033259 [ 6464/60000]
loss: 0.844976 [12864/60000]
loss: 0.999262 [19264/60000]
loss: 0.886032 [25664/60000]
loss: 0.900262 [32064/60000]
loss: 0.960658 [38464/60000]
loss: 0.905108 [44864/60000]
loss: 0.933553 [51264/60000]
loss: 0.884176 [57664/60000]
Test Error:
Accuracy: 67.1%, Avg loss: 0.890012

Epoch 8
-------------------------------
loss: 0.923357 [ 64/60000]
loss: 0.969192 [ 6464/60000]
loss: 0.769062 [12864/60000]
loss: 0.937313 [19264/60000]
loss: 0.829622 [25664/60000]
loss: 0.835161 [32064/60000]
loss: 0.907506 [38464/60000]
loss: 0.859212 [44864/60000]
loss: 0.879564 [51264/60000]
loss: 0.838839 [57664/60000]
Test Error:
Accuracy: 68.5%, Avg loss: 0.841160

Epoch 9
-------------------------------
loss: 0.860379 [ 64/60000]
loss: 0.921618 [ 6464/60000]
loss: 0.712957 [12864/60000]
loss: 0.892126 [19264/60000]
loss: 0.789554 [25664/60000]
loss: 0.786586 [32064/60000]
loss: 0.867247 [38464/60000]
loss: 0.827443 [44864/60000]
loss: 0.839287 [51264/60000]
loss: 0.804475 [57664/60000]
Test Error:
Accuracy: 69.6%, Avg loss: 0.804354

Epoch 10
-------------------------------
loss: 0.810273 [ 64/60000]
loss: 0.883598 [ 6464/60000]
loss: 0.669499 [12864/60000]
loss: 0.857720 [19264/60000]
loss: 0.759117 [25664/60000]
loss: 0.749155 [32064/60000]
loss: 0.834741 [38464/60000]
loss: 0.803910 [44864/60000]
loss: 0.807586 [51264/60000]
loss: 0.776884 [57664/60000]
Test Error:
Accuracy: 71.0%, Avg loss: 0.774985

Done!

User Comments

There are currently no comments for this article.