Deep Learning with Python - Chapters 5 and 6
The "Fundamentals of Machine Learning" and "Universal Workflow of Machine Learning" chapters go over many things to keep in mind when doing a machine learning project - both when building the model and when preparing data/deploying model.
Fundamentals of Machine Learning
The central problem of machine learning: overfitting. This chapter will formalize some of your new intuition about machine learning into a solid conceptual framework, highlighting the importance of accurate model evaluation and the balance between training and generalization.
Generalization: The Goal of Machine Learning
The fundamental issue in machine learning is the tension between optimization and generalization. Optimization refers to the process of adjusting a model to get the best performance possible on the training data (the learning in machine learning) whereas generalization refers to how well the trained model performs on data it has never seen before.
At the beginning of training, optimization and generalization are correlated: the lower the loss on training data, the lower the loss on test data. While tis is happening, the model is said to be underfit: there is still progress to be made; the network hasn't yet modeled all relevant patterns in the training data. After some iterations on the training data, generalization stops improving, validation metrics stall and then begin to degrade: the model is starting to overfit. Overfitting is particularly likely to occur when your data is noisy, if it involves uncertainty, or if it includes rare features. If a model goes out of its way to incorporate severe outliers, its generalization performance will degrade. A model could overfit probabilistic data by being too confident about ambiguous regions of the feature space.
"""
Adding White Noise Channels or All-Zeros Channels to MNIST
"""
from tensorflow.keras.datasets import mnist
import numpy as np
(train_images, train_labels), _ = mnist.load_data()
train_images = train_images.reshape((60_000, 28*28))
train_images = train_images.astype('float32') / 255
train_images_with_noise_channels = np.concatenate(
[train_images, np.random.random((len(train_images), 784))],
axis=1
)
train_images_with_zeros_channels = np.concatenate(
[train_images, np.zeros((len(train_images), 784))],
axis=1
)
from tensorflow import keras
from tensorflow.keras import layers
def get_model():
model = keras.Sequential([
layers.Dense(512, activation="relu"),
layers.Dense(10, activation="softmax")
])
model.compile(optimizer="rmsprop",loss="sparse_categorical_crossentropy",metrics=["accuracy"])
return model
model = get_model()
history_noise = model.fit(
train_images_with_noise_channels,
train_labels,
epochs=10,
batch_size=128,
validation_split=0.2
)
model = get_model()
history_zeros = model.fit(
train_images_with_zeros_channels,
train_labels,
epochs=10,
batch_size=128,
validation_split=0.2
)
"""
Compare how validation accuracy of each model evolves over time
"""
import matplotlib.pyplot as plt
val_acc_noise = history_noise.history["val_accuracy"]
val_acc_zeros = history_zeros.history["val_accuracy"]
epochs = range(1, 11)
plt.plot(epochs, val_acc_noise, "b-",
label="Validation accuracy with noise channels")
plt.plot(epochs, val_acc_zeros, "b--",
label="Validation accuracy with zeros channels")
plt.title("Effect of noise channels on validation accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
As seen in the graph above, despite the data holding the same information in both cases, teh validation accuracy of the modes trained with noise channels ends up about one percentage point lower - purely through the influence of spurious correlations. Noise features inevitably lead to overfitting. As such, in cases where you aren't sure whether the features you have are informative or distracting, it;s common to do feature selection before training.
The typical way to do feature selection is to compute some usefulness score for each feature available—a measure of how informative the feature is with respect to the task, such as the mutual information between the feature and the labels—and only keep features that are above some threshold.
A "manifold" is a lower dimensional subspace of some parent space that is locally similar to a linear (Euclidean) space. For instance, a smooth curve in the plane is a 1D manifold within 2D space, because for every point of the curve, you can draw a tangent (the curve can be approximated by a line at every point). The manifold hypothesis posits that all natural data lies on a low-dimensional manifold within the high-dimensional space where it is encoded. That's a strong statement about the structure of information, and as far as we known, it's accurate and the reason why deep learning works.
The manifold hypothesis implies that:
- Machine learning models only have to fit relatively simple, low-dimensional, highly structured subspaces within their potential input space (latent manifolds)
- Within one of these manifolds, it's always possible to interpolate between two inputs, that is to say, morph one into another via a continuous path along which all points fall on the manifold.
The ability to interpolate between sample is the key to understanding understanding generalization in deep learning. While deep learning achieves generalization via interpolation on a learned approximation of the data manifold, it would be a mistake to assume that interpolation is all there is to generalization. It is the tip of the iceberg. Interpolation helps you make sense of the things that are very close to what you've seen before: it enables local generalization. Humans are capable of extreme generalization, which is enabled by cognitive mechanisms other than interpolation: abstraction, symbolic models of the world, reasoning, logic, common sense, innate priors about the world - what we call reason, as opposed to intuition and pattern recognition. The latter are largely interpolative in nature, but the former isn't. Both are essential to intelligence.
Properties of DL models that make them well-suited to learning latent manifolds:
- Deep learning models implement a smooth, continuous mapping from their inputs to the outputs. It has to be smooth and continuous because it must be differentiable, by necessity. The smoothness helps approximate latent manifold, which follow the same properties.
- Deep learning models tend to be structured in a way that mirrors the "shape" of the information in their training data. This is particularly the case for image-processing models and sequence-processing models. More generally, deep neural network structure their learned representations in a hierarchal and modular way, which echoes the way natural data is organized.
Data curation and feature engineering are essential to generalization. Because deep learning is curve fitting, for a model to perform well it need to be trained on a dense sampling of its input space.
You should always keep in mind that the best way to improve a deep learning model is to train it on more data or better data. AA denser coverage of the input data manifold will yield a model that generalizes better. You should never expect a deep learning model to perform anything more than crude interpolation between its training samples, and thus you should do everything to make interpolation as easy as possible. The only thing you will find in a deep learning model is what you put into it: the priors encoded in its architecture and the data it was trained on. The process of fighting overfitting by only focusing on more prominent (regular) patterns is called regularization.
Evaluating Machine Learning Models
Evaluating a model always boils down to splitting the available data into three sets: training, validation, and test. The reason for evaluating the model on the validation set is that developing a model always involves tuning its configuration: choosing # of layers and size of layers (called the hyperparameters of the model, to distinguish them from the parameters, which are the network's weights). You do this tuning by using as a feedback signal the performance of the model on the validation data. This tuning is a form of learning: a search for a good configuration in some parameter space. Tuning the configuration of the model based on the performance on the validation set can result in overfitting to the validation set, even though your model was never directly trained on it - central to this phenomenon is the notion of information leaks - every time you tune a hyperparameter of your model based on the model's performance on the validation set, some information about the validation data leaks into the model. Methods for splitting your data into training, validation, and test:
- Simple Holdout validation - set apart some fraction of your data as the test set and validation set. Train on the remaining data, and evaluate on the test set. Tune hyperparameters on the validation set.If little data is available, then your validation and test sets may contain too few samples to be statistically representative of the data at hand.
- K-Fold Validation - Split your data into K partitions of equal size, for each partition, train a model on the remaining K-1 partitions, and evaluate it on partition i. The final score is then the average of the K scores obtained.
- Iterated K-Fold Validation with Shuffling - one for situations where little data is available and you need to evaluate model as precisely as possible. Chollet said it has been extremely helpful in Kaggle competitions. It consists of applying K-fold validation multiple times, shuffling the data every time before splitting it K ways. The final score is the average of the scores obtained at each run of K-fold validation.
Before you start working on a dataset, you should try and pick a trivial baseline that you will try to beat. If you cross that threshold, you'll know you're doing something right: your model is actually using the information in the input data to make predictions that generalize, and you can keep going.
Things to keep in mind about model evaluation:
- Data representations: you want both training set and test set to be representative of the data at hand - represent class frequency as it seen in the whole dataset. You should usually randomly shuffle your data before splitting it into training and test sets.
- The arrow of time: if you are predicting time series data (predicting future given past), you should not shuffle your data before splitting - this will create a temporal leak
- Redundancy in your data: Make sure that your training set and validation set are disjoint sets.
Improving Model Fit
- When training doesn't get started or your model stalls to early or your loss is stuck, there is a problem with the configuration of the gradient descent process: your choice of optimizer, the distribution of initial values in the weights of your model, your learning rate, or your batch size. All these parameters are independent, and as such it is usually sufficient to tune the learning rate and the batch size while keeping the rest of the parameters constant.
- You have a model that fits, but for some reason, your validation metrics aren't improving at all. This indicates that something is fundamentally wrong with your approach. Using a model that makes the right assumptions about the problem is essential to achieve generalization: you should leverage the right architecture priors.
- Remember that it should always be possible to overfit. If you can't seen to be able to overfit, it's likely a problem with the representational power of your model: your're going to need a bigger model, one with more capacity, that is to say, one able to store more information. You can increase representational power by adding more layers, using bigger layers (layers with more parameters), or using kinds of layers tht are more appropriate for the problem at hand.
Improving Generalization
Once you are able to overfit, it is time to focus on generalization maximization. Spending more effort and money on data collection almost always yields a much greater return on investment than spending the same on developing a better model. A particularly important way to improve the generalization potential of your data is feature engineering. Feature engineering is the process of using your own knowledge about the data abd about machine learning at hand to make the algorithm work better by applying hardcoded transformations to the data before it hoes into the model. Even though modern deep deep learning removes the need for most feature engineering, you still have to worry about feature engineering as you are using deep neural networks:
- Good features still allow you to solve problems more elegantly while using fewer resources.
- Good feature let you solve a problem with far less data.
In deep learning, we always use models that are vastly overparameterized: they have way more degrees of freedom than the minimum necessary to fit to the latent manifold of the data. This overparameterization is not an issue, because you never fully fit a deep learning model.
Finding the exact point during training where you've reached the most generalizable fit - the exact boundary between an underfit curve and an overfit curve - is one of the most effective things you can do to improve generalization. This can be done with early stopping.
*Regularization techniques are a set of best practices that actively impede the model's ability to fit perfectly to the training data, with the goal of making the model perform better during validation. You should use models that do not have too many parameters to avoid overfitting and use enough parameters in the model to prevent underfitting. There is a compromise to be found between too much capacity and not enough capacity. You'll know your mode is too large if it starts overfitting right away and if its validation loss curve looks choppy with high variance.
from tensorflow.keras.datasets import imdb
(train_data, train_labels), _ = imdb.load_data(num_words=10_000)
def vectorize_sequences(sequences, dimension=10_000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
train_data = vectorize_sequences(train_data)
model = keras.Sequential([
layers.Dense(16,activation="relu"),
layers.Dense(16,activation="relu"),
layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="rmsprop",loss="binary_crossentropy",metrics=["accuracy"])
history_original = model.fit(train_data, train_labels, epochs=20, batch_size=512, validation_split=0.4)
"""
Version of the Model with low Capacity
"""
model = keras.Sequential([
layers.Dense(4, activation="relu"),
layers.Dense(4, activation="relu"),
layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["accuracy"])
history_smaller_model = model.fit(
train_data, train_labels,
epochs=20, batch_size=512, validation_split=0.4)
"""
Version of the model with Higher Capacity
"""
model = keras.Sequential([
layers.Dense(512, activation="relu"),
layers.Dense(512, activation="relu"),
layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["accuracy"])
history_larger_model = model.fit(
train_data, train_labels,
epochs=20, batch_size=512, validation_split=0.4)
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,3,layout="constrained",figsize=(12,4))
def plot_loss_curves(ax,history,title):
val_loss = history.history['val_loss']
train_loss = history.history['loss']
epochs = [i+1 for i, _ in enumerate(val_loss)]
ax.plot(epochs,train_loss,'-b',label="Train Loss")
ax.plot(epochs,val_loss,'b--',label="Validation Loss")
ax.set_xlabel("Epochs")
ax.set_ylabel("Loss")
ax.set_title(title)
ax.legend()
return ax
for i in range(3):
if i==0:
plot_loss_curves(ax[0],history_original,"Original Model")
elif i==1:
plot_loss_curves(ax[1],history_smaller_model,"Model with Low Capacity")
else:
plot_loss_curves(ax[2],history_larger_model,"Model with High Capacity")
plt.show()
As can be seen in the graphs above, the model with the smaller capacity starts to overfit later than the original model, and the model with too much capacity starts overfitting right away and its validation curve looks choppy with high variance - which is characteristic of models with too much capacity. (The choppy loss curve could also be caused by an unreliable validation process).
A common way to mitigate overfitting is to put constraints on the complexity of a model by forcing its weights to take only small values, which makes the distribution of weight values more regular. This is called weight regularization, and it's done by adding to the loss function of the model a cost associated with having large weights:
- L1 Regularization: The cost added is proportional to the *absolute value of the weight coefficients( the L1 norm of the weights)
- L2 Regularization: The cost added is proportional to the square of the value of the weight coefficients (the L2 norm of the weights). L2 regularization is also called weight decay in the context of neural networks.
In Keras, weight regularization is added by passing weight regularizer instances to layers as keyword arguments.
from tensorflow.keras import regularizers
model = keras.Sequential([
layers.Dense(16, kernel_regularizer=regularizers.l2(0.002), activation="relu"),
layers.Dense(16, kernel_regularizer=regularizers.l2(0.002), activation="relu"),
layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
history_l2_reg = model.fit(train_data, train_labels,epochs=20, batch_size=512, validation_split=0.4)
fig, ax = plt.subplots(layout="constrained")
plot_loss_curves(ax,history_l2_reg,"Model with L2 Regularization")
plt.show()
Keras has other reularizers besides l2.
from tensorflow.keras import regularizers
regularizers.l1(0.001)
regularizers.l1_l2(l1=0.001, l2=0.001)
Dropout is one of the most effective and most commonly used regularization techniques for neural networks. Dropout, applied to a layer, consists of randomly dropping out (setting to zero) a number of output features of the layer during training. The dropout rate is the fraction of the features that are zeroed out - it is usually set between 0.2 and 0.5. The core idea is that introducing noise in the output values of a layer can break up happenstance patterns that aren't significant, which the model will start memorizing of no noise is present.
"""
Adding Dropout to the IMDB Model
"""
model = keras.Sequential([
layers.Dense(16, activation="relu"),
layers.Dropout(0.5),
layers.Dense(16, activation="relu"),
layers.Dropout(0.5),
layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["accuracy"])
history_dropout = model.fit(
train_data, train_labels,
epochs=20, batch_size=512, validation_split=0.4)
fig, ax = plt.subplots(1,layout="constrained")
plot_loss_curves(ax,history_dropout,"Model with Dropout")
The Universal Workflow of Machine Learning
The universal workflow of machine learning is broadly structured in three parts:
- Define the task - understand the problem domain and the business logic underlying what the customer asked for. Collect a dataset, understand what the data represents, and choose how you will measure success on the task.
- Develop a model - Prepare your data so that it can be processed by a machine learning model, select a model evaluation protocol and a simple baseline to beat, train a first model that has generalization power and that can overfit, and then regularize and tune your model until you achieve the best possible generalization performance.
- Deploy the model - Present work, ship to web server, and monitor the model's performance in the wild.
It's a better use of time to collect more, better-quality data than trying to improve model most of the time. It's critical that the data used for training should be representative of the production data. Concept drift - the properties of production data change over time, causing model accuracy to gradually decay. Dealing with fast concept drift requires constant data collection, annotation, and model retraining. Sampling bias occurs when your data collection process interacts with what you are tying to predict, resulting in biased measurements.
For balanced classification problems, where every class is equally likely, accuracy and the area under a receiver operating characteristic (ROC) curve are common metrics. For class-imbalanced problems, you can use precision and recall. The hardest things in machine learning are framing problems and collecting, annotating, and cleaning data.
All inputs and targets in a neural network must typically be tensors of floating-point data (or, in specific cases, tensors of integers and strings). Whatever data you need to process - sound, images, text - you must turn into tensors, a step called data vectorization. To make learning easier for your network, your data should have the following characteristics:
- Take small values: most values should be in the 0-1 range
- Be homogeneous: All features should take values in roughly the same range.
The following stricter normalization process is common and can help, although it's not always necessary:
- Normalize each feature independently to have a mean of 0
- Normalize each feature independently to have a standard deviation of 1
Handling missing values:
- If the feature is categorical, it's safe to create a new category that means "the value is missing". The model will automatically learn what this implies wrt the targets.
- If the feature is numerical, impute with median or average for the feature or KNNImputer.
The most common way to turn a model into a product is to install TensorFlow on a server or cloud instance, and query the model's predictions via a REST API. Optimizing model for inference when deploying in an evironment with strict constraints on available power and memory or for applications with latency requirements:
- Weight pruning: not every coefficient in a wight tensor contributes equally to the predictions. It's possible to considerably lower the number of parameters in your model by only keeping the most significant ones.
- Weight quantization: Deep learning models are trained with single-precision floating point (float32) weights. However, it's possible to quantize weights to 8-bit signed integers (int8) to get an inference-nly model that's a quarter of the size.
<aside>
Element
<details>
Element
Comments
You have to be logged in to add a comment
User Comments
There are currently no comments for this article.