Hands On Machine Learning Chapter 8 - Dimensionality Reduction
I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook.
Dimensionality Reduction
Many machine Learning problems involve thousands or even millions of features for each training instance. Not only does this make training extremely slow, it can also make it much harder to find a good solution. The problem is often referred to as the curse of Dimensionality. Fortunately, in real-world problems, it is often possible to reduce the number of features considerably, turning an intractable problem into a tractable one. Reducing dimensionality does lose some information, so even though it will speed up training it may also make your system perform slightly worse. It also makes your pipelines more complex and difficult to maintain. In some cases, reducing the dimensionality of the training dat may filter out some noise and unnecessary details and thus result in higher performance. Apart form speeding up training, dimensionality reduction is also extremely useful for data visualization (DataViz). Reducing the number of dimensions down to two or three makes it possible to plot a condensed view of a high-dimensional training set on a graph and often gain some important insights by visually detecting patterns, such as clusters.
The Curse of Dimensionality
Dimensions higher than 3D are hard for us to imagine.
In higher dimensional space, it is very likely that a random point will be "extreme" among any dimension (close to the border), where as in smaller dimensions, it is more likely. Higher dimensional datasets are at risk of being very sparse - most training instances are likely to be very far away from each other. This means that a new instance will likely be far away from any training instance, making predictions much less reliable in lower dimensions, since they will be based on much larger extrapolations. The more dimensions training set has, the greater the risk of overfitting.
Main Approaches to Dimensionality Reduction
Projection
In most real-world problems, training instances are not spread out uniformly across all dimensions. Many features are almost constant, while others are highly correlated. As a result, all training dimensions lie within a much lower-dimensional subspace of the high-dimensional space.
In the image above, you can see that the training instances lie close to a plane: this is a lower dimensional (2D) subspace of a high-dimensional (3D) space. Now, if we project every training instance perpindicularly onto this subspace, we get the new 2D dataset shown below:
Projection is not always the best approach to dimensionality reduction. In many cases, the subspace may twist in turn, such as in the famous Swiss roll toy dataset:
Simply projecting onto a plane (by dropping x3 ) would squash different layers of the Swiss roll together. However, what you really want is to unroll the Swiss role to obtain the 2D dataset on the right of the image below.
Manifold Learning
The Swiss roll is an example of a 2D manifold. A 2D manifold is a 2D shape that can be bent and twisted in a higher-dimensional space. More generally, a d -dimensional manifold is a part of an n -dimensional space (where d<n ) that locally resembles a d -dimensional hyperplane. In the case of the Swiss roll, d=2 and n=3 : it locally resembles a 3D plane, but it is rolled in the third dimension.
Many dimensionality reduction algorithms work by modeling modeling the manifold on which the training instances lie; this is called Manifold Learning. It relies on the manifold assumption, also called the manifold hypothesis, which holds that the most real-world higher dimensional datasets lie close to a much lower-dimensional manifold., This assumption is very often empirically observed. The manifold assumption is often accompanies by another implicit assumption: that the task at hand (e.g., classification or regression) will be simpler if expressed in the lower-dimensional space of the manifold. However, this assumption does not always hold. In the image below, the decision boundary looks very simple in the original 3D space (a vertical plane), but it looks more complex in the unrolled manifold (a collection of four independent line segments). In short, if you reduce the dimensionality of your training set before training a model, it will usually speed up training, but it may not always lead to a better or simpler solution; it all depends on the dataset.
PCA
Principle Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm. First, it identifies the hyperplane that lies closest to the data, and then it projects the data onto it.
Preserving the Variance
Before you can project the training set onto a lower-dimensional hyperplane, you first need to choose the right hyperplane.The image below shows an example of projecting a 2D dataset onto three different axes. The projection onto the solid line preserves the maximum amount of variance. It seems reasonable to select the axis that preserves the maximum amount of variance, as it will most likely lose less information than the other projections. Another way to justify this choice is that it is the axis that minimizes the mean squared distance between the original dataset and its projection onto that axis. This is the simple idea behind PCA.
Principal Components
PCA identifies the axis that accounts for the largest amount of variance in the training set. It also finds a second axis, orthogonal to the first one, that accounts for the largest amount of remaining variance. In higher dimensions, PCA would also find a third acis, orthogonal to both previous axes, and a fourth, a fifth and so on - as many axes as the number of dimensions in the dataset. The unit vector that defines the ith axis is called the ith principal component (PC). TRhe direction of the principal components is not stable: a change in data means that PCs may point in the opposite direction as the original PCs. They will generally lie on the same axes though. How can you find the principal components? A standard matrix factorization technique called Singular Value Decomposition (SVD), which decomposes the training set matrix X into the matrix multiplication of three matrices UΣVT , where V contains all the principal components that we are looking for. You can use Numpy's svd() function to obtain all the principal components of the training set. PCA assumes that the dataset is centered around the origin.
X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]
Projecting Down to d Dimensions
Once you have identified all the principal components, you can reduce the dimensionality of the dataset down to d dimensions by projecting it onto the hyperplane defined by the d principal components. Selecting this hyperplane ensures that the projection will preserve as much variance as possible. To project the training set onto the hyperplane, you can simply compute the matrix multiplication of the training setnatrix X by the matrix Wd , defined by the matrix containing the first d principal compnents (the matrix composed of the first d columns of $\textbf{V}$):
Projecting the Training Set Down to d Dimensions
The following Python code projects the training set onto the plane defined by the first two principal components:
W2 = Vt.T[:, :2]
X2D = X_centered.dot(W2)
Using Scikit-Learn
Scikit-Learn's PCA class implements PCA using SVD Decomposition as we did before. The Code below applies PCA to reduce the dimensionality of the dataset down to two dimensions (it takes care of centering the data):
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X2D = pca.fit_transform(X)
After fitting the PCA transformer to the dataset, you can access the principal components using the components_ variable.
Explained Variance Ratio
Another useful piece of information is the explained variance ratio of each principal component, available via the explained_variance_ratio_ variable. It indicates the proportion of the dataset's variance that lies alond the axis of each principal component. Example:
pca.explained_variance_ratio_
array([0.84248607, 0.14631839])
This tells you 84.2% of the variance lies along the first axis, and 14,6 along the second axis. This leads you to believe that the third axis (1.2%) carries little information.
Choosing the Right Number of Dimensions
It is generally preferable to choose the number of dimensions that add up tro a sufficiently large portion of the variance (e.g. 95%). (Unless you are reducing for data visualization - in that case you generally want to reduce dimensionality down to 2 or 3). The code below computes PCA without reducing dimensionality, the computes the minimum number of dimensions required to preserve 95% of the training set's variance:
pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
You could then set n_components=d and run PCA again. However, there is a much better option: you can set n_components to be a float between 0.0 and 1/0, indicating the ratio of varaince you wish to preserve:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)
Yet another option is to plot the explained variance as a function od the number of dimensions (See image below). There will usually ve an elbow in the curve, where teh explained variance stops growing so fast. You can think of this as the intrinsic dimensionality of the dataset.
PCA for Compression
Applying PCA to the MNIST dataset can reduce the dataset to less than 20% of its original size. While it is possible to compress the data, it is also possible to decompress the dataset by applying the inverse transformation of the PCA projection. This won't give you exactly the original information, but it will likely be close. The mean squared distance between the original data and the reconstructed data is called the reconstructed error. The following code shows compression and decompression and the image after displays images after compression and decompressison (some quality loss, but images are still mostly in tact).
PCA Inverse Transformation, back to the Original Number of Dimensions
Randomized PCA
If you set the svd_solver hyperparameter to "randomized", Scikit-Learn uses a stochastic algorithm called Randomized PCA that quickly finds an approximation of the first d principal components. It is dramatically faster than SVD when d is smaller than n .
rnd_pca = PCA(n_components=154, svd_solver="randomized")
X_reduced = rnd_pca.fit_transform(X_train)
svd_solver is det to auto by default and Scikit-Learn intelligently chooses between "full" and "randomized" based on the valued of m , n , and d .
Incremental PCA
The above implementations of PCa require the whole training set to fit in memory. Incremental PCA (IPCA) algorithms have been developed: you can split the training set into mini-batches and feed an IPCA algorithm one mini-batch at a time. The following code splits the MNIST dataset into 100 mini-batches (using NumPy’s array_split() function) and feeds them to Scikit-Learn’s IncrementalPCA class to reduce the dimensionality of the MNIST dataset down to 154 dimensions (just like before).
from sklearn.decomposition import IncrementalPCA
n_batches = 100
inc_pca = IncrementalPCA(n_components=154)
for X_batch in np.array_split(X_train, n_batches):
inc_pca.partial_fit(X_batch)
X_reduced = inc_pca.transform(X_train)
Kernel PCA
We discussed the kernel trick in chapter 5, a mathematical technique that implicitly maps instances into a very high-dimensional space (called the feature space), enabling nonlinear classification and regression with Support Vector Machines. Recall that a linear decision boundary in the high-dimensional feature space corresponds to a complex nonlinear decision boundary in the original space. Kernel PCA makes it possible to perform complex nonlinear projections for dimensionality reduction. It is often good at preserving clusters of instances after projection or sometimes unrolling datasets that lie close to a twisted manifold. The following code uses Scikit-Learn;s KernelPCA class to perform kPCA with an RBF kernel:
from sklearn.decomposition import KernelPCA
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.04)
X_reduced = rbf_pca.fit_transform(X)
The image below shows the Swiss roll, reduced to two dimensions using a linear kernel, an RBF kernel, and a sigmoid kernel.
Selecting and Tuning Hyperparameters
Since kPCA is an unsupervised learning algorithm there is no obvious performance measure to help you select the best kernel and hyperparameter values. Since dimensionality reduction is often a preparation step for a supervised learning task, you can simply use grid search to select the kernel and hyperparameters that lead to the best performance on that task. The code below is an example. You can get the best kernel and hyperparameter values through the best_params_ variable:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
clf = Pipeline([
("kpca", KernelPCA(n_components=2)),
("log_reg", LogisticRegression())
])
param_grid = [{
"kpca__gamma": np.linspace(0.03, 0.05, 10),
"kpca__kernel": ["rbf", "sigmoid"]
}]
grid_search = GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, y)
print(grid_search.best_params_)
>> {'kpca__gamma': 0.043333333333333335, 'kpca__kernel': 'rbf'}
Another approach is to select the kernel and hyperparameters that yield the lowest reconstruction error. However, reconstruction is not as easy with Kernel PCA. It is possible to find a point in the original space that would map close to the reconstructed point. This is called the reconstruction pre-image/ once you have this pre-image, you can measure its squared distance to the original instance. You can then select the kernel and hyperparameters that minimize this reconstruction pre-image error. The code below shows how to perform this reconstruction and then find the reconstruction pre-image error:
# Reconstruction
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.0433, fit_inverse_transform=True)
X_reduced = rbf_pca.fit_transform(X)
X_preimage = rbf_pca.inverse_transform(X_reduced)
# Reconstruction pre-image erropr
from sklearn.metrics import mean_squared_error
print(mean_squared_error(X, X_preimage)) # 32.786308795766132
LLE
Locally Linear Embedding is another way very powerful nonlinear dimensionality reduction (NLDR) technique. It is a Manifold Learning technique that does not rely on projections like the previous algorithms, In a nutshell, LLE works by first measuring how each training instance linearly relates to its closest neighbors, and then looking for a low dimensional representation of the training set where these local relationships are best preserved. This makes it particularly good at unrolling twisted manifolds, especially when there is not too much noise.
The code below uses Scikit-Learn's LocallyLinearEmbedding class to unroll the Swiss roll, resulting in a 2D dataset seen in the image below. The distances between the instances are well preserved in the 2D mapping. However, the distances are not preserved on a large scale. Scikit-Learn's implementation of LLE does not scale well with very large datasets.
from sklearn.manifold import LocallyLinearEmbedding
lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10)
X_reduced = lle.fit_transform(X)
Other Dimensionality Reduction Techniques
Here are some of the most popular dimensionality reduction techniques:
- Multidimensional Scaling (MDS) reduces dimensionality while trying to preserve the distances between the instances.
- Isomap creates a graph by connecting each instance to its nearest neighbors, then reduces dimensionality while trying to preserve the geodesic distances between the instances
- t-Distributed Stochastic Neighbor Embedding (t-SNE) reduces dimensionality while trying to keep similar instances close and dissimilar instances apart. It is mostly used for visualization, in particular to visualize clusters of instances in high-dimensional space
- Linear Discriminant Analysis (LDA) is actually a classification algorithm, but during training it learns the most discriminative axes between the classes, and these axes can then be used to define a hyperplane onto which to project the data. The benefit is that the projection will keep classes as far apart as possible, so LDA is a good technique tro reduce dimensionality before running another classification algorithm such as an SVM classifier.