Dimensionality Reduction Exercises Answers

This chpapter goes over dimensionality reduction for training models and visualization.

Dimensionality Reduction Exercises Answers

Question 1

What are the main motivations for reducing a dataset’s dimensionality? What are the main drawbacks?

Some of the main motivations for reducing a dataset's dimensionality include producing a better model, speeding up training, and data visualization. The main drawbacks of reducing dimensionality of the training data is that dome information is lost and it also makes data preparation pipelines more complicated and difficult to maintain.

Question 2

What is the curse of dimensionality?

The curse of dimensionality refers to the problems that arise in datasets with a lot of features. One of these problems is the sparseness of data in high dimensions (instances are more likely to be far away from each other as the number of dimensions grows):

[I]f you pick a random point in a unit square (a 1 × 1 square), it will have only about a 0.4% chance of being located less than 0.001 from a border (in other words, it is very unlikely that a random point will be “extreme” along any dimension). But in a 10,000-dimensional unit hypercube (a 1 × 1 × ⋯ × 1 cube, with ten thousand 1s), this probability is greater than 99.999999%. Most points in a high-dimensional hypercube are very close to the border.

The sparseness of data means that predictions are much less reliable than in lower dimensions, since new instances are likely to be farther away from training instances than in lower dimensions.

Question 3

Once a dataset’s dimensionality has been reduced, is it possible to reverse the operation? If so, how? If not, why?

While it is technically possible to "reverse" a dimensionality reduction operation to some extent, it's usually not possible to perfectly reconstruct the original data because some information is lost during the reduction process; essentially, you can't fully recover the lost variance, making the reversal imperfect in most cases.

The mean squared distance between the original dat and the reconstructed data is called the reconstruction error.

Question 4

Can PCA be used to reduce the dimensionality of a highly nonlinear dataset?

No, PCA (Principal Component Analysis) is not ideal for reducing the dimensionality of a highly nonlinear dataset because PCA is a linear dimensionality reduction technique, meaning it can only effectively capture linear relationships within the data; for nonlinear datasets, you should use alternative methods like Kernel PCA or manifold learning techniques like Isomap or Locally Linear Embedding (LLE).

Question 5

Suppose you perform PCA on a 1,000-dimensional dataset, setting the explained variance ratio to 95%. How many dimensions will the resulting dataset have?

About 200 dimensions. (Applying PCA to the MNIST dataset while preserving 95% of its variance produces 150 features, instead of the original 784. This is about 20% of its initial size, hence my answer and estimation.)

Question 6

In what cases would you use vanilla PCA, Incremental PCA, Randomized PCA, or Kernel PCA?

  • Regular PCA should be used in linear cases where Randomized PCA is not called for (or where an approximation of the Principal Components is not good enough).
  • You would want to use Randomized PCA in the cases where the number of dimensions you are reducing the dataset down to ( ddd ) is much less than the original number of dimensions ( nnn ). This is because the Randomized PCA quickly finds an approximation of the first ddd principal components. Its computational complexity is O(m×d2)+O(d3)O ( m \times d^2 ) + O(d^3)O(m×d2)+O(d3) instead of O(m×n2)+O(n3)O( m \times n^2 ) + O(n^3)O(m×n2)+O(n3) for the full SVD approach.
  • You should use Incremental PCA (IPCA) in cases where the whole training set cannot fit into memory or in cases where you are doing online learning. With this approach, you can split the training set into mini-batches and feed an IPCA algorithm one mini-batch at a time.
  • Kernel PCA (kPCA) is good for preserving clusters of instances after projection or sometimes even unrolling the datasets that lie close to a twisted manifold. Kernel PCA should be used with non-linear of clustered datasets.

Question 7

How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?

To evaluate a dimensionality reduction algorithm on your dataset, you can assess how well it reduces the number of dimensions while preserving important information by measuring the reconstruction error after applying the reverse transformation, visualizing the reduced data to check for meaningful patterns, and observing the performance of a downstream machine learning model when using the reduced data as input, essentially checking if the reduced data retains enough information for accurate predictions

Question 8

Does it make any sense to chain two different dimensionality reduction algorithms?

Yes, chaining two different dimensionality reduction algorithms can make sense, particularly when one algorithm is good at removing large amounts of irrelevant information quickly, while the second can then extract more subtle patterns from the reduced data, especially in cases where the data has complex non-linear relationships; a common example is using PCA to initially reduce dimensions, followed by a more sophisticated method like Locally Linear Embedding (LLE) for further analysis.

Question 9

Load the MNIST dataset (introduced in Chapter 3) and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing). Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set. Next, use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%. Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster? Next evaluate the classifier on the test set: how does it compare to the previous classifier?

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)
data, target, description = mnist["data"], mnist["target"], mnist["DESCR"]
print(description)
data = data.to_numpy()
target = target.to_numpy()
X_train, X_test, y_train, y_test = data[:60000], data[60000:], target[:60000], target[60000:]
print("X_train Shape: {}".format(X_train.shape))
print("X_test Shape: {}".format(X_test.shape))
print("y_train Shape: {}".format(y_train.shape))
print("y_test Shape: {}".format(y_test.shape))
out[11]

**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges
**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown
**Please cite**:

The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples

It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.

With some classification methods (particularly template-based methods, such as SVM and K-nearest neighbors), the error rate improves when the digits are centered by bounding box rather than center of mass. If you do this kind of pre-processing, you should report it in your publications. The MNIST database was constructed from NIST's NIST originally designated SD-3 as their training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to recognize than SD-1. The reason for this can be found on the fact that SD-3 was collected among Census Bureau employees, while SD-1 was collected among high-school students. Drawing sensible conclusions from learning experiments requires that the result be independent of the choice of training set and test among the complete set of samples. Therefore it was necessary to build a new database by mixing NIST's datasets.

The MNIST training set is composed of 30,000 patterns from SD-3 and 30,000 patterns from SD-1. Our test set was composed of 5,000 patterns from SD-3 and 5,000 patterns from SD-1. The 60,000 pattern training set contained examples from approximately 250 writers. We made sure that the sets of writers of the training set and test set were disjoint. SD-1 contains 58,527 digit images written by 500 different writers. In contrast to SD-3, where blocks of data from each writer appeared in sequence, the data in SD-1 is scrambled. Writer identities for SD-1 is available and we used this information to unscramble the writers. We then split SD-1 in two: characters written by the first 250 writers went into our new training set. The remaining 250 writers were placed in our test set. Thus we had two sets with nearly 30,000 examples each. The new training set was completed with enough examples from SD-3, starting at pattern # 0, to make a full set of 60,000 training patterns. Similarly, the new test set was completed with SD-3 examples starting at pattern # 35,000 to make a full set with 60,000 test patterns. Only a subset of 10,000 test images (5,000 from SD-1 and 5,000 from SD-3) is available on this site. The full 60,000 sample training set is available.

Downloaded from openml.org.
X_train Shape: (60000, 784)
X_test Shape: (10000, 784)
y_train Shape: (60000,)
y_test Shape: (10000,)

from sklearn.ensemble import RandomForestClassifier 
import time
from sklearn.metrics import accuracy_score
import numpy as np
clf_1 = RandomForestClassifier(n_estimators=300,criterion="gini",n_jobs=-1)
start_time = time.time()
clf_1.fit(X_train,y_train)
end_time = time.time()
print("Without dimensionality reduction, RandomForestClassifer takes: {} seconds.".format(np.floor(end_time-start_time)))
clf_1_pred = clf_1.predict(X_test)
print("Accuracy Score (No Dimensionality Reduction): {}".format(accuracy_score(clf_1_pred,y_test)))
out[12]

Without dimensionality reduction, RandomForestClassifer takes: 14.0 seconds.
Accuracy Score (No Dimensionality Reduction): 0.9705

from sklearn.decomposition import PCA 
from sklearn.pipeline import Pipeline

clf_2 = Pipeline(
    steps=[
        # If 0 < n_components < 1 and svd_solver == 'full', select the number 
        # of components such that the amount of variance that needs 
        # to be explained is greater than the percentage specified by n_components.
        ('reduce_dim',PCA(n_components=0.95,svd_solver="full")),
        ('fit',RandomForestClassifier(n_estimators=300,criterion="gini",n_jobs=-1))
    ]
)
start_time = time.time()
clf_2.fit(X_train,y_train)
end_time = time.time()
print("With dimensionality reduction, RandomForestClassifer takes: {} seconds.".format(np.floor(end_time-start_time)))
clf_2_pred = clf_2.predict(X_test)
print("Accuracy Score (With Dimensionality Reduction): {}".format(accuracy_score(clf_2_pred,y_test)))
out[13]

With dimensionality reduction, RandomForestClassifer takes: 48.0 seconds.
Accuracy Score (With Dimensionality Reduction): 0.9517

Question 10

Use t-SNE to reduce the MNIST dataset down to two dimensions and plot the result using Matplotlib. You can use a scatterplot using 10 different colors to rep‐ resent each image’s target class. Alternatively, you can write colored digits at the location of each instance, or even plot scaled-down versions of the digit images themselves (if you plot all digits, the visualization will be too cluttered, so you should either draw a random sample or plot an instance only if no other instance has already been plotted at a close distance). You should get a nice visualization with well-separated clusters of digits. Try using other dimensionality reduction algorithms such as PCA, LLE, or MDS and compare the resulting visualizations.

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)
data, target, description = mnist["data"], mnist["target"], mnist["DESCR"]
print(description)
data = data.to_numpy()
target = target.to_numpy()
X_train, X_test, y_train, y_test = data[:60000], data[60000:], target[:60000], target[60000:]
print("X_train Shape: {}".format(X_train.shape))
print("X_test Shape: {}".format(X_test.shape))
print("y_train Shape: {}".format(y_train.shape))
print("y_test Shape: {}".format(y_test.shape))

print("Fetched openml")

import time
from sklearn.manifold import TSNE, LocallyLinearEmbedding, MDS, Isomap, SpectralEmbedding
from sklearn.model_selection import StratifiedShuffleSplit
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA 

start_time = time.time()
pca = PCA(n_components=2)
pca_out = pca.fit_transform(X_train)
end_time=time.time()
print("Performed PCA: {}".format(end_time-start_time))

start_time = time.time()
tsne = TSNE(n_components=2)
tsne_out = tsne.fit_transform(X_train)
end_time=time.time()
print("Performed t-SNE: {}".format(end_time-start_time))

# start_time = time.time()
# lle = LocallyLinearEmbedding(n_components=2)
# lle_out = lle.fit_transform(X_train)
# end_time=time.time()
# print("Performed Locally Linear Embedding: {}".format(end_time-start_time))

# start_time = time.time()
# mds = MDS(n_components=2)
# mds_out = mds.fit_transform(X_train)
# end_time=time.time()
# print("Performed Multidimensional Scaling: {}".format(end_time-start_time))

# start_time = time.time()
# iso = Isomap(n_components=2)
# iso_out = iso.fit_transform(X_train)
# end_time=time.time()
# print("Performed Isomap Embedding: {}".format(end_time-start_time))

# start_time = time.time()
# spectral = SpectralEmbedding(X_train)
# spectral_out = spectral.fit_transform(X_train)
# end_time=time.time()
# print("Performed Spectral Embedding: {}".format(end_time-start_time))
out[15]

**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges
**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown
**Please cite**:

The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples

It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.

With some classification methods (particularly template-based methods, such as SVM and K-nearest neighbors), the error rate improves when the digits are centered by bounding box rather than center of mass. If you do this kind of pre-processing, you should report it in your publications. The MNIST database was constructed from NIST's NIST originally designated SD-3 as their training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to recognize than SD-1. The reason for this can be found on the fact that SD-3 was collected among Census Bureau employees, while SD-1 was collected among high-school students. Drawing sensible conclusions from learning experiments requires that the result be independent of the choice of training set and test among the complete set of samples. Therefore it was necessary to build a new database by mixing NIST's datasets.

The MNIST training set is composed of 30,000 patterns from SD-3 and 30,000 patterns from SD-1. Our test set was composed of 5,000 patterns from SD-3 and 5,000 patterns from SD-1. The 60,000 pattern training set contained examples from approximately 250 writers. We made sure that the sets of writers of the training set and test set were disjoint. SD-1 contains 58,527 digit images written by 500 different writers. In contrast to SD-3, where blocks of data from each writer appeared in sequence, the data in SD-1 is scrambled. Writer identities for SD-1 is available and we used this information to unscramble the writers. We then split SD-1 in two: characters written by the first 250 writers went into our new training set. The remaining 250 writers were placed in our test set. Thus we had two sets with nearly 30,000 examples each. The new training set was completed with enough examples from SD-3, starting at pattern # 0, to make a full set of 60,000 training patterns. Similarly, the new test set was completed with SD-3 examples starting at pattern # 35,000 to make a full set with 60,000 test patterns. Only a subset of 10,000 test images (5,000 from SD-1 and 5,000 from SD-3) is available on this site. The full 60,000 sample training set is available.

Downloaded from openml.org.
X_train Shape: (60000, 784)
X_test Shape: (10000, 784)
y_train Shape: (60000,)
y_test Shape: (10000,)
Fetched openml
Performed PCA: 1.1047589778900146
Performed t-SNE: 198.3906238079071

from matplotlib import colors
from matplotlib import cm
from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np
sss = StratifiedShuffleSplit(n_splits=1,test_size=0.02,train_size=0.02)

train_index_use = None

for i, (train_index, test_index) in enumerate(sss.split(X_train,y_train)):
    train_index_use = train_index



fig, ax = plt.subplots(1,2,layout="constrained",figsize=(12,6))
fig.suptitle("Plotting Dimensionality Reduction of MNIST with Different Techniques")
arr = [
    {"title": "PCA", },
    {"title": "t-SNE", },
]

c = ["b","g","r","c","m","y","k","#964B00", "#964B00","#800080"]

cmap = (colors.ListedColormap(c))
bounds = [0,1,2,3,4,5,6,7,8,9]
norm = colors.BoundaryNorm(bounds, cmap.N)


for i in range(2):
    Ax = ax[i]
    Ax.set_title(arr[i]["title"])
    if i==0:
        plotting_data = pca_out[train_index_use,:]
        plotting_classes = y_train[train_index_use]
    else:
        plotting_data = tsne_out[train_index_use,:]
        plotting_classes = y_train[train_index_use]
    for k in range(10):
        indices = np.nonzero(plotting_classes==str(k))
        data_for_number = plotting_data[indices[0],:]
        color = c[k]
        Ax.scatter(data_for_number[:,0],data_for_number[:,1],c=color,alpha=0.7,label=str(k))
        Ax.legend()
plt.show()
out[16]
Jupyter Notebook Image

<Figure size 1200x600 with 2 Axes>