Dimensionality Reduction Exercises Answers
This chpapter goes over dimensionality reduction for training models and visualization.
Question 1
What are the main motivations for reducing a dataset’s dimensionality? What are the main drawbacks?
Some of the main motivations for reducing a dataset's dimensionality include producing a better model, speeding up training, and data visualization. The main drawbacks of reducing dimensionality of the training data is that dome information is lost and it also makes data preparation pipelines more complicated and difficult to maintain.
Question 2
What is the curse of dimensionality?
The curse of dimensionality refers to the problems that arise in datasets with a lot of features. One of these problems is the sparseness of data in high dimensions (instances are more likely to be far away from each other as the number of dimensions grows):
[I]f you pick a random point in a unit square (a 1 × 1 square), it will have only about a 0.4% chance of being located less than 0.001 from a border (in other words, it is very unlikely that a random point will be “extreme” along any dimension). But in a 10,000-dimensional unit hypercube (a 1 × 1 × ⋯ × 1 cube, with ten thousand 1s), this probability is greater than 99.999999%. Most points in a high-dimensional hypercube are very close to the border.
The sparseness of data means that predictions are much less reliable than in lower dimensions, since new instances are likely to be farther away from training instances than in lower dimensions.
Question 3
Once a dataset’s dimensionality has been reduced, is it possible to reverse the operation? If so, how? If not, why?
While it is technically possible to "reverse" a dimensionality reduction operation to some extent, it's usually not possible to perfectly reconstruct the original data because some information is lost during the reduction process; essentially, you can't fully recover the lost variance, making the reversal imperfect in most cases.
The mean squared distance between the original dat and the reconstructed data is called the reconstruction error.
Question 4
Can PCA be used to reduce the dimensionality of a highly nonlinear dataset?
No, PCA (Principal Component Analysis) is not ideal for reducing the dimensionality of a highly nonlinear dataset because PCA is a linear dimensionality reduction technique, meaning it can only effectively capture linear relationships within the data; for nonlinear datasets, you should use alternative methods like Kernel PCA or manifold learning techniques like Isomap or Locally Linear Embedding (LLE).
Question 5
Suppose you perform PCA on a 1,000-dimensional dataset, setting the explained variance ratio to 95%. How many dimensions will the resulting dataset have?
About 200 dimensions. (Applying PCA to the MNIST dataset while preserving 95% of its variance produces 150 features, instead of the original 784. This is about 20% of its initial size, hence my answer and estimation.)
Question 6
In what cases would you use vanilla PCA, Incremental PCA, Randomized PCA, or Kernel PCA?
- Regular PCA should be used in linear cases where Randomized PCA is not called for (or where an approximation of the Principal Components is not good enough).
- You would want to use Randomized PCA in the cases where the number of dimensions you are reducing the dataset down to ( d ) is much less than the original number of dimensions ( n ). This is because the Randomized PCA quickly finds an approximation of the first d principal components. Its computational complexity is O(m×d2)+O(d3) instead of O(m×n2)+O(n3) for the full SVD approach.
- You should use Incremental PCA (IPCA) in cases where the whole training set cannot fit into memory or in cases where you are doing online learning. With this approach, you can split the training set into mini-batches and feed an IPCA algorithm one mini-batch at a time.
- Kernel PCA (kPCA) is good for preserving clusters of instances after projection or sometimes even unrolling the datasets that lie close to a twisted manifold. Kernel PCA should be used with non-linear of clustered datasets.
Question 7
How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?
To evaluate a dimensionality reduction algorithm on your dataset, you can assess how well it reduces the number of dimensions while preserving important information by measuring the reconstruction error after applying the reverse transformation, visualizing the reduced data to check for meaningful patterns, and observing the performance of a downstream machine learning model when using the reduced data as input, essentially checking if the reduced data retains enough information for accurate predictions
Question 8
Does it make any sense to chain two different dimensionality reduction algorithms?
Yes, chaining two different dimensionality reduction algorithms can make sense, particularly when one algorithm is good at removing large amounts of irrelevant information quickly, while the second can then extract more subtle patterns from the reduced data, especially in cases where the data has complex non-linear relationships; a common example is using PCA to initially reduce dimensions, followed by a more sophisticated method like Locally Linear Embedding (LLE) for further analysis.
Question 9
Load the MNIST dataset (introduced in Chapter 3) and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing). Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set. Next, use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%. Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster? Next evaluate the classifier on the test set: how does it compare to the previous classifier?
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
data, target, description = mnist["data"], mnist["target"], mnist["DESCR"]
print(description)
data = data.to_numpy()
target = target.to_numpy()
X_train, X_test, y_train, y_test = data[:60000], data[60000:], target[:60000], target[60000:]
print("X_train Shape: {}".format(X_train.shape))
print("X_test Shape: {}".format(X_test.shape))
print("y_train Shape: {}".format(y_train.shape))
print("y_test Shape: {}".format(y_test.shape))
from sklearn.ensemble import RandomForestClassifier
import time
from sklearn.metrics import accuracy_score
import numpy as np
clf_1 = RandomForestClassifier(n_estimators=300,criterion="gini",n_jobs=-1)
start_time = time.time()
clf_1.fit(X_train,y_train)
end_time = time.time()
print("Without dimensionality reduction, RandomForestClassifer takes: {} seconds.".format(np.floor(end_time-start_time)))
clf_1_pred = clf_1.predict(X_test)
print("Accuracy Score (No Dimensionality Reduction): {}".format(accuracy_score(clf_1_pred,y_test)))
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
clf_2 = Pipeline(
steps=[
# If 0 < n_components < 1 and svd_solver == 'full', select the number
# of components such that the amount of variance that needs
# to be explained is greater than the percentage specified by n_components.
('reduce_dim',PCA(n_components=0.95,svd_solver="full")),
('fit',RandomForestClassifier(n_estimators=300,criterion="gini",n_jobs=-1))
]
)
start_time = time.time()
clf_2.fit(X_train,y_train)
end_time = time.time()
print("With dimensionality reduction, RandomForestClassifer takes: {} seconds.".format(np.floor(end_time-start_time)))
clf_2_pred = clf_2.predict(X_test)
print("Accuracy Score (With Dimensionality Reduction): {}".format(accuracy_score(clf_2_pred,y_test)))
Question 10
Use t-SNE to reduce the MNIST dataset down to two dimensions and plot the result using Matplotlib. You can use a scatterplot using 10 different colors to rep‐ resent each image’s target class. Alternatively, you can write colored digits at the location of each instance, or even plot scaled-down versions of the digit images themselves (if you plot all digits, the visualization will be too cluttered, so you should either draw a random sample or plot an instance only if no other instance has already been plotted at a close distance). You should get a nice visualization with well-separated clusters of digits. Try using other dimensionality reduction algorithms such as PCA, LLE, or MDS and compare the resulting visualizations.
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
data, target, description = mnist["data"], mnist["target"], mnist["DESCR"]
print(description)
data = data.to_numpy()
target = target.to_numpy()
X_train, X_test, y_train, y_test = data[:60000], data[60000:], target[:60000], target[60000:]
print("X_train Shape: {}".format(X_train.shape))
print("X_test Shape: {}".format(X_test.shape))
print("y_train Shape: {}".format(y_train.shape))
print("y_test Shape: {}".format(y_test.shape))
print("Fetched openml")
import time
from sklearn.manifold import TSNE, LocallyLinearEmbedding, MDS, Isomap, SpectralEmbedding
from sklearn.model_selection import StratifiedShuffleSplit
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
start_time = time.time()
pca = PCA(n_components=2)
pca_out = pca.fit_transform(X_train)
end_time=time.time()
print("Performed PCA: {}".format(end_time-start_time))
start_time = time.time()
tsne = TSNE(n_components=2)
tsne_out = tsne.fit_transform(X_train)
end_time=time.time()
print("Performed t-SNE: {}".format(end_time-start_time))
# start_time = time.time()
# lle = LocallyLinearEmbedding(n_components=2)
# lle_out = lle.fit_transform(X_train)
# end_time=time.time()
# print("Performed Locally Linear Embedding: {}".format(end_time-start_time))
# start_time = time.time()
# mds = MDS(n_components=2)
# mds_out = mds.fit_transform(X_train)
# end_time=time.time()
# print("Performed Multidimensional Scaling: {}".format(end_time-start_time))
# start_time = time.time()
# iso = Isomap(n_components=2)
# iso_out = iso.fit_transform(X_train)
# end_time=time.time()
# print("Performed Isomap Embedding: {}".format(end_time-start_time))
# start_time = time.time()
# spectral = SpectralEmbedding(X_train)
# spectral_out = spectral.fit_transform(X_train)
# end_time=time.time()
# print("Performed Spectral Embedding: {}".format(end_time-start_time))
from matplotlib import colors
from matplotlib import cm
from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np
sss = StratifiedShuffleSplit(n_splits=1,test_size=0.02,train_size=0.02)
train_index_use = None
for i, (train_index, test_index) in enumerate(sss.split(X_train,y_train)):
train_index_use = train_index
fig, ax = plt.subplots(1,2,layout="constrained",figsize=(12,6))
fig.suptitle("Plotting Dimensionality Reduction of MNIST with Different Techniques")
arr = [
{"title": "PCA", },
{"title": "t-SNE", },
]
c = ["b","g","r","c","m","y","k","#964B00", "#964B00","#800080"]
cmap = (colors.ListedColormap(c))
bounds = [0,1,2,3,4,5,6,7,8,9]
norm = colors.BoundaryNorm(bounds, cmap.N)
for i in range(2):
Ax = ax[i]
Ax.set_title(arr[i]["title"])
if i==0:
plotting_data = pca_out[train_index_use,:]
plotting_classes = y_train[train_index_use]
else:
plotting_data = tsne_out[train_index_use,:]
plotting_classes = y_train[train_index_use]
for k in range(10):
indices = np.nonzero(plotting_classes==str(k))
data_for_number = plotting_data[indices[0],:]
color = c[k]
Ax.scatter(data_for_number[:,0],data_for_number[:,1],c=color,alpha=0.7,label=str(k))
Ax.legend()
plt.show()