Hands On Machine Learning Chapter 9 - Unsupervised Learning Techniques
I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook.
Unsupervised Learning Techniques
Although most of the applications of Machine Learning today are based on supervised learning, the vast majority of the available data is actually unlabeled: we have the input features X, but we do not have the labels y. Last chapter we looked at the most common unsupervised learning task: dimensionality reduction. In this chapter, we look at a few more unsupervised learning tasks:
- Clustering: the goal is to group similar instances together into clusters. This is a great tool for data analysis, customer segmentation, recommender systems, search engines, image segmentation, semi-supervised learning, dimensionality reduction, and more
- Anomaly Detection: the objective is to learn about what "normal" data looks like, and use this to detect abnormal instances, such as defective items on a production line or a new trend in a time series
- Density Estimation: this is the task of estimating the probability density function (PDF) of the random process that generated the dataset. This is commonly used for anomaly detection instances located in very low-density regions are likely to be anomalies. It is also useful for data analysis and visualization.
Clustering
Clustering - the task of identifying similar instances and assigning them to clusters, groups of similar instances. Like in classification, each instance gets assigned to a group. However, this is an unsupervised task.
Clustering is used in a wide variety of applications, including:
- customer segmentation - segmenting users based on their behavior. This can be useful in recommender systems to suggest content that other users in the same cluster enjoyed.
- data analysis
- dimensionality reduction technique - once a dataset has been clustered - it is usually possible to measure each instance's affinity with each cluster (affinity is a measure of how well an instance fits into a cluster).
- For anomaly detection (also called outlier detection): any instance that has a low affinity to all the clusters is likely to be an anomaly. Anomaly detection is particularly useful in detecting defects in manufacturing, or for fraud detection
- For semi-supervised learning: if you only have a few labels, you could perform clustering and propagate the labels to all the instances in the same cluster.
- For search engines: image similarity search
- To segment an image: reducing the number of different colors in an image considerably
There is no universal definition of what a cluster is: it really depends on the context, and different algorithms will capture different kinds of clusters. For example, some algorithms look for instances centered around a particular point, called a centroid. Others look for continuous regions of densely packed instances: these clusters can take on any shape. Some algorithms are hierarchical, looking for clusters of clusters. And the list goes on.
K-Means
The K-Means algorithm is a simple algorithm capable of clustering this kind of dataset (data in the image below) very quickly and efficiently, often in just a few iterations. It was published in a paper "Least Square Quantization in PCM" in 1982. The code below shows training a K-Means clusterer on the data in the image below.
You have to specify how many clusters to find. In the context of clustering, an instance's label is the index of the cluster that this instance gets assigned to by the algorithm: this is not be confused with class labels in classification. The KMeans instance preserves a copy of the labels of the instances it was trained on, available via the labels attribute. Plotting the cluster's decision boundaries gives you a Voronoi tessellation. In the image below, you can see that the vast majority of instances were clearly assigned to the appropriate cluster, but a few instances were probably mislabeled. The K-Menas algorithm does not behave well when the blobs have very different diameters since all it cares about when assigning an instance to a cluster is the distance to the centroid. Instead of assigning each instance to a single cluster, called hard clustering, it can be useful to give eahc instance a score per cluster: this is called soft clustering. The tranform() method of the KMeans class measures the distance from each instance to every centroid. If you have a high-dimensional dataset and you transform it this way, tou end up with a k-dimensional dataset: this can be a very efficient non-linear dimensionality reduction technique.
The K-Means Algorithm
Place the centroids randomly. Then label the instances, update the centroids, label the instances, update the centroids and so un until the centroids stop moving. This algorithm is guaranteed to converge in a finite number of steps (usually quite small; it will not oscillate forever).
The computational complexity of the algorithm is generally linear with regards to the number of instances m, the number of clusters k and the number of dimensions n. This is only true when the data has a clustering structure.
Unfortunately, even though the algorithm is guaranteed to converge, it may not converge to the right solution (might converge to a local optimum): this depends on centroid initialization.
Centroid Initialization Methods
If you happen to know approximately where the centroids should be, then you can set the init hyperparameter to a NumPy array containing the list of centroid, and set n_init to 1:
good_init = np.array([[-3, 3], [-3, 2], [-3, 1], [-1, 2], [0, 2]])
kmeans = KMeans(n_clusters=5, init=good_init, n_init=1)
Another solution is to run the algorithm with different random initializations and keep the best solution. This is controlled by the n_init hyperparameter: by default, it is equal to 10, which means that the whole algorithm described earlier actually runs 10 times when you call fit() and Scikit-Learn keeps the best solution. A model's inertia tells you which centroids are best: this is the mean squared distance between each instance and the closest centroid. The KMeans class runs the algorithm n_init times and keeps the model with the lowest inertia. A model's inertia is accessible via the inertia_ attribute.
An important improvement to the K-Means algorithm, called K-Means++ was proposed in a 2006 paper by David Arthur and Sergei Vassilvitskii: they introduced a smarter initialization step that tends to select centroids that are distant from one another, and this makes the K-Means algorithm, much less likely to converge to a sub-optimal solution.
Accelerated K-Means and Mini-Batch K-Means
Another important improvement to the K-Means algorithm was proposed in a 2003 paper by Charles Elkan. It considerably accelerates the algorithm by avoiding many unnecessary distance calculations: this is achieved by exploiting the triangle inequality and by keeping track of lower and upper bounds for distances between instances and centroids. Another important variant of the K-Means algorithm was proposed in a 2010 paper by David Sculley. Instead of using the full dataset at each iteration, the algorithm is capable of using mini-batches, moving the centroids just slightly at each iteration. This speeds up the algorithm by a factor of 3 or 4 and makes it possible to cluster huge datasets that do not fit in memory. Scikit-Learn implements this algorithm with the MiniBatchKMeans class (see code below). Mini-batch K-Means algorithm generally has slightly worse inertia. The image below shows that Mini-Batch K-Means is much faster than regular K-Means, and this differences increases with k.
from sklearn.cluster import MiniBatchKMeans
minibatch_kmeans = MiniBatchKMeans(n_clusters=5)
minibatch_kmeans.fit(X)
Finding the Optimal Number of clusters
The result might be quite bad if you set the number of clusters to the wrong value.
The inertia is not a good performance metric when trying to choose k since it keeps getting lower as we increase k. IThe more clusters there are, the closer each instance will be to its closest centroid, and therefore the lower the inertia will be. The image below shows the inertia as a function of k, the number of clusters
The "elbow" at k=4 would be a good choice if we did not know any better. A more precise approach (also more computationally expensive) is to use the silhouette score, which is the mean silhouette coefficient over all the instances. An instance;s silhouette coefficient is equal to max(a,b)b−1 where a is the mean distance to the other instances in the same cluster (mean intra-cluster distance), and b is the mean nearest-cluster distance, that is the mean distance to the instances of the next closest cluster. The silhouette coefficient can vary between -1 and +1: a coefficient close to +1 means that the instance is well inside its own cluster and far from other clusters, while a coefficient close to 0 means that it is close to a cluster boundary, and finally a cluster close to -1 means that the instance may have been assigned to the wrong cluster, To compute the silhouette score, you can use Scikit-Learn's silhouette_score() function, giving it all the instances in the dataset, and the labels they were assigned:
from sklearn.metrics import silhouette_score
print(silhouette_score(X, kmeans.labels_)) # 0.655517642572828
The image below confirms that k=4 is a very good choice for the number of clusters.
An informative visualization is obtained when you plot every instance's silhouette coefficient (see image below), sorted by the cluster they are assigned to and by the value of the coefficient. This is called a silhouette diagram. The vertical dashed lines represent the silhouette score for each number of clusters. When most of the instances in a cluster have a lower coefficient than this score (i.e., if many of the instances stop short of the dashed line, ending to the left of it), then the cluster is rather bad since this means its instances are much too close to other clusters.
Limits of K-Means
- It is necessary to run the algorithm several times to avoid sub-optimal solutions.
- K-Means does not behave very well when the clusters have varying sizes, different densities, or non-spherical shapes (see image below - eposoidal clusters).
It is important to scale the input features before you run K-Means, or else the clusters may be very stretched, and K-Means will perform poorly. Scaling the features does not guarantee that all the clusters will be nice and spherical, but it generally improves things.
Using Clustering For Image Segmentation
Image Segmentation is the task of partitioning an image into multiple segments. In semantic segmentation, all pixels that are part of the same object type get assigned to the same segment. In instance segmentation, all pixels that are part of the same individual object are assigned to the same segment. The state of the art in semantic or instance segmentation today is achieved using complex architectures based on convolutional neural networks.Color Segmentation - assign pixels to the same segment if they have a similar color. The ladybug image - loaded in the code below - is represented by a 3D array: the first dimension is the height, the second the width and the third the number of color channels. For each pixel, there is a 3D vector containing the intensities of red, green, and blue. The image below shows clustering the lady bug image with a various amount of clusters, The disappearance of the lady bug in the bottom right shows that K-means prefers clusters of similar sizes.
Clustering for Preprocessing
Clustering can be an efficient approach to dimensionality production, in particular as a Preprocessing step before a supervised learning algorithm. [See the code for an example of using clustering as preprocessing for MNIST dataset] When clustering is used as a preprocessing step for a supervised learning task, finding a good value for k is much simpler: find the vest value of k that results in the best classification performance during cross-validation.
Clustering for Semi-Supervised Learning
Label propagation - propagating the labels to all other instances in the same cluster.
Active Learning
To continue improving your model and your training set, you could do a few rounds of active learning. Active Learning is when a human expert interacts with the learning algorithm, providing labels when the algorithm needs them. Uncertainty sampling is a strategy for active learning:
- The trained model makes predictions on unlabeled instances
- An expert looks over the instances for which the model is most uncertain.
- Iterate this process until the performance improvement stops being worth the labeling effort.
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np
# extra code – the exact arguments of make_blobs() are not important
blob_centers = np.array([[ 0.2, 2.3], [-1.5 , 2.3], [-2.8, 1.8],
[-2.8, 2.8], [-2.8, 1.3]])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])
X, y = make_blobs(n_samples=2000, centers=blob_centers, cluster_std=blob_std,random_state=7)
k = 5
kmeans = KMeans(n_clusters=k, n_init=10, random_state=42) # You have to specify how many clusers the algorithm must find
y_pred = kmeans.fit_predict(X)
print("prediction =",y_pred)
print(y_pred is kmeans.labels_)
print("5 centroids that the algorithm found =",kmeans.cluster_centers_)
X_new = np.array([[0, 2], [3, 2], [-3, 3], [-3, 2.5]])
print("Predictions of new instances =",kmeans.predict(X_new))
print("Soft Clustering (Distance from centroid) =",kmeans.transform(X_new))
import os
from matplotlib.image import imread
image = imread(os.path.join("images","ladybug.png"))
print("Image Shape =",image.shape)
X = image.reshape(-1, 3)
kmeans = KMeans(n_clusters=8).fit(X)
segmented_img = kmeans.cluster_centers_[kmeans.labels_]
segmented_img = segmented_img.reshape(image.shape)
DBSCAN
Allows the algorithm to identify clusters of arbitrary shapes. It defined continuous regions of high density:
- For each instance, the algorithm counts how many instances are located with a small distance ϵ from it. This region is called the instance's ϵ -neighborhood
- If an instance has at least min_samples in its ϵ -neighborhood, then it is considered a core instance. Core instances are those that are located in dense regions.
- All instances in the neighborhood of a core instance belong to the same cluster, This may include other core instances, therefore a long sequnce of neighboring core instance forms a single cluster,
- Any instance that is not a core instance and does not have one in its neighborhood is considered an anomaly.
This algorithm works well if all the clusters are dense enough, and they are well separted by low-density regions. The DBSCAN class in Scikit-Learn is simple to use. You can access the labels of all the instances in the labels_ instance variable. As seen in the image below, increasing ϵ can can decrease the number of clusters.
The DBSCAN class does not have a predict() method. It cannot predict which cluster a new instance belongs to. The rationale here is that several classification algorithms could make sense here, and it is easy enough to train one. See the code below. The decision boundary can be seen in the image below.
In short, DBSCAN is a very simple yet powerful algorithm, capable of identifying any number of clusters, of any shape, it is robust to outliers, and it has just two hyper‐ parameters (eps and min_samples). However, if the density varies significantly across the clusters, it can be impossible for it to capture all the clusters properly. Moreover, its computational complexity is roughly O(mlogm), making it pretty close to linear with regards to the number of instances. However, Scikit-Learn’s implementation can require up to O(m2) memory if eps is large.
Other Clustering Algorithms
- Alternative Clustering:
- Birch
- Mean-shift
- Affinity propagation
- Spectral Clustering
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
import numpy as np
X, y = make_moons(n_samples=1000, noise=0.05)
dbscan = DBSCAN(eps=0.05, min_samples=5)
dbscan.fit(X)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(dbscan.components_, dbscan.labels_[dbscan.core_sample_indices_])
X_new = np.array([[-0.5, 0], [0, 0.5], [1, -0.1], [2, 1]])
knn.predict(X_new)
knn.predict_proba(X_new)
Gaussian Mixtures
A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown. All the instances generated from a single Gaussian distribution form a cluster that typically looks like an ellipsoid. Each cluster can have different ellipsoidal shape, size, density, and orientation. When you observe an instance, you know it was generated frtom one of the Gaussian distributions, but you are not told which one, and you do not known what the patterns of the distribution are. There are several GMM variants: in the simplest variant, implemented in the GausianMixture class, you must know in advance the number k of Gaussian distributions. The dataset X is assumed to have been generated through the following stochastic process:
- For each instance, a cluster is picked randomly among k clusters. The probability of choosing the jth cluster is defined by the cluster's weight ϕ(i). The index of the cluster chosen for the ith instance is noted z(i) .
- if z(i)=j , meaning that the ith instance has been randomly assigned to thje jth cluster, the location x(i) of this instance is sampled randomly from the Gaussian distribution with mean μ(i) and covariance matrix Σ(i). This is noted x(i)∼N(μ(i),Σ(j)) .
This generative process can be represented as a graphical model (See image below). This graph represents the structure of the conditional dependencies between random variables. The circles represent random variables. The squares represent fixed values (the parameters of the model). The large rectangles are called plates: they indicate that their content is repeated several times. The number at the bottom right hand size of each plate represents how many times its content is repeated, so there are m random variables z(i) and m random variables x(i), and k means μ(i) and k covariances matrices Σ(j) , but just one weight vector ϕ (containing the weights ϕ(1) to ϕ(k) ). The variable z(i) is drawn from the categorical distribution with weights ϕ . Each variable x(i) is drawn from the normal distribution with the mean and covariance matrix defined by its cluster z(i). The solid arrows represent conditional dependencies. For example, the probability distribution for each random variable z(i) depends on the weight vector ϕ . Note that when an arrow crosses a plate boundary, it means that it applies to all repetitions of that plate, so for example the weight vector ϕ conditions all the probability distributions of all the random variables x(i) to x(m). The squiggly arrow from z(i) to x(i) represents a switch: depending on the value of z(i) , the instance x(i) will be sampled from a different Gaussian distribution. Shaded nodes indicate that the value is known, so in this case only the anbdom variables x(i) have known values: they are observed variables. The unkown variables x(i) are called latent variables.
The GaussianMixture class relies on the Expectation Maximum (EM) algorithm, which has many similarities to the K-Means algorithm: it also initializes the cluster parameters randomly, then it repeats two steps until convergence, first assigning instances to clusters (this is called the expectation step) then updating the clusters (this is called the maximalization step). You can think of EM as a generalization of K-Means, except EM finds not only the cluster centers, but also their size, shape, orientation, and relative weights. EM uses soft cluster assignment: for each instance during the expectation step, the algorithm estimates the probability that it belongs to each cluster (based on the current cluster parameters). Then, during the maximalization step, each cluster is updated using all the instances in the dataset, with each instance weighted by the estimated probability that it belongs to that cluster, These probabilities are called the responsibilities of the clusters for the instances. EM can end up converging to poor solutions, so it needs to be run several times. GM can assign each instance to a cluster or estimate the probability that an instance belongs to a particular cluster using the predict() and predict_proba() methods, respectively. GM is a generative model meaning that you can actually sample new instances from it (gm.sample()). You cna estimate the density of the model at any given location: the score_samples() method gives you an estimate of the log of the probability density function (PDF) at that location. The image below shows the cluster means, the decision boundaries (dashed lines), and the density contours of this GM model:
When there are many dimensions, or many clusters, or few instances, EM can struggle to converge to the optimal solution. You might need to limit the number of parameters that the algorithm has to learn: one way to do this is to limit the range of shapes and orientations that the cluster can have. This can be achieved by imposing constraints on the covariance matrices - set the covariance_type hyperparameter. Setting it to spherical means that all clusters must be spherical, diag means that the ellipsoidal clusters axes must be parallel to the coordinate axes, and tied means that all clusters must have the same ellipsoidal shape, size, and orientation.
Anomaly Detection using Gaussian Mixtures
Anomaly detection also called outlier detection is the task of detecting instances that deviate strongly from the norm. These instances are of course called anomalies or outliers, while the normal instances are called inliers. Can be useful in fraud detection, manufacturing, or removing outliers from a dataset. Using a Gaussian mixture model for anomaly detection is simple: any instance located in a low-density region can be considered an anomaly. You must defined what density threshold you want to use. If you notice that you get too many false positives (i.e., perfectly good products that are flagged as defective), you can lower the threshold. Conversely, if you have too many false negatives (i.e., defective products that the system does not flag as defective), you can increase the threshold. This is the usual precision/recall tradeoff. See anomaly detection in the code below and the image below shows anomalied in red.
A closely related task is novelty detection. It differs from anomaly detection in that the algorithm is assumed to be trained on a "clean" dataset, uncontaminated by outliers, whereas anomaly detection does not make this assumption. Outlier detection is often precisely used to clean up a dataset.
Selecting the Number of Clusters
With GM models, you have to try to find the model that minimizes a theoretical information criterion such as the Bayesian information criterion (BIC) or the Akaike information criterion (AIC). Both these criteria penalize models that have more parameters to learn (e.g. more clusters) and reward models that fit the data well. They two criteria often pick the same model, but when they differ, Models selected by BIC tend to be simpler than the ones selected by AIC, but it does not fit the data quite as well. The image below shows the BIC and AIC for different number of clusters k .
Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC)
Likelihood Function
Given a statistical model with some parameters θ , the word "probability" is used to describe how plausible a future outcome of x is (knowing the parameter values θ ), while the word "likelihood" is used to describe how plausible a particular set of parameter values θ are, after the outcomes x is known. The Probability Density Function is a function of x (with θ fixed), while the likelihood function is a function of θ (with x fixed). If you integrate th =e DF over all possible values of x, you always get 1, but if you integrate the likelihood function over all possible values of θ , the result can be any positive value.
Bayesian Gaussian Mixture Models
Rather than manually searching for the optimal number of clusters, it is possible to use instead the BayesianGaussianMixture class which is capable of giving weights equal (or close) to zero to unnecessary clusters. Just set the number of clusters n_components to a value that you have a good reason to believe is greater than the optimal number of clusters, and the algorithm will estimate the unnecessary clusters automatically. In this model, the cluster parameters (weights, means, and covariance matrices) are not treated as fixed model parameters anymore, but as latent variables, like cluster assignments. So now z includes both the cluster parameters and the cluster assignments - see the image below.
>>> from sklearn.mixture import BayesianGaussianMixture
>>> bgm = BayesianGaussianMixture(n_components=10, n_init=10, random_state=42)
>>> bgm.fit(X)
>>> np.round(bgm.weights_, 2)
array([0.4 , 0.21, 0.4 , 0. , 0. , 0. , 0. , 0. , 0. , 0. ])
Prior knowledge about the latent variables z can be encoded in a probability distribution p(z) called the prior. You can adjust hyperparameters based on the prior, which will result in very different clusterings - see the image below.
Bayes' theorem tells us how to update the probability density function over the latent variables after we observe some data X . It computes the posterior distribution p(z∣X) , which is the conditional probability of z given X .
Bayes' Theorem
Unfortunately, in a GMN model, the denominator p(X) is intractable, as it required integrating over all the possible values of z. This means considering all possible combinations of cluster parameters and cluster assignments.
The Evidence p(X) is Often Intractable
This is one of the central problems in Bayesian statistics, and there are several approaches to solving it (which I won't go into here). If you want to dive deeper into Bayesian statistics, check out the Bayesian Data Analysis book. Gaussian mixture models work great on clusters with ellipsoidal shapes, but if you try to fit a dataset with different shapes, it may be bad (see image below).
Other Anomaly Detection and Novelty Detection Algorithms
- Fast-MCD (minimum convergence determinant): this algorithm is useful for outlier detection, in particular to clean up a dataset.
- Isolation forest: efficient algorithm for outlier detection
- Local outlier factor (LOF): this algorithm is also good for outlier detection
- One-class SVM: better suited for novelty detection
from sklearn.mixture import GaussianMixture
import os
os.environ["OMP_NUM_THREADS"] = "4"
gm = GaussianMixture(n_components=3, n_init=10)
gm.fit(X)
print("Weights =",gm.weights_)
print("Means =",gm.means_)
print("Covariances =",gm.covariances_)
print("Algorithm Converged =",gm.converged_)
print("Iterations to Convergence =",gm.n_iter_)
# print("Prediction of Clusters =",gm.predict(X))
# print("Cluster Probabilities =",gm.predict_proba(X))
X_new, y_new = gm.sample(6)
print("GaussianMixtures is a generative model =>",y_new)
# print("Log of the Probability Density Function at Instance Location =",gm.score_samples(X))
# Anomaly Detection
densities = gm.score_samples(X)
density_threshold = np.percentile(densities, 4)
anomalies = X[densities < density_threshold]