Using the Triangle Inequality to Accelerate k-Means

Looking for more research papers to read, I scanned my Hands-On Machine Learning notes for the many papers that were referenced there. This is one of those papers. These papers are mainly on machine learning and deep learning topics.

Reference Using the Triangle Inequality to Accelerate k-Means Paper

Date Created:
Last Edited:
1 71

Introduction

The k-means algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still computing exactly the same result in the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the triangle inequality in two different ways, and by keeping track of lower and upper bounds for distances between points and centers. Experiments show that the new algorithm is effective for datasets with up to 1000 dimensions, and becomes more and more effective as the number of k clusters increases.

k-means is considered a fast method because it is not based on computing distances between all pairs of data points. However, the algorithm is still slow in practice for large datasets. The number of distance computations is nke where n is the number of data points, k is the number of clusters to be found, and e is the number of iterations required. e grows sublinearly with k, n, and the dimensionality d of the data.

The main contribution of this paper is an optimized version of the standard k-means method, with which the number of distance computations is in practice closer to n than to nke. The optimized algorithm is based on the fact that most distance calculations in standard k-means are redundant. If a point is far away from the center, it is not necessary to calculate the exact distance between the point and the center in order to know that the point should not be assigned to this center. Conversely, if a point is much closer to one center than to any other, calculating exact distances is not necessary to know that the point should be assigned to the first center.

The optimized algoithm satisfies:

  • It can start with any initail centers
  • Given the initial centers, it always produces the same final centers as the standard algorithm
  • It can use any black-box distance metrice, so it should not rely for example on optimizations specific to Euclidean distance.

You can read more about how comments are sorted in this blog post.

User Comments