(Guest post by Abhishek Murthy)
This talk was about clustering of data streams. A common formulation of the
-median problem is as follows: Find
centers in a set of
points so as to minimize the sum of distances of data points to their closest cluster centers. This paper proposes an
time algorithm which uses
space in the data stream model of computation. The line of attack used by the authors has two main aspects.
Firstly the paper proves that it is possible to obtain a constant factor approximation to the clustering problem in space
but time
using a multilevel recursive call to an
approximation clustering routine which uses linear programming. This routine is called on pieces of data streams. Clearly the length of each piece determines the number of calls and hence the depth of the recursion tree.
Thus the main idea is to compress the information of a fixed number of points and always maintain a constant number of
medians. As the algorithm proceeds through all the levels and reaches the penultimate level, we have
medians to cluster.
Next to drive down the running time of the algorithm to around linear, the sampling approach proposed in Indyk’s paper is used. Hence rather than clustering all the points of each piece in the above mentioned process, a random sample is chosen such that the sampled points are “good” representatives of the points in the corresponding pieces. Then the same
approximation algorithm is run on the smaller sample to obtain the cluster centers. This drives down the running time of the complete algorithm to
.
Further the authors reduce the clustering problem to the graph
-Partition problem which has a lower bound of
queries to be made on the adjacency matrix of the graph to get the
partitions. Thus the authors prove that their algorithm provides a lower bound for the clustering problem’s running time.
The main tool used to prove the first phase was to use triangle inequality provided to us by metric spaces. Using this we can obtain a recursive equation on the approximation obtained by the multilevel algorithm and hence bound the approximation of the overall problem to a constant factor.
The sampling phase of each piece again uses triangle inequality to bind the approximation factor. The individual terms in the inequality are separately bounded using Chebyshev.
Thus the two main drive home points from this paper were the piece-wise processing concept and the sampling approach to solve the clustering problem on each of these pieces.
The main drawbacks identified were the inability to handle weighted points and find non-spherical arbitrary shaped clusters. We also discussed some applications of clustering on data streams and recent trends in the research of this problem.