Posted by: atri | April 12, 2008

April 7th presentation

(Guest post by Demian Lessa)

I presented the main results and algorithms of the SODA08 paper “Declaring Independence via the Sketching of Sketches,” by Indyk and McGregor.

The authors consider the problem of computing the correlation- that is, the degree of independence, of data streams. In particular, if (i,j) \in [n]^2 are pairs appearing in a data stream, the frequencies of such pairs define a joint distribution (X,Y), and the goal is to compute the correlation between (X,Y) and the product of the marginals. In the centralized model, the coordinates i and j appear together in the stream, while in the distributed model each coordinate may appear separately. Three measures of closeness are used to approximate the correlation- the \ell_1 norm, the \ell_2 norm, and the mutual information between X and Y. All positive results in the paper are obtained in the centralized model.

In order to obtain a \widetilde{O}(\epsilon^{-2}\log\delta^{-1})-space, (1+\epsilon, \delta)-approximation for the \ell_2 distance between the joint distribution and the product of the marginals, the authors explore the techniques in [AMS] for the computation of F_2. In particular, they utilize two 4-wise independent vectors \mathbf{x} and \mathbf{y} (of size n), constructed using the parity check matrix of BCH codes, to compute vector \mathbf{z} (of size n^2) defined as the outer product of \mathbf{x} and \mathbf{y}. They show that, even though \mathbf{z} is not 4-wise independent, the variance can still be bounded as necessary by observing the geometric relationship on the indices of \mathbf{z}. The algorithm is defined naturally by composing sketches and producing an estimate for the square of the \ell_2 distance. A set of weak estimates followed by an application of the “median trick” yields the claimed approximation and space bounds.

In order to obtain a \widetilde{O}(\log\delta^{-1})-space, (O(\log n), \delta)-approximation for the \ell_1 distance between the joint distribution and the product of the marginals, the authors explore the techniques in [I06] for the computation of \ell_1. In particular, they utilize vectors drawn from the Cauchy and T-Truncated Cauchy distributions to produce an estimate for \ell_1. Differently from the algorithm for \ell_2, the estimate for \ell_1 is produced by the median of O(1) estimators. By repeating the process O(\log\delta^{-1}) times, the claimed approximation and space bounds follow.

The remaining results for the centralized model consist of extensions to the \ell_1 algorithm, and an approximation of the distance by the mutual information between X and Y. Finally, negative results were presented for the distributed model.

Leave a response

Your response:

Categories