(Guest post by Demian Lessa)
I presented the main results and algorithms of the SODA08 paper “Declaring Independence via the Sketching of Sketches,” by Indyk and McGregor.
The authors consider the problem of computing the correlation- that is, the degree of independence, of data streams. In particular, if are pairs appearing in a data stream, the frequencies of such pairs define a joint distribution
, and the goal is to compute the correlation between
and the product of the marginals. In the centralized model, the coordinates
and
appear together in the stream, while in the distributed model each coordinate may appear separately. Three measures of closeness are used to approximate the correlation- the
norm, the
norm, and the mutual information between
and
. All positive results in the paper are obtained in the centralized model.
In order to obtain a -space,
-approximation for the
distance between the joint distribution and the product of the marginals, the authors explore the techniques in [AMS] for the computation of
. In particular, they utilize two
-wise independent vectors
and
(of size
), constructed using the parity check matrix of BCH codes, to compute vector
(of size
) defined as the outer product of
and
. They show that, even though
is not
-wise independent, the variance can still be bounded as necessary by observing the geometric relationship on the indices of
. The algorithm is defined naturally by composing sketches and producing an estimate for the square of the
distance. A set of weak estimates followed by an application of the “median trick” yields the claimed approximation and space bounds.
In order to obtain a -space,
-approximation for the
distance between the joint distribution and the product of the marginals, the authors explore the techniques in [I06] for the computation of
. In particular, they utilize vectors drawn from the Cauchy and
-Truncated Cauchy distributions to produce an estimate for
. Differently from the algorithm for
, the estimate for
is produced by the median of
estimators. By repeating the process
times, the claimed approximation and space bounds follow.
The remaining results for the centralized model consist of extensions to the algorithm, and an approximation of the distance by the mutual information between
and
. Finally, negative results were presented for the distributed model.
Posted in presentation