Posted by: NQH | February 16, 2008

Lecture 8: another algorithm for computing frequent items

In today’s lecture I presented another algorithm for estimating frequent items from [CCF02]. The main line of thought hopefully is clear. In Atri’s lecture earlier and two of my lectures, we made use of a very pervasively useful concept called pairwise independent hash functions. Since we will need this concept and its generalized version k-wise independent hash functions in a later lecture, let me briefly describe what they are.

Let K be a set of m “keys” and V=\{0,1,\dots,n-1\}. A family \mathcal H of functions from K to V is called a family of (k-wise independent) k-universal family if, for any set of k keys x_1,\dots,x_k we have

Prob \left[h(x_1)=h(x_2)= \cdots = h(x_k)\right] \leq \frac{1}{n^{k-1}}

where the probability is taken over uniform choices of h from \mathcal H. And, the family is strongly k-universal if for any set of k keys x_1,\dots,x_k and any values y_1,\dots,y_k \in \{0,1,\dots,n-1\} we have

Prob \left[h(x_1)=y_1, \cdots h(x_k)=y_k \right] = \frac{1}{n^k}

In an earlier lecture we have used a 2-universal family. The family of all functions from K to V certainly fits the bill, but picking a random function from this huge family requires \Omega(m\log n) bits, which is too many for our purpose. The family we used only need O(\log n) random bits.

In the next two weeks, I will present several papers on estimating F_0 and some statistics on (multi)graphs. Our stating points will be the following two papers:

  • Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In Proc. RANDOM 2002
  • G. Cormode, S. Muthukrishan, Space Efficient Mining of Multigraph Streams. PODS 2005.

Leave a response

Your response:

Categories