Posted by: atri | April 22, 2008

April 21st talk

(Guest post by Xi Zhang)

I presented [JKV07] on how to compute aggregates aggr, where aggr\in\{ \mathrm{SUM, COUNT, AVG, MIN, MAX}\}, over a probabilistic data stream. Only those five aggregates are considered as they are the major concerns in databases.

First, I introduced the probabilistic data stream model, where in contrast to the classical (deterministic) stream model, we have pdfs instead of elements from some domain. In a probabilistic data stream \mathcal{P}=\vartheta_1, \vartheta_2, \ldots, \vartheta_n, each pdf \vartheta_i is over the base domain \{1,\ldots,R\} \cup \{ \bot \}, where the special symbol \bot indicates no element is produced. Any deterministic stream which is a possible outcome of \mathcal{P} is a possible stream of \mathcal{P}. And the semantics of aggr over \mathcal{P} is the expected value of aggr over all the possible streams of \mathcal{P}.

Then, I went on to talk about how to evaluate those five aggregates over a probabilistic stream one by one.

For \mathrm{SUM} and \mathrm{COUNT}, [BDJ05] shows exact algorithms to compute those two aggregates in one-pass with constant space and update time.

\mathrm{AVG} is an interesting case. The paper presents algorithms from the generating function point of view.
It shows a generating function h_{\textnormal{AVG}}(x) such that \mathrm{AVG}=\int^1_{0}h_{\mathrm{AVG}}(x)dx. h_{\mathrm{AVG}}(x) is in the form of a high-degree polynomial. The paper exhibits a data stream algorithm for estimating this definite integral to a relative error (1+\varepsilon) in O(\log n) passes over the data with O(\frac{1}{\varepsilon}\log^2 n) space and O(\frac{1}{\varepsilon}\log n) update time per pdf. The approximation here focuses on the approximation of a definite integral, where rectangle approximation is used.

The paper proves the lower bound of any one-pass exact algorithm for computing \mathrm{AVG} is \Omega(n). However, complementing this result, this paper also presents an exact algorithm for computing \mathrm{AVG}, which runs in O(n\log^2 n) time and O(n) space. It is still better than the previously best-known exact algorithm which uses the same amount of space but runs in O(n^3) time.

For \mathrm{MIN} and \mathrm{MAX}, the paper presents a one-pass data stream algorithm with relative accuracy (1+\varepsilon), using O(\frac{1}{\varepsilon}\lg R) space and constant update time per pdf (in fact, it is O(\ell\lg \ell), where \ell is the size of maximum support of all pdfs. Whether it is really a “constant” is arguable). The approximation technique used here is “binning”.

Leave a response

Your response:

Categories