(Guest post by Xi Zhang)
I presented [JKV07] on how to compute aggregates , where
, over a probabilistic data stream. Only those five aggregates are considered as they are the major concerns in databases.
First, I introduced the probabilistic data stream model, where in contrast to the classical (deterministic) stream model, we have pdfs instead of elements from some domain. In a probabilistic data stream , each pdf
is over the base domain
, where the special symbol
indicates no element is produced. Any deterministic stream which is a possible outcome of
is a possible stream of
. And the semantics of
over
is the expected value of
over all the possible streams of
.
Then, I went on to talk about how to evaluate those five aggregates over a probabilistic stream one by one.
For and
, [BDJ05] shows exact algorithms to compute those two aggregates in one-pass with constant space and update time.
is an interesting case. The paper presents algorithms from the generating function point of view.
It shows a generating function such that
.
is in the form of a high-degree polynomial. The paper exhibits a data stream algorithm for estimating this definite integral to a relative error
in
passes over the data with
space and
update time per pdf. The approximation here focuses on the approximation of a definite integral, where rectangle approximation is used.
The paper proves the lower bound of any one-pass exact algorithm for computing is
. However, complementing this result, this paper also presents an exact algorithm for computing
, which runs in
time and
space. It is still better than the previously best-known exact algorithm which uses the same amount of space but runs in
time.
For and
, the paper presents a one-pass data stream algorithm with relative accuracy
, using
space and constant update time per pdf (in fact, it is
, where
is the size of maximum support of all pdfs. Whether it is really a “constant” is arguable). The approximation technique used here is “binning”.
Posted in presentation