Posted by: atri | May 10, 2008

## April 14th talk

(Guest post by Omar Mukhtar)

I presented the paper Efficient Computation of Frequent and Top-k Elements in Data
Streams
.

I presented the SpaceSaving data structure and algorithms for finding the Top-$k$ and Frequent elements in an insert-only setting. The key element of the paper is a unified approach to solving the very common Top-$k$ and Frequent element problem as well as guaranteeing the results. The main hurdle to a unified approach is that a Top-$k$ query can’t preprocess a Frequent element query and vice versa, so the authors use a common data structure, SpaceSaving, that is used to maintain stream statistics, and then answer queries. Thus, only one-pass is made over data to answer the Top-$k$ and Frequent element query.

The main idea is to have a set of counters which keep an exact frequency of individual elements. Each counter also keeps track of the over-estimation error in the elements frequency. The SpaceSaving data structure has two parts, a counter which has a field representing the element, and the over estimation error, $\epsilon$. The second part is a sorted doubly linked of buckets, each element is attached to a bucket which represents that elements frequency. Once the pass over the Data Stream is made and the SpaceSaving data structure constructed, a pass over the counters is made to find the Top-$k$ and the Frequent elements. The over estimation field is used to guarentee the accuracy of the Top-$k$ and Frequent element query.

The space bounds of the algorithm are lower than other algorithms which solve the Top-$k$ and Frequent element query separately. Experimental results with synthetic as well as Zipfian data showed that the algorithm in the paper provides a very efficient way to track frequencies in small space with guaranteed results.

Posted by: atri | May 9, 2008

## April 28th talk

(Guest Post by Karthik Dwarakanath)

I presented the paper Streaming algorithms for Robust, real-time detection of DDoS attacks. I presented two data structures and algorithms for the efficient tracking of potentially malicious connections. The key element of this approach is a novel data-streaming algorithm for efficiently tracking, in guaranteed small space and time, destination IP addresses that are large with respect to the number of distinct source IP addresses that access them.

The main idea in the basic estimator is to employ the distinct-count sketch synopsis to build a (appropriately-sized) distinct sample of the observed active source – destination pairs. Once such a distinct sample is available, the top-$k$ estimation process is fairly straightforward: the destinations with the $k$ highest occurrence frequencies in the distinct sample are identified, and are returned with their frequencies appropriately scaled by the distinct-sampling rate. This algorithm can incur a high overhead for producing the top-$k$ frequency estimates and corresponding destination addresses. The update time for this algorithm is $O(r \log m)$ and the query time is $O(r s \log^2{m})$, where $r$ and $s$ are the number of “first level” and “second level” hash tables.

They provide a second algorithm the “Tracking-DCS-based top-k estimation algorithm” which offers guaranteed poly-logarithmic update and query times, while increasing the overall storage space by only a small constant factor over thebaseline distinct-count sketch synopsis. The update time for this algorithm is $O(r\log^2{m})$ and the query time is $O(k \log m)$. Experimental results with synthetic data showed that the tracking version of the algorithm in the paper provides a very efficient way to track large distinct frequencies in small space and guaranteed small update/query time.

Posted by: atri | May 6, 2008

## April 25th talk

(Guest post by Denis Mindolin)

In my presentation, I showed the results from the following paper: Stabbing the Sky: Efficient Skyline Computation over Sliding Windows, ICDE ’05.

The paper deals with the problem of computing skylines over data streams. Given a data set $S$ of elements with $d$ attributes $(A_1,...A_d)$, a skyline of $S$ is a set of all undominated elements of $S$. An element $X$ dominates another element $Y$ if for all $i$, $X_{A_i} \leq Y_{A_i}$, and for at least one $i$, $X_{A_i} < Y_{A_i}$. This problem is a topic of active research for the past thirty years. However, the paper I presented is one of the first ones considering skylines in the data stream framework.

The stream model considered in the paper is append only. A stream element is a vector having $d$ components each of which is a number. It also assumed that there’s a sliding window of the most recent $N$ elements, and the queries issued are

1. Compute the skyline of the most recent $n$ elements ($n \leq N$), and
2. Compute the skyline of the elements between the most recent $n_1$ and $n_2$ ($n_1 \leq n_2 \leq N$).

To answer queries of the first type ($n$-of-$N$), the authors propose to keep in main memory a set of nonredundant elements $R_N$. The size of $R_N$ is $O(log^d N)$ for some restricted streams. In order to avoid the quadratic size of the dominance graph over $R_N$, it is proposed to keep at most two dominance relationships per element in $R_N$ which are enough to answer $n$-of-$N$ queries. This makes the dominance graph size linear in the size of $R_N$. To answer skyline queries in $O(log |R_N|)$ time, the authors propose to store the graph as a set of intervals ($O(|R_N|)$ space) which reduces the skyline problem to the already solved problem of answering stabbing queries.

The second query model ( $(n_1,n_2)$-of-$N$ ) is handled in a similar way with an exception that here the entire set of the most recent $N$ elements should be kept in main memory. As a result, the space requirement here is higher than in the first approach.

Posted by: atri | May 5, 2008

## Reminders

• The report is due today by midnight.
• If you haven’t submit the blog entry for your presentation, please do so by tomorrow.
Posted by: atri | April 28, 2008

## Lunch

I would like to take you guys out for lunch this Wednesday or Thursday at noon. Use the comments section to let us know if you can make it (and if you prefer one day over the other).

Update (04/29) Based upon the comments Thursday looks like the better option. So let’s meet at noon on Thursday, May 1 in my office (Bell 123).

Posted by: atri | April 25, 2008

## April 18th talk

(Guest post by Yang Wang)

In my presentation, I showed results from the following three papers.

The three papers were developed in sequence. In the first paper, the authors proposed using multiple passes with the ability to write onto an intermediate stream and also sort on that stream (as some difficult problems such as graph problems, the data stream model is too weak). Thus this improves the power of the “basic” data stream model. Define a $\mathrm{polylog StrSort}$ class to be the problems that can be solved in polylog passes and polylog (inner memory) space. Then they showed that graph connectivity, max spanning tree and many other problems are in this class. However there do exists problems that are not in the polylog class. In particular, they showed an example that need at least $p$ passes, and $m$ space with $pm = n^{1/3}$ where $n$ is the input length.

In the second paper, the authors consider the case when we can write to intermediate streams but cannot sort them. This new model is called the $\mathrm{W}$-stream. They use ideas from parallel algorithms that solve the transitive closure problem to attack the connected component problems and single source shortest paths problem. Later they extended the idea to more general cases in the third paper. They found the following relationship between parallel algorithm and $\mathrm{W}$-stream. Every $n$ processor $m$ memory $T$ time PRAM algorithm can be simulated on the $\mathrm{W}$-stream with $\frac{Tn\log{m}}{s}$ passes with space $s$. In the next step, they claimed that though the $\mathrm{W}$-stream model can be used to simulate PRAM, it is still more powerful that PRAM in that in the $\mathrm{W}$-stream model every memory cell in the entire stream can be accessed while in PRAM only a constant number of cells can be accessed. Thus the strategy is to change a PRAM algorithm to a RPRAM algorithm and then to the $\mathrm{W}$-stream. In this way, the authors solve many problems such as Maximum independent set and biconnected components.

Posted by: atri | April 22, 2008

## April 21st talk

(Guest post by Xi Zhang)

I presented [JKV07] on how to compute aggregates $aggr$, where $aggr\in\{ \mathrm{SUM, COUNT, AVG, MIN, MAX}\}$, over a probabilistic data stream. Only those five aggregates are considered as they are the major concerns in databases.

First, I introduced the probabilistic data stream model, where in contrast to the classical (deterministic) stream model, we have pdfs instead of elements from some domain. In a probabilistic data stream $\mathcal{P}=\vartheta_1, \vartheta_2, \ldots, \vartheta_n$, each pdf $\vartheta_i$ is over the base domain $\{1,\ldots,R\} \cup \{ \bot \}$, where the special symbol $\bot$ indicates no element is produced. Any deterministic stream which is a possible outcome of $\mathcal{P}$ is a possible stream of $\mathcal{P}$. And the semantics of $aggr$ over $\mathcal{P}$ is the expected value of $aggr$ over all the possible streams of $\mathcal{P}$.

Then, I went on to talk about how to evaluate those five aggregates over a probabilistic stream one by one.

For $\mathrm{SUM}$ and $\mathrm{COUNT}$, [BDJ05] shows exact algorithms to compute those two aggregates in one-pass with constant space and update time.

$\mathrm{AVG}$ is an interesting case. The paper presents algorithms from the generating function point of view.
It shows a generating function $h_{\textnormal{AVG}}(x)$ such that $\mathrm{AVG}=\int^1_{0}h_{\mathrm{AVG}}(x)dx$. $h_{\mathrm{AVG}}(x)$ is in the form of a high-degree polynomial. The paper exhibits a data stream algorithm for estimating this definite integral to a relative error $(1+\varepsilon)$ in $O(\log n)$ passes over the data with $O(\frac{1}{\varepsilon}\log^2 n)$ space and $O(\frac{1}{\varepsilon}\log n)$ update time per pdf. The approximation here focuses on the approximation of a definite integral, where rectangle approximation is used.

The paper proves the lower bound of any one-pass exact algorithm for computing $\mathrm{AVG}$ is $\Omega(n)$. However, complementing this result, this paper also presents an exact algorithm for computing $\mathrm{AVG}$, which runs in $O(n\log^2 n)$ time and $O(n)$ space. It is still better than the previously best-known exact algorithm which uses the same amount of space but runs in $O(n^3)$ time.

For $\mathrm{MIN}$ and $\mathrm{MAX}$, the paper presents a one-pass data stream algorithm with relative accuracy $(1+\varepsilon)$, using $O(\frac{1}{\varepsilon}\lg R)$ space and constant update time per pdf (in fact, it is $O(\ell\lg \ell)$, where $\ell$ is the size of maximum support of all pdfs. Whether it is really a “constant” is arguable). The approximation technique used here is “binning”.

Posted by: atri | April 14, 2008

## April 11th talk

(Guest post by Steve Uurtamo)

I presented a few algorithms for the detection of superspreaders, where superspreaders are defined to be sources in a networking data stream that contact many unique destinations. The paper was New algorithms for fast detection of superspreaders, by S. Venkataraman, D. Song, P. Gibbons, and A. Blum.

The basic algorithm is to store a uniformly chosen fraction of the $(s,d)$ pairs, using a hash function to make sure that we don’t count any $(s,d)$ pair more than once. Assuming uniformity of the hash function, if we want to detect a $k$-superspreader, holding onto $O(\frac{n\log(\frac{1}{\delta})}{k})$ of the resulting datastream with $n$ elements should be sufficient, where, if superspreaders are as rare and as chatty as expected, the storage will be modest.

Modifications of this algorithm were presented for the sliding window model, where we are continuously monitoring a datastream over a fixed length window, and for parallel monitoring, where we need to combine the information in several different datastreams to detect superspreaders.

Experimental results with synthetic data showed that the algorithms in the paper were much more space-efficient than simply combining existing approximate counting and distinct counting stream algorithms together.

Posted by: atri | April 12, 2008

## Report

A 2-3 page report is due on May 5 (midnight). When you are done, email it to both Hung and me. We basically want you to summarize your thoughts on some problem related to the seminar that you thought about. If you have any questions and/or comments, please use the comments section of this post.

Update (4/14) When you are suggesting open problem, please also summarize what is already known about the problem.

Posted by: atri | April 12, 2008

## April 7th presentation

(Guest post by Demian Lessa)

I presented the main results and algorithms of the SODA08 paper “Declaring Independence via the Sketching of Sketches,” by Indyk and McGregor.

The authors consider the problem of computing the correlation- that is, the degree of independence, of data streams. In particular, if $(i,j) \in [n]^2$ are pairs appearing in a data stream, the frequencies of such pairs define a joint distribution $(X,Y)$, and the goal is to compute the correlation between $(X,Y)$ and the product of the marginals. In the centralized model, the coordinates $i$ and $j$ appear together in the stream, while in the distributed model each coordinate may appear separately. Three measures of closeness are used to approximate the correlation- the $\ell_1$ norm, the $\ell_2$ norm, and the mutual information between $X$ and $Y$. All positive results in the paper are obtained in the centralized model.

In order to obtain a $\widetilde{O}(\epsilon^{-2}\log\delta^{-1})$-space, $(1+\epsilon, \delta)$-approximation for the $\ell_2$ distance between the joint distribution and the product of the marginals, the authors explore the techniques in [AMS] for the computation of $F_2$. In particular, they utilize two $4$-wise independent vectors $\mathbf{x}$ and $\mathbf{y}$ (of size $n$), constructed using the parity check matrix of BCH codes, to compute vector $\mathbf{z}$ (of size $n^2$) defined as the outer product of $\mathbf{x}$ and $\mathbf{y}$. They show that, even though $\mathbf{z}$ is not $4$-wise independent, the variance can still be bounded as necessary by observing the geometric relationship on the indices of $\mathbf{z}$. The algorithm is defined naturally by composing sketches and producing an estimate for the square of the $\ell_2$ distance. A set of weak estimates followed by an application of the “median trick” yields the claimed approximation and space bounds.

In order to obtain a $\widetilde{O}(\log\delta^{-1})$-space, $(O(\log n), \delta)$-approximation for the $\ell_1$ distance between the joint distribution and the product of the marginals, the authors explore the techniques in [I06] for the computation of $\ell_1$. In particular, they utilize vectors drawn from the Cauchy and $T$-Truncated Cauchy distributions to produce an estimate for $\ell_1$. Differently from the algorithm for $\ell_2$, the estimate for $\ell_1$ is produced by the median of $O(1)$ estimators. By repeating the process $O(\log\delta^{-1})$ times, the claimed approximation and space bounds follow.

The remaining results for the centralized model consist of extensions to the $\ell_1$ algorithm, and an approximation of the distance by the mutual information between $X$ and $Y$. Finally, negative results were presented for the distributed model.