Posted by: atri | May 10, 2008

April 14th talk

(Guest post by Omar Mukhtar)

I presented the paper Efficient Computation of Frequent and Top-k Elements in Data
Streams
.

I presented the SpaceSaving data structure and algorithms for finding the Top-k and Frequent elements in an insert-only setting. The key element of the paper is a unified approach to solving the very common Top-k and Frequent element problem as well as guaranteeing the results. The main hurdle to a unified approach is that a Top-k query can’t preprocess a Frequent element query and vice versa, so the authors use a common data structure, SpaceSaving, that is used to maintain stream statistics, and then answer queries. Thus, only one-pass is made over data to answer the Top-k and Frequent element query.

The main idea is to have a set of counters which keep an exact frequency of individual elements. Each counter also keeps track of the over-estimation error in the elements frequency. The SpaceSaving data structure has two parts, a counter which has a field representing the element, and the over estimation error, \epsilon. The second part is a sorted doubly linked of buckets, each element is attached to a bucket which represents that elements frequency. Once the pass over the Data Stream is made and the SpaceSaving data structure constructed, a pass over the counters is made to find the Top-k and the Frequent elements. The over estimation field is used to guarentee the accuracy of the Top-k and Frequent element query.

The space bounds of the algorithm are lower than other algorithms which solve the Top-k and Frequent element query separately. Experimental results with synthetic as well as Zipfian data showed that the algorithm in the paper provides a very efficient way to track frequencies in small space with guaranteed results.

Leave a response

Your response:

Categories