(Guest post by Omar Mukhtar)
I presented the paper Efficient Computation of Frequent and Top-k Elements in Data
I presented the SpaceSaving data structure and algorithms for finding the Top- and Frequent elements in an insert-only setting. The key element of the paper is a unified approach to solving the very common Top- and Frequent element problem as well as guaranteeing the results. The main hurdle to a unified approach is that a Top- query can’t preprocess a Frequent element query and vice versa, so the authors use a common data structure, SpaceSaving, that is used to maintain stream statistics, and then answer queries. Thus, only one-pass is made over data to answer the Top- and Frequent element query.
The main idea is to have a set of counters which keep an exact frequency of individual elements. Each counter also keeps track of the over-estimation error in the elements frequency. The SpaceSaving data structure has two parts, a counter which has a field representing the element, and the over estimation error, . The second part is a sorted doubly linked of buckets, each element is attached to a bucket which represents that elements frequency. Once the pass over the Data Stream is made and the SpaceSaving data structure constructed, a pass over the counters is made to find the Top- and the Frequent elements. The over estimation field is used to guarentee the accuracy of the Top- and Frequent element query.
The space bounds of the algorithm are lower than other algorithms which solve the Top- and Frequent element query separately. Experimental results with synthetic as well as Zipfian data showed that the algorithm in the paper provides a very efficient way to track frequencies in small space with guaranteed results.