Posted by: atri | April 12, 2008

April 4th presentation

(Guest post by Anh Le)

I presented the sliding window model which computes aggregates and statistics over data streams of the last N data elements seen so far. There are a wide range of applications that require statistics over the most recently observed data elements, such as intrusion detection in network monitoring, stock price prediction, weblog mining, user click stream for personalization etc. First, we considered a basic counting problem whose solution is a prerequisite for efficient maintenance of a variety of more complex statistical aggregates. We showed a solution for basic counting which use O(\frac{1}{\epsilon}log^2N) bits of memory to estimate the number of ones to within a factor of 1+\epsilon. Then, we showed the lower bound of \Omega(\frac{1}{\epsilon}log^2N) memory bits for any deterministic or randomized algorithms. We then extended the scheme to maintain the sum of the last N positive integers. Finally, we briefly discussed the scheme to adapt several other problems in the sliding window model. However, maintaining other statistics is more complex than those simple aggregates.

Posted by: atri | April 12, 2008

March 17th presentation

(Guest post by Abhishek Murthy)

This talk was about clustering of data streams. A common formulation of the k-median problem is as follows: Find k centers in a set of n points so as to minimize the sum of distances of data points to their closest cluster centers. This paper proposes an O(nk) time algorithm which uses O(n^{\epsilon}) space in the data stream model of computation. The line of attack used by the authors has two main aspects.

Firstly the paper proves that it is possible to obtain a constant factor approximation to the clustering problem in space O(n^{\epsilon}) but time O(n^{1+\epsilon}) using a multilevel recursive call to an (a,b) approximation clustering routine which uses linear programming. This routine is called on pieces of data streams. Clearly the length of each piece determines the number of calls and hence the depth of the recursion tree.

Thus the main idea is to compress the information of a fixed number of points and always maintain a constant number of O(k) medians. As the algorithm proceeds through all the levels and reaches the penultimate level, we have O(k) medians to cluster.

Next to drive down the running time of the algorithm to around linear, the sampling approach proposed in Indyk’s paper is used. Hence rather than clustering all the points of each piece in the above mentioned process, a random sample is chosen such that the sampled points are “good” representatives of the points in the corresponding pieces. Then the same (a,b) approximation algorithm is run on the smaller sample to obtain the cluster centers. This drives down the running time of the complete algorithm to O(nk).

Further the authors reduce the clustering problem to the graph k-Partition problem which has a lower bound of \Omega(nk) queries to be made on the adjacency matrix of the graph to get the k partitions. Thus the authors prove that their algorithm provides a lower bound for the clustering problem’s running time.

The main tool used to prove the first phase was to use triangle inequality provided to us by metric spaces. Using this we can obtain a recursive equation on the approximation obtained by the multilevel algorithm and hence bound the approximation of the overall problem to a constant factor.

The sampling phase of each piece again uses triangle inequality to bind the approximation factor. The individual terms in the inequality are separately bounded using Chebyshev.

Thus the two main drive home points from this paper were the piece-wise processing concept and the sampling approach to solve the clustering problem on each of these pieces.

The main drawbacks identified were the inability to handle weighted points and find non-spherical arbitrary shaped clusters. We also discussed some applications of clustering on data streams and recent trends in the research of this problem.

Posted by: atri | March 31, 2008

Last theory seminar reminder

A gentle reminder that the last theory seminar talk will be take place tomorrow at 10:45am in Bell 242 (note the unusual time and place).

From Friday, we will resume with student presentations.

Posted by: atri | March 25, 2008

Rest of the theory seminar talks

Ashwin Lall will talk about Estimating the Entropy of a Stream at 11am in Bell 224 on Friday, March 28. This talk is on data stream algorithms with real network data, so it should be of great interest to all of you.

Prasad Raghavendra will talk about Optimal Algorithms and Inapproximability Results for every CSP? at 10:45 am in Bell 242 on Monday, March 31. Note that both the time and location are somewhat “non-standard.” Prasad’s talk will be related to Ryan’s talk (and will hopefully shed more light on some aspects that Ryan did not have the time to talk about).

Both the speaker are grad students: as a grad student, I always found it nice/encouraging to attend talks given by grad students, so see you all there for both the talks.

Posted by: atri | March 24, 2008

Ryan O’Donnell talk Monday

Ryan O’Donnell will be talking about 3-Query Dictator Testing in the theory seminar tomorrow (Monday) at 11am in Bell 224.

Posted by: atri | March 20, 2008

Talk tomorrow

A gentle reminder that Bobby’s talk is at 11am tomorrow at the usual place.

Posted by: atri | March 17, 2008

Ashwin Lall talk

Ashwin Lall will be talking about Estimating the entropy of a stream in the theory seminar on Friday, March 28th. This talk is perhaps the closest to our seminar, so you should attend the talk.

Also take a look at theory seminar webpage again: we now have the abstract up for all the confirmed talks.

Posted by: atri | March 15, 2008

Bobby Kleinberg talk

From next Friday onwards, all our meetings in March will be devoted to theory seminar talks. We have have some very exciting talks coming up, so please do attend. Next Friday, March 21, Bobby Kleinberg is going to speak on Multi-armed bandit problems on metric spaces. Bobby is a good speaker so his talk should be fun.

Posted by: atri | March 14, 2008

Presentation tips

I just wanted to write an entry to broadly explain our expectations and hopefully help you give a better presentation.

Our “expectations” are pretty simple:

  • You will use up one meeting time for your presentation. You can use slides or the whiteboard– your choice. Personally, I have found slides to be better for the introduction/motivation part of the talk and the whiteboard for presenting the proofs (it also slows you down which is generally good). Please plan to finish by 11:55am– in the last five minutes Hung & I will give you quick feedback on your talk (both on the content and the style of the presentation).
  • After your presentation, send us a one/two paragraph summary of what you presented in the class. We will post that on the blog and you’ll be the guest blogger for that day. To get an idea of what to write in the entry, look at any of entries for a lecture posted by Hung or me.

Here are some general tips for planning/preparing your presentation. (I just compiled up the list more or less on the fly: I might add some more things later on.) These are meant to be guidelines– use your own judgment on which (possibly empty) subset to follow. You probably are aware of all these points but it never hurts to repeat.

  • Your talk should start with a motivation of why your audience should listen to you, i.e., why is the stuff that you are going to present interesting? Presumably you chose your paper because you found it exciting: so tell us why. At the very least your audience should go away from your talk knowing what was the problem you talked about and why it was interesting. Note that the following are all valid reasons for liking a paper (of course the list is not exhaustive):
    • The problem tackled by the paper is interesting because either the problem arises in practice (and needs a data stream algorithm) or it is a problem that a lot of people have tried to solve but could not solve well or the problem being solved is just cute (this is sort of vague but you can generally tell when this is the case) or the techniques/proofs in the paper are nice and elegant.
  • Remember you are the only one who has studied the paper in detail, so slow down. Even though things seem obvious to you, it most probably will not be obvious to those who are listening. So for example, when you are going through a proof, state each step precisely and slowly. Stop and ask questions. Stop and do a jig. Whatever suits your style– just remember a talk given at a rapid pace is less useful than not giving the talk at all.
  • As a side effect of the above point, you will probably have only time to present one (or maybe two) results from the paper. So choose one or two results in the paper that you like and concentrate on them. (However, it never hurts to have a bit of extra material in case you still have time on your hands.)
  • Do a top-down presentation, especially for proofs. That is, start with some high level pieces such that assuming these pieces are correct, stitching them together to prove your main result is relatively straightforward. Then go into the details of the pieces (which themselves can be presented in its own top down hierarchy). This way you can also prioritize different chunks of the proof: if you are running out of time you can leave out the less important chunks.
  • Since this is a presentation in a seminar, try to connect what you are presenting to what has already been covered earlier in the class. Giving a quick recap of the background material will help. Hung and I are flattered if you think we taught everything so well that everyone remembers everything we spoke about ;-) but in reality most of your audience will only have a fleeting memory of the previous material. Refreshing memories will help.
  • Giving a break in the middle is not mandatory but it’ll help drifting minds to re-focus.
  • This one is hard to implement. Before presenting the correct proof, presenting an obvious but (partially) incorrect proof helps a lot in understanding the proof. A related strategy is to present the proof from the paper in your “own words.”
  • One advice, which I got early in my career (from this guy) is to use figures to illustrate your proof if you can. At first it stuck me as odd ( as a theory newbie I was all enamored by equations and such like). As time went on I realized how helpful the advice was. A related advice was not to spend too much time on calculations (especially when you have only one lecture to finish your talk). It is better to prove a weaker result as long as the general idea is being conveyed correctly. The only reason to be very precise about a calculation is if you want to highlight some cool trick/technique that is being used.
Posted by: atri | March 13, 2008

Presentation Schedule

Here is the schedule for student presentations (formed by a random process, except for Abhishek was “volunteered” to go first):

  • Mar 17 : Abhishek
  • Apr 4 : Anh
  • Apr 7 : Demian
  • Apr 11 : Steve
  • Apr 14 : Omar
  • Apr 18 : Yang
  • Apr 21 : Xi
  • Apr 25 : Dennis
  • Apr 28 : Karthik

The “missing” dates (Mar 21, 24, 28, 31) will be used up by the theory seminar (if you have not checked the website this week, you should take a look– we have one new speaker and some new abstracts).

Please use the comments section if you have any questions/concerns.

Update: If for some (valid) reason, your alloted slot does not work for you, feel free to talk with some other student to swap your slots. Assuming both parties are OK with it, we’ll switch the slots (and the swap will be reflected in the post above). Feel free to use the comments section to do this.

« Newer Posts - Older Posts »

Categories