WebDiarios de Motocicleta

Presburger Award

2012-05-06T20:43:00.000-04:00

Hurah!!I've got some excellent news: I was co-awarded the 2012 Presburger Award, to be shared with Venkat Guruswami ,the young superstar of error correcting codes.
In short, the Presburger Award is the European "young-scientist award" in Theory.
It is given anually by the European Association for Theoretical Computer Science (EATCS)
to a scientist under 35 "for outstanding contributions in
Theoretical Computer Sciene as documented by a series of published papers."

Not surprisingly, I got the award for my work on data-structure lower bounds.The citation has a lot of nice things to say. Go read it :)

Happy Easter

2012-04-15T10:11:00.000-04:00

Happy Easter!

...to all my readers cellebrating today in the Eastern Orthodox tradition.
According to persistent rumours:
Χριστός ἀνέστη
Christos a înviat!

Happpy recovery from the FOCS deadline:with a funny picture :)

2012-04-05T13:01:00.000-04:00

Happpy recovery from the FOCS deadline everyone:)

With this kind of support from NY fashion stores, it'sodd that we have recruiting problems i n our field )

New York Theory Day

2011-11-06T23:05:00.000-05:00

The Fall 2011 New York Theory Day is happening this Friday, November 11th, at NYU (Courant), and features yours truly as a speaker.

I hope to see many of you there!

Follow-up: Sampling a discrete distribution

2011-09-19T16:03:00.000-04:00

This is a follow-up to my last post, on a puzzle about "Sampling a discrete distribution" (see my comment there for the solution I originally thought of).

As an anonymous commenter (Rex?) points out, a nice elementary solution that has been known for a long time. I admit to not knowing this solution ahead of time, and use the fact that it's not really my field as an excuse. Here I want to summarize this solution for educational purposes.

Imagine each sample in the distribution is a vase containing liquid proportional to the probability of the outcome.

We classify these vases into 3 types:

LO — probability less than 1/n

GOOD – probability precisely 1/n

HI – probability greater then 1/n

In the ideal case when every vase is GOOD, we are done (uniform sampling). If not, we move towards this ideal case by inductively applying the following subroutine:

** Pick a LO vase (say, with x amount of liquid) and a HI vase (with y amount of liquid).

Grab a new vase and fill it up to the 1/n mark: pour all the liquid from the LO vase, and (1/n)-x liquid from the HI vase (remember that a HI vase has liquid y>1/n, so this is certainly possible). Note that the HI vase may now become LO, GOOD, or stay HI.

/end of **

Observe that the process terminates in n steps, since at each step we increment the number of GOOD vases. If we maintain 3 linked lists with samples of each type, each step can be done in O(1) time, so O(n) time total.

At the end all vases are GOOD so we use uniform sampling. If we get a random x in [0,1], we use ⌊x·n⌋ to indicate that vase and x·n mod 1 to pick a molecule of water inside that vase. If we trace back the original vase from where this molecule came, we have sampled according to the initial distribution.

Now we describe how to "undo" the water pouring at query time, i.e. tracing the origin of each water molecule. Observe that operation (**) is never applied to a GOOD vase. So when a vase becomes GOOD, it has reached final form. But the water in a GOOD vase can only come from two distinct vases. So we can essentially draw a water-mark on the vase, to separate the LO water from HI water and store pointers to the original vases whose water got poured into the GOOD vase. Then the sampling algorithm only need one comparison, i.e. it compares x·n mod 1 to the mark on the vase ⌊x·n⌋, and follows a pointer from this vase to the input vases from which water was poured. Note that only one pointer is followed, since a GOOD vase is never touched again by operation (**). Constant time!

Sampling a discrete distribution

2011-09-16T13:55:00.002-04:00

The following cute question came up at lunch today (all credit goes to Aaron Archer).

Informal challenge: You are given a discrete distribution with n possible outputs (and you know the probability p_i of each output), and you want to sample from the distribution. You can use a primitive that returns a random real number in [0,1].

This was the way the problem came to me, and the goal was to find a very fast practical solution. Think of n as around 10⁶and use a regular machine as your machine model (real number means a "double" or something like that).

Formal challenge: To formalize this completely for a bounded-precision computer, we can say that we work on a w-bit machine and probabilities have w-bit precision. The distribution is given by n integers x_i, and these values sum to exactly 2^w. Output "i" should be produced with probability exactly x_i / 2^w. You are given a primitive that produces a uniformly distributed w-bit number.

The challenge is to get O(1) worst-case time. (NB: some fancy data structures required.)

International Olympiad, Day 2

2011-07-28T10:42:00.002-04:00

I suspect I got a lot of bad karma this day (I'm the evil mind behind "Crocodile" and "Elephant".) Congratulations to the winners! See here for final results.

Problem Crocodile. You have a weighted undirected graph with a start node and k designated target nodes. Two players play a full-information game, taking turns:

Player 1 moves on the nodes of the graph, starting at the designated start node, and aiming to reach a target node. In each turn, she can traverse any edge out of her current node [except the blocked edge / see below], incurring a cost equal to the weight of the edge.
Player 2 can block one edge at every turn. When an edge is blocked, Player 1 cannot traverse it in the next turn. An edge can be blocked multiple times, and any past edge is "unblocked" when a new edge is blocked.

Find the minimum budget B such that Player 1 can reach a target node with cost ≤ B, regardless of what Play 2 does. Running time: O~(n+m).

Problem Elephants. You have n elephants on the line at given coordinates. The elephants make n moves: in each unit of time, some elephant moves to another position. You have cameras that can photograph any interval of length L of the line. After each move, output the minimum number of cameras needed to phtograph all elephants in the current configuration.

Running time: subquadratic in n.

Problem Parrots. You are given a message of M bytes (0..255) to transmit over a weird channel. The channel can carry elements between 1 and R, but elements don't arrive in order (they are permuted adversarially). Send the message, minimizing the number L of elements sent across the channel.

Running time: polynomial. (Yes, this is an easy problem for a trained theorist.)

International Olympiad, Day 1

2011-07-25T13:14:00.002-04:00

The problems for Day 1 of the IOI have been posted (in many languages). Exciting! Here are the executive summaries. (Though I am on the International Scientific Committee (ISC), I do not know all the problems, so my appologies if I'm misrepresenting some of them. In particular, the running times may easily be wrong, since I'm trying to solve the problems as I write this post.)

The usual rules apply: if you're the first to post a solution in the comments, you get eternal glory. (If you post annonymously, claim your glory some time during the next eternity.)

Problem Fountain. Consider an undirected graph with costs (all costs are distinct) and the following traversal procedure. If we arrived at node v on edge e, continue on the cheapest edge different from e. If the node has degree 1, go back on e. A walk following these rules is called a "valid walk."

You also have a prescribed target node, and an integer k. Count the number of valid walks of exactly k edges which end at the prescribed target node. Running time O~(n+m) ["O~" means "drop logs"]

Problem Race. You have a tree with integer weights on the edges. Find a path with as few edges as possible and total length exaclty X. Running time O(nlg n) [log^2 may be easier.]

Problem RiceHub. N rice fields are placed at integer coordinates on the line. The rice must be gathered at some hub, which must also be at an integer coordinate on the line (to be determined). Each field produces one truck-load of rice, and driving the truck a distance d costs exactly d. You have a hard budget constraint of B. Find the best location for the hub, maximizing the amount of rice that can be gathered in the budget constraint.

Running time: O~(N).

Balkan Olympiad (Day 2)

2011-07-22T13:29:00.003-04:00

Now is time for the problems on day 2 (of 2). See here for day 1. Feel free to post solutions in the comments (you get eternal glory if you're the first to post one) and discuss anything related to the problems.

Day2–Problem TimeIsMoney. You have a graph with two costs (A and B) on each edge. Find a spanning tree that minimizes (its total A-cost) * (its total B-cost) [that's "times" in the middle]. Desired running time: polynomial. Sharper bounds are possible (I think I get O(n²m log)) but this is hard enough. To be entirely fair, the contestants just need to find an algorithm, not to prove it runs in poly-time, which may be easier (but I'm writing for a theory audience so consider yourself challenged).

Day2–Problem Trapezoid. Consider two horizontal lines, and n trapezoids stretching from one line to the other. The proverbial picture worth 1000 words:

Find the maximal set of nonintersecting trapezoids. Running time: O(nlg n).

Day 2–Problem Compare. Alice gets a number a, and Bob gets a number b. Both are integers in {0, ..., 4095}. Bob's goal is to compare b to a (and output "greater than", "less than" or "equal"). Charlie is helping them. Alice can send Charlie a set A ⊂ {1, ..., 10240} (intuitively, think of Alice marking some bits in an array of 10Kbits). Bob can ask Charlie whether some x is in A or not (think of Bob as querying some bits of the bit vector). The goal is to minimize (for worst-case a and b)

T=|A| + the number of queries made by Bob

Desired solution: We know how to get T=9. In the Olympiad, T=10 was enough for full score, but we left the problem open-ended, so T=9 would've earned you 109 points.

Somewhat unusually for computer olympiads, in this problem we could test contestant solutions exhaustively: we ran their code for all 4096² choices for (a,b) and really computed the worst-case T.

Balkan Olympiad

2011-07-12T17:45:00.002-04:00

The Balkan Olympiad is over, and I am slowly recovering from the experience of chairing the Scientific Commitee. For your enjoyment, I will post summaries of the competition problems. Perhaps this post should be titled "Are you smarter than a high school student?"

Day 1 – Problem 2Circles. You are given a convex polygon. Find the largest R such that two circles of radius R can be placed inside the polygon without overlapping. Target running time: O~(N). I think this problem illustrates the rather significant difference between coming up with an algorithm on paper and actually implementing it (only 4 contestants scored nonzero; the committee says ``Sorry!'').

Day 1– Problem Decryption. We define the following pseudorandom number generator:

initialize R[0], R[1], R[2] with random values in {0,..., 255}
let R[n] = R[n-3] XOR R[n-2].

We also define the following cypher:

let π be a random permutation of {0,...,255}
to encrypt the i-th byte of the message, output π(Message[i] XOR R[i])

You have to implement a chosen plaintext attack. You get a device implementing this cypher. You may feed it a message of at most 320 bytes (you give the device one byte at a time, and observe the encyption immediately). Your goal is to recover the secret keys (R[0..2], and π).

Day 1–Problem Medians. Given an array A[1..n] of integers, we define the prefix medians M[0..(n-1)/2] as M[k] = median(A[1..2k+1]).

The problem is: given M, recover A. O~(n) running time.

PS: This was by far the best prepared Committee that I've been on, and I think we conclusively proved that, no matter how much work you do before the Olympiad, there's still a lot left to do during the Olympiad itself. Many thanks to everybody who volunteered their time. I wish I could do more to express my gratitude, but for lack of other ideas, it is my pleasure to acknowledge the committee here:

Marius Andrei - Senior Software Engineer, Google Inc.
Radu Ștefan - Researcher, Technische Universiteit Eindhoven
Dana Lica - "I.L. Caragiale" High School, Ploiești
Cosmin Negrușeri - Senior Software Engineer, Google Inc.
Mugurel Ionuț Andreica - PhD, Assist., Polytechnic University, Bucharest
Cosmin Tutunaru- Student, Babes-Bolyai University, Cluj-Napoca
Mihai Damaschin - Student, Bucharest University
Gheorghe Cosmin - Student, MIT
Bogdan Tătăroiu - Student, Cambridge University,
Filip Buruiană - Student, Polytechnic University, Bucharest
Robert Hasna - Student, Bucharest University
Marius Dragus - Student, Colgate University NY
Aleksandar and Andreja Ilic -- University of Nis, Serbia; external problem submitters, who were not intimidated by the fact that I only posted the problem call in Romanian :)

Please submit...

2011-04-22T11:40:00.002-04:00

Papers to MASSIVE 2011. This will be a SoCG satellite workshop (think Paris in June...). Publication here does not hinder publication anywhere else.
Algorithmic problems for the Balkan Olympiad (BOI 2011). I am chairing the Scientific Committee, and we need your help in putting together an interesting set of problems. If your problem is selected for the competition, you will be invited to be a member of the Scientific Committee, and you get a free trip to the Olympiad if you accept (think Bistrița in July...). Anybody can submit problems, but I think the free trip only applies if you're starting somewhere in Romania and you are a Romanian/EU^(?) citizen. Hence the call for problems is in Romanian; sorry.
Comments to other blogs. This blog is now on full moderation after I had to delete a batch of comments that did not score particularly high on the sanity scale. Don't worry, the main point of this is not censorship; nasty but coherent comments will continue to be accepted.

Complexity Theory

2010-11-19T10:00:00.002-05:00

If you're a student thinking of going into complexity theory, take a good look around and ask yourself: "Do I really want to be in a field with this amount of groupthink?" [1,2,3,4,5, and last but certainly not least 6]

Back in the day, I was myself saved from a career in complexity by Omer Reingold's logspace connectivity.

FOCS 2010

2010-10-27T14:03:00.002-04:00

I am back from FOCS 2010, where I gave two talks:

a tutorial on data structure lower bounds: PPSX, PDF
a regular conference talk on distance oracles: PPSX, PDF (for this paper coauthored with Liam Roditty).

This paper caused some amount of fun in certain "inner circles", which I think I should share with a wider audience. If you read the paper, the algorithms repeatedly grow balls (aka shortest path trees) around vertices of the graph. After obsessing about growing these balls for more than a year, I found it natural to name the paper "How to Grow Your Balls". At least it allowed me to begin various talks by telling the audience that, "This is a topic of great economic importance; I receive email about it almost every day."

The committee took its role as Cerberus at the gates of Theory very seriously, and refused to accept the paper under the original title. In the process, many jokes were made, at least some of which transpired outside the PC room. I won't post them here, as the PC probably intends to keep some decorum. (But if you get a chance, ask your friend on the PC about that proposed proceedings cover.)

So the title changed, but I stretched the joke a bit further by titling the tutorial "How to Grow Your Lower Bounds".

This episode confirmed two things that I already knew. First, this community insists on taking itself way too seriously. For comparison, the title certainly triggered some PR alarms inside AT&T. But it made it through, perhaps because I told people that it's in the interest of AT&T Labs to prove publicly that it's a real research lab: it lets researchers be researchers, which includes being silly at times. Given the reputation of PR departments in American telecoms, it's a bit sad to find that the theory community has a higher bar on form.

Secondly, I should turn it down a notch with this blog. Several people said they had expected me to withdraw the paper rather than change the title. I found it quite amazing that people had such an image of me. Perhaps it is time to emphasize that I'm doing my research because I think it's important and cool. The horsing around is, well, not quite the main message.

Problem solving versus new techniques

2010-09-29T16:47:00.002-04:00

This is a guest post by Mikkel Thorup:

--------------------------

I think there is nothing more inhibiting for problem solving than referees looking for new general techniques.

When I go to STOC/FOCS, I hope to see some nice solutions to important problems and some new general techniques. I am not interested in semi-new techniques for semi-important problems. A paper winning both categories is a wonderful but rare event.

Thus I propose a max-evaluation rather than a sum. If the strength of a paper is that it solves an important problem, then speculations on the generality of the approach are of secondary importance. Conversely, if the strength of the paper is some new general techniques, then I can forgive that it doesn't solve anything new and important.

One of the nice things about TCS is that we have problems that are important, not just as internal technical challenges, but because of their relation to computing. At the end of the day, we hope that our techniques will end up solving important problems.

Important problems should be solved whatever way comes natural. It may be deep problem specific understanding, and it may build on previous techniques. Why would we be disappointed if an old problem got solved by a surprising reuse of an old technique?

Elegant reuse of old techniques is not a bad thing. After all, when we develop new techniques, we are hoping they will be reused, and from a teaching perspective, it is great to discover that the same general technique applies to more diverse problems. The last thing we want is to encourage authors to hide reuse. Of course, reuse should be highly non-obvious for a strong conference, but for established open problems, non-obviousness will normally be self-evident.

Retrieval-Only Dictionaries

2010-09-27T14:17:00.004-04:00

We saw two cool applications of dictionaries without membership; now it's time to construct them. Remember that we are given a set S, where each element x∈S has some associated data[x], a k-bit value. We want a data structure of O(nk) bits which retrieves data[x] for any x∈S and may return garbage when queried for x∉S.

A conceptually simple solution is the "Bloomier filter" of [Chazelle, Kilian, Rubinfeld, Tal SODA'04]. This is based on the power of two choices, so you should first go back and review my old post giving an encoding analysis of cuckoo hashing.

Standard cuckoo hashing has two arrays A[1..2n], B[1..2n] storing keys, and places a key either at A[h(x)] or B[g(x)]. Instead of this, our arrays A and B will store k-bit values (O(nk) bits in total), and the query retrieve-data(x) will return A[h(x)] xor B[g(x)].

The question is whether we can set up the values in A and B such that any query x∈S returns data[x] correctly. This is a question about the feasibility of a linear system with n equations (one per key) and 4n variables (one per array entry).

Consider a connected component in the bipartite graph induced by cuckoo hashing. If this component is acyclic, we can fix A and B easily. Take an arbitrary node and make it "zero"; then explore the tree by DFS (or BFS). Each new node (an entry in A or B) has a forced value, since the edge advancing to it must return some data[x] and the parent node has been fixed already. As the component is acyclic, there is only one constraint on every new node, so there are no conflicts.

On the other hand, if a component has a cycle, we are out of luck. Remark that if we xor all cycle nodes by some Δ, the answers are unchanged, since the Δ's cancel out on each edge. So a cycle of length k must output k independent data values, but has only k-1 degrees of freedom.

Fortunately, one can prove the following about the cuckoo hashing graph:

the graph is acyclic with some constant probability. Thus, the construction algorithm can rehash until it finds an acyclic graph, taking O(n) time in expectation.
the total length of all cycles is O(lg n) with high probability. Thus we can make the graph acyclic by storing O(lg n) special elements in a stash. This gives construction time O(n) w.h.p., but the query algorithm is slightly more complicated (for instance, it can handle the stash by a small hash table on the side).

These statements fall out naturally from the encoding analysis of cuckoo hashing. A cycle of length k allows a saving of roughly k bits in the encoding: we can write the k keys on the cycle (klg n bits) plus the k hash codes (klg(2n) bits) instead of 2k hash codes (2klg(2n) bits).

Further remarks. Above, I ignored the space to store the hash functions h and g. You have to believe me that there exist families of hash functions representable in O(n^ε) space, which can be evaluated in constant time, and make cuckoo hashing work.

A very interesting goal is to obtain retrieval dictionaries with close to kn bits. As far as I know, the state of the art is given by [Pagh-Dietzfelbinger ICALP'08] and [Porat].

Static 1D Range Reporting

2010-09-21T13:20:00.004-04:00

Method 4 for implementing van Emde Boas with linear space, described in my last post, is due to [Alstrup, Brodal, Rauhe: STOC'01]. They worked on static range reporting in 1 dimension: preprocess a set of integers S, and answer query(a,b) = report all points in S ∩ [a,b]. This is easier than predecessor search: you can first find the predecessor of a and then output points in order until you exceed b. Using van Emde Boas, we would achieve a linear-space data structure with query time O(lglg u + k), where k is the number of points to be reported.

Alstrup, Brodal, and Rauhe showed the following surprising result:

Static 1D range reporting can be solved with O(n) space and O(1+k) query time.

I like this theorem a lot, since it is so easy to describe to anybody with minimal background in Computer Science, yet the result is not obvious. I have used it many times to answer questions like, "Tell me a surprising recent result from data structures."

The solution. We need a way to find some (arbitrary) key from S ∩ [a,b] in constant time. Once we have that, we can walk left and right in an ordered list until we go outside the interval.

Let's first see how to do this with O(n lg u) space; this was described by [Miltersen, Nisan, Safra, Wigderson: STOC'95]. Of course, we build the trie representing the set. Given the query [a,b] let us look at the lowest common ancestor (LCA) of a and b. Note that LCA(a,b) is a simple mathematical function of the integers a and b, and can be computed in constant time. (The height of the LCA is the most significant set bit in a xor b.)

if LCA(a,b) is a branching node, look at the two descendant branching nodes. If the interval [a,b] is nonempty, it must contain either the max in the tree of the left child, or the min in the tree of the right child.
if LCA(a,b) is an active node, go to its lowest branching ancestor, and do something like the the above.
if LCA(a,b) is not an active node, the interval [a,b] is certainly empty!

Thus, it suffices to find the lowest branching ancestor of LCA(a,b) assuming that LCA(a,b) is active. This is significantly easier than predecessor search, which needs the lowest branching ancestor of an arbitrary node.

The proposal of [Miltersen et al.] is to store all O(n lg u) active nodes in a hash table, with pointers to their lowest branching ancestors.

As in my last post, the technique of [Alstrup et al.] to achieve O(n) space is: store only O(n √lg u) active nodes, and store them in a retrieval-only dictionary with O(lglg u) bits per node. We store the following active nodes:

active nodes at depth i·√lg u ;
active nodes less than √lg u levels below a branching node.

We first look up LCA(a,b) in the dictionary. If the lowest branching ancestor is less than √lg u levels above, LCA(a,b) is in the dictionary and we find the ancestor. If not, we truncate the depth of the LCA to a multiple of √lg u , and look up the ancestor at that depth. If [a,b] is nonempty, that ancestor must be an active node and it will point us to a branching ancestor.

vEB Space: Method 4

2010-09-21T11:11:00.003-04:00

In the previous post I described 3 ways of making the "van Emde Boas data structure" take linear space. I use quotes since there is no unique vEB structure, but rather a family of data structures inspired by the FOCS'75 paper of van Emde Boas. By the way, if you're curious who van Emde Boas is, here is a portrait found on his webpage.

In this post, I will describe a 4th method. You'd be excused for asking what the point is, so let me quickly mention that this technique has a great application (1D range reporting, which I will discuss in another post) and it introduces a nice tools you should know.

Here is a particularly simple variant of vEB, introduced by Willard as the "y-fast tree". Remember from the last post that the trie representing the set has n-1 branching nodes connected by 2n-1 "active" paths; if we know the lowest branching ancestor of the query, we can find the predecessor in constant time. Willard's approach is to store a hash table with all O(n lg u) active nodes in the trie; for each node, we store a pointer to its lowest branching ancestor. Then, we can binary search for the height of the lowest active ancestor of the query, and follow a pointer to the lowest branching node above. As the trie height is O(lg u), this search takes O(lglg u) look-ups in the hash table.

Of course, we can reduce the space from O(n lg u) to O(n) by bucketing. But let's try something else. We could break the binary search into two phases:

Find v, the lowest active ancestor of the query at some depth of the form i·√lg u (binary search on i). Say v is on the path u→w (where u, w are branching nodes). If w is not an ancestor of the query, return u.
Otherwise, the lowest branching ancestor of the query is found at some depth in [ i·√lg u , (i+1)√lg u ]. Binary search to find the lowest active ancestor in this range, and follow a pointer to the lowest active ancestor.

With this modification, we only need to store O(n √lg u ) active nodes in the hash table! To support step 1., we need active nodes at depths i·√lg u. To support step 2., we need active nodes whose lowest branching ancestor is only ≤ √lg u levels above. All other active nodes can be ignored.

You could bring the space down to O(n lg^εu) by breaking the search into more segments. But to bring the space down to linear, we use heavier machinery:

Retrieval-only dictionaries. Say we want a dictionary ("hash table") that stores a set of n keys from the universe [u], where each key has k bits of associated data. The dictionary supports two operations:

membership: is x in the set?
retrieval: assuming x is in the set, return data[x].

If we want to support both operations, the smallest space we can hope for is log(u choose n) + nk ≈ n(lg u + k) bits: the data structure needs to encode the set itself, and the data.

Somewhat counterintuitively, dictionaries that only support retrieval (without membership queries) are in fact useful. (The definition of such a dictionary is that retrieve(x) may return garbage if x is not in the set.)

Retrieval-only dictionaries can be implemented using only O(nk) bits. I will describe this in the next post, but I hope it is believable enough.

When is a retrieval-only dictionary helpful? When we can verify answers in some other way. Remember the data structure with space O(n √lg u ) from above. We will store branching nodes in a real hash table (there are only n-1 of them). But observe the following about the O(n √lg u ) active nodes that we store:

We only need k=O(lglg u) bits of associated data. Instead of storing a pointer to the lowest branching ancestor, we can just store the height difference (a number between 1 and lg u). This is effectively a pointer: we can compute the branching ancestor by zeroing out so many bits of the node.
We only need to store them in a retrieval-only dictionary. Say we query some node v and find a height difference δ to the lowest branching ancestor. We can verify whether v was real by looking up the δ-levels-up ancestor of v in the hash table of branching nodes, and checking that v lies on one of the two paths descending from this branching node.

Therefore, the dictionary of active nodes only requires O(n √lg u · lglg u) bits, which is o(n) words of space! This superlinear number of nodes take negligible space compared to the branching nodes.

Van Emde Boas and its space complexity

2010-09-19T14:34:00.002-04:00

In this post, I want to describe 3 neat and very different ways of making the space of the van Emde Boas (vEB) data structure linear. While this is not hard, it is subtle enough to confuse even seasoned researchers at times. In particular, it is the first bug I ever encountered in a class: Erik Demaine was teaching Advanced Data Structures at MIT in spring of 2003 (the first grad course I ever took!), and his solution for getting linear space was flawed.

Erik is the perfect example of how you can get astronomically high teaching grades while occasionally having bugs in your lectures. In fact, I sometimes suspected him of doing it on purpose: deliberately letting a bug slip by to make the course more interactive. Perhaps there is a lesson to be learned here.

***

Here is a quick review of vEB if you don't know it. Experienced readers can skip ahead.

The predecessor problem is to support a set S of |S|=n integers from the universe {1, ..., u} and answer:

predecessor(q) = max { x ∈ S | x ≤ q }

The vEB data structure can answer queries in O(lglg u) time, which is significantly faster than binary search for moderate universes.

The first idea is to divide the universe into √u segments of size √u. Let hi(x) = ⌊x/√u⌋ be the segment containing x, and lo(x) = x mod √u be the location of x within its segment. The data structure has the following components:

a hash table H storing hi(x) for all x ∈ S.
a top structure solving predecessor search among { hi(x) | x ∈ S }. This is the same as the original data structure, i.e. use recursion.
for each element α∈H, a recursive bottom structure solving predecessor search inside the α segment, i.e. among the keys { lo(x) | x ∈ S and hi(x)=α }.

The query algorithm first checks if hi(q) ∈ H. If so, all the action is in q's segment, so you recurse in the appropriate bottom structure. (You either find its predecessor there, or in the special case when q is less than the minimum in that segment, find the successor and follow a pointer in a doubly linked list.)

If q's segment is empty, all the action is at the segment level, and q's predecessor is the max in the preceding non-empty segment. So you recurse in the top structure.

In one step, the universe shrinks from u to √u, i.e. lg u shrinks to ½ lg u. Thus, in O(lglg u) steps the problem is solved.

***

So what is the space of this data structure? As described above, each key appears in the hash table, and in 2 recursive data structures. So the space per key obeys the recursion S(u) = 1 + 2 S(√u). Taking logs: S'(lg u) = 1 + 2 S'(½ lg u), so the space is O(lg u) per key.

How can we reduce this to space O(n)? Here are 3 very different ways:

Brutal bucketing. Group elements into buckets of O(lg u) consecutive elements. From each bucket, we insert the min into a vEB data structure. Once we find a predecessor in the vEB structure, we know the bucket where we must search for the real predecessor. We can use binary search inside the bucket, taking time O(lglg u). The space is (n/lg u) ·lg u = O(n).

Better analysis. In fact, the data structure from above does take O(n) space if you analyze it better! For each segment, we need to remember the max inside the segment in the hash table, since a query in the top structure must translate the segment number into the real predecessor. But then there's no point in putting the max in the bottom structure: once the query accesses the hash table, it can simply compare with the max in O(1) time. (If the query is higher than the max in its segment, the max is the predecessor.)

In other words, every key is stored recursively in just one data structure: either the top structure (for each segment max) or the bottom structure (for all other keys). This means there are O(lglg u) copies of each element, so space O(n lglg u).

But note that copies get geometrically cheaper! At the first level, keys are lg u bits. At the second level, they are only ½ lg u bits; etc. Thus, the cost per key, in bits, is a geometric series, which is bounded by O(lg u). In other words, the cost is only O(1) words per key. (You may ask: even if the cost of keys halves every time, what about the cost of pointers, counters, etc? The cost of a pointer is O(lg n) bits, and n ≤ u in any recursive data structure.)

Be slick. Here's a trickier variation due to Rasmus Pagh. Consider the trie representing the set of keys (a trie is a perfect binary tree of depth lg u in which each key is a root-to-leaf path). The subtree induced by the keys has n-1 branching nodes, connected by 2n-1 unbranching paths. It suffices to find the lowest branching node above the query. (If each branching node stores a pointer to his children, and the min and max values in its subtree, we can find the predecessor with constant work after knowing the lowest branching node.)

We can afford space O(1) per path. The data structure stores:

a top structure, with all paths that begin and end above height ½ lg u.
a hash table H with the nodes at depth ½ lg u of every path crossing this depth.
for each α∈H, a bottom structure with all paths starting below depth ½ lg u which have α as prefix.

Observe that each path is stored in exactly one place, so the space is linear. But why can we query for the lowest branching node above some key? As the query proceeds, we keep a pointer p to the lowest branching node found so far. Initially p is the root. Here is the query algorithm:

if p is below depth ½ lg u, recurse in the appropriate bottom structure. (We have no work to do on this level.)
look in H for the node above the query at depth ½ lg u. If not found, recurse in the top structure. If found, let p be the bottom node of the path crossing depth ½ lg u which we just found in the hash table. Recurse to the appropriate bottom structure.

The main point is that a path is only relevant once, at the highest level of the recursion where the path crosses the middle line. At lower levels the path cannot be queried, since if you're on the path you already have a pointer to the node at the bottom of the path!

IOI Wrap-up

2010-09-07T14:23:00.002-04:00

In the past 2 years, I have been a member of the Host Scientific Committee (HSC) of the IOI. This is the body that comes up with problems and test data. While it consists primarily of people from the host country (Bulgaria in 2009, Canada in 2010), typically the host will have a call-for-problems and invite the authors of problems they intend to use.

This year, I was elected member of the International Scientific Committee (ISC). This committee works together with the HSC on the scientific aspects, the hope being that a perenial body will maintain similar standards of quality from one year to another. There are 3 elected members in the ISC, each serving 3-year terms (one position is open each year).

I anticipate this will be a lot of fun, and you will probably hear more about the IOI during this time. When a call for problems comes up (will be advertised here), do consider submitting!

I will end with an unusual problem from this IOI:

Consider the largest 50 languages on Wikipedia. We picked 200 random articles in each language, and extracted an excerpt of 100 consecutive characters from each. You will receive these 10000 texts one at a time in random order, and for each you have to guess its language. After each guess, your algorithm learns the correct answer. The score is the percentage of correct guesses.

To discourage students from coding a lot of special rules, a random permutation is applied to the Unicode alphabet, and the language IDs are random values. So, essentially, you start with zero knowledge.

Considering the tiny amount of training data and the real-time nature of guessing, one might not expect too good solutions. However, it turns out that one can get around 90% accuracy with relatively simple ideas.

My own approach was to define Score(text, language) as the minimal number of substrings seen previously in this language that compose the text. This can be computed efficiently by maintaining a suffix tree for each language, and using it to answer longest common prefix queries.

Barriers 2

2010-08-29T12:20:00.004-04:00

Later today, I'll be giving a talk at the 2nd Barriers Workshop in Princeton.

Here's my attempt to survey data structure lower bounds in 70 minutes: PDF, PPSX.

Update: Based on comments, I'm publishing the slides as PPSX (which can be viewed with a free viewer) and PDF. I will try to convert my other talks to these formats when I have time.

IOI: A Medium Problem

2010-08-26T23:26:00.002-04:00

Here is another, medium-level problem from the IOI. (Parental advisory: this is not quite as easy as it may sound!)

I think of a number between 1 and N. You want to guess the secret number by asking repeatedly about values in 1 to N. After your second guess, I will always reply "hotter" or "colder", indicating whether your recent guess was closer or farther from the secret compared to the previous one.

You have lg N + O(1) questions.

***

The solution to the first problem I mentioned can be found in the comments. Bill Hesse solved the open question that I had posed. He has a neat example showing that the space should be N^1.5log₂3 bits, up to lower order terms. It is very nice to know the answer.

A very elegant solution to the second problem was posted in the comments by a user named Radu (do you want to disclose your last name?). Indeed, this is simpler than the one I had. However, the one I had worked even when the numbers in the array were arbitrary (i.e. you could not afford to sort them in linear time). I plan to post it soon if commenters don't find it.

IOI: Another Hard Problem

2010-08-22T19:54:00.002-04:00

You are given a matrix A[1..N][1..M] that contains a permutation of the numbers {1, ..., NM}. You are also given W≤N and H≤M. The goal is to find that rectangle A[i ... i+W][j ... j+H] which has the lowest possible median.

Running time: O(NM).

***

This could have been another hard problem at the IOI, but it was decided to give maximum score to solutions running in O(NM lg(NM)). This is considerably easier to obtain (but still interesting).

In the style of what Terry Tao tried for the IMO, I am opening this blog post as a discussion thread to try to solve the problem collectively ("poly-CS" ?). You are encouraged to post any suggestions, half-baked ideas, trivial observations, etc – with the goal to jointly reach a solution. If you think about the problem individually and find the solution, please don't post, as it will ruin the fun of the collective effort. Following this rule, I will not engage in the discussion myself.

I did not encourage a discussion for the first problem, since it was the kind of problem that only required one critical observation to solve. This second problem requires several ideas, and in fact I can see two very different solutions.

IOI: The Hard Problem

2010-08-20T08:09:00.003-04:00

The International Olympiad in Informatics (IOI 2010) is taking part this week at the University of Waterloo, Canada.

The olympiad often features a hard problem, which is intended to be solved by a handful of contestants. This year, the problem was solved by about 6 people. Read the problem below and give it a shot! :)

I will describe the problem in both TCS and IOI fashion.

Asymptotic version. You are given an unweighted, undirected graph on N vertices. Some sqrt(N) vertices are designated as "hubs". You have to encode the pairwise distances between all hubs and all vertices in O(N^1.5) bits of space.

The encoder and decoder may run in polynomial time. Of course, the decoder does not see the original graph; it receives the output of the encoder and must output the explicit distances between any hub and any other vertex. (This list of explicit distances takes O(N^1.5lg N) bits.)

Non-asymptotic version. You are given a graph on 1000 nodes and 36 designated hubs. You have to encode the distances between all hubs and all vertices in 70,000 bits of space.

The non-asymptotic version is a bit harder, since you have to pay more attention to some details.

The research version. Prove or disprove that the distances can be encoded using (1+o(1)) N^1.5bits of space. I don't know the answer to this question (but I find the question intriguing.)

A taxonomy of range query problems

2010-08-04T13:12:00.004-04:00

In this post, I will try to enumerate the range query problems that I know about. Let me know if I'm missing anything.

The query. Say you have n points in the plane, and you query for the points in an axis-parallel rectangle. What could we mean by "query"?

existential range queries: Is there any point in the rectangle?
counting queries: How many points are there in the rectangle?
reporting queries: List the points in the rectangle. Unlike the previous cases, the query time is now broken into two components: it is usually given as f(n) + k*g(n), where k is the number of output points.

Now let's assume the points have some number associated to them (a weight or a priority). Then one could ask the following natural queries:

weighted counting: What is the total weight of the points inside?
range min (max) query
range median query. (Possible generalizations: selection or quantiles.)
top-k reporting: Report just the top k guys, by priority (for k given). One may demand the output to be sorted. More stringently, one may ask the query algorithm to enumerate points sorted by priority, in time g(n) per point, until the user says "stop."

The number associated to a point can also be a "color". For instance, points could represent Irish pubs / Belgian bars / etc, and we may only want one point of each type in our output. Then the queries become:

colored counting: How many distinct colors are in the rectangle?
colored reporting: List the distinct colors in the rectangle (possibly with one example point from each color).
top-k colored reporting: If the colors are sorted by priorities (e.g. I prefer points of color 1 over points of color 2), one can then ask for the top-k distinct colors inside the rectangle.

Dynamism. The problem could be:

static: Preprocess the point set and then answer queries.
dynamic: Insert and delete from the point set.
incremental / decremental: We only insert or delete.
offline: The sequence of operations is known in advance. This is enough for many applications to algorithms.
parametric / kinetic. I confess ignorance with respect to these.

Orthogonal range queries. The setting from above works in any number of dimensions d≥1: the data set consists of n points in d-dimensional space and the query is a box [a₁, b₁]×···×[a_d, b_d]. This setup is usually called "orthogonal range queries".

We can consider the following important restrictions on the query:

dominance queries: the box is [0, b₁]×···×[0, b_d]. In other words, we are asking for the points dominated, coordinate-wise, by a point (b₁, ..., b_d).
k-sided queries: exactly 2d-k values in (a₁, a₂, ..., a_d) are zero. For instance, a 3-sided query in 2D is a rectangle with one side on the x axis. Dominance queries are the special case of d-sided queries.

The universe. The coordinates of the points and queries can come from the following sets:

general universe. In the standard RAM model, we assume that the coordinates are integers that fit in a machine word.
rank space: the coordinates are from {1, 2, ..., n}. One can reduce any static problem to rank space by running 2d predecessor queries. Most problems can be shown to be at least as hard as predecessor search, so their complexity is precisely: "optimal cost of predecessor search" + "optimal cost for the problem in rank space". In other words, for most problems it is sufficient to solve them in rank space.
dense universe: the points are exactly the points of the grid [1, n₁]×···×[1, n_d] where n₁n₂···n_d= n. In 1D this is the same as rank space, but in 2 or more dimensions the problems are very different. (To my knowledgeable readers: Is there a standard name for this case? For counting queries people call this "the partial sums problem", but how about e.g. min queries?)

For dynamic problems, the "rank space" changes when a new coordinate value is inserted. Thus, a rank-space solution must support a "insert value" operation that increments all coordinate values after a given one, creating space for a newly inserted point. (I have heard the term "list space" for this. Should we just use "rank space"?)

Stabbing. So far, our data set consisted of points and the queries asked for points in a given rectangle. Conversely, one can consider a data set of rectangles; the query is a point and asks about the rectangles containing that point ("stabbed" by it). This problem is important, among others, in routing: we can have rules for packets coming from some range of IP addresses and going to some other range of IP addresses.

The notion of rank space, and all query combinations still make sense. For instance, interval max stabbing is the following problem: given a set of interval (in 1D) with priorities, return the interval of highest priority stabbed by a query point.

Note that the rectangles in the input can intersect! If we ask that the rectangles be disjoint, or more stringently, be a partition of the space, we obtain the point location problem.

Rectangle-rectangle queries. So far we looked at containment relations between rectangles and points. More generally, the data set can consist of rectangles, and the query can also be a rectangle. Then one can ask:

intersection queries: analyze the set of input rectangles that intersect the query rectangle.
containment queries: analyze the set of rectangles that contain / are-contained-by the query.

Two important special cases arise when the rectangles degenerate to line segments:

orthogonal segment intersection: Given a set of horizontal segments, find the ones that intersect a vertical query segment.
orthogonal ray shooting: Given a set of horizontal segments, find the lowest one immediately above a query point. In other words, consider a vertical ray leaving from the point and find the first segment it intersects. (This is the min segment intersection query, where the priority of each horizontal segment is its y coordinate.)

More complex geometry. Of course, our ranges need not be orthogonal. One can consider:

balls
arbitrary lines
half spaces
simplices (e.g 2D triangles).

In non-orthogonal geometry, the concept of rank space disappears. However, most other combinations are possible. For instance, one could ask about the points in a query range; the ranges stabbed by a query point; the ranges intersecting a query range; the first rangeintersected by a ray; etc. We can ask existential, counting, or reporting questions, on ranges that can have weights or priorities.

SODA

2010-07-22T09:13:00.003-04:00

Being on the SODA PC is excruciating. Too many submissions are at the level of a hard exercise – things that David Karger would assign in Advanced Algorithms or Randomized Algorithms as homework. And since I'm a problem solver by nature, I cannot resist solving the problems before (in lieu of) reading the papers...

I am left wondering how life on the ICS PC is like. The fact that the PC members can solve the problem in half an hour is promised to not be a negative – so is there any point in engaging the technical side of your brain during the review process?

Which brings me to graph algorithms, a field I have engaged in sporadically. This field strikes me as problem solving par excellence. You think about the problem, you pull a rabbit out of the hat, and run around naked screaming "Εὕρηκα!" Yet most reviewers (which, I will assume, write graph-theory papers in the same way) cannot help commenting on the lack of new techniques in the solution. I interpret this as a code for "We didn't care to think about the problem, we just read over your solution and remarked that it solved the problem by looking at some edges in some order, then at some trees and some vertices."

(I should say I have no sour grapes about this... Graph algorithms in not my main field and all my papers on the topic are in STOC/FOCS. Yet I am amazed by how much the field sabotages itself with this "techniques thing.")

By contrast, I once came across a reviewer that actually wore his problem-solver hat while reading my paper. The review read "I spent days to figure out the interplay and indispensability of the data structures and techniques used. But after fully understanding the entire picture and the interplay of all the data structures, I felt very satisfied. It is indeed a very beautiful piece of algorithmic art. I congratulate the authors for such a novel algorithm." This is possibly the best review I ever got – complemented, of course, by some "no new techniques but important result" reviews.

Which brings me to the main trouble with TCS, in my opinion. If only we could stop inventing a ton of artificial problems every year, and we actually cared about solving some age-old clear-cut problems – then it would be no surprise to find reviewers that have actually thought carefully about your hard problem.