The post DRY Function Pointers in C appeared first on Alex Bowe.
]]>The most recent I have read was this one by Dennis Kubes. I haven’t hung out with C for a while (I more frequently visit its octopussier cousin, onto which a lambda-shaped leg has been nailed), so the post triggered some fond memories.
This post is just a leg nailed to his post, so go read it now if you would like to know about function pointers in C.
Kubes presents an example where he defines a domath
function, that accepts a pointer to a function that performs an operation on two integers:
int domath(int (*mathop)(int, int), int x, int y) {
return (*mathop)(x, y);
}
For a more realistic example, let’s say you implemented fold instead, and hey, why not left-fold also?
int fold_right(int (*f)(int, int), int init, const int * first, const int * last) {
int result = init;
while(first != last) {
result = f(result, *first++);
}
return result;
}
int fold_left(int (*f)(int, int), int init, const int * first, const int * last) {
int result = init;
while(first != last) {
result = f(*last--, result);
}
return result;
}
Now imagine a whole family of higher order functions like these… the two function signatures here are bad enough already. The syntax doesn’t communicate well – you actually have to work out wtf int (*f)(int, int)
means, unless you write C in your sleep. Even then, there is a chance the next person to maintain your code will cry.
Let’s look out for our fellow coder and DRY those tears. C has syntax for typedef
ing function pointers. The result is much friendlier:
typedef int (*binary_int_op) (int, int);
int fold_right(binary_int_op f, int init, const int * first, const int * last) {
int result = init;
while(first != last) {
result = f(result, *first++);
}
return result;
}
int fold_left(binary_int_op f, int init, const int * first, const int * last) {
int result = init;
while(first != last) {
result = f(*last--, result);
}
return result;
}
That is still some weird ass syntax at first, but you write that once (per type/intent of function pointer) and then live a happier life. And if you choose the name intelligently, you can communicate the intent as fast as the reader can read English.
Or you could ditch C and use C++’s functional programming tools.
The post DRY Function Pointers in C appeared first on Alex Bowe.
]]>The post Succinct de Bruijn Graphs appeared first on Alex Bowe.
]]>Using our new structure, we have squeezed a graph for a human genome (which took around 300 GB of memory if using previous representations) down into 2.5 GB. In addition, the construction method allows much of the work to be done on disk. Your computer might not have 300 GB of RAM, but you might have 2.5 GB of RAM and a hard disk.
I have given a talk about this a few times, so I’ve been itching to write it up as a blog post (if only to shamelessly plug my blog at conferences). However, since it is a lowly blog post, I won’t give attention to the gory details, nor provide any experimental results. But this post is merely to communicate the approach. Feel free to check out our conference paper, and stay tuned for the journal paper.
In this blog post, I will first give an introduction to de Bruijn graphs and how they are used in DNA assembly. Then I will briefly explain some previous implementations before reaching the main topic of this post: our new succinct representation. The explanation first explains some preliminaries, such as how we were inspired by the Burrows Wheeler Transform, and what rank and select are, which are required to understand the construction method, and traversal interface respectively.
For those following along at home, I have implemented a demo version in Python. It doesn’t use efficient implementations of rank and select, nor provide any compression – it is merely meant to demonstrate the key ideas with (hopefully) readable high level code. An optimised version will be made available at some point.
Okay, here we go…
De Bruijn graphs are a beautifully simple, yet useful combinatoric object which I challenge you not to lose sleep over.
Since their discovery around 1946, they have been used in a variety of applications, such as encryption, psychology, chess and even card tricks. Quite recently they have become a popular data structure for DNA assembly of short read data.
They are defined as a directed graph, where each node $u$ represents a fixed-length string (say, of length $k$, and an edge exists from $u$ to $v$ iff they overlap by $k–1$ symbols. That is, $u[2..k] = v[1..k–1]$.
Let’s make this concrete with a diagram:
Doesn’t this just feel good to look at? Now tilt your head and look at it this way: it is essentially a Finite State Machine with additional bounded memory of the last $k$ visited states, if you added a few more states to allow for incomplete input at the start. For example, if X represents blank input, from a starting state XXX we might have the transition chain: XXX -> XX1 -> X10 -> 101 (which is already a node in the graph). Put this way, it is easy to see that the $k$ previous edges define the current node. This perspective will make things easier to understand later, I promise.
Still with me? Good (: Then let’s also consider that if one node is defined by a $k$-length string, then a pair of nodes (i.e. an edge) can be identified by a $k+1$ length string, since they overlap and differ by 1 symbol. This will also be important later.
And of course, this can be extended to larger alphabet sizes than binary (say, 4…).
First suggested in 2001 by Pevzner et al.[1], we can use de Bruijn graphs to represent a network of overlapping short read data.
The long and short of it (heh heh) is that a DNA molecule is currently too difficult to sequence (that is, read it into a computer) in its entirety. Special methods must be used so we can sequence parts of the molecule, and hand off the putting-back-together process (assembly) to an algorithm.
One current popular sequencing method is shotgun sequencing, which clones the genome a bunch of times, then randomly breaks each clone into short segments. If we can sequence the short segments, then the fact that we randomly cut up the clones should lead us to have overlapping reads. Of course it is a bit more complicated than this in reality, but this is the essence of it.
We then move a sliding window over each read, outputing overlapping $k$-mers (the $k$-length strings), which we use to create our de Bruijn graph.
About now mathematicians among you might raise your hand to tell me “that may not technically yield a de Bruijn graph”. That’s correct – in Bioinformatics the term “de Bruijn graph” is overloaded to mean a subgraph. Even though genomes are long strings, most genomes won’t have every single k-mer present, and there is usually repeated regions. This means our data will be sparse.
Consider the following contrived example. Take the sequence TACGACGTCGACT
. If we set \(k=3\), our k-mers will be TAC
, ACG
, CGA
, and so on.
We would end up with this de Bruijn graph:
After we construct the graph from the reads, assembly becomes finding the “best” contiguous regions. The jury is still out on what the best method is (or what “best” even means); the point of this post isn’t about assembly, but our implementation of the data structure and how it provides all required navigation options to implement any traversal method. I recommend reading this primer from Nature if you want to get deeper into this,.
Even though this is only a de Bruijn subgraph, these things still grow pretty big. It is worthwhile considering how to handle this scalability issue, if only to reduce hardware requirements of sequencing (thus proliferating personal genomics), and potentially improve traversal speed (due to better memory locality). Increased efficiency might also enable richer multiple-genomic analysis.
One of the first approaches to this was to scale “horizontally”. Simpson et al.[2] introduced ABySS in 2009. The graph for reads from a human genome (HapMap: NA18507), which used a distributed hash table, reached 336 GB.
In 2011, Conway and Bromage[3] instead approached this problem from a “vertical” scaling perspective (that is, scaling to make better use of a single system’s resources), by using a sparse bitvector (by Okanohara and Sadakane[4]) to represent the $(k + 1)$-mers (the edges), and used rank and select (to be described shortly) to traverse it. As a result, their representation took 32 GB for the same data set.
Minia, by Cikhi and Rizk (2012)[5], proposed yet another approach by using a bloom filter (with additional structure to avoid false positive edges that would affect the assembly). They traverse by generating possible edges and testing for it in the bloom filter. Using this approach, the graph was reduced to 5.7 GB.
As stated, we were able to represent the same graph in 2.5 GB (after some further compression techniques, which I will save for a future post).
The key insight is that the edges define overlapping node labels. This is similar to that of Conway and Bromage, although they have some redundancy, since some nodes are represented more times than necessary.
We further exploit the mutual information of edges by taking inspiration from the Burrows Wheeler Transform[6].
The Burrows Wheeler Transform[6] is a reversible string permutation that can be searched directly and has the admirable quality of having long strings of repeated characters (great for compression). The easiest way to calculate the BWT of a string is to sort each symbol by their prefixes in colex order (that is, alphabetic order of the reverse of the string, not reverse alphabetic!) More information can be found on Wikipedia and this Dr. Dobbs article.
The XBW is a generalisation of the BWT that applies to rooted, labeled trees[7]. The idea is that instead of taking all suffixes, we sort all paths from the root to each node, and support tree navigation (since it isn’t a linearly shaped string) with auxiliary bit vectors indicating which edges are leaves, and which are the last edges (of their siblings) of internal nodes.
I won’t go into detail, but in the next section you should be able to see glimpses of these two ideas.
The simplest construction method[8] is to take every <node, edge> pair and sort them based on the reverse of the node label (colex order), removing duplicates[9]. We also padding to ensure every node has an incoming and an outgoing edge. This maintains the fact that a node is defined by its previous k edges. An example can be seen below:
You may have spotted that we have flagged some edges with a minus symbol. This is to disambiguate identically labelled incoming edges – edges that exit separate nodes,
but have the same symbol, and thus enter the same node. In the example below, the nodes ACG
and CGA
both have two incoming edges.
Notice that each outgoing edge is stored contiguously? We include a bit vector to represent whether an edge is the last edge exiting a node. This means that each node will have a sequence of zero-or-more 0-bits, followed by a single 1-bit.
Since a 1 in the $L$ vector identifies a unique node, we can use this vector (and select, explained shortly) to index nodes, whereas standard array indexing points to edges.
Finally, instead of storing the node labels we just need to store the final column of the node labels. Since the node labels are sorted, it is equivalent to store an array of first positions:
In total we have a bitvector $L$, an array of flagged edge labels $W$, and a position array $F$ of size $\sigma$ (the alphabet size). Respectively, these take $m$ bits, $m \log{2*sigma} = 3 m$ bits (for DNA), and $\sigma \log{m} = o(m)$ bits[10], given $m$ edges – a bit over 4 bits per edge. Using appropriate structures (not detailed) we can compress this further, to around 3 bits per edge.
Rank and select are the bread and butter of succinct data structures, because so many operations can be implemented using them alone. $rank_c(i)$ returns the number of occurences of symbol $c$ on the closed range $[0, i]$, whereas $select_c(i)$ returns the position of the $i^{th}$ occurence of symbol $c$.
They are kind of like inverse functions, although $rank()$ is not injective, so cannot have a true inverse. For this reason, if you want to find the position of the left-nearest $c$, you would have to use $select()$ in orchestra with $rank()$ (a common pattern).
Speaking of patterns of use, it may also help to keep this in mind: rank is for counting, and select is for searching, or can be thought of as an indirect addressing technique (such as addressing a node using the $L$ array). Two rank queries can count over a range, whereas two select queries can find a range. A rank and a select query can find a range where either a start point or end point are fixed.
Rank, select and standard array access can all be done in $\mathcal{O}(1)$ time when $\sigma = polylog(N)$[11], if represent a bitvector using the structure described by Raman, Raman and Rao in 2007[12] (which I explained in an earlier blog post), and for larger alphabets use the index described by Ferragina, Manzini, Makinen, and Navarro in 2006[13]. In our implementation, we use modified versions to get it down to 3 bits per edge.
While it might not be obvious, these three arrays provides support for a full suite of navigation operations. An overview is given in these tables, which link to the implementation details that follow.
First, to navigate each of the edges, we define two internal functions (using rank and select calls):
Operation | Description | Complexity |
---|---|---|
\(forward(i)\) | Return index of the last edge of the node pointed to by edge \(i\). | \(\mathcal{O}(1)\) |
\(backward(i)\) | Return index of the first edge that points to the node that the edge at \(i\) exits. | \(\mathcal{O}(1)\) |
Using these two functions, we can implement the less confusing public interface below, which operate on node indexes:
Operation | Description | Complexity |
---|---|---|
\(outdegree(v)\) | Return number of outgoing edges from node \(v\). | \(\mathcal{O}(1)\) |
\(outgoing(v,c)\) | From node \(v\), follow the edge labeled by symbol \(c\). | \(\mathcal{O}(1)\) |
\(label(v)\) | Return (string) label of node \(v\). | \(\mathcal{O}(k)\) |
\(indegree(v)\) | Return number of incoming edges to node \(v\). | \(\mathcal{O}(1)\) |
\(incoming(v,c)\) | Return predecessor node starting with symbol \(c\), that has an edge to node \(v\). | \(\mathcal{O}(k \log \sigma) \) |
The details of the above functions are given in the following sections.
In order to support the public interface, we create for ourselves a simpler way to work with edges: the complementing forward and backward functions.
Recall that all node labels are defined by predecessor edges, then we have represented each edge in two different places: the \(F\) array (which is equivalent to the last column of the “Node” array), and the edge array \(W\). It follows that, since it is sorted, the node labels maintain the same relative order as the edge labels. This can be seen in the following figure:
Note that the number of C
s in the last column of Node is different from the number of C
s in \(W\), because the first C
in \(W\) points to two edges.
For this reason, we ignore the first edge from node GAC
(although it doesn’t affect the relative order). In fact, we ignore any edge that doesnt have L[i] == 1.
Then, following an edge is simply finding the corresponding relatively positioned node! All it takes is some creative counting, using rank and select, as pictured below:
First we access W[i] to find the edge label, then calculate \(rank_C\) up to row to gives us the relative ordering of our edges with this label. Let’s call this relative index \(r\). In our example we are following the 2nd C-labeled edge.
To find the 2nd occurence of C in F, first we need to know where the first occurence is. We can use F to find that. Then we can select to the 2nd one, using the last array. Because the last array is binary only, this requires us to count how many 1s there are before the run of Cs (using rank), then adding 2 to land us at the 2nd C.
The W access, rank over W, rank and select over L, accessing F, and the addition are all done in O(1) time, so forward also takes O(1) time.
Backward is very similar to forward, but calculated in a different order: we find the relative index of the node label first, and use that to find the corresponding edge (which may not be the only edge to point to this node, but we define it to point to the first one, that is one that isn’t flagged with a minus).
We can find our relative index of the node label by issuing two rank queries instead, and using select on W to find the first incoming edge.
For similar reasons to Forward this is O(1).
This is an easy one. This function accepts a node (not edge!) index v
, and returns the number of outgoing edges from that node. Why did I say this was easy?
Well, remember that all our outgoing edges from the same node are contiguous in our data structure. See the diagrams below, paying attention to the node ACG
.
By definition (since nodes are unique and sorted), this will always be the case. We also defined our so-called “last” vector $L$ to have a 0 for every edge from a given node, except the last edge. All we need to do is count how many 0s there are, then add 1. Put another way, we need to measure the distance between 1s (since the previous 1 will belong to the previous node). Since we know the node index, we can use select for that!
In the above example, we are querying the outdegree of node 6 (the 7th node due to zero-basing). First we select to find the position of the 7th 1, which gives us the last edge of that node. Then we simply subtract the position of the previous node (node 5, the 6th node): \(select(7) – select(6) = 8 – 6 = 2\). Boom.
Select queries can be answered in O(1) time, so outdegree is also O(1).
Outgoing(v,c) returns the target node after traversing edge c from node v, which might not exist. The hard part is finding the correct edge index to follow; after that we can conveniently use the forward() function we defined earlier.
To ease the explanation, consider a simple bit vector 00110100
. To count how many 1s there are up to and including the 7th position, we would use rank.
The answer is 3, but at this stage we still don’t know the position of the 3rd 1 (we can’t see the bitvector). In general, it may or may not be in the 7th position.
We could scan backwards, or we could just use select(3) to find the position (since this returns the first position i that has rank(i) = 3).
So essentially we can count how many of those edges there are before the node we are interested in, then use select to find the position. If the position is inside this nodes range (of contiguous edges), then we follow it.
We complicated things a bit by separating our edges into flagged and non-flagged, so we may have to issue two of these queries (for the minus flags). The flagging is useful later in Indegree.
In the next example, our nonflagged edge doesn’t fall in our nodes range, so we make a second query. It is possible that this one will return a positive result, but in the example it doesn’t. By that stage though, we can respond with by returning –1 to signal that the edge doesn’t exist.
forward() is defined to move us to the last edge of the resulting node, so the value in last will be 1. Hence, we can use rank to convert our edge index into a node index before returning it.
This is a constant number of calls to O(1) functions, so outgoing is also O(1).
At some point (e.g. during traversal) we are probably going to want to print the node labels out. Let’s work out how to do that (:
Remember, we aren’t storing the node labels explicitly. The F array will come in handy: We can use the position of our node (found using select) as a reverse lookup into F. This can be done in constant time with a sparse bit vector, or in logarithmic time using binary search, or we can use a linear scan if our alphabet is small. In any case, lets assume it is O(1) time, although a linear scan might be faster in practice (fewer function calls may yield a lower constant coefficient).
In the above example, we select to node 6 (the 7th 1 in last), which gives us the last edge index (any edge will do, but the last one is the easiest to find). This happens to be row 8, so from F we know that all the last symbols between on the open range [7,10) are G.
Then we just use bwd() on the current edge to find an edge that pointed to this node, then rinse and repeat k times.
In a similar manner to outdegree, all we need to do is count the edges that point to the current node label. Take for example the graph:
We can easily find the first incoming edge by using backward(). To count the remaining edges our minus flags come in handy; In the W array, the next G (non-flagged) belongs to a different node, because we defined W to have minus flags if the source node has the same $k–1$ suffix (or, if same-labeled edges also share the same target node).
From here it is simple enough to do a linear scan (or even use select) until the next non-flagged edge; the maximum distance it could be is $\sigma^2$, For larger alphabets, a more efficient method is to use rank instead:
First we find the position of the next non-flagged $G$, which gives us the end point of our range (we already have the start point from the first rank). Then we use rank to calculate how many $G-$ there are to the end position, and subtract the initial rank value from this, giving us how many $G-$ occur within the range.
This is once again a constant number of $O(1)$ function calls, which means $indegree()$ is also $O(1)$.
Incoming, which returns the predecessor node that begins with the provided symbol, is probably the most difficult operation to implement. However, it does use approaches similar to the previous functions.
Consider this: from indegree(), we already know how to count each of the predecessor nodes. We can access these nodes if we instead use select to iterate over the predecessor nodes, rather than using rank to simply count. Then, to disambiguate them by first character, we can use label().
A linear scan over the predecessors in this fashion would work, but for large alphabets we can use binary search (with a select call before each array access) to support this in $O(k \log \sigma)$ time; $\log \sigma$ for the binary search, where each access is a $O(1)$ select, followed by $O(k)$ to compute the label.
This is demonstrated in the example below:
In conclusion, by using memory more efficiently, hopefully the cost of genome seqeuencing can be reduced, both proliferating the technology, but also giving way to more advanced population analysis. To that end, we have described a novel approach to representing de Bruijn graphs efficiently, while supporting a full suite of navigation operations quickly. Much of the (BWT-inspired) construction can be done efficiently on disk, but we intend to improve this soon to compete with Minia.
The total space is a theoretical $m(2 + \log{\sigma} + o(1))$ bits in general, or $4m + o(m)$ bits for DNA, given $m$ edges. Using specially modified indexes we can lower this to around 3 bits per edge.
I apologize that an efficient implementation isn’t available, nor have I provided experimental results. But if you found this post interesting you can get your hands dirty with the Python implementation I have provided. The results and efficient implementation are on their way (:
And as always, follow me on Twitter, and feel free to send me questions, criticisms, and especially pull requests!
Pevzner, P. A., Tang, H., and Waterman, M. S. (2001). An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences, 98(17):9748–9753. ↩
Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J. M., and Birol, I. (2009). ABySS: A parallel assembler for short read sequence data. Genome Research, 19(6):1117–1123. ↩
Conway, T. C. and Bromage, A. J. (2011). Succinct data structures for assembling large genomes. Bioinformatics, 27(4):479–486. ↩
Okanohara, D. and Sadakane, K. (2006). Practical Entropy-Compressed Rank/Select Dictionary. ↩
R. Chikhi, G. Rizk. (2012) Space-efficient and exact de Bruijn graph representation based on a Bloom filter. WABI. ↩
M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithms. Technical Report 124, Digital SRC Research Report, 1994. ↩
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. Journal of the ACM, 57(1):4:1–4:33, 2009. ↩
The paper also describes an online construction method (where we can update by appending), and an iterative method that builds the k-dimensional de Bruijn graph from an existing (k–1)-dimensional de Bruijn graph. ↩
We currently use an external merge sort, but intend to optimise as this is where Minia beats us time-wise. ↩
This is “little o” notation, which may be unfamiliar to some people. Intuitively it means “grows much slower than”, and is stricter than big O. A formal definition can be found on Wikipedia. ↩
http://stackoverflow.com/questions/1801135/what-is-the-meaning-of-o-polylogn-in-particular-how-is-polylogn-defined ↩
R. Raman, V. Raman, and S. R. Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms, 3(4), November 2007. ↩
P. Ferragina, G. Manzini, V. M ̈akinen, and G. Navarro. Compressed Representations of Sequences and Full-Text Indexes. ACM Transactions on Algorithms, 3(2):No. 20, 2006. ↩
The post Succinct de Bruijn Graphs appeared first on Alex Bowe.
]]>The post Failing at Google Interviews appeared first on Alex Bowe.
]]>During my interviews I didn’t sign a NDA, but I do respect the effort that interviewers put into preparing their questions so I’m not going to discuss them. That doesn’t matter though, because you probably won’t get the same questions anyway, and the algorithm stuff is far from the whole story.
This post is mainly about the rituals I perform during preparation for the interviews, and the lessons I have learned from them. I am of the strong opinion that everyone should apply for a job at Google.
Not everyone wants to work for Google, but there are valuable side effects to a Google interview. Even if you don’t think you want a job there, or think that you are under-qualified, it is a great idea to just try for one. The absolute worst thing that could happen is that you have fun and learn something.
A couple of the things I learned are algorithms for (weighted) random sampling, queueing, vector calculus, and some cool applications of bloom filters.
The people you will talk to are smart, and it’s a fun experience to be able to solve problems with smart and passionate people. One of my interviews was just a discussion about the good and bad parts (in our opinions) of a bunch of programming languages (Scheme, Python, C, C++, Java, Erlang). We discussed SICP and the current state of education, and he recommended some research papers for me to read. All the intriguing questions and back-and-forth made me feel like I was being taught by a modern Socrates (perhaps Google should consider offering a Computer Science degree taught entirely with interviews :P).
Sadly, a subsequent interview stumped me because I didn’t understand the requirements. Even the stumping interviews have given me a great chance to realise some gaps in my knowledge and refine my approach. I knew that it was important to get the requirements right, but this really drove it home.
I hope I’ve got you curious about what you could learn from a Google interview. If you are worried about the possible rejection, treat it as a win in a game of Rejection Therapy. You can re-apply as many times as you like, so you could also think of it as TDD for your skills, and you like TDD, right?
When you are accepted for a phone interview, Google sends you an email giving you tips on how to prepare. Interestingly, this has been a different list each time. I’ll discuss the one I liked the most. They only give advice on the technical side. I will also discuss what I think are some other important aspects to be mindful of.
First of all, you are going to want to practice. Even if you have been coding every day for years, you might not be used to the short question style. Project Euler is the bomb for this. You will learn some maths too, which will come in handy, and it builds confidence. Do at least one of these every day until your interview.
You will also want some reading material. Google recommended this post by Steve Yegge, which does a good job of calming you. They also recommended another post by Steve Yegge where he covers some styles of questions that are likely to be asked. Yegge recommends a particular book very highly – The Algorithm Design Manual:
More than any other book it helped me understand just how astonishingly commonplace (and important) graph problems are – they should be part of every working programmer’s toolkit. The book also covers basic data structures and sorting algorithms, which is a nice bonus. But the gold mine is the second half of the book, which is a sort of encyclopedia of 1-pagers on zillions of useful problems and various ways to solve them, without too much detail. Almost every 1-pager has a simple picture, making it easy to remember. This is a great way to learn how to identify hundreds of problem types.
I haven’t read the whole thing, but what I have read of it is eye and mind opening. This wasn’t recommended to me directly by Google recruiting staff, but one of my interviewers emailed me a bunch of links after, including a link to the page for this book. There was a recent review of this book featured on Hacker News. It is very good. The author, Steve Skiena, also offers his lecture videos and slides – kick back and watch them with a beer after work/uni.
If the size of The Algorithm Design Manual is daunting and you want a short book to conquer quickly (for morale reasons), give Programming Pearls a read. Answer as many questions in it as you can.
Additionally, Interview Cake offers a new approach, which systematises your technical preparation so you can know exactly what to focus on while avoiding becoming overwhelmed. It is a paid service, but they also have a free mailing list with weekly questions to keep you sharp (great for your long-game).
The phone interviews usually are accompanied by a Google doc for you to program into. I usually nominate Python as my preferred language, but usually they make me use C or C++ (they often say I can use Java too). I was rusty with my C++ syntax at the time, but they didn’t seem to mind. I just explained things like using templates, even though I can never remember the syntax for the cool metaprogramming tricks.
Speaking of tricks, you get style points for using features of the language that are less well known. I had an interviewer say he was impressed because I used Pythons pattern matching (simple example: (a, b) = (b, a)
). List comprehensions, map/reduce, generators, lambdas, and decorators could all help make you look cool, too. Only use them if they are useful though!
There will also be a few non-technical questions. When I did my first one, a friend recommended that I have answers ready for cookie-cutter questions like “Where do you see yourself in ten years?” and “Why do you want to work for Google?”. Don’t bother with that! Do you really think one of the biggest companies in the world will waste their time asking questions like that? Every candidate would say the same answer, something about leading a team and how Google would let you contribute to society, or whatever (great, but everyone wants that).
They will ask you about your previous work and education, though, and pretty much always ask about a technical challenge you overcame. I like to talk about a fun incremental A* search I did at my first job (and why we needed it to be iterative). You can probably think of something, don’t stress, but better to think of it before the interview.
And have a question ready for when they let you have your turn. Don’t search for “good questions to ask in technical interviews”, because if it isn’t your question, you might be uninterested if the interviewer talks about it for a long time. Think of something that you could have a discussion about, something you are opinionated about. Think of something you hated at a previous job (but don’t come across as bitter), how you would improve that, and then ask them if they do that. For me, I was interested in the code review process at Google, and what sort of project they would assign to a beginner.
I know someone who asked questions from The Joel Test. The interviewer might recognise these questions and either congratulate you on reading blogs about your field, or quietly yawn to themselves. It’s up if you want to take that risk (well, it’s not a big risk). I definitely think it’s better to ask about something that has the potential to annoy you on a personal level if they don’t give you the answer you want ;) it’s subtle, but people can detect your healthy arrogance and passion.
If you have a tech blog, refer to it. I’ve had interviewers discuss my posts with me (which they found from my resume). Blogs aren’t hard to write, and even a few posts on an otherwise barren blog will make you look more thoughtful.
Finally, the absolute best way to prepare for a Google interview is to do more Google interviews, so if you fail, good for you! ;)
Here are a few things that help me handle the pressure before an interview.
One time I was walking to an interview in the city (not a Google interview) and I was really nervous, even though I didn’t care either way if I got the job. I thought about how the nerves wouldn’t be an issue after the interview, because I’d have already done the scary thing by then. I couldn’t time travel, but I instead wondered if there is a way to use up the nerves on something else.
There was a girl walking next to me, so I turned to her and said she was dressed nicely. She said a timid “thank you” and picked up pace to get away from me. I laughed at my failure, but suddenly I didn’t feel so scared about the interview. I think this is a great example of why Rejection Therapy is worth experimenting with.
So yeah, talk to a stranger. If you are waiting at home for a phone call though, another thing I do is jack jumps, dancing, or jogging on the spot just to make myself forget the other reason my heart is pounding so fast.
If you are doing a phone interview, answer it standing up (you can sit down after) and pace around a little bit. Smile as you talk, as well. You should also take down their name on paper ready to use a few times casually. These are tricks from the infamous How to Win Friends and Influence People. Maybe these alone won’t make you likeable, but I think it causes you to think about the other person and stop being so self conscious, which helps you to relax. You’ll be one charming motherfucking pig.
Take some time to think before answering, and especially to seek clarification on the questions. Ask what the data representation is. I’ve found that they tend to say “whatever you want”. In a graph question, I said “Okay, then it’s an adjacency matrix”, which made the question over and done with in ten seconds. The interviewer seemed to like that, so don’t be afraid to be a (humble) smart ass.
You might recognise the adjacency matrix as potentially being a very poor choice, depending on the nature of the graph. I did discuss when this might not be a good option. In fact, for every question, I start off by describing a naive approach, and then refine it. This helps to verify the question requirements, and gives you an easy starting point. Maybe you could introspectively comment on agile methodology (Google practises Scrum).
One last thing! Google schedules the interview to be from 45 minutes to an hour. I have had awkward moments at the end of interviews where the interviewer mentions that our time is nearly up, and then asks another question, or asks if I have any questions. It made me feel like he was in a rush, so I didn’t feel like expanding on things much. Now, I recommend taking as much time as they will give you. Keep talking until they hang up on you if you have to :) although it might help to say “I don’t mind if we go over, as long as I’m not keeping you from something” when the interviewer mentions the time.
Steve Yegge says there are lots of smart Googlers who didn’t get in until their third attempt (I still haven’t gotten in after my fourth, and I don’t think I’m stupid). As I mentioned, I’m writing this post because I found the process of doing a Google interview at all to be very rewarding.
It is important to reflect afterwards in order to reap the full benefits of interviewing at Google. If you did well, why? But more importantly, if you feel you did poorly, why? Google won’t give feedback, which can be a bit depressing at times. After each interview write notes about what you felt went well and what didn’t – this way you can look back if you don’t get the job, and decide what you need to work on. This post is the culmination of my reflections and the notes – if you decide to write a blog post, I’d enjoy reading it and will link it here.
If you want more blog posts to read about how to get better at Computer Science, I recently found this post by Matt Might to be a good target to aim for. Check out Ten Things Every Computer Science Major Should Learn by Macneil Shonle as well, and my previous post Advice to CS Undergrads (the links at the end in particular).
And as always, please read the comments below and add your own thoughts to the discussion. In particular, Sumit Arora gave some important advice that I didn’t cover.
Have you really read this far? Consider adding me to Twitter and telling me what you thought :)
The post Failing at Google Interviews appeared first on Alex Bowe.
]]>The post FM-Indexes and Backwards Search appeared first on Alex Bowe.
]]>Last time (way back in June! I have got to start blogging consistently again) I discussed a gorgeous data structure called the Wavelet Tree. When a Wavelet Tree is stored using RRR sequences, it can answer rank and select operations in $\mathcal{O}(\log{A})$ time, where A is the size of the alphabet. If the size of the alphabet is $2$, we could just use RRR by itself, which answers rank and select in $\mathcal{O}(1)$ time for binary strings. RRR also compresses the binary strings, and hence compresses a Wavelet Tree which is stored using RRR.
So far so good, but I suspect rank and select queries seem to be of limited use right now (although once you are familiar with the available structures, applications show up often). One of the neatest uses of rank that I’ve seen is in substring search, which is certainly a wide reaching problem (for a very recent application to genome assembly, see Jared Simpson’s paper from 2010 called Efficient construction of an assembly string graph using the FM-index).
Note that arrays and strings are one-based (not zero-based).
There is a variety of Suffix Array construction algorithms, including some $\mathcal{O}(N)$ ones (Puglisi et al. 2007). However, I will explain it from the most common (and intuitive) angle.
In its simplest form, a suffix array can be constructed for a string $S[1..N]$ like so:
For example, the sorting of the string 'mississippi'
with terminating character $
would look like this:
The Burrows-Wheeler Transform (BWT) is a was developed by Burrows and Wheeler to reversibly permute a string in such a way that characters from repeated substrings would be clustered together. It was useful for compression schemes such as run-length encoding.
It is not the point of this blog to explain how it works, but it is closely linked to Suffix Arrays: $BWT[i] = S[SA[i] – 1, BWT[1] = \$ $ (it wraps around) for the original string $S$, Suffix Array $SA$, and Burrows-Wheeler Transform string $BWT$. In other words, the $i^{th}$ symbol of the BWT is the symbol just before the $i^{th}$ suffix. See the image below:
In particular, $BWT[1] = S[SA[1] – 1] = S[12 – 1] = S[11] = i$ (or the $11^{th}$ symbol from the original string 'mississippi'
).
Ferragina and Manzini (Ferragina et al. 2000) recommended that a BWT be paired with a Suffix Array, creating the so-called FM-Index, which enables backward search. The BWT also lets us reconstruct the original string $S$ (not covered in this blog), allowing us to discard the original document – indexes with this property are known as self indexes.
This is where rank comes in. If it is hard to follow (it is certainly not easy to explain) then hang in there until the example, which should clear things up.
Since any pattern $P$ in $S$ (the original string) is a prefix of a suffix (our Suffix Array stores suffixes), and because the suffixes are lexicographically ordered, all occurrences of a search pattern $P$ lie in a contiguous portion of the Suffix Array. One way to hone in on our search term is to use successive binary searches. Storing the BWT lets us use a cooler way, though…
Backward search instead utilises the BWT in a series of paired rank queries (which can be answered with a Wavelet Tree, for example), improving the query performance considerably.
Backward search issues $p$ pairs of rank queries, where $p$ denotes the length of the pattern $P$. The paired rank queries are:
$$
\begin{align}
s^\prime &= C[P[i]] + rank(s-1, P[i]) + 1 \\
e^\prime &= C[P[i]] + rank(e, P[i])
\end{align}
$$
Where $s$ denotes the start of the range and $e$ is the end of the range. Initially $s = 1$ and $e = N$. If at any stage $e \lt s$, then $P$ doesn’t exist in $S$.
As for $C$… $C$ is a lookup table containing the count of all symbols in our alphabet which sort lexicographically before $P[i]$. What does this mean? Well, $C$ would look like this:
Which means that there aren’t any characters in $S$ that sort before $\$$, one that sorts before $i$ (the $\$$), five that sort before $m$ (the $\$$ and the $i$s) and so on. In the example I store it in a less compact way as the column $F$ (which contains the first symbol for each suffix – essentially the same information, since each suffix is sorted), so it might be easier to follow (wishful thinking).
Why is this called backwards search? Well, our index variable $i$ actually starts at $|P|$ (the last character of our search pattern), and decreases to $1$. This maintains the invariant that $SA[s..e]$ contains all the suffixes of which $P[i..|P|]$ is a prefix, and hence all locations of $P[i..|P|]$ in $S$.
Let’s practice this magic spell…
Let our search pattern P be 'iss'
, and our string $S$ be 'mississippi'
. Starting with $i = 3$, $c = P[i] = $'s'
. The working for each rank query is shown below each figure. I’m representing the current symbol as $c$ to avoid confusion between 's'
and $s$ and $s^\prime$.
Here (above) we are at the first stage of backwards search for 'iss'
on 'mississippi'
string – before any rank queries have been made.
Note: we do not store the document anymore – the gray text – and we don’t store $F$, but instead store $C$ – see section on Backward Search.
Starting from $s=1$ and $e=12$ (as above) and $c = P[i] =$ 's'
where $i = 3$, we make our first two rank queries:
$$
\begin{align}
s^\prime &= C[c] + rank(0, c) + 1 = 8 + 0 + 1 = 9 \\
e^\prime &= C[c] + rank(12, c) = 8 + 4 = 12
\end{align}
$$
After the above, we are now at the second stage of backwards search for 'iss'
on 'mississippi'
string. All the occurrences of 's'
lie in $SA[9..12]$.
From $s = 9$ and $e = 11$, and $c = P[i] =$ 's'
where $i = 2$, our next two rank queries are:
$$
\begin{align}
s^{\prime\prime} &= C[c] + rank(8, c) + 1 = 8 + 2 + 1 = 11 \\
e^{\prime\prime} &= C[c] + rank(12, c) = 8 + 4 = 12
\end{align}
$$
We are now at the third stage of backwards search for 'iss'
on 'mississippi'
string. All the occurrences of 'ss'
lie in $SA[11..12]$.
From $s = 11$ and $e = 12$, and $c = P[i] =$ 'i'
where $i = 1$, our final two rank queries are:
$$
\begin{align}
s^{\prime\prime\prime} &= C[c] + rank(10, c) + 1 = 1 + 2 + 1 = 4 \\
e^{\prime\prime\prime} &= C[c] + rank(12, c) = 1 + 4 = 5
\end{align}
$$
This is the fourth and final stage of our backwards search for 'iss'
in the string 'mississippi'
. All the occurrences of 'iss'
lie in $SA[4..5]$.
It impresses me every time…
No doubt you want to get your hands dirty. I have played around with libdivsufsort before, although I think you may have to implement backward search yourself (it’d be a good exercise), since it doesn’t appear to come with fast rank query providers. For rank structures for your BWT you might want to check out libcds. In fact there are heaps out there, but I haven’t used them yet.
Also, please comment here if you develop something cool with it :) and as always, if you have journeyed this far, consider following me on Twitter: @alexbowe.
Ferragina, P. and Manzini, G. (2000). Opportunistic data structures with applications. Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science, pages 390–398.
S. J. Puglisi, W. F. Smyth, and A. Turpin. A taxonomy of suffix array construction algorithms. ACM Computing Surveys, 39(2):1–31, 2007.
Jared Simpson’s paper from 2010 called *Efficient construction of an assembly string graph using the FM-index.
Simpson, J. T. and Durbin, R. (2010). Efficient construction of an assembly string graph using the FM-index. Bioinformatics, 26(12):i367–i373.
The post FM-Indexes and Backwards Search appeared first on Alex Bowe.
]]>The post Wavelet Trees – an Introduction appeared first on Alex Bowe.
]]>Today I will talk about an elegant way of answering rank queries on sequences over larger alphabets – a structure called the Wavelet Tree. In my last post I introduced a data structure called RRR, which is used to quickly answer rank queries on binary sequences, and provide implicit compression.
A Wavelet Tree organises a string into a hierarchy of bit vectors. A rank query has time complexity is $\mathcal{O}(\log_2{A})$, where $A$ is the size of the alphabet. It was introduced by Grossi, Gupta and Vitter in their 2003 paper High-order entropy-compressed text indexes [4] (see the Further Reading section for more papers). It has since been featured in many papers [1, 2, 3, 5, 6].
If you store the bit vectors in RRR sequences, it may take less space than the original sequence. Alternatively, you could store the bit vectors in the rank indexes proposed by Sadakane and Okonohara [7]. It has a different approach to compression. I will talk about it another time ;) – fortunately, I will be studying under Sadakane-sensei at a later date (update: now I’m doing my Ph.D. under him in Tokyo).
In a different future post, I will show how Suffix Arrays can be used to find arbitrary patterns of length $P$, by issuing $2P$ rank queries. If using a Wavelet Tree, this means a pattern search has $\mathcal{O}(P \log_2{A})$ time complexity, that is, the size of size of the ‘haystack’ doesn’t matter, it instead depends on the size of the ‘needle’ and size of the alphabet.
A Wavelet Tree converts a string into a balanced binary-tree of bit vectors, where a $0$ replaces half of the symbols, and a $1$ replaces the other half. This creates ambiguity, but at each level this alphabet is filtered and re-encoded, so the ambiguity lessens, until there is no ambiguity at all.
The tree is defined recursively as follows:
For the string "Peter Piper picked a peck of pickled peppers"
(spaces and a string terminator have been represented as $\_$ and $\$$ respectively, due to convention in the literature) the Wavelet Tree would look like this:
note: the strings aren’t actually stored, but are shown here for convenience
It has the alphabet $\{ \$, P, \_, a, c, d, e, f, i, k, l, o, p, r, s, t \}$, which would be mapped to $\{ 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1 \}$. So, for example, $\$$ would map to $0$, and $r$ would map to $1$.
The left subtree is created by taking just the 0-encoded symbols $\{ \$, P, \_, a, c, d, e, f \}$ and then re-encoding them by dividing this new alphabet: $\{ 0, 0, 0, 0, 1, 1, 1, 1 \}$. Note that on the first level an $e$ would be encoded as a $0$, but now it is encoded as a $1$ (it becomes a $0$ again at a leaf node).
We can store the bit vectors in RRR structures for fast binary rank queries (which are needed, as described below), and compression :) In fact, since it is a balanced tree, we can concatenate each of the levels and store it as one single bit vector.
Recall from my last post that a rank query is the count of $1$-bits up to a specified position. Rank queries over larger alphabets are analogous – instead of a $1$, it may be any other symbol:
After the tree is constructed, a rank query can be done with log $A$ ($A$ is alphabet size) binary rank queries on the bit vectors – $\mathcal{O}(1)$ if you store them in RRR or another binary rank index. The encoding at each internal node may be ambiguous, but of course it isn’t useless – we use the ambiguous encoding to guide us to the appropriate sub-tree, and keep doing so until we have our answer.
For example, if we wanted to know $rank(5, e)$, we use the following procedure which is illustrated below. We know that $e$ is encoded as $0$ at this level, so we take the binary rank query of $0$ at position $5$:
Which is $4$, which we then use to indicate where to rank in the $0$-child: the $4^{th}$ bit (or the bit at position $3$, due to $0$-basing). We know to query the $0$-child, since that is what $e$ was encoded as at the parent level. We then repeat this recursively:
At a leaf node we have our answer. I would love to explain why this works, but it is fun and rewarding to think about it yourself ;)
There are also ways to provide fast select queries, but once again I will leave that up to you to research. The curious among you might also be interested in the Huffman-Shaped Wavelet Tree described by Mäkinen and Navarro [5].
Feel free to implement this yourself, but if you want to get your hands dirty right away, all-around-clever-guy Francisco Claude has made an implementation available in his Compressed Data Structure Library (libcds). If you create something neat with it be sure to report back ;)
Update: Terence Siganakis wrote a blog post about Wavelet Trees that made it to the front page of Hacker News, encouraging an interesting discussion. The discussion is here.
And if you read this far, consider following me on Twitter: @alexbowe.
I didn’t want to saturate this blog post with proofs and other details, since it was meant to be a light introduction. If you want to dive deeper into this beautiful structure, check out the following papers:
[1] F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proceedings of the 15th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 5280, pages 176–187. Springer, 2008.
[2] P. Ferragina, R. Giancarlo, and G. Manzini. The myriad virtues of wavelet trees. Information and Computation, 207(8):849–866, 2009.
[3] P. Ferragina, G. Manzini, V. M ̈akinen, and G. Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2):20, 2007.
[4] R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proceedings of the 14th annual ACM-SIAM symposium on Dis- crete algorithms, pages 841–850. Society for Industrial and Applied Mathematics, 2003.
[5] V. Mäkinen and G. Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(1):40–66, 2005.
[6] V. Mäkinen and G. Navarro. Implicit compression boosting with applications to self-indexing. In Proceedings of the 14th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 4726, pages 214–226. Springer, 2007.
[7] D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. Arxiv Computing Research Repository, abs/cs/0610001, 2006.
The post Wavelet Trees – an Introduction appeared first on Alex Bowe.
]]>The post RRR – A Succinct Rank/Select Index for Bit Vectors appeared first on Alex Bowe.
]]>This blog post will give an overview of a static bitsequence data structure known as RRR, which answers arbitrary length rank queries in $\mathcal{O}(1)$ time, and provides implicit compression.
As my blog is informal, I give an introduction to this structure from a birds eye view. If you want, read my thesis for a version with better markup, and follow the citations for proofs by people smarter than myself :)
My intended future posts will cover the other aspects of my thesis, including generalising RRR (for sequences over small alphabets), Wavelet Trees (which answer rank queries over bigger alphabets), and Suffix Arrays (a text index which – when combined with the above structures – can answer queries in $\mathcal{O}(P \log_2 A)$ time, when $P$ is the length of the search pattern, and $A$ is the alphabet size).
Update: I have now posted about Wavelet Trees! Check it out here.
Cracking the Oyster, the first column of Programming Pearls, opens with a programmer asking for advice when sorting around ten million unique seven-digit integers – phone numbers.
After some discussion, the author concludes that a bitmap should be used. If we wanted to store ten million integers, we could use an array of $32$-bit integers, consuming $38$ MB, or we could represent our numbers as positions on a number line.
All of these phone numbers will be within the range $[0000000, 9999999]$. To represent the presence of these numbers, we only need a bitmap $10^7$ bits long, about $1$ MB, which would represent our number line. Then, for a bitmap $M$, if we want to store phone number $p$, we set the bit $M[p]$ to $1$. Sorting would involve setting the numbers that are present to $1$, then iterating over the bitmap, printing the positions of the $1$-bits – $\mathcal{O}(N)$ time.
In the following sections, I will detail operations that can be done on bitmaps, named rank and select, and explain how to answer rank queries in $\mathcal{O}(1)$ time, and implicitly compress the bitmap. Using rank and select, a compressed bitmap can be a very powerful way to store sets. This isn’t limited to just sets of numbers, all sorts of things, such as tree or graph nodes for example.
Allow me to extend the problem. I want to query our simple phone number database to see how many phone numbers are allocated within the range $[0005000, 0080000]$. I could iterate over that range and update a counter whenever I encounter a $1$-bit. Actually, this operation is what is known as a rank operation.
The operation $rank(i)$ is defined as the number of set bits ($1$s) in the range $[0, i]$ (or $[0, i)$ in some papers). In the bitstring above, the answer to $rank(5)$ is $3$… This is a generalisation of the popcount operation which counts all set bits, which I have discussed before (here and here). $rank(i)$ can be implemented by left-shifting $L – i$ bits (where $L$ is the length of the datatype you are using, int, long, etc) to remove the unwanted bits, then calling $popcount$ on the resulting value. This could be done iteratively over an array if you want, but I will discuss a much faster way below.
Then, the above question can be answered as: $rank(0080000) – rank(0005000 – 1)$. This will give us just the number of $1$s between $0005000$ and $0080000$.
This isn’t the only place we would use a popcounts; it happens that popcounts are common enough that we want to optimise them. Check out this blog post at valuedlessons.com for a discussion and empirical comparison of several fast approaches.
As it happens, we can build a data structure for static bitmaps that answers rank queries in $\mathcal{O}(1)$ time, and provides implicit compression. It is what is known as a succinct data structure, which means that even though it is compressed, we don’t need to decompress the whole thing t operate on it efficiently. Sadakane (a respected researcher in succinct data structures) gives a nice analogy in his introduction of the field, likening it to forcing dehydrated noodles apart with your chopsticks (decompression) as you are rehydrating them, but before the whole thing is fully cooked and separated. This allows you to keep some of the noodles compressed while you eat the decompressed fragment.
Since it is static it isn’t well suited for a bitmap which you want to update (although work has been done toward this), it is still really cool :)
The structure I’m referring to is named RRR. It sounds like a radio station, but it is named after its creators: Raman, Raman, Rao, from their 2002 paper Succinct indexable dictionaries with applications to encoding $k$-ary trees and multisets. Its a data structure I had to become intimately involved with for my honours thesis, where I extended it for sequences of larger (but still small) alphabets. If you want to answer rank queries on large alphabets, a wavelet tree might be what you are after, but that will be covered in a different blog post (or you could read my thesis!).
In my last post (Generating Binary Permutations in Popcount Order) I discussed how to compress a bitstring by replacing blocks of a certain blocksize with their corresponding pop number, and (variable length) offset into a lookup table. I briefly mentioned building an index over it to improve lookup as well.
To construct a RRR sequence we divide our bitmap into blocks, as I mentioned in my previous blog post. These are grouped in superblocks, too, which allows us to construct an index to enable $\mathcal{O}(1)$ rank queries. In the following image, I have fragmented the bitmap using a blocksize of $b = 5$, and grouped them with a superblock factor of $f = 3$ – so each superblock is three blocks.
First we replace the blocks with a pair of values, a class value $C$ and offset value $O$, which are used together as a lookup key into a table of precomputed (small – for each possible block only) ranks – this is demonstrated in the figure below. This is the same as the previous blog post, although in that I called the “class” $P$. This is because the class of a block is defined as the popcount – the number of set bits – in the block: $class(B) = popcount(B)$ for block $B$.
The table is shared among all your RRR sequences, and is in fact a table of tables, where $C$ points to the first element for the ranks of a given popcount:
For this table (let’s call it $G$), for a given class $C$, the sub-table at $G[C]$ has $b \choose C$ entries, which correspond to all possible permutations that have a popcount of $C$. This means that while our $C$ values will always be $\log{b + 1}$ (the number of bits to represent values $0, 1, 2… b$ – these are all possible popcount values for the blocksize), but our $O$ values will vary in size, requiring $\log{b \choose C}$ bits (oh yeah, and of course I’m using $\log_2$ here :)). During a query, we can use our $C$ values to work out how many bits will follow for the $O$ values.
Using this approach alone we get the compression, but not $\mathcal{O}(1)$ ranks. $C$ is fixed width, the compression comes from $O$ being varied width.
In order to get the $\mathcal{O}(1)$ ranks we use a method discussed by Munro in Tables, 1996. This is where the superblocks come in to play:
For each superblock boundary we store the global rank up to that position. We also store a prefix sum of the bits, which gives us the address to the first block in the next superblock (since it is variable length!). This allows us to not require iterating over the whole RRR sequence, but instead going straight to the required superblock. We will only need to iterate over the blocks within a superblock, so it is now bound by whatever your superblock factor is.
To calculate $rank(i)$:
Select is the inverse operation to rank; it answers the question “at which position is the $i^{th}$ set bit?”. To tie this in with the phone numbers example, maybe we want to find out the fiftieth phone number in the set (excluding unassigned numbers). This is a way we can index just the present elements of a bitmap. It turns out select can be answered in $\mathcal{O}(1)$ time as well. I won’t cover select here, as my future posts (and thesis) will mainly use rank. You can read about it in the RRR paper.
Feel free to implement this (somewhat complicated) data structure yourself, or you can use a pre-rolled one by my friend Francisco Claude in his LIBCDS – Compressed Data Structure Library.
If you read this far, consider adding me to twitter :) or you may enjoy reading my post on Wavelet Trees.
The post RRR – A Succinct Rank/Select Index for Bit Vectors appeared first on Alex Bowe.
]]>The post Generating Binary Permutations in Popcount Order appeared first on Alex Bowe.
]]>I’ve been keeping an eye on the search terms that land people at my site, and although I get the occasional “alex bowe: fact or fiction” and “alex bowe bad ass phd student” queries (the frequency strangely increased when I mentioned this on Twitter) I also get some queries that relate to the actual content.
One query I received recently was “generating integers in popcount order”, I guess because I mentioned popcounts (the number of 1-bits in a binary string) in a previous post, but the post wasn’t able to answer that visitors question.
What would this be used for? Among other applications, I have used it for generating a table of numbers ordered by popcount, which I used in a compression algorithm: by breaking a bitstring into fixed-length chunks (of B bits) and replacing them with a (P, O) pair, where P is the block’s popcount which can be used to point to the table where each entry has the popcount P, and O is the offset in that subtable. Then P can be stored with log2(B + 1) bits – we need to represent all possible P values from 0 to B – and O can be stored with log2(binomial(B, P)) bits.
Note that the bit-length of P varies; binomial represents the binomial coefficient, which can be seen in Pascal’s triangle expanded to row 5:
So binomial(5, x) for x = 0, 1, … , 5 yields the sequence 1, 5, 10, 10, 5, 1 – some things take more bits than others. Once you know the P value, you will know how many bits the O value is, so you can read it that way. This means access is O(N) (since each values position relies on the previous P values), but you can build an index on top of that to allow O(1) lookup. But all this is a story for another time ;) [1].
Here is the code:
def next_perm(v):
"""
Generates next permutation with a given amount of set bits,
given the previous lexicographical value.
Taken from http://graphics.stanford.edu/~seander/bithacks.html
"""
t = (v | ( v - 1)) + 1
w = t | ((((t & -t) / (v & -v)) >> 1) - 1)
return w
This will take a number with a certain popcount, and generate the next number with the same popcount. For example, if you feed it 7, which is 111 in binary,
you will get 1011 back – or 11 – the next number with the same popcount (lexicographically speaking).
To find the first number of a given popcount, you can use this:
def element_0(c):
"""Generates first permutation with a given amount
of set bits, which is used to generate the rest."""
return (1 << c) - 1
I should clear up what lexicographically means in the context of numbers. Well, it’s actually the same as any other symbols (such as an alphabet), 0 is the symbol that comes before 1 (if it helps, you can picture 0 as a and 1 as b):
00111 aabbb
01011 ababb
01101 abbab
01110 abbba
10011 baabb
10101 babab
10110 babba
11001 bbaab
11010 bbaba
11100 bbbaa
Looking at the above pattern, here is some loose pseudocode that may help us understand how the above bithacks work:
Understanding element_0()
is pretty easy. 1 << c
is the same as moving a 1
to position c, then -1 sets all the bits from 0 to c – 1, giving c set bits:
c = 4
1 << c = 10000
10000 - 1 = 01111
next_perm()
is a bit more complicated. The v | (v -1)
in t = (v | (v - 1)) + 1
right-propagates the rightmost bit. Allow me to show an example: 01110 | 01101 = 01111
In the case of 0, this isn’t quite correct: 00000 | 11111 = 11111
But it’s okay because we proceed to add 1 to this value (which returns it to zero). This increment, combined with the right-propagation, will do step 2, 3, and part of step 5 above. For example, 10100 -> 10111 -> 11000
.
We are on our way to generating integers by popcount in lexicographical order.
Now let’s break down the next line, w = t | ((((t & -t) / (v & -v)) >> 1) - 1)
. Bitwise equations of the form (x & -x)
isolate the rightmost bit:
01110 & 10010 (two's complement) = 00010
If you take the two’s complement, you invert a numbers bits and then add 1. If you think about it, this means there are 0s where there were 1s, and 1s where there were 0s, and adding 1 bumps the rightmost 1 left and sets the subsequent right bits to 0. This means that the only position that will remain set in both numbers is the the rightmost 1-bit.
So let R(x) denote the isolated rightmost bit of x, then for x = 01110
we calculate t = 10000
, R(x) = 00010
and R(t) = 10000
.
Following the calculation of w, we need to divide them: R(t) / R(x) = 10000 / 00010 = 01000
Shift to the right by 1: 00100
Subtract 1: 00011
Then we bitwise-or them to stick them together: 10000 | 00011 = 10011
. This corresponds to our table above :)
So w = t | ((((t & -t) / (v & -v)) >> 1) - 1)
corresponds to the rest of step 5 (the moving part, t was the zeroing part of moving the sub-range) in our pseudocode. Well, kind of anyway, there are a few steps happening in parallel, but the pseudocode was only loosely explaining what was happening :)
def gen_blocks(p, b):
"""
Generates all blocks of a given popcount and blocksize
"""
v = initial = element_0(p)
block_mask = element_0(b)
while (v >= initial):
yield v
v = next_perm(v) & block_mask
>>> for x in gen_blocks(3, 5): print bin(x, 5)
...
00111
01011
01101
01110
10011
10101
10110
11001
11010
11100
Note: bin
is just a function I found online for printing binay numbers and isn’t important to this post, but you can find it here if you need one.
Then of course you can loop through all values of P from 0 to B to build the complete table.
Questions? Comments? Flames? I wanna hear em :)
[1] – Check out Tables by Munro, 1996, and Succinct indexable dictionaries with applications to encoding k-ary trees and multisets by Raman et al, 2002 for a first step into this stuff. Also check out my honours thesis, 2010 for a recent look at succinct data structures. I will write a blog post about this stuff sooner or later :P
The post Generating Binary Permutations in Popcount Order appeared first on Alex Bowe.
]]>The post Some Lazy Fun with Streams appeared first on Alex Bowe.
]]>Update: fellow algorithms researcher Francisco Claude just posted a great article about using lazy evaluation to solve Tic Tac Toe games in Common Lisp. Niki (my brother) also wrote a post using generators with asynchronous prefetching to hide IO latency. Worth a read I say!
I’ve recently been obsessing over this programming idea called streams (also known as infinite lists or generators), which I learned about from the Structure and Interpretation of Computer Programs book. It is kind of like an iterator that creates its own data as you go along, and it can lead to performance increases and wonderfully readable code when you utilise them with higher order functions such as map and reduce (which many things can be rewritten in). It also allows you to express infinitely large data structures.
When regular lists are processed with higher order functions, you need to compute the entire list at each stage; if you have 100 elements, and you map a function to them, then filter them, then partially reduce them, you may be doing up to 300 operations, but what if you only want to take the first 5 elements of the result? That would be a waste, hence streams are sometimes a better choice.
Although SICP details how to do it in Scheme, in this blog post I will show some languages that have it built in – Haskell and Python – and how to implement streams yourself if you ever find yourself in a language without it1.
Haskell is a lazy language. It didn’t earn this reputation from not doing the dishes when you ask it to (although that is another reason it is lazy). What it means in the context of formal languages is that evaluation is postponed until absolutely necessary (Here is a cute (illustrated) blog post describing this lazy evaluation stuff). Take this code for example:
Prelude> let x = [1..10]
At this stage you might be tempted to say that x is the list of numbers from 1 to 10. Actually it only represents a promise that when you need that list, x is your guy. The above code that creates a list from 1 to 10 still hasn’t been executed until I finally ask it to be (by referring to x):
Prelude> x
[1,2,3,4,5,6,7,8,9,10]
It is kind of like telling your mum you’ll do the dishes, but waiting until she shouts your name out again before you put down your DS. Actually, it is sliiiiightly different – if I instead wrote:
Prelude> let x = [1..10]
Prelude> let y = x ++ [11..20]
I have referred to x again when I declared y, but x still hasn’t evaluated. Only after I shout y’s name will y shout x’s name and give me back my whole list:
Prelude> y
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
Here you ask your robot to wash half the dishes, but he is too busy playing DS too (stupid robot). Finally when your mum shouts, you shout at the robot, and he does his set of dishes, and you do yours. But what is the benefit here? It isn’t that I can get more DS time in…
Take for example a list of positive integers from 1. Yes, all of them. In other languages it might be hard to express this, but in Haskell it is as simple as [1..]
. This means we have a list of infinite integers, but we will only calculate as far as we need:
Prelude> let z = [1..]
Prelude> head z
1
Prelude> take 5 z
[1,2,3,4,5]
The syntax here is amazingly terse, and it may make your code more efficient. But even if we don’t have the syntax for it in another language, we can provide it ourselves very easily.
Python has a similar concept called generators, which are made using the yield
keyword in place of return
, more than one time (or in a loop) in a function:
def integers_from(N):
while(1):
yield N
N += 1
>>> z = integers_from(1)
>>> z.next()
1
>>> z.next()
2
Note: Python generators are stateful and are hence slightly different to an infinite list in Haskell. For example z.next()
returns different values in two places, and thus is time sensitive – we cannot get z to ‘rewind’ like we could in Haskell, where z
is stateless. Statelessness can lead to easier to understand code, among other benefits.
Let’s reinvent this wheel in Python (but in a stateless manner), so if we ever find ourselves craving infinite lists we can easily roll our own in pretty much any language with Lambdas.
I have chosen Python to implement this, even though it already supports infinite lists through generators, simply because its syntax is more accessible. Indeed, the below can already be done with Python’s built-in-functions (although with state). It is probably not a great idea to do it this way in Python, as it doesn’t have tail call optimisation (unless you use this hack using decorators and exceptions).
First we’ll look at adding lazy evaluation, however the syntax requires it to be explicit:
>>> x = lambda: 5
>>> y = lambda: 2 + x()
Here, x
is not 5, and y
is not 7, they are both functions that will evaluate to that when we finally run them; the expression inside the lambda won’t be evaluated until we do so explicitly:
>>> x()
5
>>> y()
7
And that’s pretty much all the heavy lifting. To make an infinite list, we basically make a linked list where we generate each node as we need it:
def integers_from(N): return (N, lambda: integers_from(N+1))
def head((H, _)): return H
def tail((_, T)): return T()
And there is our infinite list. To access it use head()
and tail()
(recursively if necessary):
>>> z = integers_from(1)
>>> head(z)
1
>>> head(tail(z))
2
First we should make a way for us to look at our streams:
def to_array(stream):
return reduce(lambda a, x: a + [x], [], stream)
Which is a reduce operation that puts each head element into an array (which is carried along as a parameter to reduce()
). Here is reduce()
(map()
can be found in this gist):
null_stream = (None, None)
def reduce(f, result, stream):
if stream is null_stream: return result
return reduce(f, f(result, head(stream)), tail(stream))
We needed some way to tell if we had reached the end of a stream – not all streams are infinitely long. Meet our next function, which will help us terminate a stream:
def take(N, stream):
if N <= 0 or stream is null_stream: return null_stream
return (head(stream), lambda: take(N-1, tail(stream)))
This will take the first N
elements from the specified stream. So now we can inspect the first N
elements:
>>> to_array(take(10, integers_from(1)))
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
For our upcoming example, we also need a filter()
method, which will filter out elements that meet a provided predicate:
def filter(pred, stream):
if pred(head(stream)):
return (head(stream), lambda: filter(pred, tail(stream)))
return filter(pred, tail(stream))
Now onto our example :)
Here is the standard example to demonstrate the terseness of streams:
def sieve(stream):
h = head(stream)
return (h, lambda: sieve(
filter(lambda x: x%h != 0, tail(stream))))
Here is a function which recursively filters anything which is divisible by any number we have previously seen in our stream. Math aficionados will notice that this is the Sieve of Eratosthenes algorithm.
>>> primes = sieve(integers_from(2))
>>> to_array(take(10, primes))
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
Recursively defined data, and only as much of it as we want – pretty neat.
When I first saw this, I wondered what application there might be to have a stream of functions. Here I have defined a stream which recursively applies a function to itself:
def rec_stream(f):
return (f, lambda: rec_stream(lambda x: f(f(x))))
When might this be useful? It might yield speed improvements if you commonly want to recursively apply a function a certain amount of times, but have costly branching (so the condition check at each level is slow). It could also be used as a abstraction for recursive iteration 2, which gives you back the function ‘already recursed’ so to speak (although lazily).
One such recursive process I can think of is Newton’s method for approximating roots, defined recursively as:
When f(x) = 0
.
The more iterations you do the more accurate the solution becomes. One use of Newton’s method is to use it until you have reached a certain error tolerance. Another way, which I learned about recently when reading about the fast inverse square root algorithm, which uses just one step of Newton’s method as a cheap way to improve it’s (already pretty good) initial guess. There is a really great article here which explains it very well.
After reading that, I wondered about a stream that would consist of functions of increasing accuracy of Newton’s method.
def newton(f, fdash):
return lambda x: x - f(x)/float(fdash(x))
The newton()
function accepts f(x)
and f'(x)
, and returns a function that accepts a first guess.
def newton_solver(iters, f, fdash):
def solve(v):
n = newton(lambda x: f(x) - v, fdash)
stream = rec_stream(n)
return to_array(take(iters, stream))[-1]
return solve
This one is a little more complicated. In order to have it solve for a value other than zero, I needed to either define it in f(x)
, since f(x)
must equal zero, but I didn’t want the user to have to iterate over the stream each time they wanted to compute the square root of a different number, say. To allow it to return a function that solved for square roots in the general case, I had to make the internal function solve()
, which would bind for the value the caller specifies, hence solving f(x) = v
for x
. Hopefully this becomes clearer with an example:
>>> sqrt = newton_solver(1, lambda x: x**2, lambda x: 2*x) # 1 iter
>>> sqrt(64)(4) # Sqrt of 64 with initial guess of 4
10.0
>>> sqrt = newton_solver(3, lambda x: x**2, lambda x: 2*x) # 3 iters
>>> sqrt(64)(4)
8.000000371689179
Now we can pass around this square root function and it will always do 3 iterations of Newton’s method.
This may not be practical unless compilers can optimise the resulting function (or if there is a way to do the reduction myself easily), but it was fun to do :) As always comments and suggestions are appreciated. If anyone who reads this is good with compilers, advice would be great :D
What you can do now is read SICP for more cool things like streams and functional programming, or check out Sean Anderson’s bit hacks page for more cool hacks like the fast inverse square root. Or refactor your code to use map, reduce and streams :)
The reason I have chosen Python for this exercise is for reasons of accessibility. Here is a post about implementing streams in Ruby, and here is one for Erlang :) but of course it’s all pretty much the same deal.
If a compiler could optimise this, simplifying the reapplied function, but keeping the generality, that’d be really cool :) I don’t think many compilers would/could do that for lambdas though. Any information would be great.
The post Some Lazy Fun with Streams appeared first on Alex Bowe.
]]>The post Design Pattern Flash Cards appeared first on Alex Bowe.
]]>Last year I studied a subject which required me to memorise design patterns. I tried online flash card web sites, but I was irritated that I didn’t own the data I put up (they had no export option). So I wrote a something in Python to generate flash cards for me using LaTeX and the Cheetah templating library. The repository is hosted here, although it could do with a refactor.
If you don’t want to generate your own, you can download the pre-generated design pattern intent flash cards here which contains the 23 original design patterns from the Gang Of Four.
To generate your own flash cards, create an input text file with this structure:
Front text (such as pattern name):
Definition line 1.
Definition line 2.
For example:
Abstract Factory:
Provides an interface for creating families of related or
dependent objects without specifying their concrete classes.
Currently the front text is single-line only. The regex could be updated of course (if you do, feel free to send a pull request!).
To compile this:
./cardgen.py -i inputfile -o outputfile
pdflatex outputfile
Then just print it out on a double side printer (or glue the two sheets together). I carried these around with me all the time during the lead-up to the exam, and I was scay-fast when it came to recalling which design pattern did what. Just flick through them (shuffle first) in forward or reverse order when you are on the train next :)
The post Design Pattern Flash Cards appeared first on Alex Bowe.
]]>The post Metaprogramming Erlang the Easy Way appeared first on Alex Bowe.
]]>I’ve recently taken Erlang back up1, and I wanted to use this blog post to talk about something cool I learned over the weekend.
I am implementing a data structure. Reimplementing actually, as it is the structure from my thesis – a succinct text index (I will post a blog on this soon).
Why am I reimplementing it in Erlang? The structure involves many bit-level operations, and I wanted to try out Erlang’s primitive Binary type, which seems to be allow efficient splitting and concatenation (which I require). Erlang’s approach to concurrency will hopefully assist me to experiment with distributing the structure, too. As a bonus, the functional approach has lent itself well to this math-centric data structure, and my code is much, MUCH cleaner because of it (I love reduce and other higher order functions).
One function I needed to implement was popcount (the sum of set set bits in a bitvector). For example, if b = 1010
then pop(b)
is 2.
There are many methods listed on this blog. One of them is the table method, which precomputes all popcounts for 16 bit integers (or any length you have space for):
POPCOUNT_TABLE16 = [0] * 2**16
for index in xrange(len(POPCOUNT_TABLE16)):
POPCOUNT_TABLE16[index] = (index & 1)
+ POPCOUNT_TABLE16[index >> 1]
I translated this to Erlang:
View the code on Gist.
So now gen_table(Bits)
will generate a tuple for me with all popcounts from 0 to $2^Bits$. However, we may gain performance2 from knowing how big we want the table to be at compile time. If we know the amount of bits we want our popcount table to work for, we could type the table directly into the source. But that would impede our flexibility, and make our code ugly.
Enter ct_expand
. We can use ct_expand:term( <code to execute> )
to run gen_table()
for us at compile time! Now we only run gen_table()
whenever we compile the module – after that the table is embedded in the binary.
First, check out ct_expand (git clone https://github.com/esl/parse_trans.git
), and compile it with make
3. Then just move the .beam
files into your project directory. Now we’re ready for some metaprogramming :)
I created a new module for generating our table at compile time:
View the code on Gist.
The reason we must be in a new module is because ct_expand:term()
requires it’s parameter to be something it will know about at compile time. This could be an inline fun (which I tried, but it wasn’t pretty), or it can be something you compile before ct_expand:term()
is executed (see line 3). Note on line 2 that we also need to compile parse_transform
and ct_expand
with our module.
Let’s test it out:
1> c(popcount).
{ok,popcount}
2> popcount:popcount16(2#1010).
2
3> popcount:popcount16(2#101011).
4
Nice! Thanks to Ulf Wiger for making that so easy :)
( Image stolen from http://improvisazn.wordpress.com/2010/02/22/meta/ )
erlang-crypto
package. On Mac OS X: sudo port install erlang +ssl
. (I had this issue)
The post Metaprogramming Erlang the Easy Way appeared first on Alex Bowe.
]]>