When you learned about genome assembly algorithms, you might have heard a story that goes something like this:
In the overlap-layout paradigm, solving the assembly problem requires solving the Hamiltonian cycle problem in the overlap graph. This is difficult, because the Hamiltonian cycle program is NP-hard. On the other hand, if we break our reads up into k-mers, we can build the de Bruijn graph. Then, the assembly problem becomes the problem of finding an Eulerian cycle in the de Bruijn graph, which is easily solvable in linear time. Thus, by formulating the assembly problem in terms of the de Bruijn graph, we can solve the much easier Eulerian cycle problem and not have to solve the NP-hard Hamiltonian cycle problem.
In this post, we explain that while de Bruijn graphs have indeed been very useful, the reason has nothing to do with the complexity of the Hamiltonian and Eulerian cycle problems.
Every Eulerian cycle in a de Bruijn graph or a Hamiltonian cycle in an overlap graph corresponds to a single genome reconstruction where all the repeats are completely resolved. For example, Figure 1 shows two different Eulerian cycles in the same graph (a similar example could be constructed for Hamiltonian cycles in an overlap graph). Each cycle corresponds to a different arrangement of segments between the repeats. The presence of multiple Eulerian or Hamiltonian cycles implies that the genome structure is ambiguous given the data available. In other words, using the same set of reads one can reconstruct different genomes, each of which is fully consistent with the data (Figure 1 gives an example). Choosing one of these reconstructions arbitrarily would be foolhardy since only one of them is the original genome. No sane assembly algorithm would do this, and that is one of the major reasons why an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice.
Figure 1: A worked out example for a set of reads R = {TATTA, TAATA} and k=3. Here, the set of all k-mers is S = sp^{k}(R) = {TAT, ATT, TTA, TAA, AAT, ATA}. Panel A shows G_{1} = dBG^{k}(S) and one possible Eulerian cycle of G_{1} (in blue). Panel B show the only other Eulerian cycle in G_{1} (in orange). The genome reconstruction corresponding to the blue cycle is ATTAATAT and to the orange cycle is ATTATAAT (note that because the genome is circular, the last two characters of each string are equal to the first two characters).
Instead, assemblers output contigs—long, contiguous segments which can unambiguously be inferred to be part of the genome. Finding such segments is a very different computational problem than finding a single Eulerian or Hamiltonian cycle^{1} . In fact, it was shown that finding all possible contigs can be done in polynomial time, regardless of whether the genome reconstruction is modeled as a Hamiltonian or Eulerian cycle (Tomescu and Medvedev, 2016). The algorithm used in practice (the unitig algorithm) is linear and nearly identical in the two graph models.
Perhaps you are not convinced by the above reasoning? Fine. For the sake of argument, let’s imagine that we really are interested in finding a single, arbitrary, genome reconstruction. But even in this case, the distinction between Eulerian and Hamiltonian cycles is misleading. We make our point with this Theorem, which we first state informally (a formal statement and proof will come below):
Main Theorem (informal): The following problems are equivalent and solvable in linear time:
The first part of the theorem should not be surprising. It states one half of the story we started with, namely that we can solve the assembly problem in linear time by finding an Eulerian cycle in a de Bruijn graph. The second part of the theorem, though, adds a twist. It is about finding a Hamiltonian cycle, but it differs from the initial story in two ways. First, it is a Hamiltonian cycle in a de Bruijn graph, not in an overlap graph. This might seem strange, but there is no special connection between overlap graphs and the Hamiltonian cycle problem — one is free to find a Hamiltonian cycle in any graph they wish. Second, the problem is solvable in linear time in this case, even though it is NP-hard in general. This might also seem strange, but in fact it is common for NP-hard problems to have polynomial-time solutions for a restricted class of inputs^{2}.
What the theorem states, then, is that one can solve the assembly problem in linear time by finding a Hamiltonian cycle within an appropriately defined de Bruijn graph. The fact that the Hamiltonian cycle problem is NP-hard in general graphs is not directly relevant. What is important is the underlying structure of the de Bruijn graph which makes the Hamiltonian cycle problem easy to solve^{3}. Hence, the initial story was right in the sense that using de Bruijn graphs is a good idea but wrong to imply that the complexity of the Hamiltonian cycle problem is a reason. All of this is of course assuming we are, for some reason, interested in an arbitrary genome reconstruction, which, as we argued earlier, we typically are not.
So why are de Bruijn graphs so popular for short read assembly, if not for the difference in the complexity of finding Eulerian or Hamiltonian cycles? The answer is complex, which might explain why the initial simple story was appealing. It may have to do with the simplicity of their implementation, the appeal of the k-mer abstraction, the ease of error correction, or with something else. In fact, the difference between using de Bruijn graphs and overlap graphs is poorly understood and is a fascinating open research problem. But, the Eulerian and Hamiltonian cycle dichotomy is not really relevant to assembly or to the popularity of de Bruijn graphs.
PM would like to thank Rayan Chikhi for feedback on the post and Alexandru Tomescu and Michael Brudno for many helpful discussions on this topic (and Michael Brudno specifically for introducing him to the problem a long time ago).
Let’s prove the main theorem, which follows almost directly from definitions. It may have been observed in previous papers, though we are not sure if it has been explicitly stated.
Let t be a string, k be a positive integer, and S be a set of k-mers. Let R be a set of reads, i.e. strings of length k. We define pre_{i }(t) and suf_{i }(t) as the prefix and suffix, respectively, of length i of t. The k-spectrum of R, denoted by sp^{k} (R), is the set of all k-mer substrings of the strings of R. The de Bruijn graph of order k of S, denoted as dBG^{k}(S), is defined as follows^{4}. The vertex set is sp^{k-1} (S) , and for every k-mer x ∈S, we add an edge from pre_{k-1 }(x) to suf_{k-1 }(x) . Figure 1 shows an example of a de Bruijn graph. We define the closure of S, denoted closure(S), to be the set of all (k+1)-mers y such that pre_{k}(y) ∈ S and suf_{k}(y) ∈ S. Informally, it is the set of all (k+1)-mers that can be constructed from S.
Main Theorem (formal): Let R be a set of strings whose smallest length is L. Let k be a positive integer less than L. Then, there is a one-to-one correspondence between Eulerian cycles in dBG^{k}(sp^{k}(R) and Hamiltonian cycles in dbG^{k+1}(closure(sp^{k}(R))). Moreover, an Eulerian or Hamiltonian cycle can be found in its respective graph in O(|sp^{k}(R)|) time.
Proof: Let S=sp^{k}(R). Let G_{1} = dBG^{k}(S) and G_{2} = dBG^{k+1} (closure(S)). First, we show that the vertex set of G_{2} is S, i.e. sp^{k}(closure(S)) = S. Clearly, sp^{k}(closure(S)) ⊆ S, since no new k-mers are created during the closure process. Now, let x be a k-mer in S. It must appear in some read r, and since the length of r is greater than k, x must be a prefix or suffix of some (k+1)-mer y in R. Moreover, y must be in closure(S), since its prefix and suffix are both in r and hence in S. Therefore, x ∈ sp^{k}(closure(S), completing our proof that S ⊆ sp^{k}(closure(S)) and the vertex set of G_{2} is S.
Observe that a sequence of k-mers C_{1} = x_{0}, …, x_{n-1} is a sequence of edges defining an Eulerian cycle in G_{1} if and only if the set of k-mers of C_{1} is exactly S (without any repetitions) and, for all i, suf_{k-1} (x_{i}) = pre_{k-1} (x_{i + 1 mod n}). Also, observe that a sequence of k-mers C_{2} = x_{0}, …, x_{n-1} is a sequence of vertices defining a Hamiltonian cycle in G_{2} if and only if the exact same criteria holds, i.e. the set of k-mers of C_{2} is exactly S (without repetitions) and, for all i, suf_{k-1} (x_{i}) = pre_{k-1} (x_{i+1 mod n}). Thus, there is a one-to-one correspondence between Eulerian cycles in dBG^{k}(S) and Hamiltonian cycles in dBG^{k+1}(closure(S)). Figure 2 shows an example.
Figure 2: Using the same example as in Figure 1, this figure shows G_{2} = dBG^{k+1}(closure(S)) and the two possible Hamiltonian cycles in G_{2}. Notice that the k-mer sequence corresponding to the blue cycle in Panel A is the same as in Fig 1A; similarly, the k-mer sequence for the orange cycle in Panel B is the same as in Fig 1B.
For the running time, an Eulerian cycle can be found in time linear in the number of edges using a classical algorithm, e.g., Hierholzer’s Algorithm. Giving the one-to-one correspondence above, the vertex labels of a Hamiltonian cycle in G_{2} can be found by outputting the edge labels of an Eulerian cycle in G_{1}. Hence, the running times for the two problems are equivalent.
∎
]]>If you know me personally, then you know how often I complain about the gap between theory and practice in bioinformatics. Recently, I came across a research area that seeks to address this gap, though not in bioinformatics specifically. It is called experimental algorithms, or algorithm engineering. Briefly, “Experimental Algorithmics studies algorithms and data structures by joining experimental studies with the traditional theoretical analyses” (Moret, 2000). I had heard about this before in passing, but I finally took the time to look up details about what this is. I wish I had done this years earlier.
There is a lot that has been written about it, so I don’t think I have anything to add for now. But, I decided to make a post that collects all the resources I found and a few questions I had. This is partially because it will help me organize my thoughts, but also might be a good starting point for someone who, like me, wants to learn more about it. Please note that this list is definitely not comprehensive or representative. Please leave a comment to add things that I missed or if you have any other thoughts.
What is algorithm engineering / experimental algorithms and what problems is it intended to address?
What is the methodology for experimentally evaluating an algorithm?
Some classes that teach algorithm engineering:
What conferences / journals publish work in experimental algorithms / algorithm engineering?
What about bioinformatics?
Questions that I am left with:
]]>
The potential options for what to (re)tweet are large. I was initially overwhelmed by having to make a decision every time I wondered whether to tweet something or not. To overcome this, I came up with a set of guidelines for this decision making process. These guidelines greatly reduced the activation energy for writing a tweet. I wanted to share them here in case 1) anyone was curious why I do or do not (re)tweet certain things, 2) my experience can help someone have an easier time with Twitter, and 3) other people are willing to share their own guidelines. Please keep in mind that I do not suggest these guidelines as general rules for Twitter usage. They work to help me achieve my personal goals but, to the extent your goals are different, will not necessarily work for you.
Here are my guidelines:
These guidelines serve me as a baseline, but I feel free to break them if needed. My thoughts about this will continue to evolve based on my experience, feedback from others, and the evolution of Twitter and its alternatives.
Let me also add that there are many alternate approaches to using Twitter that violate the guidelines above. For example, one might choose to engage with controversial issues, or one might be negative as a way to effect positive change. It just depends on what works for you and what you want to achieve. So, I don’t proselytize any of the above, except perhaps avoiding the drive-by criticism. I am not sure if I see a justification for that.
]]>As a PC member, I sometimes find it frustrating to see a paper that potentially has a great contribution be rejected because of the way it was written. I wish that I had the opportunity to tell the authors — hey, you forgot to do this really important thing, without which its hard to accept the paper, but if you could go back and fix it, you might have a great paper for the conference. In our conference format, this type of back-and-forth is usually not possible. This motivated writing this post, so that newcomers to the field have a chance to know in advance what a potential reviewer might look for in an algorithmic bioinformatics conference paper.
What do I mean by algorithmic bioinformatics conference paper? I am thinking of the subset of papers submitted to RECOMB/WABI/ISMB that take an algorithm-based approach to solving a bioinformatics problem. This is largely intended to contrast against papers more rooted in statistical methodology, where the standards are a bit different. I also focus on conference reviews, where the process is a bit different than for a journal. When reviewing a paper for a bioinformatics journal like Oxford Bioinformatics, there is of course an opportunity for the authors to address any limitations in a revision.
I want to also add a disclaimer that this is not in any way an official statement about what PC members would look for in a review. As far as I know, there is no such official policy, and the things that each PC member looks for do not completely overlap. There is a diversity of standards and that is why each paper has multiple PC members reviewing it. I cannot speak for others, though I hope that people can add their comments and feedback.
When I review, the first thing I try to identify is: what is the main novel contribution of the paper? Is it an idea, a theorem, an algorithm, or a tool (i.e. software) people can use? Sometimes a paper has all these components, but not all of them contribute to the novelty of the paper. Here are some examples:
1) The paper contains an algorithm and a tool that implements the algorithm. The algorithm itself may be a simple modification of what is previously known, but the algorithm is implemented in a novel software tool for an important biological problem. If the tool performance is an improvement over previous tools, then the tool is the main contribution.
2) In another example, the main novelty is in the algorithm or in its analysis, and this is what the reader is intended to take away from the paper. The paper may have implemented a tool, but the intention of the tool is to only be a prototype to test the feasibility of the idea. The tool is not the main contribution.
3) Sometimes, the main contribution of the paper is novel biological findings, without any methodological (either algorithmic or software) novelty. This is not really within the scope of the RECOMB/ISMB†/WABI conferences, which has to be methodological. Certainly, having novel biological findings can serve to demonstrate the strength of the methodological contribution. But if you discover a cure for cancer by applying existing software, then it is probably outside the scope of RECOMB/ISMB†/WABI.
It is up to the authors to make the main contribution of the paper crystal clear to the reader. As a reviewer, I will then base my evaluation on what the authors claim. If the authors’ claim is not clearly stated, then I will do my best to guess what it is. But if I make a mistake, then I may end up evaluating the paper from a completely incorrect angle.
Here are some common issues I find with papers. This is not intended to be an exhaustive list and only includes issues that are both basic and that I’ve seen multiple times. In a competitive venue, a paper is usually accepted based on its strengths rather than a lack of weaknesses; however, in my experience, the weaknesses below typically ruin a paper’s chance of acceptance.
Context within prior algorithmic work is not given: A common scenario where this happens is when the authors developed a method for a particular biological dataset, and there are no other tools designed specifically for this kind of dataset or problem. However, the problem and/or solution might be very similar to what has been previously studied. For instance, many problems come down to clustering of some data points (e.g. genes in a network or reads from a sequencing experiment) or to some version of sequence alignment. The algorithmic context of such a paper is, at least in part, clustering or, respectively, alignment algorithms. Sometimes the authors provide the biological context (e.g. what is the relationship to previous approaches to finding genes in a network) but leave out the algorithmic one (e.g. what is the relationship to previous clustering algorithms). Why is this particular problem or dataset different enough so that standard clustering or alignment techniques do not apply? If the authors present a clustering algorithm for the problem but do not answer this question in the intro, then their contribution is not placed in the algorithmic context — which makes it hard to evaluate its novelty.
Unclear writing: Some papers will contain many spelling and grammatical mistakes, or ambiguous notation and terminology. I try to do the best I can to understand the contribution of the paper, and often I do understand it in spite of these problems. In such cases, it does not greatly influence my overall decision about the paper, and I generally trust the authors to clean up the paper before publication (if it is accepted). In other cases, I cannot understand the paper after a reasonable amount of time trying. In these cases, I simply cannot evaluate the paper’s contribution.
The paper is written in the style of a biology journal: In biology journals, the methods section is often written as a step-by-step manual necessary to reproduce the results (i.e. a pipeline of processing steps on the data). This type of presentation focuses on implementation details and reproducibility rather than highlighting the novelty of the algorithm. Even if the method is novel, when it is written in this style it is hard for the reader to identify and understand the novel parts. Another aspect of this is that for a biology journal, the results section comes before the methods section. Doing this for an algorithmic bioinformatics paper is not in it of itself a problem, but it usually correlates with not enough focus being given to the method.
Claims in the intro that are not supported by the rest of the paper: For example, the authors claim that their tool is the fastest to-date for a problem, but the results section only contains a comparison against one other tool or only on a narrow type of data. In such cases, I simply ask the authors to tone down their claims. However, sometimes the claims are central to the claimed importance of the paper, in which case this feels a bit disingenuous. Another example is the bait-and-switch, when the intro claims that the paper presents an algorithm for some interesting problem. But, what ends up being evaluated in the results is an algorithm for a slightly different problem.
There is neither a strong theoretical contribution nor an experimental evaluation: Some contributions are theoretical — a powerful idea, a way of thinking about a problem, or a theorem which can be applied by other algorithm developers. These papers require a lot of work on the modeling or theoretical side, and it can be justifiable if experimental results are either not included or limited. However, in most other cases, experimental evaluation is essential to a paper. If this is missing or is inappropriate to the problem, it can make it impossible to evaluate the strength of the contribution.
There is no comparison against other work: The authors sometimes find it obvious that their method should work much better than anything else out there. They may be right, but it is important to demonstrate this in the paper by finding the most compelling alternative approach and comparing against it.
Software: If the main contribution of the paper is a tool, then the tool should be usable. At the very least, I should be able to download the software, install it, and run it on a toy input that is provided in the download. If I can see that the tool already has some users (e.g. through GitHub activity), then this is enough to demonstrate its usability and I may not bother to try it out myself. On the other hand, if the paper contains a tool that is only a prototype and is not the main contribution, then the usability of the software is not something I consider very important.
Correctness: Sometimes the authors present an algorithm or data structure for which they prove the correctness, or it is obvious through the construction. For example, it could be a data structure to represent and query some data. However, when they evaluate their tool for its e.g. runtime, it is still essential that the correctness of the algorithm is explicitly verified in the experiments. This can be a simple one line that says: e.g. we verified that the new data structure gives the same answers to queries as the previous one on all the evaluated datasets. However, without this check, how does the reader know that the algorithm is not twice as fast as the competition just because it has a bug?
No analysis of running time or memory usage: In most cases, it is important for an algorithmic bioinformatics paper to present the running time and memory usage of the algorithm, either through experimental evaluation and/or theoretical analysis. This is a very natural thing to do for computer scientists, but I sometimes find that researchers with a different background forget to include this. In other cases, the authors do not include any memory or time analysis because they know that it is tiny and besides the main point, but it may not be at all obvious to the reader. In such cases, a simple statement to the effect that the memory usage or running time is negligible would suffice.
† ISMB is very diverse, with many different types of tracks and presentations. This blog only refers to those papers focusing on methods. Certain tracks may also feature papers focusing on the biology.
[ Subscribe to new posts via RSS feed. ]
]]>