Medvedev Group

Can researcher identity be captured with a single number?

paulmedv — Tue, 15 Nov 2022 12:36:20 +0000

Of course not, the title is just click bait But, what I was thinking of a characterization that could be informative.

We have two little voices in our head, driving our self-esteem. On one hand, we seek the approval of others. We need that paper in Nature, we want that big award, we want that big recognition. We yearn for it as it forms our opinion of ourselves: if I get a paper in Nature, then I am a great researcher. If not, I’m so-so. And we all want to be great researchers.

On the other hand, we have an internal barometer of our self-worth as a researcher. We have standards for what a great paper is: maybe it’s a discovery or a tool that gets used by others, or it greatly improves something impactful. It is fair in its treatment of previous work, it is thorough in its presentation, etc etc. The point is its our OWN standards and when we write papers that meet those standards we think of ourselves as great researchers — regardless of where the paper is published or what the reviewers say about it.

These two little voices pull us in different directions, and perhaps we can think of ourselves as what percentage of the pull goes to which voice (be honest!). After that, we can also think of others this way, and it can help us understand them. It can be helpful when mentoring also, since if you’re trying to support someone, it helps to know what drives their self-esteem.

From idea to paper: discretizing and optimizing the initial stages of the process

paulmedv — Wed, 17 Aug 2022 14:16:20 +0000

This is a post about managing the research process when juggling many projects and other commitments.

My idea-to-output pipeline:

I often have “a flash of potential insight.” This might be a thought about how there is a better way of solving a computational problem or a connection between unrelated concepts that might open new avenues for research exploration. These flashes happen while reading a paper, hearing a talk, walking down the street, or talking to a colleague. To say that these are half-baked ideas is an exaggeration and maybe “quarter-baked” is a better term. Let’s think of such a flash as the beginning of a pipeline. At the end of a pipeline is a research output — a paper, a lecture, a blog post, etc. It might take years to get here, e.g. a research paper.

What about the points in the middle of the pipeline? I used to think that it’s a continuous process; but with time I realized that there are discrete checkpoints, at least in the way I do it. Here is what I came up with:

1. A flash of potential insight

2. Converting the cloudy flash into a more concrete thought, often through writing it down.

3. Doing a dump of all accompanying thoughts onto paper (i.e. a brainstorm).

4. Evaluating which of this is novel and relevant — usually doing a literature review or looking at existing teaching materials.

5. Narrowing the initial brainstorm into a concrete project idea.

6-15: Execute the project idea: evaluate practical feasibility with preliminary results, maybe write a grant, match a student’s interest with the project, guide the student through the project, etc…

This post is not about steps 6 through 15, so I didn’t elaborate on them here. I’ve found that over my career, I’ve been discretizing the continuous process into these steps, starting from the end. That is, as a starting student the whole process was a continuous blur; then I recognized that there is a discrete stage 15, then stage 14, and so on. The initial 5 steps were all a continuous blur to me until somewhat recently, when I first noticed the existence of checkpoint 5, then later checkpoint 4, and so on. Only last week did I recognize the existence/importance of stage 3.

The problem:

At the moment I have a flash, I usually do not have time to follow up on it with the first five steps. I am usually busy and my mind is immersed in other projects. Maybe potential projects get lost along the way? I used to think that if a flash is truly promising, it will occur to me multiple times, and eventually I will follow up on it. But now this seems like a baseless assumption whose purpose was only to justify my status quo. Over time, I’ve applied a lot of scrutiny to the later stages of the pipeline (6-15) and tried to optimize the efficiency of those stages. Now, I want to do the same for steps 1-5. How many “flashes” eventually become outputs? I don’t know, since many flashes are not followed-up on and their existence is forgotten. How many of these flashes turn out to be nonsense when scrutinized? How many make it to at least to stage 5? I don’t know. How many flashes that could have led to outputs are lost because I forget about them? I don’t know.

Another problem is that the whole pipeline potentially stretches over many years and, since I am generally overwhelmed with the number of projects going on simultaneously, I don’t remember things well. When I don’t take good notes, I often find myself repeating steps that I have already done! For example, at the early stages, I sometimes browse through literature to get a sense of what has been done. If I don’t take good notes, then when I return to this in a month, I have to start over. Another problem is that the coolness and excitement that I have at the initial stages are sometimes lost by the time of writing the paper! Sure, sometimes it’s just that the original excitement was naive and didn’t account for things I learned later. But sometimes, I suspect that it’s just that after being down in the trenches of a project for a long time, I forget its original beauty.

What I also realized recently is that “taking good notes” is stage-dependent. For example, taking good notes at the end of stage 3 is just an unpolished, “natural flow” kind of text. Trying to do something more actually destroys the value of these notes. If I have the natural flow, I can capture the original excitement, and then a year later I can look at these notes and remember precisely why I was so excited initially. Moreover, if I require polished notes at the end of this stage then it makes it harder to find time to get through the stage and increases the chance that the flash will get forgotten. On the other hand, “taking good notes” during a literature search means being very precise, so that these notes can serve as a basis for precise published statements later.

The solution:

I want to take a more systematic approach in the future. I want to refine/improve/make-more-precise the pipeline steps, and make precise what kind of notes are most effective for each stage. I also don’t want ideas to unintentionally drop out of the pipeline. It should take very little time to bring an idea to the end of stage 3 (e.g. 30 minutes in a coffee shop). Now that I’ve broken down the initial stages into smaller, more manageable tasks, I think this is possible. By being systematic about this, I can iterate and refine both the stages of the pipeline and the specs for each deliverable.

Also, it is fine if ideas will get stuck in the pipeline because I can’t find time for them (e.g. I never get to stage 4). I just don’t want them to fall out. If they are stuck, then I know about it and I can always return to it. Moreover, I can analyze why they get stuck and fix the problem, e.g. I need to allocate more time to idea development.

Feedback:

This post stemmed from a “flash of potential insight” I had yesterday. Since I am on vacation, I had the opportunity to immerse myself into the pipeline of turning it into an output. I decided that a blog post would be a good output, for which the turnaround is really quick. But really these are all still half-baked thoughts in my head. I am really curious to hear about how other people approach the early stages of the pipeline.

The challenges of writing a logical argument in the Introduction

paulmedv — Wed, 10 Nov 2021 19:15:19 +0000

Writing a good intro to a paper is challenging, for many reasons. I have a longer lecture about it here. In this post, I wanted to point out one reason which occurred to me only after many years and after giving the lecture. The Intro often contains a multi-step logical argument. Sometimes this is hard for the reader to follow because the argument has logical gaps. The thing is, even when there are no logical gaps, the reader might still find it difficult to follow, for the following reason.

In the writer’s head, an argument is usually represented as a directed acyclic graph. Each node is a statement and an edge from x to y means that x logically implies y. For example

The rightmost node (node 9) is the final point the writer wants to make, e.g. the specific challenge their paper addresses. One mistake writers make is to include nodes that do not lie on the path to 9. In this example, that’s node 3. Node 3 might be a very interesting observation and the writer might be tempted to keep it. But that’s usually a mistake and they must trim 3 from their tree before putting it in the intro:

The next challenge is that an introduction is necessarily linear. Each sentence has exactly one successor and one predecessor. The introduction is not good at representing a tree structure of arguments. So the writer must linearize their graph and create what in Computer Science we call a “topological ordering”:

Each of those arcs that jump other nodes, e.g. from 4 to 7, require the writer to make explicit connections with previous parts of the text. This makes things more difficult for the reader, and the less of these jumping arcs, the better. For example, putting 7 right after 4 would have been a better linearization:

Now there are only two jumping arcs instead of three. The writer can further try to eliminate arcs from the graph completely by checking if they are absolutely necessary. For example, 5 and 6 might be two examples that support point 8. But are two examples really necessary? If not, then get rid of 5:

This is now much easier for the reader to follow. Happy writing!

UPDATE 11/10/21: I want to add a prequel to all this. What is often in the writer’s mind at the start is not even a graph but some kind of personal, cerebral, often pictorial representation. Getting that into a graph is its own challenge. If a writer tries to go from that straight into a linearized argument, all hell breaks lose.

What do Eulerian and Hamiltonian cycles have to do with genome assembly?

paulmedv — Fri, 21 Aug 2020 16:40:05 +0000

(UPDATE: A slightly updated version of this blog is now published here in PLoS Computational Biology)

(written by Paul Medvedev and Mihai Pop)

When you learned about genome assembly algorithms, you might have heard a story that goes something like this:

In the overlap-layout paradigm, solving the assembly problem requires solving the Hamiltonian cycle problem in the overlap graph. This is difficult, because the Hamiltonian cycle program is NP-hard. On the other hand, if we break our reads up into k-mers, we can build the de Bruijn graph. Then, the assembly problem becomes the problem of finding an Eulerian cycle in the de Bruijn graph, which is easily solvable in linear time. Thus, by formulating the assembly problem in terms of the de Bruijn graph, we can solve the much easier Eulerian cycle problem and not have to solve the NP-hard Hamiltonian cycle problem.

In this post, we explain that while de Bruijn graphs have indeed been very useful, the reason has nothing to do with the complexity of the Hamiltonian and Eulerian cycle problems.

Every Eulerian cycle in a de Bruijn graph or a Hamiltonian cycle in an overlap graph corresponds to a single genome reconstruction where all the repeats are completely resolved. For example, Figure 1 shows two different Eulerian cycles in the same graph (a similar example could be constructed for Hamiltonian cycles in an overlap graph). Each cycle corresponds to a different arrangement of segments between the repeats. The presence of multiple Eulerian or Hamiltonian cycles implies that the genome structure is ambiguous given the data available. In other words, using the same set of reads one can reconstruct different genomes, each of which is fully consistent with the data (Figure 1 gives an example). Choosing one of these reconstructions arbitrarily would be foolhardy since only one of them is the original genome. No sane assembly algorithm would do this, and that is one of the major reasons why an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice.

Figure 1: A worked out example for a set of reads R = {TATTA, TAATA} and k=3. Here, the set of all k-mers is S = sp^k(R) = {TAT, ATT, TTA, TAA, AAT, ATA}. Panel A shows G₁ = dBG^k(S) and one possible Eulerian cycle of G₁ (in blue). Panel B show the only other Eulerian cycle in G₁ (in orange). The genome reconstruction corresponding to the blue cycle is ATTAATAT and to the orange cycle is ATTATAAT (note that because the genome is circular, the last two characters of each string are equal to the first two characters).

Instead, assemblers output contigs—long, contiguous segments which can unambiguously be inferred to be part of the genome. Finding such segments is a very different computational problem than finding a single Eulerian or Hamiltonian cycle¹ . In fact, it was shown that finding all possible contigs can be done in polynomial time, regardless of whether the genome reconstruction is modeled as a Hamiltonian or Eulerian cycle (Tomescu and Medvedev, 2016). The algorithm used in practice (the unitig algorithm) is linear and nearly identical in the two graph models.

Perhaps you are not convinced by the above reasoning? Fine. For the sake of argument, let’s imagine that we really are interested in finding a single, arbitrary, genome reconstruction. But even in this case, the distinction between Eulerian and Hamiltonian cycles is misleading. We make our point with this Theorem, which we first state informally (a formal statement and proof will come below):

Main Theorem (informal): The following problems are equivalent and solvable in linear time:

Find an Eulerian cycle in the de Bruijn graph where the edges correspond to k-mers in the reads.
Find a Hamiltonian cycle in the de Bruijn graph where the edges correspond to all the possible (k+1)-mers that can be obtained from the reads’ k-mers.

The first part of the theorem should not be surprising. It states one half of the story we started with, namely that we can solve the assembly problem in linear time by finding an Eulerian cycle in a de Bruijn graph. The second part of the theorem, though, adds a twist. It is about finding a Hamiltonian cycle, but it differs from the initial story in two ways. First, it is a Hamiltonian cycle in a de Bruijn graph, not in an overlap graph. This might seem strange, but there is no special connection between overlap graphs and the Hamiltonian cycle problem — one is free to find a Hamiltonian cycle in any graph they wish. Second, the problem is solvable in linear time in this case, even though it is NP-hard in general. This might also seem strange, but in fact it is common for NP-hard problems to have polynomial-time solutions for a restricted class of inputs².

What the theorem states, then, is that one can solve the assembly problem in linear time by finding a Hamiltonian cycle within an appropriately defined de Bruijn graph. The fact that the Hamiltonian cycle problem is NP-hard in general graphs is not directly relevant. What is important is the underlying structure of the de Bruijn graph which makes the Hamiltonian cycle problem easy to solve³. Hence, the initial story was right in the sense that using de Bruijn graphs is a good idea but wrong to imply that the complexity of the Hamiltonian cycle problem is a reason. All of this is of course assuming we are, for some reason, interested in an arbitrary genome reconstruction, which, as we argued earlier, we typically are not.

So why are de Bruijn graphs so popular for short read assembly, if not for the difference in the complexity of finding Eulerian or Hamiltonian cycles? The answer is complex, which might explain why the initial simple story was appealing. It may have to do with the simplicity of their implementation, the appeal of the k-mer abstraction, the ease of error correction, or with something else. In fact, the difference between using de Bruijn graphs and overlap graphs is poorly understood and is a fascinating open research problem. But, the Eulerian and Hamiltonian cycle dichotomy is not really relevant to assembly or to the popularity of de Bruijn graphs.

Acknowledgements

PM would like to thank Rayan Chikhi for feedback on the post and Alexandru Tomescu and Michael Brudno for many helpful discussions on this topic (and Michael Brudno specifically for introducing him to the problem a long time ago).

References

Bresler, Guy, Ma’ayan Bresler, and David Tse. “Optimal assembly for high throughput shotgun sequencing.” In BMC bioinformatics, vol. 14, no. S5, p. S18. BioMed Central, 2013.
Carl Kingsford, Michael C. Schatz, and Mihai Pop. “Assembly complexity of prokaryotic genomes using short reads.” In BMC bioinformatics 11(1): 21, 2010.
Medvedev, Paul, Konstantinos Georgiou, Gene Myers, and Michael Brudno. “Computability of models for sequence assembly.” In International Workshop on Algorithms in Bioinformatics, pp. 289-301. Springer, Berlin, Heidelberg, 2007.
Medvedev, Paul. “Modeling biological problems in computer science: a case study in genome assembly.” Briefings in bioinformatics 20, no. 4 (2019): 1376-1383.
Nagarajan, Niranjan, and Mihai Pop. “Parametric complexity of sequence assembly: theory and applications to next generation sequencing.” Journal of computational biology 16, no. 7 (2009): 897-908.
Tomescu, Alexandru I., and Paul Medvedev. “Safe and complete contig assembly through omnitigs.” Journal of Computational Biology 24, no. 6 (2017): 590-602.

Proof of theorem

Let’s prove the main theorem, which follows almost directly from definitions. It may have been observed in previous papers, though we are not sure if it has been explicitly stated.

Let t be a string, k be a positive integer, and S be a set of k-mers. Let R be a set of reads, i.e. strings of length k. We define pre_i(t) and suf_i(t) as the prefix and suffix, respectively, of length i of t. The k-spectrum of R, denoted by sp^k (R), is the set of all k-mer substrings of the strings of R. The de Bruijn graph of order k of S, denoted as dBG^k(S), is defined as follows⁴. The vertex set is sp^k-1 (S) , and for every k-mer x ∈S, we add an edge from pre_k-1(x) to suf_k-1(x) . Figure 1 shows an example of a de Bruijn graph. We define the closure of S, denoted closure(S), to be the set of all (k+1)-mers y such that pre_k(y) ∈ S and suf_k(y) ∈ S. Informally, it is the set of all (k+1)-mers that can be constructed from S.

Main Theorem (formal): Let R be a set of strings whose smallest length is L. Let k be a positive integer less than L. Then, there is a one-to-one correspondence between Eulerian cycles in dBG^k(sp^k(R) and Hamiltonian cycles in dbG^k+1(closure(sp^k(R))). Moreover, an Eulerian or Hamiltonian cycle can be found in its respective graph in O(|sp^k(R)|) time.

Proof: Let S=sp^k(R). Let G₁ = dBG^k(S) and G₂ = dBG^k+1 (closure(S)). First, we show that the vertex set of G₂ is S, i.e. sp^k(closure(S)) = S. Clearly, sp^k(closure(S)) ⊆ S, since no new k-mers are created during the closure process. Now, let x be a k-mer in S. It must appear in some read r, and since the length of r is greater than k, x must be a prefix or suffix of some (k+1)-mer y in R. Moreover, y must be in closure(S), since its prefix and suffix are both in r and hence in S. Therefore, x ∈ sp^k(closure(S), completing our proof that S ⊆ sp^k(closure(S)) and the vertex set of G₂ is S.

Observe that a sequence of k-mers C₁ = x₀, …, x_n-1 is a sequence of edges defining an Eulerian cycle in G₁ if and only if the set of k-mers of C₁ is exactly S (without any repetitions) and, for all i, suf_k-1 (x_i) = pre_k-1 (x_{i + 1 mod n}). Also, observe that a sequence of k-mers C₂ = x₀, …, x_n-1 is a sequence of vertices defining a Hamiltonian cycle in G₂ if and only if the exact same criteria holds, i.e. the set of k-mers of C₂ is exactly S (without repetitions) and, for all i, suf_k-1 (x_i) = pre_k-1 (x_{i+1 mod n}). Thus, there is a one-to-one correspondence between Eulerian cycles in dBG^k(S) and Hamiltonian cycles in dBG^k+1(closure(S)). Figure 2 shows an example.

Figure 2: Using the same example as in Figure 1, this figure shows G₂ = dBG^k+1(closure(S)) and the two possible Hamiltonian cycles in G₂. Notice that the k-mer sequence corresponding to the blue cycle in Panel A is the same as in Fig 1A; similarly, the k-mer sequence for the orange cycle in Panel B is the same as in Fig 1B.

For the running time, an Eulerian cycle can be found in time linear in the number of edges using a classical algorithm, e.g., Hierholzer’s Algorithm. Giving the one-to-one correspondence above, the vertex labels of a Hamiltonian cycle in G₂ can be found by outputting the edge labels of an Eulerian cycle in G₁. Hence, the running times for the two problems are equivalent.

∎

Algorithm Engineering / Experimental Algorithms

paulmedv — Wed, 01 Apr 2020 01:58:54 +0000

If you know me personally, then you know how often I complain about the gap between theory and practice in bioinformatics. Recently, I came across a research area that seeks to address this gap, though not in bioinformatics specifically. It is called experimental algorithms, or algorithm engineering. Briefly, “Experimental Algorithmics studies algorithms and data structures by joining experimental studies with the traditional theoretical analyses” (Moret, 2000). I had heard about this before in passing, but I finally took the time to look up details about what this is. I wish I had done this years earlier.

There is a lot that has been written about it, so I don’t think I have anything to add for now. But, I decided to make a post that collects all the resources I found and a few questions I had. This is partially because it will help me organize my thoughts, but also might be a good starting point for someone who, like me, wants to learn more about it. Please note that this list is definitely not comprehensive or representative. Please leave a comment to add things that I missed or if you have any other thoughts.

What is algorithm engineering / experimental algorithms and what problems is it intended to address?

Sanders, Peter. “Algorithm engineering–an attempt at a definition using sorting as an example.” ALENEX 2010.
Moret, Bernard ME. “Towards a discipline of experimental algorithmics.” Data Structures, Near Neighbor Searches, and Methodology: Fifth and Sixth DIMACS Implementation Challenges 59 (2002): 197-213.
On experimental algorithmics: an interview with Catherine McGeoch and Bernard Moret
Chapter 1 from Müller-Hannemann, Matthias, and Stefan Schirra. Algorithm Engineering: Bridging the Gap Between Algorithm Theory and Practice. Springer, 2001.

What is the methodology for experimentally evaluating an algorithm?

A whole book about it: Müller-Hannemann, Matthias, and Stefan Schirra. Algorithm Engineering: Bridging the Gap Between Algorithm Theory and Practice. Springer, 2001.
- - Chapter 4.8 struck me as particularly interesting. It talked about how to try to extrapolate asymptotic behavior from experiments.
Another book about it: A Guide to Experimental Algorithmics by Catherine C. McGeoch
- - (this book is behind a paywall that even Penn State and scihub are not able to penetrate…but there is a PDF floating out there if you google it)
A recent article: Angriman, Eugenio, et al. “Guidelines for experimental algorithmics: A case study in network analysis.” Algorithms 12.7 (2019): 127.

Some classes that teach algorithm engineering:

What conferences / journals publish work in experimental algorithms / algorithm engineering?

SIAM Symposium on Algorithm Engineering and Experiments (ALENEX)
European Symposium on Algorithms (ESA): Engineering and Application Track
Symposium on Experimental Algorithms (SEA)
ACM Journal of Experimental Algorithmics (JEA)

What about bioinformatics?

Here is an example of an older paper that presents itself in the mold of algorithm engineering:
- - Moret, Bernard ME, David A. Bader, and Tandy Warnow. “High-performance algorithm engineering for computational phylogenetics.” The Journal of Supercomputing 22.1 (2002): 99-111.
Here are two recent k-mer data structure bioinformatics papers
- - Limasset, Antoine, et al. “Fast and scalable minimal perfect hashing for massive key sets.” 16th International Symposium on Experimental Algorithms. Vol. 11. 2017.
  - Zentgraf, Jens, Timm, Henning , and Rahmann, Sven. “Cost-optimal assignment of elements in genome-scale multi-way bucketed Cuckoo hash tables.” ALENEX 2020.

Questions that I am left with:

What does a paper in algorithm engineering / experimental algorithms look like? How is it different from other papers?
- - I found this question difficult to answer. I looked through the recent ALENEX proceedings to get an idea. Compared to SODA papers, the difference was clear — there was always an experimental analysis, often extensive. But beyond that? It seemed that many bioinformatics papers published in a journal like Oxford Bioinformatics or in a conference like WABI and RECOMB could have been ALENEX papers. At least, the papers without much biology. I’m thinking for example of my papers on the compaction of de Bruijn graphs or on other papers for data structures for k-mers. Is it the case that for a paper to belong to the algorithm engineering community it should study an algorithm for a problem of BROAD interest, i.e. one that is relevant to more than just one application domain like bioinformatics? I don’t know.
What is the relationship of algorithm engineering / experimental algorithms to Data Science?
- - It seems like there is an intersection, e.g. the McGeoch book has a chapter on “Data Analysis” that would today be part of a Data Science course. Both fields differentiate themselves from Statistics and Computer Science by focusing on empirical performance, rather than theoretical properties of the algorithm or the data.
What is the relationship of algorithm engineering / experimental algorithms to Computational Science?
What are the big open questions in the field of algorithm engineering / experimental algorithms?
- - Bioinformaticians often get asked this question, and it’s not an easy one to answer for us. I’m not sure if it is even a fair question — should a field necessarily have major open questions? It’s not a natural science, and maybe a field that is engineering rather than science does not need to have such questions. Either way, are there such questions in algorithm engineering / experimental algorithms?

My Twitter guidelines

paulmedv — Tue, 25 Feb 2020 16:44:29 +0000

I was recently asked about my attitude towards Twitter, which inspired me to write a blog post about it. It took me awhile to start using Twitter, and then only as a passive observer. On the one hand, the format is not conducive to constructive discussions; on the other, Twitter is an amazing way to connect and share information with other researchers. On balance, I decided that until something better comes along, I will use Twitter.

The potential options for what to (re)tweet are large. I was initially overwhelmed by having to make a decision every time I wondered whether to tweet something or not. To overcome this, I came up with a set of guidelines for this decision making process. These guidelines greatly reduced the activation energy for writing a tweet. I wanted to share them here in case 1) anyone was curious why I do or do not (re)tweet certain things, 2) my experience can help someone have an easier time with Twitter, and 3) other people are willing to share their own guidelines. Please keep in mind that I do not suggest these guidelines as general rules for Twitter usage. They work to help me achieve my personal goals but, to the extent your goals are different, will not necessarily work for you.

Here are my guidelines:

Avoid using Twitter for discussion. I want to use Twitter for sharing information, but not for engaging in discussion. I would love to engage in constructive discussion in an open forum, but I don’t find Twitter is a good platform for that.
Limit tweets to research-related content. A narrow and cohesive scope means that my tweets will tend to be of interest to many of my followers. I don’t expect that people that care about my research are also interested in my views on politics or in pictures of my cat (though he is sooooo cute).
Avoid tweeting on controversial issues. The reason is that such tweets can potentially generate vitriol and aggression, and even the anticipation of this leads to anxiety for me. This is a somewhat selfish rule, as it prioritizes my mental health over the need to speak out on important issues related to academia. Such issues may include research ethics, publication biases, or university politics. These are really important, so I feel some guilt about not commenting on them on Twitter; however, I can affect positive change through other means not involving Twitter (e.g. through my actions when I’m in a position to do something). Since in today’s climate it is not always possible to predict what may be controversial, my rule of thumb is to stick to research related content as a safe bet (it also matches rule 2).
My tweets are not endorsements. If I retweet a link to a paper, it does not mean it’s a good paper. It just means it is research that someone has done on a topic that is of interest to me or to my followers. If I were to retweet only papers that I endorse, it means I would have to read the paper first. As a result, I would not retweet anything in a timely manner. Adopting this rule was crucial for me to be able to start to actively engage with Twitter.
Try to promote research rather than people. Twitter is a good venue to promote research, like tweeting about a useful paper. But I try to make a distinction between promoting people’s research vs. the people themselves. For example, if a tweet has a link to a resource (e.g. a paper or software), then it is promoting the research. If a tweet just congratulates somebody on an award, then it is promoting the person. I think promoting people is dangerous if one does not know them well (for example, what if somebody does great research but turns out to be a jerk).
Be positive. If I enjoy reading a paper or see a great talk or come across a great resource, Twitter is a great way to let the authors know their work is appreciated. What if I see something for which I want to make a more critical comment? Most of the time, I can do that privately.
Avoid drive-by research commentary, especially critical one. For example, “Thanks for the paper but your analysis sucks.” A paper is often the result of years of work by a student and the least one can do before trashing it is to read it completely and give a thought-out response, including a balanced focus on the strengths and weakness (some good advice in this tweet). This is what we strive for when writing reviews, and I don’t think we should drop this standard because of Twitter. Drive-by criticism is often done by people who have not even read the paper carefully. Drive-by positive statements are more acceptable, but only if I have read the paper enough to endorse it.
Try to be honest about my intentions. I see a lot of tweets of the form: “I am so humbled that our paper X won the greatest paper of all time award.” If you were truly humbled, then you wouldn’t be tweeting about it. In truth, you just want to show-off your accomplishment. There is nothing wrong with that. I would just phrase it more honestly, for example: “Our paper X won the greatest paper of all time award.”

These guidelines serve me as a baseline, but I feel free to break them if needed. My thoughts about this will continue to evolve based on my experience, feedback from others, and the evolution of Twitter and its alternatives.

Let me also add that there are many alternate approaches to using Twitter that violate the guidelines above. For example, one might choose to engage with controversial issues, or one might be negative as a way to effect positive change. It just depends on what works for you and what you want to achieve. So, I don’t proselytize any of the above, except perhaps avoiding the drive-by criticism. I am not sure if I see a justification for that.

What are some common issues I find when reviewing algorithmic bioinformatics conference papers?

paulmedv — Sun, 11 Aug 2019 17:56:12 +0000

UPDATE: This post, in a slightly modified and improved form, is now published at https://doi.org/10.1371/journal.pcbi.1007742 . Please see there for the latest version.

As a PC member, I sometimes find it frustrating to see a paper that potentially has a great contribution be rejected because of the way it was written. I wish that I had the opportunity to tell the authors — hey, you forgot to do this really important thing, without which its hard to accept the paper, but if you could go back and fix it, you might have a great paper for the conference. In our conference format, this type of back-and-forth is usually not possible. This motivated writing this post, so that newcomers to the field have a chance to know in advance what a potential reviewer might look for in an algorithmic bioinformatics conference paper.

What do I mean by algorithmic bioinformatics conference paper? I am thinking of the subset of papers submitted to RECOMB/WABI/ISMB that take an algorithm-based approach to solving a bioinformatics problem. This is largely intended to contrast against papers more rooted in statistical methodology, where the standards are a bit different. I also focus on conference reviews, where the process is a bit different than for a journal. When reviewing a paper for a bioinformatics journal like Oxford Bioinformatics, there is of course an opportunity for the authors to address any limitations in a revision.

I want to also add a disclaimer that this is not in any way an official statement about what PC members would look for in a review. As far as I know, there is no such official policy, and the things that each PC member looks for do not completely overlap. There is a diversity of standards and that is why each paper has multiple PC members reviewing it. I cannot speak for others, though I hope that people can add their comments and feedback.

When I review, the first thing I try to identify is: what is the main novel contribution of the paper? Is it an idea, a theorem, an algorithm, or a tool (i.e. software) people can use? Sometimes a paper has all these components, but not all of them contribute to the novelty of the paper. Here are some examples:

1) The paper contains an algorithm and a tool that implements the algorithm. The algorithm itself may be a simple modification of what is previously known, but the algorithm is implemented in a novel software tool for an important biological problem. If the tool performance is an improvement over previous tools, then the tool is the main contribution.

2) In another example, the main novelty is in the algorithm or in its analysis, and this is what the reader is intended to take away from the paper. The paper may have implemented a tool, but the intention of the tool is to only be a prototype to test the feasibility of the idea. The tool is not the main contribution.

3) Sometimes, the main contribution of the paper is novel biological findings, without any methodological (either algorithmic or software) novelty. This is not really within the scope of the RECOMB/ISMB†/WABI conferences, which has to be methodological. Certainly, having novel biological findings can serve to demonstrate the strength of the methodological contribution. But if you discover a cure for cancer by applying existing software, then it is probably outside the scope of RECOMB/ISMB†/WABI.

It is up to the authors to make the main contribution of the paper crystal clear to the reader. As a reviewer, I will then base my evaluation on what the authors claim. If the authors’ claim is not clearly stated, then I will do my best to guess what it is. But if I make a mistake, then I may end up evaluating the paper from a completely incorrect angle.

Here are some common issues I find with papers. This is not intended to be an exhaustive list and only includes issues that are both basic and that I’ve seen multiple times. In a competitive venue, a paper is usually accepted based on its strengths rather than a lack of weaknesses; however, in my experience, the weaknesses below typically ruin a paper’s chance of acceptance.

Context within prior algorithmic work is not given: A common scenario where this happens is when the authors developed a method for a particular biological dataset, and there are no other tools designed specifically for this kind of dataset or problem. However, the problem and/or solution might be very similar to what has been previously studied. For instance, many problems come down to clustering of some data points (e.g. genes in a network or reads from a sequencing experiment) or to some version of sequence alignment. The algorithmic context of such a paper is, at least in part, clustering or, respectively, alignment algorithms. Sometimes the authors provide the biological context (e.g. what is the relationship to previous approaches to finding genes in a network) but leave out the algorithmic one (e.g. what is the relationship to previous clustering algorithms). Why is this particular problem or dataset different enough so that standard clustering or alignment techniques do not apply? If the authors present a clustering algorithm for the problem but do not answer this question in the intro, then their contribution is not placed in the algorithmic context — which makes it hard to evaluate its novelty.

Unclear writing: Some papers will contain many spelling and grammatical mistakes, or ambiguous notation and terminology. I try to do the best I can to understand the contribution of the paper, and often I do understand it in spite of these problems. In such cases, it does not greatly influence my overall decision about the paper, and I generally trust the authors to clean up the paper before publication (if it is accepted). In other cases, I cannot understand the paper after a reasonable amount of time trying. In these cases, I simply cannot evaluate the paper’s contribution.

The paper is written in the style of a biology journal: In biology journals, the methods section is often written as a step-by-step manual necessary to reproduce the results (i.e. a pipeline of processing steps on the data). This type of presentation focuses on implementation details and reproducibility rather than highlighting the novelty of the algorithm. Even if the method is novel, when it is written in this style it is hard for the reader to identify and understand the novel parts. Another aspect of this is that for a biology journal, the results section comes before the methods section. Doing this for an algorithmic bioinformatics paper is not in it of itself a problem, but it usually correlates with not enough focus being given to the method.

Claims in the intro that are not supported by the rest of the paper: For example, the authors claim that their tool is the fastest to-date for a problem, but the results section only contains a comparison against one other tool or only on a narrow type of data. In such cases, I simply ask the authors to tone down their claims. However, sometimes the claims are central to the claimed importance of the paper, in which case this feels a bit disingenuous. Another example is the bait-and-switch, when the intro claims that the paper presents an algorithm for some interesting problem. But, what ends up being evaluated in the results is an algorithm for a slightly different problem.

There is neither a strong theoretical contribution nor an experimental evaluation: Some contributions are theoretical — a powerful idea, a way of thinking about a problem, or a theorem which can be applied by other algorithm developers. These papers require a lot of work on the modeling or theoretical side, and it can be justifiable if experimental results are either not included or limited. However, in most other cases, experimental evaluation is essential to a paper. If this is missing or is inappropriate to the problem, it can make it impossible to evaluate the strength of the contribution.

There is no comparison against other work: The authors sometimes find it obvious that their method should work much better than anything else out there. They may be right, but it is important to demonstrate this in the paper by finding the most compelling alternative approach and comparing against it.

Software: If the main contribution of the paper is a tool, then the tool should be usable. At the very least, I should be able to download the software, install it, and run it on a toy input that is provided in the download. If I can see that the tool already has some users (e.g. through GitHub activity), then this is enough to demonstrate its usability and I may not bother to try it out myself. On the other hand, if the paper contains a tool that is only a prototype and is not the main contribution, then the usability of the software is not something I consider very important.

Correctness: Sometimes the authors present an algorithm or data structure for which they prove the correctness, or it is obvious through the construction. For example, it could be a data structure to represent and query some data. However, when they evaluate their tool for its e.g. runtime, it is still essential that the correctness of the algorithm is explicitly verified in the experiments. This can be a simple one line that says: e.g. we verified that the new data structure gives the same answers to queries as the previous one on all the evaluated datasets. However, without this check, how does the reader know that the algorithm is not twice as fast as the competition just because it has a bug?

No analysis of running time or memory usage: In most cases, it is important for an algorithmic bioinformatics paper to present the running time and memory usage of the algorithm, either through experimental evaluation and/or theoretical analysis. This is a very natural thing to do for computer scientists, but I sometimes find that researchers with a different background forget to include this. In other cases, the authors do not include any memory or time analysis because they know that it is tiny and besides the main point, but it may not be at all obvious to the reader. In such cases, a simple statement to the effect that the memory usage or running time is negligible would suffice.

† ISMB is very diverse, with many different types of tracks and presentations. This blog only refers to those papers focusing on methods. Certain tracks may also feature papers focusing on the biology.

[ Subscribe to new posts via RSS feed. ]