Inundata

A quick introduction to ggplot2

Karthik Ram — Wed, 10 Apr 2013 20:14:33 +0000

My friend Jonah asked me to guest lecture in his R seminar aimed at grad students and postdocs in Integrative Biology. I gave Jonah a bunch of topic options ranging from reproducible research with R to data manipulation. The consensus was data visualization so I put together a 2 hour talk/hands on presentation for ggplot2 beginners. Here are my slides and code in case anyone else might benefit from it.

What worked

People got a good sense of what’s possible with ggplot2 even if they couldn’t keep up with the examples.
The code worked for most attendees except for a few glitches (see below).

What didn’t work

Several people still had very old copies of R or outdated copies of ggplot dependencies (which made their upgrade a little less intuitive).
Sourcing code/data from GitHub using devtools source_url doesn’t seem to work on Windows machines. I can’t replicate the problem without access to one but I might have to work out a better way to share data.

Slides are on Speakerdeck and the full repository is on GitHub. Please feel free to reuse, remix, or contribute to this presentation.

Version control for science

Karthik Ram — Thu, 28 Feb 2013 21:24:43 +0000

I’ve been thinking a lot about the importance of version control in science of late. This is not just because of my involvement with multiple collaborative efforts that would be a nightmare to move forward without a structured workflow. I fortuitously got involved in a collaboration between GitHub, BiomedCentral, and a handful of bioinformatics scientists to explore the importance of version control in a scholarly communication context.

I’m happy to announce that the first outcome of the project, a paper (open access) outlining various use-cases for Git in science just went in press today. Here is the official announcement from BMC and also a blog post that I co-wrote with C. Titus Brown.

Ram, K. (2013). git can facilitate greater reproducibility and increased transparency in science. Source Code Biol Med, 8, 7.

Altmetrics as a discovery tool

Karthik Ram — Thu, 24 Jan 2013 00:10:46 +0000

Altmetrics is all the rage these days in the scientometrics world. One rationale for developing these metrics has been to quantify the entire range of academic output beyond publications to include everything from datasets and code to presentations. The idea is that these metrics would one day be used in tenure committees (and tenure track applications?) to get a more complete picture of a researcher’s contributions. As much as I love the idea (since I do so much more than write papers) and the people behind these efforts, I honestly don’t think these metrics will have any impact on folks currently in the academic trenches (pre-tenure faculty and postdocs on the job market). But I’d love to be proven wrong here.

However, I do think altmetrics are a terrific discovery tool. I was recently approached with an idea for a collaboration (yes, yes I know my plate currently overfloweth but this would be several months down the line) on an emerging research topic. I was given a few of hot off the press articles as starting points to get my feet wet. When getting into new topics, I usually pull up some highly cited articles, look at reverse citations on Web of Science and go from there. This is somewhat harder to do for research that is really really new. In this case, without a second thought (probably because I hang around the altmetrics community a good bit and also develop some tools on that front), I popped the articles I had into ImpactStory and Altmetric and boom! pay dirt.

Both altmetrics providers led me to some really insightful blogs (not currently on my reading list) that gave me a lot more context about how the topic emerged and where it is likely headed. I also found Tweeps (in this case scientists on Twitter) who are working on this topic. A quick look through their lab pages and recent pubs and I have a pretty good sense of what this is all about. All over the course of a few hours. Doing the same thing a few years ago would have been impossible.

Make your research a little more open this year

Karthik Ram — Thu, 03 Jan 2013 19:44:53 +0000

After a winter break of working at half-speed, I’m finding it a little daunting to face the overwhelming number of projects that need my attention in the new year. As I sort through and prioritize the ones where the biggest fires are raging, I also like to use this time to reevaluate how I go about these activites and identify areas for improvement. If you’re in the same mental state right now, I’d like you to consider making your science a little more open as a challenge to take on in 2013.

More and more research published these days is difficult to replicate, validate, or build upon without all the critical components such as the underlying data and the code that was used to analyze it. Although support for open data and open science is steadily growing in the research community, putting this into practice requires some upfront investment. Since there is often no immediate incentives or payoff, activities such as documenting code, metadata, and making both available in permanent repositories with appropriate licences end up taking the back seat.

At rOpenSci, my fantastic colleagues and I have been building various R packages that make it easy to retrieve and reuse existing data and also share your research output through persistent repositories. If you’ve come across some of these before and found them useful, but hesitated because of the learning curve associated with using them for a real world project, you’re in luck! We’re offering our time and expertize to help you make your efforts (however small) to reuse data (or share your own research output) a reality. Our current suite of packages get you access to a rich variety of data from phylogenetic databases, taxonomic databases, fisheries time series, to full-text of any PLOS article and various scienceometrics datasets. Perhaps the tutorials might inspire you. Even if you’re only working with data you collected, you can use rOpenSci tools to programmatically clean, and submit your data, code, and/or manuscript pre-prints to figshare, a free science repository that will give you a permanent location (with a doi) to share with colleagues. If you have additional data sources in mind that don’t have an associated R package, drop us a line. We might be able to put together something fairly quickly for you to use (or we’ll add it to our existing todo list).

Learn more about the Open Science challenge and get in touch.

Formatting tables in markdown

Karthik Ram — Tue, 25 Dec 2012 22:48:55 +0000

Since someone asked about tables in markdown in the comments section of an earlier post, I thought I’d elaborate a little more. Since the appeal of markdown is its minimalism, options for formatting tables are also fairly limited. LaTeX is a much better tool if one needs to work with complicated tables (like cells that span multiple columns).

Pandoc flavored tables

Since I’m still discussing Pandoc as the markdown parser, I’ll stick with the table formatting options that it can correctly parse. The pandoc user guide has several examples for formatting tables. One can create simple, multi-line (although not multi-cell), and gridded (with borders) tables. Although it is possible to manually format tables (using a fixed-width font editor), it’s much easier to do it programatically with R and one of several packages (ascii, pander). If the data were entered into a spreadsheet, those could easily be exported as csv files. Alternatively, if the data to be used in the tables were generated programmatically, perhaps as the output of an analysis in R, those could also be easily saved as csv files or directly formatted into markdown-flavored tables with R and knitr.

Load the data

If the data are already in a spreadsheet, simply read those in R. For the sake of this post, I’ll create a really simple table to use in a markdown document.

> foo <- data.frame(x = 1:3, y = rnorm(3))
> foo
  x          y
1 1 -1.3665947
2 2 -0.9967103
3 3 -0.6870180

Using the pandoc.table function in the pander, this data.frame could easily be formatted with:


pandoc.table(foo)

resulting in

-----------
 x     y   
--- -------
 1  -1.3666

 2  -0.9967

 3  -0.6870
-----------

with a caption


> pandoc.table(foo, caption = "This is the table caption")

-----------
 x     y   
--- -------
 1  -1.3666

 2  -0.9967

 3  -0.6870
-----------

Table: This is the table caption

as a gridded table


> pandoc.table(foo, caption = "This is the table caption", 
style = "grid")


+-----+---------+
|  x  |    y    |
+=====+=========+
|  1  | -1.3666 |
+-----+---------+
|  2  | -0.9967 |
+-----+---------+
|  3  | -0.6870 |
+-----+---------+

Table: This is the table caption

One could also use multiline as a style option for tables containing lots of text. By default, the text is split at 30 characters but one could specify one with split.cells. Wide tables can also be split (default is 80 characters) using split.table.

Table headers can be justified (left, right, or center) with the justify option. Not so complicated, right?

Thoughts on a preprint server

Karthik Ram — Thu, 06 Dec 2012 14:45:25 +0000

In my last post I sang praises for markdown as a way to write and collaborate on manuscripts and other scientific documents. As easy as it is to use, the one command line step is enough of a barrier for most academics. This brought back an old idea that I batted around with a few folks right after ESA. With all the tools and web technologies that currently exist, it would be possible to create a pre-print server that runs on markdown and version controlled with git. Here’s a rough vision for how all the pieces might fit together.

Creating a new manuscript

An author would begin by logging in with their GitHub (or similar) account. When a new manuscript is created, it automatically initializes a new git repository for the paper. Collaborators can be added using their GitHub accounts which automatically gives them write access. Documents are publicly readable by default (just like any GitHub repo) although one could change it to private if need be (which turns the repo private at the GitHub end).

Choosing a reference library

Next, authors choose a library to use with this manuscript. This could be either Mendeley or Zotero (or any other alternative) since both services have APIs and mechanisms for collaboration. With Mendeley, for example, it is trivial to read an up-to-date list of documents from a shared library using existing API methods. Both services also have bookmarklets which makes it easy to update missing references without ever leaving the browser (and all these efforts would show up on the desktop library at the next sync). An author could also manually upload a bib file. As authors type out citations in the editor window, the engine behind the web app autocompletes the process by reading from the JSON file (if reading from an API call).

Adding in tables and figures

Although it would be ideal to embed R/Python/other code directly into the MS and have knitr add in the results and figures, it’ll leave that out from version 1 of this hypothetical server. Instead, authors can just use an uploader and add in tables (as csv files), and figures (as images).

As the author builds up the manuscript, snapshots (git commits) can be saved at anytime with a human-friendly commit message. Even if an author doesn’t commit often, the document (and associated files) remain autosaved. At any time the manuscript can be previewed and exported into any format (with pandoc or other document conversion tool powering this part of the engine).

Authors familiar with git can skip the web app entirely and simply clone a copy, work locally, and commit back to the repo. This will appear seamlessly on the web version and remain transparent to co-authors.

Submitting the manuscript

When authors are ready to submit, it could be as simple as forking the repo over (although a little too soon for something this efficient). For now, one could export the final PDF, or if the journal has a write API, then submit directly via an API call and quickly fill out author info with a form. Ideally there would also be a link to the full repository so reviewers can see everything.

Some existing pieces

There are several pieces that could be hacked together to make a first draft of this work.

markdown preview – There are several implementations of live rendering a markdown preview to html. Here’s a particularly elegant one. Pandoc (or even Asciidoc) could also run behind the scenes and quickly parse the document.
Git bindings – Abstracting git from the user (avoiding issues like merging and merge conflicts for the time being) could be done using GitHub’s existing API.
Citations – This is already possible with the current version of Mendeley/Zotero API.
Stats – With all the rapid development on Shiny, executable papers with embedded R code aren’t far off. Here’s a neat prototype of a live, in-browser markdown file with embedded R code being parsed by knitr.
Comments – The issues feature on GitHub could be repurposed to serve as a feedback mechanism. Reviewers could refer to specific blocks of text using the line highlight feature.

Note: Although I mention a few services in the post (GitHub, Mendeley), the system is not dependent on these specific providers. Git repositories can be hosted anywhere (or even in multiple locations) and almost every reference manager can export citations as bibtex. GitHub just has the advantage of being a popular service (so a large user base), and already hosts the most number of academic papers, software, and code used in data analyses.

How to ditch Word

Karthik Ram — Tue, 04 Dec 2012 20:48:36 +0000

I spent an hour this morning polishing up a proposal. This mostly involved running spell-checks, cleaning up tables, and making sure I added in all the right references. That’s when I realized something. I haven’t used Microsoft Word to write anything in over 6 months. How fantastic!

Like everyone else I’ve been complaining about MS Word since the last ice age but never had a better alternative. When Markdown came around I was smitten but there were several things missing. Tables were hard to format and I still had to get the final text back into Word to insert citations. I’m happy to report that I’ve found great solutions to both these issues. So here’s a quick how-to (following up from my earlier post) for switching your writing workflow away from Word. There is a small learning curve but the payoff is wonderful.

Software you’ll need

Pandoc – It’s like the swiss army knife of document conversion. Although it’s command-line only, pandoc is easy to use and quickly converts any document into whatever format you desire.
Mendeley – A free reference manager. Mendeley is great for two reasons. First, it allows you to collaborate via shared libraries (especially when writing with multiple authors). Second, Mendeley can automatically export those libraries to a bib file anytime you make changes to them. (update: You are free to use any reference manager. Just export a bibtex file to the folder containing your writing).
A markdown editor (optional) – Technically you don’t need any special software to write in markdown. Any text editor will do. However, there are several tools and helpers that make the process easier and more fun to use. Marked for e.g. renders a live preview into one of several styles (or custom ones). If you’re on a mac, here is a complete roundup of Markdown editors. My favorites are iA Writer, and Sublime Text with the SmartMarkdown package. Mou is great for beginners.
knitr (optional) – If you plan to insert data tables, either from a spreadsheet or if you need to incorporate summary statistics, knitr will run the code for you in R and insert the output in pandoc friendly format (with the help of the pander package). This step isn’t necessary if you don’t require tables. I’ll describe this process in more detail in my next post.

That’s it as far as set up goes.

Writing your document

The markdown syntax is super easy to learn. It takes all of 5 minutes to learn and the documents are easily readable even when unparsed (unlike LaTeX). Here’s a quick guide to markdown syntax. Here’s what a simple markdown document looks like:

# Title
some text. 
some more text.
## a sub-heading
More text. A [link](http://google.com/). 
A figure
![Figure 1: caption](figure.png)

This screenshot shows you unparsed and parsed markdown side by side.

Adding in citations

Now if you need to cite anything, first add documents to your Mendeley folder or group and have it automatically export to a bib file into the same folder as your document (see Mendeley desktop’s settings). To cite any document, look at the details pane for a citation key.

To cite this reference, add it in like so:

some statement [@Costello2009]. statement with multiple citations [@Costello2009; @Costello2010].

Generating a pdf

Now you can use Pandoc to turn this markdown file into any format you like. Word (docx), rtf, pdf, html, LaTeX, plain text. Just change the pdf extension to the output format you need.

The simple way:

pandoc document.md -o document.pdf

With citations:

pandoc document.md -o document.pdf --bibliography citations.bib

Formatting for a journal? Grab the citation styles from here and drop it into your folder. Then specify that style during document generation:

pandoc document.md -o document.pdf --bibliography cite.bib --csl style.csl

You can create a Make file for each project and run that instead of typing in the pandoc call into your terminal (although this is super easy to remember once you use it a few times). That’s really it. You can do a lot more like adding in results, tables, figures, and equations using mathjax but I’ll save the more advanced stuff for a future post.

Workflow

When starting any new writing project, I create a new folder with two files (my markdown document and a small script). If this folder doesn’t already sit inside a git repository, I initialize one so my writing is version controlled (to avoid this) from the very beginning. Version control makes it really easy to return the document to any stage, remotely back it up on GitHub (and or other locations), and edit asynchronously with multiple coauthors (all of which are impossible with Word). When I need the formatted version, I run the script which:
* Copies in the most current version of the bib file from Mendeley
* Parses my markdown with pandoc using the settings I need (citations, equations, margins) and outputs a pdf (for viewing) and Word (for some collaborators that still prefer this format).

Update: Here is a real world example of how I do this. Just click the zip icon to grab a copy and test this out for yourself.

PLOS Altmetrics workshop

Karthik Ram — Thu, 08 Nov 2012 23:48:59 +0000

I was fortunate enough to be invited to the PLOS altmetrics workshop held last week in Fort Mason as part of the rOpenSci team. For those of you that haven’t heard of the term altmetrics, it refers to alternative measures of scholarly impact beyond just citations which can take a very long time before being useful and may still not be such a good indicator of real impact. A recent news piece in Science as well as the original manifesto written by Jason, Paul, and Dario is also worth reading. Pedro Beltrao also posted a summary of the meeting.

We discussed a lot of challenges and approaches to using altmetrics, gaining wider adoption, and dealing with issues such as gaming, sentiment analysis, and context. Even though I am a strong supporter of the idea, I still struggle with these issues (as an academic) so I was happy to see some of the smartest people in this field tackle these ideas.

After two whole days of breakout groups, and idea development, we spent day three at the really cool PLOS HQ hacking together several of these ideas. I had a great time working with several others in the Alt Viz group (developing visualizations for article level metrics) where we brainstormed and implemented a few ideas for best ways to capture metrics for single articles over time and building snapshots of multiple articles. You can see some of our efforts here and here.

As far as rOpenSci’s contribution to ALMs, we briefly demo’ed our 3 altmetric packages: raltmet, rAltmetric (which incidentally became available on CRAN today), and rImpactStory. You can see slides from the demo here.

Markdown and the future of collaborative manuscript writing

Karthik Ram — Fri, 01 Jun 2012 16:25:43 +0000

When I first started using markdown a couple of years ago, I expected its popularity to be somewhat short lived and mostly in a blogging/note taking context. The greatest appeal of markdown is the fact the learning curve is non-existent, unparsed documents are easily readable (Latex on the other hand is not), and content can easily be parsed to a variety of formats with minimal effort. Little did I envision that it might one day revolutionize the world of collaborative academic writing.

I believe a few factors have made this possible:

In the last few years, Github has skyrocketed in popularity among academics as a way to collaborate on statistical analyses. Github is extremely markdown friendly (and has its own flavored version) allowing people to effortlessly document code.
Although document generation tools have existed in R (currently one of the most widely used statistical software tools in the academic community) for quite some time, recent efforts such Yihui’s knitr package have made is much easier for people to weave in results and figures both into traditional document formats such as Latex but also into markdown. The clutter-free, readability factor of markdown makes it easy to write and edit text alongside results and has a lower barrier to entry compared to Latex. Combine this with the free, cross-platform document generator Pandoc, one could easily embed citations and journal styles to programatically generate a final document in any desired format.
Github’s powerful issue tracker provide a quick and easy way to solicit feedback from collaborators, track milestones, and more importantly leverage Git’s version control capabilities (no more Word document clutter).

Although only a handful of people are currently writing manuscripts in markdown, I’m really excited at the prospect of making this my primary workflow for all future (especially collaborative) manuscripts. All the results and figures can be generated by knitr, citations embedded using Pandoc, and the final document converted on the fly into one of many formats (latex, word, rtf, markdown) while the entire workflow (code, analyses, manuscript) remains synched with all collaborators via Github.

I’m planning to write a series of detailed posts describing my workflow that involves Github + Knitr + Pandoc. Stay tuned.

Imposter week

Karthik Ram — Mon, 30 Apr 2012 15:04:57 +0000

I’ll freely admit that even as a postdoc I suffer from quite a bit of impostor syndrome, more so than when I was a grad student. Although this feeling is widespread among academics, it is not impossible to beat. Looks like everyone has decided to speak out about it this week on the academic blogosphere. It started out last week with a great post by fellow blogger and tweep Jacqueline Gill on how she overcame her impostor syndrome. There is also this really comprehensive post (with a bucket load of links) at Neurotic Physiology.
If you’re a postdoc reading this, this post (Some days, I just want to crawl under my desk and cry) best describes how I feel some days.

PS: It gets better.