Getting Genetics Done

Create a free Llama 3.1 405B-powered chatbot on any GitHub repo in 1 minute (cross-posted from Paired Ends)

2024-09-11T03:20:00.002-05:00

This blog has moved. This is reposted from Paired Ends:

https://blog.stephenturner.us/p/create-a-free-llama-405b-llm-chatbot-github-repo-huggingface

Llama 3.1 405B is the first open-source LLM on par with frontier models GPT-4o and Claude 3.5 Sonnet. I’ve been running the 70B model locally for a while now using Ollama + Open WebUI, but you’re not going to run the 405B model on your MacBook.

Here I demonstrate how to create and deploy a Llama 3.1 405B-powered chatbot on any GitHub repo in <1 minute on HuggingFace Assistants, using an R package as an example

Create and deploy a HuggingFace Assistant

I’m going to use the tfboot R package as an example here (paper, GitHub). I wrote the tfboot package to provide methods for bootstrapping transcription factor binding site disruption to statistically quantify the impact across gene sets of interest compared to an empirical null distribution. The package is meant to integrate with Bioconductor data structures and workflows on the front end, and Tidyverse-friendly tools on the back end. You can read more about the package in the paper.

The 42-second video below demonstrates how to create & deploy your chatbot.

First, go to HuggingFace Assistants (https://huggingface.co/chat/assistants) and click Create new assistant.
Fill in some details. Give your chatbot a name and description, and a system prompt (“You are a chatbot that answers questions about the tfboot codebase”).
Select meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 as your model.
Fill in some example prompts, like “what does this package do?” or “how do I do X, Y, or Z with this tool?
Now, the important part. Under internet access, select “Specific Links” and provide the URL to the GitHub Repo.
Hit create, then activate. You’re done.

Demo with the tfboot R package

Once you create and activate your model, you’ll see an interface that will look familiar if you’ve ever used ChatGPT or similar LLMs.

Landing page for the new chatbot I made for the tfboot GitHub repo.

From here you can click one of the example prompts, or type your own prompt. Let’s give it a try. First, a softball pitch. What does this package do? This should be fairly obvious from the README.

Prompt: What’s this package do?

Next, let’s get a little basic info on usage. The chatbot looks through the package’s RMarkdown vignettes and pulls out a high-level protocol on how to run the analysis. There wasn’t much context on what motifbreakR is or what you have to do upstream of running tfboot, but further prompting can help with this.

Prompt: How do I assess the statistical significance of transcription factor binding sites in gene sets of interest?

Finally, let’s see what it can tell us about the statistical underpinnings of what the package is doing? Of note, this isn’t simply regurgitation of what’s in the package documentation or vignettes. It’s using a combination of the code and documentation itself and integrating that with general information about bootstrapping, null hypothesis significance testing, and transcription factor binding site disruption analysis.

Prompt: Can you explain the theory that underpins what this package does?

Keep in mind that any assistant you create will be public. You can play around with the tfboot chatbot here. Also, know that the 405B model is extremely resource intensive. A few times the bot would timeout and I’d have to retry the prompt. This happens far less often with the 70B model, and the response times are faster. You might experiment around and see for yourself where the speed/accuracy sweet spot is for your specific needs.

PLANES: Plausibility Analysis of Epidemiological Signals

2024-09-03T08:24:00.003-05:00

This blog has moved. This is reposted from Paired Ends:

https://blog.stephenturner.us/p/planes-plausibility-analysis-of-epidemiological-signals-rplanes-r-package

PLANES provides a set of methods for evaluating the plausibility of epidemiological signals and forecasts. The PLANES methods are available in the rplanes R package and Shiny app.

Motivation

Since early 2023, we’ve been developing methods to conduct what we are calling plausibility analysis of epidemiological signals (PLANES). The motivation for PLANES stems from our work on infectious disease forecasting projects, including COVID-19, influenza-like illness (ILI), and influenza hospitalizations. Near-term forecasts of disease patterns can help guide public health decision-making. We’ve been fortunate enough to participate in several consortia efforts that openly solicit contributions from modelers using different methods. Those methods are often tailored to use training data that comes from epidemiological reporting systems. The openness of these forecasting “hubs” is invaluable, and can lead to ensemble forecasts that as a whole are greater than the sum of their parts.

But after all we’re trying to predict the future, so there are plenty of challenges! To name a few:

The data for training and “gold-standard” truth assessments are maintained in complex reporting systems that are often distributed across jurisdictions.
With the number of forecasters, forecast dates, targets (diseases), and locations, there are LOTS of incoming forecasts. And many of the hubs are designed on a weekly submission cadence, so that scale can grow even more dramatically as the season progresses.
Hubs require that submissions include some representation of uncertainty such that they can communicate prediction intervals around each forecast. Depending on the modeling methods used, that estimate of uncertainty may be very different from forecast to forecast.

These particular challenges underscore a need for review of both the incoming surveillance data and forecasts that are generated. But that review requires human intuition — a “gut feeling” — and it can be hard to scale.

Enter PLANES. We developed this approach to reduce the burden of human review for epidemiological signals. The aim here is to mirror human intuition, but not replace it. In other words, by flagging implausibility we are not claiming impossibility. Instead, the approach provides a screening mechanism so that humans can take a closer look at the signal data. PLANES is agnostic, such that it can be used across either forecasted or observed signals, for different diseases, and for varying temporal and geographic resolutions. The PLANES methods are described in detail in a manuscript we published in August 2024:

Nagraj VP, Benefield AE, Williams D, & Turner SD. (2024). PLANES: Plausibility Analysis of Epidemiological Signals. medRxiv, 10.1101/2024.08.22.24312449.

The concept behind the algorithm is demonstrated in Supplemental Figure A1 from the manuscript (below), which illustrates notional components and impacts.

Figure A1 from Nagraj et al 2024: Conceptual motivation for developing the PLANES approach. Examples of plausibility components and their impacts are described. Each component is illustrated to show relationship between observed data and evaluated signal.

Once we established the PLANES methods, we moved on to the rplanes R package to implement this approach.

PLANES components

A human can look at a time series of epidemiological data (e.g., incident influenza hospitalizations) and a probabilistic near-term forecast, and get a good sense of whether the forecast “looks weird” or not. PLANES attempts to systematize and formalize this assessment such that it can be automated and scaled. We created multiple components, each of which is a binary (yes/no) assessment of plausibility that maps to some feature in the data. All evaluated components are then combined into an ordinal score. By default, each component is equally weighted in the overall PLANES score. When delivered in the rplanes R package, the user can optionally weight components higher or lower in the scoring scheme.

1. Difference

The difference component checks the magnitude of point-to-point differences for all time steps of the evaluated data. This component can be used on either forecasts or observed signals. If an evaluated signal departs from the prior observation more dramatically than has been seen previously in the time series, then it is flagged as implausible. The function internally computes the maximum observed difference (based on absolute value) to set a threshold, which if exceeded will trigger a flag to be raised by the algorithm. While large and unexpected point-to-point changes may naturally occur in epidemiological signals, this component provides a means to draw attention to the most extreme cases.

Depiction of a flag raised with the difference component. The difference between forecasted observations and observed observations exceeds the maximum observed difference in the observed data.

2. Coverage

The coverage component compares the prediction interval for the first horizon of the evaluated signal to the most recent value in the seed. If the interval does not cover the most recent data point, then the flag is raised as implausible. The width of the interval used for this evaluation can be customized. The narrower the width of the prediction interval, the more sensitive this component will be.

Depiction of a flag raised with the coverage component. The coverage component compares the prediction interval for the first horizon of the evaluated signal to the most recent observed value.

3. Taper

The taper component checks whether the prediction interval for the evaluated signal decreases in width (i.e., certainty increases) as horizons progress. The width of the prediction interval at every horizon is assessed against the previous horizon and if any of the intervals for the earlier horizon is wider a flag is raised. One would expect that there would be more variability in signals forecasted further out in time, and therefore the prediction interval would be wider in later horizons.

Depiction of a flag raised with the taper component. The taper component checks whether the prediction interval for the evaluated signal decreases in width (i.e., certainty increases) as horizons progress.

4. Repeat

The repeat component checks whether consecutive values in an observed or forecasted signal are repeated more than the tolerated number of times (k). This stores the maximum number of consecutive repeats for each location and uses this as the default value for k. If the evaluated data exceeds k, then the signal is considered implausible and a flag is raised.

Depiction of a flag raised with the repeat component. The repeat component checks whether consecutive values in an observed or forecasted signal are repeated.

5. Trend

The trend component assesses if there is a significant change in the magnitude or direction of the slope for the evaluated signal compared to the most recent data in the seed. Each “change point” in the signal is identified using a hierarchical divisive algorithm originally implemented in the ecp R package.

Depiction of a flag raised with the trend component. The trend component assesses whether there is a significant change in the magnitude or direction of the slope for the evaluated signal compared to the most recent data in the seed.

6. Shape

While the trend component scans the time series for an inflection point, the shape component assesses the time series for unusual shapes across multiple points. To arrive at the shape assessment, the algorithm first divides the observed seed data into sliding windows to form trajectories. The trajectories are summarized as a set of shapes against which the forecasted trajectory is compared. If the shape of the forecasted trajectory does not match any shapes in the seed data, then the forecast is considered implausible per this component. The core intuition underlying this component is that the shape of future data is more likely to reflect patterns that have previously been observed and less likely to be a novel trajectory. Therefore, it may be useful to flag any novel shapes for review. The PLANES paper describes the two methods we used (dynamic time warping, and differences of consecutive observations) to summarize the shapes of signal trajectories.

Depiction of a flag raised with the shape component. The shape component evaluates the shape of the trajectory of the forecast signal and compares that shape to existing shapes in the observed seed data. If the shape is identified as novel, a flag is raised, and the signal is considered implausible.

7. Zero

The zero component was designed to check if there are any “sudden” zeros in the evaluated signal. Whether it is a broken surveillance instrument or miscalibrated forecast, we expect it would be unlikely to observe a zero if it has never been reported in the signal data.

Depiction of a flag raised with the zero component. This component checks for the presence of any unexpected zeros in the evaluated signal.

Now that we’ve covered all of the individual PLANES components, let’s take a look at the rplanes R package, which implements functions and a Shiny app to evaluate epidemiological signals and forecasts for these components.

R package: rplanes

rplanes is released under an open-source license, with code, documentation and issue tracking on GitHub: https://github.com/signaturescience/rplanes.

The package is also released on CRAN: https://cran.r-project.org/package=rplanes.

The package website includes function documentation, reproducible examples, and narrative vignettes describing how to get started as well as advanced usage of the tool: https://signaturescience.github.io/rplanes/.

We tried to develop rplanes to be as intuitive as possible. Before assessing the individual PLANES components that we defined, we needed to ensure that user data was formatted consistently. We also needed a structure to store location-specific baseline characteristics against which we could compare the evaluated signal. With this in mind, we created an object-oriented structure using S3 classes in R. The figure below walks through the workflow for preparing data and analyzing data in rplanes.

Workflow for the rplanes API. The diagram depicts the process for preparing and analyzing data with the available functions in the package. Users begin by preparing data to evaluate as well as data to seed background characteristics. These datasets begin as data frames and are coerced to ‘signal’ objects using rplanes helper functions. If the signal to be evaluated is observed data, then the same object used for the seed can be used in downstream analysis. If the signal is a forecast, then the user must prepare both an observed and forecast signal object for eventual scoring. The package also includes a function to build the seed using an observed signal. Given a seed and a signal to evaluate, the user can assess all components independently across all locations in the signal using a built-in wrapper function.

The rplanes package has several vignettes illustrating basic usage, detailed descriptions of the PLANES components, interpreting plausibility scores, and the rplanes explorer shiny app.

rplanes explorer

In addition to a programmatic interface, we included a point-and-click version of the tool. This is developed as a Shiny app that is built into rplanes. We’ve translated all of the package functionality to widgets in the app. To launch the app simply run rplanes_explorer().

The rplanes explorer Shiny app. Launch using rplanes_explorer(). For details, see the rplanes explorer vignette.

Please see the original post at:

https://blog.stephenturner.us/p/planes-plausibility-analysis-of-epidemiological-signals-rplanes-r-package

biorecap: an R package for summarizing bioRxiv preprints with a local LLM

2024-08-24T08:40:00.000-05:00

This is re-posted from my newsletter, where I'll be posting from now on:

https://blog.stephenturner.us/p/biorecap-r-package-for-summarizing-biorxiv-preprints-local-llm

---

TL;DR

I wrote an R package that summarizes recent bioRxiv preprints using a locally running LLM via Ollama+ollamar, and produces a summary HTML report from a parameterized RMarkdown template. The package is on GitHub and you can install it with devtools: https://github.com/stephenturner/biorecap.
I published a paper about the package on arXiv: Turner, S. D. (2024). biorecap: an R package for summarizing bioRxiv preprints with a local LLM. arXiv, 2408.11707. https://doi.org/10.48550/arXiv.2408.11707.
I wrote both the package and the paper with assistance from LLMs: GitHub copilot for documentation, llama3.1:70b for tests, llama3.1:405b via HuggingFace assistants for drafting, GPT-4o for editing.

The biorecap package

I recently started to explore prompting a local LLM (e.g. Llama3.1) from R via Ollama. Last week I wrote about how to do this, with a few contrived examples: (1) trying to figure out what’s interesting about a set of genes, and (2) summarizing a set of preprints retrieved from bioRxiv’s RSS feed. The gene set analysis really was contrived — as I mentioned in the footnote to that post, you’d never actually want to do a gene set analysis this way when there are plenty of well-understood first-principles approaches to gene set analysis.

The second example wasn’t so contrived. I subscribe to bioRxiv’s RSS feeds, along with many other journals and blogs in genetics, bioinformatics, data science, synthetic biology, and others. The fusillade of new preprints and peer-reviewed papers relevant to my work is relentless. Late last year bioRxiv started adding AI summaries to newly published preprints, but this required multiple clicks to get out of my RSS feed onto the paper’s landing page, another to click into the AI summary. I wanted a way to give me a quick TL;DR on all the recent papers published in particular subject areas (e.g., bioinformatics, genomics, or synthetic biology).

Shortly after putting together that one-off demo, on a sultry Sunday afternoon in Virginia too hot to do anything outside, I took the code from that post, generalized it a bit, and created the biorecap package: https://github.com/stephenturner/biorecap.

You can install it with devtools/remotes:

remotes::install_github("stephenturner/biorecap", 
                        dependencies=TRUE)

Create a report with 2-sentence summaries of recent papers published in the bioinformatics, genomics, and synthetic biology sections, using llama3.1:8b running locally on your laptop:

my_subjects <- c("bioinformatics", "genomics", "synthetic_biology")

biorecap_report(output_dir=".", 
                subject=my_subjects, 
                model="llama3.1")

This report will look different each time you run it, depending on what’s been published that day in the sections you’re interested in.

Figure 1 from Turner 2024 arXiv: Example biorecap report for bioinformatics, genomics, and synthetic biology from August 6, 2024.

The package documentation provides further instructions on usage.

I mentioned I used LLMs to help write this package. I started out using Positron for package development, but quickly fell back to RStudio because I wanted GitHub copilot integration. Among other things, Copilot is great for quickly writing Roxygen documentation for relatively simple functions. I also used llama3.1:70b running locally via Open WebUI to help me write some of the unit tests using testthat. Starting from the code I worked out in the previous post, it took about 2 hours to get the package working, and another hour or so to write documentation and tests and set up GitHub actions for pkgdown deployment and R CMD checks on PRs to main.

The biorecap paper

I published a short paper about the package on arXiv: Turner, S. D. (2024). biorecap: an R package for summarizing bioRxiv preprints with a local LLM. arXiv, 2408.11707. https://doi.org/10.48550/arXiv.2408.11707.

biorecap-preprint

266KB ∙ PDF file

Download

The biorecap preprint published on arXiv

I used two LLMs to assist me in writing the package, which itself uses an LLM to summarize research published on bioRxiv. It only makes sense to close the circle and use an LLM to help me write a preprint to publish on arXiv1 describing the software.

I’ll write a post soon about how to set up a llama3.1:405b-powered chatbot on HuggingFace connected to a GitHub repository to be able to ask questions about the codebase in that repo. It’s free and takes about 60 seconds or less. I started out doing this, asking for help crafting a narrative outline and introduction section for a paper based on the code in the repo. I ended up scrapping this and writing most of the first draft text myself, then using GPT-4o to help with editing, tone, and length reduction. It’s hard to exactly put my finger on it, but what I ended up with still had that feeling of “sounds like it was written by ChatGPT.” I did some of the final editing on my own, and used Mike Mahoney’s arXiv template Quarto plugin to write and typeset the final document.

Use R to prompt a local LLM with ollamar

2024-08-14T05:13:00.000-05:00

This is reposted from the original article:

https://blog.stephenturner.us/p/use-r-to-prompt-a-local-llm-with

Use R to prompt a local LLM with ollamar: Using R to prompt llama3.1:70b running on my laptop with Ollama + ollamar to tell me what's interesting about a set of genes, and to summarize recent bioRxiv preprints

Subscribe at https://blog.stephenturner.us/ to get future posts like this delivered to your e-mail.

--------------------

I’ve been using the llama3.1:70b model just released by Meta using Ollama running on my MacBook Pro. Ollama makes it easy to talk to a locally running LLM in the terminal (ollama run llama3.1:70b) or via a familiar GUI with the open-webui Docker container.

Here I’ll demonstrate using the ollamar package on CRAN to talk to an LLM running locally on my Mac. I’ll demonstrate this by asking llama3.1-70b what it thinks about a set of genes1 and to summarize recent preprints published on bioRxiv’s Scientific Communication and Education channel using the smaller+faster 8B model.

Tell me what’s interesting about these genes

Genes involved in the G2/M checkpoint

First I’ll use the msigdbr R package to pull the gene symbols for all the genes involved in the G2/M checkpoint in the cell cycle progression from MSigDb.

library(msigdbr)
hm <- msigdbr(species="human", category="H")
gs <- unique(hm[hm$gs_name=="HALLMARK_G2M_CHECKPOINT",]$gene_symbol)

Next, I’ll load the ollamar package, and test the connection to the Ollama server.

library(ollamar)
test_connection()

If all goes well you should see something like the following:

Ollama local server running
<httr2_response>
GET http://localhost:11434/
Status: 200 OK
Content-Type: text/plain
Body: In memory (17 bytes)

The ollamar README has good documentation on how to pull models, list which models you have available, etc.

Next, I’ll collapse the gene symbol vector I created above to a string (“CDK1, SMC2, POLA2, …”), and create a prompt tacking on this list of gene symbols to the end. Finally, I’ll use the generate() function with this prompt I just created to prompt the llama3.1:70b2 model, and print the results.

# Collapse the symbol vector to a string ("GENE1, GENE2, GENE3")
gs_string <- paste(gs, collapse=", ")

# Create a prompt
prompt <- paste("I'm going to give you some gene symbols.
                Anything interesting about this gene set? ",
                gs_string)

# Prompt the model
res <- generate("llama3.1:70b", prompt=prompt, output="text")

# Print the results
cat(res$response)

The prompt returns markdown text which looks nice after rendering with RMarkdown/Quarto:

What a massive list of gene symbols!
After analyzing the list, I noticed several interesting aspects:
Cell cycle and DNA replication: Many genes are involved in cell cycle regulation (e.g., CDK1, CDK4, CDC20, CDC25A, CDC45), DNA replication (e.g., ORC5, ORC6, MCM2, MCM3, MCM5), and chromosome segregation (e.g., BUB1, BUB3, CENPA, CENPE).
Apoptosis and cell death: Several genes are associated with apoptosis (programmed cell death) or regulation of cell death pathways, including CASP8AP2, BIRC5, and PML.
Cancer-related genes: A significant number of genes have been implicated in cancer development or progression, such as MYC, CCND1, CCNE1, CDC25A, and PLK1. Some genes are tumor suppressors (e.g., BRCA2, TP53), while others may promote tumorigenesis.
Chromatin modification and epigenetics: Genes like EZH2, CUL4A, and KMT5A are involved in chromatin modification, histone regulation, or other epigenetic processes that influence gene expression.
Transcriptional regulation: Many genes encode transcription factors (e.g., E2F1, MYC, HIF1A) or co-factors that regulate gene expression, including some with roles in cell cycle progression and apoptosis.
Nuclear structure and function: Genes like LBR, LIG3, and NUP50 are involved in nuclear envelope organization, pore complex formation, or nucleocytoplasmic transport.
These observations suggest that this gene set is enriched for genes involved in fundamental cellular processes such as cell cycle regulation, DNA replication, apoptosis, chromatin modification, and transcriptional control. This might indicate a specific biological context or experimental condition under which these genes are co-expressed.
Would you like to know more about any of these aspects or explore further?

Random genes

For kicks I did this with a random selection of 50 genes from the human genome:

allgenesets <- msigdbr(species="Homo sapiens")
allgenes <- unique(allgenesets$gene_symbol)
set.seed(42)
gs_random <- sample(allgenes, 50)

Llama3.1 is doing a good job HARKing about what these genes could be doing:

After analyzing the list, I noticed a few interesting things:
High proportion of pseudogenes: About 20-25% of the genes on your list are labeled as "pseudogenes" (e.g., RN7SL677P, RPL27AP3, HYAL6P). Pseudogenes are inactive copies of functional genes that have lost their ability to code for proteins. They can still be transcribed and play roles in regulating gene expression.
Ribosomal protein pseudogenes: Many of the pseudogenes on your list (e.g., RPL27AP3, RN7SL677P) appear to be related to ribosomal proteins, which are essential for protein synthesis.
MicroRNAs and small nuclear RNAs: You have several microRNA genes (MIRs) and small nuclear RNA (snRNA) genes on your list (e.g., MIR4444-2, RNU7-160P). These types of non-coding RNAs play crucial roles in regulating gene expression.
Tyrosine phosphatase and kinases: The presence of PTPN7 (a tyrosine phosphatase) and TYK2 (a Janus kinase) on your list suggests a possible connection to signaling pathways involved in cell growth, differentiation, or immune responses.
Vascular endothelial growth factor A (VEGFA): This gene is crucial for angiogenesis (the formation of new blood vessels). Its presence on your list might indicate a role in vascular development or disease.
These observations are just a starting point, and further analysis would be needed to determine the significance of these genes in a specific biological context. Do you have any additional information about how this gene set was obtained or what type of study it's related to?

Summarize the latest papers on bioRxiv

Above I created a single prompt with a single call to generate(). Here I’ll create many prompts based on data in columns of a data frame, and purrr::map() the generate() function over all of those prompts to create a new column of responses in my data frame.

I’ll demonstrate this by asking for a two sentence summary of the most recent bioRxiv preprints. I’ll start by pulling the latest titles and abstracts from the Scientific Communication and Education channel on bioRxiv. I use the tidyrss package to pull this information directly from the bioRxiv RSS feed. This returns a data frame with title, abstract, and other information about the most recent preprints in the feed.

library(tidyRSS)
library(tidyverse)

# Parse the feed
feed <- tidyfeed("https://connect.biorxiv.org/biorxiv_xml.php?subject=scientific_communication_and_education")

# Show a few titles
head(feed$item_title, 3)

# Show a few abstracts
head(feed$item_description, 3)

Here are the latest three titles (as of August 3, 2024).3 The abstracts that go with them (not shown) are in the feed$item_description column.

[1] "\nBiological changes, political ideology, and scientific communication shape human perceptions of pollen seasons \n" 
[2] "\nAn updated and expanded characterization of the biological sciences academic job market \n"                        
[3] "\n\"I'd like to think I'd be able to spot one if I saw one\": How science journalists navigate predatory journals \n"

Next, I’ll take the first 20 lines of the feed, pull out the title and abstract with the select statement, remove leading and trailing whitespace with the first mutate, construct a prompt with the second mutate, and generate a response with llama3.1 with the last mutate with purrr::map_chr(). Here I’m using the smaller/faster llama3.1:8b model (the default unless you specify :70b).

summarized <-
  feed |>
  head(20) |>
  select(title=item_title, abstract=item_description) |>
  mutate(across(everything(), trimws)) |>
  mutate(prompt=paste(
    "\n\nI'm going to give you a paper's title and abstract.",
    "Can you summarize this paper in 2 sentences?",
    "\n\nTitle: ", title, "\n\nAbstract: ", abstract)) |>
  mutate(response=purrr::map_chr(prompt, \(x) 
                                 generate("llama3.1",
                                          prompt=x,
                                          output="text")$response))

This is an example prompt, which will get run for every title and abstract in the feed:

I'm going to give you a paper's title and abstract. Can you summarize this paper in a few sentences? 

Title:  An updated and expanded characterization of the biological sciences academic job market 

Abstract:  In the biological sciences, many areas of uncertainty exist regarding the factors that contribute to success within the faculty job market. Earlier work from our group reported that beyond certain thresholds, academic and career metrics like the number of publications, fellowships or career transition awards, and years of experience did not separate applicants who received job offers from those who did not. Questions still exist regarding how academic and professional achievements influence job offers and if candidate demographics differentially influence outcomes. To continue addressing these gaps, we initiated surveys collecting data from faculty applicants in the biological sciences field for three hiring cycles in North America (Fall 2019 to the end of May 2022), a total of 449 respondents were included in our analysis. These responses highlight the interplay between various scholarly metrics, extensive demographic information, and hiring outcomes, and for the first time, allowed us to look at persons historically excluded due to ethnicity or race (PEER) status in the context of the faculty job market. Between 2019 and 2022, we found that the number of applications submitted, position seniority, and identifying as a women or transgender were positively correlated with a faculty job offer. Applicant age, residence, first generation status, and number of postdocs, however, were negatively correlated with receiving a faculty job offer. Our data are consistent with other surveys that also highlight the influence of achievements and other factors in hiring processes. Providing baseline comparative data for job seekers can support their informed decision-making in the market and is a first step towards demystifying the faculty job market.

Now the response column in my new table has all the responses from the LLM. I can now put the title and summary in a table and render it with pandoc.

summarized |> 
  select(title, response) |> 
  mutate(response=gsub("\n", " ", response)) |> 
  knitr::kable()

Moving to blog.stephenturner.us (Paired Ends)

2024-07-30T09:32:00.001-05:00

My new blog/newsletter ("Paired Ends") is now at blog.stephenturner.us. I'll be posting semi-regular updates and literature highlights in bioinformatics, computational biology, and data science, along with the occasional post on programming. Head over to blog.stephenturner.us to subscribe by email, or add the RSS feed to your favorite reader app.

Staying Current in Bioinformatics & Genomics: 2017 Edition

2017-02-01T13:35:00.001-06:00

A while back I wrote this post about how I stay current in bioinformatics & genomics. That was nearly five years ago. A lot has changed since then. A few links are dead. Some of the blogs or Twitter accounts I mentioned have shifted focus or haven’t been updated in years (guilty as charged). The way we consume media has evolved — Google thought they could kill off RSS (long live RSS!), there are many new literature alert services, preprints have really taken off in this field, and many more scientists are engaging via social media than before.

People still frequently ask me how I stay current and keep a finger on the pulse of the field. I’m not claiming to be able to do this well — that’s a near-impossible task for anyone. Five years later and I still run our bioinformatics core, and I’m still mostly focused on applied methodology and study design rather than any particular phenotype, model system, disease, or specific method. It helps me to know that transcript-level estimates improve gene-level inferences from RNA-seq data, and that there’s software to help me do this, but the details underlying kmer shredding vs pseudoalignment to a transcriptome de Bruijn graph aren’t as important to me as knowing that there’s a software implementation that’s well documented, actively supported, and performs well in fair benchmarks. As such, most of what I pay attention to is applied/methods-focused.

What follows is a scattershot, noncomprensive guide to the people, blogs, news outlets, journals, and aggregators that I lean on in an attempt to stay on top of things. I’ve inevitably omitted some key resources, so please don’t be offended if you don’t see your name/blog/Twitter/etc. listed here (drop a link in the comments!). Whatever I write here now will be out of date in no time, so I’ll try to write an update post every year instead of every five.

Twitter

In the 2012 post I ended with Twitter, but I have to lead with it this time. Twitter is probably my most valuable resource for learning about the bleeding-edge developments in genomics & bioinformatics. It’s great for learning what’s new and contributing to the dialogue in your field, but only when used effectively.

I aggressively prune the list of people I follow to keep what I see relevant and engaging. I can tolerate an occasional digression into politics, posting pictures of you drinking with colleagues at a conference, or self-congratulatory announcements. But once these off-topic Tweets become the norm, I unfollow. I also rely on the built-in list feature. I follow a few hundred people, but I only add a select few dozen to a “notjunk” list that I look at when I’m short on time. Folks in this list don’t Tweet too often and have a high signal-to-noise ratio (as far as what I’m interested in reading). If I don’t get a chance to catch up on my entire timeline, I can at least breeze through recent Tweets from folks on this list.

I’m also wary of following extremely prolific users. For example — if someone’s been on Twitter less than a year, already has 20,000 Tweets, but only 100 followers, it tells me they’ve got a lot to say but nobody cares. I let the hive mind work for me in this case, using this Tweet-to-follower ratio as sort of a proxy for signal-to-noise.

I mostly follow individuals and aggregators, but I also follow a few organization accounts. These can be a mixed bag. Only a few organization accounts do this well, delivering interesting and applicable content to a targeted audience, while many more are poor attempts at marketing and self-promotion while not offering any substantive value or interesting content.

Individuals: In no particular order, here’s an incomplete list of people who Tweet content that I find consistently on-topic and interesting.

Aaron Quinlan (aaronquinlan)
Adam Phillippy (aphillippy)
Andrew Severin (isugif)
Casey Greene (GreeneScientist)
Clive Brown (Clive_G_Brown)
Dan MacArthur (dgmacarthur)
David Robinson (drob)
Elisabeth Bik (MicrobiomDigest)
Frank Harrell (f2harrell)
Hadley Wickham (hadleywickham)
Heng Li (lh3lh3)
James Hadfield (coregenomics)
Jared Simpson (jaredtsimpson)
Jeff Leek (jtleek)
Jenny Bryan (JennyBryan)
Julia Silge (juliasilge)
Krista Ternus (KristaTernus)
Lex Nederbragt (lexnederbragt)
Lior Pachter (lpachter)
Mick Watson (biomickwatson)
Mike Love (mikelove)
Nick Loman (pathogenomenick)
Nicolas Robine (notSoJunkDNA)
Phil Ashton (flashton2003)
RNA-seq Blog (rnaseqblog)
Rob Patro (nomad421)
Roger Peng (rdpeng)
Sam Minot (sminot)
Sean Davis (seandavis12)
Titus Brown (ctitusbrown)
Torsten Seemann (torstenseemann)
Tuuli Lappalainen (tuuliel)
Vince Buffalo (vsbuffalo)
Willem van Schaik (WvSchaik)
Zamin Iqbal (ZaminIqbal)
Many more I’m failing to specifically mention…

Others: Besides individual accounts, there are also a number of aggregators and organizations that I keep on a high signal-to-noise list.

bioRxiv (biorxivpreprint)
bioRxiv Bioinfo (biorxiv_bioinfo)
bioRxiv Genomics (biorxiv_genomic)
Metagenomics Papers (metagenomic_lit)
InformaticsGW (UduakGW)
Hacker News 300 (newsyc300)
CompBiolPapers (compbiolpapers)
RNA-seq paper aggregator (RNA_seq)
Bioconductor (Bioconductor)
RStudio Tips (rstudiotips)

Blogs

I follow these and other blogs using RSS. I’ve been happy with the free version of Feedly ever since Google Reader was killed. The web interface and iOS app have everything I need, and they both integrate nicely with other services like Evernote, Instapaper, Buffer, Twitter, etc. If you can’t find a direct link to the blog’s RSS feed, you can usually type the name of the blog into Feedly’s search bar and it’ll find it for you. Similar to my “notjunk” list in Twitter, I have a Favorites category in Feedly where I include only the feeds I absolutely wouldn’t want to miss.

These are some of the few that I try to read whenever something new is posted, and Feedly helps me keep those organized, either by “starring” something I want to come back to, or saving it for later with Instapaper. They’re in no particular order, and I’m sure I’ve forgotten something.

Variance Explained: David Robinson’s blog (Data Scientist at Stack Overflow, works in R and Python).
Global Biodefense: News on pathogens, outbreaks, and preparedness, with periodic posts on genomics and bioinformatics-related developments and funding opportunities.
In between lines of code: Lex Nederbragt’s blog on biology, sequencing, bioinformatics, …
Simply Statistics: A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek.
Bits of DNA: Reviews and commentary on computational biology by Lior Pachter (fair warning: dialogue here can get a bit heated!).
Blue Collar Bioinformatics: articles related tool validation and the open source bioinformatics community.
Microbiome Digest - Bik’s Picks: A daily digest of scientific microbiome papers, by Elisabeth Bik, Science Editor at uBiome.
Living in an Ivory Basement: Titus Brown’s blog on metagenomics, open science, testing, reproducibility, and programming.
Enseqlopedia: James Hadfield’s blog on all things NGS.
Epistasis Blog: Jason Moore’s computational biology blog.
RStudio Blog: announcements about new RStudio functionality, updates about the tidyverse, and more.
nextgenseek.com: Next-Gen Sequencing Blog covering new developments in NGS data & analysis.
RNA-Seq Blog: Transcriptome Research & Industry News.
The Allium: We all need a little humor in our lives. Like The Onion, but for science.

Others

I’m unsure how to categorize the rest. These are things like aggregators, Q&A sites/forums, and others.

Nuzzel is something I’ve only been using for a few months but it works very well. It’s meant to solve the Twitter / social media overload problem. If you’re following a few hundred people, you could easily have thousands of Tweets per day to read through (or miss). Nuzzel emails you a daily newsletter of the most relevant content in your Twitter feed. I’m guessing it does this by analyzing how many people you follow share, retweet, or favorite the same links. I try to read everything in my RSS feeds but I could never do this with Twitter (nor should you worry about trying). Nuzzel helps you catch up on things that are trending among the people you follow. It’s not a substitute for following the right people (see the Twitter section above).
RWeekly: weekly updates from the entire R community. Offers an RSS feed but I subscribe to the weekly email. Each email sends out about 50 links with one-sentence descriptions to things being done in the R community that week.
R Bloggers aggregates RSS feeds from hundreds of blogs about R. Much more comprehensive than RWeekly, but lots to sort through.
GenomeWeb still provides high-quality original content as well as summaries of what’s going on in the field. Create an account, log in, view your profile page, and subscribe to some of their regular emails. I subscribe to their daily news, the scan, informatics, sequencing, and infectious diseases bulletins. Pro tip: Much of their content is only available for premium subscribers. If you sign up with a .edu address, you can access all this content for free.
F1000’s Smart Search is one of the few literature recommendation services that I find useful, relevant, and current. My RNA-seq and metagenomics alerts consistently deliver relevant and fresh content.
BioStars: This is a stack exchange Q&A site focused on bioinformatics, computational genomics, biological data analysis. You can go to the homepage and sort by topic, views, answers, etc., and the platform offers several granular ways to subscribe via RSS.
Bioconductor Support: This is a Q&A site much like BioStars that replaced the Bioconductor mailing list. You can do things like limit to a certain time period and sort by views, for example, if you only want to log in occasionally to see what’s being talked about.
SEQanswers: I subscribe to all new threads in the SEQanswers bioinformatics forum, and regularly browse post titles. When something sparks my interest, I’ll click into that post and subscribe to future updates on that post via email.
Google Scholar lets you search and create email alerts.
PubMed Alerts: You can save, automate, and have search results emailed to you through your MyNCBI account. Surprisingly, these seem to be more relevant than the Google Scholar searches for the terms that I use.
PubMed Trending - I have no idea how PubMed ranks these. It seemed to be more useful in the past, but now it seems that the top “trending” articles alternate between CRISPR/Cas9, and old kinesiology / sports medicine articles.
IFTTT: If This Then That is a service that connects many different web services together in an endless number of ways. At home I might connect Facebook and Dropbox, so that whenever someone tags me in a photo, that photo is automatically downloaded to my Dropbox. At work I can connect an RSS feed to an Evernote note or Google Doc. It’s useful is so many ways, both for personal and for work-related tasks. I mostly use it here as a last safeguard so that things I really shouldn’t miss don’t slip through the cracks. I have recipes that do things like email me if certain low-volume Twitter accounts post a new Tweet, others that automatically save to Instapaper things like starred articles in Feedly. I also use this to keep a close eye on a few accounts on GitHub. I have connections set up for a few users on GitHub so that whenever one of these users creates a new public repository, I get an email. I’ve also used IFTTT to archive Tweets coming out of various hashtags — you can create a recipe where if a new Tweet contains certain keywords or hashtags, then save that Tweet to Evernote, a shared Google Doc spreadsheet, etc. Zapier is a similar service that I’ve heard provides more granular control, but I haven’t tried it.
Podcasts: I listen to every episode of Roger Peng and Hilary Parker’s Not So Standard Deviations data science podcast, and most episodes of Roger Peng and Elizabeth Matsui’s The Effort Report (this one’s more about life in academia in general). I use the Overcast iOS app to listen to these and other podcasts on ~1.75X speed. (When I met Hilary at the RStudio Conference I heard her speak for the first time at regular 1X speed. Odd experience.) Finally, I just learned about the R podcast. I haven’t listened to much yet, but I’ve added it to my long Overcast queue.

Preprints!

Preprints in life sciences were nearly unheard of when I wrote the 2012 post. Now everybody’s doing it. There are still a few people using the arXiv Quantitative biology channel, and I’ll occasionally find something in PeerJ Preprints that grabs my attention.

bioRxiv is the biggest player here, hands down. The Alerts/RSS page lets you sign up for email alerts on particular topics, or subscribe to RSS feeds coming from particular categories that interest you. I subscribe to the Genomics and Bioinformatics feeds. I also follow several of the bioRxiv’s top-level and category Twitter feeds @biorxivpreprint, @biorxiv_bioinfo, and @biorxiv_genomic).

F1000 Research deserves some special attention here. It’s somewhere in-between a preprint server and a peer-reviewed publication. You can upload manuscripts (or other research outputs like posters or slides), and they’re immediately and permanently published, and given a DOI. Then one or more rounds of open peer review as well as public comment take place, and authors can update the published paper for further review. Check out the transcript estimates / gene inference paper I mentioned earlier. You’ll see it’s “version 2,” and was approved by two referees. If you look at the right-hand panel, you can actually go back and see the prior to revision, as well as see who reviewed it, what the reviewer wrote, and how the authors responded to those reviews. It’s an innovative platform where peer review is open and transparent, and is independent of publication, since papers are published before they are reviewed, and remain regardless of the outcome of the review. F1000 Research has a number of channels that are externally curated by different organizations, societies, conferences, etc. I subscribe to and get alerts about the R package and Bioconductor channels. Whenever a new preprint is dropped into one of these channels, I’ll get an email and an RSS item.

I only recently discovered PrePubMed, which looks very useful. PrePubMed indexes preprints from arXiv q-bio, PeerJ Preprints, bioRxiv, F1000Research, preprints.org, The Winnower, Nature Precedings, and Wellcome Open Research. In the tools box on the homepage, you can enter a search string and get back an RSS feed with results from that search. It looks like PrePubMed is maintained by a single person, but he’s made the entire thing open source, so you could presumably set this up and mirror it on your own, should you check back in 2021 and the link be dead.

Journals

I started with Journals in my 2012 post, but they’re last (and probably least) here. I still subscribe to a few journals’ RSS feeds, but in most cases, by the time I see a new Table of Contents hit my RSS reader, I probably saw the publications making the rounds on Twitter, blogs, or other channels mentioned above. It’s also no longer unusual to see a “publication” land where I read the preprint on biorXiv months ago, and perhaps even a blog post before that! What “publication” means is changing rapidly, and I’m sure the lines between a blog post, preprint, and journal article will be even blurrier in the year 2022 post.

How do you have the time to do this?

How do you not? It’s not as bad as it seems. I probably spend an hour each weekday scanning all the resources mentioned here, and I find the time well spent. I can breeze through my Twitter and RSS feeds on my bus ride into work, and saving things I actually want to look at later with a bookmark, star, favorite, Instapaper, etc.

I should have prefaced this whole article with the note that I hardly ever actually fully read any of the papers or blog posts I see here. If I see, for example, a new WGS variant caller published, I’ll glance at the figures benchmarking it against GATK and FreeBayes, and skim through the documentation on the GitHub README or BioConductor vignette. If either of these is missing or falls short, that’s usually enough for me to ignore the publication completely (don’t underestimate the importance of good documentation!).

It’s taken me a decade to compile and continually hone this list of resources to the things that I find useful and relevant. This is what works for me, now, in 2017. It’s not a one-size-fits-all, and the 2018-me will probably have a somewhat different list, but I hope you’ll find it useful. If your interests are similar to what I’ve discussed here, how do you stay current? What have I left out? Let me know in the comments!

RStudio Conference 2017 Recap

2017-01-14T15:48:00.000-06:00

The first ever RStudio conference was held January 11-14, 2017 in Orlando, FL. For anyone else like me who spends hours each working day staring into an RStudio session, the conference was truly excellent. The speaker lineup was diverse and covered lots of areas related to development in R, including the tidyverse, the RStudio IDE, Shiny, htmlwidgets, and authoring with RMarkdown.

This is not a complete list by any means — with split sessions I could only go to half the talks at most. Here are some noncomprehensive notes and links to slides and resources for some of the awesome things are doing with R and RStudio that I learned about at the RStudio Conference.

Hadley Wickham kicked off the meeting with a keynote on doing data science in R. The talk focused on the tidyverse, and the notion of splitting functions into commands that do something, as compared to queries that calculate something, and how it’s generally a good idea to keep these different functionalties contained in their own separate functions. (Contrast this to things like lm that both computes values and does things, like printing those values to the screen, making it difficult to capture (see broom).

I asked Hadley after his talk about strategies to reduce issues getting Bioconductor data structures to play nicely with tidyverse tools. Within minutes David Robinson released a new feature in the fuzzyjoin package that leverages IRanges within this tidyverse-friendly package for efficiently doing things like joining on genomic intervals.

Another #rstudioconf-inspired addition to fuzzyjoin:

genome_join, for overlapping intervals on the same chromosome@genetics_blog #rstats pic.twitter.com/oUctyNYc09
— David Robinson (@drob) January 13, 2017

Charlotte Wickham’s 2-hour purrr tutorial was awesome. Here’s a link to a shared dropbox folder with code, challenges, slides, data, etc. The purrr package is a core package in the tidyverse, and I’ll be replacing many of the base ?apply and plyr ??ply functions that I still use here and there. The map_* functions are integral to working with nested list-columns in dplyr, and I think I’m finally starting to grok how to work with these.

Jenny Bryan gave a great talk on list columns. You can see her slides here. Jenny also put together this excellent tutorial with lots of worked examples and code snippets. And if you need some example list data structures for more practice or for teaching that aren’t foo/bar/iris/mtcars-level boring, see her repurrrsive package. Related to this, for more on list columns and purrr map functions, start reading at the “Many Models” section of Hadley’s R for Data Science book.

Julia Silge, data scientist at Stack Overflow, gave a great introduction to tidy text mining with R. You can read Julia and David’s Tidy Text Mining with R book here online (the book was authored in Rmarkdown using bookdown!).

Andrew Flowers, data journalist and former writer at FiveThirtyEight gave the second day’s keynote address on finding and telling stories using R. He gave a series of examples illustrating six motivating features that make data stories worth telling, along with potential danger inherent to each one:

Novelty (potential danger: triviality)
Outlier (spurious result; see also, p-hacking)
Archetype (oversimplification)
Trend (variance)
Debunking (confirmation bias)
Forecast (overfitting)

Yihui Xie led a two-hour tutorial on advanced RMarkdown. You can see his slides here. The rticles package has LaTeX Journal Article Templates for R Markdown for various journals. The tufte package now supports both PDF and HTML output. See an example here. Yihui’s xaringan package ports the remark.js library for slideshows into R. Careful. Yihui warns that you may not sleep after learning about how cool remark.js is. Yihui showed an early version of the in-development blogdown package that can build blog-aware static websites using the blazing-fast and well-documented Hugo static site generator. Finally, the bookdown package is just awesome. It takes multiple RMarkdown documents as input and renders into multiple output formats (screen-readable ebook, PDF, epub, etc.). It looks great for writing books and technical documentation with pushbutton publishing to multiple output formats with some nice built-in styles out of the box. Some examples:

bookdown.org/yihui/bookdown — The bookdown book, written in RMarkdown with bookdown. (whoa, meta)
r4ds.had.co.nz — Garrett Grolemund and Hadley Wickham’s R for Data Science book.
tidytextmining.com — Julia and David’s book on text mining
moderndive.com — an open-source introductory statistics class textbook

Finally, a few gems from other talks that I jotted down:

Chester Ismay gave a great talk on teaching introductory statistics using R, with the open-source course textbook written in RMarkdown using bookdown.
Bob Rudis talked about using pipes (%>%), and pipes within pipes, and best piping practices. See his slides here.
Hilary Parker talked about the idea of an analysis development, (and analysis developers), drawing similarities to software development/developers. Hilary discussed this once before on the excellent podcast that she and Roger Peng host, and you can probably find it in their Conversations On Data Science ebook that summarize and transcribe these conversations.
Simon Jackson introduced corrr package for exploring and manipulating correlations and correlation matrices in a tidy way.
Gordon Shotwell introduced the easymake package that generates Makefiles from a data frame using R.
Karthik Ram quickly introduced several of the (many) rOpenSci packages related to data publication, data access, scientific literature access, scalable & reproducible computing, databases, visualization, taxonomy, geospatial analysis, and many utility tools for data analysis and manipulation.

With split sessions I missed more than half the talks. Lots of people here are active on Twitter, and you can catch many more notes and tidbits on the #rstudioconf hashtag. The meeting was superbly organized, I learned a ton, and I enjoyed meeting in person many of the folks I follow on Twitter and elsewhere online. A few days of 80-degree weather in mid-January didn’t hurt either. I’ll definitely be coming again next year. Kudos to the rstudio::conf organizers and speakers!

All the talks were recorded and will supposedly find their way to rstudio.com at some point soon. I’ll update this post with a link when that happens.

Update Feb 16, 2017: All the talks have now been posted online here under the rstudio::conf2017 heading.

Day 1:

Day 2:

Primers in computational biology

2016-09-19T09:20:00.001-05:00

I recently stumbled across this collection of computational biology primers in Nature Biotechnology. Many of these are old, but they're still great resources to get a fundamental understanding of the topic. Here they are in no particular order.

...

How does multiple testing correction work?
http://www.nature.com/nbt/journal/v27/n12/full/nbt1209-1135.html

What is principal component analysis?
http://www.nature.com/nbt/journal/v26/n3/full/nbt0308-303.html

SNP imputation in association studies
http://www.nature.com/nbt/journal/v27/n4/full/nbt0409-349.html

How does gene expression clustering work?
http://www.nature.com/nbt/journal/v23/n12/full/nbt1205-1499.html

What is a hidden Markov model?
http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html

What is a support vector machine?
http://www.nature.com/nbt/journal/v24/n12/full/nbt1206-1565.html

What is the expectation maximization algorithm?
http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html

Syntax Highlight Code in Keynote or Powerpoint

2016-06-30T13:22:00.001-05:00

I came across this awesome gist explaining how to syntax highlight code in Keynote. The same trick works for Powerpoint. Mac only.

Install homebrew if you don’t have it already and brew install highlight.
highlight -O rtf myfile.ext | pbcopy to highlight code to a formatted text converter in RTF output format, and copy the result to the system clipboard.
Paste into Keynote or Powerpoint.

If I’ve got some code in a file called eset_pca.R:

I can simply highlight -O rtf eset_pca.R | pbcopy and then paste it right into Keynote or Powerpoint.

Covcalc: Shiny App for Calculating Coverage Depth or Read Counts for Sequencing Experiments

2016-06-01T13:48:00.000-05:00

How many reads do I need? What's my sequencing depth? These are common questions I get all the time. Calculating how much sequence data you need to hit a target depth of coverage, or the inverse, what's the coverage depth given a set amount of sequencing, are both easy to answer with some basic algebra. Given one or the other, plus the genome size and read length/configuration, you can calculate either. This was inspired by a similar calculator written by James Hadfield, and was an opportunity for me to create my first Shiny app.

Check out the app here:
http://apps.bioconnector.virginia.edu/covcalc/

And the source code on GitHub:
https://github.com/stephenturner/covcalc

Give it your read length, whether you're using single- or paired-end sequencing, select a genome or enter your own. Then, select whether you want to calculate (a) the number of reads you need to hit a target depth of coverage, or (b) the coverage depth you'll hit given a set number of sequencing reads. Once you make the selection, use the slider to adjust either the desired coverage or number of reads sequenced, and the output text below is automatically updated.

Shiny App: Coverage / Read Count Calculator

Shiny Developer Conference 2016 Recap

2016-02-05T11:17:00.000-06:00

This is a guest post from VP Nagraj, a data scientist embedded within UVA’s Health Sciences Library, who runs our Data Analysis Support Hub (DASH) service.

Last weekend I was fortunate enough to be able to participate in the first ever Shiny Developer Conference hosted by RStudio at Stanford University. I’ve built a handful of apps, and have taught an introductory workshop on Shiny. In spite of that, almost all of the presentations de-mystified at least one aspect of the how, why or so what of the framework. Here’s a recap of what resonated with me, as well as some code and links out to my attempts to put what I digested into practice.

tl;dr

reactivity is a beast
javascript isn’t cheating
there are already a ton of shiny features … and more on the way

reactivity

For me, understanding reactivity has been one of the biggest challenges to using Shiny … or at least to using Shiny well. But after > 3 hours of an extensive (and really clear) presentation by Joe Cheng, I think I’m finally starting to see what I’ve been missing. Here’s something in particular that stuck out to me:

output$plot = renderPlot() is not an imperative to the browser to do a what … it’s a recipe for how the browser should do something.

Shiny ‘render’ functions (e.g. renderPlot(), renderText(), etc) inherently depend on reactivity. What the point above emphasizes is that assignments to a reactive expression are not the same as assignments made in “regular” R programming. Reactive outputs depend on inputs, and subsequently change as those inputs are manipulated.

If you want to watch how those changes happen in your own app, try adding options(shiny.reactlog=TRUE) to the top of your server script. When you run the app in a browser and press COMMAND + F3 (or CTRL + F3 on Windows) you’ll see a force directed network that outlines the connections between inputs and outputs.

Another way to implement reactivity is with the reactive() function.
For my apps, one of the pitfalls has been re-running the same code multiple times. That’s a perfect use-case for reactivity outside of the render functions.

Here’s a trivial example:

library(shiny)

    ui = fluidPage(
         numericInput("threshold", "mpg threshold", value = 20),
         plotOutput("size"),
         textOutput("names")
    )

    server = function(input, output) {

        output$size = renderPlot({

            dat = subset(mtcars, mpg > input$threshold)
            hist(dat$wt)

        })

        output$names = renderText({

            dat = subset(mtcars, mpg > input$threshold)
            rownames(dat)

        })
    }

shinyApp(ui = ui, server = server)

The code above works … but it’s redundant. There’s no need to calculate the “dat” object separately in each render function.

The code below does the same thing but stores “dat” in a reactive that is only calculated once.

library(shiny)

ui = fluidPage(
    numericInput("threshold", "mpg threshold", value = 20),
    plotOutput("size"),
    textOutput("names")
)

server = function(input, output) {

    dat = reactive({

        subset(mtcars, mpg > input$threshold)

    })

    output$size = renderPlot({

        hist(dat()$wt)

    })

    output$names = renderText({

        rownames(dat())

    })
}

shinyApp(ui = ui, server = server)

javascript

For whatever reason I’ve been stuck on the idea that using JavaScript inside a Shiny app would be “cheating”. But Shiny is actually well equipped for extensions with JavaScript libraries. Several of the speakers leaned in on this idea. Yihui Xie presented on the DT package, which is an interface to use features like client-side filtering from the DataTables library. And Dean Attali demonstrated shinyjs, a package that makes it really easy to incorporate JavaScript operations.

Below is code for a masterpiece that that does some hide() and show():

# https://apps.bioconnector.virginia.edu/game
library(shiny)
library(shinyjs)
shinyApp(

  ui = fluidPage( 
        titlePanel(actionButton("start", "start the game")),
        useShinyjs(),
        hidden(actionButton("restart", "restart the game")),
        tags$h3(hidden(textOutput("game_over")))
  ),

  server = function(input, output) {

        output$game_over =
            renderText({
                "game over, man ... game over"
            })  

       observeEvent(input$start, {

            show("game_over", anim = TRUE, animType = "fade")
            hide("start")
            show("restart")
        })

       observeEvent(input$restart, {
            hide("game_over")
            hide("restart")
            show("start")
        })

  }
)

everything else

brushing

http://shiny.rstudio.com/articles/plot-interaction.html

Adding a brush argument to plotOutput() let’s you click and drag to select a points on a plot. You can use this for “zooming in” on something like a time series plot. Here’s the code for an app I wrote based on data from the babynames package - in this case the brush let’s you zoom to see name frequency over specific range of years.

# http://apps.bioconnector.virginia.edu/names/
library(shiny)
library(ggplot2)
library(ggthemes)
library(babynames)
library(scales)

options(scipen=999)

ui = fluidPage(titlePanel(title = "names (1880-2012)"),
                textInput("name", "enter a name"),
                actionButton("go", "search"),
                plotOutput("plot1", brush = "plot_brush"),
                plotOutput("plot2"),
                htmlOutput("info")

)

server = function(input, output) {

    dat = eventReactive(input$go, {

        subset(babynames, tolower(name) == tolower(input$name))

    })

    output$plot1 = renderPlot({

        ggplot(dat(), aes(year, prop, col=sex)) + 
            geom_line() + 
            xlim(1880,2012) +
            theme_minimal() +
            # format labels with percent function from scales package
            scale_y_continuous(labels = percent) +
            labs(list(title ="% of individuals born with name by year and gender",
                      x = "\n click-and-drag over the plot to 'zoom'",
                      y = ""))

    })

    output$plot2 = renderPlot({

        # need latest version of shiny to use req() function
        req(input$plot_brush)
        brushed = brushedPoints(dat(), input$plot_brush)

        ggplot(brushed, aes(year, prop, col=sex)) + 
            geom_line() +
            theme_minimal() +
            # format labels with percent function from scales package
            scale_y_continuous(labels = percent) +
            labs(list(title ="% of individuals born with name by year and gender",
                      x = "",
                      y = ""))

    })

    output$info = renderText({

        "data source: social security administration names from babynames package

"

    })

}

shinyApp(ui, server)

gadgets

http://shiny.rstudio.com/articles/gadgets.html

A relatively easy way to leverage Shiny reactivity for visual inspection and interaction with data within RStudio. The main difference here is that you’re using an abbreviated (or ‘mini’) ui. The advantage of this workflow is that you can include it in your script to make your analysis interactive. I modified the example in the documentation and wrote a basic brushing gadget that removes outliers:

library(shiny)
library(miniUI)
library(ggplot2)

outlier_rm = function(data, xvar, yvar) {

    ui = miniPage(
        gadgetTitleBar("Drag to select points"),
        miniContentPanel(
            # The brush="brush" argument means we can listen for
            # brush events on the plot using input$brush.
            plotOutput("plot", height = "100%", brush = "brush")
            )
        )

    server = function(input, output, session) {

        # Render the plot
        output$plot = renderPlot({
            # Plot the data with x/y vars indicated by the caller.
            ggplot(data, aes_string(xvar, yvar)) + geom_point()
        })

        # Handle the Done button being pressed.
        observeEvent(input$done, {

            # create id for data
            data$id = 1:nrow(data)

            # Return the brushed points. See ?shiny::brushedPoints.
            p = brushedPoints(data, input$brush)

            # create vector of ids that match brushed points and data
            g = which(p$id %in% data$id)

            # return a subset of the original data without brushed points
            stopApp(data[-g,])
        })
    }

    runGadget(ui, server)
}

# run to open plot viewer
# click and drag to brush
# press done return a subset of the original data without brushed points
library(gapminder)
outlier_rm(gapminder, "lifeExp", "gdpPercap")

# you can also use the same method above but pass the output into a dplyr pipe syntax
# without the selection what is the mean life expectancy by country?
library(dplyr)
outlier_rm(gapminder, "lifeExp", "gdpPercap") %>%
    group_by(country) %>%
    summarise(mean(lifeExp))

req()

http://shiny.rstudio.com/reference/shiny/latest/req.html

This solves the issue of requiring an input - I’m definitely going to use this so I don’t have to do the return(NULL) work around:

# no need to do do this any more
# 
# inFile = input$file1
# 
#         if (is.null(inFile))
#             return(NULL)

# use req() instead
req(input$file1)

profvis

http://rpubs.com/wch/123888

Super helpful method for digging into the call stack of your R code to see how you might optimize it.

One or two seconds of processing can make a big difference, particularly for a Shiny app …

rstudio connect

https://www.rstudio.com/rstudio-connect-beta/

Jeff Allen from RStudio gave a talk on deployment options for Shiny applications and mentioned this product, which is a “coming soon” platform for hosting apps alongside RMarkdown documents and plots. It’s not available as a full release yet, but there is a beta version for testing.

Repel overlapping text labels in ggplot2

2016-01-08T09:50:00.003-06:00

A while back I showed you how to make volcano plots in base R for visualizing gene expression results. This is just one of many genome-scale plots where you might want to show all individual results but highlight or call out important results by labeling them, for example, with a gene name.

But if you want to annotate lots of points, the annotations usually get so crowded that they overlap one another and become illegible. There are ways around this - reducing the font size, or adjusting the position or angle of the text, but these usually don’t completely solve the problem, and can even make the visualization worse. Here’s the plot again, reading the results directly from GitHub, and drawing the plot with ggplot2 and geom_text out of the box.

What a mess. It’s difficult to see what any of those downregulated genes are on the left. Enter the ggrepel package, a new extension of ggplot2 that repels text labels away from one another. Just sub in geom_text_repel() in place of geom_text() and the extension is smart enough to try to figure out how to label the points such that the labels don’t interfere with each other. Here it is in action.

And the result (much better!):

See the ggrepel package vignette for more.

GRUPO: Shiny App For Benchmarking Pubmed Publication Output

2015-12-14T08:29:00.000-06:00

This is a guest post from VP Nagraj, a data scientist embedded within UVA’s Health Sciences Library, who runs our Data Analysis Support Hub (DASH) service.

The What

GRUPO (Gauging Research University Publication Output) is a Shiny app that provides side-by-side benchmarking of American research university publication activity.

The How

The code behind the app is written in R, and leverages the NCBI Eutils API via the rentrez package interface.

The methodology is fairly simple:

Build the search query in Pubmed syntax based on user input parameters.
Extract total number of articles from results.
Output a visualization of the total counts for both selected institutions.
Extract unique article identifiers from results.
Output the number of article identifiers that match (i.e. “collaborations”) between the two selected institutions.

Build Query

The syntax for the searching Pubmed relies on MEDLINE tags and boolean operators. You can peek into how to use the keywords and build these kinds of queries with the Pubmed Advanced Search Builder.

GRUPO builds its queries based on two fields in particular: “Affiliation” and “Date.” Because this search term will have to be built multiple times (at least twice to compare results for two institutions) I wrote a helper function called build_query():

# use %y/%m/%d (e.g. 1999/02/14) date format for startDate and endDate arguments

build_query = function(institution, startDate, endDate) {

    if (grepl("-", institution)==TRUE) {                
        split_name = strsplit(institution, split="-")
        search_term = paste(split_name[[1]][1], '[Affiliation]',
                             ' AND ',
                             split_name[[1]][2],
                             '[Affiliation]',
                             ' AND ',
                             startDate,
                             '[PDAT] : ',
                             endDate,
                             '[PDAT]',
                             sep='')
        search_term = gsub("-","/",search_term)
    } else {
        search_term = paste(institution, 
                             '[Affiliation]',
                             ' AND ',
                             startDate,
                             '[PDAT] : ',
                             endDate,
                             '[PDAT]',
                             sep='')
        search_term = gsub("-","/",search_term)
    }

    return(search_term)
}

The if/else logic in there accommodates cases like “University of North Carolina-Chapel Hill”, which otherwise wouldn’t search properly in the affiliation field. This method does depend on the institution name having its specific locale separated by a - symbol. In other words, if you passed in “University of Colorado/Boulder” you’d be stuck.

So by using this function for the University of Virginia from January 1, 2014 to January 1, 2015 you’d get the following term:

University of Virginia[Affiliation] AND 2014/01/01[PDAT] : 2015/01/01[PDAT]

And for University of Texas-Austin over the same dates you get the following term:

University of Texas[Affiliation] AND Austin[Affiliation] AND 2014/01/01[PDAT] : 2015/01/01[PDAT]

The advantage of using this function in a Shiny app is that you can pass the institution names and dates dynamically. Users enter the input parameters for which date range and institutions to search via the widgets in the ui.R script.

For the app to work, there has to be one date picker widget and two text inputs (one for each of the two institutions) in the ui.R script. The corresponding server.R script would have a reactive element wrapped around the following:

search_term = build_query(institution = input$institution1, startDate = input$dates[1], endDate = input$dates[2])
search_term2 = build_query(institution = input$institution2, startDate = input$dates[1], endDate = input$dates[2])
### Run Query

With the query built, you can run the search in Pubmed. The entrez_search() function from the rentrez package lets us get the information we want. This function returns four elements:

ids (unique Pubmed identifiers for each article in the result list)
count (total number of results)
retmax (maximum number of results that could have been returned)
file (the actual XML record containing the values above)

The following code returns total articles for each of two different searches:

affiliation_search = entrez_search("pubmed", search_term, retmax = 99999)
affiliation_search2 = entrez_search("pubmed", search_term2, retmax = 99999)

total_articles = as.numeric(affiliation_search$count)
total_articles2 = as.numeric(affiliation_search2$count)

Plot Results

The code above lives in the server.R script and is the functional workhorse for the app. But to adequately represent the benchmarking, GRUPO needed some kind of plot.

We can combine the total articles for each institution with the institution names, which we used to build the search terms. The result is a tiny (2 x 2) data frame of “Institution” and “Total.Articles” variables. Nothing fancy. But it does the trick.

With a data frame in hand, we can load it into ggplot2 and do some very simple barplotting:

Output Collaborations

Although the primary function of GRUPO is side-by-side benchmarking, it does have at least one other feature so far.

The inclusion of the “ids” object in the query result makes it possible to do something else. You can compare how many of the article identifiers match between two queries. That should represent the number of “collaborations” (i.e. how many of the publications share authorship) between individuals at the two institutions.

To get the total number of collaborations, we can do a simple calculation of length on the vector of intersections between the two search results:

collaboration_count = length(intersect(affiliation_search$ids,affiliation_search2$ids)

By placing the search call inside a reactive element within Shiny, GRUPO can store the results (“count” and “ids”) rather than repeating the query for each purpose.

NB This approach to assessing collaboration counts is spurious when considering articles published before October 2013, which was when the National Library of Medicine (NLM) began including affiliation tags for all authors.

The Next Steps

What’s next? There are a number of potential new features for GRUPO. It’s worth pointing out that a discussion of these possibilities will likely highlight some of the limitations of the app as it exists now.

For example, it would be advantageous to include other “research output” data sources. GRUPO currently only accounts for publications indexed in Pubmed. That’s a fairly one-dimensional representation of scholarly activities. Information about publications indexed elsewhere, funding awarded or altmetric indicators isn’t accounted for.

And neither is any information about the institutions. While all of them are considered to have very high research activity one could argue that some are “apples” and some are “oranges” based on discrepancies in budgets, number of faculty members, student body size, etc. A more thorough benchmarking tool might model research universities based on additional administrative data, and restrict comparisons to “similar” institutions.

So GRUPO is still a work in progress. But it’s a solid example of a Shiny app that effectively leverages an API as its primary data source. Feel free to post a comment if you have any feedback or questions.

Grupo Shiny App: http://apps.bioconnector.virginia.edu/grupo/

Grupo Source Code: https://github.com/vpnagraj/grupo

Tutorial: RNA-seq differential expression & pathway analysis with Sailfish, DESeq2, GAGE, and Pathview

2015-12-04T11:40:00.000-06:00

Background

This tutorial shows an example of RNA-seq data analysis with DESeq2, followed by KEGG pathway analysis using GAGE. Using data from GSE37704, with processed data available on Figshare DOI: 10.6084/m9.figshare.1601975. This dataset has six samples from GSE37704, where expression was quantified by either: (A) mapping to to GRCh38 using STAR then counting reads mapped to genes with featureCounts under the union-intersection model, or (B) alignment-free quantification using Sailfish, summarized at the gene level using the GRCh38 GTF file. Both datasets are restricted to protein-coding genes only. Here I’ll use the Sailfish gene-level estimated counts.

Differential expression analysis

First, import the countdata and metadata directly from the web. Set up the DESeqDataSet, run the DESeq2 pipeline.

# Note importing BioC pkgs after dplyr requires explicitly using dplyr::select()
library(dplyr)
library(DESeq2)

# Which data do you want to use? Let's use the sailfish counts.
# browseURL("http://dx.doi.org/10.6084/m9.figshare.1601975")
# countDataURL = "http://files.figshare.com/2439061/GSE37704_featurecounts.csv"
countDataURL = "http://files.figshare.com/2600373/GSE37704_sailfish_genecounts.csv"

# Import countdata
countData = read.csv(countDataURL, row.names=1) %>% 
  dplyr::select(-length) %>% 
  as.matrix()

# Filter data where you only have 0 or 1 read count across all samples.
countData = countData[rowSums(countData)>1, ]
head(countData)

##                 SRR493366 SRR493367 SRR493368 SRR493369 SRR493370
## ENSG00000198888     17528     23007     30241     24418     29152
## ENSG00000198763     21264     26720     35550     28878     32416
## ENSG00000198804    130975    151207    195514    178130    196727
## ENSG00000198712     49769     61906     78608     66478     69758
## ENSG00000228253      9304     11160     12830     12608     13041
## ENSG00000198899     45401     51260     66851     63433     66123
##                 SRR493371
## ENSG00000198888     34416
## ENSG00000198763     38422
## ENSG00000198804    244670
## ENSG00000198712     86808
## ENSG00000228253     16063
## ENSG00000198899     79215

# Import metadata
colData = read.csv("http://files.figshare.com/2439060/GSE37704_metadata.csv", row.names=1)
colData

##               condition
## SRR493366 control_sirna
## SRR493367 control_sirna
## SRR493368 control_sirna
## SRR493369      hoxa1_kd
## SRR493370      hoxa1_kd
## SRR493371      hoxa1_kd

# Set up the DESeqDataSet Object and run the DESeq pipeline
dds = DESeqDataSetFromMatrix(countData=countData,
                              colData=colData,
                              design=~condition)
dds = DESeq(dds)
dds

## class: DESeqDataSet 
## dim: 16755 6 
## metadata(0):
## assays(3): counts mu cooks
## rownames(16755): ENSG00000198888 ENSG00000198763 ...
##   ENSG00000267795 ENSG00000165795
## rowRanges metadata column names(27): baseMean baseVar ... deviance
##   maxCooks
## colnames(6): SRR493366 SRR493367 ... SRR493370 SRR493371
## colData names(2): condition sizeFactor

Next, get results for the HoxA1 knockdown versus control siRNA, and reorder them by p-value. Call summary on the results object to get a sense of how many genes are up or down-regulated at FDR 0.1.

res = results(dds, contrast=c("condition", "hoxa1_kd", "control_sirna"))
res = res[order(res$pvalue),]
summary(res)

## 
## out of 16755 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)     : 4193, 25% 
## LFC < 0 (down)   : 4286, 26% 
## outliers [1]     : 22, 0.13% 
## low counts [2]   : 1299, 7.8% 
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Since we mapped and counted against the Ensembl annotation, our results only have information about Ensembl gene IDs. But, our pathway analysis downstream will use KEGG pathways, and genes in KEGG pathways are annotated with Entrez gene IDs. I wrote an R package for doing this offline the dplyr way (https://github.com/stephenturner/annotables), but the canonical Bioconductor way to do it is with the AnnotationDbi and organism annotation packages. Here we’re using the organism package (“org”) for Homo sapiens (“Hs”), organized as an AnnotationDbi database package (“db”) using Entrez Gene IDs (“eg”) as primary keys. To see what all the keys are, use the columns function.

library("AnnotationDbi")
library("org.Hs.eg.db")
columns(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

Let’s use the mapIds function to add more columns to the results. The row.names of our results table has the Ensembl gene ID (our key), so we need to specify keytype=ENSEMBL. The column argument tells the mapIds function which information we want, and the multiVals argument tells the function what to do if there are multiple possible values for a single input value. Here we ask to just give us back the first one that occurs in the database. Let’s get the Entrez IDs, gene symbols, and full gene names.

res$symbol = mapIds(org.Hs.eg.db,
                     keys=row.names(res), 
                     column="SYMBOL",
                     keytype="ENSEMBL",
                     multiVals="first")
res$entrez = mapIds(org.Hs.eg.db,
                     keys=row.names(res), 
                     column="ENTREZID",
                     keytype="ENSEMBL",
                     multiVals="first")
res$name =   mapIds(org.Hs.eg.db,
                     keys=row.names(res), 
                     column="GENENAME",
                     keytype="ENSEMBL",
                     multiVals="first")

head(res, 10)

## log2 fold change (MAP): condition hoxa1_kd vs control_sirna 
## Wald test p-value: condition hoxa1_kd vs control_sirna 
## DataFrame with 10 rows and 9 columns
##                  baseMean log2FoldChange      lfcSE      stat    pvalue
##                           
## ENSG00000148773  1885.344      -3.172502 0.07868572 -40.31865         0
## ENSG00000138623  2939.936      -2.418238 0.05889229 -41.06205         0
## ENSG00000104368 13601.963       2.016802 0.05249643  38.41789         0
## ENSG00000124766  2692.200       2.379545 0.06193654  38.41908         0
## ENSG00000122861 35889.413       2.224779 0.05258658  42.30697         0
## ENSG00000116016  4558.157      -1.885339 0.04258766 -44.26961         0
## ENSG00000164251  2404.103       3.325196 0.07021236  47.35912         0
## ENSG00000125257  6187.386       1.943762 0.04259189  45.63692         0
## ENSG00000104321  9334.555       3.186856 0.06227530  51.17367         0
## ENSG00000183508  2110.345       3.190612 0.07488305  42.60794         0
##                      padj      symbol      entrez
##                   
## ENSG00000148773         0       MKI67        4288
## ENSG00000138623         0      SEMA7A        8482
## ENSG00000104368         0        PLAT        5327
## ENSG00000124766         0        SOX4        6659
## ENSG00000122861         0        PLAU        5328
## ENSG00000116016         0       EPAS1        2034
## ENSG00000164251         0       F2RL1        2150
## ENSG00000125257         0       ABCC4       10257
## ENSG00000104321         0       TRPA1        8989
## ENSG00000183508         0      FAM46C       54855
##                                                                               name
##                                                                        
## ENSG00000148773                                      marker of proliferation Ki-67
## ENSG00000138623 semaphorin 7A, GPI membrane anchor (John Milton Hagen blood group)
## ENSG00000104368                                      plasminogen activator, tissue
## ENSG00000124766                               SRY (sex determining region Y)-box 4
## ENSG00000122861                                   plasminogen activator, urokinase
## ENSG00000116016                                   endothelial PAS domain protein 1
## ENSG00000164251                   coagulation factor II (thrombin) receptor-like 1
## ENSG00000125257            ATP-binding cassette, sub-family C (CFTR/MRP), member 4
## ENSG00000104321 transient receptor potential cation channel, subfamily A, member 1
## ENSG00000183508                       family with sequence similarity 46, member C

Pathway analysis

We’re going to use the gage package (Generally Applicable Gene-set Enrichment for Pathway Analysis) for pathway analysis. See also the gage package workflow vignette for RNA-seq pathway analysis. Once we have a list of enriched pathways, we’re going to use the pathview package to draw pathway diagrams, shading the molecules in the pathway by their degree of up/down-regulation.

KEGG pathways

The gageData package has pre-compiled databases mapping genes to KEGG pathways and GO terms for common organisms. kegg.sets.hs is a named list of 229 elements. Each element is a character vector of member gene Entrez IDs for a single KEGG pathway. (See also go.sets.hs). sigmet.idx.hs is an index of numbers of sinaling and metabolic pathways in kegg.set.gs. In other words, KEGG pathway include other types of pathway definitions, like “Global Map” and “Human Diseases”, which may be undesirable in pathway analysis. Therefore, kegg.sets.hs[sigmet.idx.hs] gives you the “cleaner” gene sets of sinaling and metabolic pathways only.

library(pathview)
library(gage)
library(gageData)
data(kegg.sets.hs)
data(sigmet.idx.hs)
kegg.sets.hs = kegg.sets.hs[sigmet.idx.hs]
head(kegg.sets.hs, 3)

## $`hsa00232 Caffeine metabolism`
## [1] "10"   "1544" "1548" "1549" "1553" "7498" "9"   
## 
## $`hsa00983 Drug metabolism - other enzymes`
##  [1] "10"     "1066"   "10720"  "10941"  "151531" "1548"   "1549"  
##  [8] "1551"   "1553"   "1576"   "1577"   "1806"   "1807"   "1890"  
## [15] "221223" "2990"   "3251"   "3614"   "3615"   "3704"   "51733" 
## [22] "54490"  "54575"  "54576"  "54577"  "54578"  "54579"  "54600" 
## [29] "54657"  "54658"  "54659"  "54963"  "574537" "64816"  "7083"  
## [36] "7084"   "7172"   "7363"   "7364"   "7365"   "7366"   "7367"  
## [43] "7371"   "7372"   "7378"   "7498"   "79799"  "83549"  "8824"  
## [50] "8833"   "9"      "978"   
## 
## $`hsa00230 Purine metabolism`
##   [1] "100"    "10201"  "10606"  "10621"  "10622"  "10623"  "107"   
##   [8] "10714"  "108"    "10846"  "109"    "111"    "11128"  "11164" 
##  [15] "112"    "113"    "114"    "115"    "122481" "122622" "124583"
##  [22] "132"    "158"    "159"    "1633"   "171568" "1716"   "196883"
##  [29] "203"    "204"    "205"    "221823" "2272"   "22978"  "23649" 
##  [36] "246721" "25885"  "2618"   "26289"  "270"    "271"    "27115" 
##  [43] "272"    "2766"   "2977"   "2982"   "2983"   "2984"   "2986"  
##  [50] "2987"   "29922"  "3000"   "30833"  "30834"  "318"    "3251"  
##  [57] "353"    "3614"   "3615"   "3704"   "377841" "471"    "4830"  
##  [64] "4831"   "4832"   "4833"   "4860"   "4881"   "4882"   "4907"  
##  [71] "50484"  "50940"  "51082"  "51251"  "51292"  "5136"   "5137"  
##  [78] "5138"   "5139"   "5140"   "5141"   "5142"   "5143"   "5144"  
##  [85] "5145"   "5146"   "5147"   "5148"   "5149"   "5150"   "5151"  
##  [92] "5152"   "5153"   "5158"   "5167"   "5169"   "51728"  "5198"  
##  [99] "5236"   "5313"   "5315"   "53343"  "54107"  "5422"   "5424"  
## [106] "5425"   "5426"   "5427"   "5430"   "5431"   "5432"   "5433"  
## [113] "5434"   "5435"   "5436"   "5437"   "5438"   "5439"   "5440"  
## [120] "5441"   "5471"   "548644" "55276"  "5557"   "5558"   "55703" 
## [127] "55811"  "55821"  "5631"   "5634"   "56655"  "56953"  "56985" 
## [134] "57804"  "58497"  "6240"   "6241"   "64425"  "646625" "654364"
## [141] "661"    "7498"   "8382"   "84172"  "84265"  "84284"  "84618" 
## [148] "8622"   "8654"   "87178"  "8833"   "9060"   "9061"   "93034" 
## [155] "953"    "9533"   "954"    "955"    "956"    "957"    "9583"  
## [162] "9615"

The gage() function requires a named vector of fold changes, where the names of the values are the Entrez gene IDs.

foldchanges = res$log2FoldChange
names(foldchanges) = res$entrez
head(foldchanges)

##      4288      8482      5327      6659      5328      2034 
## -3.172502 -2.418238  2.016802  2.379545  2.224779 -1.885339

Now, let’s run the pathway analysis. See help on the gage function with ?gage. Specifically, you might want to try changing the value of same.dir. This value determins whether to test for changes in a gene set toward a single direction (all genes up or down regulated) or changes towards both directions simultaneously (any genes in the pathway dysregulated).

For experimentally derived gene sets, GO term groups, etc, coregulation is commonly the case, hence same.dir = TRUE (default); In KEGG, BioCarta pathways, genes frequently are not coregulated, hence it could be informative to let same.dir = FALSE. Although same.dir = TRUE could also be interesting for pathways.

Here, we’re using same.dir = TRUE, which will give us separate lists for pathways that are upregulated versus pathways that are downregulated. Let’s look at the first few results from each.

# Get the results
keggres = gage(foldchanges, gsets=kegg.sets.hs, same.dir=TRUE)

# Look at both up (greater), down (less), and statatistics.
lapply(keggres, head)

## $greater
##                                          p.geomean stat.mean        p.val
## hsa04142 Lysosome                     0.0002630657  3.517890 0.0002630657
## hsa04640 Hematopoietic cell lineage   0.0017919390  2.976432 0.0017919390
## hsa04630 Jak-STAT signaling pathway   0.0048980977  2.604390 0.0048980977
## hsa00140 Steroid hormone biosynthesis 0.0051115493  2.636206 0.0051115493
## hsa04062 Chemokine signaling pathway  0.0125582961  2.250765 0.0125582961
## hsa00511 Other glycan degradation     0.0223819919  2.104311 0.0223819919
##                                            q.val set.size         exp1
## hsa04142 Lysosome                     0.04261664      116 0.0002630657
## hsa04640 Hematopoietic cell lineage   0.14514706       61 0.0017919390
## hsa04630 Jak-STAT signaling pathway   0.20701775      119 0.0048980977
## hsa00140 Steroid hormone biosynthesis 0.20701775       39 0.0051115493
## hsa04062 Chemokine signaling pathway  0.40688879      156 0.0125582961
## hsa00511 Other glycan degradation     0.49956506       15 0.0223819919
## 
## $less
##                                      p.geomean stat.mean        p.val
## hsa04110 Cell cycle               2.165725e-06 -4.722301 2.165725e-06
## hsa03030 DNA replication          3.807440e-06 -4.835336 3.807440e-06
## hsa04114 Oocyte meiosis           1.109869e-04 -3.767561 1.109869e-04
## hsa03013 RNA transport            1.181787e-03 -3.071947 1.181787e-03
## hsa03440 Homologous recombination 1.197124e-03 -3.190747 1.197124e-03
## hsa00240 Pyrimidine metabolism    1.570318e-03 -2.992059 1.570318e-03
##                                          q.val set.size         exp1
## hsa04110 Cell cycle               0.0003084027      121 2.165725e-06
## hsa03030 DNA replication          0.0003084027       36 3.807440e-06
## hsa04114 Oocyte meiosis           0.0059932916      101 1.109869e-04
## hsa03013 RNA transport            0.0387868193      145 1.181787e-03
## hsa03440 Homologous recombination 0.0387868193       28 1.197124e-03
## hsa00240 Pyrimidine metabolism    0.0423985796       96 1.570318e-03
## 
## $stats
##                                       stat.mean     exp1
## hsa04142 Lysosome                      3.517890 3.517890
## hsa04640 Hematopoietic cell lineage    2.976432 2.976432
## hsa04630 Jak-STAT signaling pathway    2.604390 2.604390
## hsa00140 Steroid hormone biosynthesis  2.636206 2.636206
## hsa04062 Chemokine signaling pathway   2.250765 2.250765
## hsa00511 Other glycan degradation      2.104311 2.104311

Now, let’s process the results to pull out the top 5 upregulated pathways, then further process that just to get the IDs. We’ll use these KEGG pathway IDs downstream for plotting.

# Get the pathways
keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>% 
  tbl_df() %>% 
  filter(row_number()<=5) %>% 
  .$id %>% 
  as.character()
keggrespathways

## [1] "hsa04142 Lysosome"                    
## [2] "hsa04640 Hematopoietic cell lineage"  
## [3] "hsa04630 Jak-STAT signaling pathway"  
## [4] "hsa00140 Steroid hormone biosynthesis"
## [5] "hsa04062 Chemokine signaling pathway"

# Get the IDs.
keggresids = substr(keggrespathways, start=1, stop=8)
keggresids

## [1] "hsa04142" "hsa04640" "hsa04630" "hsa00140" "hsa04062"

Finally, the pathview() function in the pathview package makes the plots. Let’s write a function so we can loop through and draw plots for the top 5 pathways we created above.

# Define plotting function for applying later
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa", new.signature=FALSE)

# plot multiple pathways (plots saved to disk and returns a throwaway list object)
tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="hsa"))

Here are the plots:

Gene Ontology (GO)

We can also do a similar procedure with gene ontology. Similar to above, go.sets.hs has all GO terms. go.subs.hs is a named list containing indexes for the BP, CC, and MF ontologies. Let’s only do Biological Process.

data(go.sets.hs)
data(go.subs.hs)
gobpsets = go.sets.hs[go.subs.hs$BP]

gobpres = gage(foldchanges, gsets=gobpsets, same.dir=TRUE)

lapply(gobpres, head)

## $greater
##                                                             p.geomean
## GO:0007156 homophilic cell adhesion                      3.914568e-05
## GO:0008285 negative regulation of cell proliferation     2.907332e-04
## GO:0016339 calcium-dependent cell-cell adhesion          4.218753e-04
## GO:0016337 cell-cell adhesion                            6.170551e-04
## GO:0048729 tissue morphogenesis                          6.581460e-04
## GO:1901617 organic hydroxy compound biosynthetic process 8.876161e-04
##                                                          stat.mean
## GO:0007156 homophilic cell adhesion                       4.017207
## GO:0008285 negative regulation of cell proliferation      3.453345
## GO:0016339 calcium-dependent cell-cell adhesion           3.543891
## GO:0016337 cell-cell adhesion                             3.244296
## GO:0048729 tissue morphogenesis                           3.223979
## GO:1901617 organic hydroxy compound biosynthetic process  3.157421
##                                                                 p.val
## GO:0007156 homophilic cell adhesion                      3.914568e-05
## GO:0008285 negative regulation of cell proliferation     2.907332e-04
## GO:0016339 calcium-dependent cell-cell adhesion          4.218753e-04
## GO:0016337 cell-cell adhesion                            6.170551e-04
## GO:0048729 tissue morphogenesis                          6.581460e-04
## GO:1901617 organic hydroxy compound biosynthetic process 8.876161e-04
##                                                              q.val
## GO:0007156 homophilic cell adhesion                      0.1613977
## GO:0008285 negative regulation of cell proliferation     0.4720349
## GO:0016339 calcium-dependent cell-cell adhesion          0.4720349
## GO:0016337 cell-cell adhesion                            0.4720349
## GO:0048729 tissue morphogenesis                          0.4720349
## GO:1901617 organic hydroxy compound biosynthetic process 0.4720349
##                                                          set.size
## GO:0007156 homophilic cell adhesion                           124
## GO:0008285 negative regulation of cell proliferation          458
## GO:0016339 calcium-dependent cell-cell adhesion                27
## GO:0016337 cell-cell adhesion                                 355
## GO:0048729 tissue morphogenesis                               429
## GO:1901617 organic hydroxy compound biosynthetic process      141
##                                                                  exp1
## GO:0007156 homophilic cell adhesion                      3.914568e-05
## GO:0008285 negative regulation of cell proliferation     2.907332e-04
## GO:0016339 calcium-dependent cell-cell adhesion          4.218753e-04
## GO:0016337 cell-cell adhesion                            6.170551e-04
## GO:0048729 tissue morphogenesis                          6.581460e-04
## GO:1901617 organic hydroxy compound biosynthetic process 8.876161e-04
## 
## $less
##                                             p.geomean stat.mean
## GO:0048285 organelle fission             4.411540e-18 -8.850004
## GO:0000280 nuclear division              7.459684e-18 -8.805564
## GO:0007067 mitosis                       7.459684e-18 -8.805564
## GO:0000087 M phase of mitotic cell cycle 2.286444e-17 -8.655644
## GO:0007059 chromosome segregation        1.872901e-13 -7.686883
## GO:0051301 cell division                 5.841375e-12 -6.887763
##                                                 p.val        q.val
## GO:0048285 organelle fission             4.411540e-18 1.025209e-14
## GO:0000280 nuclear division              7.459684e-18 1.025209e-14
## GO:0007067 mitosis                       7.459684e-18 1.025209e-14
## GO:0000087 M phase of mitotic cell cycle 2.286444e-17 2.356752e-14
## GO:0007059 chromosome segregation        1.872901e-13 1.544394e-10
## GO:0051301 cell division                 5.841375e-12 4.013998e-09
##                                          set.size         exp1
## GO:0048285 organelle fission                  376 4.411540e-18
## GO:0000280 nuclear division                   352 7.459684e-18
## GO:0007067 mitosis                            352 7.459684e-18
## GO:0000087 M phase of mitotic cell cycle      362 2.286444e-17
## GO:0007059 chromosome segregation             141 1.872901e-13
## GO:0051301 cell division                      462 5.841375e-12
## 
## $stats
##                                                          stat.mean
## GO:0007156 homophilic cell adhesion                       4.017207
## GO:0008285 negative regulation of cell proliferation      3.453345
## GO:0016339 calcium-dependent cell-cell adhesion           3.543891
## GO:0016337 cell-cell adhesion                             3.244296
## GO:0048729 tissue morphogenesis                           3.223979
## GO:1901617 organic hydroxy compound biosynthetic process  3.157421
##                                                              exp1
## GO:0007156 homophilic cell adhesion                      4.017207
## GO:0008285 negative regulation of cell proliferation     3.453345
## GO:0016339 calcium-dependent cell-cell adhesion          3.543891
## GO:0016337 cell-cell adhesion                            3.244296
## GO:0048729 tissue morphogenesis                          3.223979
## GO:1901617 organic hydroxy compound biosynthetic process 3.157421

Annotables: R data package for annotating/converting Gene IDs

2015-11-13T09:54:00.000-06:00

I work with gene lists on a nearly daily basis. Lists of genes near ChIP-seq peaks, lists of genes closest to a GWAS hit, lists of differentially expressed genes or transcripts from an RNA-seq experiment, lists of genes involved in certain pathways, etc. And lots of times I’ll need to convert these gene IDs from one identifier to another. There’s no shortage of tools to do this. I use Ensembl Biomart. But I do this so often that I got tired of hammering Ensembl’s servers whenever I wanted to convert from Ensembl to Entrez gene IDs for pathway mapping, get the chromosomal location for some BEDTools-y kinds of genomic arithmetic, or get the gene symbol and full description for reporting. So I used Biomart to retrieve the data that I use most often, cleaned up the column names, and saved this data as an R data package called annotables.

This package has basic annotation information from Ensembl release 82 for:

Human (grch38)
Mouse (grcm38)
Rat (rnor6)
Chicken (galgal4)
Worm (wbcel235)
Fly (bdgp6)

Where each table contains:

ensgene: Ensembl gene ID
entrez: Entrez gene ID
symbol: Gene symbol
chr: Chromosome
start: Start
end: End
strand: Strand
biotype: Protein coding, pseudogene, mitochondrial tRNA, etc.
description: Full gene name/description.

Additionally, there are tables for human and mouse (grch38_gt and grcm38_gt, respectively) that link ensembl gene IDs to ensembl transcript IDs.

Usage

The package isn’t on CRAN, so you’ll need devtools to install it.

# If you haven't already installed devtools...
install.packages("devtools")

# Use devtools to install the package
devtools::install_github("stephenturner/annotables")

It isn’t necessary to load dplyr, but the tables are tbl_df and will print nicely if you have dplyr loaded.

library(dplyr)
library(annotables)

Look at the human genes table (note the description column gets cut off because the table becomes too wide to print nicely):

grch38

## Source: local data frame [66,531 x 9]
## 
##            ensgene entrez  symbol   chr start   end strand        biotype
##              (chr)  (int)   (chr) (chr) (int) (int)  (int)          (chr)
## 1  ENSG00000210049     NA   MT-TF    MT   577   647      1        Mt_tRNA
## 2  ENSG00000211459     NA MT-RNR1    MT   648  1601      1        Mt_rRNA
## 3  ENSG00000210077     NA   MT-TV    MT  1602  1670      1        Mt_tRNA
## 4  ENSG00000210082     NA MT-RNR2    MT  1671  3229      1        Mt_rRNA
## 5  ENSG00000209082     NA  MT-TL1    MT  3230  3304      1        Mt_tRNA
## 6  ENSG00000198888   4535  MT-ND1    MT  3307  4262      1 protein_coding
## 7  ENSG00000210100     NA   MT-TI    MT  4263  4331      1        Mt_tRNA
## 8  ENSG00000210107     NA   MT-TQ    MT  4329  4400     -1        Mt_tRNA
## 9  ENSG00000210112     NA   MT-TM    MT  4402  4469      1        Mt_tRNA
## 10 ENSG00000198763   4536  MT-ND2    MT  4470  5511      1 protein_coding
## ..             ...    ...     ...   ...   ...   ...    ...            ...
## Variables not shown: description (chr)

Look at the human genes-to-transcripts table:

grch38_gt

## Source: local data frame [216,133 x 2]
## 
##            ensgene          enstxp
##              (chr)           (chr)
## 1  ENSG00000210049 ENST00000387314
## 2  ENSG00000211459 ENST00000389680
## 3  ENSG00000210077 ENST00000387342
## 4  ENSG00000210082 ENST00000387347
## 5  ENSG00000209082 ENST00000386347
## 6  ENSG00000198888 ENST00000361390
## 7  ENSG00000210100 ENST00000387365
## 8  ENSG00000210107 ENST00000387372
## 9  ENSG00000210112 ENST00000387377
## 10 ENSG00000198763 ENST00000361453
## ..             ...             ...

Tables are tbl_df, pipe-able with dplyr:

grch38 %>% 
  filter(biotype=="protein_coding" & chr=="1") %>% 
  select(ensgene, symbol, chr, start, end, description) %>% 
  head %>% 
  pander::pandoc.table(split.table=100, justify="llllll", style="rmarkdown")

ensgene	symbol	chr	start	end
ENSG00000158014	SLC30A2	1	26037252	26046133
ENSG00000173673	HES3	1	6244192	6245578
ENSG00000243749	ZMYM6NB	1	34981535	34985353
ENSG00000189410	SH2D5	1	20719732	20732837
ENSG00000116863	ADPRHL2	1	36088875	36093932
ENSG00000188643	S100A16	1	153606886	153613145

Table: Table continues below

description
solute carrier family 30 (zinc transporter), member 2 [Source:HGNC Symbol;Acc:HGNC:11013]
hes family bHLH transcription factor 3 [Source:HGNC Symbol;Acc:HGNC:26226]
ZMYM6 neighbor [Source:HGNC Symbol;Acc:HGNC:40021]
SH2 domain containing 5 [Source:HGNC Symbol;Acc:HGNC:28819]
ADP-ribosylhydrolase like 2 [Source:HGNC Symbol;Acc:HGNC:21304]
S100 calcium binding protein A16 [Source:HGNC Symbol;Acc:HGNC:20441]

Example with RNA-seq data

Here’s an example with RNA-seq data. Specifically, DESeq2 results from the airway package, made tidy with biobroom:

# Load libraries (install with Bioconductor if you don't have them)
library(DESeq2)
library(airway)

# Load the data and do the RNA-seq data analysis
data(airway)
airway = DESeqDataSet(airway, design = ~cell + dex)
airway = DESeq(airway)
res = results(airway)

# tidy results with biobroom
library(biobroom)
res_tidy = tidy.DESeqResults(res)
head(res_tidy)

## Source: local data frame [6 x 7]
## 
##              gene    baseMean    estimate   stderror  statistic
##             (chr)       (dbl)       (dbl)      (dbl)      (dbl)
## 1 ENSG00000000003 708.6021697  0.37424998 0.09873107  3.7906000
## 2 ENSG00000000005   0.0000000          NA         NA         NA
## 3 ENSG00000000419 520.2979006 -0.20215550 0.10929899 -1.8495642
## 4 ENSG00000000457 237.1630368 -0.03624826 0.13684258 -0.2648902
## 5 ENSG00000000460  57.9326331  0.08523370 0.24654400  0.3457140
## 6 ENSG00000000938   0.3180984  0.11555962 0.14630523  0.7898530
## Variables not shown: p.value (dbl), p.adjusted (dbl)

Now, make a table with the results (unfortunately, it’ll be split in this display, but you can write this to file to see all the columns in a single row):

res_tidy %>% 
  arrange(p.adjusted) %>% 
  head(20) %>% 
  inner_join(grch38, by=c("gene"="ensgene")) %>% 
  select(gene, estimate, p.adjusted, symbol, description) %>% 
  pander::pandoc.table(split.table=100, justify="lrrll", style="rmarkdown")

gene	estimate	p.adjusted	symbol
ENSG00000152583	-4.316	4.753e-134	SPARCL1
ENSG00000165995	-3.189	1.44e-133	CACNB2
ENSG00000101347	-3.618	6.619e-125	SAMHD1
ENSG00000120129	-2.871	6.619e-125	DUSP1
ENSG00000189221	-3.231	9.468e-119	MAOA
ENSG00000211445	-3.553	3.94e-107	GPX3
ENSG00000157214	-1.949	8.74e-102	STEAP2
ENSG00000162614	-2.003	3.052e-98	NEXN
ENSG00000125148	-2.167	1.783e-92	MT2A
ENSG00000154734	-2.286	4.522e-86	ADAMTS1
ENSG00000139132	-2.181	2.501e-83	FGD4
ENSG00000162493	-1.858	4.215e-83	PDPN
ENSG00000162692	3.453	3.563e-82	VCAM1
ENSG00000179094	-3.044	1.199e-81	PER1
ENSG00000134243	-2.149	2.73e-81	SORT1
ENSG00000163884	-4.079	1.073e-80	KLF15
ENSG00000178695	2.446	6.275e-75	KCTD12
ENSG00000146250	2.64	1.143e-69	PRSS35
ENSG00000198624	-2.784	1.707e-69	CCDC69
ENSG00000148848	1.783	1.762e-69	ADAM12

Table: Table continues below

description
SPARC-like 1 (hevin) [Source:HGNC Symbol;Acc:HGNC:11220]
calcium channel, voltage-dependent, beta 2 subunit [Source:HGNC Symbol;Acc:HGNC:1402]
SAM domain and HD domain 1 [Source:HGNC Symbol;Acc:HGNC:15925]
dual specificity phosphatase 1 [Source:HGNC Symbol;Acc:HGNC:3064]
monoamine oxidase A [Source:HGNC Symbol;Acc:HGNC:6833]
glutathione peroxidase 3 [Source:HGNC Symbol;Acc:HGNC:4555]
STEAP family member 2, metalloreductase [Source:HGNC Symbol;Acc:HGNC:17885]
nexilin (F actin binding protein) [Source:HGNC Symbol;Acc:HGNC:29557]
metallothionein 2A [Source:HGNC Symbol;Acc:HGNC:7406]
ADAM metallopeptidase with thrombospondin type 1 motif, 1 [Source:HGNC Symbol;Acc:HGNC:217]
FYVE, RhoGEF and PH domain containing 4 [Source:HGNC Symbol;Acc:HGNC:19125]
podoplanin [Source:HGNC Symbol;Acc:HGNC:29602]
vascular cell adhesion molecule 1 [Source:HGNC Symbol;Acc:HGNC:12663]
period circadian clock 1 [Source:HGNC Symbol;Acc:HGNC:8845]
sortilin 1 [Source:HGNC Symbol;Acc:HGNC:11186]
Kruppel-like factor 15 [Source:HGNC Symbol;Acc:HGNC:14536]
potassium channel tetramerization domain containing 12 [Source:HGNC Symbol;Acc:HGNC:14678]
protease, serine, 35 [Source:HGNC Symbol;Acc:HGNC:21387]
coiled-coil domain containing 69 [Source:HGNC Symbol;Acc:HGNC:24487]
ADAM metallopeptidase domain 12 [Source:HGNC Symbol;Acc:HGNC:190]

Explore!

This data can also be used for toying around with dplyr verbs and generally getting a sense of what’s in here. First, tet some help.

ls("package:annotables")
?grch38

Let’s join the transcript table to the gene table.

gt = grch38_gt %>% 
  inner_join(grch38, by="ensgene")

Now, let’s filter to get only protein-coding genes, group by the ensembl gene ID, summarize to count how many transcripts are in each gene, inner join that result back to the original gene list, so we can select out only the gene, number of transcripts, symbol, and description, mutate the description column so that it isn’t so wide that it’ll break the display, arrange the returned data descending by the number of transcripts per gene, head to get the top 10 results, and optionally, pipe that to further utilities to output a nice HTML table.

gt %>% 
  filter(biotype=="protein_coding") %>% 
  group_by(ensgene) %>% 
  summarize(ntxps=n_distinct(enstxp)) %>% 
  inner_join(grch38, by="ensgene") %>% 
  select(ensgene, ntxps, symbol, description) %>% 
  mutate(description=substr(description, 1, 20)) %>% 
  arrange(desc(ntxps)) %>% 
  head(10) %>% 
  pander::pandoc.table(split.table=100, justify="lrll", style="rmarkdown")

ensgene	ntxps	symbol	description
ENSG00000165795	77	NDRG2	NDRG family member 2
ENSG00000205336	77	ADGRG1	adhesion G protein-c
ENSG00000196628	75	TCF4	transcription factor
ENSG00000161249	68	DMKN	dermokine [Source:HG
ENSG00000154556	64	SORBS2	sorbin and SH3 domai
ENSG00000166444	62	ST5	suppression of tumor
ENSG00000204580	58	DDR1	discoidin domain rec
ENSG00000087460	57	GNAS	GNAS complex locus [
ENSG00000169398	57	PTK2	protein tyrosine kin
ENSG00000104529	56	EEF1D	eukaryotic translati

Let’s look up DMKN (dermkine) in Ensembl. Search Ensembl for ENSG00000161249, or use this direct link. You can browse the table or graphic to see the splicing complexity in this gene.

Or, let’s do something different. Let’s group the data by what type of gene it is (e.g., protein coding, pseudogene, etc), get the number of genes in each category, and plot the top 20.

library(ggplot2)
grch38 %>% 
  group_by(biotype) %>% 
  summarize(n=n_distinct(ensgene)) %>% 
  arrange(desc(n)) %>% 
  head(20) %>% 
  ggplot(aes(reorder(biotype, n), n)) + 
  geom_bar(stat="identity") + 
  xlab("Type") + 
  theme_bw() + 
  coord_flip()

Annotables: R data package for annotating/converting Gene IDs

Software from CSHL Genome Informatics 2015

2015-11-02T08:50:00.001-06:00

I just returned from the Genome Informatics meeting at Cold Spring Harbor. This was, hands down, the best scientific conference I've been to in years. The quality of the talks and posters was excellent, and it was great meeting in person many of the scientists and developers whose tools and software I use on a daily basis. To get a sense of what the meeting was about, 140 characters at a time, you can access all the Tweets sent Oct 28-31 2015 tagged #gi2015 at this link.

Below is a very short list of software that was presented at GI2015. This is only a tiny slice of the tools and methods that were presented at the meeting, and the list is highly biased toward tools that I personally find interesting or useful to my own work (please don't be offended if I omitted your stuff, and feel free to mention it in the comments).

Monocle: Software for analyzing single-cell RNA-seq data
Paper: http://www.nature.com/nbt/journal/v32/n4/full/nbt.2859.html
Software: http://cole-trapnell-lab.github.io/monocle-release/

Kallisto: very fast RNA-seq transcript abundance estimation using pseudoalignment.
Preprint: http://arxiv.org/abs/1505.02710
Software: http://pachterlab.github.io/kallisto/about.html

Sleuth: R package for analyzing & reporting differential expression analysis from transcript abundances estimated with Kallisto.
Preprint: coming soon?
Software: http://pachterlab.github.io/sleuth/about.html
See also: The bear's lair (http://lair.berkeley.edu/): reanalysis of published RNA-seq studies using kallisto+sleuth.

QoRTs: Quality of RNA-Seq Toolset. Toolkit for QC, gene/junction counting, and other miscellaneous downstream processing from RNA-seq alignments.

Software: https://github.com/hartleys/QoRTs

Paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4506620/

JunctionSeq: R package for testing differential junction usage with RNA-seq data.

Software: https://github.com/hartleys/JunctionSeq

Vignette: http://hartleys.github.io/JunctionSeq/doc/JunctionSeq.pdf

HISAT2: RNA-seq alignment against populations of genomes (aligns DNA also).
Software: http://ccb.jhu.edu/software/hisat2/index.shtml

Rail: software for aligning many-sample RNA-seq data, producing alignments, genome coverage bigWigs, and splice junction BED files.
Software: http://rail.bio
Preprint: http://biorxiv.org/content/early/2015/08/11/019067

LobSTR: genotype short tandem repeats from NGS data.
Software: http://melissagymrek.com/lobstr-code/
Paper: http://www.ncbi.nlm.nih.gov/pubmed/22522390

Basset: convolutional neural networks for learning functional/regulatory features of DNA sequence.
Software: https://github.com/davek44/Basset
Preprint: http://biorxiv.org/content/early/2015/10/05/028399

Genotype Query Tools (GQT): fast/efficient individual-level queries of large-scale variation data.
Software: https://github.com/ryanlayer/gqt
Preprint: http://biorxiv.org/content/early/2015/06/05/018259

Centrifuge: a metagenomics classifier.
Software: https://github.com/infphilo/centrifuge
Poster: http://www.ccb.jhu.edu/people/infphilo/data/Centrifuge-poster.pdf

Mash: MinHash-based method for rapidly estimating pairwise distances between genomes or metagenomes.
Software: https://github.com/marbl/Mash
Docs: http://mash.readthedocs.org/en/latest/
Preprint: http://biorxiv.org/content/early/2015/10/26/029827

VCFanno: ultrafast large-sample VCF annotation
Software: https://github.com/brentp/vcfanno

Ginkgo: Interactive analysis and assessment of single-cell copy-number variations
Paper: http://www.nature.com/nmeth/journal/v12/n11/full/nmeth.3578.html
Software: https://github.com/robertaboukhalil/ginkgo

StringTie: RNA-seq transcript assembly+quantification, with or without a reference. See paper for comparison to existing tools.
Software: http://ccb.jhu.edu/software/stringtie/
Source: https://github.com/gpertea/stringtie
Poster: http://ccb.jhu.edu/software/stringtie/cshl2015.pdf
Paper: http://www.nature.com/nbt/journal/v33/n3/full/nbt.3122.html

Compiling RMarkdown from a Helper R Script

2015-08-06T11:17:00.000-05:00

The problem

I was looking for a way to compile an RMarkdown document and have the filename of the resulting PDF or HTML document contain the name of the input data that it processed. That is, if I compiled the analysis.Rmd file, where in that file it did some analysis and reporting on data001.txt, I’d want the resulting filename to look something like data001.txt.analysis.html. Or even better, to stick in a timestamp with the date, so if the analysis was compiled today, August 6 2015, the resulting filename would be data001.txt.2015-08-06.html. I also wanted to implement the entire solution in R, not relying on fiddly makefiles or scripts that may behave differently depending on the OS/environment.

I found a near-solution as described on this SO post and detailed on this follow-up blog post, but neither really addressed my problem.

The solution

The simplest solution I could come up with involved creating two files:

A .Rmd file that would actually do all the analysis and generate the compiled report.
A second .R script to be used as a config file. Here you’d specify the input data (and potentially other analysis parameters).

By default, when calling rmarkdown::render() from an R script, the environment in which the code chunks are to be evaluated during knitting uses parent.frame() by default, so anything you define in the .R config file will get passed on to the .Rmd that is to be compiled.

Here’s what it looks like in practice.

First, the analysis.Rmd file that actually runs the analysis:

 ---
 title: "Analysis Markdown document"
 author: "Stephen Turner"
 date: "August 6, 2015"
 output: html_document
 ---

 This is the Rmarkdown document that runs the analysis.
 Some narrative text goes here. 
 Maybe we'll do some analysis here. The `infile` variable is passed 
 in from the config script. You could pass in other variables too.

 ```{r}
 # check that you defined infile from the config and that 
 # the file actually exists in the current directory
 stopifnot(exists("infile"))

 stopifnot(file.exists(infile))

 # read in the data
 x = read.table(infile)

 # do some stuff, make a plot, etc.
 result = mean(x$value)
 hist(x$value)
 ```

 Here is some conclusion narrative text. Maybe show some notes:

 - Input file used for this report: `r infile`
 - This report was compiled: `r Sys.Date()`
 - The mean of the `value` column is: `r result`

 Also, never forget to show your...

 ```{r}
 sessionInfo()
 ```

And the config.R helper script:

#-------- define the input filename --------#
infile = "data001.txt"
#----- Now just hit the source button! -----#

# check that the input file actually exists!
stopifnot(file.exists(infile))

# create the output filename
outfile = paste(infile, Sys.Date(), "analysis.html", sep=".")

# compile the document
rmarkdown::render(input="analysis.Rmd", output_file=outfile)

All I’d need to now is open up the config.R script, edit the infile variable, and hit the source button in RStudio. This runs the analysis.Rmd as shown above for the input (data001.txt in this example) and saves the resulting compiled report as data001.txt.2015-08-06.analysis.html.

(Crosspost at RPubs).

R: single plot with two different y-axes

2015-04-21T08:23:00.000-05:00

I forgot where I originally found the code to do this, but I recently had to dig it out again to remind myself how to draw two different y axes on the same plot to show the values of two different features of the data. This is somewhat distinct from the typical use case of aesthetic mappings in ggplot2 where I want to have different lines/points/colors/etc. for the same feature across multiple subsets of data.

For example, I was recently poking around with some data examining enrichment of a particular set of genes using a hypergeometric test as I was fiddling around with other parameters that included more genes in the selection (i.e., in the classic example, the number of balls drawn from some hypothetical urn). I wanted to show the -log10(p-value) on one axis and some other value (e.g., “n”) on the same plot, using a different axis on the right side of the plot.

Here’s how to do it. First, generate some data:

set.seed(2015-04-13)

d = data.frame(x =seq(1,10),
           n = c(0,0,1,2,3,4,4,5,6,6),
           logp = signif(-log10(runif(10)), 2))

x	n	logp
1	0	1.400
2	0	0.590
3	1	1.200
4	2	1.500
5	3	0.028
6	4	0.380
7	4	2.500
8	5	0.067
9	6	0.041
10	6	0.360

The strategy here is to first draw one of the plots, then draw another plot on top of the first one, and manually add in an axis. So let’s draw the first plot, but leave some room on the right hand side to draw an axis later on. I’m drawing a red line plot showing the p-value as it changes over values of x.

par(mar = c(5,5,2,5))
with(d, plot(x, logp, type="l", col="red3", 
             ylab=expression(-log[10](italic(p))),
             ylim=c(0,3)))

Now, draw the second plot on top of the first using the par(new=T) call. Draw the plot, but don’t include an axis yet. Put the axis on the right side (axis(...)), and add text to the margin (mtext...). Finally, add a legend.

par(new = T)
with(d, plot(x, n, pch=16, axes=F, xlab=NA, ylab=NA, cex=1.2))
axis(side = 4)
mtext(side = 4, line = 3, 'Number genes selected')
legend("topleft",
       legend=c(expression(-log[10](italic(p))), "N genes"),
       lty=c(1,0), pch=c(NA, 16), col=c("red3", "black"))

Translational Bioinformatics Year In Review

2015-04-10T15:47:00.000-05:00

Per tradition, Russ Altman gave his "Translational Bioinformatics: The Year in Review" presentation at the close of the AMIA Joint Summit on Translational Bioinformatics in San Francisco on March 26th. This year, papers came from six key areas (and a final Odds and Ends category). His full slide deck is available here.

I always enjoy this talk because it routinely points me to new collections of data and new software tools that are useful for a variety of analyses; as such, I thought I would highlight these resources from his talk this year.

GRASP: analysis of genotype-phenotype results from1390 genome-wide association studies and corresponding open access database
Some of you may have accessed the Johnson and O'Donnell catalog of GWAS results published in 2009. This data set was a more extensive collection of GWAS findings than the popular NHGRI GWAS catalog, as it did not impose a genome-wide significance threshold for reported associations. The GRASP database is a similar effort, reporting numerous attributes of each study.
A zip archive of the full data set (a flat file) is available here.

Effective diagnosis of genetic disease by computational phenotype analysis of the disease associated genome
This paper tackles the enormously complex task of diagnosing rare genetic diseases using a combination of genetic variants (from a VCF file), a list of phenotype characteristics (fed from the Human Phenotype Ontology), and a few other aspects of the disease.
The online tool called PhenIX is available here.

A network based method for analysis of lncRNA disease associations and prediction of lncRNAs implicated in diseases
Here, Yang et al. examine relationships between known long non-coding RNAs and disease using graph propagation. Their underlying database, however, was generated using PubMed mining along with some manual curation.
Their lncRNA-Disease database is available here.

SNPsea: an algorithm to identify cell types, tissuesand pathways affected by risk loci
This tool is a type of SNP set enrichment, designed to specifically look at functional enrichment in the context of specific tissues and cell types. The tool is a C++ executable, available for download here.
The data sources underlying the SNPsea algorithm are available here.

Human symptoms-disease network
Here Zhou et al. systematically extract symptom-to-disease network by exploting MeSH annotations. They compiled a list of 322 symptoms and 4,442 diseases from the MeSH vocabulary, and document their occurrence within PubMed. Using this disease-symptom network, the authors explore the biological underpinnings of certain symptoms by looking at shared genomic elements between diseases with similar symptoms.
The full list of ~130,000 edges in their disease-symptom network is available here.

A circadian gene expression atlas in mammals: implications for biology and medicine
This fascinating paper explores the temporal impact on gene expression traits from 12 mouse organs. By systematically collecting transcriptome data from these tissues at two hour intervals, the authors construct a temporal atlas of gene expression, and show that 43% of proteins have a circadian expression profile.
The accompanying CircaDB database is available online here.

dRiskKB: a large-scale disease-disease riskrelationship knowledge base constructed frombiomedical text
The authors of dRiskKB use text mining across MEDLINE citations using a controlled disease vocabulary, in this case the Human Disease Ontology, to generate pairs of diseases that co-occur with specific patterns in abstract text. These pairs are ranked with a scoring algorithm and provide a new resource for disease co-morbidity relationships.
The flat file data driving dRiskKB can be found online here.

A tissue-based map of the human proteome
In this major effort, a group of investigators have published the most detailed atlas of human protein expression to date. The transcriptome has been extensively studied across human tissues, but it remains unclear to what extent transcriptional activity reflects translation into protein. But most importantly, the data are searchable via a beautiful website.
The underlying data from the Human Protein Atlas is available here.

R User Group Recap: Heatmaps and Using the caret Package

2015-04-10T10:01:00.002-05:00

At our most recent R user group meeting we were delighted to have presentations from Mark Lawson and Steve Hoang, both bioinformaticians at Hemoshear. All of the code used in both demos is in our Meetup’s GitHub repo.

Making heatmaps in R

Steve started with an overview of making heatmaps in R. Using the iris dataset, Steve demonstrated making heatmaps of the continuous iris data using the heatmap.2 function from the gplots package, the aheatmap function from NMF, and the hard way using ggplot2. The “best in class” method used aheatmap to draw an annotated heatmap plotting z-scores of columns and annotated rows instead of raw values, using the Pearson correlation instead of Euclidean distance as the distance metric.

library(dplyr)
library(NMF)
library(RColorBrewer)
iris2 = iris # prep iris data for plotting
rownames(iris2) = make.names(iris2$Species, unique = T)
iris2 = iris2 %>% select(-Species) %>% as.matrix()
aheatmap(iris2, color = "-RdBu:50", scale = "col", breaks = 0,
         annRow = iris["Species"], annColors = "Set2", 
         distfun = "pearson", treeheight=c(200, 50), 
         fontsize=13, cexCol=.7, 
         filename="heatmap.png", width=8, height=16)

Classification and regression using caret

Mark wrapped up with a gentle introduction to the caret package for classification and regression training. This demonstration used the caret package to split data into training and testing sets, and run repeated cross-validation to train random forest and penalized logistic regression models for classifying Fisher’s iris data.

First, get a look at the data with the featurePlot function in the caret package:

library(caret)
set.seed(42)
data(iris)
featurePlot(x = iris[, 1:4],
            y = iris$Species,
            plot = "pairs",
            auto.key = list(columns = 3))

Next, after splitting the data into training and testing sets and using the caret package to automate training and testing both random forest and partial least squares models using repeated 10-fold cross-validation (see the code), it turns out random forest outperforms PLS in this case, and performs fairly well overall:

	setosa	versicolor	virginica
Sensitivity	1.00	1.00	0.00
Specificity	1.00	0.50	1.00
Pos Pred Value	1.00	0.50	NaN
Neg Pred Value	1.00	1.00	0.67
Prevalence	0.33	0.33	0.33
Detection Rate	0.33	0.33	0.00
Detection Prevalence	0.33	0.67	0.00
Balanced Accuracy	1.00	0.75	0.50

A big thanks to Mark and Steve at Hemoshear for putting this together!

Using and Abusing Data Visualization: Anscombe's Quartet and Cheating Bonferroni

2015-02-26T13:30:00.002-06:00

Anscombe’s quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties.

Let’s load and view the data. There’s a built-in dataset, but I munged the data into a tidy format and included it in an R package that I wrote primarily for myself.

# If you don't have Tmisc installed, first install devtools, then install
# from github: install.packages('devtools')
# devtools::install_github('stephenturner/Tmisc')
library(Tmisc)
data(quartet)
str(quartet)

## 'data.frame':    44 obs. of  3 variables:
##  $ set: Factor w/ 4 levels "I","II","III",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ x  : int  10 8 13 9 11 14 6 4 12 7 ...
##  $ y  : num  8.04 6.95 7.58 8.81 8.33 ...

set	x	y
I	10	8.04
I	8	6.95
I	13	7.58
…	…	…
II	10	9.14
II	8	8.14
II	13	8.74
…	…	…
III	10	7.46
III	8	6.77
III	13	12.74
…	…	…
IV	8	6.58
IV	8	5.76
IV	8	7.71
…	…	…

Now, let’s compute the mean and standard deviation of both x and y, and the correlation coefficient between x and y for each dataset.

library(dplyr)
quartet %>%
  group_by(set) %>%
  summarize(mean(x), sd(x), mean(y), sd(y), cor(x,y))

## Source: local data frame [4 x 6]
##
##   set mean(x) sd(x) mean(y) sd(y) cor(x, y)
## 1   I       9  3.32     7.5  2.03     0.816
## 2  II       9  3.32     7.5  2.03     0.816
## 3 III       9  3.32     7.5  2.03     0.816
## 4  IV       9  3.32     7.5  2.03     0.817

Looks like each dataset has the same mean, median, standard deviation, and correlation coefficient between x and y.

Now, let’s plot y versus x for each set with a linear regression trendline displayed on each plot:

library(ggplot2)
p = ggplot(quartet, aes(x, y)) + geom_point()
p = p + geom_smooth(method = lm, se = FALSE)
p = p + facet_wrap(~set)
p

This classic example really illustrates the importance of looking at your data, not just the summary statistics and model parameters you compute from it.

With that said, you can’t use data visualization to “cheat” your way into statistical significance. I recently had a collaborator who wanted some help automating a data visualization task so that she could decide which correlations to test. This is a terrible idea, and it’s going to get you in serious type I error trouble. To see what I mean, consider an experiment where you have a single outcome and lots of potential predictors to test individually. For example, some outcome and a bunch of SNPs or gene expression measurements. You can’t just visually inspect all those relationships then cherry-pick the ones you want to evaluate with a statistical hypothesis test, thinking that you’ve outsmarted your way around a painful multiple-testing correction.

Here’s a simple simulation showing why that doesn’t fly. In this example, I’m simulating 100 samples with a single outcome variable y and 64 different predictor variables, x. I might be interested in which x variable is associated with my y (e.g., which of my many gene expression measurement is associated with measured liver toxicity). But in this case, both x and y are random numbers. That is, I know for a fact the null hypothesis is true, because that’s what I’ve simulated. Now we can make a scatterplot for each predictor variable against our outcome, and look at that plot.

library(dplyr)
set.seed(42)
ndset = 64
n = 100
d = data_frame(
  set = factor(rep(1:ndset, each = n)),
  x = rnorm(n * ndset),
  y = rep(rnorm(n), ndset))
d

## Source: local data frame [6,400 x 3]
##
##    set       x       y
## 1    1  1.3710  1.2546
## 2    1 -0.5647  0.0936
## 3    1  0.3631 -0.0678
## 4    1  0.6329  0.2846
## 5    1  0.4043  1.0350
## 6    1 -0.1061 -2.1364
## 7    1  1.5115 -1.5967
## 8    1 -0.0947  0.7663
## 9    1  2.0184  1.8043
## 10   1 -0.0627 -0.1122
## .. ...     ...     ...

ggplot(d, aes(x, y)) + geom_point() + geom_smooth(method = lm) + facet_wrap(~set)

Now, if I were to go through this data and compute the p-value for the linear regression of each x on y, I’d get a uniform distribution of p-values, my type I error is where it should be, and my FDR and Bonferroni-corrected p-values would almost all be 1. This is what we expect — remember, the null hypothesis is true.

library(dplyr)
results = d %>%
  group_by(set) %>%
  do(mod = lm(y ~ x, data = .)) %>%
  summarize(set = set, p = anova(mod)$"Pr(>F)"[1]) %>%
  mutate(bon = p.adjust(p, method = "bonferroni")) %>%
  mutate(fdr = p.adjust(p, method = "fdr"))
results

## Source: local data frame [64 x 4]
##
##    set      p   bon   fdr
## 1    1 0.2738 1.000 0.749
## 2    2 0.2125 1.000 0.749
## 3    3 0.7650 1.000 0.900
## 4    4 0.2094 1.000 0.749
## 5    5 0.8073 1.000 0.900
## 6    6 0.0132 0.844 0.749
## 7    7 0.4277 1.000 0.820
## 8    8 0.7323 1.000 0.900
## 9    9 0.9323 1.000 0.932
## 10  10 0.1600 1.000 0.749
## .. ...    ...   ...   ...

library(qqman)
qq(results$p)

BUT, if I were to look at those plots above and cherry-pick out which hypotheses to test based on how strong the correlation looks, my type I error will skyrocket. Looking at the plot above, it looks like the x variables 6, 28, 41, and 49 have a particularly strong correlation with my outcome, y. What happens if I try to do the statistical test on only those variables?

results %>% filter(set %in% c(6, 28, 41, 49))

## Source: local data frame [4 x 4]
##
##   set      p   bon   fdr
## 1   6 0.0132 0.844 0.749
## 2  28 0.0338 1.000 0.749
## 3  41 0.0624 1.000 0.749
## 4  49 0.0898 1.000 0.749

When I do that, my p-values for those four tests are all below 0.1, with two below 0.05 (and I'll say it again, the null hypothesis is true in this experiment, because I've simulated random data). In other words, my type I error is now completely out of control, with more than 50% false positives at a p<0.05 level. You'll notice that the Bonferroni and FDR-corrected p-values (correcting for all 64 tests) are still not significant.

The moral of the story here is to always look at your data, but don't "cheat" by basing which statistical tests you perform based solely on that visualization exercise.

Microbial Genomics: the State of the Art in 2015

2015-02-04T11:19:00.000-06:00

Current Opinion in Microbiology recently published a special issue in genomics. In an excellent editorial overview, “Genomics: The era of genomically-enabled microbiology”, Neil Hall and Jay Hinton give an overview of the state of the field in microbial genomics, summarize recent contributions, and give a great synopsis of each of the reviews in this issue. Hall and Hinton’s editorial overview goes into a little more depth, but here’s a rundown of the reviews in this special issue. There’s a lot of good stuff here!

Quantitative bacterial transcriptomics with RNA-seq (James Creecy and Tyrrell Conway) discusses RNA-seq in bacteria and how transcriptome analysis adds a wealth of annotation information to the genome.

One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly (Sergey Koren and Adam Phillippy) describes newer long-read sequencing technologies and their characteristics, discusses how microbial genomes can be easily and automatically finished using these methods for under $1,000, and discusses challenges for microbial and metagenome assembly.

Using comparative genomics to drive new discoveries in microbiology (Daniel Haft) describes progress using comparative genomics to make new discoveries, and takes the reader on a “bioinformatics journey” to describe a code-breaking exercise in comparative genomics that starts with weak hypotheses and uses genomics to fill in the biological picture.

Taking the pseudo out of pseudogenes (Ian Goodhead and Alistair Darby) reviews how pseudogenes are surprisingly prevalent, and discusses how problems with genome annotation can be addressed by combining multiple “omics” data.

Ten years of pan-genome analyses (George Vernikos et al.) describes how pan-genome analyses provide a framework for predicting and modling genomic diversity, where the “core genome” of many bacterial species constitutes only the minority of genes.

Lateral gene transfers and the origins of the eukaryote proteome: a view from microbial parasites (Robert Hirt et al.) reviews the dynamic nature of lateral gene transfer, its role in microbial diversity, how it contributes to eukaryotic genomes, and how once again integrating different “omics” methodologies is needed to recognize the extent to which LGT affects eukaryotes.

The application of genomics to tracing bacterial pathogen transmission (Nicholas Croucher and Xavier Didelot) reviews how bacterial whole-genome sequencing gives you the ultimate resolution for investigating direct pathogen transmission, distinguishing transmission chains, and defining outbreaks. If you haven’t kept up with this quickly growing body of literature, this review is a great place to start catching up.

The impact of genomics on population genetics of parasitic diseases (Daniel Hupalo et al.) describes the influence of genomics on parasite population genetics and how burgeoning genomic data has enabled new types of investigations, and focuses on Plasmodium population genomics as a foundation for studies of neglected parasites.

R + ggplot2 Graph Catalog

2015-02-03T07:33:00.000-06:00

Joanna Zhao’s and Jenny Bryan’s R graph catalog is meant to be a complement to the physical book, Creating More Effective Graphs, but it’s a really nice gallery in its own right. The catalog shows a series of different data visualizations, all made with R and ggplot2. Click on any of the plots and you get the R code necessary to generate the data and produce the plot.

You can use the panel on the left to filter by plot type, graphical elements, or the chapter of the book if you’re actually using it. All of the code and data used for this website is open-source, in this GitHub repository. Here's an example for plotting population demographic data by county that uses faceting to create small multiples:

library(ggplot2)
library(reshape2)
library(grid)

this_base = "fig08-15_population-data-by-county"

my_data = data.frame(
  Race = c("White", "Latino", "Black", "Asian American", "All Others"),
  Bronx = c(194000, 645000, 415000, 38000, 40000),
  Kings = c(855000, 488000, 845000, 184000, 93000),
  New.York = c(703000, 418000, 233000, 143000, 39000),
  Queens = c(733000, 556000, 420000, 392000, 128000),
  Richmond = c(317000, 54000, 40000, 24000, 9000),
  Nassau = c(986000, 133000, 129000, 62000, 24000),
  Suffolk = c(1118000, 149000, 92000, 34000, 26000),
  Westchester = c(592000, 145000, 123000, 41000, 23000),
  Rockland = c(205000, 29000, 30000, 16000, 6000),
  Bergen = c(638000, 91000, 43000, 94000, 18000),
  Hudson = c(215000, 242000, 73000, 57000, 22000),
  Passiac = c(252000, 147000, 60000, 18000, 12000))

my_data_long = melt(my_data, id = "Race",
                     variable.name = "county", value.name = "population")

my_data_long$county = factor(
  my_data_long$county, c("New.York", "Queens", "Kings", "Bronx", "Nassau",
                         "Suffolk", "Hudson", "Bergen", "Westchester",
                         "Rockland", "Richmond", "Passiac"))

my_data_long$Race =
  factor(my_data_long$Race,
         rev(c("White", "Latino", "Black", "Asian American", "All Others")))

p = ggplot(my_data_long, aes(x = population / 1000, y = Race)) +
  geom_point() +
  facet_wrap(~ county, ncol = 3) +
  scale_x_continuous(breaks = seq(0, 1000, 200),
                     labels = c(0, "", 400, "", 800, "")) +
  labs(x = "Population (thousands)", y = NULL) +
  ggtitle("Fig 8.15 Population Data by County") +
  theme_bw() +
  theme(panel.grid.major.y = element_line(colour = "grey60"),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank(),
        panel.margin = unit(0, "lines"),
        plot.title = element_text(size = rel(1.1), face = "bold", vjust = 2),
        strip.background = element_rect(fill = "grey80"),
        axis.ticks.y = element_blank())

p

ggsave(paste0(this_base, ".png"),
       p, width = 6, height = 8)

Keep in mind not all of these visualizations are recommended. You’ll find pie charts, ugly grouped bar charts, and other plots for which I can’t think of any sensible name. Just because you can use the add_cat() function from Hilary Parker’s cats package to fetch a random cat picture from the internet and create an annotation_raster layer to add to your ggplot2 plot, doesn’t necessarily mean you should do such a thing for a publication-quality figure. But if you ever needed to know how, this R graph catalog can help you out.

library(ggplot2)

this_base = "0002_add-background-with-cats-package"

## devtools::install_github("hilaryparker/cats")
library(cats)
## library(help = "cats")

p = ggplot(mpg, aes(cty, hwy)) +
  add_cat() +
  geom_point()
p

ggsave(paste0(this_base, ".png"), p, width = 6, height = 5)

R graph catalog (via Laura Wiley)

Microbiome Digest Blog

2015-01-20T14:55:00.000-06:00

I have a noteworthy blogs tag on this blog that I sort of forgot about, and haven't used in years. But I started reading one recently that's definitely qualified for the distinction.

The Microbiome Digest is written by Elisabeth Bik, a scientist studying the microbiome at Stanford. It's a near-daily compilation of papers and popular press articles mostly relating to microbiome research, split up into categories like the human microbiome, the non-human microbiome (soil, animal, plants, other environments), metagenomics and bioinformatics methods, reviews, news articles, and other general science or career advice articles.

I imagine Elisabeth spends hours each week culling the huge onslaught of literature into these highly relevant digests. I wish someone else would do the same for other areas I care about so I don't have to. I subscribe to the RSS feed and the email list so I never miss a post. If you're at all interested in metagenomics or microbiome research, I suggest you do the same!

Microbiome Digest

Using the microbenchmark package to compare the execution time of R expressions

2015-01-14T07:56:00.001-06:00

I recently learned about the microbenchmark package while browsing through Hadley’s advanced R programming book. I’ve done some quick benchmarking using system.time() in a for loop and taking the average, but the microbenchmark function in the microbenchmark package makes this much easier. Hadley gives the example of taking the square root of a vector using the built-in sqrt function versus the mathematical equivalent of raising the vector to the power of 0.5.

library(microbenchmark)
x = runif(100)
microbenchmark(
  sqrt(x),
  x ^ 0.5
)

By default, microbenchmark runs each argument 100 times to get an average look at how long each evaluation takes. Results:

Unit: nanoseconds
    expr  min     lq    mean median     uq   max neval
 sqrt(x)  825  860.5 1212.79  892.5  938.5 12905   100
   x^0.5 3015 3059.5 3776.81 3101.5 3208.0 15215   100

On average sqrt(x) takes 1212 nanoseconds, compared to 3776 for x^0.5. That is, the built-in sqrt function is about 3 times faster. (This was surprising to me. Anyone care to comment on why this is the case?)

Now, let’s try it on something just a little bigger. This is similar to a real-life application I faced where I wanted to compute summary statistics of some value grouping by levels of some other factor. In the example below we’ll use the nycflights13 package, which is a data package that has info on 336,776 outbound flights from NYC in 2013. I’m going to go ahead and load the dplyr package so things print nicely.

library(dplyr)
library(nycflights13)
flights

Source: local data frame [336,776 x 16]

year month day dep_time dep_delay arr_time arr_delay carrier tailnum
1  2013     1   1      517         2      830        11      UA  N14228
2  2013     1   1      533         4      850        20      UA  N24211
3  2013     1   1      542         2      923        33      AA  N619AA
4  2013     1   1      544        -1     1004       -18      B6  N804JB
5  2013     1   1      554        -6      812       -25      DL  N668DN
6  2013     1   1      554        -4      740        12      UA  N39463
7  2013     1   1      555        -5      913        19      B6  N516JB
8  2013     1   1      557        -3      709       -14      EV  N829AS
9  2013     1   1      557        -3      838        -8      B6  N593JB
10 2013     1   1      558        -2      753         8      AA  N3ALAA
..  ...   ... ...      ...       ...      ...       ...     ...     ...
Variables not shown: flight (int), origin (chr), dest (chr), air_time
(dbl), distance (dbl), hour (dbl), minute (dbl)

Let’s say we want to know the average arrival delay (arr_delay) broken down by each airline (carrier). There’s more than one way to do this.

Years ago I would have used the built-in aggregate function.

aggregate(flights$arr_delay, by=list(flights$carrier), mean, na.rm=TRUE)

This gives me the results I’m looking for:

   Group.1          x
1       9E  7.3796692
2       AA  0.3642909
3       AS -9.9308886
4       B6  9.4579733
5       DL  1.6443409
6       EV 15.7964311
7       F9 21.9207048
8       FL 20.1159055
9       HA -6.9152047
10      MQ 10.7747334
11      OO 11.9310345
12      UA  3.5580111
13      US  2.1295951
14      VX  1.7644644
15      WN  9.6491199
16      YV 15.5569853

Alternatively, you can use the sqldf package, which feels natural if you’re used to writing SQL queries.

library(sqldf)
sqldf("SELECT carrier, avg(arr_delay) FROM flights GROUP BY carrier")

Not long ago I learned about the data.table package, which is good at doing these kinds of operations extremely fast.

library(data.table)
flightsDT = data.table(flights)
flightsDT[ , mean(arr_delay, na.rm=TRUE), carrier]

Finally, there’s my new favorite, the dplyr package, which I covered recently.

library(dplyr)
flights %>% group_by(carrier) %>% summarize(mean(arr_delay, na.rm=TRUE))

Each of these will give you the same result, but which one is faster? That’s where the microbenchmark package becomes handy. Here, I’m passing all four evaluations to the microbenchmark function, and I’m naming those “base”, “sqldf”, “datatable”, and “dplyr” so the output is easier to read.

library(microbenchmark)
mbm = microbenchmark(
  base = aggregate(flights$arr_delay, by=list(flights$carrier), mean, na.rm=TRUE),
  sqldf = sqldf("SELECT carrier, avg(arr_delay) FROM flights GROUP BY carrier"),
  datatable = flightsDT[ , mean(arr_delay, na.rm=TRUE), carrier],
  dplyr = flights %>% group_by(carrier) %>% summarize(mean(arr_delay, na.rm=TRUE)),
  times=50
)
mbm

Here’s the output:

Unit: milliseconds
      expr     min      lq    mean  median      uq     max neval
      base 1487.39 1521.12 1544.73 1539.96 1554.55 1676.25    50
     sqldf  867.14  880.34  892.24  887.88  897.28  982.91    50
 datatable    4.12    4.57    5.29    4.89    5.43   18.69    50
     dplyr   14.49   15.53   16.59   15.86   16.58   25.04    50

In this example, data.table was clearly the fastest on average. dplyr took ~3 times longer, sqldf took ~180x longer, and the base aggregate function took over 300 times longer. Let’s visualize those results using ggplot2 (microbenchmark has an autoplot method available, and note the log scale):

library(ggplot2)
autoplot(mbm)

In this example data.table and dplyr were both relatively fast, with data.table being just a few milliseconds faster. Sometimes this will matter, other times it won’t. This is a matter of personal preference, but I personally find the data.table incantation not the least bit intuitive compared to dplyr. The way we pronounce flights %>% group_by(carrier) %>% summarize(mean(arr_delay, na.rm=TRUE)) is: “take flights then group that data by the carrier variable then summarize the data taking the mean of arr_delay.” The dplyr syntax, for me, is much easier to use and extend to much more complex data management and analysis tasks, so I’ll sacrifice those few milliseconds or program run time for the minutes or hours of programmer debugging time. But if you’re planning on running a piece of code on, for instance, millions or more simulations, then those few milliseconds might be important to you. The microbenchmark package makes benchmarking easy for small pieces of code like this.

The code used for this analysis is consolidated here on GitHub