Blog for Sean Davis

Publicly Available Human Genome Variant Databases

2014-10-21T04:00:00Z

Back in the day, there was one variant database, and only one–dbSNP. Then came HapMap and then 1000Genomes. Now, there are many such resources, each of which can be useful for annotating and filtering variants found in our own data. Here, I just pulled together a few such resources and a little relevant information about each.

The download links were valid as of this writing and I will try to update them over time, as many of these resources are still growing. When possible, I have pointed the download links to VCF file locations.

If there are other resources of interest to the community, please drop me a comment and I will try to add them as they come in.

dbSNP

website: http://www.ncbi.nlm.nih.gov/SNP/
downloads: ftp

Exome Aggregation Consortium

website: http://exac.broadinstitute.org/
downloads: ftp
samples: 65,000

1000 Genomes

website: http://www.1000genomes.org
downloads:
- NCBI: ftp or Aspera
- EBI: ftp or Aspera
- Globus endpoint: “ebi#1000genomes” described here
samples: About 1100

ClinVar

website: http://www.ncbi.nlm.nih.gov/clinvar/
downloads: ftp

Exome Sequencing Project

website: http://evs.gs.washington.edu/EVS/
downloads: http
samples: about 6500

Genome of the Netherlands

website: http://www.nlgenome.nl/
downloads: http
samples: About 800

UK10K

website: http://www.uk10k.org/
downloads: ftp
samples: About 2000

GEUVADIS: Genetic European Variation in Health and Disease

website: http://www.geuvadis.org/web/geuvadis/home
downloads: ftp
samples: about 900

COSMIC: Catalogue of Somatic Mutations in Cancer:

website: http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/
downloads: http

Convert from Sweave to R markdown vignettes

2014-09-21T04:00:00Z

I recently converted my GEOquery vignette from Sweave to R markdown and created a few notes on the process. Why, you ask? Because it is now easy to do this (and because I like to try new things). Until relatively recently, using something other than Sweave for vignettes was technically pretty challenging, requiring nasty hacks, Makefiles, and the like. As things stand, functionality to build HTML vignettes using knitr and rmarkdown is included out-of-the-box with R.

The DESCRIPTION file needs a couple of minor adjustments:

Suggests: knitr, rmarkdown, ...
VignetteBuilder: knitr

Vignette files now go in the directory PACKAGEROOT/vignettes. A standard .Rmd file in that directory will become our vignette. In addition, a couple of legacy tags (from Sweave days) are required to tell R what to do with the vignette. The yaml header and (html commented) tags look like this for my GEOquery package:

---
title: "Using the GEOquery Package"
author: "Sean Davis"
date: September 21, 2014
output:
  html_document:
    toc: true
    number_sections: true
    theme: united
    highlight: pygments
---
<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Using GEOquery}
-->

A standard R CMD build on the package will now build an HTML version of the vignette and index it appropriately.

Software cataloging made simple--the R DESCRIPTION file

2014-08-30T04:00:00Z

The NIH has just released a Request for Information: Input on Information Resources for Data-Related Standards Widely Used in Biomedical Science. One goal (I hope) is to make data and software more discoverable and searchable. While the R package system is no panacea, it is a successful system by many measures and represents a practical approach to metadata management associated with software. At the core of this system is the R DESCRIPTION file that is computable, human consumable, and extensible. Most importantly, it is simple.

Bioconductor is perhaps the largest open-source software project dedicated to biological data analysis with an active user base numbering several thousand (based on mailing list subscription and download statistics). Bioconductor is built on the R statistical software platform that enjoys an even larger user base. The success of R and Bioconductor can be traced back to the simple extensibility of R and, critically, the availability of well-maintained repositories of these extension packages. The number of such packages likely now exceeds several thousand, but a simple, required DESCRIPTION file, shipped with every package, enables both computable and human consumable metadata maintenance and searching. The DESCRIPTION file is a simple tag-value format that includes minimal information about each software project but is easily extensible and can include bug reporting mechanisms, support, tagging, free text description, licensing, versioning, and even dependencies. An example DESCRIPTION file looks like:

Package: GEOquery
Type: Package
Title: Get data from NCBI Gene Expression Omnibus (GEO)
Version: 2.31.1
Date: 2014-07-10
Author: Sean Davis <sdavis2@mail.nih.gov>
Maintainer: Sean Davis <sdavis2@mail.nih.gov>
BugReports: https://github.com/seandavi/GEOquery/issues/new
Depends: methods, Biobase
Imports: XML, RCurl
Suggests: limma
URL: https://github.com/seandavi/GEOquery
biocViews: Microarray, DataImport, OneChannel, TwoChannel, SAGE
Description: The NCBI Gene Expression Omnibus (GEO) is a public repository of microarray data.  Given the rich and varied nature of this resource, it is only natural to want to apply BioConductor tools to these data.  GEOquery is the bridge between GEO and BioConductor.
License: GPL-2

There is a real temptation to think of large problems like software cataloging and discoverability as requiring complex solutions. In some cases, simple suffices. Such software metadata files exist for other packaging systems and languages such as ruby-gems (ruby), pypi (python), and CPAN (perl). In developing standards with respect to software cataloging, it will be important to be aware of existing approaches and to work with developer communities to enhance these existing solutions, and, where possible, keep it simple.

Globus Connect Multiuser Setup Notes

2014-08-24T04:00:00Z

Globus Connect enables your system to use the Globus file transfer and sharing service. It makes it simple to create a Globus endpoint on practically any system, from a personal laptop to a national supercomputer. Globus Connect is free to install and use for users at non-profit research and education institutions.

Here are very simple/sketchy notes on setup of globus-connect-multiuser, a server appropriate for hosting an endpoint on a multiuser machine such as a NAS or other file store. These notes are derived from this support forum entry.

1rpm -hUv http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
2yum install globus-connect-multiuser

EDIT /etc/globus-connect-multiuser.conf

Change username
Change password
Change public to True

Then do:

1globus-connect-multiuser-setup

Any changes to the config file will require another:

1globus-connect-multiuser-setup

(which will simply restart servers, etc.)

Make sure that appropriate ports are open in any firewall.

Scientific collaboration tools

2014-08-22T04:00:00Z

Reproducible research in a team environment implies that team members work collaboratively to produce knowledge. Doing so in a productive and reproducible way has lead to a number of potentially useful tools and environments. This post includes a sampling of such tools. Please comment with additional suggestions about what you use and how you use it and I will try to add.

I have broken the list down into a few organizational categories realizing that there is plenty of overlap.

Collaborative writing
Brainstorming
Data and file sharing
Collaborative coding
General communication and project management
Online lists of collaboration software

Collaborative writing

Google docs
writeLaTeX is a commercial solution for collaborative, online editing of LaTeX documents.
shareLaTeX is another commercial approach for collaborative, online editing of LaTeX documents.
StackEdit is a MarkDown-based online editor that can open shared documents on google drive or dropbox.
Authorea is self-described as “the collaborative platform for research. Write and manage your technical documents in one place.” It supports markdown and LaTeX.
EtherPad is an online editor, similar to google docs, but not requiring any login. Export to multiple file formats, though not markdown as of this writing. There are multiple public instances, but the software is open source and runs on node.js for local installation.

Brainstorming

whiteboardfox allows users to share a virtual whiteboard in real-time using any modern web browser. No login required.
skype is an old favorite.
Google Hangouts is Google’s integrated collaboration site including chatting, video calling, file sharing, etc. It works relatively seamlessly on multiple devices including android and iOS.

globus/gridFTP is a suite of tools and technologies that enables high-performance, robust file transfers. The software supporting this capability is installed locally (even on a laptop). For a small fee (<$10 per month), a user can share files or directories with other globus users. Unlike many other file sharing approaches mentioned here, this one will work for very large file sharing.
Dropbox – you have probably heard of this one.
Google Docs/Google Drive – no explanation required.
ownCloud provides access to your data through a web interface or WebDAV while providing a platform to easily view, sync and share across devices—all under your control. ownCloud’s open architecture is extensible via a simple but powerful API for applications and plugins and works with any storage. This software runs on local hardware (requires installation and configuration).
figshare
Sage Synapse
transfer.sh allows you upload and download files without complexity from your shell or browser. Just upload the file by dropping it to the transfer.sh page, curl or any other command using PUT to their server. They will return a unique (obscure) shareable url, which will expire within 2 weeks.

Collaborative coding

git is the mother of all collaborative coding tools. It forms the basis for multiple online sites for social coding.
github is the most popular code sharing and collaborative coding site on the planet. Social coding features such as pull requests, issues, comments, “star”, and “watch” allow even non-coders to follow and contribute to code projects.
bitbucket is very similar to github, though not as popular. It has free private repositories, though, so it is great for early or private development efforts that can later be exposed to the public.
coderscrowd describes itself as “a platform to discuss your problems and share your skills”. It is kind of like a stackoverflow, but with live code.

General communication and project management

slack (via @thatdnaguy) is a platform for team communication: everything in one place, instantly searchable, available wherever you go. It has integration capabilities with dozens of services including twitter, github, etc. And it is free!

Online lists of collaboration software

There are many other catalogs of collaboration software available online. Some are more focused on project management, etc., but I’ll drop those below.

The 20 best tools for online collaboration