Digital Notebook - Vince Buffalo

Why I am I growing worried about reproducibility?

2012-03-25T00:00:00-07:00

Why I am I growing worried about reproducibility? For three reasons:

Reproducibility is a fringe movement in the sciences now.
Reproducibility is hard.

Right now, I could walk to the hardware store, buy some pea seeds, and replicate genetics research that’s 150 years old. I could also replicate just Mendel’s analysis with his raw data. However, I can’t

Using Bioconductor to Analyze your 23andme Data

2012-03-12T00:00:00-07:00

This is a working draft; I may add to this in the future.

Using Bioconductor to Analyze your 23andme Data

Bioconductor is one of the open source projects of which I am most fond. The documentation is excellent, the community wonderful, the development fast-paced, and the software very well written. I’ll introduce the power of Bioconductor while exploring some raw SNP data from 23andme.com.

I’ll be using a new package in the development branch (due to be released with Bioconductor version 2.10 very soon) called gwascat. gwascat is a package that serves as an interface to the NHGRI’s database of genome-wide association studies. I’ll also be using the development version of AnnotationDbi, which you’ll need to grab if you want to follow along.

If you’re reading this after the release of 2.10, use biocLite to download; otherwise install the source package via R CMD INSTALL. Loading the package with library(gwascat) creates a GRanges instance of SNPs and their diseases. GRanges is a fundamental data structure in Bioconductor (specifically the GenomicRanges package) that is designed to hold ranges on genomes efficiently, as well as metadata about the ranges. In this case, the object gwrngs holds SNP ranges (well, locations) and metadata provided by the GWA studies in NHGRI’s database.

While 23andme provides a great interface to one’s genotyping information and GWA research, Bioconductor offers a different way to explore and learn from your genotyping data.

23andme Raw Data

When I was considering 23andme, I ultimately persuaded by the fact that they release their raw genotype calls to users. Unfortunately they do so without SNP call confidence data, but in a personal correspondence with a 23andme representative they stated:

Data reproducibility of our genotyping platforms is estimated at about 99.9%. Average call rate is about 99%. When samples do not meet sufficient call rate thresholds, we repeat the analysis, and/or request a new sample. We do not return data to customers that does not meet our quality thresholds.

Now, 99.9% sounds like a lot, but considering there are 960,545 SNPs being called, it’s not that high.

To retrieve raw data, simply click the “Account” link at the top of the page (after you’ve signed in) and click “Browse Raw Data”. There should be a download link. If you’ve never used GPG to encrypt a file, now is the time to learn; keep your SNP data encrypted.

The file 23andme provides has four columns: rs ID, chromosome, position, and genotype.

Loading Raw Data into R

Use read.table to load this data in R. It’s a lot of data, so providing this function with information about the type of data can speed this up quite a bit. Here is the code I used:

d <- read.table("data/genome_Vince_Buffalo_Full_20120313162059.txt", sep="\t", 
               colClasses=c("character", "character", "numeric", "character"),
               col.names=c("rsid", "chrom", "position", "genotype"),
               header=FALSE)

You may notice that chromosome has the class “character” - this is because there are chromosomes X, Y, and MT (for mitochondrial). For later plotting purposes, it’s good to make this an ordered factor:

tmp <- d$chrom
d$chrom = ordered(d$chrom, levels=c(seq(1, 22), "X", "Y", "MT"))

## It's never a bad idea to check your work
stopifnot(all(as.character(tmp) == as.character(d$chrom)))

Where are the SNPs 23andme Genotypes?

Using Hadley Wickham’s excellent ggplot2 package, we can look at the distribution of SNPs by chromosome:

ggplot(d) + geom_bar(aes(chrom))

This isn’t providing information on SNP density as much as it is chromosome length (except X). We’ll take a more detailed look a bit later.

Another really wonderful aspect of Bioconductor is that the project isn’t just a repository of code: it also stores annotation, full genomes, and experimental data. Such packaged data is the foundating of reproducible bioinformatics, as you no longer have to worry about keeping track of data versions and storing downloaded data yourself. If you need to work with cutting edge data from Ensembl or UCSC tracks, the packages biomaRt and rtracklayer work well.

A Quick Demonstration of GenomicRanges and Bioconductor Annotation Packages

Suppose I want to see if any of my SNPs fall in the APOE gene region. For this, I’ll need transcript annotation data, as this tells me the range and chromosome of this gene. If I wished to create a fresh database of exon, gene, transcript, and splicing data, I could with the GenomicFeature package. This package has methods for building transcriptDb objects from the Known Gene track from UCSC, as well as Ensembl databases. However, I’ll just use a pre-packaged version, TxDb.Hsapiens.UCSC.hg18.knownGene. I use hg18 rather than hg19 because this is the build that 23andme’s coordinates reference.

library(TxDb.Hsapiens.UCSC.hg18.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg18.knownGene
class(txdb) ## do some digging around!

transcriptDb objects have nice accessor functions for accessing their components. Behind the scenes, everything is in SQLite and very efficient (are you seeing why I love Bioconductor?).

If we look at the transcripts with the transcripts accessor function, we see it’s a GenomicRanges object:

transcripts(txdb)
GRanges with 66803 ranges and 2 elementMetadata values:
          seqnames               ranges strand   |     tx_id     tx_name
             <Rle>            <IRanges>  <Rle>   | <integer> <character>
      [1]     chr1     [  1116,   4121]      +   |         1  uc001aaa.2
      [2]     chr1     [  1116,   4272]      +   |         2  uc009vip.1
      [3]     chr1     [ 19418,  20957]      +   |        26  uc009vjg.1
      [4]     chr1     [ 55425,  59692]      +   |        28  uc009vjh.1
      [5]     chr1     [ 58954,  59871]      +   |        29  uc001aal.1
      [6]     chr1     [310947, 310977]      +   |        33  uc001aaq.1
      [7]     chr1     [311009, 311086]      +   |        34  uc001aar.1
      [8]     chr1     [314323, 314353]      +   |        35  uc001aas.1
      [9]     chr1     [314354, 314385]      +   |        36  uc001aat.1
      ...      ...                  ...    ... ...       ...         ...
  [66795]     chrY [25318610, 25368905]      -   |     33721  uc004fwl.1
  [66796]     chrY [25318610, 25368905]      -   |     33722  uc010nxm.1
  [66797]     chrY [25586438, 25607639]      -   |     33731  uc004fws.1
  [66798]     chrY [25739178, 25740308]      -   |     33732  uc004fwt.1
  [66799]     chrY [25949151, 25949179]      -   |     33733  uc004fwu.1
  [66800]     chrY [26012854, 26012887]      -   |     33734  uc004fww.1
  [66801]     chrY [26015033, 26015066]      -   |     33735  uc004fwx.1
  [66802]     chrY [26015782, 26015809]      -   |     33737  uc004fwy.1
  [66803]     chrY [26016792, 26016820]      -   |     33738  uc004fwz.1

To interact with the wealth of data behind a transcriptDb object, we often group individual ranges into groups, leaving us with a GRangesList.

tx.by.gene <- transcriptsBy(txdb, "gene")
tx.by.gene
GRangesList of length 20121:
$1 
GRanges with 2 ranges and 2 elementMetadata values:
      seqnames               ranges strand |     tx_id     tx_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]    chr19 [63549984, 63556677]      - |     61027  uc002qsd.2
  [2]    chr19 [63551644, 63565932]      - |     61033  uc002qsf.1

$10 
GRanges with 2 ranges and 2 elementMetadata values:
      seqnames               ranges strand | tx_id    tx_name
  [1]     chr8 [18293035, 18303003]      + | 26503 uc003wyw.1
  [2]     chr8 [18301794, 18302666]      + | 26504 uc010lte.1

$100 
GRanges with 2 ranges and 2 elementMetadata values:
      seqnames               ranges strand | tx_id    tx_name
  [1]    chr20 [42681577, 42713790]      - | 62142 uc002xmj.1
  [2]    chr20 [42681577, 42713790]      - | 62143 uc010ggt.1

...
<20118 more elements>

Holy GRangeList batman! These are the transcripts grouped by gene. There are other methods for grouping by CDS and exons (cdsBy and exonsBy).

The names of the list elements are Entrez gene IDs. We can look up specific genes with another Bioconductor annotation package, org.Hs.eg.db. There are org.* annotation packages for many organisms. You can forge your own and interact with them with the AnnotationDbi package. I’m using a development version of this package that has a new slick SQL-like interface; it will be widely available with the upcoming 2.10 release.

Suppose I want to convert the Entrez Gene IDs to gene names. The “eg” in org.Hs.eg.db refers to Entrez Gene IDs. Printing the org.Hs.eg.db object gives a nice list of information. Let’s look for the APOE gene’s Entrez Gene ID.

library(org.Hs.eg.db)
cols(org.Hs.eg.db)
  [1] "ENTREZID"     "ACCNUM"       "ALIAS"        "CHR"          "ENZYME"      
  [6] "GENENAME"     "MAP"          "OMIM"         "PATH"         "PMID"        
 [11] "REFSEQ"       "SYMBOL"       "UNIGENE"      "CHRLOC"       "CHRLOCEND"   
 [16] "PFAM"         "PROSITE"      "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [21] "UNIPROT"      "UCSCKG"       "GO"

These are the columns we can query out. Certain keys exist: we can access these using keytypes(). Using it all together, we can extract the Entrez Gene ID:

select(org.Hs.eg.db, keys="APOE", cols=c("ENTREZID", "SYMBOL", "GENENAME"), 
       keytype="SYMBOL")
SYMBOL ENTREZID         GENENAME
APOE      348 apolipoprotein E

Now, we can look for this in our tx.by.gene GRangesList. A word of caution: Entrez Gene IDs are names and thus they need to be quoted when working with GRangesList objects from transcript databases.

tx.by.gene["348"]
GRangesList of length 1:
$348 
GRanges with 1 range and 2 elementMetadata values:
      seqnames               ranges strand |     tx_id     tx_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]    chr19 [50100879, 50104490]      + |     59642  uc002pab.1

If I had used tx.by.gene[348] the 348th element of the list would have been returned, not the transcript data for the APOE gene (which has Entrez Gene ID “348”).

Now, do any SNPs fall in this region? Let’s build a GRanges object from my genotyping data, and look for overlaps. Before I do, it’s worth mentioning another gotcha about working with bioinformatics data: chromosome naming schemes. Different databases use all sorts of schemes, and you should always check them. 23andme returns just numbers, X, Y, and MT. Let’s change it to use the same as the Bioconductor annotation.

# CAREFUL: use levels() to check that you're making new factor names
# that correspond to the old ones!
levels(d$chrom) <- paste("chr", c(1:22, "X", "Y", "M"), sep="")
my.snps <- with(d, GRanges(seqnames=chrom, 
                   IRanges(start=position, width=1), 
                   rsid=rsid, genotype=genotype)) # this goes into metadata

Now, let’s find overlaps using, well, findOverlaps:

apoe.i <- findOverlaps(tx.by.gene["348"], my.snps)

apoe.i is an object of class RangesMatching. Note that had we not matched chromosome names, Bioconductor gives us a nice warning that sequence names don’t match. We could look at the slots of apoe.i but output can be seen with matchMatrix:

hits <- matchMatrix(apoe.i)[, "subject"]
hits
 [1] 873650 873651 873652 873653 873654 873655 873656 873657 873658 873659
[11] 873660 873661 873662 873663 873664 873665 873666 873667 873668 873669
[21] 873670 873671 873672 873673 873674 873675 873676

So in our subject, we have two hits. Let’s dig them up in our SNP GRanges object:

my.snps[hits]
GRanges with 27 ranges and 2 elementMetadata values:
       seqnames               ranges strand   |        rsid    genotype
          <Rle>            <IRanges>  <Rle>   | <character> <character>
   [1]    chr19 [50101007, 50101007]      *   |    rs440446          CG
   [2]    chr19 [50101842, 50101842]      *   |    rs769449          GG
   [3]    chr19 [50102284, 50102284]      *   |    rs769450          AG
   [4]    chr19 [50102751, 50102751]      *   |    rs769451          TT
   [5]    chr19 [50102874, 50102874]      *   |    i5000209          GG
   [6]    chr19 [50102904, 50102904]      *   |    i5000208          GG
   [7]    chr19 [50102940, 50102940]      *   |    i5000201          CC
   [8]    chr19 [50102991, 50102991]      *   |  rs28931576          AA
   [9]    chr19 [50103697, 50103697]      *   |  rs11542040          CC
   ...      ...                  ...    ... ...         ...         ...
  [19]    chr19 [50104077, 50104077]      *   |    i5000212          GG
  [20]    chr19 [50104118, 50104118]      *   |    i5000210          GG
  [21]    chr19 [50104129, 50104129]      *   |    i5000213          CC
  [22]    chr19 [50104154, 50104154]      *   |    i5000207          TT
  [23]    chr19 [50104177, 50104177]      *   |    i5000219          GG
  [24]    chr19 [50104180, 50104180]      *   |    i5000218          GG
  [25]    chr19 [50104198, 50104198]      *   |    i5000206          CC
  [26]    chr19 [50104268, 50104268]      *   |    i5000204          GG
  [27]    chr19 [50104333, 50104333]      *   |  rs28931579          AA

Now, we can verify that these SNPs are in the APOE gene using the UCSC Genome Browser (we could actually pull open a browser to this spot from R using rtracklayer, but I’ll save that for another time). Be sure to use hg18/build 36! Note that my genotype information is there.

The ApoE4 allele is rs429358(C) + rs7412(C). The most common allele (ApoE3, or e3/e3) is rs429358(T) + rs7412(C) which is what I have (that’s a relief). There’s a lot of established research that shows homozygous ApoE4 (that is rs429358(C/C) + rs7412(C/C)) leads to substantially higher risk of Alzeheimer’s. According to SNPedia, James Watson requested he not learn his genotype at this locus, and Steven Pinker requested his ApoE data be removed from his PGP10 data.

Looking for Risk Variants using `gwascat`

We can use the metadata provided by gwascat to further look for interesting variants in our 23andme data. I would recommend interpreting this data with caution, as summarizing these findings in a single element metadata data frame is hard: there’s definitely lost information.

The gwrngs GRanges object has lots of metadata you should scan through with elementMetadata(gwrngs). The Strongest.SNP.Risk.Allele is useful for seeing what you’re at risk for. First, using the rs ID as a key, let’s join our SNP data with the gwrngs metadata:

gwrngs.emd <- as.data.frame(elementMetadata(gwrngs))
dm <- merge(d, gwrngs.emd, by.x="rsid", by.y="SNPs")

We can search for the risk allele in the 23andme genotype data with R and attach a vector of i.have.risk to the dm data frame:

risk.alleles <- gsub("[^\\-]*-([ATCG?])", "\\1", dm$Strongest.SNP.Risk.Allele)
i.have.risk <- mapply(function(risk, mine) {
  risk %in% unlist(strsplit(mine, ""))
}, risk.alleles, dm$genotype)
dm$i.have.risk <- i.have.risk

Now that you have this data frame, you can mine it endlessly. You may want to sort by Risk.Allele.Frequency and whether you have the risk. Because there are quite a few columns in the element metadata, it’s nice to define a quick-summary subset:

my.risk <- dm[dm$i.have.risk, ]

# define some relevant columns
rel.cols <- c(colnames(d), "Disease.Trait", "Risk.Allele.Frequency",
              "p.Value", "i.have.risk", "X95..CI..text.")

head(my.risk[order(my.risk$Risk.Allele.Frequency), rel.cols], 1)
          rsid chrom position genotype Disease.Trait Risk.Allele.Frequency
2553 rs2315504 chr17 36300407       AC        Height                  0.01
     p.Value i.have.risk   X95..CI..text.
2553   8e-06        TRUE [NR] cm increase

This is a rare variant, but the most important next question is, rare in who?

dm[which(dm$rsid == "rs2315504"), "Initial.Sample.Size"]
[1] 8,842 Korean individuals

So this clearly doesn’t mean much to me since I am of European ancestry. We can use grep to find studies that mention “European”:

head(my.risk[grep("European", my.risk$Initial.Sample.Size), rel.cols], 30)

One interesting rs ID that popped up in this list of my data is rs10166942, which is lightly linked to migraines (from which I suffer).

Git Notes

2012-03-11T00:00:00-08:00

Git Notes

These are updated by me periodically. I have tried my best to illustrate common use cases, and the motivation for doing things the “Git” way.

Example Set Up

I’ll use this setup scenario frequently. In a suitable scatch repository (i.e. git-sandbox), make a fake remote:

mkdir fake-remote
cd fake-remote
git init --bare
cd ..

Now, clone it, pretending you are two developers:

git clone fake-remote jerry-repo
git clone fake-remote kramer-repo

Let’s assume you’re Jerry and Kramer is another programmer in your group. As Jerry, let’s make some changes:

cd jerry-repo
echo "an example file" > file.txt
git add file.txt
git commit -am "initial import"
git push origin master
cd ..

Now, let’s pretend we’re Kramer and grab that recent commit:

cd kramer-repo
git pull origin master
cd ..

Git Remote Tracking Branches

Git remote tracking branches are similar to local branches (i.e. the kind you interact with git checkout -b branch-name and see with git branch). However, you don’t work on the remote branch directly, you work on a local branch that’s tracking this remote branch. For example, the most common workflow is to track a remote branch, then push your commits to it or pull commits down from it. Even though it’s a “remote” tracking branch, the branch is stored locally (this branch doesn’t disappear if you can’t connected to the remote).

Git remote tracking branches always have the format remote-repo/remote-branch. After cloning a repository, you can set it to track a remote tracking branch with the -u option of git push, e.g. git push -u origin master. From now on, you can just use git push when on this branch; this branch is tracking origin’s master branch.

git branch shows local branches; to see remote branches use git branch -r, and to see all branches, use git branch -a.

Remote tracking branches are also what determine what is pulled/pushed when using git pull and git push without a remote repository and refspec (i.e. git push origin master).

If the current branch is new-feature, which tracks origin/new-feature, then any branches checked out from new-feature will also track the remote too, unless --no-track is added.

Git Fetch and Merge vs. Git Pull

Recall the above point that remote tracking branches are local, so (unlike Subversion) they function even when you’re not able to connect to the remote. This gives an example of how elegant Git is: being so similar to regular branches, Git remote tracking branches can be merged into local branches. This is precisely what goes on behind the scenes with git pull. Here’s an example. First, let’s set it up such that a developer in your group, Kramer, made some changes, committed them, then pushed them to the remote.

# assuming you're in the right directory
cd kramer-repo
echo "kramer adding gibberish" >> file.txt
git commit -am "I added some gibberish"
git push
cd ..

Now, imagine you (Jerry) have made some commits but want to see what the status of the remote looks like. git remote show can be used to see if the local remote tracking branch is out of date.

vinceb@poisson$ git remote show origin
* remote origin
  Fetch URL: /Users/vinceb/Desktop/git-sandbox/fake-remote
  Push  URL: /Users/vinceb/Desktop/git-sandbox/fake-remote
  HEAD branch: master
  Remote branch:
    master tracked
  Local branch configured for 'git pull':
    master merges with remote master
  Local ref configured for 'git push':
    master pushes to master (local out of date)

So we are out of date! We can see these commits before merging them in with git fetch. git fetch updates your remote tracking branch with the new changes, allowing you to diff branches just as you would with two regular branches.

vinceb@poisson$ git diff master origin/master
diff --git a/file.txt b/file.txt
index 4e850ce..34ceb34 100644
--- a/file.txt
+++ b/file.txt
@@ -1 +1,2 @@
 an example file

git log origin/master master also works. git log origin/master ^master shows us just the new commits. We could really explore these commits by checkout out the remote tracking branch (but consider this to be “taking a visit”; don’t commit anything). Suppose we did, and we decide we want to merge them with our current branch. For this, we just use git merge: remember remote tracking branches are just branches!

vinceb@poisson$ git branch ## always check what branch you're on!
* master
vinceb@poisson$ git merge origin/master
Updating ee922a9..8c8c240
Fast-forward
 file.txt |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

Note that git pull basically is git fetch && git merge.

Great resources

The Beauty of Bioconductor

2012-03-08T00:00:00-08:00

The Beauty of Bioconductor

Introduction

In talking with bioinformaticians, biologists, and other researchers, I’ve seen some worrying trends in computation in the sciences. I plan on writing about these extensively in the future, as I believe computation in the sciences will not scale well to face the huge wealth of data coming experiments will provide. This is not due to algorithmic or hardware limitations, but rather to the fact that scientific programmers simply do not have the same standards and practices that the software industry does.

How do we prevent a big genomic data breaking point?

Three events are simultaneously occurring that could endanger the validity of scientific conclusions in the future. First, new technology is providing the average scientist with more data than ever before. Genomics is the prime example of this: the average biologist can now sequence multiple samples simultaneously whereas this would be prohibitively expensive just a few years earlier. As metabolomic and proteomic data are increasingly incorporated into research alongside genomic data, the work done by bioinformaticians will increase significantly. More lines of code to write and more data to process under deadlines will doubtlessly lead to mistakes.

The second contributing factor is that more researchers are writing code and analyzing their own data rather hiring a bioinformatician or statistician. It’s an awesome and commendable occurrence, but sadly academic institutions don’t adequately prepare researchers to code to high standards. Also, in many cases these researchers learn to program by analyzing their own experimental data, rather than example or “toy” data. This makes “silent” mistakes (i.e. those that don’t prompt an error, but lead to incorrect results) impossible to discover as the actual results are not known.

The last contributing factor is that there’s not a strong expectation that coding standards and software engineering practices be upheld in the sciences. There’s a strong cowboy coding culture in scientific programming. In this mindset, the coding is done when the data is processed, not when the data is processed, the code documented, the unit tests passed, the code checked into a repository, etc. The scientific community needs to embrace the idea that proper data analysis takes time: perhaps as long or longer than gathering experimental data.

In future essays I’ll talk more about these issues in depth. This stuff honestly keeps me (and other people I know) awake at night. I worry humanity may face a Thalidomide-like event in the future due to an error in scientific programming.

However, here I want to commend a project that I feel is underutilized in the biology and bioinformatics communities: Bioconductor. It’s worthy of praise as both an example of, and tool to aid in excellency in bioinformatics programming. I’ll focus primarily on its capacity for handling high throughput sequencing data (even though it handles data from other assays very well too).

Where is Bioinformatics?

Currently, many bioinformatics analyses go something like this: experimental data is received, and then a bioinformatician downloads vast amounts of other data relating to the experiment from web resource such as the UCSC Genome Browser, Ensembl, Phytozome, etc. Often this includes genome assemblies, transcript sequences, and annotation data. Then, application code (alignment software, assemblers, SNP finding tools, etc) are downloaded and compiled. These tools are used alongside custom code written that combines downloaded data with the experimental data, and this produces results that are interpreted. Intermediate results may be fed into other online tools and databases like DAVID or Reactome.

However, this is a bad model if one wants the analysis to be reproducible. The common weakness is that web resources can be unstable. It’s then necessary for the researcher to record software and data versions manually. Even if the researcher dutifully complies, outside databases and code repositories may disappear and leave the project unable to be reproduced. Researchers truly invested in conducting reproducible research then have to store data and software versions themselves, which given the scale of genomic data is quite a burden.

Thus, three things currently perplex reproducible research in bioinformatics: the scale of both experimental and other required data prevents easy self-archival, the fast-paced development of bioinformatics tools could lead to differing results across versions, and the overwhelming prevalence of web-based data resources and applications which are not easily reproducible.

The Bioconductor Solution

Bioconductor has, in my opinion, the best solution to these problems. First, Bioconductor stores past versions of its packages back to the earliest releases. Past experiments can be replicated using the exact version of software that was used for the actual analysis.

Second, Bioconductor stores data as packages. Pre-packaged versioned data is a cornerstone of reproducible research. For example, suppose I am working with human RNA-seq data. This requires transcript annotation data, which could be downloaded from an online resource. To be reproducible, this requires that:

The webhost be up indefinitely.
The URL remain stable and point to the exact same resource.
The user provide not only a URL but the version of data/software downloaded.
That the external resource provider (i.e. database or application developer) actually update their versions accordingly.

For absolute best practice, it’s also necessary to MD5 checksum the data and record this value to maintain any data gathered from the same source is the exact same.

In contrast, consider how I would load human transcript data into R with Bioconductor:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

The versioning here is done explicitly in the package name: hg19. I could also easily record the state of all my Bioconductor packages and my session with sessionInfo():

> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] annotate_1.32.2       AnnotationDbi_1.17.23 Biobase_2.14.0       
 [4] BiocGenerics_0.1.12   DBI_0.2-5             DESeq_1.6.1          
 [7] genefilter_1.36.0     geneplotter_1.32.1    grid_2.14.1          
 [10] IRanges_1.12.6        RColorBrewer_1.0-5    RSQLite_0.11.1       
 [13] splines_2.14.1        survival_2.36-12      xtable_1.7-0

Entire genomes are also packaged via the BSgenome package (BS refers here to Biostrings). If the data in packages is not sufficiently recent, the GenomicFeatures package provides a programmatic way of downloading, packaging, and using data from BioMart and UCSC Genome Browser tracks, and provides functions for saving and loading transcriptDb objects from such resources. Recently Duncan Temple Lang and I were speaking about reproducible research, and he said “people adopt best practices when they’re right in front of their face”. Bioconductor’s tools do just that. Furthermore, Bioconductor has strict coding and documentation standards (much stricter than CRAN actually), which ensures user-contributed packages are high quality.

Information leakage and statistics at every level

When discussing R and Bioconductor with other researchers, it’s easy to convince them to adopt both for analyzing statistical data - the data that comes in the very final stages of a bioinformatics analysis. It’s usually much more difficult to convince them to consider working with high throughput sequencing data in Bioconductor. Folks complain that it’s (1) not worth it to process sequencing data with Bioconductor tools or (2) it’s not fast enough. I’ll address the second point in a bit; more importantly I want to emphasize that it’s absolutely worth it to process sequencing data in Bioconductor.

In analyzing genomic data, we take very, very, very high dimension data and try to condense it into biologically meaningful conclusions without being misleading or getting something wrong. Every step is about taking dense data and making it understandable: we take sequence reads and try to assemble them into larger contigs and scaffolds, we take cDNA reads and try to map them back to genomes to understand expression, etc. At each step, our tools make heuristic or statistical choices for us. Pipelines woefully ignore these choices because in most cases, after a step is completed, a script jumps to the next step.

When I think about these steps, I try to assess what I think of as “informational leakage” in bioinformatics processing. Each step summarizes something, hopefully in a way without bias or too much noise. Informational leakage is the information that’s lost between steps. Catastrophic information leakage occurs when we lose information that could have indicated whether the data is biased or incorrect. We can hedge the risk of information leakage when we use summary statistics between steps that try to capture this leaked information.

Consider processing RNA-seq reads. The first step is usually quality control, i.e. removing adapter sequences and trimming off poor quality bases. Failing to gather summary statistics before and after each of these steps leads to potentially catastrophic information leakage. Suppose that an experiment with control and treatment groups, sequenced on two different lanes (bad experiment design!). If one lane has systematically lower 3’-end quality than the other, quality trimming software will trim these bases off and lead one experimental group to have much shorter sequences than the other. The mapping rates will differ significantly, as shorter reads may map less uniquely. In the end, our data is completely confounded not only by the lane (and bad experimental design), but by our own tools! If these tools are being run in a pipeline without intermediate summary statistics being gathered, this catastrophic information leakage will go unnoticed.

Loading sequencing data into R and using Bioconductor’s tools earlier allows summary statistics to be gathered earlier and more easily (R is, after all, great for statistics and visualization), which I strongly believe will decrease the risk of catastrophic information leakage in genomics data analysis. This is why I wrote qrqc, which can provide quick summary statistics on sequencing read quality. Used before and after the application of quality tools, qrqc can provide information not only on the state of the data, but also the effect of the tools.

With existing Bioconductor packages, many useful statistics can be gathered on whole reads, BAM mapping results, VCF files, etc.

Massive Power

The complaint that R is slow, and couldn’t possibly be used with sequencing/mapping-level data is unwarranted. In reality, some of Bioconductor’s core packages for working with high throughput sequencing data such as Biostrings and IRanges (the foundation of GenomicRanges) are astoundingly fast because most of their backends are written in C. Biostring actually uses external pointers to C structures and bit patterns to encode biological string data efficiently.

In addition to being fast, they’re also clever. Biostrings implements an abstraction called a “view” on an XString object, which efficiently represents multiple sections of the same string object (such as subsequences of interest). While I wouldn’t write a short read aligner or assembler entirely in R, many bioinformatic tasks are more than sufficiently fast with Bioconductor tools.

Conclusion

In evangelizing Bioconductor, I have two goals. First, I want to spread awareness that it’s the best way to do reproducible bioinformatics that I think exists today. I want more people to use it just because I deeply care about the state of science and reproducibility. Second, I want to build excitement about this project so that more people will contribute. I believe that far too many high quality bioinformatics tools are written outside of Bioconductor. Packaging bioinformatics tools in Bioconductor forces the developer to adopt strict standards, write clear documentation, and open up a program to a large, active user base. Any results from a package’s methods can then easily be evaluated using R, CRAN, or Bioconductor’s existing tools.

I also believe that large programs (like BLAST, and maybe assemblers) should provide better interfaces to R, to prevent information leakage in analysis. I’m willing to bet that a large majority of bioinformatics tools could be outputting more statistics than they currently do that could be valuable in assessing their functionality. R interfaces to these bioinformatics tools will drastically make it easier for biologists and bioinformaticians to prevent information leakage.

Thoughts on Julia and R

2012-03-07T00:00:00-08:00

Thoughts on Julia and R

Hello, Julia

Julia is an exciting new technical computing language. It’s still in its infancy, but it’s fast (see below), and already does a lot.

There’s been some excitement on Twitter about Julia. Excitement combined with open source often yields development, which then leads to further excitement, until a mature open source project arises. One of Julia’s explicit goals is to challenge other statistical computing environments, including R.

What’s wrong with R?

R is, without a doubt, changing the world. It’s being used by industry giants like Facebook and Google, while also providing academic researchers in statistics, biology, psychology, and countless other fields with not only a free and open source statistical environment, but a huge number of user-contributed package through CRAN. Now methods papers in many fields are often accompanied by CRAN or Bioconductor packages. It’s also a brilliant platform for reproducible, open research, as Bioconductor beautifully illustrates with packaged and version-controlled genomes, microarray probesets, etc.

However, R is suffering from growing pains. For example, there are now 64-bit versions of R, however, vector indexing is still limited by R_len_t (see definition in src/include/Rinternals.h):

/* type for length of vectors etc */
typedef int R_len_t; /* will be long later, LONG64 or ssize_t on Win64 */
#define R_LEN_T_MAX INT_MAX

It appears that one can simply change this to a long and recompile to increase the longest possible addressable vector, but no. Take a look at R_euclidean in library/stats/src/distance.c for an example why: almost all variables for iterating over elements in vectors are defined as integers and don’t use this type. One would have to read through every function, and every line of code to fix this.

R_len_t is just one example. Another issue is that R has been slow to adopt new compiler technologies (i.e. JIT, optional type indications, etc). R almost always gains speed from pushing stuff to C (the recent bytecode compiler is an exception). This isn’t a problem, but it’s a huge limitation to require developers to not only know R, but also C, and also how to interface the two. More modern languages (Java, as well as Python and Julia come to mind) spend more time tracking compiler technology developments and implementing them than R core does (again, Luke Tierney and the bytecode compiler are exceptions). It’s still sometimes efficient to use C with these languages (consider Cython), but developers in these language aren’t cracking open Kernighan and Ritchie everytime they need to have a for loop do something quickly.

Another gripe I have is that R language development is somewhat closed. Despite a quickly expanding user base, the number of R core contributors is not increasing. I find it hard to believe this is due to lack of interest. It seems much more likely this is due to institutional reasons that need to be changed. The nice thing about language development that it’s really hard, so opening up R to more contributors won’t likely flood the existing core with bad ideas and patches. Personally I would dedicate much more time profiling, reading the source, and working on the R language if it were more open.

The last gripe I have is that R is fragmented. Consider Python:

import re
re.search(r'R-([\d]+).([\d]+)', "R-2.15").groups()

Now, consider R:

gsub("R-([\\d]+)\\.([\\d]+)", "\\1", "R-2.15")

# or

library(stringr)
str_match("R-2.15", "R-([0-9]+)\\.([0-9]+)")

Now, Python also has PyPI’s re2, but most developers are using re. The motivation behind stringr is that R’s currently family of string processing functions are horribly inconsistent:

# (my ... to avoid writing all parameters)
grep(pattern, x, ...)
regexpr(pattern, text, ...)
gsub(pattern, replacement, x, ...)
strsplit(x, split, ...)

But rather than deprecate these and move forward, we now have two sets of string processing functions. Both are being used. I’m not saying Hadley Wickham is to blame here; quite the contrary, he’s trying to fix a very annoying problem in the language and should be commended. I think the community needs to be more open; for example, before writing a package that processes strings, let’s discuss an implementation plan, deprecating old functions, etc. If not, in the future R will be highly fragmented, and end up with five different object orientation systems… oh, wait.

What would it take to “challenge” R?

Contributors to Julia are optimistic they can challenge R based on a solid foundation of JIT compiling, parallelism, and nice language semantics. I salute this optimism, but I think we need to realistically consider what it would take to “challenge” R.

First, we would need to build an equal statistical computing environment. Consider moving all of stats, MASS, graphics, grid, etc to Julia. Is Julia sufficiently faster than R will be in the time it takes to port these base packages? Remember, R is a moving target; despite my few earlier gripes, R will evolve and get faster. Now, consider adding the extremely popular CRAN packages like ggplot2 and lattice to Julia. In the time it takes to port these, is Julia still sufficiently faster than R will be?

Suppose it is still faster than R. What about after we port the rest of CRAN, and all of Bioconductor to Julia? My point isn’t say that it’s unimaginable that Julia will surpass R. It’s that developers should really dissect what makes a successful language successful before they try to challenge it. I don’t have a horse in this race; I would love to see Julia surpass R. But if all developers want is a fast environment to analyze large data sets using a wealth of methods and libraries, it may be a lot easier to make R faster than to develop a new fast language and hope/wait/beg the community to move over.

Some Ramblings on Machine Learning in Science

2012-03-03T00:00:00-08:00

The Unbelievable Debate: Some Ramblings on Machine Learning in Science

In between refactoring some qrqc code this morning and looking at RNA-seq data, I grabbed some cold brew coffee and caught up on some missed tweets. Admittedly, my brain glosses over most tweets, but this tweet from Drew Conway had the right mix of keywords to actually make me click and read the link:

The data science debate: domain expertise or machine learning? by @medriscoll http://bit.ly/zr17Z2

I don’t mean for this title to be inflammatory, but I do believe this debate is a bit unbelievable. Machine learning is magical; I imagine that everyone that has studied it goes through a hype cycle-like set of epiphanies. This is my hype cycle story, and why I believe machine learners need to calm down, collaborate with domain experts, and together tackle hard problems.

Social Sciences and Machine Learning Caution

MONIAC: social science machine learning?

Biologists, it’s true: I’m not one of you. I’m a transplant from the social sciences. Specifically, from political science and economics (with statistics too), where my interests lie in methodology and comparative politics.

In the social sciences, the dimensionality of most problems is small enough that data mining is (at least in my experience) frowned upon. A lot of political data is collected by hand, often by undergraduates toiling away for meager pay as they try to understand some cryptic event coding protocol. There are some very large p data sets: The Political Instability Task Force’s data set is something I’ll keep mentioning. Mining this data with algorithms looking for interesting relationships was exactly how I was taught not to do political science.

I recall one story of a candidate giving a job talk mentioning he used forward step-wise regression to find interesting variables (in a presumably small p data set) and three people immediately stood up and left. I was proud to be knowledgeable of, but avoid statistical learning techniques. Political science had flirted with neural networks to understand massive state failure data sets, but my endless gripe was there these were predictive, not causal models. The latter required some a priori testable theory, often derived from an intimate knowledge of political crisis in a variety of countries. Just as I thought biologists knew c. elegans or s. cerevisiae well enough to form interesting experiment ideas, political scientists knew many political crises well enough to form theories and test them on a larger set of data in a quantitatively rigorous fashion. I also believed that predictive models of state failure may predict recorded (even when out-sample!) state failures well, but a model backed in a good theory that fits existing data slightly less well could predict unseen cases even better (Bruce Bueno de Mesquita has an entire wonderful book about game theory being such a model).

The Machine Learning Awakening

When I made the jump to analyzing gene expression data, I was initially astounded at how many algorithms were thrown at it. I had this vision of the hard sciences having randomization and experimentation at their disposal to lead to the purest causal findings. Looking for any differences in 30,000 genes’ expression values and then forming hypotheses after seemed backwards. Microarrays shocked biology with what they revealed about cancer and the cell, but they probably shocked the methods of experimental biology more. If your average biologist had a tenuous knowledge of p-values to begin with, now microarrays analysts were throwing around false discovery rates, empirical Bayesian techniques, Storey’s q-value, etc.

However, as I analyzed more and more sets of data, the initial reluctance I had about employing machine learning algorithms disappeared. In hype cycle terms, I was climbing the peak of inflated expectations. A quote from Michael E. Driscoll’s article captures this excitement:

Claudia Perlich, a three-time winner of the KDD Nuggets competition, stood up and shared how she had won contests in domains as varied as “breast cancer, movie prediction, and sales performance - and I can tell you I knew next to nothing about those things when I started.”

This optimism is abundant, and not entirely without justification. Coming from a non-biological background yet thoroughly understanding machine learning provides an immensely satisfying feeling of understanding of the cell. Employing all sorts of machine learning techniques, I could find “biologically interesting” genes in data sets and help biologists understand the cell.

A Few Epiphanies and a Dip of Disillusionment

The Hype Cycle’s lowest stage is the “trough of disillusionment”. Machine learning in biology certainly hasn’t had its trough (and I don’t think it will), but it is priming up to have its “slope of enlightenment” and “plateau of productivity”. There will be future machine learning hype cycles in biology, especially as multiple heterogeneous data sets need to be simultaneously mined to understand the cell with the systems approach.

My personal dip didn’t happen because machine learning left me with a particularly terrible result - it occurred because (1) because of an interaction I had with an experimental biologist and (2) I realized how wonderfully complex the cell is.

Let’s Put That in This

The first interaction I had was with a graduate student friend of mine. We were discussing an interesting finding the Korf Lab made: that some introns lead to increased expression (paper here). Introns traditionally haven’t had the same attention as promoters of enhancers in regulating gene expression. A member of the Korf lab had previously mentioned intron-mediated expression in passing to me, and I immediately started imagining what ways I could look for such an effect in silico. As I understood it, in silico was how it was first discovered, further increasing my admiration of algorithms applied to biology. When my friend mentioned it again, the first thing she said was, “well, we just need to take that intron and put it in something”.

I immediately agreed, but I realized something: I hadn’t thought of that simple step the first time I thought about intron-mediated expression. Machine learning can bring so much wealth in finding interesting relationships that my mind had glossed over the most important question in science: whether these relationships were spurious or causal. This is why my training in the social sciences was rigidly anti-machine learning: it’s far too easy to let our thought processes about understanding a complex system be biased by some spurious relationships machine learning and predictive models can quickly give us.

The Complexity of the Cell

The epiphany was gradual (and still occurring): the cell is wonderfully complex, or as my mind puts it “fucking awesomely complex”. Machine learning applied to gene expression data gives valuable insights into a complex system, but it’s really a messy snapshot. I think we’ll look at current pristine RNA-seq experiments in twenty years and we’ll realize they’re giving us an image into cellular activity that is akin to a photograph from a cheap Soviet-era camera.

Measuring gene expression from many cells glosses over interesting variation in each cell; this is certainly not a new complaint. However, even a single cell image is messy: mRNAs that make their way into gene expression values may have never been exported from the nucleus, they could have been degraded by the cell, silenced, undergone post translational modification, etc. What’s astounding is that these systems are not just complex, but are amazingly accurate. Cellular data is messy, but the cell certainly isn’t. Development is a prime example of how tightly regulated these processes are. It’s up to us to understand these tightly regulated systems with the messy images scientific data gives us. Machine learning is a necessary, but not sufficient tool to help us understand the cell.

As an example, genes interact in groups, and many algorithms can gloss over this detail. If an algorithm tries to find a sparse set of genes that are biologically interesting to the problem at hand, it may be indifferent to which it includes from a set of co-expressed genes (consider the lasso against the elastic net here). If a biologist reviews these findings, they could easily miss something vastly important based on machine learning’s indifference.

Let’s Use Both.

These epiphanies are now what guides my path through biology and machine learning. I still love and am infatuated with machine learning (although, I much prefer the phrase statistical learning). However, if we wish to understand a complex system, we need to take the approach that modern biology does: leverage machine learning with a priori biological expert knowledge to bootstrap findings. We need to design experiments that also harness the power of machine learning to help us understand, and not just predict the behavior of complex systems. Applied machine learners need to realize the power of experimental data. Chances are if you’re finding everything you think is out there with just machine learning, you’re making a mistake or your problem is too simple.

Elucidating k-mer Contamination with Kullback-Leibler Divergence

2012-03-01T00:00:00-08:00

Elucidating k-mer Contamination with Kullback-Leibler Divergence

Severe Read Contamination

Recently a coworker showed me a FASTQ file from an Illumina HiSeq run (which will be packaged in the new release of my Bioconductor package qrqc) that was severely contaminated. Below is the file in less with a string highlighted:

Holy contamination, Batman! There are a few approaches to handling this level of contamination. The program tagdust will match contaminated reads and remove them. My program Scythe is being changed so that it can match adapter contaminants further in the read using its Bayesian model. Both programs require a priori knowledge of possible contamiant sequences - what if this is a novel sequence contaminant? In this case, AAGCAGTGGTATCAACGCAGAGT appears to be a PCR primer related to DSN normalization that may not have made it into our adapter files.

k-mer Entropy Approaches

k-mer approaches are a nice way of searching for such contaminants. I am currently adding this feature to qrqc; you can follow development on the kmer branch on Github but this branch may merge into master and disappear.

The C functions I’ve written use Heng Li’s khash.h library to quickly hash k-mer sequences with their positions in the read. The end result is a data frame in R of k-mer sequence, position, and counts in that position.

Looking at raw k-mer counts is somewhat useful, but I’ve been exploring some information theoretical approaches to analyzing this data. One useful graphic is entropy of k-mers by position:

These are 6-mers, so there are 4,096 possible k-mers (excluding N). If the k-mer distribution were uniform, 12 bits would be needed to encode each k-mer. This graph illustrates that even at the most random 3’-end of the read, only about 6.5 bits are needed. In the first 20 bases, the distribution of k-mers is so skewed that less the Shannon entropy is less than four bits.

Kullback-Leibler Divergence Approach

It makes sense biologically that the k-mers don’t have uniform frequency. In the case of read contaminants, the enrichment by position against an empirical k-mer distribution may be as interesting as total k-mer enrichment against a random distribution model.

To assess this, some beta qrqc code pools k-mer counts across position to find an empirical k-mer distribution. Then, the k-mer distribution per position is compared to the pooled distribution using the Kullback-Leibler divergence. K-L divergence is only defined when both distributions sum to 1, the sample spaces are the same, and if q(i) > 0 for any i such that p(i)> 0.

In essence, the K-L divergence is measuring the average number of bits needed to encode data from P with a code based on the distribution of Q. In the k-mer case, Q is the empirical distribution of k-mers irrespective of k-mer position and P is the position-specific distribution of k-mers. Thus, an enrichment of k-mers at a particular position would lead to more divergence.

A nice feature of ggplot2 is the stacking of the “bar” geom. Since K-L divergences are sums, stacking and setting fill color by k-mer (the terms of the sum) gives us a sense of the total divergence and each k-mer’s effect on the total. There are too many k-mers to plot, so I have some procedures that find a nice subset. Because this is a subset, the K-L total (indicated by bar height) is wrong, but the graphical interpretation is easier. Now, the enrichment by position is clear:

This messy dataset has repeat primer contamination. Note that because we’re plotting a subset of k-mers, there is negative total K-L (not mathematically possible) because we’re leaving out terms in the sum, but the meaning still comes through. Also note that there is k-mer nesting: The first wide peak begins with k-mer TATCAA, then ATCAAC, then TCAACG, etc. This indicates that we could adjust k and find the entire repeated k-mer.

Update

I’ve added faceting of multiple SequenceSummary objects’ KL/k-mer diagnostics. Combined with a random data file, this really illustrated contamination:

This is still in development; follow the code on Github and feel free to contact me and make suggestions.

Articles of Note, by Topic

2012-02-23T00:00:00-08:00

Articles of Note, by Topic

Gene prediction

How does eukaryotic gene prediction work?, Brent, 2007.

RNA-Seq

Normalization

Removing technical variability in RNA-seq data using conditional quantile normalization, Hansen, 2011.

Biases

Improving RNA-Seq expression estimates by correcting for fragment bias, Roberts et al., 2011

Biological Networks

Network Modeling Reveals Prevalent Negative Regulatory Relationships between Signaling Sectors in Arabidopsis Immune Signaling, Sato et al, 2010.

Please developers, don't be dicks.

2012-02-21T00:00:00-08:00

Please developers, don’t be dicks.

As the author of a few open source tools, I’ve had my fair share of users seeking help. Emails range from the very useful (bug reports, patches, etc) to the annoying (“can you help guide me through this process”). But never once (that I can remember) have I been a dick (and yes, I’ve wanted to be). It will be tricky to write this without sounding self-righteous, but I hope to make the case that open source developers shouldn’t be dicks in all cases.

We’ve All Been There (WABT)

The first reason to never be a dick is that We’ve All Been There (I’m going to give this the acronym WABT). Even the most voracious and diligent manual readers can suffer from the XY problem. A user comes asking how to do Y, which they think is the solution to X. However, it’s a bad solution to X and they don’t know this. These situations will always lead to frustration: users waste time explaining Y and helpers waste time explaining how to do Y to realize the user wanted X. But this is not the user’s fault; it just takes programming practice to realize Y is not the correct way to do X.

We’ve all had these problems in our early stages as developers. Being a dick in these cases will not help the user grok anything. They’re already frustrated - that’s why they’re asking for your help. Being a dick will cause them to get more frustrated and really not grasp anything. They’re not going to have an “ah ha!” moment when they’re too busy trying to come up with a witty response to your burn on IRC.

PCTM has the same number of letters as RTFM

Please Check The Manual (PCTM) has the same number of letters as Read The Fucking Manual (RTFM). I strongly believe it takes more energy for a developer to be a dick than to be nice. We’ve all had dumb questions that disrupt our workflow, make us angry, etc. But being a dick back does not discourage this behavior. Write some boilerplate text for responding to users’ questions. Make this a FAQ. Then respond, PCTM (Please Check the Manual) and send them the link. If they get needy, tell them open source software doesn’t come with a warranty.

People remember dicks

Someone was once quite rude to me via email (I’ll name him Tom). I had voiced some frustrations with software Tom wrote and he attacked me for these public comments. Now, as an aside, there’s a lot of shitty software out there, and signals about software quality (even noisy signals) are very valuable. Tom on one hand attacked me for saying something negative about his software, and on the other hand asked me to help fix it, emphasizing it was open source software. I agree with this sentiment 100%, however the email was clearly very angry.

I told another developer who I’ll call Jerry about the encounter, and he laughed. Apparently, Tom nagged Jerry about portability issues of Jerry’s software years ago. This is evidence of my first point, WABT. It also shows that developers remember interactions with other developers really well. Since then, I’ve also heard other programmers complaining about interactions with Tom. This is all too bad, as Tom is probably very nice in person and certainly a good programmer.

If you’re a dick, you’re hurting OSS

OSS has seen an explosion in recent years. Biologsts, ecologists, and social scientists that never thought they’d write code are using R to analyze data. Folks frustrated by Windows are installing Ubuntu and asking for help. In the early days of the OSS, usenet, and IRC, it was an acceptable norm to be a dick. Now, it’s not.

OSS benefits from a large user base, but it will have growing pains. Being a dick does not alleviate these pains, it makes them worse. Let’s go back to my story about Tom.

In the second half of Tom’s email (after attacking me), he asked me to help him fix his software. Now, collaboration can be difficult; code style clashes, merges fail, frustration is common. In a small project, you’re really in bed with your collaborators. Now that Tom has sent me the signal he’s nasty in correspondences, do you think I’ll work on this project with him? Hell no. I’d rather fork, fix the problem and encourage others to use my software. Of course this is bad for OSS; consider this passage from Eric S. Raymond’s Jargon File:

Forking is considered a Bad Thing - not merely because it implies a lot of wasted effort in the future, but because forks tend to be accompanied by a great deal of strife and acrimony between the successor groups over issues of legitimacy, succession, and design direction. There is serious social pressure against forking.

Tom’s actions guarantee I will avoid working on his projects at all costs. The two other developers, and anyone else we’ve told will too. In the end, the software loses.

Idolize programmers, not their dickishness

Some abrasive programmers are really gifted. Erik Naggum is regarded as the first Usenet flamer. Theo de Raadt forked NetBSD into what became OpenBSD partially because of issues with other developers. Richard Stallman gave an AMA on reddit a year ago and the most popular question (since deleted) was about a young GNU-lover that was nervous about asking RMS a question and accidentally referred to it as “Linux”, not “GNU/Linux” and RMS ripped him in half.

Now, all of these developers have been dickish and are well-known because they are gifted visionaries. I’m not sure why, but other developers admire this dickishness. But don’t idolize their dickishness, idolize their skill. There are also overwhelmingly nice programmers: John McCarthy, Donald Knuth, and Alan Turing to name a few. Admire their skill and their personality.

Being a dick hurts science

There’s been an explosion of open source software utilization in the sciences. My field, bioinformatics, provides an interesting case study. There are bioinformaticians like myself that write software. Users are divided into other programmer types (other bioinformaticians) and biologists (on average, less knowledgeable of programming). All else equal, biologists and bioinformaticians prefer free, open source software to costly proprietary software.

For these reasons, being helpful and nice to scientific users is really important. For biologists, choosing tools is about getting analysis done quickly and easily. Rude bioinformaticians will quickly increase the cost of using OSS tools, which is already high for many biologists who aren’t experienced with Unix tools and programming. Consequently, science could become less open, something neither group wants.

Arduino Notes

2012-02-21T00:00:00-08:00

Arduino Notes

The following are some notes/links I’ve come across when working with my Arduino Uno.

Arduino Development

Ultimately I want to develop through Emacs, not the Arduino IDE. Doing this well involves (1) a good mode for Emacs, (2) interfacing with avr-gcc, and (3) a makefile or script that uploads the binary to the Arduino.

GPS

GPS buying guide (via Sparkfun)

OpenLog

Hello, static blogging

2012-02-20T00:00:00-08:00

Why static blogging?

Simplicity

I spend hours each week in Emacs, and I’ve spent ages customizing it. With Jekll, Markdown, and Blueprint I quickly built this site in my favorite editor. I can spend time writing content, not fighting a browser WYSIWYG editor.

Mentality

Also, I want this site to function as a notebook. Most blogging systems lead to a post-and-forget mentality. Revision control makes, well, revisions easier.

Plaintext

Plaintext is powerful (see org-mode if you don’t believe me). If I want to search all my posts using regular expressions, I can with grep. How could this be done if I was using WordPress or any blogging system that used database backend?

Code-friendly

Jekyll plays well with Pygments, a Python syntax highlighter. This is import because I want to share code.

    # Jekyll is code friendly
    x <- seq(-1, 3, 0.1)
    y <- sin(x) + rnorm(length(x), 0, 0.3)
    plot(x, y)

Formulae

This was my final worry - org-mode is a bit “heavy” for publishing quick notes (although entirely necessary for other things), but it has excellent LaTeX integration. I would be very disappointed if I couldn’t easily incorporate LaTeX into Jekyll. Luckily, I can! The default Maruku Markdown interpreter just needs to have LaTeX support turned on (see below)… and wahla! I can include math:

My configuration

There’s a myriad of Ruby tools and plugins for Jekyll - it’s all a bit intimidating. Here are a few notes on my configuration. My _config.yml file is definitely important.

Maruku and LaTeX

I am using the default Markdown parser, Maruku. Maruku has formulae support, but it experimental and not on my default. Formulae are very important to my notebook, so getting them to work is paramount. Here’s how I did it:

1. Add the Maruku configurations below. Note the trailing slashes - these are important (and a sign that LaTeX support is still a bit rough.)

maruku:
  use_tex: true
  use_divs: true
  png_engine: blahtex
  png_dir: images/latex/
  png_url: /images/latex/

2. Install Xerces-C. On a Mac with MacPorts, use sudo port install xercesc. This is a requirement for blahtexml, which Maruku uses.

3. Install blahtexml. Get the source at http://gva.noekeon.org/blahtexml and on a Mac, use make blahtex-mac and put this binary in /usr/local/bin/ or some other place in your $PATH.

Pygments and Blueprint CSS

Blueprint CSS has a class highlight in screen.css that clashes with Pygment (giving code a hideous yellow background). I just renamed the class highlight-alt in screen.css and everything looks great now.

Build errors with Github

Github’s Jekyll engine apparently runs with jekyll --pygments --safe which disables plugins. LaTeX to PNG rendering is built into Maruku, not a plugin, yet builds still fail on Github (even though they are fine with these options locally). I also host this site at http://vincebuffalo.org, so there’s not a huge problem here.

Digital Notebook - Vince Buffalo

Why I am I growing worried about reproducibility?

Using Bioconductor to Analyze your 23andme Data

Using Bioconductor to Analyze your 23andme Data

23andme Raw Data

Loading Raw Data into R

Where are the SNPs 23andme Genotypes?

A Quick Demonstration of GenomicRanges and Bioconductor Annotation Packages

Looking for Risk Variants using gwascat

Git Notes

Git Notes

Example Set Up

Git Remote Tracking Branches

Git Fetch and Merge vs. Git Pull

Great resources

The Beauty of Bioconductor

The Beauty of Bioconductor

Introduction

Where is Bioinformatics?

The Bioconductor Solution

Information leakage and statistics at every level

Massive Power

Conclusion

Thoughts on Julia and R

Thoughts on Julia and R

Hello, Julia

What’s wrong with R?

What would it take to “challenge” R?

Some Ramblings on Machine Learning in Science

The Unbelievable Debate: Some Ramblings on Machine Learning in Science

Social Sciences and Machine Learning Caution

The Machine Learning Awakening

A Few Epiphanies and a Dip of Disillusionment

Let’s Put That in This

The Complexity of the Cell

Let’s Use Both.

Elucidating k-mer Contamination with Kullback-Leibler Divergence

Elucidating k-mer Contamination with Kullback-Leibler Divergence

Severe Read Contamination

k-mer Entropy Approaches

Kullback-Leibler Divergence Approach

Update

Articles of Note, by Topic

Articles of Note, by Topic

Gene prediction

RNA-Seq

Normalization

Biases

Biological Networks

Please developers, don't be dicks.

Please developers, don’t be dicks.

We’ve All Been There (WABT)

PCTM has the same number of letters as RTFM

People remember dicks

If you’re a dick, you’re hurting OSS

Idolize programmers, not their dickishness

Being a dick hurts science

Arduino Notes

Arduino Notes

Arduino Development

GPS

OpenLog

Hello, static blogging

Why static blogging?

Simplicity

Mentality

Plaintext

Code-friendly

Formulae

My configuration

Maruku and LaTeX

Pygments and Blueprint CSS

Build errors with Github

Looking for Risk Variants using `gwascat`