LIBD rstats club

L. Collado-Torres

Mon, 01 Jan 0001 00:00:00 +0000

Our first adventure with Visium Spatial Proteogenomics

Wed, 01 Nov 2023 00:00:00 +0000

Recent advancements in spatially-resolved transcriptomics (SRT) technologies have ushered in a new era of possibilities for biological research. These technologies offer the unique ability to map biomolecular information within the native tissue architecture. Preserving the spatial resolution of genome-wide gene expression allows researchers to obtain a more holistic view of the tissue microenvironment, particularly the underlying molecular and cellular dynamics in a spatial-anatomical context, which is useful to understand the composition, states, and function of individual cell types, as well as their interactions with one another in a defined microenvironment. Visium Spatial Gene Expression from 10x Genomics is a widely used and validated next generation sequencing (NGS)-based SRT platform.

In 2021, 10x Genomics extended the capabilities of the Visium Spatial platform, creating the Visium Spatial Proteogenomics (Visium-SPG) platform by introducing immunofluorescence protein staining into the workflow. This integration of spatial transcriptomics and immunofluorescence-based protein identification significantly enhanced the power of spatial -omics. This capability provides a more comprehensive breadth of biomolecular information, bridging genome-wide gene expression with the specific protein expression of interest within undissociated tissue sections at a high level of spatial resolution.

Our study serves as a proof-of-concept for the power of spatial proteogenomic profiling in characterizing human brain pathology. Leveraging the Visium-SPG platform, we first mapped local brain microenvironments bearing Alzheimer’s disease pathology by identifying the presence of amyloid-beta and phosphorylated tau. We then investigated the transcriptional signatures surrounding amyloid-beta pathology in postmortem human brain tissue from donors diagnosed with Alzheimer’s disease. We further conducted a comprehensive computational analysis that allowed us to deconvolute Visium data at the level of individual expression spots, which allowed us to predict the relative enrichment of astrocyte and microglia populations surrounding amyloid plaques in comparison to neurons. Additionally, we employed an orthogonal RNA detection technology, RNAscope single molecule Fluorescence In Situ Hybridization (smFISH), to finely resolve gene expression changes of a selected subset of differentially expressed genes (DEGs) identified with Visium-SPG at cellular resolution.

Overall, our study provides a roadmap for a comprehensive data analysis workflow that encompasses various experimental platforms, such as Visium-SPG and RNAscope smFISH, along with a diverse range of computational software tools, including:

VistoSeg for image-based preprocessing/segmentation,
spatialLIBD, scran, limma, and other Bioconductor packages,
Harmony for batch correction,
BayesSpace for unsupervised clustering,
Cell2location for spot deconvolution,
MAGMA for genetic risk enrichment analysis.

This integrated approach empowered us to delve into spatial proteogenomic analysis and examine the local tissue microenvironment harboring neuropathological lesions enriched with amyloid-beta pathology at the genome-wide gene expression level. We anticipate that our work will lay the groundwork for deploying multi-omic approaches in advancing the next frontiers of spatial multi-omics and contribute to a more comprehensive understanding of complex human brain biology and pathology.

For a more in-depth exploration of our recent work, we share a link to our paper officially published in a special issue of GEN Biotechnology focusing on spatial -omics: https://doi.org/10.1089/genbio.2023.0019.

After 10+ years of research, my FIRST 1st-author paper & @biorxivpreprint is finally OUT🥹I delved deep into the human inferior temporal cortex🧠in #Alzheimersdisease, using #VisiumSPG #VectraPolaris #RNAscope & #HALO @Indica_Labs. Check out our📜at https://t.co/QQEy1ZWc1V🔥 pic.twitter.com/dBZNzL1S96
— Sang Ho (Sangho) Kwon (@sanghokwon17) April 24, 2023

The inaugural SPECIAL ISSUE on #SpatialOmics is here!
-Profiling Alzheimer’s using @10xGenomics @lcolladotor
-Spatial proteome of head and neck tumors @AkoyaBio
-Hyperspectral imaging @yaojj02
-Spatial omics and #organoids @ahmetfcoskun
...and more!https://t.co/ERpXTlih08 pic.twitter.com/Moar3K3T3Q
— GEN Biotechnology (@GENBiotechJrnl) October 17, 2023

Original draft by Sang Ho Kwon M.S. Reviewed and edited by Keri Martinowich Ph.D and Leonardo Collado-Torres Ph.D.

Introduction to BiocMAP

Wed, 27 Sep 2023 00:00:00 +0000

By Nick Eagles

Over the past few years, I’ve had the opportunity to work with a lot of whole-genome bisulfite-sequencing (WGBS) datasets. They provide a powerful opportunity to look at DNA methylation on a complete scale, in contrast to microarrays which target a narrower set of important CpG sites across the genome. But for this same reason, the data is often unwieldy, and can feel difficult to tackle even with access to powerful computational resources. At LIBD, we were excited by the opportunity to better characterize the role of methylation in development and psychiatric disorders like schizophrenia, and we’ve performed WGBS on thousands of samples in just a few years.

The Challenges of WGBS

Despite the massive research opportunity, we had a huge computational challenge in the way. How could we turn thousands of raw sequencing files into methylation proportions for each gene? It’s not like the basic logistics of this preprocessing task is unsolved– in fact, some great tools like nf-core/methylseq exist to chain together the various steps (alignment to a reference genome, counting methylated and unmethylated reads of each gene, etc) into a fairly easy-to-use workflow. Could we just use something like nf-core/methylseq?

At the scale of our datasets, existing pipeline tools could take years of (wall clock!) computational time, even with access to a high-performance computing cluster. We also noticed that many existing solutions would simply run out of memory, even when allocated gigantic (hundreds of GBs) amounts of RAM. We knew that our situation was unique– and we’d need to carefully implement a workflow that was optimized for speed and efficient memory use.

Our Solution

We developed BiocMAP after refining our internal preprocessing workflow. Much of the speed gains were simply achieved by using Arioc, a GPU-based tool for alignment to the reference genome, when the standard in the field was to use Bismark or other CPU-based tools. We limited memory usage by tricks like breaking data by chromosome, and using disk-based backends where possible, details we describe in the manuscript.

The “Bioc” in BiocMAP stand for Bioconductor-friendly– BiocMAP collects all the methylation counts and proportions into SummarizedExperiment-based objects in R, since these objects are how the Bioconductor community likes to represent experimental data of all kinds. A whole ecosystem of R packages is built around performing statistical analyses on SummarizedExperiment-based objects.

So now I’m excited we just published the paper and the software is ready to share with the world!

Using BiocMAP

We aimed to make BiocMAP simple to install and use on a variety of computing environments, while allowing a good deal of customization for interested users. I’ll show examples of running BiocMAP on a SLURM-managed cluster, though running on an SGE-managed cluster or just a single machine is possible too.

#   Install BiocMAP with singularity
git clone git@github.com:LieberInstitute/BiocMAP.git
cd BiocMAP
bash install_software.sh singularity

sbatch run_first_half_slurm.sh

That install_software.sh script allows you to use Docker or Singularity to set BiocMAP up, and then we have shell scripts and configuration files that make it easy to run on computing clusters that have job schedulers like SLURM or SGE.

We split the BiocMAP pipeline into two pieces because our experience with processing WGBS data involved collaboration and the use of more than one computing cluster. Since the use of GPUs is still sometimes seen as a new thing, some clusters may have more impressive GPU resources than others, while others may have more CPUs or overall memory. We found it useful to allow the flexibility of running the GPU-intensive alignment in a potentially different location than the remaining analysis steps. Nothing’s stopping you from running everything on one machine or cluster though:

sbatch run_first_half_slurm.sh

#   Once the first module finishes:
sbatch run_second_half_slurm.sh

Do you have a lot of WGBS data and access to GPUs? BiocMAP may be helpful in powering through the preprocessing so you can help focus on the interesting part of your research– the statistical analysis.

Check out our manuscript, code, and documentation!

Lessons Learned Applying Tangram on Visium Data

Tue, 09 Mar 2021 00:00:00 +0000

By Nick Eagles

We’ve recently been interested in exploring the (largely python-based) tools others have published to process spatial transcriptomics data for various end goals. A common goal is to integrate data from platforms like Visium, which provides some information about how gene expression is spatially organized, with other approaches with potentially better spatial resolution or gene throughput. In particular, we came across a paper by Biancalani, Scalia et al. presenting a tool called Tangram, and were particularly interested in a component of the tool which could map individual cells from single cell gene expression data onto the spatial voxels probed by Visium. I encourage you to check out the paper for a more detailed description of their approach, as well as the other capabilities of their software which I won’t be covering.

There’s a lot to talk about around these topics– integrating spatial gene expression data with other forms of data, installing and running external software, and much more– but this blog post will focus on data science and machine learning lessons I learned while trying to apply Tangram on some private data.

Note: I will regularly refer to the Tangram manuscript, and several conceptual points I make (especially the role of data sparsity, trusting scores of well-selected training genes, and descriptions of the mapping learned by Tangram) are inspired by or are paraphrased from the manuscript. I intend this blog post in part to discuss these ideas, for which the authors deserve credit.

Image credit: By Nevit Dilmen - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1798693

The Initial Plan

OK, so I had read through the paper and felt I got a basic understanding of what Tangram was doing, and some concept about how it worked. Admittedly, I had to skip over some of the more technical and advanced biological details, but I felt I had enough of the context I needed to get started. We could take single cell expression data, and learn where to “place” individual cells onto the voxels containing Visium measurements. Deep learning was used to infer this mapping?

My coworkers and I got started with Tangram’s main tutorial, where we saw that Tangram trained its mapping using a subset of genes, and that the proper selection of training genes was crucial for a robust mapping. I had prepared AnnData objects for some private data, containing single cell and Visium data, as required by the tutorial, and was ready to follow along with the code. Then, I realized that none of the marker genes selected for use in the example tutorial were present in our own data.

Somewhat naively, I was ready to experiment with different gene selection approaches, and rank them by average similarity score achieved in the test genes. Other metrics might be preferable to evaluate performance (the paper makes use of “spatial correlation”), but cosine similarity by gene was readily available as output from the tangram function tg.compare_spatial_geneexp, to compare mapped and actual expression. For each gene selection method, I took the spatially-mapped single cell expression object ad_ge, and the Visium expression object ad_sp, and computed the following metrics:

df_all_genes = tg.compare_spatial_geneexp(ad_ge, ad_sp)

#  Compute average cosine similarity for test genes
np.mean(df_all_genes.score[np.logical_not(df_all_genes.is_training)])

#  Compute average cosine similarity for training genes
np.mean(df_all_genes.score[df_all_genes.is_training])

Strangely, regardless of the selection approach used, I achieved training and test scores around 0.9 and 0.16, respectively, of course with some variation. Even for the example tutorial’s data and gene set, I achieved scores with a similarly large performance gap. What was going on here?

I more carefully reviewed the paper, and thought some more.

The Importance of Understanding your Data

As was well described in the manuscript, sparsity of gene expression, especially in the spatial data, could negatively impact similarity scores. In a particularly extreme hypothetical case, we might have a gene expressed somewhat in many cells, but whose expression in the spatial data is nonzero in only a few voxels. No matter the arrangement of cells onto voxels, it will receive a poor score owing to a relative lack of data needed to demonstrate the “true expression profile”. Figure 4f in the manuscript provides a visual display of the impact of sparsity on model performance by gene.

Image source

In my case, I was disproportionately selecting genes with large and diverse expression levels for training genes, and not considering expression at all when selecting test genes. We were dealing with data that fundamentally exhibited sparsity, a feature significantly influencing the model’s performance, and by ignoring it, I had created imbalanced training and test sets.

The Importance of Understanding your Model

My first instinct when observing the training/test performance gap was that serious overfitting was occurring. I was under the impression a deep neural network was used as the model which was to learn the mapping from cell to spatial voxel. However, the more I thought about it, the less it made sense that a basic arrangement task would require or even benefit from a deep neural network- this felt almost like a regression problem. Would features could a neural network even learn to solve a problem like this?

I dug deeper into the methods of the paper, and discovered the model at hand was simply a matrix, assigning probabilities for each cell to each spatial voxel. These probabilities were “directly” optimized by gradient descent to maximize a similarity score (cosine similarity) between assigned and observed gene expression levels. I made an assumption perhaps based on the title of the manuscript- but the reality was that a very “shallow”, relatively simple model was being used. In this sense, we should already be less afraid about the potential of overfitting.

Thinking more about the task Tangram’s model was learning to perform, I found it more helpful to consider the single cell- spatial mapping to be a complete jigsaw puzzle, and cells to be pieces (yes, I know the whole purpose of the name “Tangram” is as a similar metaphor). Well– imagine these pieces could hypothetically fit together in any arrangement so long as the complete picture’s shape was fixed. In this metaphor, using carefully selected training genes would be like erasing small, uninformative parts of the image on each puzzle piece. When we complete the puzzle, we will see the underlying big picture fairly well. Then, we wouldn’t particularly be worried that the erased segments might contradict what we already have in place, since we already have solid visual evidence our arrangement is good. Analogously, provided our training genes are well-selected, good training scores can give us confidence a robust mapping was found, in which we can trust test genes to be well-placed.

Takeaways

consider the nature of your data when interpreting results, such as performance of a model on subsets of this data
take time to understand your model, so that you can understand how to interpret its performance
some papers need more than a brief skim :)

Using tidymodels to Predict Health Insurance Cost

Mon, 15 Feb 2021 00:00:00 +0000

By Arta Seyedian

Medical Cost Personal Datasets

Insurance Forecast by using Linear Regression

Link to Kaggle Page

Link to GitHub Source

Around the end of October 2020, I attended the Open Data Science Conference primarily for the workshops and training sessions that were offered. The first workshop I attended was a demonstration by Jared Lander on how to implement machine learning methods in R using a new package named tidymodels. I went into that training knowing almost nothing about machine learning, and have since then drawn exclusively from free online materials to understand how to analyze data using this “meta-package.”

As a brief introduction, tidymodels is, like tidyverse, not a single package but rather a collection of data science packages designed according to tidyverse principles. Many of the packages present in tidymodels are also present in tidyverse. What makes tidymodels different from tidyverse, however, is that many of these packages are meant for predictive modeling and provide a universal standard interface for all of the different machine learning methods available in R.

Today, we are using a data set of health insurance information from ~1300 customers of a health insurance company. This data set is sourced from a book titled Machine Learning with R by Brett Lantz.

library(tidyverse)
library(tidymodels)
library(data.table)

download.file("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv", 
              "insurance.csv")

insur_dt <- fread("insurance.csv")

insur_dt %>% colnames()

## [1] "age"      "sex"      "bmi"      "children" "smoker"   "region"   "charges"

insur_dt$age %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   27.00   39.00   39.21   51.00   64.00

insur_dt$sex %>% table()

## .
## female   male 
##    662    676

insur_dt$bmi %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.96   26.30   30.40   30.66   34.69   53.13

insur_dt$smoker %>% table()

## .
##   no  yes 
## 1064  274

insur_dt$charges %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1122    4740    9382   13270   16640   63770

Above, you’ll noticed I loaded packages such as parsnip and recipes. These packages, together with others, form the meta-package tidymodels used for modeling and statistical analysis. You can learn more about it here. Usually, you can simply call library(tidymodels), but Kaggle R notebooks seem unable to install and/or load it for the time being, which is fine.

As you can see, there are 7 different relatively self-explanatory variables in this data set, some of which are presumably used by the benevolent private health insurance company in question to determine how much a given individual is ultimately charged. age, sex and region appear to be demographics; with age going no lower than 18 and no greater than 64 with a mean of about 40. The two factor levels in sex seem to be about the same in quantity.

Assuming that the variable bmi corresponds to Body Mass Index, according to the CDC, a BMI of 30 or above is considered clinically obese. In our present data set, the average is just over the cusp of obese.

Next we have the number of smokers vs non-smokers. As someone who has filled out even one form before in my life, I can definitely tell you that smoker is going to be important going forward in determining the charge of each given heath insurance customer.

Lastly, we have charge. The average annual charge for health insurance is a modest $13,000.

Exploratory Data Analysis

skimr::skim(insur_dt)

Table 1: Data summary
Name	insur_dt
Number of rows	1338
Number of columns	7
_______________________
Column type frequency:
character	3
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
sex	1	4	6	2
smoker	1	2	3	2
region	1	9	9	4

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	39.21	14.05	18.00	27.00	39.00	51.00	64.00	▇▅▅▆▆
bmi	1	30.66	6.10	15.96	26.30	30.40	34.69	53.13	▂▇▇▂▁
children	1	1.09	1.21	0.00	0.00	1.00	2.00	5.00	▇▂▂▁▁
charges	1	13270.42	12110.01	1121.87	4740.29	9382.03	16639.91	63770.43	▇▂▁▁▁

table(insur_dt$sex)

## 
## female   male 
##    662    676

I want to note that this data set is pretty clean; you will probably never encounter a data set like this in the wild. There are no NAs and, as I mentioned before, no class imbalance along sex. Let’s look at the distribution of children:

table(insur_dt$children)

## 
##   0   1   2   3   4   5 
## 574 324 240 157  25  18

Pretty standard; the plurality of people in this set do not have children. The next highest amount is 1, the second highest 2, etc.

options(repr.plot.width=15, repr.plot.height = 10)

insur_dt %>%
    select(age, bmi, children, smoker, region, charges) %>%
    GGally::ggpairs(mapping = aes(color = region))

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

GGally is a package that facilitates the process of exploratory data analysis by automatically generating ggplots with the variables present in the input data frame to help you get a better understanding of the relationships that might exist between them. Most of these plots are just noise, but there are a few interesting ones, such as the two on the bottom left assessing charge vs age and charge vs bmi. Further to the right, there is also charge vs smoker. Let’s take a closer look at some of these relationships:

insur_dt %>% ggplot(aes(color = region)) + facet_wrap(~ region)+
  geom_point(mapping = aes(x = bmi, y = charges))

I wanted to see if there are regions that are somehow charged at a different rate than the others, but these plots all look basically the same. If you’ll notice, there are about two different blobs projecting from 0,0 to the center of the plot. We’ll get back to that later.

insur_dt %>% ggplot(aes(color = region)) + facet_wrap(~ region)+
  geom_point(mapping = aes(x = age, y = charges))

Here, I wanted to see if there was any sort of noticeable relationship between age and charges. Across the four regions, most tend to lie on a slope near the X-axis increasing modestly with age. There are, however, a pattern that appears to be two levels coming off of that baseline. Since we don’t have a variable for the type of health insurance plan these people are using, we should probably hold off on any judgements on what this could be for now.

Let’s move onto what is undoubtedly the pièce de résistance of health insurance coverage: smokers.

insur_dt %>%
    select(smoker, bmi, charges) %>%
    ggplot(aes(color = smoker)) +
    geom_point(mapping = aes(x = bmi, y = charges))

Wow. What a stark difference. Here, you can see that smoker almost creates a whole new blob of points separate from non-smokers… and that blob sharply rises after bmi = 30. Say, what was the CDC official cutoff for obesity again?

insur_dt$age_bins <- cut(insur_dt$age,
                breaks = c(18,20,30,40,50,60,70,80,90),
                include.lowest = TRUE,
                right = TRUE)

insur_dt %>%
    select(bmi, charges, sex, age_bins) %>%
    ggplot(aes(color = age_bins)) +
    geom_point(mapping = aes(x = bmi, y = charges))

You can see that age does play a role in charge, but it’s still stratified within the 3-ish clusters of points, so even among the high-bmi smokers, younger people still pay less money than older people in a consistent way, so it makes sense. However, it does not appear that age interacts with bmi or smoker, meaning that it independently effects the charge.

insur_dt %>%
    select(children, charges, sex) %>%
    ggplot(aes(x = children, y = charges, group = children)) +
    geom_boxplot(outlier.alpha = 0.5, aes(fill = children)) +
    theme(legend.position = "none")

Finally, children does not affect charge significantly.

I think we’ve done enough exploratory analysis to establish that bmi and smoker together form a synergistic effect on charge, and that age also influences charge as well.

Build Model

set.seed(123)

insur_split <- initial_split(insur_dt, strata = smoker)

insur_train <- training(insur_split)
insur_test <- testing(insur_split)

# we are going to do data processing and feature engineering with recipes

# below, we are going to predict charges using everything else(".")
insur_rec <- recipe(charges ~ bmi + age + smoker, data = insur_train) %>%
    step_dummy(all_nominal()) %>%
    step_normalize(all_numeric(), -all_outcomes()) %>%
    step_interact(terms = ~ bmi:smoker_yes)

test_proc <- insur_rec %>% prep() %>% bake(new_data = insur_test)

## Warning: partial match of 'object' to 'objects'

## Warning: partial match of 'object' to 'objects'

## Warning: partial match of 'object' to 'objects'

## Warning: partial match of 'object' to 'objects'

We first split our data into training and testing sets. We stratify sampling by smoker status because there is an imbalance there and we want them to be equally represented in both the training and testing data sets. This is accomplished by first conducting random sampling within these classes.

An explanation of the recipe:

We are going to model the effect of bmi, age and smoker on charges. We do not specify interactions in this step because recipe handles interactions as a step.
We create dummy variables (step_dummy) for all nominal predictors, so smoker becomes smoker_yes and smoker_no is “implied” through omission (so if a row has smoker_yes == 0) because some models cannot have all dummy variables present as columns. To include all dummy variables, you can use one_hot = TRUE.
We then normalize all numeric predictors except our outcome variable(step_normalize(all_numeric(), -all_outcomes())), because you generally want to avoid transformations on outcomes when training and developing a model lest another data set inconsistent with the one you’re using comes along and breaks your model. It’s best do do transformations on outcomes before creating a recipe.
We are setting an interaction term; bmi and smoker_yes (the dummy variable for smoker), all interact with each other when effecting the outcome. Earlier, we noticed that older patients are charged more, and that older patients with higher bmi are charged even more than that. Well, older patients with a higher bmi who smoke are charged the most out of anyone in our data set. We observed this visually when looking at the plot, so we are going to also test this in the model we will develop.

Let’s actually specify the model. We are going to be working with a k-Nearest Neighbors model to compare it later with another model. The KNN model is simply defined as follows:`):

KNN regression is a non-parametric method that, in an intuitive manner, approximates the association between independent variables and the continuous outcome by averaging the observations in the same neighbourhood. The size of the neighbourhood needs to be set by the analyst or can be chosen using cross-validation (we will see this later) to select the size that minimises the mean-squared error.

To keep things simple, we are not going to use cross-validation to find the optimal k. Instead, we are just going to say k = 10.

knn_spec <- nearest_neighbor(neighbors = 10) %>%
    set_engine("kknn") %>%
    set_mode("regression")

knn_fit <- knn_spec %>%
    fit(charges ~ age + bmi + smoker_yes + bmi_x_smoker_yes,
        data = juice(insur_rec %>% prep()))

## Warning: partial match of 'object' to 'objects'

## Warning: partial match of 'object' to 'objects'

insur_wf <- workflow() %>%
    add_recipe(insur_rec) %>%
    add_model(knn_spec)

We specified the model knn_spec by calling the model itself from parsnip, then we set_engine and set the mode to regression. Note the neighbors parameter in nearest_neighbor. That corresponds to the k in knn.

We then fit the model using the model specification to our data. Because we already computed columns for the bmi and smoker_yes interaction, we do not need to represent the interaction formulaically again.

Let’s evaluate this model to see how it does.

insur_cv <- vfold_cv(insur_train, prop = 0.9)

insur_rsmpl <- fit_resamples(insur_wf,
                           insur_cv,
                           control = control_resamples(save_pred = TRUE))

## 
## Attaching package: 'rlang'

## The following object is masked from 'package:data.table':
## 
##     :=

## The following objects are masked from 'package:purrr':
## 
##     %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
##     flatten_lgl, flatten_raw, invoke, list_along, modify, prepend,
##     splice

## 
## Attaching package: 'vctrs'

## The following object is masked from 'package:dplyr':
## 
##     data_frame

## The following object is masked from 'package:tibble':
## 
##     data_frame

## ! Fold01: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold01: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold02: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold02: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold03: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold03: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold04: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold04: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold05: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold05: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold06: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold06: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold07: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold07: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold08: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold08: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold09: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold09: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold10: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold10: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

insur_rsmpl %>% collect_metrics()

## # A tibble: 2 x 6
##   .metric .estimator     mean     n  std_err .config             
##   <chr>   <chr>         <dbl> <int>    <dbl> <chr>               
## 1 rmse    standard   4916.       10 274.     Preprocessor1_Model1
## 2 rsq     standard      0.827    10   0.0194 Preprocessor1_Model1

summary(insur_dt$charges)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1122    4740    9382   13270   16640   63770

We set vfold_cv (which is the cross validation that most people are familiar with, wherein the training data is split into V folds and then is trained on V-1 folds in order to make a prediction on the last fold, and is repeated so that all folds are trained and used as a prediction fold) to a prop of 0.9, which is the same as specifying 9 training folds and 1 testing fold (within our training data).

We then finally run the cross validation by using fit_resamples. As you can see, we used our workflow object as our input.

Finally, we call collect_metrics to examine the model effectiveness. We end up with an rmse of 4,915 and an rsq of 0.82. The RMSE would suggest that, on average, our predictions varied from observed values by an absolute measure of 4,915, in this case, dollars in charges. The R^2 would suggest that our regression has a fit of ~82%, although a high R^2 doesn’t always mean the model has a good fit and a low R^2 doesn’t always mean that a model has a poor fit.

insur_rsmpl %>%
    unnest(.predictions) %>%
    ggplot(aes(charges, .pred, color = id)) + 
    geom_abline(lty = 2, color = "gray80", size = 1.5) + 
    geom_point(alpha = 0.5) + 
    theme(legend.position = "none")

Above is a demonstration of our regression fit to a line. There is a large cluster of values that are model simply does not capture, and we could learn more about these points, but instead we are going to move on to applying our model to our test data, which we defined much earlier in this project.

insur_test_res <- predict(knn_fit, new_data = test_proc %>% select(-charges))

## Warning: partial match of 'fit' to 'fitted.values'

insur_test_res <- bind_cols(insur_test_res, insur_test %>% select(charges))

insur_test_res

## # A tibble: 334 x 2
##     .pred charges
##     <dbl>   <dbl>
##  1  4339.   3757.
##  2 27038.  27809.
##  3  2231.   1837.
##  4  6500.   6204.
##  5  2794.   4688.
##  6  6057.   6314.
##  7 14335.  12630.
##  8  1663.   2211.
##  9  5655.   3580.
## 10 39401.  37743.
## # … with 324 more rows

We’ve now applied our model to test_proc, which is the test set after we’ve used the recipes preprocessing steps on them to transform them in the same way we transformed our training data. We bind the resulting predictions with the actual charges found in the training data to create a two-column table with our predictions and the corresponding real values we attempted to predict.

ggplot(insur_test_res, aes(x = charges, y = .pred)) +
  # Create a diagonal line:
  geom_abline(lty = 2) +
  geom_point(alpha = 0.5) +
  labs(y = "Predicted Charges", x = "Charges") +
  # Scale and size the x- and y-axis uniformly:
  coord_obs_pred()

rmse(insur_test_res, truth = charges, estimate = .pred)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       4985.

insur_rsmpl %>% 
    collect_metrics()

## # A tibble: 2 x 6
##   .metric .estimator     mean     n  std_err .config             
##   <chr>   <chr>         <dbl> <int>    <dbl> <chr>               
## 1 rmse    standard   4916.       10 274.     Preprocessor1_Model1
## 2 rsq     standard      0.827    10   0.0194 Preprocessor1_Model1

Nice! The RMSE generated by our test data is insignificantly different from the one generated by our cross-validation! That means our model can reliably reproduce predictions with approximately the same level of error.

Another great thing about tidymodels is that it streamlines the process of comparing predictive performance between two different models. Allow me to demonstrate.

Linear Regression

We already have the recipe. All we need now is to specify a linear model and cross-validate the fit to test it on the testing data.

lm_spec <- linear_reg() %>% 
    set_engine("lm")

lm_fit <- lm_spec %>%
    fit(charges ~ age + bmi + smoker_yes + bmi_x_smoker_yes,
        data = juice(insur_rec %>% prep()))

## Warning: partial match of 'object' to 'objects'

## Warning: partial match of 'object' to 'objects'

insur_lm_wf <- workflow() %>%
    add_recipe(insur_rec) %>%
    add_model(lm_spec)

We just repeat some of the same steps that we did for KNN but for the linear model. We can even cross-validate by using (almost) the same command:

insur_lm_rsmpl <- fit_resamples(insur_lm_wf,
                           insur_cv,
                           control = control_resamples(save_pred = TRUE))

## ! Fold01: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold01: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold02: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold02: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold03: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold03: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold04: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold04: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold05: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold05: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold06: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold06: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold07: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold07: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold08: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold08: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold09: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold09: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

## ! Fold10: preprocessor 1/1: partial match of 'object' to 'objects'

## ! Fold10: preprocessor 1/1, model 1/1 (predictions): partial match of 'object' to ...

insur_lm_rsmpl %>% 
    collect_metrics()

## # A tibble: 2 x 6
##   .metric .estimator     mean     n  std_err .config             
##   <chr>   <chr>         <dbl> <int>    <dbl> <chr>               
## 1 rmse    standard   4866.       10 251.     Preprocessor1_Model1
## 2 rsq     standard      0.832    10   0.0162 Preprocessor1_Model1

insur_rsmpl %>% 
    collect_metrics()

## # A tibble: 2 x 6
##   .metric .estimator     mean     n  std_err .config             
##   <chr>   <chr>         <dbl> <int>    <dbl> <chr>               
## 1 rmse    standard   4916.       10 274.     Preprocessor1_Model1
## 2 rsq     standard      0.827    10   0.0194 Preprocessor1_Model1

Fascinating! It appears that the good, ol’ fashioned linear model beat k-Nearest Neighbors both in terms of RMSE but also R^2 across 10 cross-validation folds.

insur_test_lm_res <- predict(lm_fit, new_data = test_proc %>% select(-charges))

insur_test_lm_res <- bind_cols(insur_test_lm_res, insur_test %>% select(charges))

insur_test_lm_res

## # A tibble: 334 x 2
##     .pred charges
##     <dbl>   <dbl>
##  1  6335.   3757.
##  2 31938.  27809.
##  3  3171.   1837.
##  4  7878.   6204.
##  5  3081.   4688.
##  6  7815.   6314.
##  7 14070.  12630.
##  8  2656.   2211.
##  9  3498.   3580.
## 10 36293.  37743.
## # … with 324 more rows

Now that we have our predictions, let’s look at how well the linear model fared:

ggplot(insur_test_lm_res, aes(x = charges, y = .pred)) +
  # Create a diagonal line:
  geom_abline(lty = 2) +
  geom_point(alpha = 0.5) +
  labs(y = "Predicted Charges", x = "Charges") +
  # Scale and size the x- and y-axis uniformly:
  coord_obs_pred()

It seems as though the area on the bottom left corner had the greatest concentration of charges, and explains most of the lm fit. Look at both of these plots makes me wonder if there was a better model we could have used, but our model was sufficient given our purposes and level of accuracy.

combind_dt <- mutate(insur_test_lm_res,
      lm_pred = .pred,
      charges = charges
      ) %>% select(-.pred) %>%
    add_column(knn_pred = insur_test_res$.pred)

ggplot(combind_dt, aes(x = charges)) +
    geom_line(aes(y = knn_pred, color = "kNN Fit"), size = 1) +
    geom_line(aes(y = lm_pred, color = "lm Fit"), size = 1) +
    geom_point(aes(y = knn_pred, alpha = 0.5), color = "#F99E9E") +
    geom_point(aes(y = lm_pred, alpha = 0.5), color = "#809BF4") +
    geom_abline(size = 0.5, linetype = "dashed") +
    xlab('Charges') +
    ylab('Predicted Charges') +
    guides(alpha = FALSE)

Above is a comparison of the two methods with their respective predictions, and with the dotted line representing the “correct” values. In this case, the two models were not different enough from each other for their differences to be readily observed when plotted against each other, but there will be instances in the future wherein your two models do differ substantially, and this sort of plot will bolster your case for using one model over another.

Conclusion

Here, we were able to build a KNN model with our training data and use it to predict values in our testing data. To do this, we:

performed EDA
preprocessed our data using recipes
specified our model to be KNN
fit it to our training data
ran cross validation to produce accurate error statistics
predicted values in our test set
compared observed test set values with our predictions
specified another model, lm
performed a cross-validation
discovered lm to be the better model

I’m very excited to continue using tidymodels in R as a way to apply machine learning methods. If you’re interested, I recommend checking out Tidy Modeling with R by Max Kuhn and Julia Silge.

Using VisiumExperiment at spatialLIBD package

Fri, 06 Nov 2020 00:00:00 +0000

By Brenda Pardo

A month ago, I started an enriching adventure by joining Leonardo Collado-Torres’ team at Lieber Institute for Brain Development. Since then, I have been working on modifying spatialLIBD, a package to interactively visualize the LIBD human dorsolateral pre-frontal cortex (DLPFC) spatial transcriptomics data (Maynard, Collado-Torres, Weber, Uytingco, et al., 2020). The performed modifications allow spatialLIBD to use objects of the VisiumExperiment class, which is designed to specifically store spatial transcriptomics data (Righelli and Risso, 2020). In this blog post, I describe the changes we carried out to the package and happily share a piece of my journey through my research internship at LIBD.

Starting internship at Lieber Institute

As part of the Genomic Sciences undergraduate program at Universidad Nacional Autónoma de México (UNAM), I attended a single cell data analysis course imparted by Leo Collado. During the sessions, I found quite fun and useful programming in R and decided I wanted to go deeper into the use of this programming language. My interest was enhanced when Leo highlighted in the CDSB Workshop 2020, we could not just be R users but developers, and generate helpful tools for biological data analysis. With this motivation, I reached out to Leo, and that’s how the adventure started: I joined Leonardo Collado’s team at LIBD, and my research internship was inaugurated with this tweet.

I'm super happy that I'm starting a research internship mentored by @fellgernon at @LieberInstitute and @jhubiostat 😆.#CDSB2020 and the course at LIIGH definitely impacted me by increasing my interest in R. Attending was such an encouraging and enriching experience. https://t.co/1iwqKwolTb
— Brenda Pardo (@PardoBree) September 5, 2020

Since then, I have been working with Leo on adapting the package spatialLIBD to use R objects structured specifically to store spatial transcriptomics data. This is work that I’ve been doing part-time while also attending the third ¹ year classes at LCG-UNAM-EJ.

via GIPHY

What is spatialLIBD?

Spatial transcriptomics allows to know the transcriptome of a small group of cells in a tissue sample, and to map the exact location of the cells with that expression profile. This technology has generated the need for tools to visualize the data produced by it. spatialLIBD is a Bioconductor package to interactively inspect the DLPFC spatial transcriptomics 10x Genomics Visium data from Maynard, Collado-Torres et al, 2020 analyzed by LIBD researchers and collaborators.

It contains functions for:

accessing the spatial transcriptomics data,
visualizing the spot-level spatial gene expression data and clusters, and
inspecting the data interactively either on the user’s computer or through a web application.

The spatialLIBD package used to employ R objects from the SingleCellExperiment class to store the data, nevertheless Righelli et al created a more accurate class, VisiumExperiment, for this data. This class is in the package SpatialExperiment.

The VisiumExperiment class

The VisiumExperiment class inherits from the SingleCellExperiment class, however, it includes new attributes and methods that allow optimal accommodation of the spatial transcriptomics data. It contains specific slots to store:

The spatial coordinates for each small group of cells (contained in a spot).
The paths to the tissue images.
Information about which spots are covered by tissue.
The scale factors to know the location of the spots in pixels.

and methods to set and retrieve the information contained in these slots.

Now that I have introduced the context, I can start relating my task: to adapt the set of functions that make up spatialLIBD so that they could work with VisiumExperiment objects. And now, I’ll drive you through the updates.

Before starting, remember to install spatialLIBD by using the following commands in your R session.

if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("spatialLIBD")

Now, please load the package.

library("spatialLIBD")

Downloading DLPFC data contained in a VisiumExperiment object

Let’s start with the function fetch_data(), which was previously designed to retrieve a SingleCellExperiment object called sce, containing DLPFC spatial transcriptomics data. With our updates, the function is able to return ve, an object belonging to the VisiumExperiment class, by calling a new function sce_to_ve(), that takes data from sce and rearranges it to the structure of the VisiumExperiment class (defined by Righelli et al).

Below, we obtain ve.

## Download ve object
ve <- fetch_data(type = "ve")

## snapshotDate(): 2020-10-02

## 2020-11-09 14:59:36 loading file /Users/lcollado/Library/Caches/BiocFileCache/c4f432e69d6_Human_DLPFC_Visium_processedData_sce_scran_spatialLIBD.Rdata%3Fdl%3D1

And once we downloaded the object, we can explore it

ve

## class: VisiumExperiment 
## dim: 33538 47681 
## metadata(0):
## assays(2): counts logcounts
## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(9): source type ... gene_search is_top_hvg
## colnames(47681): AAACAACGAATAGTTC-1 AAACAAGTATCTCCCA-1 ...
##   TTGTTTCCATACAACT-1 TTGTTTGTGTAAATTC-1
## colData names(66): Cluster height ... pseudobulk_UMAP_spatial
##   markers_UMAP_spatial
## reducedDimNames(6): PCA TSNE_perplexity50 ... TSNE_perplexity80
##   UMAP_neighbors15
## altExpNames(0):
## spatialCoordinates(7): Cell_ID sample_name ... pxl_row_in_fullres
##   pxl_col_in_fullres
## inTissue(1): 47681
## imagePaths(12):
##   /Users/lcollado/Library/Caches/BiocFileCache/c4f3c2dc99_151507_tissue_lowres_image.png
##   /Users/lcollado/Library/Caches/BiocFileCache/c4f6e20c2bc_151508_tissue_lowres_image.png
##   ...
##   /Users/lcollado/Library/Caches/BiocFileCache/c4f2196b8e6_151675_tissue_lowres_image.png
##   /Users/lcollado/Library/Caches/BiocFileCache/c4f2e451544_151676_tissue_lowres_image.png

Observe that our ve object is a bit more complex than a regular VisiumExperiment one because it contains multiple samples, and as a consequence, multiple images. This has an impact on the arrangement we made for part of the data.

Storing and retrieving scale factors from a VisiumExperiment object

As previously mentioned, there is a new slot in VisiumExperiment class created to store the values to convert spots coordinates into pixels named scaleFactors. This slot has a storage limitation: it has to contain a list with the exact names for the four scale factors for a given sample (tissue_lowres_scalef, fiducial_diameter_fullres, tissue_hires_scalef, spot_diameter_fullres). Given that in DLFPC data we do not have just one, but multiple samples, we decided to create a list with the required four scale factors names, but also a fifth slot with the full list of scale factors for all our samples, and it looks like this:

## Get scale factors
facs <- SpatialExperiment::scaleFactors(ve)

## "current" scale factors
facs[1:4]

## $spot_diameter_fullres
## [1] 96.37511
## 
## $tissue_hires_scalef
## [1] 0.150015
## 
## $fiducial_diameter_fullres
## [1] 144.5627
## 
## $tissue_lowres_scalef
## [1] 0.0450045

## Data for the rest of the 12 images
class(facs[5])

## [1] "list"

length(facs[[5]])

## [1] 12

In addition, we created a function called update_scaleFactors() that generates a VisiumExperiment object with updated scale factors for a given input sample ID in case the user wants uniquely a set of scale factors.

Visualizing the histology image from a VisiumExperiment object

Now let’s talk about the new location of the histology images in ve and how to display them. The object sce has a tibble in its metadata slot that contains a grob for each sample image. In contrast, ve has a list of image paths contained in the imagePath slot.

Downloading the histology images

VisiumExperiment class validity code checks that the image files, whose paths are in imagePaths, exist locally instead of being available remotely through a URL. Thus, we decided to download the images at the moment the ve object is created; this process happens in our function sce_to_ve.

The geom_spatial() function

For visualizing the histology image from visium, the function geom_spatial() was previously created. To do this, it defines a [ggplot2::layer()] , taking the information from the metadata tibble of the sce object. In order to make geom_spatial() available to use ve as an input, we created a novel function read_image() that takes the image path of the desired samples, creates the grobs and puts them on a tibble.

Here you can observe an example of the usage of these functions.

## Extract data from a sample (with ID 151507)
sample_id <- "151507"
ve_sub <- ve[, SpatialExperiment::spatialCoords(ve)$sample_name == sample_id]
sample_df <- as.data.frame(SpatialExperiment::spatialCoords(ve_sub))

## Plot with geom_spatial
ggplot2::ggplot(
    sample_df,
    ggplot2::aes(
        x = imagecol,
        y = imagerow,
    )
) +
geom_spatial(
    data = read_image(ve_sub, sample_id),
    ggplot2::aes(grob = grob),
    x = 0.5,
    y = 0.5
)

Making other functions compatible with VisiumExperiment class

Other functions just like sce_image_gene(), sce_image_clus(), sce_imge_grid() and sce_image_grid_gene() access the column containing the sample IDs, which is located in different slots depending on the object class. For sce, sample_name is in colData slot, and for ve it is in the spatialCoords slot. Hence, conditionals evaluating the object class were added in order to access the information correctly.

Finally, the sce_image_clus_p() and sce_image_gene_p() functions also had important modifications. They visualize the gene expression (or any continuous variable) and clusters for one given sample at the spot-level using the histology information on the background; both functions receive a data frame with information residing in colData of the object sce, including the spatial coordinates. Given the VisiumExperiment object ve contains spatial coordinates in spatialCoords slot, a function called ve_image_colData() was created in order to generate the data frame by joining the required columns for the input for the functions sce_image_clus_p() and sce_image_gene_p().

An example of how these functions can accept the VisiumExperiment object is shown here:

## Use the data previously extracted for a sample 
## Prepare the data for the plotting function
df <- colData(ve_sub)
df$COUNT <- df$expr_chrM_ratio

sce_image_gene_p(
    sce = ve_sub,
    d = df,
    sampleid = sample_id,
    title = "151507 chrM expr ratio",
    spatial = FALSE
)

Although the goal is that the end user won’t need to use these lower-level functions and can just run code like this with our original SingleCellExperiment objects as well as the new VisiumExperiment objects:

sce_image_gene(
    sce = ve_sub,
    sampleid = sample_id,
    spatial = TRUE
)

Wrapping up

And that’s it, those are the main modifications the package went through to accept our VisiumExperiment object. These changes are part of Bioconductor 3.12 that was just recently released, so you can try them out!

I could make a long list of things that I have learned and gained during the process, but I will briefly summarize it. Firstly, I had so much fun coding in R, I got very familiar with the programming language and started feeling like a fish in the water. Secondly, I learned how to write R packages and found it quite useful to order code and document it. And lastly, I acquired a set of skills and good habits to organize my code and projects in R. I want to emphasize that, an element of great importance to this learning and growth process is having a mentor who is enthusiastic about sharing knowledge, always attentive and patient with the student. All his teaching has driven and inspired me to program and enjoy while doing it. Undoubtedly, the road continues and there is much to continue learning!

via GIPHY

Future Plans?

I’m looking forward to learning more about data analysis while working with 10x Genomics Visium data with Leo and colleagues at LIBD. All while potentially working on more R packages as I keep learning new skills that will help me join grad school programs. In the meantime, I’m going to apply to present these updates at EuroBioC2020 and would love to meet you there!

Thanks for reading this post and you are welcome to keep exploring spatialLIBD and the DLPFC data.

via GIPHY

Acknowledgements

This blog post was made possible thanks to:

References

[1] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.10. 2020. URL: https://github.com/cboettig/knitcitations.

[2] G. Csárdi, R. core, H. Wickham, W. Chang, et al. sessioninfo: R Session Information. R package version 1.1.1. 2018. URL: https://CRAN.R-project.org/package=sessioninfo.

[3] K. R. Maynard, L. Collado-Torres, L. M. Weber, C. Uytingco, et al. “Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex”. In: bioRxiv (2020). DOI: 10.1101/2020.02.28.969931. URL: https://www.biorxiv.org/content/10.1101/2020.02.28.969931v1.

[4] A. Oleś, M. Morgan, and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.18.0. 2020. URL: https://github.com/Bioconductor/BiocStyle.

[5] D. Righelli and D. Risso. SpatialExperiment: S4 Class for Spatial Experiments handling. R package version 1.0.0. 2020.

[6] Y. Xie, A. P. Hill, and A. Thomas. blogdown: Creating Websites with R Markdown. ISBN 978-0815363729. Boca Raton, Florida: Chapman and Hall/CRC, 2017. URL: https://github.com/rstudio/blogdown.

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.3 (2020-10-10)
##  os       macOS Catalina 10.15.7      
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/New_York            
##  date     2020-11-09                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package                * version  date       lib source                                 
##  AnnotationDbi            1.52.0   2020-10-27 [1] Bioconductor                           
##  AnnotationHub            2.22.0   2020-10-27 [1] Bioconductor                           
##  assertthat               0.2.1    2019-03-21 [1] CRAN (R 4.0.2)                         
##  attempt                  0.3.1    2020-05-03 [1] CRAN (R 4.0.2)                         
##  backports                1.2.0    2020-11-02 [1] CRAN (R 4.0.3)                         
##  beachmat                 2.6.0    2020-10-27 [1] Bioconductor                           
##  beeswarm                 0.2.3    2016-04-25 [1] CRAN (R 4.0.2)                         
##  benchmarkme              1.0.4    2020-05-09 [1] CRAN (R 4.0.2)                         
##  benchmarkmeData          1.0.4    2020-04-23 [1] CRAN (R 4.0.2)                         
##  Biobase                * 2.50.0   2020-10-27 [1] Bioconductor                           
##  BiocFileCache            1.14.0   2020-10-27 [1] Bioconductor                           
##  BiocGenerics           * 0.36.0   2020-10-27 [1] Bioconductor                           
##  BiocManager              1.30.10  2019-11-16 [1] CRAN (R 4.0.2)                         
##  BiocNeighbors            1.8.0    2020-10-27 [1] Bioconductor                           
##  BiocParallel             1.24.0   2020-10-27 [1] Bioconductor                           
##  BiocSingular             1.6.0    2020-10-27 [1] Bioconductor                           
##  BiocStyle              * 2.18.0   2020-10-27 [1] Bioconductor                           
##  BiocVersion              3.12.0   2020-05-14 [1] Bioconductor                           
##  bit                      4.0.4    2020-08-04 [1] CRAN (R 4.0.2)                         
##  bit64                    4.0.5    2020-08-30 [1] CRAN (R 4.0.2)                         
##  bitops                   1.0-6    2013-08-17 [1] CRAN (R 4.0.2)                         
##  blob                     1.2.1    2020-01-20 [1] CRAN (R 4.0.2)                         
##  blogdown               * 0.21     2020-10-11 [1] CRAN (R 4.0.3)                         
##  bmp                      0.3      2017-09-11 [1] CRAN (R 4.0.2)                         
##  bookdown                 0.21     2020-10-13 [1] CRAN (R 4.0.3)                         
##  cli                      2.1.0    2020-10-12 [1] CRAN (R 4.0.2)                         
##  codetools                0.2-16   2018-12-24 [1] CRAN (R 4.0.3)                         
##  colorout                 1.2-2    2020-11-03 [1] Github (jalvesaq/colorout@726d681)     
##  colorspace               1.4-1    2019-03-18 [1] CRAN (R 4.0.2)                         
##  config                   0.3      2018-03-27 [1] CRAN (R 4.0.2)                         
##  cowplot                  1.1.0    2020-09-08 [1] CRAN (R 4.0.2)                         
##  crayon                   1.3.4    2017-09-16 [1] CRAN (R 4.0.2)                         
##  curl                     4.3      2019-12-02 [1] CRAN (R 4.0.1)                         
##  data.table               1.13.2   2020-10-19 [1] CRAN (R 4.0.2)                         
##  DBI                      1.1.0    2019-12-15 [1] CRAN (R 4.0.2)                         
##  dbplyr                   2.0.0    2020-11-03 [1] CRAN (R 4.0.3)                         
##  DelayedArray             0.16.0   2020-10-27 [1] Bioconductor                           
##  DelayedMatrixStats       1.12.0   2020-10-27 [1] Bioconductor                           
##  desc                     1.2.0    2018-05-01 [1] CRAN (R 4.0.2)                         
##  digest                   0.6.27   2020-10-24 [1] CRAN (R 4.0.2)                         
##  dockerfiler              0.1.3    2019-03-19 [1] CRAN (R 4.0.2)                         
##  doParallel               1.0.16   2020-10-16 [1] CRAN (R 4.0.2)                         
##  dotCall64                1.0-0    2018-07-30 [1] CRAN (R 4.0.2)                         
##  dplyr                    1.0.2    2020-08-18 [1] CRAN (R 4.0.2)                         
##  DT                       0.16     2020-10-13 [1] CRAN (R 4.0.2)                         
##  ellipsis                 0.3.1    2020-05-15 [1] CRAN (R 4.0.2)                         
##  evaluate                 0.14     2019-05-28 [1] CRAN (R 4.0.1)                         
##  ExperimentHub            1.16.0   2020-10-27 [1] Bioconductor                           
##  fansi                    0.4.1    2020-01-08 [1] CRAN (R 4.0.2)                         
##  farver                   2.0.3    2020-01-16 [1] CRAN (R 4.0.2)                         
##  fastmap                  1.0.1    2019-10-08 [1] CRAN (R 4.0.2)                         
##  fields                   11.6     2020-10-09 [1] CRAN (R 4.0.2)                         
##  foreach                  1.5.1    2020-10-15 [1] CRAN (R 4.0.2)                         
##  fs                       1.5.0    2020-07-31 [1] CRAN (R 4.0.2)                         
##  generics                 0.1.0    2020-10-31 [1] CRAN (R 4.0.2)                         
##  GenomeInfoDb           * 1.26.0   2020-10-27 [1] Bioconductor                           
##  GenomeInfoDbData         1.2.4    2020-11-03 [1] Bioconductor                           
##  GenomicRanges          * 1.42.0   2020-10-27 [1] Bioconductor                           
##  ggbeeswarm               0.6.0    2017-08-07 [1] CRAN (R 4.0.2)                         
##  ggplot2                  3.3.2    2020-06-19 [1] CRAN (R 4.0.2)                         
##  glue                     1.4.2    2020-08-27 [1] CRAN (R 4.0.2)                         
##  golem                    0.2.1    2020-03-05 [1] CRAN (R 4.0.2)                         
##  gridExtra                2.3      2017-09-09 [1] CRAN (R 4.0.2)                         
##  gtable                   0.3.0    2019-03-25 [1] CRAN (R 4.0.2)                         
##  htmltools                0.5.0    2020-06-16 [1] CRAN (R 4.0.2)                         
##  htmlwidgets              1.5.2    2020-10-03 [1] CRAN (R 4.0.2)                         
##  httpuv                   1.5.4    2020-06-06 [1] CRAN (R 4.0.2)                         
##  httr                     1.4.2    2020-07-20 [1] CRAN (R 4.0.2)                         
##  interactiveDisplayBase   1.28.0   2020-10-27 [1] Bioconductor                           
##  IRanges                * 2.24.0   2020-10-27 [1] Bioconductor                           
##  irlba                    2.3.3    2019-02-05 [1] CRAN (R 4.0.2)                         
##  iterators                1.0.13   2020-10-15 [1] CRAN (R 4.0.2)                         
##  jpeg                     0.1-8.1  2019-10-24 [1] CRAN (R 4.0.2)                         
##  jsonlite                 1.7.1    2020-09-07 [1] CRAN (R 4.0.2)                         
##  knitcitations          * 1.0.10   2020-11-03 [1] Github (cboettig/knitcitations@ea5d202)
##  knitr                    1.30     2020-09-22 [1] CRAN (R 4.0.2)                         
##  labeling                 0.4.2    2020-10-20 [1] CRAN (R 4.0.2)                         
##  later                    1.1.0.1  2020-06-05 [1] CRAN (R 4.0.2)                         
##  lattice                  0.20-41  2020-04-02 [1] CRAN (R 4.0.3)                         
##  lazyeval                 0.2.2    2019-03-15 [1] CRAN (R 4.0.2)                         
##  lifecycle                0.2.0    2020-03-06 [1] CRAN (R 4.0.2)                         
##  lubridate                1.7.9    2020-06-08 [1] CRAN (R 4.0.2)                         
##  magrittr                 1.5      2014-11-22 [1] CRAN (R 4.0.2)                         
##  maps                     3.3.0    2018-04-03 [1] CRAN (R 4.0.2)                         
##  Matrix                   1.2-18   2019-11-27 [1] CRAN (R 4.0.3)                         
##  MatrixGenerics         * 1.2.0    2020-10-27 [1] Bioconductor                           
##  matrixStats            * 0.57.0   2020-09-25 [1] CRAN (R 4.0.2)                         
##  memoise                  1.1.0    2017-04-21 [1] CRAN (R 4.0.2)                         
##  mime                     0.9      2020-02-04 [1] CRAN (R 4.0.2)                         
##  munsell                  0.5.0    2018-06-12 [1] CRAN (R 4.0.2)                         
##  pillar                   1.4.6    2020-07-10 [1] CRAN (R 4.0.2)                         
##  pkgconfig                2.0.3    2019-09-22 [1] CRAN (R 4.0.2)                         
##  pkgload                  1.1.0    2020-05-29 [1] CRAN (R 4.0.2)                         
##  plotly                   4.9.2.1  2020-04-04 [1] CRAN (R 4.0.2)                         
##  plyr                     1.8.6    2020-03-03 [1] CRAN (R 4.0.2)                         
##  png                      0.1-7    2013-12-03 [1] CRAN (R 4.0.2)                         
##  Polychrome               1.2.5    2020-03-29 [1] CRAN (R 4.0.2)                         
##  promises                 1.1.1    2020-06-09 [1] CRAN (R 4.0.2)                         
##  purrr                    0.3.4    2020-04-17 [1] CRAN (R 4.0.2)                         
##  R6                       2.5.0    2020-10-28 [1] CRAN (R 4.0.2)                         
##  rappdirs                 0.3.1    2016-03-28 [1] CRAN (R 4.0.2)                         
##  RColorBrewer             1.1-2    2014-12-07 [1] CRAN (R 4.0.2)                         
##  Rcpp                     1.0.5    2020-07-06 [1] CRAN (R 4.0.2)                         
##  RCurl                    1.98-1.2 2020-04-18 [1] CRAN (R 4.0.2)                         
##  readbitmap               0.1.5    2018-06-27 [1] CRAN (R 4.0.2)                         
##  RefManageR               1.3.0    2020-11-03 [1] Github (ropensci/RefManageR@ab8fe60)   
##  remotes                  2.2.0    2020-07-21 [1] CRAN (R 4.0.2)                         
##  rlang                    0.4.8    2020-10-08 [1] CRAN (R 4.0.2)                         
##  rmarkdown                2.5      2020-10-21 [1] CRAN (R 4.0.3)                         
##  roxygen2                 7.1.1    2020-06-27 [1] CRAN (R 4.0.2)                         
##  rprojroot                1.3-2    2018-01-03 [1] CRAN (R 4.0.2)                         
##  RSQLite                  2.2.1    2020-09-30 [1] CRAN (R 4.0.2)                         
##  rstudioapi               0.11     2020-02-07 [1] CRAN (R 4.0.2)                         
##  rsvd                     1.0.3    2020-02-17 [1] CRAN (R 4.0.2)                         
##  S4Vectors              * 0.28.0   2020-10-27 [1] Bioconductor                           
##  scales                   1.1.1    2020-05-11 [1] CRAN (R 4.0.2)                         
##  scater                   1.18.0   2020-10-27 [1] Bioconductor                           
##  scatterplot3d            0.3-41   2018-03-14 [1] CRAN (R 4.0.2)                         
##  scuttle                  1.0.0    2020-10-27 [1] Bioconductor                           
##  sessioninfo            * 1.1.1    2018-11-05 [1] CRAN (R 4.0.2)                         
##  shiny                    1.5.0    2020-06-23 [1] CRAN (R 4.0.2)                         
##  shinyWidgets             0.5.4    2020-10-06 [1] CRAN (R 4.0.2)                         
##  SingleCellExperiment   * 1.12.0   2020-10-27 [1] Bioconductor                           
##  spam                     2.5-1    2019-12-12 [1] CRAN (R 4.0.2)                         
##  sparseMatrixStats        1.2.0    2020-10-27 [1] Bioconductor                           
##  SpatialExperiment        1.0.0    2020-10-27 [1] Bioconductor                           
##  spatialLIBD            * 1.2.0    2020-10-29 [1] Bioconductor                           
##  stringi                  1.5.3    2020-09-09 [1] CRAN (R 4.0.2)                         
##  stringr                  1.4.0    2019-02-10 [1] CRAN (R 4.0.2)                         
##  SummarizedExperiment   * 1.20.0   2020-10-27 [1] Bioconductor                           
##  testthat                 3.0.0    2020-10-31 [1] CRAN (R 4.0.2)                         
##  tibble                   3.0.4    2020-10-12 [1] CRAN (R 4.0.2)                         
##  tidyr                    1.1.2    2020-08-27 [1] CRAN (R 4.0.2)                         
##  tidyselect               1.1.0    2020-05-11 [1] CRAN (R 4.0.2)                         
##  tiff                     0.1-5    2013-09-04 [1] CRAN (R 4.0.2)                         
##  usethis                  1.6.3    2020-09-17 [1] CRAN (R 4.0.2)                         
##  vctrs                    0.3.4    2020-08-29 [1] CRAN (R 4.0.2)                         
##  vipor                    0.4.5    2017-03-22 [1] CRAN (R 4.0.2)                         
##  viridis                  0.5.1    2018-03-29 [1] CRAN (R 4.0.2)                         
##  viridisLite              0.3.0    2018-02-01 [1] CRAN (R 4.0.1)                         
##  withr                    2.3.0    2020-09-22 [1] CRAN (R 4.0.2)                         
##  xfun                     0.19     2020-10-30 [1] CRAN (R 4.0.2)                         
##  xml2                     1.3.2    2020-04-23 [1] CRAN (R 4.0.2)                         
##  xtable                   1.8-4    2019-04-21 [1] CRAN (R 4.0.2)                         
##  XVector                  0.30.0   2020-10-28 [1] Bioconductor                           
##  yaml                     2.2.1    2020-02-01 [1] CRAN (R 4.0.2)                         
##  zlibbioc                 1.36.0   2020-10-28 [1] Bioconductor                           
## 
## [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

I believe that it’s the junior year in the US.↩︎

Using Space Ranger at JHPCE

Tue, 20 Oct 2020 00:00:00 +0000

By Nick Eagles

As part of recent LIBD work with spatial gene expression, I recently was recommended the tool Space Ranger, which provides software pipelines walking Visium spatial RNA-seq samples through the steps we ultimately need to explore gene expression coupled with spatial information. In this blog post, I’ll explain how to start using Space Ranger at JHPCE, focusing heavily on the set-up details relevant to this cluster in particular.

Image source

What is Space Ranger

In practice, there are a fairly large number of computational steps we’d need to perform to produce spatial information about gene expression for a multiple-sample experiment, given just microscope images and Visium RNA-seq output. To start, we’d want our data in FASTQ format- then we’d have to worry about aligning reads to a reference genome, producing gene counts, normalizing data, and so on. Thankfully, Space Ranger bundles together these steps into three simple utilities. We won’t focus too much on how to use these individual utilities or the various features of Space Ranger, documented in detail here; rather, this blog post will describe how to get Space Ranger up and running at the JHPCE cluster.

Using the `spaceranger` module at JHPCE

We make regular use of lmod environment modules at JHPCE, as a means of loading and running software without worrying about user set-up differences, manually modifying your PATH, or other nasty considerations. While some sets of modules are available system-wide (for any user), others are not accessible unless you specifically “use” them. To make LIBD-specific modules like spaceranger available, you must “use” the set of modules explicitly:

module use /jhpce/shared/jhpce/modulefiles/libd

If you want to avoid typing this every time you want to use an LIBD module, consider the .bashrc trick described here.

Next, let’s load the spaceranger module in particular.

module load spaceranger

Note: the above code loads the default version of the spaceranger module currently available. You can see which versions are available with:

module avail spaceranger

# Example output may look like this: 
##-------------------------- /jhpce/shared/jhpce/modulefiles/libd ---------------------------
##   spaceranger/1.1.0
##

# You may also load a specific version of the module:
module load spaceranger/1.1.0

First script

Next, let’s run a test of the Space Ranger software on example data they provide. We will write a bash script to load the spaceranger module as above, and call the executable. We could easily have qrsh’d into a compute node and run the few lines of code interactively, but I recommend writing a bash script, which we will qsub, for a few reasons:

A script documents the code you have run, allowing others to see and reproduce the work you’ve done.
When we qsub the script, we include arguments regarding memory and other hardware resources, which you otherwise would have to remember or estimate each time you interactively run this or similar code.
Using qsub allows long-running code to continue without having to worry about keeping your session running and network-connected. This example won’t take long to run, but Space Ranger on real experiments likely will.

Let’s start by writing the “skeleton” of our script, including only the basic required code before worrying about memory, logging, or other more complicated issues. Note that this will create a directory called “tiny” with the example outputs in the current working directory. I’m opening a new file I’ll call spaceranger_test.sh, and the contents should like something like this:

#  Make LIBD modules available, and load the "spaceranger" module
module use /jhpce/shared/jhpce/modulefiles/libd
module load spaceranger

#  Test Space Ranger on already-installed example data
spaceranger testrun --id=tiny

If you qsub this script as-is, it will produce two log files in your home directory, containing verbose and somewhat cryptic errors. We’d prefer a single clearly-named log file written to the same directory as our bash script, and of course to fix the source of the Space Ranger error. In this case, we simply need to provide more memory to fix the main error.

Below, we flesh out spaceranger_test.sh with arguments to qsub which will improve logging and provide sufficient memory. These arguments are indicated by lines beginning with #$.

#  Specify memory and other details below. In order:
#    "-cwd": write the log file to the current working directory
#    "-o" and "-e": combine 'STDOUT' and 'STDERR' messages into the same log file
#    "-l mem_free=20G,h_vmem=20G": request 20G of memory free, which may not be exceeded

#$ -cwd
#$ -o spaceranger_test.txt
#$ -e spaceranger_test.txt
#$ -l mem_free=20G,h_vmem=20G

#  Make LIBD modules available, and load the "spaceranger" module
module use /jhpce/shared/jhpce/modulefiles/libd
module load spaceranger

#  Test Space Ranger on already-installed example data
spaceranger testrun --id=tiny

Now, we can actually submit the script and wait for the job to complete.

qsub spaceranger_test.sh

If you open spaceranger_test.txt after the job completes, you should see that the test was successful. However, there is a worrying warning suggesting that Space Ranger is not properly made aware of the memory to which it should have access:

Martian Runtime - v4.0.0
2020-10-19 15:48:59 [jobmngr] WARNING: configured to use 453GB of local memory, but only 331.3GB is currently available.
2020-10-19 15:48:59 [jobmngr] WARNING: The current virtual address space size
                              limit is too low.
    Limiting virtual address space size interferes with the operation of many
    common libraries and programs, and is not recommended.
    Contact your system administrator to remove this limit.

Rather than using 20GB of memory, Space Ranger believes it has a whopping 453GB of memory to work with, though only ~331GB are actually free. In the next section we will communicate memory and even CPU constraints to Space Ranger with arguments to the spaceranger command.

Exploring memory and parallelization options

Below, we will construct another bash script to submit with qsub, demonstrating how to properly specify memory and number of CPUs for a hypothetical dataset. Suppose we have an experiment with multiple FASTQ files and a microscope slide image. We would like to call the spaceranger count command on this input data, making use of parallelization for speed. Let’s use 5 CPU cores and a total of 60GB of memory. Following the documentation here, we can create the template script we’ll call SR_count_example.sh, appropriate for running at JHPCE:

# Specify memory and other details. Note that 'mem_free' and 'h_vmem' specify
# per-core memory (12G * 5 cores = 60GB total, as we want), as indicated here:
# https://jhpce.jhu.edu/knowledge-base/how-to/#multicore

#$ -cwd
#$ -o SR_count_example.txt
#$ -e SR_count_example.txt
#$ -l mem_free=12G,h_vmem=12G
#$ -pe local 5

#  Make LIBD modules available, and load the "spaceranger" module
module use /jhpce/shared/jhpce/modulefiles/libd
module load spaceranger

#  The main Space Ranger command
spaceranger count \
    --id=<SOME RUN ID HERE> \
    --fastqs <LIST OF FASTQ PATHS HERE> \
    --image <IMAGE PATH HERE> \
    --jobmode=local \ # we will use one "node" of the cluster, which has many cores available
    --localcores=5 \  # we requested 5 cores at the top
    --localmem=54     # 60GB * 0.9 = 54GB; using 90% of total memory requested is recommended

In practice, you’d specify an --id, the FASTQ paths --fastqs, and the microscope image --image in the above script, for your experiment. Then simply submit the script as a job!

qsub SR_count_example.sh

Note: you might also be interested in sgejobs that we explored in a LIBD rstats club session. You can use it to create SGE bash scripts.

Acknowledgments

This blog post was made possible thanks to:

References

[1] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.10. 2019. URL: https://CRAN.R-project.org/package=knitcitations.

[2] G. Csárdi, R. core, H. Wickham, W. Chang, et al. sessioninfo: R Session Information. R package version 1.1.1. 2018. URL: https://CRAN.R-project.org/package=sessioninfo.

[3] Y. Xie, A. P. Hill, and A. Thomas. blogdown: Creating Websites with R Markdown. ISBN 978-0815363729. Boca Raton, Florida: Chapman and Hall/CRC, 2017. URL: https://github.com/rstudio/blogdown.

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.2 (2020-06-22)
##  os       macOS Catalina 10.15.7      
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/New_York            
##  date     2020-10-21                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package       * version date       lib source                            
##  assertthat      0.2.1   2019-03-21 [1] CRAN (R 4.0.0)                    
##  bibtex          0.4.2.3 2020-09-19 [1] CRAN (R 4.0.2)                    
##  BiocManager     1.30.10 2019-11-16 [1] CRAN (R 4.0.0)                    
##  BiocStyle     * 2.17.1  2020-09-24 [1] Bioconductor                      
##  blogdown        0.21.19 2020-10-21 [1] Github (rstudio/blogdown@1a7ad52) 
##  bookdown        0.21    2020-10-13 [1] CRAN (R 4.0.2)                    
##  cli             2.1.0   2020-10-12 [1] CRAN (R 4.0.2)                    
##  colorout      * 1.2-2   2020-05-18 [1] Github (jalvesaq/colorout@726d681)
##  crayon          1.3.4   2017-09-16 [1] CRAN (R 4.0.0)                    
##  digest          0.6.26  2020-10-17 [1] CRAN (R 4.0.2)                    
##  evaluate        0.14    2019-05-28 [1] CRAN (R 4.0.0)                    
##  fansi           0.4.1   2020-01-08 [1] CRAN (R 4.0.0)                    
##  generics        0.0.2   2018-11-29 [1] CRAN (R 4.0.0)                    
##  glue            1.4.2   2020-08-27 [1] CRAN (R 4.0.2)                    
##  htmltools       0.5.0   2020-06-16 [1] CRAN (R 4.0.2)                    
##  httr            1.4.2   2020-07-20 [1] CRAN (R 4.0.2)                    
##  jsonlite        1.7.1   2020-09-07 [1] CRAN (R 4.0.2)                    
##  knitcitations * 1.0.10  2019-09-15 [1] CRAN (R 4.0.0)                    
##  knitr           1.30    2020-09-22 [1] CRAN (R 4.0.2)                    
##  lubridate       1.7.9   2020-06-08 [1] CRAN (R 4.0.2)                    
##  magrittr        1.5     2014-11-22 [1] CRAN (R 4.0.0)                    
##  plyr            1.8.6   2020-03-03 [1] CRAN (R 4.0.0)                    
##  R6              2.4.1   2019-11-12 [1] CRAN (R 4.0.0)                    
##  Rcpp            1.0.5   2020-07-06 [1] CRAN (R 4.0.2)                    
##  RefManageR      1.2.12  2019-04-03 [1] CRAN (R 4.0.0)                    
##  rlang           0.4.8   2020-10-08 [1] CRAN (R 4.0.2)                    
##  rmarkdown       2.5     2020-10-21 [1] CRAN (R 4.0.2)                    
##  sessioninfo   * 1.1.1   2018-11-05 [1] CRAN (R 4.0.0)                    
##  stringi         1.5.3   2020-09-09 [1] CRAN (R 4.0.2)                    
##  stringr         1.4.0   2019-02-10 [1] CRAN (R 4.0.0)                    
##  withr           2.3.0   2020-09-22 [1] CRAN (R 4.0.2)                    
##  xfun            0.18    2020-09-29 [1] CRAN (R 4.0.2)                    
##  xml2            1.3.2   2020-04-23 [1] CRAN (R 4.0.0)                    
##  yaml            2.2.1   2020-02-01 [1] CRAN (R 4.0.0)                    
## 
## [1] /Library/Frameworks/R.framework/Versions/4.0branch/Resources/library

R 101

Mon, 24 Dec 2018 00:00:00 +0000

HAPPY HOLIDAYS!!!🎉⛄🎆🍾❄

In the spirit of the coming new year and new beginnings, we created a tutorial for getting started or restarted with R. If you are new to R or have dabbled in R but haven’t used it much recently, then this post is for you. We will focus on data classes and types, as well as data wrangling, and we will provide basic statistics and basic plotting examples using real data. Enjoy!

By C.Wright

As with most programming tutorials, let’s start with a good’ol “Hello World”.

1) First Command

print("Hello World")

## [1] "Hello World"

2) Install and Load Packages and Data

Now we need some data. Packages are collections of functions and/or data. There are published packages that you can use from the community such as these two packages, or you can make your own package for your own private use.

install.packages("babynames")  
install.packages("titanic")

Now that we have installed the packages, we need to load them.

library("babynames")
library("titanic")

Each installation of R comes with quite a bit of data! Now we want to load the “quake” data - there are lots of other options.

data("quakes")
data() #this will list all of the datasets available

3) Assigning Objects

Objects can be many different things ranging from a simple number to a giant matrix, but they refer to things that you can manipulate in R.

myString <- "Hello World" #notice how we need "" around words, aka strings
myString #take a look at myString

## [1] "Hello World"

A <- 14 #now we do not need "" around numbers
A #take a look at A

## [1] 14

A = 5 #can also use the equal sign to assign objects
A #notice how A has changed

## [1] 5

4) Assigning Objects with Multiple Elements

Now lets assign a more complex object

height <- c(5.5, 4.5, 5, 5.6, 5.8, 5.2, 6, 6.2, 5.9, 5.8, 6, 5.9) #this is called a vector
colors_to_use <- c("red", "blue")# a vector of strings

5) Classes

There are a variety of different object classes. We can use the function class() to tell us what class an object belongs to.

class(height) #this is a numeric vector

## [1] "numeric"

class(colors_to_use) #this is a character vector

## [1] "character"

heightdf<-data.frame(height, gender =c("F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M"))
heightdf #take a look at the dataframe

##    height gender
## 1     5.5      F
## 2     4.5      F
## 3     5.0      F
## 4     5.6      F
## 5     5.8      F
## 6     5.2      F
## 7     6.0      M
## 8     6.2      M
## 9     5.9      M
## 10    5.8      M
## 11    6.0      M
## 12    5.9      M

class(heightdf) #check the class

## [1] "data.frame"

heightdf$height # we can refer to indivdual columns based on the column name

##  [1] 5.5 4.5 5.0 5.6 5.8 5.2 6.0 6.2 5.9 5.8 6.0 5.9

class(heightdf$gender)#here we see a factor(categorical variable - stored in R as with integer levels

## [1] "factor"

logical_variable<-height == heightdf$height #this shows that all the elements in the height column of the heightdf dataframe are equivalent to those of the height vector
class(logical_variable)

## [1] "logical"

matrix_variable <- matrix(height, nrow = 2, ncol = 3)#now we will make a matrix
matrix_variable #take a look at the matrix

##      [,1] [,2] [,3]
## [1,]  5.5  5.0  5.8
## [2,]  4.5  5.6  5.2

class(matrix_variable)

## [1] "matrix"

6) Subsetting Data

Now that we can assign or instantiate objects, let’s try to look at or manipulate specific parts of more complex objects.

Lets create an object of male heights by grabbing rows from heightdf.

maleIndex<-which(heightdf$gender == "M") #lets try subsetting just the male data out of the heightdf - first we need to determine which rows of the dataframe are male
maleIndex # this is now just a list of rows

## [1]  7  8  9 10 11 12

heightmale<-heightdf[maleIndex,] #now we will use the brackets to grab these rows - we use the comma to indicate that we want rows not columns
heightmale # now this is just the males

##    height gender
## 7     6.0      M
## 8     6.2      M
## 9     5.9      M
## 10    5.8      M
## 11    6.0      M
## 12    5.9      M

Here is another way using a package called dpylr:

install.packages("dplyr")

Here we are creating an object of height data for males over 6 feet.

library(dplyr) #load a useful package for subsetting data

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#heightmale_over6feet <- dplyr::subset(heightdf, gender =="M" & height >=6)#need to use column names to describe what we want to pull out of our data
heightmale_over6feet <- subset(heightdf, gender =="M" & height >=6)#need to use column names to describe what we want to pull out of our data

heightmale_over6feet#now we just have the males 6 feet or over

##    height gender
## 7     6.0      M
## 8     6.2      M
## 11    6.0      M

Now let’s create an object by grabbing part of an object based on its columns.

gender1<-heightdf[2]#notice how here we use the brackets but no comma
gender1

##    gender
## 1       F
## 2       F
## 3       F
## 4       F
## 5       F
## 6       F
## 7       M
## 8       M
## 9       M
## 10      M
## 11      M
## 12      M

gender2<-heightdf$gender#this does the same thing #notice this way we loose the data architecture - no longer a dataframe
gender2

##  [1] F F F F F F M M M M M M
## Levels: F M

gender2<-data.frame(gender =heightdf$gender)# this however stays as a dataframe
gender2

##    gender
## 1       F
## 2       F
## 3       F
## 4       F
## 5       F
## 6       F
## 7       M
## 8       M
## 9       M
## 10      M
## 11      M
## 12      M

genderindex<- which(colnames(heightdf) == "gender") #now wwe will use which() to select columns named gender
genderindex

## [1] 2

gender3 <-heightdf[genderindex]#now we will use the brackets to grab just this column
gender3

##    gender
## 1       F
## 2       F
## 3       F
## 4       F
## 5       F
## 6       F
## 7       M
## 8       M
## 9       M
## 10      M
## 11      M
## 12      M

identical(gender1, gender2)#lets see if they are identical - this is a helpful function - can only compare two variables at a time

## [1] TRUE

gender1==gender2 # are they the same? should say true if they are

##       gender
##  [1,]   TRUE
##  [2,]   TRUE
##  [3,]   TRUE
##  [4,]   TRUE
##  [5,]   TRUE
##  [6,]   TRUE
##  [7,]   TRUE
##  [8,]   TRUE
##  [9,]   TRUE
## [10,]   TRUE
## [11,]   TRUE
## [12,]   TRUE

gender2==gender3 # are they the same? should say true if they are

##       gender
##  [1,]   TRUE
##  [2,]   TRUE
##  [3,]   TRUE
##  [4,]   TRUE
##  [5,]   TRUE
##  [6,]   TRUE
##  [7,]   TRUE
##  [8,]   TRUE
##  [9,]   TRUE
## [10,]   TRUE
## [11,]   TRUE
## [12,]   TRUE

Now let’s try to look at/grab specific values.

height2<-c(6, 5.5, 6, 6, 6, 6, 4.3) #6 and 5.5 are in our orignal height vector but not 4.3
which(height %in% height2) # what of our orignial data is also found in height2

## [1]  1  7 11

heightdf[which(height %in% height2),] # here we skipped making another variable for the index

##    height gender
## 1     5.5      F
## 7     6.0      M
## 11    6.0      M

#we can also use a function clalled grep
wanted_heights_index<-grep(5.9, heightdf$height)
heightdf[wanted_heights_index,] #now we just have the samples who are 5.9

##    height gender
## 9     5.9      M
## 12    5.9      M

#say we want to know the value of an element at a particular location
heightdf$height[2] #second value in the height column

## [1] 4.5

heightdf$height[1:3]# first three valeus in the height column

## [1] 5.5 4.5 5.0

This allows you to grab random data points.

sample(heightdf$height, 2)#takes a random sample from a vector of the specified number of elements

## [1] 5.5 5.9

sample.int(1000999, 2)#takes a random sample from 1 to the first whole number specified. Thue number of random values to output is given by the second number.

## [1] 161093 430020

7) Plotting Data

Now let’s try plotting some data and perform some statistical tests.

boxplot(heightdf$height~heightdf$gender)#simple boxplot

#lets make a fancy boxplot
boxplot(heightdf$height~heightdf$gender, col = c("red", "blue"), ylab = "Height", xlab = "Gender", main = "Relationship of gender and height", cex.lab =2, cex.main = 2, cex.axis = 1.3, par(mar=c(5, 5, 5, 5)))

hist(heightdf$height)#histogram

heightdf$age <-c(20, 30, 15, 20, 40, 14, 35, 40, 17, 16, 25, 16)##adding another variable to a dataframe
plot(heightdf$height)#one variable

plot(y=heightdf$height, x=heightdf$age)#scatterplot of 2 variables
height_age<-lm(heightdf$height~heightdf$age)#perform a regression on the data - evaluate height and age relationship
summary(height_age)#shows the stats results from the regression

## 
## Call:
## lm(formula = heightdf$height ~ heightdf$age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1999 -0.1154  0.1348  0.3634  0.3943 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.28384    0.39367  13.422 1.01e-07 ***
## heightdf$age  0.01387    0.01527   0.908    0.385    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4973 on 10 degrees of freedom
## Multiple R-squared:  0.07616,    Adjusted R-squared:  -0.01622 
## F-statistic: 0.8244 on 1 and 10 DF,  p-value: 0.3853

abline(height_age)#add regression line to plot

cor.test(y=heightdf$height, x=heightdf$age)#shows the same p value when performing a correlation test

## 
##  Pearson's product-moment correlation
## 
## data:  heightdf$age and heightdf$height
## t = 0.90797, df = 10, p-value = 0.3853
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3539944  0.7336745
## sample estimates:
##       cor 
## 0.2759734

8) More Statistical Tests

t.test(heightdf$height~heightdf$gender)#try a t test between male height and female height - this is significant!

## 
##  Welch Two Sample t-test
## 
## data:  heightdf$height by heightdf$gender
## t = -3.4903, df = 5.8325, p-value = 0.01359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.194177 -0.205823
## sample estimates:
## mean in group F mean in group M 
##        5.266667        5.966667

#if p<0.05 it is generally considered significant
fit <-aov(heightdf$height~heightdf$gender + heightdf$age)#now lets perform an anova or multiple regression
summary(fit)# here are the results

##                 Df Sum Sq Mean Sq F value  Pr(>F)   
## heightdf$gender  1 1.4700  1.4700  12.167 0.00685 **
## heightdf$age     1 0.1193  0.1193   0.987 0.34638   
## Residuals        9 1.0874  0.1208                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(fit)# same results

## Analysis of Variance Table
## 
## Response: heightdf$height
##                 Df  Sum Sq Mean Sq F value   Pr(>F)   
## heightdf$gender  1 1.47000 1.47000 12.1668 0.006851 **
## heightdf$age     1 0.11928 0.11928  0.9872 0.346383   
## Residuals        9 1.08739 0.12082                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fit <-lm(heightdf$height~heightdf$gender + heightdf$age)# performing as multiple regression
summary(fit) #gives the same result as above - this is an anova but the results are presented differently

## 
## Call:
## lm(formula = heightdf$height ~ heightdf$gender + heightdf$age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83944 -0.07318  0.02918  0.12062  0.36706 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.01995    0.28600  17.552 2.86e-08 ***
## heightdf$genderM  0.68225    0.20148   3.386  0.00805 ** 
## heightdf$age      0.01065    0.01072   0.994  0.34638    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3476 on 9 degrees of freedom
## Multiple R-squared:  0.5938, Adjusted R-squared:  0.5035 
## F-statistic: 6.577 on 2 and 9 DF,  p-value: 0.01736

anova(fit)#also gives the same result

## Analysis of Variance Table
## 
## Response: heightdf$height
##                 Df  Sum Sq Mean Sq F value   Pr(>F)   
## heightdf$gender  1 1.47000 1.47000 12.1668 0.006851 **
## heightdf$age     1 0.11928 0.11928  0.9872 0.346383   
## Residuals        9 1.08739 0.12082                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s do a more classic anova - using a categorical variable with more than two categories.

heightdf$country <-c("British", "French", "British", "Dutch", "Dutch", "French", "Dutch", "Dutch", "British", "French", "British", "French")
fit <-aov(heightdf$height~heightdf$gender + heightdf$age + heightdf$country)
summary(fit)

##                  Df Sum Sq Mean Sq F value  Pr(>F)   
## heightdf$gender   1 1.4700  1.4700  19.157 0.00325 **
## heightdf$age      1 0.1193  0.1193   1.554 0.25258   
## heightdf$country  2 0.5503  0.2751   3.586 0.08471 . 
## Residuals         7 0.5371  0.0767                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fit <-aov(heightdf$height~ heightdf$country)
summary(fit)

##                  Df Sum Sq Mean Sq F value Pr(>F)
## heightdf$country  2 0.6067  0.3033   1.319  0.315
## Residuals         9 2.0700  0.2300

anova(fit)# we see the results of country but not each country

## Analysis of Variance Table
## 
## Response: heightdf$height
##                  Df  Sum Sq Mean Sq F value Pr(>F)
## heightdf$country  2 0.60667 0.30333  1.3188 0.3146
## Residuals         9 2.07000 0.23000

TukeyHSD(fit)# this is how we get these results - none are significant

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = heightdf$height ~ heightdf$country)
## 
## $`heightdf$country`
##                 diff        lwr       upr     p adj
## Dutch-British   0.30 -0.6468152 1.2468152 0.6627841
## French-British -0.25 -1.1968152 0.6968152 0.7484769
## French-Dutch   -0.55 -1.4968152 0.3968152 0.2860337

fit <-lm(heightdf$height~heightdf$gender + heightdf$age + heightdf$country)
summary(fit) #gives the same result as above - this is an anova just different output

## 
## Call:
## lm(formula = heightdf$height ~ heightdf$gender + heightdf$age + 
##     heightdf$country)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36474 -0.13454  0.03405  0.15333  0.33097 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.46051    0.28225  19.346 2.46e-07 ***
## heightdf$genderM        0.71905    0.16131   4.458  0.00294 ** 
## heightdf$age           -0.01143    0.01263  -0.905  0.39547    
## heightdf$countryDutch   0.46574    0.26813   1.737  0.12596    
## heightdf$countryFrench -0.25286    0.19590  -1.291  0.23778    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.277 on 7 degrees of freedom
## Multiple R-squared:  0.7993, Adjusted R-squared:  0.6847 
## F-statistic: 6.971 on 4 and 7 DF,  p-value: 0.01375

Lets use some real data!

Baby Name Data

This is a very fun package to check out. If you have ever wondered about the popularity of your name or someone that you know, you will find this very interesting. I also have some friends who have used it to help them name their child.

#recall that we installed and loaded this data earlier
head(babynames)#this is a special data type called a tibble - it is basically a fancy dataframe

## # A tibble: 6 x 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1 1880. F     Mary       7065 0.0724
## 2 1880. F     Anna       2604 0.0267
## 3 1880. F     Emma       2003 0.0205
## 4 1880. F     Elizabeth  1939 0.0199
## 5 1880. F     Minnie     1746 0.0179
## 6 1880. F     Margaret   1578 0.0162

tail(babynames)# we can see the data goes up to 2015

## # A tibble: 6 x 5
##    year sex   name       n       prop
##   <dbl> <chr> <chr>  <int>      <dbl>
## 1 2015. M     Zyah       5 0.00000247
## 2 2015. M     Zykell     5 0.00000247
## 3 2015. M     Zyking     5 0.00000247
## 4 2015. M     Zykir      5 0.00000247
## 5 2015. M     Zyrus      5 0.00000247
## 6 2015. M     Zyus       5 0.00000247

#how many unique baby names are there?
length(unique(babynames$name))# that's a lot of baby names!

## [1] 95025

#check to see if your name is included
grep("Bob", unique(babynames$name))#looks like bob is in there

##  [1]  1148  2502  4948  6510  6573  6999  9443 10761 13598 13794 18059
## [12] 18701 19278 19812 20116 20921 22002 23289 26242 27453 30231 34262
## [23] 35357 37057 37702 38171 41808 42382 43247 44135 44778 46568 50097

#Let's look at the values
babynames$name [grep("Bob", unique(babynames$name))] # this is a vector so we don't need to specify rows with a comma

##  [1] "Scott"     "Tessie"    "Sadye"     "Una"       "Philomena"
##  [6] "Belva"     "Rufus"     "Dovie"     "Janette"   "Mammie"   
## [11] "Melinda"   "Honor"     "Arch"      "Denis"     "Orrie"    
## [16] "Floyd"     "Al"        "Selina"    "Clora"     "Elvin"    
## [21] "Lafayette" "Lovie"     "Armilda"   "Nola"      "Icy"      
## [26] "Mahalia"   "Gordon"    "Seth"      "Claudia"   "Glada"    
## [31] "Floyd"     "Theodora"  "Vella"

# oops! this didn't work. why? because were aren't subsetting with an index derived from the data
unique(babynames$name) [grep("Bob", unique(babynames$name))] # here we go

##  [1] "Bob"        "Bobbie"     "Bobby"      "Bobie"      "Bobbye"    
##  [6] "Bobbe"      "Bobette"    "Bobetta"    "Boby"       "Bobbette"  
## [11] "Bobbi"      "Bobo"       "Bobra"      "Bobi"       "Bobbee"    
## [16] "Bobb"       "Bobbetta"   "Bobbyetta"  "Bobbijo"    "Bobbiejo"  
## [21] "Bobbyjo"    "Bobbiejean" "Bobbilynn"  "Boban"      "Bobijo"    
## [26] "Bobbyjoe"   "Bobak"      "Bobbilee"   "Bobbisue"   "Bobbiesue" 
## [31] "Boback"     "Bobbylee"   "Bobbielee"

#now we can see all the variations of Bob in the data
Bob<- subset(babynames,babynames$name == "Bob")
#Let's see how much the name has been used in the past
plot(Bob$n~Bob$year)#Bob was popular but it isn't anymore

#what is the line of samples at the bottom?
plot(Bob$n~Bob$year, col= c("red", "blue")[as.factor(Bob$sex)])

#looks like most people named Bob were male, the line of samples at the bottom are females
#lets try another name
Lori<- subset(babynames,babynames$name == "Lori")
plot(Lori$n~Lori$year, col= c("red", "blue")[as.factor(Lori$sex)])

Lori_M<- subset(babynames,name == "Lori" & sex == "M")#lets look at when exactly some males were named Lori
head(Lori_M)

## # A tibble: 6 x 5
##    year sex   name      n       prop
##   <dbl> <chr> <chr> <int>      <dbl>
## 1 1954. M     Lori      5 0.00000242
## 2 1955. M     Lori      5 0.00000239
## 3 1956. M     Lori     14 0.00000653
## 4 1957. M     Lori     20 0.00000914
## 5 1958. M     Lori     25 0.0000116 
## 6 1959. M     Lori     29 0.0000134

# lets see how many samples are present for each year in the data
table(babynames$year)# so there are 2000 samples from 1880 and 1935 samples from 1881

## 
##  1880  1881  1882  1883  1884  1885  1886  1887  1888  1889  1890  1891 
##  2000  1935  2127  2084  2297  2294  2392  2373  2651  2590  2695  2660 
##  1892  1893  1894  1895  1896  1897  1898  1899  1900  1901  1902  1903 
##  2921  2831  2941  3049  3091  3028  3264  3042  3731  3153  3362  3389 
##  1904  1905  1906  1907  1908  1909  1910  1911  1912  1913  1914  1915 
##  3561  3655  3633  3948  4018  4227  4629  4867  6351  6967  7964  9358 
##  1916  1917  1918  1919  1920  1921  1922  1923  1924  1925  1926  1927 
##  9696  9915 10401 10368 10756 10856 10757 10641 10869 10641 10460 10405 
##  1928  1929  1930  1931  1932  1933  1934  1935  1936  1937  1938  1939 
## 10159  9816  9788  9293  9383  9011  9181  9035  8893  8945  9030  8919 
##  1940  1941  1942  1943  1944  1945  1946  1947  1948  1949  1950  1951 
##  8960  9087  9424  9405  9153  9026  9702 10370 10237 10264 10309 10460 
##  1952  1953  1954  1955  1956  1957  1958  1959  1960  1961  1962  1963 
## 10654 10831 10963 11114 11339 11564 11521 11771 11924 12178 12206 12278 
##  1964  1965  1966  1967  1968  1969  1970  1971  1972  1973  1974  1975 
## 12394 11953 12148 12400 12930 13746 14782 15291 15414 15676 16243 16934 
##  1976  1977  1978  1979  1980  1981  1982  1983  1984  1985  1986  1987 
## 17395 18171 18224 19032 19439 19470 19680 19398 19501 20076 20642 21399 
##  1988  1989  1990  1991  1992  1993  1994  1995  1996  1997  1998  1999 
## 22360 23769 24715 25104 25421 25959 25998 26080 26420 26966 27894 28546 
##  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011 
## 29764 30264 30559 31179 32040 32539 34073 34941 35051 34689 34050 33880 
##  2012  2013  2014  2015 
## 33697 33229 33176 32952

# Ok, so more samples were included in the more recent years
hist(babynames$year)

Let’s look at some other data…

Titanic Data

This package contains real data about the passengers that were aboard the Titanic.

head(Titanic) # we can see that this may be an unusal data type

## [1]  0  0 35  0  0  0

class(Titanic) # indeed this appears to be a table

## [1] "table"

dim(Titanic)

## [1] 4 2 2 2

dimnames(Titanic)

## $Class
## [1] "1st"  "2nd"  "3rd"  "Crew"
## 
## $Sex
## [1] "Male"   "Female"
## 
## $Age
## [1] "Child" "Adult"
## 
## $Survived
## [1] "No"  "Yes"

str(Titanic) # shows the structure of the data

##  table [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
##   ..$ Sex     : chr [1:2] "Male" "Female"
##   ..$ Age     : chr [1:2] "Child" "Adult"
##   ..$ Survived: chr [1:2] "No" "Yes"

help("titanic_test")#this shows more information about the data
help("titanic_train")# I would assume survived is a 1
#lets see if more males or females survived
boxplot(titanic_train$Survived~titanic_train$Sex)

table(titanic_train$Survived,titanic_train$Sex)# it looks like males laregly did not survive

##    
##     female male
##   0     81  468
##   1    233  109

# this might be a better way to view the data - here width represents the number of samples - thus there are more males overall
mosaicplot(Sex ~ Survived, data=titanic_train)

# It looks like even though there were many more males, the female passangers were much more likely to survive

How about some more data…

Earthquake Data

Here we will scrape data (or obtain data from a website) from the USGS website about earthquake rates in different US states. See our previous post from S. Semick on scraping data from research articles for more information on how to do this.

install.packages("htmltab")
install.packages("reshape2")

library(htmltab)
library(reshape2)

## Warning: package 'reshape2' was built under R version 3.4.3

url<-"https://earthquake.usgs.gov/earthquakes/browse/stats.php"
eq <- htmltab(doc = url, which = 5)

## No encoding supplied: defaulting to UTF-8.

rownames(eq)<-eq$States
eq<-eq[-1]
head(eq)

##            2010 2011 2012 2013 2014 2015
## Alabama       1    1    0    0    2    6
## Alaska     2245 1409 1166 1329 1296 1575
## Arizona       6    7    4    3   31   10
## Arkansas     15   44    4    4    1    0
## California  546  195  243  240  191  130
## Colorado      4   23    7    2   13    7

eq2 <- as.data.frame(sapply(eq, function(x) as.numeric(as.character(x))))
head(eq2)

##   2010 2011 2012 2013 2014 2015
## 1    1    1    0    0    2    6
## 2 2245 1409 1166 1329 1296 1575
## 3    6    7    4    3   31   10
## 4   15   44    4    4    1    0
## 5  546  195  243  240  191  130
## 6    4   23    7    2   13    7

convertoChar<-function(x) as.numeric(as.character(x)) # or you could create a function to use multiple times
factor_to_fix<-as.factor(c(1,2))
class(factor_to_fix)

## [1] "factor"

class(trying_function<-convertoChar(x=factor_to_fix))# now the class is numeric

## [1] "numeric"

class(trying_function2<-convertoChar(factor_to_fix))# now the class is numeric

## [1] "numeric"

rownames(eq2)<-rownames(eq)
head(eq2)

##            2010 2011 2012 2013 2014 2015
## Alabama       1    1    0    0    2    6
## Alaska     2245 1409 1166 1329 1296 1575
## Arizona       6    7    4    3   31   10
## Arkansas     15   44    4    4    1    0
## California  546  195  243  240  191  130
## Colorado      4   23    7    2   13    7

colSums(eq2)#look at the col sums

## 2010 2011 2012 2013 2014 2015 
## 3026 1955 1603 1899 2628 3225

colMeans(eq2)# look at the col means

##  2010  2011  2012  2013  2014  2015 
## 60.52 39.10 32.06 37.98 52.56 64.50

rowMeans(eq2)

##        Alabama         Alaska        Arizona       Arkansas     California 
##      1.6666667   1503.3333333     10.1666667     11.3333333    257.5000000 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##      9.3333333      0.1666667      0.0000000      0.0000000      0.0000000 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##     33.3333333     15.8333333      0.8333333      0.6666667      0.0000000 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##     17.3333333      0.3333333      0.1666667      0.5000000      0.1666667 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##      0.0000000      0.3333333      0.1666667      0.5000000      2.1666667 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##     14.8333333      1.0000000     85.5000000      0.1666667      0.0000000 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##      6.3333333      0.5000000      0.1666667      0.1666667      1.0000000 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##    285.6666667      2.8333333      0.0000000      0.0000000      0.5000000 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##      1.0000000      1.3333333     13.8333333     11.5000000      0.0000000 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##      2.1666667     10.0000000      0.3333333      0.0000000     84.6666667

rowSums(eq2)

##        Alabama         Alaska        Arizona       Arkansas     California 
##             10           9020             61             68           1545 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##             56              1              0              0              0 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##            200             95              5              4              0 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##            104              2              1              3              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              0              2              1              3             13 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##             89              6            513              1              0 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##             38              3              1              1              6 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##           1714             17              0              0              3 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              6              8             83             69              0 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##             13             60              2              0            508

max(eq2)# maximum value

## [1] 2245

boxplot(eq2)

boxplot(eq2, ylim = c(0,40))# change the limit of the plot

boxplot(t(eq2), ylim =c(0,2000))#flip the data using t()

eq3<-melt(eq2)#this puts the data in long form

## No id variables; using all as measure variables

names(eq3)<-c("year", "quakes")
head(eq3)

##   year quakes
## 1 2010      1
## 2 2010   2245
## 3 2010      6
## 4 2010     15
## 5 2010    546
## 6 2010      4

fit<-aov(eq3$quakes~eq3$year)
summary(fit)#no significant difference by year

##              Df   Sum Sq Mean Sq F value Pr(>F)
## eq3$year      5    44161    8832   0.169  0.974
## Residuals   294 15405834   52401

In addition, these are functions that the members of LIBD Rstats use often:

head() / tail() – see the head and the tail - also check out the corner function of the jaffelab package created by LIBD Rstats founding member E. Burke
colnames() / rownames() – see and rename columns or row names
colMeans() / rowMeans() / colSums() / rowSums() – get means and sums of columns and rows
dim() and length() – determine the dimensions/size of a data set – need to use length() when evaluating a vector
ncol() / nrow() – number of columns and rows
str() – displays the structure of an object - this is very useful with complex data structures
unique()/duplicated() – find unique and duplicated values
order()/sort()– order and sort your data
gsub() – replace values
table() – summarize your data in table format
t.test() – perform a t test
cor.test() – perform a correlation test
lm() – make a linear model
summary() – if you use the lm() output – this will give you the results
set.seed() – allows for random permutations or random data to be the same every time your run your code

For additional help take a look at these links:

Free courses and tutorials

Community support

https://community.rstudio.com/
https://support.bioconductor.org/
https://resources.rstudio.com/webinars/help-me-help-you-creating-reproducible-examples-jenny-bryan
link helps you make examples of code errors that you need help with

Tips for help

https://www.r-project.org/help.html
Google! – stackoverflow, biostars
https://www.rstudio.com/resources/cheatsheets/

Also follow our blog for more helpful posts.

Thanks for reading and have fun getting to know R!

This image came from: https://www.pinterest.com/pin/89790586304535333/

Acknowledgments

This blog post was made possible thanks to:

References

[1] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.8. 2017. URL: https://CRAN.R-project.org/package=knitcitations.

[2] G. Csárdi, R. core, H. Wickham, W. Chang, et al. sessioninfo: R Session Information. R package version 1.1.1. 2018. URL: https://CRAN.R-project.org/package=sessioninfo.

[3] A. Oleś, M. Morgan and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.6.1. 2017. URL: https://github.com/Bioconductor/BiocStyle.

[4] Y. Xie, A. P. Hill and A. Thomas. blogdown: Creating Websites with R Markdown. ISBN 978-0815363729. Boca Raton, Florida: Chapman and Hall/CRC, 2017. URL: https://github.com/rstudio/blogdown.

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.4.0 (2017-04-21)
##  os       macOS Sierra 10.12.6        
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/New_York            
##  date     2018-11-19                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package       * version   date       lib source                           
##  assertthat      0.2.0     2017-04-11 [1] CRAN (R 3.4.0)                   
##  babynames     * 0.3.0     2017-04-14 [1] CRAN (R 3.4.0)                   
##  backports       1.1.2     2018-04-18 [1] Github (r-lib/backports@cee9348) 
##  bibtex          0.4.2     2017-06-30 [1] CRAN (R 3.4.1)                   
##  bindr           0.1       2016-11-13 [1] CRAN (R 3.4.0)                   
##  bindrcpp        0.2       2017-06-17 [1] CRAN (R 3.4.0)                   
##  BiocStyle     * 2.6.1     2017-11-30 [1] Bioconductor                     
##  blogdown        0.5.9     2018-03-08 [1] Github (rstudio/blogdown@dc1f41c)
##  bookdown        0.7       2018-02-18 [1] CRAN (R 3.4.3)                   
##  cli             1.0.0     2017-11-05 [1] CRAN (R 3.4.2)                   
##  crayon          1.3.4     2017-09-16 [1] CRAN (R 3.4.1)                   
##  curl            3.2       2018-03-28 [1] CRAN (R 3.4.4)                   
##  digest          0.6.15    2018-01-28 [1] CRAN (R 3.4.3)                   
##  dplyr         * 0.7.4     2017-09-28 [1] CRAN (R 3.4.2)                   
##  evaluate        0.11      2018-07-17 [1] CRAN (R 3.4.4)                   
##  glue            1.3.0     2018-07-17 [1] CRAN (R 3.4.4)                   
##  htmltab       * 0.7.1     2016-12-29 [1] CRAN (R 3.4.0)                   
##  htmltools       0.3.6     2017-04-28 [1] CRAN (R 3.4.0)                   
##  httr            1.3.1     2017-08-20 [1] CRAN (R 3.4.1)                   
##  jsonlite        1.5       2017-06-01 [1] CRAN (R 3.4.0)                   
##  knitcitations * 1.0.8     2017-07-04 [1] CRAN (R 3.4.1)                   
##  knitr           1.20      2018-02-20 [1] CRAN (R 3.4.3)                   
##  lubridate       1.7.4     2018-04-11 [1] CRAN (R 3.4.4)                   
##  magrittr        1.5       2014-11-22 [1] CRAN (R 3.4.0)                   
##  pillar          1.2.1     2018-02-27 [1] CRAN (R 3.4.3)                   
##  pkgconfig       2.0.1     2017-03-21 [1] CRAN (R 3.4.0)                   
##  plyr            1.8.4     2016-06-08 [1] CRAN (R 3.4.0)                   
##  R6              2.2.2     2017-06-17 [1] CRAN (R 3.4.0)                   
##  Rcpp            0.12.16   2018-03-13 [1] CRAN (R 3.4.4)                   
##  RefManageR      1.2.0     2018-04-25 [1] CRAN (R 3.4.4)                   
##  reshape2      * 1.4.3     2017-12-11 [1] CRAN (R 3.4.3)                   
##  rlang           0.2.0     2018-02-20 [1] CRAN (R 3.4.3)                   
##  rmarkdown       1.10      2018-06-11 [1] CRAN (R 3.4.4)                   
##  rprojroot       1.3-2     2018-01-03 [1] CRAN (R 3.4.3)                   
##  sessioninfo   * 1.1.1     2018-11-05 [1] CRAN (R 3.4.4)                   
##  stringi         1.2.4     2018-07-20 [1] CRAN (R 3.4.4)                   
##  stringr         1.3.1     2018-05-10 [1] CRAN (R 3.4.4)                   
##  tibble          1.4.2     2018-01-22 [1] CRAN (R 3.4.3)                   
##  titanic       * 0.1.0     2015-08-31 [1] CRAN (R 3.4.0)                   
##  utf8            1.1.3     2018-01-03 [1] CRAN (R 3.4.3)                   
##  withr           2.1.2     2018-03-15 [1] CRAN (R 3.4.4)                   
##  xfun            0.3       2018-07-06 [1] CRAN (R 3.4.4)                   
##  XML             3.98-1.10 2018-02-19 [1] CRAN (R 3.4.3)                   
##  xml2            1.2.0     2018-01-24 [1] CRAN (R 3.4.3)                   
##  yaml            2.2.0     2018-07-25 [1] CRAN (R 3.4.4)                   
## 
## [1] /Library/Frameworks/R.framework/Versions/3.4/Resources/library

Quality Surrogate Variable Analysis

Tue, 11 Dec 2018 00:00:00 +0000

By Amy Peterson

Studying genetic differential expression using postmortem human brain tissue requires an understanding of the effect brain tissue degradation has on genetic expression. Particularly when brain tissue degradation confounds¹ the differences in gene expression levels between subject groups. This problem of confounding necessitates measures from a control dataset of postmortem tissue from individuals who do not have the outcome of interest. Doing so provides a comparative measure of the impact of tissue degradation on expression that can then be used in a case-control study to examine the impact of the outcome of interest on genetic expression. Incorporating the determinations of tissue degradation in control brains in an algorithm to assess the results of genetic differential expression in brains that have the outcome of interest leads to more accurate results and reduces the number of false positive genes that are incorrectly identified as differentially expressed between cases and controls.

SVA background

RNA-sequencing (RNA-seq) is a high-throughput method for quantifying gene expression levels that requires using high-quality RNA. The effect of RNA quality on detecting genetic differential expression accurately was previously addressed with surrogate variable analysis (SVA), which includes batch effects to address the issue of heterogeneity in expression studies (Leek and Storey, 2007). The problem of confounding requires a more robust approach to identifying genes that are differentially expressed.

qSVA

The quality surrogate variable analysis (qSVA) algorithmic framework, an extended version of SVA, was developed by Andrew Jaffe and colleagues (Jaffe, Tao, Norris, Kealhofer, et al., 2017) to provide a method for solving the issue of confounding by brain degradation. The qSVA framework reduces the number of false positive genes, since genes may be identified because RNA quality confounding is not controlled for adequately. This conservative approach uses stricter criteria and involves processing methods that are well established, applying expression cutoffs, avoiding potential batch effects, and adjusting for RNA quality degradation confounding using qSVA.

Datasets

The qSVA algorithm requires the use of two datasets. Here, the dataset of interest is part of BrainSeq, A Human Brain Genomics Consortium, which was initiated with the goal of generating a public database of gene expression in postmortem brain tissue to enhance the understanding of psychiatric disorders through neurogenomic data (Schubert, O’Donnell, Quan, Wendland, et al., 2015). The other dataset is a control dataset, which can also be referred to as the degradation dataset, since it is the measure of the impact of degradation on gene expression in postmortem tissue for individuals who do not have the outcome of interest. The degradation dataset is a much smaller dataset and helps determine the genomic regions most associated with brain degradation. This addresses the concern of an association between the outcome of interest and genetic expression, and helps better understand metrics that demonstrate RNA quality through experimental approaches. Using these two datasets, and by extending qSVA to more than one brain region, we are able to examine the issue of RNA quality confounding using RNA-seq data from multiple brain regions in a case-control study comparing degradation of tissue in patients with schizophrenia to non-psychiatric controls using BrainSeq consortium data (Collado-Torres, Burke, Peterson, Shin, et al., 2018). We focused on the hippocampus (HIPPO) and dorsolateral prefrontal cortex (DLPFC), two brain regions that have been identified as functionally-altered in schizophrenia (Rasetti, Mattay, White, Sambataro, et al., 2014).

Algorithm and Workflow

The algorithm has terms that account for measured covariates, including diagnosis, age, sex, mitochondrial rate, rRNA rate, gene assignment rate, RNA integrity number (RIN), ethnicity principal components (PCs)², and the region-specific quality surrogate variables, or qSVs, identified using the degradation dataset.

Results

After using qSVA to adjust for the confounding effect of RNA, differential expression quality (DEqual) plots are used to assess the effectiveness of the statistical correction. These plots compare the differential expression statistics for the degradation data experiments on the y axis to statistics for the outcome from the dataset of interest on the x axis. The plots are shown for the HIPPO samples, looking at the log-fold change in expression per minute, with each point representing a gene. The goal is to assess the correlation between these two datasets, and how the correlation changes after including the quality surrogate variables (qSVs) in the model. There should not be correlation between the degradation dataset and the schizophrenia disorder case-control BrainSeq dataset, labeled as Dx on the axis for diagnosis, since they are independent datasets and the degradation dataset is serving as a control. Model 1 is a naïve model that includes diagnosis only. Model 2 includes diagnosis and measures for RNA-quality and demographic covariates. Model 3 includes all of the terms from the previous models, with the added qSVs. The number of genes identified as differentially expressed are shown in parentheses next to each model, and the number of genes identified as differentially expressed reduces drastically from over 6,000 in model 1, to 63 in model 2, to 48 in model 3.

Conclusions

Once we are confident that confounding has been removed from the samples of interest, we are able to assess differential expression between cases and controls. Using the 48 genes identified from model 3, we can then perform gene biological process ontology enrichment analysis to determine which genes show enrichment, and to gain clearer insights into which genes are most affected in brain tissue of individuals with schizophrenia. For more information please check the freely available pre-print describing the BrainSeq Phase II project (Collado-Torres, Burke, Peterson, Shin, et al., 2018).

Amy Peterson extended the qSVA statistical framework to from one to multiple brain regions as part of her JHBSPH MPH Capstone project that she carried out with Andrew E. Jaffe and Leonardo Collado-Torres at the Lieber Institute for Brain Development. The R and bash code Amy Peterson wrote is available online via GitHub at LieberInstitute/qsva_brain.

A few days late, but here's Amy https://t.co/A0OSUVD9Nw after successfully presenting her @JohnsHopkinsSPH MPH capstone project. It was a great experience for me to mentor her along with @andrewjaffe at @lieberinstitute I look forward to seeing where her career takes her 🙌🏾 ^^ pic.twitter.com/hbUiHQOVq3
— 🇲🇽 Leonardo Collado-Torres (@lcolladotor) May 8, 2018

Acknowledgments

This blog post was made possible thanks to:

References

[1] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.8. 2017. URL: https://CRAN.R-project.org/package=knitcitations.

[2] L. Collado-Torres, E. E. Burke, A. Peterson, J. H. Shin, et al. “Regional heterogeneity in gene expression, regulation and coherence in hippocampus and dorsolateral prefrontal cortex across development and in schizophrenia”. In: bioRxiv (2018). DOI: 10.1101/426213.

[3] G. Csárdi, R. core, H. Wickham, W. Chang, et al. sessioninfo: R Session Information. R package version 1.1.1. 2018. URL: https://CRAN.R-project.org/package=sessioninfo.

[4] A. E. Jaffe, R. Tao, A. L. Norris, M. Kealhofer, et al. “qSVA framework for RNA quality correction in differential expression analysis”. In: Proceedings of the National Academy of Sciences 114.27 (Jun. 2017), pp. 7130–7135. DOI: 10.1073/pnas.1617384114. URL: https://doi.org/10.1073/pnas.1617384114.

[5] J. T. Leek and J. D. Storey. “Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis”. In: PLoS Genetics 3.9 (2007), p. e161. DOI: 10.1371/journal.pgen.0030161. URL: https://doi.org/10.1371/journal.pgen.0030161.

[6] A. Oleś, M. Morgan and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.10.0. 2018. URL: https://github.com/Bioconductor/BiocStyle.

[7] R. Rasetti, V. S. Mattay, M. G. White, F. Sambataro, et al. “Altered Hippocampal-Parahippocampal Function During Stimulus Encoding”. In: JAMA Psychiatry 71.3 (Mar. 2014), p. 236. DOI: 10.1001/jamapsychiatry.2013.3911. URL: https://doi.org/10.1001/jamapsychiatry.2013.3911.

[8] Schubert, O’Donnell, Quan, Wendland, et al. “BrainSeq: Neurogenomics to Drive Novel Target Discovery for Neuropsychiatric Disorders”. In: Neuron (2015). DOI: 10.1016/j.neuron.2015.10.047.

[9] Y. Xie, A. P. Hill and A. Thomas. blogdown: Creating Websites with R Markdown. ISBN 978-0815363729. Boca Raton, Florida: Chapman and Hall/CRC, 2017. URL: https://github.com/rstudio/blogdown.

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.1 (2018-07-02)
##  os       macOS Mojave 10.14.1        
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/New_York            
##  date     2018-12-11                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package       * version date       lib source                            
##  assertthat      0.2.0   2017-04-11 [1] CRAN (R 3.5.0)                    
##  backports       1.1.2   2017-12-13 [1] CRAN (R 3.5.0)                    
##  bibtex          0.4.2   2017-06-30 [1] CRAN (R 3.5.0)                    
##  BiocManager     1.30.4  2018-11-13 [1] CRAN (R 3.5.0)                    
##  BiocStyle     * 2.10.0  2018-10-30 [1] Bioconductor                      
##  blogdown        0.9     2018-10-23 [1] CRAN (R 3.5.0)                    
##  bookdown        0.8     2018-12-03 [1] CRAN (R 3.5.0)                    
##  cli             1.0.1   2018-09-25 [1] CRAN (R 3.5.0)                    
##  colorout      * 1.2-0   2018-05-03 [1] Github (jalvesaq/colorout@c42088d)
##  crayon          1.3.4   2017-09-16 [1] CRAN (R 3.5.0)                    
##  curl            3.2     2018-03-28 [1] CRAN (R 3.5.0)                    
##  digest          0.6.18  2018-10-10 [1] CRAN (R 3.5.0)                    
##  evaluate        0.12    2018-10-09 [1] CRAN (R 3.5.0)                    
##  htmltools       0.3.6   2017-04-28 [1] CRAN (R 3.5.0)                    
##  httr            1.3.1   2017-08-20 [1] CRAN (R 3.5.0)                    
##  jsonlite        1.5     2017-06-01 [1] CRAN (R 3.5.0)                    
##  knitcitations * 1.0.8   2017-07-04 [1] CRAN (R 3.5.0)                    
##  knitr           1.20    2018-02-20 [1] CRAN (R 3.5.0)                    
##  lubridate       1.7.4   2018-04-11 [1] CRAN (R 3.5.0)                    
##  magrittr        1.5     2014-11-22 [1] CRAN (R 3.5.0)                    
##  plyr            1.8.4   2016-06-08 [1] CRAN (R 3.5.0)                    
##  R6              2.3.0   2018-10-04 [1] CRAN (R 3.5.0)                    
##  Rcpp            1.0.0   2018-11-07 [1] CRAN (R 3.5.0)                    
##  RefManageR      1.2.0   2018-04-25 [1] CRAN (R 3.5.0)                    
##  rmarkdown       1.10    2018-06-11 [1] CRAN (R 3.5.0)                    
##  rprojroot       1.3-2   2018-01-03 [1] CRAN (R 3.5.0)                    
##  sessioninfo   * 1.1.1   2018-11-05 [1] CRAN (R 3.5.0)                    
##  stringi         1.2.4   2018-07-20 [1] CRAN (R 3.5.0)                    
##  stringr         1.3.1   2018-05-10 [1] CRAN (R 3.5.0)                    
##  withr           2.1.2   2018-03-15 [1] CRAN (R 3.5.0)                    
##  xfun            0.4     2018-10-23 [1] CRAN (R 3.5.0)                    
##  xml2            1.2.0   2018-01-24 [1] CRAN (R 3.5.0)                    
##  yaml            2.2.0   2018-07-25 [1] CRAN (R 3.5.0)                    
## 
## [1] /Library/Frameworks/R.framework/Versions/3.5devel/Resources/library

As defined in Wikipedia, confounding is: “In statistics, a confounder (also confounding variable, confounding factor or lurking variable) is a variable that influences both the dependent variable and independent variable causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations.”↩
These are PCs computed on the genotype information from the individuals in this study. We use them to adjust for ethnicity in a more rigorous form than a categorical race variable would be able to do.↩

Quick overview on the new Bioconductor 3.8 release

Fri, 02 Nov 2018 00:00:00 +0000

Every six months the Bioconductor project releases it’s new version of packages. This allows developers a time window to try out new methods and test them rigorously before releasing them to the community at large. It also means that this is an exciting time 🎉. With every release there are dozens¹ of new software packages. Bioconductor version 3.8 was just released on Halloween: October 31st, 2018. Thus, this is the perfect time to browse through their descriptions and find out what’s new that can be of use to your research.

That’s exactly what our post today is about. We looked at the list of new packages as well as updates to find those that we think could be useful for us. That is, packages that we might want to explore further.

AffiXcan

AffiXcan

Impute a GReX (Genetically Regulated Expression) for a set of genes in a sample of individuals, using a method based on the Total Binding Affinity (TBA). Statistical models to impute GReX can be trained with a training dataset where the real total expression values are known.

Looks interesting from the name but the description is too vague.

BiocPkgTools

BiocPkgTools

Bioconductor has a rich ecosystem of metadata around packages, usage, and build status. This package is a simple collection of functions to access that metadata from R. The goal is to expose metadata for data mining and value-added functionality such as package searching, text mining, and analytics on packages.

As Bioconductor developers this package sounds useful. Maybe it can be used to see if any of your packages is broken (errors, warnings) in BioC release or BioC devel.

brainImageR

brainImageR

BrainImageR is a package that provides the user with information of where in the human brain their gene set corresponds to. This is provided both as a continuous variable and as a easily-interpretable image. BrainImageR has additional functionality of identifying approximately when in developmental time that a gene expression dataset corresponds to. Both the spatial gene set enrichment and the developmental time point prediction are assessed in comparison to the Allen Brain Atlas reference data.

Sounds interesting since we work with brain data ourselves. We are curious about where the data comes from!

BUScorrect

BUScorrect

High-throughput experimental data are accumulating exponentially in public databases. However, mining valid scientific discoveries from these abundant resources is hampered by technical artifacts and inherent biological heterogeneity. The former are usually termed “batch effects,” and the latter is often modelled by “subtypes.” The R package BUScorrect fits a Bayesian hierarchical model, the Batch-effects-correction-with-Unknown-Subtypes model (BUS), to correct batch effects in the presence of unknown subtypes. BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, and (d) enjoying a linear-order computation complexity.

Hm… maybe this can be used with data from recount. We also have to work sometimes with data from multiple labs.

consensusDE

consensusDE

This package allows users to perform DE analysis using multiple algorithms. It seeks consensus from multiple methods. Currently it supports “Voom”, “EdgeR” and “DESeq”, but can be easily extended. It uses RUV-seq (optional) to remove batch effects.

Depending how flexible this new package is, it could be useful for saving time. If it’s not flexible we won’t really use it.

EnhancedVolcano

EnhancedVolcano

Volcano plots represent a useful way to visualise the results of differential expression analyses. Here, we present a highly-configurable function that produces publication-ready volcano plots. EnhancedVolcano will attempt to fit as many transcript names in the plot window as possible, thus avoiding ‘clogging’ up the plot with labels that could not otherwise have been read.

Enhanced volcano plots? Cool! We use volcano plots all the time, so they sold us with the name. Plus, who doesn’t want publication ready plots?²

ExCluster

ExCluster

ExCluster flattens Ensembl and GENCODE GTF files into GFF files, which are used to count reads per non-overlapping exon bin from BAM files. This read counting is done using the function featureCounts from the package Rsubread. Library sizes are normalized across all biological replicates, and ExCluster then compares two different conditions to detect signifcantly differentially spliced genes. This process requires at least two independent biological repliates per condition, and ExCluster accepts only exactly two conditions at a time. ExCluster ultimately produces false discovery rates (FDRs) per gene, which are used to detect significance. Exon log2 fold change (log2FC) means and variances may be plotted for each significantly differentially spliced gene, which helps scientists develop hypothesis and target differential splicing events for RT-qPCR validation in the wet lab.

Hm… this one could be useful for some future work related to recount. Specially the part about simplifying a GTF file.

maser

maser

This package provides functionalities for analysis, annotation and visualizaton of alternative splicing events.

Visualization of splicing events is something that can be useful. But the description is too vague and will require more investigation.

miRSM

miRSM

The package aims to identify miRNA sponge modules by integrating expression data and miRNA-target binding information. It provides several functions to study miRNA sponge modules, including popular methods for inferring gene modules (candidate miRNA sponge modules), and a function to identify miRNA sponge modules, as well as a function to conduct functional analysis of miRNA sponge modules.

I’m still reading the description, hold on! We’ll look into this more!

OUTRIDER

OUTRIDER

Identification of aberrent gene expression in RNA-seq data. Read count expectations are modeled by an autoencoder to control for confounders in the data. Given these expectations, the RNA-seq read counts are assumed to follow a negative binomial distribution with a gene-specific dispersion. Outliers are then identified as read counts that significantly deviate from this distribution. Further OUTRIDER provides useful plotting function to analyze and visualize the results.

We’d be interested in testing the performance of OUTRIDER in our already analyzed datasets.

primirTSS

primirTSS

A fast, convenient tool to identify the TSSs of miRNAs by integrating the data of H3K4me3 and Pol II as well as combining the conservation level and sequence feature, provided within both command-line and graphical interfaces, which achieves a better performance than the previous non-cell-specific methods on miRNA TSSs.

We don’t have this kind of data (Pol II ChIP-seq) but it looks useful if you do have it.

TimeSeriesExperiment

TimeSeriesExperiment

Visualization and analysis toolbox for short time course data which includes dimensionality reduction, clustering, two-sample differential expression testing and gene ranking techniques. The package also provides methods for retrieving enriched pathways.

We have time series data and could maybe try this package out for clustering and pathway analyses. We are not sure why it matters that the time course is short, what does this really mean? Could they mean that the time course is complete (no dropouts) unlike a longitudinal time course project? From the vignette: they are talking about small number of time points: that is, statistical methods for this scenario.

tRNA

tRNA

The tRNA package allows tRNA sequences and structures to be accessed and used for subsetting. In addition, it provides visualization tools to compare feature parameters of multiple tRNA sets and correlate them to additional data. The tRNA package uses GRanges objects as inputs requiring only few additional column data sets.

Wow! tRNAs!? Let’s look into this!

tximeta

tximeta

Transcript quantification import from Salmon with automatic population of metadata and transcript ranges. Filtered, combined, or de novo transcriptomes can be linked to the appropriate sources with linkedTxomes and shared for reproducible analyses.

Just look at this great video tweet by Michael Love!

.@Bioconductor 3.8 is released, which means so is tximeta! this idea came up more than 2 years ago, to auto-populate metadata for Salmon quant directories. the goal is no more guessing for the data you quantified earlier in a project, or from public archive. here's a demo pic.twitter.com/6r1yoNcIyj
— Michael Love (@mikelove) November 1, 2018

Ularcirc

Ularcirc

Ularcirc reads in STAR aligned splice junction files and provides visualisation and analysis tools for splicing analysis. Users can assess backsplice junctions and forward canonical junctions.

We are interested in exploring its visualization tools and if they are specific STAR output files or if this package works with a wider set of aligners.

We also appreciate the name of this package!

Wrapping up

We liked how many of the new software packages emphasized visualization! The package with the best name was COCOA hehe.

We hope that you are as excited as we are about trying out the new Bioconductor 3.8 packages! If we implement any of these packages into our analysis routine we want to come back and write a blog post about them³.

Acknowledgments

This blog post was made possible thanks to:

and all the Bioconductor package developers and maintainers!

References

[1] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.8. 2017. URL: https://CRAN.R-project.org/package=knitcitations.

[2] G. Csárdi, R. core, H. Wickham, W. Chang, et al. sessioninfo: R Session Information. R package version 1.1.0.9000. 2018. URL: https://github.com/r-lib/sessioninfo#readme.

[3] A. Oleś, M. Morgan and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.10.0. 2018. URL: https://github.com/Bioconductor/BiocStyle.

[4] Y. Xie, A. P. Hill and A. Thomas. blogdown: Creating Websites with R Markdown. ISBN 978-0815363729. Boca Raton, Florida: Chapman and Hall/CRC, 2017. URL: https://github.com/rstudio/blogdown.

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                                      
##  version  R version 3.5.1 Patched (2018-10-14 r75439)
##  os       macOS Mojave 10.14.1                       
##  system   x86_64, darwin15.6.0                       
##  ui       X11                                        
##  language (EN)                                       
##  collate  en_US.UTF-8                                
##  ctype    en_US.UTF-8                                
##  tz       America/New_York                           
##  date     2018-11-02                                 
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package       * version    date       lib source                            
##  assertthat      0.2.0      2017-04-11 [1] CRAN (R 3.5.0)                    
##  backports       1.1.2      2017-12-13 [1] CRAN (R 3.5.0)                    
##  bibtex          0.4.2      2017-06-30 [1] CRAN (R 3.5.0)                    
##  BiocManager     1.30.3     2018-10-10 [1] CRAN (R 3.5.0)                    
##  BiocStyle     * 2.10.0     2018-10-30 [1] Bioconductor                      
##  blogdown        0.9        2018-10-23 [1] CRAN (R 3.5.0)                    
##  bookdown        0.7        2018-02-18 [1] CRAN (R 3.5.0)                    
##  cli             1.0.1      2018-09-25 [1] CRAN (R 3.5.0)                    
##  colorout      * 1.2-0      2018-05-03 [1] Github (jalvesaq/colorout@c42088d)
##  crayon          1.3.4      2017-09-16 [1] CRAN (R 3.5.0)                    
##  digest          0.6.18     2018-10-10 [1] CRAN (R 3.5.0)                    
##  evaluate        0.12       2018-10-09 [1] CRAN (R 3.5.0)                    
##  htmltools       0.3.6      2017-04-28 [1] CRAN (R 3.5.0)                    
##  httr            1.3.1      2017-08-20 [1] CRAN (R 3.5.0)                    
##  jsonlite        1.5        2017-06-01 [1] CRAN (R 3.5.0)                    
##  knitcitations * 1.0.8      2017-07-04 [1] CRAN (R 3.5.0)                    
##  knitr           1.20       2018-02-20 [1] CRAN (R 3.5.0)                    
##  lubridate       1.7.4      2018-04-11 [1] CRAN (R 3.5.0)                    
##  magrittr        1.5        2014-11-22 [1] CRAN (R 3.5.0)                    
##  plyr            1.8.4      2016-06-08 [1] CRAN (R 3.5.0)                    
##  R6              2.3.0      2018-10-04 [1] CRAN (R 3.5.0)                    
##  Rcpp            0.12.19    2018-10-01 [1] CRAN (R 3.5.1)                    
##  RefManageR      1.2.0      2018-04-25 [1] CRAN (R 3.5.0)                    
##  rmarkdown       1.10       2018-06-11 [1] CRAN (R 3.5.0)                    
##  rprojroot       1.3-2      2018-01-03 [1] CRAN (R 3.5.0)                    
##  sessioninfo   * 1.1.0.9000 2018-10-02 [1] Github (r-lib/sessioninfo@4f91fad)
##  stringi         1.2.4      2018-07-20 [1] CRAN (R 3.5.0)                    
##  stringr         1.3.1      2018-05-10 [1] CRAN (R 3.5.0)                    
##  withr           2.1.2      2018-03-15 [1] CRAN (R 3.5.0)                    
##  xfun            0.4        2018-10-23 [1] CRAN (R 3.5.0)                    
##  xml2            1.2.0      2018-01-24 [1] CRAN (R 3.5.0)                    
##  yaml            2.2.0      2018-07-25 [1] CRAN (R 3.5.0)                    
## 
## [1] /Library/Frameworks/R.framework/Versions/3.5devel/Resources/library

Soon it’ll be in the hundreds! Wow!↩
We’ll see if that’s true hehe.↩
Time permitting :P↩

“Demystifying Data Science” remote notes

Wed, 24 Oct 2018 00:00:00 +0000

To carry on our momentum from a few weeks ago from our useR!2018 remote notes blog post, this time we will be summarizing the Demystifying Data Sience 2018 conference for which you can register for free. We are just following David Robinson’s advice to blog all the time!

Conference overview

We got interested in this conference¹ thanks to tweets like these ones that highlight that:

data scientists are young!
specialists are more in demand!

Hopefully you find these tweets interesting as well. We can find more about the conference on Twitter using the DemystifyDS hashtag which also covers previous conferences. We see that the official event account thisismetis really went all out with branded summary tweets! You can find recordings of all the talks and there were many interesting titles. So we decided to spend 2 sessions and watched 4 full talks.

Navigating the Maze of the Data Science Job Hunt by Mark Meloon, Data Scientist, ServiceNow.
How to Get a Foothold in the Field of Data Science by Brandon Rohrer, Data Scientist, Facebook.
Data Visualization: How to Overcome Common Challenges by Kate Strachnyi, Manager and Data Visualization Specialist, Deloitte.
The Art of & Science of Creating a Actionable Data Story by Mico Yuk, Chief Executive Officer, BI-Brainz Group | Author, Data Visualization for Dummies & More.

Talks summaries

Mark Meloon `MarkMeloon`

In the first talk, by Mark Meloon, we learned about the power of LinkedIn for networking and finding your next job. He suggested posting regularly on LinkedIn as your feed will show up more on others’, allowing you to connect with more people. If you write something about content described by someone you especially admire or hope to work for, you are more likely to catch their attention. It’s best to not ask people directly for a job, but to contact them first to discuss their work or to ask for advice. He also suggested adding key data analysis techniques to your profile. He suggested that describing the techniques with specificity would be best, instead of using more vague terms.

Brandon Rohrer `_brohrer_`

In the second talk, by Brandon Rohrer, we learned about the different data science careers that are possible.

The major fields are:

Data Analysis - statistics and interpretation
Data Modeling - machine learning, prediction
Data Engineering - automation, databases, programming

The major roles/archetypes are:

Generalist - decent at all three fields
Detective - master of analysis
Oracle - master of modeling
Maker - master of engineering
Unicorn - master of all!!!

He ended by mentioning that job postings using the term “data science” often vary widely, and he recommends ignoring the posted job titles and de-emphasizing the specific tools listed, and instead focus on the skills that are being asked for to get a real sense of the job and how you would perform.

Kate Strachnyi `StorybyData`

In the third talk by Kate Strachnyi we learned about how to overcome challenges in data visualization. She described data visualizations as “Information Maps” that should ideally be:

Informative
Efficient
Appealing

Common issues were:

Wrong chart choice - some charts will be much more effective
Improper use of color - use to tell a story in a useful way - not just to be decorative
Information overload - don’t try to do too much at once - loses impact
Clutter - leave out the nonessential
Not speaking the same language - know your audience (jargon/lingo)

She also noted that we should be careful about color schemes. She suggested that there are websites to check how your figures would appear to others with colorblindness.

She mostly uses tableau in her work and suggested that it makes a nice free option for data visualization.

Mico Yuk `micoyuk`

In the fourth and last talk by Mico Yuk we learned about storyboards and remembering that our data analyses are always to try to tell story about the data. She pointed out that the human mind is wired visually, that we retain about 80% of what we see, 20% of what we read, and 10% of what we hear. She suggested that we create SMART goals (she credits Peter Drucker) to make sure that our work is driven efficiently in the correct direction. She suggested that communicating our work in a SMART goal-based framework based would concisely and clearly communicate the purpose and results of work.

Our impressions

Given our diversity of impressions we thought it would be more useful to share our impressions. Without further ado, here they are.

P1

I found Mark Meloon’s talk very useful. I have actually started posting more regularly on my own LinkedIn account and it has indeed captured more attention from others. In fact, I have even received emails from companies interested in hiring someone with my expertise. Brandon Rohrer clarified some trends that I had noticed about data science. I identify with the “Detective” role and I see that while I may aspire at times (unsuccessfully) to be a “Generalist - or someday a Unicorn”, my experience as a Detective is very worthwhile as well. I love data visualization and I loved Kate Strachnyi’s talk. I found her tips to be very clear reminders for how to continue with my own visualizations. The talk by Mico Yuk was a good reminder to keep overall goals in mind as you work and to regularly take a step back and assess if your work is really proceeding in the direction and at the rate that you planned.

P2

Mark Meloon’s talk emphasized the use of LinkedIn for networking and job hunting. He interviews job candidates for his company so his viewpoint was a direct reflection of someone who uses the website to find and/or assess job applicants. I liked that he gave both good and bad examples of actual profiles and messages he’s seen on LinkedIn. He also noted that, to get a foot in the door of a job posting, you don’t need to directly know the hiring manager, but reaching out to anyone you know in the company, even if it’s a second- or third-level connection (i.e. friend of a friend), is better than nothing, as long as you do it right. I do wish he had spoken about other social media platforms, such as Twitter, and how they compare to LinkedIn for networking.

I found the breakdown of skills and job types by Brandon Rohrer to be really instructive. It made me reflect on my own interests and skills in a broken down way, and I think it will help to have this framework for both future job hunts and interviews. I particularly like that he emphasized it’s okay/normal to not be great at everything related to data science - it’s a broad field - and that people with a narrower set of expertise are still needed and valuable for specific jobs. His talk also gave me some ideas of skills I may be able to work on and add to my portfolio to round out my skill-set. I would recommend this talk to anyone in the data science or analysis fields that is looking for clarity or definition in their current job or career path!

P3

Kate Strachnyi’s talk was a great reminder of the importance of keeping your audience in mind when presenting information and making sure that visualizations are not just accurate but also easily understood. Her list of common issues was a helpful summary of guidelines I’ve heard before, and I appreciated the examples she used. In particular, I think I often run into the challenge of “information overload” when I present informally to others – I need to remember that it’s not enough for the information to be there, it also needs to be arranged in way that lets people understand it quickly.

Mico Yuk’s talk was probably more applicable to someone working in a corporate field rather than an academic one, but the main idea of framing data as a story and keeping the goal in mind was still relevant to me. Some of the suggestions, like asking the “right questions” of your user, could easily be reworked for research (even if the user is just be me). I haven’t worked with a storyboard before, but it would be interesting to see if that approach could also apply to planning out analyses for a research paper – the goal might be the question we’re asking, KPI the metrics we’re using to answer that question, trends the conclusions we can draw, and actions the next direction of analysis. The translation from business to academic research probably needs some tweaking, but I might try this approach on a future project to help with organization and keeping the bigger story in mind.

P4

Mark Meloon’s talk reminded me that many use LinkedIn for networking which hasn’t been that common in my experience in academia. This is something I would need to keep in mind for advising students in the future that are either unsure of staying in academia or want to go to industry. I do brush up my profile once in a while, and parts of Mark’s advice applies also to CVs (writing them and sending them via email): basically, be genuine and respectful of others.

Brandon Rohrer verbalized distinctions in data science roles that I had either heard of before or had some intuition behind them, but hadn’t actually spent the time to see them as clearly defined as Brandon did. I was also quite curious of everyone’s reaction to his talk and how each of us labelled ourselves. For example, maybe X thought Z was a unicorn, but Z perceived themselves as a beginner. In my case, I think that it’s probably too ambitious to get to the unicorn level. I’m simply aiming to get to (or am at) a level where I can understand most of the terms and conversations, but then go back and research a bit more if I need to as preparation for a follow up meeting. I guess that I’m a generalist.

Kate Strachnyi’s key points are I guess topics that I’ve heard before and loosely follow. I think that her audience is different from mine as she seems to create visualizations that are used in many company presentations. I’m frequently under pressure to get a simple version of a plot done where we can see the trend in the data and only work on polishing a few selected plots that get highlighted in a research paper. Though I guess that I could/should spend a bit more time thinking about the plot design and colors before I make the next one. For that, I would like to learn more about the paletteer R package:

ICYMI, 🎨 With more palettes than a tweet could possibly contain…
"paletteer: Collection of most color palettes in a single R 📦" 👨‍🎨 @Emil_Hvitfeldt https://t.co/7kKSyohQN4 #rstats pic.twitter.com/zibFhW03EU
— Mara Averick (@dataandme) July 24, 2018

Mico Yuk talked about SMART goals. Hmm… I don’t remember what that stood for, so I clearly would need to re-watch her talk. After skimming through it again I guess that I can only say that it was hard for me to relate to her talk because I haven’t been in a project that involved all planning steps that she talked about. While it wasn’t for me, it might be useful to you, so give it a try!

Wrapping up

Thanks for getting this far. We are curious to hear what where your own impressions in these and other talks from the Demystifying Data Sience 2018: they have 28 recorded talks in total! We also hope that you enjoyed reading about our different perspectives.

Acknowledgments

We are grateful to everyone that tweeted about the conference and shared their materials online! We are also happy that Metis got interested in our summary blog post.

This blog post was made possible thanks to:

References

[1] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.8. 2017. URL: https://CRAN.R-project.org/package=knitcitations.

[2] G. Csárdi, R. core, H. Wickham, W. Chang, et al. sessioninfo: R Session Information. R package version 1.1.0.9000. 2018. URL: https://github.com/r-lib/sessioninfo#readme.

[3] Y. Xie, A. P. Hill and A. Thomas. blogdown: Creating Websites with R Markdown. ISBN 978-0815363729. Boca Raton, Florida: Chapman and Hall/CRC, 2017. URL: https://github.com/rstudio/blogdown.

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                                      
##  version  R version 3.5.1 Patched (2018-10-14 r75439)
##  os       macOS High Sierra 10.13.6                  
##  system   x86_64, darwin15.6.0                       
##  ui       X11                                        
##  language (EN)                                       
##  collate  en_US.UTF-8                                
##  ctype    en_US.UTF-8                                
##  tz       America/New_York                           
##  date     2018-10-24                                 
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package       * version    date       lib source                            
##  assertthat      0.2.0      2017-04-11 [1] CRAN (R 3.5.0)                    
##  backports       1.1.2      2017-12-13 [1] CRAN (R 3.5.0)                    
##  bibtex          0.4.2      2017-06-30 [1] CRAN (R 3.5.0)                    
##  BiocStyle     * 2.8.2      2018-05-30 [1] Bioconductor                      
##  blogdown        0.8        2018-07-15 [1] CRAN (R 3.5.0)                    
##  bookdown        0.7        2018-02-18 [1] CRAN (R 3.5.0)                    
##  cli             1.0.1      2018-09-25 [1] CRAN (R 3.5.0)                    
##  colorout      * 1.2-0      2018-05-03 [1] Github (jalvesaq/colorout@c42088d)
##  crayon          1.3.4      2017-09-16 [1] CRAN (R 3.5.0)                    
##  digest          0.6.18     2018-10-10 [1] CRAN (R 3.5.0)                    
##  evaluate        0.12       2018-10-09 [1] CRAN (R 3.5.0)                    
##  htmltools       0.3.6      2017-04-28 [1] CRAN (R 3.5.0)                    
##  httr            1.3.1      2017-08-20 [1] CRAN (R 3.5.0)                    
##  jsonlite        1.5        2017-06-01 [1] CRAN (R 3.5.0)                    
##  knitcitations * 1.0.8      2017-07-04 [1] CRAN (R 3.5.0)                    
##  knitr           1.20       2018-02-20 [1] CRAN (R 3.5.0)                    
##  lubridate       1.7.4      2018-04-11 [1] CRAN (R 3.5.0)                    
##  magrittr        1.5        2014-11-22 [1] CRAN (R 3.5.0)                    
##  plyr            1.8.4      2016-06-08 [1] CRAN (R 3.5.0)                    
##  R6              2.3.0      2018-10-04 [1] CRAN (R 3.5.0)                    
##  Rcpp            0.12.19    2018-10-01 [1] CRAN (R 3.5.1)                    
##  RefManageR      1.2.0      2018-04-25 [1] CRAN (R 3.5.0)                    
##  rmarkdown       1.10       2018-06-11 [1] CRAN (R 3.5.0)                    
##  rprojroot       1.3-2      2018-01-03 [1] CRAN (R 3.5.0)                    
##  sessioninfo   * 1.1.0.9000 2018-10-02 [1] Github (r-lib/sessioninfo@4f91fad)
##  stringi         1.2.4      2018-07-20 [1] CRAN (R 3.5.0)                    
##  stringr         1.3.1      2018-05-10 [1] CRAN (R 3.5.0)                    
##  withr           2.1.2      2018-03-15 [1] CRAN (R 3.5.0)                    
##  xfun            0.3        2018-07-06 [1] CRAN (R 3.5.0)                    
##  xml2            1.2.0      2018-01-24 [1] CRAN (R 3.5.0)                    
##  yaml            2.2.0      2018-07-25 [1] CRAN (R 3.5.0)                    
## 
## [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

This conference covered a large spectrum of data science topics, hence the picture for the post!↩

Hacking our way through UpSetR

Fri, 27 Jul 2018 00:00:00 +0000

For our club meeting today we were going to summarize the Demystifying Data Science conference but we forgot that the videos are not released yet.

Oops, we'll have to postpone our blog post. We didn't read the fine print that talk recordings will be available sometime next week. Sorry about that!
— LIBD rstats club (@LIBDrstats) July 27, 2018

So we adjusted plans and decided to continue our work on the UpSetR (Gehlenborg, 2016) package by Nils Gehlenborg.

Yesterday we discussed various options for visualizing large numbers of overlapping groups. We explored uses of the #venneuler package from Lee Wilkinson and Simon Urbanek and the #UpSetR package from Jake Conway, Alexander Lex, and @ngehlenborg. #rstats pic.twitter.com/k55YfihmiP
— LIBD rstats club (@LIBDrstats) June 16, 2018

What you can currently do

First, let’s install the version we used for this post:

devtools::install_github('hms-dbmi/UpSetR@fe2812c')

Our ultimate goal is to submit a pull request that enables UpSetR users to specify a color by row for the dots instead of the actual rows. We had already identified an example that we could work with.

library('UpSetR')
movies <- read.csv( system.file("extdata", "movies.csv", package = "UpSetR"), 
                    header=T, sep=";" )

require(ggplot2); require(plyr); require(gridExtra); require(grid);

## Loading required package: ggplot2

## Loading required package: plyr

## Loading required package: gridExtra

## Loading required package: grid

upset(movies, 
      sets = c("Action", "Comedy", "Drama"), 
      order.by="degree", matrix.color="blue", point.size=5,
      sets.bar.color=c("maroon","blue","orange"))

We also explored the set metadata vignette that includes examples such as the following one.

set.seed(20180727)

## Create the metadata object first
sets <- names(movies[3:19])
avgRottenTomatoesScore <- round(runif(17, min = 0, max = 90))
metadata <- as.data.frame(cbind(sets, avgRottenTomatoesScore))
names(metadata) <- c("sets", "avgRottenTomatoesScore")
metadata$avgRottenTomatoesScore <- as.numeric(as.character(metadata$avgRottenTomatoesScore))
Cities <- sample(c("Boston", "NYC", "LA"), 17, replace = T)
metadata <- cbind(metadata, Cities)
metadata$Cities <- as.character(metadata$Cities)
metadata[which(metadata$sets %in% c("Drama", "Comedy", "Action", "Thriller", 
    "Romance")), ]

##        sets avgRottenTomatoesScore Cities
## 1    Action                     68 Boston
## 4    Comedy                     40    NYC
## 7     Drama                     48     LA
## 13  Romance                     77 Boston
## 15 Thriller                     19    NYC

accepted <- round(runif(17, min = 0, max = 1))
metadata <- cbind(metadata, accepted)
metadata[which(metadata$sets %in% c("Drama", "Comedy", "Action", "Thriller", 
    "Romance")), ]

##        sets avgRottenTomatoesScore Cities accepted
## 1    Action                     68 Boston        0
## 4    Comedy                     40    NYC        1
## 7     Drama                     48     LA        0
## 13  Romance                     77 Boston        1
## 15 Thriller                     19    NYC        0

## Now make the plot
upset(movies, set.metadata = list(data = metadata, plots = list(list(type = "hist", 
    column = "avgRottenTomatoesScore", assign = 20), list(type = "matrix_rows", 
    column = "Cities", colors = c(Boston = "green", NYC = "navy", LA = "purple"), 
    alpha = 0.5))))

Hacking our way

Using the metadata looked complicated to us and hopefully not necessary for what we are trying to accomplish. That is, we really wanted to change the colors of the circles in each row, not the rows themselves. So we found the GitHub repo with the code, plugged a laptop to a TV and started exploring as a group. We went the rabbit hole to see how the matrix.color argument got used. To actually hack our way through, we downloaded the latest version of the code using git.

git clone git@github.com:hms-dbmi/UpSetR.git
cd UpSetR

Next, we created the objects that match the default arguments of upset() by finding and replacing commas by semi-colons. Well, not all of the commas. Also, for inputs that specified a vector (mostly 2 options), we chose the first one to match the default R behavior. This way we could execute them and have them in our session.

## Default upset() arguments
nsets = 5; nintersects = 40; sets = NULL; keep.order = F; set.metadata = NULL; intersections = NULL;
matrix.color = "gray23"; main.bar.color = "gray23"; mainbar.y.label = "Intersection Size"; mainbar.y.max = NULL;
sets.bar.color = "gray23"; sets.x.label = "Set Size"; point.size = 2.2; line.size = 0.7;
mb.ratio = c(0.70,0.30); expression = NULL; att.pos = NULL; att.color = main.bar.color; order.by = 'freq';
decreasing = T; show.numbers = "yes"; number.angles = 0; group.by = "degree";cutoff = NULL;
queries = NULL; query.legend = "none"; shade.color = "gray88"; shade.alpha = 0.25; matrix.dot.alpha =0.5;
empty.intersections = NULL; color.pal = 1; boxplot.summary = NULL; attribute.plots = NULL; scale.intersections = "identity";
scale.sets = "identity"; text.scale = 1; set_size.angles = 0 ; set_size.show = FALSE

Next, we did the same (commas to semicolons) for the inputs of the first example.

## Initial inputs on the first example
movies <- read.csv( system.file("extdata", "movies.csv", package = "UpSetR"), 
                    header=T, sep=";" )

## comma -> semicolon
data = movies; sets = c("Action", "Comedy", "Drama"); 
      order.by="degree"; matrix.color="blue"; point.size=5;
      sets.bar.color=c("maroon","blue","orange")

Now we were ready to start modifying some of the internal UpSetR (Gehlenborg, 2016) code.

Hacking internals

The function upset() is pretty long and uses many un-exported functions from the package itself. In order to test thing quickly we added UpSetR::: calls before the un-exported functions. Here’s our modified version where we added a piece of code to modify the Matrix_layout object and add some colors.

## Piece of code we introduced
for(i in 1:3) {
      j <- which(Matrix_layout$y == i & Matrix_layout$value == 1)
      if(length(j) > 0) Matrix_layout$color[j] <- c("maroon","blue","orange")[i]
  }

Ok, here’s the full modified upset() function.

## Modified internal upset() code

startend <- UpSetR:::FindStartEnd(data)
  first.col <- startend[1]
  last.col <- startend[2]

  if(color.pal == 1){
    palette <- c("#1F77B4", "#FF7F0E", "#2CA02C", "#D62728", "#9467BD", "#8C564B", "#E377C2",
                 "#7F7F7F", "#BCBD22", "#17BECF")
  } else{
    palette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00",
                 "#CC79A7")
  }

  if(is.null(intersections) == F){
    Set_names <- unique((unlist(intersections)))
    Sets_to_remove <- UpSetR:::Remove(data, first.col, last.col, Set_names)
    New_data <- UpSetR:::Wanted(data, Sets_to_remove)
    Num_of_set <- UpSetR:::Number_of_sets(Set_names)
    if(keep.order == F){
      Set_names <- UpSetR:::order_sets(New_data, Set_names)
    }
    All_Freqs <- UpSetR:::specific_intersections(data, first.col, last.col, intersections, order.by, group.by, decreasing,
                                        cutoff, main.bar.color, Set_names)
  } else if(is.null(intersections) == T){
    Set_names <- sets
    if(is.null(Set_names) == T || length(Set_names) == 0 ){
      Set_names <- UpSetR:::FindMostFreq(data, first.col, last.col, nsets)
    }
    Sets_to_remove <- UpSetR:::Remove(data, first.col, last.col, Set_names)
    New_data <- UpSetR:::Wanted(data, Sets_to_remove)
    Num_of_set <- UpSetR:::Number_of_sets(Set_names)
    if(keep.order == F){
    Set_names <- UpSetR:::order_sets(New_data, Set_names)
    }
    All_Freqs <- UpSetR:::Counter(New_data, Num_of_set, first.col, Set_names, nintersects, main.bar.color,
                         order.by, group.by, cutoff, empty.intersections, decreasing)
  }
  Matrix_setup <- UpSetR:::Create_matrix(All_Freqs)
  labels <- UpSetR:::Make_labels(Matrix_setup)
  #Chose NA to represent NULL case as result of NA being inserted when at least one contained both x and y
  #i.e. if one custom plot had both x and y, and others had only x, the y's for the other plots were NA
  #if I decided to make the NULL case (all x and no y, or vice versa), there would have been alot more if/else statements
  #NA can be indexed so that we still get the non NA y aesthetics on correct plot. NULL cant be indexed.
  att.x <- c(); att.y <- c();
  if(is.null(attribute.plots) == F){
    for(i in seq_along(attribute.plots$plots)){
      if(length(attribute.plots$plots[[i]]$x) != 0){
        att.x[i] <- attribute.plots$plots[[i]]$x
      }
      else if(length(attribute.plots$plots[[i]]$x) == 0){
        att.x[i] <- NA
      }
      if(length(attribute.plots$plots[[i]]$y) != 0){
        att.y[i] <- attribute.plots$plots[[i]]$y
      }
      else if(length(attribute.plots$plots[[i]]$y) == 0){
        att.y[i] <- NA
      }
    }
  }

  BoxPlots <- NULL
  if(is.null(boxplot.summary) == F){
    BoxData <- UpSetR:::IntersectionBoxPlot(All_Freqs, New_data, first.col, Set_names)
    BoxPlots <- list()
    for(i in seq_along(boxplot.summary)){
      BoxPlots[[i]] <- UpSetR:::BoxPlotsPlot(BoxData, boxplot.summary[i], att.color)
    }
  }

  customAttDat <- NULL
  customQBar <- NULL
  Intersection <- NULL
  Element <- NULL
  legend <- NULL
  EBar_data <- NULL
  if(is.null(queries) == F){
    custom.queries <- UpSetR:::SeperateQueries(queries, 2, palette)
    customDat <- UpSetR:::customQueries(New_data, custom.queries, Set_names)
    legend <- UpSetR:::GuideGenerator(queries, palette)
    legend <- UpSetR:::Make_legend(legend)
    if(is.null(att.x) == F && is.null(customDat) == F){
      customAttDat <- UpSetR:::CustomAttData(customDat, Set_names)
    }
    customQBar <- UpSetR:::customQueriesBar(customDat, Set_names, All_Freqs, custom.queries)
  }
  if(is.null(queries) == F){
    Intersection <- UpSetR:::SeperateQueries(queries, 1, palette)
    Matrix_col <- UpSetR:::intersects(QuerieInterData, Intersection, New_data, first.col, Num_of_set,
                             All_Freqs, expression, Set_names, palette)
    Element <- UpSetR:::SeperateQueries(queries, 1, palette)
    EBar_data <-UpSetR:::ElemBarDat(Element, New_data, first.col, expression, Set_names,palette, All_Freqs)
  } else{
    Matrix_col <- NULL
  }
  
  Matrix_layout <- UpSetR:::Create_layout(Matrix_setup, matrix.color, Matrix_col, matrix.dot.alpha)

As a little pause in upset(), let’s check what actually Matrix_layout looks.

Matrix_layout

##    y x value  color alpha Intersection
## 1  1 1     1   blue   1.0         1yes
## 2  2 1     1   blue   1.0         1yes
## 3  3 1     1   blue   1.0         1yes
## 4  1 2     0 gray83   0.5          4No
## 5  2 2     1   blue   1.0         2yes
## 6  3 2     1   blue   1.0         2yes
## 7  1 3     1   blue   1.0         3yes
## 8  2 3     0 gray83   0.5          8No
## 9  3 3     1   blue   1.0         3yes
## 10 1 4     1   blue   1.0         4yes
## 11 2 4     1   blue   1.0         4yes
## 12 3 4     0 gray83   0.5         12No
## 13 1 5     0 gray83   0.5         13No
## 14 2 5     0 gray83   0.5         14No
## 15 3 5     1   blue   1.0         5yes
## 16 1 6     0 gray83   0.5         16No
## 17 2 6     1   blue   1.0         6yes
## 18 3 6     0 gray83   0.5         18No
## 19 1 7     1   blue   1.0         7yes
## 20 2 7     0 gray83   0.5         20No
## 21 3 7     0 gray83   0.5         21No

We figured out that we had to change the colors only the rows with value = 1 and that y was the row grouping variable.

## our modification
  for(i in 1:3) {
      j <- which(Matrix_layout$y == i & Matrix_layout$value == 1)
      if(length(j) > 0) Matrix_layout$color[j] <- c("maroon","blue","orange")[i]
  }

Here’s our modified Matrix_layout:

Matrix_layout

##    y x value  color alpha Intersection
## 1  1 1     1 maroon   1.0         1yes
## 2  2 1     1   blue   1.0         1yes
## 3  3 1     1 orange   1.0         1yes
## 4  1 2     0 gray83   0.5          4No
## 5  2 2     1   blue   1.0         2yes
## 6  3 2     1 orange   1.0         2yes
## 7  1 3     1 maroon   1.0         3yes
## 8  2 3     0 gray83   0.5          8No
## 9  3 3     1 orange   1.0         3yes
## 10 1 4     1 maroon   1.0         4yes
## 11 2 4     1   blue   1.0         4yes
## 12 3 4     0 gray83   0.5         12No
## 13 1 5     0 gray83   0.5         13No
## 14 2 5     0 gray83   0.5         14No
## 15 3 5     1 orange   1.0         5yes
## 16 1 6     0 gray83   0.5         16No
## 17 2 6     1   blue   1.0         6yes
## 18 3 6     0 gray83   0.5         18No
## 19 1 7     1 maroon   1.0         7yes
## 20 2 7     0 gray83   0.5         20No
## 21 3 7     0 gray83   0.5         21No

Ok, let’s continue with the rest of upset().

## continuing with upset()
  
  Set_sizes <- UpSetR:::FindSetFreqs(New_data, first.col, Num_of_set, Set_names, keep.order)
  Bar_Q <- NULL
  if(is.null(queries) == F){
    Bar_Q <- UpSetR:::intersects(QuerieInterBar, Intersection, New_data, first.col, Num_of_set, All_Freqs, expression, Set_names, palette)
  }
  QInter_att_data <- NULL
  QElem_att_data <- NULL
  if((is.null(queries) == F) & (is.null(att.x) == F)){
    QInter_att_data <- UpSetR:::intersects(QuerieInterAtt, Intersection, New_data, first.col, Num_of_set, att.x, att.y,
                                  expression, Set_names, palette)
    QElem_att_data <- UpSetR:::elements(QuerieElemAtt, Element, New_data, first.col, expression, Set_names, att.x, att.y,
                               palette)
  }
  AllQueryData <- UpSetR:::combineQueriesData(QInter_att_data, QElem_att_data, customAttDat, att.x, att.y)

  ShadingData <- NULL

  if(is.null(set.metadata) == F){
    ShadingData <- get_shade_groups(set.metadata, Set_names, Matrix_layout, shade.alpha)
    output <- Make_set_metadata_plot(set.metadata, Set_names)
    set.metadata.plots <- output[[1]]
    set.metadata <- output[[2]]

    if(is.null(ShadingData) == FALSE){
    shade.alpha <- unique(ShadingData$alpha)
    }
  } else {
    set.metadata.plots <- NULL
  }
  if(is.null(ShadingData) == TRUE){
  ShadingData <- UpSetR:::MakeShading(Matrix_layout, shade.color)
  }
  Main_bar <- suppressMessages(UpSetR:::Make_main_bar(All_Freqs, Bar_Q, show.numbers, mb.ratio, customQBar, number.angles, EBar_data, mainbar.y.label,
                            mainbar.y.max, scale.intersections, text.scale, attribute.plots))
  Matrix <- UpSetR:::Make_matrix_plot(Matrix_layout, Set_sizes, All_Freqs, point.size, line.size,
                             text.scale, labels, ShadingData, shade.alpha)
  Sizes <- UpSetR:::Make_size_plot(Set_sizes, sets.bar.color, mb.ratio, sets.x.label, scale.sets, text.scale, set_size.angles,set_size.show)

  # Make_base_plot(Main_bar, Matrix, Sizes, labels, mb.ratio, att.x, att.y, New_data,
  #                expression, att.pos, first.col, att.color, AllQueryData, attribute.plots,
  #                legend, query.legend, BoxPlots, Set_names, set.metadata, set.metadata.plots)

  structure(class = "upset",
    .Data=list(
      Main_bar = Main_bar,
      Matrix = Matrix,
      Sizes = Sizes,
      labels = labels,
      mb.ratio = mb.ratio,
      att.x = att.x,
      att.y = att.y,
      New_data = New_data,
      expression = expression,
      att.pos = att.pos,
      first.col = first.col,
      att.color = att.color,
      AllQueryData = AllQueryData,
      attribute.plots = attribute.plots,
      legend = legend,
      query.legend = query.legend,
      BoxPlots = BoxPlots,
      Set_names = Set_names,
      set.metadata = set.metadata,
      set.metadata.plots = set.metadata.plots)
  )

Line colors

Ok, that’s great but we have a problem with the lines. The color is no longer black, so we went deeper into the rabbit hole and found that the internal Make_matrix_plot() function is where the lines are made. We made some edits but got a plot where the lines were on top of the circles as shown in this screenshot.

Our club session was out of time, so we decided to continue our project another day and ask for help on twitter. And yay, we got help super fast!

Thank you and @thatdnaguy, that did it! pic.twitter.com/tzQvhKFXgR
— LIBD rstats club (@LIBDrstats) July 27, 2018

So here’s our modified version of Make_matrix_plot() that keeps the lines black.

Make_matrix_plot <- function(Mat_data,Set_size_data, Main_bar_data, point_size, line_size, text_scale, labels,
                             shading_data, shade_alpha){

  if(length(text_scale) == 1){
    name_size_scale <- text_scale
  }
  if(length(text_scale) > 1 && length(text_scale) <= 6){
    name_size_scale <- text_scale[5]
  }
  
  Mat_data$line_col <- 'black'

  Matrix_plot <- (ggplot()
                  + theme(panel.background = element_rect(fill = "white"),
                          plot.margin=unit(c(-0.2,0.5,0.5,0.5), "lines"),
                          axis.text.x = element_blank(),
                          axis.ticks.x = element_blank(),
                          axis.ticks.y = element_blank(),
                          axis.text.y = element_text(colour = "gray0",
                                                     size = 7*name_size_scale, hjust = 0.4),
                          panel.grid.major = element_blank(),
                          panel.grid.minor = element_blank())
                  + xlab(NULL) + ylab("   ")
                  + scale_y_continuous(breaks = c(1:nrow(Set_size_data)),
                                       limits = c(0.5,(nrow(Set_size_data) +0.5)),
                                       labels = labels, expand = c(0,0))
                  + scale_x_continuous(limits = c(0,(nrow(Main_bar_data)+1 )), expand = c(0,0))
                  + geom_rect(data = shading_data, aes_string(xmin = "min", xmax = "max",
                                                              ymin = "y_min", ymax = "y_max"),
                              fill = shading_data$shade_color, alpha = shade_alpha)
                  + geom_line(data= Mat_data, aes_string(group = "Intersection", x="x", y="y",
                                                         colour = "line_col"), size = line_size)
                 + geom_point(data= Mat_data, aes_string(x= "x", y= "y"), colour = Mat_data$color,
                     size= point_size, alpha = Mat_data$alpha, shape=16)
                  + scale_color_identity())
  Matrix_plot <- ggplot_gtable(ggplot_build(Matrix_plot))
  return(Matrix_plot)
}

Using that modified version we can then run the code again (note that we are not using UpSetR::: before Make_matrix_plot) and get the plot we wanted.

Matrix <- Make_matrix_plot(Matrix_layout, Set_sizes, All_Freqs, point.size, line.size,
                             text.scale, labels, ShadingData, shade.alpha)
  Sizes <- UpSetR:::Make_size_plot(Set_sizes, sets.bar.color, mb.ratio, sets.x.label, scale.sets, text.scale, set_size.angles,set_size.show)

  # Make_base_plot(Main_bar, Matrix, Sizes, labels, mb.ratio, att.x, att.y, New_data,
  #                expression, att.pos, first.col, att.color, AllQueryData, attribute.plots,
  #                legend, query.legend, BoxPlots, Set_names, set.metadata, set.metadata.plots)

  structure(class = "upset",
    .Data=list(
      Main_bar = Main_bar,
      Matrix = Matrix,
      Sizes = Sizes,
      labels = labels,
      mb.ratio = mb.ratio,
      att.x = att.x,
      att.y = att.y,
      New_data = New_data,
      expression = expression,
      att.pos = att.pos,
      first.col = first.col,
      att.color = att.color,
      AllQueryData = AllQueryData,
      attribute.plots = attribute.plots,
      legend = legend,
      query.legend = query.legend,
      BoxPlots = BoxPlots,
      Set_names = Set_names,
      set.metadata = set.metadata,
      set.metadata.plots = set.metadata.plots)
  )

We have quite a bit more to do in order to complete our pull request. We are also curious if you would have used a different approach to hack your way through UpSetR (Gehlenborg, 2016). For example, maybe some functions from devtools (Wickham, Hester, and Chang, 2018) would have enabled to do this equally fast without having to introduce UpSetR::: calls.

Thanks for reading!

Acknowledgments

This blog post was made possible thanks to:

BiocStyle (Oleś, Morgan, and Huber, 2018)
blogdown (Xie, Hill, and Thomas, 2017)
devtools (Wickham, Hester, and Chang, 2018)
knitcitations (Boettiger, 2017)

References

[1] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.8. 2017. URL: https://CRAN.R-project.org/package=knitcitations.

[2] N. Gehlenborg. UpSetR: A More Scalable Alternative to Venn and Euler Diagrams for Visualizing Intersecting Sets. R package version 1.4.0. 2016. URL: http://github.com/hms-dbmi/UpSetR.

[3] A. Oleś, M. Morgan and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.8.2. 2018. URL: https://github.com/Bioconductor/BiocStyle.

[4] H. Wickham, J. Hester and W. Chang. devtools: Tools to Make Developing R Packages Easier. R package version 1.13.6. 2018. URL: https://CRAN.R-project.org/package=devtools.

[5] Y. Xie, A. P. Hill and A. Thomas. blogdown: Creating Websites with R Markdown. ISBN 978-0815363729. Boca Raton, Florida: Chapman and Hall/CRC, 2017. URL: https://github.com/rstudio/blogdown.

Reproducibility

## Session info ----------------------------------------------------------------------------------------------------------

##  setting  value                       
##  version  R version 3.5.1 (2018-07-02)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/New_York            
##  date     2018-07-27

## Packages --------------------------------------------------------------------------------------------------------------

##  package       * version date       source                            
##  assertthat      0.2.0   2017-04-11 cran (@0.2.0)                     
##  backports       1.1.2   2017-12-13 cran (@1.1.2)                     
##  base          * 3.5.1   2018-07-05 local                             
##  bibtex          0.4.2   2017-06-30 CRAN (R 3.5.0)                    
##  bindr           0.1.1   2018-03-13 cran (@0.1.1)                     
##  bindrcpp        0.2.2   2018-03-29 cran (@0.2.2)                     
##  BiocStyle     * 2.8.2   2018-05-30 Bioconductor                      
##  blogdown        0.8     2018-07-15 CRAN (R 3.5.0)                    
##  bookdown        0.7     2018-02-18 CRAN (R 3.5.0)                    
##  colorout      * 1.2-0   2018-05-03 Github (jalvesaq/colorout@c42088d)
##  colorspace      1.3-2   2016-12-14 cran (@1.3-2)                     
##  compiler        3.5.1   2018-07-05 local                             
##  crayon          1.3.4   2017-09-16 cran (@1.3.4)                     
##  datasets      * 3.5.1   2018-07-05 local                             
##  devtools      * 1.13.6  2018-06-27 cran (@1.13.6)                    
##  digest          0.6.15  2018-01-28 CRAN (R 3.5.0)                    
##  dplyr           0.7.6   2018-06-29 CRAN (R 3.5.1)                    
##  evaluate        0.11    2018-07-17 CRAN (R 3.5.0)                    
##  ggplot2       * 3.0.0   2018-07-03 CRAN (R 3.5.0)                    
##  glue            1.3.0   2018-07-17 CRAN (R 3.5.0)                    
##  graphics      * 3.5.1   2018-07-05 local                             
##  grDevices     * 3.5.1   2018-07-05 local                             
##  grid          * 3.5.1   2018-07-05 local                             
##  gridExtra     * 2.3     2017-09-09 CRAN (R 3.5.0)                    
##  gtable          0.2.0   2016-02-26 CRAN (R 3.5.0)                    
##  htmltools       0.3.6   2017-04-28 cran (@0.3.6)                     
##  httr            1.3.1   2017-08-20 CRAN (R 3.5.0)                    
##  jsonlite        1.5     2017-06-01 CRAN (R 3.5.0)                    
##  knitcitations * 1.0.8   2017-07-04 CRAN (R 3.5.0)                    
##  knitr           1.20    2018-02-20 cran (@1.20)                      
##  labeling        0.3     2014-08-23 cran (@0.3)                       
##  lazyeval        0.2.1   2017-10-29 CRAN (R 3.5.0)                    
##  lubridate       1.7.4   2018-04-11 CRAN (R 3.5.0)                    
##  magrittr        1.5     2014-11-22 cran (@1.5)                       
##  memoise         1.1.0   2017-04-21 CRAN (R 3.5.0)                    
##  methods       * 3.5.1   2018-07-05 local                             
##  munsell         0.5.0   2018-06-12 CRAN (R 3.5.0)                    
##  pillar          1.3.0   2018-07-14 CRAN (R 3.5.0)                    
##  pkgconfig       2.0.1   2017-03-21 cran (@2.0.1)                     
##  plyr          * 1.8.4   2016-06-08 cran (@1.8.4)                     
##  purrr           0.2.5   2018-05-29 cran (@0.2.5)                     
##  R6              2.2.2   2017-06-17 CRAN (R 3.5.0)                    
##  Rcpp            0.12.18 2018-07-23 CRAN (R 3.5.1)                    
##  RefManageR      1.2.0   2018-04-25 CRAN (R 3.5.0)                    
##  rlang           0.2.1   2018-05-30 cran (@0.2.1)                     
##  rmarkdown       1.10    2018-06-11 CRAN (R 3.5.0)                    
##  rprojroot       1.3-2   2018-01-03 cran (@1.3-2)                     
##  scales          0.5.0   2017-08-24 cran (@0.5.0)                     
##  stats         * 3.5.1   2018-07-05 local                             
##  stringi         1.2.4   2018-07-20 CRAN (R 3.5.0)                    
##  stringr         1.3.1   2018-05-10 CRAN (R 3.5.0)                    
##  tibble          1.4.2   2018-01-22 cran (@1.4.2)                     
##  tidyselect      0.2.4   2018-02-26 cran (@0.2.4)                     
##  tools           3.5.1   2018-07-05 local                             
##  UpSetR        * 1.4.0   2018-07-27 Github (hms-dbmi/UpSetR@fe2812c)  
##  utils         * 3.5.1   2018-07-05 local                             
##  withr           2.1.2   2018-03-15 CRAN (R 3.5.0)                    
##  xfun            0.3     2018-07-06 CRAN (R 3.5.0)                    
##  xml2            1.2.0   2018-01-24 CRAN (R 3.5.0)                    
##  yaml            2.1.19  2018-05-01 CRAN (R 3.5.0)

LIBD rstats club remote useR!2018 notes

Fri, 13 Jul 2018 00:00:00 +0000

For our July 13th 2018 LIBD rstats club meeting we decided to check as much as we could the useR!2018 conference. Here’s what we were able to figure out about it in about an hour. Hopefully our quick notes will help other rstats enthusiasts, users and developers get a glimpse of the conference. Although there’s bound to me more videos and material about the conference coming out in the following days.

Main links:

First of all search all the Twitter history for tweets related to the conference by checking the user2018 hashtag.

Next, check the videos of the talks. There are more videos there than we can check right now but we hope to come back sometime later and check more talks.

All of the #useR2018 presentations (unless specifically requested not), including tutorials are being recorded. These will be available at some point after the meeting, we think at this channel https://t.co/lq6E2XnXP9

Live streaming is a challenge, hope to attempt one keynote.
— useR!2018 (@useR2018_conf) July 10, 2018

Talks

From checking Twitter, we can say that there lots of great talks and tutorials. Here are some of the main ones we found in this hour.

Roger Peng talking about Teaching R to New Users got lots of attention. Here are some tweets about it:

.@rdpeng doing a better job of describing the #tidyverse design philosophy than I ever have! https://t.co/o3KunXe6qq
— Hadley Wickham (@hadleywickham) July 12, 2018

Last day of #useR2018 kicking off with @rdpeng "Teaching R to new users" pic.twitter.com/yIJfiU7s8I
— Luke Zappia (@_lazappi_) July 13, 2018

Here is the narrative from my #useR2018 keynote https://t.co/SbrlShNaDL
— Roger D. Peng (@rdpeng) July 13, 2018

Roger Peng’s #useR2018 keynote this morning resonates with me, as another long time user/developer/instructor. Useful, opinionated take on where we are now in #rstats and how we got here. @rdpeng https://t.co/bOLSoaFupd pic.twitter.com/ejc9yFYGVA
— Jenny Bryan (@JennyBryan) July 13, 2018

Jenny Bryan talked about Code Smells and Feels which was one of the major highlights. We wish we could have been there. Here are some tweets about it:

Code Smells and Feels
^^ my keynote talk at #useR2018
Materials at: https://t.co/e7dZRMZuSL
It was a great honour to speak and the Brisbane crew upheld the fine tradition of fun and informative useR! meetings 🎉 pic.twitter.com/2XkJ64NgsM
— Jenny Bryan (@JennyBryan) July 13, 2018

Check out her presentation materials on github

The talk was centered around the idea of writing good code. Using senses such as smell and feel as an extended metaphor, Bryan explains that coding is a sense that is developed through experience. Taking a very supportive tone, she provides pro-tips to writing efficient and effective code, such as writing simple conditions and functions instead of relying on complex function and “Tip #1: Do not comment and uncomment sections of code to alter behavior.”

Thomas Lin Pedersen talked about the gganimate package which seems to have included gifs in the talk.

First keynote for the second day of #useR2018. @thomasp85 "The Grammar of Animation" #sketchnotes pic.twitter.com/tvNjvbr4ag
— Luke Zappia (@_lazappi_) July 12, 2018

😬 I said I wasn't gonna gif it, but I also don't want you to miss it…
"The Grammar of Animation" 👨‍🎨 @thomasp85 https://t.co/t2HYRTtHwO #rstats #useR2018 #gganimate pic.twitter.com/YOyuNn5p1g
— Mara Averick (@dataandme) July 12, 2018

Steph de Silva started the useR!2018 keynotes with her Beyond syntax, towards culture talk which covered different R communities and how we all interact.

Kicking off the #useR2018 talks with @StephdeSilva's keynote "Beyond syntax, towards culture" #sketchnotes pic.twitter.com/vgBsfOIFJU
— Luke Zappia (@_lazappi_) July 11, 2018

Late to the party, I was a little busy: my slides for my talk #useR2018 https://t.co/OzqcSTEx2v pic.twitter.com/RNnCm6K2ym
— Steph Stammel (@StephStammel) July 13, 2018

The slides and video for the workflowr talk by John Blischak are already online too which got the big thumbs up by Peter Hickey!

Here are the slides for my #useR2018 presentation on my #rstats package #workflowr for reproducible research https://t.co/O2yZ7RemN6
— John Blischak (@jdblischak) July 11, 2018

I feel like @jdblischak has read my mind with workflowr. It's like my current workflow but, like, actually good and reproducible! Will be especially great for collaborative and consulting type projects #useR2018 https://t.co/tZtqyH3sc2
— Peter Hickey (@PeteHaitch) July 11, 2018

If you are starting out with the tidyverse, this tutorial about Wrangling with the Tidyverse by Simon Jackson seems interesting!

Great to meet everyone today who attended my #useR2018 #rstats tutorial, "Wrangling with the Tidyverse!"

Missed out or forgot anything? Get all the material at https://t.co/YfYZBlkwMs

Special thanks to @Rhydwyn @orchid00 and @SayaniGupta5 for their support too pic.twitter.com/5MnXLSQxq9
— Simon Jackson (@drsimonj) July 10, 2018

Did anyone else think about the Diablo game with the deckard package? This new package by Verge Labs could be very useful when working with large datasets.

Introducing deckard for large scale visualisations in #rstats! If you want to hear more about it come catch us present this Thursday at 4:30 at #user2018. https://t.co/sIcd3ztqVl pic.twitter.com/ggu0N7JMWH
— Verge Labs (@VergeLabsAI) July 10, 2018

Jim Hester’s talk about the glue package was highly recommended by Jenny Bryan. And more likely than not, you are using R packages that Jim has helped in some way or the other.

Slides for my talk at https://t.co/7JO8G1nQup
— Jim Hester (@jimhester_) July 12, 2018

Thomas Lumley talked about fasteR: ways to speed up R code; check the video of his talk at YouTube.

Major takehomes:

If you repeat a task frequently, it is worth taking the time to optimize it for speed. (See xkcd cartoon!)
Packages are available to measure how “efficient” your code is, in time and/or memory. Options: microbenchmark(), Rprof(), system.time()
Reasons your code may be slower than necessary:
- Dataframes are slower than matrices, data.tables, tbls, and lists
- Vectorize your code whenever possible
- Preallocate for the size of your objects, rather than “growing” your objects.
- Linear algebra / matrix algebra functions can be much faster than alternatives because they are coded in C. E.g. for a large matrix, crossprod(scale(x)) if you know there is no missing data or NAs is many times faster than cor(x). If you know the linear algebra, use matrix operations when possible.
- Packages exist for modeling large data. Example: biglm for linear models.
- Thomas Lumley is a Rosalind Franklin fan :)

Miscellaneous

They made an awesome hex wall with the hex stickers from packages represented at useR!2018. It’s awesome!

The #useR2018 #hexwall has been revealed! Read about how it was created in #rstats on this blog post: https://t.co/krYYOQ3N84

A huge thanks to everyone who has submitted stickers and provided feedback. I hope you enjoy the end result as much as I have had creating it! 🎉 pic.twitter.com/GnG9m2cZme
— Mitchell O'Hara-Wild (@mitchoharawild) July 11, 2018

Controversial choice of wearing a #python in front of an #hexwall of #R #packages but hey, we're all friends! #useR2018 pic.twitter.com/rSG1EsH4Wi
— Anna Quaglieri (@annaquagli) July 11, 2018

It’s awesome to see the RLadies community thriving! A few of us didn’t know about RLadies Remote which everyone can join.

Here are amazing #RLadies from around the world having dinner after an excellent day at @useR2018_conf @RLadiesGlobal @RLadiesBrisbane @RLadiesSydney @RLadiesIstanbul @RLadiesIzmir @RLadiesAKL @RLadiesDC @RLadiesRemote @RLadiesMVD #rladies #useR2018 #rstat pic.twitter.com/aZoSuAU0Gi
— R-Ladies Melbourne Inc (@RLadiesMelb) July 10, 2018

These are some awesome socks!

And who doesn’t love this image of Hadley Wickham being mobbed by deers? He even meme-fied it himself on this tweet.

People often ask why dplyr & tibble don't support row names. I've (finally) written up my reasons at https://t.co/UmZjaSk7UX #rstats (photo credit: @hspter) pic.twitter.com/IVbaVmKhYp
— Hadley Wickham (@hadleywickham) July 13, 2018

useR!2019 is already lined up, check it out. It’ll be in Toulouse, France! Follow them on Twitter at https://twitter.com/UseR2019_Conf.

Acknowledgments

We are greatful to everyone that tweeted about the conference and shared their materials online!

This blog post was made possible thanks to:

blogdown (Xie, Hill, and Thomas, 2017)
devtools (Wickham, Hester, and Chang, 2018)
knitcitations (Boettiger, 2017)

References

[1] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.8. 2017. URL: https://CRAN.R-project.org/package=knitcitations.

[2] H. Wickham, J. Hester and W. Chang. devtools: Tools to Make Developing R Packages Easier. R package version 1.13.6. 2018. URL: https://CRAN.R-project.org/package=devtools.

[3] Y. Xie, A. P. Hill and A. Thomas. blogdown: Creating Websites with R Markdown. ISBN 978-0815363729. Boca Raton, Florida: Chapman and Hall/CRC, 2017. URL: https://github.com/rstudio/blogdown.

Reproducibility

## Session info ----------------------------------------------------------------------------------------------------------

##  setting  value                       
##  version  R version 3.5.1 (2018-07-02)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/New_York            
##  date     2018-07-13

## Packages --------------------------------------------------------------------------------------------------------------

##  package       * version date       source                            
##  backports       1.1.2   2017-12-13 cran (@1.1.2)                     
##  base          * 3.5.1   2018-07-05 local                             
##  bibtex          0.4.2   2017-06-30 CRAN (R 3.5.0)                    
##  BiocStyle     * 2.8.2   2018-05-30 Bioconductor                      
##  blogdown        0.7     2018-07-07 CRAN (R 3.5.0)                    
##  bookdown        0.7     2018-02-18 CRAN (R 3.5.0)                    
##  colorout      * 1.2-0   2018-05-03 Github (jalvesaq/colorout@c42088d)
##  compiler        3.5.1   2018-07-05 local                             
##  datasets      * 3.5.1   2018-07-05 local                             
##  devtools      * 1.13.6  2018-06-27 cran (@1.13.6)                    
##  digest          0.6.15  2018-01-28 CRAN (R 3.5.0)                    
##  evaluate        0.10.1  2017-06-24 cran (@0.10.1)                    
##  graphics      * 3.5.1   2018-07-05 local                             
##  grDevices     * 3.5.1   2018-07-05 local                             
##  htmltools       0.3.6   2017-04-28 cran (@0.3.6)                     
##  httr            1.3.1   2017-08-20 CRAN (R 3.5.0)                    
##  jsonlite        1.5     2017-06-01 CRAN (R 3.5.0)                    
##  knitcitations * 1.0.8   2017-07-04 CRAN (R 3.5.0)                    
##  knitr           1.20    2018-02-20 cran (@1.20)                      
##  lubridate       1.7.4   2018-04-11 CRAN (R 3.5.0)                    
##  magrittr        1.5     2014-11-22 cran (@1.5)                       
##  memoise         1.1.0   2017-04-21 CRAN (R 3.5.0)                    
##  methods       * 3.5.1   2018-07-05 local                             
##  plyr            1.8.4   2016-06-08 cran (@1.8.4)                     
##  R6              2.2.2   2017-06-17 CRAN (R 3.5.0)                    
##  Rcpp            0.12.17 2018-05-18 cran (@0.12.17)                   
##  RefManageR      1.2.0   2018-04-25 CRAN (R 3.5.0)                    
##  rmarkdown       1.10    2018-06-11 CRAN (R 3.5.0)                    
##  rprojroot       1.3-2   2018-01-03 cran (@1.3-2)                     
##  stats         * 3.5.1   2018-07-05 local                             
##  stringi         1.2.3   2018-06-12 CRAN (R 3.5.0)                    
##  stringr         1.3.1   2018-05-10 CRAN (R 3.5.0)                    
##  tools           3.5.1   2018-07-05 local                             
##  utils         * 3.5.1   2018-07-05 local                             
##  withr           2.1.2   2018-03-15 CRAN (R 3.5.0)                    
##  xfun            0.3     2018-07-06 CRAN (R 3.5.0)                    
##  xml2            1.2.0   2018-01-24 CRAN (R 3.5.0)                    
##  yaml            2.1.19  2018-05-01 CRAN (R 3.5.0)

git to know git: an 8 minute introduction

Tue, 17 Apr 2018 00:00:00 +0000

By Amy Peterson

Using Git

Git is a version control system that allows you to track changes made to files while working on a project, either independently or in collaboration with others. It provides a way to save many different components of a project in progress, including the source code, but also the figures and data that the code produces. The importance of understanding and using Git lies in its ability to maintain an organized record of a project, also referred to as a repository or repo, as it evolves. While setting up and learning to use Git may seem intimidating, the majority of the work is in the initial setup.

GitHub

GitHub is one of the hosting services that provides an interface for using Git, and can be thought of as Dropbox for version control projects. GitHub is one of the ways to store repositories using Git, and is an easy way to routinely back-up your work as you make progress on a project. It is also helpful for tracking changes, demonstrating who contributed to which projects, when they contributed, and what their contributions were.

Why I Use Git

When I started as an Intern at the LIBD, I noticed how frequently GitHub was used. As I familiarized myself with some of the projects I would be working on, it became clear how much easier it was to use a system that could document project changes made throughout time in a way that was widely accessible to contributors. Using GitHub also made it easier to re-visit certain scripts or documents to determine what changes were made, when, and why they were needed. Having a detailed history of various project components is an easy way to ensure that contributors have information organized in the same way.

Beyond working on projects with collaborators, using GitHub is equally rewarding when used for individual projects. Particularly if working on some projects at work on one computer, and needing those updates to be accessible on a different computer at home, GitHub is a quick and easy way to keep a project updated across computers to ensure you are always working on the latest updates.

Terms

commit: saves changes, either adding a new file to GitHub, or updating the existing version of that file

issue: option on GitHub that creates a list of action items for a repository, similar to a to-do list; tasks can be assigned to particular contributors; also possible to commented on and reference particular tasks within a commit message by including # and the issue number

push: sends the commits made locally to the repository on GitHub

pull: downloads modified or newly added files, so the local directory matches the current repository on GitHub

Public v. Private Repositories

Repositories can be public or private. Public repositories are readable to everyone, but permissions are still required to make edits by pushing commits. Private repositories are inaccessible and unreadable without permission, with the repository owner having control to moderate who has access to read, edit, or extend admin access.

Features

Watch: Provides a way to receive notifications regarding all updates on a particular repository of interest.

Star: Marks a specific repository of interest, making it easier to refer back to it later. Differs from watch in that you do not receive notifications for repository updates.

Fork: Downloads a copy of the current version of the file from GitHub. The downloaded copy exists separately from the repo, and reflects the file as is at the time of the download.

Initial Set-Up

Make an account on GitHub.

Mac

On newer Macs this should already be set up, but checking is easy!

## Open Terminal Application
which git # determine if Git is installed
git --version # lists current Git version installed

## If not installed, use the following to install
git --version
git config

Windows

Install Git for Windows

After Git Installation

## Open Terminal (Mac) or Git Bash (Windows)
## Enter the name and email associated with your GitHub account
git config --global user.name "Amy Peterson"
git config --global.email "amy.peterson@jhu.edu"
git config --global --list # Lists global configuration options

Setting Up a Repository

Identify a repository you want to contribute to, or create your own! Repositories can be created on the front page using “Start a project” or by clicking the green “New” button by clicking repositories from your profile page.

Next, take the following steps

## Open Terminal (Mac) or Git Bash (Windows)
# Change directories so you are in the directory where you want to set up the repository
get pwd() #gives name of current directory 
cd /~Desktop #changes current directory to Desktop 
ls #lists folders you can cd into 
# On the repository page on GitHub, click the green "Clone or Download"
git clone git@GitHub.com:SampleLink.git # Paste link from GitHub to download the repository locally

Saving Your Work

The process of updating GitHub is as follows:

## Open Terminal (Mac) or Git Bash (Windows)
git add File1.R # adds file, here File1.R, to GitHub
git commit -m "Example message" # attaches the message in parentheses to the files being added to GitHub
git push # save file to GitHub

# Once updates are pushed, other repository members need to do the following
git pull # updates local directory to reflect the changes made to GitHub

Useful at any time throughout the process of updating a repository, git status provides information regarding how your local directory differs from the repository on GitHub, and separates those differences into which files have had changes made, and which files are entirely new. In the example below, File1.R and File2.pdf have been modified from what exists on GitHub, while File3.R and File4.pdf are untracked, or entirely new to the repository.

Committing Folders

Folders associated with a project can also be committed to a repository on GitHub. Folders that are currently untracked will be listed in response to git status, and committing a folder to a repository will simultaneously commit all of its contents. This is particularly useful and efficient when creating a repository for an existing project.

Making Multiple Commits

Multiple commits can be made to group files before pushing them to GitHub. Each set of files you have added using git add will be grouped together as a single commit once you type git commit and enter the commit message you want associated with the files. Then, once all the commits you are ready to make are finished, use git push to save the commits to GitHub.

Starting a Repository for an Existing Project

There are only a few differences for setting up a repository for an existing project, compared to the steps previously described.

Most importantly, after setting up a new repository on GitHub, the next screen will list a number of options. If you are setting up a repository for an existing project, and hoping to commit locally saved files, you will first need to cd into the locally existing project folder. Then use the instructions below that appear under on the GitHub website under the header “create a new repository on the command line”. In the screenshot from the example below, the repository I named is called “test”.

Git Ignore

Git ignore files are important for both new and old project repositories. They are scripts that specify which file types should be ignored, meaning they will not be included in the list git status provides to inform you of local files that are not currently saved to GitHub. Git ignore is important when creating a repository for an existing project, since there will be some existing local files that you will not want to include in the repository, for example, larger files that are not necessary to upload and include on the repository long-term. With new project repositories, you do not need to start with an extensive git ignore file, but can edit it as the project evolves, since it will become more clear over time which file types you do not want to include in the repository.

## Open Terminal (Mac) or Git Bash (Windows)
touch .gitignore # Creates git ignore file
# Open the file to edit, then commit the file to your GitHub repository

An example of a git ignore file is below. As demonstrated, an asterisk can be used to designate entire file types to ignore. For example, adding *.zip would ignore any zip files that are saved locally when using git status to determine the differences between local files and the repository on GitHub.

Summary

The general steps for saving files from your local directory to GitHub is

git add -> git commit -> git push

Git pull will be used to download files from GitHub to match what exists on your local directory.

This project was written as a brief introduction to Git and GitHub, for individuals who are interested in incorporating Git into their work. This post is by no means a comprehensive introduction. For more detailed information regarding GitHub, and using Git, Happy Git and GitHub for the useR is a great resource.

Hopefully this post was helpful in serving as a brief introduction and a way to become more familiarized with some of the basic concepts behind Git and GitHub. Feel free to leave questions or share your story in the comments!

Introduction to Scraping and Wrangling Tables from Research Articles

Mon, 19 Mar 2018 00:00:00 +0000

By Steve Semick.

What do you do when you want to use results from the literature to anchor your own analysis? When these results are in the form of an easily accessible table, such as a .csv file or .xlsx file, then it is simple enough to download them and incorporate them into your research. Often times, however, published findings are not so easy to handle. Today, we’ll go through a practical scenario on scraping an html table from a Nature Genetics article into R and wrangling the data into a useful format. This is what the online table we want to scrape looks like:

Example 1A: Scraping a html table from a webpage

Sometimes a table is online as part of a research article but can’t be easily coerced into a useful format. You can’t copy and paste the table into excel and its not stored elsewhere. In these situations you can use the handy R package rvest to scrape it into R from the webpage.

First load the rvest package to scrape the table.

library("rvest")
library("knitr")

Next, get the url of the webpage where the table is stored.

url <- "https://www.nature.com/articles/ng.2802/tables/2"

Now, the trickiest part of the process. We need find where the table lives on this webpage. An excellent guide to finding out how to do this can be found on Cory Nissen’s blogpost– this is also where I learned about using rvest to scrape html tables. Once you have copied the table location, the rest is easy!

table_path='//*[@id="content"]/div/div/figure/div[1]/div/div[1]/table'
nature_genetics_table2 <- url %>%
  read_html() %>%
  html_nodes(xpath=table_path) %>%
  html_table(fill=T)
nature_genetics_table2 = nature_genetics_table2[[1]]

This is what the first few lines of our scraped product looks like:

kable(nature_genetics_table2[1:4,])

SNPa	Chr.	Positionb	Closest genec	Major/minor alleles	MAFd	Stage 1	Stage 1	Stage 2	Stage 2	Overall	Overall	Overall
SNPa	Chr.	Positionb	Closest genec	Major/minor alleles	MAFd	OR (95% CI)e	Meta P value	OR (95% CI)e	Meta P value	OR (95% CI)e	Meta P value	I2 (%), P valuef
Known GWAS-defined associated genes	Known GWAS-defined associated genes	Known GWAS-defined associated genes	Known GWAS-defined associated genes	Known GWAS-defined associated genes	Known GWAS-defined associated genes	Known GWAS-defined associated genes	Known GWAS-defined associated genes	Known GWAS-defined associated genes	Known GWAS-defined associated genes	Known GWAS-defined associated genes	Known GWAS-defined associated genes	Known GWAS-defined associated genes
rs6656401	1	207692049	CR1	G/A	0.197	1.17 (1.12–1.22)	7.7 × 10−15	1.21 (1.14–1.28)	7.9 × 10−11	1.18 (1.14–1.22)	5.7 × 10−24	0, 7.8 × 10−1
rs6733839	2	127892810	BIN1	C/T	0.409	1.21 (1.17–1.25)	1.7 × 10−26	1.24 (1.18–1.29)	3.4 × 10−19	1.22 (1.18–1.25)	6.9 × 10−44	28, 6.1 × 10−2

While this table has the information we want, it is clearly still a mess. Which brings us to…

Example 1B: Making messy data useful

Fortunately, there is the right number of columns, but there are lines of text breaking the table and stretching through at least one row of the table. There are two others (not shown), so getting these rows into better data format will be our first task.

Cleaning up the rows

We could look at the table and see these lines are on rows 2, 12, and 18. But let’s do this instead using some R code. The trick here is to note that all the elements of these rows contain the exact same text.

v=which(apply(nature_genetics_table2,1, function(x) length(unique(unlist(x))) )==1)

Great, now let’s get rid of these rows but retain the information. We are going to do this in three steps. First, we will split the data.frame into a list based on the location of these descriptions (rows 2,12, and 18). Then, we will clean this list up by keeping only the list elements with data. We will move the text taking up entire rows to an additional column titled “Description”. Lastly, we will concatenate this cleaned list back into a data.frame.

nature_genetics_table2_list = split(nature_genetics_table2, cumsum(1:nrow(nature_genetics_table2) %in% v))
nature_genetics_table2_list = lapply(nature_genetics_table2_list[2:4], function(y){ y$Description= unique(as.character(y[1,]) ) 
y[-1,] } )

nature_genetics_table2_clean = do.call("rbind",nature_genetics_table2_list)

Now let’s look at our data.

kable(nature_genetics_table2_clean[1:3,])

	SNPa	Chr.	Positionb	Closest genec	Major/minor alleles	MAFd	Stage 1	Stage 1	Stage 2	Stage 2	Overall	Overall	Overall	Description
1.3	rs6656401	1	207692049	CR1	G/A	0.197	1.17 (1.12–1.22)	7.7 × 10−15	1.21 (1.14–1.28)	7.9 × 10−11	1.18 (1.14–1.22)	5.7 × 10−24	0, 7.8 × 10−1	Known GWAS-defined associated genes
1.4	rs6733839	2	127892810	BIN1	C/T	0.409	1.21 (1.17–1.25)	1.7 × 10−26	1.24 (1.18–1.29)	3.4 × 10−19	1.22 (1.18–1.25)	6.9 × 10−44	28, 6.1 × 10−2	Known GWAS-defined associated genes
1.5	rs10948363	6	47487762	CD2AP	A/G	0.266	1.10 (1.07–1.14)	3.1 × 10−8	1.09 (1.04–1.15)	4.1 × 10−4	1.10 (1.07–1.13)	5.2 × 10−11	0, 9 × 10−1	Known GWAS-defined associated genes

Fixing column names

It’s getting better but is still messy. Let’s clean up those columns names. This part we will do by hand.

colnames(nature_genetics_table2_clean) <- c("SNP", "Chr", "Position", "Closest gene", "Major/minor alleles", "MAF", "Stage1_OR", "Stage1_MetaP", "Stage2_OR","Stage2_MetaP",    "Overall_OR", "Overall_MetaP", "I2_Percent/P","Description")
colnames(nature_genetics_table2_clean)

##  [1] "SNP"                 "Chr"                 "Position"           
##  [4] "Closest gene"        "Major/minor alleles" "MAF"                
##  [7] "Stage1_OR"           "Stage1_MetaP"        "Stage2_OR"          
## [10] "Stage2_MetaP"        "Overall_OR"          "Overall_MetaP"      
## [13] "I2_Percent/P"        "Description"

Making a character variable into a numeric variable

It’s coming along.. Next, we need to make the numbers into, well numbers. This will be more useful in R than a character class of data. To do this, try using the as.numeric function as shown.

as.numeric(nature_genetics_table2_clean$Stage1_MetaP)

## Warning: NAs introduced by coercion

##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

It doesn’t work because there are weird symbols that R doesn’t understand. Get rid of them with the gsub command and replace them an E (scientific notation).

nature_genetics_table2_clean$Stage1_MetaP = gsub(" × 10","E",nature_genetics_table2_clean$Stage1_MetaP)

Now try converting to a numeric.

as.numeric(nature_genetics_table2_clean$Stage1_MetaP)

## Warning: NAs introduced by coercion

##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

It still doesn’t work!! Take a second closer look at the data. Can you discern why the code failed?

nature_genetics_table2_clean$Stage1_MetaP

##  [1] "7.7E-15" "1.7E-26" "3.1E-8"  "8.8E-10" "9.6E-17" "2.8E-11" "6.5E-16"
##  [8] "1.7E-9"  "5.1E-8"  "1.6E-8"  "3.3E-9"  "5.0E-11" "1.5E-7"  "4.6E-8" 
## [15] "9.6E-5"  "2.5E-6"  "1.3E-5"  "7.4E-6"  "6.7E-6"  "1.0E-5"  "1.6E-6"

Yep, that’s right the − symbol is not in fact the same as a minus symbol -! We need to replace it. We’ll use the fact that that symbol always appears immediately after a capital E to our advantage.

Split the string on the E using strsplit.

s = strsplit(nature_genetics_table2_clean$Stage1_MetaP, "E")

Get the first and second half of each string.

firstStr = lapply(s, `[[`, 1 )
secondStr=lapply(s, `[[`, 2 )

Knock off that first character: which the symbol we don’t want and slap a minus sign back on .

secondStr= lapply(secondStr,function(x) {paste0("E-",substring(x,2))})

Finally, stitch the two parts of the string back together now that the minus sign has been corrected and convert it to numeric.

mapply( function(firstStr, secondStr) {as.numeric(paste0(firstStr,secondStr) )}, firstStr, secondStr )

##  [1] 7.7e-15 1.7e-26 3.1e-08 8.8e-10 9.6e-17 2.8e-11 6.5e-16 1.7e-09
##  [9] 5.1e-08 1.6e-08 3.3e-09 5.0e-11 1.5e-07 4.6e-08 9.6e-05 2.5e-06
## [17] 1.3e-05 7.4e-06 6.7e-06 1.0e-05 1.6e-06

It works! Make sure to replace the column in the data.frame.

nature_genetics_table2_clean$Stage1_MetaP= mapply( function(firstStr, secondStr) {as.numeric(paste0(firstStr,secondStr) )}, firstStr, secondStr )

See how it appears in the table now as a numeric? Try wrangling the rest of these columns into a useful format on your own and let me know how it goes.

kable(nature_genetics_table2_clean[1:3,])

	SNP	Chr	Position	Closest gene	Major/minor alleles	MAF	Stage1_OR	Stage2_OR	Stage2_MetaP	Overall_OR	Overall_MetaP	I2_Percent/P	Description
1.3	rs6656401	1	207692049	CR1	G/A	0.197	1.17 (1.12–1.22)	1.21 (1.14–1.28)	7.9 × 10−11	1.18 (1.14–1.22)	5.7 × 10−24	0, 7.8 × 10−1	Known GWAS-defined associated genes
1.4	rs6733839	2	127892810	BIN1	C/T	0.409	1.21 (1.17–1.25)	1.24 (1.18–1.29)	3.4 × 10−19	1.22 (1.18–1.25)	6.9 × 10−44	28, 6.1 × 10−2	Known GWAS-defined associated genes
1.5	rs10948363	6	47487762	CD2AP	A/G	0.266	1.10 (1.07–1.14)	1.09 (1.04–1.15)	4.1 × 10−4	1.10 (1.07–1.13)	5.2 × 10−11	0, 9 × 10−1	Known GWAS-defined associated genes

Conclusions

Today we went through getting data directly into R from an html table (a table on a webpage) and demonstrated how to get the data into a useful format. There are a couple advantages of doing this work in R instead of using excel– if that’s even an option. First, R is more reproducible. Second, once you have code for wrangling one table, wrangling the next one will be much faster. At the end of the day though, it is always easiest when the table is shared as a csv or excel file; something to keep in mind when preparing your own papers.

Reproducibility

## Session info ----------------------------------------------------------------------------------------------------------

##  setting  value                       
##  version  R version 3.4.4 (2018-03-15)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United States.1252  
##  tz       America/New_York            
##  date     2018-04-20

## Packages --------------------------------------------------------------------------------------------------------------

##  package   * version date       source                           
##  backports   1.1.2   2017-12-13 CRAN (R 3.4.3)                   
##  base      * 3.4.4   2018-03-15 local                            
##  blogdown    0.5.12  2018-03-23 Github (rstudio/blogdown@21f14af)
##  bookdown    0.7     2018-02-18 CRAN (R 3.4.3)                   
##  compiler    3.4.4   2018-03-15 local                            
##  curl        3.1     2017-12-12 CRAN (R 3.4.3)                   
##  datasets  * 3.4.4   2018-03-15 local                            
##  devtools  * 1.13.5  2018-02-18 CRAN (R 3.4.3)                   
##  digest      0.6.15  2018-01-28 CRAN (R 3.4.3)                   
##  evaluate    0.10.1  2017-06-24 CRAN (R 3.4.3)                   
##  graphics  * 3.4.4   2018-03-15 local                            
##  grDevices * 3.4.4   2018-03-15 local                            
##  highr       0.6     2016-05-09 CRAN (R 3.4.3)                   
##  htmltools   0.3.6   2017-04-28 CRAN (R 3.4.3)                   
##  httr        1.3.1   2017-08-20 CRAN (R 3.4.3)                   
##  knitr     * 1.20    2018-02-20 CRAN (R 3.4.3)                   
##  magrittr    1.5     2014-11-22 CRAN (R 3.4.3)                   
##  memoise     1.1.0   2017-04-21 CRAN (R 3.4.3)                   
##  methods   * 3.4.4   2018-03-15 local                            
##  R6          2.2.2   2017-06-17 CRAN (R 3.4.3)                   
##  Rcpp        0.12.16 2018-03-13 CRAN (R 3.4.4)                   
##  rmarkdown   1.9     2018-03-01 CRAN (R 3.4.3)                   
##  rprojroot   1.3-2   2018-01-03 CRAN (R 3.4.3)                   
##  rvest     * 0.3.2   2016-06-17 CRAN (R 3.4.3)                   
##  selectr     0.3-2   2018-03-05 CRAN (R 3.4.3)                   
##  stats     * 3.4.4   2018-03-15 local                            
##  stringi     1.1.7   2018-03-12 CRAN (R 3.4.4)                   
##  stringr     1.3.0   2018-02-19 CRAN (R 3.4.3)                   
##  tools       3.4.4   2018-03-15 local                            
##  utils     * 3.4.4   2018-03-15 local                            
##  withr       2.1.2   2018-03-15 CRAN (R 3.4.4)                   
##  xfun        0.1     2018-01-22 CRAN (R 3.4.3)                   
##  xml2      * 1.2.0   2018-01-24 CRAN (R 3.4.3)                   
##  yaml        2.1.18  2018-03-08 CRAN (R 3.4.3)

Edit your bashrc file for a nicer terminal experience

Sun, 11 Mar 2018 00:00:00 +0000

By L. Collado-Torres.

If you are working at LIBD or with large data, it’s very likely that it won’t fit in your laptop and that you’ll be using the terminal to interact with a high performance computing cluster (like JHPCE) or server. Some small edits to your bash configuration file can make your terminal experience much more enjoyable and hopefully boost your productivity. The edits described below work for any OS. On Windows, I’m assuming that you are using git bash or a similar terminal program.

The way we can control our terminal appearance and some behavior is through the .bashrc file. That file typically gets read once when loading a new terminal window and that is where we can save some shortcuts we like to use, alter the colors of our terminal, change the behavior of the up and down arrow keys, etc.

`.bashrc` file

First, we need to learn where to locate this file. On all OS (Mac, Windows, Linux) machines/servers, the .bashrc file typically lives at ~/.bashrc. For my Mac for example that is /home/lcollado/.bashrc. For my Windows machine, that’s /c/Users/Leonardo/.bashrc. Now, the dot before the file makes it a hidden file. A quick search can help you find the options for your computer that lets you see these hidden files. From a terminal window, I typically use this bash command to show all the hidden files (that’s from the a option).

## List files in human readable format including hidden files
ls -lha

Initial files

You might have a ~/.bashrc and a ~/.bash_profile already. If not, lets create simple ones. You can use the touch bash command for make a new file (touch ~/.bashrc), or you could use this R code:

file.create('~/.bashrc')
file.create('~/.bash_profile')

Next open them with your text editor (say Notepad++, TextMate 2, RStudio, among others) and paste the following contents.

~/.bash_profile contents

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi

# In my Mac one I also have this:
if [ -f ~/.profile ]; then
        . ~/.profile
fi

Minimal ~/.bashrc contents

# Source global definitions
if [ -f /etc/bashrc ]; then
    . /etc/bashrc
fi

Control your bash history

Lets start adding features to our terminal experience by editing the ~/.bashrc file. I typically include comments # describing what the code is doing and where I learned how to do $X$. The first part is controlling your bash history. I want to have a longer history than what is included by default and where duplicates are deleted.

# http://www.biostat.jhsph.edu/~afisher/ComputingClub/webfiles/KasperHansenPres/IntermediateUnix.pdf
# https://unix.stackexchange.com/questions/48713/how-can-i-remove-duplicates-in-my-bash-history-preserving-order
export HISTCONTROL=ignoreboth:erasedups
export HISTSIZE=10000
shopt -s histappend
shopt -s cmdhist

Change the up and down arrows

The next change will save you a lot of time! Plus it goes nicely with the bash history changes we just made. Normally, the up and down arrow let you select previous commands from your bash history (up) or select one of your latest commands (down, after having used up). The following changes make it so that the up arrow searches only commands that start with exactly the letters you had already typed.

Lets say that you just requested a compute node with qrsh and you have an empty line.

If you use the up arrow, you can navigate your command history. So far, this is the same as the default up arrow behavior.

Lets say that I want to change directory to one of my recent projects. So I type cd / in the terminal window (without hitting enter).

Next I use the up arrow, and it only finds for me commands that start with cd /, including this long one.

Did you like this? Well, add the following code to your ~/.bashrc file

# Auto-complete command from history
# http://lindesk.com/2009/04/customize-terminal-configuration-setting-bash-cli-power-user/
export INPUTRC=~/.inputrc

where ~/.inputrc file has the following contents:

#Page up/page down
"\e[B": history-search-forward
"\e[A": history-search-backward

$include /etc/inputrc

As an added benefit, the up and down arrows will now have this improved behavior when you run R inside a terminal, although it’s limited to your current R history: actually, I guess that you could change your .Rprofile to load the previous R history.

Interactive deleting of files

In a terminal, you normally delete files with rm, but you can make an alias (shortcut) so that when you are deleting files with rmi you will get asked to confirm whether you want to delete the file or not. This can be useful if you are using some patterns for finding the files that you are trying to delete but want to make sure that the pattern didn’t catch other files you want to keep.

# http://superuser.com/questions/384769/alias-rm-rm-i-considered-harmful
alias rmi='rm -i'

Change the command prompt

You can also control the command prompt. That is, the parts that are shown before you start typing in your terminal. I like keeping it short, so it only shows me the parent directory instead of the full path, plus a small version for the time (hh:mm) in a 12 hour clock. This is sometimes useful if I run some commands and later on want to get a quick idea if any of them took a while to run (specially if I was not looking at the terminal).

# Change command prompt
# http://www.cyberciti.biz/tips/howto-linux-unix-bash-shell-setup-prompt.html
# http://www.cyberciti.biz/faq/bash-shell-change-the-color-of-my-shell-prompt-under-linux-or-unix/
# https://bbs.archlinux.org/viewtopic.php?id=48910
# previous in enigma2: "[\u@\h \W]\$ "
# previously in mac: "\h:\W \u\$ "
export PS1="\[\e[0;33m\]\A \W \$ \[\e[m\]"

Colors

You can change the colors of your terminal. For example, do you want directories to be shown in blue and/or bold font while executable files are shown in red. This goes in hand with the ls --color=auto shortcut to make sure that the colors are used (Mac: you might need brew install coreutils as described in this blog post). The following lines of my ~/.bashrc file include some old history of the colors and how I use to have other options.

# colors
# http://norbauer.com/notebooks/code/notes/ls-colors-and-terminal-app
# used BSD pattern ExGxFxDxBxEgEdxbxgxhxd on http://geoff.greer.fm/lscolors/
# that tool does not specify the colors, which I did by looking manually at
# http://blog.twistedcode.org/2008/04/lscolors-explained.html
# and the norbauer.com site previously mentioned
alias ls="ls --color=auto"
#export LS_COLORS="di=1;34;40:ln=1;36;40:so=1;35;40:pi=1;93;40:ex=1;31;40:bd=1;34;46:cd=1;34;43:su=0;41:sg=0;46:tw=0;47:ow=0;43"
## After switching to RStudio:
# https://askubuntu.com/questions/466198/how-do-i-change-the-color-for-directories-with-ls-in-the-console
export LS_COLORS="di=0;32:ln=0;36:so=0;35:pi=0;93:ex=0;31:bd=0;34;46:cd=0;34;43:su=0;41:sg=0;46:tw=0;47:ow=0;43:fi=0;33"

Mac extra lines:

# Uncomment below for Mac and comment the two previous commands
#export CLICOLOR=1
#export LSCOLORS="ExGxFxDxBxEgEdxbxgxhxd"
## Actually from https://superuser.com/questions/183876/how-do-i-get-ls-color-auto-to-work-on-mac-os-x
# brew install coreutils
# then change the aliast to use gls instead of ls
# that way I can use the same config file =)

alias ls="gls --color=auto"

I use the same LS_COLORS now on my Mac too, but you don’t need to.

UPDATE

We got this note from Mark Miller, admin of JHPCE:

One quick note on your page. You mention setting colors for the ls output, which is great. One thing we (and others) have found is that, for a directory on a Lustre filesystem (/dcl01 or /dcl02), using “ls –colors=auto” or “ls -al” on a directory with lots (thousands+) of files in it can be super slow. With these options, the ls command needs to iterate through each file in the directory, and query the lustre server for each and every file to retrieve information about the file in order to determine what color to display. So, if you’re regularly using directories on Lustre that have lots of files in them, and your “ls” command it taking too long, we recommend using “ls –color=none”. https://wikis.nyu.edu/display/NYUHPC/Lustre+FAQ https://groups.google.com/forum/#!topic/lustre-discuss-list/3afjd4j2Q-g

Shortcuts for main project directories

We’ve seen several aliases (shortcuts) already such as the one for ls --color=auto which is the one I use the most. But I also use aliases for changing to the root directories that I use the most.

alias labold="cd /dcl01/lieber/ajaffe/lab"
alias lab="cd /dcl01/ajaffe/data/lab"

Actually, we were supposed to just use the new disk here and I should have probably chosen better names to differentiate the two.

The next terminal window you open after editing the ~/.bashrc file will have all your new features enabled.

Source

Extra

Sometimes you might need to export other environment variables, such as RMATE_PORT described in the rmate setup post.

Repeat this process

You can/should repeat this process for other ~/.bashrc files you interact with. In my case, that would be:

~/.bashrc in my Mac laptop
~/.bashrc at my JHPCE home
~/.bashrc in my Windows laptop

Acknowledgments

This blog post was made possible thanks to:

BiocStyle (Oleś, Morgan, and Huber, 2018)
blogdown (Xie, Hill, and Thomas, 2017)
devtools (Wickham, Hester, and Chang, 2018)
knitcitations (Boettiger, 2017)

References

[1] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.8. 2017. URL: https://CRAN.R-project.org/package=knitcitations.

[2] A. Oleś, M. Morgan and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.8.2. 2018. URL: https://github.com/Bioconductor/BiocStyle.

[3] H. Wickham, J. Hester and W. Chang. devtools: Tools to Make Developing R Packages Easier. R package version 1.13.6. 2018. URL: https://CRAN.R-project.org/package=devtools.

[4] Y. Xie, A. P. Hill and A. Thomas. blogdown: Creating Websites with R Markdown. ISBN 978-0815363729. Boca Raton, Florida: Chapman and Hall/CRC, 2017. URL: https://github.com/rstudio/blogdown.

Reproducibility

## Session info ----------------------------------------------------------------------------------------------------------

##  setting  value                                      
##  version  R version 3.5.1 Patched (2018-10-14 r75439)
##  system   x86_64, darwin15.6.0                       
##  ui       X11                                        
##  language (EN)                                       
##  collate  en_US.UTF-8                                
##  tz       America/New_York                           
##  date     2018-10-26

## Packages --------------------------------------------------------------------------------------------------------------

##  package       * version date       source                            
##  backports       1.1.2   2017-12-13 cran (@1.1.2)                     
##  base          * 3.5.1   2018-10-15 local                             
##  bibtex          0.4.2   2017-06-30 CRAN (R 3.5.0)                    
##  BiocStyle     * 2.8.2   2018-05-30 Bioconductor                      
##  blogdown        0.8     2018-07-15 CRAN (R 3.5.0)                    
##  bookdown        0.7     2018-02-18 CRAN (R 3.5.0)                    
##  colorout      * 1.2-0   2018-05-03 Github (jalvesaq/colorout@c42088d)
##  compiler        3.5.1   2018-10-15 local                             
##  datasets      * 3.5.1   2018-10-15 local                             
##  devtools      * 1.13.6  2018-06-27 cran (@1.13.6)                    
##  digest          0.6.18  2018-10-10 CRAN (R 3.5.0)                    
##  evaluate        0.12    2018-10-09 CRAN (R 3.5.0)                    
##  graphics      * 3.5.1   2018-10-15 local                             
##  grDevices     * 3.5.1   2018-10-15 local                             
##  htmltools       0.3.6   2017-04-28 cran (@0.3.6)                     
##  httr            1.3.1   2017-08-20 CRAN (R 3.5.0)                    
##  jsonlite        1.5     2017-06-01 CRAN (R 3.5.0)                    
##  knitcitations * 1.0.8   2017-07-04 CRAN (R 3.5.0)                    
##  knitr           1.20    2018-02-20 cran (@1.20)                      
##  lubridate       1.7.4   2018-04-11 CRAN (R 3.5.0)                    
##  magrittr        1.5     2014-11-22 cran (@1.5)                       
##  memoise         1.1.0   2017-04-21 CRAN (R 3.5.0)                    
##  methods       * 3.5.1   2018-10-15 local                             
##  plyr            1.8.4   2016-06-08 cran (@1.8.4)                     
##  R6              2.3.0   2018-10-04 CRAN (R 3.5.0)                    
##  Rcpp            0.12.19 2018-10-01 CRAN (R 3.5.1)                    
##  RefManageR      1.2.0   2018-04-25 CRAN (R 3.5.0)                    
##  rmarkdown       1.10    2018-06-11 CRAN (R 3.5.0)                    
##  rprojroot       1.3-2   2018-01-03 cran (@1.3-2)                     
##  stats         * 3.5.1   2018-10-15 local                             
##  stringi         1.2.4   2018-07-20 CRAN (R 3.5.0)                    
##  stringr         1.3.1   2018-05-10 CRAN (R 3.5.0)                    
##  tools           3.5.1   2018-10-15 local                             
##  utils         * 3.5.1   2018-10-15 local                             
##  withr           2.1.2   2018-03-15 CRAN (R 3.5.0)                    
##  xfun            0.3     2018-07-06 CRAN (R 3.5.0)                    
##  xml2            1.2.0   2018-01-24 CRAN (R 3.5.0)                    
##  yaml            2.2.0   2018-07-25 CRAN (R 3.5.0)

Textmate setup (Mac only)

Sun, 11 Mar 2018 00:00:00 +0000

By L. Collado-Torres.

For the past 6-7 years I have been using TextMate 2 as my text editor which I’ve found useful for R code, bash, Markdown, etc. You could also look into Sublime Text or use RStudio (post about this setup coming soon).

Sometimes students are interested in this setup, which is what I’ll document here. Though I want to highlight that you can get a very similar setup using other tools. Note that this setup only works for Mac computers.

Setup

First, install TextMate 2 for free. TextMate allows users to contribute bundles which are a set of tools that enhance the editor. For example, if you want to quickly insert an R code chunk in a .Rmd file you can add a command for it inside a bundle. You can also use a bundle to get the editor to recognize R code inside an R markdown code chunk. Probably the easiest way to get jump-started is to copy my exact setup.

Change some preferences

So next, go to the preferences menu

and under bundle, choose the R bundle as shown below.

As you can see, it hasn’t been updated in a while. I have made a few edits myself here and there which I’ll describe in the next section.

You should also enable running TextMate from the terminal.

Finally, here are my main file preferences: I want my files to be .R files by default and to use linux line endings to avoid issues later on.

Adding bundles from git

TextMate allows you to install bundles by adding the bundle files in a specific folder. The bundle files are most likely in a GitHub repository, so you just need to clone (download) them to where TextMate expect them to be.

https://github.com/lcolladotor/r.tmbundle for R and sending code to be evaluated in an iTerm2 terminal (setup explained later)
https://github.com/noniq/Merge-Markers.tmbundle for git merging
https://github.com/elia/base16-themes.tmbundle for theme colors
https://github.com/lcolladotor/criticmarkup.tmbundle for CriticMarkup in Markdown files
https://github.com/lcolladotor/knitr.tmbundle for knitr::knit()
https://github.com/lcolladotor/markdown-redcarpet.tmbundle for basically running rmarkdown::render() on the document at hand and previewing it live (if it’s an html doc). It also makes it so that R code inside code chunks will be recognized as such, enabling all the R code shortcuts.

## Go to main bundle directory
cd ~/Library/Application\ Support/TextMate/

## Download Leonardo's bundles (he uses the leo branch)
## For R, sendind code to iTerm2
git clone https://github.com/lcolladotor/r.tmbundle.git

## For merging
git clone https://github.com/noniq/Merge-Markers.tmbundle.git

## For more color themes
git clone https://github.com/elia/base16-themes.tmbundle.git

## For commenting Markdown files
git clone https://github.com/lcolladotor/criticmarkup.tmbundle.git

## For knitr::knit()
git clone https://github.com/lcolladotor/knitr.tmbundle.git

## For rmarkdown::render()
git clone https://github.com/lcolladotor/markdown-redcarpet.tmbundle.git

As you can see, these bundles help adapt TextMate2 for working with R files of different flavors. But it’s not beginner friendly, hence the upcoming blog post about using RStudio.

Some feature highlights

One of the features that I really like from TextMate is searching/replacing (you can use regex) across all the files and sub-directories of a given directory. It’s very useful when trying to fix a common typo across different files or finding all the places where a function/argument was used. The latter one is handy when you are looking at someone else’s code. It’s basically like searching inside a GitHub repository: example, search baseurl in all of blogdown for finding the code that reads it from a config file, which I did for this PR.

You can also comment all the lines of code you have selected fairly easily using the Source bundle.

I’ve also used the Text, LaTeX and Gist bundles, though not as frequently. Also, TextMate automatically spell checks for you and knows to ignore R markdown code chunks.

Evaluating code in R console or iTerm2

If you download and install the iTerm2 terminal, you can configure TextMate to evaluate the code in that terminal. The particular code I have for doing this is in the leo branch of the following repo https://github.com/lcolladotor/r.tmbundle/commits/leo. In total I use 3 different keyboard shortcuts depending on whether I want to evaluate the code:

in an R console window;
in an R console window after automatically running setwd() to the directory that contains the files I’m looking at;
to the iTerm2 terminal, which is useful when working with a computing cluster.

getwd() ## run in terminal with cmd + enter shortcut

getwd() ## run in R console using backtip (`) shortcut

getwd() ## run in R console using cmd + R, runs setwd() first

Running `rmarkdown::render()`

If I’m working with an R Markdown file (.Rmd extension), I frequently use the alt (option) + t shortcut for running rmarkdown::render() and viewing the output file.

For example, if I’m working with the recount-brain/index.Rmd file (you can get it here), my setup allows me to run all the R shortcuts. That’s because it recognizes the R code chunk syntax and uses the source.r scope.

Anyway, after using alt (option) + t TextMate shows me the final html.

A lot of the bundles in TextMate are from the days when we run Sweave(). So they work well with .Rnw files and all the like. I did modify one of them to use knitr::knit() instead of Sweave().

`mate` and `rmate`

If you enable the terminal preferences you can now use the mate command in any directory in your laptop. TextMate will open and show you all the tabs of files you had last opened in that same directory. This behavior is also a part of the .Rproj files with RStudio.

The command I really like is rmate because it enables me to remotely open a file from the cluster in TextMate, which combined with the evaluate in iTerm2 command makes it easy to work. Basically, I power up an iTerm2 terminal, log into the cluster, navigate to the directory that contains the files I’m working with, and then open them remotely with rmate.

Setting up rmate takes a bit of work but it’s definitely worth it.

In the cluster, install rmate following the instructions at https://github.com/textmate/rmate
Find a port that works for doing the forwarding. The default one will likely be taken already by another user. Find more about this here. There they use ssh -R 8080:localhost:80 public.example.com for testing. Sadly, I don’t know of a quick and easy way to find a port for you to use :/
Edit your cluster’s ~/.bashrc file with the port information. Mine includes these lines where someSecretPortNumber is replaced by my port number.

## rmate port
# https://github.com/textmate/rmate
export RMATE_PORT="someSecretPortNumber"

Edit your laptop’s ~/.ssh/config file so you don’t have to specify the port every time you ssh into the JHPCE cluster:

## Will work later as (aka, less typing):
## ssh j
## ssh cluster
Host j cluster
    User yourUsernameGoesHere
    Hostname jhpce01.jhsph.edu
    RemoteForward someSecretPortNumber localhost:someSecretPortNumber
    ForwardX11 yes
    ForwardX11Trusted yes

Edit your cluster’s ~/.ssh/config file so the port gets forwarded also when you access a compute node with qrsh. All of JHPCE’s compute nodes are named computesomething, so we can take advantage of that in the config file.

# For rmate
Host compute*
    RemoteForward someSecretPortNumber localhost:someSecretPortNumber

Once you do these steps, rmate should work on a new terminal window.

Source

TextMate variables

I don’t remember right now if I manually edited the TextMate variables, but just in case, here’s the info.

Acknowledgments

This blog post was made possible thanks to:

BiocStyle (Oleś, Morgan, and Huber, 2017)
blogdown (Xie, Hill, and Thomas, 2017)
devtools (Wickham, Hester, and Chang, 2018)
knitcitations (Boettiger, 2017)

References

[1] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.8. 2017. URL: https://CRAN.R-project.org/package=knitcitations.

[2] A. Oleś, M. Morgan and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.6.1. 2017. URL: https://github.com/Bioconductor/BiocStyle.

[3] H. Wickham, J. Hester and W. Chang. devtools: Tools to Make Developing R Packages Easier. R package version 1.13.5. 2018. URL: https://CRAN.R-project.org/package=devtools.

[4] Y. Xie, A. P. Hill and A. Thomas. blogdown: Creating Websites with R Markdown. ISBN 978-0815363729. Boca Raton, Florida: Chapman and Hall/CRC, 2017. URL: https://github.com/rstudio/blogdown.

Reproducibility

## Session info ----------------------------------------------------------------------------------------------------------

##  setting  value                       
##  version  R version 3.4.3 (2017-11-30)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/New_York            
##  date     2018-03-11

## Packages --------------------------------------------------------------------------------------------------------------

##  package       * version date       source                               
##  backports       1.1.2   2017-12-13 CRAN (R 3.4.3)                       
##  base          * 3.4.3   2017-12-07 local                                
##  bibtex          0.4.2   2017-06-30 CRAN (R 3.4.1)                       
##  BiocStyle     * 2.6.1   2017-11-30 Bioconductor                         
##  blogdown        0.5.10  2018-03-10 Github (lcolladotor/blogdown@471b086)
##  bookdown        0.7     2018-02-18 cran (@0.7)                          
##  colorout      * 1.2-0   2018-02-19 Github (jalvesaq/colorout@2f01173)   
##  compiler        3.4.3   2017-12-07 local                                
##  datasets      * 3.4.3   2017-12-07 local                                
##  devtools      * 1.13.5  2018-02-18 CRAN (R 3.4.3)                       
##  digest          0.6.15  2018-01-28 CRAN (R 3.4.3)                       
##  evaluate        0.10.1  2017-06-24 CRAN (R 3.4.1)                       
##  graphics      * 3.4.3   2017-12-07 local                                
##  grDevices     * 3.4.3   2017-12-07 local                                
##  htmltools       0.3.6   2017-04-28 CRAN (R 3.4.0)                       
##  httr            1.3.1   2017-08-20 CRAN (R 3.4.1)                       
##  jsonlite        1.5     2017-06-01 CRAN (R 3.4.0)                       
##  knitcitations * 1.0.8   2017-07-04 CRAN (R 3.4.1)                       
##  knitr           1.20    2018-02-20 cran (@1.20)                         
##  lubridate       1.7.3   2018-02-27 CRAN (R 3.4.3)                       
##  magrittr        1.5     2014-11-22 CRAN (R 3.4.0)                       
##  memoise         1.1.0   2017-04-21 CRAN (R 3.4.0)                       
##  methods       * 3.4.3   2017-12-07 local                                
##  plyr            1.8.4   2016-06-08 CRAN (R 3.4.0)                       
##  R6              2.2.2   2017-06-17 CRAN (R 3.4.0)                       
##  Rcpp            0.12.15 2018-01-20 CRAN (R 3.4.3)                       
##  RefManageR      0.14.20 2017-08-17 CRAN (R 3.4.1)                       
##  rmarkdown       1.9     2018-03-01 cran (@1.9)                          
##  rprojroot       1.3-2   2018-01-03 CRAN (R 3.4.3)                       
##  stats         * 3.4.3   2017-12-07 local                                
##  stringi         1.1.6   2017-11-17 CRAN (R 3.4.2)                       
##  stringr         1.3.0   2018-02-19 cran (@1.3.0)                        
##  tools           3.4.3   2017-12-07 local                                
##  utils         * 3.4.3   2017-12-07 local                                
##  withr           2.1.1   2017-12-19 CRAN (R 3.4.3)                       
##  xfun            0.1     2018-01-22 CRAN (R 3.4.3)                       
##  xml2            1.2.0   2018-01-24 CRAN (R 3.4.3)                       
##  yaml            2.1.18  2018-03-08 cran (@2.1.18)

Contributing to the LIBD rstats club

Fri, 09 Mar 2018 00:00:00 +0000

In this blog post Leonardo Collado-Torres explains how to contribute posts to the LIBD rstats club. While some parameters are specific to this blog, you could also use this for creating your own community blog.

Install necessary tools

We first need to get the appropriate tools installed in our computer.

1. R

Lets start by installing the latest version of R. At the time of writing this post, that would be R 3.4.3 but in reality this should work with any R 3.4.x version. It might even work with earlier versions, but it’d be a bummer to find out that we have an R version problem later on.

2. RStudio

Once we have an up to date R version, lets install RStudio. By using the latest version we will have access to RStudio addin menus, which will make our life much easier. Since we will be using RStudio quite a bit, it’s best to work from your laptop/computer than any server or cluster where you might not have the flexibility to install/update RStudio and R.

3. blogdown

We next need to install the main package that we will be using for creating our blog posts, that is blogdown (Xie, Hill, and Thomas, 2017). At the time of writing this post, the version that we need to use is only available in the development branch. So we need to install blogdown via devtools (Wickham, Hester, and Chang, 2018).

## If needed install devtools first:
install.packages('devtools')

## Install development version
devtools::install_github('rstudio/blogdown')

## If you are reading this in the fiture, you might only need
install.packages('blogdown')

At some point we might need two other extra packages that blogdown uses. It will show you a message when that happens, so you can install them when you need them or you could go ahead and get them anyway.

install.packages(c('later', 'processx'))

4. `hugo`

We are almost there! blogdown uses hugo which they claim is the world’s fastest framework for building websites. hugo is a bit complicated, but it makes maintaining a complicated website such as a blog super easy. Basically, we will be working in the rstatsclubsource directory and hugo will create the final version we can share around in the directory rstatsclubsource/public.

Ok, lets go head and install it with

blogdown::install_hugo()

5. `git`

To get the files and interact with other LIBD rstats club members you will need to use git, which you can install following these instructions from the happygitwithr website. A note for Windows users: get git for windows because it includes git bash and will enable you to run some commands that we will need later on. Even more advanced for Windows users, when installing git bash choose the use git and optional Unix tools from the Windows Command Prompt as described in detail here.

Also get a GitHub account if you don’t have one already. Optionally set up ssh keys for password-less login, though it’s not needed.

Access blog source files

The file structure of our blog involves a total of 3 GitHub repositories that are related to each other as shown below. However, you will only need to interact with one of them.

rstatsclubsource
    ## From https://github.com/LieberInstitute/rstatsclubsource
    content/
        post/
        ...
    static/
        post/
        img/
        ...
    themes/
        hugo-academic/
        ## From https://github.com/gcushen/hugo-academic
        ## linked as a git submodule
    public/
        ## https://github.com/LieberInstitute/rstatsclub
        ## Hosts the files that make up the website
        ## http://LieberInstitute.github.io/rstatsclub/

The main directory for the blog is rstatsclubsource and you can access it at github.com/LieberInstitute/rstatsclubsource if you are a member of the LIBD rstats club and have logged in to your GitHub account. This directory contains many files that blogdown understands and that you don’t really need to change. The main sub-directory that you will be interacting with is rstatsclubsource/content/post where your new post files will get saved by blogdown. If you insert images, blogdown will automatically create the directory rstatsclubsource/static/post/2018-03-09_postTitle_files (example), but don’t worry about it.

Ok, lets get the files. Open your terminal or git bash (Windows) and navigate to the directory where you will store your copy of the source files.

## Works in Mac and Windows (with git bash)
cd ~/Desktop

## Windows example command if you have more than 1 disk
cd /f/Desktop

Next clone github.com/LieberInstitute/rstatsclubsource. Here we use https since that doesn’t require extra setup, but cloning by ssh also works.

git clone https://github.com/LieberInstitute/rstatsclubsource.git

After a successful clone, we should have created the directory ~/Desktop/rstatsclubsource. Lets change to this new directory.

cd rstatsclubsource

Our cloning process also created a placeholder for our hugo theme (hugo-academic) but it’s contents are empty.

$ ls -lh themes/hugo-academic/
total 0

Lets fill them up by running the following git command.

git submodule update --init --recursive

If we now check the contents of the themes/hugo-academic directory we should see several files now (here’s a screenshot from Windows using git bash).

ls -lh themes/hugo-academic

Preview website

Our setup is now complete! We can now start writing blog posts using blogdown. Open the ~/Desktop/rstatsclubsource/rstatsclubsource.Rproj file, which should open a new RStudio window automatically. Find the Addins menu on the top of your window, and select the Serve Site option in the drop-down menu.

This addin will create a local version of the website you can preview in the RStudio Viewer pane.

If you click in the show in new window symbol as shown below

you then can preview the website in your main browser:

This will be super useful when you are writing a new blog post because any changes you make will automatically refresh the local preview version after a second of two (only after you save the file you are editing). Sometimes you might have to manually refresh your browser (like when you make too many updates in a row). The preview website works as long as you have the Serve Site addin running on the background.

All our files

By running Serve Site blogdown actually populated automatically our rstatsclubsource/public directory. So our full list of files look somewhat like this:

## Found about tree from
## https://stackoverflow.com/questions/3455625/linux-command-to-print-directory-structure-in-the-form-of-a-tree
## Intall in a Mac with homebrew:
## brew install tree
$ tree -d rstatsclubsource
.
├── archetypes
├── content
│   ├── home
│   └── post
├── data
│   ├── fonts
│   └── themes
├── layouts
│   └── partials
├── public
│   ├── 2018
│   │   └── 03
│   │       ├── 06
│   │       │   └── test-post-for-checking-website
│   │       └── 09
│   │           └── welcome-to-the-libd-rstats-club
│   ├── categories
│   │   ├── page
│   │   │   └── 1
│   │   ├── rstats
│   │   │   └── page
│   │   │       └── 1
│   │   └── setup
│   │       └── page
│   │           └── 1
│   ├── home
│   ├── img
│   ├── js
│   ├── post
│   │   ├── 2018-03-06-test-post-for-checking-website_files
│   │   │   └── figure-html
│   │   └── page
│   │       └── 1
│   ├── publication_types
│   │   └── page
│   │       └── 1
│   └── tags
│       ├── blog
│       │   └── page
│       │       └── 1
│       └── page
│           └── 1
├── static
│   ├── img
│   └── post
│       └── 2018-03-06-test-post-for-checking-website_files
│           └── figure-html
└── themes
    └── hugo-academic
        ├── archetypes
        ├── data
        ├── exampleSite
        ├── i18n
        ├── images
        ├── layouts
        └── static

Some of the structure looks redundant right? Well, that’s because hugo and blogdown tried to keep everything super organized and re-use some of the file structure for the final version of the website (the one inside rstatsclubsource/public).

Write a blog post

We now have a working full preview version of the website in our computer. It’s finally time to write a blog post. The good thing is that we don’t need to do all the setup steps again!

Update if necessary

If you paused midway for a few days, it’s likely that your files are not the latest ones. So please git pull the latest files from github.com/LieberInstitute/rstatsclubsource.

cd ~/Desktop/rstatsclubsource
git pull

Start a blog post

To start a new blog post, use the New Post blogdown addin.

This will open up a window where you can specify the blog post title. The title will automatically fill out the slug which is used in the final URL of the post. It also lets you choose a date, when is when the blog post will be made publicly visible.

Author and extension

Lets start by entering the author information (leave blank if you want it to be anonymous) and selecting the post format. We highly recommend you use the .Rmd format for your blog posts, even to the point that you should just make it our default option. To do so you have to edit/create an .Rprofile file in your computer at your home directory, that is ~/.Rprofile. Then use these options (with how you prefer the author information to be).

# https://bookdown.org/yihui/blogdown/global-options.html
options(blogdown.author = 'L. Collado-Torres')
options(blogdown.ext = '.Rmd')

The next time you write a blog post, blogdown will know what options you prefer.

Use a blog template

Next lets choose the blog archetype (template) that for posts, that is post under Archetype.

The reasons why a blog template is useful are described in full detail in this blog post by L. Collado-Torres. The quick version is that it will pre-populate your new blog post with helpful information and reminders on how to do $X$ or $Y$. For example, how to insert external images, how to control the figure height and width for images generated by R, how to cite packages, and including the R reproducibility information by default.

Use appropriate post categories

Most of our blog posts will likely be about R. For those blog posts, select the rstats category as shown below.

If your blog post covers a topic that might be useful to new LIBD members use the setup category. If it involves using the JHPCE cluster, use jhpce as a category, etc. The New Post blogdown addin lets you see which categories and tags have been used before to classify posts. In general, we should aim to have a handful of categories while having as many tags as necessary. These will be useful for filtering our blog posts.

Hit done! (well check before)

We are now basically set to start our new blog post. Just double check:

the categories,
that the format is .Rmd,
that the archetype is post,
post title,
post date,
post author.

Aaaaand hit done!

Edit your blog post

You can now write your blog post using the R Markdown syntax. Note that you won’t be using the knit button from RStudio since the Serve Site blogdown addin should be doing all the work for you and updating your local website preview.

When writing your blog post keep in mind our Code of Conduct and that your our blog posts have to comply with the confidentiality agreement we signed at LIBD.

Share your blog post

Once you are happy with the final version of your blog post and have used the spell check from RStudio, add your blog post files to github.com/LieberInstitute/rstatsclubsource. First, remember to git pull in case that there are new files in the repository that you don’t have. Next check which files you modified by running git status

cd ~/Desktop/rstatsclubsource
git status

Typically, you should see that you have untracked files under rstatsclubsource/content/post and maybe rstatsclubsource/static/post/. If so, add them to the github repository with the following sequence of commands.

git add content/post/*
git add static/post/*
git commit -m "Short description of your new blog post"
git push

You could also do this with a git GUI such as the git tools in RStudio, SourceTree (works for both Mac and Windows), the gitk command, etc.

After a quick review we will update the files at github.com/LieberInstitute/rstatsclub and deploy the changes to GitHub.

Good luck with your first of many blogdown posts!!!

Source

Details

We are keeping both github.com/LieberInstitute/rstatsclubsource and github.com/LieberInstitute/rstatsclub as private repositories to enable contributors to write posts anonymously.

The default license for our blog posts is (CC) BY-NC-SA 4.0 which you can read more about here. If you write a blog post under a different license, please make it clear. Also please attribute the sources of the material you use.

Acknowledgments

This blog post was made possible thanks to:

BiocStyle (Oleś, Morgan, and Huber, 2017)
blogdown (Xie, Hill, and Thomas, 2017)
devtools (Wickham, Hester, and Chang, 2018)
knitcitations (Boettiger, 2017)

References

[1] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.8. 2017. URL: https://CRAN.R-project.org/package=knitcitations.

[2] A. Oleś, M. Morgan and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.6.1. 2017. URL: https://github.com/Bioconductor/BiocStyle.

[3] H. Wickham, J. Hester and W. Chang. devtools: Tools to Make Developing R Packages Easier. R package version 1.13.5. 2018. URL: https://CRAN.R-project.org/package=devtools.

[4] Y. Xie, A. P. Hill and A. Thomas. blogdown: Creating Websites with R Markdown. ISBN 978-0815363729. Boca Raton, Florida: Chapman and Hall/CRC, 2017. URL: https://github.com/rstudio/blogdown.

Reproducibility

## Session info ----------------------------------------------------------------------------------------------------------

##  setting  value                       
##  version  R version 3.4.3 (2017-11-30)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/New_York            
##  date     2018-03-10

## Packages --------------------------------------------------------------------------------------------------------------

##  package       * version date       source                            
##  backports       1.1.2   2017-12-13 CRAN (R 3.4.3)                    
##  base          * 3.4.3   2017-12-07 local                             
##  bibtex          0.4.2   2017-06-30 CRAN (R 3.4.1)                    
##  BiocStyle     * 2.6.1   2017-11-30 Bioconductor                      
##  blogdown        0.5.9   2018-03-08 Github (rstudio/blogdown@dc1f41c) 
##  bookdown        0.7     2018-02-18 cran (@0.7)                       
##  colorout      * 1.2-0   2018-02-19 Github (jalvesaq/colorout@2f01173)
##  compiler        3.4.3   2017-12-07 local                             
##  datasets      * 3.4.3   2017-12-07 local                             
##  devtools      * 1.13.5  2018-02-18 CRAN (R 3.4.3)                    
##  digest          0.6.15  2018-01-28 CRAN (R 3.4.3)                    
##  evaluate        0.10.1  2017-06-24 CRAN (R 3.4.1)                    
##  graphics      * 3.4.3   2017-12-07 local                             
##  grDevices     * 3.4.3   2017-12-07 local                             
##  htmltools       0.3.6   2017-04-28 CRAN (R 3.4.0)                    
##  httr            1.3.1   2017-08-20 CRAN (R 3.4.1)                    
##  jsonlite        1.5     2017-06-01 CRAN (R 3.4.0)                    
##  knitcitations * 1.0.8   2017-07-04 CRAN (R 3.4.1)                    
##  knitr           1.20    2018-02-20 cran (@1.20)                      
##  lubridate       1.7.3   2018-02-27 CRAN (R 3.4.3)                    
##  magrittr        1.5     2014-11-22 CRAN (R 3.4.0)                    
##  memoise         1.1.0   2017-04-21 CRAN (R 3.4.0)                    
##  methods       * 3.4.3   2017-12-07 local                             
##  plyr            1.8.4   2016-06-08 CRAN (R 3.4.0)                    
##  R6              2.2.2   2017-06-17 CRAN (R 3.4.0)                    
##  Rcpp            0.12.15 2018-01-20 CRAN (R 3.4.3)                    
##  RefManageR      0.14.20 2017-08-17 CRAN (R 3.4.1)                    
##  rmarkdown       1.9     2018-03-01 cran (@1.9)                       
##  rprojroot       1.3-2   2018-01-03 CRAN (R 3.4.3)                    
##  stats         * 3.4.3   2017-12-07 local                             
##  stringi         1.1.6   2017-11-17 CRAN (R 3.4.2)                    
##  stringr         1.3.0   2018-02-19 cran (@1.3.0)                     
##  tools           3.4.3   2017-12-07 local                             
##  utils         * 3.4.3   2017-12-07 local                             
##  withr           2.1.1   2017-12-19 CRAN (R 3.4.3)                    
##  xfun            0.1     2018-01-22 CRAN (R 3.4.3)                    
##  xml2            1.2.0   2018-01-24 CRAN (R 3.4.3)                    
##  yaml            2.1.17  2018-02-27 cran (@2.1.17)

Welcome to the LIBD rstats club!

Fri, 09 Mar 2018 00:00:00 +0000

Welcome to the LIBD rstats club!

A few months ago Carrie Wright and Leonardo Collado-Torres started an R + Journal club where we have been meeting every other week to talk about R packages and discuss journal articles in our field. Some examples of what we covered are tidyverse (Wickham, 2017), BiocStyle (Oleś, Morgan, and Huber, 2017) and rmarkdown (Allaire, Xie, McPherson, Luraschi, et al., 2018) presentations. We are now taking the R portion of the club to the next level.

Like we say in the about section, we are researchers at LIBD that frequently use R and other tools. The R community has been growing a lot in recent years and it’s not easy to keep ourselves updated on all the recent developments. We also work in a dynamic environment where new people join LIBD frequently as students, post-docs and staff.

In the LIBD rstats club we plan to write blog posts about R packages we are interested in or are just learning how to use, how to do guides, and occasionally our own open-source software. For our how to do guides, the idea is that if two people asks us how to do $X$, then it’s probably a good time to write a blog post. Similar to David Robinson’s advice:

When you’ve written the same code 3 times, write a function

When you’ve given the same in-person advice 3 times, write a blog post
— David Robinson (@drob) November 9, 2017

We are doing this to help each other at LIBD, to help future new colleagues and students, and the community at large. We also hope to receive feedback from the community (that complies with our code of conduct) that will be beneficial to everyone reading the posts. Maybe you know of an alternative way to do the same task we describe.

Thank you for tuning in! Follow us via RSS (allows email subscriptions) or Twitter.

Founding members: Leonardo Collado-Torres, Carrie Wright and Emily E. Burke.

PS This is not a course or boot camp site to get started using R, for that there are other resources available.

Details

Anyone at LIBD is welcome to participate and contribute blog posts: we will add you to the members section. Just remember that your blog posts have to comply with the confidentiality agreement we signed. We also welcome anonymous posts, though signing them can be useful for exposure and CV building, although not necessary.

Acknowledgments

This blog post was made possible thanks to:

BiocStyle (Oleś, Morgan, and Huber, 2017)
blogdown (Xie, Hill, and Thomas, 2017)
devtools (Wickham, Hester, and Chang, 2018)
knitcitations (Boettiger, 2017)

References

[1] J. Allaire, Y. Xie, J. McPherson, J. Luraschi, et al. rmarkdown: Dynamic Documents for R. R package version 1.9. 2018. URL: https://CRAN.R-project.org/package=rmarkdown.

[2] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.8. 2017. URL: https://CRAN.R-project.org/package=knitcitations.

[3] A. Oleś, M. Morgan and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.6.1. 2017. URL: https://github.com/Bioconductor/BiocStyle.

[4] H. Wickham. tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. 2017. URL: https://CRAN.R-project.org/package=tidyverse.

[5] H. Wickham, J. Hester and W. Chang. devtools: Tools to Make Developing R Packages Easier. R package version 1.13.5. 2018. URL: https://CRAN.R-project.org/package=devtools.

[6] Y. Xie, A. P. Hill and A. Thomas. blogdown: Creating Websites with R Markdown. ISBN 978-0815363729. Boca Raton, Florida: Chapman and Hall/CRC, 2017. URL: https://github.com/rstudio/blogdown.

Reproducibility

## Session info ----------------------------------------------------------------------------------------------------------

##  setting  value                       
##  version  R version 3.4.3 (2017-11-30)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/New_York            
##  date     2018-03-08

## Packages --------------------------------------------------------------------------------------------------------------

##  package       * version date       source                            
##  backports       1.1.2   2017-12-13 CRAN (R 3.4.3)                    
##  base          * 3.4.3   2017-12-07 local                             
##  bibtex          0.4.2   2017-06-30 CRAN (R 3.4.1)                    
##  BiocStyle     * 2.6.1   2017-11-30 Bioconductor                      
##  blogdown        0.5.9   2018-03-08 Github (rstudio/blogdown@dc1f41c) 
##  bookdown        0.7     2018-02-18 cran (@0.7)                       
##  colorout      * 1.2-0   2018-02-19 Github (jalvesaq/colorout@2f01173)
##  compiler        3.4.3   2017-12-07 local                             
##  datasets      * 3.4.3   2017-12-07 local                             
##  devtools      * 1.13.5  2018-02-18 CRAN (R 3.4.3)                    
##  digest          0.6.15  2018-01-28 CRAN (R 3.4.3)                    
##  evaluate        0.10.1  2017-06-24 CRAN (R 3.4.1)                    
##  graphics      * 3.4.3   2017-12-07 local                             
##  grDevices     * 3.4.3   2017-12-07 local                             
##  htmltools       0.3.6   2017-04-28 CRAN (R 3.4.0)                    
##  httr            1.3.1   2017-08-20 CRAN (R 3.4.1)                    
##  jsonlite        1.5     2017-06-01 CRAN (R 3.4.0)                    
##  knitcitations * 1.0.8   2017-07-04 CRAN (R 3.4.1)                    
##  knitr           1.20    2018-02-20 cran (@1.20)                      
##  lubridate       1.7.3   2018-02-27 CRAN (R 3.4.3)                    
##  magrittr        1.5     2014-11-22 CRAN (R 3.4.0)                    
##  memoise         1.1.0   2017-04-21 CRAN (R 3.4.0)                    
##  methods       * 3.4.3   2017-12-07 local                             
##  plyr            1.8.4   2016-06-08 CRAN (R 3.4.0)                    
##  R6              2.2.2   2017-06-17 CRAN (R 3.4.0)                    
##  Rcpp            0.12.15 2018-01-20 CRAN (R 3.4.3)                    
##  RefManageR      0.14.20 2017-08-17 CRAN (R 3.4.1)                    
##  rmarkdown       1.9     2018-03-01 cran (@1.9)                       
##  rprojroot       1.3-2   2018-01-03 CRAN (R 3.4.3)                    
##  stats         * 3.4.3   2017-12-07 local                             
##  stringi         1.1.6   2017-11-17 CRAN (R 3.4.2)                    
##  stringr         1.3.0   2018-02-19 cran (@1.3.0)                     
##  tools           3.4.3   2017-12-07 local                             
##  utils         * 3.4.3   2017-12-07 local                             
##  withr           2.1.1   2017-12-19 CRAN (R 3.4.3)                    
##  xfun            0.1     2018-01-22 CRAN (R 3.4.3)                    
##  xml2            1.2.0   2018-01-24 CRAN (R 3.4.3)                    
##  yaml            2.1.17  2018-02-27 cran (@2.1.17)

Test post for checking website

Tue, 06 Mar 2018 00:00:00 +0000

This is a test post for checking the formatting of the website. You can basically ignore the rest. It’s showing the contents of the post.md archetype (blog post template).

Useful links for editing the theme:

https://sourcethemes.com/academic/docs/get-started/
https://sourcethemes.com/academic/docs/customization/
https://github.com/gcushen/hugo-academic/tree/master/data/fonts
https://github.com/gcushen/hugo-academic/tree/master/data/themes
https://www.libd.org/ (for getting colors under inspect mode)

Post content

Typical location to start editing since the bibliography chunk is hidden. Make sure that you selected R Markdown (.Rmd) as the format option of the post when using the New Post blogdown addin.

R image

## This imaged will be saved in the /post/*_files/ directory
## Use echo = FALSE if you want to hide the code for making the plot
plot(1:10, 10:1)

If you prefer not to use the fig.width and fig.height options in every plot chunk, edit the YAML (the part at the top of the post) with:

output:
  blogdown::html_page:
    toc: no
    fig_width: 5
    fig_height: 5

Custom image

Use the blogdown Insert Image plugin to add an external image. You need to use version 0.5.7 or newer to have access to this plugin. At the time of writing this post, this version is only available via GitHub. That is, install it with:

devtools::install_github('rstudio/blogdown')

Acknowledgments

This blog post was made possible thanks to:

BiocStyle (Oles, Morgan, and Huber, 2017)
blogdown (Xie, Hill, and Thomas, 2017)
devtools (Wickham, Hester, and Chang, 2018)
knitcitations (Boettiger, 2017)

References

[1] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.8. 2017. URL: https://CRAN.R-project.org/package=knitcitations.

[2] A. Oles, M. Morgan and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.6.1. 2017. URL: https://github.com/Bioconductor/BiocStyle.

[3] H. Wickham, J. Hester and W. Chang. devtools: Tools to Make Developing R Packages Easier. R package version 1.13.5. 2018. URL: https://CRAN.R-project.org/package=devtools.

[4] Y. Xie, A. P. Hill and A. Thomas. blogdown: Creating Websites with R Markdown. ISBN 978-0815363729. Boca Raton, Florida: Chapman and Hall/CRC, 2017. URL: https://github.com/rstudio/blogdown.

Reproducibility

## Reproducibility info
options(width = 120)
session_info()

## Session info ----------------------------------------------------------------------------------------------------------

##  setting  value                       
##  version  R version 3.4.2 (2017-09-28)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United States.1252  
##  tz       America/New_York            
##  date     2018-03-07

## Packages --------------------------------------------------------------------------------------------------------------

##  package       * version date       source                               
##  backports       1.1.2   2017-12-13 CRAN (R 3.4.3)                       
##  base          * 3.4.2   2017-09-28 local                                
##  bibtex          0.4.2   2017-06-30 CRAN (R 3.4.2)                       
##  BiocStyle     * 2.6.1   2017-11-30 Bioconductor                         
##  blogdown        0.5.7   2018-03-07 Github (lcolladotor/blogdown@7b8761b)
##  bookdown        0.7     2018-02-18 CRAN (R 3.4.3)                       
##  compiler        3.4.2   2017-09-28 local                                
##  datasets      * 3.4.2   2017-09-28 local                                
##  devtools      * 1.13.5  2018-02-18 CRAN (R 3.4.3)                       
##  digest          0.6.15  2018-01-28 CRAN (R 3.4.3)                       
##  evaluate        0.10.1  2017-06-24 CRAN (R 3.4.2)                       
##  graphics      * 3.4.2   2017-09-28 local                                
##  grDevices     * 3.4.2   2017-09-28 local                                
##  htmltools       0.3.6   2017-04-28 CRAN (R 3.4.2)                       
##  httr            1.3.1   2017-08-20 CRAN (R 3.4.2)                       
##  jsonlite        1.5     2017-06-01 CRAN (R 3.4.2)                       
##  knitcitations * 1.0.8   2017-07-04 CRAN (R 3.4.2)                       
##  knitr           1.20    2018-02-20 CRAN (R 3.4.3)                       
##  lubridate       1.7.2   2018-02-06 CRAN (R 3.4.3)                       
##  magrittr        1.5     2014-11-22 CRAN (R 3.4.2)                       
##  memoise         1.1.0   2017-04-21 CRAN (R 3.4.2)                       
##  methods       * 3.4.2   2017-09-28 local                                
##  plyr            1.8.4   2016-06-08 CRAN (R 3.4.2)                       
##  R6              2.2.2   2017-06-17 CRAN (R 3.4.2)                       
##  Rcpp            0.12.15 2018-01-20 CRAN (R 3.4.3)                       
##  RefManageR      0.14.20 2017-08-17 CRAN (R 3.4.2)                       
##  rmarkdown       1.9     2018-03-01 CRAN (R 3.4.3)                       
##  rprojroot       1.3-2   2018-01-03 CRAN (R 3.4.3)                       
##  stats         * 3.4.2   2017-09-28 local                                
##  stringi         1.1.6   2017-11-17 CRAN (R 3.4.2)                       
##  stringr         1.3.0   2018-02-19 CRAN (R 3.4.3)                       
##  tools           3.4.2   2017-09-28 local                                
##  utils         * 3.4.2   2017-09-28 local                                
##  withr           2.1.1   2017-12-19 CRAN (R 3.4.3)                       
##  xfun            0.1     2018-01-22 CRAN (R 3.4.3)                       
##  xml2            1.2.0   2018-01-24 CRAN (R 3.4.3)                       
##  yaml            2.1.17  2018-02-27 CRAN (R 3.4.3)

LIBD rstats club

L. Collado-Torres

Our first adventure with Visium Spatial Proteogenomics

Introduction to BiocMAP

The Challenges of WGBS

Our Solution

Using BiocMAP

Lessons Learned Applying Tangram on Visium Data

The Initial Plan

The Importance of Understanding your Data

The Importance of Understanding your Model

Takeaways

Using tidymodels to Predict Health Insurance Cost

Medical Cost Personal Datasets

Insurance Forecast by using Linear Regression

Exploratory Data Analysis

Build Model

Linear Regression

Conclusion

Using VisiumExperiment at spatialLIBD package

Starting internship at Lieber Institute

What is spatialLIBD?

The VisiumExperiment class

Downloading DLPFC data contained in a VisiumExperiment object

Storing and retrieving scale factors from a VisiumExperiment object

Visualizing the histology image from a VisiumExperiment object

Downloading the histology images

The geom_spatial() function

Making other functions compatible with VisiumExperiment class

Wrapping up

Future Plans?

Acknowledgements

References

Reproducibility

Using Space Ranger at JHPCE

What is Space Ranger

Using the spaceranger module at JHPCE

First script

Exploring memory and parallelization options

Acknowledgments

References

Reproducibility

R 101

HAPPY HOLIDAYS!!!🎉⛄🎆🍾❄

1) First Command

2) Install and Load Packages and Data

3) Assigning Objects

4) Assigning Objects with Multiple Elements

5) Classes

6) Subsetting Data

7) Plotting Data

8) More Statistical Tests

Lets use some real data!

Baby Name Data

Titanic Data

Earthquake Data

In addition, these are functions that the members of LIBD Rstats use often:

For additional help take a look at these links:

Free courses and tutorials

Community support

Tips for help

Also follow our blog for more helpful posts.

Thanks for reading and have fun getting to know R!

Acknowledgments

References

Reproducibility

Quality Surrogate Variable Analysis

SVA background

qSVA

Datasets

Algorithm and Workflow

Results

Conclusions

Acknowledgments

References

Reproducibility

Quick overview on the new Bioconductor 3.8 release

AffiXcan

BiocPkgTools

brainImageR

Using the `spaceranger` module at JHPCE

Mark Meloon `MarkMeloon`

Brandon Rohrer `_brohrer_`

Kate Strachnyi `StorybyData`

Mico Yuk `micoyuk`