rOpenSci

rOpenSci is a runner-up in the Mendeley Binary Battle!

2011-11-30T18:49:29Z

We just got word that rOpenSci was a runner-up in the first Binary Battle! Thank you for all the support so far!

We entered two of our packages for integrating with PLoS Journals (rplos) and Mendeley (RMendeley) in the Mendeley-PLoS Binary Battle. Get them at GitHub (rplos; RMendeley).

These two packages allow users to search and retrieve data from PLoS journals (including their altmetrics data), and from Mendeley. You could surely mash up data from both PLoS and Mendeley. That’s what’s cool about rOpenSci – we provide the tools, and leave it up to users vast creativity to do awesome things.

We won third place!! This gives us a $1,000 prize, plus a Parrot AR Drone helicopter.

The grand prize went to openSNP, and the runner-up to PaperCritic. Check them out.

Use case: combining taxize and rgbif

2011-11-29T21:26:04Z

This is just the sort of thing for which rOpenSci is being built.

A colleague of mine recently saw our packages in development and thought, “Hey, that could totally make my life easier.” What was made easier you ask? This was his situation:

He had a list of ca. 1200 species of birds and wanted to first obtain the most current species names before seeking location data for occurrences of all the species.

So what tools do we need for this? We need the packages taxize and rgbif:

taxize: The taxize package allows you to search taxonomic information across the Universal Biological Indexer and Organizer (uBio), Integrated Taxonomic Information Service (ITIS), Encyclopedia of Life (EOL), the Taxonomic Name Resolution Service (TNRS), and Phylomatic.
rgbif: The rgbif package allows you to search for and retrieve data from the Global Biodiversity Information Facility.

If you want to run this code, the entire workflow is here, as a GitHub Gist.

First step: check names

Note that we are using a subset of the data in my friend’s actual dataset for brevity here. So 1200 species down to 10 species for our purposes.

Let’s just wrap up all the dirty work into one function called checkname. This function uses a few taxize functions, including get_tsn, and getacceptname.

checkname <- function(name) {
  # name: scientific name
  # get taxonomic serial number (TSN)
  if(class(try(tsn <- get_tsn(name, "sciname", by_="name"), silent = T)) == "try-error")
    {tsn <- "no_results"}
  # check accepted name
  out <- getacceptname(tsn)
  if(out[[1]] == "no_results") {list("check_spelling", name, "check_spelling", out)} else
    if(length(out) == 2) {list("new_name", name, as.character(out)[[1]], as.character(out)[[2]])} else
      if(class(as.numeric(out)) == "numeric") {list("good_name", name, name, out)}
}

Nice. Now let’s run our species list through the function checkname using llply function from the plyr package.


ournames <- c("Agapornis_roseicapillis", "Catharacta_maccormicki",
  "Catharacta_skua", "Cathartes_aura", "Catharus_bicknelli",
  "Catharus_fuscescens", "Catharus_guttatus", "Catharus_minimus",
  "Catharus_ustulatus", "Ceratogymna_brevis")

itisout <- llply(ournames, checkname, .progress = "text") # query ITIS
  |======================================================================================| 100%

dfnames <- ldply(itisout, function(x) { # make a data frame of results
    out_ <- as.data.frame(x)
    names(out_) <- c("status", "name_old", "name_new", "TSN")
    out_})

dfnames
           status                name_old                 name_new        TSN
1  check_spelling Agapornis_roseicapillis           check_spelling no_results
2        new_name  Catharacta_maccormicki Stercorarius maccormicki     660062
3        new_name         Catharacta_skua        Stercorarius skua     660059
4       good_name          Cathartes_aura           Cathartes_aura     175265
5       good_name      Catharus_bicknelli       Catharus_bicknelli     554148
6       good_name     Catharus_fuscescens      Catharus_fuscescens     179796
7       good_name       Catharus_guttatus        Catharus_guttatus     179779
8       good_name        Catharus_minimus         Catharus_minimus     179793
9       good_name      Catharus_ustulatus       Catharus_ustulatus     179788
10       new_name      Ceratogymna_brevis        Bycanistes brevis     707796

It looks like we have one name spelled wrong (“check_spelling”), three name replacements (“new_name”), and the remainder checked out just fine with ITIS.
Now we need to remove that one species with the spelling problem for now (although you would fix it of course if it was your project). Then we feed the new species list to queries to GBIF.

p.s. The output from above spits out TSNs too, which you can use to query for more taxonomic information for species through the taxize package.

Second step: get lat/long data

dfnames$gbifname <- gsub("_", " ", dfnames[,3]) # create new name column

dfnames # we now have a column of names without the underscore for GBIF search
           status                name_old                 name_new        TSN                 gbifname
1  check_spelling Agapornis_roseicapillis           check_spelling no_results           check spelling
2        new_name  Catharacta_maccormicki Stercorarius maccormicki     660062 Stercorarius maccormicki
3        new_name         Catharacta_skua        Stercorarius skua     660059        Stercorarius skua
4       good_name          Cathartes_aura           Cathartes_aura     175265           Cathartes aura
5       good_name      Catharus_bicknelli       Catharus_bicknelli     554148       Catharus bicknelli
6       good_name     Catharus_fuscescens      Catharus_fuscescens     179796      Catharus fuscescens
7       good_name       Catharus_guttatus        Catharus_guttatus     179779        Catharus guttatus
8       good_name        Catharus_minimus         Catharus_minimus     179793         Catharus minimus
9       good_name      Catharus_ustulatus       Catharus_ustulatus     179788       Catharus ustulatus
10       new_name      Ceratogymna_brevis        Bycanistes brevis     707796        Bycanistes brevis

dfnames <- dfnames[-1,] # remove row 1

gbiftestout <- llply(as.list(dfnames[,5]), function(x) occurrencelist(x, coordinatestatus = TRUE, maxresults = 10, latlongdf = TRUE))

gbiftestout[[1]] # here's the data frame of results from one species
                    sciname latitude longitude
1  Stercorarius maccormicki 36.65685 -121.9187
2  Stercorarius maccormicki 36.85800 -122.0910
3  Stercorarius maccormicki 46.89017 -125.0051
4  Stercorarius maccormicki 36.85800 -122.0910
5  Stercorarius maccormicki 36.65685 -121.9187
6  Stercorarius maccormicki 40.76234 -124.2363
7  Stercorarius maccormicki 36.85800 -122.0910
8  Stercorarius maccormicki 36.85800 -122.0910
9  Stercorarius maccormicki 36.85800 -122.0910
10 Stercorarius maccormicki 40.76234 -124.2363

gbiftestout_df <- ldply(gbiftestout, identity) # make a data frame of all results

rbind(head(gbiftestout_df), tail(gbiftestout_df)) # look at first and last 6 rows
                    sciname latitude longitude
1  Stercorarius maccormicki 36.65685 -121.9187
2  Stercorarius maccormicki 36.85800 -122.0910
3  Stercorarius maccormicki 46.89017 -125.0051
4  Stercorarius maccormicki 36.85800 -122.0910
5  Stercorarius maccormicki 36.65685 -121.9187
6  Stercorarius maccormicki 40.76234 -124.2363
85        Bycanistes brevis -0.16700   37.3170
86        Bycanistes brevis  0.31700   32.5830
87        Bycanistes brevis -0.16700   37.3170
88        Bycanistes brevis -0.16700   37.3170
89        Bycanistes brevis  0.05000   37.6500
90        Bycanistes brevis  0.05000   37.6500

Beauty! That just saved a lot of time I reckon.

Of course there are many more options within the functions to grab data from GBIF – I only show retrieval of latitude and longitude data for species here.

Third step: make some maps

install.packages("maps")
require(ggplot2)
try_require("maps")

world <- map_data("world")
mexico <- subset(world, region=="Mexico")
# Make a plot for Stercorarius maccormicki
ggplot(world, aes(long, lat)) +
  geom_polygon(aes(group = group), fill = "white", color = "gray40", size = .2) +
  geom_jitter(data = gbiftestout[[1]],
    aes(longitude, latitude), alpha=0.6, size = 4, color = "blue") +
  opts(title = "Stercorarius maccormicki")

# Make a plot for Catharus guttatus, just in Mexico though
ggplot(mexico, aes(long, lat)) +
  geom_polygon(aes(group = group), fill = "white", color = "gray40", size = .2) +
  geom_jitter(data = gbiftestout[[6]],
    aes(longitude, latitude), alpha=0.6, size = 4, color = "blue") +
  opts(title = "Catharus guttatus")

Here’s the two maps, first for Stercorarius maccormicki, and then for Catharus guttatus

Fourth step: smile and get back to us

Wasn’t that easy? So much better than checking names one by one manually, then retrieving data from GBIF manually, both through web interfaces.

Please tell us here, or on Twitter, what other use cases you can think of!

Again, if you want to run this code, the entire workflow is here, as a GitHub Gist.

Acknowledgements: Thanks to Owen Jones for pointing out a mistake in the code! Thanks to Kay for pointing out broken URL’s.

Two rOpenSci R packages on CRAN: treebase and rfishbase

2011-12-24T15:02:03Z

Carl Boettiger, a graduate student at UC Davis, and one of the rOpenSci developers, just put two packages on CRAN. One is treebase, which handshakes with the Treebase API. The other is rfishbase, which connects with Fishbase by scraping XML content. See code development on GitHub for treebase here, and for rfishbase here. Carl has some tutorials on treebase and rfishbase at his website here, and we have an official rOpenSci tutorial for treebase here.

Basically, these two R packages let you search and pull down data from Treebase and Fishbase – pretty awesome. This improves workflow, and puts your data search and acquisition component into your code, instead of being a bunch of mouse clicks in a browser.

These two packages are part of the rOpenSci suite of packages.

Bug reports and feature requests welcome.

Guest Post: Scientist Variability & Consequences for Data Sharing

2011-10-14T04:04:29Z

Scientist variability is high. I know it’s risky to use statistical terms in non-statistical contexts on this blog, but it’s a critical component of attitudes towards data sharing. This variability is manifested in many ways: what they study (ecologists, herpetologists, geneticists, microbiologists), how they study it (modelers, experimentalists, observationalists), their level of expertise (graduate students; postdocs; and BS, MS, and PhD-level researchers), and their professional affiliations (academic, government, non-governmental organization, museum).

Given this variability, it’s not surprising that there exists diverse points of view about topics surrounding data. This is especially true for attitudes about data sharing. There are plenty of articles that tout the benefits of data sharing ([1], [2], LeClere 2010), but it is unclear whether these articles are being read by the “scientist general public” (i.e. NOT librarians, information scientists or data managers). In addition, attitudes about data sharing are strongly influenced by scientific field, the types of data collected, how hard it was to collect those data, and past experiences with sharing and collaboration [3].

I am currently ruminating on the disparate attitudes towards data sharing and its causes, primarily because of the interactions I’ve recently had with scientists for my current project. I am collecting information that will inform an Excel add-in designed to promote data sharing and archiving in science (learn more about the Digital Curation for Excel project (DCXL)). To collect this information, I’ve attended a few conferences and interviewed a lot of scientists. Part of this interview scheme is to ask them about their attitudes towards data sharing. Here I report on two groups of scientists with quite different perspectives.

In August, I attended the Ecological Society of America Meeting in Austin. Around 2,000 ecologists took over the city, and I managed to corner a few and badger them with questions. Here are my general observations about ecologists and data sharing:

They know that they should share their data,
They think that it is very important to share data publicly, and
They did not list a litany of caveats to the two statements above.

Sadly, although sharing data is important, it is not currently a priority for most ecologists. This is because of the time and effort involved in preparing data for others to use, as well as lack of knowledge about data centers and repositories they could utilize.

A few weeks later, I was in Seattle for the American Fisheries Society Meeting. About 4,500 fisheries scientists were in attendance, and I spoke with a good cross-section of them about data sharing. Here’s my general observations for this group:

They don’t think they should have to share their data,
They think it is generally important for scientists to share their data, but are unwilling to share their own data, and
If they were to share their data, they would want to place a number of restrictions on its use and accessibility.

Of course, not all of the ecologists or fisheries scientists are described by the statements above. These are, however, the general trends.

Although I could list some hypotheses about why these different points of view arise, I’m more interested in their consequences. Are ecologists embracing meta-analysis and re-use of data more so than fisheries scientists? Are they collaborating more? Considering the current state of global fisheries ([3],[4]), do the attitudes of fisheries scientists hinder our ability to model populations?

How should these attitudes and their consequences be addressed? In the case of ecologists, it seems simple enough: enable tools that make data sharing easy, foster good data management practices, and increase education about data centers and how they can be used. For fisheries scientists, there is the more challenging task of changing attitudes. Perhaps someone should conduct a fisheries-centered meta-analysis of the field’s advancement due to open data practices, or we should focus on creating the ability to set a diversity of use policies for fisheries data placed in repositories.

Not all scientists think alike. We have different motivations, needs, and desires that influence our attitudes and decisions. This is the challenge for data centers, librarians, and funders: to find ways to address the needs for many different types of scientists and facilitate data sharing. Regardless of their differences, all fields of science benefit when data sharing is the norm and the progress of scientific discovery is accelerated through data re-use.

About the author: Dr. Carly Strasser is a Marine Ecologist by training, but now focuses on helping scientists with data management, organization, sharing, archiving, and re-use. She is currently working on the DCXL (Digital Curation for Excel) project at California Digital Library, funded by the Gordon and Betty Moore Foundation and Microsoft Research. Learn more about the project.

References

C. PARR, and M. CUMMINGS, "Data sharing in ecology and evolution", Trends in Ecology & Evolution, vol. 20, 2005, pp. 362-363.
H.A. Piwowar, R.S. Day, and D.B. Fridsma, "Sharing Detailed Research Data Is Associated with Increased Citation Rate", PLoS ONE, vol. 2, 2007, pp. e308.
C. Tenopir, S. Allard, K. Douglass, A.U. Aydinoglu, L. Wu, E. Read, M. Manoff, and M. Frame, "Data Sharing by Scientists: Practices and Perceptions", PLoS ONE, vol. 6, 2011, pp. e21101.
B. Worm, R. Hilborn, J.K. Baum, T.A. Branch, J.S. Collie, C. Costello, M.J. Fogarty, E.A. Fulton, J.A. Hutchings, S. Jennings, O.P. Jensen, H.K. Lotze, P.M. Mace, T.R. McClanahan, C. Minto, S.R. Palumbi, A.M. Parma, D. Ricard, A.A. Rosenberg, R. Watson, and D. Zeller, "Rebuilding Global Fisheries", Science, vol. 325, 2009, pp. 578-585.

A preview of upcoming developments

2011-09-11T18:36:50Z

We are happy to report on developments since we formally launched our efforts little over a month ago. We are quite pleased with the feedback, encouragement, and response from practitioners, open science advocates, and the community at large. Here is a brief preview of what to expect over the next couple of months.

The rOpenSci suite currently features 8 packages (3 literature and 5 database) under active development. Although none are at the stable release stage yet, several of the packages are functional and can be put to use right away. We now have a tutorial for TreeBASE and FishBASE. Soon we will start rolling out tutorials and example use cases for some of the other packages that are farther along in the development stage. For those of you not actively watching projects on github, we plan to launch a activity page within a few days.

Guest posts
We are also happy to report that a handful of prominent open science advocates, bloggers, and scientists have agreed to write opinion pieces for the site. Keep a look out for our first review coming up shortly.

Use case ideas
As we move forward, we would like to formally solicit use case ideas that may benefit from the tools being developed here. Since our entire team is composed of ecologists & evolutionary biologists, ideas for use cases are motivated by our own research interests. However, packages such as RMendeley and RPlos can be useful to scientists across disciplines. So in the spirit of Mendeley/Plos idea solicitation, we are also soliciting use case ideas for questions that could be addressed in the R environment. We would love to hear from you in the comments section or via the contact form.

Welcome to rOpenSci

2011-08-10T03:06:55Z

rOpenSci is a collaborative effort to develop R-based tools for facilitating Open Science.

So what is Open Science?

Open science is the practice of making various elements of scientific research — data & methods, code & software, and results & publications — readily accessible to anyone. While this has great potential for advancing research (in addition to education, public policy, & commercial innovation) as a whole, there are both technical and social challenges preventing this practice from being more widespread. Social challenges stem largely from the dichotomy between what is best for an individual researcher and what is best for the community. Technical challenges arise largely from issues of scale: putting free print copies of DNA sequencing data in a box in front of your office doesn’t scale as well as depositing those sequences on repositories like GeneBANK.

Why does open science need tools?

Our goal is to provide open-source tools to help address both these challenges. These are interesting times. The technology to facilitate the access and utilization of this data has never been better, yet it is only beginning to be employed. The internet — firmly in its second generation, the read-write Web 2.0 culture in which users generate content as readily as they consume it — has led to the explosion of mechanisms for sharing. Yet these tools are not widely leveraged in scientific communities [1].

Why R for open science?

R is an open-source statistical environment that can be used for not only statistics, but also for data acquisition, data manipulation, modeling, among other uses. R is increasingly being used by scientists across all disciplines and has overtaken popular scientific programming tools. Part of the reason behind R’s explosive growth is the ease with which the ever-growing userbase can add new functionality, a fact evidenced by 3,000+ currently available R packages. The R framework is ideal for open science because:

The software is free.
There is an extensive user community from which help is very quickly given, and
The scientific workflow can be documented for fully reproducible research.

Open Access Literature

Published literature is still the most common repository of scientific information. While literature continues to expand exponentially, the amount any individual researcher can consume appears to be constant. Whether discovering what to read, identifying research trends, or summarizing existing work, scalable solutions require computational approaches. Mendeley, a citation tool, literature repository, and social network which has just cataloged its 100 millionth paper, has released a public API through which it challenges the scientific community to leverage this data to facilitate novel applications. PLoS, The Public Library of Science, has also joined the start-up with there own public API allowing access to the metadata and full text of all its publication through an accessible RESTful interface.

Open Data

Access to scientific data is rapidly emerging as a central theme in the future of research [2]. In our field of Ecology and Evolution, new requirements for data management plans from NSF, data deposition requirements of prominent journals, and the emergence of well-developed repositories like GeneBANK, Dryad, TreeBASE, DataONE, (KNB, NBII, GBIF) have opened a wealth of potential [3].

Building Bridges between Open Access Literature and Open Data

Despite these dramatic shifts, the bulk of scientific research has been slow to benefit from the transformation than they are to comply with the new requirements. To address these challenges, we have set about building bridges between the repositories and the open source R statistical environment. We are creating packages that allow access to these data repositories through a statistical programming environment that is already a familiar part of the workflow of many scientists. We hope that these bridges will not only facilitate drawing data into an environment where it can readily be manipulated, but also one in which those analyses and methods can be easily shared, replicated, and extended by other researchers.

Keep up with updates on the rOpenSci project here and on twitter.

References

K.R. Coombes, J. Wang, and K.A. Baggerly, "Microarrays: retracing steps", Nature Medicine, vol. 13, 2007, pp. 1276-1277.
"Challenges and Opportunities", Science, vol. 331, 2011, pp. 692-693.
O.J. Reichman, M.B. Jones, and M.P. Schildhauer, "Challenges and Opportunities of Open Data in Ecology", Science, vol. 331, 2011, pp. 703-705.