Data Pub

Data Citation Developments

johnkratz — Fri, 11 Oct 2013 14:09:03 +0000

Citation is a defining feature of scholarly publication and if we want to say that a dataset has been published, we have to be able to cite it. The purpose of traditional paper citations– to recognize the work of others and allow readers to judge the basis of the author’s assertions– align with the purpose of data citations. Check out previous posts on the topic here.

Although in the past, datasets and databases have usually been mentioned haphazardly, if at all, in the body of a paper and left out of the list of references, this no longer has to be the case.

Last month, there was quite a bit of activity on the data citation front:

The third version of the DataCite metadata schema was released at the DataCite Summer Meeting in Washington, DC.
The Research Data Alliance (RDA) Data Citation Working Group met at the RDA Second Plenary Meeting, also in Washington.
The Committee on Data for Science and Technology (CODATA) released a thorough report on data citation
A synthesis set of data citation principles combining principles from Future of Research Communication and e-Scholarship (FORCE11), the Digital Curation Center, CODATA, and DataCite was released. The principles are:

Importance: Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.

Credit and Attribution: Data citations should facilitate giving scholarly credit and normative and legal atribution to all contributors to the data, recognizing that a single style or mechanism of atribution may not be applicable to all data.

Evidence: Where a specific claim rests upon data, the corresponding data citation should be provided.

Unique Identifiers: A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.

Access: Data citations should facilitate access to the data themselves and to such associated metadata, documentation, and other materials, as are necessary for both humans and machines to make informed use of the referenced data.

Persistence: Metadata describing the data, and unique identifiers should persist, even beyond the lifespan of the data they describe.

Versioning and Granularity: Data citations should facilitate identification and access to different versions and/or subsets of data. Citations should include sufficient detail to verifiably link the citing work to the portion and version of data cited.

Interoperability and Flexibility: Data citation methods should be sufficiently flexible to accommodate the variant practices among communities but should not differ so much that they compromise interoperability of data citation practices across communities.

In the simplest case– when a researcher wants to cite the entirety of a static dataset– there seems to be a consensus set of core elements between DataCite, CODATA and others. There is less agreement with respect to more complicated cases, so let’s tackle the easy stuff first.

(Nearly) Universal Core Elements

Creator(s): Essential, of course, to publicly credit the researchers who did the work. One complication here is that datasets can have large (into the hundreds) numbers of authors, in which case an organizational name might be used.
Date: The year of publication or, occasionally, when the dataset was finalized.
Title: As is the case with articles, the title of a dataset should help the reader decide whether your dataset is potentially of interest. The title might contain the name of the organization responsible, or information such as the date range covered.
Publisher: Many standards split the publisher into separate producer and distributor fields. Sometimes the physical location (City, State) of the organization is included.
Identifier: A Digital Object Identifier (DOI), Archival Resource Key (ARK), or other unique and unambiguous label for the dataset.

Common Additional Elements

Location: A web address from which the dataset can be accessed. DOIs and ARKs can be used to locate the resource cited, so this field is often redundant.
Version: May be necessary for getting the correct dataset when revisions have been made.
Access Date: The date the data was accessed for this particular publication.
Feature Name: May be a formal feature from a controlled vocabulary, or some other description of the subset of the dataset used.
Verifier: Information that can be used to be make sure you have the right dataset.

Complications

Datasets are different from journal articles in ways that can make them more difficult to cite. The first issue is deep citation or granularity, and the second is dynamic data.

Deep Citation

Traditional journal articles are cited as a whole and it is left to the reader to sort through the article to find the relevant information. When citing a dataset, more precision is sometimes necessary. An analysis is done on part of a dataset, it can only be repeated by extracting exactly that subset of the data. Consequently, there is a desire for mechanisms allowing precise citation of data subsets. A number of solutions have been put forward:

Most common and least useful is to describe how you extracted the subset in the text of the article.
For some applications, such as time series, you many be able to specify a date or geographic range, or a limited number of variables within the citation.
Another approach is to mint a new identifier that refers to only the subset used, and refer back to the source dataset in the metadata of the subset. The DataCite DOI metadata scheme includes a flexible mechanism to specify relationships between objects, including that one is part of another.
The citation can include a Universal Numeric Fingerprint (UNF) as a verifier for the subset. A UNF can be used to test whether two datasets are identical, even if they are stored in different file formats. This won’t help you to find the subset you want, but it will tell you whether you’ve succeeded.

Dynamic Data

When a journal article is published, it’s set in stone. Corrections and retractions are are rare occurrences, and small errors like typos are allowed to stand. In contrast, some datasets can be expected to change over time. There is no consensus as to whether or how much change is permitted before an object must be issued a new identifier. DataCite recommends but does not require that DOIs point to a static object.

Broadly, dynamic datasets can be split into two categories:

Appendable datasets get new data over time, but the existing data is never changed. If timestamps are applied to each entry, inclusion of an access date or a date range in the citation may allow a user to confidently reconstruct the state of the dataset. The Federation of Earth Science Information Partners (ESIP), for instance, specifies that an add-on dataset be issued a DOI only once, and a time range specified in the citation. On the other hand, the Dataverse standard and DCC guidelines require new DOIs for any change. If the dataset is impractically large, the new DOI may cover a “time slice” containing only the new data. For instance, each year of data from a sensor could be issued its own DOI.
Data in revisable datasets may be inserted, altered, or deleted. Citations to revisable datasets are likely to include version numbers or access dates. In this case ESIP specifies that a new DOI should be minted for each “major” but not “minor” version. If a new DOI is required for each version, a “snapshot” of the dataset can be frozen from time to time and issued it’s own DOI.

RDA Meeting Part 2: The Meeting in DC

carlystrasser — Tue, 24 Sep 2013 19:07:48 +0000

In last week’s post, I outlined the basic structure of the Research Data Alliance, a group intent on enabling international data sharing and collaboration. I attended the recent RDA 2nd Plenary in Washington, DC last week, and will share a few insights below.

The Good Stuff

The RDA has some seriously admirable ambitions, and they have many important people involved in the organization, working towards their goals. Summed up, the four great features about RDA are:

International involvement: Australia, European countries, and the US are all involved in supporting the RDA, and as such, it has the potential to influence international standards and interoperability.

Combining efforts: We are all aware of at least a couple of projects focused on data interoperability, infrastructure development, or other aspects of the brave new data world. The RDA is a place for those involved in these many disparate projects to meet up, discuss, and ensure the wheel doesn’t get reinvented.

Important people: There’s no doubt that everyone who’s anyone in the bureaucratic circles of the data world is at RDA. Having these important people, who are often heads of the many projects mentioned above, in the same room and talking about big-picture stuff is really important.

Flexible and community-driven: throughout the meeting in DC, I heard folks asking “What is our task?” or “How should we proceed?”. The answer was invariably that “it’s up to the working groups”. This means outputs won’t be tainted by secret agendas.

The Challenges

Working group woes: I’m a bit concerned about the membership of the working groups. It appears to be quite fluid, and varies from meeting to meeting. I sat in on a few working group meetings and neither committed to working on anything, nor shared my contact information. I’m guessing I wasn’t the only one in the room, which is potentially problematic. How will continuity of working group members be maintained? Who will be held accountable for the work? I am guessing the co-chairs of the working groups will be held accountable, but how likely are they to succeed when there is not a clear membership policy? I listed “flexible and community-driven” above as a good thing, but it has its limits. And finally, the working group names are not always well-suited for the goals of the group; this led to quite a bit of confusion at the meeting.

Diversity of attendees: If you followed the tweet stream from the meeting, you might have noticed the commentary on a lack of meeting attendee diversity. The gender balance in the audience was pretty good, but the speakers and panelists… not so much. More concerning, perhaps, was the complete lack of community members who actually produce and/or use data. It’s true that the focus on technical issues and these would not be of interest to the average data producer, but it’s important to include them in the conversation since RDA outputs will affect them. And then there’s the age balance… the average age of attendees was probably around 50, with very few attendees under 40. A lack of early-career attendees suggests that uptake of what the RDA produces might not be as easy as they think.

According to Wikipedia, “RDA” might stand for Richard Dean Anderson, aka MacGyver. Rad. (From Flickr by trainman74)

What’s missing: The working groups focus on fairly specific, technical topics that align with the goals of interoperable data. Although this is important, I was concerned by the lack of discussion about the cultural shift that will be required to encourage data sharing, and how it is best addressed. For example, the working groups on data citation zeroed in on the issues surrounding granularity of identifiers when citing a particular dataset. What about the promotion of data citation as a cultural norm? This corresponds to my concerns about a lack of practitioners contributing to the working groups.

All in all, I’m looking forward to seeing what comes out of the RDA. Perhaps I will see some of you at a future Plenary Meeting? The 3rd Plenary is in Dublin in March, followed by the Netherlands in Fall 2014. Stay tuned!

Other blog posts about the RDA 2nd Plenary:

“Research Data Sharing Without Barriers… Get Involved?” blog post from the JISC Digital Infrastructure Team
“RDA – the Long-Tail – Adding the Institutional Perspective to the Mix.” blog post from the Confederation of Open Access Repositories

RDA Meeting Part 1: The RDA Organization

carlystrasser — Fri, 20 Sep 2013 15:52:16 +0000

Did you know that the National Academy of Sciences was founded in 1863, at the height of the Civil War? From Wikimedia Commons.

This week nearly 400 data nerds flooded the National Academy of Sciences in Washington, DC, for the second Plenary Meeting of the Research Data Alliance. I was among those nerds, and I’ll review some highlights of the #RDAplenary in my next blog post. First, however, I want to provide an overview of this thing called RDA.

The organization is funded via Australian, European Union, and US government agencies. Work started around August 2012 and focuses on “research data sharing without barriers”. The National Science Foundation awarded $2.5 million to Rensselaer Polytechnic Institute to participate in the RDA (read more in the NSF press release), which suggests that the NSF is very interested in the mission of the RDA. From the RDA website:

The Research Data Alliance aims to accelerate and facilitate research data sharing and exchange. The work of the Research Data Alliance is primarily undertaken through its working groups. Participation in working groups and interest groups, starting new working groups, and attendance at the twice-yearly plenary meetings is open to all.

An important note is that the RDA is NOT a funding body. It doesn’t fund participants at meetings, nor does it pay for infrastructure development or implementation. Think of the RDA as a means for folks interested in common subjects to get together twice yearly and try to ensure that

no one is reinventing the wheel,
standards, ontologies, and solutions are as universal as possible, and
careful consideration is being given to all aspects of developing services, tools, and standards for data sharing.

The working groups at the RDA are where the rubber meets the road. According to the website,

Working Groups conduct short-lived, 12-18 month efforts that implement specific tools, code, best practices, standards, etc. at multiple institutions.

There are currently 8 working groups listed on the website; if I’m not mistaken a few more were born this week. Working group members are expected to make a commitment to ensure the working group goals are met in the allotted time. They are essentially volunteers, who contribute their time and travel budgets to participate in the RDA. In some cases, RDA members join forces to write funding proposals to various agencies (NSF and international counterparts).

Anyone interested in the helping to meet the goals of the RDA and its working groups is invited to join. I expect that the list of members includes bureaucrats, administrators, computer scientists, librarians, and many others. The RDA website has a few bios of some of the RDA’s leaders here. If you are interested in participating, check out the RDA website on how to get involved. In next week’s post I’ll share my impressions of the RDA meeting I attended.

Hello Data Publication World

johnkratz — Thu, 05 Sep 2013 00:12:46 +0000

from foodbeast.com

Greetings, all. I’m a new postdoc at the CDL and I’m very excited to be spending the next couple of years thinking about data publication. Carly has discussed data publication several times before, but briefly, the goal is to improve dataset reproduction and reuse by publishing datasets as “first class” scholarly objects akin to journal articles- with the attendant opportunities for preservation, citation, and award of credit.

I spent most of grad school tickling worms with an eyebrow hair glued to a toothpick (this is true), but now I’m moving from lab to library as a CLIR/DLF Postdoctoral Fellow in Data Curation for the Sciences and Social Sciences. The Sloan Foundation funds these fellowships to, as Josh Greenberg puts it, train “professionals with one foot in research and one foot in data curation”.

Partly for my own edification, I’m starting with a thorough survey of the data publication landscape. I’ll be looking at current practices and proposals for data publication, citation, and peer-review. I’m interested in questions like: How can the quality of a dataset evaluated? How does the creator of a dataset get credit for it? How do datasets remain findable, accessible, and useable in the future? Does it even make sense to apply the terms “publication” or “peer-review” to data at all?

Where things go from there depends on how the survey turns out, so that’s much more up in the air. One possibility is to put a workflow for data publication together from existing tools. Another is to identify a need not met by existing tools that the CDL could address.

If you have ideas you’d like to share, please comment here or email me.

Shameless Plug: Applications for 2014 CLIR/DLF Fellowships are opening soon!

UC Open Access: How to Comply

carlystrasser — Tue, 20 Aug 2013 15:30:24 +0000

Free access to UC research is almost as good as free hugs! From Flickr by mhauri

My last two blog posts have been about the new open access policy that applies to the entire University of California system. For big open science nerds like myself, this is exciting progress and deserves much ado. For the on-the-ground researcher at a UC, knee-deep in grants and lecture preparation, the ado could probably be skipped in lieu of a straightforward explanation of how to comply with the procedure. So here goes.

Who & When:

1 November 2013: Faculty at UC Irvine, UCLA, and UCSF
1 November 2014: Faculty at UC Berkeley, UC Merced, UC Santa Cruz, UC Santa Barbara, UC Davis, UC San Diego, UC Riverside

Note: The policy applies only to ladder-rank faculty members. Of course, graduate students and postdocs should strongly consider participating as well.

To comply, faculty members have two options:

Option 1: Out-of-the-box open access

. There are two ways to do this:

Publishing in an open access-only journal (see examples here). Some have fees and others do not.
Publishing with a more traditional publisher, but paying a fee to ensure the manuscript is publicly available. These are article-processing charges (APCs) and vary widely depending on the journal. For example, Elsevier’s Ecological Informatics charges $2,500, while Nature charges $5,200.

Learn more about different journals’ fees and policies: Directory of Open Access Journals: www.doaj.org

Option 2: Deposit your final manuscript in an open access repository.

In this scenario, you can publish in whatever journal you prefer – regardless of its openness. Once the manuscript is published, you take action to make a version of the article freely and openly available.

As UC faculty (or any UC researcher, including grad students and postdocs), you can comply via Option 2 above by depositing your publications in UC’s eScholarship open access repository. The CDL Access & Publishing Group is currently perfecting a user-friendly, efficient workflow for managing article deposits into eScholarship. The new workflow will be available as of November 1^st. Learn more.

Does this still sound like too much work? Good news! The Publishing Group is also working on a harvesting tool that will automate deposit into eScholarship. Stay tuned – the estimated release of this tool is June 2014.

An Addendum: Are you not a UC affiliate? Don’t fret! You can find your own version of eScholarship (i.e., an open access repository) by going to OpenDOAR. Also see my full blog post about making your publications open access.

Why?

Academic libraries must pay exorbitant fees to provide their patrons (researchers) with access to scholarly publications. The very patrons who need these publications are the ones who provide the content in the form of research articles. Essentially, the researchers are paying for their own work, by proxy via their institution’s library.

What if you don’t have access? Individuals without institutional affiliations (e.g., between jobs), or who are affiliated with institutions that have no/a poorly funded library (e.g., in 2nd or 3rd world countries), depend on open access articles for keeping up with the scholarly literature. The need for OA isn’t limited to jobless or international folks, though. For proof, one only has to notice that the Twitter community has developed a hash tag around this, #Icanhazpdf (Hat tip to the Lolcats phenomenon). Basically, you tweet the name of the article you can’t access and add the hashtag in hopes that someone out in the Twittersphere can help you out and send it to you.

Special thanks to Catherine Mitchell from the CDL Publishing & Access Group for help on this post.

A Closer Look at the New UC Open Access Policy

carlystrasser — Wed, 07 Aug 2013 16:24:09 +0000

The UC is opening up their research locker. From Flickr by sam.d

Last week, the University of California announced a new Open Access Policy. Here I will explore the policy in a bit more detail. The gist of the policy is this: research articles authored by UC faculty will be made available to the public at no charge.

I’m sure most of this blog’s readers are familiar with paywalls and the nuances of scholarly publishing, but for those that aren’t – if you don’t have a license to get content from particular journals (via your institution’s library, for example) then you may pay upwards of $100 per article. For example, if I publish an amazing article in Nature (and don’t pay the $5,200 fee to make my article open access), my mom can’t get a copy of the article to hang on her fridge without either (1) getting a copy from someone with access, or (2) paying a big fee. Considering that my mom pays taxes that fund the NSF which funded my work, this is rather strange.

The UC policy is trying to change that. The idea is that faculty at the UC will grant a license to the UC prior to any contractual arrangement with publishers. The faculty member then has the right to make their research will be widely and publicly available, re-use it for various purposes, or modify it for future research publications – regardless of the publisher’s wishes for locking down the work.

Faculty will continue to publish their work in the most appropriate journal (open access or not). The big change is that now they can also place a copy of the publication in UC’s open access repository, eScholarship, which is freely accessible to anyone. To re-emphasize: This policy does NOT require that faculty publish in particular journals or pay “Article Processing Charges” to ensure their article is open access.

From the policy’s FAQ page:

Faculty are strongly encouraged to continue to publish as normal, in the most appropriate and prestigious journals. Faculty are not required to pay to publish articles or pay to deposit them in an open-access repository under this policy, unless they choose to do so.

How faculty can comply (from the FAQ page):

By passing the policy on July 24, 2013, UC faculty members have committed themselves to making their scholarly articles available to the public by granting a license to UC and depositing a copy of their publications in eScholarship, UC’s open access repository. The policy automatically grants UC a license to make any scholarly articles available in an open access repository. UC will not do so, however, until an author takes the action of depositing an article in UC’s eScholarship repository or confirms the availability of the article in another open access venue – i.e., a repository (such as PubMed Central, ArXiv or SSRN) or an open access journal.

The California Digital Library and the campus libraries will assist faculty by providing a streamlined deposit system into eScholarship and an automated ‘harvesting’ tool in order to ease the process of depositing articles, is expected to be in place by June 2014.

And now, the downside. Michael Eisen, co-founder of the open access journal PLOS, points out the potential downside of the new policy in his blog post:

This policy has a major, major hole – an optional faculty opt-out. This is there because enough faculty wanted the right to publish their works in ways that were incompatible with the policy that the policy would not have passed without the provision. Unfortunately, this means that the policy is completely toothless.

Eisen goes on to say

…because of the opt out, this is a largely symbolic gesture – a minor event in the history of open access, not the watershed event that some people are making it out to be.

Although I agree with Eisen that the opt-out clause significantly weakens the strength of this policy, I still believe this move on the UC’s part represents a major step forward in the battle to reclaim our scholarly work from some publishers. Perhaps it isn’t “watershed” but it’s certainly exciting, and it’s stimulating conversations about open science and accessibility to research.

UC Faculty Senate Passes #OA Policy

carlystrasser — Fri, 02 Aug 2013 18:42:23 +0000

Big news! I just got this email regarding the new Open Access Policy for the University of California System. I’ll write a full blog post next week but wanted to share this as soon as possible. (emphasis is mine)

The Academic Senate of the University of California has passed an Open Access Policy, ensuring that future research articles authored by faculty at all 10 campuses of UC will be made available to the public at no charge. “The Academic Council’s adoption of this policy on July 24, 2013, came after a six-year process culminating in two years of formal review and revision,” said Robert Powell, chair of the Academic Council. “Council’s intent is to make these articles widely—and freely— available in order to advance research everywhere.” Articles will be available to the public without charge via eScholarship (UC’s open access repository) in tandem with their publication in scholarly journals. Open access benefits researchers, educational institutions, businesses, research funders and the public by accelerating the pace of research, discovery and innovation and contributing to the mission of advancing knowledge and encouraging new ideas and services.

Chris Kelty, Associate Professor of Information Studies, UCLA, and chair of the UC University Committee on Library and Scholarly Communication (UCOLASC), explains, “This policy will cover more faculty and more research than ever before, and it sends a powerful message that faculty want open access and they want it on terms that benefit the public and the future of research.”

The policy covers more than 8,000 UC faculty at all 10 campuses of the University of California, and as many as 40,000 publications a year.

It follows more than 175 other universities who have adopted similar so-called “green” open access policies. By granting a license to the University of California prior to any contractual arrangement with publishers, faculty members can now make their research widely and publicly available, re-use it for various purposes, or modify it for future research publications. Previously, publishers had sole control of the distribution of these articles. All research publications covered by the policy will continue to be subjected to rigorous peer review; they will still appear in the most prestigious journals across all fields; and they will continue to meet UC’s standards of high quality. Learn more about the policy and its implementation here: http://osc.universityofcalifornia.edu/openaccesspolicy/

UC is the largest public research university in the world and its faculty members receive roughly 8% of all research funding in the U.S.

With this policy UC Faculty make a commitment to the public accessibility of research, especially, but not only, research paid for with public funding by the people of California and the United States. This initiative is in line with the recently announced White House Office of Science and Technology Policy (OSTP) directive requiring “each Federal Agency with over $100 million in annual conduct of research and development expenditures to develop a plan to support increased public access to results of the research funded by the Federal Government.” The new UC Policy also follows a similar policy passed in 2012 by the Academic Senate at the University of California, San Francisco, which is a health sciences campus.

“The UC Systemwide adoption of an Open Access (OA) Policy represents a major leap forward for the global OA movement and a well-deserved return to taxpayers who will now finally be able to see first-hand the published byproducts of their deeply appreciated investments in research” said Richard A. Schneider, Professor, Department of Orthopaedic Surgery and chair of the Committee on Library and Scholarly Communication at UCSF. “The ten UC campuses generate around 2-3% of all the peer-reviewed articles published in the world every year, and this policy will make many of those articles freely available to anyone who is interested anywhere, whether they are colleagues, students, or members of the general public”

The adoption of this policy across the UC system also signals to scholarly publishers that open access, in terms defined by faculty and not by publishers, must be part of any future scholarly publishing system. The faculty remains committed to working with publishers to transform the publishing landscape in ways that are sustainable and beneficial to both the University and the public.

More information: http://osc.universityofcalifornia.edu/openaccesspolicy/

Contact:

Professor Christopher Kelty, UCLA / (310) 880-2433; ckelty@ucla.edu
Professor Richard Schneider, UC San Francisco / 415-305-7992; rich.schneider@ucsf.edu
Professor Robert Powell, Chair, Academic Council / 510-987-0711; Robert.powell@ucop.edu

University of California, Berkeley campus, 1901. Contributed to Calisphere by the Berkeley Public Library.

The Data Lineup for #ESA2013

carlystrasser — Mon, 29 Jul 2013 18:48:48 +0000

Why am I excited about Minneapolis? Potential Prince sightings, of course! From http://www.emusic.com

In less than week, the Ecological Society of America’s 2013 Meeting will commence in Minneapolis, MN. There will be zillions of talks and posters on topics ranging from microbes to biomes, along with special sessions on education, outreach, and citizen science. So why am I going?

For starters, I’m a marine ecologist by training, and this is an excuse to meet up with old friends. But of course the bigger draw is to educate my ecological colleagues about all things data: data management planning, open data, data stewardship, archiving and sharing data, et cetera et cetera. Here I provide a rundown of must-see talks, sessions, and workshops related to data. Many of these are tied to the DataONE group and the rOpenSci folks; see DataONE’s activities and rOpenSci’s activities. Follow the full ESA meeting on Twitter at #ESA2013. See you in Minneapolis!

Sunday August 4th

0800-1130 / WK8: Managing Ecological Data for Effective Use and Re-use: A Workshop for Early Career Scientists

For this 3.5 hour workshop, I’ll be part of a DataONE team that includes Amber Budden (DataONE Community Engagement Director), Bill Michener (DataONE PI), Viv Hutchison (USGS), and Tammy Beaty (ORNL). This will be a hands-on workshop for researchers interested in learning about how to better plan for, collect, describe, and preserve their datasets.

1200-1700 / WK15: Conducting Open Science Using R and DataONE: A Hands-on Primer (Open Format)

Matt Jones from NCEAS/DataONE will be assisted by Karthik Ram (UC Berkeley & rOpenSci), Carl Boettiger (UC Davis & rOpenSci), and Mark Schildhauer (NCEAS) to highlight the use of open software tools for conducting open science in ecology, focusing on the interplay between R and DataONE.

Monday August 5th

1015-1130 / SS2: Creating Effective Data Management Plans for Ecological Research

Amber, Bill and I join forces again to talk about how to create data management plans (like those now required by the NSF) using the free online DMPTool. This session is only 1.25 hours long, but we will allow ample time for questions and testing out the tool.

1130-1315 / WK27: Tools for Creating Ecological Metadata: Introduction to Morpho and DataUp

Matt Jones and I will be introducing two free, open-source software tools that can help ecologists describe their datasets with standard metadata. The Morpho tool can be used to locally manage data and upload it to data repositories. The DataUp tool helps researchers not only create metadata, but check for potential problems in their dataset that might inhibit reuse, and upload data to the ONEShare repository.

Tuesday August 6th

0800-1000 / IGN2: Sharing Makes Science Better

This two-hour session organized by Sandra Chung of NEON is composed of 5-minute long “ignite” talks, which guarantees you won’t nod off. The topics look pretty great, and the crackerjack list of presenters includes Ethan White, Ben Morris, Amber Budden, Matt Jones, Ed Hart, Scott Chamberlain, and Chris Lortie.

1330-1700 / COS41: Education: Research And Assessment

In my presentation at 1410, “The fractured lab notebook: Undergraduates are not learning ecological data management at top US institutions”, I’ll give a brief talk on results from my recent open-access publication with Stephanie Hampton on data management education.

2000-2200 / SS19: Open Science and Ecology

Karthik Ram and I are getting together with Scott Chamberlain (Simon Fraser University & rOpenSci), Carl Boettiger, and Russell Neches (UC Davis) to lead a discussion about open science. Topics will include open data, open workflows and notebooks, open source software, and open hardware.

2000-2200 / SS15: DataNet: Demonstrations of Data Discovery, Access, and Sharing Tools

Amber Budden will demo and discuss DataONE alongside folks from other DataNet projects like the Data Conservancy, SEAD, and Terra Populus.

It’s Time for Better Project Metrics

carlystrasser — Wed, 17 Jul 2013 20:24:59 +0000

I’m involved in lots of projects, based at many institutions, with multiple funders and oodles of people involved. Each of these projects has requirements for reporting metrics that are used to prove the project is successful. Here, I want to argue that many of these metrics are arbitrary, and in some cases misleading. I’m not sure what the solution is – but I am anxious for a discussion to start about reporting requirements for funders and institutions, metrics for success, and how we measure a project’s impact.

What are the current requirements for projects to assess success? The most common request is for text-based reports – which are reminiscent of junior high book reports. My colleague here at the CDL, John Kunze, has been working for the UC in some capacity for a long time. If anyone is familiar with the bureaucratic frustrations of metrics, it’s John. Recently he brought me a sticky-note with an acronym he’s hoping will catch on:

SNωωRF: Stuff nobody wants to write, read, or fund

The two lower-case omegas, which translate to “w” for the acronym, represent the letter “O” to facilitate pronunciation –i.e., ”snorf”. He was prompted to invent this catchy acronym after writing up a report for a collaborative project we work on, based in Europe. After writing the report, he was told it “needed to be longer by two or three pages”. The necessary content was there in the short version – but it wasn’t long enough to look thorough. Clearly brevity is not something that’s rewarded in project reporting.

Which orange dot is bigger? Overall impressions differ from what the measurements say. Project metrics doesn’t always reflect success. From donomic10.edublogs.org

Outside of text-based reports, there are other reports and metrics that higher-ups like: number of website hits, number of collaborations, number of conferences attended, number of partners/institutions involved, et cetera. A really successful project can look weak in all these ways. Similarly, a crap project can look quite successful based on the metrics listed. So if there is not a clear correlation between metrics used for project success, and actual project success, why do we measure them?

So what’s the alternative? The simplest alternative – not measuring/reporting metrics – is probably not going to fly with funders, institutions, or organizations. In fact, metrics play an important role. They allow for comparisons among projects, provide targets to strive for, and allow project members to assess progress. Perhaps rather than defaulting to the standard reporting requirements, funders and institutions could instead take some time to consider what success means for a particular project, and customize the metrics based on that.

In the space I operate (data sharing, data management, open science, scholarly publishing etc.) project success is best assessed by whether the project has (1) resulted in new conversations, debates and dialogue, and/or (2) changed the way science is done. Examples of successful projects based on this definition: figshare, ImpactStory, PeerJ, IPython Notebook, and basically anything funded by the Alfred P. Sloan Foundation. Many of these would also pass the success test based on more traditional metrics, but not necessarily. I will avoid making enemies by listing projects that I deem unsuccessful, despite their passing the test based on traditional metrics.

The altmetrics movement is focused on reviewing researcher and research impact in new, interesting ways (see my blog posts on the topic here and here). What would this altmetrics movement look like in terms of projects? I’m not sure, but I know that its time has come.

Software Carpentry and Data Management

carlystrasser — Fri, 28 Jun 2013 15:25:48 +0000

About a year ago, I started hearing about Software Carpentry. I wasn’t sure exactly what it was, but I envisioned tech-types showing up at your house with routers, hard drives, and wireless mice to repair whatever software was damaged by careless fumblings. Of course, this is completely wrong. I now know that it is actually an ambitious and awesome project that was recently adopted by Mozilla, and recently got a boost from the Alfred P. Sloan Foundation (how is it that they always seem to be involved in the interesting stuff?).

From their website:

Software Carpentry helps researchers be more productive by teaching them basic computing skills. We run boot camps at dozens of sites around the world, and also provide open access material online for self-paced instruction.

SWC got its start in 1990s, when its founder, Greg Wilson, realized that many of the scientists who were trying to use supercomputers didn’t actually know how to build and troubleshoot their code, much less use things like version control. More specifically, most had never been shown how to do four basic tasks that are fundamentally important to any science involving computation (which is increasingly all science):

growing a program from 10 to 100 to 100 lines without creating a mess
automating repetitive tasks
basic quality assurance
managing and sharing data and code

Software Carpentry is too cool for a reference to the Carpenters. From marshallmatlock.com (click for more).

Greg started teaching these topics (and others) at Los Alamos National Laboratory in 1998. After a bit of stop and start, he left a faculty position at the University of Toronto in April 2010 to devote himself to it full-time. Fast forward to January 2012, and Software Carpentry became the first project of what is now the Mozilla Science Lab, supported by funding from the Alfred P. Sloan Foundation.

This new incarnation of Software Carpentry has focused on offering intensive, two-day workshops aimed at grad students and postdocs. These workshops (which they call “boot camps”) are usually small – typically 40 learners – with low student-teacher ratios, ensuring that those in attendance get the attention and help they need.

Other than Greg himself, whose role is increasingly to train new trainers, Software Carpentry is a volunteer organization. More than 50 people are currently qualified to instruct, and the number is growing steadily. The basic framework for a boot camp is this:

Someone decides to host a Software Carpentry workshop for a particular group (e.g., a flock of macroecologists, or a herd of new graduate students at a particular university). This can be fellow researchers, department chairs, librarians, advisors — you name it.
Organizers round up funds to pay for travel expenses for the instructors and any other anticipated workshop expenses.
Software Carpentry matches them with instructors according to the needs of their group; together, they and the organizers choose dates and open up enrolment.
The boot camp itself runs eight hours a day for two consecutive days (though there are occasionally variations). Learning is hands-on: people work on their own laptops, and see how to use the tools listed below to solve realistic problems.

That’s it! They have a great webpage on how to run a bootcamp, which includes checklists and thorough instructions on how to ensure your boot camp is a success. About 2300 people have gone through a SWC bootcamp, and the organization hopes to double that number by mid-2014.

The core curriculum for the two-day boot camp is usually:

0.5 days on the Unix shell
0.5 days on version control (focusing on Git)
0.5 days on Python or R programming, depending on the crowd
0.5 days on databases and SQL

Software Carpentry also offers over a hundred short video lessons online, all of which are CC-BY licensed (go to the SWC webpage for a hyperlinked list):

Version Control
The Shell
Python
Testing
Sets and Dictionaries
Regular Expressions
Databases
Using Access
Data
Object-Oriented Programming
Program Design
Make
Systems Programming
Spreadsheets
Matrix Programming
MATLAB
Multimedia Programming
Software Engineering

Why focus on grad students and postdocs? They focus on graduate students and post-docs because professors are often too busy with teaching, committees, and proposal writing to improve their software skills, while undergrads have less incentive to learn since they don’t have a longer-term project in mind yet. They’re also playing a long game: today’s grad students are tomorrow’s professors, and the day after that, they will be the ones setting parameters for funding programs, editing journals, and shaping science in other ways. Teaching them these skills now is one way – maybe the only way – to make computational competence a “normal” part of scientific practice.

So why am I blogging about this? When Greg started thinking about training researchers to understand the basics of good computing practice and coding, he couldn’t have predicted that huge explosion in the availability of data, the number of software programs to analyze those datasets, and the shortage of training that researchers receive in dealing with this new era. I believe that part of the reason funders stepped up to help the mission of software caprentry is because now, more than ever, reseachers need these skills to successfully do science. Reproducibility and accountability are in more demand, and data sharing mandates will likely morph into workflow sharing mandates. Ensuring reproducibility in analysis is next to impossible without the skills Software Carpentry’s volunteers teach.

My secret motive for talking about SWC? I want UC librarians to start organizing bootcamps for groups of researchers on their campuses!