Marcio von Muhlen

We need a Github of Science

2011-04-21T17:05:00-07:00

Summary

Publishing is central to Academia, but its publishing system is outclassed by what Open Source software developers have in GitHub
GitHub’s success is not just about openness, but also a prestige economy that rewards valuable content producers with credit and attention
Open Science efforts like arXiv and PLoS ONE should follow GitHub’s lead and embrace the social web

Publish or Perish

I am postdoctoral fellow, and my academic department is currently running a junior faculty search. We are interviewing four candidates, each of whom will present a job talk attended by the entire department. Before each talk, I’ll receive each candidate’s application packets, and my eyes will scan the “publications” section of their resume. The presence of a first-author article in the ultra-prestigious academic journals Science or Nature would all but guarantee an offer. Multiple publications in top-tier journals would indicate a strong application. If those are missing, meaning the publication history is weak, I’ll wonder how that person got an interview in the first place. Cultural fit, letters of reference and other credentials certainly matter, but beyond publications, everything is secondary.

To anyone involved in academia, this overwhelming focus on publications is a given. Publishing is so central to scientists that their academic value can be measured by adding the relative worth of their publications. After many years, citations preferentially accumulate towards publications of significance, and by extension, their authors. Ranking importance by citations received is a powerful concept, and incidentally is the basis for Google’s search algorithm. But at the beginning of an academic’s career, before citations accumulate, reputation rests largely on what journals they have published in.

Getting a paper accepted into an academic journal requires passage through the often opaque process of peer-review. Scientists make a big deal of peer- review, because it is supposed to be the filter that separates mere opinions from trusted, citable sources. However, the peer-review process in science has close analogs in any “old-media” field, such as TV or radio. Like academic journals, these are mediums of limited capacity, and there are always more submissions (or ideas for submissions) than there are openings. Selecting content worthy enough for distribution is made by the field’s establishment, which effectively silences what they don’t choose. This is especially true of peer-review as practiced in prestigious journals, defined as the ones that get their contributors faculty jobs.

Having editorial decisions made by established experts makes sense, since they draw on judgement born from years of experience. But this exposes the system to vulnerabilities common to any decision by committee – especially semi- secret committee – such as lack of agility, an aversion to disruptive innovation, and the tendency of committee members (and their friends) to be more equal in their own eyes than anyone else. Because publishing affects scientists so deeply, the strengths and weaknesses of this system inevitably affect the makeup and character of science as a whole. Which makes one wonder, is there a better way?

GitHubbing

My training has spanned biology, engineering, and computer science. My latest project, Instant Q&A for Physician Communities, relies heavily on open source code and led me to GitHub and Git. The Linux community developed Git, a distributed version control system, to coordinate work on the Linux code repository among thousands of programmers. Git is itself open source, and has become widely adopted for many software projects (open source and not). GitHub is a cloud service that hosts over 1 million Git repositories. Since its launch in 2008, GitHub has quickly become the de facto platform for publishing open source code, whose popularity is changing the world. If you’ve ever been astonished at how quickly the web world seems to move, the primary reasons are 1) it’s not dominated by Microsoft, so we have competition instead of a monopoly, and 2) open source code, widely shared through a multitude of email lists in the past and now centralized at GitHub [1]. How has GitHub become so successful?

GitHub is a social network of code, the first platform for sharing validated knowledge native to the social web [2]. This is a big deal. I believe it represents a demonstrably superior way of distributing validated knowledge than academic publishing. How are these even related? Software developers rarely write applications from scratch. Instead, they often start with various modular bundles of open source code. Within Ruby (the programming language underlying the popular web application framework Ruby on Rails [3]) these bundles are called gems. My current project employs 34 gems. Each one is responsible for a specific task, such as logging in users, interfacing with cloud storage, or making fancy-looking buttons. Science operates in a similar way. Scientists never begin a research project from an intellectual vacuum. They stand on the shoulders of giants, building on the knowledge contained in previous publications to form a new, coherent finding. For example, the article in which I published the bulk of my PhD thesis cites 38 others.

Gems are typically developed, distributed, and promoted through GitHub, and therein lies the connection. GitHub has evolved to solve the same general problem that scientific publishing does: making modular, validated units of knowledge easily usable by a global community, with mechanisms that efficiently allocate prestige to proven contributors. GitHub has the advantage of doing this with 21st century technology, the social web, while academic publishing is based on the printing press. This suggests an opportunity for the scientific community to evolve its publishing practices by assimilating mechanisms proven to work for GitHub.

Published Versus Prestigious

The existing peer-review process arose from the limited carrying capacity of physical journals. Prioritization had to happen before publication, because journals were limited in size to what could be economically printed and shipped. If you were born before 1990, you may recall the prestige formerly associated with being “a published author”. However, in the times we are living in, distributing media is basically free. Anyone can start a blog and deliver content worldwide in minutes. Clay Shirky has made a career of deftly explaining how this has fundamentally changed the media equation, with such unexpected consequences as YouTube videos that get more views than Super Bowl commercials.

Individuals still have a limited capacity for consuming and evaluating content, so prioritization and authentication remain necessary, but look different. These functions are now disconnected from publication. Google prioritizes web pages by analyzing utility after publication, by tracking citations in the form of inbound links. Similarly, anyone can publish a gem to GitHub, and published gems are prioritized by the numbers of developers “watching” for updates or ”forking ” new development lines. This is the social web at work, where the audience gets to decide what and whom to pay attention to all by itself, without requiring assistance from all-powerful editorial committees. One can complain that lowering barriers to publication leads to content that on average is of lower quality. But the abundance of non-significant projects in GitHub does not detract from its usability, because those projects are never brought to anyone’s attention [4].

Prestige is really about having an engaged audience that follows and recognizes your activities. This formerly required publication through established venues, but that’s no longer needed since your audience can use the social web to recognize and engage with you directly.

The Market for Prestige

Gems on GitHub are not just code. They also have authors whose relative contributions are automatically catalogued by Git, as shown in this impact graph for the popular and open source jQuery project. If you’ve visited a web application recently, chances are you’ve benefitted from jQuery, which makes it easy for a web engineer to turn static web sites into responsive web applications (think interactions with buttons instead of navigation through links). This impact graph can let you know precisely which developers are responsible for this awesome-ness. In this way, GitHub acts as an efficient, incorruptible “central bank” of the prestige supply. Furthermore, unlike in Google, great contributions in GitHub bring prestige to their creators, not their domain names. If you wanted to hire a contractor to work on a web application, GitHub can let you know who has publicly demonstrated the skills you’d need. It’s thus not surprising that GitHub profiles are supplanting traditional resume items, such as a CS degree, for discerning employers looking to hire top talent.

By contrast, current Open Science efforts that ask scientists to ”share all your data” have not become mainstream, because they do not appropriately reward knowledge producers. They are all free-distribution and no prestige, solving a different half of the problem than traditional journals but not the whole enchilada. Put another way, when anything can be published, there is no prestige associated with being published, so prestige must be introduced in other ways. Evangelists for Open Science should focus on promoting new, post-publication prestige metrics that will properly incentivize scientists to focus on the utility of their work, which will allow them to start worrying less about publishing in the right journals.

The biomedical world is increasingly permeated by code and data [5], which should be very amenable to GitHub style metrics since they are by nature tied to networked computers. Scientists in fields like genomics and biomedical informatics are being held to the same publication expectations as their peers, but this makes little sense. An article describing a genomic database is nowhere near as useful as an open API for accessing it. We need trusted ways to quantify just how useful that API and associated code are to the scientific community, which can be listed on a scientist’s profile and utilized by committees making hiring and funding decisions [6].

Challenges and Current Efforts

Of course, there are fundamental differences between publishing software code and publishing science. Copying code results in an exact replica and does not affect the original. By contrast, duplicating a research finding may require significant expenses just to recreate experimental conditions. Code is structured by the strict syntax of programming languages, while most scientific research is not. For this reason and others, academic articles and journals are not going to disappear, but they should not be the only way for a scientist to accumulate prestige.

Unfortunately, energy that could be spent developing these new solutions is instead tied up with the older struggle of open-access. Universities still pay outrageous sums to journal publishers to allow them access to the knowledge they just produced, reviewed, and edited on their own dime [7]. Broadly speaking, traditional journals are being reduced to rent- takers on brand names with reputational inertia. arXiv, which provides open access to pre-prints in many quantitative disciplines, is a notable and long-running example of the scientific community’s workaround to this problem. The arXiv is amazing, but why remain dependent on a system it could be replacing [8]?

PLoS is at the cutting edge of both open-access and rethinking the functions of a journal. PLoS One comes closest to what I am describing, in that their peer-review process screens only for scientific rigour, not perceived impact, meaning they will publish content considered unsexy and let future citations determine importance. But they have not yet embraced the social web, as the lack of scientist profiles (with associated prestige metrics) in their website demonstrates. In programmer jargon, PLoS ONE needs to become a web application, not a website that hosts content. One problem might be that they still consider themselves a journal first, and journals have editorial boards, while social web is all about not having editors. There is no editorial board at GitHub.

When I discuss this with current faculty, a typical reaction is that I’m pining for a social network of scientists. That seems reasonable, and it is being tried, but may not be bold enough. GitHub did not succeed by being a social network of programmers. It succeeded by being a social network of code. We need a social network of science, meaning scientific bundles of knowledge must be structured and accessible by API, with the connections among those bundles and appropriate utility metrics being what connects and prioritizes scientists.

APIs for science already exist, and some are incredibly useful, but they have ignored authorship and prestige implications which have prevented them from achieving their potential. For example, biophysicists have the RCSB Protein Data Bank, which stores experimentally determined protein structures. This database is a tremendous asset to the field, but it could represent much more, as a story from my younger days illustrates. In 2004, as an undergrad, I spent a summer writing Python code to download and analyze all existing RCSB structures. That program built a database of “real” structures to train a scoring algorithm, which subsequently scored computationally generated structures to see how “real” they seemed [9]. Unfortunately, my results were not compelling enough to be published in a prestigious academic journal, and therefore not interesting to my research adviser. Open-sourcing and publishing that code might have saved someone’s time, spurred new thinking, or at the very least marked a tangible reward for my work. But the incentives to my adviser weren’t there, so he did not suggest it. That idea did not even occur to me, because I was not a good enough programmer to know about SourceForge, a less-social precursor to GitHub, so the code went nowhere.

Hey Mr. Gates

It may be that the activation energy required to initiate changes won’t arise within the system. In that case, an outside push might do the job, and the best place for this push to come from may be a nimble funding agency. For example, a request for proposals could specify that phase II funding decisions would be based on the impact of online resources developed in phase I, as measured by specific metrics developed with community feedback. Nothing makes a scientist contemplate change faster than a new source of grant money, and the only thing better than a faculty applicant with a paper in Science may be one bringing in a multimillion dollar grant.

Further reading:

Collective knowledge systems: Where the Social Web meets the Semantic Web (Tom Gruber)

Peer Review in Academic Promotion and Publishing: Its meaning, locus and future (Diane Harley and Sophia Krzy)

Mechanisms for (Mis)Allocating Scientific Credit. (Jon Kleinberg and Sigal Ore)

The Life Scientists room on Friendfeed

Notes

1 “The combination of the Internet and open source transformed the functionality in modern programming tools, increasing developer productivity 10 fold” - Ben Horowitz, formerly of Netscape.

2 “Native” in the sense eloquently explained by USV (the VCs who funded Twitter): “Native opportunities are the ones that make use of unique capabilities of [new] platforms”. The social web is the new platform.

3 For example, Twitter, Groupon, and GitHub itself run on Ruby on Rails.

4 I speculate that many gems are also discovered through technical blogs (found through Google) or the programmer Q&A site StackOverflow.

5 Biomedical research is also huge - funding has been squeezed lately, but is still on the order of ~$100B annually. Therefore the potential market is large enough to be worthwhile to build for.

6 The unique requirements of the scientific community probably mean GitHub itself can’t do the job.

7 Or more accurately, on the federal government or philanthropic organizations that fund them. Journals do not compensate their editors or peer reviewers.

8 The peer-review process has been hacked via arXiv before, by Grigori Perelman. But to appreciate how unusual Grigori’s motivations are, consider that he also refused to accept the Fields Medal and its $1M cash prize.

9 Computing protein structure from amino acid sequence is known as “the protein folding problem” and is one of the holy grails of science.

Thanks to Sean Ahrens, Sean Carroll, Manuel Cebrian, Wendy Chapman, Lawrence David, Lucila Ohno-Machado, Carlos von Muhlen, Denise von Muhlen, and Ryan Weald for reading drafts and helpful discussions.

hackers-wanted-1000-job-posts-to-course-vi-at

2011-02-21T17:58:00-08:00

Tried to hire a hacker lately?

Hacker talent is highly non-commoditized, and the ROI at the top end of the market more closely resembles professional sporting leagues than traditional engineering career fields. [1,2] I have a theory that Paul Graham is the Scott Boras of this marketplace, which I’ll write about in the future. The ability to scout and hire great hackers before other market participants (GOOG = Yankees?) can be extremely valuable.

The mailing list for Course VI at MIT is an interesting place to see what people who are (presumably) looking to hire great hackers are saying in “hackers wanted” posts. [3] Many years ago, my friend Fergus at PicBounce tipped me off that it’s a great resource to capture trends and vocabulary in the tech entrepreneurship world. [4] It’s also good for a few laughs. [5] To brush off my python skills, I performed basic word count analyses on last semester’s posts; I figured the results might be interesting to others so I’m sharing them here.

Updated 01-18-11: The administrator of this list contacted me and asked me to clarify details of who maintains the list. It’s not the EECS Department’s list, but rather a personal list maintained by Anne Hunter, who can be reached at anneh@eecs at MIT’s domain name. Thanks Anne for keeping this very useful list going, and my apology for not crediting you when I first posted this.

Methods

I extracted 944 messages sent to the announcement list from July to December 2010 to a text file using Automator on Snow Leopard. Using python, I parsed messages into message objects containing date, subject, and body. I ignored date and subject and looked only at body text. I then stripped punctuation (except # and +), split() to tokenize strings, and built a wordDict with key/value pairs of {word: wordCount}. To filter non-job posts like course announcements, I ignored messages that contained any of a list of words usually used by the department. [6] I then merged all the individual wordDicts into a globalDict. I dropped words commonly used in English. [7] I did something similar for two-word phrases, to capture terms like “social networking” (data used for figures B and C). I’m reporting average word count in the globalDict – total word count divided by number of posts. The number of posts analyzed was 905 (thus 39 were scored as being department announcements).

Results

The most common word was “experience”. Lots of generic technology words show up in the top-50 (figure A), as expected. I figured a more interesting plot might be the occurrence of words that I thought a priori were interesting, which I call “programmer vocabulary” (figure B). This is subject to my own biases of what I think is important, so sorry if I didn’t include your favorite. The winner there was “web”, followed closely by “mobile”. The prevalence of buzzwords like “rockstar”, “ninja” and “guru” was smaller than I expected. Finally I looked at locations (figure C). The sum of the SF Bay Area terms was 0.067, which I didn’t place in the figure because it would have been the only multi-term aggregate.

Any thoughts on how I can improve this analyses? I’ll do this again in 6 months if enough people find this interesting (email me and I’ll send you the update when it’s ready).

Disclaimers: This is a relatively small sample size subject to outlier effects (i.e. a single message that contains “mobile” twenty times). I used an arbitrary exclusion list to filter department announcements and it might be incomplete (i.e. these results may include department announcements that were not job posts).

Notes

1 Hacker as in “person who builds things with computer code”, as opposed to the more common definition of “16 year old who does bad things with computers”.

2 Just look at the size of signing bonuses. Paul Buchheit’s was up there with Lebron James’s when he was acqui-hired by Facebook via Friendfeed (there were additional engineers in this trade deal).

3 Course VI is MIT-speak for Department of Electrical Engineering and Computer Science.

4 A great complement to Hacker News.

5 A real email from Fall 2010, identities redacted to protect the guilty:

Subject: Coder Needed for Social Network Website

Greetings,

My name is [redacted] and I am looking for a very experienced coder. My team and I are looking to build a social networking site, very similar to facebook, that has extreme potential. We plan on 95 percent of all college students becoming active users within 4-8 months from the launch date. My partners [redacted], [redacted], and I are marketing and promotions specialists with a great outreach to the college market. We are searching for an unbelievably talented coder, who is ready and willing to PARTNER UP on this college based website project to make history. We are very serious about our business venture so we are looking for someone who really feels confident that they are capable of handling a project like this.

Please email all your contact info as well as a resume/portfolio with sample projects.

I look forward to hearing from anyone who’s up for this challenge. Lets Make History.

Best,

[redacted]

[redacted]@gmail.com

6 Words that appear in department posts: ‘ta’, ‘lecture’, ‘course description’, ‘grad school’, ‘correction’, ‘announcement’, ‘websis’.

7 Common words in English: ‘a’, ‘able’, ‘about’, ‘across’, ‘after’, ‘all’, ‘almost’, ‘also’, ‘am’, ‘among’, ‘an’, ‘and’, ‘any’, ‘are’, ‘as’, ‘at’, ‘be’, ‘because’, ‘been’, ‘but’, ‘by’, ‘can’, ‘cannot’, ‘could’, ‘dear’, ‘did’, ‘do’, ‘does’, ‘either’, ‘else’, ‘ever’, ‘every’, ‘for’, ‘from’, ‘get’, ‘got’, ‘had’, ‘has’, ‘have’, ‘he’, ‘her’, ‘hers’, ‘him’, ‘his’, ‘how’, ‘however’, ‘i’, ‘if’, ‘in’, ‘into’, ‘is’, ‘it’, ‘its’, ‘just’, ‘least’, ‘let’, ‘like’, ‘likely’, ‘may’, ‘me’, ‘might’, ‘most’, ‘must’, ‘my’, ‘neither’, ‘no’, ‘nor’, ‘not’, ‘of’, ‘off’, ‘often’, ‘on’, ‘only’, ‘or’, ‘other’, ‘our’, ‘own’, ‘rather’, ‘said’, ‘say’, ‘says’, ‘she’, ‘should’, ‘since’, ‘so’, ‘some’, ‘than’, ‘that’, ‘the’, ‘their’, ‘them’, ‘then’, ‘there’, ‘these’, ‘they’, ‘this’, ‘tis’, ‘to’, ‘too’, ‘twas’, ‘us’, ‘wants’, ‘was’, ‘we’, ‘were’, ‘what’, ‘when’, ‘where’, ‘which’, ‘while’, ‘who’, ‘whom’, ‘why’, ‘will’, ‘with’, ‘would’, ‘yet’, ‘you’, ‘your’.