<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom">

  <title><![CDATA[business|bytes|genes|molecules]]></title>
  
  <link href="http://blog.deepaksingh.net/" />
  <updated>2013-02-09T09:32:18-08:00</updated>
  <id>http://blog.deepaksingh.net/</id>
  <author>
    <name><![CDATA[Deepak Singh]]></name>
    
  </author>
  <generator uri="http://octopress.org/">Octopress</generator>

  
  <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/bbgm" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="bbgm" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><entry>
    <title type="html"><![CDATA[Repo of the week - Feb 9, 2013]]></title>
    <link href="http://blog.deepaksingh.net/repo-of-the-week-feb-9" />
    <updated>2013-02-09T09:14:00-08:00</updated>
    <id>http://blog.deepaksingh.net/repo-of-the-week-feb-9</id>
    <content type="html"><![CDATA[<p>So the <a href="http://blog.deepaksingh.net/categories/repooftheweek/"><em>Repo of the Week</em></a> didn&#8217;t quite pan out weekly, but I am going to keep the category going.</p>

<p>This weeks repo (well, pair of repos) comes to you courtesy of the <a href="http://dunnlab.org/">Dunn lab</a>.  The two repos are <a href="https://bitbucket.org/caseywdunn/biolite">biolite</a> and <a href="https://github.com/caseywdunn/agalma">agalma</a>.  What are these?</p>

<blockquote><p>BioLite is a bioinformatics framework written in Python/C++ that automates the collection and reporting of diagnostics, tracks provenance, and provides lightweight tools for building out customized analysis pipelines. It is distributed with Agalma, but can be used independently of Agalma.</p>

<p>Agalma is a de novo transcriptome assembly and annotation pipeline for Illumina data. Agalma is built on top of the BioLite framework. If you have downloaded Agalma+BioLite, the files that are specific to the Agalma pipeline are located in the agalma/ subdirectory.</p></blockquote>

<p>The authors have also made an <a href="http://aws.amazon.com/ec2">Amazon EC2</a> image available with Agalma and all its dependencies.  There is a <a href="https://github.com/caseywdunn/agalma/blob/master/TUTORIAL">tutorial</a> to get things working on EC2.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[More on GATK]]></title>
    <link href="http://blog.deepaksingh.net/more-on-gatk" />
    <updated>2013-01-28T06:47:00-08:00</updated>
    <id>http://blog.deepaksingh.net/more-on-gatk</id>
    <content type="html"><![CDATA[<p>I <a href="http://blog.deepaksingh.net/the-return/">returned to blogging</a> because of a need to rant about <a href="http://blog.deepaksingh.net/the-gatk-license/">newly announced GATK licensing</a>.  Well, this time I am going to let others rant since things have only taken a turn for the worse.</p>

<p>I noticed a <a href="https://twitter.com/BioMickWatson/status/293833882728534017">tweet</a> from <a href="https://twitter.com/biomickwatson">Mick Watson</a>, which led me to <a href="http://gatkforums.broadinstitute.org/discussion/2091/upcoming-changes-to-the-license-the-retirement-of-gatk-lite-by-v-2-4">this discussion</a> on GATK licensing.</p>

<p>You can read my original post, the discussion, or Mick Watson&#8217;s <a href="http://biomickwatson.wordpress.com/2013/01/28/gatk-why-it-matters/">blog post</a>. Having worked on the commercial side of scientific software for a good chunk of my career, I understand the commercial side and potential driving factors, but my complete distaste for academic/non-commercial use licensing is well known, and the GATK folks aren&#8217;t exactly handling this well.</p>

<p>I will add one thing.  There are some whom I respect, who point out that commercial entities add pretty GUIs and don&#8217;t add much value.  To that I say, that&#8217;s pretty much why commercial informatics software is hard.  Any company that isn&#8217;t really adding value is not going to succeed in the long run.  Let the market decide.  Your job as GATK is to create high quality, open source, software which benefits science.  If companies create no value or minimize the value it means the following in most cases</p>

<ul>
<li>In time the company will go under cause no one else is deriving any value.  This is the usual case and hardly something to get concerned about</li>
<li>If the company is providing value then it&#8217;s a good thing.  In most cases, this will happen only if GATK is part of a much more comprehensive package or service that makes it easier for people to get stuff done</li>
<li>The onus is on the GATK devs and funders to figure out how to compete if they feel their work is being &#8220;trivialized&#8221;.  Competition is a good thing, even in pure open source code.  The problem seems to be, that the Broad considers this their code as opposed to a community resource with a rich developer community.  Get the latter behind you and any trivialization by people building pretty GUIs goes out of the window cause your community is going to do that for you if there is demand</li>
</ul>


<p>To cut a long story short, the Broad is not taking the right steps, but I don&#8217;t blame them per sé.  Scientific software funding needs to evolve and the idea of community and broad developer outreach needs to evolve.  So as much as anything, I blame the system.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[On reproducibility]]></title>
    <link href="http://blog.deepaksingh.net/on-reproducibility" />
    <updated>2012-11-17T19:25:00-08:00</updated>
    <id>http://blog.deepaksingh.net/on-reproducibility</id>
    <content type="html"><![CDATA[<p>There is an interesting discussion on Titus&#8217; blog on <a href="http://ivory.idyll.org/blog/vms-considered-harmful.html">VMs and reproducibility</a>, including some great comments.  I&#8217;ve always considered VMs, especially those that can be deployed in the cloud, a convenience.  In other words, they make it easy for people to try and reproduce your work cause you give it to them in a turnkey-type way. However, I&#8217;ve never felt that VMs were the optimal solution for doing science.  If you think about it, what do you need for good science</p>

<ul>
<li>Access to the raw data and any other data sets associated with the science.</li>
<li>A description of the methods used that are used in the research.  Ideally you should be able to use these methods and the data sets above to come up with the same results.</li>
<li>The code used to implement the methods above.</li>
<li>A list of dependencies and the execution environment.</li>
</ul>


<p>Is this a complete list?  I am sure if I think about it again the list may evolve, but it seems about right to me.  In the end you want to do three things (1) See if you can replicate the work; (2) have enough information to reproduce it but using your own code, in case you don&#8217;t like the actual implementation and (3) evolve the science using existing work as a starting point.</p>

<p>What enables all this?  It&#8217;s open data, it&#8217;s open source and it&#8217;s programmability.  If you think of your infrastructure and your overall system programmatically, it&#8217;s a lot more elegant than a VM.  It&#8217;s not easy, but if you can use recipes and configure a system on the fly then you aren&#8217;t limited to a VM, but can dynamically generate the environment required, with the appropriate data sets and dependencies.  I&#8217;ve always said that data is royal garden, but compute is a fungible commodity, and dynamic environments are super powerful tools that can enable really good science.  Unfortunately, they also require a level of skill that many scientists don&#8217;t have.</p>

<p>These are topics that <a href="http://www.greenisgood.co.uk/">Matt Wood</a> and I talk about a lot (see the two decks below for some ideas)</p>

<script async class="speakerdeck-embed" data-id="4f414190156b230022011477" data-ratio="1.33333333333333" src="http://blog.deepaksingh.net//speakerdeck.com/assets/embed.js"></script>




<script async class="speakerdeck-embed" data-id="506a2eec9cc368000202b6f6" data-ratio="1.33333333333333" src="http://blog.deepaksingh.net//speakerdeck.com/assets/embed.js"></script>


<p>Yes it&#8217;s a very cloud-centric view of the world, but there is a reason we work where we do.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[My chem coach carnival]]></title>
    <link href="http://blog.deepaksingh.net/my-chem-coach-carnival" />
    <updated>2012-10-26T18:46:00-07:00</updated>
    <id>http://blog.deepaksingh.net/my-chem-coach-carnival</id>
    <content type="html"><![CDATA[<p><a href="https://plus.google.com/u/0/112933623429779400915/posts">Susan Baxter</a> <a href="https://plus.google.com/108883794732535350016/posts/QRCyNTAXTCn">blackmailed</a> me into writing this post, but it is actually an interesting one to write, since I am probably not the most likely person to write one for the <a href="http://justlikecooking.blogspot.com/2012/10/announcing-chem-coach-carnival.html">Chem Coach Carnival</a>.</p>

<p>I am a chemist by training.  Every degree (B.Sc., M.Sc. and Ph.D.) is in chemistry, but I am not a practicing chemist any more, and haven&#8217;t been for a very long time.  However, I do not have any regrets about the path I have taken.  In fact, I think my background in chemistry has helped me quite a bit.</p>

<p>Today, I am a <a href="http://www.linkedin.com/in/dsingh">Principal Product Manager</a> at Amazon Web Services.  There I work on Amazon EC2 instance platforms.  In other words, I spend a lot of my time on the server platform that powers EC2.  What does this have to do with chemistry?  Not much.  So why do I think Chemistry has a role to play in this?</p>

<p>After my B.Sc. in chemistry, I spent most of my Master&#8217;s and Ph.D. as a physical chemist/theoretical chemist.  That pretty much means that you have be analytical, learn to work with others (who are often doing bench chemistry), and have to learn your way around computers.  A lot of what I have done in my professional career has been around software, computers and analytical thinking.  Your traning as a chemisty allows you to think about the fundamentals of a problem, about how to break problems down into their consituent parts, and best of all teaches you how to set up experiments.  I am not formally trained in software development, web services, data management or product management, so I definitely believe that my training as a chemist has helped me transition into all these non-chemistry roles over the years.</p>

<p>Moral of the story: Your career can take many paths, but your training as a chemist is going to come in good stead along those paths, and stories about lab explosions always come in handy at parties.</p>

<p>Oh, and happy chemistry week.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Titus makes my life easy]]></title>
    <link href="http://blog.deepaksingh.net/titus-makes-my-life-easy" />
    <updated>2012-10-18T19:20:00-07:00</updated>
    <id>http://blog.deepaksingh.net/titus-makes-my-life-easy</id>
    <content type="html"><![CDATA[<p>This is the second post in the short existence of this blog that starts with &#8220;Titus&#8221;.  Well there is a good reason.  In a <a href="http://ivory.idyll.org/blog/automated-testing-and-research-software.html">wonderful blog post</a> Titus pretty much nails my opinion on the matter of research software.  He writes</p>

<blockquote><p>I think this notion that research software is something special and deserving of some accomodation is so wrong that it&#8217;s hard to even address it intelligently. What, you think people at Google aren&#8217;t doing exploratory programming where they don&#8217;t know the answer already? You think Amazon customers don&#8217;t behave in unexpected ways? You think Facebook social network data mining is easy? The difference there is that companies have a direct economic incentive to solve these problems, and you don&#8217;t.</p></blockquote>

<p>And I completely agree with him on the excuses.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Scientific software and being customer centric]]></title>
    <link href="http://blog.deepaksingh.net/scientific-software-and-being-customer-centric" />
    <updated>2012-09-22T22:01:00-07:00</updated>
    <id>http://blog.deepaksingh.net/scientific-software-and-being-customer-centric</id>
    <content type="html"><![CDATA[<p>A lot of scientific software, and this is especially true in bioinformatics, is &#8220;open source&#8221; in some way or another.  That it seems the community doesn&#8217;t quite understand the value of open source is another matter and another post, so for the sake of this post, let&#8217;s assume it is.  Perhaps more importantly, a good chunk of the software used is developed by academia.  In my mind, this increases the bar on code quality and software stewardship.  Most importantly, developers of academic software need to think about their applications differently and funding agencies need to think about how they fund software development differently.</p>

<p>Under the assumption that the majority of code used to do scientific discovery originates in academia, the question to ask is, what responsibiity does a scientific software developer have?  Should they think of their potential users as customers from the beginning or is that something that becomes important later in the process.  While in some open source academic projects, especially ones that gave been developed ground up, a customer-centric approach seems to exist, in general it appears that much code is developed to get published or to get something out there to solve a particular problem.  Given the realities of scientific problems, I don&#8217;t believe you can assume on day 1 that your applications are going to find use in the broader community, but it is a safe assumption to make that for many applications that is the end goal.  The reality is that you might be the only one that ever uses the code, especially if it is being developed to solve a specific problem, then it might be your team, then other labs and collaborators and ultimately a wider community.  This means that not only should scientific software developers take a step back and think about the potential scope of their project as it evolves, it means that funding agencies need to rethink how they fund software.</p>

<p>First, publishing software as papers needs to go away.  Algorithms should get published, novel architectures should get published.  Software should only be published as a note to aid discovery.  Funding agencies also need to recognize that funding new software projects for 3-5 years and expecting the developer to know the outcome at the beginning is short sighted.  Software evolves, features and scope evolves along the way. Three years is an eternity for a software project, five .. I don&#8217;t have a word for how long that is.  Funders also need to recognize that there is a greater need for funding as a piece of software grows and is recognized by the community.  In a way that could be looked at as a return on investment.  The broader the reach and impact to science the more successful the initial funding, but you need the concept of angel funding as well to get a project off the ground, see how it will evolve.  We also need to raise the bar.  Should new proposals be funded or should developers be encouraged to contribute to existing projects?  Since there doesn&#8217;t seem to be much emphasis on the latter, you see new applications being developed as opposed to getting funding to contribute to existing applications.</p>

<p>The problem with scientific software is more cultural than anything else.  As <a href="http://twitter.com/smbaxtersd">Susan Baxter</a> <a href="https://twitter.com/smbaxtersd/status/248163072999567360">tweeted</a></p>

<blockquote><p>bioinformatician still = PI mentality, not team-based or community</p></blockquote>

<p>Software development is different, it works at different time scales and it requires a different approach. Note that I am not talking about research code, but code that&#8217;s meant to be used over a period of time, at the least by multiple generations in your research group.  The change has to start within the community, but they aren&#8217;t going anywhere without funding agencies changing the incentives.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[repo of the week - Sept 8, 2012]]></title>
    <link href="http://blog.deepaksingh.net/repo-of-the-week-sept-8" />
    <updated>2012-09-08T15:58:00-07:00</updated>
    <id>http://blog.deepaksingh.net/repo-of-the-week-sept-8</id>
    <content type="html"><![CDATA[<p>I have been on a soapbox lately around programming and bioinformatics.  So I am going to try and point to find a random repo every week that I like and put it up here.  They will mostly be from Github, but that&#8217;s not a requirement.</p>

<p>Today&#8217;s <a href="https://github.com/fls-bioinformatics-core/genomics">repo</a> comes to you via the <a href="http://www.ls.manchester.ac.uk/">Faculty of Life Sciences</a> at the University of Manchester.  The repo consists of &#8220;Scripts, utilities and programs for genomic bioinformatics&#8221;, and contains scripts for a variety of genome informatics tasks.</p>

<p>This is the kind of repo that&#8217;s super useful.  For now their seems to be one person pushing code, so hopefully there will be more.  There are at least 2 forks, and reasonable <a href="https://github.com/fls-bioinformatics-core/genomics/graphs">activity</a>.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[RAID doesn't make you resilient]]></title>
    <link href="http://blog.deepaksingh.net/raid-doesnt-make-you-resilient" />
    <updated>2012-09-08T12:04:00-07:00</updated>
    <id>http://blog.deepaksingh.net/raid-doesnt-make-you-resilient</id>
    <content type="html"><![CDATA[<p>I&#8217;ve now heard at least a couple of people managing large life science repositories talk about resiliency and durability and mention that they have durability cause they use RAID.  That&#8217;s just a cringeworthy thing to hear.  I would hope that people managing core repositories know better.  To the best of my knowledge they do, but it is troubling.  I was reminded of the apparently lack of understanding in managing data by a tweet from <a href="https://twitter.com/adamkraut">Adam Kraut</a> earlier where he linked to a <a href="http://queue.acm.org/detail.cfm?id=2367378">paper</a> that talks about the challenges of maintaining file integrity.  In general, I recommend anyone in the world of informatics building large scale storage (or even small scale storage) check out James Hamilton&#8217;s <a href="http://perspectives.mvdirona.com/2009/10/17/JeffDeanDesignLessonsAndAdviceFromBuildingLargeScaleDistributedSystems.aspx">blog post</a> covering a talk by Jeff Dean on building large distributed systems <a href="http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf">pdf</a>.  The key is failure happens.  Between 1-5% of your disks are going to fail over the course of a year and 2-4% of your storage servers.  There are any number of reasons these could happen and they all have different failure rates.  Google has published work on their analysis of disk failure rates <a href="http://research.google.com/archive/disk_failures.pdf">pdf</a> <a href="http://storagemojo.com/2007/02/19/googles-disk-failure-experience/">analysis on Storage Mojo</a>.</p>

<p>Where am I going with this?  As the size of our storage systems in informatics increases, as we keep data around for longer, we need to take a deeper look at how we are managing our data, and not make naive assumptions.  Think about the tradeoffs you need to make between performance, availiability and durability (and think through what durability means).  There are simple and creative ways of getting there (e.g. keeping a copy of a disk array in a friends lab in a different building), and a number of solutions (including some from my day job), but let&#8217;s not assume that RAID = durability.  In the end managing your data is less about the hardware and more about the operational processes and software sitting on top of the hardware.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Titus has a point]]></title>
    <link href="http://blog.deepaksingh.net/titus-has-a-point" />
    <updated>2012-09-03T16:19:00-07:00</updated>
    <id>http://blog.deepaksingh.net/titus-has-a-point</id>
    <content type="html"><![CDATA[<p>There seems to be a bit of a debate brewing in the bioinformatics community around code.  There have been a number of posts recently, including <a href="http://blog.deepaksingh.net/research-code/">my own</a>.  A recent entrants is a wonderful post by <a href="http://ivory.idyll.org/blog/anecdotal-science.html">Titus Brown</a>.  The concern that Titus raises, and I see in many comments and discussion is that a lot of computational science, at least in the life sciences, is very anecdoctal and suffers from a lack of computational rigor, and there is an opaqueness that makes science difficult to reproduce (or replicate as Titus prefers).  I&#8217;ll let you read Titus&#8217; post for his reasoning and thought process.  My concern is where I think computational science is right now.  Maybe I am being too negative, but here&#8217;s what I think</p>

<ul>
<li>We are accepting mediocrity and a non-open culture.  I crave a world of science full of gists and code thrown up on github.  Who knows how it might end up being useful, or end up fostering interesting collaboration.  But for whatever reason we aren&#8217;t ready to do that.</li>
<li>Actually I think I do know why.  The bioinformatics community is all too aware that the quality of our code is very substandard.  Even today, we don&#8217;t consider programming skills and computational literacy an essential requirement for biological research.  So we have way too many people writing poor code, even if it is code never meant to see the light of day.  My biggest concern is that this is driving shoddy science that we can&#8217;t trust.  There is a difference in the skillset required for an algorithm developer, and someone using computational techniques to analyze data.  The code bar for the latter should be a lot lot higher.</li>
<li>We have a cultural problem cause good hacking skills are not exactly the route to scientific success.</li>
</ul>


<p>A recent example was a case where I was encouraging someone to cite an application by pointing to it&#8217;s source, but others insisted on a paper (which was not even about that particular piece of code).  That&#8217;s just wrong.  We have to do better.  I am getting a little tired of excuses about time and a lack of funding. Yes, funding is important, and funding agencies need to realize that we need to encourage the right skill sets.  But we have to be responsible for the quality of science and the quality of our work.  Perhaps all that work hidden in our machines is good, but right now I don&#8217;t believe it.</p>

<p>Note that none of this is about software engineering.  There are software products, e.g. repositories, deployment infrastructure, visualization systems, that are different and have an even higher bar.  I am exclusively talking about the code we use to actually do exploratory research (good frameworks will make exploration a lot more effective, but that&#8217;s another post).</p>

<p><em>Update:</em> <a href="http://third-bit.com/">Greg Wilson</a> <a href="http://software-carpentry.org/2012/09/not-really-disjoint/">adds</a> to this discussion as well.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Research Code]]></title>
    <link href="http://blog.deepaksingh.net/research-code" />
    <updated>2012-08-26T18:56:00-07:00</updated>
    <id>http://blog.deepaksingh.net/research-code</id>
    <content type="html"><![CDATA[<p>Iddo Friedberg has an interesting post on <a href="http://bytesizebio.net/index.php/2012/08/24/can-we-make-research-software-accountable/">making research software accountable</a>.  While I have never been in academia since I left grad school, I have been around it through friends and my wife, and I am not sure I completely agree with the post.</p>

<p>He writes</p>

<blockquote><p>This practices of code writing for day-to-day lab research are therefore completely unlike anything software engineers are taught.</p></blockquote>

<p>and</p>

<blockquote><p>Research coding is not done with the purpose of being robust, or reusable, or long-lived in development and versioning repositories.</p></blockquote>

<p>Just reading these lines and seeing other issues I&#8217;ve seen with academic code makes me think of a few things.  (1) Scientific programmers are either poor programmers or lazy programmers.  That measn that a lot of the reasons scientific code is not robust or maintainable is because they don&#8217;t know how to write robust code.  (2) There seems to be an assumption that all code inside a software company is written with lots of time on hand and is user facing.  There is a lot of code that is created to create metrics, analyze data and is done in &#8220;can I get that answer in the next 4 hours&#8221;.</p>

<p>Perhaps more than anything what it brought to mind was &#8221;<a href="http://en.wikipedia.org/wiki/Technical_debt">technical debt</a>&#8221;.  For anyone that&#8217;s been around software, technical debt is a reality and there is always a tension between speed and debt.  The fact remains that debt catches up with you.  And then you are faced with all kinds of issues.  In the scientific world, I&#8217;ll call out specific examples of the impact of technical debt</p>

<ul>
<li>You hack something together to get some preliminary data.  You are short on time, so you hard code some parameters, and along the way you forget that you did.  Guess what, that could result in scientific errors down the line cause you have bad parameters or you made a mistake in some algorithm that you fat fingered in your hurry.</li>
<li>Your code is lying around and gets picked up by someone else.  They make assumptions, the wrong ones.</li>
<li>You essentially have to reinvent the wheel often cause you don&#8217;t have quality reusable code, which also means that your research is going to take even longer.</li>
</ul>


<p>The fact is that every field has slice and dice code.  The better the quality of your programming the better the slicing and dicing.  The better your documentation, the more it goes from being something one person knows, to being part of the toolchest of an entire group.  I wonder if people would take such shortcuts with their lab protocols?</p>

<p>In the end, no amount of enforcement or procedure is going to help.  While there will always be a need to hack something up, and often, scientists need to become better programmers, and realize that code has impact on the quality of the science.  A few things that I do think will help.  Make programming more of a first class citizen.  Right now, it&#8217;s still thought of as this other thing.  The successful groups have proper software engineers doing the hard stuff, but the majority of scientists can barely script, forget thinking through smart ways of building pipelines or even hacking. The concept of &#8220;publishing&#8221; scientific code also needs to change.  It should be less about publishing papers and more about publishing code.  Just throw it up on github, even if no one else is ever going to use it.  If you are using a version control system, and there is no excuse not to use one, then pushing that out to Github or similar is trivial.</p>

<p>Let&#8217;s just stop using the &#8220;we don&#8217;t have time&#8221; excuse.  I don&#8217;t know too many graduate students who have more time pressure than an engineer or data scientist at a startup, where every minute counts and costs and people are wearing 10 hats.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Open Data begets cool]]></title>
    <link href="http://blog.deepaksingh.net/open-data-begets-cool" />
    <updated>2012-08-22T05:09:00-07:00</updated>
    <id>http://blog.deepaksingh.net/open-data-begets-cool</id>
    <content type="html"><![CDATA[<p>I am a huge fan of <a href="http://commoncrawl.org/">Common Crawl</a>. For those who don&#8217;t know, Common Crawl is a non-profit whose goal is to build and maintain an open crawl of the web.  Their hope is that with the availability of an open, high quality crawl, cool things will happen, e.g. like Michael Nielsen&#8217;s how to on <a href="http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/">crawling 250 million web pages quickly and inexpensively</a>.  The thing that makes Common Crawl work is not just quality raw data.  They also provide <a href="https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set">JSON crawl metadata</a> in an S3 bucket, and an Amazon Machine Image to help both users get up and running quickly.  The image includes a copy of the Common Crawl User Library, examples, and launch scripts that show users how to analyze the Common Crawl corpus using their own Hadoop cluster or Amazon Elastic MapReduce.</p>

<p>It is this complete picture, data + tools, and the easy availability of infrastructure to do so that make a project like Common Crawl so compelling.  When you have the infrastructure in place, the friction to do something interesting gets reduced sufficiently that there are enough smart people using the data that interesting things are inevitable.  With people like Michael and <a href="http://commoncrawl.org/twelve-steps-to-running-your-ruby-code-across-five-billion-web-pages/">Pete Warden</a> publishing great getting started posts, the barriers to entry for Common Crawl are essentially the cost of running a small cluster for a few hours.</p>

<p>I can think of a few life science data sets that would benefit from such an approach, e.g. data sets releated to disease outbreaks, expression profiles, etc.  Data that can be analyzed and mashed up with other sources with minimal friction.  That would be awesome.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Data flow]]></title>
    <link href="http://blog.deepaksingh.net/data-flow" />
    <updated>2012-08-18T20:25:00-07:00</updated>
    <id>http://blog.deepaksingh.net/data-flow</id>
    <content type="html"><![CDATA[<p><a href="http://datasyndrome.com/">Russell Jurney</a> has a great blog post on the Hortonworks blog entitled <a href="http://hortonworks.com/blog/pig-as-connector-part-one-pig-mongodb-and-node-js/"><em>Pig as Hadoop Connector &#8230;</em></a>.  I&#8217;ve long been a fan of data flow-style approaches and <a href="http://pig.apache.org/">Pig</a> fits my mental model better than something like <a href="http://hive.apache.org/">Hive</a>.  The post does a great job explaining how you can move data through Hadoop, into MongoDB, and eventually turn the data into a web service (in this case via Node.js).  Such a workflow is a particularly nice fit for modern bioinfomratics, especially in a high scale next-gen world.  MongoDB with it&#8217;s document-based model and rich query syntax is quite popular with the next-gen sequencing crowd, and I&#8217;ve started to see a lot more Hadoop, especially in commercial services that need to scale cost-effectively.</p>

<p>Biological data is a great fit for Mongo-style key-value stores.  In practice, I wonder how many people are using such pipelines, where they may use something like Hadoop to aggregate a large number of &#8220;events&#8221;.  In this case an event could be the output from a single experiment or pipeline.  Essentially you could just stream the output from all your pipeline runs into one or more Hadoop clusters that would do your aggregation and sorting, and then feed that into MongoDB or similar K-V store.  From there, publishing the data as a service is a relatively simple step, and you can even make it look pretty quickly with something like <a href="http://twitter.github.com/bootstrap/">Bootstrap</a>.</p>

<p>The key message here is that we have unprecedented access to the kinds of tools that allow us to work with data flexibly at various scales, and, even better still, make results available to a broader set of users and developers via web services.  Sometimes it feels like there are too many tools to keep track of and learn, and to some extent that is true (pretty much the story of my life), but it&#8217;s a fun time to be a developer.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Platforms for citizen science]]></title>
    <link href="http://blog.deepaksingh.net/platforms-for-citizen-science" />
    <updated>2012-08-04T12:59:00-07:00</updated>
    <id>http://blog.deepaksingh.net/platforms-for-citizen-science</id>
    <content type="html"><![CDATA[<p>Nice article in the NY Times around <a href="http://www.nytimes.com/2010/12/28/science/28citizen.html?pagewanted=all">citizen science</a> (<a href="https://www.evernote.com/shard/s4/sh/5f54e39c-4177-44c7-a3ac-fd876c53a6ab/39c04c88d16f0f74c832ea946bfcbe47">or here</a>).  It presents a balanced view of citizen science, a topic I care about deeply</p>

<script async class="speakerdeck-embed" data-id="4fa8b9b42edc72002200e5d7" data-ratio="1.3333333333333333" src="http://blog.deepaksingh.net//speakerdeck.com/assets/embed.js"></script>


<p>In the end, citizen science is many things.  It is a way to stimulate public interest, help collect data that would be difficult to do without engaging the community, but perhaps most importantly, citizen science allows the broader public to be engaged in science.  Of the many people participating in data collection, perhaps a few will actually do some analysis, and an even fewer number will end up pursuing science as more than a hobby.  That&#8217;s OK and that&#8217;s how it should be.</p>

<p>The key in my mind is to make sure we are developing and nurturing the frameworks that enable participation.  The <a href="http://zooniverse.org">Zooniverse</a> is a great example of making participation easy and fun.  <a href="http://fold.it">Foldit</a> is another model that makes participation fun and rewarding.  The current reach of the web makes such platforms very viable and very powerful.  Do all models and efforts need to work?  No, that is very difficult.  Is it OK to leave the hard science to the &#8220;experts&#8221;?  To an extent, that is a good model, but you never know who the experts really are and assuming the sit in some laboratory is both limiting and naive. Not proceeding forward in areas where work can be done by the broader community in chunks because we worry about quality is going to only hold science back. The key once again is to make sure that the underlying platforms make participation easy, and also allow quality to be managed and filtered.  In the biological sciences, we haven&#8217;t quite seen a project like the Zooniverse, at least not to my knowledge.  Initial success has come from efforts that involve the broader scientific community and some hobbyists.  Over time, hopefully we will achieve the scale that the web enables and reach a wider set of people, not just scientists.  I am pretty sure folks like <a href="http://sulab.org/andrew-i-su-ph-d/">Andrew Su</a> are thinking of how to do exactly this.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[The GATK license]]></title>
    <link href="http://blog.deepaksingh.net/the-gatk-license" />
    <updated>2012-07-28T19:18:00-07:00</updated>
    <id>http://blog.deepaksingh.net/the-gatk-license</id>
    <content type="html"><![CDATA[<p>One of the catalysts for restarting the blog was the new <a href="http://www.broadinstitute.org/gatk/">GATK</a> license.  GATK is a great tool for the genomics community and has historically had an open (MIT) license.  However, with GATK 2.0, the license is moving to a <a href="http://gatk.vanillaforums.com/discussion/17/gatk-2-0-announcement">hybrid licensing model</a>.  Per the announcement</p>

<blockquote><p>The complete GATK 2.0 suite will be distributed as a binary only, without source code for the newest tools. We plan to release the source code for these tools, but its unclear the timeframe for this. The GATK engine and programming libraries will remain open-sourced under the MIT license, as they currently are for GATK 1.0. The current GATK 1.0 tool chain, now called GATK-lite, will remain open-source under the MIT license and distributed as a companion binary to the full GATK binary. GATK-lite includes the original base quality score recalibrator (BQSR), indel realigner, unified genotyper v1, and VQSR v2.</p></blockquote>

<p>&#8230;</p>

<blockquote><p>GATK 2.0 is being released under a software license that permits non-commercial research use only. Until the beta ends and the full GATK 2.0 suite is officially launched, commercial activities should use the unrestricted GATK-lite version. In the fall we intend to release the full version of GATK 2.0. The full version will be free-to-use version for non-commercial entities, just like the beta. A commercial license will be required for commercial entities. This commercial version will include commercial-grade support for installation, configuration, and documentation, as well as long-term support for each commercial release.</p></blockquote>

<p>This is the wrong direction.  Mixed licensing has been the bane of chemistry codes for years, but seeing it in the genomics world, especially for something that started with a more permissive license is a step in the wrong direction.  Others have commented on the potential reasons; commercialization, concern about use by dodby DTC genomics sites; but all of those reasons are quite weak.</p>

<p>So why is this a mistake?  First, it shuts out those who may not be academics, but want to (a) do good science, and (b) contribute to good science.  What if I was a smart developer, perhaps at a small company, or working for myself.  Suddenly, not only is the code no longer available without a license, but their ability to contribute to improving the code is severely diminished.  Second, it betrays a lack of understanding of what open source means.  Yes, there are plenty of open core models, but GATK is not a company or commercial service. If it plans to be one, they should say so more clearly and spin off a company that does the work of developing products around an open source core.  This is neither here nor there, and all it does is come in the way of doing good science and writing good software.</p>

<p>In the end, this sets a terrible precendent.  The world of open source has lots of good models for monetizing software.  If that is the goal, it would be best to follow those models, or focus on providing quality services, but the non-commercial entity only model is a huge backward step.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[The Return]]></title>
    <link href="http://blog.deepaksingh.net/the-return" />
    <updated>2012-07-28T12:57:00-07:00</updated>
    <id>http://blog.deepaksingh.net/the-return</id>
    <content type="html"><![CDATA[<p>It&#8217;s been a while. When the original bbgm went down, I thought it would take a few days bringing it back online.  Days became weeks, weeks became months.  It&#8217;s been over a year since I last wrote a post, and strangely enough for a while there it felt good not to think about writing.  Life has been incredibly busy, especially once I moved into my <a href="http://www.linkedin.com/in/dsingh">current role</a>.  The little spare time available has been spent with family and indulging hobbies <a href="http://www.dualnatureofmatter.net">old</a> and <a href="http://photos.deepaksingh.net">new</a>.</p>

<p>For now, I have given up any illusions of trying to resurrect the original bbgm, but loss brings new opportunities, and this blog is that opportunity.  As always I will write about things I care about, especially science, which is a smaller part of my life than it has been in years.  As always, there will be limited writing about my day job, but there&#8217;s enough to write about in the world of science, technology, and product development.</p>

<p>The original bbgm ran on Wordpress.  For a long time, I&#8217;ve wanted to switch to more static sites.  <a href="http://deepaksingh.net">deepaksingh.net</a> uses <a href="http://jekyllrb.com/">Jekyll</a> and <a href="http://www.dualnatureofmatter.net">dualnatureofmatter.net</a> uses <a href="http://nanoc.stoneship.org/">nanoc</a>.  This site uses <a href="http://octopress.org/">Octopress</a>, which is a blogging system on top of Jekyll, and is hosted on <a href="http://docs.amazonwebservices.com/AmazonS3/latest/dev/WebsiteHosting.html">Amazon S3</a>.  Oh and here&#8217;s the new bbgm <a href="http://feeds.feedburner.com/bbgm">RSS feed</a>.</p>

<p>So yes, this is a reboot of bbgm. Whether it has any legs remains to be seen.</p>
]]></content>
  </entry>
  
</feed>
