<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Donnie Berkholz's Story of Data</title>
	
	<link>http://redmonk.com/dberkholz</link>
	<description>Making sense out of information</description>
	<lastBuildDate>Thu, 16 May 2013 14:50:30 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/thestoryofdata" /><feedburner:info uri="thestoryofdata" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>DVCS doesn’t disenfranchise enterprise IT — it empowers it</title>
		<link>http://feedproxy.google.com/~r/thestoryofdata/~3/EXAO5DddgMw/</link>
		<comments>http://redmonk.com/dberkholz/2013/05/16/dvcs-doesnt-disenfranchise-enterprise-it-it-empowers-it/#comments</comments>
		<pubDate>Thu, 16 May 2013 13:57:54 +0000</pubDate>
		<dc:creator>dberkholz</dc:creator>
				<category><![CDATA[distributed-development]]></category>
		<category><![CDATA[social]]></category>

		<guid isPermaLink="false">http://redmonk.com/dberkholz/?p=1749</guid>
		<description><![CDATA[I was talking to Atlassian recently about its new release of Stash (a tool for internal corporate git forges), which just added forking and personal repos to its capabilities, and something counterintuitive occurred to me. Despite the common belief that distributed version control (DVCS) steals power away from corporate IT, I assert that the reality [...]]]></description>
				<content:encoded><![CDATA[<p>I was talking to Atlassian recently about its new release of Stash (a tool for internal corporate git forges), which just added forking and personal repos to its capabilities, and something counterintuitive occurred to me. <strong>Despite the common belief that distributed version control (DVCS) steals power away from corporate IT, I assert that the reality is in fact opposite: <span style="text-decoration: underline;">DVCS returns visibility and control to central IT.</span></strong></p>
<p>For developers using centralized version-control tools like CVS and Subversion, their workstations are often a complete mess of source code and checkouts scattered all over the place like pigeon droppings. They&#8217;ve got multiple checkouts of different upstream branches, a bunch of separate checkouts for projects based on the same branch, and even multiple copies of files within each checkout (with nicely dated suffixes or .bak.1, .bak.2-style naming). Frankly, it&#8217;s a mess. And it&#8217;s got two huge problems:</p>
<ol>
<li><strong><span style="font-size: 13px;">This workflow sucks for developers. It&#8217;s easy to lose work and hard to track progress.</span></strong></li>
<li><strong><span style="font-size: 13px;">Central IT and project management have no idea what&#8217;s going on. They can&#8217;t track or control anything and they can&#8217;t promote better practices.</span></strong></li>
</ol>
<p>Although some leading-edge developers will be using a tool like git that interface with CVS and SVN so they can work in a distributed, offline fashion and commit locally, most won&#8217;t. And even those that do work like that still suffer from a number of downsides since the whole team and the main repo aren&#8217;t doing the same.</p>
<p>The problem is that management is simply scared by the word &#8220;forking.&#8221; They envision fragmentation, disappearing commits, and invisible work, when in fact <strong>all of those things are already happening</strong>. Using a tool like Stash or GitHub Enterprise internally will counterintuitively increase transparency and control into an organization&#8217;s development practices, because people will push on a regular basis and use trackable forks within the context of the tool. In addition, there&#8217;s the clear benefits of smaller commits (because they&#8217;re fast and easy), feature branching (because it&#8217;s fast and easy), and maintain things in a way that enables better investigation of bugs (smaller commits make bisection easier).</p>
<p><strong>In other words, using DVCS internally for corporate development improves the experience for both enterprise IT and individual software developers. </strong>The problem is that nobody seems to be telling that story to corporate IT.</p>
<p><strong style="color: #999999; font-size: 13px;">Disclosure</strong><span style="color: #999999; font-size: 13px;">: Atlassian is a client, and GitHub has been one.</span></p>
<div class="acc_license"><a href="http://creativecommons.org/licenses/by-nc-sa/3.0/"><img src="http://i.creativecommons.org/l/by-nc-sa/3.0/88x31.png" alt="by-nc-sa" /></a></div><!--<rdf:RDF xmlns="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><Work rdf:about=""><license rdf:resource="http://creativecommons.org/licenses/by-nc-sa/3.0/" /></Work><License rdf:about="http://creativecommons.org/licenses/by-nc-sa/3.0/"><requires rdf:resource="http://creativecommons.org/ns#Attribution" /><permits rdf:resource="http://creativecommons.org/ns#Reproduction" /><permits rdf:resource="http://creativecommons.org/ns#Distribution" /><permits rdf:resource="http://creativecommons.org/ns#DerivativeWorks" /><requires rdf:resource="http://creativecommons.org/ns#ShareAlike" /><prohibits rdf:resource="http://creativecommons.org/ns#CommercialUse" /><requires rdf:resource="http://creativecommons.org/ns#Notice" /></License></rdf:RDF>--><img src="http://feeds.feedburner.com/~r/thestoryofdata/~4/EXAO5DddgMw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://redmonk.com/dberkholz/2013/05/16/dvcs-doesnt-disenfranchise-enterprise-it-it-empowers-it/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://redmonk.com/dberkholz/2013/05/16/dvcs-doesnt-disenfranchise-enterprise-it-it-empowers-it/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=dvcs-doesnt-disenfranchise-enterprise-it-it-empowers-it</feedburner:origLink></item>
		<item>
		<title>Gonzo video with PhoneGap’s Andre Charland and Brian LeRoux</title>
		<link>http://feedproxy.google.com/~r/thestoryofdata/~3/iNQ89RI7LvI/</link>
		<comments>http://redmonk.com/dberkholz/2013/05/14/gonzo-video-with-phonegaps-andre-charland-and-brian-leroux/#comments</comments>
		<pubDate>Tue, 14 May 2013 12:12:49 +0000</pubDate>
		<dc:creator>dberkholz</dc:creator>
				<category><![CDATA[adoption]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[mobile]]></category>

		<guid isPermaLink="false">http://redmonk.com/dberkholz/?p=1755</guid>
		<description><![CDATA[Last week I was at Adobe&#8217;s MAX conference in Los Angeles, where I grabbed some time with two of the key people behind PhoneGap, the incredibly popular framework for developing hybrid HTML5/native mobile apps. Over beers at Yard House (which has an outstandingly enthusiastic manager who loves craft and good beer), Andre Charland, Brian LeRoux, and [...]]]></description>
				<content:encoded><![CDATA[<p><iframe src="https://www.youtube-nocookie.com/embed/xkO2EUb0CuQ" height="315" width="560" allowfullscreen="" frameborder="0"></iframe></p>
<p>Last week I was at Adobe&#8217;s MAX conference in Los Angeles, where I grabbed some time with two of the key people behind PhoneGap, the incredibly popular framework for developing hybrid HTML5/native mobile apps. Over beers at Yard House (which has an outstandingly enthusiastic manager who loves craft and good beer), <strong>Andre Charland, Brian LeRoux, and I discuss PhoneGap, Adobe, craft and beer, the connection between designers and developers, and more.</strong></p>
<p>I only had my phone on me at the bar, so this whole thing was recorded on a Nexus 4.</p>
<p>Some language may be NSFW, so watch at your peril.</p>
<p><strong style="color: #888888;">Disclosure</strong><span style="color: #888888;">: Adobe is a client and paid for my travel and hotel at Adobe Max.</span></p>
<div class="acc_license"><a href="http://creativecommons.org/licenses/by-nc-sa/3.0/"><img src="http://i.creativecommons.org/l/by-nc-sa/3.0/88x31.png" alt="by-nc-sa" /></a></div><!--<rdf:RDF xmlns="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><Work rdf:about=""><license rdf:resource="http://creativecommons.org/licenses/by-nc-sa/3.0/" /></Work><License rdf:about="http://creativecommons.org/licenses/by-nc-sa/3.0/"><requires rdf:resource="http://creativecommons.org/ns#Attribution" /><permits rdf:resource="http://creativecommons.org/ns#Reproduction" /><permits rdf:resource="http://creativecommons.org/ns#Distribution" /><permits rdf:resource="http://creativecommons.org/ns#DerivativeWorks" /><requires rdf:resource="http://creativecommons.org/ns#ShareAlike" /><prohibits rdf:resource="http://creativecommons.org/ns#CommercialUse" /><requires rdf:resource="http://creativecommons.org/ns#Notice" /></License></rdf:RDF>--><img src="http://feeds.feedburner.com/~r/thestoryofdata/~4/iNQ89RI7LvI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://redmonk.com/dberkholz/2013/05/14/gonzo-video-with-phonegaps-andre-charland-and-brian-leroux/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://redmonk.com/dberkholz/2013/05/14/gonzo-video-with-phonegaps-andre-charland-and-brian-leroux/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=gonzo-video-with-phonegaps-andre-charland-and-brian-leroux</feedburner:origLink></item>
		<item>
		<title>DevOps and cloud: A view from outside the Bay Area bubble</title>
		<link>http://feedproxy.google.com/~r/thestoryofdata/~3/v2xGkddpNR4/</link>
		<comments>http://redmonk.com/dberkholz/2013/05/03/devops-and-cloud-a-view-from-outside-the-bay-area-bubble/#comments</comments>
		<pubDate>Fri, 03 May 2013 17:39:34 +0000</pubDate>
		<dc:creator>dberkholz</dc:creator>
				<category><![CDATA[cloud]]></category>
		<category><![CDATA[devops]]></category>
		<category><![CDATA[open-source]]></category>

		<guid isPermaLink="false">http://redmonk.com/dberkholz/?p=1750</guid>
		<description><![CDATA[I saw two starkly different worlds of IT almost side-by-side last week, thanks to the absurdities of airline pricing, and it illustrated very clearly the contrast between how we perceive the world in our Bay-Area–centric bubble and how the world really is. First, I spent some time at Amazon&#8217;s AWS Summit in San Francisco, where [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://dberkholz-media.redmonk.com/dberkholz/files/2013/05/devops_cloud_venn_diagram.png"><img class="wp-image-1752 aligncenter" alt="devops_cloud_venn_diagram" src="http://dberkholz-media.redmonk.com/dberkholz/files/2013/05/devops_cloud_venn_diagram.png" width="431" height="269" /></a></p>
<p>I saw two starkly different worlds of IT almost side-by-side last week, thanks to the absurdities of airline pricing, and it illustrated very clearly<strong> the contrast between <span style="text-decoration: underline;">how we perceive the world</span> in our Bay-Area–centric bubble and <span style="text-decoration: underline;">how the world really is</span></strong>.</p>
<p>First, I spent some time at Amazon&#8217;s <a href="http://aws.amazon.com/aws-summit-2013/san-francisco/">AWS Summit</a> in San Francisco, where Amazon was pushing best practices at the bleeding edge of tech to one of the most technically sophisticated communities on the planet. Following that, I spent a day at <a href="http://devopsdays.org/events/2013-austin/">DevOpsDays</a> in Austin, Texas, en route to my home in Minnesota. (For some reason this was hundreds of dollars cheaper than a direct flight.)</p>
<p>In the Bay Area, I saw the same thing that&#8217;s endemic of the area. <strong>There&#8217;s a clear best way to do things, pretty much everyone is aware of it, and that&#8217;s what everyone does.</strong> Thanks to the heavy startup presence, there&#8217;s much less inertia in terms of existing cultures or infrastructure, so changes are easier. When you&#8217;ve got a next-door neighbor doing something amazing, it&#8217;s very hard to resist the peer pressure and the local culture, so everyone&#8217;s doing The Right Thing™. Very similar things hold true in the open-source world, where neighbors may be virtual but they&#8217;re still highly visible.</p>
<p>In Austin, it was an entirely different story. I saw yet another example of how the rest of the IT world, at least in this country, lives. I&#8217;ve seen it in places like Minnesota, Maine, and Oregon. It&#8217;s a world where trendy software vendors and startups don&#8217;t represent any meaningful part of the tech community, where businesses mostly don&#8217;t yet realize that <a href="http://online.wsj.com/article/SB10001424053111903480904576512250915629460.html">software is eating the world</a>. It&#8217;s a world where <strong>inertia rules the day, </strong>where<strong> business is king </strong>and<strong> sysadmins have little to no say</strong> in major changes. And it&#8217;s a world where even <strong>experimentation is difficult</strong> and must be done on the smallest of scales.</p>
<p>What happens in places like this? Let&#8217;s call it Everytown, USA. In Everytown, IT departments can&#8217;t afford to build a new infrastructure from scratch using Puppet or Chef in the cloud. They don&#8217;t have the freedom to do it externally or the resources to implement a private cloud internally.</p>
<p>Even at a conference like DevOpsDays Austin, if you <strong>ask people what they&#8217;re actually doing today</strong>, most of the time it has little to no resemblance to how a new Bay Area startup would set up its infrastructure. Don&#8217;t ask them about their plans, that&#8217;s often so ambitious as to be unusable. Maybe they&#8217;re maintaining cloud instances by hand in AWS, or maybe they&#8217;re slowly migrating a large datacenter full of <a href="http://www.gregarnette.com/blog/2012/05/cloud-servers-are-not-our-pets/">pets</a> to configuration management, which they&#8217;ve been working on for the past five years. If they&#8217;re open-source fans, chances are they&#8217;re running Nagios and have a huge collection of Nagios-related infrastructure that would need serious, dedicated effort to shift to anything different.</p>
<p>More modern shops could have migrated most or all of their servers to tools like Puppet or Chef, so everything&#8217;s at least under configuration management and thus documented and reproducible. But in many cases, this is for datacenter use only, either true on-prem or in a colo. Gaining the capacity, budget, and permission to even migrate to private cloud is impossible for many companies, and it could be that way for a while.</p>
<p>You can see the same thing at conferences for larger enterprise vendors like IBM — talking to attendees at IBM&#8217;s <a href="http://www-01.ibm.com/software/tivoli/pulse/">Pulse</a> conference this spring, most of them are in exactly the situation I&#8217;ve described. IBM&#8217;s jump into both cloud and DevOps will make a significant difference to their adoptability in many places; it&#8217;s like a stamp of approval that these things are really ready for the enterprise.</p>
<p>&#8220;Shadow IT&#8221; developers outside the purview of IT-controlled infrastructure, on the other hand, often don&#8217;t have or don&#8217;t want to develop the expertise to learn DevOps philosophies and approaches.<strong> Developers may well be working in the cloud</strong>, but chances are they aren&#8217;t running tools like Puppet or Chef, and they don&#8217;t have any monitoring set up. They may hack things around by hand and hope everything doesn&#8217;t break too often, or they may outsource the infrastructure to somewhere external and run in a PaaS.</p>
<p>IT shops like this may be aware that better ways exists and they may have ambitions of going there, someday. The Bay Area view of the right infrastructure is always going to be years away for the rest of us — we even put William Gibson&#8217;s quote regarding this on our website:</p>
<blockquote><p>The future is already here, it&#8217;s just unevenly distributed.</p></blockquote>
<p>&nbsp;</p>
<p><span style="color: #ff0000;"><strong>Update (5/5/13)</strong>: Of course this is a generalization of reality, which is always more complex than a single answer at either end of the spectrum. I&#8217;ve just simplified it to communicate the overall points, which remain true regardless of the details. Reality looks like a distribution on both ends — but <strong>the distribution is shifted</strong>. I&#8217;m just talking about the most common cases within those distributions. There are clearly going to be some Bay Area companies with plenty of inertia, and some Everytown companies overflowing with cloud- and DevOps-based approaches. Even within a single company, there&#8217;s a distribution of approaches, with some areas more modern and others more legacy (heard of systems of engagement and systems of record?).</span></p>
<p><span style="color: #999999;"><strong>Disclosure</strong>: Amazon (AWS) and IBM are clients. Puppet Labs has been a client. Opscode and Nagios are not.</span></p>
<div class="acc_license"><a href="http://creativecommons.org/licenses/by-nc-sa/3.0/"><img src="http://i.creativecommons.org/l/by-nc-sa/3.0/88x31.png" alt="by-nc-sa" /></a></div><!--<rdf:RDF xmlns="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><Work rdf:about=""><license rdf:resource="http://creativecommons.org/licenses/by-nc-sa/3.0/" /></Work><License rdf:about="http://creativecommons.org/licenses/by-nc-sa/3.0/"><requires rdf:resource="http://creativecommons.org/ns#Attribution" /><permits rdf:resource="http://creativecommons.org/ns#Reproduction" /><permits rdf:resource="http://creativecommons.org/ns#Distribution" /><permits rdf:resource="http://creativecommons.org/ns#DerivativeWorks" /><requires rdf:resource="http://creativecommons.org/ns#ShareAlike" /><prohibits rdf:resource="http://creativecommons.org/ns#CommercialUse" /><requires rdf:resource="http://creativecommons.org/ns#Notice" /></License></rdf:RDF>--><img src="http://feeds.feedburner.com/~r/thestoryofdata/~4/v2xGkddpNR4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://redmonk.com/dberkholz/2013/05/03/devops-and-cloud-a-view-from-outside-the-bay-area-bubble/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		<feedburner:origLink>http://redmonk.com/dberkholz/2013/05/03/devops-and-cloud-a-view-from-outside-the-bay-area-bubble/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=devops-and-cloud-a-view-from-outside-the-bay-area-bubble</feedburner:origLink></item>
		<item>
		<title>Musical chairs with open-source business models: Opscode and Tokutek</title>
		<link>http://feedproxy.google.com/~r/thestoryofdata/~3/alSYRijGpGI/</link>
		<comments>http://redmonk.com/dberkholz/2013/04/23/musical-chairs-with-open-source-business-models-opscode-and-tokutek/#comments</comments>
		<pubDate>Tue, 23 Apr 2013 16:59:32 +0000</pubDate>
		<dc:creator>dberkholz</dc:creator>
				<category><![CDATA[adoption]]></category>
		<category><![CDATA[devops]]></category>
		<category><![CDATA[open-source]]></category>

		<guid isPermaLink="false">http://redmonk.com/dberkholz/?p=1748</guid>
		<description><![CDATA[While everyone else is talking about API-related acquisitions (Mashery by Intel, Layer 7 by CA, now ProgrammableWeb by MuleSoft), I&#8217;m going to avoid the pack in this post and focus on some other underrated but interesting news that you should know about. A couple of pieces of changes in direction regarding open source came out in the [...]]]></description>
				<content:encoded><![CDATA[<p>While everyone else is talking about API-related acquisitions (<a href="http://readwrite.com/2013/04/17/intel-acquires-mashery">Mashery</a> by Intel, <a href="http://techcrunch.com/2013/04/22/ca-acquires-layer-7-technologies-to-connect-cloud-mobile-and-internet-of-things-as-api-market-starts-to-consolidate/">Layer 7</a> by CA, now <a href="http://gigaom.com/2013/04/23/api-turf-war-heats-up-as-mulesoft-buys-programmableweb/">ProgrammableWeb</a> by MuleSoft), I&#8217;m going to avoid the pack in this post and focus on some other underrated but interesting news that you should know about.</p>
<p>A couple of pieces of changes in direction regarding open source came out in the past few days, and they&#8217;ve gotten little coverage thus far, despite their fairly significant implications.</p>
<h2>Gambling on traction with open source</h2>
<p><a href="http://blogs.the451group.com/information_management/2011/04/06/what-we-talk-about-when-we-talk-about-newsql/">NewSQL</a> database provider Tokutek just went <a href="http://www.tokutek.com/2013/04/announcing-tokudb-v7-open-source-and-more/">open source</a> with its TokuDB v7 release yesterday. TokuDB is a MySQL/MariaDB storage engine based around an algorithm called <a href="http://www.tokutek.com/2012/12/fractal-tree-indexing-overview/">fractal trees</a>. What makes this move interesting?</p>
<p>For one, open-source NewSQL options are hard to come by. This is one market where open source isn&#8217;t yet table stakes, unlike NoSQL, so it does make companies stand out. VoltDB is <a href="http://voltdb.com/community/source-code.php">one</a> of very few OSS options, falling under the strongly copyleft <a href="http://en.wikipedia.org/wiki/Affero_General_Public_License">AGPLv3</a>. Tokutek went with GPLv2 for its engine (the same as MySQL), a slightly more permissive license in that you don&#8217;t need to provide source if it&#8217;s only available within a hosted service. Usefully, they also provided a <a href="https://github.com/Tokutek/ft-index/blob/master/README-TOKUDB">patent license</a> since that isn&#8217;t GPLv2&#8242;s strong point. This makes TokuDB newly interesting to service providers who want to incorporate an open-source NewSQL option into their products.</p>
<p>Secondly, it&#8217;s always interesting to look at the particular approach companies take to an OSS-centric model. In this case, <strong>it&#8217;s a combination of the classic models of support and proprietary add-ons</strong> (in this case tools for backup and recovery), according to <a href="http://siliconangle.com/blog/2013/04/23/another-open-source-win-with-tokuteks-mysql-storage-engine/">SiliconAngle</a>. As going open source with your core product isn&#8217;t a transition that&#8217;s easy to step away from, it can be useful to take a piecemeal approach, as you determine where your customers find the real value.</p>
<h2>Maximizing the innovation window</h2>
<p>Opscode, on the other hand, is moving in a <a href="http://www.opscode.com/blog/2013/04/22/reflections-on-5-years-as-an-open-source-company/">more proprietary direction</a>. As Adam Jacob, Chief Customer Officer, wrote in a post on the past five years:</p>
<blockquote><p>One shift here is in the order of operations: before we wrote Chef, there was no Chef. We shipped the primitive first, then we built value (Hosted Chef and Private Chef) on top. As we move forward, <strong>we’ll shift to open sourcing new primitives after we build something cool on top of them</strong> that shows their power. [emphasis mine]</p></blockquote>
<p>This shift to an &#8220;open-source the infrastructure&#8221; approach after you&#8217;ve already built a beautiful facade on top is a significant change to a model that&#8217;s entirely about differentiating on top (a la GitHub, Facebook, Twitter, LinkedIn) rather than being what I would call a true open-source company. <strong>It gives Opscode a new monopoly</strong> on the time window between when they create a new piece of infrastructure and when they release the proprietary frosting on top. <strong>It also has a detrimental effect</strong> on a leading subset of users who prefer a more composeable infrastructure, as we&#8217;re seeing now in the #monitoringsucks/#monitoringlove movement, and who will now be forced to wait for the core components until Opscode finishes building something on top of them. That said, <strong>much like the value of example code in SDKs, Adam is entirely right</strong> that building useful products on top of a core component will very clearly illustrate its values and some of its use cases.</p>
<p>So, two transitions: one shifting toward open, another shifting more closed. I&#8217;m looking forward to seeing what comes of both.</p>
<p><span style="color: #999999;"><em><strong>Disclosure</strong>: VoltDB is a client. GitHub has been a client. Opscode, Tokutek, Oracle (MySQL), MariaDB, Twitter, Facebook, and LinkedIn are not clients.</em></span></p>
<div class="acc_license"><a href="http://creativecommons.org/licenses/by-nc-sa/3.0/"><img src="http://i.creativecommons.org/l/by-nc-sa/3.0/88x31.png" alt="by-nc-sa" /></a></div><!--<rdf:RDF xmlns="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><Work rdf:about=""><license rdf:resource="http://creativecommons.org/licenses/by-nc-sa/3.0/" /></Work><License rdf:about="http://creativecommons.org/licenses/by-nc-sa/3.0/"><requires rdf:resource="http://creativecommons.org/ns#Attribution" /><permits rdf:resource="http://creativecommons.org/ns#Reproduction" /><permits rdf:resource="http://creativecommons.org/ns#Distribution" /><permits rdf:resource="http://creativecommons.org/ns#DerivativeWorks" /><requires rdf:resource="http://creativecommons.org/ns#ShareAlike" /><prohibits rdf:resource="http://creativecommons.org/ns#CommercialUse" /><requires rdf:resource="http://creativecommons.org/ns#Notice" /></License></rdf:RDF>--><img src="http://feeds.feedburner.com/~r/thestoryofdata/~4/alSYRijGpGI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://redmonk.com/dberkholz/2013/04/23/musical-chairs-with-open-source-business-models-opscode-and-tokutek/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://redmonk.com/dberkholz/2013/04/23/musical-chairs-with-open-source-business-models-opscode-and-tokutek/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=musical-chairs-with-open-source-business-models-opscode-and-tokutek</feedburner:origLink></item>
		<item>
		<title>The size of open-source communities and its impact upon activity, licensing, and hosting</title>
		<link>http://feedproxy.google.com/~r/thestoryofdata/~3/bm44K32_IWw/</link>
		<comments>http://redmonk.com/dberkholz/2013/04/22/the-size-of-open-source-communities-and-its-impact-upon-activity-licensing-and-hosting/#comments</comments>
		<pubDate>Mon, 22 Apr 2013 17:31:40 +0000</pubDate>
		<dc:creator>dberkholz</dc:creator>
				<category><![CDATA[adoption]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[data-science]]></category>
		<category><![CDATA[licensing]]></category>
		<category><![CDATA[open-source]]></category>

		<guid isPermaLink="false">http://redmonk.com/dberkholz/?p=1729</guid>
		<description><![CDATA[Common (mis?)conception states that development practices, standards, and cultures vary broadly depending on the size of an open-source community. In general, we expect that many solo projects may lack the same level of QA and rigor as those with multiple developers due to constraints on time, varying experience levels, etc., and they may not even [...]]]></description>
				<content:encoded><![CDATA[<p><strong>Common (mis?)conception states that development practices, standards, and cultures vary broadly depending on the size of an open-source community.</strong> In general, we expect that many solo projects may lack the same level of QA and rigor as those with multiple developers due to constraints on time, varying experience levels, etc., and they may not even be intended for consumption by others. As communities grow slightly larger, projects that successfully recruited multiple contributors would likely tend to be higher-quality, on average, than those that failed to do so. In the largest open-source projects with tens or hundreds of contributors, we generally expect a fairly high level of quality, attention to detail, documentation, and so on.</p>
<p>Here, I&#8217;m going to dig into data from <a href="http://www.ohloh.net/">Ohloh</a>, which tracks a vast set of open-source software projects, to investigate some of the effects related to community size. I&#8217;ll look at a number of potentially connected variables centered around development activity, licensing choices, and hosting providers (GitHub, etc.).</p>
<p>As always, the caveats:</p>
<ul>
<li><span style="line-height: 13px;"><strong>This is only useful for active projects with active communities</strong>, because it contains only projects with commits during a 1-year period and members of the community must opt in to subscribe the project to Ohloh. This equates to 50,000+ projects, so it&#8217;s still a good-sized set.</span></li>
<li><strong>It is subject to any imperfections in Ohloh&#8217;s measurements</strong>, which is particularly relevant for license detection where it simply looks for strings in source files. It will miss any indirect references to licenses by name or URL. It also seems to miss some more obvious ones, which will set a lower bound on license discovery (but it should be independent of community size).</li>
<li><strong>In most cases, I&#8217;m ignoring the largest ~100 projects</strong> on Ohloh, which have ~150 or more committers per year, so these conclusions may not be generalizable to them. These tend to include well-known names like GNOME, Chrome, KDE, the Linux kernel, Mozilla, etc. There simply aren&#8217;t enough samples of things at similar size to aggregate data for general, non-project-specific conclusions.</li>
</ul>
<p>To make these posts more easily readable, I&#8217;m going to try something new. All the methodology is now in the figure captions, so <strong>skip captions if you just want to read the <span style="text-decoration: underline;">what</span> and ignore the <span style="text-decoration: underline;">how</span>.</strong></p>
<h2>The size distribution of open-source communities</h2>
<p>Before looking at the impact of size, I first wanted to gain an understanding of how big free/open-source software (FOSS) communities were, and how many project communities there were at each size. Plotting the community size against the number of FOSS projects at each size produced the plot shown below:</p>
<div id="attachment_1730" class="wp-caption aligncenter" style="width: 357px"><a href="http://dberkholz-media.redmonk.com/dberkholz/files/2013/04/committer_histogram.png"><img class="wp-image-1730 " alt="committer_histogram" src="http://dberkholz-media.redmonk.com/dberkholz/files/2013/04/committer_histogram.png" width="347" height="329" /></a><p class="wp-caption-text">Ohloh data for projects active in the past year as of July 2012. Monthly data from the 30 days immediately prior to the Ohloh dump. The LOWESS fit shows a locally smoothed line in the noisier regions, using 1/8 of observations for smoothing each point. Not shown are the 92 projects with &gt;150 committers per year.</p></div>
<h4>Global features</h4>
<p>What I find interesting about the shape of the above graph is two things: the helpfully linear behavior on this type of plot, and the gap between monthly and annual contributors. When this type of plot appears linear, it indicates behavior supporting a set of statistical distributions including the <strong><a href="https://en.wikipedia.org/wiki/Power_law">power law</a> (the famous <a href="https://en.wikipedia.org/wiki/Pareto_principle">80-20 principle</a> stating that 80% of the effect comes from 20% of the causes)</strong>. In the below section on specific effects, I&#8217;ll show some numbers indicating the kind of behavior we see as a result of this.</p>
<p><strong>One interesting question I wanted to answer here was the relationship between monthly committers and &#8220;expected&#8221; monthly committers based on the year-long figures</strong> — we can get this by dividing annual committers by 12. However, appearances can be deceiving. This graph actually can&#8217;t tell us anything about that because there&#8217;s zero connection between projects with 80 annual contributors and 80 monthly contributors. Instead, what we can do to get at this information is directly correlate monthly and annual contributors at the level of individual projects, which I&#8217;ll show later.</p>
<p>What can that gap tell us, then? As it turns out, it&#8217;s not equally sized throughout. There&#8217;s a very real and linear increase in that difference as community size increases from 1 to 35 committers (it&#8217;s too noisy above that), with the logarithmic difference increasing from 0.57 to 1.29 (each unit of 1.0 indicates a 10x increase). The higher slope for the monthly committers indicates a set of values with a tighter overall distribution that are (unsurprisingly) biased away from high numbers of committers, which are much more easily attained for a project in a year than a month. <strong>Put simply, it&#8217;s an expected result — you get more unique committers in a year than a month.</strong></p>
<h4>Specific effects</h4>
<p>The vast majority of projects are tiny, having just 1 or a few contributors. This is even more dramatic than it first appears if you look at the Y-axis, which is logarithmic rather than linear. From the full spreadsheet (embedded below),  we can draw some more quantitative conclusions. On an annual level, <strong>just over half of active projects<strong> (51%)</strong> have only 1 contributor</strong>, while 19% have 2, 9% have 3, 5% have 4, and 3% have 5 (see the PDF column below). <strong>Overall, 87% of projects have 5 or fewer committers per year</strong> (see the CDF column). Looking from the opposite perspective,<strong> merely 1% of projects have 50 or more committers per year, and a scant 0.1% have 200 or more</strong> (see the Rev. CDF column).</p>
<div style="text-align: center;"><iframe src="https://docs.google.com/spreadsheet/pub?key=0AoKvP9o3WWBxdGs4eHY3Yk43UGlUR3IzakVRaUVRZWc&amp;single=true&amp;gid=0&amp;output=html&amp;widget=true" height="300" width="500" frameborder="0" align="center"></iframe></div>
<p>&nbsp;</p>
<h2>Contribution regularity is independent of community size</h2>
<p>To directly compare monthly and annual committers, we need to pull the numbers at the level of individual projects and create a plot based on them, rather than looking at two independent histograms on the same graph as we did above. If we do that and aggregate it into 25-project bins to ease visualization, then fit lines to them, we can produce a plot much like the below:</p>
<div id="attachment_1745" class="wp-caption aligncenter" style="width: 363px"><a href="http://dberkholz-media.redmonk.com/dberkholz/files/2013/04/monthly_vs_annual_contributors_custom2.png"><img class=" wp-image-1745  " alt="monthly_vs_annual_contributors_custom" src="http://dberkholz-media.redmonk.com/dberkholz/files/2013/04/monthly_vs_annual_contributors_custom2.png" width="353" height="332" /></a><p class="wp-caption-text">Data points were created for each 25 projects, and percentiles were calculated for each data point. The lines indicate an observation-weighted cubic-spline fit with a smoothing factor of 1/1000.</p></div>
<p>This is a variation on a <a href="http://en.wikipedia.org/wiki/Box_plot">box plot</a>, showing the median in thick black in addition to a number of percentiles to indicate the size of the distribution cores (25%–75%) and more extreme, non-outlier values (10%–90%). To interpret it, consider that medians represent the central values, while the thicker colored lines represent the &#8220;middle half&#8221; for each number of committers, and the thin colored lines represent nearly everything (the central 80%).</p>
<p><strong>This plot shows a very clear typical range of annual committers, given a monthly number. Conversely, it could also be read the other way to suggest likely numbers of unique monthly committers, given an annual value.</strong></p>
<p>I next wanted to look more specifically at <strong>the relationship between a simplistic prediction of monthly committers and the actual monthly values.</strong> It&#8217;s based purely on dividing the annual committers by 12 months, which means that a ratio of 1 would equate to each contributor making commits during only 1 month each year.</p>
<div id="attachment_1741" class="wp-caption aligncenter" style="width: 342px"><a href="http://dberkholz-media.redmonk.com/dberkholz/files/2013/04/expected_monthly_contributors_custom.png"><img class=" wp-image-1741  " alt="expected_monthly_contributors_custom" src="http://dberkholz-media.redmonk.com/dberkholz/files/2013/04/expected_monthly_contributors_custom.png" width="332" height="332" /></a><p class="wp-caption-text">The ratio of expected monthly committers was generated from dividing 1/12 of annual committers by the monthly committers. Each semitransparent circle represents the median committer values of 25 data points, and darker colors indicate multiple overlapping circles.</p></div>
<p>Interestingly, while the data points distribute much more widely at lower committer counts (potentially due simply to larger populations), <strong>it remains near-linear and horizontal, going from a ratio of ~0.20 to ~0.25 as a function of community size</strong>. Values below 1 mean committers are making contributions during more than 1 month each year. In particular, if you multiple the ratio by 12 months, you get the average periodicity of someone&#8217;s contributions in months — so 0.25 * 12 = committing every 3 months, for a total of <strong>4 months of contributions each year from each committer in large projects, on average, and 5 months from small projects using the same math</strong>. While many developers will contribute more, enough will also contribute less to make the final numbers come out around 4–5 months in a remarkably consistent fashion.</p>
<p>An important take-home from this result is that <strong>smaller projects are proportionately nearly as likely as larger ones to receive drive-by commits or have relatively inactive developers on a monthly basis.</strong></p>
<h2>Larger communities tend to get more engagement</h2>
<p>To look for size effects at a finer-grained level than committers alone by looking at commits themselves, I took the ratio of monthly commits per committer and plotted it as a function of community size in the graph below. <strong>As the size of a project increases from 1 to ~10 developers, the median gradually doubles from ~5 to ~10 commits per committer, where it then holds steady as community size grows</strong> (beyond 20, the data become too noisy due to too few projects of that size).</p>
<div id="attachment_1743" class="wp-caption aligncenter" style="width: 350px"><a href="http://dberkholz-media.redmonk.com/dberkholz/files/2013/04/commits_per_committer.png"><img class="wp-image-1743 " alt="commits_per_committer" src="http://dberkholz-media.redmonk.com/dberkholz/files/2013/04/commits_per_committer.png" width="340" height="332" /></a><p class="wp-caption-text">Data points were created for each 25 projects, and percentiles were calculated for each data point. The lines indicate an observation-weighted cubic-spline fit with a smoothing factor of 1/1000.</p></div>
<p>A number of factors could explain this trend — for example:</p>
<ul>
<li>They receive or accept proportionately more drive-by patches that are credited to a committer rather than the patch contributor;</li>
<li>Everyone is more active due to an effect of community interactions or peer pressure;</li>
<li>They have a higher proportion of active committers, such as professional contributors who make more frequent commits;</li>
<li>Most smaller projects will never gain the traction to grow larger, but the larger a project is, the more likely it is to have gained or be in the process of gaining developer traction.</li>
</ul>
<p>However, the generally horizontal line at the 90th percentile (the peak around 7-8 appears to be an outlier due to some large projects with abnormally low committer levels that month) indicates that a subset of small communities do behave similarly to the larger ones. This suggests that it may be the last of these explanations.</p>
<h2>&#8220;Post-OSS&#8221; licensing practices are a big issue in smaller communities</h2>
<p>My eminent colleague James <a href="https://twitter.com/monkchips/status/247584170967175169">posted</a> this succinct and bluntly honest tweet last fall:</p>
<blockquote><p>younger devs today are about <strong>POSS</strong> &#8211; Post open source software. fuck the license and governance, just commit to github.</p></blockquote>
<p>Luis Villa, open-source lawyer and Friend of RedMonk, wrote an <a href="http://tieguy.org/blog/2013/01/27/taking-post-open-source-seriously-as-a-statement-about-copyright-law/">excellent post</a> following up on the topic, postulating that <strong>POSS behavior was an explicit rejection of permission-based culture</strong>. It&#8217;s easy to simply accept that this is happening, but as a scientist by training, I prefer to see whether there&#8217;s data to support or deny the assertion that licenses as a whole are growing less popular.</p>
<p>Ohloh is quite useful for licensing data because it goes beyond simply looking at COPYING, LICENSE or README files to directly examine the contents of each source file for strings found in licenses. While it&#8217;s undoubtedly imperfect because it looks directly for license <a href="https://github.com/blackducksw/ohcount/blob/master/src/licenses.c">strings</a> so may miss poorly worded or obscure references to licenses, that will simply set a baseline for detection. Any <strong>changes</strong> relative to that baseline will still be valuable.</p>
<p>If we look at the percentage of active projects (1 commit in the past year) without licenses detected by Ohloh, it baselines around 20% for large projects, which one would hope embody best practices in open source. This is likely a combination of two factors, Ohloh&#8217;s detection ability and actual missing licenses (likely dominated by the former).</p>
<p>But once we start looking at the trends, that&#8217;s when things get interesting. Take a look at the graphs below:</p>
<div id="attachment_1746" class="wp-caption aligncenter" style="width: 477px"><a href="http://dberkholz-media.redmonk.com/dberkholz/files/2013/04/license_by_community_size_composite.png"><img class="wp-image-1746 " alt="Coloration indicates the number of projects for a given data point. Based on 56,090 Ohloh projects with available data on origination date, committer counts, and license. The observation-weighted cubic spline fit is used as implemented in gnuplot with a scale factor of 1/1000." src="http://dberkholz-media.redmonk.com/dberkholz/files/2013/04/license_by_community_size_composite.png" width="467" height="426" /></a><p class="wp-caption-text">Data points were created for each 25 projects, and percentiles were calculated for each data point. The lines indicate an observation-weighted cubic-spline fit with a smoothing factor of 1/1000.</p></div>
<p>I&#8217;ve classified project licensing into one of four categories: None, Copyleft, Permissive, or Limited (a.k.a. weak copyleft). Let&#8217;s walk through them in order.</p>
<p><strong>First, unlicensed projects would qualify for the POSS designation and are shown in the top left.</strong> This is the largest trend among all of the license types in terms of absolute license share, indicating the importance of thinking about it. <strong>When looking at monthly contributors, this trend flattens out around 15 committers at 20% of all active projects and stays flat well beyond the right edge of this graph, to at least 70 commiters per month</strong> (after that point it&#8217;s too noisy). Regardless of whether this trend is due to a true rejection of the permissive culture, as Luis Villa suggests, or whether it&#8217;s a function of lack of licensing education, <strong>the shift from 50% unlicensed single-developer projects to below 25% unlicensed projects with 15 or more contributors cannot be ignored</strong>. My interpretation is that essentially no projects with ≥10 monthly contributors have licensing problems, while ~1/3 of one-developer projects do. The transition occurs in the middle. In other words, as projects grow, they tend to sort out any licensing issues, likely because they get corporate users, professional developers, etc.</p>
<p>Second, let&#8217;s look at copyleft licensing, the next-most-popular type.<strong> As a counterpart to the POSS trend, the use of copyleft licenses increases from ~20% to 35–40% around 15–20 monthly committers </strong>before the data get too noisy to draw any further conclusions. However, four of the five largest data points (25-project aggregates) hover around 45–50% copyleft, suggesting a potential upper limit that&#8217;s driven in part by the Linux kernel and Linux distributions, some of the largest collaborative projects around.</p>
<p>The lower two plots, <strong>permissive and limited (weak copyleft) licensing, show mild upward trends on an absolute scale</strong>. Permissive shows a small increase from ~20% to ~25%. Limited licenses, on the other hand show a small increase from ~7–8% to ~11–12%. While small on an absolute scale, this modest-seeming trend indicates that limited licenses are roughly 50% more popular in larger communities than small ones.</p>
<h2>Hosting providers generally do not support large communities well</h2>
<p>The other interesting data point I have is which code forge each active project (1 commit in the past year) is hosted at, so let&#8217;s examine the connection between code forges and community sizes. My expectations going in were that:</p>
<ul>
<li>Small communities would bias heavily toward GitHub, because it&#8217;s basically the center of open development today; and</li>
<li>Larger communities would likely tend to host independently, because they have more complex needs in terms of service heterogeneity and scale.</li>
</ul>
<div id="attachment_1735" class="wp-caption aligncenter" style="width: 484px"><a href="http://dberkholz-media.redmonk.com/dberkholz/files/2013/04/forge_by_community_size_composite1.png"><img class="wp-image-1735 " alt="forge_by_community_size_composite" src="http://dberkholz-media.redmonk.com/dberkholz/files/2013/04/forge_by_community_size_composite1-659x1024.png" width="474" height="737" /></a><p class="wp-caption-text">Dots indicate the integer medians of each 25 data points in order of committer size, semitransparent so darker dots indicate multiple overlapping. points. Otherwise, data source and spline fit as described previously.</p></div>
<h3>Small projects</h3>
<p>On an overall level, we can see a strong bias for small projects toward GitHub (~50%), while just under 20% opt for both of SourceForge and Google Code. The remaining ~10% largely choose to self-host, with the last few percent going to Launchpad and Bitbucket.</p>
<p>Two points worthy of note are that GitHub and Launchpad both show global peaks at a committer count higher than 1, indicating a break with the global trend in the first graph that the most common situation is a single-developer project. <strong>This could support the importance of a low barrier to entry for collaboration</strong>. Getting those first few developers beyond the founder tends to be incredibly difficult, and anything that makes that easier is a huge deal.</p>
<h4>Large projects</h4>
<p>The downhill trends are clear for SourceForge and Google Code, while Launchpad and Bitbucket appear to remain roughly flat. GitHub seems to have a slight downhill trend. Interestingly, <strong>scaling to the needs of larger projects turns out to be a major issue for the older forges (SourceForge and Google Code), but GitHub seems to have largely defeated it</strong>.</p>
<div>While it&#8217;s clear that the vast majority of the increase in self-hosting comes at the cost of share for SourceForge and Google Code,  it&#8217;s hard to attribute precise causes to it. Some of the likeliest possibilities are a lack of desired communication methods, a difficulty with the usability of the platform or collaboration on it, and a failure of the forge to scale effectively.</div>
<h2>Conclusions</h2>
<p>Once a project reaches 15–20 monthly contributors, it seems to behave much differently, on average, than smaller projects in a number of ways. In larger projects, committers tend to be more active as a whole, licensing tends to be better-determined, and they&#8217;re much more likely to be self-hosted. Very small communities make up the vast majority of the open-source world, however, so we need to pay close attention to what&#8217;s happening even on solo projects.</p>
<p><span style="color: #999999;"><em><strong>Disclosure</strong>: Black Duck Software (which runs Ohloh) and Atlassian (which runs Bitbucket) are clients. GitHub and <em>Canonical (which runs Launchpad) </em>have been clients. Dice (which runs SourceForge) and Google are not clients.</em></span></p>
<div class="acc_license"><a href="http://creativecommons.org/licenses/by-nc-sa/3.0/"><img src="http://i.creativecommons.org/l/by-nc-sa/3.0/88x31.png" alt="by-nc-sa" /></a></div><!--<rdf:RDF xmlns="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><Work rdf:about=""><license rdf:resource="http://creativecommons.org/licenses/by-nc-sa/3.0/" /></Work><License rdf:about="http://creativecommons.org/licenses/by-nc-sa/3.0/"><requires rdf:resource="http://creativecommons.org/ns#Attribution" /><permits rdf:resource="http://creativecommons.org/ns#Reproduction" /><permits rdf:resource="http://creativecommons.org/ns#Distribution" /><permits rdf:resource="http://creativecommons.org/ns#DerivativeWorks" /><requires rdf:resource="http://creativecommons.org/ns#ShareAlike" /><prohibits rdf:resource="http://creativecommons.org/ns#CommercialUse" /><requires rdf:resource="http://creativecommons.org/ns#Notice" /></License></rdf:RDF>--><img src="http://feeds.feedburner.com/~r/thestoryofdata/~4/bm44K32_IWw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://redmonk.com/dberkholz/2013/04/22/the-size-of-open-source-communities-and-its-impact-upon-activity-licensing-and-hosting/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://redmonk.com/dberkholz/2013/04/22/the-size-of-open-source-communities-and-its-impact-upon-activity-licensing-and-hosting/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=the-size-of-open-source-communities-and-its-impact-upon-activity-licensing-and-hosting</feedburner:origLink></item>
		<item>
		<title>Quantifying the shift toward permissive licensing</title>
		<link>http://feedproxy.google.com/~r/thestoryofdata/~3/wUm1G_DXiYE/</link>
		<comments>http://redmonk.com/dberkholz/2013/04/02/quantifying-the-shift-toward-permissive-licensing/#comments</comments>
		<pubDate>Tue, 02 Apr 2013 17:51:14 +0000</pubDate>
		<dc:creator>dberkholz</dc:creator>
				<category><![CDATA[adoption]]></category>
		<category><![CDATA[data-science]]></category>
		<category><![CDATA[licensing]]></category>
		<category><![CDATA[open-source]]></category>

		<guid isPermaLink="false">http://redmonk.com/dberkholz/?p=1722</guid>
		<description><![CDATA[The team at Ohloh worked with me to organize a data hackfest at OSCON 2012, and we pulled together a great dataset that included licensing data for all open-source projects in Ohloh that had any commits in the past year. After working with Ohloh data for my recent post on language expressiveness, I wanted to [...]]]></description>
				<content:encoded><![CDATA[<p>The team at <a href="http://ohloh.net">Ohloh</a> worked with me to organize a data hackfest at OSCON 2012, and we pulled together a great dataset that included licensing data for all open-source projects in Ohloh that had any commits in the past year. After working with Ohloh data for my recent post on <a href="http://redmonk.com/dberkholz/2013/03/25/programming-languages-ranked-by-expressiveness/">language expressiveness</a>, I wanted to explore it in some different ways to see what else might emerge, and licensing seemed like one worth examining more deeply.</p>
<p><strong>My colleague Steve has posted about permissive vs copyleft licensing a <a href="http://redmonk.com/sogrady/2013/02/26/forking-permissive-licenses/">number</a> <a href="http://redmonk.com/sogrady/2012/02/15/decline-of-the-gpl/">of</a> <a href="http://redmonk.com/sogrady/2008/03/16/open-source-licensing-obsolete-or-of-importance/">times</a>, but we&#8217;ve never done quantitative research into licensing choice</strong> to prove the extent to which any shifts are happening, the time frames involved, and the potential variations within different programming-language communities.</p>
<h2>Approach: Classification, history, and languages</h2>
<p>Using the Ohloh data for 57,930 active projects as of July 2012, <strong>I classified the top 30 open-source licenses into one of three categories: permissive (e.g. BSD, Apache), limited (e.g. LGPL, MPL, EPL), or copyleft (e.g. GPL, AGPL).</strong> This three-category classification accounts for 90+% of all projects with specified licenses, which means it should be representative. The total number of classified projects was 17,549, because a vast number of projects either have no license or Ohloh was unable to detect it. Limited licensing is quite rare, hovering around 2%–3% of projects with licenses, so for the purposes of this post, we will focus on permissive and copyleft licensing.</p>
<p>To attempt to identify historical shifts, I separated projects into buckets based on the date of their first commit. Since license changes between permissive and copyleft are quite rare, this should be a reasonable approach to examining trends over time.</p>
<p>Since I hypothesized that programming language might also play a role, I further split each year&#8217;s bucket by language. Here, I&#8217;m going to focus on the 11 most popular languages according to our <a href="http://redmonk.com/sogrady/2013/02/28/language-rankings-1-13/">rankings</a>, as well as the total across all languages regardless of popularity. Any data points with 5 or fewer projects between permissive and copyleft are not shown, to remove noise.</p>
<h2>Results: A clear trend toward permissiveness</h2>
<p>I&#8217;m showing the data as a ratio between permissive and copyleft licensing to account for changes in absolute numbers of projects over time. <strong>Any number above 1 indicates a bias toward permissive licensing, while any number below one indicates a bias toward copyleft.</strong></p>
<p style="text-align: center;"><a href="http://dberkholz-media.redmonk.com/dberkholz/files/2013/04/sort_license_class_by_year.png"><img class="wp-image-1723 aligncenter" alt="sort_license_class_by_year" src="http://dberkholz-media.redmonk.com/dberkholz/files/2013/04/sort_license_class_by_year-1024x964.png" width="442" height="416" /></a></p>
<p>&nbsp;</p>
<p><strong>Remarkably, every single language shows an upward trend, starting either in favor of copyleft or near equilibrium and shifting upward in a more permissive direction.</strong> The overall total, shown as a thick black line, further supports and clarifies this trend since the individual languages can be rather noisy.</p>
<p>Two languages of particular note are the two extremes: Ruby on the permissive side and Perl on the copyleft side. <strong>While most languages cluster relatively tightly, Ruby rises far above them with a very clear and strengthening shift toward permissive licensing — 2x in favor of permissive in 2010, 6x in 2011, and 11x in 2012.</strong> At the other extreme, <strong>Perl shows a roughly 2x–3x bias in favor of copyleft</strong>, which is distinctly below the nearest neighbor, C++, but not nearly as large of a divergence from the primary cluster as Ruby shows.</p>
<p>Other than that, at the level of individual languages, it&#8217;s difficult to draw any strong conclusions based on their relative positions because they are much less distinct. More recent web-development languages (Ruby, JavaScript, Python) may bias toward permissiveness, as do languages that tend to be used on closed platforms (Obj-C, C#). The difference between Java, C, and C++ is likely cultural as well, with C and C++ being common in the copyleft community while Java is less so due to inertia from its OSS-unfriendly past.</p>
<h2>Conclusions</h2>
<p>The shift toward permissive open-source licensing is dramatic over the past decade. <strong>Since 2010, this trend has reached a point where permissive is more likely than copyleft for a new open-source project.</strong> Although there are language-specific effects, especially in the case of Ruby, the overall movement is clear. Outside the extremes, new projects in even the most copyleft-biased language (C++) in 2012 were given copyleft licenses less than 60% of the time.</p>
<p><span style="color: #999999;"><em><strong>Disclosure</strong>: Black Duck Software (which owns Ohloh) is a client.</em></span></p>
<div class="acc_license"><a href="http://creativecommons.org/licenses/by-nc-sa/3.0/"><img src="http://i.creativecommons.org/l/by-nc-sa/3.0/88x31.png" alt="by-nc-sa" /></a></div><!--<rdf:RDF xmlns="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><Work rdf:about=""><license rdf:resource="http://creativecommons.org/licenses/by-nc-sa/3.0/" /></Work><License rdf:about="http://creativecommons.org/licenses/by-nc-sa/3.0/"><requires rdf:resource="http://creativecommons.org/ns#Attribution" /><permits rdf:resource="http://creativecommons.org/ns#Reproduction" /><permits rdf:resource="http://creativecommons.org/ns#Distribution" /><permits rdf:resource="http://creativecommons.org/ns#DerivativeWorks" /><requires rdf:resource="http://creativecommons.org/ns#ShareAlike" /><prohibits rdf:resource="http://creativecommons.org/ns#CommercialUse" /><requires rdf:resource="http://creativecommons.org/ns#Notice" /></License></rdf:RDF>--><img src="http://feeds.feedburner.com/~r/thestoryofdata/~4/wUm1G_DXiYE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://redmonk.com/dberkholz/2013/04/02/quantifying-the-shift-toward-permissive-licensing/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://redmonk.com/dberkholz/2013/04/02/quantifying-the-shift-toward-permissive-licensing/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=quantifying-the-shift-toward-permissive-licensing</feedburner:origLink></item>
		<item>
		<title>Coastal Africa: an up-and-coming force in software</title>
		<link>http://feedproxy.google.com/~r/thestoryofdata/~3/0SLnho9ADnA/</link>
		<comments>http://redmonk.com/dberkholz/2013/03/29/coastal-africa-an-up-and-coming-force-in-software/#comments</comments>
		<pubDate>Fri, 29 Mar 2013 14:56:58 +0000</pubDate>
		<dc:creator>dberkholz</dc:creator>
				<category><![CDATA[data-science]]></category>
		<category><![CDATA[employment]]></category>

		<guid isPermaLink="false">http://redmonk.com/dberkholz/?p=1716</guid>
		<description><![CDATA[As I was digging through Google Trends to check on some geographic trends related to my post ranking expressive languages, I came across intriguing data about Africa. It turns out that the eastern and western African coasts appear extremely interested in development, according to Google Trends. This is particularly true for Nigeria and Kenya in 2011–2012, as [...]]]></description>
				<content:encoded><![CDATA[<p>As I was digging through Google Trends to check on some geographic trends related to my post ranking <a href="http://redmonk.com/dberkholz/2013/03/25/programming-languages-ranked-by-expressiveness/">expressive languages</a>, I came across intriguing data about Africa. It turns out that <strong>the eastern and western African coasts appear extremely interested in development</strong>, according to <a href="http://www.google.com/trends/">Google Trends</a>. This is particularly true for <strong>Nigeria</strong> and <strong>Kenya</strong> in 2011–2012, as shown below for 2012.</p>
<p>If you look at a longer-term view of the full history of Google Trends from 2004 to present, other nearby countries show up as well, although lower-ranked: <strong>Uganda, Ethiopia, Zimbabwe, and Ghana</strong>. It&#8217;s likely no surprise to anyone that, outside of East and West Africa, <strong>South Africa</strong> (the country) also makes a strong showing, and <strong>Egypt</strong> appears to a lesser extent, visible on some of the maps but not on the top 10 of the lists. Here are the results for the search terms I used, &#8220;software engineering,&#8221; &#8220;programming languages,&#8221; &#8220;computer programming,&#8221; and &#8220;software development&#8221;:</p>
<p style="text-align: center;"><a href="http://dberkholz-media.redmonk.com/dberkholz/files/2013/03/africa2.png"><img class="size-full wp-image-1720 aligncenter" alt="africa" src="http://dberkholz-media.redmonk.com/dberkholz/files/2013/03/africa2.png" width="441" height="1010" /></a></p>
<p>&nbsp;</p>
<p>It&#8217;s <strong>further supported</strong> by Web-traffic data from <strong>Alexa</strong> showing programming popularity in parts of Africa, with GitHub being a <a href="http://www.alexa.com/siteinfo/github.com">popular site</a> in both <strong>South Africa</strong> and <strong>Nigeria</strong> (<a href="http://www.alexa.com/siteinfo/stackoverflow.com">the same</a> goes for Stack Overflow).</p>
<p>As another proxy for interest in software development and its future directions, we can look at<strong> website traffic to RedMonk.com.</strong> It shows the top 10 highest-traffic countries in Africa since 2009 as:</p>
<ol>
<li><strong>South Africa</strong></li>
<li><strong>Egypt</strong></li>
<li><strong>Kenya</strong></li>
<li><strong>Nigeria</strong></li>
<li>Morocco</li>
<li>Tunisia</li>
<li>Ghana</li>
<li>Algeria</li>
<li>Mauritius</li>
<li>Uganda</li>
</ol>
<p><strong>South Africa and Egypt alone account for more than half of the traffic</strong>, however, so the rest appear behind in this respect. Comparing 2012 with 2010, we&#8217;ve seen a <strong>13% increase</strong> in the proportion of our traffic coming from Africa, and none of that increase comes from Northern Africa (i.e. Egypt, Tunisia, Algeria) — it&#8217;s spread across the remaining regions. However, African traffic remains a quite small proportion of our overall traffic, hovering around 1% compared with our top 3 continents since 2009 at 50%, 32%, and 13%, so they won&#8217;t be taking over anytime soon. It&#8217;s roughly equivalent to the traffic from the #10-ranked US state.</p>
<p>If we look at another metric, that of <strong>LinkedIn members</strong> in any of the above African countries who match the job titles &#8220;software developer&#8221; or &#8220;software engineer,&#8221; we see very similar results (showing all countries with ≥50 results):</p>
<ol>
<li><span style="line-height: 13px;"><strong>Egypt</strong>: 4080</span></li>
<li><strong>South Africa</strong>: 2714</li>
<li><strong>Kenya</strong>: 688</li>
<li><strong>Nigeria</strong>: 597</li>
<li>Tunisia: 370</li>
<li>Mauritius: 231</li>
<li>Morocco: 167</li>
<li>Ghana: 176</li>
<li>Uganda: 153</li>
<li>Ethiopia: 144</li>
<li>Tanzania: 81</li>
<li>Zimbabwe: 80</li>
<li>Sudan: 70</li>
</ol>
<p>While those numbers will be smaller than the true developer population, an estimate in IEEE Spectrum <a href="http://spectrum.ieee.org/computing/it/the-african-hacker">suggested</a> roughly 200 full-time programmers in Ghana in 2005 compared to 176 on LinkedIn today, which suggests it&#8217;s not a completely unreasonable number. The correlation with hits to RedMonk.com suggests that these numbers, while perhaps not correct on an absolute scale, do reflect relative differences across Africa.</p>
<p>From these two lists, we can see that <strong>African software development extends somewhat more broadly than merely eastern and western Africa to include a broader group of generally stable, coastal African countries</strong>, be it north, south, east, or west.</p>
<h2>What are they writing?</h2>
<p><strong>Language-specific searches</strong> of all <a href="http://redmonk.com/sogrady/2013/02/28/language-rankings-1-13/">RedMonk&#8217;s tier 1 and tier 2 languages</a> on Google Trends with the pattern &#8220;$LANGUAGE programming&#8221; showed that <strong>Java and C/C++ were the primary languages in use.</strong> In fact, they were the only ones to show any meaningful population on searches. C/C++ shows up in Kenya and South Africa, while Java shows up strongly in Kenya and Nigeria, more weakly in South Africa, and finally weakest in Egypt.</p>
<p>The country-level popularity above shows an interesting correlation with entries to a <a href="http://www.howwemadeitinafrica.com/african-software-developers-making-their-mark/7415/"><strong>World Bank</strong> software contest</a> on <strong>global development</strong>, a domain in which many Africans have a keen interest in because it&#8217;s directly relevant to their lives (unlike many apps popular in San Francisco). The top submissions, in order, came from Uganda, Nigeria, Kenya, Ghana, South Africa, Niger, and Rwanda. Most interestingly, <strong>Africa had more submissions than any other continent.</strong></p>
<p>I would also expect that as the living standards and cost of living in places like China and India continue to increase, <strong>we may see more outsourcing move to Africa</strong>.</p>
<h2>Conclusion</h2>
<p>Minnesota, Colorado, and Virginia are peers to Africa on the basis of RedMonk.com traffic, and most software companies don&#8217;t ignore them.<strong><strong> </strong>If you aren&#8217;t thinking about Africa, it&#8217;s time to start. It&#8217;s already as significant as a top-10 US state, and it&#8217;s just going to get bigger from here.</strong></p>
<p><span style="color: #ff0000;"><strong>Update (4/1/13)</strong>: <a href="https://github.com/kanaka">Joel Martin</a> pointed out that this data also shows a reasonable correlation with <a href="http://en.wikipedia.org/wiki/Internet_in_Africa"><span style="color: #ff0000;">Internet users in Africa</span></a>.</span></p>
<p><span style="color: #999999;"><em><strong>Disclosure</strong>: World Bank is not a client.</em></span></p>
<div class="acc_license"><a href="http://creativecommons.org/licenses/by-nc-sa/3.0/"><img src="http://i.creativecommons.org/l/by-nc-sa/3.0/88x31.png" alt="by-nc-sa" /></a></div><!--<rdf:RDF xmlns="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><Work rdf:about=""><license rdf:resource="http://creativecommons.org/licenses/by-nc-sa/3.0/" /></Work><License rdf:about="http://creativecommons.org/licenses/by-nc-sa/3.0/"><requires rdf:resource="http://creativecommons.org/ns#Attribution" /><permits rdf:resource="http://creativecommons.org/ns#Reproduction" /><permits rdf:resource="http://creativecommons.org/ns#Distribution" /><permits rdf:resource="http://creativecommons.org/ns#DerivativeWorks" /><requires rdf:resource="http://creativecommons.org/ns#ShareAlike" /><prohibits rdf:resource="http://creativecommons.org/ns#CommercialUse" /><requires rdf:resource="http://creativecommons.org/ns#Notice" /></License></rdf:RDF>--><img src="http://feeds.feedburner.com/~r/thestoryofdata/~4/0SLnho9ADnA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://redmonk.com/dberkholz/2013/03/29/coastal-africa-an-up-and-coming-force-in-software/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://redmonk.com/dberkholz/2013/03/29/coastal-africa-an-up-and-coming-force-in-software/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=coastal-africa-an-up-and-coming-force-in-software</feedburner:origLink></item>
		<item>
		<title>Some external validation on expressive languages</title>
		<link>http://feedproxy.google.com/~r/thestoryofdata/~3/ZCdZ71OBHMQ/</link>
		<comments>http://redmonk.com/dberkholz/2013/03/26/some-external-validation-on-expressive-languages/#comments</comments>
		<pubDate>Tue, 26 Mar 2013 21:55:35 +0000</pubDate>
		<dc:creator>dberkholz</dc:creator>
				<category><![CDATA[adoption]]></category>
		<category><![CDATA[data-science]]></category>
		<category><![CDATA[employment]]></category>

		<guid isPermaLink="false">http://redmonk.com/dberkholz/?p=1714</guid>
		<description><![CDATA[I just got pointed to a really interesting and relevant data source by Ben Racine and wanted to post a short update to note the correlation of my post with a new piece of external information. The information? The input from ~2,500 developers over on Hammer Principle on the statement, &#8220;This language is expressive.&#8221; I [...]]]></description>
				<content:encoded><![CDATA[<p>I just got pointed to a really interesting and relevant data source by <a href="https://twitter.com/i3enhamin">Ben Racine</a> and wanted to post a short update to note <strong>the correlation of <a href="http://redmonk.com/dberkholz/2013/03/25/programming-languages-ranked-by-expressiveness/">my post</a> with a new piece of external information</strong>.</p>
<p>The information? The input from ~2,500 developers over on Hammer Principle on the statement, &#8220;<a href="http://hammerprinciple.com/therighttool/statements/this-language-is-expressive">This language is expressive.</a>&#8221; I mapped the top 10 and bottom 10 languages to my own median-based ranking, showing only the top two popularity tiers for simplicity, and got this:</p>
<p><a href="http://dberkholz-media.redmonk.com/dberkholz/files/2013/03/expressiveness_weighted_top_tiers.png"><img class="alignnone  wp-image-1715" alt="expressiveness_weighted_top_tiers" src="http://dberkholz-media.redmonk.com/dberkholz/files/2013/03/expressiveness_weighted_top_tiers-1024x450.png" width="553" height="243" /></a></p>
<p>&nbsp;</p>
<p>Interestingly, it&#8217;s a very clear correlation — all expressive at one end, all poorly expressive at the other end, and a mix in the middle (indicating a bit of noise).</p>
<p>What do you think?</p>
<div class="acc_license"><a href="http://creativecommons.org/licenses/by-nc-sa/3.0/"><img src="http://i.creativecommons.org/l/by-nc-sa/3.0/88x31.png" alt="by-nc-sa" /></a></div><!--<rdf:RDF xmlns="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><Work rdf:about=""><license rdf:resource="http://creativecommons.org/licenses/by-nc-sa/3.0/" /></Work><License rdf:about="http://creativecommons.org/licenses/by-nc-sa/3.0/"><requires rdf:resource="http://creativecommons.org/ns#Attribution" /><permits rdf:resource="http://creativecommons.org/ns#Reproduction" /><permits rdf:resource="http://creativecommons.org/ns#Distribution" /><permits rdf:resource="http://creativecommons.org/ns#DerivativeWorks" /><requires rdf:resource="http://creativecommons.org/ns#ShareAlike" /><prohibits rdf:resource="http://creativecommons.org/ns#CommercialUse" /><requires rdf:resource="http://creativecommons.org/ns#Notice" /></License></rdf:RDF>--><img src="http://feeds.feedburner.com/~r/thestoryofdata/~4/ZCdZ71OBHMQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://redmonk.com/dberkholz/2013/03/26/some-external-validation-on-expressive-languages/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://redmonk.com/dberkholz/2013/03/26/some-external-validation-on-expressive-languages/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=some-external-validation-on-expressive-languages</feedburner:origLink></item>
		<item>
		<title>What does “expressiveness” via LOC per commit measure in practice?</title>
		<link>http://feedproxy.google.com/~r/thestoryofdata/~3/ozeAQnx1EZE/</link>
		<comments>http://redmonk.com/dberkholz/2013/03/26/what-does-expressiveness-via-loc-per-commit-measure-in-practice/#comments</comments>
		<pubDate>Tue, 26 Mar 2013 18:48:37 +0000</pubDate>
		<dc:creator>dberkholz</dc:creator>
				<category><![CDATA[adoption]]></category>
		<category><![CDATA[data-science]]></category>
		<category><![CDATA[employment]]></category>

		<guid isPermaLink="false">http://redmonk.com/dberkholz/?p=1708</guid>
		<description><![CDATA[Yesterday&#8217;s post ranking the &#8220;expressiveness&#8221; of programming languages was quite popular. It got more than 30,000 readers in the first 24 hours; it&#8217;s at 31,302 as I write this. For this blog, that qualifies as a great audience. After a day&#8217;s worth of feedback, thought, and discussion on Twitter, Hacker News, and the post&#8217;s comments, [...]]]></description>
				<content:encoded><![CDATA[<p>Yesterday&#8217;s post ranking the <a href="http://redmonk.com/dberkholz/2013/03/25/programming-languages-ranked-by-expressiveness/">&#8220;expressiveness&#8221; of programming languages</a> was quite popular. It got more than 30,000 readers in the first 24 hours; it&#8217;s at 31,302 as I write this. For this blog, that qualifies as a great audience. After a day&#8217;s worth of feedback, thought, and discussion on <a href="https://twitter.com/search/realtime?q=http%3A%2F%2Fredmonk.com%2Fdberkholz%2F2013%2F03%2F25%2Fprogramming-languages-ranked-by-expressiveness%2F&amp;src=typd">Twitter</a>, <a href="https://news.ycombinator.com/item?id=5438755">Hacker News</a>, and the <a href="http://redmonk.com/dberkholz/2013/03/25/programming-languages-ranked-by-expressiveness/">post&#8217;s comments</a>, I wanted to sum up some of my thoughts, others&#8217; contributions, and things I left out of the initial post.</p>
<h2> What are we really measuring here?</h2>
<p>As I mentioned as a major caveat in the initial post, lines of code (LOC) per commit is an imperfect metric as a window into expressiveness. It&#8217;s measuring <strong>something</strong>, but what does it mean? My take on these results is that <strong>it&#8217;s a useful metric when painting with broad strokes, and the results seem to generally bear that out</strong>. It&#8217;s more helpful in comparing large-scale trends than arguing over whether Ruby should be #27 or #22, which is likely below the noise level. I think the reason some placements seem so weird is that <strong>it&#8217;s measuring expressiveness in practice rather than in theory</strong>. That brings in factors like:</p>
<ul>
<li><strong>The standard library and library ecosystem.</strong> Is there a weak standard library? Is there a small or nonexistent community of add-on library developers? In both cases, constructing a commit-worthy chunk of code could require additional lines.</li>
<li><strong>The development culture and its norms.</strong> Is copy-and-pasting common for this language? Are imported libraries often committed to the project repository (JavaScript is a prime candidate here)? Are autogenerated files committed (e.g., minified JavaScript, autotools configure scripts)?</li>
<li><strong>The developer population using it.</strong> Especially for <strong>third-tier languages</strong>, the number of developers is small enough that these results could reflect those developers more than the properties of the language itself. Some of the least-popular third-tier languages have fewer than 10 developers committing during a given month. I would generally disregard anything but the largest differences between third-tier languages, and treat even those with skepticism. Some languages are also more popular for <strong>beginning programmers</strong>, which could influence the results if the beginners make up a significant chunk of the language&#8217;s total userbase.</li>
<li><strong>The time frame of its initial popularity.</strong>  This can result in time-based influences upon tools and methodologies in use. For example, newer languages popularized in the <strong>agile</strong> and <strong>GitHub</strong> eras may tend to bias toward smaller, more frequent commits. Languages that grew up alongside <strong>waterfall</strong> development and slower, <strong>centralized version control</strong> may be biased more toward larger, monolithic commits. It even carries as far as things like <strong>line length</strong> — today, wide-screen monitors are common, and many developers no longer restrict their column width to 80 or less. This could have a language-specific impact, where older languages with a great deal of inertia change more slowly to a new &#8220;standard&#8221; of development. For example, perhaps fixed-format Fortran wasn&#8217;t typically maintained in version control at all, and full files were just committed wholesale? That could explain its similarity to JavaScript.</li>
<li><strong>Differences in project types by language.</strong> If a language is more likely to be used in <strong>larger,</strong> <strong>enterprise</strong> projects, this could influence the types of commits it receives. For example, it could get more small bugfixes than new features because it&#8217;s a long-lived codebase and requires additional stability. It could also see a different level of refactoring.</li>
</ul>
<h2>So &#8230; what should you get out of the results, then?</h2>
<p>Frankly, given all the possible variables involved, <strong>the biggest surprise here is that the results look as reasonable as they do</strong>, at the level of broad, multi-language or cross-tier trends. Here&#8217;s what I would tend to believe, and what I would be skeptical about.</p>
<ul>
<li><strong>Believe</strong>: multi-language trends</li>
<li><strong>Believe</strong>: cross-tier trends</li>
<li><strong>Believe</strong>: large differences between individual languages, but <strong>investigate</strong> why</li>
<li><strong>Believe</strong>: highly-ranked languages</li>
<li><strong>Be skeptical</strong>: anything involving third-tier languages</li>
<li><strong>Be skeptical</strong>: small differences between individual languages</li>
<li><strong>Be skeptical</strong>: individual languages that don&#8217;t fit into a group of similar ones</li>
<li><strong>Be skeptical</strong>: low-ranked languages, until <strong>investigated</strong></li>
</ul>
<p>Why do I suggest believing high ranks but not low ones? It&#8217;s the Anna Karenina principle, as Tolstoy wrote:</p>
<blockquote><p><i>Happy families are all alike; every unhappy family is unhappy in its own way.</i></p></blockquote>
<p><strong>While there are a large number of ways to have a high median or high IQR, it seems to me that low values of both would indicate a number of good development practices in addition to a good language.</strong></p>
<p>To wrap things up, I think this is measuring, with a fair amount of noise, a form of expressiveness in practice rather than in theory — a form that includes all the ways code is incorporated into a repository. That makes it an interesting window into a number of potential problems with how specific languages as well as language classes are typically used.</p>
<div class="acc_license"><a href="http://creativecommons.org/licenses/by-nc-sa/3.0/"><img src="http://i.creativecommons.org/l/by-nc-sa/3.0/88x31.png" alt="by-nc-sa" /></a></div><!--<rdf:RDF xmlns="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><Work rdf:about=""><license rdf:resource="http://creativecommons.org/licenses/by-nc-sa/3.0/" /></Work><License rdf:about="http://creativecommons.org/licenses/by-nc-sa/3.0/"><requires rdf:resource="http://creativecommons.org/ns#Attribution" /><permits rdf:resource="http://creativecommons.org/ns#Reproduction" /><permits rdf:resource="http://creativecommons.org/ns#Distribution" /><permits rdf:resource="http://creativecommons.org/ns#DerivativeWorks" /><requires rdf:resource="http://creativecommons.org/ns#ShareAlike" /><prohibits rdf:resource="http://creativecommons.org/ns#CommercialUse" /><requires rdf:resource="http://creativecommons.org/ns#Notice" /></License></rdf:RDF>--><img src="http://feeds.feedburner.com/~r/thestoryofdata/~4/ozeAQnx1EZE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://redmonk.com/dberkholz/2013/03/26/what-does-expressiveness-via-loc-per-commit-measure-in-practice/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		<feedburner:origLink>http://redmonk.com/dberkholz/2013/03/26/what-does-expressiveness-via-loc-per-commit-measure-in-practice/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=what-does-expressiveness-via-loc-per-commit-measure-in-practice</feedburner:origLink></item>
		<item>
		<title>Programming languages ranked by expressiveness</title>
		<link>http://feedproxy.google.com/~r/thestoryofdata/~3/9nyYJ8Uo8TE/</link>
		<comments>http://redmonk.com/dberkholz/2013/03/25/programming-languages-ranked-by-expressiveness/#comments</comments>
		<pubDate>Mon, 25 Mar 2013 18:35:16 +0000</pubDate>
		<dc:creator>dberkholz</dc:creator>
				<category><![CDATA[adoption]]></category>
		<category><![CDATA[data-science]]></category>
		<category><![CDATA[employment]]></category>

		<guid isPermaLink="false">http://redmonk.com/dberkholz/?p=1695</guid>
		<description><![CDATA[Is it possible to rank programming languages by their efficiency, or expressiveness? In other words, can you compare how simply you can express a concept in them? One proxy for this is how many lines of code change in each commit. This would provide a view into how expressive each language enables you to be in [...]]]></description>
				<content:encoded><![CDATA[<p><strong>Is it possible to rank programming languages by their efficiency, or expressiveness?</strong> In other words, can you compare how simply you can express a concept in them? One proxy for this is <strong>how many lines of code change in each commit</strong>. <strong>This would provide a view into how expressive each language enables you to be in the same amount of space. </strong>Because the number of bugs in code is proportional to the number of source lines, not the number of ideas expressed, a more expressive language is always worth considering for that reason alone (e.g., see <a href="http://en.wikipedia.org/wiki/Halstead_complexity_measures">Halstead&#8217;s complexity measures</a>).</p>
<p>I recently got a hold of a great set of data from <a href="http://ohloh.net/">Ohloh</a>, which tracks open-source code repositories, on the use of programming languages over time across all of the codebases they track. After validating the data against Ohloh&#8217;s own graphs, one of the first things I did was try out my idea on expressiveness of programming languages. Sure enough, it gave me results that made sense and were surprisingly reasonable.</p>
<p>Some caveats to this approach :</p>
<ul>
<li><strong>This assumes that commits are generally used to add a single conceptual piece regardless of which language it&#8217;s programmed in.</strong></li>
<li>It won&#8217;t tell you how readable the resulting code is (Hello, lambda functions) or how long it takes to write it (<a href="http://en.wikipedia.org/wiki/APL_(programming_language)">APL</a> anyone?), so <strong>it&#8217;s not a measure of maintainability or productivity.</strong></li>
<li>Ohloh relies on opt-in subscription from open-source projects rather than crawling forges itself. That said, it&#8217;s a vast data set covering some 7.5 million project-months.</li>
</ul>
<p>Time to let the results speak for themselves. Enough words, here&#8217;s the data (enlarge by clicking):</p>
<p><a href="http://dberkholz-media.redmonk.com/dberkholz/files/2013/03/expressiveness_weighted2.png"><img class="alignnone size-large wp-image-1712" alt="expressiveness_weighted" src="http://dberkholz-media.redmonk.com/dberkholz/files/2013/03/expressiveness_weighted2-1024x347.png" width="547" height="185" /></a></p>
<p>It&#8217;s visualized in the form of <a href="http://en.wikipedia.org/wiki/Box_plot">box-and-whisker plots</a>, which are effective for showing a distribution of numbers relatively simply. <strong>What numbers are we showing? It&#8217;s a distribution of lines of code per commit every month for around 20 years, weighted by the number of commits in any given month.</strong> The black line in the middle of each box is the median (the 50th percentile) for that language, and languages are ranked by median. The bottom and top of the box are the 25th and 75th percentiles, while the &#8220;whiskers&#8221; extend to the 10th and 90th percentiles. The &#8220;Total&#8221; box indicates the median of each value across all languages (median of all 25th percentiles, median of all 75th percentiles, etc.) to show a &#8220;typical&#8221; language.</p>
<p>I&#8217;ve also colored them according to our most recent <a href="http://redmonk.com/sogrady/2013/02/28/language-rankings-1-13/">RedMonk programming language rankings</a> (<strong><span style="color: #ff0000;">red</span></strong> is the most popular cluster, and <strong><span style="color: #0000ff;">blue</span></strong> is the second-tier cluster, while <strong>black</strong> is everything else), and <strong>restricted languages here to the ones popular enough to be included in that set of rankings</strong>.</p>
<p>What conclusions can we draw from this?</p>
<h2>Global effects</h2>
<p><strong>The trends generally make sense.</strong> If we focus purely on the tier-one languages shown in red, high-level languages (Python [#27], Ruby [#34]) lean toward better expressiveness while lower-level languages (C [#50], C++ [#45], Java [#44]) tend toward wordiness. Similarly in tier two, Fortran [#39/#52] and assembly [#49] are wordy, and &#8220;middle-aged&#8221; functional languages are intermediate while newer functional languages are best.</p>
<p><strong>Expressiveness ranges broadly across languages. </strong>The medians go from lows of <strong>48</strong> for Augeas (#1) and <strong>52</strong> for Puppet (#2) to a high of <strong>1629 </strong>for <a href="http://en.wikipedia.org/wiki/Fortran#Fixed_layout">fixed-format Fortran</a> (#52), which is a surprisingly large <strong>31x variation</strong>.</p>
<p><strong>Less expressive languages tend to show a much wider variability. </strong>There&#8217;s a clear, but not strong, correlation between the medians (black lines) and the IQRs (box heights). Languages with the largest IQRs also tend to have greater medians, and <strong>consistently</strong> expressive languages tend to also be <strong>more</strong> expressive.</p>
<p><strong>First-tier languages are a mix of poor and moderate expressiveness. </strong>Of the 11 tier-one languages, 5 are moderately expressive and the remaining 6 are poor. The tier-one languages range from LOC/Commit ratios of 309–1485, which equates to 6x–30x lower expressiveness than the top languages. Perl (#26), the best tier-one language, <span style="color: #000000;">is 5x more expressive than the worst, JavaScript (#51), and 3.5x more expressive than the classic C. That&#8217;s certainly respectable but falls well short of the 20x or greater improvement one could gain with one of the top languages.</span></p>
<p><strong>Second-tier languages are well-distributed and reach into highly expressive languages. </strong>With 52 total languages on this list, the top ~17 constitute the highly expressive languages. Although none of those are first-tier languages, 9 of those 17 are second-tier — mostly functional with the exceptions of Groovy (#16), Prolog (#13), Puppet (#2), and CoffeeScript (#6).</p>
<p><strong>Third-tier languages are heavily biased toward high expressiveness.</strong> Of the 15 third-tier languages on this list, 8 are in the top 1/3 of languages, leaving only 7 are in the remaining 2/3. Although these data do not directly show any correlation between age and expressiveness, it seems reasonable that newer, more expressive languages would begin less popular and may grow later.</p>
<h2>Effects of language class/type</h2>
<p><strong>Functional languages tend to be highly expressive.</strong> On this list are Haskell (#10), Erlang (#22), F# (#21), Lisp variants (including Clojure [#7], Emacs Lisp [#14], Dylan [#12], Common Lisp [#23], Scheme [#31], and Racket [#11]), OCaml (#20), R (#17), and Scala (#18). Of those, only two fall below #30 out of the 52 languages included here.</p>
<p><strong>Domain-specific languages are biased toward high expressiveness.</strong> Augeas (#1), Puppet (#2), R (#17), and Scilab (#19) are good examples of this, while VHDL (#38) serves as an outlier on the low end.</p>
<p><strong>Compilation does not imply lower expressiveness.</strong> I was halfway expecting highly expressive languages to exclude all compiled languages but was proven wrong. Compiled languages in the top 17 include CoffeeScript (#6), Vala (#9), Haskell (#10), and Dylan (#12).</p>
<p><strong>Interactive modes correlate with intermediate expressiveness.</strong> Languages with an interactive shell tend to be mid-range in expressiveness, with a few outliers on either side. For example: Lisp (#23), Erlang (#22), F# (#21), OCaml (#20), Perl (#26), Python (#27), R (#17), Ruby (#34), Scala (#18), Scheme (#31).</p>
<h2>Specific language effects</h2>
<p><strong><span style="text-decoration: underline;">CoffeeScript</span> (#6) appears dramatically more expressive than <span style="text-decoration: underline;">JavaScript</span> (#51), in fact among the best of all languages. </strong>Although the general trend is not particularly surprising because that&#8217;s the whole point of CoffeeScript, the magnitude of the difference seems unusual. I suspect JavaScript&#8217;s low placement could be at least partially due to wholesale copying of template JavaScript files rather than reflecting development in JavaScript itself.</p>
<p><strong><span style="text-decoration: underline;">Clojure</span> (#7) is the most expressive of Lisp variants.</strong> There are a large number of Lisp variants that generally ranked quite well, described in more detail above in the functional-language section. In this context, it&#8217;s worth noting that the top one was the fairly popular Clojure, with a median LOC/commit value of 101, followed by Racket (#11) at 136 and Dylan (#12) at 143.</p>
<p><strong>Among data-analysis languages, <span style="text-decoration: underline;">R</span> (#17) and <span style="text-decoration: underline;">Scilab</span> (#19) are most expressive. </strong> With a median of 193 LOC/commit for R, it&#8217;s a clear top performer. R is followed by Scilab and Matlab (#35) with medians of 225 and 445, respectively.</p>
<p><strong>Although <span style="text-decoration: underline;">Go</span> (#24) is getting increasingly hot, it&#8217;s not outstandingly expressive.</strong> We keep hearing about new use of Go across a variety of startups, but it&#8217;s little better than Perl (#26) or Python (#27) by this measure. Despite that, it does trump all the tier-one languages, so someone who only had experience with them could certainly see an improvement when trying Go.</p>
<h2>What if we sort by <span style="text-decoration: underline;">consistency</span> of expressiveness, instead of the median?</h2>
<p>Ideally a language should be:</p>
<ul>
<li>Easy enough to learn that the vast majority of developers using it can be highly productive; and</li>
<li>Equally expressive across nearly its entire domain of usefulness.</li>
</ul>
<p>To measure that, let&#8217;s take a look at the interquartile range (IQR; the distance between the 25th and 75th percentiles) as a proxy for these two criteria, and rank languages by that instead (enlarge by clicking):</p>
<p><a href="http://dberkholz-media.redmonk.com/dberkholz/files/2013/03/expressiveness_by_iqr_weighted2.png"><img class="alignnone size-large wp-image-1713" alt="expressiveness_by_iqr_weighted" src="http://dberkholz-media.redmonk.com/dberkholz/files/2013/03/expressiveness_by_iqr_weighted2-1024x347.png" width="553" height="187" /></a></p>
<p>What you&#8217;re looking for here is the height of the boxes. It starts small on the left side, with CoffeeScript doing best at <strong>23</strong> lines and increases to the right side, ending with fixed-format Fortran at <strong>1854</strong> lines.</p>
<p>A few new insights specific to this plot before we move on to considering them both together:</p>
<ul>
<li>As alluded to earlier but illustrated differently here, <strong>inconsistency and wordiness are correlated</strong>, as are consistency and expressiveness.</li>
<li><strong>Tier-one languages put in a much stronger showing</strong> here, with four in the top 1/3 of languages (Python at #11, Objective-C at #13, Perl at #15, and C# at #17). Shell nearly makes the cut at #19. Those IQRs vary from 90–167 LOC/commit, a fairly large difference even among the best performers.</li>
<li>Consequently, <strong>tier-three languages make a poorer showing here</strong>, although they performed unusually well at levels of expressiveness. They are nearly proportionate with their population with 5 of 15 showing up in the top third, and the remainder are evenly distributed across the moderate and low consistency groups as well.</li>
<li><strong>Java turns in the strongest performance of &#8220;enterprisey&#8221; languages (C, C++, Java)</strong> when considering both metrics. Java comes in with nearly identical expressiveness as C++ (both at 823 LOC/commit) but a vastly greater consistency (IQR of 277 vs 476).</li>
<li>CoffeeScript is #1 for consistency, with an IQR spread of only 23 LOC/commit compared to even #4 Clojure at 51 LOC/commit. By the time we&#8217;ve gotten to #8 Groovy, we&#8217;ve dropped to an IQR of 68 LOC/commit. In other words, <strong>CoffeeScript is incredibly consistent across domains and developers in its expressiveness.</strong></li>
<li><strong>The outliers are particularly interesting</strong> — the ones with unusually high or low medians compared to nearby languages. If the median is higher than neighbors, than it&#8217;s an unusually consistent yet less expressive language. Conversely if the median is lower than neighbors, then the language is unusually inconsistent (a.k.a. shifted to the right on this graph from the rough correlation between consistency and median expressiveness).
<ul>
<li><strong><strong><img class="size-medium wp-image-1711 alignright" style="font-weight: normal;" alt="expressiveness_by_iqr_weighted_tier_one" src="http://dberkholz-media.redmonk.com/dberkholz/files/2013/03/expressiveness_by_iqr_weighted_tier_one-300x101.png" width="300" height="101" /></strong>Tier-one languages tend to be remarkably consisten</strong><strong>t, regardless of their expressiveness.</strong> In nearly all cases, their medians are higher than their neighbors, showing a general shift to the left from the expected placement. <span style="text-decoration: underline;"><strong>This suggests that a primary characteristic of a tier-one language is its predictability, even more so than its productivity.</strong></span></li>
<li>Conversely, in most cases where languages appear shifted to the right, they&#8217;re third-tier languages. The lack of predictability has often held them back from even reaching the second tier.</li>
</ul>
</li>
</ul>
<h2>So, what are the best languages by these metrics?</h2>
<p>If you pick the top 10 based on ranking by median and by IQR, then take the intersection of them, here&#8217;s what&#8217;s left. The median and IQR are listed immediately after the names:</p>
<ul>
<li><span style="line-height: 13px;"><a href="http://en.wikipedia.org/wiki/Augeas_(software)">Augeas</a> (48, 28): A domain-specific languages for configuration files</span></li>
<li><a href="http://en.wikipedia.org/wiki/Puppet_(software)#Puppet_language">Puppet</a> (52, 65): Another DSL for configuration</li>
<li><a href="http://en.wikipedia.org/wiki/REBOL">REBOL</a> (57, 47): A language designed for distributed computing</li>
<li><a href="http://www.ecere.com/technologies.html#eC">eC</a> (75, 75): Ecere C, a C derivative with object orientation</li>
<li><a href="http://en.wikipedia.org/wiki/CoffeeScript">CoffeeScript</a> (100, 23): A higher-level language that transcompiles to JavaScript</li>
<li><a href="http://en.wikipedia.org/wiki/Clojure">Clojure</a> (101, 51): A Lisp dialect for functional, concurrent programming</li>
<li><a href="http://en.wikipedia.org/wiki/Vala_(programming_language)">Vala</a> (123, 61): An object-oriented language used by GNOME</li>
<li><a href="http://en.wikipedia.org/wiki/Haskell_(programming_language)">Haskell</a> (127, 71): A purely functional, compiled language with strong static typing</li>
</ul>
<p>Looking at the box plots again, I would tend to <strong>rule out eC</strong> based on the poor performance of the upward-reaching whiskers at the 90th percentiles, indicating a real lack of consistency as often as a quarter of the time (since the 75th percentile is quite good). I would also rule out <strong>Puppet</strong> and <strong>Augeas</strong> because they are DSLs.</p>
<p>Combining those with our RedMonk <a href="http://redmonk.com/sogrady/2013/02/28/language-rankings-1-13/">programming language rankings on popularity</a>, <strong>the only highly expressive, general-purpose languages within the top two popularity tiers are</strong>:</p>
<ul>
<li><strong><span style="line-height: 13px;">Clojure</span></strong></li>
<li><strong>CoffeeScript</strong></li>
<li><strong>Haskell</strong></li>
</ul>
<p>If you&#8217;re considering learning a new language, it would make a lot of sense to put <strong>Clojure, </strong><strong>CoffeeScript, </strong>and<strong> Haskell</strong> on your list, based on expressiveness and current use in communities we&#8217;ve found to be predictive.</p>
<p><strong>No tier-one languages fall in the top 25 on both metrics, although 5 make the cut on consistency alone.</strong>  Of the tier-one languages, lower-level ones tend to be both inconsistent and overly wordy, while higher-level ones have intermediate wordiness and very strong consistency. The most consistent languages are Python, Objective-C, Perl, C#, and shell, with <strong>the presence of Perl and shell supporting the initial assertion that expressiveness has little to do with readability or maintainability.</strong> Ruby is an interesting language, in that it violates the &#8220;rules&#8221; of expressiveness and consistency seen in the other higher-level languages. This could be an instance of a framework (Rails) truly <a href="http://redmonk.com/sogrady/2011/04/27/frameworks-lead-adoption/">popularizing a language</a> that otherwise would&#8217;ve never taken off.</p>
<p><strong>For projects that require an expressive language where it&#8217;s relatively easy to hire developers, <span style="text-decoration: underline;">Python</span> is worth serious consideration.</strong> Of tier-one languages, Python, Perl, Shell, and Objective-C are the best overall performers, and I consider Python the strongest of those for general-purpose applications. In my opinion, it makes a lot of sense to take a <a href="http://redmonk.com/jgovernor/2011/05/12/typesafe-the-polyglot-revolution-continues-apace/">polyglot</a> approach to projects, writing as high-level as performance requirements allow. Fortunately many high-level languages like Python allow for modules based in more performant languages such as C. That means it&#8217;s easily possible to write the vast majority of a project in a more productive, more expressive language while falling back to high-performance languages where needed.</p>
<p><span style="color: #ff0000;"><strong>Update (3/26/12)</strong>: I somehow missed Haskell on the final recommendations for second-tier languages, although it was on the initial list. Thanks to Chad Scherrer for pointing it out in the comments.</span></p>
<p><span style="color: #ff0000;"><strong>Update (3/26/12):</strong> I just wrote a <a href="http://redmonk.com/dberkholz/2013/03/26/what-does-expressiveness-via-loc-per-commit-measure-in-practice/"><span style="color: #ff0000;">post</span></a> on the last day&#8217;s discussion and commentary about what this kind of metric means and what you can get out of it.</span></p>
<p><span style="color: #ff0000;"><strong>Update (3/26/12):</strong> I wrote a new post showing <a href="http://redmonk.com/dberkholz/2013/03/26/some-external-validation-on-expressive-languages/"><span style="color: #ff0000;">correlation</span></a> of my data with external survey data on what languages developers think are expressive.</span></p>
<p><em style="color: #999999;"><strong>Disclosure</strong>: Black Duck Software (which runs Ohloh) is a client.</em></p>
<div class="acc_license"><a href="http://creativecommons.org/licenses/by-nc-sa/3.0/"><img src="http://i.creativecommons.org/l/by-nc-sa/3.0/88x31.png" alt="by-nc-sa" /></a></div><!--<rdf:RDF xmlns="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><Work rdf:about=""><license rdf:resource="http://creativecommons.org/licenses/by-nc-sa/3.0/" /></Work><License rdf:about="http://creativecommons.org/licenses/by-nc-sa/3.0/"><requires rdf:resource="http://creativecommons.org/ns#Attribution" /><permits rdf:resource="http://creativecommons.org/ns#Reproduction" /><permits rdf:resource="http://creativecommons.org/ns#Distribution" /><permits rdf:resource="http://creativecommons.org/ns#DerivativeWorks" /><requires rdf:resource="http://creativecommons.org/ns#ShareAlike" /><prohibits rdf:resource="http://creativecommons.org/ns#CommercialUse" /><requires rdf:resource="http://creativecommons.org/ns#Notice" /></License></rdf:RDF>--><img src="http://feeds.feedburner.com/~r/thestoryofdata/~4/9nyYJ8Uo8TE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://redmonk.com/dberkholz/2013/03/25/programming-languages-ranked-by-expressiveness/feed/</wfw:commentRss>
		<slash:comments>71</slash:comments>
		<feedburner:origLink>http://redmonk.com/dberkholz/2013/03/25/programming-languages-ranked-by-expressiveness/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=programming-languages-ranked-by-expressiveness</feedburner:origLink></item>
	</channel>
</rss>
