<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Dataspora Blog</title>
	
	<link>http://dataspora.com/blog</link>
	<description>Big Data, open source analytics, and data visualization</description>
	<pubDate>Mon, 08 Mar 2010 18:49:25 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5</generator>
	<language>en</language>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/data-evolution" /><feedburner:info uri="data-evolution" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>The Data Singularity is Here</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/sxoeIHK3Byc/</link>
		<comments>http://dataspora.com/blog/the-data-singularity-is-here/#comments</comments>
		<pubDate>Mon, 08 Mar 2010 08:36:22 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=104</guid>
		<description><![CDATA[In the next two blog posts I&#8217;ll attempt to sketch the forces behind what I&#8217;m calling, somewhat sensationally, the Data Singularity, and then (in a following post) discuss what I see as its consequences.
In a nutshell, the Data Singularity is this: humans are being spliced out of the data-driven processes around us, and frequently we [...]]]></description>
			<content:encoded><![CDATA[<p><a href='http://dataspora.com/blog/wp-content/uploads/2010/03/thematrix.jpg'><img src="http://dataspora.com/blog/wp-content/uploads/2010/03/thematrix.jpg" alt="" title="thematrix" width="150" height="113" class="alignleft size-full wp-image-108" /></a>In the next two blog posts I&#8217;ll attempt to sketch the forces behind what I&#8217;m calling, somewhat sensationally, the Data Singularity, and then (in a following post) discuss what I see as its consequences.</p>
<p>In a nutshell, the Data Singularity is this: humans are being spliced out of the data-driven processes around us, and frequently we aren&#8217;t even at the terminal node of action.  International cargo shipments, high-frequency stock trades, and genetic diagnoses are all made without us.</p>
<p>Absent humans, these data and decision loops have far less friction; they become constrained only by the costs of bandwidth, computation, and storage&#8211; all of which are dropping exponentially.</p>
<p>The result is an explosion of data thrown off from these machine-mediated pipelines, along with data about those flows (and data about that data, and so on).  The machines all around us &#8212; our smart phones, smart cars, and fee-happy bank accounts &#8212; are talking, and increasingly we&#8217;re being left out of the conversation.</p>
<p>So whether or not the Singularity is Near, the Data Singularity is here, and its consequences are being felt.</p>
<p>But before I discuss these consequences, I&#8217;d like to expand on the premise.  The world wasn&#8217;t always drowning in this data deluge, so how did we get here?</p>
<p><strong>I.  Data at the Speed of Speech</strong></p>
<p>For most of human history, information traveled no faster than the sound of the human voice.  The origin of human language was the original singularity:  it marked the birth of a non-biological information channel,  distinct from our DNA.</p>
<p>But despite this achievement , the production of information &#8212; whether farmers&#8217; almanacs or merchants&#8217; ledgers &#8212; was still constrained the by costs of ink and parchment and the write-speed of the human hand.</p>
<p>All 70,000 volumes of the Library of Alexandria, the collected body of human knowledge in antiquity, could fit on two thumb drives today.</p>
<p>Thus the transmission and production of data, when it was done at all, was painstaking in form, small in scale, and occurred between people.</p>
<p><code>  People --> People </code></p>
<p><strong>II.  Data at the Speed of Light</strong></p>
<p>With the telegraph, for the first time, data flowed at the speed of light.</p>
<p>In the late 18th century, the first substantive telegraph line connected Paris to a suburb 210 kilometers to its north, using optical semaphores rather than electrical currents to communicate.  Yet while data hopped between stations at light speed, it had to be routed by human operators at each station.</p>
<p>Centuries earlier, the printing press dramatically reduced the production costs of information.  Still, human authors transmitted their hand drafted manuscripts to type setters, who set type with fonts optimally designed for human eyes.</p>
<p><strong>III. Programmable Looms and Reading Machines</strong></p>
<p>Punch cards represented the movement of data away from human-readable, anthropocentric substrates, onto a medium designed principally for consumption by machines.</p>
<p>Punch cards were developed in the early 18th century <a href="http://en.wikipedia.org/wiki/Basile_Bouchon"> to control industrial looms </a>, in France.</p>
<p>Now, machines were the final terminus of data transmission.  This act of communicating with our machines, <em>programming</em> them, was at the heart of Charles Babbage&#8217;s Analytical Engine, which came more than a century later.</p>
<p><code>  People --> Machines</code></p>
<p><strong>IV.  Phonographs and Recording Machines </strong></p>
<p>Developing on the other side of the communication spectrum were machines that excelled at writing and storing data.</p>
<p>The <a href="http://www-03.ibm.com/ibm/history/exhibits/storage/storage_350.html"> modern rotating disk drive </a> feels less inspired by punch cards, but by Thomas Edison&#8217;s cylinder machines, better known as phonographs.</p>
<p>The human voice was a natural data format, and if early pioneers had a vision for the modern human-machine interface, I imagine it would have been to program machines by voice.  It&#8217;s a vision that still eludes us.</p>
<p>By the middle of the 20th century, a slew of semiconductor technologies emerged to close the loop of data generation: we had machines that produced digital data, and machines that continuously consumed it, without human intervention.</p>
<p><code>  Machines --> Machines</code></p>
<p>These technologies also sparked the beginning of a less-celebrated, but equally important exponential curve: the falling cost of data storage. </p>
<p><a href='http://dataspora.com/blog/wp-content/uploads/2010/03/cost_of_data_storage_360.png'><img src="http://dataspora.com/blog/wp-content/uploads/2010/03/cost_of_data_storage_360.png" alt="" title="cost_of_data_storage_360" width="360" height="360" class="alignnone size-full wp-image-106" /></a><br clear=all /></p>
<p> <strong>V.  Listening to the Pulse of the Planet</strong></p>
<p>The exponential drop in data storage costs has meant that logging historical data about a process, or billions of processes, is economically feasible.</p>
<p>I conjecture that the largest share of data on the planet sits in log files; these are the EKGs of the server farms that manage our cell phones, our e-mail accounts, and every other facet of our online existence &#8212; and which consume 3% of the <a href="http://arstechnica.com/old/content/2007/08/epa-power-usage-in-data-centers-could-double-by-2011.ars">US energy budget </a>.</p>
<p>Ubiquitous networking and cheap bandwidth has meant these pools of storage are no longer isolated on individual sensors, phones, or servers, but form the tributaries feeding an ocean of data in the Cloud.</p>
<p>And yet, funneling these massive volumes of data creates enormous technological pressures, against which companies struggle.  So why keep the data?</p>
<p>Because inside these log files, amidst the myriad conversations recorded between machines, lies the pulse of their customers.</p>
<p>Collectively, these logs reveal the pulse of the planet &#8212; flight delays, package shipments, job losses, and human sentiments.</p>
<p>And as I&#8217;ll discuss in my next post, those who can extract a meaningful signal from this thunderous cacophony &#8212; the analysts, statisticians, and data scientists &#8212; are uniquely positioned to change the world.</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/sxoeIHK3Byc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/the-data-singularity-is-here/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/the-data-singularity-is-here/</feedburner:origLink></item>
		<item>
		<title>SQL is Dead.  Long Live SQL!</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/S9OepTJtXGo/</link>
		<comments>http://dataspora.com/blog/sql-is-dead-long-live-sql/#comments</comments>
		<pubDate>Wed, 25 Nov 2009 10:58:14 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=97</guid>
		<description><![CDATA[&#8220;The adoption of a relational model of data, as described above, permits the development of a universal data sub-language.&#8221;– E.F. Codd, 1969
&#8220;Database research has produced a number of good results, but the relational database is not one of them.&#8221; – Henry Baker, 1991
 Outside of programming language flame wars, few questions raise the hackles of [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>&#8220;The adoption of a relational model of data, as described above, permits the development of a universal data sub-language.&#8221;– <a href="http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf">E.F. Codd, 1969</a></p></blockquote>
<blockquote><p>&#8220;Database research has produced a number of good results, but the relational database is not one of them.&#8221; – <a href="http://home.pipeline.com/~hbaker1/letters/CACM-RelationalDatabases.html">Henry Baker, 1991</a></p></blockquote>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/11/relational_theory.png"><img class="alignleft size-thumbnail wp-image-102" title="relational_theory" src="http://dataspora.com/blog/wp-content/uploads/2009/11/relational_theory-150x150.png" alt="" width="150" height="150" /></a> Outside of programming language flame wars, few questions raise the hackles of hackers more than: &#8220;how should I store my data?&#8221;</p>
<p>I will argue here, like many such debates , the answer is:  it depends on what you&#8217;re doing.</p>
<p>While the rise of non-relational data stores serves a much-needed niche, the death of SQL and relational databases <a href="http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-database-doomed.php">has been much exaggerated</a>.  E.F. Codd may be dead, but SQL is alive and well as a simple yet powerful data query language.</p>
<p><strong>3NF Crusaders vs NoSQL Rebels</strong></p>
<p>While the current critique relational databases shares features of earlier debates (such as in the 1990s, when object-oriented databases were heralded as the next big thing), it has some new twists.  Thus to review the players and their positions:</p>
<p>On our right are the relational curmudgeons, the kind of folks who <a href="http://www.thethirdmanifesto.com/"> pen manifestos and crusade against NULL values</a>.  They have converted nearly all of big business to their ministry, and have billions of dollars in their coffers to show for it.  They insist that data should be stored in terms of its relations, to protect its integrity and facilitate its analysis.  Ideally that means third-normal form, but <a href="http://www.amazon.com/exec/obidos/ASIN/0471200247"> more liberal branches of the church </a> exist.</p>
<p><span id="more-97"></span>On our left are the folks from the misnomered NoSQL movement, <a href="http://blog.oskarsson.nu/2009/06/nosql-debrief.html">shaggy kids</a> from <a href="http://gigaom.com/2009/08/15/how-yahoo-facebook-amazon-and-google-think-about-big-data/"> the likes of Facebook and Twitter </a>.  They&#8217;ve rebelled against the shackles of relational tables (and bear the scars of MySQL scaling struggles).  They believe that data should be persisted as it&#8217;s programmed: in objects.  And they&#8217;ve spawned a constellation of colorfully named open-source projects – Cassandra, Voldemort, CouchDB, MongoDB, and <a href="http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html">Dynamo</a> – to consummate their cause.</p>
<p><strong>A Three-Pronged Attack on SQL:  Syntax, Schemas, and Performance</strong></p>
<p>At the heart of the NoSQL movement are three distinct critiques:</p>
<ol>
<li>A dislike for SQL&#8217;s syntax, which is ill-fitted to programming patterns.  It&#8217;s painful to write select statements to grab the data spread out across many tables, when all you want is a record.  Within web frameworks, the interface problem has been solved to a large degree by object-relational-mappers, such as Ruby&#8217;s ActiveRecord.</li>
<li>A rejection of the strong typing of relational schemas, which make it painfully difficult to alter one&#8217;s data model.  It also makes <a href="http://codemonkeyism.com/essential-storage-tradeoff-simple-reads-simple-writes/">writing to the data store a complex process</a>.</li>
<li>A critique of performance, which in turn relates to how concurrency and partitioning of computation is handled.  Most relational databases maintain a shared state, which strives for perfect concurrency, but complicates distributed computation over many nodes.  NoSQL architectures are built on languages and tools, like Erlang and Hadoop, that favor distributed processes which (to use two favorite catch phrases) &#8220;share nothing&#8221; but are &#8220;<a href="http://www.allthingsdistributed.com/2008/12/eventually_consistent.html">eventually consistent</a>.&#8221;  The NoSQL philosophy also weighs heavily against joins.</li>
</ol>
<p>These critical threads are mirrored in the movement and their associated projects.  One the one hand you have developers who prefer the programmatic ease of interacting with NoSQL data stores, such as Cassandra and CouchDB.  They also don&#8217;t suffer the performance penalties of scale:  unlike with relational tables, the performance of look-ups does not degrade as the stored number of objects rises.</p>
<p>On the other, you have Big Data analysts (like myself), who love Hadoop because it allows easy distributed computation over massive, loosely typed data sets.</p>
<p><strong>Analytics:  MapReduce for Munging, SQL for Set Operations</strong></p>
<p>With regard to analytics, the Hadoop ecosystem makes it easy to dump several billion records of varying formats into a data store and process them – without having to conform them to a common data model.   Thus NoSQL framework is great for massive data munging.</p>
<p>But if I had to access an already structured massive data set, I prefer SQL&#8217;s declarative syntax to MapReduce constructs.</p>
<p>I recently sat down at an SQL terminal with several hundred billion call records behind it.  With a simple SQL query, I determined how many distinct people the average American telephones more than once in a given month (answer: five).  In a few hundred seconds, I&#8217;d generated a report on the global state of the customer calling network.</p>
<p>Contrary to what the NoSQL may inveigh, it&#8217;s not that relational databases can&#8217;t scale – in fact, they can scale to petabytes, as <a href="http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/"> those who know Fortune 500 enterprise computing can attest </a>.  The problem is that relational databases require lots of ETL cruft to munge fluid blobs of data into strongly typed tables.</p>
<p>I can&#8217;t imagine the programmer pain and suffering that went into building one, unified, global database.  But once it&#8217;s there, I&#8217;d much prefer to access it with SQL statements than MapReduce code .</p>
<p>And I&#8217;m not alone in feeling this way:  Jeff Hammerbacher of Cloudera recently told me that, for an enterprise deployment, usage jumped 10x when an SQL interface – HIVE (which I mention below) – was placed on the cluster.</p>
<p><strong>NoSQL is a Misnomer: SQL is Innocent!</strong></p>
<p>Which brings me to my defense of SQL.  I agree with two of three above critiques that embody the NoSQL philosophy, namely the need for schema-less storage and distributed architectures.  But when they go after SQL, and name the movement in opposition to it, they&#8217;ve named the wrong villain.  (Your honor,) <a href="http://cacm.acm.org/blogs/blog-cacm/50678-the-nosql-discussion-has-nothing-to-do-with-sql/fulltext">SQL is just an innocent query language!</a></p>
<p>As evidence of innocence, look no further than <a href="http://code.google.com/appengine/docs/python/datastore/gqlreference.html">Google&#8217;s GQL</a> and <a href="http://wiki.apache.org/hadoop/Hive"> Hadoop&#8217;s HIVE</a>, two SQL-style query languages for NoSQL data stores.</p>
<p>Why SQL in a NoSQL data store?   For one, it&#8217;s a language that both business analysts and developers already know; so the zero-th order adoption step is shorter.</p>
<p>But SQL lives on for a deeper reason: it is a simple yet powerful language for set operations.  SQL captures the essential patterns of data manipulation, such as:</p>
<ol>
<li>intersections (JOINs)</li>
<li> filters (WHEREs)</li>
<li>reductions or aggregations (GROUP BYs)</li>
</ol>
<p>I suspect that many developers who profess a disdain for SQL have been deceived by its simplicity.  One of my favorite packages in R is <a href="http://code.google.com/p/sqldf/">sqldf</a>, which allows SQL queries on R data frames.  SQL&#8217;s declarative expressions are frequently more readable and compact than their R programmatic equivalents.</p>
<p><strong>MapReduce is Possible in SQL</strong></p>
<p>Until very recently one of the more difficult operations to perform in SQL was a top-K query, for example, finding the five highest priced items in for every store in a retail database.  But so-called window functions, which make such queries easy to express, have become part of the SQL standard and are now natively supported in Postgres.</p>
<p>Window functions are powerful because they provide a &#8220;split-apply&#8221; functionality, otherwise known as a map function.  Combine these with SQL&#8217;s GROUP BY operations, which is a reduce function, and you have achieved – voila! – map-reduce in SQL.  And as with all map functions, window operations are massively parallelizable (something that has not gone unnoticed by <a href="http://www.greenplum.com">some commercial vendors.</a>)</p>
<p><strong>Verdict:  Don&#8217;t Use a Chainsaw to Cut Butter (Use the Right Tool)</strong></p>
<p>Both NoSQL and SQL have their place in an analytics ecosystem.   In the <a href="http://dataspora.com/blog/sexy-data-geeks/">Big Data workflow</a> that I&#8217;ve advocated in the past, I view SQL as a pipe feeding data into more sophisticated modeling and visualization tools, such as R.  But it is an easy-to-use pipe, and it allows analysts to quickly pull out a subset of data &#8212; and start asking questions of that data.</p>
<p>The verdict in the great NoSQL debate is:  know your tools and know your goals.  In the Big Data space today, there can be an undue focus on formats or mechanics, but these are just a means to one end:  products.  Remember, Paul Graham and his team wrote Viaweb in Lisp, and it just worked.</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/S9OepTJtXGo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/sql-is-dead-long-live-sql/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/sql-is-dead-long-live-sql/</feedburner:origLink></item>
		<item>
		<title>How XML Threatens Big Data</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/sgrANTdWFkU/</link>
		<comments>http://dataspora.com/blog/xml-and-big-data/#comments</comments>
		<pubDate>Sun, 23 Aug 2009 06:25:02 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<category><![CDATA[computing]]></category>

		<category><![CDATA[data]]></category>

		<category><![CDATA[bigdata]]></category>

		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=91</guid>
		<description><![CDATA[Confessions from a Massive, Nightmarish Data Project
Back in 2000, I went to France to build a genomics platform.  A biotech hired me to combine their in-house genome data with that of public repositories like Genbank.  The problem was the repositories, all with millions of records, each had their own format.  It sounded [...]]]></description>
			<content:encoded><![CDATA[<p><strong><a href="http://dataspora.com/blog/wp-content/uploads/2009/08/elephant.jpg"><img class="alignleft size-thumbnail wp-image-93" title="elephant" src="http://dataspora.com/blog/wp-content/uploads/2009/08/elephant-150x150.jpg" alt="Credit:  http://www.flickr.com/photos/digitalart/2101765353" width="150" height="150" /></a>Confessions from a Massive, Nightmarish Data Project</strong></p>
<p>Back in 2000, I went to France to build a genomics platform.  A biotech hired me to combine their in-house genome data with that of public repositories like Genbank.  The problem was the repositories, all with millions of records, each had their own format.  It sounded like a massive, nightmarish data interoperability project.  And an ideal fit for <a href="http://www.nytimes.com/2000/06/07/business/the-next-big-leap-it-s-called-xml.html"> a hot new technology </a>:  XML.</p>
<p>So I dove in, spending my days designing DTDs, writing parsers, tweaking tags (&#8221;taxon&#8221; or &#8220;species&#8221;?  attribute or element?).  At night I dreamt in ontologies.  <a href="http://labs.dataspora.com/pubseq/docs/overview/records2xml.gif">It was perfect.</a></p>
<p>Then reality struck.  The pipeline was slow:  Oracle loaded XML at a crawl.  And it was a memory hog, since XSLT required putting full document trees in RAM.</p>
<p>We had a deadline to meet (and, mon dieu, a 35 hour work-week).  So we changed course.  We hacked our Perl scripts to emit a flat tab-delimited format &#8212; &#8220;TabML&#8221; &#8212; which was bulk loaded into Oracle.  It wasn&#8217;t elegant, but it was fast and it worked.</p>
<p>Yet looking back, I realize that XML was the wrong format from the start.  And as I&#8217;ll argue here, our unhealthy obsession with XML formats threatens to slow or impede many open data projects, including  initiatives like <a href="http://www.data.gov">Data.gov</a>.</p>
<p>In the next sections, I discuss how XML fails for Big Data because of its unnatural form, bulk, and complexity.  Finally, I generalize to three rules that advocate a more liberal approach to data.</p>
<p><span id="more-91"></span></p>
<h3>Three Reasons Why XML Fails for Big Data</h3>
<p><strong>I. XML Spawns Data Bureaucracy </strong></p>
<p>In its natural habitat, data lives in relational databases or as data structures in programs.  The common import and export formats of these environments do not resemble XML, so much effort is dedicated to making XML fit.  When more time is spent on inter-converting data &#8212; serializing, parsing,translating &#8212; than in using it, you&#8217;ve created a data bureaucracy.</p>
<p>Indeed, it was what Doug Crockford called <a href="link://http//www.json.org/fatfree.html">&#8220;impedance mismatch inefficiencies&#8221;</a> that sparked him to create JSON - standardizing Javascript&#8217;s object notation as a portable data container.</p>
<p><strong>II. Yes, Size Matters for Data</strong></p>
<p>Size matters for data in a way it does not for documents.  Documents are intended for human consumption and have human-sized upper bounds (a lifetime&#8217;s worth of reading fits on a thumb drive).  Data designed for machine consumption is bounded only by bandwidth and storage.</p>
<p>XML&#8217;s expansiveness &#8212; for even when compressed, the genie must be let out the bottle at some point &#8212; imposes memory, storage, and CPU costs.</p>
<p><strong>III. Complexity Carries a Cost</strong></p>
<p>I never fail to sigh when I open a data file and discover an army of tags, several ranks deep, surrounding the data I need.  XML&#8217;s complexity imposes costs without commensurate benefits, specifically:</p>
<ul>
<li>In-line, element-by-element tagging is redundant.  Far preferable is stating the data model separately, and using a lightweight delimiter (such as a comma or a tab).</li>
<li> Text tags are purported to be self-documenting, but textual meaning is a slippery thing: it&#8217;s rare that one can be sure of a tag&#8217;s data type without consulting its DTD (in a separate document).</li>
<li> End-tags support nested structures (such as an aside (within (an aside)).  But to facilitate data exchange, flattened out structures are preferable, and arbitrary levels of nesting are best using sparingly.</li>
</ul>
<p>XML&#8217;s complexity inflicts misery on both sides of the data divide: on the publishing side, developers struggle to comply with the latest edicts of a fussy standards group.  While data suitors labor to <a href="http://www.crummy.com/software/BeautifulSoup/">quickly unravel</a> that XML format into something they can use.</p>
<h3>Three Rules for XML Rebels</h3>
<p><strong>I.  Stop Inventing New Formats</strong> <a href="http://www.tbray.org/ongoing/When/200x/2006/01/08/No-New-XML-Languages">(as Tim Bray said in 2006)</a></p>
<p>Before you call for &#8220;an XML format for X&#8221;, let me tell you a story about LaTeX and MathML.  (And while these are document formats, there&#8217;s a lesson here for data).</p>
<p>The LaTeX typesetting system is the lingua franca for composing scientific documents.  As the one-million plus LaTeX-formatted articles on arXiv.org attest, it is spoken by scientists worldwide.</p>
<p>MathML, on the other hand, is a markup language for mathematics recommended by the W3C.  If you&#8217;re a scientist looking to use MathML, you have two choices: (i) find a program to convert LaTeX, which you already know, to MathML 3.0 or (ii) familiarize yourself with this <a href="http://www.w3.org/TR/2009/WD-MathML3-20090604/"> handy 354-page spec</a> and code it yourself.</p>
<p>Two years ago, Mike Adams thought of a third way: why not just let people use LaTeX directly in WordPress?  So he wrote a plug-in that did it.  <a href="http://en.blog.wordpress.com/2007/02/17/math-for-the-masses/">The applause was deafening</a>.</p>
<p>Spoken languages are strengthened by usage, not by imperial fiat, and data formats are no different.  Far better to evolve and adapt the standards we already have (as JSON and SQLite&#8217;s file format do), than to fabricate new ones from whole cloth.  <a href="http://blog.jonudell.net/2009/07/31/polymath-equals-user-innovatio/">As John Udell says</a>, &#8220;good-enough solutions [that are] here now, and familiar to people, often trump great solutions that aren’t here and wouldn’t be familiar if they were.&#8221;</p>
<p><strong>II.  Obey The Fifteen Minute Rule</strong></p>
<p><a href="http://www.ddj.com/184404686">Interviewed several years ago</a>, James Clark stated &#8220;If a technology is too complicated, no matter how wonderful it is and how easy it makes a user&#8217;s life, it won&#8217;t be adopted on a wide scale.&#8221;</p>
<p>Accordingly, if you absolutely must develop a new API, language, or format, it should satisfy a simple rule: a person of reasonable ability should be able to get from zero to &#8216;Hello World&#8217; in fifteen minutes.  (This does not preclude complex languages or formats, per se:  it does require that additional complexity not be sui generis, but built on some existing foundation, <a href="http://people.mandriva.com/~prigaux/language-study/diagram-light.png">for example.</a>) </p>
<p>Despite <a href="http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/"> a noble vision for the semantic web </a>, the barriers for adopting the W3C&#8217;s proposals for linked data are too high.  The beauty of original HTML standard was that it was dead simple.  The flaw of RDF is that it is too hard.</p>
<p><strong>III.  Embrace Lazy Data Modeling</strong></p>
<p>To keep data bureaucracy to a minimum, <a href="http://my.safaribooksonline.com/9780596801656/information_platforms_as_dataspaces">several Big Data thinkers </a> have advocated a more <a href="http://en.wiktionary.org/wiki/catholic">catholic</a> approach to data:  building data stores that accommodate <a href="http://infochimps.org/">a broad range of data types and formats</a>.</p>
<p>Lazy data modeling is similar to lazy evaluation.  The right schema for data depends on future use cases, in as-yet-undeveloped applications.  Instead of trying to guess the future, we can store the data &#8220;as-is&#8221; &#8212; and deal with its transformation when (and if) a necessary use case arises.  As <a href="http://www.eecs.berkeley.edu/~franklin/Papers/dataspaceSR.pdf">Michael Franklin and colleagues note</a>: &#8221;the most scarce resource available for semantic integration is human attention.&#8221;</p>
<p>This liberal view also reduces barriers for data sharing, barriers which threaten initiatives like <a href="http://www.data.gov/">Data.gov</a>.  The US Census Bureau shouldn&#8217;t expend resources to publish in XML if they have a good-enough format available right now.</p>
<p>For the data geeks in the trenches, who are building the next generation of data services, the laws of economics hold fast: there are unlimited opportunities in the face of one limited resource, time. (Which also explains why <a href="http://blog.i2pi.com/">data geeks </a> <a href="http://www.datawrangling.com/">seem to </a> <a href="http://twitter.com/dpatil">get </a> <a href="http://anyall.org/blog/">no sleep</a>).</p>
<p>XML&#8217;s unfulfilled promise for data testifies that formats can create friction.  The easier it is for data to be shared and consumed, the more quickly we&#8217;ll realize our visions for smarter businesses and <a href="http://www.readwriteweb.com/archives/how_tim_oreilly_aims_to_change_government.php">better governments.</a></p>
<p><strong>(25-Aug-2009 Update:  <a href="http://groups.google.com/group/sunlightlabs/browse_thread/thread/da9118b9fe566c">  Read a response from open gov advocates at Sunlight Labs</a>).</strong></p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/sgrANTdWFkU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/xml-and-big-data/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/xml-and-big-data/</feedburner:origLink></item>
		<item>
		<title>The Rise of the Data Web</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/trYsY0hnfNQ/</link>
		<comments>http://dataspora.com/blog/the-rise-of-the-data-web/#comments</comments>
		<pubDate>Fri, 21 Aug 2009 01:51:33 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<category><![CDATA[computing]]></category>

		<category><![CDATA[data]]></category>

		<category><![CDATA[data bigdata xml]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=86</guid>
		<description><![CDATA[The future of the web is data, not documents.  The web has evolved from Tim Berners-Lee&#8217;s original vision of &#8220;some big, virtual documentation system in the sky&#8221; into an vibrant ecosystem of data where documents &#8212; and human actors &#8212; will play an ever smaller role.
As others have noted, we&#8217;ve reached a tipping point [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/08/stream.jpg"><img class="alignleft size-medium wp-image-88" title="stream" src="http://dataspora.com/blog/wp-content/uploads/2009/08/stream-188x300.jpg" alt="" width="188" height="300" /></a>The future of the web is data, not documents.  The web has evolved from Tim Berners-Lee&#8217;s original vision of <a href="http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html">&#8220;some big, virtual documentation system in the sky&#8221;</a> into an vibrant ecosystem of data where documents &#8212; and human actors &#8212; will play an ever smaller role.</p>
<p><a href="http://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel programming/">As others have noted</a>, we&#8217;ve reached a tipping point in history: more data is being manufactured by machines &#8212; servers, cell phones, GPS-enabled cars &#8212; than by people.  The early, document-centric web was populated by hand-coded hypertext files; today, a hand-coded web page is as rare as hand-woven clothing.</p>
<p>Through web frameworks, wikis, and blogs, we have industrialized the creation of hypertext.  Similarly, we&#8217;ve also industrialized the collection of data, and spliced out the human steps in many data flows, such that data entry clerks may soon be as rare as typesetters.</p>
<p>The web we experience will continue to be dominated by documents &#8212; e-mail, blogs, and news.  And while many sites are data-centric &#8212; Google maps, Weather.com, and Yahoo finance &#8212; it&#8217;s the web that we can&#8217;t see that surging with data.  It&#8217;s not about us, it&#8217;s about servers in the cloud mediating <a href="http://radar.oreilly.com/archives/2007/02/pipes-and-filte.html">entire pipelines of data</a>, only occasionally surfacing in a browser.</p>
<p>But the web&#8217;s data architecture is fractious and in flux: many competing standards exist for serializing, parsing, and describing data.  As we build out the data web, we ought to embrace standards that mirror data&#8217;s form in its natural habitats &#8212; as programmatic data structures, relational tables, or key-value pairs &#8212; while taking advantage of data&#8217;s stream-like nature.  Mark-up languages like HTML and XML are ideal for documents, but they are poor containers for data, especially Big Data.</p>
<p><span id="more-86"></span></p>
<p><strong>Sacred &#8220;Words &amp; Enthusiasm&#8221; vs Meaningless Utterances</strong></p>
<p>Documents and data are different.  The table below reflects my thin grasp of the fissure lines, as a step towards arguing why we ought to design around them.</span></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/08/documents_vs_data.png"><img class="alignnone size-full wp-image-90" title="documents_vs_data" src="http://dataspora.com/blog/wp-content/uploads/2009/08/documents_vs_data.png" alt="" width="499" height="356" /></a></p>
<p>Documents are made of <a href="http://www.ted.com/talks/view/id/161">&#8220;words and enthusiasm&#8221;</a>: sonnets, cake recipes, blog posts, Supreme Court rulings, and dictionary definitions.  Their core stuffing is text.  Their structure is unpredictable and irregular &#8212; even <a href="http://seanmcgrath.blogspot.com/2004_05_23_seanmcgrath_archive.html"> fractal</a>.</p>
<p>Data are not created but collected (<a href="http://www.archives.nd.edu/cgi-bin/lookit.pl?latin=datum">something given</a>, not something made): city temperatures, stock prices, web visitors, and home runs. They are observations in time and space, with periodic and predictable structure.  Data are reorderable and divisible: you can relay city temperatures in any order, but you can&#8217;t rearrange a Shakespearian sonnet without muddling its meaning.  Some documents are so meaningful as to be considered <a href="http://www.ietf.org/rfc/rfc1.txt">sacred</a>.</p>
<p>Data are, in this regard, meaningless on their own; they do not signify, they simply are.  These data are the <a href="http://plato.stanford.edu/entries/assertion/">utterances </a>of the <a href="http://boingboing.net/images/blobjects.htm">spimes </a> that surround us.</p>
<p><strong>Documents as Trees, Data as Streams</strong></p>
<p>The argument for shifting away from markup languages as data formats is not just practical, it&#8217;s philosophical: it&#8217;s about pivoting our conception away from the dominant metaphor of documents &#8212; trees &#8212; towards one far more suitable for data &#8212; streams.</p>
<p>Trees are rooted and finite: you can&#8217;t chop up a tree and easily put it back together again (while XML has made concessions to <a href="http://www.w3.org/TR/xml-fragment">document fragments</a>, it is not a natural fit).</p>
<p>Streams can be split, sampled, and filtered.  The divisibility of data streams lends itself to parallelism in a way that document trees do not.  The stream paradigm conceives of data as extending infinitely forward in time.  The Twitter data stream has no end: it ought have no end tag.</p>
<p>Conceiving of data as streams moves us out of the realm of static objects and into the <a href="http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-24.html#%_sec_3.5">realm of signal processing</a>.  This is the domain of the living: where the web is not an archive but an organism, <a href="http://radar.oreilly.com/2009/08/big-data-and-real-time-structured-data-analytics.html">reacting in real-time</a>.</p>
<p><strong>XML Considered Harmful for Data</strong></p>
<p>XML is a poor language for data because it solves the wrong problems &#8212; those of documents &#8212; while leaving many of data&#8217;s unique issues unaddressed.   But many promising alternatives exist &#8212; microformats like <a href="http://www.json.org/fatfree.html">JSON</a>, <a href="http://developers.facebook.com/thrift/thrift-20070401.pdf">Thrift</a>, and even <a href="http://www.sqlite.org/fileformat.html">SQLite&#8217;s file format</a> &#8211; as I will detail in a <a href="http://dataspora.com/blog/xml-and-big-data/">my next post.</a></p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/trYsY0hnfNQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/the-rise-of-the-data-web/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/the-rise-of-the-data-web/</feedburner:origLink></item>
		<item>
		<title>The Three Sexy Skills of Data Geeks</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/Wf9Z8ufjH2o/</link>
		<comments>http://dataspora.com/blog/sexy-data-geeks/#comments</comments>
		<pubDate>Wed, 27 May 2009 10:02:05 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=85</guid>
		<description><![CDATA[Hal Varian, Google&#8217;s Chief Economist, was interviewed a few months ago, and said the following in the McKinsey Quarterly:
&#8220;The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/marilyn_scatter.png"><img class="alignnone size-medium wp-image-84" title="marilyn_scatter" src="http://dataspora.com/blog/wp-content/uploads/2009/05/marilyn_scatter-300x300.png" alt="Marilyn Monroe Scatterplot Mashup" width="300" height="300" /></a>Hal Varian, Google&#8217;s Chief Economist, was interviewed a few months ago, and said the following in <a href="http://www.mckinseyquarterly.com/Hal_Varian_on_how_the_Web_challenges_managers_2286">the McKinsey Quarterly</a>:<br />
<em>&#8220;The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.” </em></p>
<p>In prepping for tonite&#8217;s talk at the <a href="http://www.youtube.com/watch?v=hcl3qmawY_0">Google IO Ignite</a> event, this quote inspired me to muse about how sex appeal and statistics might go together:  so I chose to mash up a few scatter plots with Andy Warhol&#8217;s Marilyn Monroe.</p>
<p>Statisticians&#8217; sex appeal has little to do with their lascivious leanings (ahem, <a href="http://www.bedposted.com">BedPost</a>), and more with the scarcity of their skills.  I believe that the folks to whom Hal Varian is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas:  statistics, data munging, and data visualization.  (In parentheses next to each, I&#8217;ve put the salient character trait needed to acquire it).</p>
<p><strong>Skill #1: Statistics (Studying).</strong> Statistics is perhaps the most important skill and the hardest to learn. <span id="more-85"></span>It&#8217;s a deep and rigorous discipline, and one that is actively progressing (the widely used method of Least Angle Regression was only <a href="http://arxiv.org/abs/math/0406456">recently developed in 2004</a>).  I expect to be on its learning curve my entire life.  This being the case, people who possess a solid grasp of modern statistics are rare.   And yet problems that require its application continue to multiply.  The text that I was exposed to in graduate school and find to be an unparalleled survey is Hastie, Tibshirani, and Friedman&#8217;s <a href="http://www.amazon.com/Elements-Statistical-Learning-T-Hastie/dp/0387952845">Elements of Statistical Learning</a>.</p>
<p><strong>Skill #2: Data Munging (Suffering).</strong> The second critical skill mentioned above is  &#8220;data munging.&#8221;  Among data geek circles (you can find us with a <a href="http://search.twitter.com/search?q=%23rstats">Twitter search for #rstats</a>), this refers to the painful process of cleaning, parsing, and proofing one&#8217;s data before it&#8217;s suitable for analysis.  Real world data is messy.  At best it&#8217;s inconsistently delimited or packed into an unnecessarily complex XML schema.  At worst, it&#8217;s a series of scraped HTML pages or a thoroughly undocumented fixed-width format.</p>
<p>A good data munger excels at turning coffee into regular expressions and parsers, implemented in a high-level scripting language of choice (often Perl, Python, even Javascript).  This is problem solving with programming, and quite different from statistics.  An aspiration towards elegance &#8212; in the form of a perfect XSLT filter, for example &#8212; is rarely rewarded, and often punished.  A decade ago, I thought that the world&#8217;s data would soon be well-structured, and my talent for syntactical incantations of regular expressions would be a moot skill.   I was wrong.  (Perhaps there&#8217;s an analogy with the paper industry:  the growing volume of data means we&#8217;ll likely need more regular expressions before we need less).</p>
<p>Related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores, using a combination of SQL, scripting languages (especially Python and its SciPy and NumPy libraries), and even several oldie-but-goodie Unix utilities (cut, join).</p>
<p>And when data sets grow too large to manage on a single desktop, the samurai of data geeks are capable of parallelizing storage and computation with tools like <a href="http://databeta.wordpress.com/2009/05/14/bigdata-node-density/">96-nodes of Postgres</a>, <a href="http://cran.r-project.org/web/views/HighPerformanceComputing.html">snow and RMPI</a>, Hadoop and Mapreduce, and <a href="http://www.datawrangling.com/amazon-elastic-mapreduce-a-web-service-api-for-hadoop">on Amazon EC2 to boot.</a></p>
<p><strong>Skill #3: Visualization (Storytelling).</strong> This third and last skill that Professor Varian refers to is the easiest to believe one has.  Most of us have had exposure to basic chart-making widgets of Excel (and to date myself, tools like Harvard Graphics).   But a little knowledge is a dangerous thing:  these software tools are often insufficient when faced with the visualization of large, multivariate data sets.</p>
<p>Here it&#8217;s worth making a distinction between two breeds of data visualizations, which differ in their audience and their goals.  The first are exploratory data visualizations (as named by John Tukey), intended to faciliate a data analyst&#8217;s understanding of the data.   These may consist of <a href="http://dsarkar.fhcrc.org/lattice/book/images/Figure_05_17_stdBW.png">scatter plot matrices</a> and histograms, where labels and colors are minimally set by default.   Their goal is to help develop a hypothesis about the data, and their audience typically numbers one or a small team.</p>
<p>A second kind of data visualization are those intended to communicate to a wider audience, whose goal is to visually advocate for a hypothesis.  While most data geeks are facile with exploratory graphics, the ability to create this second kind of visualization, these visual narratives, is again a separate skill &#8212; with separate tools.  (R is excellent for static visualizations, but cannot compete with the kinds of rich interactive visualizations that tools like <a href="http://processing.org/">Processing </a>and <a href="http://flare.prefuse.org/">Flare</a> make possible).  Luckily, successful collaboration often occurs <a href="http://blog.jonudell.net/2009/05/26/a-conversation-with-eric-rodenbeck-about-usefully-cool-design-and-engineering/">between data analysts and designers</a>, the <a href="http://flowingdata.com/2009/04/22/narrow-minded-data-visualization/">occasional fracas</a> notwithstanding.</p>
<p>The ability to visualize and communicate data is critical, because even with good data and rigorous statistical techniques, if the results of an analysis are poorly visualized, they will not convince:  whether it&#8217;s an academic discovery or a business proposal.</p>
<p><strong>Put All Three Skills Together:  Sexy. </strong>Thus with the Age of Data upon us, those who can model, munge, and visually communicate data &#8212; call us statisticians or data geeks &#8212; are a hot commodity.  I grew up before the age of geek chic, when the computer wizzes were social pariahs, and feature-length movies were dedicated to <a href="http://www.imdb.com/title/tt0088000/">nerds seeking revenge</a>.  But in the last decade, Steve Jobs became an icon, the Internet became cool, and an entire generation of tech kids grew up well adjusted.  They even built the social web to prove it.   I believe the same could happen to statistics and data geeks too.</p>
<p><a href="http://panelpicker.sxsw.com/ideas/view/4287"><br />
</a><strong> (Update Aug-2009:  If you liked this post, consider </strong><a href="http://panelpicker.sxsw.com/ideas/view/4287"><strong>voting for it at the 2010 SXSW Conference</strong></a><strong>).</strong><a href="http://panelpicker.sxsw.com/ideas/view/4287"><img src="http://sxsw.com/files/SXSWPanelPicker-sm.png" alt="Vote for my PanelPicker idea at SXSW" /></a></p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/Wf9Z8ufjH2o" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/sexy-data-geeks/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/sexy-data-geeks/</feedburner:origLink></item>
		<item>
		<title>Dataviz Salon SF #2:  Maps, Grammars, &amp; Models</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/YIddP1eXxpc/</link>
		<comments>http://dataspora.com/blog/dataviz-sf-salon-no/#comments</comments>
		<pubDate>Fri, 08 May 2009 10:11:35 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=75</guid>
		<description><![CDATA[A few nights ago the talented folks at Stamen Design hosted us at their studios for our second dataviz salon in San Francisco.  (Special thanks to Tom Carden and Michal Migurski for inviting us).  Four talks were given, which I&#8217;ll review in turn.

Stamen:  Reaching through Maps
Protovis: A Declarative, Open Source Graphical Toolkit
A Mathematician&#8217;s View:  A Visualization is a [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/dataviz_salon_poster_5may20.png"></a><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/dataviz_salon_poster_5may20.png"><img class="alignleft size-thumbnail wp-image-76" title="dataviz_salon_poster_5may20" src="http://dataspora.com/blog/wp-content/uploads/2009/05/dataviz_salon_poster_5may20-150x150.png" alt="" width="150" height="150" /></a>A few nights ago the talented folks at <a href="http://www.stamen.com">Stamen Design</a> hosted us at their studios for our second dataviz salon in San Francisco.  (Special thanks to <a href="http://www.tom-carden.co.uk">Tom Carden</a> and <a href="http://mike.teczno.com/">Michal Migurski</a> for inviting us).  Four talks were given, which I&#8217;ll review in turn.</p>
<ul>
<li><a href="http://dataspora.com/blog/dataviz-sf-salon-no/#stamen">Stamen:  Reaching through Maps</a></li>
<li><a href="http://dataspora.com/blog/dataviz-sf-salon-no/#protovis">Protovis: A Declarative, Open Source Graphical Toolkit</a></li>
<li><a href="http://dataspora.com/blog/dataviz-sf-salon-no/#morton">A Mathematician&#8217;s View:  A Visualization is a Hypothesis</a></li>
<li><a href="http://dataspora.com/blog/dataviz-sf-salon-no/#uuorld">UUorld:  Multidimensional Extrusion Maps</a></li>
</ul>
<h3 id="stamen">Stamen:  Reaching through Maps</h3>
<p>Eric Rodenbeck (Stamen) started by highlighting several mapping visualizations that Stamen has been hacking on recently and in the past, including <a href="http://oakland.crimespotting.org/map/#types=Va,Na,DP,Al,Pr&amp;dtend=2009-05-05T23:34:55-07:00&amp;dtstart=2009-04-22T23:47:51-07:00&amp;lon=-122.270&amp;zoom=14&amp;lat=37.806"> </a><a href="http://www.cabspotting.org"> Cabspotting in San Francisco </a>, <a href="http://oakland.crimespotting.org/">Crimespotting in Oakland</a>, and  <a href="http://www.london2012.com/in-your-area/map/index.php"> Olympic Stadium spotting in London</a>.</p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/stamen_cabspotting.png"><img class="alignleft size-thumbnail wp-image-79" title="stamen_cabspotting" src="http://dataspora.com/blog/wp-content/uploads/2009/05/stamen_cabspotting-150x150.png" alt="" width="150" height="150" /></a>Eric showed how Stamen has attempted to move away from what <a href="http://mappinghacks.com/2006/04/07/web-map-api-roundup/">Schuyler Erle has dubbed &#8220;red dot fever&#8221;</a>, whereby the overlayed data can overwhelm our visual attention, and toward allowing various data layers to &#8220;reach through&#8221; the maps.</p>
<p>For example, the London Olympic maps provide a mixture of schematic, satellite, and webcam images.  These various drill-downs of detail are not all exposed, but rather collaged.  Even more interesting was a movable &#8216;lens&#8217; that, as it is moved over regions of a map, reveals another layer (reminiscent of a <a href="http://www.flickr.com/photos/cdevers/2896777351/"> polarized-light based mural</a> at Boston&#8217;s MoS).  In these ways, additional layers of data are only selectively brought into focus (echoing a design pattern in Japanese gardening, <a href="http://www.amazon.com/Visual-Spatial-Structure-Landscapes/dp/0262580942">mie gakure</a>, meaning &#8220;seen and unseen&#8221;).<br />
<span id="more-75"></span><br />
One practical gem that Mike Migurski shared regarding the Oakland Crimespotting site was, &#8220;the design of a comments section is a huge part of how its perceived and used.&#8221;  Nota bene, social web developers.</p>
<h3 id="protovis">Protovis: A Declarative, Open Source Graphical Toolkit</h3>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/burtin_yeast_mic.png"><img class="alignnone size-thumbnail wp-image-77" title="burtin_yeast_mic" src="http://dataspora.com/blog/wp-content/uploads/2009/05/burtin_yeast_mic-150x150.png" alt="" width="150" height="150" /></a>Mike Bostock (Stanford CS) introduced <a href="http://vis.stanford.edu/protovis/">Protovis</a>, an extensible visualization toolkit implemented using Javascript&#8217;s canvas element.  Protovis draws inspiration from Leland Wilkinson&#8217;s <a href="http://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448">Grammar of Graphics</a>, which argues for moving away from the prevailing method of building visualizations, where data are simply poured into one of several chart types &#8212; pie, stacked bar, or scatter.</p>
<p>Wilkinson argues that visualizations should not be cast from chart typologies, but rather composed of graphical primitives.  In Protovis, these primitives include dots, areas, lines, and labels (called &#8220;marks&#8221;).</p>
<p>Among Protovis&#8217;s strengths are:</p>
<dl>
<dt><strong> A More Declarative Syntax for Creating Graphics </strong></dt>
<dd> One disadvantage of directly using Javascript&#8217;s canvas is its   imperative style.  To draw a diagonal line, the code must manipulate   and move a pen using x,y coordinates.  With Protovis, however, the   code declares (roughly) &#8220;add a bar to this graph&#8221; (<a href="http://vis.stanford.edu/protovis/ex/weather.html">example</a>).  Thus Protovis   provides a grammar for statements about graphical marks, rather than   statements about graphical mechanics. </dd>
<dt><strong> Visible Open Source </strong></dt>
<dd> With Protovis, the source code is not just open and available, it&#8217;s   viewable from within the browser.  I have an admittedly personal bias for <a href="http://dataspora.com/blog/open-source-dataviz/">open   source data visualization</a>, but lowering the barriers to sharing source   code ultimately drive faster adoption and iteration of visualization   techniques. </dd>
</dl>
<p>Mike has used Protovis to recreate classic data visualizations by Will Burtin, Florence Nightingale, William Playfair, and others.  You can find these at the <a href="http://vis.stanford.edu/protovis">Protovis site</a> and in their <a href="http://vis.stanford.edu/protovis/protovis.pdf">InfoVis &#8216;09 paper</a>.</p>
<p>(For those interested in a Wilkinson-inspired approach for graphics in R, check out <a href="http://had.co.nz/ggplot2/">Hadley Wickham&#8217;s ggplot</a>).</p>
<h3 id="morton">A Mathematician&#8217;s View:  A Visualization is a Hypothesis</h3>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/dataspora_wordle.png"><img class="alignleft size-thumbnail wp-image-78" title="dataspora_wordle" src="http://dataspora.com/blog/wp-content/uploads/2009/05/dataspora_wordle-150x150.png" alt="" width="150" height="150" /></a>Jason Morton (Stanford Mathematics) made the argument that a data visualization is not merely a descriptive vessel, it is a predictive model.</p>
<p>A visualization is a model is because, especially with large data sets, not every dimension of every observation can be shown.  Quite simply, a (compressed) 100k data visualization cannot losslessly describe a (compressed) 10 Mb data set: information must be discarded. What remains is a <em>model</em> of the original data, albeit a visual model.</p>
<p>Moreover, a data visualization&#8217;s model is predictive: it presents a hypothesis about how observable data points were generated, and implies predictions about future, as-yet-unobserved data.</p>
<p>Seen from this perspective, Stamen&#8217;s Crimespotting maps are powerful precisely because they make compelling hypotheses about when and where crime occurs in Oakland.  Their London Olympic maps, which integrate time series photographs of the stadium site, take a position about the pace of construction and how it is impacting the landscape.</p>
<p><strong>&#8220;Form Ever Follows Function&#8221;</strong></p>
<p>And if the function of a data visualization is to make hypotheses, then its form should follow this function. The arbitrary use of color, position, shape, and ornament &#8212; only adds noise.</p>
<p>The ever popular <a href="http://www.wordle.net/"> Wordle </a> provides a visual model for word distribution in a text: more frequent words are larger.  However, a word&#8217;s color, position, and font are arbitrarily chosen - they carry no meaning, and model nothing. Indeed, the &#8220;randomize&#8221; button is an admission of as much (for it does not randomize size).</p>
<p>Adding arbitrary marks or dimensions to a visualization carries two related risks: first, it can obscure the true model that&#8217;s trying to be conveyed (what do same-colored have in common?); second, this added complexity, beyond polluting the information channel, has a cost: the visualization is larger.  <a href="http://www.swivel.com/graphs/image/28893777/default/600/337/5/absolute/HorizontalBarGraph/ASC/all+time/daily/ignore?s=1241769339">Bar graphs with iPhone ads</a> in the background cannot be succinctly rendered.</p>
<p>The parallels to the modernist movement in architecture are obvious. Adolf Loos wrote in 1908 that &#8220;the evolution of culture marches with the elimination of ornament from useful objects.&#8221;  The American modernist Louis Sullivan proclaimed that &#8220;form ever follows function.&#8221;</p>
<p>But the truth is that stripping visualizations down to their bare models can be counterproductive.  Call it noise or ornamentation, but even visual marks that do not advance a hypothesis can act to support it,  by guiding the eye, providing context, or otherwise speeding the absorption of a pattern by the human brain.  At the very least, this functionalist perspective can help data visualizers use ornamentation intentionally, not inadvertently.</p>
<h3 id="uuorld">UUorld:  Multidimensional Extrusion Maps</h3>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/uuorld_stlouis.png"><img class="alignleft size-thumbnail wp-image-80" title="uuorld_stlouis" src="http://dataspora.com/blog/wp-content/uploads/2009/05/uuorld_stlouis-150x150.png" alt="" width="150" height="150" /></a>Zach Wilson (UUorld) showcased his <a href="http://www.uuorld.com">company&#8217;s</a> software that simplifies creating and exploring extrusion maps.  Among the several interesting applications of his software, Zach showed off a temporal visualization <a href="http://vimeo.com/4480815"> of the spread of swine flu in the United States</a> over the past several weeks.</p>
<p>In response to the critique that layering data dimensions on two-dimensional maps could be done more effectively by use other indicators such as color &#8212; instead of the simulation of a third dimension of height &#8212; Zach indicated that research has shown that physical dimensions (or their simulation) possess greater visual saliency to the human eye.</p>
<p>Zach also mentioned UUorld&#8217;s <a href="http://www.uuorld.com/portal">data portal</a> which contains thousands of downloadable statistics from a variety of public sources; some of which have been used to generate UUorld visualizations.</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/YIddP1eXxpc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/dataviz-sf-salon-no/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/dataviz-sf-salon-no/</feedburner:origLink></item>
		<item>
		<title>Color:  The Cinderella of dataviz</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/EIEl54Ti7Bg/</link>
		<comments>http://dataspora.com/blog/how-to-color-multivariate-data/#comments</comments>
		<pubDate>Sat, 14 Mar 2009 00:14:42 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[R]]></category>

		<category><![CDATA[analytics]]></category>

		<category><![CDATA[color theory]]></category>

		<category><![CDATA[computing]]></category>

		<category><![CDATA[data]]></category>

		<category><![CDATA[dataviz]]></category>

		<category><![CDATA[sabermetrics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=58</guid>
		<description><![CDATA[&#8220;Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.&#8221;  &#8212; Envisioning Information, Edward Tufte, Graphics Press, 1990   
Color is one of the most abused and neglected tools in data visualization.  It is abused when we make poor color choices; it is neglected when we rely on poor software [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>&#8220;Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.&#8221;  &#8212; <em>Envisioning Information</em>, Edward Tufte, Graphics Press, 1990   </p></blockquote>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor2d_4001.png"><img class="alignnone size-full wp-image-73" title="stripcolor2d_4001" src="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor2d_4001.png" alt="multivariate color strip plot " width="400" height="185" /></a>Color is one of the most abused and neglected tools in data visualization.  It is abused when we make poor color choices; it is neglected when we rely on poor software defaults.  Yet despite its historically poor treatment at the hands of engineers and end-users alike, if used wisely, color is unrivaled as a visualization tool.</p>
<p>Most of us think twice before walking outside in fluorescent red underoos.  If only we were as cautious in choosing colors for infographics.  The difference is that few of us design our own clothes.  But until good palettes (like <a href="http://www.colorbrewer.org">ColorBrewer</a>) are commonplace, to get colors that fit our purposes, we must be our own tailors.</p>
<p>While obsessing about how to implement color on the <a href="http://labs.dataspora.com/gameday">Dataspora Labs&#8217; PitchFX viewer</a> I began with a basic motivating question:<span id="more-58"></span></p>
<h3>Why use color in data graphics?</h3>
<p>If our data are simple, a single color is sufficient, even preferable.  For example, below is a scatter plot of 287 pitches thrown by the major league pitcher Oscar Villarreal in 2008.  With just two dimensions of data to describe &#8212; the x and y location in the strike zone &#8212; black and white is sufficient.  In fact, this scatter plot is a perfectly lossless representation of the data set (assuming no data points perfectly overlap).</p>
<p><strong>Fig 1. Location of Pitches </strong><strong>(Villarreal, HOU, 2008)</strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/bwxy_250.png"><img class="alignnone size-full wp-image-59" title="bwxy_250" src="http://dataspora.com/blog/wp-content/uploads/2009/03/bwxy_250.png" alt="Simple black and white scatter plot" width="250" height="250" /></a></p>
<p>But what if we&#8217;d like to know more: for instance, what kinds of pitches (curveballs, fastballs) landed where?  Or their speed?  Visualizations live in two dimensions, but the world they describe is rarely so confined.</p>
<p><strong>The defining challenge of data visualization is projecting high dimensional data onto a low dimensional canvas.</strong> (As a rule, one should never do the reverse: visualize more dimensions than what already exist in the data).</p>
<p>Getting back to our pitching example, if we want to layer another dimension of data &#8212; pitch type &#8212; into our plot, we have several methods at our disposal:</p>
<ol>
<li><strong>plotting symbols </strong> - vary the glyphs that we use (circles, triangles, etc.),</li>
<li><strong>small multiples</strong> - vary extra dimensions in space, creating a series of smaller plots</li>
<li><strong>color</strong> - we can color our data, encoding extra dimensions inside a color space</li>
</ol>
<p>Which techniques you employ depend on the nature of the data and the media of your canvas.  I will describe all three by way of example.</p>
<h3>Multivariate Method I:  Vary Your Plotting Symbols</h3>
<p><strong>Fig 2. Location and Pitch Type (Villarreal, HOU, 2008)</strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/glyphs_300.png"><img class="alignnone size-full wp-image-60" title="glyphs_300" src="http://dataspora.com/blog/wp-content/uploads/2009/03/glyphs_300.png" alt="Scatterplot with varied plotting symbols." width="300" height="300" /></a></p>
<p>In this plot, I&#8217;ve layered the categorical dimension of pitch type into our plot by using four different plotting symbols.</p>
<p>I consider this visualization an abject failure.  In fact, the prize for my most despised graphs in graduate school goes to <a href="http://www.rbej.com/content/figures/1477-7827-4-23-10-l.jpg"> bacterial growth curves rendered this way </a>.  The reason these graphs make our heads hurt is because (i) distinguishing glyphs demands extra attention (versus what academics call &#8216;<a href="http://www.csc.ncsu.edu/faculty/healey/PP/index.html">pre-attentively processed</a>&#8216; cues like color), (ii) even after we visually decode the symbols, we have yet another step: mapping symbols to their semantic categories.  (Admittedly this can be improved with <a href="http://eagereyes.org/VisCrit/ChernoffFaces.html">Chernoff faces</a> or other iconic symbols, where the categorical mapping is self-evident).</p>
<h3>Multivariate Method II:  Small Multiples on a Canvas</h3>
<p>Folding additional dimensions into a partitioned canvas has a distinguished pedigree in information graphics.  It has been employed everywhere from <a href="http://hsci.ou.edu/images/jpg-100dpi-5in/17thCentury/Galileo/1613/Galileo-1613-Pt3-27.jpg"> Galileo sunspot illustrations </a> to William Cleveland&#8217;s trellis plots.  And as Scott Mccloud&#8217;s unexpected <a href="http://www.amazon.com/Understanding-Comics-Invisible-Scott-Mccloud/dp/006097625X"> tour de force on comics </a> makes clear, panels of pictures possess a narrative power that a single, undivided canvas lacks.</p>
<p>In this plot below, the four types of pitches that Oscar throws are splintered horizontally.   By reducing our plot sizes, we&#8217;ve given up some resolution in positional information. But in return, patterns that were invisible in our first plot, and obscured in our second (by varied symbols) are now made clear (Oscar throws his fastballs low, but his sliders high).</p>
<p><strong>Fig 3:  Location and Pitch Type (Villarreal, HOU, 2008)</strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/strip_4002.png"><img class="alignnone size-full wp-image-70" title="strip_4002" src="http://dataspora.com/blog/wp-content/uploads/2009/03/strip_4002.png" alt="black and white strip plot" width="400" height="185" /></a></p>
<p>Multiplying plots in space works especially well on printed media, which can hold more than ten times as many dots per square inch as a screen.  Both columns and rows can be used to lattice over additional dimensions, the result being a <a href="http://dsarkar.fhcrc.org/lattice/book/images/Figure_06_07_stdBW.png"> matrix of scatter plots </a> (in R, see the &#8216;<a href="http://finzi.psych.upenn.edu/R/library/lattice/html/splom.html">splom</a>&#8216; function).</p>
<h3>Multivariate Method III: Color Your Data</h3>
<p><strong>So why bother with color?</strong></p>
<p>First, as compared to most print media, computer displays have fewer units of space, but a broader color gamut.  So color is a compensatory strength.</p>
<p>For multi-dimensional data, color can convey additional dimensions inside a unit of space &#8212; and can do so instantly.  Color differences can be detected within 200 ms, before you&#8217;re even conscious of paying attention (the &#8216;pre-attentive&#8217; concept I mentioned earlier).</p>
<p>But the most important reason to use color in multivariate graphics is that<strong> color is itself multidimensional</strong>.  Our perceptual color space &#8212; <a href="http://en.wikipedia.org/wiki/Opponent_process"> however </a><a href="http://en.wikipedia.org/wiki/RGB_color_model"> you </a><a href="http://en.wikipedia.org/wiki/HSL_and_HSV"> slice </a><a href="http://en.wikipedia.org/wiki/Lab_color_space"> it </a> &#8212; is three-dimensioned.</p>
<p>In the example below, I&#8217;ve used color as a means of encoding a fourth dimension of our pitching data: the speed of pitches thrown. The palette I&#8217;ve chosen is a divergent palette that moves along one dimension (think of it as the &#8216;redness-blueness&#8217; dimension) in the <a href="http://en.wikipedia.org/wiki/CIELUV_color_space">CIELUV</a> color space, while maintaining a constant level of luminosity.</p>
<p><strong>Fig 4. Location, Pitch Type, and Velocity (Villarreal, HOU, 2008)</strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/keycolor1d_3001.png"><img class="alignnone size-full wp-image-69" title="keycolor1d_3001" src="http://dataspora.com/blog/wp-content/uploads/2009/03/keycolor1d_3001.png" alt="isoluminant, diverging color ramp" width="300" height="150" /></a></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor1d_400.png"> </a></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor1d_4002.png"><img class="alignnone size-full wp-image-71" title="stripcolor1d_4002" src="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor1d_4002.png" alt="color strip plot" width="396" height="187" /></a></p>
<p>Holding luminosity constant is important, because luminosity (similar to brightness) determines a color&#8217;s visual impact. Bright colors pop, and dark colors recede.  A color ramp that varies luminosity along with hue will highlight data points as an artifact of color choice.</p>
<p>I chose only seven gradations of color, so I&#8217;m downsampling (in a lossy way) our speed data - but further segmentation of our color ramp is not likely to be perceptible.</p>
<p>I&#8217;ve also chosen to use filled circles as my plotting symbol, as opposed to the open circles in all my previous plots.  This is done to improve the perception of each pitch&#8217;s speed via its color: small patches of color are less perceptible.  But a consequence of this choice &#8212; compounded by our choice to work with a series of smaller plots &#8212; is that more points overlap.  We&#8217;ve further degraded some of our positional information.  However, in our last step, we attempt to recover some of this.</p>
<p>Now I&#8217;ve finally brought color to bear on this visualization, but I&#8217;ve only encoded a single dimension &#8212; speed.  Which leads to another question:</p>
<h3>If color is three-dimensional, can I encode three dimensions with it?</h3>
<p>In theory, yes.  <a href="http://dataspora.com/blog/wp-content/uploads/2009/03/ware_infoviz_p142.jpg">Colin Ware researched this exact question</a>.  In practice, it&#8217;s difficult.  It turns out that asking observers to assess the amount of &#8216;redness&#8217;, &#8216;blueness&#8217;, and &#8216;greenness&#8217; of points is possible, but not intuitive (I suspect it&#8217;s somewhat like parsing symbols).</p>
<p>Another complicating factor is that a nontrivial fraction of the population has some form of color blindness.  This effectively reduces their color perception to two dimensions.</p>
<p>And finally, the truth is that our sensation of color is not equal along all dimensions; it&#8217;s thought the closely related &#8216;red&#8217; and &#8216;green&#8217; receptors emerged via duplication of the single long wavelength receptor (useful for detecting ripe from unripe fruits, according to one just-so story).</p>
<p>Because the high level of dichromacy in the population, and because of the challenge of encoding three dimensions in color, I  feel color is best used to encode no more than two dimensions of data.</p>
<p>So, for my last example of our pitching plot data, I will introduce luminosity as a means of encoding the local density of points (using a kernel density estimator).  This allows us to recover some of the data lost by increasing the sizes of our plotting symbols.</p>
<p><strong>Fig 5. Location, Pitch Type, Velocity, and Density (Villarreal, HOU, 2008)</strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/keycolor2d_3001.png"><img class="alignnone size-full wp-image-72" title="keycolor2d_3001" src="http://dataspora.com/blog/wp-content/uploads/2009/03/keycolor2d_3001.png" alt="two-dimensional color palette" width="291" height="278" /></a></p>
<p><span style="text-decoration: underline;"><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor2d_4001.png"><img class="alignnone size-full wp-image-73" title="stripcolor2d_4001" src="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor2d_4001.png" alt="multivariate color strip plot " width="400" height="185" /></a><br />
</span></p>
<p>Here we have effectively employed a two-dimensional color palette, with blueness-redness varying along one axis for speed, and luminosity varying in the other to denote local density.</p>
<p>One final point about using luminosity.  Observing colors in a data visualization involves overloading, in the programming sense.  We rely on cognitive functions that were developed for one purpose (perceiving lions) and use them for another (perceiving lines).</p>
<p>Since we can overload color any way we want, whenever possible,  we should choose mappings that are natural.  Mapping pitch density to luminosity feels right because the darker shadows in our pitch plots imply depth.  Likewise, when sampling from the color space, we might as well choose colors found in nature.  These are the palettes our eyes were gazing at for the millions of years before #FF0000 showed up.</p>
<p>Color, used thoughtfully and responsibly, can be an incredibly valuable tool in visualizing high dimensional data.</p>
<h3>FutureMan Asks:  What about Animation?</h3>
<p>This discussion has focused on using static graphics in general, and color in particular, as a means of visualizing multivariate data.  I&#8217;ve purposely neglected one very powerful tool:  motion. The ability to animate graphics multiplies by several orders of magnitude the amount of information that can be packed into a visualization.   But packing  information into a time-varying data structure has to be done by someone (you or me) and from my view, this remains a significant challenge.  Canonical forms of animated visualizations (equivalent to the histograms, box plots, and scatterplots of the static world) are still a ways off, but frameworks like <a href="http://dataspora.com/blog/wp-admin/http:/processing.org">Processing</a> and <a href="http://prefuse.org/">Prefuse</a> are a promising start towards their development.</p>
<h3><a href="http://en.wikipedia.org/wiki/Lab_color_space"> </a>Methods</h3>
<p>The final product of these five-dimensional pitch plots &#8212; for all available data for the 2008 season &#8212; can be explored via the <a href="http://labs.dataspora.com/gameday">PitchFX</a> Django-driven web tool at Dataspora labs.</p>
<p>All of the visualizations here were developed using R and the Lattice graphics package.  (Of note, Hadley Wickham is developing <a href="http://had.co.nz/ggplot2/">ggplot2</a>, a bold re-write of the R graphics system based on a grammar of graphics).</p>
<h3>References for Further Reading</h3>
<ul>
<li>Ross Ihaka - <a href="http://www.stat.auckland.ac.nz/~ihaka/120/lectures.html">Lectures on Information Visualization</a>, Lectures 12-14</li>
</ul>
<ul>
<li>Colin Ware - <a href="http://www.amazon.com/Information-Visualization-Second-Interactive-Technologies/dp/1558608192"> Information Visualization</a>, Ch. 4</li>
</ul>
<ul>
<li>Edward Tufte,<a href="http://www.amazon.com/Envisioning-Information-Edward-R-Tufte/dp/0961392118"> Envisioning Information</a>, Ch. 4.</li>
</ul>
<ul>
<li> Deepayan Sarkar - <a href="http://lmdvr.r-forge.r-project.org">Lattice: Multivariate Data Visualization with R</a> (web site with code)</li>
</ul>
<ul>
<li>Maureen Stone - <a href="http://www.stonesc.com/">StoneSoup Consulting </a> (color consultant to Tableau Software)</li>
</ul>
<ul>
<li> Stephen Few,<a href="http://www.amazon.com/Information-Dashboard-Design-Effective-Communication/dp/0596100167"> Information Dashboard Design</a>, Ch. 4</li>
</ul>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/EIEl54Ti7Bg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/how-to-color-multivariate-data/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/how-to-color-multivariate-data/</feedburner:origLink></item>
		<item>
		<title>People who love scatter plots &amp; connecting dots</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/_ps1Q8A3iHQ/</link>
		<comments>http://dataspora.com/blog/dataviz-sf/#comments</comments>
		<pubDate>Fri, 20 Feb 2009 06:02:34 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[R]]></category>

		<category><![CDATA[analytics]]></category>

		<category><![CDATA[dataviz]]></category>

		<category><![CDATA[sabermetrics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=54</guid>
		<description><![CDATA[
We hosted the first Dataviz Salon SF on Tuesday night, with lightning talks by boredom cop  Shane Booth, dataviz wiz  Lee Byron , computational journalist Brad Stenger, data wrangler  Pete Skomoroch , and any/all data enthusiast  Brendan O&#8217;Connor .
I was going to blog all about it &#8212; but Tom Carden of [...]]]></description>
			<content:encoded><![CDATA[<p><img title="dataviz-sf" src="http://dataspora.com/blog/wp-content/uploads/2009/02/dataviz_salon_poster_smal.jpg" alt="" /><br />
We hosted the first Dataviz Salon SF on Tuesday night, with lightning talks by boredom cop <a href="http://criminalizeboring.tumblr.com/"> Shane Booth</a>, dataviz wiz <a href="http://www.leebyron.com"> Lee Byron </a>, computational journalist <a href="http://nbagraphs.tumblr.com">Brad Stenger</a>, data wrangler <a href="http://www.datawrangling.com"> Pete Skomoroch </a>, and any/all data enthusiast <a href="http://www.anyall.org/blog"> Brendan O&#8217;Connor </a>.</p>
<p>I was going to blog all about it &#8212; but <a href="http://www.tom-carden.co.uk/2009/02/18/dataviz-salon-sf-1/">Tom Carden of Stamen Design already has a great write-up</a>.</p>
<blockquote><p>&#8230; Dataspora invited a few people to a Dataviz Salon yesterday evening. Mike and I went along and huddled in a brick-built basement in SoMa to listen to <a href="http://www.tom-carden.co.uk/2009/02/18/dataviz-salon-sf-1/">the following</a>:</p></blockquote>
<p>.</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/_ps1Q8A3iHQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/dataviz-sf/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/dataviz-sf/</feedburner:origLink></item>
		<item>
		<title>How Google and Facebook are using R</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/GeD2DzlYIYs/</link>
		<comments>http://dataspora.com/blog/predictive-analytics-using-r/#comments</comments>
		<pubDate>Fri, 20 Feb 2009 03:11:03 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[R]]></category>

		<category><![CDATA[analytics]]></category>

		<category><![CDATA[data]]></category>

		<category><![CDATA[prediction]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=49</guid>
		<description><![CDATA[
(March 26th Update:  Video now available)   Last night, I moderated our Bay Area R Users Group kick-off event with a panel discussion entitled &#8220;The R and Science of Predictive Analytics&#8221;, co-located with the  Predictive Analytics World  conference here in SF.
The panel comprised of four recognized R users from industry:

 Bo [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/02/decision-tree.png"><img class="alignleft size-thumbnail wp-image-53" title="decision-tree" src="http://dataspora.com/blog/wp-content/uploads/2009/02/decision-tree-150x150.png" alt="" width="150" height="150" /></a><br />
<strong><a href="http://www.lecturemaker.com/2009/02/r-kickoff-video/">(March 26th Update:  Video now available)</a></strong>  <br /> Last night, I moderated our <a href="http://www.meetup.com/R-Users">Bay Area R Users Group</a> kick-off event with a panel discussion entitled &#8220;The R and Science of Predictive Analytics&#8221;, co-located with the <a href="http://www.predictiveanalyticsworld.com"> Predictive Analytics World </a> conference here in SF.</p>
<p>The panel comprised of four recognized R users from industry:</p>
<ul>
<li> Bo Cowgill, Google</li>
<li> Itamar Rosenn, Facebook</li>
<li> David Smith, Revolution Computing</li>
<li> Jim Porzak, The Generations Network (and Co-Chair of our R Users Group)</li>
</ul>
<p>The panelists were asked to explain how they use R for predictive analytics within their firms, its strengths and weaknesses as a tool, and provide a case study.  What follows is my summary with comments.</p>
<p><span id="more-49"></span></p>
<p><em> Panel Introduction </em></p>
<p>I began by describing R as a programming language with strengths in three areas: (i) data manipulation, (ii) statistics, and (iii) data visualization.</p>
<p>What sets it apart from other data analysis tools?  It was developed by statisticians, it&#8217;s free software, and it is extensible via user-developed packages &#8212; there are nearly 2000 of them as of today at the <a href="http://cran.r-project.org"> Comprehensive R Archive Network </a> or CRAN.</p>
<p>Many of these packages can be used for predictive analytics.  Jim highlighted Max Kuhn&#8217;s <a href="http://caret.r-forge.r-project.org"> caret package </a>, which provides a wrapper for accessing dozens of classification and regression models, from neural networks to naive Bayes.</p>
<p><em> Bo Cowgill, Google </em></p>
<p>R is the most popular statistical package at Google, according to Bo Cowgill, and indeed Google is a donor to the R Foundation.  He remarked that &#8220;The best thing about R is that it was developed by statisticians.  The worst thing about R is that&#8230; it was developed by statisticians.&#8221;  Nonetheless, he&#8217;s optimistic to see that as the R developer community has expanded, R&#8217;s documentation has improved, and its performance has gained.</p>
<p>One theme that Bo first brought up, but which was echoed by others, was that while Google uses R for data exploration and model prototyping, it is not typically used in production: in Bo&#8217;s group, R is typically run in a desktop environment.</p>
<p>The typical workflow that Bo thus described for using R was: (i) pulling data with some external tool, (ii) loading it into R, (iii) performing analysis and modeling within R, (iv) implementing a resulting model in Python or C++ for a production environment.</p>
<p><em> Itamar Rosenn, Facebook </em></p>
<p>Itamar conveyed how Facebook&#8217;s Data Team used R in 2007 to answer two questions about new users: (i) which data points predict whether a user will stay? and  (ii) if they stay, which data points predict how active they&#8217;ll be after three months?</p>
<p>For the first question, Itamar&#8217;s team used recursive partitioning (via the <a href="http://cran.r-project.org/web/packages/rpart">rpart</a> package) to infer that just two data points are significantly predictive of whether a user remains on Facebook:  (i) having more than one session as a new user, and (ii) entering basic profile information.</p>
<p>For the second question, they fit the data to a logistic model using a least angle regression approach (via the <a href="http://cran.r-project.org/web/packages/lars"> lars </a> package), and found that activity at three months was predicted by variables related to three classes of behavior: (i)  how often a user was reached out to by others, (ii) frequency of third party application use, and (iii) what Itamar termed &#8220;receptiveness&#8221; &#8212; related to how forthcoming a user was on the site.</p>
<p><em> David Smith, Revolution Computing </em></p>
<p>David&#8217;s firm, Revolution Computing, not only uses R, but R is their core business.  David said that &#8220;we are to R what Red Hat is to Linux&#8221;.  His firm addresses some of the pain points of using R, such as (i) supporting older versions of the software and (ii)  providing parallel computing in R through their ParallelR suite.</p>
<p>David showcased how one of their life sciences clients used R to classify genomic data through use of the <a href="http://cran.r-project.org/web/packages/randomForest"> randomForest </a> package, and how the analysis of classification trees could be easily parallelized using their &#8216;foreach&#8217; package.</p>
<p>He also mentioned that several firms they have worked with do use R in production environments, whereby a particular script is exposed on a server, and a client calls it with some data to return a result (several ways exist to set up R in a client-server manner, such as <a href="http://cran.r-project.org/web/packages/Rserve"> RServe </a>, <a href="http://biostat.mc.vanderbilt.edu/rapache/"> rapache </a>, and <a href="http://biocep-distrib.r-forge.r-project.org/"> Biocep</a>).</p>
<p>David evangelizes and educates about R at the <a href="http://blog.revolution-computing.com"> Revolutions blog </a>.</p>
<p><em> Jim Porzak, The Generations Network </em></p>
<p>Jim (also co-chairs the R Users Group), gave a brief overview of his <a href="http://www.predictiveanalyticsworld.com/agenda.php#sun"> PAW talk </a> on using R for marketing analytics.  In particular, Jim has used the <a href="http://cran.r-project.org/web/packages/flexclust"> flexclust </a> package to cluster customer survey data for Sun Microsystems, and apply the resulting profiles to identify high-value sales leads.</p>
<p>During the Q &amp; A session, the panelists were asked several questions.</p>
<p><em><strong>How do you work around R&#8217;s memory limitations?</strong> (R workspaces are stored in RAM, and thus their size is limited)</em></p>
<p>Three responses were given (including one from the audience):</p>
<p>(i) use R&#8217;s database connectivity (e.g. <a href="http://cran.r-project.org/web/packages/RMySQL">RMySQL</a>), and pull in only slices of your data, (ii) downsample your data (do you really a billion data points to test your model?), or (iii) run your scripts on a RAM-obsessed colleague&#8217;s machine  or fire up an <a href="http://blog.infochimps.org/2009/02/06/start-hacking-machetec2-released/"> virtual server on Amazon&#8217;s compute cloud </a> &#8212; for up to 15 Gigs.</p>
<p><em><strong>What&#8217;s the general ramp-up process for groups wanting to use R?</strong></em></p>
<p>Itamar and Bo both indicated that within their groups, almost everyone arrived having learned R in their university studies.  Jim Porzak led an R tutorial within his last firm using an internal slide deck.</p>
<p><em><strong>How easy is it for developers who are not statisticians to learn R?</strong></em></p>
<p>The consensus seemed to be that R is a difficult language to achieve competency in, vis-a-vis Python, Perl, or other high-level scripting languages.   Jim emphasized, however, that he is a not a statistician - nor were any of our panelists.  (As a non-statistician R user myself, I will say this &#8212; a consequence of learning R is an improved grasp of statistics.  Knowing statistics is a necessary pre-requisite for understanding R&#8217;s features, from its data types to its modeling syntax).</p>
<p><em><strong>How well does R interface with other tools and languages?</strong></em></p>
<p>There are several packages on CRAN for importing and exporting data to and from Matlab (<a href="http://cran.r-project.org/web/packages/R.matlab/"> RMatlab</a>), Splus, SAS, Excel and other tools.  In addition, there are interfaces for running R within Python (<a href="http://rpy.sourceforge.net/"> RPy </a>) and Java ( <a href="http://www.rforge.net/rJava/"> RJava </a>).</p>
<p>The panelists mentioned that they typically run R within a GUIs, either <a href="http://en.wikipedia.org/wiki/R_Commander"> RCommander </a> or <a href="http://rattle.togaware.com"> Rattle </a>.  (Aside: I run R exclusively in emacs using <a href="http://ess.r-project.org/"> ESS </a> &#8212; incidentally, one of its authors was panelist David Smith).</p>
<p><a href="http://www.lecturemaker.com/2009/02/r-kickoff-video/">A video of the event is now available</a> courtesy of <a href="http://www.lecturemaker.com"> Ron Fredericks</a> and LectureMaker.</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/GeD2DzlYIYs" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/predictive-analytics-using-r/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/predictive-analytics-using-r/</feedburner:origLink></item>
		<item>
		<title>Is Big Data at a tipping point?</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/AEiXJbTfbU8/</link>
		<comments>http://dataspora.com/blog/tipping-points-and-big-data/#comments</comments>
		<pubDate>Fri, 09 Jan 2009 07:01:03 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<category><![CDATA[data]]></category>

		<category><![CDATA[bigdata]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=46</guid>
		<description><![CDATA[
(5/18/09 update - included an overdue reference to linked data!) 
Stuart Kauffman, in one of his books about complexity, discusses tipping points in networks &#8212; what he calls a phase transitions &#8212; by way of buttons. Suppose you’re sitting on a floor strewn with 400 buttons, and you begin tying them together with pieces of string [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal"><a href="http://dataspora.com/blog/wp-content/uploads/2009/01/buttons_sketch.png"><img class="alignleft size-medium wp-image-45" style="float: left;" title="buttons_sketch" src="http://dataspora.com/blog/wp-content/uploads/2009/01/buttons_sketch.png" alt="" width="250" height="166" /></a></p>
<p class="MsoNormal"><em><span style="color: #808080;">(5/18/09 update - included an overdue reference to linked data!) </span></em></p>
<p class="MsoNormal"><em><span style="color: #808080;"><span style="color: #000000; font-style: normal;">Stuart Kauffman, in <a href="http://books.google.com/books?id=FxvENHL0qzYC">one of his books about complexity</a>, discusses tipping points in networks &#8212; what he calls a phase transitions &#8212; by way of buttons. Suppose you’re sitting on a floor strewn with 400 buttons, and you begin tying them together with pieces of string at random. At first, you have just pairs of buttons.   Then, you have clusters of threes, which in turn get tied into ever larger clumps. The question is: How long until picking any button off the floor pulls them all off together, in one connected mass?</span></span></em></p>
<p class="MsoNormal">It turns out that this supercluster of buttons doesn’t build gradually as we tie more threads, it emerges suddenly.  This rapid phase transition, from relatively unconnected to mostly connected, occurs right around where we have about half as many threads as buttons (see figure).  This is the tipping point of the system:  where a few threads make a big difference.</p>
<p class="MsoNormal"><a href="http://dataspora.com/blog/wp-content/uploads/2009/01/phase_transition_kauffman.png"><img class="alignleft alignnone size-full wp-image-44" style="float: left;" title="phase_transition_kauffman" src="http://dataspora.com/blog/wp-content/uploads/2009/01/phase_transition_kauffman.png" alt="" width="300" height="174" /></a>A similar phase transition has already occurred with regards to data inside business ecosystems. For the past several decades, an increasing number of business processes– from sales, customer service, shipping - have come online, along with the data they throw off.  As these individual databases are linked, via common formats or labels, a tipping point is reached: suddenly, every part of the company organism is connected to the data center.  And every action &#8212; sales lead, mouse click, and shipping update  &#8212; is stored.  The result:  organizations are overwhelmed by what feels like a tsunami of data.</p>
<p class="MsoNormal">The same trend is occurring in the larger universe of data that these organizations inhabit.  <a href="http://www.nature.com/nature/journal/v455/n7209/full/455001a.html">Big Data</a> unleashed by the <a href="http://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programming/">“Industrial Revolution of Data”</a>, whether from public agencies, non-profit institutes, or forward-thinking private firms.</p>
<p class="MsoNormal">At present, much of the world’s Big Data is iceberg-like: frozen and mostly underwater. It&#8217;s frozen because format and meta-data standards make it hard to flow from one place to another:  comparing the SEC&#8217;s financial data with that of Europe&#8217;s requires common formats and labels (ahem, <a href="http://blogmaverick.com/2008/12/16/the-sec-madoff-and-xbrl/">XBRL</a>) that don&#8217;t yet exist. Data is “underwater” when, whether reasons of competitiveness, privacy, or sheer incompetence it’s not shared: US medical records may contain a wealth of data, but much of it is on paper and offline (not so in Europe, enabling <a href="http://content.nejm.org/cgi/content/full/359/20/2105">studies with huge cohorts</a>).</p>
<p class="MsoNormal">Yet there&#8217;s a slow thaw underway as evidenced by a number of initiatives:  <a href="http://theinfo.org">Aaron Swartz’s theinfo.org</a>, <a href="http://infochimps.org">Flip Kromer’s infochimps</a>, <a href="http://bulk.resource.org">Carl Malamud’s bulk.resource.org</a>, the <a href="http://www.linkedata.org">Tim-Berners-Lee-inspired LinkedData.org</a>, as well as <a href="http://www.numbrary.com">Numbrary</a>, <a href="http://www.swivel.com">Swivel</a>, <a href="http://www.freebase.com">Freebase</a>, and Amazon’s <a href="http://aws.amazon.com/publicdatasets/">public data sets</a>.  These are all ambitious projects, but the challenge of weaving these data sets together is still greater.</p>
<p class="MsoNormal">How far are we from the tipping point of Big Data? When will the world’s icebergs of data melt into one sea? More importantly, when it happens, will we be ready to do something useful with it all?</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/AEiXJbTfbU8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/tipping-points-and-big-data/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/tipping-points-and-big-data/</feedburner:origLink></item>
	</channel>
</rss>
