<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Dataspora Blog</title>
	
	<link>http://dataspora.com/blog</link>
	<description>Big Data, open source analytics, and data visualization</description>
	<pubDate>Tue, 31 Aug 2010 08:37:59 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5</generator>
	<language>en</language>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/data-evolution" /><feedburner:info uri="data-evolution" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>The Seven Secrets of Successful Data Scientists</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/8QInTfxdkZI/</link>
		<comments>http://dataspora.com/blog/the-seven-secrets-of-successful-data-scientists/#comments</comments>
		<pubDate>Fri, 27 Aug 2010 13:12:04 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=114</guid>
		<description><![CDATA[At O&#8217;Reilly&#8217;s &#8220;Making Data Work&#8221; seminar earlier this summer, I teamed up with a few other folks (data diva Hilary Mason, R extraordinaire Joe Adler, and visualization guru Ben Fry) to talk about data.
What follows is a blog-ified and amended version of that talk, originally entitled &#8220;Secrets of Successful Data Scientists.&#8221;
 1. Choose The Right-Sized [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dataspora.com/images/ds200.jpg"><img class="alignleft size-thumbnail wp-image-100" title="phoenix" src="http://dataspora.com/images/ds200.jpg" alt="" width="200"/></a>At O&#8217;Reilly&#8217;s &#8220;Making Data Work&#8221; seminar earlier this summer, I teamed up with a few other folks (data diva Hilary Mason, R extraordinaire Joe Adler, and visualization guru Ben Fry) to talk about data.</p>
<p>What follows is a blog-ified and amended version of that talk, originally entitled &#8220;Secrets of Successful Data Scientists.&#8221;</p>
<p><strong> 1. Choose The Right-Sized Tool </strong></p>
<p>Or, as I like to say, you don&#8217;t need a chainsaw to cut butter.</p>
<p>If you&#8217;ve got 600 lines of CSV data that you need to work with on a one-time basis, paste it into Excel or Emacs and just do it (yes, curse the Flying Spaghetti Monster, I&#8217;ve just endorsed that dull knife called Excel).</p>
<p>In fact, Excel&#8217;s and Emacs&#8217; program-by-example keyboard macros can be <a href="http://emacs-fu.blogspot.com/2010/07/keyboard-macros.html"> fantastic tool for quick and dirty data clean-up. </a><br />
<span id="more-114"></span></p>
<p>Alternatively, if you&#8217;ve got 600 million lines of data and you need something simple, piping together a several Unix tools (cut, uniq, sort) with a dash of <a href="http://www.ibm.com/developerworks/linux/library/l-p102.html#1">Perl one-liner foo </a> may get you there.</p>
<p>But don&#8217;t confuse this kind of data exploration, where the goal is to size up the data, with building proper data plumbing, where you want robustness and maintainability.  Perl and bash scripts are nice for the former, but can be a nightmare for building data pipelines.</p>
<p>When you&#8217;re data gets very large, so big it can&#8217;t fit reasonably on your laptop (in 2010, that&#8217;s north of a terabyte), then you&#8217;re in Hadoop, <a href="http://www.greenplum.com"> parallelized database </a>, or <a href="http://oracle.com"> overpriced Big Iron </a> territory.</p>
<p>So, when it comes to choosing tools: scale them up as you need, and focus on getting results first.</p>
<p><strong> 2. Compress Everything </strong></p>
<p>We live in an IO-bound world, where the dominant bottlenecks to data flow are disk read-speed and network bandwidth.</p>
<p>As I was writing this, I was downloading an uncompressed CSV file via a web API.  Uncompressed, it was 257MB, ZIP-compressed: 9MB.</p>
<p>Compression gives you a 6-8x bump out of the gate.  When moving or crunching data of a certain heft, compress everything, always: it will save you time and money.</p>
<p>That said, because compression can render data difficult to introspect, I don&#8217;t recommend compressing TBs of data into a single tarball, but rather splitting it up, as I discuss next.</p>
<p><strong> 3. Split Up Your Data </strong></p>
<p>&#8220;Monolithic&#8221; is a bad word in software development.</p>
<p>It&#8217;s also, in my experience, a bad word when it comes to data.</p>
<p>The real world is partitioned – whether as zip codes, states, hours, or top-level web domains – and your data should be too. Respect the grain of your data, because eventually you&#8217;ll need to use it to shard your database or distribute it across your file system.</p>
<p>Even more, it&#8217;s this splitting up of data that enables the parallel execution in Hadoop and commercial data platforms (such as Greenplum, Aster, and Netezza).</p>
<p>Splitting is part of a larger design pattern succinctly identified in a paper by Hadley Wickham as:     <strong> <a href="http://had.co.nz/plyr/plyr-intro-090510.pdf"> split, apply, combine </a></strong>.</p>
<p>This is, in my mind, a more lucid formulation of &#8220;map, reduce&#8221; to include key selection (&#8221;split&#8221;) as a distinct step before any map/apply.</p>
<p><strong> 4.  Sample Your Data </strong></p>
<p>Let&#8217;s say hypothetically you&#8217;ve got 200 GBs of data from your <a href="http://en.wikipedia.org/wiki/Portmanteau">portmanteau</a> of a start-up, FaceLink.  Someone wants to know if more people visit on Mondays or Fridays, what do you do?</p>
<p>Before you wonder &#8220;if only I had 64 GB of RAM on my MacBook Pro&#8221;, or fire up a Hadoop streaming job, try this: look at a 10k sample of data.</p>
<p>It&#8217;s easy to visually inspect, or pull into R and plot.</p>
<p>Sampling allows you to quickly iterate your approach, and work around edge cases (say, pesky unescaped line terminators), before running a many-hour job on the full monty.</p>
<p>That said, sampling can bite you if you&#8217;re not careful: when data is skewed, which it always is, it can be hard to estimate joint-distributions – comparing the means of California vs Alaska, for example, if your sample is dominated by Californians (an issue that statistics, that sexy skill, can address).</p>
<p><strong> 5. Smart Borrows, But Genius Uses Open Source </strong></p>
<p>Before you create something new out of whole cloth, pause and consider that someone else may have already seen it, solved it, and open-sourced it.</p>
<p>A Google Code Search may find turn up a regular expression for that obscure data format.</p>
<p>The open source community allows you, if not to stand on the shoulders of giants, to at least rely on the gruntwork of fellow geeks.</p>
<p><strong> 6. Keep Your Head in the Cloud </strong></p>
<p>This past week, an engineer friend was just thinking about buying a dream desktop: a high RAM, multi-core box to run machine learning code over TBs of data.</p>
<p>I told him it was a terrible idea.</p>
<p>Why?  Because the data he wants to work on isn&#8217;t local, it&#8217;s on an Amazon EC2 cluster.  It&#8217;d take hours to download those TBs over a cable connection.</p>
<p>If you want to compute locally, pull down a sample.  But if your data is in the cloud, that&#8217;s where your tools and code should be.</p>
<p><strong> 7. Don&#8217;t Be Clever </strong></p>
<p>I once heard Brewster Kahle discuss managing the Internet Archive&#8217;s many-petabyte data platform: &#8220;everytime one of our engineers comes to me with a new, ingenious and clever idea for managing our data, I have a response: &#8216;You&#8217;re fired.&#8217;&#8221;</p>
<p>Hyperbole aside, his point is well-taken: cleverness doesn&#8217;t scale.</p>
<p>When dealing with big data, embrace standards and use commonly available tools.  Most of all, keep it simple, because simplicity scales.</p>
<p>I know of a firm that, several years ago, decided to fork one part of Hadoop because they had a more clever approach.  Today, they are several versions behind the latest release, and devoting time &amp; energy to back-porting changes.</p>
<p>Cleverness rarely pays off.  Focus your precious programmer-hours on the problems that are unsolved, not simply unoptimized.</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/8QInTfxdkZI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/the-seven-secrets-of-successful-data-scientists/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/the-seven-secrets-of-successful-data-scientists/</feedburner:origLink></item>
		<item>
		<title>The Data Singularity, Part II:  Human-Sizing Big Data</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/TVg4UfOwk2E/</link>
		<comments>http://dataspora.com/blog/new-tools-for-big-data/#comments</comments>
		<pubDate>Thu, 27 May 2010 14:10:39 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=110</guid>
		<description><![CDATA[
&#8220;There are no more promising or important targets for basic scientific research than understanding how human minds&#8230; solve problems and make decisions effectively.&#8221; - Herbert Simon


In my  previous post , I discussed the forces behind what I&#8217;m calling The Data Singularity. My basic thesis is that as information generating processes become more frictionless &#8212; [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>
&#8220;There are no more promising or important targets for basic scientific research than understanding how human minds&#8230; solve problems and make decisions effectively.&#8221; - <a href="http://dieoff.org/page163.htm">Herbert Simon</a>
</p></blockquote>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2010/05/cern_supercollider.jpg"><img class="alignleft size-thumbnail wp-image-109" title="cern_supercollider" src="http://dataspora.com/blog/wp-content/uploads/2010/05/cern_supercollider-150x150.jpg" alt="" width="150" height="150" /></a></p>
<p>In my <a href="http://dataspora.com/blog/the-data-singularity-is-here/"> previous post </a>, I discussed the forces behind what I&#8217;m calling The Data Singularity. My basic thesis is that as information generating processes become more frictionless &#8212; as humans have been excised from information read-write loops &#8212; the velocity and volume of data in the world is increasing, and at an exponential rate.</p>
<p>But where we go from here? What are the consequences of living in an age where every datum is stored? Where are the bottlenecks, pain points, and opportunities? Which technologies are addressing these?</p>
<p>The upshot is this: a new class of tools are evolving for Big Data because traditional approaches can&#8217;t scale up.  But these tools share a common goal: scaling down data, and making it human-sized.  That&#8217;s the &#8220;reduce&#8221; part of MapReduce, the single statistic from analysis, or the hundred pixel line from one hundred million events.</p>
<p>What&#8217;s happening today isn&#8217;t entirely new, though. There were echoes of it decades ago, when surveillance satellites first began scanning the globe.</p>
<p><strong>VI. How Satellite Data Paralyzed the CIA </strong></p>
<p>Beginning in the early 1970s the CIA began relying more on global satellite reconnaissance imagery for its intelligence operations. But according to <a href="http://www.amazon.com/exec/obidos/ASIN/140004684X">one history</a>, this massive, rich data didn&#8217;t accelerate the pace of US intelligence: it slowed it down.</p>
<p>Why? Because confronted with this firehose, CIA leaders attempted to analyze every image, chase every half-formed hypothesis, simply because it was possible. The few good leads were washed out by the many mediocre. The CIA didn&#8217;t adjust their decision-making to this new scale, and they were drowned by it.</p>
<p>Many organizations are at a similar inflection point now, with access to massive, rich data about their customers or products. And, like like the CIA in the 1970s, they find themselves paralyzed by the possibilities.</p>
<p><strong>VII. People Still Pull the Big Levers </strong></p>
<p>That Big Data paralyzes human decision-makers matters, because humans still make the big decisions. When someone praises a company as being &#8220;data-driven&#8221;, I&#8217;d like to imagine that this is literally true: that the company is nothing more than a few server racks blinking &amp; humming away, slinging bits and earning money.</p>
<p>But no such company exists. What &#8220;data-driven&#8221; really means is that the executives &amp; employees use data as inputs for making decisions. Companies may be data-fueled, but they&#8217;re people-driven.</p>
<p><strong>VIII. Human-sizing Big Data: Filter &amp; Crunch </strong><br />
<span id="more-110"></span><br />
All of the analytics in the world won&#8217;t matter if it remains inaccessible to the people driving an organization &#8212; the human decision-makers.</p>
<p>We have processes all around us acting as data amplifiers, recording events at a pace &amp; scale that we can&#8217;t comprehend. But this has created a disequilibrium: our capacity to create data is vastly outstripping our ability to consume it.  Analytics is the act of taking Big Data streams and human-sizing them for our small data brains.  </p>
<p>We can reduce data by either filtering it, which sifts through but does not alter data, or by crunching it, reducing many data points to a few.</p>
<p><strong> Google and Facebook are Filters </strong>.  Many consumer web technologies might be viewed as powerful filters.  Google is a relevance filter for 20 billion web pages.  Facebook is a social filter for baby photos.  FourSquare is a geo-social filter for hipster bars.  Amazon is a filter for retail products, combining search with a powerful recommendation engine.</p>
<p><strong> Wikipedia is a Natural Language Cruncher </strong>.  Crunching data is harder than filtering it.  Perhaps the toughest nut to crack involves processing natural language:  if you read a thousand web pages about the Gutenberg Bible, how would you describe it <a href="http://en.wikipedia.org/wiki/Gutenberg_Bible">in a few paragraphs</a>?  Wikipedia is a human-powered natural language cruncher, powered by <a href="http://www.aaronsw.com/2002/whowriteswikipedia/">its army of mechanical turks</a>, whose collective actions <a href="http://trendingtopics.org">even reveal news trends.</a></p>
<p><strong> Crunch the Past to Predict the Future </strong>.  Crunching of quantitative data is at the heart of many prediction tasks: the National Weather Service aggregates weather station measurements into forecasts, Fair Isaac calculates a score of credit-worthiness by examining your credit history, and a sports contest might be construed as an algorithm &#8212; operating on a sequence of individually played points &#8212; to predict the best team or athlete.</p>
<p>Number crunching has its more banal forms, as well, in the kind of sums and averages found in your phone or utility bill. These are necessary, but predictive algorithms &#8212; the kind involved in weather forecasting &#8212; will continue to grow in importance. For at a certain scale of data, exact reporting become an insurmountable task: we can only hope to have probabilistic answers.</p>
<p><strong>IX. Business Intelligence is Dead: New Tools for a New Era </strong></p>
<p>That our traditional tools don&#8217;t operate at scale was highlighted by Tim O&#8217;Reilly recently, when he declared <a href="http://www.slideshare.net/timoreilly/the-future-of-business-intelligence">&#8220;Business intelligence as we knew it is dead.&#8221;</a></p>
<p>A new class of tools is emerging along the Big Data stack, in three areas: (1) storage &amp; computation, (2) analytics, and (3) dashboards &amp; visualization.</p>
<p>These tools will disrupt and attack many of the traditional Business Intelligence firms, ranging from tool-makers like SAS and SPSS, to relational database vendors like Oracle, to custom hardware providers.</p>
<ul>
<li><strong>1. Storage &amp; Computation:  Mixed Platforms, not Monolithic Databases </strong>. At the lowest level of storage &amp; computation, Big Data is driving the success of cloud computing platforms like Amazon&#8217;s Elastic Compute Cloud &#8212; a massive, virtualized commodity-hardware grid &#8212; as an alternative to the Big Iron sold by hardware makers.Big Data has also catalyzed widespread adoption of the distributed, fault-tolerant Hadoop platform &#8212; an open-source implementation of Google&#8217;s BigTable that was developed by Yahoo, and is now commercially supported by Cloudera.
<p>A bit further up the stack, relational databases are suffering: newer commercial entrants in this space &#8212; such as Greenplum, Aster Data, Vertica, and Netezza &#8212; offer parallelized relational systems that operate at greater scale and lower cost than Oracle and Teradata.Many open-source, non-relational data stores &#8212; with a colorful constellation of names such as HBase, MongoDB, CouchDB, Cassandra, and Voldemort &#8212; have gained traction for high-traffic, content-driven web sites.</p>
<p><strong> SQL &amp; NoSQL are Complementary, Not Antagonistic</strong>.  While some may view storage technologies as antagonistic, either-or choices, the truth is that most Big Data-driven companies use a mixture of tools in complementary ways.  Hadoop is often used for batch-processing and transformation of log data that is fed to more structured data stores, such as a distributed RDBMS, in backend systems. Non-relational data stores are in turn ideal for front-facing, high-performance web applications, where queries return a bolus of data related to a single key &#8212; often a product, user, or page identifier.  All of these pieces working together form <a href="http://my.safaribooksonline.com/9780596801656/information_platforms_and_the_rise_of_th">an information platform</a>: an ecosystem of APIs working together. </li>
</ul>
<ul>
<li><strong>2. Analytics: There Are No Turnkey Solutions</strong>. Imagine if any piece of data you ever wanted was within a query&#8217;s reach:  what would you do with it?  We&#8217;re fast approaching this scenario, and making data meaningful is the bottleneck.  But unlike storing data &#8212; where use cases &#038; technologies are common and becoming commoditized &#8212; the ways that firms filter and crunch their data varies widely.
<p>This reflects the range of analytics needs that firms have: for example, a financial firm may need low-latency, continuous analysis of data streams, while an online retailer or pharmaceutical firm can tolerate 24-hour delays for analysis.</p>
<p><strong>Scaling Up Analytics is Hard</strong>.  R, my favorite analytics tool, is fantastic for modeling either aggregated data sets or samples of data that can fit in memory, but methods for deploying R in a large-scale data environment are still nascent.One promising approach is <a href="http://www.stat.purdue.edu/~sguha/rhipe/">Saptarshi Guha&#8217;s RHIPE </a>, which combines R with Hadoop ( <a href="http://files.meetup.com/1225993/RHIPE%20-%20Saptarshi%20Guha.pdf">slides </a>) from his March presentation at the <a href="http://www.meetup.com/R-Users">Bay Area R Users Group </a>.  Another MapReduce-based framework for large-scale data analysis include the <a href="http://mahout.apache.org/">Apache Mahout project</a>.</p>
<p><strong>Learn, Then Apply:  But Stay Close to the Data</strong>.  In general, there are two pieces in any analytics pipeline: (i) learning, or the training of a model with historical data, and (ii) prediction, or the application of a model to new data.  On the the learning side, it&#8217;s been said that <a href="http://anand.typepad.com/datawocky/2008/03/more-data-usual.html">more data beats better algorithms </a>, and this is certainly true for many classification problems. In general, training a model is a computationally intensive task, and the development of methods that can train on massive data sets is <a href="http://research.google.com/pubs/pub36296.html">an area of active research. </a></p>
<p>On the application/prediction side of modeling, the challenges often revolve around deployment, or How do we get the model to the data? (Since the reverse, pushing data to the model, is more expensive). To address the desire of porting models across different environments <a href="http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language">PMML (Predictive Modeling Markup Language) </a>has been developed, which is supported by a range of database vendors.</p>
<p>The meme of &#8220;in-database analytics&#8221; is resonating because given data&#8217;s increasing heft, efficient analytics will follow the pattern of having the training &amp; execution of models stay close to where the data lives.</p>
<p>As it will be several years before either open-source or commercial analytics tools are mature here, the most successful Big Data modelers will be those data scientists who can build and glue together their own methods, tailored for individual environments and needs.</li>
</ul>
<ul>
<li><strong>3. Dashboards &amp; Visualization:  Why &#8220;I See&#8221; is a Synonym for &#8220;I Understand&#8221; </strong>. The most visible way in which Big Data is disrupting old tools is by changing the way we look at data.  The ultimate end-point for most data analysis is a human decision-maker, whose highest bandwidth channel is his or her eyeballs.  To take optimal advantage of the human visual system, dashboards and data visualization must be well-designed, and until recently, tools that achieved even a minimal standard of competence were rare.
<p><strong>Visual Literacy is on the Rise</strong>.  But a new set of visualization tools and packages, as well as growing popular interest in data visualization &#8212; catalyzed by the books of Edward Tufte, blogs like <a href="http://www.flowingdata.com">Nathan Yau&#8217;s FlowingData </a>and <a href="http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html">talks at T.E.D. conferences </a>&#8211; are changing this.As I&#8217;ve written about before, there are two distinct kinds of data visualization pathways: (i) exploratory, a highly interactive path whereby a data scientist may permute through dozens or even hundreds of views of a data set to understand its shape or fit to a hypothesized model, and (ii) narrative, a more constrained path whereby only one or several views of the data are presented.</p>
<p><strong>Exploring Data Requires Fast, Frequent Feedback </strong>.  For the exploratory path, desktop tools are ideal. The open-source language R has <a href="http://www.slideshare.net/dataspora/a-survey-of-r-graphics">several outstanding visualization packages</a>, including <a href="http://had.co.nz/ggplot2/">ggplot2 </a>and <a href="http://lmdvr.r-forge.r-project.org/">lattice </a>(based on William Cleveland&#8217;s trellis).  Two solid commercial products for exploratory visualization are <a href="http://spotfire.tibco.com/">SpotFire </a>and <a href="http://www.tableausoftware.com/">Tableau </a>(the latter of which has <a href="http://www.perceptualedge.com/blog/?p=191">been praised </a>by the hard-to-please Stephen Few).</p>
<p><strong>Sharing Visualizations:  Web Dashboards Are Ideal</strong>.  Ultimately, however, visualizations need to be shared beyond a single user, to an audience. Web-driven dashboards are an ideal form for sharing narrative visualizations, by allowing navigation along defined axes of the data.The challenge is moving visualizations from the desktop to the web. Tableau has this capacity, but with R the process is less straightforward. One promising route is via <a href="http://rapache.net/">Jeff Horner&#8217;s RApache tool </a>, which embeds R inside an Apache server (which I&#8217;ve used for my <a href="http://labs.dataspora.com/gameday/">MLB Pitch F/X tool</a>, and which Jeroen Ooms&#8217; uses to power his <a href="http://www.stat.ucla.edu/~jeroen/ggplot2.html">ggplot2 web app </a>).</p>
<p>The major limitation of R-driven web graphics is that achieving some interactivity within the graphic itself is difficult, as R&#8217;s graphics model is focused on static graphics.  There are, however, several routes for achieving highly interactive, web-based data visualizations, whether by using Javascript, HTML5&#8217;s Canvas, or Flash. Two in particular are:  (i) Ben Fry&#8217;s <strong><a href="http://processing.org">Processing </a></strong>, an expressive language for vector animation, which recently added <a href="http://www.processing.js">Javascript </a>as one of its implementations, and (ii) the <strong><a href="http://vis.stanford.edu/protovis/">Protovis </a></strong> framework out of Stanford: a Javascript graphing toolkits whose conceptual integrity and expressive flexibility was inspired (like ggplot2) by Wilkinson&#8217;s grammar of graphics.</li>
</ul>
<p><strong>X.  Collaborating with Big Data: Analytics is a Social Process </strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2010/05/greenplum_chorus.png"></a><a href="http://dataspora.com/blog/wp-content/uploads/2010/05/greenplum_chorus.png"></a><a href="http://dataspora.com/blog/wp-content/uploads/2010/05/greenplum_chorus1.png"><img class="alignleft size-thumbnail wp-image-112" title="greenplum_chorus1" src="http://dataspora.com/blog/wp-content/uploads/2010/05/greenplum_chorus1-150x150.png" alt="" width="150" height="150" /></a>In the same talk that Tim O&#8217;Reilly proclaimed the death of BI &#8220;as we knew it&#8221;, he also highlighted a new initiative by Greenplum called <a href="http://www.greenplum.com/products/chorus/">Chorus </a>(Greenplum is a Dataspora client, but I confess I&#8217;ve only seen a limited preview).</p>
<p>The animating spirit of Chorus is that analytics is not only about data, models, and visualizations &#8212; it&#8217;s also about the people who work on these various pieces.  One of the reasons I love Box.net is the layer of social information that&#8217;s overlayed onto my files: appended notes, access statistics from collaborators, automatic notifications when a change is made.</p>
<p>Chorus is a vision to do this with Big Data; it allows, for instance, an analyst to link a data visualization to an underyling data source, include the R code that created the visualization, and append a note about a recent change to it.</p>
<p>As the Big Data stack matures, tools that help manage the workflow from data to analytics to visualizations, and ultimately to decisions, will be critical.  Someday, creating and sharing a data analysis through a web dashboard should be as easy as writing a blog post.  Until that day, there&#8217;s plenty of work to keep us data scientists well-employed.</p>
<p><em> If crunching terabytes of data is the kind of thing you&#8217;d like to do for breakfast, please send me a note at med @ dataspora.com.  I&#8217;m looking to hire technologists &amp; analytics experts for a new venture&#8230; more on that soon. </em></p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/TVg4UfOwk2E" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/new-tools-for-big-data/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/new-tools-for-big-data/</feedburner:origLink></item>
		<item>
		<title>The Data Singularity is Here</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/sxoeIHK3Byc/</link>
		<comments>http://dataspora.com/blog/the-data-singularity-is-here/#comments</comments>
		<pubDate>Mon, 08 Mar 2010 08:36:22 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=104</guid>
		<description><![CDATA[In this blog post I&#8217;ll attempt to sketch the forces behind what I&#8217;m calling, somewhat sensationally, the Data Singularity, and then (in a following post) discuss what I see as its consequences.
In a nutshell, the Data Singularity is this: humans are being spliced out of the data-driven processes around us, and frequently we aren&#8217;t even [...]]]></description>
			<content:encoded><![CDATA[<p><a href='http://dataspora.com/blog/wp-content/uploads/2010/03/thematrix.jpg'><img src="http://dataspora.com/blog/wp-content/uploads/2010/03/thematrix.jpg" alt="" title="thematrix" width="150" height="113" class="alignleft size-full wp-image-108" /></a>In this blog post I&#8217;ll attempt to sketch the forces behind what I&#8217;m calling, somewhat sensationally, the Data Singularity, and then (in a <a href="http://dataspora.com/blog/new-tools-for-big-data/">following post</a>) discuss what I see as its consequences.</p>
<p>In a nutshell, the Data Singularity is this: humans are being spliced out of the data-driven processes around us, and frequently we aren&#8217;t even at the terminal node of action.  International cargo shipments, high-frequency stock trades, and genetic diagnoses are all made without us.</p>
<p>Absent humans, these data and decision loops have far less friction; they become constrained only by the costs of bandwidth, computation, and storage&#8211; all of which are dropping exponentially.</p>
<p>The result is an explosion of data thrown off from these machine-mediated pipelines, along with data about those flows (and data about that data, and so on).  The machines all around us &#8212; our smart phones, smart cars, and fee-happy bank accounts &#8212; are talking, and increasingly we&#8217;re being left out of the conversation.</p>
<p>So whether or not the Singularity is Near, the Data Singularity is here, and its consequences are being felt.</p>
<p>But before I discuss these consequences, I&#8217;d like to expand on the premise.  The world wasn&#8217;t always drowning in this data deluge, so how did we get here?</p>
<p><strong>I.  Data at the Speed of Speech</strong><br />
<span id="more-104"></span><br />
For most of human history, information traveled no faster than the sound of the human voice.  The origin of human language was the original singularity:  it marked the birth of a non-biological information channel,  distinct from our DNA.</p>
<p>But despite this achievement , the production of information &#8212; whether farmers&#8217; almanacs or merchants&#8217; ledgers &#8212; was still constrained the by costs of ink and parchment and the write-speed of the human hand.</p>
<p>All 70,000 volumes of the Library of Alexandria, the collected body of human knowledge in antiquity, could fit on two thumb drives today.</p>
<p>Thus the transmission and production of data, when it was done at all, was painstaking in form, small in scale, and occurred between people.</p>
<p><code>  People --> People </code></p>
<p><strong>II.  Data at the Speed of Light</strong></p>
<p>With the telegraph, for the first time, data flowed at the speed of light.</p>
<p>In the late 18th century, the first substantive telegraph line connected Paris to a suburb 210 kilometers to its north, using optical semaphores rather than electrical currents to communicate.  Yet while data hopped between stations at light speed, it had to be routed by human operators at each station.</p>
<p>Centuries earlier, the printing press dramatically reduced the production costs of information.  Still, human authors transmitted their hand drafted manuscripts to type setters, who set type with fonts optimally designed for human eyes.</p>
<p><strong>III. Programmable Looms and Reading Machines</strong></p>
<p>Punch cards represented the movement of data away from human-readable, anthropocentric substrates, onto a medium designed principally for consumption by machines.</p>
<p>Punch cards were developed in the early 18th century <a href="http://en.wikipedia.org/wiki/Basile_Bouchon"> to control industrial looms </a>, in France.</p>
<p>Now, machines were the final terminus of data transmission.  This act of communicating with our machines, <em>programming</em> them, was at the heart of Charles Babbage&#8217;s Analytical Engine, which came more than a century later.</p>
<p><code>  People --> Machines</code></p>
<p><strong>IV.  Phonographs and Recording Machines </strong></p>
<p>Developing on the other side of the communication spectrum were machines that excelled at writing and storing data.</p>
<p>The <a href="http://www-03.ibm.com/ibm/history/exhibits/storage/storage_350.html"> modern rotating disk drive </a> feels less inspired by punch cards, but by Thomas Edison&#8217;s cylinder machines, better known as phonographs.</p>
<p>The human voice was a natural data format, and if early pioneers had a vision for the modern human-machine interface, I imagine it would have been to program machines by voice.  It&#8217;s a vision that still eludes us.</p>
<p>By the middle of the 20th century, a slew of semiconductor technologies emerged to close the loop of data generation: we had machines that produced digital data, and machines that continuously consumed it, without human intervention.</p>
<p><code>  Machines --> Machines</code></p>
<p>These technologies also sparked the beginning of a less-celebrated, but equally important exponential curve: the falling cost of data storage. </p>
<p><a href='http://dataspora.com/blog/wp-content/uploads/2010/03/cost_of_data_storage_360.png'><img src="http://dataspora.com/blog/wp-content/uploads/2010/03/cost_of_data_storage_360.png" alt="" title="cost_of_data_storage_360" width="360" height="360" class="alignnone size-full wp-image-106" /></a><br clear=all /></p>
<p> <strong>V.  Listening to the Pulse of the Planet</strong></p>
<p>The exponential drop in data storage costs has meant that logging historical data about a process, or billions of processes, is economically feasible.</p>
<p>I conjecture that the largest share of data on the planet sits in log files; these are the EKGs of the server farms that manage our cell phones, our e-mail accounts, and every other facet of our online existence &#8212; and which consume 3% of the <a href="http://arstechnica.com/old/content/2007/08/epa-power-usage-in-data-centers-could-double-by-2011.ars">US energy budget </a>.</p>
<p>Ubiquitous networking and cheap bandwidth has meant these pools of storage are no longer isolated on individual sensors, phones, or servers, but form the tributaries feeding an ocean of data in the Cloud.</p>
<p>And yet, funneling these massive volumes of data creates enormous technological pressures, against which companies struggle.  So why keep the data?</p>
<p>Because inside these log files, amidst the myriad conversations recorded between machines, lies the pulse of their customers.</p>
<p>Collectively, these logs reveal the pulse of the planet &#8212; flight delays, package shipments, job losses, and human sentiments.</p>
<p>And as I&#8217;ll discuss <a href="http://dataspora.com/blog/new-tools-for-big-data/">in my next post</a>, those who can extract a meaningful signal from this thunderous cacophony &#8212; the analysts, statisticians, and data scientists &#8212; are uniquely positioned to change the world.</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/sxoeIHK3Byc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/the-data-singularity-is-here/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/the-data-singularity-is-here/</feedburner:origLink></item>
		<item>
		<title>SQL is Dead.  Long Live SQL!</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/S9OepTJtXGo/</link>
		<comments>http://dataspora.com/blog/sql-is-dead-long-live-sql/#comments</comments>
		<pubDate>Wed, 25 Nov 2009 10:58:14 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=97</guid>
		<description><![CDATA[&#8220;The adoption of a relational model of data, as described above, permits the development of a universal data sub-language.&#8221;– E.F. Codd, 1969
&#8220;Database research has produced a number of good results, but the relational database is not one of them.&#8221; – Henry Baker, 1991
 Outside of programming language flame wars, few questions raise the hackles of [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>&#8220;The adoption of a relational model of data, as described above, permits the development of a universal data sub-language.&#8221;– <a href="http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf">E.F. Codd, 1969</a></p></blockquote>
<blockquote><p>&#8220;Database research has produced a number of good results, but the relational database is not one of them.&#8221; – <a href="http://home.pipeline.com/~hbaker1/letters/CACM-RelationalDatabases.html">Henry Baker, 1991</a></p></blockquote>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/11/relational_theory.png"><img class="alignleft size-thumbnail wp-image-102" title="relational_theory" src="http://dataspora.com/blog/wp-content/uploads/2009/11/relational_theory-150x150.png" alt="" width="150" height="150" /></a> Outside of programming language flame wars, few questions raise the hackles of hackers more than: &#8220;how should I store my data?&#8221;</p>
<p>I will argue here, like many such debates , the answer is:  it depends on what you&#8217;re doing.</p>
<p>While the rise of non-relational data stores serves a much-needed niche, the death of SQL and relational databases <a href="http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-database-doomed.php">has been much exaggerated</a>.  E.F. Codd may be dead, but SQL is alive and well as a simple yet powerful data query language.</p>
<p><strong>3NF Crusaders vs NoSQL Rebels</strong></p>
<p>While the current critique relational databases shares features of earlier debates (such as in the 1990s, when object-oriented databases were heralded as the next big thing), it has some new twists.  Thus to review the players and their positions:</p>
<p>On our right are the relational curmudgeons, the kind of folks who <a href="http://www.thethirdmanifesto.com/"> pen manifestos and crusade against NULL values</a>.  They have converted nearly all of big business to their ministry, and have billions of dollars in their coffers to show for it.  They insist that data should be stored in terms of its relations, to protect its integrity and facilitate its analysis.  Ideally that means third-normal form, but <a href="http://www.amazon.com/exec/obidos/ASIN/0471200247"> more liberal branches of the church </a> exist.</p>
<p><span id="more-97"></span>On our left are the folks from the misnomered NoSQL movement, <a href="http://blog.oskarsson.nu/2009/06/nosql-debrief.html">shaggy kids</a> from <a href="http://gigaom.com/2009/08/15/how-yahoo-facebook-amazon-and-google-think-about-big-data/"> the likes of Facebook and Twitter </a>.  They&#8217;ve rebelled against the shackles of relational tables (and bear the scars of MySQL scaling struggles).  They believe that data should be persisted as it&#8217;s programmed: in objects.  And they&#8217;ve spawned a constellation of colorfully named open-source projects – Cassandra, Voldemort, CouchDB, MongoDB, and <a href="http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html">Dynamo</a> – to consummate their cause.</p>
<p><strong>A Three-Pronged Attack on SQL:  Syntax, Schemas, and Performance</strong></p>
<p>At the heart of the NoSQL movement are three distinct critiques:</p>
<ol>
<li>A dislike for SQL&#8217;s syntax, which is ill-fitted to programming patterns.  It&#8217;s painful to write select statements to grab the data spread out across many tables, when all you want is a record.  Within web frameworks, the interface problem has been solved to a large degree by object-relational-mappers, such as Ruby&#8217;s ActiveRecord.</li>
<li>A rejection of the strong typing of relational schemas, which make it painfully difficult to alter one&#8217;s data model.  It also makes <a href="http://codemonkeyism.com/essential-storage-tradeoff-simple-reads-simple-writes/">writing to the data store a complex process</a>.</li>
<li>A critique of performance, which in turn relates to how concurrency and partitioning of computation is handled.  Most relational databases maintain a shared state, which strives for perfect concurrency, but complicates distributed computation over many nodes.  NoSQL architectures are built on languages and tools, like Erlang and Hadoop, that favor distributed processes which (to use two favorite catch phrases) &#8220;share nothing&#8221; but are &#8220;<a href="http://www.allthingsdistributed.com/2008/12/eventually_consistent.html">eventually consistent</a>.&#8221;  The NoSQL philosophy also weighs heavily against joins.</li>
</ol>
<p>These critical threads are mirrored in the movement and their associated projects.  One the one hand you have developers who prefer the programmatic ease of interacting with NoSQL data stores, such as Cassandra and CouchDB.  They also don&#8217;t suffer the performance penalties of scale:  unlike with relational tables, the performance of look-ups does not degrade as the stored number of objects rises.</p>
<p>On the other, you have Big Data analysts (like myself), who love Hadoop because it allows easy distributed computation over massive, loosely typed data sets.</p>
<p><strong>Analytics:  MapReduce for Munging, SQL for Set Operations</strong></p>
<p>With regard to analytics, the Hadoop ecosystem makes it easy to dump several billion records of varying formats into a data store and process them – without having to conform them to a common data model.   Thus NoSQL framework is great for massive data munging.</p>
<p>But if I had to access an already structured massive data set, I prefer SQL&#8217;s declarative syntax to MapReduce constructs.</p>
<p>I recently sat down at an SQL terminal with several hundred billion call records behind it.  With a simple SQL query, I determined how many distinct people the average American telephones more than once in a given month (answer: five).  In a few hundred seconds, I&#8217;d generated a report on the global state of the customer calling network.</p>
<p>Contrary to what the NoSQL may inveigh, it&#8217;s not that relational databases can&#8217;t scale – in fact, they can scale to petabytes, as <a href="http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/"> those who know Fortune 500 enterprise computing can attest </a>.  The problem is that relational databases require lots of ETL cruft to munge fluid blobs of data into strongly typed tables.</p>
<p>I can&#8217;t imagine the programmer pain and suffering that went into building one, unified, global database.  But once it&#8217;s there, I&#8217;d much prefer to access it with SQL statements than MapReduce code .</p>
<p>And I&#8217;m not alone in feeling this way:  Jeff Hammerbacher of Cloudera recently told me that, for an enterprise deployment, usage jumped 10x when an SQL interface – HIVE (which I mention below) – was placed on the cluster.</p>
<p><strong>NoSQL is a Misnomer: SQL is Innocent!</strong></p>
<p>Which brings me to my defense of SQL.  I agree with two of three above critiques that embody the NoSQL philosophy, namely the need for schema-less storage and distributed architectures.  But when they go after SQL, and name the movement in opposition to it, they&#8217;ve named the wrong villain.  (Your honor,) <a href="http://cacm.acm.org/blogs/blog-cacm/50678-the-nosql-discussion-has-nothing-to-do-with-sql/fulltext">SQL is just an innocent query language!</a></p>
<p>As evidence of innocence, look no further than <a href="http://code.google.com/appengine/docs/python/datastore/gqlreference.html">Google&#8217;s GQL</a> and <a href="http://wiki.apache.org/hadoop/Hive"> Hadoop&#8217;s HIVE</a>, two SQL-style query languages for NoSQL data stores.</p>
<p>Why SQL in a NoSQL data store?   For one, it&#8217;s a language that both business analysts and developers already know; so the zero-th order adoption step is shorter.</p>
<p>But SQL lives on for a deeper reason: it is a simple yet powerful language for set operations.  SQL captures the essential patterns of data manipulation, such as:</p>
<ol>
<li>intersections (JOINs)</li>
<li> filters (WHEREs)</li>
<li>reductions or aggregations (GROUP BYs)</li>
</ol>
<p>I suspect that many developers who profess a disdain for SQL have been deceived by its simplicity.  One of my favorite packages in R is <a href="http://code.google.com/p/sqldf/">sqldf</a>, which allows SQL queries on R data frames.  SQL&#8217;s declarative expressions are frequently more readable and compact than their R programmatic equivalents.</p>
<p><strong>MapReduce is Possible in SQL</strong></p>
<p>Until very recently one of the more difficult operations to perform in SQL was a top-K query, for example, finding the five highest priced items in for every store in a retail database.  But so-called window functions, which make such queries easy to express, have become part of the SQL standard and are now natively supported in Postgres.</p>
<p>Window functions are powerful because they provide a &#8220;split-apply&#8221; functionality, otherwise known as a map function.  Combine these with SQL&#8217;s GROUP BY operations, which is a reduce function, and you have achieved – voila! – map-reduce in SQL.  And as with all map functions, window operations are massively parallelizable (something that has not gone unnoticed by <a href="http://www.greenplum.com">some commercial vendors.</a>)</p>
<p><strong>Verdict:  Don&#8217;t Use a Chainsaw to Cut Butter (Use the Right Tool)</strong></p>
<p>Both NoSQL and SQL have their place in an analytics ecosystem.   In the <a href="http://dataspora.com/blog/sexy-data-geeks/">Big Data workflow</a> that I&#8217;ve advocated in the past, I view SQL as a pipe feeding data into more sophisticated modeling and visualization tools, such as R.  But it is an easy-to-use pipe, and it allows analysts to quickly pull out a subset of data &#8212; and start asking questions of that data.</p>
<p>The verdict in the great NoSQL debate is:  know your tools and know your goals.  In the Big Data space today, there can be an undue focus on formats or mechanics, but these are just a means to one end:  products.  Remember, Paul Graham and his team wrote Viaweb in Lisp, and it just worked.</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/S9OepTJtXGo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/sql-is-dead-long-live-sql/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/sql-is-dead-long-live-sql/</feedburner:origLink></item>
		<item>
		<title>How XML Threatens Big Data</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/sgrANTdWFkU/</link>
		<comments>http://dataspora.com/blog/xml-and-big-data/#comments</comments>
		<pubDate>Sun, 23 Aug 2009 06:25:02 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<category><![CDATA[computing]]></category>

		<category><![CDATA[data]]></category>

		<category><![CDATA[bigdata]]></category>

		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=91</guid>
		<description><![CDATA[Confessions from a Massive, Nightmarish Data Project
Back in 2000, I went to France to build a genomics platform.  A biotech hired me to combine their in-house genome data with that of public repositories like Genbank.  The problem was the repositories, all with millions of records, each had their own format.  It sounded [...]]]></description>
			<content:encoded><![CDATA[<p><strong><a href="http://dataspora.com/blog/wp-content/uploads/2009/08/elephant.jpg"><img class="alignleft size-thumbnail wp-image-93" title="elephant" src="http://dataspora.com/blog/wp-content/uploads/2009/08/elephant-150x150.jpg" alt="Credit:  http://www.flickr.com/photos/digitalart/2101765353" width="150" height="150" /></a>Confessions from a Massive, Nightmarish Data Project</strong></p>
<p>Back in 2000, I went to France to build a genomics platform.  A biotech hired me to combine their in-house genome data with that of public repositories like Genbank.  The problem was the repositories, all with millions of records, each had their own format.  It sounded like a massive, nightmarish data interoperability project.  And an ideal fit for <a href="http://www.nytimes.com/2000/06/07/business/the-next-big-leap-it-s-called-xml.html"> a hot new technology </a>:  XML.</p>
<p>So I dove in, spending my days designing DTDs, writing parsers, tweaking tags (&#8221;taxon&#8221; or &#8220;species&#8221;?  attribute or element?).  At night I dreamt in ontologies.  <a href="http://labs.dataspora.com/pubseq/docs/overview/records2xml.gif">It was perfect.</a></p>
<p>Then reality struck.  The pipeline was slow:  Oracle loaded XML at a crawl.  And it was a memory hog, since XSLT required putting full document trees in RAM.</p>
<p>We had a deadline to meet (and, mon dieu, a 35 hour work-week).  So we changed course.  We hacked our Perl scripts to emit a flat tab-delimited format &#8212; &#8220;TabML&#8221; &#8212; which was bulk loaded into Oracle.  It wasn&#8217;t elegant, but it was fast and it worked.</p>
<p>Yet looking back, I realize that XML was the wrong format from the start.  And as I&#8217;ll argue here, our unhealthy obsession with XML formats threatens to slow or impede many open data projects, including  initiatives like <a href="http://www.data.gov">Data.gov</a>.</p>
<p>In the next sections, I discuss how XML fails for Big Data because of its unnatural form, bulk, and complexity.  Finally, I generalize to three rules that advocate a more liberal approach to data.</p>
<p><span id="more-91"></span></p>
<h3>Three Reasons Why XML Fails for Big Data</h3>
<p><strong>I. XML Spawns Data Bureaucracy </strong></p>
<p>In its natural habitat, data lives in relational databases or as data structures in programs.  The common import and export formats of these environments do not resemble XML, so much effort is dedicated to making XML fit.  When more time is spent on inter-converting data &#8212; serializing, parsing,translating &#8212; than in using it, you&#8217;ve created a data bureaucracy.</p>
<p>Indeed, it was what Doug Crockford called <a href="link://http//www.json.org/fatfree.html">&#8220;impedance mismatch inefficiencies&#8221;</a> that sparked him to create JSON - standardizing Javascript&#8217;s object notation as a portable data container.</p>
<p><strong>II. Yes, Size Matters for Data</strong></p>
<p>Size matters for data in a way it does not for documents.  Documents are intended for human consumption and have human-sized upper bounds (a lifetime&#8217;s worth of reading fits on a thumb drive).  Data designed for machine consumption is bounded only by bandwidth and storage.</p>
<p>XML&#8217;s expansiveness &#8212; for even when compressed, the genie must be let out the bottle at some point &#8212; imposes memory, storage, and CPU costs.</p>
<p><strong>III. Complexity Carries a Cost</strong></p>
<p>I never fail to sigh when I open a data file and discover an army of tags, several ranks deep, surrounding the data I need.  XML&#8217;s complexity imposes costs without commensurate benefits, specifically:</p>
<ul>
<li>In-line, element-by-element tagging is redundant.  Far preferable is stating the data model separately, and using a lightweight delimiter (such as a comma or a tab).</li>
<li> Text tags are purported to be self-documenting, but textual meaning is a slippery thing: it&#8217;s rare that one can be sure of a tag&#8217;s data type without consulting its DTD (in a separate document).</li>
<li> End-tags support nested structures (such as an aside (within (an aside)).  But to facilitate data exchange, flattened out structures are preferable, and arbitrary levels of nesting are best using sparingly.</li>
</ul>
<p>XML&#8217;s complexity inflicts misery on both sides of the data divide: on the publishing side, developers struggle to comply with the latest edicts of a fussy standards group.  While data suitors labor to <a href="http://www.crummy.com/software/BeautifulSoup/">quickly unravel</a> that XML format into something they can use.</p>
<h3>Three Rules for XML Rebels</h3>
<p><strong>I.  Stop Inventing New Formats</strong> <a href="http://www.tbray.org/ongoing/When/200x/2006/01/08/No-New-XML-Languages">(as Tim Bray said in 2006)</a></p>
<p>Before you call for &#8220;an XML format for X&#8221;, let me tell you a story about LaTeX and MathML.  (And while these are document formats, there&#8217;s a lesson here for data).</p>
<p>The LaTeX typesetting system is the lingua franca for composing scientific documents.  As the one-million plus LaTeX-formatted articles on arXiv.org attest, it is spoken by scientists worldwide.</p>
<p>MathML, on the other hand, is a markup language for mathematics recommended by the W3C.  If you&#8217;re a scientist looking to use MathML, you have two choices: (i) find a program to convert LaTeX, which you already know, to MathML 3.0 or (ii) familiarize yourself with this <a href="http://www.w3.org/TR/2009/WD-MathML3-20090604/"> handy 354-page spec</a> and code it yourself.</p>
<p>Two years ago, Mike Adams thought of a third way: why not just let people use LaTeX directly in WordPress?  So he wrote a plug-in that did it.  <a href="http://en.blog.wordpress.com/2007/02/17/math-for-the-masses/">The applause was deafening</a>.</p>
<p>Spoken languages are strengthened by usage, not by imperial fiat, and data formats are no different.  Far better to evolve and adapt the standards we already have (as JSON and SQLite&#8217;s file format do), than to fabricate new ones from whole cloth.  <a href="http://blog.jonudell.net/2009/07/31/polymath-equals-user-innovatio/">As John Udell says</a>, &#8220;good-enough solutions [that are] here now, and familiar to people, often trump great solutions that aren’t here and wouldn’t be familiar if they were.&#8221;</p>
<p><strong>II.  Obey The Fifteen Minute Rule</strong></p>
<p><a href="http://www.ddj.com/184404686">Interviewed several years ago</a>, James Clark stated &#8220;If a technology is too complicated, no matter how wonderful it is and how easy it makes a user&#8217;s life, it won&#8217;t be adopted on a wide scale.&#8221;</p>
<p>Accordingly, if you absolutely must develop a new API, language, or format, it should satisfy a simple rule: a person of reasonable ability should be able to get from zero to &#8216;Hello World&#8217; in fifteen minutes.  (This does not preclude complex languages or formats, per se:  it does require that additional complexity not be sui generis, but built on some existing foundation, <a href="http://people.mandriva.com/~prigaux/language-study/diagram-light.png">for example.</a>) </p>
<p>Despite <a href="http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/"> a noble vision for the semantic web </a>, the barriers for adopting the W3C&#8217;s proposals for linked data are too high.  The beauty of original HTML standard was that it was dead simple.  The flaw of RDF is that it is too hard.</p>
<p><strong>III.  Embrace Lazy Data Modeling</strong></p>
<p>To keep data bureaucracy to a minimum, <a href="http://my.safaribooksonline.com/9780596801656/information_platforms_as_dataspaces">several Big Data thinkers </a> have advocated a more <a href="http://en.wiktionary.org/wiki/catholic">catholic</a> approach to data:  building data stores that accommodate <a href="http://infochimps.org/">a broad range of data types and formats</a>.</p>
<p>Lazy data modeling is similar to lazy evaluation.  The right schema for data depends on future use cases, in as-yet-undeveloped applications.  Instead of trying to guess the future, we can store the data &#8220;as-is&#8221; &#8212; and deal with its transformation when (and if) a necessary use case arises.  As <a href="http://www.eecs.berkeley.edu/~franklin/Papers/dataspaceSR.pdf">Michael Franklin and colleagues note</a>: &#8221;the most scarce resource available for semantic integration is human attention.&#8221;</p>
<p>This liberal view also reduces barriers for data sharing, barriers which threaten initiatives like <a href="http://www.data.gov/">Data.gov</a>.  The US Census Bureau shouldn&#8217;t expend resources to publish in XML if they have a good-enough format available right now.</p>
<p>For the data geeks in the trenches, who are building the next generation of data services, the laws of economics hold fast: there are unlimited opportunities in the face of one limited resource, time. (Which also explains why <a href="http://blog.i2pi.com/">data geeks </a> <a href="http://www.datawrangling.com/">seem to </a> <a href="http://twitter.com/dpatil">get </a> <a href="http://anyall.org/blog/">no sleep</a>).</p>
<p>XML&#8217;s unfulfilled promise for data testifies that formats can create friction.  The easier it is for data to be shared and consumed, the more quickly we&#8217;ll realize our visions for smarter businesses and <a href="http://www.readwriteweb.com/archives/how_tim_oreilly_aims_to_change_government.php">better governments.</a></p>
<p><strong>(25-Aug-2009 Update:  <a href="http://groups.google.com/group/sunlightlabs/browse_thread/thread/da9118b9fe566c">  Read a response from open gov advocates at Sunlight Labs</a>).</strong></p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/sgrANTdWFkU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/xml-and-big-data/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/xml-and-big-data/</feedburner:origLink></item>
		<item>
		<title>The Rise of the Data Web</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/trYsY0hnfNQ/</link>
		<comments>http://dataspora.com/blog/the-rise-of-the-data-web/#comments</comments>
		<pubDate>Fri, 21 Aug 2009 01:51:33 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<category><![CDATA[computing]]></category>

		<category><![CDATA[data]]></category>

		<category><![CDATA[data bigdata xml]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=86</guid>
		<description><![CDATA[The future of the web is data, not documents.  The web has evolved from Tim Berners-Lee&#8217;s original vision of &#8220;some big, virtual documentation system in the sky&#8221; into an vibrant ecosystem of data where documents &#8212; and human actors &#8212; will play an ever smaller role.
As others have noted, we&#8217;ve reached a tipping point [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/08/stream.jpg"><img class="alignleft size-medium wp-image-88" title="stream" src="http://dataspora.com/blog/wp-content/uploads/2009/08/stream-188x300.jpg" alt="" width="188" height="300" /></a>The future of the web is data, not documents.  The web has evolved from Tim Berners-Lee&#8217;s original vision of <a href="http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html">&#8220;some big, virtual documentation system in the sky&#8221;</a> into an vibrant ecosystem of data where documents &#8212; and human actors &#8212; will play an ever smaller role.</p>
<p><a href="http://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel programming/">As others have noted</a>, we&#8217;ve reached a tipping point in history: more data is being manufactured by machines &#8212; servers, cell phones, GPS-enabled cars &#8212; than by people.  The early, document-centric web was populated by hand-coded hypertext files; today, a hand-coded web page is as rare as hand-woven clothing.</p>
<p>Through web frameworks, wikis, and blogs, we have industrialized the creation of hypertext.  Similarly, we&#8217;ve also industrialized the collection of data, and spliced out the human steps in many data flows, such that data entry clerks may soon be as rare as typesetters.</p>
<p>The web we experience will continue to be dominated by documents &#8212; e-mail, blogs, and news.  And while many sites are data-centric &#8212; Google maps, Weather.com, and Yahoo finance &#8212; it&#8217;s the web that we can&#8217;t see that surging with data.  It&#8217;s not about us, it&#8217;s about servers in the cloud mediating <a href="http://radar.oreilly.com/archives/2007/02/pipes-and-filte.html">entire pipelines of data</a>, only occasionally surfacing in a browser.</p>
<p>But the web&#8217;s data architecture is fractious and in flux: many competing standards exist for serializing, parsing, and describing data.  As we build out the data web, we ought to embrace standards that mirror data&#8217;s form in its natural habitats &#8212; as programmatic data structures, relational tables, or key-value pairs &#8212; while taking advantage of data&#8217;s stream-like nature.  Mark-up languages like HTML and XML are ideal for documents, but they are poor containers for data, especially Big Data.</p>
<p><span id="more-86"></span></p>
<p><strong>Sacred &#8220;Words &amp; Enthusiasm&#8221; vs Meaningless Utterances</strong></p>
<p>Documents and data are different.  The table below reflects my thin grasp of the fissure lines, as a step towards arguing why we ought to design around them.</span></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/08/documents_vs_data.png"><img class="alignnone size-full wp-image-90" title="documents_vs_data" src="http://dataspora.com/blog/wp-content/uploads/2009/08/documents_vs_data.png" alt="" width="499" height="356" /></a></p>
<p>Documents are made of <a href="http://www.ted.com/talks/view/id/161">&#8220;words and enthusiasm&#8221;</a>: sonnets, cake recipes, blog posts, Supreme Court rulings, and dictionary definitions.  Their core stuffing is text.  Their structure is unpredictable and irregular &#8212; even <a href="http://seanmcgrath.blogspot.com/2004_05_23_seanmcgrath_archive.html"> fractal</a>.</p>
<p>Data are not created but collected (<a href="http://www.archives.nd.edu/cgi-bin/lookit.pl?latin=datum">something given</a>, not something made): city temperatures, stock prices, web visitors, and home runs. They are observations in time and space, with periodic and predictable structure.  Data are reorderable and divisible: you can relay city temperatures in any order, but you can&#8217;t rearrange a Shakespearian sonnet without muddling its meaning.  Some documents are so meaningful as to be considered <a href="http://www.ietf.org/rfc/rfc1.txt">sacred</a>.</p>
<p>Data are, in this regard, meaningless on their own; they do not signify, they simply are.  These data are the <a href="http://plato.stanford.edu/entries/assertion/">utterances </a>of the <a href="http://boingboing.net/images/blobjects.htm">spimes </a> that surround us.</p>
<p><strong>Documents as Trees, Data as Streams</strong></p>
<p>The argument for shifting away from markup languages as data formats is not just practical, it&#8217;s philosophical: it&#8217;s about pivoting our conception away from the dominant metaphor of documents &#8212; trees &#8212; towards one far more suitable for data &#8212; streams.</p>
<p>Trees are rooted and finite: you can&#8217;t chop up a tree and easily put it back together again (while XML has made concessions to <a href="http://www.w3.org/TR/xml-fragment">document fragments</a>, it is not a natural fit).</p>
<p>Streams can be split, sampled, and filtered.  The divisibility of data streams lends itself to parallelism in a way that document trees do not.  The stream paradigm conceives of data as extending infinitely forward in time.  The Twitter data stream has no end: it ought have no end tag.</p>
<p>Conceiving of data as streams moves us out of the realm of static objects and into the <a href="http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-24.html#%_sec_3.5">realm of signal processing</a>.  This is the domain of the living: where the web is not an archive but an organism, <a href="http://radar.oreilly.com/2009/08/big-data-and-real-time-structured-data-analytics.html">reacting in real-time</a>.</p>
<p><strong>XML Considered Harmful for Data</strong></p>
<p>XML is a poor language for data because it solves the wrong problems &#8212; those of documents &#8212; while leaving many of data&#8217;s unique issues unaddressed.   But many promising alternatives exist &#8212; microformats like <a href="http://www.json.org/fatfree.html">JSON</a>, <a href="http://developers.facebook.com/thrift/thrift-20070401.pdf">Thrift</a>, and even <a href="http://www.sqlite.org/fileformat.html">SQLite&#8217;s file format</a> &#8211; as I will detail in a <a href="http://dataspora.com/blog/xml-and-big-data/">my next post.</a></p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/trYsY0hnfNQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/the-rise-of-the-data-web/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/the-rise-of-the-data-web/</feedburner:origLink></item>
		<item>
		<title>The Three Sexy Skills of Data Geeks</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/Wf9Z8ufjH2o/</link>
		<comments>http://dataspora.com/blog/sexy-data-geeks/#comments</comments>
		<pubDate>Wed, 27 May 2009 10:02:05 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=85</guid>
		<description><![CDATA[Hal Varian, Google&#8217;s Chief Economist, was interviewed a few months ago, and said the following in the McKinsey Quarterly:
&#8220;The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/marilyn_scatter.png"><img class="alignnone size-medium wp-image-84" title="marilyn_scatter" src="http://dataspora.com/blog/wp-content/uploads/2009/05/marilyn_scatter-300x300.png" alt="Marilyn Monroe Scatterplot Mashup" width="300" height="300" /></a>Hal Varian, Google&#8217;s Chief Economist, was interviewed a few months ago, and said the following in <a href="http://www.mckinseyquarterly.com/Hal_Varian_on_how_the_Web_challenges_managers_2286">the McKinsey Quarterly</a>:<br />
<em>&#8220;The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.” </em></p>
<p>In prepping for tonite&#8217;s talk at the <a href="http://www.youtube.com/watch?v=hcl3qmawY_0">Google IO Ignite</a> event, this quote inspired me to muse about how sex appeal and statistics might go together:  so I chose to mash up a few scatter plots with Andy Warhol&#8217;s Marilyn Monroe.</p>
<p>Statisticians&#8217; sex appeal has little to do with their lascivious leanings (ahem, <a href="http://www.bedposted.com">BedPost</a>), and more with the scarcity of their skills.  I believe that the folks to whom Hal Varian is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas:  statistics, data munging, and data visualization.  (In parentheses next to each, I&#8217;ve put the salient character trait needed to acquire it).</p>
<p><strong>Skill #1: Statistics (Studying).</strong> Statistics is perhaps the most important skill and the hardest to learn. <span id="more-85"></span>It&#8217;s a deep and rigorous discipline, and one that is actively progressing (the widely used method of Least Angle Regression was only <a href="http://arxiv.org/abs/math/0406456">recently developed in 2004</a>).  I expect to be on its learning curve my entire life.  This being the case, people who possess a solid grasp of modern statistics are rare.   And yet problems that require its application continue to multiply.  The text that I was exposed to in graduate school and find to be an unparalleled survey is Hastie, Tibshirani, and Friedman&#8217;s <a href="http://www.amazon.com/Elements-Statistical-Learning-T-Hastie/dp/0387952845">Elements of Statistical Learning</a>.</p>
<p><strong>Skill #2: Data Munging (Suffering).</strong> The second critical skill mentioned above is  &#8220;data munging.&#8221;  Among data geek circles (you can find us with a <a href="http://search.twitter.com/search?q=%23rstats">Twitter search for #rstats</a>), this refers to the painful process of cleaning, parsing, and proofing one&#8217;s data before it&#8217;s suitable for analysis.  Real world data is messy.  At best it&#8217;s inconsistently delimited or packed into an unnecessarily complex XML schema.  At worst, it&#8217;s a series of scraped HTML pages or a thoroughly undocumented fixed-width format.</p>
<p>A good data munger excels at turning coffee into regular expressions and parsers, implemented in a high-level scripting language of choice (often Perl, Python, even Javascript).  This is problem solving with programming, and quite different from statistics.  An aspiration towards elegance &#8212; in the form of a perfect XSLT filter, for example &#8212; is rarely rewarded, and often punished.  A decade ago, I thought that the world&#8217;s data would soon be well-structured, and my talent for syntactical incantations of regular expressions would be a moot skill.   I was wrong.  (Perhaps there&#8217;s an analogy with the paper industry:  the growing volume of data means we&#8217;ll likely need more regular expressions before we need less).</p>
<p>Related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores, using a combination of SQL, scripting languages (especially Python and its SciPy and NumPy libraries), and even several oldie-but-goodie Unix utilities (cut, join).</p>
<p>And when data sets grow too large to manage on a single desktop, the samurai of data geeks are capable of parallelizing storage and computation with tools like <a href="http://databeta.wordpress.com/2009/05/14/bigdata-node-density/">96-nodes of Postgres</a>, <a href="http://cran.r-project.org/web/views/HighPerformanceComputing.html">snow and RMPI</a>, Hadoop and Mapreduce, and <a href="http://www.datawrangling.com/amazon-elastic-mapreduce-a-web-service-api-for-hadoop">on Amazon EC2 to boot.</a></p>
<p><strong>Skill #3: Visualization (Storytelling).</strong> This third and last skill that Professor Varian refers to is the easiest to believe one has.  Most of us have had exposure to basic chart-making widgets of Excel (and to date myself, tools like Harvard Graphics).   But a little knowledge is a dangerous thing:  these software tools are often insufficient when faced with the visualization of large, multivariate data sets.</p>
<p>Here it&#8217;s worth making a distinction between two breeds of data visualizations, which differ in their audience and their goals.  The first are exploratory data visualizations (as named by John Tukey), intended to faciliate a data analyst&#8217;s understanding of the data.   These may consist of <a href="http://dsarkar.fhcrc.org/lattice/book/images/Figure_05_17_stdBW.png">scatter plot matrices</a> and histograms, where labels and colors are minimally set by default.   Their goal is to help develop a hypothesis about the data, and their audience typically numbers one or a small team.</p>
<p>A second kind of data visualization are those intended to communicate to a wider audience, whose goal is to visually advocate for a hypothesis.  While most data geeks are facile with exploratory graphics, the ability to create this second kind of visualization, these visual narratives, is again a separate skill &#8212; with separate tools.  (R is excellent for static visualizations, but cannot compete with the kinds of rich interactive visualizations that tools like <a href="http://processing.org/">Processing </a>and <a href="http://flare.prefuse.org/">Flare</a> make possible).  Luckily, successful collaboration often occurs <a href="http://blog.jonudell.net/2009/05/26/a-conversation-with-eric-rodenbeck-about-usefully-cool-design-and-engineering/">between data analysts and designers</a>, the <a href="http://flowingdata.com/2009/04/22/narrow-minded-data-visualization/">occasional fracas</a> notwithstanding.</p>
<p>The ability to visualize and communicate data is critical, because even with good data and rigorous statistical techniques, if the results of an analysis are poorly visualized, they will not convince:  whether it&#8217;s an academic discovery or a business proposal.</p>
<p><strong>Put All Three Skills Together:  Sexy. </strong>Thus with the Age of Data upon us, those who can model, munge, and visually communicate data &#8212; call us statisticians or data geeks &#8212; are a hot commodity.  I grew up before the age of geek chic, when the computer wizzes were social pariahs, and feature-length movies were dedicated to <a href="http://www.imdb.com/title/tt0088000/">nerds seeking revenge</a>.  But in the last decade, Steve Jobs became an icon, the Internet became cool, and an entire generation of tech kids grew up well adjusted.  They even built the social web to prove it.   I believe the same could happen to statistics and data geeks too.</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/Wf9Z8ufjH2o" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/sexy-data-geeks/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/sexy-data-geeks/</feedburner:origLink></item>
		<item>
		<title>Dataviz Salon SF #2:  Maps, Grammars, &amp; Models</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/YIddP1eXxpc/</link>
		<comments>http://dataspora.com/blog/dataviz-sf-salon-no/#comments</comments>
		<pubDate>Fri, 08 May 2009 10:11:35 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=75</guid>
		<description><![CDATA[A few nights ago the talented folks at Stamen Design hosted us at their studios for our second dataviz salon in San Francisco.  (Special thanks to Tom Carden and Michal Migurski for inviting us).  Four talks were given, which I&#8217;ll review in turn.

Stamen:  Reaching through Maps
Protovis: A Declarative, Open Source Graphical Toolkit
A Mathematician&#8217;s View:  A Visualization is a [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/dataviz_salon_poster_5may20.png"></a><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/dataviz_salon_poster_5may20.png"><img class="alignleft size-thumbnail wp-image-76" title="dataviz_salon_poster_5may20" src="http://dataspora.com/blog/wp-content/uploads/2009/05/dataviz_salon_poster_5may20-150x150.png" alt="" width="150" height="150" /></a>A few nights ago the talented folks at <a href="http://www.stamen.com">Stamen Design</a> hosted us at their studios for our second dataviz salon in San Francisco.  (Special thanks to <a href="http://www.tom-carden.co.uk">Tom Carden</a> and <a href="http://mike.teczno.com/">Michal Migurski</a> for inviting us).  Four talks were given, which I&#8217;ll review in turn.</p>
<ul>
<li><a href="http://dataspora.com/blog/dataviz-sf-salon-no/#stamen">Stamen:  Reaching through Maps</a></li>
<li><a href="http://dataspora.com/blog/dataviz-sf-salon-no/#protovis">Protovis: A Declarative, Open Source Graphical Toolkit</a></li>
<li><a href="http://dataspora.com/blog/dataviz-sf-salon-no/#morton">A Mathematician&#8217;s View:  A Visualization is a Hypothesis</a></li>
<li><a href="http://dataspora.com/blog/dataviz-sf-salon-no/#uuorld">UUorld:  Multidimensional Extrusion Maps</a></li>
</ul>
<h3 id="stamen">Stamen:  Reaching through Maps</h3>
<p>Eric Rodenbeck (Stamen) started by highlighting several mapping visualizations that Stamen has been hacking on recently and in the past, including <a href="http://oakland.crimespotting.org/map/#types=Va,Na,DP,Al,Pr&amp;dtend=2009-05-05T23:34:55-07:00&amp;dtstart=2009-04-22T23:47:51-07:00&amp;lon=-122.270&amp;zoom=14&amp;lat=37.806"> </a><a href="http://www.cabspotting.org"> Cabspotting in San Francisco </a>, <a href="http://oakland.crimespotting.org/">Crimespotting in Oakland</a>, and  <a href="http://www.london2012.com/in-your-area/map/index.php"> Olympic Stadium spotting in London</a>.</p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/stamen_cabspotting.png"><img class="alignleft size-thumbnail wp-image-79" title="stamen_cabspotting" src="http://dataspora.com/blog/wp-content/uploads/2009/05/stamen_cabspotting-150x150.png" alt="" width="150" height="150" /></a>Eric showed how Stamen has attempted to move away from what <a href="http://mappinghacks.com/2006/04/07/web-map-api-roundup/">Schuyler Erle has dubbed &#8220;red dot fever&#8221;</a>, whereby the overlayed data can overwhelm our visual attention, and toward allowing various data layers to &#8220;reach through&#8221; the maps.</p>
<p>For example, the London Olympic maps provide a mixture of schematic, satellite, and webcam images.  These various drill-downs of detail are not all exposed, but rather collaged.  Even more interesting was a movable &#8216;lens&#8217; that, as it is moved over regions of a map, reveals another layer (reminiscent of a <a href="http://www.flickr.com/photos/cdevers/2896777351/"> polarized-light based mural</a> at Boston&#8217;s MoS).  In these ways, additional layers of data are only selectively brought into focus (echoing a design pattern in Japanese gardening, <a href="http://www.amazon.com/Visual-Spatial-Structure-Landscapes/dp/0262580942">mie gakure</a>, meaning &#8220;seen and unseen&#8221;).<br />
<span id="more-75"></span><br />
One practical gem that Mike Migurski shared regarding the Oakland Crimespotting site was, &#8220;the design of a comments section is a huge part of how its perceived and used.&#8221;  Nota bene, social web developers.</p>
<h3 id="protovis">Protovis: A Declarative, Open Source Graphical Toolkit</h3>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/burtin_yeast_mic.png"><img class="alignnone size-thumbnail wp-image-77" title="burtin_yeast_mic" src="http://dataspora.com/blog/wp-content/uploads/2009/05/burtin_yeast_mic-150x150.png" alt="" width="150" height="150" /></a>Mike Bostock (Stanford CS) introduced <a href="http://vis.stanford.edu/protovis/">Protovis</a>, an extensible visualization toolkit implemented using Javascript&#8217;s canvas element.  Protovis draws inspiration from Leland Wilkinson&#8217;s <a href="http://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448">Grammar of Graphics</a>, which argues for moving away from the prevailing method of building visualizations, where data are simply poured into one of several chart types &#8212; pie, stacked bar, or scatter.</p>
<p>Wilkinson argues that visualizations should not be cast from chart typologies, but rather composed of graphical primitives.  In Protovis, these primitives include dots, areas, lines, and labels (called &#8220;marks&#8221;).</p>
<p>Among Protovis&#8217;s strengths are:</p>
<dl>
<dt><strong> A More Declarative Syntax for Creating Graphics </strong></dt>
<dd> One disadvantage of directly using Javascript&#8217;s canvas is its   imperative style.  To draw a diagonal line, the code must manipulate   and move a pen using x,y coordinates.  With Protovis, however, the   code declares (roughly) &#8220;add a bar to this graph&#8221; (<a href="http://vis.stanford.edu/protovis/ex/weather.html">example</a>).  Thus Protovis   provides a grammar for statements about graphical marks, rather than   statements about graphical mechanics. </dd>
<dt><strong> Visible Open Source </strong></dt>
<dd> With Protovis, the source code is not just open and available, it&#8217;s   viewable from within the browser.  I have an admittedly personal bias for <a href="http://dataspora.com/blog/open-source-dataviz/">open   source data visualization</a>, but lowering the barriers to sharing source   code ultimately drive faster adoption and iteration of visualization   techniques. </dd>
</dl>
<p>Mike has used Protovis to recreate classic data visualizations by Will Burtin, Florence Nightingale, William Playfair, and others.  You can find these at the <a href="http://vis.stanford.edu/protovis">Protovis site</a> and in their <a href="http://vis.stanford.edu/protovis/protovis.pdf">InfoVis &#8216;09 paper</a>.</p>
<p>(For those interested in a Wilkinson-inspired approach for graphics in R, check out <a href="http://had.co.nz/ggplot2/">Hadley Wickham&#8217;s ggplot</a>).</p>
<h3 id="morton">A Mathematician&#8217;s View:  A Visualization is a Hypothesis</h3>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/dataspora_wordle.png"><img class="alignleft size-thumbnail wp-image-78" title="dataspora_wordle" src="http://dataspora.com/blog/wp-content/uploads/2009/05/dataspora_wordle-150x150.png" alt="" width="150" height="150" /></a>Jason Morton (Stanford Mathematics) made the argument that a data visualization is not merely a descriptive vessel, it is a predictive model.</p>
<p>A visualization is a model is because, especially with large data sets, not every dimension of every observation can be shown.  Quite simply, a (compressed) 100k data visualization cannot losslessly describe a (compressed) 10 Mb data set: information must be discarded. What remains is a <em>model</em> of the original data, albeit a visual model.</p>
<p>Moreover, a data visualization&#8217;s model is predictive: it presents a hypothesis about how observable data points were generated, and implies predictions about future, as-yet-unobserved data.</p>
<p>Seen from this perspective, Stamen&#8217;s Crimespotting maps are powerful precisely because they make compelling hypotheses about when and where crime occurs in Oakland.  Their London Olympic maps, which integrate time series photographs of the stadium site, take a position about the pace of construction and how it is impacting the landscape.</p>
<p><strong>&#8220;Form Ever Follows Function&#8221;</strong></p>
<p>And if the function of a data visualization is to make hypotheses, then its form should follow this function. The arbitrary use of color, position, shape, and ornament &#8212; only adds noise.</p>
<p>The ever popular <a href="http://www.wordle.net/"> Wordle </a> provides a visual model for word distribution in a text: more frequent words are larger.  However, a word&#8217;s color, position, and font are arbitrarily chosen - they carry no meaning, and model nothing. Indeed, the &#8220;randomize&#8221; button is an admission of as much (for it does not randomize size).</p>
<p>Adding arbitrary marks or dimensions to a visualization carries two related risks: first, it can obscure the true model that&#8217;s trying to be conveyed (what do same-colored have in common?); second, this added complexity, beyond polluting the information channel, has a cost: the visualization is larger.  <a href="http://www.swivel.com/graphs/image/28893777/default/600/337/5/absolute/HorizontalBarGraph/ASC/all+time/daily/ignore?s=1241769339">Bar graphs with iPhone ads</a> in the background cannot be succinctly rendered.</p>
<p>The parallels to the modernist movement in architecture are obvious. Adolf Loos wrote in 1908 that &#8220;the evolution of culture marches with the elimination of ornament from useful objects.&#8221;  The American modernist Louis Sullivan proclaimed that &#8220;form ever follows function.&#8221;</p>
<p>But the truth is that stripping visualizations down to their bare models can be counterproductive.  Call it noise or ornamentation, but even visual marks that do not advance a hypothesis can act to support it,  by guiding the eye, providing context, or otherwise speeding the absorption of a pattern by the human brain.  At the very least, this functionalist perspective can help data visualizers use ornamentation intentionally, not inadvertently.</p>
<h3 id="uuorld">UUorld:  Multidimensional Extrusion Maps</h3>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/05/uuorld_stlouis.png"><img class="alignleft size-thumbnail wp-image-80" title="uuorld_stlouis" src="http://dataspora.com/blog/wp-content/uploads/2009/05/uuorld_stlouis-150x150.png" alt="" width="150" height="150" /></a>Zach Wilson (UUorld) showcased his <a href="http://www.uuorld.com">company&#8217;s</a> software that simplifies creating and exploring extrusion maps.  Among the several interesting applications of his software, Zach showed off a temporal visualization <a href="http://vimeo.com/4480815"> of the spread of swine flu in the United States</a> over the past several weeks.</p>
<p>In response to the critique that layering data dimensions on two-dimensional maps could be done more effectively by use other indicators such as color &#8212; instead of the simulation of a third dimension of height &#8212; Zach indicated that research has shown that physical dimensions (or their simulation) possess greater visual saliency to the human eye.</p>
<p>Zach also mentioned UUorld&#8217;s <a href="http://www.uuorld.com/portal">data portal</a> which contains thousands of downloadable statistics from a variety of public sources; some of which have been used to generate UUorld visualizations.</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/YIddP1eXxpc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/dataviz-sf-salon-no/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/dataviz-sf-salon-no/</feedburner:origLink></item>
		<item>
		<title>Color:  The Cinderella of dataviz</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/EIEl54Ti7Bg/</link>
		<comments>http://dataspora.com/blog/how-to-color-multivariate-data/#comments</comments>
		<pubDate>Sat, 14 Mar 2009 00:14:42 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[R]]></category>

		<category><![CDATA[analytics]]></category>

		<category><![CDATA[color theory]]></category>

		<category><![CDATA[computing]]></category>

		<category><![CDATA[data]]></category>

		<category><![CDATA[dataviz]]></category>

		<category><![CDATA[sabermetrics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=58</guid>
		<description><![CDATA[&#8220;Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.&#8221;  &#8212; Envisioning Information, Edward Tufte, Graphics Press, 1990   
Color is one of the most abused and neglected tools in data visualization.  It is abused when we make poor color choices; it is neglected when we rely on poor software [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>&#8220;Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.&#8221;  &#8212; <em>Envisioning Information</em>, Edward Tufte, Graphics Press, 1990   </p></blockquote>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor2d_4001.png"><img class="alignnone size-full wp-image-73" title="stripcolor2d_4001" src="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor2d_4001.png" alt="multivariate color strip plot " width="400" height="185" /></a>Color is one of the most abused and neglected tools in data visualization.  It is abused when we make poor color choices; it is neglected when we rely on poor software defaults.  Yet despite its historically poor treatment at the hands of engineers and end-users alike, if used wisely, color is unrivaled as a visualization tool.</p>
<p>Most of us think twice before walking outside in fluorescent red underoos.  If only we were as cautious in choosing colors for infographics.  The difference is that few of us design our own clothes.  But until good palettes (like <a href="http://www.colorbrewer.org">ColorBrewer</a>) are commonplace, to get colors that fit our purposes, we must be our own tailors.</p>
<p>While obsessing about how to implement color on the <a href="http://labs.dataspora.com/gameday">Dataspora Labs&#8217; PitchFX viewer</a> I began with a basic motivating question:<span id="more-58"></span></p>
<h3>Why use color in data graphics?</h3>
<p>If our data are simple, a single color is sufficient, even preferable.  For example, below is a scatter plot of 287 pitches thrown by the major league pitcher Oscar Villarreal in 2008.  With just two dimensions of data to describe &#8212; the x and y location in the strike zone &#8212; black and white is sufficient.  In fact, this scatter plot is a perfectly lossless representation of the data set (assuming no data points perfectly overlap).</p>
<p><strong>Fig 1. Location of Pitches </strong><strong>(Villarreal, HOU, 2008)</strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/bwxy_250.png"><img class="alignnone size-full wp-image-59" title="bwxy_250" src="http://dataspora.com/blog/wp-content/uploads/2009/03/bwxy_250.png" alt="Simple black and white scatter plot" width="250" height="250" /></a></p>
<p>But what if we&#8217;d like to know more: for instance, what kinds of pitches (curveballs, fastballs) landed where?  Or their speed?  Visualizations live in two dimensions, but the world they describe is rarely so confined.</p>
<p><strong>The defining challenge of data visualization is projecting high dimensional data onto a low dimensional canvas.</strong> (As a rule, one should never do the reverse: visualize more dimensions than what already exist in the data).</p>
<p>Getting back to our pitching example, if we want to layer another dimension of data &#8212; pitch type &#8212; into our plot, we have several methods at our disposal:</p>
<ol>
<li><strong>plotting symbols </strong> - vary the glyphs that we use (circles, triangles, etc.),</li>
<li><strong>small multiples</strong> - vary extra dimensions in space, creating a series of smaller plots</li>
<li><strong>color</strong> - we can color our data, encoding extra dimensions inside a color space</li>
</ol>
<p>Which techniques you employ depend on the nature of the data and the media of your canvas.  I will describe all three by way of example.</p>
<h3>Multivariate Method I:  Vary Your Plotting Symbols</h3>
<p><strong>Fig 2. Location and Pitch Type (Villarreal, HOU, 2008)</strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/glyphs_300.png"><img class="alignnone size-full wp-image-60" title="glyphs_300" src="http://dataspora.com/blog/wp-content/uploads/2009/03/glyphs_300.png" alt="Scatterplot with varied plotting symbols." width="300" height="300" /></a></p>
<p>In this plot, I&#8217;ve layered the categorical dimension of pitch type into our plot by using four different plotting symbols.</p>
<p>I consider this visualization an abject failure.  In fact, the prize for my most despised graphs in graduate school goes to <a href="http://www.rbej.com/content/figures/1477-7827-4-23-10-l.jpg"> bacterial growth curves rendered this way </a>.  The reason these graphs make our heads hurt is because (i) distinguishing glyphs demands extra attention (versus what academics call &#8216;<a href="http://www.csc.ncsu.edu/faculty/healey/PP/index.html">pre-attentively processed</a>&#8216; cues like color), (ii) even after we visually decode the symbols, we have yet another step: mapping symbols to their semantic categories.  (Admittedly this can be improved with <a href="http://eagereyes.org/VisCrit/ChernoffFaces.html">Chernoff faces</a> or other iconic symbols, where the categorical mapping is self-evident).</p>
<h3>Multivariate Method II:  Small Multiples on a Canvas</h3>
<p>Folding additional dimensions into a partitioned canvas has a distinguished pedigree in information graphics.  It has been employed everywhere from <a href="http://hsci.ou.edu/images/jpg-100dpi-5in/17thCentury/Galileo/1613/Galileo-1613-Pt3-27.jpg"> Galileo sunspot illustrations </a> to William Cleveland&#8217;s trellis plots.  And as Scott Mccloud&#8217;s unexpected <a href="http://www.amazon.com/Understanding-Comics-Invisible-Scott-Mccloud/dp/006097625X"> tour de force on comics </a> makes clear, panels of pictures possess a narrative power that a single, undivided canvas lacks.</p>
<p>In this plot below, the four types of pitches that Oscar throws are splintered horizontally.   By reducing our plot sizes, we&#8217;ve given up some resolution in positional information. But in return, patterns that were invisible in our first plot, and obscured in our second (by varied symbols) are now made clear (Oscar throws his fastballs low, but his sliders high).</p>
<p><strong>Fig 3:  Location and Pitch Type (Villarreal, HOU, 2008)</strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/strip_4002.png"><img class="alignnone size-full wp-image-70" title="strip_4002" src="http://dataspora.com/blog/wp-content/uploads/2009/03/strip_4002.png" alt="black and white strip plot" width="400" height="185" /></a></p>
<p>Multiplying plots in space works especially well on printed media, which can hold more than ten times as many dots per square inch as a screen.  Both columns and rows can be used to lattice over additional dimensions, the result being a <a href="http://dsarkar.fhcrc.org/lattice/book/images/Figure_06_07_stdBW.png"> matrix of scatter plots </a> (in R, see the &#8216;<a href="http://finzi.psych.upenn.edu/R/library/lattice/html/splom.html">splom</a>&#8216; function).</p>
<h3>Multivariate Method III: Color Your Data</h3>
<p><strong>So why bother with color?</strong></p>
<p>First, as compared to most print media, computer displays have fewer units of space, but a broader color gamut.  So color is a compensatory strength.</p>
<p>For multi-dimensional data, color can convey additional dimensions inside a unit of space &#8212; and can do so instantly.  Color differences can be detected within 200 ms, before you&#8217;re even conscious of paying attention (the &#8216;pre-attentive&#8217; concept I mentioned earlier).</p>
<p>But the most important reason to use color in multivariate graphics is that<strong> color is itself multidimensional</strong>.  Our perceptual color space &#8212; <a href="http://en.wikipedia.org/wiki/Opponent_process"> however </a><a href="http://en.wikipedia.org/wiki/RGB_color_model"> you </a><a href="http://en.wikipedia.org/wiki/HSL_and_HSV"> slice </a><a href="http://en.wikipedia.org/wiki/Lab_color_space"> it </a> &#8212; is three-dimensioned.</p>
<p>In the example below, I&#8217;ve used color as a means of encoding a fourth dimension of our pitching data: the speed of pitches thrown. The palette I&#8217;ve chosen is a divergent palette that moves along one dimension (think of it as the &#8216;redness-blueness&#8217; dimension) in the <a href="http://en.wikipedia.org/wiki/CIELUV_color_space">CIELUV</a> color space, while maintaining a constant level of luminosity.</p>
<p><strong>Fig 4. Location, Pitch Type, and Velocity (Villarreal, HOU, 2008)</strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/keycolor1d_3001.png"><img class="alignnone size-full wp-image-69" title="keycolor1d_3001" src="http://dataspora.com/blog/wp-content/uploads/2009/03/keycolor1d_3001.png" alt="isoluminant, diverging color ramp" width="300" height="150" /></a></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor1d_400.png"> </a></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor1d_4002.png"><img class="alignnone size-full wp-image-71" title="stripcolor1d_4002" src="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor1d_4002.png" alt="color strip plot" width="396" height="187" /></a></p>
<p>Holding luminosity constant is important, because luminosity (similar to brightness) determines a color&#8217;s visual impact. Bright colors pop, and dark colors recede.  A color ramp that varies luminosity along with hue will highlight data points as an artifact of color choice.</p>
<p>I chose only seven gradations of color, so I&#8217;m downsampling (in a lossy way) our speed data - but further segmentation of our color ramp is not likely to be perceptible.</p>
<p>I&#8217;ve also chosen to use filled circles as my plotting symbol, as opposed to the open circles in all my previous plots.  This is done to improve the perception of each pitch&#8217;s speed via its color: small patches of color are less perceptible.  But a consequence of this choice &#8212; compounded by our choice to work with a series of smaller plots &#8212; is that more points overlap.  We&#8217;ve further degraded some of our positional information.  However, in our last step, we attempt to recover some of this.</p>
<p>Now I&#8217;ve finally brought color to bear on this visualization, but I&#8217;ve only encoded a single dimension &#8212; speed.  Which leads to another question:</p>
<h3>If color is three-dimensional, can I encode three dimensions with it?</h3>
<p>In theory, yes.  <a href="http://dataspora.com/blog/wp-content/uploads/2009/03/ware_infoviz_p142.jpg">Colin Ware researched this exact question</a>.  In practice, it&#8217;s difficult.  It turns out that asking observers to assess the amount of &#8216;redness&#8217;, &#8216;blueness&#8217;, and &#8216;greenness&#8217; of points is possible, but not intuitive (I suspect it&#8217;s somewhat like parsing symbols).</p>
<p>Another complicating factor is that a nontrivial fraction of the population has some form of color blindness.  This effectively reduces their color perception to two dimensions.</p>
<p>And finally, the truth is that our sensation of color is not equal along all dimensions; it&#8217;s thought the closely related &#8216;red&#8217; and &#8216;green&#8217; receptors emerged via duplication of the single long wavelength receptor (useful for detecting ripe from unripe fruits, according to one just-so story).</p>
<p>Because the high level of dichromacy in the population, and because of the challenge of encoding three dimensions in color, I  feel color is best used to encode no more than two dimensions of data.</p>
<p>So, for my last example of our pitching plot data, I will introduce luminosity as a means of encoding the local density of points (using a kernel density estimator).  This allows us to recover some of the data lost by increasing the sizes of our plotting symbols.</p>
<p><strong>Fig 5. Location, Pitch Type, Velocity, and Density (Villarreal, HOU, 2008)</strong></p>
<p><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/keycolor2d_3001.png"><img class="alignnone size-full wp-image-72" title="keycolor2d_3001" src="http://dataspora.com/blog/wp-content/uploads/2009/03/keycolor2d_3001.png" alt="two-dimensional color palette" width="291" height="278" /></a></p>
<p><span style="text-decoration: underline;"><a href="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor2d_4001.png"><img class="alignnone size-full wp-image-73" title="stripcolor2d_4001" src="http://dataspora.com/blog/wp-content/uploads/2009/03/stripcolor2d_4001.png" alt="multivariate color strip plot " width="400" height="185" /></a><br />
</span></p>
<p>Here we have effectively employed a two-dimensional color palette, with blueness-redness varying along one axis for speed, and luminosity varying in the other to denote local density.</p>
<p>One final point about using luminosity.  Observing colors in a data visualization involves overloading, in the programming sense.  We rely on cognitive functions that were developed for one purpose (perceiving lions) and use them for another (perceiving lines).</p>
<p>Since we can overload color any way we want, whenever possible,  we should choose mappings that are natural.  Mapping pitch density to luminosity feels right because the darker shadows in our pitch plots imply depth.  Likewise, when sampling from the color space, we might as well choose colors found in nature.  These are the palettes our eyes were gazing at for the millions of years before #FF0000 showed up.</p>
<p>Color, used thoughtfully and responsibly, can be an incredibly valuable tool in visualizing high dimensional data.</p>
<h3>FutureMan Asks:  What about Animation?</h3>
<p>This discussion has focused on using static graphics in general, and color in particular, as a means of visualizing multivariate data.  I&#8217;ve purposely neglected one very powerful tool:  motion. The ability to animate graphics multiplies by several orders of magnitude the amount of information that can be packed into a visualization.   But packing  information into a time-varying data structure has to be done by someone (you or me) and from my view, this remains a significant challenge.  Canonical forms of animated visualizations (equivalent to the histograms, box plots, and scatterplots of the static world) are still a ways off, but frameworks like <a href="http://dataspora.com/blog/wp-admin/http:/processing.org">Processing</a> and <a href="http://prefuse.org/">Prefuse</a> are a promising start towards their development.</p>
<h3><a href="http://en.wikipedia.org/wiki/Lab_color_space"> </a>Methods</h3>
<p>The final product of these five-dimensional pitch plots &#8212; for all available data for the 2008 season &#8212; can be explored via the <a href="http://labs.dataspora.com/gameday">PitchFX</a> Django-driven web tool at Dataspora labs.</p>
<p>All of the visualizations here were developed using R and the Lattice graphics package.  (Of note, Hadley Wickham is developing <a href="http://had.co.nz/ggplot2/">ggplot2</a>, a bold re-write of the R graphics system based on a grammar of graphics).</p>
<h3>References for Further Reading</h3>
<ul>
<li>Ross Ihaka - <a href="http://www.stat.auckland.ac.nz/~ihaka/120/lectures.html">Lectures on Information Visualization</a>, Lectures 12-14</li>
</ul>
<ul>
<li>Colin Ware - <a href="http://www.amazon.com/Information-Visualization-Second-Interactive-Technologies/dp/1558608192"> Information Visualization</a>, Ch. 4</li>
</ul>
<ul>
<li>Edward Tufte,<a href="http://www.amazon.com/Envisioning-Information-Edward-R-Tufte/dp/0961392118"> Envisioning Information</a>, Ch. 4.</li>
</ul>
<ul>
<li> Deepayan Sarkar - <a href="http://lmdvr.r-forge.r-project.org">Lattice: Multivariate Data Visualization with R</a> (web site with code)</li>
</ul>
<ul>
<li>Maureen Stone - <a href="http://www.stonesc.com/">StoneSoup Consulting </a> (color consultant to Tableau Software)</li>
</ul>
<ul>
<li> Stephen Few,<a href="http://www.amazon.com/Information-Dashboard-Design-Effective-Communication/dp/0596100167"> Information Dashboard Design</a>, Ch. 4</li>
</ul>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/EIEl54Ti7Bg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/how-to-color-multivariate-data/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/how-to-color-multivariate-data/</feedburner:origLink></item>
		<item>
		<title>People who love scatter plots &amp; connecting dots</title>
		<link>http://feedproxy.google.com/~r/data-evolution/~3/_ps1Q8A3iHQ/</link>
		<comments>http://dataspora.com/blog/dataviz-sf/#comments</comments>
		<pubDate>Fri, 20 Feb 2009 06:02:34 +0000</pubDate>
		<dc:creator>Michael E. Driscoll</dc:creator>
		
		<category><![CDATA[R]]></category>

		<category><![CDATA[analytics]]></category>

		<category><![CDATA[dataviz]]></category>

		<category><![CDATA[sabermetrics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=54</guid>
		<description><![CDATA[
We hosted the first Dataviz Salon SF on Tuesday night, with lightning talks by boredom cop  Shane Booth, dataviz wiz  Lee Byron , computational journalist Brad Stenger, data wrangler  Pete Skomoroch , and any/all data enthusiast  Brendan O&#8217;Connor .
I was going to blog all about it &#8212; but Tom Carden of [...]]]></description>
			<content:encoded><![CDATA[<p><img title="dataviz-sf" src="http://dataspora.com/blog/wp-content/uploads/2009/02/dataviz_salon_poster_smal.jpg" alt="" /><br />
We hosted the first Dataviz Salon SF on Tuesday night, with lightning talks by boredom cop <a href="http://criminalizeboring.tumblr.com/"> Shane Booth</a>, dataviz wiz <a href="http://www.leebyron.com"> Lee Byron </a>, computational journalist <a href="http://nbagraphs.tumblr.com">Brad Stenger</a>, data wrangler <a href="http://www.datawrangling.com"> Pete Skomoroch </a>, and any/all data enthusiast <a href="http://www.anyall.org/blog"> Brendan O&#8217;Connor </a>.</p>
<p>I was going to blog all about it &#8212; but <a href="http://www.tom-carden.co.uk/2009/02/18/dataviz-salon-sf-1/">Tom Carden of Stamen Design already has a great write-up</a>.</p>
<blockquote><p>&#8230; Dataspora invited a few people to a Dataviz Salon yesterday evening. Mike and I went along and huddled in a brick-built basement in SoMa to listen to <a href="http://www.tom-carden.co.uk/2009/02/18/dataviz-salon-sf-1/">the following</a>:</p></blockquote>
<p>.</p>
<img src="http://feeds.feedburner.com/~r/data-evolution/~4/_ps1Q8A3iHQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://dataspora.com/blog/dataviz-sf/feed/</wfw:commentRss>
		<feedburner:origLink>http://dataspora.com/blog/dataviz-sf/</feedburner:origLink></item>
	</channel>
</rss>
