<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>www.BenStopford.com</title>
	
	<link>http://www.benstopford.com</link>
	<description>Gently flexing the grid</description>
	<lastBuildDate>Wed, 27 Mar 2013 12:58:15 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/benstopford" /><feedburner:info uri="benstopford" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>The Return of Big Iron? (Big Data 2013)</title>
		<link>http://feedproxy.google.com/~r/benstopford/~3/2Rx7WpAMgb4/</link>
		<comments>http://www.benstopford.com/2013/03/27/the-return-of-big-iron-big-data-2013/#comments</comments>
		<pubDate>Wed, 27 Mar 2013 12:57:51 +0000</pubDate>
		<dc:creator>ben</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.benstopford.com/?p=2616</guid>
		<description><![CDATA[ 
  The return of big iron?  from Ben Stopford 
]]></description>
			<content:encoded><![CDATA[<p><iframe src="http://www.slideshare.net/slideshow/embed_code/17756085" width="427" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen webkitallowfullscreen mozallowfullscreen> </iframe>
<div style="margin-bottom:5px"> <strong> <a href="http://www.slideshare.net/benstopford/the-return-of-big-iron" title="The return of big iron?" target="_blank">The return of big iron?</a> </strong> from <strong><a href="http://www.slideshare.net/benstopford" target="_blank">Ben Stopford</a></strong> </div>
<img src="http://feeds.feedburner.com/~r/benstopford/~4/2Rx7WpAMgb4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.benstopford.com/2013/03/27/the-return-of-big-iron-big-data-2013/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.benstopford.com/2013/03/27/the-return-of-big-iron-big-data-2013/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Slides from Advanced Databases Lecture 27/11/12</title>
		<link>http://feedproxy.google.com/~r/benstopford/~3/x8tAOsAicDQ/</link>
		<comments>http://www.benstopford.com/2012/11/28/slides-from-advanced-databases-lecture-271112/#comments</comments>
		<pubDate>Wed, 28 Nov 2012 07:26:39 +0000</pubDate>
		<dc:creator>ben</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.benstopford.com/?p=2606</guid>
		<description><![CDATA[The slides from yesterday&#8217;s guest lecture on NoSQL, NewSQL and Big Data can be found here.
]]></description>
			<content:encoded><![CDATA[<p>The slides from yesterday&#8217;s guest lecture on NoSQL, NewSQL and Big Data can be found <a href="http://www.slideshare.net/slideshow/embed_code/11942089">here</a>.</p>
<img src="http://feeds.feedburner.com/~r/benstopford/~4/x8tAOsAicDQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.benstopford.com/2012/11/28/slides-from-advanced-databases-lecture-271112/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.benstopford.com/2012/11/28/slides-from-advanced-databases-lecture-271112/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Big Data &amp; the Enterprise</title>
		<link>http://feedproxy.google.com/~r/benstopford/~3/0MBehll7yXQ/</link>
		<comments>http://www.benstopford.com/2012/11/22/big-data-the-enterprise-2/#comments</comments>
		<pubDate>Thu, 22 Nov 2012 18:58:27 +0000</pubDate>
		<dc:creator>ben</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.benstopford.com/?p=2595</guid>
		<description><![CDATA[Slides from today&#8217;s European Trading Architecture Summit 2012 are here.
 
  Big Data &#38; the Enterprise  from Ben Stopford 
]]></description>
			<content:encoded><![CDATA[<p>Slides from today&#8217;s European Trading Architecture Summit 2012 are <a href="http://www.slideshare.net/slideshow/embed_code/15302384">here</a>.</p>
<p><iframe src="http://www.slideshare.net/slideshow/embed_code/15302384" width="550" height="430" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen webkitallowfullscreen mozallowfullscreen> </iframe>
<div style="margin-bottom:5px"> <strong> <a href="http://www.slideshare.net/benstopford/the-big-data-conundrum" title="Big Data &amp; the Enterprise" target="_blank">Big Data &amp; the Enterprise</a> </strong> from <strong><a href="http://www.slideshare.net/benstopford" target="_blank">Ben Stopford</a></strong> </div>
<img src="http://feeds.feedburner.com/~r/benstopford/~4/0MBehll7yXQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.benstopford.com/2012/11/22/big-data-the-enterprise-2/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://www.benstopford.com/2012/11/22/big-data-the-enterprise-2/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>The Big Data Conundrum</title>
		<link>http://feedproxy.google.com/~r/benstopford/~3/-43Ifiz_tUM/</link>
		<comments>http://www.benstopford.com/2012/11/10/the-big-data-conundrum/#comments</comments>
		<pubDate>Sat, 10 Nov 2012 12:31:31 +0000</pubDate>
		<dc:creator>ben</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Distributed Data Storage]]></category>
		<category><![CDATA[Top4]]></category>

		<guid isPermaLink="false">http://www.benstopford.com/?p=2561</guid>
		<description><![CDATA[Whilst Big Data contains the promise of fame and fortune, the signal to noise ratio is high. How do you make sense of the marketing blurb?]]></description>
			<content:encoded><![CDATA[<p>I attended an interesting talk at JAX earlier this year by guy called Ian Polsker, from Riak, somewhat amusingly entitled &#8216;TheBigDataCon&#8217; (worth <a href="http://www.slideshare.net/jaxlondon2012/the-big-data-con-why-big-data-is-a-problem-not-a-solution">a look</a> by the way &#8211; the slides are good). Ian makes a little fun of all the current hype, joking that vendors seemed to be the only people actually monitising Big Data. I think we can&#8217;t help but be a little cynical of anything that has this much hype.</p>
<p>On another level the term has become overloaded. It has many definitions, Oracle for example talk about Big Data in a very different way to say MapR. It seems to broadly boil down to two angles though:</p>
<ul>
<li>The promise of greater insight using the huge amounts of data we produce</li>
<li>A change in the technologies we use to crunch our way through the data we have (or expect to have)</li>
</ul>
<p>Like any other commodity, the harder it is to extract, the more it costs. The aspirational, needle-in-a-haystack concept that drives much of the marketing paraphernalia is certainly real and should not be ignored. However the hype around the &#8216;hidden insight&#8217; thing masks a more fundamental, and grounded point: the technology shift that facilitates all this.</p>
<p>There is a view that todays data is &#8216;big&#8217; and that having big data means some form of MapReduce. Yet it is not size that really matters. Both relational and nosql camps can deal with the data volumes (and even, for the most part, the <a href="http://blogs.sas.com/content/sascom/2012/04/12/turning-big-data-volume-variety-and-velocity-into-value/">three Vs</a> in one way or another). Ebay for example runs a 20PB+ database. Yahoo and Google both have larger MR clusters, but not that much larger. For most problems data volume alone is not enough to make a sensible technology choice (and I&#8217;d contest that any of the Vs were really enough either). As the academic world likes to keep reminding us (<a href="http://vldb.org/pvldb/vol5/p1712_avriliafloratou_vldb2012.pdf">here</a> and <a href="http://www.cse.nd.edu/~dthain/courses/cse598z/spring2010/benchmarks-sigmod09.pdf">here</a>) performance is not the reason to pick up a big data technology. There reason is that these new technologies embrace a very different approach to data analysis, particularly in the context of the whole &#8216;lifecycle&#8217; of our data analysis work. Big Data technologies decouple us from some of the shackles that make big data problems hard. However, there is no free lunch and they come with some shackles of their own.</p>
<p>A core difference is the ability to define a schema at runtime, rather than upfront. That alone is a powerful, and game changing idea. Dave Campell put it well in his <a href="http://www.vldb.org/2011/files/slides/keynotes/campbell_keynote.pptx">VLDB keynote</a> when he says ‘ability to model data is much more of a gating factor than raw size, particularly when considering new forms of data’. Modelling data, getting it into a form we can interpret and understand can be a longwinded and painful task, and something we must do before we can do anything useful with it.</p>
<blockquote><p><em>Our ability to model data is much more of a gating factor than raw size</em></p></blockquote>
<p>Traditional databases push us to model our data before we store it. Big data solutions often leave their data in its natural form. A ‘virtual schema’ is bound at runtime. This concept of binding the schema ‘late’ is powerful. It allows the interpretation of the data to be changed at any time without having to change the physical format of the data on disk. Something that becomes increasingly important as the size of the dataset increases. The downside of not imposing a schema from the point of ingestion is that keeping old forms of data &#8216;current&#8217; becomes an increasingly difficult task. That&#8217;s to say that the client is left with the problem of handling many data representations. Fine if the model is free text, tougher if the model has any real structure (explicit or implicit).</p>
<blockquote><p><em>The concept of binding the schema ‘late’, with the data held in its natural form, is powerful.</em></p></blockquote>
<p>Big Data technologies offer very different performance profiles to relational analytics tools. The lack of indexing and overarching structure means inserts are fast, making them suitable for high velocity systems and batch processing. The imperative interface and the absence of a schema, makes diverse, ad hoc analytics hard though. Instead they work best for specific, well-defined data operations (I often use the data enabled grid analogy). It is likely this that has driven Big Data leaders, Google, and more recently Hadoop, to add more database-like features to their products (Dremel/Magastore/Spanner providing SQL like interface and ACID semantics). Yet it&#8217;s much harder to optimise in a late-bound world, no big data solution today comes close to the raw performance of the top end analytics engines.</p>
<blockquote><p><em>Most of today&#8217;s databases are hindered hugely by needing the schema to be defined upfront.</em></p></blockquote>
<p>The last thing to consider is the cost of change. For simple data sets it is less apparent. Start joining data sets together though and it becomes a different ball game. Whist possible with Big Data technologies, it&#8217;s just going to cost more and managing the complexity with the absence of a schema becomes an increasingly uphill struggle. In this case, better to stick in the relational world (for now at least).</p>
<p>However most of today&#8217;s databases have the HUGE disadvantage that the schema needs to be defined, and the data understood, upfront. Great for simple, well defined business data, but if you&#8217;re searching free text, machine generated data or simply a hugely diverse data population (like the data that gets thrown around most big organisations) it&#8217;s simply not practical or maybe not possible to understand, and model, the data upfront. By applying the schema later in the cycle the cost of change, the availability of insight and the inherent feedback cycles can all be improved.</p>
<p><img class="alignright" title="Merging" src="http://www.benstopford.com/wp-content/uploads/2012/06/Merging-300x265.png" alt="" width="300" height="265" /></p>
<p><strong>As for the future?</strong></p>
<p>You’ll probably have noticed that every database vendor worth their salt now have some form or Big Data offering, be it bought in, ‘tacked on’ or genuinely integrated. Likewise the Big Data vendors are looking more and more like their relational counterparts, sprouting query languages, loose schemas, columnar storage, indexing, even elements of transactionality. The two camps are converging.</p>
<p>Many of new set of relational technologies look more like MapReduce than they do like System R (IBM’s original relational database). Yet the majority of the database community still seem to be lurking in the corner of the playground, wearing anoraks and murmuring (although these days the anoraks are made by Armani). They are a long way from penetrating the progressive Internet space. Joe Hellerstien’s <a href="http://db.cs.berkeley.edu/jmh/talks/hpts2001-we-lose.pdf">words</a> still ring true today.</p>
<blockquote><p><em>The new cool kids of the database world are making their mark with technologies of the moment, backed with a hefty dose of academic acumen.</em></p></blockquote>
<p>The future is likely to be one of convergence, and redirecting the database community is undoubtedly good. In fact possibly the most most useful things the NoSQL movement has done has been to give a well timed boot to the database world&#8217;s behind, reminding them that they need to listen to their consumers. They got stuck in a rut and the internet space wasn&#8217;t going to wait around. Convergence over some newly shared values that sit between the two camps is of course inevitable, and welcomed.</p>
<blockquote><p><em>The database world got stuck in it&#8217;s ways and the internet space wasn&#8217;t going to wait around.</em></p></blockquote>
<p>The evidence is quite plain already. There are a host of young (ish) upstart technologies hitting the space. The number of shared nothing analytics engines has significantly increased (Asterdata, Vertica, VoltDB, Exasol, ParAccel, Greenplum, Hana the list goes on) and the benchmarks they are extremely impressive. There are hybrid engines mixing MapReduce engines with smarter storage, routing and indexing strategies. <a href="http://hadapt.com/">Hadapt</a> and <a href="http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/">Impala</a> are good examples. The former particularly as it is the one that probably best personifies the blending of the two worlds.</p>
<blockquote><p><em>These new upstart database technologies redefine the current mainstream with, not in spite of, the lessons of the past.</em></p></blockquote>
<p>Finally there are some interesting one-stop-shop approaches. Holistic solutions that span dynamic schema provisioning and data access, all the way to presentation in a single package. Originating in the machine generated data space, <a href="http://www.splunk.com/">Splunk</a> (dominant) and <a href="http://logscape.com/">Logscape</a> (scalable), are the current leaders in this space and there is likely to be a lot more activity. For answering the what-if questions or assembling high level MI stacks these all inclusive solutions get the closest to answering the more insightful questions we have today.</p>
<p>Whether this ever breaks the strangle hold clenched by the oligopoly of key database players remains to be seen. Michael Stonebraker still rains disdain on the NoSQL world even today [<a href="http://cacm.acm.org/blogs/blog-cacm/149074-possible-hadoop-trajectories/fulltext">see here</a>]. He may be outspoken, he may come across like a bit of **** at times, but it is unlikely that he is wrong.  The solutions of the future will not be the pure (and relatively simplistic) MapReduce of today. They will be blends that protect our data, even at scale. For me the new technologies coming from both camps are exciting as they redefine the current mainstream thinking <em>with, not in spite of</em>, the lessons of the past.</p>
<p>Related posts:</p>
<ul>
<li><a title="&lt;p&gt;Joe Hellerstein, from Berkeley, did an fascinating talk at the ‘High Performance Transaction Systems Workshop’ (HTPS) way back in 2001 entitled “We Lose”. It’s a retrospective on the state of the database field just after the dot-com bubble focussing particularly on their lack of uptake with the young internet companies of that time. He observes  (and I’m paraphrasing) that the [...]&lt;/p&gt; " href="http://www.benstopford.com/2012/07/25/thoughts-on-big-data-technologies-4-our-love-hate-relationship-with-the-relational-database/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">Thoughts on Big Data Technologies (4): Our Love-Hate relationship with the Relational Database </a>(2012)</li>
<li><a title="&lt;p&gt;Despite (or maybe because of) the huge amount of  hype in recent years MapReduce still has many vocal opponents. On one side its focus on local rather than global consistency, a lack of schemas, an architecture that embraces the unreliable network and natural support semi-structured or unstructured data have made us reconsider the use of incumbent [...]&lt;/p&gt; " href="http://www.benstopford.com/2012/07/19/thoughts-on-big-data-technologies-part-3-objections-worth-thinking-about/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">Thoughts on Big Data Technologies (3): Objections Worth Thinking About </a>(2012)</li>
<li><a title="&lt;p&gt;So size isn’t really the driving factor for Big Data technologies, it’s more about the form of the data itself, but size still causes us a lot of problems. Technologies inevitably hit bottlenecks in the presence of increasingly large data sets so it is worth quantifying what we really mean by ‘Big’ when we say [...]&lt;/p&gt; " href="http://www.benstopford.com/2012/07/14/thoughts-on-big-data-technologies-part-2-how-big-is-big/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">Thoughts on Big Data Technologies (2): How big is Big? </a>(2012)</li>
<li><a title="&lt;p&gt;It may not have been its intention, but the undercurrent of the NoSQL movement seems to be something of a two-finger salute to the apathy of the database community. A community that was once the height of technological innovation seems to have sat on its laurels in recent years, propped up by the lucrative support contracts of its corporate dependents. NoSQL and [...]&lt;/p&gt; " href="http://www.benstopford.com/2012/06/30/thoughts-on-big-data-technologies-part-1/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">Thoughts on Big Data Technologies (1) </a>(2012)</li>
</ul>
<img src="http://feeds.feedburner.com/~r/benstopford/~4/-43Ifiz_tUM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.benstopford.com/2012/11/10/the-big-data-conundrum/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		<feedburner:origLink>http://www.benstopford.com/2012/11/10/the-big-data-conundrum/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Problems with Feature Branches</title>
		<link>http://feedproxy.google.com/~r/benstopford/~3/QgLiHgxstK4/</link>
		<comments>http://www.benstopford.com/2012/11/10/problems-with-feature-branches/#comments</comments>
		<pubDate>Sat, 10 Nov 2012 09:41:17 +0000</pubDate>
		<dc:creator>ben</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.benstopford.com/?p=2521</guid>
		<description><![CDATA[Over the last few years we&#8217;ve had a fair few discussions around the various different ways to branch and how they fit into a world of Continuous Integration (and more recently Continuous Delivery). It&#8217;s so fundamental that it&#8217;s worth a post of its own!
Dave Farley (the man that literally wrote the book on it) penned a the [...]]]></description>
			<content:encoded><![CDATA[<p>Over the last few years we&#8217;ve had a fair few discussions around the various different ways to branch and how they fit into a world of Continuous Integration (and more recently Continuous Delivery). It&#8217;s so fundamental that it&#8217;s worth a post of its own!</p>
<p>Dave Farley (the man that literally wrote the book on it) penned a the best advice I&#8217;ve seen on the topic a while back. Worth a read, or even a reread (and gets better towards the end).</p>
<p><a href="http://www.davefarley.net/?p=160">http://www.davefarley.net/?p=160</a> (in case dave&#8217;s somewhat flakey site is down again the article is republished <a href="http://www.siteminds.nl/index.php/2012/01/dont-feature-branch/">here</a>)</p>
<img src="http://feeds.feedburner.com/~r/benstopford/~4/QgLiHgxstK4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.benstopford.com/2012/11/10/problems-with-feature-branches/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.benstopford.com/2012/11/10/problems-with-feature-branches/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>The Best of VLDB 2012 (Very Large Database Conference)</title>
		<link>http://feedproxy.google.com/~r/benstopford/~3/0mRl5FxACQA/</link>
		<comments>http://www.benstopford.com/2012/10/28/the-best-of-vldb-2012/#comments</comments>
		<pubDate>Sun, 28 Oct 2012 16:06:15 +0000</pubDate>
		<dc:creator>ben</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Links]]></category>

		<guid isPermaLink="false">http://www.benstopford.com/?p=2533</guid>
		<description><![CDATA[Here are some of the highlights of the 210 papers presented at VLDB earlier this year. You can find the full list here.
From Cooperative Scans to Predictive Buffer Management (here)
Intriguing paper from the Vectorwise guys for improving IO efficiency under load. LRU/MRU caching policies are known to break down under large, concurrent workloads. SQL Server [...]]]></description>
			<content:encoded><![CDATA[<p>Here are some of the highlights of the 210 papers presented at VLDB earlier this year. You can find the full list <a href="http://www.vldb.org/pvldb/vol5/">here</a>.</p>
<p><strong>From Cooperative Scans to Predictive Buffer Management </strong>(<a href="http://arxiv.org/pdf/1208.4170.pdf">here</a>)</p>
<p>Intriguing paper from the Vectorwise guys for improving IO efficiency under load. LRU/MRU caching policies are known to break down under large, concurrent workloads. SQL Server and DB2 both have mechanisms for sharing IO between queries (by attaching to an existing scan or throttling faster queries so that IO can be shared). The Cooperative Scans discussed here takes this a step further by incorporating an active buffer manager which scans use to register their interest in data. The manager then adaptively chooses which pages to load and pass to the various concurrent requests.</p>
<p>There is another related paper at this conference SharedDB: Killing One Thousand Queries With One Stone (<a href="http://www.vldb.org/pvldb/vol5/p526_georgiosgiannikis_vldb2012.pdf">here</a>)</p>
<p><strong>Processing a Trillion Cells per Mouse Click (Google) </strong>(<a href="http://www.vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf">here</a>)</p>
<p>Interesting paper from Google suggesting an alternative to the approach to column orientation taken in Dremel. PowerDrill uses a double-dictionary encoded column store where the encodings live largely in memory. Further optimisations are made at load time to ensure minimal access to persistent storage. This makes it more akin to column stores like ParAccel or Vectorwise, applied to analytical workloads (aggregates, group bys etc).</p>
<p><strong>Can the elephants handle the NoSQL onslaught </strong>(<a href="http://vldb.org/pvldb/vol5/p1712_avriliafloratou_vldb2012.pdf">here</a>)</p>
<p>Another paper comparing the performance of Hadoop with a relational database (in a similar vein to the Sigmod 09 paper DeWitt published previously <a href="http://www.cse.nd.edu/~dthain/courses/cse598z/spring2010/benchmarks-sigmod09.pdf">here</a>). I sympathise with the message &#8211; databases outperform hadoop on small to medium workloads &#8211; but I hope that most people know that already. This time the comparison is with Microsoft&#8217;s Sql Server PDW (Parallel Data Warehouse). The choice of data sizes between 250Gb and 16TB means that the study has the same failing as the previous Sigmod one; it&#8217;s not looking at large dataset performance.</p>
<p><strong>Interactive Query Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads </strong>(<a href="http://www.eecs.berkeley.edu/~alspaugh/papers/mapred_workloads_vldb_2012.pdf">here</a>)</p>
<p>Useful, empirically driven paper with detailed data sets from a number of NoSQL implementations including Facebook. Chen et al. performed an empirical study on the implementation of Hadoop at a number of companies including Facebook. It hints at the current &#8216;elephant in the room&#8217; that is Hadoop&#8217;s focus on batch-time over real-time performance (roll on <a href="https://github.com/cloudera/impala">Impala</a>!) .  Having data of this level of granularity over a range of real time systems in itself is quite valuable. They note that 90% of jobs are small (resulting in MBs of data returned).</p>
<p><strong>High-Performance Concurrency Control Mechanisms for Main-Memory Databases</strong> (<a href="http://www.vldb.org/pvldb/vol5/p298_per-akelarson_vldb2012.pdf">here</a>)</p>
<p>Proposes an optimistic MVCC method for in memory concurrency control. The conclusion: single-version locking performs well only when transactions are short and contention is low; higher contention or workloads including some long transactions favor the multiversion methods, and the optimistic method performs better than the pessimistic one.</p>
<p><strong>Blink and It’s Done: Interactive Queries on Very Large Data</strong> (<a href="http://www.vldb.org/pvldb/vol5/p1902_sameeragarwal_vldb2012.pdf">here</a>)</p>
<p>Blink is different to the mainstream database as it&#8217;s not designed to give you an exact answer. Instead you specify either error (confidence) or maximum time constraints on your query. The approach uses a number of sampling based strategies to achieve the required confidence level. There is a related paper: Model-based Integration of Past &amp; Future in TimeTravel (<a href="http://www.vldb.org/pvldb/vol5/p1974_mohamedekhalefa_vldb2012.pdf">here</a>)</p>
<p><strong>Developing and Analyzing XSDs through BonXai</strong> (<a href="http://www.vldb.org/pvldb/vol5/p1994_wimmartens_vldb2012.pdf">here</a>)</p>
<p>This one struck a cord with me as I&#8217;m not the biggest fan of xsd. Bonxai provides and expression rather than type based approach to defining the data schema. More info <a href="http://doclib.uhasselt.be/dspace/bitstream/1942/1957/1/dbpl07.pdf">here</a> and <a href="http://ls1-www.cs.tu-dortmund.de/~niewerth/bonxai/bonxai-specification.pdf">here</a>.</p>
<p><strong>B+-tree Index Optimization by Exploiting Internal Parallelism of Flash-based Solid State Drives </strong>(<a href="http://www.vldb.org/pvldb/vol5/p286_hongchanroh_vldb2012.pdf">here</a>)</p>
<p>SSDs performance increases (initially) with the number of concurrent executions (in stark contrast with magnetic drives). This paper looks into maximising this with the use of concurrent B-trees that utalise parallel IO. Useful research as flash is only going to get cheaper.</p>
<p><strong>SCOUT: Prefetching for Latent Structure Following Queries </strong>(<a href="http://infoscience.epfl.ch/record/176914/files/scout-cr.pdf">here</a>)</p>
<p>I quite like the ideas in this paper around prefetching data based on a known structure (probably because it&#8217;s similar to some of the stuff we do).</p>
<p><strong>Fast Updates on Read-Optimized Databases Using Multi-Core CPUs</strong> (<a href="http://www.vldb.org/pvldb/vol5/p061_jenskrueger_vldb2012.pdf">here</a>)</p>
<p>Addresses the problem some columnar architectures suffer where they accumulate writes in a separate partition, which must be periodically merged with the read-optimised main one.</p>
<p><strong>FDB: A Query Engine for Factorised Relational Databases</strong> (<a href="http://www.vldb.org/pvldb/vol5/p1232_nurzhanbakibayev_vldb2012.pdf">here</a>)</p>
<p>I hadn&#8217;t come across the idea of Factorised Databsaes before. An interesting concept. The paper demonstrates performance improvements over traditional methods for many-to-many join criteria.</p>
<p><strong>Only Agressive Elephants are Fast Elephants</strong> (<a href="http://www.vldb.org/pvldb/vol5/p1591_jensdittrich_vldb2012.pdf">here</a>)</p>
<p>Interesting approach to indexing Hadoop that claims to improve both read and write performance. I couldn&#8217;t find the code though so couldn&#8217;t try it.</p>
<p><strong>The Vertica Analytic Database: C-Store 7 Years Later</strong> (<a href="http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf">here</a>)</p>
<p>A good summary of this mature shared-everything, columnar database. They discuss their use of super projections over join indexes, due to the overheads associated with tuple construction and the verbosity of storing the associated rowids. There is a summary of the encoding types used as well as partitioning and locking strategies.</p>
<p><strong>Muppet: MapReduce-Style Processing of Fast Data</strong> (<a href="http://www.vldb.org/pvldb/vol5/p1814_wanglam_vldb2012.pdf">here</a>)</p>
<p>Whilst the majority of MapReduce commentary focuses on improving MR query performance this paper looks at the problem of injesting data quickly for high throughput, streaming workloads. The interesting approach focuses on data as streams (in and out) in association with a moving historical window (they denote a slate). To me there seems to be a lot of similarity between this approach the one taken by products like <a href="http://www.streambase.com/">StreamBase</a> and <a href="http://www.cloudscale.com/">Cloudscale</a> but the authors differentiate themselves my being less schema oriented, more akin to the traditional MR style.</p>
<p><strong>Serializable Snapshot Isolation in PostgreSQL</strong> (<a href="http://drkp.net/drkp/papers/ssi-vldb12.pdf">here</a>)</p>
<p>Interesting paper on the implementation of serializable isolation using the snapshot model.</p>
<p><strong>Other papers of note:</strong></p>
<ul>
<li>Minuet: A Scalable Distributed Multiversion B-Tree (<a href="http://www.vldb.org/pvldb/vol5/p884_benjaminsowell_vldb2012.pdf">here</a>)</li>
<li>A Statistical Approach Towards Robust ProgressEstimation (<a href="http://research.microsoft.com/pubs/154936/Progress.pdf">here</a>)</li>
<li>Efﬁcient Multi-way Theta-Join Processing UsingMapReduce (<a href="http://vldb.org/pvldb/vol5/p1184_xiaofeizhang_vldb2012.pdf">here</a>)</li>
<li>Avatara: OLAP for Web-scale Analytics Products (OLAP cubes over a NoSQL @LinkedIn) (<a href="http://www.vldb.org/pvldb/vol5/p1874_liliwu_vldb2012.pdf">here</a>)</li>
<li>10 Year Best Paper Award: Approximate Frequency Counts over Data Streams (<a href="http://www.vldb.org/conf/2002/S10P03.pdf">here</a>)</li>
</ul>
<img src="http://feeds.feedburner.com/~r/benstopford/~4/0mRl5FxACQA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.benstopford.com/2012/10/28/the-best-of-vldb-2012/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://www.benstopford.com/2012/10/28/the-best-of-vldb-2012/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Where does Big Data meet Big Database</title>
		<link>http://feedproxy.google.com/~r/benstopford/~3/tDZ2Fkb6ldA/</link>
		<comments>http://www.benstopford.com/2012/08/17/where-does-big-data-meet-big-database/#comments</comments>
		<pubDate>Fri, 17 Aug 2012 11:12:44 +0000</pubDate>
		<dc:creator>ben</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.benstopford.com/?p=2468</guid>
		<description><![CDATA[




InfoQ published the video for my Where does Big Data meet Big Database talk at QCon this year.
Thoughts appreciated.

]]></description>
			<content:encoded><![CDATA[<p><img class="size-full wp-image-2469 alignleft" style="color: #0000ee; text-align: center; border-image: initial; margin: 5px; border: 3px solid black;" title="Ben" src="http://www.benstopford.com/wp-content/uploads/2012/08/Ben.jpg" alt="" width="100" height="100" /></p>
<div>
<div style="text-align: center;"><span style="color: #0000ee;"><br />
</span></div>
<p style="text-align: center;">
<p>InfoQ published the video for my <a href="http://www.infoq.com/presentations/Big-Data-Big-Database">Where does Big Data meet Big Database</a> talk at QCon this year.</p>
<p>Thoughts appreciated.</p>
</div>
<img src="http://feeds.feedburner.com/~r/benstopford/~4/tDZ2Fkb6ldA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.benstopford.com/2012/08/17/where-does-big-data-meet-big-database/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.benstopford.com/2012/08/17/where-does-big-data-meet-big-database/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>Thinking in Graphs: Neo4J</title>
		<link>http://feedproxy.google.com/~r/benstopford/~3/i2esLGOiXPA/</link>
		<comments>http://www.benstopford.com/2012/08/17/thinking-in-graphs-neo4j/#comments</comments>
		<pubDate>Fri, 17 Aug 2012 10:50:17 +0000</pubDate>
		<dc:creator>ben</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.benstopford.com/?p=2461</guid>
		<description><![CDATA[Ian Robinson kindly came to RBS yesterday to speak about Neo4J (slides are here Thinking in Graphs). The odd one out of the NoSQL pack, Neo4J is a fascinating alternative to your regular key value store. For me it’s about a different way of thinking about data simply because the relations between nodes are as much [...]]]></description>
			<content:encoded><![CDATA[<p>Ian Robinson kindly came to RBS yesterday to speak about Neo4J (slides are here <a href="http://www.benstopford.com/wp-content/uploads/2012/08/Thinking-in-Graphs.pdf#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">Thinking in Graphs</a>). The odd one out of the NoSQL pack, Neo4J is a fascinating alternative to your regular key value store. For me it’s about a different way of thinking about data simply because the relations between nodes are as much a part of the data model as the nodes are themselves. I am left wondering somewhat how one might apply this solution to the enterprise space, particularly finance. Multistep montecarlo springs to mind as it creates a large connected space but there is no real need to traverse that space retrospectively. There may be application in other simulation techniques though. The below is a paraphrased version of Ian’s words.</p>
<p>Today’s problems can be classified as a function of not only size but also connectedness and structure.</p>
<p style="text-align: center;"><em>F(size, connectedness, structure)</em></p>
<p>The Relational model struggles to deal with each of these three factors. The use of sparsely populated tables in our databases and null checks in client side code allude to the unsuitability of this model.</p>
<p>NoSQL offers a solution. The majority of this fledgling field rely on the concept of a Map (Dictionary) in some way. First came simple key-value stores like Dynamo. Next column-oriented stores like Cassandra and BigTable.Finally Document Databases provide a more complex document model(for example JSON), with facilities for simple introspection.</p>
<p>Neo4J is quite different to its NoSQL siblings: A graph database that allows users to model data as a set of nodes and relationships. Once modelled the data can be examined based on its connectedness (i.e. how one node relates to others) rather than simply based on its attributes.</p>
<p>Neo4J uses a specific type of graph model termed a Property Graph: Each node has associated attributes that describe its specificities. These need not be homogenous (as they would in a relational or object schema). Further the relationships between nodes are both namedand directed. As such they can be used in search criteria to find relationships between nodes.</p>
<p>The Property Graph model represents a pragmatic trade off between the purity of a traditional graph database and what you might see in a document database. This can be contrasted with the other graph database models: In ‘Triple Stores’every attribute is broken out as a separate node (this is a bit like third normal form for a graph database). Another alterative is Hypergraphs, where an edge can connect more than two nodes (see Ian’s slide to get a better understanding of this). Triple stores suffer from their fine-grained nature (I’m thinking binary vs red-black trees). Hypergaphs can be hard to apply to real world modelling applications as the multiplicity of relationships can make them hard to comprehend. The Property Graph model avoids the verbosity of triple stores and the conceptual complexity of Hypergraphs. As such the model works well for Complex, densely connected domains and ‘Messy’ data.</p>
<p>The fundamental attribute of the graph database is that Relationships are first class elements. That is to say querying relationships in a graph database is as natural as querying the data the nodes contain.</p>
<p>Neo4J, like many NoSQL databases is schemaless. You simply create nodes and relate them to one another to form a graph. Graphs need not be connected and many sub-graphs can be supported.</p>
<p>A query is simply ‘parachuted’ into a point in the graph from where it explores the local areas looking for some search pattern. So for example you might search for the pattern A&#8211;&gt;B&#8211;&gt;C. The query itself can be executed either via a ‘traversal’ or using the Cypher graph language. The traversal method simply visits the graph based on some criteria.For example it might only traverse arcs of a particular type. Cypher is a more formal graph language that allows the identification of patterns within the graph.<strong></strong></p>
<p>Imagine a simple graph of two anonymous nodes with an arc between them:</p>
<p style="text-align: center;">O&#8211;&gt;O</p>
<p>In Cypher this would be represented A-[:connected_to]-B</p>
<p>Considering a more complex graph:</p>
<p style="text-align: center;">A&#8211;&gt;B&#8211;&gt;C, A&#8211;&gt;C or A&#8211;&gt;B&#8211;&gt;C&#8211;&gt;A</p>
<p>We can start to build up pattern matching logic over these graphs for exampleA-[*]-&gt;B to represent that A is somehow connected to B (think regex for graphs). This allows the graph to be mined for patterns based on any combination of the properties, arc directions or name (type).</p>
<p>There are further Cypher examples <a href="http://docs.neo4j.org/chunked/milestone/cypher-query-lang.html">here </a>including links to an online console where you can interactively experiment with the queries. Almost all of the query examples and diagrams are generated from the unit tests used to develop Cypher. This means that the manual is always an accurate reflection of the current feature set.</p>
<p>Physical Characteristics:</p>
<p>The product itself is JVM based (query language written in Scala). There is an HTTP interface too (restful). It is fully transactional (ACID) and it is possible to override the transaction manager should you need to coordinate with an external transaction manager (for example because you want to coordinate with and external store). An object cache is used to store the entities in memory with fall through to memory-mapped files if the dataset does not fit in RAM. There is also an HTTP based API.</p>
<p>HA support uses a master-slave, replicated model (single master model). You can write to a slave (i.e. any node) and it will obtain a lock from the master. Lucene is the default index provider.</p>
<p>The team have several strategies for mitigating the impact of GC pauses, the most important being a GC resistant caching strategy. This assignes a certain amount of space in the JVM heap; it then purges objects whenever the maximum size is about to be reached, instead of relying on GC to make that decision. Here the competition with other objects in the heap, as well as GC-pauses, can be better controlled since the cache gets assigned a maximum heap space usage. Caching is described in more detail <a href="http://docs.neo4j.org/chunked/milestone/configuration-caches.html#_object_cache">here</a>.</p>
<p>Ian mentioned a few applications too:</p>
<ul>
<li>Telcos: Managing the network graph: If something goes wrong they use the graph database he help predict where the problem likely comes from by simulating the network topology.</li>
<li>Logistics: parcel routing. This is a hierarchical problem. Neo4J helps by allowing them to model the various routes to get a parcel from it’s start to end locations. Routes change (and become unavailable).</li>
<li>Finally the social graph which is fairly self explanatory!</li>
</ul>
<p>All round an eye-opening approach to the modelling and inspection of connected data sets.</p>
<img src="http://feeds.feedburner.com/~r/benstopford/~4/i2esLGOiXPA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.benstopford.com/2012/08/17/thinking-in-graphs-neo4j/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.benstopford.com/2012/08/17/thinking-in-graphs-neo4j/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>A Brief Summary of the NoSQL World</title>
		<link>http://feedproxy.google.com/~r/benstopford/~3/0kT0Z2v6emY/</link>
		<comments>http://www.benstopford.com/2012/08/11/a-brief-summary-of-the-nosql-world/#comments</comments>
		<pubDate>Sat, 11 Aug 2012 09:09:29 +0000</pubDate>
		<dc:creator>ben</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.benstopford.com/?p=2441</guid>
		<description><![CDATA[James Phillips (co-founder of Couchbase) did a nice talk on NoSQL Databases at QCon:


Memcached &#8211; the simplest and original. Pure key value store. Memory focussed
Redis &#8211; Extends the simple map-like semantic with extensions that allow the manipulation of certain specific data structures, stored as values. So there are operations for manipulating values as lists, queues [...]]]></description>
			<content:encoded><![CDATA[<p>James Phillips (co-founder of Couchbase) did a nice <a href="http://www.infoq.com/presentations/NoSQL-Survey-Comparison">talk</a> on NoSQL Databases at QCon:</p>
<p style="text-align: center;"><a href="http://www.benstopford.com/wp-content/uploads/2012/08/Screen-Shot-2012-08-11-at-09.40.04.png#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed"><img class="aligncenter size-full wp-image-2442" title="Screen Shot 2012-08-11 at 09.40.04" src="http://www.benstopford.com/wp-content/uploads/2012/08/Screen-Shot-2012-08-11-at-09.40.04.png" alt="" width="348" height="263" /></a></p>
<p style="text-align: center;">
<p><strong>Memcached</strong> &#8211; the simplest and original. Pure key value store. Memory focussed</p>
<p><strong>Redis</strong> &#8211; Extends the simple map-like semantic with extensions that allow the manipulation of certain specific data structures, stored as values. So there are operations for manipulating values as lists, queues etc. Redis is primarily memory focussed.</p>
<p><strong>Membase</strong> &#8211; extends the membached approach to include persistence, the ability to add nodes, backup&#8217;s on other nodes.</p>
<p><strong>Couchbase</strong> &#8211; a cross between Membase and CouchDB. Membase on the front, Couch DB on the back. The addition of CouchDB means you can can store and reflect on more complex documents (in JSON). To query Couchbase you need to write javascript mapping functions that effectively materialise the schema (think index) so that you can create a query model. Couchbase is CA not AP (i.e. not eventually consistent)</p>
<p><strong>MongoDB</strong> &#8211; Uses BSON (binary version of JSON which is open source but only really used by Mongo). Mongo unlike the Couchbase in that the query language is dynamic: Mongo doesn&#8217;t require the declaration of indexes. This makes it better at adhoc analysis but slightly weaker from a production perspective.</p>
<p><strong>Cassandra</strong> &#8211; Column oriented, key value. The value are split into columns which are pre-indexed before the information can be retrieved. Eventually consistent (unlike Couchbase). This makes it better for highly distributed use cases or ones where the data is spread over an unreliable networks.</p>
<p><strong>Neo4J</strong> &#8211; Graph oriented database. Much more niche. Not distributed.</p>
<p style="text-align: center;"><a href="http://www.benstopford.com/wp-content/uploads/2012/08/Screen-Shot-2012-08-11-at-10.03.341.png#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed"><img class="aligncenter size-full wp-image-2444" title="Screen Shot 2012-08-11 at 10.03.34" src="http://www.benstopford.com/wp-content/uploads/2012/08/Screen-Shot-2012-08-11-at-10.03.341.png" alt="" width="482" height="330" /></a></p>
<p>There are obviously a few more that could have been covered (Voldemort, Dynamo etc but a good summary from James none the less)</p>
<p>Full slides/video can be found <a href="http://www.infoq.com/presentations/NoSQL-Survey-Comparison">here</a>.</p>
<img src="http://feeds.feedburner.com/~r/benstopford/~4/0kT0Z2v6emY" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.benstopford.com/2012/08/11/a-brief-summary-of-the-nosql-world/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://www.benstopford.com/2012/08/11/a-brief-summary-of-the-nosql-world/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
		<item>
		<title>ODC</title>
		<link>http://feedproxy.google.com/~r/benstopford/~3/xV47A4iYl2g/</link>
		<comments>http://www.benstopford.com/2012/08/09/odc/#comments</comments>
		<pubDate>Thu, 09 Aug 2012 17:51:53 +0000</pubDate>
		<dc:creator>ben</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.benstopford.com/?p=2370</guid>
		<description><![CDATA[This article describes a little about ODC &#8211; primarily because we are hiring and we’d like candidates to know a little more about what we do here before they rock up &#8211; but it may also be of interest to those attempting to consolidate large amounts of data into a single, real-time, enterprise-wide store.
The Big [...]]]></description>
			<content:encoded><![CDATA[<p>This article describes a little about ODC &#8211; primarily because we are hiring and we’d like candidates to know a little more about what we do here before they rock up &#8211; but it may also be of interest to those attempting to consolidate large amounts of data into a single, real-time, enterprise-wide store.</p>
<h3><strong>The Big Idea</strong></h3>
<p>ODC Core is the data store that sits at the centre of the ODC project. It was designed to be the one datastore the bank needs; the single port of call for all our trades and valuations with the vision of one day blending processing and data in a collocated manner. In fairness it is not quite that yet, as such a mythical beast is hard to come by, but it has made significant inroads.</p>
<p>So why is one big datastore useful you may ask? In short we, like many organisations, have a lot of problems with data. Most of these problems have nothing to do with technology. They are about different people’s interpretation of their part of our domain. Hundreds of systems across the bank each implement these different interpretations. Data is forwarded from system to system and the problem compounds. Enterprise messaging can only do so much to solve this problem because it is inherently point-in-time (so the interpretation of the message is still left to each application and their own method of persistence). Joining up all the dots to get a global view of the bank’s activity can be a confusing, manual and painful process. So the concept is simple: one golden copy that holds the truth. Get it right in one place and then migrate applications to that one single model and the one single data source. Simple idea. Somewhat harder to make a reality.</p>
<p style="text-align: center;">
<p style="text-align: center;"><a href="http://www.benstopford.com/wp-content/uploads/2012/08/image0011.png#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed"><img class="aligncenter size-full wp-image-2428" title="image001" src="http://www.benstopford.com/wp-content/uploads/2012/08/image0011.png" alt="" width="467" height="294" /></a></p>
<h3>What is ODC Now</h3>
<p>ODC has been live for coming up to two years with development starting back in Jan 2010. The datastore is written inside Oracle Coherence, which provides a data-fabric in which we have built a distributed, normalised database. ODC Core (which is the data store itself) has some interesting qualities that differentiate it from your average database (or Coherence cluster). The three I cover in more detail below are messaging as a system of record, a dynamic data replication model to support efficient distributed joins and our dynamic object and sql interfaces. There are some other quite neat features that I won&#8217;t go into here such as a distributed clock implementation that allows reliable and efficient snapshots of the datastore, the use of compression on large result sets (our own interpretation of dictionary encoding) and a sample-based query optimiser.</p>
<p><strong>Messaging as a System of Record</strong>: Unlike most databases ODC Core provides both query and subscription semantics. This actually falls out quite naturally as messaging sits at the very core of the product. In fact messaging is our system of record. So when data is written to the store that data is only &#8216;accepted&#8217; once it is has been written synchronously to the event stream. Having an event stream as your system of record proves to be a powerful concept.</p>
<p style="text-align: center;"><a href="http://www.benstopford.com/wp-content/uploads/2012/08/image0031.png#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed"><img class="aligncenter size-full wp-image-2425" title="image003" src="http://www.benstopford.com/wp-content/uploads/2012/08/image0031.png" alt="" width="411" height="302" /></a></p>
<p>From a non-functional perspective this allows persistence to scale out linearly in a &#8216;load balanced&#8217; manner (we use topics rather than queues so there is global ordering and hence no need to share state across different servers in the messaging layer). Providing write scalability is only one advantage though. Having everything persisted through a single event stream means you can hook anything you like into it. If you are interested in a certain type of event you can just subscribe with a message selector. If you want to create a copy of the store in a relational database you can just hook into the same stream. If you want and disaster recovery instance … you get the picture I’m sure.</p>
<p><strong>ODC Core efficiently joins normalised data</strong>: All distributed stores that support a degree of normalisation struggle if they need to join data elements are not collocated with one another. They are forced to ship potentially large amounts of data across the network to compute the join. Sharding helps a little but you can only shard by a single key so there will always be elements that don’t end up collocated (because they have ‘crosscutting’ keys). We use a relatively novel approach to solving this problem. In short we replicate data that does not shard. However simply replicating data would cause the cluster to run out of memory as there would simply be too much replicated data on each node.</p>
<p style="text-align: center;"><a href="http://www.benstopford.com/wp-content/uploads/2012/08/image0051.png#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed"><img class="aligncenter size-full wp-image-2426" title="image005" src="http://www.benstopford.com/wp-content/uploads/2012/08/image0051.png" alt="" width="428" height="309" /></a></p>
<p>To get around this problem, when data is written to the store the system walks the object model, ensuring that all items that the data ‘connects to’ are replicated. So we start out by replicating nothing. As data is written to the cluster we walk the domain model to make sure the &#8216;dimensions&#8217; that data connects to are replicated. Most importantly, at any point in time data that is not ‘connected’ will not be replicated. This reduces the amount of replicated data by an order of magnitude so that replication can be used for efficient joins with ‘<a href="http://en.wikipedia.org/wiki/Dimension_table">Dimensions’</a>. If you’re interested in this pattern you can find out more about it <a href="http://www.benstopford.com/2011/01/27/beyond-the-data-grid-building-a-normalised-data-store-using-coherence/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">here</a> and <a href="http://www.benstopford.com/2011/09/22/achieving-fast-joins-in-distributed-data-stores-through-the-application-of-snowflake-schemas-and-the-connected-replication-pattern-2/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">here</a>.</p>
<p><strong>ODC supports Object and Relational models through a single interface: </strong>ODC is primarily an object database. This is important because it represents a 2D domain model (a representation of the banks Logical Domain Model &#8211; something we hold very dear). We have a simple object based query language which allows a user to query (filter, group etc) by element of any object in the store (the API is derived reflectively from the domain model). The language is sql-like but has all the benefits of intellisense in your IDE. That is to say you can filter, group, select etc on any getter, collection etc that any of our objects expose. You can define which joins you would like to make to bring more data back, add predicate logic etc.</p>
<p style="text-align: center;"><a href="http://www.benstopford.com/wp-content/uploads/2012/08/image0071.png#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed"><img class="aligncenter size-full wp-image-2412" title="image007" src="http://www.benstopford.com/wp-content/uploads/2012/08/image0071.png" alt="" width="638" height="187" /></a></p>
<p style="text-align: center;">
<p>In addition we support a basic JDBC driver which means users can get at our data in rows and columns if they wish. We’d prefer that they didn’t as rows and columns just don’t really work for a 2D domain model but we also understand that a lots and lots of tools want to interact with their data in SQL. The SQL adapter actually works in exactly the same way as the Object based interface. That is to say that the information that is sent to the store is the same. We just have to do a little more work to present the data in a tabular form.</p>
<p><strong>ODC is continuously delivered</strong>: We’ve put a lot of work in to continuously deliver our application suite, or at least something do something as close to it continuous delivery as we can. The challenge is that ODC is quite big. Each environment runs around 450 processes with 50 different process definitions and the database is around 2TB which means it takes a long time to migrate (see the Future section below).</p>
<p style="text-align: center;"><a href="http://www.benstopford.com/wp-content/uploads/2012/08/diad1.png#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed"><img class="aligncenter size-full wp-image-2427" title="diad" src="http://www.benstopford.com/wp-content/uploads/2012/08/diad1.png" alt="" width="405" height="291" /></a></p>
<p style="text-align: center;">
<p style="text-align: left;">So why bother with continuous delivery? It’s really about how long it takes to get feedback on a problem. With this system in place we get feedback on changes with a real data set in something that looks and runs identically to production. We get that feedback every day. The effort has gone into a series of ever-increasingly comprehensive tests. 20k unit and functional tests run before you check in (takes just under 20 mins). The MiniMe build migrates the database should any database changes be checked in. It does this on a cut down dataset which means it can do pretty much any migration in twenty to thirty mins. If that passes a full migration ensures that the code with a fully populated data set. Finally, if all that passes we rip down the almost prod env, release to it and start everything up again. If anything goes wrong we roll back using a database flashback. All in all a lot of pain but that’s the world of databases in the terabytes. The luxury of seeing a new bit of work in a production identical environment within a day is worth it though. The continuous delivery system is written in Gradle by <a href="http://greggigon.com/2011/06/09/introduction-to-gradle/">Greg Gigon</a>.</p>
<h3><strong>The Future</strong></h3>
<p>The future for ODC&#8217;s data store revolves around its ability to adapt to a changing world. Databases aren’t so good at that. When you have a database you need to understand your data before you store it. Part of moving fast is accepting that you can’t understand all you data at the get-go however much you may wish to. Understanding data just takes time (and you get it wrong). The plan is to avoid these problems by using late binding to wrap a schema onto original, unaltered facts at runtime. This concept of the late bound schema allows us to change our mind and map data late on in the delivery cycle because the unaltered facts always sit at its core. Doing this in a traditional schema oriented store (like a database) isn&#8217;t possible since you would have to back-populate any new additions. The schema is more like a view in a database, except that the view is over the data file as it was provided to the database, rather than some mapped version of it. Some big data technologies offer properties like this but none we&#8217;ve come across offer this in the context of a statically typed language that can version data, provide consistent views and join entities that have disparate lifecycles. We see this step as an important move towards becoming the one store that a large number of systems can rely on.</p>
<p>The higher level vision (which is the vision of our CIO) is a data oriented architecture in which services are deployed and run in a cloud like environment that is &#8216;preloaded&#8217; with all the bank&#8217;s primary data. That is to say that services running in this environment utilise only centralised persistence for the bank&#8217;s core facts.</p>
<h3><strong>The Team</strong></h3>
<p>The team are split between London and India. There is a strong influence from the software industry and that goes for the work as well as the ethos. We don’t always agree (lots of strong characters) but we always get along. If you are interested some of us mapped out what we value most <a href="http://www.benstopford.com/2010/08/25/mapping-personal-practices/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed">here</a> a while back.</p>
<p>We practice something that is a little bit like agile. We work iteratively. We write lots of tests. We keep the build time down. But we’re aging slightly which means we don’t pair as often as we used to (but we do still pair). Iterations overhang a little too often but hopefully you can forgive us for that.</p>
<p>So if you’re looking for a work because you want to pay the bills there are better teams out there. If you’ve chosen a life in software because it’s something that you find yourself musing about in idle moments and excited about when you wake in the morning then it could be for you.</p>
<p>If you&#8217;d like to find out more just email me: benjamin[dot]stopford[at]rbs.com</p>
<img src="http://feeds.feedburner.com/~r/benstopford/~4/xV47A4iYl2g" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.benstopford.com/2012/08/09/odc/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://www.benstopford.com/2012/08/09/odc/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</feedburner:origLink></item>
	</channel>
</rss><!-- Dynamic page generated in 0.559 seconds. --><!-- Cached page generated by WP-Super-Cache on 2013-06-11 23:53:08 -->
