<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>DBMS2 -- DataBase Management System Services</title>
	
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Sat, 13 Mar 2010 22:47:06 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/dbms2/feed" /><feedburner:info uri="dbms2/feed" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>The Naming of the Foo</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/CByUwq0gWq4/</link>
		<comments>http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/#comments</comments>
		<pubDate>Sat, 13 Mar 2010 22:47:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Mark Logic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1703</guid>
		<description><![CDATA[Let&#8217;s start from some reasonable premises.

No technology category name is 	ever perfect.
It&#8217;s particularly hard to describe 	NoSQL (Not Only SQL) accurately, given the basic confusion as to 	what NoSQL is all about.
That said, it 	seems pretty clear that NoSQL is about making big websites (and 	perhaps other cloud-like installations) run and scale.
Dwight Merriman (founder/CEO of [...]]]></description>
			<content:encoded><![CDATA[<p>Let&#8217;s start from some reasonable premises.</p>
<ul>
<li><a href="http://www.strategicmessaging.com/monashs-first-law-of-commercial-semantics-explained/2009/01/09/" onclick="javascript:pageTracker._trackPageview('/www.strategicmessaging.com');">No technology category name is 	ever perfect</a>.</li>
<li>It&#8217;s particularly hard to describe 	NoSQL (Not Only SQL) accurately, given <a href="http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/" >the basic confusion as to 	what NoSQL is all about</a>.</li>
<li>That said, it 	seems pretty clear that NoSQL is about making big websites (and 	perhaps other cloud-like installations) run and scale.</li>
<li>Dwight Merriman (founder/CEO of 	MongoDB vendor 10gen) is heading in the right direction when he says 	that the unifying ideas of NoSQL are that you do away with 	transactions and joins. But if he&#8217;s ever said something like “NoSQL 	is Foo without joins and transactions,” I don&#8217;t know what Foo is.</li>
<li><span style="font-style: normal;">Actually, 	I do know what Foo is – Foo is what happens when lots of people 	want to get small amounts each of information in or out of a 	database at the same time. I just don&#8217;t know what Foo is called.</span></li>
<li>Obviously, Foo is a lot like OLTP 	(OnLine Transaction Processing). However, it would be pretty silly 	for Foo to actually be OLTP, given that one of the core points of 	NoSQL is that you don&#8217;t have transactions.</li>
<li>It not just the “T” part of 	OLTP that&#8217;s fried.  Calling something “OnLine” only makes sense 	as long as offline is an option, and offline transaction processing 	has been obsolete for a very long time.*</li>
</ul>
<p style="margin-bottom: 0in;"><em>*Sure, if you strain you can talk yourself into exceptions. But the point stands.</em></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">So we need a name for Foo, where Foo is what happens when</span><span style="font-style: normal;"><strong> lots of people want to get small amounts each of information in or out of a database at the same time.</strong></span><span style="font-style: normal;"> Thus, three major subcategories of more-or-less disk-based Foo are:</span></p>
<ul>
<li><span style="font-style: normal;">No-compromises 	ACID-compliant relational OLTP</span></li>
<li><span style="font-style: normal;">Sharded 	MySQL</span></li>
<li>NoSQL</li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">There may be some more purely memory-centric versions too, but let&#8217;s put those aside for the moment. </span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">Absent a better idea, I can squeeze Foo into yet another four-letter acronym:</span></p>
<p style="margin-bottom: 0in;"><strong><span style="font-style: normal;">HVSP (High-Volume Simple Processing)</span></strong></p>
<p style="margin-bottom: 0in; font-style: normal;">That&#8217;s as imperfect as any other category name, and an awkward mouthful to boot. So I&#8217;d love to hear a better one; if you have such, please share it!  In the mean time, I think “HVSP” has merit because:</p>
<ul>
<li><span style="font-style: normal;">The 	“Processing” part should be noncontroversial.</span></li>
<li>“<span style="font-style: normal;">High-Volume” 	is inherent to the challenge. If RDBMS scale well enough for your 	use case, using something less powerful is probably silly.*  	Similarly, while Oracle shines at high-volume OLTP workloads, there 	are many cheaper DBMS that do a fine job of OLTP at lower volumes.</span></li>
<li>“<span style="font-style: normal;">Simple” 	is the core principle of NoSQL systems, which drop joins and 	transactions as being too much foofarah.  That only makes sense at 	all under the assumption that you have bone-simple queries and 	updates, so that programming around the lack of joins and 	transactions isn&#8217;t all that much of a burden.</span></li>
<li><span style="font-style: normal;">Something 	similar is true of sharded MySQL.</span></li>
<li><span style="font-style: normal;">Less 	obviously, “simple” is a core principle of relational OLTP as 	well. The point of the relational model is to cap the complexity of 	data operations, or more precisely to hide that complexity from 	programmers.</span></li>
<li><span style="font-style: normal;">And 	overloading the word “simple” a bit, it&#8217;s fair to say that if 	you&#8217;re reading or writing one record at a time, you&#8217;re doing 	something relatively simple, at least as opposed to what you do in 	analytic processing. The OLTP vs. OLAP distinction is preserved in 	this name change.</span></li>
<li><span style="font-style: normal;">The whole thing matches my definition above, namely &#8220;what happens when lots of people want to get small amounts each of information in or out of a database at the same time.&#8221;</span></li>
</ul>
<p style="margin-bottom: 0in;"><em>*Assuming, of course, that rows-and-tables are a good metaphor for your data structure in the first place.</em></p>
<p style="margin-bottom: 0in; font-style: normal;">Systems I&#8217;m leaving out of the HVSP and hence also NoSQL categories include:</p>
<ul>
<li><span style="font-style: normal;"><strong>Hadoop 	and other batch-oriented MapReduce.</strong></span><span style="font-style: normal;"> Hadoop isn&#8217;t part of NoSQL. I&#8217;m pretty sure that </span><a href="http://twitter.com/mikeolson/status/10388695185" onclick="javascript:pageTracker._trackPageview('/twitter.com');">Cloudera 	CEO Mike Olson</a><span style="font-style: normal;"> agrees with me.</span></li>
<li><span style="font-style: normal;"><span style="font-weight: normal;">More 	generally, </span></span><span style="font-style: normal;"><strong>non-SQL 	data stores that don&#8217;t meet the HVSP criteria.</strong></span><span style="font-style: normal;"> Dave Kellogg stretches things when he claims that <a href="http://www.kellblog.com/2010/03/10/ieee-computer-society-article-on-nosql-an-executive-level-overview/" onclick="javascript:pageTracker._trackPageview('/www.kellblog.com');">MarkLogic 	is a NoSQL system</a>. (But then, that was in a post where he 	seemingly praised </span><a href="http://www.dbms2.com/2009/12/11/nosql-q-and-a/" >a train wreck of an article</a><span style="font-style: normal;">.)</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">But hey – what good is a categorization if it doesn&#8217;t leave some things out?</span></p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/CByUwq0gWq4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/</feedburner:origLink></item>
		<item>
		<title>Some NoSQL links</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/qSr8cvs6VdA/</link>
		<comments>http://www.dbms2.com/2010/03/12/some-nosql-links/#comments</comments>
		<pubDate>Fri, 12 Mar 2010 23:51:42 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Amazon and its cloud]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Continuent]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Tokutek]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1692</guid>
		<description><![CDATA[I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I&#8217;m poking around a bit reading stuff on the subjects. Here are some links I found.

A little over a year ago, Julian Browne put up a great post on Eric Brewer&#8217;s CAP conjecture/theorem, which provides much of the impetus [...]]]></description>
			<content:encoded><![CDATA[<p>I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I&#8217;m poking around a bit reading stuff on the subjects. Here are some links I found.</p>
<ul>
<li>A little over a year ago, Julian Browne put up a great post on <a href="http://www.julianbrowne.com/article/viewer/brewers-cap-theorem" onclick="javascript:pageTracker._trackPageview('/www.julianbrowne.com');">Eric Brewer&#8217;s CAP conjecture/theorem</a>, which provides much of the impetus to relax the traditional requirement for atomicity/consistency.</li>
<li>Even more directly inspirational to NoSQL technology development were two seminal papers: Google&#8217;s on <a href="http://labs.google.com/papers/bigtable.html" onclick="javascript:pageTracker._trackPageview('/labs.google.com');">BigTable</a> and Amazon&#8217;s on <a href="http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf" onclick="javascript:pageTracker._trackPageview('/s3.amazonaws.com');">Dynamo</a>. (That said, I&#8217;m having trouble getting myself to actually read them from start to finish, especially since they&#8217;ve been superseded by subsequent technology development.)</li>
<li>10gen (the MongoDB guys) hosted a NoSQL conference yesterday. Much blogging has ensued. The best post I&#8217;ve seen so far was by <a href="http://blog.marcua.net/post/442594842/notes-from-nosql-live-boston-2010" onclick="javascript:pageTracker._trackPageview('/blog.marcua.net');">Adam Marcus</a>. I find the graph database notes near the bottom particularly interesting.</li>
<li>Mark Callaghan hit back against the <a href="http://mysqlha.blogspot.com/2010/03/plays-well-with-others.html" onclick="javascript:pageTracker._trackPageview('/mysqlha.blogspot.com');">NoSQL <span style="text-decoration: line-through;">movement</span> hype</a>, and in particular against the <a href="http://www.dbms2.com/2010/03/02/cassandra-nosql-scalable-oltp/" >MySQL/memcached is passe</a>&#8216; meme. On the other hand, he also bemoaned many failings of MySQL. On the third hand, he praised or at least expressed hope for a variety of MySQL-related technologies, including <a href="http://www.dbms2.com/2009/04/16/introduction-to-tokutek/" >Tokutek&#8217;s TokuDB</a> and <a href="http://www.dbms2.com/2009/09/03/continuent-on-clustering/" >Continuent&#8217;s Tungsten</a>.</li>
<li>In connection with that debate, Mark Rendle offered a <a href="http://blog.markrendle.net/2010/03/do-you-need-relational-database.html" onclick="javascript:pageTracker._trackPageview('/blog.markrendle.net');">funny rant</a>, mainly pro-NoSQL, in the style of a Socratic dialogue.</li>
<li>John Quinn of Digg recently described <a href="http://www.stumbleupon.com/su/5099Ti/about.digg.com/node/564" onclick="javascript:pageTracker._trackPageview('/www.stumbleupon.com');">Digg&#8217;s move from MySQL to Cassandra</a>, and outlined a lot of features Digg was adding to Cassandra, all of which it is open-sourcing.</li>
<li>The NoSQL guys maintain their own long <a href="http://nosql-database.org/links.html" onclick="javascript:pageTracker._trackPageview('/nosql-database.org');">list of NoSQL-related links</a>.</li>
</ul>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/qSr8cvs6VdA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/03/12/some-nosql-links/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2010/03/12/some-nosql-links/</feedburner:origLink></item>
		<item>
		<title>Cassandra and the NoSQL scalable OLTP argument</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/U_i2wFZEqd8/</link>
		<comments>http://www.dbms2.com/2010/03/02/cassandra-nosql-scalable-oltp/#comments</comments>
		<pubDate>Tue, 02 Mar 2010 19:01:13 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1675</guid>
		<description><![CDATA[Todd Hoff put up a provocative post on High Scalability called MySQL and Memcached: End of an Era? The post itself focuses on observations like:

Facebook invented and is adopting Cassandra.
Twitter is adopting Cassandra.
Digg is adopting Cassandra.
LinkedIn invented and is adopting Voldemort.
Gee, it seems as if the super-scalable website biz has moved beyond MySQL/Memcached.

But in addition, he [...]]]></description>
			<content:encoded><![CDATA[<p>Todd Hoff put up a provocative post on High Scalability called <a href="http://highscalability.com/blog/2010/2/26/mysql-and-memcached-end-of-an-era.html" onclick="javascript:pageTracker._trackPageview('/highscalability.com');">MySQL and Memcached: End of an Era?</a> The post itself focuses on observations like:</p>
<ul>
<li>Facebook invented and is adopting Cassandra.</li>
<li>Twitter is adopting Cassandra.</li>
<li>Digg is adopting Cassandra.</li>
<li>LinkedIn invented and is adopting Voldemort.</li>
<li>Gee, it seems as if the super-scalable website biz has moved beyond MySQL/Memcached.</li>
</ul>
<p>But in addition, he provides a lot of useful links, which DBMS-oriented folks such as myself might have previously overlooked. <span id="more-1675"></span>Following those trails gets one to, among other things:</p>
<ul>
<li>A September, 2009 post outlining <a href="http://about.digg.com/blog/looking-future-cassandra" onclick="javascript:pageTracker._trackPageview('/about.digg.com');">Digg&#8217;s reasons for moving to Cassandra</a>. The core idea is that joining two tables is expensive; it&#8217;s cheaper to store the results prejoined on disk. Details are provided.</li>
<li>A February, 2010 post outlining <a href="http://nosql.mypopescu.com/post/407159447/cassandra-twitter-an-interview-with-ryan-king" onclick="javascript:pageTracker._trackPageview('/nosql.mypopescu.com');">Twitter&#8217;s reasons for moving to Cassandra</a>. They boil down to &#8220;sufficiently scalable, sufficiently simple, sufficiently robust, robustly open source.&#8221;</li>
<li>A <a href="http://www.niallkennedy.com/blog/uploads/flickr_php.pdf" onclick="javascript:pageTracker._trackPageview('/www.niallkennedy.com');">Flickr slide presentation</a> saying &#8220;normalization is for wimps&#8221;. They seemed to be staying with MySQL, but lusting after XPath.</li>
<li>A nice <a href="http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/" onclick="javascript:pageTracker._trackPageview('/blog.evanweaver.com');">Cassandra technical overview</a> by Evan Weaver of Twitter.</li>
</ul>
<p>I also recall seeing something that said &#8220;We have 13X as many queries as updates, so of course we should optimize for reads,&#8221; but I can&#8217;t find that now. The classical OLTP answer to that would probably be &#8220;Yeah, but by the time you&#8217;re two-phase-committing and integrity-checking all the part of that update, it turns out updates are still what you should optimize for.&#8221; Well, what if the update is so simple that that&#8217;s no longer a valid argument?</p>
<p>There certainly seem to be some non-obvious technical choices being made here, with options being conflated that perhaps shouldn&#8217;t be. In particular, I wonder whether things are being written to cheap disk in a really fast way when it might be better to keep them in more expensive RAM or, perhaps better yet, solid-state memory. Perhaps then the functionality/performance tradeoff wouldn&#8217;t be so painful.</p>
<p>On the other hand, the designers of the world&#8217;s most scalable websites &#8212; e-commerce sites perhaps excepted &#8212; seem pretty unanimous in thinking it&#8217;s best to bake some database/integrity management into the applications, rather than offload it all to an RDBMS. Why? Because the transactions are so simple that hand-coding all that isn&#8217;t prohibitive. And of course because of their extreme performance and scalability needs.</p>
<p>I&#8217;m not sure on what basis one could argue that they&#8217;re wrong.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/U_i2wFZEqd8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/03/02/cassandra-nosql-scalable-oltp/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2010/03/02/cassandra-nosql-scalable-oltp/</feedburner:origLink></item>
		<item>
		<title>Data exploration vs. data visualization</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/2hjeK1MiqFE/</link>
		<comments>http://www.dbms2.com/2010/03/01/data-exploration-visualization/#comments</comments>
		<pubDate>Mon, 01 Mar 2010 09:29:47 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1666</guid>
		<description><![CDATA[I&#8217;ve tended to conflate data exploration and data visualization, and I&#8217;m far from alone in doing so. But a recent Economist article is a useful reminder that they aren&#8217;t exactly the same thing.
The article makes the same conflation, but while reading it I noticed something interesting. The concrete examples cited are of clever consultants who [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve tended to conflate <a href="http://www.dbms2.com/2010/01/31/trends-database-aanalytic-technology/" >data exploration and data visualization</a>, and I&#8217;m far from alone in doing so. But a recent <a href="http://www.economist.com/specialreports/displaystory.cfm?story_id=15557455" onclick="javascript:pageTracker._trackPageview('/www.economist.com');"><em>Economist</em></a> article is a useful reminder that they aren&#8217;t exactly the same thing.<span id="more-1666"></span></p>
<p>The article makes the same conflation, but while reading it I noticed something interesting. The concrete examples cited are of clever consultants who crafted innovative data visualizations on the fly, to make conclusions patently apparent to even mathematically-challenged decision-makers. That kind of thing is important, and has been going on <a href="http://tokyohanna.blogspot.com/2009/12/nightingale-x-healthcare-x-visualizing.html" onclick="javascript:pageTracker._trackPageview('/tokyohanna.blogspot.com');">for over 140 years</a>.*</p>
<p><em>*Yes, I&#8217;m trotting out the Florence Nightingale example again. I continue to be in awe of her.</em></p>
<p>What worries me is the article&#8217;s suggestion that <strong>the best data visualizations are done by visualization experts, as ways of making information apparent to other people.</strong> For as long as data visualization relies on hotshot visual-design experts doing one-off projects, its impact on enterprises overall will remain extremely limited. In other words, <strong>to the extent it is incorrect to conflate data visualization and data exploration, data visualization will remain a fringe technology</strong>.</p>
<p>To be fair, a primary decision support/business intelligence usage cycle has always been &#8212; where by &#8220;always&#8221; I mean &#8220;for at least the past 35+ years&#8221; &#8211;</p>
<ul>
<li><strong>Data exploration</strong>. Power user uses technology to find something interesting.</li>
<li><strong>&#8220;Look what I found!&#8221; </strong>Power user then shows a report, chart, or other summary/representation to colleagues.</li>
</ul>
<p>So to the extent modern interactive data exploration/visualization technology fits that paradigm, great. But to the extent that visualization experts are somehow integral to the technology&#8217;s use, it will remain stuck on the analytic fringe.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/2hjeK1MiqFE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/03/01/data-exploration-visualization/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2010/03/01/data-exploration-visualization/</feedburner:origLink></item>
		<item>
		<title>Another reason to expect number-crunching and big-data management to converge</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/kyvOm66BU5Y/</link>
		<comments>http://www.dbms2.com/2010/02/26/number-crunching-big-data-managementconverge/#comments</comments>
		<pubDate>Fri, 26 Feb 2010 06:03:12 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1660</guid>
		<description><![CDATA[Dan Olds argues that Oracle is likely to pursue commercially-substantive high performance computing (HPC), emphasis mine:
I just don’t see Oracle abandoning HPC entirely. I think it may call it by some other name or describe it differently, but it will be in the high throughput computing business for the foreseeable future.
There are some interesting angles [...]]]></description>
			<content:encoded><![CDATA[<p>Dan Olds argues that <a href="http://www.theregister.co.uk/2010/02/25/oracle_sun/" onclick="javascript:pageTracker._trackPageview('/www.theregister.co.uk');">Oracle is likely to pursue commercially-substantive high performance computing</a> (HPC), emphasis mine:<span id="more-1660"></span></p>
<blockquote><p>I just don’t see Oracle abandoning HPC entirely. I think it may call it by some other name or describe it differently, but it will be <strong>in the high throughput computing business for the foreseeable future.</strong></p>
<p>There are some interesting angles for it to pursue. <strong>Many of its best commercial customers have sizeable HPC or HPC-like workloads</strong> that Oracle can now (with the addition of Sun) compete for. I don’t see it passing up those opportunities.</p>
<p>Oracle can also look to specialize on certain subsets of the market and provide more of a solution rather than piece parts. I wouldn’t be surprised to hear of it offering<strong> an Exadata-like system that is optimized for, say, seismic or financial services.</strong> In fact, Exadata as it stands today is a decent fit for financial service analytic workloads.</p>
<p>HPC can be a profitable business and, in a lot of organizations, it’s growing faster than traditional business processing. From Oracle’s perspective, what’s not to like?</p></blockquote>
<p>Now, except for the Exadata-in-financial-services comment, that&#8217;s not directly an argument for the convergence of number crunching and data management.  However, I think <a href="http://www.dbms2.com/2010/02/22/netezza-twinfin/" >Netezza and Aster Data</a> are showing the way for that convergence. So, up to a point, is <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/" >the scientific-research community</a>. And of course the <a href="http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/" >Hadoop</a> guys think they have the best way to that convergent future.</p>
<p>But if Dan Olds is right that the best technologies for Oracle to pursue HPC and big-data processing with aren&#8217;t all that far apart, then the chances that Oracle will indeed pursue their convergence are pretty high. And that would amount to critical mass for the trend.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/kyvOm66BU5Y" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/26/number-crunching-big-data-managementconverge/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2010/02/26/number-crunching-big-data-managementconverge/</feedburner:origLink></item>
		<item>
		<title>Notes on Sybase Adaptive Server Enterprise</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/PyedGr5Rj1g/</link>
		<comments>http://www.dbms2.com/2010/02/25/sybase-adaptive-server-enterprise-as/#comments</comments>
		<pubDate>Thu, 25 Feb 2010 13:10:48 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cache]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Sybase]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1646</guid>
		<description><![CDATA[It had been a very long time since I was remotely up to speed on Sybase&#8217;s main OLTP DBMS, Adaptive Server Enterprise (ASE).  Raj Rathee, however, was kind enough to fill me in a few days ago. Highlights of our chat included:

One of the most confusing things about Sybase ASE is its version numbering. In [...]]]></description>
			<content:encoded><![CDATA[<p>It had been a very long time since I was remotely up to speed on Sybase&#8217;s main OLTP DBMS, Adaptive Server Enterprise (ASE).  Raj Rathee, however, was kind enough to fill me in a few days ago. Highlights of our chat included:<span id="more-1646"></span></p>
<ul>
<li>One of the most confusing things about Sybase ASE is its version numbering. In particular,
<ul>
<li>Sybase ASE 15.5 went GA in December, 2009. (But the clustered version is just coming out in March.)</li>
<li>The prior version of Sybase ASE was 15.03.</li>
<li>Sybase ASE 15.0 came out in September, 2005.</li>
<li>The version of Sybase ASE before that was 12.5.</li>
<li>And by the way, Sybase System 10 came out in 1994 or so.</li>
</ul>
</li>
<li><strong>Sybase ASE 15.0 was a major rewrite.</strong> In particular, Sybase ASE 15.0 had a “brand new” optimizer and query processing engine, based on the <strong>Volcano</strong> model. The main driver of the rewrite was to make Sybase ASE suitable for mixing OLTP and some level of decision-support workloads. (Not on the order of what Sybase IQ can handle, but at least operational reporting and so on.)</li>
<li>I haven&#8217;t looked up Volcano in more detail than to confirm that what I thought Raj said made sense, but as he characterized it, it&#8217;s a lot more modular than what Sybase had in ASE 12.5. For example, substantially the only join algorithm in Sybase ASE 12.5 was nested loop – no hash or sort/merge.</li>
<li>As you might imagine, a lot of things one might regard as core modern DBMS features were only added to Sybase ASE once 15.0 came out. Examples include:
<ul>
<li>Various forms of partitioning at the storage level.</li>
<li>User-defined functions (UDFs).</li>
<li>A clustering offering that competes with Oracle RAC. (100 or so customers are on that so far.) Absent clustering, Sybase ASE is limited to a single SMP (Symmetric Multiprocessing) box.</li>
<li>Shared disk. Amazingly, it seems that before 2008, every node in an SMP box running Sybase ASE had its own private partition (maybe not the right word) of data.</li>
</ul>
</li>
<li>In Sybase ASE, you have lots of databases managed by one database server. You can write SQL statements that span multiple databases, but they have to reference database names as well as table names.</li>
<li>There are several ways to get data from one place to another in Sybase&#8217;s technology and nomenclature, specifically including. Replication Server, Incremental Data Transfer, and “proxy tables.” (Other than the fact that Replication Server is a separate, chargeable product, I don&#8217;t really have these straight.) In addition, there&#8217;s a hand-coded one in <a href="http://www.dbms2.com/2010/02/05/sybase-aleri-rap/" >Sybase RAP</a>, which will get a planned 5-6X performance improvement later this year when it is replaced by Incremental Data Transfer.</li>
</ul>
<p>And in what basically sounds like a very cool approach, Sybase ASE has a lot of <strong>memory-centric</strong> aspects. That said, Sybase&#8217;s in-memory ASE story is still incomplete (wait until the next release) and confused (I think in part because of what&#8217;s missing in the current release).  Also, this is one area where the non-technical nature of the briefing got in my way. So here&#8217;s some of what I do and don&#8217;t know about Sybase&#8217;s memory-centric ASE strategy:</p>
<ul>
<li>Sybase lets you mix and match on-disk and in-memory databases under one instance of Sybase ASE. To a programmer, it all looks like ASE.</li>
<li>I don&#8217;t know exactly what the limitations are on what you can do with in-memory databases, how you can use them in tandem with on-disk databases, etc.</li>
<li>You can replicate data from disk to an in-memory Sybase ASE database today. (Hello caching, ala Oracle Times Ten or IBM DB2/solidDB.)</li>
<li>Replicating from memory to disk is a near-term future capability. (So Sybase does not yet have a hybrid memory-centric story ala <a href="http://www.dbms2.com/2007/06/22/in-memory-database-solid/" >solidDB Classic</a>.)</li>
<li>I have no clue as to what kinds of in-memory data structures Sybase ASE uses.</li>
</ul>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/PyedGr5Rj1g" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/25/sybase-adaptive-server-enterprise-as/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2010/02/25/sybase-adaptive-server-enterprise-as/</feedburner:origLink></item>
		<item>
		<title>Chris Bird’s blog is brilliant, and update-in-place is increasingly passe’</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/QExlC3V_gJE/</link>
		<comments>http://www.dbms2.com/2010/02/25/chris-bird-database-design-update-in-plac/#comments</comments>
		<pubDate>Thu, 25 Feb 2010 05:44:54 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1641</guid>
		<description><![CDATA[I wouldn&#8217;t say every post in Chris Bird&#8217;s occasionally-updated blog is brilliant. I wouldn&#8217;t even say every post is readable. But I&#8217;d still recommend his blog to just about anybody who reads here as, at a minimum, a consciousness-raiser.
One of the two posts inspiring me to mention this is a high-level one on &#8220;technical debt&#8220;, [...]]]></description>
			<content:encoded><![CDATA[<p>I wouldn&#8217;t say every post in Chris Bird&#8217;s occasionally-updated blog is brilliant. I wouldn&#8217;t even say every post is readable. But I&#8217;d still recommend his blog to just about anybody who reads here as, at a minimum, a consciousness-raiser.</p>
<p>One of the two posts inspiring me to mention this is a high-level one on &#8220;<a href="http://businessanditarchitecture.blogspot.com/2009/10/technical-debt.html" onclick="javascript:pageTracker._trackPageview('/businessanditarchitecture.blogspot.com');">technical debt</a>&#8220;, reminding us why things don&#8217;t always get done right the first time, and further reminding us that circling back to fix them sooner rather than later is usually wise. The other <a href="http://businessanditarchitecture.blogspot.com/2009/11/updates-harmful.html" onclick="javascript:pageTracker._trackPageview('/businessanditarchitecture.blogspot.com');">connects two observations</a> that individually have great merit (at least if you don&#8217;t take them to extremes):</p>
<ul>
<li>Update-in-place is passe&#8217;</li>
<li>So is elaborate up-front database design</li>
</ul>
<p>Specific points of interest here include:<span id="more-1641"></span></p>
<ul>
<li>Most data never gets changed after being written. Update-in-place doesn&#8217;t save all that much in storage hardware.</li>
<li>Update-in-place interferes with a lot of modern optimizations in analytic DBMS design.</li>
<li>Knowing what values data had in the past is interesting in and of itself.</li>
<li>So, potentially, is knowing what &#8220;dirty&#8221; data end-users &#8212; especially customers and prospects &#8212; decided to enter.</li>
<li>The &#8220;right&#8221; amount of data validation is application-dependent. For example, if data validation involves torturing your customers, maybe it&#8217;s not such a good idea. (Great observation by Chris.)</li>
<li>If you have the old data as well as the new, the harm of having &#8220;bad&#8221; updates is lessened. (Central connecting observation by Chris.)</li>
<li>People enter data inconsistently. MDM (Master Data Management) and data cleansing tools fix much (admittedly not all) of the harm. Computers are cheaper than people. You do the math.</li>
<li>Data is increasingly being managed in non-relational and/or non-persistent ways. Get used to it.</li>
<li>As the <a href="http://www.dbms2.com/2009/12/12/legit-nosql-key-value-store/" >NoSQL</a> guys point out, some of today&#8217;s most demanding applications have extremely simple schemas.</li>
</ul>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/QExlC3V_gJE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/25/chris-bird-database-design-update-in-plac/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2010/02/25/chris-bird-database-design-update-in-plac/</feedburner:origLink></item>
		<item>
		<title>February 2010 data warehouse DBMS news roundup</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/NR9Z7N-VZhY/</link>
		<comments>http://www.dbms2.com/2010/02/22/data-warehouse-dbms-news-roundup/#comments</comments>
		<pubDate>Mon, 22 Feb 2010 08:30:23 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1628</guid>
		<description><![CDATA[February is usually a busy month for data warehouse DBMS product releases, product announcements, and other real or contrived data warehouse DBMS news, and it can get pretty confusing trying to keep those categories of “news” apart.*  This year is no exception, although several vendors – including Teradata and Netezza – are taking “rolling thunder” [...]]]></description>
			<content:encoded><![CDATA[<p>February is usually a busy month for data warehouse DBMS product releases, product announcements, and other real or contrived data warehouse DBMS news, and it can get pretty confusing trying to keep those categories of “news” apart.*  This year is no exception, although several vendors – including Teradata and Netezza – are taking “rolling thunder” approaches, doing some of their announcements this month while holding others back for March or April.</p>
<p><em>*I probably have it worse than most people in that regard, because my clients run tentative feature lists and announcement schedules by me well in advance, which may get changed multiple times before the final dates roll around. I also occasionally miss some detail, if it wasn&#8217;t in a pre-briefing but gets added at the end.</em></p>
<p>Anyhow, the three big themes of this month&#8217;s announcements are probably:</p>
<ul>
<li><strong>Integrating different kinds of analytic processing into databases and DBMS. </strong></li>
<li><strong>Taking advantage of hardware advances.</strong></li>
<li><strong>Playing catchup</strong> in areas where small vendors&#8217; products weren&#8217;t mature yet.</li>
</ul>
<p><span id="more-1628"></span>For example, the three biggest data warehouse DBMS product announcements this month are probably:</p>
<ul>
<li><strong>Aster Data nCluster 4.5.</strong> Much like Aster&#8217;s prior release &#8212; <a href="../../../../../2009/10/30/aster-data-application-server-ncluster/">Aster Data nCluster 4.0</a> – <a href="http://www.dbms2.com/2010/02/22/aster-data-ncluster-4-5/" >Aster Data nCluster 4.5</a> has a major focus on integrating analytics and database processing. This time, the emphasis is on application development tools and pre-built analytic packages. In addition, Aster&#8217;s management tool GUIs have been upgraded, building on catch-up functionality in the Aster Data nCluster 4.0.</li>
<li><strong>Netezza&#8217;s “i” add-on to its existing TwinFin products.</strong> With <a href="../../../../../2010/02/22/netezza-twinfin/">Netezza TwinFin(i)</a>, Netezza becomes the second MPP RDBMS vendor with a comprehensive “Big Data Analytic Platform” kind of strategy. (Netezza would surely argue that it was the first, but that depends on how seriously one took <a href="../../../../../2007/09/27/the-netezza-developer-network/">Netezza&#8217;s prior attempt</a>.) Many of the details are different from Aster&#8217;s, of course, but the general philosophy is similar. So far, Netezza has announced one interesting proprietary library of analytic packages (for linear/matrix algebra), plus the port of 4,000 or so functions in open source libraries.</li>
<li><strong>Vertica 4.0.</strong> Vertica has had a highly innovative columnar DBMS architecture from the getgo, but at the cost of some restrictions or awkwardness in the relationship between data layout and SQL processing. Vertica says that <a href="../../../../../2010/02/22/vertica-4/">Vertica 4.0</a> fixes all that. In addition, it has some analytic processing enhancements, especially in the time series area, where Vertica doesn&#8217;t vigorously dispute that Sybase IQ previously had an advantage.</li>
</ul>
<p>In addition,</p>
<ul>
<li><strong>Teradata is announcing its Data Warehouse Appliance 2580, the successor to the Teradata 2550.</strong> This is purely a hardware refresh; Teradata&#8217;s hardware and software upgrades are not generally synced. The Teradata 2580 upgrades CPUs from Harpertown to Nehalem, includes 3X the RAM of its predecessor, and offers an option for 1 TB disks (thus lowering the bottom price/TB a lot, to $31K list).</li>
<li>Aster, Vertica, and ParAccel have all called attention to the fact that, if solid-state drives have interfaces like those of disk drives, and if a DBMS supports disk drives, then a DBMS also supports solid-state drives as well. At least Aster and ParAccel have signaled that they have at least one customer or prospect each interested in Fusion I/O&#8217;s solid-state technology, especially in the retail sector. This is basically a hardware matter as well, and a big deal only for those who were somehow unaware of <a href="../../../../../2010/01/31/flash-pcmsolid-state-memory-disk/">the impending dominance of solid-state memory technology</a>.</li>
<li>Sybase announced its <a href="../../../../../2010/02/05/sybase-aleri-rap/">Aleri</a> acquisition earlier this month.</li>
<li>Various vendors have bragged about various rankings, awards, or benchmarks, or – sometimes less tediously &#8212; about last year&#8217;s sales results.</li>
</ul>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/NR9Z7N-VZhY" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/22/data-warehouse-dbms-news-roundup/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2010/02/22/data-warehouse-dbms-news-roundup/</feedburner:origLink></item>
		<item>
		<title>TwinFin(i) – Netezza’s version of a parallel analytic platform</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/ZkxPFvQRIl0/</link>
		<comments>http://www.dbms2.com/2010/02/22/netezza-twinfin/#comments</comments>
		<pubDate>Mon, 22 Feb 2010 08:21:13 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1613</guid>
		<description><![CDATA[Much like Aster Data did in Aster 4.0 and now Aster 4.5, Netezza is announcing a general parallel big data analytic platform strategy. It is called Netezza TwinFin(i), it is a chargeable option for the Netezza TwinFin appliance, and many announced details are on the vague side, with Netezza promising more clarity at or before [...]]]></description>
			<content:encoded><![CDATA[<p>Much like Aster Data did in <a href="http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/" >Aster 4.0</a> and now <a href="http://www.dbms2.com/2010/02/22/aster-data-ncluster-4-5/" >Aster 4.5</a>, Netezza is announcing a general parallel big data analytic platform strategy. It is called Netezza TwinFin(i), it is a chargeable option for the <a href="http://www.dbms2.com/2009/07/30/netezza-new-product-family/" >Netezza TwinFin</a> appliance, and many announced details are on the vague side, with Netezza promising more clarity at or before its Enzee Universe conference in June. At a high level, the Aster and Netezza approaches compare/contrast as follows:<span id="more-1613"></span></p>
<ul>
<li>Netezza&#8217;s software runs on well-designed proprietary hardware. Aster runs on hardware that&#8217;s more off-the-shelf.</li>
<li>Aster was first to ship, and will also be first to ship an IDE (Integrated Development Environment).</li>
<li>MapReduce is central to Aster&#8217;s approach. Netezza TwinFin(i) supports MapReduce too, specifically a Hadoop implementation, but I don&#8217;t get the sense that everything Netezza does is built on MapReduce underpinnings.</li>
<li>Both Aster and Netezza try to provide rich functionality for creating in-memory data structures parallel analytic programs can use. Both seem to let you escape from the pure relational-table paradigm more easily than, say, Teradata&#8217;s new persistent memory capabilities do.</li>
<li>Aster and Netezza have made different choices about what kinds of prebuilt analytic packages to offer. Netezza could actually leapfrog Aster in this regard, but let&#8217;s see where each vendor is by, say, mid-year. If you care about the details of built-in analytic functions, you really should consider executing non-disclosure agreements with both those companies.</li>
<li>Both Aster and Netezza stress that you can run analytic functions out-of-process, greatly reducing the chance that they crash the database. Netezza and I&#8217;m pretty sure also Aster also retain the option of running in-process, which provides maximum performance. (In Netezza&#8217;s case C++ is the only in-process language supported, and I think Aster has a similar limitation.)</li>
<li>Like Aster, Netezza is integrating SQL queries and other analytic processing under the same workload management rubric.</li>
<li>Much like Aster, Netezza is tap-dancing by implying much richer forthcoming SAS support than anything currently announced. (The crunch-per-paragraph ratio in either vendor&#8217;s SAS-related press releases to date is distressingly low.)</li>
</ul>
<p>More specifically, here are some highlights of what I know, am guessing, and/or am allowed to say about Netezza TwinFin(i) at this time.</p>
<ul>
<li>The foundation for the analytic add-ons in Netezza TwinFin(i) is some sort of low-level “analytic executables.” Not understanding exactly what these are is my biggest area of confusion in the whole TwinFin(i) stack. Are they all C++, with everything translated into same? Is there Java all the way down as an alternative? (E.g., Hadoop is written in Java.) Anyhow, whatever it is, it&#8217;s surely a big improvement on <a href="../../../../../2007/09/27/the-netezza-developer-network/">Netezza&#8217;s prior Verilog-based generation of analytic extensibility technology</a>.</li>
<li>The announced list of languages supported in Netezza TwinFin(i) is Java, Python, Fortran, R, and C/C++. More are coming.</li>
<li>Netezza has named a lot of analytic functions it is adding, and hinting about more to come. It has named <a href="http://cran.r-project.org/" onclick="javascript:pageTracker._trackPageview('/cran.r-project.org');">CRAN/R</a> and GNU libraries, saying those have 1900 or more functions each. Netezza has also built its own linear algebra library for TwinFin(i), called nzMatrix. And as previously noted, TwinFin(i) also boasts a Hadoop implementation.</li>
<li>I haven&#8217;t heard about much in the way of TwinFin(i)-specific IDE support.</li>
<li>I don&#8217;t really have details as to what kinds of in-memory data structures Netezza TwinFin(i) does or doesn&#8217;t support.</li>
</ul>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/ZkxPFvQRIl0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/22/netezza-twinfin/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2010/02/22/netezza-twinfin/</feedburner:origLink></item>
		<item>
		<title>Aster Data nCluster 4.5</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/fFFlgMm8hOk/</link>
		<comments>http://www.dbms2.com/2010/02/22/aster-data-ncluster-4-5/#comments</comments>
		<pubDate>Mon, 22 Feb 2010 08:20:13 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1617</guid>
		<description><![CDATA[Like Vertica, Netezza, and Teradata, Aster is using this week to pre-announce a forthcoming product release, Aster Data nCluster 4.5. Aster is really hanging its identity on “Big Data Analytics” or some variant of that concept, and so the two major named parts of Aster nCluster 4.5 are:

Aster Data Analytic Foundation, a set of analytic [...]]]></description>
			<content:encoded><![CDATA[<p>Like <a href="http://www.dbms2.com/2010/02/22/vertica-4/" >Vertica</a>, <a href="http://www.dbms2.com/2010/02/22/netezza-twinfin/" >Netezza</a>, and Teradata, Aster is using this week to pre-announce a forthcoming product release, Aster Data nCluster 4.5. Aster is really hanging its identity on “Big Data Analytics” or some variant of that concept, and so the two major named parts of Aster nCluster 4.5 are:</p>
<ul>
<li><strong>Aster Data Analytic Foundation,</strong> a set of analytic packages prebuilt in <a href="../2009/06/09/aster-data-nclustersql-mapreduce/">Aster&#8217;s SQL-MapReduce</a><strong></strong></li>
<li><strong>Aster Data Developer Express,</strong> an Eclipse-based IDE (Integrated Development Environment) for developing and testing applications built on Aster nCluster, Aster SQL-MapReduce, and Aster Data Analytic Foundation</li>
</ul>
<p>And in other Aster news:</p>
<ul>
<li>Along with the development GUI in Aster nCluster 4.5, there is also a new administrative GUI.</li>
<li>Aster has certified that nCluster works with Fusion I/O boards, because at least one retail industry prospect cares. However, that in no way means that arm&#8217;s-length Fusion I/O certification is Aster&#8217;s ultimate <a href="../2010/01/31/flash-pcmsolid-state-memory-disk/">solid-state memory</a> strategy.</li>
<li>I had the wrong impression about how far Aster/SAS integration has gotten. So far, it&#8217;s just at the connector level.</li>
</ul>
<p>Aster Data Developer Express evidently does some cool stuff, like providing some sort of parallelism testing right on your desktop. It also generates lots of stub code, saving humans from the tedium of doing that. Useful, obviously.</p>
<p>But mainly, I want to write about the analytic packages.<span id="more-1617"></span> I&#8217;m not convinced that they&#8217;re a big deal in themselves yet, or that a whole lot of person-months have gone into their combined development. Still, I think they provide a great indication of one direction in which analytic functionality is going. And by the way, Aster promises to release a lot more of that kind of thing over the next 12 months.</p>
<p>Aster&#8217;s flagship analytic package is <a href="../2009/02/10/aster-data-npath/">nPath</a>, which is like a <strong>regular expression matcher,</strong> but <strong>for (time) series of data</strong> rather than for character strings. The main use for nPath is in pulling specific kinds of event sequences out of web or network event logs. However, one could imagine uses in other sectors that focus on temporal or sequential data (e.g., trading, intelligence, other sensor analysis), should existing SQL- and/or CEP-based technologies not prove sufficiently flexible. Aster 4.5 adds some new aggregation capabilities around nPath.</p>
<p>Other not-wholly-new packages in the Aster Data Analytic Foundation announcement are for <strong>sessionization</strong> (of clickstream data and the like) and <strong>tokenization </strong>(of text/character string data). While sessionization can be done in SQL, Aster thinks its MapReduce-based version is faster, since it doesn&#8217;t require self-joins. Makes sense. Aster&#8217;s tokenization sounds lame, however – text analytics in MapReduce tends to reinvent simplistic wheels for no clear reason, and Aster doesn&#8217;t seem to be an exception. (Aster would argue, however, that anything it does in SQL-MapReduce is more flexible than pure SQL or pure MapReduce alternatives.)</p>
<p>Another example of better-living-without-self-joins is Aster&#8217;s new <strong>market basket</strong> package. This lets you look at a set of point-of-sale data, pick a small integer N, and pull out all the sets of N things that were bought by the same person at the same time. I haven&#8217;t probed the claim in detail, but Aster implies there&#8217;s less combinatorial explosion in its approach than it is in the self-join alternative.</p>
<p><em>Note: Gartner highlighted self joins as a performance challenge in its recent </em><a href="../2010/02/10/gartner-magic-quadrant-data-warehouse-2009-2010/">Data Warehouse Magic Quadrant</a><em>.</em></p>
<p>Aster is also releasing a few <strong>statistical and general analytic functions</strong> &#8212; specifically (and I quote a slide):</p>
<ul>
<li>exponential moving average</li>
<li>weighted moving average</li>
<li>simple moving average</li>
<li>volume-weighted average price</li>
<li>correlation</li>
<li>linear regression</li>
<li>logistic regression</li>
<li>approximate_percentile</li>
<li>approximate_count_distinct</li>
</ul>
<p>The point of the last two items on the list is that if you set a non-zero tolerance for error, you can you can count things or order them into bins very efficiently – especially in terms of RAM &#8212; while being guaranteed not to exceed your error tolerance.</p>
<p><em>Note: One obvious inference from this list &#8212; which Aster gladly confirms &#8212; is that Aster has high hopes of selling to the financial services industry. </em></p>
<p>Finally, Aster is releasing its first pure <strong>graph-analytic</strong> function, for finding the shortest path between a given pair of nodes.</p>
<p>While I had the Aster folks on the phone anyway, I also took the opportunity to ask about the Aster nCluster 4.0 capability to create fairly persistent non-relational in-memory data structures. Specifically, I asked whether different users could access the same in-memory structure, and was told that this is a little klugey but not too horrendous. That suggests Aster&#8217;s capability may be a strict superset of UDF-based (User-Defined Function) approaches to meeting the same need, at least from a functionality standpoint. However, ease of creating those in-memory structures may still be better in the more SQL/UDF-centric approach favored by Teradata.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/fFFlgMm8hOk" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/22/aster-data-ncluster-4-5/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2010/02/22/aster-data-ncluster-4-5/</feedburner:origLink></item>
	</channel>
</rss><!-- Dynamic page generated in 0.321 seconds. --><!-- Cached page generated by WP-Super-Cache on 2010-03-13 18:48:09 -->
