<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>DBMS2 -- DataBase Management System Services</title>
	
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Sun, 08 Nov 2009 06:32:44 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/dbms2/feed" type="application/rss+xml" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><item>
		<title>Calpont’s InfiniDB</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/TzSICCtaJxw/</link>
		<comments>http://www.dbms2.com/2009/11/07/calponts-infinidb/#comments</comments>
		<pubDate>Sun, 08 Nov 2009 01:35:25 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Calpont]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Open source]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1207</guid>
		<description><![CDATA[Since its inception, Calpont has gone through multiple management teams, strategies, and investor groups. What it hadn&#8217;t done, ever, is actually shipped a product. Last week, however, Calpont introduced a free/open source DBMS, InfiniDB, with technical details somewhat reminiscent of what Calpont was promising last April. Highlights include:

Like Infobright, Calpont&#8217;s 	InfiniDB is a columnar DBMS [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Since its inception, Calpont has gone through multiple management teams, strategies, and investor groups. What it hadn&#8217;t done, ever, is actually shipped a product. Last week, however, Calpont introduced a free/open source DBMS, InfiniDB, with technical details somewhat reminiscent of <a href="../2009/04/20/calpont-update-you-read-it-here-first/">what Calpont was promising last April</a>. Highlights include:</p>
<ul>
<li>Like Infobright, Calpont&#8217;s 	InfiniDB is a columnar DBMS consisting of a MySQL front end and a 	columnar storage engine.</li>
<li>Community edition InfiniDB runs on 	a single server.</li>
<li>One of commercial/enterprise 	edition InfiniDB&#8217;s main claims to fame will be MPP support.</li>
<li>There&#8217;s no announced time frame 	for commercial edition InfiniDB.</li>
<li>InfiniDB&#8217;s current compression 	story is dictionary/token only, with decompression occurring  before 	joins are executed. Improvement is a roadmap item.</li>
<li>Indeed, InfiniDB has many roadmap 	items, a few of which can be found <a href="http://infinidb.org/resources/tech-articles/120-infinidb-community-edition-roadmap" onclick="javascript:pageTracker._trackPageview('/infinidb.org');">here</a>. 	Also, a great overview of InfiniDB&#8217;s current state and roadmap can 	be found in <a href="http://www.mysqlperformanceblog.com/2009/11/02/air-traffic-queries-in-infinidb-early-alpha/" onclick="javascript:pageTracker._trackPageview('/www.mysqlperformanceblog.com');">this 	MySQL Performance Blog</a> thread. (And follow the links there to 	find performance discussions of other free analytic DBMS.)</li>
<li>One thing InfiniDB already has 	that is still a roadmap item for Infobright is the ability to run a 	query across multiple cores at once.</li>
<li>One thing free InfiniDB has that 	Infobright only offers in its Enterprise Edition is ACID-compliant 	Insert/Update/Delete. <em>(Note: I wish people would stop saying that Infobright Enterprise Edition isn&#8217;t ACID-compliant, since that point was cleared up <a href="http://www.dbms2.com/2009/04/20/infobright-update-3/" >a while ago</a>.)</em></li>
<li>InfiniDB has no indexes or 	materialized views.</li>
<li>However, InfiniDB&#8217;s retrieval is 	expedited by something called “Extents,” which sounds a lot like 	Netezza&#8217;s zone maps.</li>
</ul>
<p><em>Being on vacation, I&#8217;ll stop there for now. (If it weren&#8217;t for Tropical Storm/ depression Ida, I might not even be posting this much until I get back.)</em></p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/TzSICCtaJxw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/11/07/calponts-infinidb/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2009/11/07/calponts-infinidb/</feedburner:origLink></item>
		<item>
		<title>Aster Data 4.0 and the evolution of “advanced analytic(s) servers”</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/nKpnfh1o6aM/</link>
		<comments>http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/#comments</comments>
		<pubDate>Sat, 31 Oct 2009 01:56:55 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1198</guid>
		<description><![CDATA[Since Linda and I are leaving on vacation in a few hours, Aster Data graciously gave me permission to morph its “12:01 am Monday, November 2” embargo into “late Friday night.”
Aster Data is officially announcing the 4.0 release of nCluster. There are two big pieces to this announcement:

Aster is 	offering a slick vision for integrating [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;"><em>Since Linda and I are leaving on vacation in a few hours, Aster Data graciously gave me permission to morph its “12:01 am Monday, November 2” embargo into “late Friday night.”</em></p>
<p style="margin-bottom: 0in; font-style: normal;">Aster Data is officially announcing the 4.0 release of nCluster. There are two big pieces to this announcement:</p>
<ul>
<li>Aster is 	offering a slick vision for integrating big-database management and 	general analytic processing on the same MPP cluster, under the 	not-so-slick name “Data-Application Server.”</li>
<li>Aster is also 	offering a sophisticated vision for workload management.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">In addition, Aster has matured nCluster in various ways, for example cleaning up a performance problem with single-row updates.</p>
<p style="margin-bottom: 0in; font-style: normal;">Highlights of the Aster “Data-Application Server” story include:<span id="more-1198"></span></p>
<ul>
<li>At its core, 	the Aster “Data-Application Server” is the Aster nCluster MPP 	analytic DBMS, enhanced with basic application server functionality 	(I didn&#8217;t ask for details of that part), running on the same 	nCluster worker nodes that answer SQL queries.</li>
<li>Thus, Aster is 	eliminating a lot of the data movement that plagues three-tier 	architectures and other less-integrated approaches.</li>
<li>The Aster 	“Data-Application Server” further offers integrated workload 	management for applications and queries; more on that below.</li>
<li>The Aster 	“Data-Application Server” requires applications to be 	parallelized and invoked via Aster&#8217;s <a href="../2009/10/15/mapreduce-webinar-slides/">SQL/MapReduce.</a></li>
<li>As befits a 	MapReduce-based system, the Aster “Data-Application Server” lets 	you write your applications in lots of different languages (the 	usual suspects, and it also does. NET).</li>
<li>The Aster 	“Data-Application Server” runs applications in their own process 	spaces, protecting the DBMS server from crashes and other damaging 	behavior.</li>
<li>The Aster 	“Data-Application Server” allows applications to manage memory 	themselves, persistently, and not just via relational constructs. 	Thus, if you want your application to maintain a graph, mini rules 	engine, and/or finite state machine, you can, without doing SQL 	contortions.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">In a compelling proof point for the Aster Data-Application Server&#8217;s slickness, Aster has leapfrogged Teradata and Netezza in the extent to which SAS functionality is integrated into Aster&#8217;s DBMS. (Aster and SAS both say that you can do full SAS modeling in parallel on Aster, but even so I wouldn&#8217;t be surprised to discover there were some parts of SAS&#8217; system that turned out to be exceptions.) Of course, Aster is hardly the only analytic DBMS vendor to have the idea of explicitly enhancing general analytic processing; that&#8217;s why we see lots of MapReduce announcements, and it&#8217;s also why Teradata enhanced its UDFs (User-Defined Functions) to have some kind of persistent memory.* But I don&#8217;t know of anybody else whose approach is quite so elegant and general at this time.</p>
<p style="margin-bottom: 0in;"><em>*Unfortunately, I don&#8217;t yet know much about Teradata&#8217;s UDF enhancements. I neglected to drill down on Global Persistent Memory when it was mentioned a couple of times at Teradata Partners last week, and Teradata was unable to accommodate my request this week for a rapid follow-up briefing on the subject.</em></p>
<p style="margin-bottom: 0in; font-style: normal;">Aster&#8217;s approach to workload management is similarly stylish. The idea is:</p>
<ul>
<li>Lots of 	variables are available to be taken into account (e.g., user role, 	expected query duration, actual duration of a running query, etc.)</li>
<li>SQL statements 	can be written against any of these variables.</li>
<li>The SQL 	statements serve as rules to set query/task priorities.</li>
<li>There seem to 	be a few different ways to measure priority, including explicit 	allocation of CPU or I/O resources, as well as more conventional 	“This group of queries is gets higher priority than that one” 	kinds of metrics.</li>
<li>The whole 	thing provides integrated workload management for queries, 	applications, load jobs, data redistribution, and so on.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">Right now the interface is – well, you&#8217;re manipulating a SQL table. A more conventional workload management GUI is slated for the second quarter of 2010.</p>
<p style="margin-bottom: 0in; font-style: normal;">Discussing subjects such as mirroring and ILM (Information Lifecycle Management) with Aster can be tricky, as Aster uses the word “partition” in confusing ways. Anyhow, Aster has a few different levels of compression, and the ability to apply different levels of compression to different partitions, to change compression levels via ALTER TABLE, and to alter (presumably increase) compression on the fly when doing online backup. Aster is also part of a growing trend to eschew RAID, instead doing mirroring in its own software.  (Other examples of this strategy would be <span><a href="http://www.dbms2.com/2009/10/06/oracle-and-vertica-on-compression-and-other-physical-data-layout-features/" >Vertica</a>, <a href="http://www.dbms2.com/2008/09/28/oracle-database-machine-performance-and-compression/" >Oracle Exadata/ASM</a>, and <a href="http://www.dbms2.com/2009/10/25/teradata-hardware-strategy-and-tactics/" >Teradata Fallback</a>.) </span><span>Prior to nCluster 4.0, this caused a problem, in that the block sizes for mirroring were so large as to create a lag in transactional updating. But Aster says this problem is now solved, and indeed claims that nCluster 4.0 is superior to most rivals in transactional efficiency.</span></p>
<p style="margin-bottom: 0in;">And finally, while I was talking w/ Aster Data anyway, I checked up on cloud and MapReduce customer penetration. The answers were:</p>
<ul>
<li>Aster has two serious production 	cloud users, both of which have been disclosed for a while, namely:
<ul>
<li>ShareThis, which runs Aster 		nCluster on Amazon EC2</li>
<li>Didit, which runs Aster nCluster 		on AppNexus</li>
</ul>
</li>
<li>Outside of those two, Aster sees 	some cloud use for test, development, prototyping, etc.</li>
<li>Every single Aster customer uses 	<a href="../2009/10/15/mapreduce-webinar-slides/">SQL/MapReduce</a> &#8212; i.e., they invoke MapReduce via Aster nCluster SQL queries.</li>
<li>Some of those customers use MapReduce for ETL, some use it 	for actual analytics.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/nKpnfh1o6aM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/</feedburner:origLink></item>
		<item>
		<title>A question on MDX performance</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/4KlJNPwLeDU/</link>
		<comments>http://www.dbms2.com/2009/10/30/a-question-on-mdx-performance/#comments</comments>
		<pubDate>Fri, 30 Oct 2009 05:11:20 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[MOLAP]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1186</guid>
		<description><![CDATA[An enterprise user wrote in with a question that boils down to:
What are reasonable MDX performance expectations?
MDX doesn&#8217;t come up in my life very much, and I don&#8217;t have much intuition about it. E.g., I don&#8217;t know whether one can slap an MDX-to-SQL converter on top of a fast analytic RDBMS and go to town. [...]]]></description>
			<content:encoded><![CDATA[<p>An enterprise user wrote in with a question that boils down to:</p>
<p><strong>What are reasonable MDX performance expectations?</strong></p>
<p>MDX doesn&#8217;t come up in my life very much, and I don&#8217;t have much intuition about it. E.g., I don&#8217;t know whether one can slap an MDX-to-SQL converter on top of a fast analytic RDBMS and go to town. What&#8217;s more, I&#8217;m heading off on vacation and don&#8217;t feel like researching the matter myself in the immediate future. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>So here&#8217;s the long form of the question. Any thoughts?</p>
<p style="padding-left: 30px;">I have a general question on assessing  the performance of an OLAP technology using a set of MDX queries. I would be  interested to know if there are any benchmark MDX performance tests/results  comparing different OLAP technologies (which may be based on different  underlying DBMS&#8217;s if appropriate) on similar hardware setup, or even comparisons  of complete appliance solutions. More generally, I want to determine what  performance limits I could reasonably expect on what I think are fairly standard servers.</p>
<p style="padding-left: 30px;">In my own work, I have set up a star  schema model centered on a Fact table of 100 million rows (approx 60 columns), with dimensions ranging in cardinality from 5 to 10,000. In ad hoc analytics, is  it expected that any query against such a dataset should return a result within  a minute or two (i.e. before a user gets impatient), regardless of whether that query returns 100 cells or 50,000 cells (without relying on any aggregate table  or caching mechanism)? Or is that level of performance only expected with a high  end massively parallel software/hardware solution? The server specs I&#8217;m testing  with are: 32-bit 4 core, 4GB RAM, 7.2k RPM SATA drive, running Windows Server 2003; 64-bit 8 core, 32GB RAM, 3 Gb/s  SAS drive, running Windows Server 2003 (x64).</p>
<p style="padding-left: 30px;">I realise that caching of query results  and pre-aggregation mechanisms can significantly improve performance, but I&#8217;m  coming from the viewpoint that in purely exploratory analytics, it is not  possible to have all combinations of dimensions calculated in advance, in  addition to being maintained.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/4KlJNPwLeDU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/30/a-question-on-mdx-performance/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2009/10/30/a-question-on-mdx-performance/</feedburner:origLink></item>
		<item>
		<title>Teradata’s nebulous cloud strategy</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/HWUcvGYleHM/</link>
		<comments>http://www.dbms2.com/2009/10/27/teradatas-nebulous-cloud-strategy/#comments</comments>
		<pubDate>Tue, 27 Oct 2009 19:41:47 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1180</guid>
		<description><![CDATA[As the pun goes, Teradata&#8217;s cloud strategy is – well, it&#8217;s somewhat nebulous. More precisely, for the foreseeable future, Teradata&#8217;s cloud strategy is a collection of rather disjointed parts, including:

What Teradata calls the Teradata 	 Agile Analytics Cloud, which is a combination of previously 	existing technology plus one new portlet called the Teradata 	Elastic Mart(s) [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">As the pun goes, Teradata&#8217;s cloud strategy is – well, it&#8217;s somewhat nebulous. More precisely, for the foreseeable future, Teradata&#8217;s cloud strategy is a collection of rather disjointed parts, including:</p>
<ul>
<li>What Teradata calls the <em>Teradata 	 Agile Analytics Cloud, </em>which is a combination of previously 	existing technology plus one new portlet called the <em>Teradata 	Elastic Mart(s) Builder.</em> (Teradata&#8217;s <em>Elastic Mart(s) Builder 	Viewpoint</em><span style="font-style: normal;"> portlet is avail</span>able 	for <span style="font-style: normal;">download from <a href="../2009/05/26/teradata-developer-exchange-devx-begins-to-emerge/">Teradata&#8217;s 	Developer Exchange</a>.)</span></li>
<li><em>Teradata Data Mover 2.0,</em> coming “Soon”, which will ease copying (ETL without any 	significant “T”) from one Teradata system to another.</li>
<li><em>Teradata Express</em> DBMS 	crippleware (1 terabyte only, no production use), now available on 	Amazon EC2 and VMware. (I don&#8217;t see where this has much connection to the rest of Teradata&#8217;s cloud strategy, except insofar as it serves to fill out a slide.)</li>
<li>Unannounced (and so far as I can 	tell largely undesigned) future products.</li>
</ul>
<p style="margin-bottom: 0in;">Teradata openly admits that its direction is heavily influenced by Oliver Ratzesberger at <a href="../2009/04/30/ebays-two-enormous-data-warehouses/">eBay</a>. Like Teradata, Oliver and eBay favor virtual data marts over physical ones. That is, Oliver and eBay believe that the ideal scenario is that every piece of data is only stored once, in an integrated Teradata warehouse. But eBay believes and Teradata increasingly agrees that users need a great deal of control over their use of this data, including the ability to import additional data into private sandboxes, and join it to the warehouse data already there.<span id="more-1180"></span></p>
<p style="margin-bottom: 0in;">The <em>Teradata Elastic Mart(s) Builder Viewpoint</em> portlet automates the inclusion of outside data. If you&#8217;re already an authorized Teradata data warehouse user, you can fill in a very short form (three or so fields) and add authorization to import outside data, e.g. from a .CSV file. No fuss, little bother. Trivial as that sounds, when you combine it with Teradata&#8217;s pre-existing robust workload management tools, it creates a pretty good <em>virtual data mart</em> story.</p>
<p style="margin-bottom: 0in;">Spinning out and maintaining consistency with physical data marts is a different matter. Teradata doesn&#8217;t seem too sure it believes in those. And while Teradata is obviously planning to increase its capability in that regard anyway, I didn&#8217;t get a lot of detail beyond the reference to Data Mover 2.0.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>My Greenplum-inspired post on <a href="../2009/06/08/the-future-of-data-marts/">the 	future of data marts</a>, outlining issues in “private cloud” 	data warehousing.</li>
<li>eBay&#8217;s “<a href="http://www.xlmpp.com/articles/16-articles/39-analytics-as-a-service" onclick="javascript:pageTracker._trackPageview('/www.xlmpp.com');">Analytics 	as a Service</a>” pitch (about 1 ½ years old)</li>
<li><a href="http://developer.teradata.com/database/articles/what-is-the-teradata-agile-analytics-cloud" onclick="javascript:pageTracker._trackPageview('/developer.teradata.com');">A 	post by Teradata&#8217;s Dan Graham</a> explaining the <em>Teradata Agile 	Analytics Cloud</em><span style="font-style: normal;"> and </span><em>Elastic 	Mart(s) Builder Viewpoint</em> portlet</li>
<li>Home page and complete screen shot 	for the <a href="http://developer.teradata.com/download/viewpoint/elastic-marts-builder" onclick="javascript:pageTracker._trackPageview('/developer.teradata.com');"><em>Teradata 	Elastic Mart(s) Builder Viewpoint</em> portlet</a></li>
</ul>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/HWUcvGYleHM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/27/teradatas-nebulous-cloud-strategy/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2009/10/27/teradatas-nebulous-cloud-strategy/</feedburner:origLink></item>
		<item>
		<title>Teradata hardware strategy and tactics</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/FwhPTXGSTVg/</link>
		<comments>http://www.dbms2.com/2009/10/25/teradata-hardware-strategy-and-tactics/#comments</comments>
		<pubDate>Sun, 25 Oct 2009 04:12:09 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1171</guid>
		<description><![CDATA[In my opinion, the most important takeaways about Teradata&#8217;s hardware strategy from the Teradata Partners conference last week are:

Teradata&#8217;s future lies in 	solid-state memory. That&#8217;s in 	line with what Carson 	Schmidt told me six months ago.
To Teradata&#8217;s surprise, the 	solid-state future is imminent. Teradata is 6-9 months further along with solid-state drives (SSD) 	than it [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">In my opinion, the most important takeaways about Teradata&#8217;s hardware strategy from <a href="http://www.dbms2.com/2009/10/19/teradata-partners-2009/" >the Teradata Partners conference</a> last week are:</p>
<ul>
<li><strong>Teradata&#8217;s future lies in 	solid-state memory.</strong><span> That&#8217;s in 	line with what <a href="../2009/04/28/data-warehouse-storage-options-cheap-expensive-or-solid-state-disk-drives/">Carson 	Schmidt</a> told me six months ago.</span></li>
<li><strong>To Teradata&#8217;s surprise, the 	solid-state future is imminent.</strong><span> Teradata is 6-9 months further along with solid-state drives (SSD) 	than it thought a year ago it would be at this point.</span></li>
<li><strong>Short-term, Teradata is going 	to increase the number of appliance kinds it sells. </strong><span>I 	didn&#8217;t actually get details on anything but the new SSD-based Blurr, 	but it seems there will be others as well.</span></li>
<li><strong>Teradata&#8217;s eventual future is 	to mix and match parts (especially different kinds of storage) in a 	more modular product line.</strong><span style="font-style: normal;"><span> <a href="../2008/10/14/teradata-virtual-storage/">Teradata 	Virtual Storage</a> is of </span></span><span>pretty 	limited value otherwise. I probably believe Teradata will go modular 	more emphatically than Teradata itself does, because I think <a href="http://www.dbms2.com/2009/10/25/data-warehouse-balanced-hardware-configuration/" >doing so will meet users needs more effectively</a> than if Teradata relies strictly on fixed appliance configurations.<br />
</span></li>
</ul>
<p style="margin-bottom: 0in;">In addition, some non-SSD componentry tidbits from Carson Schmidt include:</p>
<ul>
<li>Teradata really likes Intel&#8217;s 	Nehalem CPUs, with special reference to multi-threading, QuickPath 	interconnect, and integrated memory controller. Obviously, 	Nehalem-based Teradata boxes should be expected in the not too 	distant future.</li>
<li>Teradata really likes Nehalem&#8217;s 	successor Westmere too, and expects to be pretty fast to market with 	it (faster than with Nehalem) because Nehalem and Westmere are 	plug-compatible in motherboards.</li>
<li>Teradata will go to 10-gigabit 	Ethernet for external connectivity on all its equipment, which 	should improve load performance.</li>
<li>Teradata will also go to 	10-gigabit Ethernet to play the Bynet role on appliances. Tests are 	indicating this improves query performance.</li>
<li>What&#8217;s more, Teradata believes 	there will be no practical scale-out limitations with 10-gigabit 	Ethernet.</li>
<li>Teradata hasn&#8217;t decided yet what 	to do about 2.5” SFF (Small Form Factor) disk drives, but is 	leaning favorably. Benefits would include lower power consumption 	and smaller cabinets.</li>
<li>Also on Carson&#8217;s list of 	“exciting” future technologies is SAS 2.0, which at 6 	gigabits/second doubles the I/O bandwidth of SAS 1.0.</li>
<li>Carson is even excited about 	removing universal power supplies from the cabinets, increasing 	space for other components.</li>
<li>Teradata picked Intel&#8217;s Host Bus 	Adapters for 10-gigabit Ethernet. The switch supplier hasn&#8217;t been 	determined yet.</li>
</ul>
<p style="margin-bottom: 0in;">Let&#8217;s get back now to SSDs, because over the next few years they&#8217;re the potential game-changer. <span id="more-1171"></span>The big news on SSDs is that after last year&#8217;s Teradata Partners conference, a stealth supplier* introduced itself and convinced Teradata it offers really great SSD technology. For example, not a single SSD it has provided Teradata has ever failed. (In hardware, that is. There have of course been firmware bugs, suitably squashed.) I think SSD performance is also exceeding Teradata&#8217;s expectations. This supplier is where the 6-9 month time-to-market gain comes from.</p>
<p style="margin-bottom: 0in;"><em>*Based on how often the concept of “stealth” and “name is NDAed” came up, I do not believe this is the SSD company another vendor told me about that is going around claiming it has a Teradata relationship.</em></p>
<p style="margin-bottom: 0in;">Teradata SSD highlights include:</p>
<ul>
<li>I/O speeds on “random medium 	blocks” are 520 megabytes/second, vs. 15 MB/second on their 	fastest disks. And that&#8217;s limited by SAS 1.0, load-balanced across 	two devices, not the hardware itself. (2 x 300+ MB/sec turns out to 	be 520 MB/sec in this case.) No wonder Carson is excited about SAS 	2.0.</li>
<li>Teradata is using SAS interfaces 	for its SSDs, and believes that&#8217;s unusual, in that other companies 	are using SATA or Fibre Channel.</li>
<li>Never having had a part fail, 	Teradata has no real basis to make MTTF (Mean Time To Failure) 	estimates for its SSDs.</li>
<li>Teradata&#8217;s SSD appliance design 	includes no array controllers. The biggest reason is that right now 	array controllers can&#8217;t keep up with the SSDs&#8217; speed.</li>
<li>In its SSD appliance, Teradata has 	abandoned RAID, doing mirroring instead via a DBMS feature called 	Fallback that&#8217;s been around since Teradata&#8217;s earliest days. 	(However, <a href="../2008/09/28/oracle-database-machine-performance-and-compression/">unlike 	Oracle in Exadata</a>, Teradata continues to use RAID for disks.)</li>
<li>Useful life for Teradata&#8217;s SSDs is 	estimated at 5-7 years.</li>
<li>Teradata&#8217;s SSDs are SLC 	(Single-Level Cell), as opposed to MLC (Multi-Level Cell).</li>
</ul>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/FwhPTXGSTVg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/25/teradata-hardware-strategy-and-tactics/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2009/10/25/teradata-hardware-strategy-and-tactics/</feedburner:origLink></item>
		<item>
		<title>Reports of perfectly-balanced hardware configurations are greatly exaggerated</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/r5R3rk99FrM/</link>
		<comments>http://www.dbms2.com/2009/10/25/data-warehouse-balanced-hardware-configuration/#comments</comments>
		<pubDate>Sun, 25 Oct 2009 04:00:27 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Kickfire]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1165</guid>
		<description><![CDATA[Data warehouse appliance and software appliance vendors like to claim that they&#8217;ve worked out just the right hardware configuration(s), and that a single configuration is correct for a fairly broad range of workloads. But there are a lot of reasons to be dubious about that. Specific vendor evidence includes:

Teradata ascribes 	considerable importance to a Virtual [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Data warehouse appliance and software appliance vendors like to claim that they&#8217;ve worked out just the right hardware configuration(s), and that a single configuration is correct for a fairly broad range of workloads. But there are a lot of reasons to be dubious about that. Specific vendor evidence includes:</p>
<ul>
<li><strong>Teradata</strong> ascribes 	considerable importance to a <a href="../2008/10/14/teradata-virtual-storage/">Virtual 	Storage</a> technology whose main purpose is to allow mixing of 	heterogeneous storage devices in a single system. And the discussion 	rarely suggests that these parts will be in a rigid fixed 	relationship.</li>
<li><strong>Netezza</strong> &#8212; as Teradata 	keeps reminding me &#8212; often sells boxes with the expectation that 	they won&#8217;t be filled with data, so as to increase spindle count and hence performance.</li>
<li><strong>Oracle/Sun</strong> have dropped 	some comments about Exadata being more flexibly configured going 	forward.</li>
<li><strong>Kickfire&#8217;s</strong> <a href="../2009/10/18/kickfire-capacity-and-pricing/">new 	“high-end” appliance</a> lets you attach fairly arbitrary 	amounts of external storage.</li>
<li>And of course, <strong>software-only 	analytic DBMS vendors</strong> run their software in all sorts of 	hardware and storage environments.</li>
</ul>
<p style="margin-bottom: 0in;">What&#8217;s more, the claim never made a lot of sense anyway. With the rarest of exceptions, even a single data warehouse&#8217;s workload will contain different queries that strain different parts of the system in different ratios. Calculating the “ideal” hardware configuration for that single workload would be forbiddingly difficult. And even if one could calculate it, it almost surely would be different than another user&#8217;s “ideal” configuration. How a single hardware configuration can be “ideally balanced” for a broad class of use cases boggles the imagination.</p>
<p style="margin-bottom: 0in;">
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/r5R3rk99FrM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/25/data-warehouse-balanced-hardware-configuration/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2009/10/25/data-warehouse-balanced-hardware-configuration/</feedburner:origLink></item>
		<item>
		<title>Greenplum Single-Node Edition — sometimes free is a real cool price</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/eEGULxcTAc0/</link>
		<comments>http://www.dbms2.com/2009/10/19/greenplum-free-single-node-edition/#comments</comments>
		<pubDate>Mon, 19 Oct 2009 13:25:41 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EnterpriseDB and Postgres Plus]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Scientific research]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1158</guid>
		<description><![CDATA[Greenplum is announcing today that you can run Greenplum software on a single 8-core commodity server, free.  First and foremost, that&#8217;s a strong statement that Greenplum wants enterprises to pay it for Greenplum&#8217;s parallelization/”private cloud” capabilities. Second, it may be an attractive gift to a variety of folks who want to extract insight from [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Greenplum is announcing today that you can run Greenplum software on a single 8-core commodity server, free.  First and foremost, that&#8217;s a strong statement that Greenplum wants enterprises to pay it for Greenplum&#8217;s parallelization/”<a href="../2009/06/08/the-future-of-data-marts/">private cloud</a>” capabilities. Second, it may be an attractive gift to a variety of folks who want to extract insight from terabyte-scale databases of various kinds.</p>
<p style="margin-bottom: 0in;">Greenplum Single-Node Edition:</p>
<ul>
<li>Is free of charge, although you 	can buy support.</li>
<li>Has no restrictions on use, 	production or otherwise.</li>
<li>Has no restrictions on database 	size.</li>
<li>Is closed-source.</li>
</ul>
<p style="margin-bottom: 0in;">For those who want free, terabyte-scale data warehousing software, Greenplum Single-Node Edition may be quite appealing, considering that the main available alternatives are:</p>
<ul>
<li>General-purpose open-source DBMS, 	such as PostgreSQL and MySQL (lacking analytic DBMS performance and 	features)</li>
<li>Infobright Community Edition (the 	other best choice – <a href="../2009/10/14/infobright-notes/">Infobright&#8217;s 	commercial sales success</a> indicates the solidity of Infobright&#8217;s 	technology)</li>
<li>Rough research-project code and 	other other questionable open source offerings</li>
<li>Crippleware from other commercial 	analytic DBMS vendors (e.g., <a href="../2009/10/19/teradata-partners-2009/">Teradata</a>)</li>
</ul>
<p style="margin-bottom: 0in;">For example, comparing PostgreSQL-based Greenplum with PostgreSQL itself, Greenplum offers:</p>
<ul>
<li>The ability to scale out queries 	across all cores in your box (and no, pgpool is not a serious 	alternative)</li>
<li>Storage alternatives such as 	columnar (I am told that EnterpriseDB recently stopped funding a 	project for a PostgreSQL columnar option)</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-1158"></span>Greenplum would surely also argue that its software is superior to PostgreSQL in parallel load, compression, MapReduce integration, and general fit-and-finish. I imagine that in some (perhaps not all) cases it would be right. PostgreSQL&#8217;s main technical advantages over Greenplum would probably lie in the area of datatype extensibility.</p>
<p style="margin-bottom: 0in;">The main target users for Greenplum&#8217;s Single-Node Edition are obviously <strong>individual enterprise power users or very small analytic teams.</strong> I.e., it&#8217;s people with a data mart need that a central data warehouse isn&#8217;t meeting. Potential benefits to Greenplum include:</p>
<ul>
<li>Adding value to its <a href="../2009/06/08/the-future-of-data-marts/">Enterprise 	Data Cloud</a> story</li>
<li>Seeding the market for future 	enterprise sales</li>
<li>Depriving competitors of revenue, 	perhaps at enterprises too small to ever be paying Greenplum 	customers</li>
</ul>
<p style="margin-bottom: 0in;">In addition, I see free Greenplum as a charity offering that could be appealing to <a href="http://" onclick="javascript:pageTracker._trackPageview('/');">scientists</a> who face PostgreSQL performance limitations.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li><a href="http://www.greenplum.com/news/252/388/Greenplum-Introduces-Free-Greenplum-Database-Edition-for-Data-Analysts/d,press-releases/" onclick="javascript:pageTracker._trackPageview('/www.greenplum.com');">Greenplum 	Free Single-Node Edition press release</a> (I&#8217;m quoted)</li>
<li><a href="http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/" onclick="javascript:pageTracker._trackPageview('/www.mysqlperformanceblog.com');">MySQL 	Performance blog on MonetDB and Infobright community edition</a></li>
<li><a href="http://archives.postgresql.org/pgsql-general/2009-03/msg01227.php" onclick="javascript:pageTracker._trackPageview('/archives.postgresql.org');">PostgreSQL&#8217;s 	restriction to one core per query</a></li>
<li><a href="http://www.infobright.org/Forums/viewthread/1141/" onclick="javascript:pageTracker._trackPageview('/www.infobright.org');">Infobright&#8217;s 	restriction to one core per query</a></li>
</ul>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/eEGULxcTAc0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/19/greenplum-free-single-node-edition/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2009/10/19/greenplum-free-single-node-edition/</feedburner:origLink></item>
		<item>
		<title>This week at the Teradata Partners user conference</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/uC192KO2-JM/</link>
		<comments>http://www.dbms2.com/2009/10/19/teradata-partners-2009/#comments</comments>
		<pubDate>Mon, 19 Oct 2009 13:07:31 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1150</guid>
		<description><![CDATA[Teradata tells me that its press embargoes are ending at 9:00 this morning. Here are some highlights of what&#8217;s going on, although names, dates, and details will have to await conversations and press releases this week.

Teradata is productizing 	“private cloud,” under names including “Teradata 	Enterprise Analytics Cloud,” “Teradata Agile Analytics Cloud,” 	and “Teradata Elastic Mart [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Teradata tells me that its press embargoes are ending at 9:00 this morning. Here are some highlights of what&#8217;s going on, although names, dates, and details will have to await conversations and press releases this week.</p>
<ul>
<li><strong>Teradata is productizing 	“private cloud,”</strong> under names including “Teradata 	Enterprise Analytics Cloud,” “Teradata Agile Analytics Cloud,” 	and “Teradata Elastic Mart Builder.” I.e., Teradata hopes to 	leapfrog Greenplum in its “<a href="../2009/06/08/the-future-of-data-marts/">Enterprise 	Data Cloud</a>” strategy. This is only fair, in that Greenplum 	lifted the idea from Teradata and eBay in the first place. It also 	provides major support for what I think is an extremely sensible 	trend. Give or take issues of who announces and ships what a couple 	months before or after a competitor, my early thinking is that the 	main differences between Greenplum and Teradata in this regard will 	be:
<ul>
<li>Virtual as opposed to just 	physical data marts, based on robust workload management software. 	(Advantage: Teradata)</li>
<li>Pricing, deployment options. 	(Advantage: Greenplum)</li>
<li>Features that don&#8217;t directly 	relate to enterprise/private cloud. (Advantage: Either, often 	Teradata.)</li>
</ul>
</li>
<li><strong>Teradata is generally 	strengthening its data movement technology</strong>, e.g. for making 	various appliances work in sync. I&#8217;m not too clear yet on the 	details of that. I think this is what Teradata&#8217;s phrase “ecosystem 	management” refers to.</li>
<li><strong>Teradata is (pre-)announcing – 	at least as a statement of direction &#8212; an appliance based on 	solid-state drives (SSDs). </strong>I&#8217;ve thought for a while that 	Teradata was a leader in thinking through <a href="../2008/10/23/teradata-solid-state-drives-ssd/">the 	issues around solid-state memory in data warehousing</a>, so it 	makes sense that they&#8217;re among the leaders in actually coming to 	market as well. I plan to say more after meeting with, e.g., Carson 	Schmidt.</li>
<li><strong>Teradata has achieved a 300%ish 	speed-up in geospatial processing</strong>. I gather this is largely a 	byproduct of the parallel analytics work Teradata did around 	strengthening its SAS integration. However, there don&#8217;t seem to be a 	lot of Teradata geospatial users yet.</li>
<li><span>Teradata 	Express, </span><strong>Teradata&#8217;s free Windows-based crippleware, is being 	ported to Amazon EC2 and VMware</strong> as well. Presumably to avoid 	cannibalizing Teradata product sales, there are quite a few 	limitations on Teradata Express, including system capacity, database 	size, and “no production use.”</li>
<li><strong>Teradata continues to extend 	its optimizations 	to handle queries issued by business intelligence tools. </strong><span>Previously, the focus of what 	Teradata discussed in this regard was <a href="../2009/08/02/teradata-13-focuses-on-advanced-analytic-performance/">query 	rewrite</a>. But soon automatic recommendation and creation of 	Aggregate Join Indexes – i.e.., materialized views – will be 	included as well.</span></li>
</ul>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/uC192KO2-JM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/19/teradata-partners-2009/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2009/10/19/teradata-partners-2009/</feedburner:origLink></item>
		<item>
		<title>Greenplum customer notes</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/Uns61kMxXHg/</link>
		<comments>http://www.dbms2.com/2009/10/18/greenplum-customer-notes/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 16:44:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Pricing]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1107</guid>
		<description><![CDATA[In a briefing about a forthcoming product announcement, Greenplum threw in a slide saying:

Greenplum is getting 12-15 new (paying) customers per quarter, all of whom it fondly refers to as &#8220;Tier 1&#8243; enterprises.
Greenplum will hit the 100+ customer mark this quarter (thus joining Vertica and Infobright).
&#60;10% of Greenplum business is now &#8220;influenced&#8221; by Sun hardware.

I [...]]]></description>
			<content:encoded><![CDATA[<p>In a briefing about a forthcoming product announcement, Greenplum threw in a slide saying:</p>
<ul>
<li>Greenplum is getting 12-15 new (paying) customers per quarter, all of whom it fondly refers to as &#8220;Tier 1&#8243; enterprises.</li>
<li>Greenplum will hit the 100+ customer mark this quarter (thus joining <a href="http://www.vertica.com/company/news/Vertica-fastest-growing-data-warehouse-vendor-100th-customer" onclick="javascript:pageTracker._trackPageview('/www.vertica.com');">Vertica</a> and <a href="http://www.dbms2.com/2009/10/14/infobright-notes/" >Infobright</a>).</li>
<li>&lt;10% of Greenplum business is now &#8220;influenced&#8221; by Sun hardware.</li>
</ul>
<p>I asked Ben Werther to unpack that last claim for me. He quickly noted that it wasn&#8217;t his slide, but rather had been put together by colleagues. That said:</p>
<ul>
<li>As of the past quarter or two, &lt;10% of Greenplum&#8217;s sales activity is on Sun, which works out to maybe one sale per quarter and at most a small number of sales cycles. (That&#8217;s down from from <a href="http://www.dbms2.com/2008/08/25/greenplum-is-in-the-big-leagues/" >50%+ not that long ago</a>.)</li>
<li>Most Greenplum business is now on HP or Dell equipment.  Some is on IBM. There are some interesting sales cycles on Cisco&#8217;s new UCS (Unified Computing System) blades, but no closed deals yet. EMC seems to be part of the Cisco story.</li>
</ul>
<p>No doubt part of the reason for the move away from Sun equipment is the impending Oracle acquisition. Another may be that the Greenplum/Sun appliance is somewhat underpowered. E.g., without particularly high levels of compression, <a href="http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/" >eBay</a> puts over 60 terabytes of data on each Greenplum node, which probably isn&#8217;t ideal from the standpoint of query performance.</p>
<p>Greenplum also says that 50% or so of sales are <a href="http://www.dbms2.com/2009/06/05/greenplum-update-release-3-3/" >subscription-priced</a>, rather than perpetual-licensed. I don&#8217;t have a sense for how long that&#8217;s been going on. <em>(Edit: Ben Werther tells me this has been true for over a year.)</em></p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/Uns61kMxXHg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/18/greenplum-customer-notes/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2009/10/18/greenplum-customer-notes/</feedburner:origLink></item>
		<item>
		<title>Three big myths about MapReduce</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/94q1ncQO1mk/</link>
		<comments>http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 16:14:37 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Michael Stonebraker]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1135</guid>
		<description><![CDATA[Once again, I find myself writing and talking a lot about MapReduce.  But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:

MapReduce is something very new
MapReduce involves strict 	adherence to the Map-Reduce programming paradigm
MapReduce is a single technology

So let&#8217;s give it a try.
When Dave DeWitt and Mike [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Once again, I find myself writing and talking a lot about MapReduce.  But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:</p>
<ul>
<li>MapReduce is something very new</li>
<li>MapReduce involves strict 	adherence to the Map-Reduce programming paradigm</li>
<li>MapReduce is a single technology</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-1135"></span>So let&#8217;s give it a try.</p>
<p style="margin-bottom: 0in;">When Dave DeWitt and Mike Stone<span style="font-style: normal;">braker leveled <a href="../2008/01/18/the-great-mapreduce-debate/">their famous blast at MapReduce</a>, many people thought they overstated their case. But one part of their story – one that both Mike and Dave say was most central to their case – was never effectively refuted, n</span>amely the claim that these ideas aren&#8217;t particularly new. I haven&#8217;t actually read enough computer science literature to have an independent opi<span style="font-style: normal;">nion on that issue. But I&#8217;ll say this – claims from companies such as <a href="../2009/10/18/introduction-to-sensage/">SenSage</a>, <a href="../2009/10/06/oracle-mapreduce/">Oracle</a>, or <a href="../2009/10/18/technical-introduction-to-splunk/">Splunk</a> that “We&#8217;ve be</span>en doing MapReduce all along” seem pretty credible to me.</p>
<p style="margin-bottom: 0in;">True, what those companies were doing things may not have looked exactly like the instant-classic MapReduce programming paradigm. But the same is true of many things almost everybody would agree count as MapReduce.  In particular, it is often not the case that you alternate Map and Reduce steps, each of whose outputs is a set of simple &lt;Key, Value&gt; pairs, with data redistributed based on Key at every step.</p>
<p style="margin-bottom: 0in;">Here are some examples of what I mean, drawn from <a href="http://www.asterdata.com/blog/index.php/2009/10/15/mastering-mapreduce/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');">my recent MapReduce webinar</a>.</p>
<ul>
<li>If you do text indexing in 	MapReduce, your goal is to wind up with a text index. So at some 	point you Reduce to a pair &lt;WordName, {all the (DocumentID, 	offset) pairs for the whole corpus, suitably ordered}&gt;.  That&#8217;s a 	heckuva compound “Value”.</li>
<li>The goal of data mining is usually 	to estimate a rather small number of parameters based on a large 	overall data set, often – depending on algorithm – in the form 	of a single vector. When you do that in MapReduce. you partition 	data among nodes, calculate something on each node that is 	structured more or less like your final vector. So when it comes 	time for the reduce, you just ship all of your vectors – one per 	node – to a single Reduce node, and do the appropriate math. 	Redistribution based on Key would be quite pointless.</li>
<li>When you sessionize clickstream 	logs in MapReduce, you may have just as many output records as input 	records. However, they now are reformatted, and might have a 	SessionID appended. In those cases, Reduce isn&#8217;t doing much by the 	way of reduction.</li>
<li>And as I happens in some 	<a href="../2009/08/04/verticas-version-of-mapreduce-integration/">Vertica-Hadoop</a> use cases around mortgage trading, sometimes MapReduce can even make 	data s<span style="font-style: normal;">ets vastly larger.</span></li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">By no means do I think this is a weakness of the MapReduce programming paradigm. Rather, I think it&#8217;s a MapReduce strength. But it&#8217;s not quite the way MapReduce has been promoted and explained to the IT public.</p>
<p style="margin-bottom: 0in; font-style: normal;">Finally: MapReduce, as commonly conceived, spans two different – albeit closely related – technology domains:</p>
<ul>
<li>Parallel 	programming</li>
<li>Distributed 	data management</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">For example, I imagine Greenplum&#8217;s and Vertica&#8217;s MapReduce/SQL combined syntaxes are very similar to each others. But Vertica&#8217;s data management implementation of MapReduce, which relies on Hadoop, is very different from Greenplum&#8217;s, which is tied into the Greenplum DBMS. Similary, non-DBMS MapReduce implementations are commonly associated with distributed file systems – notably HDFS (Hadoop Distributed File Systems) or Google&#8217;s internal GFS (Google File System). In those systems, the parallel language execution part should be aware of how the distributed file management part works – but perhaps that awareness can be pretty lightweight.</p>
<p style="margin-bottom: 0in; font-style: normal;">Right now, this is a distinction pretty much without a difference. If you choose an implementation of MapReduce &#8212; like pure Hadoop (say in the Cloudera distribution) or Hadoop-Vertica or Aster Data&#8217;s SQL/MapReduce – you&#8217;re basically picking an entire technology stack. But those stacks are going to do a whole lot of changing and maturing in the near future – and as they do, it&#8217;s likely that projects will interact or even combine in all sorts of interesting ways.</p>
<p style="margin-bottom: 0in; font-style: normal;"><strong>Bottom line: There are a lot of different ways to exploit MapReduce-related technology.</strong></p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/94q1ncQO1mk" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		<feedburner:origLink>http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/</feedburner:origLink></item>
	</channel>
</rss><!-- Dynamic page generated in 0.347 seconds. --><!-- Cached page generated by WP-Super-Cache on 2009-11-08 02:33:04 -->
