<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>DBMS2 -- DataBase Management System Services</title>
	
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<pubDate>Sat, 11 Jul 2009 21:34:43 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
	<language>en</language>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/dbms2/feed" type="application/rss+xml" /><item>
		<title>Groovy Corp</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/-xp08cqRJsY/</link>
		<comments>http://www.dbms2.com/2009/07/11/groovy-corp/#comments</comments>
		<pubDate>Sat, 11 Jul 2009 21:34:43 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[DBMS product categories]]></category>

		<category><![CDATA[In-memory DBMS]]></category>

		<category><![CDATA[Memory-centric data management]]></category>

		<category><![CDATA[OLTP]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=840</guid>
		<description><![CDATA[Groovy Corp sent over a press release and apparently suggested I write about the company&#8217;s wonderfulness immediately. This was without any kind of briefing. I don&#8217;t do that kind of thing.
However, a Twitter check revealed that Tony Bain is familiar with Groovy Corp and the Groovy SQL Switch (apparently they started out in Australia, where [...]]]></description>
			<content:encoded><![CDATA[<p>Groovy Corp sent over a press release and apparently suggested I write about the company&#8217;s wonderfulness immediately. This was without any kind of briefing. I don&#8217;t do that kind of thing.</p>
<p>However, a Twitter check revealed that Tony Bain is familiar with Groovy Corp and the Groovy SQL Switch (apparently they started out in Australia, where he lives and works, and he evidently knows the guys).  <a href="http://blog.tonybain.com/tony_bain/2009/07/groovy-baby-yeah.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/blog.tonybain.com');">Tony&#8217;s take</a>, in summary, is (emphasis mine):</p>
<ul>
<blockquote>
<li>They are an <strong>in memory RDBMS</strong></li>
<li>They have worked with Intel to architect from the ground up for large multi processor concurrency</li>
<li>Initially they are launching as a <strong>multi-core appliance</strong></li>
<li>They claim <strong>200,000 sql operations per second from a single box</strong></li>
<li>They are proprietary (not built on MySQL or any other open source database) which means they have had a lot of control around their architecture</li>
<li>They are a pretty cool company with some interesting people</li>
</blockquote>
</ul>
<p>There&#8217;s a little more detail at the above link.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/-xp08cqRJsY" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/07/11/groovy-corp/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.dbms2.com/2009/07/11/groovy-corp/</feedburner:origLink></item>
		<item>
		<title>Oracle cites Exadata wins</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/NZaofpinYhg/</link>
		<comments>http://www.dbms2.com/2009/07/09/oracle-cites-exadata-wins/#comments</comments>
		<pubDate>Fri, 10 Jul 2009 02:34:28 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Data warehouse appliances]]></category>

		<category><![CDATA[Data warehousing]]></category>

		<category><![CDATA[Exadata]]></category>

		<category><![CDATA[Netezza]]></category>

		<category><![CDATA[Oracle]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=839</guid>
		<description><![CDATA[A couple of weeks ago, Oracle put out a press release about Exadata wins.  Highlights include:

20 names of actual customers.
One quote citing a competitive win (over Netezza)
One quote citing a ~50X speedup of one query &#8220;without manual tuning&#8221;
One quote citing consistent 10-72X query performance speedups
One quote citing a speedup from &#8220;days&#8221; to &#8220;minutes&#8221;

Unless I missed [...]]]></description>
			<content:encoded><![CDATA[<p>A couple of weeks ago, Oracle put out a press release about <a href="http://www.oracle.com/us/corporate/press/020542" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.oracle.com');">Exadata wins</a>.  Highlights include:</p>
<ul>
<li>20 names of actual customers.</li>
<li>One quote citing a competitive win (over Netezza)</li>
<li>One quote citing a ~50X speedup of one query &#8220;without manual tuning&#8221;</li>
<li>One quote citing consistent 10-72X query performance speedups</li>
<li>One quote citing a speedup from &#8220;days&#8221; to &#8220;minutes&#8221;</li>
</ul>
<p>Unless I missed it, none of the quotes implied Exadata was actually in production, and none compared hardware between the old/slow/production and Exadata/fast/test systems.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/NZaofpinYhg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/07/09/oracle-cites-exadata-wins/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.dbms2.com/2009/07/09/oracle-cites-exadata-wins/</feedburner:origLink></item>
		<item>
		<title>While I’m venting about benchmarks</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/0Zd55i1eFU8/</link>
		<comments>http://www.dbms2.com/2009/07/08/while-im-venting-about-benchmarks/#comments</comments>
		<pubDate>Wed, 08 Jul 2009 23:58:48 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Benchmarks and POCs]]></category>

		<category><![CDATA[Columnar database management]]></category>

		<category><![CDATA[Data integration and middleware]]></category>

		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>

		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=837</guid>
		<description><![CDATA[Late last year, Vertica made hoo-hah about what it called a world-record data warehouse load speed benchmark.  I wrote at the time that this showed Vertica wasn&#8217;t painfully slow at loading, always a concern with column stores. But otherwise I mocked the idea that there was something useful to be learned from the whole exercise.
Well, [...]]]></description>
			<content:encoded><![CDATA[<p>Late last year, Vertica made hoo-hah about what it called a <a href="http://www.dbms2.com/2008/12/02/data-warehouse-load-speeds-in-the-spotlight/" >world-record data warehouse load speed benchmark</a>.  I wrote at the time that this showed Vertica wasn&#8217;t painfully slow at loading, always a concern with column stores. But otherwise I mocked the idea that there was something useful to be learned from the whole exercise.</p>
<p>Well, guess what?  In a throwaway line in a comment on <a href="http://dbmsmusings.blogspot.com/2009/07/paraccel-and-their-puzzling-tpc-h.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/dbmsmusings.blogspot.com');">Daniel Abadi&#8217;s blog</a>, Barry Zane of ParAccel pointed out</p>
<blockquote><p>we posted a load rate of almost 9TB/hour, which is, of course record breaking on its own</p></blockquote>
<p>Quite right.</p>
<p>I hope the nonsense stops there, but I&#8217;m not optimistic &#8230;</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/0Zd55i1eFU8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/07/08/while-im-venting-about-benchmarks/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.dbms2.com/2009/07/08/while-im-venting-about-benchmarks/</feedburner:origLink></item>
		<item>
		<title>Progress in figuring out what ParAccel is doing</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/9zfqLJqca-8/</link>
		<comments>http://www.dbms2.com/2009/07/08/progress-in-figuring-out-what-paraccel-is-doing/#comments</comments>
		<pubDate>Wed, 08 Jul 2009 23:46:05 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Columnar database management]]></category>

		<category><![CDATA[Data warehousing]]></category>

		<category><![CDATA[ParAccel]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=835</guid>
		<description><![CDATA[(Oops: Thought I&#8217;d posted this before I went out for the afternoon &#8230;)
Barry Zane of ParAccel has &#8212; finally! &#8212; started a blog.  Barrry&#8217;s first post, probably in connection with ParAccel&#8217;s recent TPC-H submission and subsequent brouhaha, consisted mainly of metaphor + very elementary and well-known arguments for column stores. Barry&#8217;s second post, however, was [...]]]></description>
			<content:encoded><![CDATA[<p><em>(Oops: Thought I&#8217;d posted this before I went out for the afternoon &#8230;)</em></p>
<p>Barry Zane of ParAccel has &#8212; finally! &#8212; started a blog.  Barrry&#8217;s <a href="http://paraccel.com/data_warehouse_blog/?p=34" onclick="javascript:pageTracker._trackPageview('/outbound/article/paraccel.com');">first post</a>, probably in connection with ParAccel&#8217;s recent TPC-H submission and subsequent brouhaha, consisted mainly of metaphor + very elementary and well-known arguments for column stores. Barry&#8217;s <a href="http://paraccel.com/data_warehouse_blog/?p=57" onclick="javascript:pageTracker._trackPageview('/outbound/article/paraccel.com');">second post</a>, however, was in direct response to <a href="http://www.dbms2.com/2009/07/07/daniel-abadi-has-a-theory-about-paraccel/" >Daniel Abadi&#8217;s speculation about ParAccel&#8217;s architecture</a>.  That post also promises a follow-up addressing the TPC-H in a more substantive way.</p>
<p>Barry&#8217;s points include:</p>
<ul>
<li><strong>ParAccel never used the row-oriented Postgres execution engine.</strong> This is contrary to Daniel&#8217;s speculation.</li>
<li><strong>ParAccel previously used an adaption of the Postgres cost-based optimizer, but now has written a new one from scratch. </strong></li>
<li><strong>ParAccel has designed its optimizer to handle lots and lots of joins.</strong> One reason Barry offers is that ParAccel wants to run customers&#8217; old schemas unaltered, whether or not those are really optimal for the ParAccel DBMS.  That approach is somewhat in contrast to Vertica, which originally focused entirely on star schemas.   And it goes well with ParAccel&#8217;s interest in appealing to customers who at least think they want to run ParAccel in Oracle or SQL Server emulation mode.</li>
</ul>
<p>Also in the post, Barry:</p>
<ul>
<li>Makes an extremely silly marketing exaggeration by referring to &#8221; the only other vendor that was <em>able</em> to run the 30TB TPC-H&#8221; (emphasis mine).</li>
<li>Makes the more excusable marketing exaggeration &#8220;Publishing the benchmark with unmatched performance is simply one way to demonstrate robustness and flexibility.  Nothing more, nothing less.&#8221;</li>
<li>Makes the very clear marketing claim &#8220;For customers, the real test will be their own bake-offs, where our performance has <em>never</em> been beaten.&#8221; (Emphasis mine.) That last one directly contradicts what I&#8217;ve been told by at least two ParAccel competitors, so I&#8217;ll be curious to see what they come up with to substantiate their version of the story.</li>
</ul>
<p>Anyhow, it&#8217;s great to see ParAccel retreating from its obsessive secrecy, which in my opinion has been even worse than <a href="http://www.dbms2.com/2006/09/20/dealing-with-netezza-has-not-been-easy/" >Netezza&#8217;s</a> used to be.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/9zfqLJqca-8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/07/08/progress-in-figuring-out-what-paraccel-is-doing/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.dbms2.com/2009/07/08/progress-in-figuring-out-what-paraccel-is-doing/</feedburner:origLink></item>
		<item>
		<title>Infobright metrics</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/SdvXE1G8ttI/</link>
		<comments>http://www.dbms2.com/2009/07/08/infobright-metrics/#comments</comments>
		<pubDate>Wed, 08 Jul 2009 17:41:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Data warehousing]]></category>

		<category><![CDATA[Infobright]]></category>

		<category><![CDATA[Market share]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=836</guid>
		<description><![CDATA[Merv Adrian posted about Infobright, and included some company-supplied metrics. Most looked familiar from a post I did in April, but Infobright&#8217;s latest figure for # of paying customers seems to be &#8220;&#62;60&#8243;, up from &#8220;&#62;50&#8243;. Pricing aside, that&#8217;s Vertica/Greenplum territory &#8212; behind Netezza, Teradata, and the big OLTP DBMS vendors, but ahead of everybody [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://mervadrian.wordpress.com/2009/07/07/infobright-bids-to-anchor-an-open-source-dw-ecosystem/" onclick="javascript:pageTracker._trackPageview('/outbound/article/mervadrian.wordpress.com');">Merv Adrian posted about Infobright</a>, and included some company-supplied metrics. Most looked familiar from <a href="http://www.dbms2.com/2009/04/20/infobright-update-3/" >a post I did in April</a>, but Infobright&#8217;s latest figure for # of paying customers seems to be &#8220;&gt;60&#8243;, up from &#8220;&gt;50&#8243;. Pricing aside, that&#8217;s Vertica/Greenplum territory &#8212; behind Netezza, Teradata, and the big OLTP DBMS vendors, but ahead of everybody else I think of as a modern analytic DBMS vendor.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/SdvXE1G8ttI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/07/08/infobright-metrics/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.dbms2.com/2009/07/08/infobright-metrics/</feedburner:origLink></item>
		<item>
		<title>Hasso Plattner calls for in-memory OLTP column stores</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/KtQCpfDdipk/</link>
		<comments>http://www.dbms2.com/2009/07/07/hasso-plattner-calls-for-in-memory-oltp-column-stores/#comments</comments>
		<pubDate>Wed, 08 Jul 2009 03:33:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Analytic technologies]]></category>

		<category><![CDATA[Columnar database management]]></category>

		<category><![CDATA[DBMS product categories]]></category>

		<category><![CDATA[Data warehousing]]></category>

		<category><![CDATA[Database compression]]></category>

		<category><![CDATA[In-memory DBMS]]></category>

		<category><![CDATA[Memory-centric data management]]></category>

		<category><![CDATA[OLTP]]></category>

		<category><![CDATA[Parallelization]]></category>

		<category><![CDATA[SAP AG]]></category>

		<category><![CDATA[Software as a Service (SaaS)]]></category>

		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=834</guid>
		<description><![CDATA[Former SAP CEO Hasso Plattner has written a paper called A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database, in association with a SIGMOD keynote address.* The approach Plattner advocates is an MPP in-memory column store, presumably somewhat akin to SAP&#8217;s frequently renamed Business Warehouse Accelerator/Business Intelligence Accelerator/BWA/BIA/Son-of-TREX technology. There also [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Former SAP CEO Hasso Plattner has writ<span style="font-style: normal;">ten a paper called <a href="http://www.sigmod09.org/images/sigmod1ktp-plattner.pdf" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.sigmod09.org');">A </a></span><a href="http://www.sigmod09.org/images/sigmod1ktp-plattner.pdf" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.sigmod09.org');"><em><span style="font-style: normal;">Common Database Approach</span></em><span style="font-style: normal;"> for OLTP and OLAP Using an In-Memory Column </span><em><span style="font-style: normal;">Database</span></em></a><em><span style="font-style: normal;">, in association with a SIGMOD keynote address.* </span></em>The approach Plattner advocates is an MPP in-memory column store, presumably somewhat akin to SAP&#8217;s frequently renamed <a href="../2006/09/20/saps-bi-accelerator/">Business Warehouse Accelerator/Business Intelligence Accelerator/BWA/BIA/Son-of-TREX</a> technology. There also are strong similarities to the MPP in-memory row store pr<span style="font-style: normal;">oject <a href="http://www.dbms2.com/2008/02/18/mike-stonebraker-calls-for-the-complete-destruction-of-the-old-dbms-order/" >H-Store</a>/<a href="http://www.dbms2.com/2009/06/22/h-store-horizontica-voltdb/" >VoltDB</a>, although I don&#8217;t know whether Plattner would go so far as to adopt the H-Store view that </span><em>all</em><span style="font-style: normal;"> transactions should run in stored procedures.</span> Unsurprisingly, SAP applications are used as the OLTP paradigm throughout.</p>
<p style="margin-bottom: 0in;"><em>*Thanks to <a href="http://marklogic.blogspot.com/2007/02/best-of-mark-logic-ceo-blog.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/marklogic.blogspot.com');">Dave Kellogg</a> for tipping me off to Plattner&#8217;s paper.  I only went to two SIGMOD sessions, neither of which was Plattner&#8217;s. Nobody actually mentioned Plattner&#8217;s talk to me when I was down at SIGMOD.</em></p>
<p style="margin-bottom: 0in;">Perhaps the most interesting part is Plattner&#8217;s claim that <strong>what&#8217;s demanding about OLTP</strong> isn&#8217;t database updating <em>per se,</em> but rather <strong>maintaining aggregates</strong> for quick-response analytics. In his main example of that point, Plattner proposes a real-life &#8220;more than 18&#8243; table schema, of which 2 are base tables, and (most of?) the rest are materialized views that his proposed database architecture dispenses with (because analytic performance is sufficiently good without them).  Thus, Plattner&#8217;s core columnar argument seemingly is</p>
<p style="margin-bottom: 0in;"><strong><em>columnar &#8211;&gt; natively fast analytics &#8211;&gt; no need to maintain aggregates &#8211;&gt; much lower update burden.</em></strong></p>
<p style="margin-bottom: 0in;">That said &#8212; if Plattner&#8217;s paper contained a clear statement of how much more expensive it is to insert or update a single row in a columnar vs. row-based system, I overlooked it. Instead, Plattner seems to be arguing that the volume of base-table updates is low enough that &#8212; whatever it may be &#8212; column-store update overhead is an acceptable price to pay.  (At one point he claims that only 5% of the data inserted in a financial application ever gets changed.) That may actually be true in a financial accounting system, but seems more questionable in a sufficiently large application that gets its updates from automatic devices, or from the consumer web.</p>
<p style="margin-bottom: 0in;">Other highlights include:<span id="more-834"></span></p>
<ul>
<li>Like most modern observers, 	Plattner believes <strong>Postgres-style timestamping</strong> beats 	update-in-place.</li>
<li>Plattner also offers a less common 	reason for liking timestamped inserts over updates-in-place &#8212; he 	thinks <strong>timestamps are helpful in planning-oriented applications.</strong> In 	particular, he wants timestamp-aware SQL extensions.</li>
<li>Plattner claims columnar designs 	have a 10:1 <strong>compression</strong> advantage over row stores &#8212; specifically 	20X vs. 2X &#8212; at least using compression schemes that allow for 	updating at reasonable speed.  That seems exaggerated.</li>
<li>Plattner seemed to drop various 	references to memory-centric structures SAP already uses. (SAP 	has long done a lot in-memory, in both the OLTP and planning 	areas.  Years ago SAP told me of a customer that was buying &gt;1 TB 	of RAM just to run SAP&#8217;s planning software.  SAP also bragged that 	&gt;99% of transactions never hit disk, in some sense of 	&#8220;transaction&#8221;. )</li>
<li>There are lots of references to 	&#8220;tenants&#8221;, SaaS, and/or SAP&#8217;s SaaS product line.  So <strong>SaaS is evidently a design point.</strong> That makes sense. First, SaaS is one of SAP&#8217;s biggest vulnerabilities. Second, the toughest 	customization a SaaS customer might want is to add a few columns to 	standard tables, which might be easier to accomodate with a columnar approach.</li>
</ul>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/KtQCpfDdipk" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/07/07/hasso-plattner-calls-for-in-memory-oltp-column-stores/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.dbms2.com/2009/07/07/hasso-plattner-calls-for-in-memory-oltp-column-stores/</feedburner:origLink></item>
		<item>
		<title>Daniel Abadi has a theory about ParAccel</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/kZr9X4inveU/</link>
		<comments>http://www.dbms2.com/2009/07/07/daniel-abadi-has-a-theory-about-paraccel/#comments</comments>
		<pubDate>Tue, 07 Jul 2009 22:46:19 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Benchmarks and POCs]]></category>

		<category><![CDATA[Columnar database management]]></category>

		<category><![CDATA[Data warehousing]]></category>

		<category><![CDATA[ParAccel]]></category>

		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=833</guid>
		<description><![CDATA[When I was at SIGMOD last week, ParAccel and its SIGMOD talk were mentioned several times, always in puzzled and at least slightly unflattering terms.  (Typical comment: &#8220;Why did they present a paper about that? We were doing the same thing in our company years ago.&#8221;) That doesn&#8217;t prove much per se, since most of [...]]]></description>
			<content:encoded><![CDATA[<p>When I was at SIGMOD last week, ParAccel and its SIGMOD talk were mentioned several times, always in puzzled and at least slightly unflattering terms.  (Typical comment: &#8220;Why did they present a paper about that? We were doing the same thing in our company years ago.&#8221;) That doesn&#8217;t prove much <em>per se,</em> since most of the mentions were by competitors and/or Vertica-affiliated academics, and since my own <a href="http://www.dbms2.com/2009/06/22/the-tpc-h-benchmark-is-a-blight-upon-the-industry/" >unflattering ParAccel-related comments</a> were rather fresh at the time.</p>
<p>But now Daniel Abadi has done <a href="http://dbmsmusings.blogspot.com/2009/07/paraccel-and-their-puzzling-tpc-h.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/dbmsmusings.blogspot.com');">a brilliant, detailed, speculative analysis of ParAccel&#8217;s publications</a>.  Here&#8217;s the meat, emphasis mine:<span id="more-833"></span></p>
<blockquote><p>(1) Why did they configure their TPC-H application with such a high amount of disk I/O throughput capabilty when they are a column-store? (Stonebraker&#8217;s question)<br />
(2) <strong>Why did queries spend seemingly 6X more time doing I/O than a column-store should have to do?</strong><br />
(3) Why are they worried about queries with thousands of joins?<br />
(4) <strong>Why do they think TPC-H/TPC-DS queries have 42 joins?</strong><br />
<strong><br />
And then a theory that answers all four questions at the same time came to me.</strong> Perhaps ParAccel directly followed my advice (see option 1) on &#8220;<a href="http://cs-www.cs.yale.edu/homes/dna/talks/abadi-nedbday.pdf" onclick="javascript:pageTracker._trackPageview('/outbound/article/cs-www.cs.yale.edu');">How to create a new column-store DBMS product in a week</a>&#8220;. <strong>They&#8217;re not a column-store. They&#8217;re a vertically partitioned row-store</strong> (this is how column-stores were built back in the 70s before we knew any better). Each column is stored in its own separate table inside the row-store (PostgreSQL in ParAccel&#8217;s case). Queries over the original schema are then automatically rewritten into queries over the vertically partitioned schema and the row-store&#8217;s regular query execution engine can be used unmodified. But now, <strong>every attribute accessed by the query now adds an additional join to the query plan</strong> (since the vertical partitions for each column in a table have to be joined together).</p>
<p>This immediately explains why they are worried about queries with hundreds to thousands of joins (questions 3 and 4). But it also explains why they seem to be doing much more I/O than a native column-store. <strong>Since each vertical partition is its own table, then each tuple in a vertical partition (which contains just one value) is preceded by the row-store&#8217;s tuple header.</strong> In PostgreSQL this tuple header is on the order of 27 bytes. So <strong>if the column width is 4 bytes, then there is a factor of 7 extra space used up for the tuple header relative to actual user data.</strong> And if the implementation is super naive, they also will need an additional 4 bytes to store a tuple identifier for joining vertical partitions from the same original table with each other. This answers questions 1 and 2, as <strong>the factor of 6 worse I/O efficiency is now obvious.</strong></p></blockquote>
<p>It will be interesting to see whether ParAccel comments, but even it does, I wouldn&#8217;t necessarily take ParAccel&#8217;s statements as dispositive.  For example &#8212; and illustrative of my view of ParAccel&#8217;s trustworthiness &#8212; I believe ParAccel&#8217;s competition who tell me that ParAccel&#8217;s claim to have won or at least tied all POCs on performance is flat-out untrue.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/kZr9X4inveU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/07/07/daniel-abadi-has-a-theory-about-paraccel/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.dbms2.com/2009/07/07/daniel-abadi-has-a-theory-about-paraccel/</feedburner:origLink></item>
		<item>
		<title>Yahoo is up to 10 petabytes now?</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/kJoJcQO3T_E/</link>
		<comments>http://www.dbms2.com/2009/07/06/yahoo-is-up-to-10-petabytes-now/#comments</comments>
		<pubDate>Mon, 06 Jul 2009 06:03:54 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Columnar database management]]></category>

		<category><![CDATA[Data warehousing]]></category>

		<category><![CDATA[Web analytics]]></category>

		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=832</guid>
		<description><![CDATA[According to somebody (I forget who) who attended Yahoo&#8217;s SIGMOD presentation last week, the big Yahoo database is now up to 10 petabytes in size, in line with Yahoo&#8217;s predictions last year.  Apparently, Yahoo also gave more details of how the technology works.
]]></description>
			<content:encoded><![CDATA[<p>According to somebody (I forget who) who attended Yahoo&#8217;s SIGMOD presentation last week, <a href="http://www.dbms2.com/2008/05/29/yahoo-scales-web-analytics-database-petabyte/" >the big Yahoo database</a> is now up to 10 petabytes in size, in line with Yahoo&#8217;s predictions last year.  Apparently, Yahoo also gave more details of how the technology works.</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/kJoJcQO3T_E" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/07/06/yahoo-is-up-to-10-petabytes-now/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.dbms2.com/2009/07/06/yahoo-is-up-to-10-petabytes-now/</feedburner:origLink></item>
		<item>
		<title>User data vs. raw disk space as a marketing metric</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/zoWWOpVWCRM/</link>
		<comments>http://www.dbms2.com/2009/07/02/daniel-abadi-user-data/#comments</comments>
		<pubDate>Thu, 02 Jul 2009 21:04:30 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Data warehousing]]></category>

		<category><![CDATA[Parallelization]]></category>

		<category><![CDATA[Pricing]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=831</guid>
		<description><![CDATA[I tried to post a comment on Daniel Abadi&#8217;s blog, but doing so seems to require some sort of registration process, so I&#8217;m posting here instead.
In a comment to his post on node scalability, Daniel Abadi argued that disk space is a better metric to use in marketing than (presumably compressed) user data.  Well, I [...]]]></description>
			<content:encoded><![CDATA[<p><em>I tried to post a comment on Daniel Abadi&#8217;s blog, but doing so seems to require some sort of registration process, so I&#8217;m posting here instead.</em></p>
<p>In a comment to <a href="http://dbmsmusings.blogspot.com/2009/06/more-on-node-scalability.html" onclick="javascript:pageTracker._trackPageview('/outbound/article/dbmsmusings.blogspot.com');">his post on node scalability</a>, Daniel Abadi argued that disk space is a better metric to use in marketing than (presumably compressed) user data.  Well, I imagine he didn&#8217;t quite mean to say that, but that&#8217;s actually what he wound up saying, starting from the accurate observation that compression ratios vary wildly from one data set to another, even more than they vary from product to product on the same data.</p>
<p>Nonetheless, I favor user data as a metric because:</p>
<ul>
<li>That&#8217;s what users care about.</li>
<li>That&#8217;s how a number of analytic DBMS vendors, including Vertica, actually price.</li>
</ul>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/zoWWOpVWCRM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/07/02/daniel-abadi-user-data/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.dbms2.com/2009/07/02/daniel-abadi-user-data/</feedburner:origLink></item>
		<item>
		<title>The TPC-H schema</title>
		<link>http://feedproxy.google.com/~r/dbms2/feed/~3/CHw3yjX_5eU/</link>
		<comments>http://www.dbms2.com/2009/07/02/the-tpc-h-schema/#comments</comments>
		<pubDate>Thu, 02 Jul 2009 18:59:23 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
		
		<category><![CDATA[Benchmarks and POCs]]></category>

		<category><![CDATA[Data warehousing]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=830</guid>
		<description><![CDATA[Would anybody recommend in real life running the TPC-H schema for that data? (I.e., fully normalized, no materialized views.) If so &#8212; why????
]]></description>
			<content:encoded><![CDATA[<p>Would anybody recommend in real life running the <a href="http://www.dbms2.com/2009/06/22/the-tpc-h-benchmark-is-a-blight-upon-the-industry/" >TPC-H</a> schema for that data? (I.e., fully normalized, no materialized views.) If so &#8212; why????</p>
<img src="http://feeds.feedburner.com/~r/dbms2/feed/~4/CHw3yjX_5eU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/07/02/the-tpc-h-schema/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.dbms2.com/2009/07/02/the-tpc-h-schema/</feedburner:origLink></item>
	</channel>
</rss>
