<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="http://feeds.feedburner.com/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
    <title>The Database Column</title>
    <link rel="alternate" type="text/html" href="http://www.databasecolumn.com/" />
    
    <id>tag:www.databasecolumn.com,2007-08-30://1</id>
    <updated>2008-07-18T18:50:13Z</updated>
    <subtitle>A multi-author blog on database technology and innovation.</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Publishing Platform 4.0</generator>

<link rel="self" href="http://feeds.feedburner.com/dbcfeed" type="application/atom+xml" /><entry>
    <title>Debunking a Myth: Column-Stores vs. Indexes</title>
    <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/dbcfeed/~3/339239330/debunking-a-myth-columnstores.html" />
    <id>tag:www.databasecolumn.com,2008://1.41</id>

    <published>2008-07-18T17:53:21Z</published>
    <updated>2008-07-18T18:50:13Z</updated>

    <summary>Consider a traditional, row-oriented database.  Indexes are known to improve performance in database systems. They can greatly reduce I/O costs by avoiding the need to perform table scans since they directly contain the data you need to answer a query or contain pointers to such data. If you have a query that accesses only two out of thirty columns from a large table, and you have an index on these two columns, then you can use the indexes to avoid scanning all of the data in a table.</summary>
    <author>
        <name>Daniel Abadi</name>
        <uri>http://www.databasecolumn.com</uri>
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database miscellaneous" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="columnstores" label="column stores" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="dba" label="DBA" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="indexing" label="indexing" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="olap" label="OLAP" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="sigmod" label="SIGMOD" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="starschema" label="Star Schema" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="tpch" label="TPC-H" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="tuple" label="tuple" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[Consider a traditional, row-oriented database.&nbsp; Indexes are known to improve query performance. They can greatly reduce I/O costs by avoiding the need to perform table scans since they directly contain the data you need to answer a query or contain pointers to such data. If you have a query that accesses only two out of thirty columns from a large table, and you have an index on these two columns, then you can use the indexes to avoid scanning all of the data in the table.<br /><br />A challenge when using a traditional database is deciding what indexes to create on your tables.&nbsp; One either pays a DBA to carefully choose the right set of indexes to optimize a target workload, or you buy a database with an auto-tuning feature to create this set of indexes automatically (which might not be as good as a human DBA).<br /><br />Ideally, it would be possible to have an index on every column. Unfortunately, every index you create results in the materialization of another copy of the column data (in addition to having other space overheads for pointers and other parts of the index data structure). Thus, the size of your database would be enormous if you had an index on every column. Even if you had infinite storage space (so that this explosion in data storage was not an issue), index maintenance is very expensive. Updates and inserts need to be reflected in the raw data and all of the indexes. Hence, there is a fundamental trade off - indexes improve query performance, but cost you in storage and maintenance. This is why you need an expert to choose the right set of indexes. <br /><br />Now consider a column-store. By storing each column separately, the benefit appears similar to having an index on every column in a table in a row-store. If you have a query that accesses only two out of thirty columns from a large table, the column-store only reads&nbsp; those two columns and can avoid the enormous table scan (just like having an index). However, since it is the raw copy of the data that is stored in columns, no additional copies of the data need to be created, so the storage and update overheads associated with indexes is avoided.<br /><br />Thus, one might expect column-stores to perform similarly to a row-store with an index on every column without the corresponding negatives of creating many indices. In fact, this is a common argument we have often heard regarding column-stores and their expected performance relative to carefully designed row-stores -- both approaches provide good read performance, with the column store providing lower total cost of ownership (since you don't have to figure out what indexes to create anymore).<br /><br />Though this argument sounds reasonable, it is completely incorrect.&nbsp; It is also dangerous since it might cause you to end up choosing a row-store when what you really need is a column-store.<br /><br />Assume the following situation:<br />a) You already have a license for a commercial row-store<br />b) You have tons of extra storage space<br />c) You have a read-only workload (so index maintenance is not an issue)<br /><br />Using the above reasoning, in this situation you would not need to go out and buy a column-store. You would just create an index on every column on your row-store.<br /><br />In our <a href="http://cs-www.cs.yale.edu/homes/dna/papers/abadi-sigmod08.pdf">SIGMOD 2008 paper, "Column-Stores vs. Row-Stores: How Different Are They Really?"</a> which we presented last month in Vancouver, we explored this situation, running a commercial row-store (with no storage restrictions) on a read-only benchmark. The benchmark we used was the <a href="http://www.cs.umb.edu/%7Eponeil/StarSchemaB.PDF">Star Schema Benchmark</a>, a recently proposed benchmark designed to be more "typical" of data warehousing data and queries than TPC-H. We compared the performance of the commercial row-store (where we created an index on every column and forced the database to always use these indexes to access data instead of using a full table scan to access data) with the same row-store under a more normal configuration (optimized by a professional DBA) and with a column-store. The results are shown in the figure below:<br />
<form class="mt-enclosure mt-enclosure-image" contenteditable="false" mt:asset-id="21"><img class="mt-image-center" style="margin: 0pt auto 20px; display: block; text-align: center;" alt="DA_mythimg_1.jpg" src="http://www.databasecolumn.com/DA_mythimg_1.jpg" height="327" width="383" /></form><br />The fact that the column-store was almost a factor of six faster than the row-store was not surprising. After all, column-stores are supposed to outperform row-stores for data warehousing workloads. But if one views a column-store as similar to a row-store with an index on every column, one would have expected the row-store (all-indexes) approach to perform about as fast as the column-store. Instead, it performed over a factor of 50 slower, and almost an order of magnitude slower than the same commercial row-store that used full table scans to access data instead of index accesses!<br /><br />So what's going on here? It turns out that a column in a column-store is very different from an index. A column in a column-store stores attribute data in the same order that it appeared in the original table (or from a sorted projection of <a href="http://www.databasecolumn.com/2007/09/compression-follow-up.html">that table</a>). You can think of this as mapping tuple ID to column value. For example, as shown in Figures 2 - 4, if you want the value for the "customer city" attribute for the 6th tuple in a table (or projection), you can find this value by jumping to the 6th value in the "customer city" column. On the other hand, an index contains the exact opposite mapping. It maps a column value to tuple ID. If you want to find the tuple ID for all tuples whose "customer city" is "Denver", an index is great. But what if you want to find the "customer city" of the 6th tuple? You would have to scan the whole index, looking for tuple ID 6.<br /><br />&nbsp;<br />
<form class="mt-enclosure mt-enclosure-image" contenteditable="false" mt:asset-id="22"><img class="mt-image-center" style="margin: 0pt auto 20px; display: block; text-align: center;" alt="DA_mythimg_2.jpg" src="http://www.databasecolumn.com/DA_mythimg_2.jpg" height="251" width="502" /></form><br />&nbsp; <br />
<form class="mt-enclosure mt-enclosure-image" contenteditable="false" mt:asset-id="23"><img class="mt-image-center" style="margin: 0pt auto 20px; display: block; text-align: center;" alt="DA_mythimg_3.jpg" src="http://www.databasecolumn.com/DA_mythimg_3.jpg" height="304" width="676" /></form><br /><br /><br />&nbsp; <br />&nbsp;
<div>
<form class="mt-enclosure mt-enclosure-image" contenteditable="false" mt:asset-id="24"><img class="mt-image-center" style="margin: 0pt auto 20px; display: block; text-align: center;" alt="DA_mythimg_4.jpg" src="http://www.databasecolumn.com/DA_mythimg_4.jpg" height="304" width="666" /></form></div>
<div><br />So indexes are often useful in first part of query execution where predicate evaluation occurs (dealing with the "WHERE" part of a SQL statement), where you are looking for tuples with specific values (it turns out that even then, indexes are only useful for very selective predicates). But for the later part of the query plan, where the database is extracting values for attributes for specific tuple IDs (the "SELECT" and "GROUP BY" part of a SQL statement), you want a tuple ID to value mapping, and a column is better than an index. The reason why the "row-store all-indexes" approach went so slow in our experiments is that for each tuple ID produced by evaluating the predicates in the SQL "WHERE" clause, the database would have to search the index (using the wrong mapping) for each attribute that appeared in the "GROUP BY" and "SELECT" clauses. This can be thought of as adding one additional join to the query for each attribute that appears in the "GROUP BY" and "SELECT" clauses.<br /><br />Hence, an index and a column are quite different data structures. Of course, there are some situations where what you really want is an index and not a column. For example, if you had a query workload with a lot of "needle-in-the-haystack" queries (queries with very selective predicates), you need to use a lot of indexes. If you have the incorrect perception that a column-store is pretty much the same as a row-store with an index on every column, you might be tempted to use a column-store. In fact, what you really want is a heavily indexed database (either a row-store, an indexed column-store, or a column-store with multiple redundant sort orders).<br /><br />An astute reader might ask the question: what if the row-store was able to have indexes that mapped tuple-ID to value instead of the other way around? We studied that idea too, and although this significantly improves the performance of the all-index approach, it still does not approach the performance of the column-store. We will explain why this is the case in a future blog post.<br /><i><br />(Ed.&nbsp; This article was co-authored by <a href="http://www.databasecolumn.com/2007/09/madden.html">Sam Madden</a>)</i><br /></div>
<div><br /></div>]]>
        
    </content>
<feedburner:origLink>http://www.databasecolumn.com/2008/07/debunking-a-myth-columnstores.html</feedburner:origLink></entry>

<entry>
    <title>Understanding the Difference Between Column-Stores and OLAP Data Cubes</title>
    <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/dbcfeed/~3/329836202/understanding-the-difference-b.html" />
    <id>tag:www.databasecolumn.com,2008://1.39</id>

    <published>2008-07-08T01:47:39Z</published>
    <updated>2008-07-09T13:29:20Z</updated>

    <summary>Both column-stores and data cubes are designed to provide high performance on analytical database workloads (often referred to as Online Analytical Processing, or OLAP.)  These workloads are characterized by queries that select a subset of tuples, and then aggregate and group along one or more dimensions.  In this post, we study how column-stores and data cubes would evaluate a query on a sample database.</summary>
    <author>
        <name>Sam Madden</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="columnstores" label="column stores" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="datacubes" label="data cubes" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="essbase" label="EssBase" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="holap" label="HOLAP" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="madden" label="Madden" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="molap" label="MOLAP" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="olap" label="OLAP" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[Both column-stores and data cubes are designed to provide high performance on analytical database workloads (often referred to as Online Analytical Processing, or OLAP.)&nbsp; These workloads are characterized by queries that select a subset of tuples, and then aggregate and group along one or more dimensions.&nbsp; For example, in a sales database, one might wish to find the sales of technology products by month and store--the SQL query to do this would look like:<br /><blockquote><blockquote><blockquote><blockquote>SELECT month, store, COUNT(*)<br />FROM sales, products<br />WHERE productType = 'technology'<br />AND products.id = sales.productID<br />GROUP BY month, store<br /></blockquote></blockquote></blockquote></blockquote><br />In this post, we study how column-stores and data cubes would evaluate this query on a sample database:<br /><br /><span class="mt-enclosure mt-enclosure-image"><img alt="cubes_1.jpg" src="http://www.databasecolumn.com/cubes_1.jpg" class="mt-image-center" style="margin: 0pt auto 20px; text-align: center; display: block;" height="299" width="630" /><b>Column Store Analysis</b><br /></span>In column-stores, this query would be answered by scanning the <i>productType</i> column of the <i>products</i> table to find the <i>ids</i> that have type <i>technology</i>.&nbsp; These <i>ids</i> would then be used to filter the <i>productID</i> column of the <i>sales</i> table to find positions of records with the appropriate product type.&nbsp; Finally, these positions would be used to select data from the <i>months</i> and <i>stores</i> columns for input into the GROUP BY operator.&nbsp; Unlike in a row-store, the column-store only has to read a few columns of the sales table (which, in most data warehouses, would contain tens of columns), making it significantly faster than most commercial relational databases that use row-based technology.<br /><br />Also, if the table is sorted on some combination of the attributes used in the query (or if a materialized view or projection of the table sorted on these attributes is available), then substantial performance gains can be obtained both from compression and the ability to directly offset to ranges of satisfying tuples.&nbsp; For example, notice that the <i>sales</i> table is sorted on <i>productID</i>, then month, then <i>storeID</i>.&nbsp;&nbsp; Here, all of the records for a given <i>productID</i> are co-located, so the extraction of matching <i>productIDs</i> can be done very quickly using binary search or a sparse index that gives the first record of each distinct <i>productID</i>.&nbsp; Furthermore, the <i>productID</i> column can be effectively run-length encoded to avoid storing repeated values, which will use much less storage space (see my previous post on <a href="http://www.databasecolumn.com/2007/09/data-compression.html">compression in column-stores</a>).&nbsp; Run-length encoding will also be effective on the <i>month</i> and <i>storeID</i> columns, since for a group of records representing a specific <i>productID</i>, month is sorted, and for a group of records representing a given (<i>productID,month</i>) pair, <i>storeID</i> is sorted.&nbsp; For example, if there are 1,000,000 sales records of about 1,000 products sold by 10 stores, with sales uniformly distributed across products, months and stores, then the <i>productID</i> column can be stored in 1,000 records (one entry per product), the month column can be stored in 1,000 x 12 = 12,000 records, and the storeID column can be stored in and 1,000 x 12 x 10 = 120,000 records.&nbsp; This compression means that less the amount of data read from disk is less than 5% of its uncompressed size.<br /><br /><b>Data Cube Analysis</b><br /><br /><span class="mt-enclosure mt-enclosure-image"><img alt="cubes_2.jpg" src="http://www.databasecolumn.com/cubes_2.jpg" class="mt-image-left" style="margin: 0pt 20px 20px 0pt; float: left;" height="282" width="477" /></span><br />Data cube-based solutions (sometimes referred to as MOLAP systems for "multidimensional online analytical processing"), are represented by commercial products such as EssBase.&nbsp; They&nbsp; store data in array-like structures, where the dimensions of the array represent columns of the underlying tables, and the values of the cells represent pre-computed aggregates over the data.&nbsp; A data cube on the product, store, and month attributes of the sales table, for example, would be stored in an array format as shown in the figure above.&nbsp; Here, the cube includes "roll-up" cells that summarize the values of the cells in the same row, column, or "stack" (x,y position.) If we want to use a cube to compute the values of the COUNT aggregate, as in the query above, the cells of this cube would look like:<br /><br /><span class="mt-enclosure mt-enclosure-image"><img alt="cubes_3.jpg" src="http://www.databasecolumn.com/cubes_3.jpg" class="mt-image-center" style="margin: 0pt auto 20px; text-align: center; display: block;" height="216" width="700" /></span>&nbsp;<br /><br />Here, each cell contains the count of the number of records with a given (<i>productID,month,storeID</i>) value.&nbsp; For example, there is one record with <i>storeID=1, productID=2</i>, and <i>month=April</i>.&nbsp; The "sum" fields indicate the values of the COUNT "rolled up" on specific dimensions; for example, looking at the lower left hand corner of the cube for Store 1, we can see that in <i>storeID 1, productID 1</i> was sold twice across all months.&nbsp; Thus, to answer the above query using a data cube, we first identify the subset of the cube that satisfies the WHERE clause (here, products 3, 4, and 5 are technology products--this is indicated by their dark shading in the above figure.)&nbsp; Then, the system reads the pre-aggregated values from <i>sum</i> fields for the unrestricted attributes (store and month), which gives the result that store 2 had 1 technology sale in Feburary and 1 in June, and that store 3 had 1 technology sale in February and 1 in October.<br /><br />The advantages of a data cube should be clear--it contains pre-computed aggregate values that make it a very compact and efficient way to retrieve answers for specific aggregate queries.&nbsp; It can be used to efficiently compute a hierarchy of aggregates--for example, the <i>sum</i> columns in the above cube make it is very fast to compute the number of sales in a given month across all stores, or the number of sales or a particular product across the entire year in a given store.&nbsp; Because the data is stored in an array-structure, and each element is the same size, direct offsetting to particular values may be possible. However, data cubes have several limitations:<br /><br /><ul><li><b>Sparsity:</b>&nbsp; Looking at the above cube, most of the cells are empty.&nbsp; This is not simply an artifact our sample data set being small--the number of cells in a cube is the product of the cardinalities of the dimensions in the cube.&nbsp; Our 3D cube with 10 stores and 1,000 products would have 120,000 cells, and adding a fourth dimension, such as <i>customerID</i> (with, say, 10,000 values), would cause the number of cells to balloon to 1.2 billion!&nbsp; Such high dimensionality cubes cannot be stored without compression.&nbsp; Unfortunately, compression can limit performance somewhat, as direct offsetting is no longer possible. For example, a common technique is to store them as a table with the values and positions of the non-empty cells, resulting in an implementation much like a row-oriented relational database!</li></ul><ul><li><b>Inflexible, Limited ad-hoc query support:&nbsp; </b>Data cubes work great when a cube aggregated on the dimensions of interest and using the desired aggregation functions is available.&nbsp; Consider, however, what happens in the above example if the user wants to compute the average sale price rather than the count of sales, or if the user wants to include aggregates on <i>customerID</i> in addition to the other attributes.&nbsp; If no cube is available, the user has no choice but to fall back to queries on an underlying relational system.&nbsp; Furthermore, if the user wants to drill down into the underlying data--asking, for example "who was the customer who bought a technology product at store 2 in February?"--the cube cannot be used (one could imagine storing entire tuples, or pointers to tuples, in the cells of a cube, but like sparse representations, this significantly complicates the representation of a cube and can lead to storage space explosions.)&nbsp; To deal with these limitations, some cube systems support what is called "HOLAP" or "hybrid online analytical processing", where they will automatically redirect queries that cannot be answered with cubes to a relational system, but such queries run as fast as whatever relational system executes them.</li></ul><ul><li><b>Long load times:&nbsp; </b>Computing a cube requires a complex aggregate query over all of the data in a warehouse (essentially, every record has to be read from the database.)&nbsp; Though it is possible to incrementally update cubes as new data arrives, it is impractical to dynamically create new cubes to answer ad-hoc queries.</li></ul><b>Summary and Discussion</b><br /><br />Data cubes work well in environments where the query workload is predictable, so that cubes needed to answer specific queries can be pre-computed.&nbsp; They are inappropriate for ad-hoc queries or in situations where complex relational expressions are needed. &nbsp;<br /><br />In contrast, column-stores provide very good performance across a much wider range of queries (all of SQL!) However, for low-dimensionality pre-computed aggregates, it is likely that a data-cube solution will outperform a column store. For many-dimensional aggregates, the tradeoff is less clear, as sparse cube representations are unlikely to perform any better than a column store. <br /><br />Finally, it is worth noting that there is no reason that cubes cannot be combined with column-stores, especially in a HOLAP-style configuration where queries not directly answerable from a cube are redirected to an underlying column-store system.&nbsp; That said, given that column-stores will typically get very good performance on simple aggregate queries (even if cubes are slightly faster), it is not clear if the incremental cost of maintaining and loading an additional cube system to compute aggregates is ever worthwhile in a column-store world.&nbsp; Furthermore, existing HOLAP products, which are based on row-stores, are likely to be an order of magnitude or more slower than column-stores on ad-hoc queries that cannot be answered by the MOLAP system, for the same reasons discussed elsewhere in this blog.<br /><div><br /></div><div><br /></div><div><br /></div>]]>
        
    </content>
<feedburner:origLink>http://www.databasecolumn.com/2008/07/understanding-the-difference-b.html</feedburner:origLink></entry>

<entry>
    <title>Designing Systems for the Grid:  The Problem with "Retrofitting," Part 1</title>
    <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/dbcfeed/~3/316559576/designing-systems-for-the-grid.html" />
    <id>tag:www.databasecolumn.com,2008://1.38</id>

    <published>2008-06-19T18:26:28Z</published>
    <updated>2008-06-21T00:09:14Z</updated>

    <summary>In a two-part post, we will illustrate how -- using a database query optimizer as an  example -- the strategy of retrofitting databases in a distributed, shared-nothing grid computing architecture can fail. This argument will require some understanding of how centralized query optimizers work. Therefore, we will divide this discussion into two parts. The first installment will provide a background on centralized query optimization; the second installment will show why retrofitting a centralized query optimizer to work on the grid can lead to poor performance when evaluating queries.</summary>
    <author>
        <name>Mitch Cherniack</name>
        
    </author>
    
    <category term="datawarehouse" label="data warehouse" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="databaseperformance" label="database performance" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="databasequeries" label="database queries" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="dbms" label="DBMS" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[One of the key features of new data warehouse databases such as Vertica is their ground-up support for distributed, shared-nothing grid computing architectures. Because of scalability and low costs, such architectures are becoming the norm in large enterprises, and because of their scalability requirements, data warehouses are a natural fit in this world.<br /><br />Some database vendors have attempted to "retrofit" their centralized DBMS designs to work in a distributed world. The basic idea of retrofitting is to reuse as much of the centralized code base as possible in producing a distributed design. The motivation for retrofitting is clear: In building a database system, it is often easier to reuse existing components and adapt them for new uses than it is to build a new system from scratch. This reduces the time-to-market of a product that can claim grid support. But the performance improvements of these retrofitted systems often do not live up to the raw increases in horsepower that grid architectures provide.&nbsp; There are two reasons for this:<br /><br /><blockquote><ol><li>Code that gets reused in a retrofitted system is often brittle as a result of many years of patchwork. To change this code introduces potential instabilities, and thus, there is a natural desire to instead treat such code as "black boxes."<br /><br /></li><li>Black box code was often designed with assumptions of a centralized architecture, and these assumptions may constrain performance when executed over a distributed system over which the assumptions do not hold.</li></ol></blockquote><br />We will illustrate this point using the database <i>query optimizer</i> as an illustrative example of how retrofitting strategies can fail. This argument will require some understanding of how centralized query optimizers work.&nbsp; Therefore, we will divide this discussion into two parts. The first installment will provide a background on centralized query optimization; the second installment will show why retrofitting a centralized query optimizer to work on the grid can lead to poor performance when evaluating queries.<br /><br /><br /><b>Part 1:&nbsp; A Primer on Centralized Query Optimization</b><br /><br />In this installment, we present a primer on centralized query optimization. &nbsp;<br /><br />The purpose of a query optimizer is to produce a cost-effective <i>evaluation plan</i> (or just plan) for any query submitted to the database. The basic strategy used to come up with this plan is largely the same for every query optimizer:<br /><br /><blockquote>a) It first formulates a set of <i>candidate plans</i> that could be used to evaluate the query <br /><br />b) It then applies a <i>cost model</i> to predict the execution time (cost) of each of the candidate plans. A cost model consists of a set of formulas that specify the sizes of intermediate query results, and the cost (e.g., time) required to produce them. For example, a simplistic cost model measures cost as the number of disk reads required to evaluate the query, with the idea that queries that perform the fewest disk reads will execute in the least time.<br /><br />c)&nbsp; Upon evaluating the cost of each candidate plan, the query optimizer then selects the plan with least cost.<br /></blockquote><br />One of the most crucial design decisions that affects the effectiveness of a query optimizer lies in how it limits the size of the space of candidate plans (the <i>search space</i>) that it must consider. Specifically, a query that includes multiple tables in its FROM clause can be evaluated using any of a number of plans that differ only by the order in which these tables are joined. Consider, for example, the SQL query fragment below:<br /><br /><blockquote><blockquote>&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;<b>SELECT </b>&nbsp;&nbsp; &nbsp;*<br />&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;<b>FROM</b> &nbsp;&nbsp;&nbsp; &nbsp;T1, T2, T3, T4<br />&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;<b>WHERE</b> &nbsp;&nbsp; &nbsp;...<br /></blockquote></blockquote><br />This query must join 4 tables: T1, T2, T3, and T4. The order in which pairs of tables are joined can vary. For example, the binary tree below (hereafter referred to as a <i>join plan</i>) shows one possible join ordering.<br /><br /><span class="mt-enclosure mt-enclosure-image"><img alt="img1.jpg" src="http://www.databasecolumn.com/img1.jpg" class="mt-image-center" style="margin: 0pt auto 20px; text-align: center; display: block;" height="205" width="228" /></span>The join plan above specifies that the first join to be performed is of T1 and T2 (the "bowtie" icon specifies a join), followed by the result being joined with T3, and the result of this in turn being joined with T4. This join plan has a structure which is <i>left-deep</i> (or equivalently, <i>right-shallow)</i> because the right branch of every join in the plan is a base table. &nbsp;<br /><br />For the 4-table query above, there are 4! = 24 different left-deep join plans that differ according to which tables correspond to which leaves of the tree (4! = the number of sequences of a set of 4 items). Aside from the left-deep plans, there are 4 other join plan structures that are possible for the query above as illustrated below.&nbsp; Note that each of these join plan structures also has 24 variations given a query with 4 tables (by permuting the tables at the leaves), so in all, there are 5 * 24 = 120 join plans for an optimizer to consider for this query. <br /><br /><span class="mt-enclosure mt-enclosure-image"><img alt="opt_img2.jpg" src="http://www.databasecolumn.com/opt_img2.jpg" class="mt-image-center" style="margin: 0pt auto 20px; text-align: center; display: block;" height="416" width="522" /></span><br />In general, given <i>n</i> tables to be joined, there are: &nbsp;<br /><br /><div align="center">(2(<i>n</i>-1))!<br />---------------<br />n! * (<i>n</i>-1)!<br /></div><br />possible join plan structures<a href="#note"><sup>(1)</sup></a>, and for each join plan structure, <i>n</i>! possible join plans, giving a total of:<br /><br /><div align="center">(2(<i>n-</i>1))!<br />---------------<br />(<i>n</i>-1)!<br /></div>&nbsp;<br />join plans that an optimizer could consider. As the number of tables in a query grows, the number of join plans to consider quickly becomes infeasible. For example, whereas a query with 4 tables requires consideration of 120 join plans, a query with 5 tables requires consideration of 151,200 join plans, a query with 6 tables requires consideration of 3,991,680 join plans, and so on.<br /><br />To cope with this enormous search space, all query optimizers must somehow limit the set of join plans considered. IBM's System R (from the late 1970s) first introduced the idea of limiting the search space to the set of join plans with a left-deep join plan structure because left-deep plans ensure that every binary join is performed with at least one participant table on disk, thereby ensuring that a join operator can produce output incrementally as its input data arrives (<i>pipelining</i>). The left-deep restriction reduces the number of join plans to consider for a query of <i>n</i> tables to <i>n!</i>, and dynamic programming techniques can be used to find the "best" query plan in this space in exponential time. In practice, this is a reasonable amount of time to process queries consisting of roughly 30 tables or fewer (YMMV), and thus, this heuristic is still used to narrow the search space of most commercial DBMS.&nbsp; <br /><br />Of course, there are many other challenges in designing an effective query optimizer aside from managing the search space, including proper choices of access methods and indexes, query unnesting, etc. But in the next installment of this blog, I will show how the typical retrofitted query optimizer determines its search space for plans that apply to the grid and how the resulting optimizer can fail to produce appropriate plans.<br />&nbsp;<div><a href="http://www.databasecolumn.com/blog/mt-static/html/editor-content.html?cs=utf-8" name="note"><sup>(1)</sup></a> This is the known as the <i>n</i>th Catalan number, which specifies (among other things) the number of binary tree "shapes" consisting of n leaves.<br /></div><div><br /><i>Part 2 will be available next week . . .</i><br /><br /></div>]]>
        
    </content>
<feedburner:origLink>http://www.databasecolumn.com/2008/06/designing-systems-for-the-grid.html</feedburner:origLink></entry>

<entry>
    <title>DBMS innovations that will make analytics in the cloud a reality</title>
    <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/dbcfeed/~3/304061026/dbms-innovations-and-cloud.html" />
    <id>tag:www.databasecolumn.com,2008://1.37</id>

    <published>2008-06-03T22:08:04Z</published>
    <updated>2008-06-03T22:30:55Z</updated>

    <summary>There will soon be a myriad of announcements of DBMS offerings in the cloud. Many of these will NOT be marriages made in heaven. However, the most innovative new DBMS software married to new cloud computing services are here today and truly take advantage of the cloud architecture in order to change the economics and the responsiveness of business analytics.</summary>
    <author>
        <name>Jerry Held</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database innovation" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="cloudcomputing" label="cloud computing" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="columnstores" label="column stores" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="dbms" label="DBMS" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[There will soon be a myriad of announcements of DBMS offerings in the cloud. Many of these will NOT be marriages made in heaven. However, the most innovative new DBMS software combined with new cloud computing services are here today and truly take advantage of the cloud architecture in order to change the economics and the responsiveness of business analytics.<br /><br />In my <a href="http://www.databasecolumn.com/2008/05/cloud-and-bi.html">last article</a>, I described how I think cloud computing will change the economics of business intelligence (BI) and enable a variety of new analytic data management projects and business possibilities. It does so by making the hardware, networking, security, and software needed to create data marts and data warehouses available on demand with a pay-as-you-go approach to usage and licensing.<br /><br />A computing cloud, such as the Amazon Elastic Compute Cloud, is composed of thousands of commodity servers running multiple virtual machine instances (VMs) of the applications hosted in the cloud. As customer demand for those applications changes, new servers are added to the cloud or idled and new VMs are instantiated or terminated.<br /><br />Cloud computing infrastructure differs dramatically from the infrastructure underlying most in-house data warehouses and data marts. There are no high-end servers with dozens of CPU cores, SANs, replicated systems, or proprietary data warehousing appliances available in the cloud. Therefore, a new DBMS software architecture is required to enable large volumes of data to be analyzed quickly and reliably on the cloud's commodity hardware. Recent DBMS innovations make this a reality today, and the best cloud DBMS architectures will include: <br /><br /><ol><li><b>Shared-nothing, massively parallel processing (MPP) architecture.</b> In order to drive down the cost of creating a utility computing environment, the best cloud service providers use huge grids of identical (or similar) computing elements. Each node in the grid is typically a compute engine with its own attached storage. For a cloud database to successfully "scale out" in such an environment, it is essential that the database have a shared-nothing architecture utilizing the resources (CPU, memory, and disk) found in server nodes added to the cluster. Most databases popularly used in BI today have shared-everything or shared-storage architectures, which will limit their ability to scale in the cloud.<br /><br /></li><li><b>Automatic high availability.</b> Within a cloud-based analytic database cluster, node failures, node changes, and connection disruptions can occur. Given the vast number of processing elements within a cloud, these failures can be made transparent to the end user if the database has the proper built-in failover capabilities. The best cloud databases will replicate data automatically across the nodes in the cloud cluster, be able to continue running in the event of 1 or more node failures ("k-safety"), and be capable of restoring data on recovered nodes automatically -- without DBA assistance. Ideally, the replicated data will be made "active" in different sort orders for querying to increase performance.<br /><br /></li><li><b>Ultra-high performance.</b> One of the game-changing advantages of the cloud is the ability to get an analytic application up quickly (without waiting for hardware procurement). However, there can be some performance penalty due to Internet connectivity speeds and the virtualized cloud environment. If the analytic performance is disappointing, the advantage is lost. Fortunately, the latest shared-nothing columnar databases are designed specifically for analytic workloads, and they have demonstrated dramatic performance improvements over traditional, row-oriented databases (as verified by industry experts, such as <a href="http://www.vertica.com/gartner">Gartner</a> and <a href="http://www.vertica.com/forrester">Forrester</a>, and by <a href="http://www.vertica.com/benchmarks">customer benchmarks</a>). This software performance improvement, coupled with the hardware economies of scale provided by the cloud environment, results in a new economic model and competitive advantage for cloud analytics.<br /><br /></li><li><b>Aggressive compression.</b> Since cloud costs are typically driven by charges for processor and disk storage utilization, aggressive data compression will result in very large cost savings. Row-oriented databases can achieve compression factors of about 30% to 50%; however, the addition of necessary indexes and materialized views often swells databases to 2 to 5 times the size of the source data. But since the data in a column tends to be more similar and repetitive than attributes within rows, column databases often achieve much higher levels of compression. They also don't require indexes. The result is normally a 4x to 20x reduction in the amount of storage needed by columnar databases and a commensurate reduction in storage costs.<br /><b><br /></b></li><li><b>Standards-based connectivity. </b>While there are a number of special-purpose file systems that have been developed for the cloud environment that can provide high performance, they lack the standard connectivity needed to support general-purpose business analytics. The broad base of analytic users will use existing commercial ETL and reporting software that depend on SQL, JDBC, ODBC, and other DBMS connectivity standards to load and query cloud databases. Therefore, it's imperative for cloud databases to support these connection standards to enable widespread use of analytic applications. <br /></li></ol>In summary, cloud databases with the architectural characteristics described above will be able to not just run in the cloud, but thrive there by:<br /><br /><ul><li>"Scaling out," as the cloud itself does<br /><br /></li><li>Running fast without high-end or custom hardware<br /><br /></li><li>Providing high availability in a fluid computing environment<br /><br /></li><li>Minimizing data storage, transfer, and CPU utilization (to keep cloud computing fees low)</li></ul><br /><br /> ]]>
        
    </content>
<feedburner:origLink>http://www.databasecolumn.com/2008/06/dbms-innovations-and-cloud.html</feedburner:origLink></entry>

<entry>
    <title>Database Column contributor: Daniel Abadi</title>
    <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/dbcfeed/~3/338579437/database-column-contributor-da.html" />
    <id>tag:www.databasecolumn.com,2008://1.40</id>

    <published>2008-05-15T01:49:30Z</published>
    <updated>2008-07-18T01:52:42Z</updated>

    <summary>Daniel's research interests are in database system architecture and implementation, cloud computing, and the Semantic Web. He currently serves on the Yale computer science faculty as an Assistant Professor. At Yale heteaches both undergraduate and graduate level classes on database systems, and directs DR@Y, the database research group at Yale....</summary>
    <author>
        <name>Admin</name>
        
    </author>
    
        <category term="About Database Column" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="abadi" label="Abadi" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="contributor" label="contributor" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[Daniel's research interests are in database system architecture and implementation, cloud computing, and the Semantic Web. He currently serves on the Yale computer science faculty as an Assistant Professor. At Yale he<br />teaches both undergraduate and graduate level classes on database systems, and directs DR@Y, the database research group at Yale. Before joining Yale, he spent four years at the Massachusetts Institute of Technology<br />where he published numerous papers on column-store databases, lead the C-Store development effort, and wrote his Ph.D. dissertation on "Query Execution in Column-Oriented Database Systems". Daniel has been a recipient of a Churchill Scholarship, an NSF Graduate Research Fellowship, and a VLDB best paper award.<br /><br />For more, Daniel's Website can be found at: <a href="http://cs-www.cs.yale.edu/homes/dna/">http://cs-www.cs.yale.edu/homes/dna/</a> ]]>
        
    </content>
<feedburner:origLink>http://www.databasecolumn.com/2008/05/database-column-contributor-da.html</feedburner:origLink></entry>

<entry>
    <title>There's a bright cloud on the horizon ... and it will transform the economics of  BI</title>
    <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/dbcfeed/~3/289756883/cloud-and-bi.html" />
    <id>tag:www.databasecolumn.com,2008://1.36</id>

    <published>2008-05-13T22:07:09Z</published>
    <updated>2008-05-21T12:32:44Z</updated>

    <summary>Cloud computing is ushering in a new era of analytic data management for business intelligence by enabling organizations to analyze terabytes of data faster and more economically than ever before. The key change: It's delivered in an on-demand basis. This alternative to traditional, in-house data analytics infrastructure will transform the economics of BI and open up many new possibilities for organizations of all sizes.</summary>
    <author>
        <name>Jerry Held</name>
        
    </author>
    
    <category term="businessintelligence" label="business intelligence" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="cloudcomputing" label="cloud computing" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="saas" label="SaaS" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[Cloud computing is ushering in a new era of analytic data management for business intelligence (BI) by enabling organizations to analyze terabytes of data faster and more economically than ever before. The key change: It's delivered in an on-demand basis.<br /><br />Organizations no longer need to justify spending hundreds of thousands of capital expense budget dollars for upfront hardware and software purchases or spend weeks waiting for hardware delivery and installation. Instead, they can sign up to tap into a computing cloud, such as Amazon's Elastic Compute Cloud (<a href="http://www.amazon.com/gp/browse.html?node=201590011">Amazon EC2</a>), and have a dedicated, high-performance analytic database cluster provisioned and hosted for them. They can then use it on a pay-per-use basis, usually for a monthly fee. <br /><br />This shouldn't be confused with software as a service (SaaS) models. Cloud customers are, in effect, renting dedicated servers and the people needed to house, secure, and manage them. These cloud offerings are more secure than multi-tenant SaaS models in which data from one customer may co-exist with data from another customer within the same application. Cloud customers have full control over server and firewall settings to ensure security.<br />&nbsp;<br /><br /><b>Transforming BI</b><br /><br />This alternative to traditional, in-house data analytics infrastructure will transform the economics of BI and open up many new possibilities for organizations of all sizes. I expect cloud-based analytics to impact BI in the following ways:<br /><br /><ul><li><b>New BI technology adoption will accelerate.</b> The cloud will become the de facto platform for evaluating new software. The cloud enables software companies to make new technology available to many more evaluators on a self-service basis. Unlike free software downloads, evaluators are spared the time and expense of finding hardware and going through installation and setup and the other tasks required to get the software up and running. As a result, the adoption of new BI software technology should increase much faster than it has in the past.<br /><br /></li><li><b>Organizations will conduct more short-term ad-hoc analysis.</b> The need for data marts often arises suddenly, usually in response to new business conditions or events. The need may also last only a short time -- maybe just a few weeks or months. For example, a company might need to suddenly analyze manufacturing data in the wake of a quality or safety breakdown, or it may need a new price plan in response to a new competitor or market condition. The cloud gives companies a way to respond to these requests immediately -- get a mart created in a few hours or days, have business people slice and dice to their hearts' content for as long as they need to, then cancel the cloud cluster, and it goes away with no leftover hardware or software licenses. The cloud makes it economically feasible to conduct more of these short-lived projects.<br /><br /></li><li><b>Lines of business will have the flexibility to fund more data mart projects.</b> Because there are no long-term financial commitments required, lines of business can pay monthly cloud-based analytic database usage fees out of the operating expense budgets they directly control rather than going through a lengthy capital expenditure approval process. Companies can fund departmental, proof of concept, and ad-hoc analytic data projects on-demand, giving them the agility to respond to BI needs faster than their competitors and increase the quality of their strategy setting and execution.<br /><b><br /></b></li><li><b>Data warehousing will increase within medium-size businesses. </b>Despite their size, many midmarket companies have very large volumes of data they would like to analyze. Hedge fund companies with only a handful of IT people at their disposal need to analyze tens of terabytes of stock market history data to hone their trading strategies. Young bio-techs are in similar situations -- they have hundreds of gigabytes of genomic data to cull through. Cloud-based analytic databases will enable them to warehouse and analyze terabytes of data even though their BI budgets and staff are a small fraction of larger enterprises.<br /><br /></li><li><b>The analytic SaaS market will develop faster.</b> Companies that collect economic, market, advertising, scientific, and other data and then offer customers the ability to analyze it on line -- analytic SaaS -- will come to market faster and in greater numbers. They will be able to bring their solutions to market with much less risk and cost by basing them on the cloud during the early stages of growth. The companies can use the hundreds of thousands of dollars saved on in-house data center development to invest in customer acquisition, product development, and other market development activities. After the viability of the business model is proven, analytic data can be migrated to internal databases from the cloud if needed.<br /></li></ul><br />In order for these pioneering analytic cloud projects to succeed -- especially as data volumes grow -- they will require a database architecture that is designed to function efficiently in elastic, hosted computing environments like the cloud. At a minimum, such databases must include the following architectural features:<br /><br /><ul><li><b>"Scale-out" shared-nothing architecture</b> to handle changing analytic workloads as elastically as the cloud<br /><br /></li><li><b>Aggressive data compression</b> to keep storage costs low<br /><br /></li><li><b>Automatic grid replication and failover </b>to provide high availability in the cloud<br /></li></ul><br />I'll discuss these cloud analytic database architectural features and others in more detail in my next post.<br /><br /> ]]>
        
    </content>
<feedburner:origLink>http://www.databasecolumn.com/2008/05/cloud-and-bi.html</feedburner:origLink></entry>

<entry>
    <title>Supporting Column Store Performance Claims</title>
    <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/dbcfeed/~3/251459079/supporting-column-store-perfor.html" />
    <id>tag:www.databasecolumn.com,2008://1.35</id>

    <published>2008-03-14T14:58:21Z</published>
    <updated>2008-03-20T15:14:54Z</updated>

    <summary>In this post, Mike Stonebraker tackles two issues with regards to row- versus column-store databases. In the first issue, he looks at performance challenges given the demands of users. In the second issue, he discusses the availability of third-party connectivity as well as automatic database design tools.</summary>
    <author>
        <name>Michael Stonebraker</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="columnstores" label="column stores" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="performance" label="performance" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="rowstores" label="row stores" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="stonebraker" label="Stonebraker" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[We commonly encounter questions related to column store performance from those considering moving away from their current DBMS solution. In this entry, I want to share my thoughts on this topic.<br /><br /><br /><b>Issue No. 1: Addressing the Performance Claim</b><br /><br />There is a well-known adage: "If it's not broken, don't fix it." Any client who is satisfied with his current data warehouse solution would be ill-advised to change it. However, Vertica sees enormous pain in the data warehouse market, due to combinations of the following factors:<br /><br /><ol><li><b>Increasing query complexity.</b> The size of data warehouses are going up faster than disks are getting cheaper. An increasing number of people are being trained and equipped to analyze information, and they desire to correlate more and more data. Since query complexity goes up more than linearly with warehouse size, this means that warehouse problems are getting harder over time - not easier.<br /><br />Many warehouse DBAs can predict with some precision when they will "hit the wall" with their current solution. The result of hitting the wall is an expensive guided tour through the enterprise wallet for more hardware, different software, or both.<br /><br /></li><li><b>The desire for real-time warehouses.</b> Most warehouses are loaded periodically and are out of date by ½ of the length of this periodicity. But many enterprises want more timely business intelligence. The obvious solution is to "trickle load" data in parallel with user queries. However, this is impossible in many current warehouse products.<br /><br /></li><li><b>The desire for timely answers.</b> In many current products, an ad-hoc query requires one to go out to lunch before the answer is returned. Sometimes response time is even worse than this. The result of delayed answers is lost human productivity and a move to "batch thinking" rather than "interactive thinking."</li></ol><br />If the user is in serious pain with his current warehouse solution, then the obvious answer is "find a better one."<br /><br />In summary, performance is either black or white. Either it is good enough or it isn't. And if performance is important--column databases have demonstrated orders better magnitude performance (50x in round numbers) than row-stores in customer benchmarks and TPC-H benchmarks. Industry experts, such as Gartner, have validated these results (click on <a href="http://www.vertica.com/elqNow/elqRedir.htm?ref=http://www.vertica.com/product/resourcelibrary/stonebrakergartner">this link</a> to launch a Vertica-Gartner podcast on this topic<a href="http://www.databasecolumn.com/blog/mt-static/html/www.vertica.com/gartner"></a>).<br /><br />We see column databases out-perform row stores by large margins in customer benchmark settings on a frequent basis. Here are some results a customer measured very recently:<br /><br /><span class="mt-enclosure mt-enclosure-image"><img alt="benchmark_table.jpg" src="http://www.databasecolumn.com/images/2008/benchmark_table.jpg" class="mt-image-center" style="margin: 0pt auto 20px; text-align: center; display: block;" height="245" width="575" /></span>And remember -- it doesn't have to be an either-or decision. As Don Feinberg of Gartner suggests in his podcast, using a column database in conjunction with an enterprise data warehouse (EDW) can provide users with better analytic performance and also to offload certain analyses from the EDW in order to improve its performance without costly upgrades or re-designs.<br /><br /><b><br />Issue No. 2: Of Connectivity and Automatic Design Tools</b><br /><br />My second point concerns the perceived connectivity advantages of row stores. Vertica (and other column-oriented databases) use ODBC/JDBC interfaces. As such, they get connectivity to all of the 3rd party tools that row stores utilize. Hence, "connectivity" is a wash between row stores and column stores. Both kinds of products connect to most -- if not all -- of the popular tools. &nbsp;<br /><br />Lastly, there is a perception that column database introduce additional complexity for DBAs. This is untrue. Vertica includes an automatic physical database designer that helps a DBA set all of the performance options in Vertica. Hence, there is no "complexity" factor; manual optimization by a human is a thing of the past. DB2 has a similar tool. The real question is, "How good is the automatic tool from any given vendor?" We are confident in Vertica's ability to automatically generate a good physical design; it would be interesting to conduct a comparative "out-of-the-box" performance benchmark that measured automatic tool effectiveness.<br /><br /> <div><br /></div><div><br /></div>]]>
        
    </content>
<feedburner:origLink>http://www.databasecolumn.com/2008/03/supporting-column-store-perfor.html</feedburner:origLink></entry>

<entry>
    <title>In response to Monash's post on the four categories of RDBMS</title>
    <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/dbcfeed/~3/237136223/responding-to-monash-2.html" />
    <id>tag:www.databasecolumn.com,2008://1.33</id>

    <published>2008-02-18T18:52:11Z</published>
    <updated>2008-03-14T15:36:09Z</updated>

    <summary>In this response to a Curt Monash post over at the DBMS2 blog, Mike Stonebraker offers his reactions. He sees two categories of relational analytic/data warehouse databases, row stores and column stores, and notes that they have very different characteristics and should not be lumped together. He also points out that if high performance is required, current high-end relational engines can be beaten by a factor of 80 or so on TPC-C.</summary>
    <author>
        <name>Michael Stonebraker</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database innovation" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="columnstores" label="column stores" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="datawarehouse" label="data warehouse" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="databaseperformance" label="database performance" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="dbms" label="DBMS" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="oltp" label="OLTP" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="stonebraker" label="Stonebraker" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[As I did last week, I am using this post to respond to an article published by Curt Monash. You can read his full post, titled "Database management system choices - 4 categories of relational," <a href="http://www.dbms2.com/2008/02/15/relational-database-management-categories/">here</a>. In this post, I will discuss the issue I have with Curt's category characterization of DBMS systems.<br /><br />First, I see two categories of relational analytic/data warehouse databases, row stores and column stores. They have very different characteristics. I would not lump them together, as this post does. Moreover, I expect the overwhelming majority of analytic data management workloads to move to column stores over time as these products become more mature because of the overwhelming performance advantage they offer on most analytic workloads.<br /><br />I don't know what competitive challenge to current high-end OLTP vendors Curt has in mind; however, I will offer my own. If performance is not a big issue, then current open-source relational DBMSs work quite well. As a result, I expect the "low end" to go to open source systems.<br /><br />On the other hand, if high performance is required, then I have shown in a recent paper (<a href="http://www.vldb2007.org/">2007 VLDB proceedings</a>) that current high-end relational engines can be beaten by a factor of 80 or so on TPC-C. This new collection of ideas may be leveragable into ultra-fast future commercial products that will challenge the current vendors at the high end. I think it is likely that the current vendors will be "caught in the middle."<br /><br />Lastly, most customers that I talk to are upset with the "out-of-box" experience of the current offerings from the high-end vendors. The products are hard to install, hard to tune, hard to learn, and just generally hard to use. If the products don't get much easier to use, then data administration costs will go to 100% sooner or later -- relegating these products to niche markets.<br /><br /> ]]>
        
    </content>
<feedburner:origLink>http://www.databasecolumn.com/2008/02/responding-to-monash-2.html</feedburner:origLink></entry>

<entry>
    <title>Responding to Monash's recent post on diversity of database systems</title>
    <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/dbcfeed/~3/236171832/responding-to-monash-1.html" />
    <id>tag:www.databasecolumn.com,2008://1.31</id>

    <published>2008-02-16T18:11:22Z</published>
    <updated>2008-02-16T18:39:10Z</updated>

    <summary>In this post, Mike Stonebraker comments on a post over at DBMS2 titled "Database management system choices - overview." Mike makes two points. First, he offers his list of the different types of DBMSs that he sees as viable. Second, he discusses OLTP and the shared nothing architecture.</summary>
    <author>
        <name>Michael Stonebraker</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database innovation" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="datawarehouse" label="data warehouse" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="dbms" label="DBMS" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="stonebraker" label="Stonebraker" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[This week, Curt Monash published a post titled "Database management system choices - overview" on the <a href="http://www.dbms2.com/">DBMS2 blog</a> that makes the argument that in the database world, one size does not fit all. In response, I have one comment and one quibble (read the post in its entirety <a href="http://www.dbms2.com/2008/02/15/database-management-system-choices-overview/">here</a>).<br /><br /><br /><b>The comment: Many different kinds of DBMSs</b><br /><br />Curt's post leads to the obvious question: Just how many different kinds of viable DBMSs can we expect to see? I can imagine the following:<br /><br /><ol><li><b>OLTP DBMSs</b> focused on fast, reliable transaction processing<br /><b><br /></b></li><li><b>Analytic/Data Warehouse DBMSs</b> focused on efficient load and ad-hoc query performance<br /><br /></li><li><b>Science DBMSs</b> -- after all MatLab does not scale to disk-sized arrays<br /><br /></li><li><b>RDF stores</b> focused on efficiently storing semi-structured data in this format<br /><br /></li><li><b>XML stores</b> focused on semi-structured data in this format<br /><br /></li><li><b>Search engines</b> -- the big players all use proprietary engines in this area<br /><br /></li><li><b>Stream Processing Engines</b> focused on real-time StreamSQL<br /><br /></li><li><b>"Lean and Mean," less-than-a-database engines</b> focused on doing a small number of things very well (embedded databases are probably in this category)<br /><br /></li><li><b>MapReduce and Hadoop</b> -- after all Google has enough "throw weight" to define a category</li></ol><br /><br />I expect all of these to be architected differently, with the possible exception of RDF stores, which are efficiently supported on top of column stores and focused on the warehouse market.<br /><br /><br /><b>The quibble: OLTP demands shared nothing</b><br /><br />Every high-end OLTP application is currently requiring 7 x 24 x 365 x 10 years of availability. That is, the database has only one state, which is "up."&nbsp; Hence, high availability -- in the face of crashes as well as disasters -- is a requirement. Disaster recovery requires replication over a wide area network; recovery from crashes, requires LAN-based replication. As such, every OLTP system is, in fact, deployed over a shared-nothing architecture, encompassing both LAN and WAN networking.<br /><br /> ]]>
        
    </content>
<feedburner:origLink>http://www.databasecolumn.com/2008/02/responding-to-monash-1.html</feedburner:origLink></entry>

<entry>
    <title>INSERT performance in column stores</title>
    <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/dbcfeed/~3/230350442/insert-performance.html" />
    <id>tag:www.databasecolumn.com,2008://1.30</id>

    <published>2008-02-06T14:50:51Z</published>
    <updated>2008-03-03T16:13:43Z</updated>

    <summary>In this post, Stan Zdonik examines the issue of INSERT performance in column stores. By implementing certain strategies, he notes that it is possible to have a column store with INSERT performance that is at least competitive in performance with that of the major row stores.</summary>
    <author>
        <name>Stan Zdonik</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="columnstores" label="column stores" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="insert" label="INSERT" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="zdonik" label="Zdonik" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[The most common question I am asked about column stores is, "Isn't INSERT performance poor?" The rationale for this question stems from the fact that in a column store a new tuple must be 1) split into its component column values, and 2) each such value must then be written to a different place (file). This would seemingly result in writing a large number of different disk blocks for every insertion. Furthermore, if the physical representation of the column is sorted and compressed, preserving this will only add to the overhead of an INSERT. While there is some truth to this line of reasoning, the problem can be overcome with the proper implementation.<br /><br /><b><br />Overcoming the INSERT performance penalty</b><br /><br />One approach to significantly mitigating the performance problem is to batch INSERTs and to perform the sorts, compression, and disk writes in large groups. By doing this, existing data is kept on disk in its sorted and compressed form and new tuples are batched in a separate memory space or cache. This cache is maintained as columns in insertion order. Periodically, an asynchronous process runs and writes a batch of tuples, merging them into the disk-based storage system. In this way, the performance cost of sorting and compression is shared and amortized across many tuples. The expected number of disk writes per INSERT will also decrease as the batch size grows. Further, if the insertion-order cache is stored in main memory, this structure can be quickly scanned.<br /><br />A fair question to ask is whether this approach would mean that the answers to queries would be stale. The answer is no ... if the query evaluator looks in both places (the disk and the cache). To do so would require the query optimizer to generate two plans since the data is structured differently in each location, but the extra query planning work is worth the trouble because of the boost in INSERT performance.<br /><br />It should also be pointed out that column stores can partition tables across a collection of shared-nothing nodes in a cluster. If INSERTs are randomly distributed on the partitioning key, then the load introduced by high INSERT rates is distributed evenly across the cluster. If the INSERT rate grows, more nodes can be added to cope with the increase.<br /><br /><br /><b>A note about ACID</b><br /><br />Of course, in order to support <a href="http://en.wikipedia.org/wiki/ACID">ACID</a> transactions in this setting, there must be a safe way to allow committed data to reside in main memory. This can be accomplished by keeping redundant copies in multiple distributed main memories. In general, one can achieve k-safety, where k is the number of nodes that can fail without losing any work, by keeping data copies on k+1 different machines. All INSERTs will be sent to all k+1 relevant sites and stored in their main memory caches. Once all these copies are installed, the tuple is stable (subject to the k-safety constraints).<br /><br /><br /><b>INSERT Performance Benchmarks</b><br /><br />By implementing all these strategies, it is possible to have a column store with INSERT performance that is at least competitive in performance with that of the major row stores. In fact, in many cases, benchmarks have shown that load performance for a column store is typically better than that of a row store.<br /><br /> ]]>
        
    </content>
<feedburner:origLink>http://www.databasecolumn.com/2008/02/insert-performance.html</feedburner:origLink></entry>

<entry>
    <title>MapReduce II</title>
    <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/dbcfeed/~3/223143860/mapreduce-continued.html" />
    <id>tag:www.databasecolumn.com,2008://1.29</id>

    <published>2008-01-25T19:56:04Z</published>
    <updated>2008-02-18T19:38:54Z</updated>

    <summary>In this follow up post, David DeWitt and Michael Stonebraker discuss the feedback from their previous post on MapReduce. They focus on four criticisms of their first article: 1) that MapReduce is not a database system and should not be judged as one; 2) that MapReduce has excellent scalability, demonstrated by Google's use; 3) that MapReduce is cheap compared to high-end DBMS solutions; 4) and that their stance was the result of DBMS "gray beards" trying to defend their turf/legacy from the MapReduce "young turks."</summary>
    <author>
        <name>David DeWitt</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database innovation" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="dbms" label="DBMS" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="dewitt" label="DeWitt" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="mapreduce" label="MapReduce" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="stonebraker" label="Stonebraker" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[<i>[Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and Michael Stonebraker]</i><br /><br /><a href="http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html">Last week's MapReduce post</a> attracted tens of thousands of readers and generated many comments, almost all of them attacking our critique. Just to let you know, we don't hold a personal grudge against MapReduce. MapReduce didn't kill our dog, steal our car, or try and date our daughters. <br /><br />Our motivations for writing about MapReduce stem from MapReduce being increasingly seen as the most advanced and/or only way to analyze massive datasets. Advocates promote the tool without seemingly paying attention to years of academic and commercial database research and real world use. <br /><br />The point of our initial post was to say that there are striking similarities between MapReduce and a fairly primitive parallel database system. As such, MapReduce can be significantly improved by learning from the parallel database community.<br /><br />So, hold off on your comments for just a few minutes, as we will spend the rest of this post addressing four specific topics brought up repeatedly by those who commented on our previous blog: <br /><br /><ol><li>MapReduce is not a database system, so don't judge it as one<br /><br /></li><li>MapReduce has excellent scalability; the proof is Google's use<br /><br /></li><li>MapReduce is cheap and databases are expensive<br /><br /></li><li>We are the old guard trying to defend our turf/legacy from the young turks</li></ol><br /><br /><b>Feedback No. 1: MapReduce is not a database system, so don't judge it as one</b><br /><br />It's not that we don't understand this viewpoint. We are not claiming that MapReduce is a database system. What we are saying is that like a DBMS + SQL + analysis tools, MapReduce can be and is being used to analyze and perform computations on massive datasets. So we aren't judging apples and oranges. We are judging two approaches to analyzing massive amounts of information, even for less structured information. <br /><br />To illustrate our point, assume that you have two very large files of facts. The first file contains structured records of the form: <br /><br /><blockquote>Rankings (pageURL, pageRank)<br /></blockquote><br />Records in the second file have the form: <br /><br /><blockquote>UserVisits (sourceIPAddr, destinationURL, date, adRevenue)<br /></blockquote><br />Someone might ask, "What IP address generated the most ad revenue during the week of January 15th to the 22nd, and what was the average page rank of the pages visited?"<br /><br />This question is a little tricky to answer in MapReduce because it consumes two data sets rather than one, and it requires a "join" of the two datasets to find pairs of Ranking and UserVisit records that have matching values for pageURL and destinationURL. In fact, it appears to require three MapReduce phases, as noted below.<br /><br /><blockquote><b>Phase 1</b><br /><br />This phase filters UserVisits records that are outside the desired data range and then "joins" the qualifying records with records from the Rankings file. <br /><br /><ul><li><b>Map program:</b> The map program scans through UserVisits and Rankings records. Each UserVisit record is filtered on the date range specification. Qualifying records are emitted with composite keys of the form &lt;destinationURL, T1 &gt; where T1 indicates that it is a UserVisits record. Rankings records are emitted with composite keys of the form &lt;pageURL, T2 &gt;&nbsp; (T2 is a tag indicating it a Rankings record). Output records are repartitioned using a user-supplied partitioning function that only hashes on the URL portion of the composite key.<br /><br /></li><li><b>Reduce Program: </b>The input to the reduce program is a single sorted run of records in URL order. For each unique URL, the program splits the incoming records into two sets (one for Rankings records and one for UserVisits records) using the tag component of the composite key. To complete the join, reduce finds all matching pairs of records of the two sets. Output records are in the form of Temp1 (sourceIPAddr, pageURL, pageRank, adRevenue).&nbsp; </li></ul><br />The reduce program must be capable of handling the case in which one or both of these sets with the same URL are too large to fit into memory and must be materialized on disk. Since access to these sets is through an iterator, a straightforward implementation will result in what is termed a nested-loops join. This join algorithm is known to have very bad performance I/O characteristics as "inner" set is scanned once for each record of the "outer" set.<br /><br /><br /><b>Phase 2</b><br /><br />This phase computes the total ad revenue and average page rank for each Source IP Address.<br /><b><br /></b><ul><li><b>Map program:</b> Scan Temp1 using the identity function on sourceIPAddr.<br /><br /></li><li><b>Reduce program:</b> The reduce program makes a linear pass over the data. For each sourceIPAddr, it will sum the ad-revenue and compute the average page rank, retaining the one with the maximum total ad revenue. Each reduce worker then outputs a single record of the form Temp2 (sourceIPAddr,&nbsp; total_adRevenue, average_pageRank).</li></ul><br /><b>Phase 3</b><br /><br /><ul><li><b>Map program:</b> The program uses a single map worker that scans Temp2 and outputs the record with the maximum value for total_adRevenue. </li></ul></blockquote><br />We realize that portions of the processing steps described above are handled automatically by the MapReduce infrastructure (e.g., sorting and partitioning the records). Although we have not written this program, we estimate that the custom parts of the code (i.e., the map() and reduce() functions) would require substantially more code than the two fairly simple SQL statements to do the same:<br /><br /><blockquote><b>Q1</b><br /><br />Select as Temp&nbsp; sourceIPAddr, avg(pageRank) as avgPR, sum(adRevenue) as adTotal<br />From Rankings, UserVisits <br />where Rankings.pageURL = UserVisits.destinationURL and<br />date &gt; "Jan 14" and date &lt; "Jan 23" <br />Group by sourceIPAddr<br /><br /><br /><b>Q2</b><br /><br />Select sourceIPAddr, adTotal, avgPR<br />From Temp<br />Where adTotal = max (adTotal)<br /></blockquote><br />No matter what you think of SQL, eight lines of code is almost certainly easier to write and debug than the programming required for MapReduce. We believe that MapReduce advocates should consider the advantages that layering a high-level language like SQL could provide to users of MapReduce. Apparently we're not alone in this assessment, as efforts such as PigLatin and Sawzall appear to be promising steps in this direction. <br /><br />We also firmly believe that augmenting the input files with a schema would provide the basis for improving the overall performance of MapReduce applications by allowing B-trees to be created on the input data sets and techniques like hash partitioning to be applied. These are technologies in widespread practice in today's parallel DBMSs, of which there are quite a number on the market, including ones from IBM, Teradata, Netezza, Greenplum, Oracle, and Vertica. All of these should be able to execute this program with the same or better scalability and performance of MapReduce.<br /><br />Here's how these capabilities could benefit MapReduce:<br /><br /><blockquote><ol><li><b>Indexing.</b> The filter (date &gt; "Jan 14" and date &lt; "Jan 23") condition can be executed by using a B-tree index on the date attribute of the UserVisits table, avoiding a sequential scan of the entire table.<br /><br /> </li><li><b>Data movement.</b> When you load files into a distributed file system prior to running MapReduce, data items are typically assigned to blocks/partitions in sequential order. As records are loaded into a table in a parallel database system, it is standard practice to apply a hash function to an attribute value to determine which node the record should be stored on (the same basic idea as is used to determine which reduce worker should get an output record from a map instance). For example, records being loaded into the Rankings and UserVisits tables might be mapped to a node by hashing on the pageURL and destinationURL attributes, respectively. If loaded this way, the join of Rankings and UserVisits in Q1 above would be performed completely locally <i>with absolutely no data movement between nodes</i>. Furthermore, as result records from the join are materialized, they will be pipelined directly into a local aggregate computation without being written first to disk. This local aggregate operator will partially compute the two aggregates (sum and average) concurrently (what is called a combiner in MapReduce terminology). These partial aggregates are then repartitioned by hashing on this sourceIPAddr to produce the final results for Q1.<br /><br />It is certainly the case that you could do the same thing in MapReduce by using hashing to map records to chunks of the file and then modifying the MapReduce program to exploit the knowledge of how the data was loaded. But in a database, physical data independence happens automatically. When Q1 is "compiled," the query optimizer will extract partitioning information about the two tables from the schema.&nbsp; It will then generate the correct query plan based on this partitioning information (e.g., maybe Rankings is hash partitioned on pageURL but UserVisits is hash partitioned on sourceIPAddr). This happens transparently to any user (modulo changes in response time) who submits a query involving a join of the two tables. <br /><b><br /></b> </li><li><b>Column representation.</b> Many questions access only a subset of the fields of the input files. The others do not need to be read by a column store.<br /><br /> </li><li><b>Push, not pull.</b> MapReduce relies on the materialization of the output files from the map phase on disk for fault tolerance. Parallel database systems push the intermediate files directly to the receiving (i.e., reduce) nodes, avoiding writing the intermediate results and then reading them back as they are pulled by the reduce computation. This provides MapReduce far superior fault tolerance at the expense of additional I/Os.&nbsp; </li></ol></blockquote><br />In general, we expect these mechanisms to provide about a factor of 10 to 100 performance advantage, depending on the selectivity of the query, the width of the input records to the map computation, and the size of the output files from the map phase. As such, we believe that 10 to 100 parallel database nodes can do the work of 1,000 MapReduce nodes. <br /><br />To further illustrate out point, suppose you have a more general filter, F, a more general group_by function, G, and a more general Reduce function, R. PostgreSQL (an open source, free DBMS) allows the following SQL query over a table T:<br /><br /><blockquote>Select R (T)<br />From T<br />Group_by G (T)<br />Where F (T)<br /></blockquote><br />F, R, and G can be written in a general-purpose language like C or C++. A SQL engine, extended with user-defined functions and aggregates, has nearly -- if not all -- of the generality of MapReduce.&nbsp;&nbsp; <br /><br />As such, we claim that <i>most things that are possible in MapReduce are also possible in a SQL engine</i>. Hence, it is exactly appropriate to compare the two approaches. We are working on a more complete paper that demonstrates the relative performance and relative programming effort between the two approaches, so, stay tuned.&nbsp;&nbsp; <br /><br /><b><br />Feedback No. 2: MapReduce has excellent scalability; the proof is Google's use</b><br /><br />Many readers took offense at our comment about scaling and asserted that since Google runs MapReduce programs on 1,000s (perhaps 10s of 1,000s) of nodes it must scale well. Having started benchmarking database systems 25 years ago (yes, in 1983), we believe in a more scientific approach toward evaluating the scalability of any system for data intensive applications.<br /><br />Consider the following scenario. Assume that you have a 1 TB data set that has been partitioned across 100 nodes of a cluster (each node will have about 10 GB of data). Further assume that some MapReduce computation runs in 5 minutes if 100 nodes are used for both the map and reduce phases. Now scale the dataset to 10 TB, partition it over 1,000 nodes, and run the same MapReduce computation using those 1,000 nodes. If the performance of MapReduce scales linearly, it will execute the same computation on 10x the amount of data using 10x more hardware in the same 5 minutes. <i>Linear scaleup is the gold standard for measuring the scalability of data intensive applications</i>. As far as we are aware there are no published papers that study the scalability of MapReduce in a controlled scientific fashion. MapReduce may indeed scale linearly, but we have not seen published evidence of this.&nbsp;&nbsp;&nbsp; <br /><br /><b><br />Feedback No. 3: MapReduce is cheap and databases are expensive</b><br /><br />Every organization has a "build" versus "buy" decision, and we don't question the decision by Google to roll its own data analysis solution. We also don't intend to defend DBMS pricing by the commercial vendors. What we wanted to point out is that we believe it is possible to build a version of MapReduce with more functionality and better performance. Pig is an excellent step in this direction. <br /><br />Also, we want to mention that there are several open source (i.e., free) DBMSs, including PostgreSQL, MySQL, Ingres, and BerkeleyDB. Several of the aforementioned parallel DBMS companies have increased the scale of these open source systems by adding parallel computing extensions.<br /><br />A number of individuals also commented that SQL and the relational data model are too restrictive. Indeed, the relational data model might very well be the wrong data model for the types of datasets that MapReduce applications are targeting. However, there is considerable ground between the relational data model and no data model at all. The point we were trying to make is that developers writing business applications have benefited significantly from the notion of organizing data in the database according to a data model and accessing that data through a declarative query language. We don't care what that language or model is. Pig, for example, employs a nested relational model, which gives developers more flexibility that a traditional 1NF doesn't allow.<br /><br /><br /><b>Feedback No. 4: We are the old guard trying to defend our turf/legacy from the young turks</b><br /><br />Since both of us are among the "gray beards" and have been on this earth about 2 Giga-seconds, we have seen a lot of ideas come and go. We are constantly struck by the following two observations:<br /><br /><ul><li><b>How insular computer science is.</b> The propagation of ideas from sub-discipline to sub-discipline is very slow and sketchy. Most of us are content to do our own thing, rather than learn what other sub-disciplines have to offer.<br /><br /></li><li><b>How little knowledge is passed from generation to generation.</b> In a recent paper entitled "What goes around comes around," (M. Stonebraker/J. Hellerstein, Readings in Database Systems 4th edition, MIT Press, 2004) one of us noted that many current database ideas were tried a quarter of a century ago and discarded. However, such pragma does not seem to be passed down from the "gray beards" to the "young turks."&nbsp; The turks and gray beards aren't usually and shouldn't be adversaries. </li></ul><br />Thanks for stopping by the "pasture" and reading this post. We look forward to reading your feedback, comments and alternative viewpoints.<br /><br /> ]]>
        
    </content>
<feedburner:origLink>http://www.databasecolumn.com/2008/01/mapreduce-continued.html</feedburner:origLink></entry>

<entry>
    <title>MapReduce: A major step backwards</title>
    <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/dbcfeed/~3/218359834/mapreduce-a-major-step-back.html" />
    <id>tag:www.databasecolumn.com,2008://1.28</id>

    <published>2008-01-17T21:20:43Z</published>
    <updated>2008-02-18T03:36:25Z</updated>

    <summary>In this post, David DeWitt and Michael Stonebraker discuss MapReduce. While it may be a good idea for writing certain types of general-purpose computations, they believe it is a giant step backward in the programming paradigm for large-scale data intensive applications; a sub-optimal implementation, in that it uses brute force instead of indexing; not novel, as it represents a specific implementation of well known techniques developed nearly 25 years ago; missing most of the features that are routinely included in current DBMS; and incompatible with all of the tools DBMS users have come to depend on.</summary>
    <author>
        <name>David DeWitt</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database history" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Database innovation" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="databaseperformance" label="database performance" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="dewitt" label="DeWitt" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="mapreduce" label="MapReduce" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="stonebraker" label="Stonebraker" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[<i>[Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and Michael Stonebraker]</i><br /><br />On January 8, a Database Column reader asked for our views on new distributed database research efforts, and we'll begin here with our views on <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce</a>. This is a good time to discuss it, since the recent trade press has been filled with news of the revolution of so-called "cloud computing." This paradigm entails harnessing large numbers of (low-end) processors working in parallel to solve a computing problem. In effect, this suggests constructing a data center by lining up a large number of "jelly beans" rather than utilizing a much smaller number of high-end servers.<br /><br />For example, IBM and Google have announced plans to make a 1,000 processor cluster available to a few select universities to teach students how to program such clusters using a software tool called MapReduce [1]. Berkeley has gone so far as to plan on teaching their freshman how to program using the MapReduce framework.<br /><br />As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:<br /><br /><ol><li>A giant step backward in the programming paradigm for large-scale data intensive applications<br /><br /></li><li>A sub-optimal implementation, in that it uses brute force instead of indexing<br /><br /></li><li>Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago<br /><br /></li><li>Missing most of the features that are routinely included in current DBMS<br /><br /></li><li>Incompatible with all of the tools DBMS users have come to depend on<br /></li></ol><br />First, we will briefly discuss what MapReduce is; then we will go into more detail about our five reactions listed above.<br /><br /><br /><b>What is MapReduce?</b><br /><br />The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called <i>map</i> and <i>reduce</i> plus a framework for executing a possibly large number of instances of each program on a compute cluster.&nbsp;&nbsp; <br /><br />The map program reads a set of "records" from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a "split" function partitions the records into <i>M</i> disjoint buckets by applying a function to the key of each output record.&nbsp;&nbsp; This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with <i>M</i> output files, one for each bucket.<br /><br />In general, there are multiple instances of the map program running on different nodes of a compute cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to process. If <i>N</i> nodes participate in the map phase, then there are <i>M</i> files on disk storage at each of <i>N</i> nodes, for a total of <i>N</i> * <i>M</i> files; <i>F<sub>i,j</sub></i>,&nbsp; 1 ≤ <i>i</i> ≤ <i>N</i>,&nbsp; 1 ≤ <i>j</i> ≤ <i>M</i>.<br /><br />The key thing to observe is that all map instances use the same hash function. Hence, all output records with the same hash value will be in corresponding output files.&nbsp; <br /><br />The second phase of a MapReduce job executes <i>M</i> instances of the reduce program, <i>R<sub>j</sub></i>, 1 ≤ <i>j</i> ≤ <i>M</i>.&nbsp; The input for each reduce instance <i>R<sub>j</sub></i> consists of the files <i>F<sub>i,j</sub></i>,&nbsp; 1 ≤ <i>i</i> ≤ <i>N</i>.&nbsp; Again notice that all output records from the map phase with the same hash value will be consumed by the same reduce instance -- no matter which map instance produced them. After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file, which forms part of the "answer" to a MapReduce computation.<br /><br />To draw an analogy to SQL, map is like the <i>group-by</i> clause of an aggregate query. Reduce is analogous to the <i>aggregate</i> function (e.g., average) that is computed over all the rows with the same group-by attribute.<br /><br />We now turn to the five concerns we have with this computing paradigm.<br /><br /><br /><b>1. MapReduce is a step backwards in database access</b><br /><br />As a data processing paradigm, MapReduce represents a giant step backwards. The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968.<br /><br /><ul><li>Schemas are good.<br /><br /></li><li>Separation of the schema from the application is good.<br /><br /></li><li>High-level access languages are good.<br /></li></ul><br />MapReduce has learned none of these lessons and represents a throw back to the 1960s, before modern DBMSs were invented.<br /><br />The DBMS community learned the importance of schemas, whereby the fields and their data types are recorded in storage. More importantly, the run-time system of the DBMS can ensure that input records obey this schema. This is the best way to keep an application from adding "garbage" to a data set. MapReduce has no such functionality, and there are no controls to keep garbage out of its data sets. A corrupted MapReduce dataset can actually silently break all the MapReduce applications that use that dataset.<br /><br />It is also crucial to separate the schema from the application program. If a programmer wants to write a new application against a data set, he or she must discover the record structure. In modern DBMSs, the schema is stored in a collection of system catalogs and can be queried (in SQL) by any user to uncover such structure. In contrast, when the schema does not exist or is buried in an application program, the programmer must discover the structure by an examination of the code. Not only is this a very tedious exercise, but also the programmer must find the source code for the application. This latter tedium is forced onto every MapReduce programmer, since there are no system catalogs recording the structure of records -- if any such structure exists.<br /><br />During the 1970s the DBMS community engaged in a "great debate" between the relational advocates and the Codasyl advocates. One of the key issues was whether a DBMS access program should be written:<br /><br /><ul><li>By stating what you want - rather than presenting an algorithm for how to get it (relational view)<br /><br /></li><li>By presenting an algorithm for data access (Codasyl view)<br /></li></ul><br />The result is now ancient history, but the entire world saw the value of high-level languages and relational systems prevailed. Programs in high-level languages are easier to write, easier to modify, and easier for a new person to understand. Codasyl was rightly criticized for being "the assembly language of DBMS access." A MapReduce programmer is analogous to a Codasyl programmer -- he or she is writing in a low-level language performing low-level record manipulation. Nobody advocates returning to assembly language; similarly nobody should be forced to program in MapReduce.<br /><br />MapReduce advocates might counter this argument by claiming that the datasets they are targeting have no schema. We dismiss this assertion. In extracting a key from the input data set, the map function is relying on the existence of at least one data field in each input record. The same holds for a reduce function that computes some value from the records it receives to process.&nbsp;&nbsp; <br /><br />Writing MapReduce applications on top of Google's BigTable (or Hadoop's HBase) does not really change the situation significantly. By using a self-describing tuple format (row key, column name, {values}) different tuples within the same table can actually have different schemas. In addition, BigTable and HBase do not provide logical independence, for example with a view mechanism. Views significantly simplify keeping applications running when the logical schema changes.<br /><br /><b><br />2. MapReduce is a poor implementation</b><br /><br />All modern DBMSs use hash or B-tree indexes to accelerate access to data. If one is looking for a subset of the records (e.g., those employees with a salary of 10,000 or those in the shoe department), then one can often use an index to advantage to cut down the scope of the search by one to two orders of magnitude. In addition, there is a query optimizer to decide whether to use an index or perform a brute-force sequential search.<br /><br />MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.<br /><br />One could argue that value of MapReduce is automatically providing parallel execution on a grid of computers. This feature was explored by the DBMS research community in the 1980s, and multiple prototypes were built including Gamma [2,3],&nbsp; Bubba [4], and Grace [5]. Commercialization of these ideas occurred in the late 1980s with systems such as Teradata.&nbsp; <br /><br />In summary to this first point, there have been high-performance, commercial, grid-oriented SQL engines (with schemas and indexing) for the past 20 years. MapReduce does not fare well when compared with such systems.&nbsp; <br /><br />There are also some lower-level implementation issues with MapReduce, specifically skew and data interchange.<br /><br />One factor that MapReduce advocates seem to have overlooked is the issue of skew. As described in "Parallel Database System: The Future of High Performance Database Systems," [6] skew is a huge impediment to achieving successful scale-up in parallel query systems. The problem occurs in the map phase when there is wide variance in the distribution of records with the same key. This variance, in turn, causes some reduce instances to take much longer to run than others, resulting in the execution time for the computation being the running time of the slowest reduce instance. The parallel database community has studied this problem extensively and has developed solutions that the MapReduce community might want to adopt.<br /><br />There is a second serious performance problem that gets glossed over by the MapReduce proponents. Recall that each of the <i>N</i> map instances produces <i>M</i> output files -- each destined for a different reduce instance. These files are written to a disk local to the computer used to run the map instance. If <i>N</i> is 1,000 and <i>M</i> is 500, the map phase produces 500,000 local files. When the reduce phase starts, each of the 500 reduce instances needs to read its 1,000 input files and must use a protocol like FTP to "pull" each of its input files from the nodes on which the map instances were run. With 100s of reduce instances running simultaneously, it is inevitable that two or more reduce instances will attempt to read their input files from the same map node simultaneously -- inducing large numbers of disk seeks and slowing the effective disk transfer rate by more than a factor of 20. This is why parallel database systems do not materialize their split files and use push (to sockets) instead of pull. Since much of the excellent fault-tolerance that MapReduce obtains depends on materializing its split files, it is not clear whether the MapReduce framework could be successfully modified to use the push paradigm instead.<br /><br />Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale. Moreover, the MapReduce implementers would do well to study the last 25 years of parallel DBMS research literature.<br /><br /><br /><b>3. MapReduce is not novel</b><br /><br />The MapReduce community seems to feel that they have discovered an entirely new paradigm for processing large data sets. In actuality, the techniques employed by MapReduce are more than 20 years old. The idea of partitioning a large data set into smaller partitions was first proposed <span>in "Application of Hash to Data Base Machine and Its Architecture" [11]</span> as the basis for a new type of join algorithm. In "Multiprocessor Hash-Based Join Algorithms," [7], Gerber demonstrated how Kitsuregawa's techniques could be extended to execute joins in parallel on a shared-nothing [8] cluster using a combination of partitioned tables, partitioned execution, and hash based splitting. DeWitt [2] showed how these techniques could be adopted to execute aggregates with and without group by clauses in parallel. DeWitt and Gray [6] described parallel database systems and how they process queries. Shatdal and Naughton [9] explored alternative strategies for executing aggregates in parallel.&nbsp;&nbsp; <br /><br />Teradata has been selling a commercial DBMS utilizing all of these techniques for more than 20 years; exactly the techniques that the MapReduce crowd claims to have invented.&nbsp;&nbsp; <br /><br />While MapReduce advocates will undoubtedly assert that being able to write MapReduce functions is what differentiates their software from a parallel SQL implementation, we would remind them that POSTGRES supported user-defined functions and user-defined aggregates in the mid 1980s. Essentially, all modern database systems have provided such functionality for quite a while, starting with the Illustra engine around 1995.&nbsp; <br /><br /><br /><b>4.&nbsp; MapReduce is missing features</b><br /><br />All of the following features are routinely provided by modern DBMSs, and all are missing from MapReduce:<br /><br /><ul><li><b>Bulk loader</b> -- to transform input data in files into a desired format and load it into a DBMS<br /><br /></li><li><b>Indexing</b> -- as noted above<br /><br /></li><li><b>Updates</b> -- to change the data in the data base<br /><br /></li><li><b>Transactions</b> -- to support parallel update and recovery from failures during update<br /><br /></li><li><b>Integrity constraints</b> -- to help keep garbage out of the data base<br /><br /></li><li><b>Referential integrity</b> -- again, to help keep garbage out of the data base<br /><br /></li><li><b>Views</b> -- so the schema can change without having to rewrite the application program<br /></li></ul><br />In summary, MapReduce provides only a sliver of the functionality found in modern DBMSs.<br /><br /><b><br />5.&nbsp; MapReduce is incompatible with the DBMS tools</b> <br /><br />A modern SQL DBMS has available all of the following classes of tools:<br /><br /><ul><li><b>Report writers</b> (e.g., Crystal reports) to prepare reports for human visualization<br /><br /></li><li><b>Business intelligence tools</b> (e.g., Business Objects or Cognos) to enable ad-hoc querying of large data warehouses<br /><br /></li><li><b>Data mining tools</b> (e.g., Oracle Data Mining or IBM DB2 Intelligent Miner) to allow a user to discover structure in large data sets<br /><br /></li><li><b>Replication tools</b> (e.g., Golden Gate) to allow a user to replicate data from on DBMS to another<br /><br /></li><li><b>Database design tools</b> (e.g., Embarcadero) to assist the user in constructing a data base.<br /></li></ul><br />MapReduce cannot use these tools and has none of its own. Until it becomes SQL-compatible or until someone writes all of these tools, MapReduce will remain very difficult to use in an end-to-end task.<br /><br /><br /><b>In Summary</b><br /><br />It is exciting to see a much larger community engaged in the design and implementation of scalable query processing techniques. We, however, assert that they should not overlook the lessons of more than 40 years of database technology -- in particular the many advantages that a data model, physical and logical data independence, and a declarative query language, such as SQL, bring to the design, implementation, and maintenance of application programs. Moreover, computer science communities tend to be insular and do not read the literature of other communities. We would encourage the wider community to examine the parallel DBMS literature of the last 25 years. Last, before MapReduce can measure up to modern DBMSs, there is a large collection of unmet features and required tools that must be added.<br /><br />We fully understand that database systems are not without their problems. The database community recognizes that database systems are too "hard" to use and is working to solve this problem. The database community can also learn something valuable from the excellent fault-tolerance that MapReduce provides its applications. Finally we note that some database researchers are beginning to explore using the MapReduce framework as the basis for building scalable database systems. The Pig[10] project at Yahoo! Research is one such effort.<br /><br /><br /><b><br />References</b> <br /><br />[1] "MapReduce:&nbsp; Simplified Data Processing on Large Clusters," Jeff Dean and Sanjay Ghemawat, Proceedings of the 2004 OSDI Conference, 2004.<br /><br />[2] "The Gamma Database Machine Project," DeWitt, et. al., IEEE Transactions on Knowledge and Data Engineering, Vol. 2, No. 1, March 1990.<br /><br />[4] "Gamma - A High Performance Dataflow Database Machine,"&nbsp; DeWitt, D, R. Gerber, G. Graefe,&nbsp; M. Heytens, K. Kumar, and M. Muralikrishna,&nbsp; Proceedings of the 1986 VLDB Conference,&nbsp; 1986.<br /><br />[5] "Prototyping Bubba, A Highly Parallel Database System," Boral, et. al., IEEE Transactions on Knowledge and Data Engineering,Vol. 2, No. 1, March 1990.<br /><br />[6] "Parallel Database System: The Future of High Performance Database Systems," David J. DeWitt and Jim Gray,&nbsp; CACM,&nbsp; Vol. 35, No. 6,&nbsp; June 1992.<br /><br />[7] "Multiprocessor Hash-Based Join Algorithms," David J. DeWitt and&nbsp; Robert H. Gerber,&nbsp; Proceedings of the 1985 VLDB Conference, 1985.<br /><br />[8] "The Case for Shared-Nothing," Michael Stonebraker,&nbsp; Data Engineering Bulletin, Vol. 9, No. 1, 1986.<br /><br />[9] "Adaptive Parallel Aggregation Algorithms," Ambuj Shatdal and Jeffrey F. Naughton,&nbsp;&nbsp; Proceedings of the 1995 SIGMOD Conference,&nbsp; 1995.<br /><br />[10] "Pig", Chris Olston, http://research.yahoo.com/project/90<br /><br /><span>[11] "Application of Hash to Data Base Machine and Its
Architecture," Masaru Kitsuregawa, Hidehiko Tanaka, Tohru Moto-Oka,
New Generation Comput. 1(1): 63-74 (1983)</span><br /><span></span><br />]]>
        
    </content>
<feedburner:origLink>http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html</feedburner:origLink></entry>

<entry>
    <title>Relational databases for storing and querying RDF</title>
    <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/dbcfeed/~3/214487853/databases-and-rdf.html" />
    <id>tag:www.databasecolumn.com,2008://1.27</id>

    <published>2008-01-09T22:29:21Z</published>
    <updated>2008-03-03T16:15:09Z</updated>

    <summary>The Resource Description Format (RDF) is a way to describe information about relationships between entities and objects. It was originally developed by the W3C as a way to describe information about resources on the Web. It is intended to be the data model used in the Semantic Web, where web pages contain not just text but also structured records describing the data they contain and the relationships in that data. In this post, Sam Madden and Daniel Abadi discuss RDF and database issues.</summary>
    <author>
        <name>Sam Madden</name>
        
    </author>
    
        <category term="Database miscellaneous" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="abadi" label="Abadi" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="madden" label="Madden" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="rdf" label="RDF" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[The Resource Description Format (RDF) is a way to describe information about relationships between entities and objects. It was originally developed by the W3C as a way to describe information about resources on the Web. It is intended to be the data model used in the <a href="http://www.w3.org/2001/sw/">Semantic Web</a>, where web pages contain not just text but also structured records describing the data they contain and the relationships in that data.<br /><br />RDF has seen widespread adoption in recent years. For example, the entire MIT library catalog is available in RDF format. More recently, a number of biology researchers have begun to publish their data in RDF, including the <a href="http://dev.isb-sib.ch/projects/uniprot-rdf/">UniProt</a> comprehensive catalog of protein sequence, function, and annotation data.<br /><br /><br /><b>Understanding RDF</b><br /><br />An RDF document consists of a collection of statements of the form subject-property-object. For example, a library database that stores data about authors and books might have statement triples like "User1 has-name 'Sam Madden'", "User1 is-an Author", "User1 wrote Book1", "Book1 is-a Book", "Book1 has-title 'Who ate my cheese?'", etc., as shown in the "Triples Representation" on the top of the figure below.<br /><br /><span class="mt-enclosure mt-enclosure-image"><img alt="rdf_table.jpg" src="http://www.databasecolumn.com/images/2008/rdf_table.jpg" class="mt-image-center" style="margin: 0pt auto 20px; text-align: center; display: block;" height="378" width="469" /></span>It should be clear that an RDF document, containing a collection of triples about a group of resources, is a structured database that users may want to browse, search, or query in a number of ways. Building tools that make it possible do this efficiently is one of the goals of our research. In particular, we are interested in the performance of different on-disk storage representations for a collection of triples.<br /><br /><br /><b>Designing tools to handle RDF efficiently</b><br /><br />Our first attempts to do this have focused on leveraging relational database technology. The obvious relational representation of an RDF document is as a table with three columns, which would conventionally be stored as a series of 3-tuples laid out on disk in a row-major format. This representation, however, performs quite poorly for many types of queries. Suppose, for example, we want to find all the authors of the book "Who ate my cheese".&nbsp; We will first have to find the triple "bookM has-title 'Who ate my cheese'". We will then have to perform a self join with the triples table to find all of the triples of the form "personN wrote bookM'. Finally, for each author, we will have to perform another self join to find triples of the form 'personN has-name 'Sam Madden'". <br /><br />Hence, we have been looking at alternative representations that eliminate these self joins (we still expose a logical model of a collection of triples that the user queries, but we transform user queries to apply to our modified physical representation.) For example, one possible representation is to store a table where the first column contains the subject, and each additional column corresponds to a particular property. This representation is sometimes called a "property representation", as shown on the bottom of the figure above.&nbsp; Though this representation can have many NULL values if there are a variety of subjects with diverse properties defined, it has the advantage that all of the properties of a given object are now stored together.<br /><br />Our work in this area, "<a href="http://www.vldb.org/conf/2007/papers/research/p411-abadi.pdf">Scalable Semantic Web Data Management Using Vertical Partitioning</a>," appeared in the VLDB Conference in Vienna in September. It showed that using a column-oriented database, along with this property representation, allows us to overcome the overhead of representing NULLs, while providing two orders of magnitude better performance than the naive triples representation. This is particularly true when processing queries that must access many triples during execution (e.g., computing the number of books grouped by subject area or institution.) Of course, there is a fair amount of subtlety to getting good performance out of such a representation. Have a look at our conference paper for the details!<br /><br /><br /><b>Caveats for column- and row-store databases</b><br /><br />As we've discussed elsewhere in this blog, column-stores can perform worse than row-stores for certain classes of queries. In particular, for lookups of a single record (e.g., all of the information about a particular author), a row-oriented database (using a property representation) may outperform a column-oriented system. This is because it only has to seek to one location on disk to read the data from this record, whereas a column store will have to seek to each column to reconstruct the entire record. <br /><br />There are other situations where neither a row- nor column-oriented property representation is ideal. Imagine, for example, a user browsing an RDF-based Web site containing our library database. During browsing, suppose the user navigates from books or articles, to authors, to related books and articles, and so on. Such browsing queries in a property representation will lead to (slow) self-joins on the property table, just as they did in the triples table. Hence, a more sensible representation for a browsing-oriented database would be to store a given record R near to records the user is likely to navigate to from R. This is the topic of our current research in this area.<br /><br /><i>* Editors note: While this post will show up in the blog as written by Sam Madden, it has two authors: Samuel Madden (MIT) and Daniel Abadi (Yale)</i><br /> <div><br /></div><div><br /></div>]]>
        
    </content>
<feedburner:origLink>http://www.databasecolumn.com/2008/01/databases-and-rdf.html</feedburner:origLink></entry>

<entry>
    <title>The Database Column in 2008: Building on initial success</title>
    <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/dbcfeed/~3/213387216/database-column-in-2008.html" />
    <id>tag:www.databasecolumn.com,2008://1.26</id>

    <published>2008-01-08T20:06:15Z</published>
    <updated>2008-01-08T20:13:09Z</updated>

    <summary>With 2007 now in the books, all of us affiliated with the Database Column blog want to thank you for your readership and thoughtful commentary. There are many topics in the publishing queue, but we want to make sure we are covering topics that matter to readers. We encourage you to send us your questions, comments, and ideas for new topics.</summary>
    <author>
        <name>Admin</name>
        
    </author>
    
        <category term="About Database Column" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="vertica" label="Vertica" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[With 2007 now in the books, all of us affiliated with the Database Column blog want to thank you for your readership and thoughtful commentary. We launched the blog late last year with the goal of generating discussion around cutting-edge database issues, with interest driven by the posts of many of movers and shakers in the database community. <br /><br />We knew going into this venture that we had to be cognizant about the Vertica and column store influence of the blog, and some readers have provided feedback about their perception of the blog. Rest assured that while all of the main contributors are affiliated with Vertica, we spend a lot of time trying to ensure that the blog does not become a marketing mouthpiece for the company.<br /><br /><br /><b>Readership: Keep checking back and spread the word</b><br /><br />The number of weekly readers for the Database Column blog continues to increase, and we hope that the original, thoughtful content provided by the experts at the blog will continue to attract more readers. More readers means more interactive discussion (feedback and subsequent posts), inspires the contributors to write more, and we hope elevates the overall level of community discussion about database technology. If you know of someone who might be interested in the content here, or you know of other good database-related sites, send them the URL or write a comment.<br /><br /><br /><b>Feedback: Thanks, and a call for more</b><br /><br />On the subject of feedback, we have been excited by the readership's interest in the posts and we appreciate those that have taken the extra time to publish comments. We encourage feedback -- even critical feedback -- and hope the trend continues. We try and respond to many of these comments, and, in fact, some of the comments have inspired full posts in the following weeks.<br /><br />One important change we have made is how we handle comments to the Database Column. We experimented with a few different mechanisms for allowing you to comment. Some of the methods required registration -- a process that was not always easy or not fully functioning. As a result of these early issues, from now on, we no longer require registration to post a comment. However, we will review each comment before it goes live. We receive hundreds of blog spam messages in a week, so this revised process should enable you to easily comment <i>and</i> avoid an avalanche of spam.<br /><br /><br /><b>Content: Upcoming posts and a call for topics</b><br /><br />There are many topics in the publishing queue, but we want to make sure we are covering topics that matter to readers. What can you expect from us? Upcoming posts will include (these are working titles): "Storing and querying RDF," "What column stores are not good for," and the first in a series of open conversations with Curt Monash on the viability of a one-size-fits-all database approach. That said, we encourage you to send us your questions, comments, and ideas for new topics. We scan all comments for possible topics and we will look at the comments for this post.<br /><br />Thanks again for your interest and contributions. Best wishes for a prosperous 2008!<br /><br />&nbsp;- The Database Column Editors &amp; Contributors<br /><br /> ]]>
        
    </content>
<feedburner:origLink>http://www.databasecolumn.com/2008/01/database-column-in-2008.html</feedburner:origLink></entry>

<entry>
    <title>To ETL or federate ... that is the question</title>
    <link rel="alternate" type="text/html" href="http://feeds.feedburner.com/~r/dbcfeed/~3/201898531/to-etl-or-federate.html" />
    <id>tag:www.databasecolumn.com,2007://1.25</id>

    <published>2007-12-17T22:58:20Z</published>
    <updated>2008-02-19T19:19:13Z</updated>

    <summary>Enterprises must integrate data in a number of operational systems. But how should they do it? There are two technical approaches: ETL or Federate. Michael Stonebraker discusses the pros and cons of each approach in regards to data element "heat," indexing, resource management, complexity of schema change, contention, timeliness, and mapping,  concluding that the ETL approach makes sense in most cases.</summary>
    <author>
        <name>Michael Stonebraker</name>
        
    </author>
    
        <category term="Database architecture" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="businessintelligence" label="business intelligence" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="etl" label="ETL" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="federation" label="federation" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="indexing" label="indexing" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="oltp" label="OLTP" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="schema" label="schema" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://www.databasecolumn.com/">
        <![CDATA[It's a common problem. Enterprises must integrate data in a number of operational systems. But how should they do it? There are two technical approaches:<br /><br /><ul><li><b>Extract, transform, and load (ETL).</b> In this approach, an enterprise sets up a centralized data warehouse and then constructs a global schema for the data of interest. For each operational system, they will employ some sort of ETL process to transform data instances into the global schema and then load them into the centralized warehouse.<br /><br /></li><li><b>Federate.</b> As an alternative, enterprises can construct a global schema as described above but leave the data where it resides. Instead of building a central warehouse, they can employ a data federator, such as MetaMatrix or Aqualogics. Queries (and perhaps updates) can be submitted to the federator. In turn, the federator figures out what queries or updates need to be run at each of the operational sites to construct the correct outcome to the submitted commands.<br /></li></ul><br />For the rest of this relatively short post, we will explain the pros and cons of each approach.<br /><br /><br /><b>Data element "heat": Hot data favors ETL</b><br /><br />In the ETL approach, data transformation occurs when a data element is extracted, while in the federation approach transformations occur at query time. If a data element is queried multiple times, it is obviously cheaper to perform the transformation once, thereby favoring the ETL approach. On the other hand, if a data element is never or only queried once, then the federation approach makes more sense. In summary, if a data element is rarely queried (i.e., it is cold) then federation is desirable. In contrast, hot data elements are better with ETL solutions.<br /><br /><br /><b>Indexing: Federation is harder to optimize</b><br /><br />The data indexing requirements of OLTP are typically quite different from those of data warehouses. Hence, in an ETL approach the warehouse workload can be optimized separately from the OLTP workload on different hardware. In the federation approach, a DBA must balance the needs of both workloads in a single database -- a task that will be much more complex than optimizing two separate workloads.<br /><br />In general, the federation approach will have significantly worse performance because the needs of the two environments must be optimized together, rather than separately.<br /><br /><b><br />Resource management: Faster BI query responses for ETL shops</b><br /><br />In a data warehouse, there is a dedicated machine with optimized indexing for BI users. In contrast, the BI user will typically be prioritized behind OLTP transactions in a data federation. This will lead to poor response time for BI queries (i.e., more recommendations to "go out for lunch" while waiting for the result of a query).<br /><br /><br /><b>Complexity of the schema change: ETL approach performs less joins</b><br /><br />Most data warehouses implement star or snowflake schemas. In contrast, most OLTP systems utilize non-snowflake schemas. As a result, the global schema is quite different from the various operational schemas. In this case, a single record in the global schema may come from several records in the operational schema. Therefore, a federator must perform this join on every query. In contrast, an ETL system will do the join once at load time. Again, the ETL approach should have much better performance when the schema mapping becomes complex.<br /><br /><br /><b>Contention (concurrency control): Federation contention challenges</b><br /><br />In an ETL system, data elements must be extracted from the operational systems periodically. Once loaded into a central warehouse, they become read-only. Hence, there is essentially no contention for locks in the ETL approach. In contrast, the federation approach will mix business intelligence queries and transactions in the operational systems. The result is lock contention, as well as contention for other resources.<br /><br /><br /><b>Timeliness: ETL approaches must deal with out-of-date data issues</b><br /><br />A data warehouse is fundamentally out of date by one-half of the periodicity of the load process. On the other hand, a federator gives up-to-the-second information.<br /><br />To alleviate this disadvantage, some newer warehouse systems, such as Vertica, allow data loading in parallel with querying, a process called "trickle loading."<br /><br /><br /><b>Mapping: Federations can't handle some transformations</b> &nbsp;<br /><br />A common situation is for the operational databases to have customer information, such as customer names. In an ETL approach, whenever a customer datum is encountered, it can be looked up in a steadily growing table containing the mapping from operational system names to global schema names. If a name is not present a new entry can be added. Name mapping is thereby a global operation, supported by a mapping table. In a data federation, name mapping is done on data access. It is difficult to guarantee that the same mapping is applied to each operational system, unless the same table -- discussed above -- is maintained. However, a federator has no facility to perform mappings that require state information. As such, there are some transformations that are very difficult to perform on the fly.<br /><br /><br /><b>Summary: The ETL approach makes sense in most cases</b><br /><br />In summary, virtually all enterprises use the ETL approach for data integration. The data federation market is, in contrast, quite small. The place where I see federations as most viable is when there are many, many data sources (e.g., more than 5,000 sources) and BI users utilize only a small number of them at any given time. In this extreme case, the average data element is accessed zero times before it is updated or deleted. In this instance, one is better off leaving the data where it originates. On the other -- more common -- hand, when most data elements get used several times, the ETL approach will continue to be preferred. <br /><br /> ]]>
        
    </content>
<feedburner:origLink>http://www.databasecolumn.com/2007/12/to-etl-or-federate.html</feedburner:origLink></entry>

</feed>
