<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:georss="http://www.georss.org/georss" version="2.0"><channel><atom:id>tag:blogger.com,1999:blog-3510987239348986069</atom:id><lastBuildDate>Tue, 30 Jun 2009 19:57:34 +0000</lastBuildDate><title>Queries From Hell</title><description>A blog about data warehousing, correlation databases, associative and incremental queries, value-based storage, metadata, on-the-fly indexing, automatic data-driven schemas, BI tools, data mining, visual mapping, pattern recognition, and the limitations of standard SQL in answering "queries from Hell." 
&lt;br&gt;Or, how to discover what you don't know you don't know.</description><link>http://ianalyze.net/</link><managingEditor>noreply@blogger.com (Joe)</managingEditor><generator>Blogger</generator><openSearch:totalResults>31</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/QueriesFromHell" type="application/rss+xml" /><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-3663763501116976540</guid><pubDate>Tue, 16 Jun 2009 13:37:00 +0000</pubDate><atom:updated>2009-06-19T15:21:41.055-07:00</atom:updated><title>How many tables in an EDW?</title><description>I have heard stories of &lt;a href="http://www.illuminateinc.com/iLuminate-data-management.html"&gt;enterprise data warehouses&lt;/a&gt; that have thousands of tables; I talked to one organization that claimed that they had 500,000 tables in their EDW!  What on earth can require 500,000 or 50,000 or even 5,000 tables?&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/database_comparison_chart-716305.gif" target="blank"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 200px; height: 180px;" src="http://ianalyze.net/uploaded_images/database_comparison_chart-716303.gif" alt="" border="0" /&gt;&lt;/a&gt;I make the argument that it is just bad design forced by the technical constraints of relational database managers.&lt;br /&gt;&lt;br /&gt;If an EDW is a central repository of an organization’s information then it should be organized around the information entities of that organization.  In any organization, there are real objects like customer, product, policy, account, vendor, etc that represent master data.  In a large complex organization that is involved in multiple lines of business, there may be as many as 50 of this class of objects.  Then there are logical objects (relationships or transactions) like orders, shipments, payments, invoices, etc.  Again, in a complex organization there may be as many as 50 of these.&lt;br /&gt;&lt;br /&gt;If we make the organization extremely complex and double the number of information entities, then there could be 200.  Double it again and there are still only 400.  Where do the thousands come from?&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Normalization&lt;/li&gt;&lt;li&gt;Reporting tables&lt;/li&gt;&lt;li&gt;Summary tables&lt;/li&gt;&lt;li&gt;Application oriented tables&lt;/li&gt;&lt;li&gt;Cubes and other special structure&lt;/li&gt;&lt;li&gt;Duplication&lt;/li&gt;&lt;li&gt;Poor design&lt;/li&gt;&lt;/ul&gt;Each of these reasons is a good topic for a paper but the normalization issue is the most difficult.  All well trained DBAs have been thoroughly instructed that data must be normalized to optimize the EDW.  However, this is only true for record-based structures.  A column-based structure gets little advantage from strict third normal form design and a value based system like iLuminate gets no benefit at all.&lt;br /&gt;&lt;br /&gt;It is true that a table with 500 columns in an RDBMS used for analytics would be very difficult to work with.  Indexing would be a major problem (imagine managing 100 indices on one table for just 20% indexed access) and I/O loads would be extreme.  In a &lt;a href="http://www.illuminateinc.com/VBS-ad-hoc-query-performance.html"&gt;value-based structure&lt;/a&gt; there is no limit on the number of columns that will affect performance; indexing is always universal.  The only limit on the number of columns is the limit of a human’s ability to comprehend the entire set of information.  A table with 5,000 columns would probably be incomprehensible for people.&lt;br /&gt;&lt;br /&gt;Likewise, redundant values are not much of a problem for column-based systems and no problem at all for the value-based structure.  In the value structure, each value is stored once regardless of the number of uses or the location of use.  Empty or null values also cause no problem in either column or value structures.  Simply the absence of a value defines a null, so large, sparsely filled records are not a problem as they are in record-based systems.&lt;br /&gt;&lt;br /&gt;Based on this, I propose that in a column or value structure, adherence to third normal form is unnecessary and a 5,000 table EDW has a serious design problem.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-3663763501116976540?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2009/06/how-many-tables-in-edw.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-7322962842580013830</guid><pubDate>Tue, 26 May 2009 13:17:00 +0000</pubDate><atom:updated>2009-05-28T06:34:38.295-07:00</atom:updated><title>"Easy to Consume" BI</title><description>Recently Claudia Imhoff wrote in &lt;a href="http://www.b-eye-network.com/blogs/imhoff/archives/2009/02/easy_to_use_-_n.php"&gt;her blog&lt;/a&gt; that BI should be “easy to consume,” and I most certainly agree. As she accurately points out, there are many forms of BI that are “easy to use,” but &lt;span style="font-style: italic;"&gt;easy to consume&lt;/span&gt; is a different thing. We at illuminate have been talking about “time to analytics” and “time to answer.” Time to answer is defined as the time it takes from submitting the first question to the time the business issue is resolved. The ability to select and run the query that you want with one click is &lt;span style="font-style: italic;"&gt;easy to use&lt;/span&gt;. Working your way through a problem with the computer helping you as you go is &lt;span style="font-style: italic;"&gt;easy to consume&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://www.illuminateinc.com/incremental_database_queries.html"&gt;incremental query process&lt;/a&gt; is built around the idea that a person may not know the right question to ask until several learning questions have been answered. This “learn as you go” process is what we call an incremental query and what I would propose is exactly "easy to consume" BI.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-7322962842580013830?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2009/05/recently-claudia-imhoff-wrote-in-her.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-6180114191263669626</guid><pubDate>Mon, 18 May 2009 14:08:00 +0000</pubDate><atom:updated>2009-05-22T15:54:32.753-07:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Software Development</category><title>Chaos</title><description>If you never read the original &lt;a href="http://net.educause.edu/ir/library/pdf/NCP08083B.pdf"&gt;Chaos report from the Standish Group&lt;/a&gt; you should stop looking at this screen and immediately read it.  When you have finished, come back here and read on.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/absolut_chaos-746579.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 150px; height: 200px;" src="http://ianalyze.net/uploaded_images/absolut_chaos-746578.jpg" alt="" border="0" /&gt;&lt;/a&gt;For those of you who read the report fifteen years ago and have forgotten it, this report, among other things, identifies reasons for failure of software projects.  The original report in 1994 showed 16% of software projects succeeded, 53% were completed but with cost and time overruns and reduced deliverables and 31% were complete failures.  In 2004 the Standish Group reported that this had improved to 29% succeeded and only 18% failed.&lt;br /&gt;&lt;br /&gt;Yikes!!!  In 10 years, as an industry, we have gone from 31% total failure to 18% total failure.  I am really happy that we are not building airplanes!&lt;br /&gt;&lt;br /&gt;One of the primary reasons for the limited improvement of the failure rate that we have been able to achieve is the reduction of project timelines.  The Chaos report states, “&lt;em&gt;Research at The Standish Group also indicates that smaller time frames… will increase the success rate&lt;/em&gt;.”   On my scale, a project with a timeline of more than a year has almost no chance of success, while a project with a timeline of a few weeks has a very good chance of success.&lt;br /&gt;&lt;br /&gt;What about a &lt;a href="http://www.illuminateinc.com/iLuminate-data-management.html"&gt;data warehouse&lt;/a&gt; project?  Usually these are measured in months, often exceeding a year. How is the success rate?  According to &lt;a href="http://www.inmoncif.com/home/"&gt;Bill Inmon&lt;/a&gt;’s estimate there is a failure rate of 70% – 80%.  Other studies have reported failure rates as high as 90%.  Given the overall failure rate of software projects from the Standish Group, it seems like data warehouse projects are the primary source of project failures in the IT industry.&lt;br /&gt;&lt;br /&gt;Of course, one can respond that building a data warehouse is a big job and requires a lot of time.  Where is most of the time spent?  Requirements definition, schema design and optimizing indexing strategy.&lt;br /&gt;&lt;br /&gt;Now, I would like to challenge the other database vendors to participate in benchmarks where these factors are evaluated rather than the essentially meaningless TPC-H tests.  I think most people would agree that a moderate success is better than a large failure and these are the factors that will determine whether your data warehouse project will need sixty days to succeed or sixty weeks to fail.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-6180114191263669626?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2009/05/chaos.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">4</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-2550989974618684350</guid><pubDate>Fri, 13 Mar 2009 19:54:00 +0000</pubDate><atom:updated>2009-03-13T13:15:49.826-07:00</atom:updated><title>Colin White Podcast on The Correlation Database</title><description>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/colin-white-774057.jpg"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 200px; height: 200px;" src="http://ianalyze.net/uploaded_images/colin-white-774046.jpg" alt="" border="0" /&gt;&lt;/a&gt;At the recent TDWI event in Las Vegas, Colin White, president and founder of &lt;a href="http://www.bi-research.com/index.html" target="blank"&gt;BI Research&lt;/a&gt; and &lt;a href="http://www.b-eye-network.com/blogs/business_integration/" target="blank"&gt;BeyeNETWORK blogger&lt;/a&gt;, sat down with Andy Fletcher of illuminate Solutions to talk about the &lt;a href="http://www.illuminateinc.com/iLuminate-data-management.html" target="blank"&gt;correlation database&lt;/a&gt; (CDBMS). A widely recognized expert in database technology and frequent speaker at industry conferences, Mr. White found the correlation database technology "fascinating and different" compared with relational and column databases.&lt;br /&gt;&lt;br /&gt;Among the high points of the discussion:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;There is much less effort prior to load with the CDBMS than with relational or column databases, as the need for physical and logical design is completely eliminated.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;A CDBMS data warehouse can be depoyed in a fraction of the time of other technologies.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Data is "all indexed, all the time, in all dimensions," enabling rapid performance even when doing train-of-thought analysis—asking questions unplanned at the outset of data exploration.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;A CDBMS is often deployed to supplement an existing enterprise data warehouse, to enable &lt;a href="http://www.illuminateinc.com/ad_hoc_analysis.html" target="blank"&gt;ad hoc analysis&lt;/a&gt; on a subset of data.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Business analysts are the key users, pursuing in-depth analysis rather than standard reporting.&lt;/li&gt;&lt;/ul&gt;Listen to the &lt;a href="http://www.illuminateinc.com/demos/Beye-Andy-Fletcher.mp3"&gt;complete podcast&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-2550989974618684350?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2009/03/colin-white-podcast-on-correlation.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-7823672757166238073</guid><pubDate>Wed, 04 Mar 2009 15:22:00 +0000</pubDate><atom:updated>2009-03-08T07:44:23.233-07:00</atom:updated><title>The Information Asset</title><description>“Information is a corporate asset” is a mantra that many IT people have been chanting for years.  Recognition of information as a real corporate asset would give the IT industry a new status they have longingly sought for decades.  However, what kind of asset should it be?&lt;br /&gt;&lt;br /&gt;A drill press is an asset and so is a pile of cash.  A drill press can be used to efficiently perform one specific type of a manufacturing process and a pile of cash can be used to accomplish almost any business objective in any line of business.  If information is rigidly structured, it looks a lot like the drill press. But if it can be accessed in any way, at any time for any purpose, then it is more like cash.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/cash-pile-pic-768549.jpg"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 200px; height: 177px;" src="http://ianalyze.net/uploaded_images/cash-pile-pic-768541.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Obviously, information should be an asset like cash, but in most organizations it is much more like the drill press.  An &lt;a href="http://www.illuminateinc.com/iLuminate-data-management.html"&gt;enterprise data wareh&lt;/a&gt;&lt;a href="http://www.illuminateinc.com/iLuminate-data-management.html"&gt;ouse&lt;/a&gt; that has been built on standard RDBMS structures is optimized for one set of functions and generally performs them well.  However, these structures handle different and unplanned information access functions about as badly as a drill press would perform injection molding.&lt;br /&gt;&lt;br /&gt;The value of the information asset is directly related to the flexibility of access the business has to that asset.    The more aspects of flexibility the organization has to using the information, the higher its value will be.  If a business person can see a fixed report that is produced each day and all other information comes from phone calls, the information asset has very limited value.  If that user can see a dashboard that indicates a problem or opportunity and then immediately begin analyzing the detailed information behind the dashboard, the information has much more value.&lt;br /&gt;&lt;br /&gt;However, if a new business opportunity comes up and the decision makers can find answers to all of the new questions associated with that opportunity in time to make an informed decision, then the information has almost unlimited value.  This means providing answers to questions that no one thought of just one day previous, much less at a requirements definition meeting eighteen months ago.  It means accessing data through paths that no one thought of in any design session.  And it means getting answers to these unplanned questions at the &lt;a href="http://www.illuminateinc.com/incremental_database_queries.html"&gt;speed of thought&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Only then will information  begin to look like cash.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-7823672757166238073?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2009/03/information-asset.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-1507307613925968513</guid><pubDate>Mon, 16 Feb 2009 12:56:00 +0000</pubDate><atom:updated>2009-02-20T12:40:35.394-08:00</atom:updated><title>Shared nothing, shared something, shared everything?</title><description>The debate over shared computing resources has been &lt;a href="http://www.informationweek.com/678/78iuarc.htm"&gt;going on for years&lt;/a&gt; and shows little sign of stopping.  Massive shared nothing vendors like Netezza insist that any sharing destroys performance in query processing. Meanwhile, Microsoft, Oracle and others insist that sharing some or all computing resources improves &lt;a href="http://www.illuminateinc.com/VBS-ad-hoc-query-performance.html"&gt;query performance&lt;/a&gt;.  Which is right?  It just doesn’t matter.&lt;br /&gt;&lt;br /&gt;How’s that for throwing gas onto the fire?&lt;br /&gt;&lt;br /&gt;A few years ago Netezza published a paper (&lt;em&gt;&lt;a href="http://www.information-management.com/white_papers/2227889-1.html"&gt;Netezza Performance Server® Appliance: An Architectu&lt;/a&gt;&lt;/em&gt;&lt;em&gt;&lt;a href="http://www.information-management.com/white_papers/2227889-1.html"&gt;ral Comparison&lt;/a&gt;)&lt;/em&gt; that very accurately summed up the problems with sharing resources.  In a single sentence, sharing resources is a problem when you need to move massive amounts of data constantly.  In that paper they say “Unlike OLTP, data warehousing is all about data shuffling: moving large quantities of data through the system’s analysis and processing engine as efficiently as possible, with a minimum of internal thrashing.”&lt;br /&gt;&lt;br /&gt;They go on to illustrate how a query will overload a shared system, “For example, a complex join or other complex query may require a number of processing steps.  Consider the sheer inefficiency of delivering multiple enormous tables (billions of rows) off disk, across the network and into memory for processing by the DBMS—all to perform one step. The partial results then have to be moved (“materialized”) back to disk in a temporary storage location prior to bringing in another huge bundle of data for the next step.”  It makes me tired just to think about that much work.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/sharing-786801.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 200px; height: 134px;" src="http://ianalyze.net/uploaded_images/sharing-786798.jpg" alt="" border="0" /&gt;&lt;/a&gt;The problem that all parties in this debate share (even the shared nothing people are sharing this one) is that they start with the erroneous assumption that it is necessary to move lots of records.  Let’s start with just one, correct assumption—a human is asking a question and they expect to get back an amount of information that is within human scope of understanding.&lt;br /&gt;&lt;br /&gt;This means that the only required movement is a tiny amount of data into the system (the text of a query) and a small amount of data out (a screen or two of result data at most).  All of the additional data movement that occurs between those end points is a result of the design of the data structure and the query process.  And in any of the record-based structures, this will be a LOT of data movement.&lt;br /&gt;&lt;br /&gt;With this assumption in mind, consider how the correlation database works.  A query comes in and uses indexed access directly to the selection values.  From that it gets a set of internal instance IDs which it uses to project the result set.  When there are multiple complex joins, only instance IDs move across I/O resources as the intermediate steps are processed.  The only actual data movement that occurs is the transfer of the final data values to the client.&lt;br /&gt;&lt;br /&gt;In this environment, more processor power means more performance.  The amount of physical data movement is tiny compared to the processing power required.  Fortunately, the chip makers are doing an excellent job of building faster chips with multiple cores providing massive processing power in a small, inexpensive package.  Using a CDBMS, even low end, commodity servers have enough I/O capacity to meet the needs of a large &lt;a href="http://www.illuminateinc.com/iLuminate-data-management.html"&gt;enterprise data warehouse&lt;/a&gt; with many users.&lt;br /&gt;&lt;br /&gt;In the Netezza paper they state “The strength of SMP lies in its processing power; however, the architecture is limited in its ability to move large amounts of data as required in data warehousing”.   That sounds a lot like the definition of the physical requirements of the CDBMS.&lt;br /&gt;Their solution is to build an architecture that can move massive amounts of data quickly; the CDBMS solution is to stop moving massive amounts of data and that means lower cost and faster performance.&lt;br /&gt;&lt;br /&gt;An interesting side effect of this structure is much “greener” computing.  A really big commodity server with lots of disks might possibly use a few kilowatts of power and would easily manage a CDBMS data warehouse with ten terabytes of source data in it.  An array of shared nothing computers processing the same amount of data in an RDBMS would use multiple times that much electrical power.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-1507307613925968513?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2009/02/shared-nothing-shared-something-shared.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">1</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-3925014085455641925</guid><pubDate>Tue, 16 Dec 2008 20:04:00 +0000</pubDate><atom:updated>2008-12-16T15:00:25.471-08:00</atom:updated><title>Load Speed in an EDW</title><description>I have been seeing some amazing reports recently about load speeds from a number of database vendors. The most recent was a benchmark run by &lt;a href="http://www.illuminateinc.com/data_warehouse_platform_comparison.asp?p="&gt;Vertica&lt;/a&gt; that showed a load rate of more than 5 terabytes per hour. My response to this is a simple question—how fast is fast enough?&lt;br /&gt;&lt;br /&gt;To answer this question it is necessary to answer two other questions:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;What is being loaded?&lt;/li&gt;&lt;li&gt;How soon can the data be used?&lt;/li&gt;&lt;/ul&gt; &lt;span style="font-weight: bold;"&gt;What is being loaded?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The benchmark source data has 30,048,001,906 order line records. To put that figure into perspective, Wal-Mart has nearly 8,000 stores. If the benchmark were simulating all of Wal-Mart sales, that works out to something like 3,750,000 detail transactions per store—and this is just testing load speed, not showing the total content of the &lt;a href="http://www.illuminateinc.com/iLuminate-data-management.html"&gt;enterprise data warehouse&lt;/a&gt; (EDW). If this is the initial load of the EDW, then whether the data is loaded in two hours or two days has little effect on the total time to deployment. If this is a daily incremental load, then the holiday season has been truly phenomenal for Wal-Mart. To generate transactions at this daily rate would require about 100 transactions for every living person in the entire United States.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/bike-car-speed-770815.jpg"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 200px; height: 141px;" src="http://ianalyze.net/uploaded_images/bike-car-speed-770811.jpg" alt="" border="0" /&gt;&lt;/a&gt;The only realistic way that most organizations would need to get this much data loaded periodically would be by loading multiple copies of redundant data from stovepipe sources in poorly defined processes. Other than in a very small number of organizations, the need for this level of load speed is to continue support for poor designs and practices. What value does it bring an organization to continue to execute poorly conceived processes, only at higher speeds?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;How soon can the data be used?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Supremely fast loading is of no benefit to the organization if the analytical power isn't there once the data is loaded. The benchmark above included no &lt;a href="http://www.illuminateinc.com/VBS-ad-hoc-query-performance.html"&gt;query performance&lt;/a&gt; testing, just load speed testing. This is fairly typical of benchmarks where there is a completely different architecture and process for load testing and query testing. What would the query response speed have been in this case? There is no way to know, but it clearly would be significantly slower than if the data had been loaded in a query-optimized manner. Of course, the load speed would then be significantly slower.&lt;br /&gt;&lt;br /&gt;These kinds of benchmarks are really more a test of the hardware and the skill of the people configuring the system for testing. The test shows how well computer system X can move bytes off the disk, through the disk channels and the system bus to the processor and then back to another location on the disk system. The database manipulation portion is such a small factor that it barely has any effect on the total time, and this is because the heavy duty part of database processing was specifically excluded from the test.&lt;br /&gt;&lt;br /&gt;Why not develop a standard database benchmark that measures “time to analytics?” That is, how fast can the system load and structure new data AND make it available to the user? The query portion of the test should include a set of queries that demonstrate reporting speed, analytical power and &lt;a href="http://www.illuminateinc.com/ad_hoc_analysis.html"&gt;ad hoc analysis&lt;/a&gt; capability. If a single benchmark does not include all of the steps needed, from reading the first source record to delivering the last line of response to the queries, then it is pure marketing spin.&lt;br /&gt;&lt;br /&gt;So, how fast is fast enough? Obviously there is no single answer to that question. I had a conversation with a user of a column database and I asked if they had a problem with load times. He answered that there was no problem at all, as the system could load the entire 2 million new records (perhaps 250 megabytes) they added per day without interfering with query activity. In another conversation I was told that ten gigabytes per hour was too slow, that they would never finish incremental loads on time. In both cases, the concern was that the data would not be ready for the user when it was needed, not how many gigabytes per hour could be loaded.&lt;br /&gt;&lt;br /&gt;The real answer is based on your needs, and no benchmark will relate to that need in a meaningful way. If you are concerned about load speed, ask the vendor to run a test: see how much data can they load, structure and query in a limited-time test. Then you will have meaningful information.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-3925014085455641925?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/12/load-speed-in-edw.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-5103213441697565497</guid><pubDate>Wed, 12 Nov 2008 17:16:00 +0000</pubDate><atom:updated>2008-11-13T06:04:51.741-08:00</atom:updated><title>The Atomic Data Warehouse</title><description>No, this isn’t a post about a place to store information on nuclear weapons.  The long-running &lt;a href="http://www.b-eye-network.com/blogs/drewek/archives/2005/03/"&gt;debate between Bill Inmon and Ralph Kimball over the optimum data warehouse model&lt;/a&gt; has come up again in a discussion of our approach to data warehouse structure.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/atomic-laptop-796842.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 200px; height: 182px;" src="http://ianalyze.net/uploaded_images/atomic-laptop-796835.jpg" alt="" border="0" /&gt;&lt;/a&gt;So, should it be a bus (Kimball) or a CIF (Inmon)?  Both of these models recognize the importance of having atomic-level data available and both recognize the limitations of the relational model in providing access to this data.  They propose two different approaches that are intended to provide the maximum benefit to the business within the constraints of the relational model.&lt;br /&gt;&lt;br /&gt;“Within the constraints of the relational model”!  What if there was an option that did not have those constraints?  I think it is time to speak in complete sentences.  Rather than saying, “Model x is the best approach”, one should say, “Model x is the best approach when restricting the options to relational databases.”&lt;br /&gt;&lt;br /&gt;OK, end of rant.&lt;br /&gt;&lt;br /&gt;There really are other viable options, even for large EDWs.  The correlation database model (CDBMS) provides a single level, centrally managed structure that still gives users the option to have their own data mart.  Using the &lt;a href="http://www.illuminateinc.com/incremental_database_queries.html"&gt;incremental query&lt;/a&gt; provided by illuminate, the first query step creates the logical data mart.  Once that is done, the user can analyze the results just as they would with a free-standing data mart.  However, they would be using managed, current and consistent data rather than the chaotic departmental data marts that are seen today in the RDBMS world.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-5103213441697565497?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/11/atomic-data-warehouse.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-6575379473841035441</guid><pubDate>Mon, 03 Nov 2008 15:04:00 +0000</pubDate><atom:updated>2008-11-13T05:49:28.098-08:00</atom:updated><title>Data Quality and Ad Hoc Analysis</title><description>Is the concept of “Good Enough” ever applicable to data quality measurements in a data warehouse environment? I think most people would agree with the concept seen in the graph below comparing resource cost to data quality level.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;img style="margin: 0px auto 10px; display: block; width: 410px; height: 247px; text-align: center;" alt="" src="http://ianalyze.net/uploaded_images/graph1-757879.jpg" border="0" /&gt; Data quality is expensive in both time and money. But when building a data warehouse, one must assume that every data element in the entire warehouse has a purpose and will be used. In that scenario, every data element must be as clean as possible.&lt;br /&gt;&lt;br /&gt;In an &lt;a href="http://www.illuminateinc.com/ad_hoc_analysis.html"&gt;ad hoc analysis&lt;/a&gt; environment though, I believe a valid case exists for accepting less than near perfect data quality. The concept of Ad Hoc means “for a specific purpose” rather than the common understanding of simply “unplanned”. When data is assembled for a specific analysis, data elements that do not affect the analysis do not need to be perfectly cleaned. If the required data elements are of acceptable quality then the analyses can commence without fear.&lt;br /&gt;&lt;br /&gt;Unfortunately, there is always an element of “unplanned” in enterprise ad hoc analytics so there is always a need to incorporate data that was not planned for the analysis. In this case, it is important to be able to quickly assess the quality of the newly required data elements and make limited emergency corrections.&lt;br /&gt;&lt;br /&gt;When both of these conditions are met, near perfect quality on known data elements and the ability to immediately assess the quality of new elements, then ad hoc analytics can proceed with confidence, even if overall data quality is not very good. Of course, this assumes that the person performing the analytics is aware of quality issues, knows which elements are trustworthy and has sufficient business knowledge to properly assess quality of new data elements.&lt;br /&gt;&lt;br /&gt;Note that I said "enough business knowledge to properly assess quality." If the user is preparing a financial analysis that must correlate with general ledger numbers, then the quality must be perfect. If the analysis is to determine the probability of success of a new promotion, then 90% accuracy of most data elements is probably acceptable.&lt;br /&gt;&lt;br /&gt;The point is that absolute statements of data quality that are critical for data warehouse management are often counterproductive in an ad hoc scenario.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-6575379473841035441?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/11/data-quality-and-ad-hoc-analysis.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">1</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-3271219739682449290</guid><pubDate>Mon, 06 Oct 2008 18:10:00 +0000</pubDate><atom:updated>2008-10-10T11:29:45.769-07:00</atom:updated><title>Database in the cloud - blue sky?</title><description>At the &lt;a href="http://www.vmworld.com/conferences/2008/"&gt;VMworld meeting&lt;/a&gt; in Las Vegas a couple of weeks ago, virtualization of everything seemed like the norm.  &lt;a href="http://www.infoworld.com/article/08/04/07/15FE-cloud-computing-reality_1.html"&gt;Cloud computing&lt;/a&gt; was the latest greatest thing and is seemed like the entire world was lining up to move into the cloud.  While there are many clear benefits to cloud computing, there were discussions about databases in the cloud that left me wondering if people really thought this through.  Certainly XML files containing datasets used in cloud computing functions are not only valuable, they are required for almost all cloud computing applications.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/cloud-computing-illuminatei-755340.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer;" src="http://ianalyze.net/uploaded_images/cloud-computing-illuminatei-755333.jpg" alt="" border="0" /&gt;&lt;/a&gt;But is there a place for real enterprise data stores in the cloud?  Let’s first look at why you would want to put anything in the cloud.  Most of the reasons that I have heard relate to ease of managing distributed applications.  Ramping an application up from 50 servers to 350 servers in just a few days is possible in the cloud and impossible in the computer room.&lt;br /&gt;&lt;br /&gt;Clearly, one of the important benefits is the accessibility of infrastructure in real time.  Infrastructure, though, is a very broad term.  It might mean access to electrical power, communication lines and brick and mortar facilities for housing servers.  But it can also mean access to enterprise data stores.&lt;br /&gt;&lt;br /&gt;The question is does access to enterprise data stores in the cloud imply that the database is in the cloud or access to a managed central database from cloud based applications good enough?  Why would you want an enterprise database in the cloud?&lt;br /&gt;&lt;br /&gt;One reason might be to provide support for wildly variable demands for the computing power needed to meet database processing requirements.  However, managing a database in the cloud would seem to be impossible, especially with wildly varying loads.  How could storage be optimized and how could partitioning be managed?  It would seem that performance could never be tuned and a database in the cloud would provide universally poor performance.&lt;br /&gt;&lt;br /&gt;There are many other issues that would need to be addressed, like access control and data quality, but the DBA’s job would seem to be the most difficult obstacle to overcome.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-3271219739682449290?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/10/database-in-cloud-blue-sky.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-586291365551600430</guid><pubDate>Thu, 04 Sep 2008 12:52:00 +0000</pubDate><atom:updated>2008-09-04T06:23:34.206-07:00</atom:updated><title>Dirty Reads</title><description>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/ditry-reads-784109.jpg"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://ianalyze.net/uploaded_images/ditry-reads-784107.jpg" alt="" border="0" /&gt;&lt;/a&gt;First off, no, this isn’t a post about &lt;a href="http://www.mouthshut.com/review/Chances_-_Jackie_Collins-14619-1.html"&gt;Jackie Collins novels&lt;/a&gt;. “Dirty read” is a dirty word for database users.  However, there is a case where a dirty read (querying against a set of data that may change before the analytical process is completed) makes sense (in my opinion only as of today).  I am interested in the opinion others have so please read on and let me know how you feel about it.&lt;br /&gt;&lt;br /&gt;The Illuminate database has a feature called &lt;a href="http://www.i-lluminate.com/incremental_database_queries.html"&gt;incremental query&lt;/a&gt;.  This allows a user to ask a question, review the response and ask a follow-on question.  The follow on question is not simply a drill down; it can go into any related information in any related table or tables.  This process can continue with many more questions over a period of hours if the user is working their way through a difficult problem.&lt;br /&gt;&lt;br /&gt;Obviously, preventing the possibility of a dirty read through the entire incremental process would require locking out updates on virtually the whole database until the user finishes and if another user starts an incremental process before the first one finished, the locks would need to be maintained until that process was finished.  In an active, twenty four hour system, this would probably lock out all updates indefinitely.&lt;br /&gt;&lt;br /&gt;The question is, how much of a problem is it for an analytical user who is doing information discovery to encounter a dirty ready situation between steps of an incremental query process?  For example, assume a retail business analyst is examining potential product affinity issues.  They can select a product and see that there were 100 units sold.  They then find the customers who bought those products through a number of incremental query steps.  Then, through another series of steps, they find all of the purchases of those customers.&lt;br /&gt;&lt;br /&gt;Looking at the results they see among other items that 102 of the original items were sold.  The new number (102 vs. 100) may be a result of three units purchased and one returned or two purchased or some other combination of events since the incremental query process started.  Of course, the problem can be avoided if the user selects only sales during a defined time frame, but this requires them to think about technical query issues rather than focusing on the business problem.  Does this invalidate the result?  Does it significantly reduce the value of the result?&lt;br /&gt;&lt;br /&gt;Your thoughts on this would be greatly appreciated.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-586291365551600430?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/09/dirty-reads.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">6</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-243975069772636243</guid><pubDate>Thu, 21 Aug 2008 18:59:00 +0000</pubDate><atom:updated>2008-08-21T11:59:00.497-07:00</atom:updated><title>What the heck is Business Intelligence anyway?</title><description>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/Karien-Verhagen-778475.jpg"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://ianalyze.net/uploaded_images/Karien-Verhagen-778473.jpg" alt="" border="0" /&gt;&lt;/a&gt;In &lt;a href="http://www.b-eye-network.com/view/8319"&gt;The Business Intelligence Model of Conflicting Interests&lt;/a&gt; on the B-Eye Network, &lt;a href="http://www.b-eye-network.com/authors/index.php?a=591"&gt;Karien Verhagen&lt;/a&gt; states, “There are three parties involved in business intelligence, each of them having its own interpretation of good management information; however, their interests are quite different and even directly oppose one another.”&lt;br /&gt;&lt;br /&gt;She goes on to describe the painfully common scenario where operations folks use Excel, the IT folks guard their ivory tower data warehouse, and the executives find none of it particularly useful for either status reporting or strategic planning.  Meanwhile, massive amounts of money are being spent on &lt;a href="http://www.i-lluminate.com/business_intelligence_data_warehouse.html"&gt;business intelligence data warehouse&lt;/a&gt; activities. Her proposed solution is “a combined initiative of all parties involved”.&lt;br /&gt;&lt;br /&gt;If the solution is so simple, why is it that integrated solutions are the exception rather than the de facto standard in business intelligence?  Conflicting requirements are the universal obstacle that needs to be overcome to make this happen.&lt;br /&gt;&lt;br /&gt;While everyone agrees that a single source of data is a good thing, making that happen has been virtually impossible.  Operations people want access to last-minute data and they need answers to different questions every day.  IT people want a dependable, well-managed data store with a single source of metadata.  Executives want quick and simple answers to status questions and powerful discovery and analytical capabilities for strategic planning.&lt;br /&gt;&lt;br /&gt;The only realistic solution for the data management side of this problem is to have a data storage system that has solid administration features, a virtual physical structure, immediate access to newly loaded data and unrestricted flexibility in accessing the contents.   Of course, front end BI tools by themselves can do nothing to help with this problem.&lt;br /&gt;&lt;br /&gt;Relational databases do a reasonable job for the IT people but can’t hope to provide the level of functionality that is needed by most users.  Column databases get closer to meeting these requirements but they are still inadequate, especially in areas such as flexibility of access. In-memory databases meet most of the speed and access requirements but can’t realistically scale up to anything near an enterprise data store.&lt;br /&gt;&lt;br /&gt;The only solution left is to use something new and different like a &lt;a href="http://www.i-lluminate.com/iLuminate-data-management.html"&gt;correlation database&lt;/a&gt; for the storage system.   Today, &lt;a href="http://www.i-lluminate.com/information-management-infrastructure.html"&gt;iLuminate&lt;/a&gt; is the only correlation database available but, considering the tremendous advantages this structure provides, many companies will soon be working on their own version of the next-generation database.&lt;br /&gt;&lt;br /&gt;Eventually every organization with a database will be using a storage technology that eliminates the constraints of today’s relational, column-based and in-memory designs.  Many are likely to implement a correlation database model to provide a single integrated BI solution.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-243975069772636243?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/08/what-heck-is-business-intelligence.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">1</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-6732259080816310540</guid><pubDate>Wed, 23 Jul 2008 14:46:00 +0000</pubDate><atom:updated>2008-07-23T08:14:52.099-07:00</atom:updated><title>Misoneism - A hatred or fear of change or innovation</title><description>I subscribe to &lt;a href="http://www.wordsmith.org/"&gt;a.word.a.day&lt;/a&gt; and a few days ago the word was &lt;a href="http://wordsmith.org/words/misoneism.html"&gt;misoneism&lt;/a&gt;.&lt;span style=""&gt;   &lt;/span&gt;The definition (hatred of change or innovation) made me think of the time when I was an IT manager trying to get our business users to accept a new technology that would make their job easier. Today though, browsers, standardized application interfaces and the general ubiquity of computers have reduced this problem almost to insignificance.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/misoneism-fear-of-technology-701657.jpg"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://ianalyze.net/uploaded_images/misoneism-fear-of-technology-701576.jpg" alt="" border="0" /&gt;&lt;/a&gt;The interesting point is that it seems like some of the IT folks have now become the misoneists (my apologies to &lt;a href="http://en.wikipedia.org/wiki/Anu_Garg"&gt;Anu Garg&lt;/a&gt; if that is not a valid form of the word). Oracle people (users, not employees of Oracle) can’t imagine using SQL Server and the SQL Server folks can’t imagine using Oracle—and DB2 is an anathema for both groups. None of them would dream of looking at a database that was not relational regardless of the business needs and appropriateness of the solution. Of course, there are exceptions to every blanket statement and there certainly are a few visionaries who will look at the possibilities of a new technology.&lt;span style=""&gt;  &lt;/span&gt;But it seems there are only visionaries and hard core conservatives in the industry. The middle ground seems to be very empty.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.i-lluminate.com/data-warehousing-platform-comparison.php"&gt;Alternative database structures&lt;/a&gt; are available today that provide much better information access for analysis and BI applications than any RDBMS yet they have virtually no customers. Compared to the number of licenses of Oracle and/or SQL Server, the presence of all non-relational database installations combined is nearly invisible unless one counts every XML file as a database. Non-relational database companies with more than a few dozen customers consider themselves to be extremely successful.&lt;br /&gt;&lt;br /&gt;What happened to the pioneers in information technology? The industry is not so old that they are all retired or dead. Why do the younger members of the industry have their feet so firmly planted in the clay? If innovation in software is to keep pace with &lt;a href="http://www.answers.com/topic/moore-s-law"&gt;Moore’s law&lt;/a&gt;, we need many more innovators and fewer traditionalists.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-6732259080816310540?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/07/misoneism-hatred-or-fear-of-change-or.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-5010595267667731742</guid><pubDate>Wed, 25 Jun 2008 16:10:00 +0000</pubDate><atom:updated>2008-06-27T07:12:48.346-07:00</atom:updated><title>How expensive is free software?</title><description>I was just looking at a white paper from Greenplum describing how fast, easy and inexpensive a &lt;a href="http://www.i-lluminate.com/advanced_data_warehouse_design.html"&gt;data warehouse&lt;/a&gt; is because it runs on commodity hardware.  I have seen uncountable comments about how inexpensive the Microsoft data warehouse infrastructure is and even more comments about the low cost of using open source software.  It would be easy to believe that for the price of a few computers and a little software you could have a data warehouse running next week.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/Larissa-Moss-735979.jpg"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://ianalyze.net/uploaded_images/Larissa-Moss-735974.jpg" alt="" border="0" /&gt;&lt;/a&gt;At a recent &lt;a href="http://www.sigs-datacom.de/sd/tdwi/conferences/2008/index.htm"&gt;TDWI conference&lt;/a&gt; in Germany, &lt;a href="http://www.eiminstitute.org/meet-the-experts/larissa-moss/"&gt;Larissa Moss&lt;/a&gt; delivered a presentation defining the development steps and the required team members in a &lt;a href="http://www.i-lluminate.com/comprehensive_data_warehouse_services.html"&gt;data warehouse implementation&lt;/a&gt; project.  The presentation did a beautiful job of defining what has to be done and who has to do it.  It involves a minimum team of five or six full time people and a number of part time specialists.   The entire project is divided into sixteen steps with each step having multiple activities.&lt;br /&gt;&lt;br /&gt;While many organizations will not follow every step and activity of her plan, it is an excellent definition of what really should be done to ensure quality and minimize total cost of ownership.  So, using her project definition as a guideline, it is easy to see that the cost of the database software and the server hardware represents only a small portion of the total cost of a data warehouse.&lt;br /&gt;&lt;br /&gt;The use of commodity hardware and open source software can reduce the total cost of the project by a couple of percentage points at best.  Reducing the amount of human resource required can reduce the cost of the project by half.  Some of the tasks identified in her presentation won’t change with any technology but there are some large improvements possible that can result in cost reductions that substantially exceed the total cost of the hardware and software.&lt;br /&gt;&lt;br /&gt;For example, the Data Analysis Step includes six activities.  Each of these activities involves a number of people and takes a significant amount of time.  Using free (or nearly free) relational database software and/or inexpensive commodity hardware, all six steps are still required to have a reasonable prospect of a successful project.  If another type of database can reduce the time and expense of these steps by just ten percent, the entire cost of the software and hardware can be offset.&lt;br /&gt;&lt;br /&gt;Using the &lt;a href="http://smartenoughsystems.com/wp/tag/cdbms/"&gt;CDBMS&lt;/a&gt; structure, two of the activities, “Refine logical data model” and “Expand enterprise logical data model” are no longer needed.  The other activities, like “Analyze source data quality”, will remain essentially the same regardless of database structure.  Removing two of the six activities will reduce the overall time and cost of just one step by about one third.&lt;br /&gt;&lt;br /&gt;As another example, take a look at one of the activities in the first step, “Cost Justification”.  If the cost of a project is a few hundred dollars there is no justification needed other than someone saying it will help them with their job.  If the cost of a project is a few tens of millions, it will require extremely detailed and robust justification and approval at top executive level.  In between, the justification should have a relatively linear relationship to the expected cost.  If the Data Analysis step is reduced by one third, and other steps are reduced by a substantial amount, the cost of the entire project is reduced and the cost justification time and effort can also be reduced.&lt;br /&gt;&lt;br /&gt;Going through all of the steps, it is easy to see how the total effort required for a successful data warehouse project could be reduced by half with no compromise on the quality of the result.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-5010595267667731742?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/06/how-expensive-is-free-software.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-4059223069711755895</guid><pubDate>Tue, 03 Jun 2008 16:52:00 +0000</pubDate><atom:updated>2008-06-04T07:31:33.863-07:00</atom:updated><title>When is an appliance not an appliance?</title><description>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://ianalyze.net/uploaded_images/DW_Appliance-763254.jpg"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://ianalyze.net/uploaded_images/DW_Appliance-763252.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;I recently had an interesting discussion with &lt;a href="http://www.linkedin.com/in/hackathorn"&gt;Richard Hackathorn&lt;/a&gt; about &lt;a href="http://www.i-lluminate.com/value_based_storage_VBS_model.html"&gt;data warehouse appliances&lt;/a&gt; and he convinced me that the illuminate database is an appliance in spite of the complete absence of hardware.  In &lt;a href="http://www.beyeresearch.com/study/4639"&gt;Data Warehouse Appliances: Evolution or Revolution?&lt;/a&gt;, he and &lt;a href="http://www.bi-research.com/aboutus.html"&gt;Colin White&lt;/a&gt; describe the attributes of a DW appliance and the illuminate database aligns well with all points.&lt;br /&gt;&lt;br /&gt;So is a software-only database really an appliance?  Like so many things in the IT world, the answer is “it depends.”  His article states “From the classical definition, an appliance is designed for a specific purpose.”  And he defines that purpose, in the case of the DW appliance, as &lt;a href="http://www.i-lluminate.com/correlation_database_engine.html"&gt;data management&lt;/a&gt;.  His diagram specifically excludes Data integration services, BI applications and tools and Information delivery from the DW appliance functionality.&lt;br /&gt;&lt;br /&gt;So when the illuminate profiling loader provides a report on incoming data quality, is that just data management or is it data integration services?  Or perhaps one could call it information delivery?&lt;br /&gt;&lt;br /&gt;A display of the extended metadata from the illuminate dictionary shows the range of values in a field, the total count of values and the count of unique values.  If the field is numeric, it also shows min, max, average and standard deviation.  An ordered display of all values and the associated count of occurrences is also available.  Is that information delivery or possibly a BI tool?&lt;br /&gt;&lt;br /&gt;The line between the appliance and the services is blurry.  The features described above, and others, were included in the illuminate database system to facilitate data management. However, they are frequently used to assist in discovery and analytical processes.  BI and data mining users can use these features to enhance their processes.&lt;br /&gt;&lt;br /&gt;I would still call the illuminate database an appliance; it just has a few extra features.  As the database becomes more aware of the context and meaning of the data it manages, these extra features will broaden until it will be nearly impossible to identify the end of the appliance and the start of related services.  Another discussion on this topic a year from now will be very interesting.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-4059223069711755895?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/06/when-is-appliance-not-appliance.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-3219084067121118052</guid><pubDate>Thu, 29 May 2008 11:11:00 +0000</pubDate><atom:updated>2008-06-02T12:05:04.780-07:00</atom:updated><title>Denormalizing your world</title><description>How many tables are in your data warehouse?  500?  1000?  5000?  More?  One data warehouse that I know of has approximately 500,000 tables!&lt;br /&gt;&lt;br /&gt;Is it any wonder that business people can not use a &lt;a href="http://www.i-lluminate.com/advanced_data_warehouse_design.html"&gt;data warehouse&lt;/a&gt;?  Even most data marts are more complex than the business activity they represent.  The paradox here is that if you try to list all of the real world entities that constitute a business information environment it is very hard to find more than fifty or sixty and a good case could be made for as few as twenty.&lt;br /&gt;&lt;br /&gt;Where then do all of these tables come from and what do they mean?  Most of them are the result of optimizing data structures to improve the performance of a computer.  Good computer performance is gained at the expense of good business performance—that seems like a poor tradeoff.&lt;br /&gt;&lt;br /&gt;Normalizing data structures and metadata rather than normalizing data is the first step in making information accessible to business users.  A data structure that never repeated a metadata value, including repeats in the form of contrived names to represent the same thing in different contexts and levels of aggregation, would be the most understandable structure for business information users.  This means one table for all customer data regardless of source or usage, providing a &lt;a href="http://www.i-lluminate.com/service_company_customer_profiling.html"&gt;single customer view&lt;/a&gt;.  The same goes for product, service, order, invoice, etc.&lt;br /&gt;&lt;br /&gt;In an environment where one table represents one entity the average business user could easily navigate through the tables and BI for the masses would become a more realistic concept.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-3219084067121118052?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/05/denormalizing-your-world.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-8529381097678514061</guid><pubDate>Tue, 20 May 2008 12:36:00 +0000</pubDate><atom:updated>2008-05-23T03:53:35.250-07:00</atom:updated><title>B-Eye Podcast from TDWI</title><description>During the whirlwind that was last week's &lt;a href="http://www.tdwi.org/"&gt;Data Warehouse Institute (TDWI)&lt;/a&gt; conference in Chicago, I had the opportunity to sit down with David Loshin of the &lt;a href="http://www.b-eye-network.com/listen/7440"&gt;Business Intelligence Network&lt;/a&gt; for a rapid-fire discussion on data warehouse design, implementation and use.&lt;br /&gt;&lt;br /&gt;In this podcast, David and I talk about how organizations can quickly deploy enterprise data warehouses, eliminate the business requirements definition phase and use all of their data for functions such as &lt;a href="http://www.i-lluminate.com/financial_data_analysis_dashboards.html"&gt;finance and performance management&lt;/a&gt;. We also discuss how a correlation data warehouse lowers the barrier to entry by reducing both up-front and ongoing maintenance costs, and enables business users to perform ad hoc analysis, discovery and exploration.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.b-eye-network.com/listen/7440"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://ianalyze.net/uploaded_images/tdwi_podcast-752536.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-8529381097678514061?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/05/b-eye-podcast-from-tdwi.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-8291193866592290975</guid><pubDate>Thu, 15 May 2008 13:04:00 +0000</pubDate><atom:updated>2008-05-21T15:06:52.956-07:00</atom:updated><title>What's in a Name?</title><description>Naming of data objects has always been an interesting issue. Long ago, programmers named files and fields arbitrarily, at their own whim. Later, most IT departments standardized naming rules. Now there are industry standard naming conventions like &lt;a href="http://metadata-standards.org/11179/"&gt;International Standard ISO/IEC 11179-5&lt;/a&gt; and complete naming systems like the &lt;a href="http://dublincore.org/"&gt;Dublin Core Metadata Initiative&lt;/a&gt;. Yet most business users today could not even begin to interpret the meaning of the names used for the hundreds or even thousands of tables and the multiple thousands of columns in an enterprise&lt;a href="http://www.i-lluminate.com/data_warehouse_analytics_tools.html"&gt; information management infrastructure&lt;/a&gt;. These names identify the information they use every day but the names are so unintelligible that no one could interpret them without a written standard.&lt;br /&gt;&lt;br /&gt;Why should objects that are used every day by the business user have names that are incomprehensible?&lt;br /&gt;&lt;br /&gt;The primary reason is that data objects are structured to optimize computer performance rather than to reflect the business structures. A common thing like a customer record quickly becomes split into five or ten or more tables in &lt;a href="http://www.i-lluminate.com/advanced_data_warehouse_design.html"&gt;third normal form&lt;/a&gt; and business information users can no longer find anything.&lt;br /&gt;&lt;br /&gt;The best solution to date has been to build supplementary data structures especially for user access and name all objects in those structures according to a centrally managed naming control system. The business users then only see a data mart with a small subset of all data objects and are usually able to learn the meaning of that subset of names.&lt;br /&gt;&lt;br /&gt;A better solution for naming is to keep just one data structure for each real world object and give the attributes of that structure names that align with the business names. In this environment, every business entity is represented in a single table and each attribute occurs once and has a meaningful name. Each customer has one customer record and each product has one product record. All information access systems use these tables and access the same data using the same names. The &lt;a href="http://www.i-lluminate.com/correlation_database_engine.html"&gt;data marts, star schemas and OLAP cubes&lt;/a&gt; of conventional systems are just ordinary queries in this environment.&lt;br /&gt;&lt;br /&gt;This concept, as simple as it may seem, has a couple of issues that keep it from being implemented. The first issue is computer performance. The massive redundancy created by this structure would be crippling in an RDBMS environment. The solution for this issue is simply to use the &lt;a href="http://www.i-lluminate.com/value_based_storage_VBS_model.html"&gt;Value Based Structure&lt;/a&gt; in the CDBMS. In this structure redundant data storage does not cause any problem and normalizing provides no gain.&lt;br /&gt;&lt;br /&gt;The second problem is application design. Applications are still designed to interact with data stores the same way they did thirty years ago. New data storage capabilities provide functions that didn’t exist thirty years ago and applications should be upgraded to use the best that is available. However, that is a discussion for another day.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-8291193866592290975?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/05/whats-in-name.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-5146910152991698197</guid><pubDate>Sat, 19 Apr 2008 14:51:00 +0000</pubDate><atom:updated>2008-04-19T05:48:03.055-07:00</atom:updated><title>When is data mining not data mining?</title><description>Product affinity is one of those processes that used to be relegated to the statisticians in the back room.  The process was so time consuming and expensive that it wasn’t done except for very high-profile situations.  Also, the results frequently proved to be of little value.  On one discovery project I found an amazing affinity between bananas and dog food.   When I told the client, he laughed and said, “Check for bananas and toilet paper.”  Sure enough there was a strong affinity there as well.  It seems that bananas have an affinity with almost everything in the store.&lt;br /&gt;&lt;br /&gt;With high costs, and results like "bananas and everything," affinity was one of the more obscure data mining processes.    Now though, we can do product affinity as a simple query.  To find the products that sell with a selected product, we select the product, ask for the baskets that contained the product and then select all of the products sold in those baskets.  Ordering the results by units sold gives us a list of the top items sold with our selected product.&lt;br /&gt;&lt;br /&gt;Using the &lt;a href="http://www.i-lluminate.com/incremental_database_queries.html"&gt;incremental query&lt;/a&gt; feature of the &lt;a href="http://www.i-lluminate.com/data_warehouse_exploration_tool.html"&gt;iCorrelate&lt;/a&gt; query screen, the real value of affinity can be extracted from tons of raw data.  For example, when we get the baskets that contain our selected product, we can select only those baskets from a specific time period or a store or region or only the baskets from weekends or mornings.  Whatever behavior characteristic we are interested can be used to get the affinity of a selected subset of all baskets.&lt;br /&gt;&lt;br /&gt;We can also extend the affinity beyond baskets if customer information is available.  When we have the desired baskets, we ask for the customers who purchased those baskets.  Then asking for the baskets related to those customers, then the items in those baskets, we get product affinity at the customer level.  Rather than market basket analysis, we are doing customer purchase analysis.  Selecting only the baskets from a promotion provides another analysis of promotion effectiveness.&lt;br /&gt;&lt;br /&gt;The old data mining process for product affinity had limited value and high cost.  The incremental query method however, has low cost and high value making it an excellent tool for product managers, promotion planners and other business people who need to analyze shopping behavior.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-5146910152991698197?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/04/when-is-data-mining-not-data-mining.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-1665699322364811227</guid><pubDate>Wed, 09 Apr 2008 15:24:00 +0000</pubDate><atom:updated>2008-04-09T09:44:16.356-07:00</atom:updated><title>An inexpensive data warehouse?</title><description>What makes a data warehouse system expensive or affordable? Let’s take a quick look at two examples; one based on an RDBMS that is free and another based on a CDBMS that sells for $100K. To keep it simple, I’ll use 100 gigabytes of raw data as the input to these systems and assume a small number of users directly on the warehouse and a larger number of users with OLAP screens, dashboards using star schemas, etc. The costs of the front end systems will be essentially the same for both systems so they are not included.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Design – RDBMS:&lt;/span&gt; Six calendar months, about $150,000 in labor cost.&lt;br /&gt;Get user requirements defined, design level one schema, design level two schema, design all required OLAP cubes, star schemas and other information access structures, prepare load, indexing and aggregation processes, prepare physical design.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Design – CDBMS:&lt;/span&gt; Three calendar days, about $750 in labor cost&lt;br /&gt;Define aggregations, prepare load and aggregation processes.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Startup – RDBMS:&lt;/span&gt; 20 calendar days, about $15,000 in labor cost&lt;br /&gt;Load data, index level one. Build and index level two. Extract data for access structures, build and index structures. Validate all levels for accuracy and completeness.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Startup – CDBMS:&lt;/span&gt; Four calendar days, about $1000 in labor cost&lt;br /&gt;Load data and validate for accuracy and completeness&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Operation – RDBMS:&lt;/span&gt; One full time person, about $75,000 per year&lt;br /&gt;Weekly load and update and rebuild of all access structures. Design and build of new access structures as needed. Periodic reorganization of level one.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Operation – CDBMS:&lt;/span&gt; Less than 1/10 of one full time person&lt;br /&gt;Weekly loads.&lt;br /&gt;&lt;br /&gt;It looks like even "free" gets expensive if you can’t reduce the time required to design, build and maintain the data warehouse. Of course, this comparison ignores the opportunity cost of not having the information available for six months. And this sample is a small data warehouse. The cost of implementing a data warehouse is roughly proportional to the amount of data and the number of users. A system with a terabyte of data and a thousand users would be much higher. The cost of creating the warehouse with both the RDBMS and the CDBMS would increase by at least a factor of ten.&lt;br /&gt;&lt;br /&gt;Hmmm, ten times $165,000 or ten times $1,750; that seems like an easy decision.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-1665699322364811227?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/04/inexpensive-data-warehouse.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-1956292387930708765</guid><pubDate>Wed, 26 Mar 2008 07:19:00 +0000</pubDate><atom:updated>2008-03-30T06:24:52.857-07:00</atom:updated><title>Airplanes and Data Warehouses</title><description>Recently I have been flying more than a pilot, and while idly passing time waiting until I could get my computer back out I thought about how similar the progression of flying is to the progression of data warehouses.  A hundred years ago, a pilot could use a plane to go places where there were no direct roads.  He could fly faster than a car could drive or a horse could run.  Fifty years ago a technician could run a program in a computer that would store and retrieve information faster and more accurately than any person.&lt;br /&gt;&lt;br /&gt;Airplanes got faster, better and cheaper.  Soon ordinary people like me routinely flew around the world.  Computers got faster, better and cheaper.  Soon every person in their business and personal life was using them every day.&lt;br /&gt;&lt;br /&gt;However, I had to travel from Barcelona, Spain to San Diego, California.  To accomplish this simple task, I flew from Barcelona to Amsterdam.  There I waited in an airport for a couple of hours before flying to Minneapolis.  After a few hours there, I flew on to San Diego.  The return to Barcelona was a similar trip.&lt;br /&gt;&lt;br /&gt;With the computer, if I want to see information about sales for a day, I have to start by accumulating all of the daily transactions.  Then I group them and sum the important statistics like volume and net prices.  Then I need to build my star schema.  Now I can use by BI tool to finally see the sales.  If I want to see all of the sales of non-food products to female customers that were made on weekday mornings, it would be more difficult than a trip from an Inuit village in Alaska to a jungle village deep in the Congo basin.&lt;br /&gt;&lt;br /&gt;The planes meet 80% of the travel needs well and the computers meet 80% of the information needs well.  And better, faster, cheaper technology will not improve that ratio much for either technology.  In fact, as people become more used to the technologies, the 80% portion is taken for granted and there is increased demand for the 20% part.  Soon the ratio changes to 70/30 or 60/40.&lt;br /&gt;&lt;br /&gt;New technologies like a personal transporter or a correlation database are needed to change the old cast-in-concrete methods.  I don’t know where to find the personal transporter, but with a &lt;a href="http://www.i-lluminate.com"&gt;correlation database&lt;/a&gt; you can ask any question directly from the raw data as soon as it is loaded.  You can change the direction of the questioning at any time.  You can move from one question to the next without starting over and tracing out a new path.  It gives you the ability to navigate through your information as you need to rather than through the “airports” that were built to designate paths for you.  You can arrive quickly at your desired destination even if you weren't sure of it when you started your journey.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-1956292387930708765?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/03/airplanes-and-data-warehouses.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-1881617820701173380</guid><pubDate>Tue, 18 Mar 2008 19:46:00 +0000</pubDate><atom:updated>2008-03-18T12:51:40.557-07:00</atom:updated><title>But how fast can you load? Part 2</title><description>The key consideration in load time for a data warehouse is &lt;span style="font-style: italic;"&gt;time to value&lt;/span&gt;: the elapsed time between having a raw data file ready to load into the data warehouse and being able to use that data for analysis.&lt;br /&gt;&lt;br /&gt;A correlation database, using the value-based storage model, provides faster time to value than record-based relational or column-oriented columnar database approaches because there is no predefinition of business requirements needed, no need to predefine database schema, and no post-design of data cubes. The correlation database creates its own optimized index structure during data loading, which includes all relationships within the data. And it &lt;span style="font-style: italic;"&gt;never&lt;/span&gt; needs restructuring.&lt;br /&gt;&lt;br /&gt;So how easy is to load data and have it ready for analysis? This easy:&lt;br /&gt;&lt;br /&gt;&lt;object width="425" height="355"&gt;&lt;param name="movie" value="http://www.youtube.com/v/Kv08rgaJc78&amp;hl=en"&gt;&lt;/param&gt;&lt;param name="wmode" value="transparent"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/Kv08rgaJc78&amp;hl=en" type="application/x-shockwave-flash" wmode="transparent" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-1881617820701173380?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/03/but-how-fast-can-you-load-part-2_18.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">1</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-6489701391111656404</guid><pubDate>Tue, 18 Mar 2008 16:47:00 +0000</pubDate><atom:updated>2008-03-18T12:56:10.246-07:00</atom:updated><title>An incremental approach</title><description>Here is a real query from hell in the making.  One of our customers wanted to prepare a list of customers for a promotion.  They wanted the usual selection criteria like geographic and demographic selections and they wanted at least 5,000 good names and not more than 6,000.&lt;br /&gt;&lt;br /&gt;With the incremental approach, you ask for one qualification at a time and you can review the results between each step.  The marketer who was preparing the promotion was able to select the customers from the area where the promo was going to take place.  A quick look showed that there were far more customers there than they wanted to contact so another selection on a demographic code narrowed the result but there were still too many names.&lt;br /&gt;&lt;br /&gt;Scanning through some of the results, it was quickly clear that there were many old customers who had not shopped recently.  Now the marketer had the option to go after the current customers and focus the promo on retention or go for the old ones to try to get them back.  Selecting the old ones resulted in a few more names than were really desired.  Another quick scan through the data showed a wide range in average spending.  An addition to the query selected the big spenders resulted in the right sized group and the promotion was launched.  Total time for the discovery process was less than one hour.&lt;br /&gt;&lt;br /&gt;Imagine attempting that with a typical DW! What was the original requirement?  It wasn’t really known.  If this had to be done as an IT project, it would never start since the requirements definition would take longer then the cycle between promotions.&lt;br /&gt;&lt;br /&gt;If the marketer had access and skill in SQL, then the same process could be attempted.  However, the probability of having all of the required fields indexed would be near zero.  If the exploration and business decisions had taken another direction and they went after active customers, a different set of fields would be required.  What are the chances that they would also be indexed?&lt;br /&gt;&lt;br /&gt;Without a doubt, there is a query from hell somewhere in this business process.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-6489701391111656404?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/03/incremental-approach.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-1794543217733357491</guid><pubDate>Wed, 12 Mar 2008 07:31:00 +0000</pubDate><atom:updated>2008-03-12T10:56:49.853-07:00</atom:updated><title>Value Based Storage</title><description>Just what is value based storage?  This is a common question I often need to answer.  It sounds like some kind of marketing jingle but, in fact, it is the underlying iLuminate storage structure technology that we are talking about.&lt;br /&gt;&lt;br /&gt;The iLuminate internal storage looks dramatically different from either row based storage (Oracle, SQL Server, etc) or column based storage (Sybase IQ, Vertica, etc).  Value Based Storage (VBS) stores the data values in sets (the "value pool") based on data type and each distinct value is stored only once.  The original record definition is maintained by an indexing system that links to values the value pool.  These two objects are enhanced by the extended metadata giving the iLuminate database its complete functionality.&lt;br /&gt;&lt;br /&gt;This storage structure provides several benefits.  The database is very small, in some cases even smaller than the raw data that went into it.  In addition to simply being small, it concentrates the most active elements of the data into very compact sets.  The value pool is often small enough to fit entirely in memory.  As a result, the slowest part of database processing, transfer of data from disk to memory, is dramatically reduced.&lt;br /&gt;&lt;br /&gt;The VBS also provides some data access features that can not be duplicated in any other structure.  A search can be executed starting with a data value in the pool.  By going from the value pool back to the index, it is possible to quickly locate every use of the value wherever is may be used in the logical record structures.&lt;br /&gt;&lt;br /&gt;This structure also enables our incremental query capability.  As the result of a query, the database returns a set of instance identifiers rather than a set of records.  This is because there are no records, only pointers and values.  With the response being a set of pointers, it is a simple matter to perform the next query step and then get the union or difference between the two sets of pointers for the result of the second query step.  This process can be continued indefinitely with the result set shrinking or growing as the new results are merged with the old.&lt;br /&gt;&lt;br /&gt;The extended metadata and the indexing structure and logic provide more features but that can be the subject of another discussion.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-1794543217733357491?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/03/value-based-storage.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3510987239348986069.post-910106813356191400</guid><pubDate>Wed, 05 Mar 2008 15:06:00 +0000</pubDate><atom:updated>2008-03-05T09:02:36.430-08:00</atom:updated><title>But how fast can you load?</title><description>When I talk to people about the Illuminate CDBMS, the first reaction I usually get from them is, “That sounds great, but how fast can you load data?”&lt;br /&gt;&lt;br /&gt;My usual answer is “How fast is fast enough?”&lt;br /&gt;&lt;br /&gt;This is not a flippant answer, but an honest question that is intended to start a discussion on the meaning of load speed.  The real measurement of load speed should be the time needed from starting the transfer of source data from the operational data stores, to the availability of the business information for the end user.  In a multi-layer information infrastructure, that will include time for:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Importing raw data&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Indexing the imported data&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Running all aggregation processes and building intermediate layer access structures as needed&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Further aggragations, rebuilding and re-indexing star schemas, OLAP cubes and any other customized, end use information delivery structures.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;In an RDBMS environment, steps 1 and 2 are very fast since there is very little indexing on the foundation data layer.  Depending on the hardware, communications methods and schema, this could be at a rate of perhaps 2 to 10 gigabytes per hour or even more in extreme cases with extreme hardware.&lt;br /&gt;&lt;br /&gt;Steps 3 and 4 are the most unpredictable since every organization will have differing amounts of work in each layer.  However, it is safe to say that this will be the longest part of the load process.  A factor of ten times the duration of steps 1 and 2 would not be unusual.  An overnight load process that loads for thirty minutes would then need five more hours to complete all of the remaining preparation steps.&lt;br /&gt;&lt;br /&gt;In the illuminate environment, the load is a single step with some aggregations needed in limited cases.  When data is loaded into &lt;a href="http://www.i-lluminate.com/illuminate_server.html"&gt;iLuminate&lt;/a&gt;, the load process indexes everything.  This means that step 2 in this case is zero.&lt;br /&gt;&lt;br /&gt;Because everything is indexed, there is no star schema or OLAP cube building required.  The foundation layer with 100% indexing already provides the full functionality of those structures.  Also, due to the indexing, aggregations will not be needed as soon as they are with an RDBMS.  The iLuminate database will minimize the I/O needed for the aggregation so the computational speed of the computer is the determining factor for aggregations. Small aggregations can be done as a part of the query process without impacting the user response time.&lt;br /&gt;&lt;br /&gt;While an exact comparison can not be made because of variability of hardware, schema design and the number of information layers, an example comparison can demonstrate the conceptual difference between loading with a conventional system and loading with iLuminate.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.ianalyze.net/uploaded_images/blog_table-797662.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://www.ianalyze.net/uploaded_images/blog_table-797646.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;So the answer to the original question is that iLuminate loads very fast.  More fast enough for almost any enterprise.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3510987239348986069-910106813356191400?l=ianalyze.net'/&gt;&lt;/div&gt;</description><link>http://ianalyze.net/2008/03/but-how-fast-can-you-load.html</link><author>noreply@blogger.com (Joe)</author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></item></channel></rss>
