<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:georss="http://www.georss.org/georss" xmlns:gd="http://schemas.google.com/g/2005" gd:etag="W/&quot;DkMCSX8-cCp7ImA9WxNUF0o.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672</id><updated>2009-11-09T08:47:48.158-05:00</updated><title>Star Schema Central</title><subtitle type="html">Information and Resources for Data Warehousing Professionals</subtitle><link rel="http://schemas.google.com/g/2005#feed" type="application/atom+xml" href="http://blog.oaktonsoftware.com/feeds/posts/default" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/" /><link rel="hub" href="http://pubsubhubbub.appspot.com/" /><link rel="next" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default?start-index=26&amp;max-results=25&amp;redirect=false&amp;orderby=published&amp;v=2" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email></author><generator version="7.00" uri="http://www.blogger.com">Blogger</generator><openSearch:totalResults>27</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><link rel="self" href="http://feeds.feedburner.com/StarSchemaCentral" type="application/atom+xml" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><entry gd:etag="W/&quot;DEIMQnY4eyp7ImA9WxNWFEs.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-2056986628644950511</id><published>2009-10-13T09:58:00.008-04:00</published><updated>2009-10-13T16:43:03.833-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-10-13T16:43:03.833-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Aggregates" /><category scheme="http://www.blogger.com/atom/ns#" term="Q and A" /><title>Highly Summarized Dimensions</title><content type="html">From the inbox today, a question about aggregate design:&lt;br /&gt;
&lt;blockquote&gt;Q:&amp;nbsp; "If you roll up a dimension that is hierarchical, but you are now left with the highest level of the hierarchy (no further roll-up would be possible in that dimension) - would you maintain a separate dimension for this in the aggregated star?&lt;br /&gt;
&lt;br /&gt;
"Or would you join a number of these unrelated top hierarchy levels in a junk dimension?"&lt;br /&gt;
&lt;br /&gt;
- Dublin, Ireland&lt;br /&gt;
&lt;/blockquote&gt;The reader has a potential aggregate that will summarize more than one dimension at a very high level.&amp;nbsp; For example, an aggregate of sales that summarizes the customer dimension by country and summarizes the product dimension by category. &lt;br /&gt;
&lt;br /&gt;
As a general rule of thumb, try to set up aggregate dimensions that preserve the original separation of the base dimension tables.&amp;nbsp; So rather than have country and category combined as a sort of "junk" dimension, create a rollup of customer called "country" and a rollup of product called "category."&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
&lt;a href="http://2.bp.blogspot.com/_oaFlRokTed0/StSGZ_isHvI/AAAAAAAAAE4/obaok7of7Do/s1600-h/3634884155_43002641b4_m.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/_oaFlRokTed0/StSGZ_isHvI/AAAAAAAAAE4/obaok7of7Do/s320/3634884155_43002641b4_m.jpg" /&gt;&lt;/a&gt;This gives the aggregate star a kind of "symmetry" with the original star.&amp;nbsp; It tends to be the most workable solution with aggregate-aware query tools, databases, etc.&amp;nbsp; It also is the easiest to manage when no tools are aggregate-aware, as it minimizes the effort required to rewrite SQL.&lt;br /&gt;
&lt;br /&gt;
If the summary is very small, it may be workable to create a "big wide table" that stores the dimension values directly with the summary facts.&amp;nbsp; This works best if you are using a tool to implement the aggregate that is able to exploit the query rewrite function of a database, like a materialized view or materialized query table.&amp;nbsp; In this case, applications can still query the "base" star, but receive the benefit of the aggregate.&lt;br /&gt;
&lt;br /&gt;
But do you really need this aggregate?&amp;nbsp; The more highly summarized an aggregate is, the fewer queries or reports it is able to accelerate.&amp;nbsp;&amp;nbsp; It is usually possible to construct aggregates that are not quite so highly summarized, giving a performance boost to a wider range of queries.&amp;nbsp; For example, summarizing a customer dimension by region, rather than country, will allow the aggregate to work with more reports--including those that focus on country.&lt;br /&gt;
&lt;br /&gt;
- Chris&lt;br /&gt;
&lt;br /&gt;
&lt;div style="text-align: right;"&gt;Image credit:&amp;nbsp; &lt;a href="http://www.flickr.com/photos/zapthedingbat/3634884155/" target="_blank"&gt;ZapTheDingbat&lt;/a&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-2056986628644950511?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=2056986628644950511&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/2056986628644950511?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/2056986628644950511?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2009/10/highly-summarized-dimensions.html" title="Highly Summarized Dimensions" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/_oaFlRokTed0/StSGZ_isHvI/AAAAAAAAAE4/obaok7of7Do/s72-c/3634884155_43002641b4_m.jpg" height="72" width="72" /><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;CkMMR3o9fSp7ImA9WxNQF04.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-4808446127441953238</id><published>2009-09-08T09:20:00.022-04:00</published><updated>2009-09-23T14:28:06.465-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-09-23T14:28:06.465-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Q and A" /><category scheme="http://www.blogger.com/atom/ns#" term="Design Techniques" /><title>More on Surrogate Keys</title><content type="html">This post addresses a common follow-up question regarding surrogate keys:&lt;br /&gt;
&lt;blockquote&gt;"My source system already tracks changes.  In this case, do dimension tables really need their own unique identifiers?"&lt;br /&gt;
&lt;/blockquote&gt;Yes.  There may be exceptions, but they are far less common than you may think, and they limit your future options.&lt;br /&gt;
&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;In a previous post, I laid out &lt;a href="http://blog.oaktonsoftware.com/2009/05/do-i-really-need-surrogate-keys.html"&gt;the case for surrogate keys&lt;/a&gt;.&amp;nbsp; In short:&lt;br /&gt;
&lt;ol&gt;&lt;li&gt;By not reusing unique identifiers from a source system, dimensions are free to respond to changed data in whatever manner makes most sense from an analytic perspective.  &lt;/li&gt;
&lt;li&gt;Surrogate keys are the best option, though a multi-part key is also possible.&lt;/li&gt;
&lt;/ol&gt;It is tempting to assume that if the source system logs all changes, there is no need to maintain a separate unique identifier for dimensions. I would seem that they can maintain all history, while reusing the natural keys from the source system. Usually, this is not the case.&lt;span style="font-weight: bold;"&gt;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;span style="font-weight: bold;"&gt;A Dimension is Not Equivalent to an Entity&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
Unfortunately, a dimension in a data warehouse &lt;span style="font-style: italic;"&gt;rarely &lt;/span&gt;corresponds to a single entity in an operational system.  Operational systems are more highly normalized; most dimensions correspond to more than one table in a source system.  A customer dimension, for example, may draw information from multiple source tables: customer, household, address, phone_number, demographics and so forth.&lt;br /&gt;
&lt;br /&gt;
A change log for the central entity corresponding to a dimension may not reflect changes to some of the related tables, especially when they represent dependent entities or are linked through intersect tables.  If this is the case, it will be impossible for the dimension to undergo a &lt;a href="http://blog.oaktonsoftware.com/2007/10/for-slowly-changing-dimensions-change.html"&gt;type 2 response&lt;/a&gt; when ancillary information changes.&lt;br /&gt;
&lt;br /&gt;
If &lt;span style="font-style: italic;"&gt;every &lt;/span&gt;source table that feeds a dimension is logged, it may be possible to design a multi-part unique identifier that is completely recycled.  (As observed previously, compound keys in dimensions are &lt;a href="http://blog.oaktonsoftware.com/2009/05/do-i-really-need-surrogate-keys.html"&gt;feasible but not optimal&lt;/a&gt;.)   This approach can lock you into type 2 processing for defining elements of the related tables, even when a type 1 response is desired.&lt;br /&gt;
&lt;br /&gt;
&lt;span style="font-weight: bold;"&gt;Surrogate Keys Will Limit Future Rework&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
Even in the rare case when a source system handles all changes exactly the way we want, reuse of the natural keys limits future options.&lt;br /&gt;
&lt;br /&gt;
If the source system is modified, the desired behavior may not be maintained.  It will either be necessary to engage in significant rework of the dimensional model (and all dependent reports), or simply live with the unsatisfactory change handling of the source system.&lt;br /&gt;
&lt;br /&gt;
Mergers and acquisitions may also bring additional source systems that correspond to an existing dimension.  These systems arrive complete with entirely different identifiers.  Incorporating them as sources for an existing dimension will always necessitate some rework of the dimension.  But if surrogate keys are in place, fact tables will be unaffected and existing reports will still work.  If, on the other hand, natural keys are the basis of unique identifiers, rework will cascade to related fact tables and also to existing queries and reports.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-4808446127441953238?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=4808446127441953238&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/4808446127441953238?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/4808446127441953238?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2009/09/more-on-surrogate-keys.html" title="More on Surrogate Keys" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;DEcFRns4eCp7ImA9WxNQFkg.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-6755874805203079392</id><published>2009-07-27T11:48:00.016-04:00</published><updated>2009-09-22T17:46:57.530-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-09-22T17:46:57.530-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Strategy" /><category scheme="http://www.blogger.com/atom/ns#" term="Announcements" /><title>Recommended Books on the Data Warehouse Lifecycle</title><content type="html">&lt;span style="font-style: italic;"&gt;Recommended Reading:  A new book by Laura Reeves, and a revised edition of the classic Lifecycle Toolkit.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
If you've been to any of my classes, you already know that I am a fan of Laura Reeves.  She has a pragmatic, get-things-done approach to data warehousing.&lt;br /&gt;
&lt;br /&gt;
You may also know her as co-author of the original edition of &lt;span style="font-style: italic;"&gt;The Data Warehouse Lifecycle Toolkit&lt;/span&gt;, a book she wrote with Ralph Kimball, Margy Ross and Warren Thornthwaite.  (For more on that book, see below.)&lt;br /&gt;
&lt;br /&gt;
&lt;a href="http://www.amazon.com/gp/product/0470176385?ie=UTF8&amp;amp;tag=oaktonsoftwar-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=9325&amp;amp;creativeASIN=0470176385" target="blank"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5363177520301629154" src="http://2.bp.blogspot.com/_oaFlRokTed0/Sm3VihTwXuI/AAAAAAAAAEA/-uN_YR4i0H0/s200/0470176385.jpg" style="cursor: pointer; float: left; height: 200px; margin: 0pt 10px 10px 0pt; width: 159px;" /&gt;&lt;/a&gt;Laura has a new book out, which I highly recommend: &lt;a href="http://www.amazon.com/gp/product/0470176385?ie=UTF8&amp;amp;tag=oaktonsoftwar-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=9325&amp;amp;creativeASIN=0470176385" target="blank"&gt;A Manager's Guide to Data Warehousing&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
In this book, she provides a practical guide to planning and executing data warehouse projects. It is written for managers (I.T. and business) who do not necessarily have a technical background in data warehousing.&lt;br /&gt;
&lt;br /&gt;
Laura touches on each phase of the data warehouse lifecycle, providing useful advice without over-burdensome methodology, detailed task lists or the like.  This makes it easy to fit her advice into your own organization's development style.&lt;br /&gt;
&lt;br /&gt;
Even if you already have a strong background in dimensional design, you will find this book to be quite useful.  You can get it at &lt;a href="http://www.amazon.com/gp/product/0470176385?ie=UTF8&amp;amp;tag=oaktonsoftwar-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=9325&amp;amp;creativeASIN=0470176385" target="blank"&gt;Amazon.com&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;span style="font-weight: bold;"&gt;Also Recommended&lt;/span&gt;&lt;br /&gt;
If you have a dimensional data warehouse, I also urge you to check out &lt;a href="http://www.amazon.com/gp/product/0470149779?ie=UTF8&amp;amp;tag=oaktonsoftwar-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=0470149779" target="blank"&gt;The Data Warehouse Lifecycle Toolkit, Second Edition&lt;/a&gt; by Ralph Kimball, Margy Ross, Warren Thornthwaite, Joy Mundy and Bob Becker.&lt;br /&gt;
&lt;br /&gt;
&lt;a href="http://www.amazon.com/gp/product/0470149779?ie=UTF8&amp;amp;tag=oaktonsoftwar-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=0470149779" target="blank"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5363177695194996658" src="http://3.bp.blogspot.com/_oaFlRokTed0/Sm3Vss1lX7I/AAAAAAAAAEI/_a2gadbuVm4/s200/0470176385.jpg" style="cursor: pointer; float: left; height: 200px; margin: 0pt 10px 10px 0pt; width: 159px;" /&gt;&lt;/a&gt;This fully revised version of the classic book contains detailed tasks and deliverables to help you manage all phases of the data warehouse lifecycle.&lt;br /&gt;
&lt;br /&gt;
It is an excellent reference for data warehousing professionals. Read more about it at &lt;a href="http://www.amazon.com/gp/product/0470149779?ie=UTF8&amp;amp;tag=oaktonsoftwar-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=0470149779" target="blank"&gt;Amazon.com&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
The original edition has been a long time recommendation on this blog, and the new edition carries on the standard.  (Apologies to Warren Thornthwaite, whose name was previously misspelled here.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-6755874805203079392?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=6755874805203079392&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/6755874805203079392?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/6755874805203079392?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2009/07/recommended-books-on-data-warehouse.html" title="Recommended Books on the Data Warehouse Lifecycle" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/_oaFlRokTed0/Sm3VihTwXuI/AAAAAAAAAEA/-uN_YR4i0H0/s72-c/0470176385.jpg" height="72" width="72" /><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;CUMEQHg4fip7ImA9WxJRGU8.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-2786463412159998632</id><published>2009-05-20T12:34:00.007-04:00</published><updated>2009-05-21T12:30:01.636-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-05-21T12:30:01.636-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Slow Changes" /><category scheme="http://www.blogger.com/atom/ns#" term="Basics" /><category scheme="http://www.blogger.com/atom/ns#" term="Q and A" /><title>Do I really need Surrogate Keys?</title><content type="html">Here is the #1 most frequently question that people ask of me:&lt;br /&gt;&lt;blockquote&gt;Q: Do I really need surrogate keys?&lt;/blockquote&gt;&lt;span style="font-style: italic;"&gt;A: You &lt;span style="font-weight: bold;"&gt;absolutely must &lt;/span&gt;have a unique identifier for each dimension table, one that does &lt;span style="font-weight: bold;"&gt;not&lt;/span&gt; come from a source system.  A surrogate key is the best way to handle this, but there are other possibilities.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The case for the surrogate key is entirely pragmatic.  Read on for a full explanation.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Dimensions Need their Own Unique Identifier&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;It is crucial that the dimensional schema be able to handle changes to source data in whatever manner makes most sense from an analytic perspective.  This may be different from how the change is handled in the source.&lt;br /&gt;&lt;br /&gt;For this reason, every dimension table needs its own unique identifier -- not one that comes from a source system.  (For more on handling changes, start with &lt;a href="http://blog.oaktonsoftware.com/2007/10/for-slowly-changing-dimensions-change.html"&gt;this post&lt;/a&gt;.)&lt;br /&gt;&lt;br /&gt;A surrogate key makes the best unique identifier. It is simply an integer value, holding no meaning for end users.   A single, compact column, it keeps fact table rows small and makes SQL easy to read and write.  It is simple to manage during the ETL process.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Compound Keys Work, But Why Take that Route?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;An alternative is to supplement a unique identifier from the source system (also known as a natural key) with a sequence number.  This results in a compound  key, or multi-part key.&lt;br /&gt;&lt;br /&gt;This kind of compound key also allows the dimension to handle changes differently than the source does.  But there are several disadvantages:&lt;span style="font-style: italic;"&gt;&lt;span style="font-style: italic;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;ul&gt;&lt;li&gt;Fact table rows become larger, as they must include the multi-part key for each dimension&lt;/li&gt;&lt;li&gt;The ETL process must manage a sequence for each natural key value, rather than a single sequence for a surrogate key&lt;/li&gt;&lt;li&gt;SQL becomes more difficult to read, write or debug&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Multi-part dimension keys can disrupt star join optimizers&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;So:  a compound key takes more space, is not any easier, and may disrupt performance.   Why bother?&lt;br /&gt;&lt;br /&gt;(By the way, many source system identifiers are &lt;span style="font-style: italic;"&gt;already &lt;/span&gt;compound keys.  Adding a sequence number will make them even larger!)&lt;br /&gt;&lt;br /&gt;Sometimes it is suggested that the natrual key be supplemented with a date, rather than a sequence.  This may simplify ETL slightly, but the rest of the drawbacks remain.  Plus, dates are even bigger than a sequence number.  Worse,  the date will appear in the fact table.  That is sure to lead to trouble!&lt;br /&gt;&lt;br /&gt;(This is not to say that datestamps on dimension rows are a bad thing.  To the contrary, they &lt;a href="http://blog.oaktonsoftware.com/2008/03/responding-to-star-schema-detractors.html"&gt;can be quite useful&lt;/a&gt;.  They just don't work well as part of a compound key.)&lt;br /&gt;&lt;br /&gt;That's the basic answer.  I'll respond to some common follow up questions over the coming weeks.&lt;br /&gt;&lt;br /&gt;- Chris&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-2786463412159998632?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=2786463412159998632&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/2786463412159998632?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/2786463412159998632?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2009/05/do-i-really-need-surrogate-keys.html" title="Do I really need Surrogate Keys?" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;C08HQn8yfSp7ImA9WxJTGEg.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-2703032116659126686</id><published>2009-04-27T12:22:00.005-04:00</published><updated>2009-04-27T13:03:53.195-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-04-27T13:03:53.195-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Q and A" /><category scheme="http://www.blogger.com/atom/ns#" term="Design Techniques" /><title>More on Distinct Counts</title><content type="html">Shortly before my &lt;a href="http://blog.oaktonsoftware.com/2009/03/q-is-unique-count-fact-or-dimension.html"&gt;recent post&lt;/a&gt; on unique counts, a reader sent in a related question that takes a slightly different perspective:&lt;br /&gt;&lt;blockquote&gt;As fact tables could easily have a hierarchy of keys, the counts using the lowest grain can be counted via a count(*), and anything above that needs a count(distinct &lt;span style="font-style: italic;"&gt;column_a&lt;/span&gt;).&lt;br /&gt;&lt;br /&gt;Do you see many applications accommodating count(*)'s…by that I mean, making separate higher grain fact tables so ad-hoc users do not have to use a count distinct?&lt;br /&gt;&lt;br /&gt;P. Petrini&lt;/blockquote&gt;As the reader points out, unique counts (or distinct counts) can be problematic for end users to put together.  This may be particularly if you are using a SQL-generating Business Intelligence (BI)  tool.&lt;br /&gt;&lt;br /&gt;For example, suppose an orders fact table has one row per order line.  Its major dimensions are customer, product, salesperson and day.  To count the number of distinct products ordered on a day, you would need to place &lt;span style="font-family:courier new;"&gt;count(distinct product_name&lt;/span&gt;)  in your SQL.&lt;br /&gt;&lt;br /&gt;That is something a developer can do in her sleep, but may be a bit much for a power-user.  And it may be difficult to get a BI tool to generate this kind of SQL.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;The reader correctly points out that this would not be necessary if the grain of the fact table precisely matches whatever it is that needs to be counted.  All that would be necessary would be &lt;span style="font-family:courier new;"&gt;count(*) &lt;/span&gt;or something along those lines.  So a summary table &lt;span style="font-style: italic;"&gt;might &lt;/span&gt;help....&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The Issue&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The problem with this is that the summary table will simplify &lt;span&gt;one &lt;/span&gt;&lt;span style="font-style: italic; color: rgb(51, 51, 255);"&gt;and only one &lt;/span&gt;kind of count.&lt;br /&gt;&lt;br /&gt;There will probably be a &lt;span style="font-style: italic;"&gt;variety &lt;/span&gt;of &lt;span style="font-style: italic;"&gt;different&lt;/span&gt;, levels or time periods across which you want distinct counts,  and each would require its own aggregate.&lt;br /&gt;&lt;br /&gt;This could quickly become a lot of work -- simply to make it easier to formulate certain kinds of queries.  And while it may make a particular query easier to express, it introduces new complexity for the user, who must now choose precisely the right star for each query.&lt;br /&gt;&lt;br /&gt;For example, a summary fact table that has one row for each product for each day makes it easier to count the number of products that sold on a day.  But this aggregate must omit other dimensions like salesperson and customer to be useful in this regard.  If you also want to count other distinct things by day, each will require its own aggregate.  Also want distinct counts by month?  More aggregates.&lt;br /&gt;&lt;br /&gt;Not very efficient, unfortunately.  The SQL for counting distinct things is simplified, to be sure.  But now the ETL process is doing a lot more work, and users are faced with choosing the right aggregate for each query they build.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Alternatives&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;My suggestion is that, rather than build fact tables that eliminate the need for count(distinct), build and cache reports that do the work.  People interested in these counts can access the reports, instead of writing SQL.  If these counts are common, you might also find it easier to use an OLAP tool.&lt;br /&gt;&lt;br /&gt;Not perfect, I know, but nothing ever is.&lt;br /&gt;&lt;br /&gt;This is not to say that there is anything wrong with creating a series of summary tables for this purpose, or perhaps storing some pre-computed counts &lt;a href="http://blog.oaktonsoftware.com/2009/03/q-is-unique-count-fact-or-dimension.html"&gt;as discussed earlier&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;As people who attend my classes know, my motto is "be pragmatic, not dogmatic."  As long as everyone is aware of the pros and cons, and understands how it affects each aspect of the data warehouse implementation, then if there is a consensus that adding summary tables is the best way to make life easier, go ahead and do it.&lt;br /&gt;&lt;br /&gt;Chris&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Many thanks to P. Petrini, for consenting to have his question appear here. &lt;br /&gt;&lt;br /&gt;If you have a question, you can email it to me. Click my profile picture at the top of this page for the address. I try to answer all questions, and might even print yours here&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-2703032116659126686?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=2703032116659126686&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/2703032116659126686?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/2703032116659126686?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2009/04/more-on-distinct-counts.html" title="More on Distinct Counts" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;A0QMSHw4fCp7ImA9WxJTEEU.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-3270530745697022944</id><published>2009-04-18T14:42:00.008-04:00</published><updated>2009-04-18T17:16:29.234-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-04-18T17:16:29.234-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Aggregates" /><category scheme="http://www.blogger.com/atom/ns#" term="Design Techniques" /><title>Dealing with Period-to-Date Measurements (MTD, YTD and so forth)</title><content type="html">A few weeks ago, I &lt;a href="http://blog.oaktonsoftware.com/2009/01/q-customers-with-no-sales-in-fact-table.html"&gt;touched on&lt;/a&gt; some problems you might encounter  when trying to store period-to-date measurements in a fact table.  Today, I want to address the issue more directly.&lt;br /&gt;&lt;br /&gt;Period-to-date measurements can summarize just about any fact over a period of time.  A measurement of month-to-date sales , for example, aggregates sales from the beginning of the month to the date in question.  Month-to-date, quarter-to-date and year-to-date facts are often used to do things like compute commissions, volume discounts, etc.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Period-to-date Facts and Additivity&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Most facts in a star schema can be summed up across any and all dimensions.  Sales dollars, for example, can be summed across days, customers, stores, or whatever dimension makes sense to the person doing the analysis.  We call this kind of fact &lt;span style="font-style: italic;"&gt;fully additive&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The basic issue with storing period-to-date facts is this:  &lt;span style="font-style: italic;"&gt;they are never fully additive&lt;/span&gt;.   It does not make any sense to sum up period-to-date measurements that were taken at different times.  Month-to-date sales for a customer on January 14, 15 and 16, for example, cannot be meaningfully added together.&lt;br /&gt;&lt;br /&gt;Period-to-date measurements are &lt;span style="font-style: italic;"&gt;semi-additive&lt;/span&gt;; the can be summed up across some dimensions, but not time.&lt;br /&gt;&lt;br /&gt;It &lt;span style="font-style: italic;"&gt;might &lt;/span&gt;make sense to aggregate together several period-to-date measurements if they were taken &lt;span style="font-style: italic;"&gt;at the same time&lt;/span&gt;.  The year-to-date sales of all customers who made a purchase on a particular day, for example, might factor into some useful analysis.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Storing a period-to-date fact with transactions&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Since it makes no sense to sum up period-to-date measurements across different points in time, they have very limited use in a transaction-grained fact table.&lt;br /&gt;&lt;br /&gt;For example,  a transaction-grained fact table that stores a row for each order-line of each order might contain dimensions that capture the order_line number, date and time of the order, product sold, and so forth.  You might store a period-to-date measurement in such a fact table, but there there will be little or no opportunity to sum this fact across different order lines.  Only orders with the same date/time can be meaningfully summed.  And if there are multiple order-lines for a given order, even this may not make sense, since the period-to-date measurement would be stored redundantly.&lt;br /&gt;&lt;br /&gt;Stored in this kind of fact table, the period-to-date measurement has very limited use.  It can really only be used to study the most granular data in the fact table.  For example, it might be used to compute an applicable discount, or a commission percentage that escalates with sales volume.&lt;br /&gt;&lt;br /&gt;But is this needed in the fact table?  Such metrics are usually highly operational.  They may &lt;span style="font-style: italic;"&gt;influence &lt;/span&gt;other metrics, which have more analytic value and happen to be additive.  The commission paid or discount given, for example, may be computed using period-to-date information, but itself is fully additive. This will be useful to store in a fact table.&lt;br /&gt;&lt;br /&gt;This does not mean that period-to-date measurements have no place in a transaction-grained fact table.  If they are central to the business, or to analysis of the process, then place them there.  They may, however, cause excessive growth in the row size, since it is possible to enumerate multiple period-to-date versions of any given fact.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Storing a period-to-date fact with period aggregates or snapshots&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;The natural home for a period-to-date measurement is a fact table with a periodic-snapshot grain, or a fact table that aggregates across the time dimension.&lt;br /&gt;&lt;br /&gt;An aggregate that summarizes the time dimension sums up data from a transaction-based fact table over a particular period.  For each period summarized, the resulting rows all summarize &lt;span style="font-style: italic;"&gt;the same &lt;/span&gt;period.  This means that each can contain period-to-date measurements that &lt;span style="font-style: italic;"&gt;might &lt;/span&gt;be meaningfully summed, at least within the period.  (Note I said "might."  There is a caveat, which I will get to in a second.)&lt;br /&gt;&lt;br /&gt;An aggregate that records monthly sales totals, for example, is potentially a good place to include a quarter-to-date or year-to-date measurement.  Each row in the table will summarize the same period, so the period-to-date measurements may be meaningfully summarized.&lt;br /&gt;&lt;br /&gt;The same goes for a periodic snapshot fact table.  Like a periodic aggregate, each row in a snapshot summarizes a particular period.  The main difference is that a snapshot will contain some form of unique status measurement that cannot be found in the transactions, such as a balance or level.  Once again, a semi-additive period-to-date fact can be stored here.&lt;br /&gt;Snapshots record the status of things at fixed time intervals.  A monthly snapshot of bank accounts, for example, records the day-end balance of each account.  This is a logical place to store month-to-date or other such&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Caution:  Exactly what is being measured period-to-date?&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;Even within an aggregate or snapshot period, period-to-date facts may not be additive.  It is important to ask yourself: &lt;span style="font-style: italic;"&gt;exactly what is being measured period-to-date?&lt;/span&gt;  If it does not coincide with the grain of the table, it may be non-additive.&lt;br /&gt;&lt;br /&gt;For example, a monthly aggregate stores sales by product and customer.  If we want to store year-to-date customer sales in this table, the same fact will be repeated multiple times if a single customer purchased multiple products.  It is not additive across products; it is repeated if the customer bought more than one product.&lt;br /&gt;&lt;br /&gt;On the other hand, if the period-to-date measurement captures sales by customer and product, it coincides with the grain of the table.  It can be meaningfully aggregated across any dimension but time.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Reporting&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;The old stand-by is to compute a period-to-date measurement in a query or report.  It is calculated from transactions at run-time.  This allows access to any conceivable period-to-date metric, and may also be used as a &lt;a href="http://blog.oaktonsoftware.com/2009/01/q-customers-with-no-sales-in-fact-table.html"&gt;behavioral qualification&lt;/a&gt; rather than a fact.&lt;br /&gt;&lt;br /&gt;Such queries may be particularly nasty or time consuming, but they can be run during batch windows and cached to compensate.&lt;br /&gt;&lt;br /&gt;My next post will be another Q&amp;amp;A.  Please send in your questions!&lt;br /&gt;&lt;br /&gt;Chris&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-3270530745697022944?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=3270530745697022944&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/3270530745697022944?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/3270530745697022944?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2009/04/dealing-with-period-to-date.html" title="Dealing with Period-to-Date Measurements (MTD, YTD and so forth)" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;C0IAQ3o9cCp7ImA9WxVbEUU.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-8786082441416344638</id><published>2009-03-27T13:34:00.001-04:00</published><updated>2009-03-27T15:19:02.468-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-03-27T15:19:02.468-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Aggregates" /><category scheme="http://www.blogger.com/atom/ns#" term="Design Techniques" /><title>Is a Snapshot an Aggregate?</title><content type="html">&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;span style="font-style: italic;"&gt;Students in my classes often ask me if a snapshot fact table is an aggregate.  My short answer is that it doesn't matter.  But here's the long answer.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is really a matter of semantics more than anything else.  What it boils down to is how you define an aggregate table.  Is it simply a table that summarizes data from other tables?  Then yes, most of the time a snapshot will fit this definition -- but not always.   However, I like to reserve the word "aggregate" for a very specific kind of summary.  If you do as well, then the answer is no. Never.&lt;br /&gt;&lt;br /&gt;Before getting into that, though, here is some background.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Snapshot Fact Tables&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A snapshot, for the uninitiated, is a fact table that periodically samples a measurement.  A snapshot always includes at least one measurement of  status, such as a balance or level.  A transaction fact table, in contrast, captures measurements as they occur.&lt;br /&gt;&lt;br /&gt;The snapshot is useful for describing the effect of transactions, rather than transactions themselves.  The best example is your bank account.  Its all well and good if you have a record of each deposit, check, interest payment, fee, direct debit, and so forth.  But what is your balance?  Aggregating all your transactions is not the most expedient way to figure that out.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Sometimes Snapshots Summarize Transactions, but not Always&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;People sometimes think of snapshots as aggregates because they can be constructed from a table that contains transactions.  That is certainly true in the case of account balances.  If you have the transactions, you can compute the snapshots.&lt;br /&gt;&lt;br /&gt;In this sense, you might be justified in thinking of  a snapshot as an aggregate.  But I don't.  I'll explain why in just a second.&lt;br /&gt;&lt;br /&gt;First, though, it is important to understand that not all snapshots are constructed from  transactions.  Sometimes, the number of transactions is too numerous to keep around.  But in other cases,  &lt;span style="font-style: italic;"&gt;there simply aren't any&lt;/span&gt;.   For example, we can't make use of transactions to describe the ambient temperature on the factory floor, or the level of pressure inside an industrial device.  In these cases, the snapshot is clearly not a summary.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Snapshots Contain "new" Facts&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;So how about those other situations -- the ones where there &lt;span style="font-style: italic;"&gt;are &lt;/span&gt;transactions that can be used to create a snapshot?  Isn't a snapshot an aggregate in this case?&lt;br /&gt;&lt;br /&gt;It really depends on how you define an aggregate.&lt;br /&gt;&lt;br /&gt;The word aggregate is normally reserved for a summary table that does not transform the structure of its corresponding base table.  For example, suppose account transactions are recorded in a fact table, with a fact called &lt;span style="font-family:courier new;"&gt;transaction_amount.&lt;/span&gt; A monthly aggregate would have exactly the same fact.  Still called &lt;span style="font-family:courier new;"&gt;transaction_amount&lt;/span&gt;, is the same as the fact in the base table.  The only difference is in level of detail:  it represents a month's worth of transactions, rather than individual transactions.  It is the same measurement.&lt;br /&gt;&lt;br /&gt;Account balance (or other status measurements like levels) is a &lt;span style="font-style: italic;"&gt;not the same &lt;/span&gt;as what the base fact table records.  It is &lt;span style="font-style: italic;"&gt;not &lt;/span&gt;an aggregation of transaction that occurred during the snapshot period; it is something more.  It describes the &lt;span style="font-style: italic;"&gt;net effect &lt;/span&gt;of the account transactions.   It is a different measurement.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Conclusion&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I use the term &lt;span style="font-weight: bold;"&gt;derived table &lt;/span&gt;to describe any table in a dimensional design uses another as its source.  I save the word &lt;span style="font-weight: bold;"&gt;aggregate &lt;/span&gt;for non-transformative summaries.&lt;br /&gt;&lt;br /&gt;To sum things up, then:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Yes, a snapshot sometimes summarizes a transaction fact table.  &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;But if you accept these definitions for aggregate tables and derived tables, the answer is different:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;A snapshot is is sometimes a derived table, but never an aggregate.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is just a matter of semantics, and not very important to me.  Call it whatever you want, and I will be fine with it.  The important thing is that you have a dimensional design that enables useful analytics.&lt;br /&gt;&lt;br /&gt;- Chris&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-8786082441416344638?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=8786082441416344638&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/8786082441416344638?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/8786082441416344638?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2009/03/is-snapshot-aggregate.html" title="Is a Snapshot an Aggregate?" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;C0MMQXk8eCp7ImA9WxVUGU0.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-6327202582694668553</id><published>2009-03-24T09:22:00.003-04:00</published><updated>2009-03-24T09:31:20.770-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-03-24T09:31:20.770-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Announcements" /><title>Feed Updated, Now Contains Full Posts</title><content type="html">If you follow this blog in a newsreader, you will be happy to know that the &lt;a href="http://feeds.feedburner.com/StarSchemaCentral"&gt;RSS feed&lt;/a&gt; now contains full posts,  rather than the first couple of paragraphs.  That should make your reading experience a lot easier.&lt;br /&gt;&lt;br /&gt;The change seems to have been glitch-free, but if there are any issues please let me know.&lt;br /&gt;&lt;br /&gt;Chris&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-6327202582694668553?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=6327202582694668553&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/6327202582694668553?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/6327202582694668553?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2009/03/feed-updated-now-contains-full-posts.html" title="Feed Updated, Now Contains Full Posts" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;CkIHRn44eSp7ImA9WxVUFEQ.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-7247367203714729383</id><published>2009-03-19T13:11:00.004-04:00</published><updated>2009-03-19T15:22:17.031-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-03-19T15:22:17.031-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Q and A" /><category scheme="http://www.blogger.com/atom/ns#" term="Design Techniques" /><title>Q&amp;A: Is a Unique Count a Fact or Dimension?</title><content type="html">&lt;span style="font-style: italic;"&gt; A reader recently sent in a question about unique counts.  Are they facts, or dimensions?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Here is a paraphrased version of their questions.  The business example has been changed.&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span style="font-weight: bold;"&gt;Q:&lt;/span&gt; We report on the number of unique customers that visit a location. We do this at daily, monthly and annual levels, and also look at the same thing across all locations. We originally modeled the customer counts as facts. The problem is that if we compute unique customer counts by day, they do not correctly roll up to month. Does it make more sense to treat this as a dimension?&lt;/blockquote&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;A:&lt;/span&gt; First, let me say that customer counts are certainly facts. They are important metrics that are colored by the dimensions we use to define them.&lt;br /&gt;&lt;br /&gt;The difficulty with counts of distinct things, as you have discovered, is that they are&lt;span style="font-style: italic;"&gt; &lt;/span&gt;&lt;span&gt;&lt;span style="font-weight: bold;"&gt;non-additive&lt;/span&gt;&lt;/span&gt;&lt;span style="font-style: italic;"&gt; &lt;/span&gt;facts. As soon as you use a set of dimensions to create a distinct count -- such as day, location and customer -- you have an aggregation that cannot be further summed. If you need to go to the monthly level, or across all locations, you need to go back to the granular data.&lt;br /&gt;&lt;br /&gt;You have a few options.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Option 1&lt;/span&gt;:&lt;span style="font-weight: bold;"&gt;  Compute in reports&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The first is to compute unique counts in reports. This can be time consuming, and a major hassle if there are several counts that must be computed frequently. You might be able to schedule the reports to run in a batch window. This deals with the processing issues, but not the report or query complexity.&lt;br /&gt;&lt;br /&gt;In my view, however, it is the best option.  There are other possibilities, but each has significant drawbacks.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Option 2: Create snapshots or aggregates&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A second option is to create some snapshot fact tables that compute the various distinct counts at various periodic intervals. These will contain non-additive facts, and your developers will have to understand that they should not be aggregated. Depending on their design, these tables may more closely resemble aggregate tables.&lt;br /&gt;&lt;br /&gt;This approach may require quite a few fact tables, since each of your unique counts will have a different grain -- daily by location, monthly by location, annually by location, daily across all locations, monthly across all locations, etc.&lt;br /&gt;&lt;br /&gt;To make life a little easier for report developers, designers sometimes replicate the monthly and annual totals in the daily level snapshot. Developers find this appealing, because all the information to produce subtotals over time can be fetched from a single row, rather than three queries. The flipside is that the monthly and annual data is stored redundantly in a lot of places, and must be maintained consistently and used properly.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Option 3: Behavioral Dimensions&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If the current period is the primary focus, a third option is to create behavioral dimensions. For example, in the store dimension, you can add attributes that represent counts of unique customers for the current month and year. I mentioned behavioral dimensions in a &lt;a href="http://blog.oaktonsoftware.com/2009/01/q-customers-with-no-sales-in-fact-table.html"&gt;previous Q&amp;amp;A post&lt;/a&gt;, and the concept is the same here.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Option 4: Creating Special Facts that can be Summed&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Lastly, some designers like to create special facts in the fact table that can be used to sum unique counts. If you have a fact table that gets a row each time a customer visits a location, you may be able to tweak it a bit. Add an additional fact called &lt;span style="font-family:courier new;"&gt;first_visit_of_month &lt;/span&gt;that only contains a 1 the first time the customer visits &lt;span style="font-style: italic;"&gt;any &lt;/span&gt;location during the month.  &lt;span style="font-family:courier new;"&gt;First_visit_of_year &lt;/span&gt;will only contain a 1 the first time the customer visits during the year. These facts can be summed to produce unique counts for various time frames, but you need to be careful not to aggregate them across the periods they represent.&lt;br /&gt;&lt;br /&gt;Another drawback is that, once again, numerous special facts will be required. To compute unique counts by location, you will need facts like first_visit_this_location_this_month. This fact should not be aggregated across months or locations. That's a lot of rules, and they are likely to be broken.&lt;br /&gt;&lt;br /&gt;As you can see, no option looks perfect. Don’t let that lead you to push back on requirements, however. Distinct counts are often very important metrics, particularly when studying customer service or processing efficiency.&lt;br /&gt;&lt;br /&gt;Thanks for the question!&lt;br /&gt;&lt;br /&gt;- Chris&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;If you have a question, you can email it to me. Click my profile picture at the top of this page for the address. I try to answer all questions, and might even print yours here.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-7247367203714729383?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=7247367203714729383&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/7247367203714729383?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/7247367203714729383?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2009/03/q-is-unique-count-fact-or-dimension.html" title="Q&amp;A: Is a Unique Count a Fact or Dimension?" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;C0MNQnk_cSp7ImA9WxVUGU0.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-7190557097448818542</id><published>2009-03-19T09:00:00.005-04:00</published><updated>2009-03-24T09:31:33.749-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-03-24T09:31:33.749-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Announcements" /><title>Privacy Policy</title><content type="html">Recent changes to the Google AdSense program prompted me to define a privacy policy for this blog.  Here it is:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-weight: bold;"&gt;Oakton Software does not gather any information about your visits to this blog, or the pages you view.&lt;/span&gt;&lt;/blockquote&gt;&lt;br /&gt;I wish the policy could be that simple, but you should also know the following:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Blogger:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The blog is hosted by Google's Blogger.com.&lt;br /&gt;&lt;br /&gt;If you have a Blogger account, Google may log information about your visits to this blog.  The Blogger privacy policy can be found by visiting the &lt;a href="http://www.google.com/privacy.html"&gt;Google Privacy Center&lt;/a&gt; and clicking on the link for Blogger.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Newsletter:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If you subscribe to the newsletter, the information you provide is used to send occasional email updates.  These always include an opt-out link.&lt;br /&gt;&lt;br /&gt;The mail list is maintained and managed by an external service provider.  They follow a confirmed opt-in policy to ensure compliance with anti-spam legislation.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ads By Google:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This blog displays advertising that is served by the Google AdSense program.&lt;br /&gt;&lt;br /&gt;You have probably seen these ads all over the internet.  Google uses a cookie to serve the ads.  The cookie allows Google to show you ads based on this and other sites you have visited that feature ads by Google.  It is not used to gather information like your name or email address.&lt;br /&gt;&lt;br /&gt;Google provides a way to opt out of this interest-based advertising program.  Visit their  &lt;a href="http://www.google.com/privacy_ads.html"&gt;Advertising and Privacy Page&lt;/a&gt; for an opt-out link,  or simply click on the Ads By Google header that appears over their ads.  You can also see what Google has decided are your interests by visiting their &lt;a href="http://www.google.com/ads/preferences/"&gt;Ads Preference Manager&lt;/a&gt; page.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-7190557097448818542?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=7190557097448818542&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/7190557097448818542?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/7190557097448818542?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2009/03/privacy-policy.html" title="Privacy Policy" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;CkIHRn44eSp7ImA9WxVUFEQ.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-6815046542630841705</id><published>2009-02-16T16:21:00.003-05:00</published><updated>2009-03-19T15:22:17.031-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-03-19T15:22:17.031-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Design Techniques" /><title>Accounting Transactions Have Multiple Dates</title><content type="html">&lt;span style="font-style: italic;"&gt;Building a financial data mart? Don't let the multiple dates associated with accounting transactions confuse you.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A standard feature of financial systems often befuddles star schema designers.  When someone inputs a transaction into a financial system, the transaction has at least two associated dates.  One represents the date that the event happened, and the second represents the date the transaction was entered into the system.  The first is usually called the transaction date, while the second is called the date posted.  Sometimes, there are more dates as well.&lt;br /&gt;&lt;br /&gt;For example, if a check is written, the record in the accounting system will have two dates.  The &lt;span style="font-style: italic;"&gt;transaction date&lt;/span&gt; will be the date on the check.   The &lt;span style="font-style: italic;"&gt;date posted &lt;/span&gt;will be the day that someone in the accounting department actually entered the transaction into the system.  (Some financial systems may have other dates as well, such as an audit date or an edited date that applies to comments.)&lt;br /&gt;&lt;br /&gt;Don’t let these dates confuse you.  If you are designing a financial data mart, you do not need to choose between them; both dates will be useful, and both should appear in your model.  A fact table that logs general ledger activity, for example, will have one row per transaction.  That row can contain two references to a day dimension; one represents the transaction date and one represents the posting date.  Name the keys appropriately.  For example, you can call them &lt;span style="font-family:courier new;"&gt;day_key_transaction&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;day_key_posted&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-style: italic;"&gt;transaction date &lt;/span&gt;will be the one that is used to drive most reporting.  This is the date that properly associates the transaction with an accounting period.   You would use it to summarize transactions by day, month or fiscal period, for example.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-style: italic;"&gt;date posted&lt;/span&gt; will be of lesser use, but may also occasionally serve analytic purposes--especially when compared to the transaction date. For example, auditors may want to look for transactions posted one or more months after they actually took place. Managers might want to improve their efficiency by studying lags between transaction date and date posted.&lt;br /&gt;&lt;br /&gt;If you are also going to build a monthly snapshot by account, then the star for account transactions may serve as its source.  In that case, the date posted will help the ETL developers maintaining the snapshot.  During each processing window, they can simply look for any transactions that were posted prior to the last time the snapshot was updated, and process only those rows.&lt;br /&gt;&lt;br /&gt;- Chris&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-6815046542630841705?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=6815046542630841705&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/6815046542630841705?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/6815046542630841705?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2009/02/accounting-transactions-have-multiple.html" title="Accounting Transactions Have Multiple Dates" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;Ak8HSX45eip7ImA9WxVXGEw.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-373440023359332376</id><published>2009-02-16T15:50:00.006-05:00</published><updated>2009-02-16T16:13:58.022-05:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-02-16T16:13:58.022-05:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Q and A" /><category scheme="http://www.blogger.com/atom/ns#" term="Design Techniques" /><title>Q&amp;A: Chosing the right time dimension for a Snapshot</title><content type="html">&lt;span style="font-style: italic;"&gt;In this Q&amp;amp;A, a reader asks how to designate a time dimension for a fact table that is a snapshot. &lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;Q:  We are designing a new star.  The fact table is a monthly snapshot.  It will capture several metrics each month for each open account.  My question is how to handle the time dimension.  We cannot agree whether it makes more sense to have a month dimension, or to use a month end date from a day dimension. What makes the most sense?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;- Name withheld&lt;/span&gt; &lt;span style="font-style: italic;"&gt;by request&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;First, I want to commend you for stating the grain of your fact table, and stating it clearly.  Grain confusion can cause many problems, and makes a question like yours even more difficult to answer.  Snapshots always have a time component to their grain; in your case it is monthly.  Your fact table will have one row per account per month.&lt;br /&gt;&lt;br /&gt;Now to your question:  I strongly prefer the use of a month dimension for a monthly snapshot.  This approach is non-ambiguous.  It is self-evident what each row represents; the chance for any confusion is minimal.&lt;br /&gt;&lt;br /&gt;When a monthly snapshot is designed with a day dimension, the designers are generally clear on what each row represents.  However, report developers or other users may misinterpret the data now or in the future.&lt;br /&gt;&lt;br /&gt;Designers who use a day dimension with a monthly snapshot generally choose the last day of the period represented, as you suggested.  The problem with this is that there are  many elements of a typical day dimension that simply do not apply to a monthly summary.  Tor example, a snapshot representing March 2009 would have the day 3/31/09 associated with it.  The day dimension has a lot of information about March 31, 2009 that does not apply to the snapshot, such as what day of the week it was, whether it was a holiday, and so forth.  It would be all too easy to misinterpret this information as being applicable to the facts.&lt;br /&gt;&lt;br /&gt;Even if everyone is clear on this at the start, a year or two down the road there are likely to be new members of your development team who may misinterpret the data.  The presence of a date can also lead to confusion about what a row represents.  If your monthly summary&lt;br /&gt;contains an average daily balance, for example, a user may confuse this for a balance as of the day in question.&lt;br /&gt;&lt;br /&gt;Confusion will also arise over what date to use for the current period.  During March 2009, you may be updating the snapshot rows for each account on a daily basis.  Using a day dimension, ETL developers may assign the particular date in March, which now serves double duty -- designating the snapshot period and the date loaded.&lt;br /&gt;&lt;br /&gt;Sometimes, there is a strong desire among technical team members to use a day dimension with a monthly snapshot.  If you cannot convince them otherwise, then you can avoid some confusion by creating a view for use with the monthly snapshot that hides the day-level attributes.  You may be able to do the same kind of “hiding” of non-applicable dimension attributes within your BI tool.  Use the month-end date for each fact table row.  If the ETL team wants to track the day each row was loaded, they can do this with a second relationship to the day dimension.  (This trick works nicely even if you do go with a month dimension.)&lt;br /&gt;&lt;br /&gt;Regardless of your approach, I also advise you to name the facts carefully.  If you want to include a balance as of the end of the period, name it &lt;span style="font-family:courier new;"&gt;period_end_balance&lt;/span&gt;.  An average daily balance for the period in question should be called &lt;span style="font-family:courier new;"&gt;average_daily_balance&lt;/span&gt;.  A column name like &lt;span style="font-family:courier new;"&gt;balance &lt;/span&gt;is simply too vague, and bound to be misused.&lt;br /&gt;&lt;br /&gt;Thanks for your question!&lt;br /&gt;&lt;br /&gt;- Chris&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;If you have a question, click my profile picture at the top of the page for my address.  I try to respond to all messages, and may even post an answer here.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-373440023359332376?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=373440023359332376&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/373440023359332376?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/373440023359332376?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2009/02/q-chosing-right-time-dimension-for.html" title="Q&amp;A: Chosing the right time dimension for a Snapshot" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;DE4BSH0-cSp7ImA9WxVRF0k.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-4037197849732621716</id><published>2009-01-23T15:34:00.006-05:00</published><updated>2009-01-23T16:42:39.359-05:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-01-23T16:42:39.359-05:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Q and A" /><category scheme="http://www.blogger.com/atom/ns#" term="Design Techniques" /><title>Q&amp;A: Customers with No Sales in Fact Table</title><content type="html">&lt;span style="font-style: italic;"&gt;In this post, Chris answer a reader's question about reporting on customers with no sales in the fact table.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;New Q&amp;amp;A Feature&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I get a lot of questions about dimensional modeling from readers of the blog, my books, and people who attend my classes.  I try to answer all of them.  I thought it would be interesting to post some of them here, for all to read.&lt;br /&gt;&lt;br /&gt;This week, I'm answering a question about reporting on customers with no sales.&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;In general what is the recommended approach for reporting customers that have no sales in the fact table for MTD or YTD? Or, for that matter, any dimension member where no fact exists for some time period?&lt;br /&gt;&lt;br /&gt;- J. Dolan&lt;/blockquote&gt;&lt;br /&gt;Usually, something like “customers with no sales for a period” is computed as part of the reporting process.  Developers often bristle at this kind of report, because it requires using a correlated subquery.  This makes it difficult to do with a SQL-generating BI tool, and can be slow.  And if you want to use the list of customers with no sales as a filter for another query, things can get really hairy.&lt;br /&gt;&lt;br /&gt;An alternative is to pre-compute this information, storing the results in a table that summarizes reporting periods.   This is likely to be a factless fact table, with foreign keys to dimension tables representing the period summarized and the dimension in question--customers in your case.   If you require monthly and yearly slices,  you will want to multiple summaries.  This approach may simplify reporting, but adds complexity to the ETL process.&lt;br /&gt;&lt;br /&gt;If current period is the only period of concern, as in your question, then it is also possible to pre-compute the information and store the result &lt;span style="font-weight: bold;"&gt;directly &lt;/span&gt;to the customer dimension table, as a flag.  The flag would specify whether the customer has placed an order during the current period.   In your case, it would be two flags -- one for month to date and one for year to date.&lt;br /&gt;&lt;br /&gt;This kind of attribute is called a &lt;span style="font-weight: bold;"&gt;behavioral dimension &lt;/span&gt;because it takes behavioral activity recorded in the fact table and transforms it into a dimension.   Behavioral dimensions are very powerful, because they can be used as simple filters for other queries.&lt;br /&gt;&lt;br /&gt;Again, this concept trades ETL complexity for reporting flexibility and performance.  But isn't that what data warehousing is all about?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Thanks to J. Dolan for agreeing to have his question posted.  If you have a question, send it in.  I try to answer all my email, though it sometimes takes a while!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Chris&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-4037197849732621716?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=4037197849732621716&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/4037197849732621716?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/4037197849732621716?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2009/01/q-customers-with-no-sales-in-fact-table.html" title="Q&amp;A: Customers with No Sales in Fact Table" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;AkcGRHY5eCp7ImA9WxRRGUg.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-5004912775186655834</id><published>2008-10-02T09:23:00.011-04:00</published><updated>2008-10-02T10:33:45.820-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-10-02T10:33:45.820-04:00</app:edited><title>Coming to TDWI New Orleans?  Vote First!</title><content type="html">&lt;span style="font-style: italic;"&gt;If you'll be joining me at the TDWI conference in New Orleans during election week, make sure you vote ahead of time.  Here's how. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This election day, I'll be teaching &lt;span style="font-weight: bold;"&gt;Advanced Dimensional Modeling&lt;/span&gt; at &lt;a href="http://www.tdwi.org/neworleans/" target="_blank"&gt;The Data Warehouse Institute World Conference&lt;/a&gt; in New Orleans.  I was a bit reluctant to accept the invitation at first, out of fear that it would prevent my vote from being counted.  But after doing a little bit of research, I accepted with confidence.&lt;br /&gt;&lt;br /&gt;You can attend with confidence as well. All you need is an absentee ballot.&lt;br /&gt;&lt;a href="http://govoteabsentee.org" target="_blank"&gt; &lt;img style="float: left; padding: 20px 20px 10px 0px; border-color: #ffffff" src="http://govoteabsentee.org/images/widgetalt.gif"  alt="GoVoteAbsentee.org / learn how to vote absentee!"/&gt;&lt;/a&gt;Absentee voting procedures vary by locality.  To find out how it works where you live, you can go to &lt;a href="http://GoVoteAbsentee.org" target="_blank"&gt;GoVoteAbsentee.org&lt;/a&gt;.  &lt;br /&gt;&lt;br /&gt;Plug in your zip code, and they'll provide you with information on how to apply for an absentee ballot.  They'll even give you a link to download the ballot application. If you prefer, you can contact your county or city government for information.&lt;br /&gt;&lt;br /&gt;The process is simple, and you can be sure your vote will be heard.  Some localities even permit residents to cast absentee ballots in person.  In Fairfax county Virgina, where I live,  you can personally feed your ballot into the optical scanner that tallies the votes.  &lt;br /&gt;&lt;br /&gt;And as an added bonus, you won't have to stand in line on election day!&lt;br /&gt;&lt;br /&gt;I have already cast my absentee vote.  If you will be traveling to The Data Warehouse Institute Conference in New Orleans, I urge you to do the same.  Go to &lt;a href="http://GoVoteAbsentee.org" target="_blank"&gt;GoVoteAbsentee.org&lt;/a&gt; now, and find out how.&lt;br /&gt;&lt;br /&gt;- Chris&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-5004912775186655834?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=5004912775186655834&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/5004912775186655834?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/5004912775186655834?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2008/10/coming-to-tdwi-new-orleans-vote-first.html" title="Coming to TDWI New Orleans?  Vote First!" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;C0UMRXk5cCp7ImA9WxVaEU4.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-283177250852723935</id><published>2008-06-10T11:39:00.007-04:00</published><updated>2009-04-07T15:08:04.728-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-04-07T15:08:04.728-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Aggregates" /><category scheme="http://www.blogger.com/atom/ns#" term="Design Techniques" /><title>Available Now:  Designing a Performance Layer</title><content type="html">My article &lt;span style="font-style: italic;"&gt;"Designing A Performance Layer:  Aggregates and Derived Tables" &lt;/span&gt;appears in the current issue of &lt;span style="font-weight: bold;"&gt;The Business Intelligence Journal&lt;/span&gt; (Vol 13, No. 1),  available to members of The Data Warehouse Institute.&lt;br /&gt;&lt;br /&gt;This article shows you how to create a &lt;span style="font-style: italic;"&gt;performance layer&lt;/span&gt; for your dimensional data warehouse -- a set of cubes and tables designed specifically for performance.&lt;br /&gt;&lt;br /&gt;The article covers two types of performance design:  &lt;span style="font-style: italic;"&gt;aggregate tables&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;derived tables&lt;/span&gt;.&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Aggregate Tables &lt;/span&gt;store partially summarized copies of existing data.  These tables (or cubes) improve performance for queries that do not require granular detail.  The article fully explains aggregates, including summarization of metrics, hierarchies, and implementation with or without aggregate navigators.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Derived Tables&lt;/span&gt; store restructured copies of existing data.  They do not necessarily summarize information, but they may.  Derived tables deliver performance benefits by streamlining the query and reporting processes.  In the article, I show how to supplement an existing schema with merge, pivot, and set operation fact tables (or cubes).&lt;/li&gt;&lt;/ul&gt;The full article appears in &lt;span style="font-weight: bold;"&gt;The Business Intelligence Journal&lt;/span&gt;, which is available in print and &lt;a href="http://www.tdwi.org/Publications/BIJournal/index.aspx" target="_blank"&gt;on line&lt;/a&gt; for TDWI members.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-283177250852723935?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=283177250852723935&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/283177250852723935?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/283177250852723935?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2008/06/available-now-designing-performance.html" title="Available Now:  Designing a Performance Layer" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;D0YBRH07fyp7ImA9WxZWE04.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-7251522539905068687</id><published>2008-03-12T10:53:00.004-04:00</published><updated>2008-03-12T11:19:15.307-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-03-12T11:19:15.307-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Slow Changes" /><category scheme="http://www.blogger.com/atom/ns#" term="Design Techniques" /><title>Responding to Star Schema Detractors with Timestamps</title><content type="html">&lt;span style="font-style: italic;"&gt;Time-stamped dimensions can pack additional analytic punch, compared with standard slow change techniques.  There is additional cost in the load process, but it may be worth it, especially if you plan to load historic data.&lt;/span&gt;    &lt;p class="MsoNormal"&gt;&lt;o:p&gt;&lt;/o:p&gt;This is the fourth in a series of posts on Slowly Changing Dimensions.&lt;span style=""&gt;  &lt;/span&gt;If you are just joining in, you might want to review &lt;a href="http://blog.oaktonsoftware.com/2007/10/for-slowly-changing-dimensions-change.html"&gt;Part I&lt;/a&gt;, &lt;a href="http://blog.oaktonsoftware.com/2007/11/how-to-document-type-3-scd-pair-of-type.html"&gt;Part II&lt;/a&gt; and &lt;a href="http://blog.oaktonsoftware.com/2008/01/have-it-both-ways-with-hybrid-slow.html"&gt;Part III&lt;/a&gt;.&lt;/p&gt;        &lt;p class="MsoNormal"&gt;&lt;o:p&gt;&lt;/o:p&gt;Star schema detractors often suggest that that a dimensional model throws away valuable detail.&lt;span style=""&gt;  &lt;/span&gt;Proponents of dimensional design counter that one of their basic design tenets is to set fact table grain at the lowest level of detail possible.&lt;span style=""&gt;  &lt;/span&gt;Fair enough.&lt;span style=""&gt; &lt;/span&gt;&lt;o:p&gt; &lt;/o:p&gt;But this response does not address the concern fully.&lt;span style=""&gt;  &lt;/span&gt;What about the dimensions?&lt;span style=""&gt;  &lt;/span&gt;&lt;br /&gt;&lt;/p&gt;    &lt;p class="MsoNormal"&gt;Designers who eschew the dimensional model for their enterprise data warehouse point out that they can tell you what a particular customer looked like on a given day, regardless of whether there were any sales.&lt;span style=""&gt;  &lt;/span&gt;A dimensional design using standard Type 2 techniques does not offer this capability.&lt;span style=""&gt;  &lt;/span&gt;It writes a new row to the dimension table each time a characteristic changes.&lt;span style=""&gt;  &lt;/span&gt;&lt;span style="font-style: italic;"&gt;But &lt;/span&gt;we can only link these versions of the customer to a point in time &lt;span style="font-style: italic;"&gt;if&lt;/span&gt; there is a transaction recorded in a fact table.&lt;/p&gt;    &lt;p class="MsoNormal"&gt;In the over-hyped "&lt;a href="http://blog.oaktonsoftware.com/2007/03/three-data-warehouse-architectures-that.html"&gt;religious wars&lt;/a&gt;" of data warehousing, this is one criticism of the star schema that I have always felt had some validity.&lt;span style=""&gt;  &lt;/span&gt;However, it is an easy one to address.&lt;span style=""&gt;  &lt;/span&gt;&lt;/p&gt;    &lt;p style="font-weight: bold;" class="MsoNormal"&gt;Time Stamped Dimensions&lt;/p&gt;    &lt;p class="MsoNormal"&gt;&lt;o:p&gt;&lt;/o:p&gt;The solution is so simple that I have encountered it being employed numerous times by groups with no formal training in dimensional design.&lt;span style=""&gt;  &lt;/span&gt;The solution is to add a pair of dates to the dimension table:&lt;span style=""&gt;  &lt;/span&gt;effective_date and expiration_date.&lt;/p&gt;    &lt;p class="MsoNormal"&gt;&lt;o:p&gt;&lt;/o:p&gt;When a row is added to the dimension table, it is given an effective_date that indicates when the change became effective.&lt;span style=""&gt;  &lt;/span&gt;Later, when a slow change occurs, two things occur.&lt;span style=""&gt;  &lt;/span&gt;The new record is inserted with the new effective_date, similar to a standard Type 2 response.&lt;span style=""&gt;  &lt;/span&gt;In addition, the previous record is updated: its expiration_date is set to the date before the change occurred.&lt;/p&gt;    &lt;p style="font-weight: bold;" class="MsoNormal"&gt;More ETL Work, but More Analytic Possibilities&lt;/p&gt;    &lt;p class="MsoNormal"&gt;This extra step may be a bit of additional work for the ETL developers, but it provides some additional analytic capability.&lt;span style=""&gt;  &lt;/span&gt;Using these dates, it is possible to:&lt;/p&gt;    &lt;ol&gt;&lt;li&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;span style="font-style: italic;"&gt;List a series of changes to something in sequence&lt;/span&gt;: order by the effective_date.&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Determine the status of something at a point in time&lt;/span&gt;:&lt;span style=""&gt;  &lt;/span&gt;qualify the date in question as greater than or equal to the effective_date and less than or equal to the expiration_date&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Quickly find the current value&lt;/span&gt; for a given dimension, by qualifying on expiration_date&lt;/li&gt;&lt;/ol&gt;For all this to work, it is useful to initialize expiration dates of current records to an actual date.&lt;span style=""&gt;  &lt;/span&gt;This avoids the tortured query predicates that are brought on by the use of NULL values.&lt;span style=""&gt;  &lt;/span&gt;I prefer to use the highest date value supported by the RDBMS, but that’s a topic for another post.&lt;br /&gt;&lt;br /&gt;&lt;p class="MsoNormal"&gt;As described, this is more of a "Date Stamped" dimension.  But if the granularity of significant change is finer than the day, it can easily be extended to include a time component.&lt;br /&gt;&lt;/p&gt;    &lt;p class="MsoNormal"&gt;Because there is a bit of extra ETL processing, it is worth considering whether the additional analytic value is worth it.  But there is also another consideration to take into account, and here the ETL developers may actually save some work.&lt;br /&gt;&lt;/p&gt;&lt;p style="font-weight: bold;" class="MsoNormal"&gt;Another Benefit&lt;/p&gt;    &lt;p class="MsoNormal"&gt;Having the timestamps on our conformed dimensions will make it easier to load past history when we build new fact tables in the future. This is especially handy if we are building the warehouse one subject area at a time.&lt;/p&gt;    &lt;p class="MsoNormal"&gt;&lt;o:p&gt;&lt;/o:p&gt;For example, our Customer dimension may be implemented first as part of an orders schema.&lt;span style=""&gt;  &lt;/span&gt;Later, it may come into play again as part of a receivables schema. When it comes time to load the receivables history into a fact table, each transaction will need to be associated with the correct historic version of the customer.&lt;span style=""&gt;  &lt;/span&gt;Without timestamps, ETL developers will need to use the operational data to identify exactly what the customer looked like at the time of the transaction, then search for a match in the customer dimension.&lt;/p&gt;    &lt;p class="MsoNormal" style=""&gt;&lt;o:p&gt;&lt;/o:p&gt;With timestamps, all the ETL developers will need is the customer_id.&lt;span style=""&gt;  &lt;/span&gt;We can easily find the correct version of the customer in the dimension table by referring to the timestamps.&lt;/p&gt;  &lt;p class="MsoNormal" style=""&gt;&lt;o:p&gt;For this simple reason alone, I encourage designers to use time-stamps on the dimensions that are important across the enterprise--the &lt;span style="font-style: italic;"&gt;conformed dimensions&lt;/span&gt; that form the backbone of the data warehouse. &lt;/o:p&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-7251522539905068687?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=7251522539905068687&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/7251522539905068687?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/7251522539905068687?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2008/03/responding-to-star-schema-detractors.html" title="Responding to Star Schema Detractors with Timestamps" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;CkcNQn86fCp7ImA9WB9aGEw.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-5681324636833723036</id><published>2008-01-08T11:05:00.000-05:00</published><updated>2008-01-08T11:34:53.114-05:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2008-01-08T11:34:53.114-05:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Slow Changes" /><category scheme="http://www.blogger.com/atom/ns#" term="Design Techniques" /><title>Have it Both Ways With Hybrid Slow Changes</title><content type="html">&lt;span style="font-style: italic;"&gt;This is the third in a series on Slowly Changing Dimensions.  It looks at an often misunderstood approach: the hybrid SCD.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;When the source data for a dimension table changes, the Slowly Changing Dimension characteristics of a dimensional model dictate how the warehouse schema should respond.  In my previous two posts, I examined the three most common techniques.  (See &lt;a href="http://blog.oaktonsoftware.com/2007/10/for-slowly-changing-dimensions-change.html"&gt;Part I&lt;/a&gt; &amp;amp; &lt;a href="http://blog.oaktonsoftware.com/2007/11/how-to-document-type-3-scd-pair-of-type.html"&gt;Part II&lt;/a&gt;.) When the source value for a dimension attribute value changes:&lt;ul&gt;&lt;li&gt;Type 1 response:  Update the existing value in the dimension table &lt;/li&gt;&lt;li&gt;Type 2 response:  Add a new row to the dimension table, with a new surrogate key&lt;/li&gt;&lt;li&gt;Type 3 response:  Maintain a single row in the dimension table, with separate columns for current and previous values of the dimension attribute in question.&lt;/li&gt;&lt;/ul&gt;Sometimes, one single technique does not fit the bill.  Requirements will come along that seem to call for both Type 1 and Type 2 treatment.  What to do?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;“We always want to do it this way.  But sometimes we want to do it that way.”  &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Consider a company building a typical orders star schema for business-to-business sales.  Customer companies are represented by a customer dimension table.  You are trying to determine what the correct slow change response should be when the name of a customer company changes.  In an interview, the first thing you are told is:&lt;br /&gt;&lt;blockquote&gt;“We always want to use the most current name for the company in our analysis…”&lt;/blockquote&gt;This sounds to you like a type 1 change, which looks reasonable.  After all, you might reason, one wouldn’t want this year vs. last year comparisons to be befuddled by different company names.  But no sooner have you noted this, the speaker continues:&lt;blockquote&gt;“...except for when we want to use the name that was in effect at the time of the transaction.”&lt;/blockquote&gt;Now things look more complicated.  To provide the company name as it existed at the time of an order would require a Type 2 response.  But that would conflict with the first requirement.&lt;br /&gt;&lt;br /&gt;What to do?  Don’t jump for Type 3.  That’s not the solution either.  A type 3 response will not tell you which value was in effect at the time of a given transaction.  (It also usually does not capture more than one previous value of the attribute – see the previous post.)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Having it Both Ways&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The hybrid approach allows you to meet both requirements.  It requires that you model two dimension attributes that will capture the value in question – company name in this case.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-family:courier new;"&gt;company_name_current&lt;/span&gt; &lt;/span&gt;will capture the current name of the company.  This attribute will exhibit Type 1 behavior.  When a company name changes, it will be updated.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;font-family:courier new;" &gt;company_name_historic&lt;/span&gt; will capture the detailed history of company names.  This attribute will exhibit Type 2 behavior.  When a company name changes, a new row will be created to capture the new value, and to associate with any subsequent transactions&lt;/li&gt;&lt;/ul&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_oaFlRokTed0/R4Oii5S6dfI/AAAAAAAAABw/GTTzpxm8qnU/s1600-h/hybrid_scd_1.jpg"&gt;&lt;img style="cursor: pointer;" src="http://3.bp.blogspot.com/_oaFlRokTed0/R4Oii5S6dfI/AAAAAAAAABw/GTTzpxm8qnU/s400/hybrid_scd_1.jpg" alt="" id="BLOGGER_PHOTO_ID_5153141119021381106" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div style="text-align: left;"&gt;This means that when a hybrid change occurs, the ETL process must do at least two things:&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;ol&gt;&lt;li&gt;Update the &lt;span style="font-family:courier new;"&gt;_current&lt;/span&gt; value for all records that share the same natural key&lt;/li&gt;&lt;li&gt;Insert a new row, initializing both &lt;span style="font-family:courier new;"&gt;_current&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;_historic&lt;/span&gt; values to the new value&lt;/li&gt;&lt;/ol&gt;Years ago, it was popular to refer to this approach as a “Type 4 Change.” With the second edition of &lt;a target="_blank" href="http://www.amazon.com/gp/product/0471200247?ie=UTF8&amp;amp;tag=oaktonsoftwar-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=9325&amp;amp;creativeASIN=0471200247"&gt;&lt;span style="font-style: italic;"&gt;The Data Warehouse Toolkit&lt;/span&gt;&lt;/a&gt;, Kimball and Ross suggest we call this a hybrid change.  And for good reason:  it is really a bit of Type 1 and a bit of Type 2.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Hybrid Change in Action&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Developers often have trouble with this type of design.  To fully understand it, look what happens when the name of a company changes.&lt;br /&gt;&lt;br /&gt;Suppose that one of the customers was Apple Computer.  You may recall that in 2007, the company changed its name from Apple Computer Inc. to Apple Inc.&lt;br /&gt;&lt;br /&gt;Initially, the dimension table contains a row that looks like this:&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_oaFlRokTed0/R4OijJS6dgI/AAAAAAAAAB4/rgbjjZbfckA/s1600-h/hybrid_scd_2.jpg"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_oaFlRokTed0/R4OijJS6dgI/AAAAAAAAAB4/rgbjjZbfckA/s400/hybrid_scd_2.jpg" alt="" id="BLOGGER_PHOTO_ID_5153141123316348418" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;In 2007, Apple changed its name from “Apple Computer Inc.” to “Apple Inc.”  The next illustration shows the schema’s hybrid response in yellow:&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_oaFlRokTed0/R4OijJS6dhI/AAAAAAAAACA/PLB2UB8OX9w/s1600-h/hybrid_scd_3.jpg"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_oaFlRokTed0/R4OijJS6dhI/AAAAAAAAACA/PLB2UB8OX9w/s400/hybrid_scd_3.jpg" alt="" id="BLOGGER_PHOTO_ID_5153141123316348434" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;The first row shows the existing record.  The &lt;span style="font-family:courier new;"&gt;_current&lt;/span&gt; value is updated with the new name, “Apple Inc.”  Its &lt;span style="font-family:courier new;"&gt;_historic&lt;/span&gt; value is not touched.  The next row shows the new record that has been inserted.  It shows the new name for both &lt;span style="font-family:courier new;"&gt;_current&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;_historic&lt;/span&gt; values.  Any facts subsequently recorded will be connected to this new row.&lt;br /&gt;&lt;br /&gt;To analyze all facts (before and after the name change) using the new name, Apple Inc., you use &lt;span style="font-family:courier new;"&gt;company_name_current &lt;/span&gt;to group query results.  To study all facts in the context of the historically accurate name, group query results by &lt;span style="font-family:courier new;"&gt;company_name_historic&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Repeatable&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This process is repeatable.  If Apple once again changed the corporate name, say to “Apple Electronics, Inc.”, the ETL process for the customer table would:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Change &lt;span style="font-family:courier new;"&gt;company_name_current&lt;/span&gt; to “Apple Electronics Inc.” for &lt;span style="font-style: italic;"&gt;both &lt;/span&gt;existing rows.&lt;/li&gt;&lt;li&gt;Add a third row (with a new surrogate key)  with &lt;span style="font-family:courier new;"&gt;_current&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;_historic&lt;/span&gt; values recorded as “Apple Electronics, Inc.”  Subsequent facts would be connected to this row.&lt;/li&gt;&lt;/ol&gt;&lt;span style="font-weight: bold;"&gt;The Importance of Good Documentation&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;It is important to document the hybrid change carefully.  ETL developers must understand how it is to be loaded, and report/query developers must understand how to use it.  Otherwise, the hybrid approach can lead to confusion.&lt;br /&gt;&lt;br /&gt;And be sure that it is worth the trouble.  You might ask why people want the historic name, if 99% of the time they only need the current name. If it is because they want to be able to reproduce an invoice, perhaps that would be better served by going back to the source system.  Then the schema can sport a more simple Type 1 approach.&lt;br /&gt;&lt;br /&gt;But if a real analytic need exists, the hybrid approach is the only way to go.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:78%;"&gt;Copyright (c) 2007 Chris Adamson&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-5681324636833723036?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=5681324636833723036&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/5681324636833723036?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/5681324636833723036?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2008/01/have-it-both-ways-with-hybrid-slow.html" title="Have it Both Ways With Hybrid Slow Changes" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://3.bp.blogspot.com/_oaFlRokTed0/R4Oii5S6dfI/AAAAAAAAABw/GTTzpxm8qnU/s72-c/hybrid_scd_1.jpg" height="72" width="72" /><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;CU4DRnkyeSp7ImA9WB9WEUk.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-4249188506123785641</id><published>2007-11-15T09:20:00.000-05:00</published><updated>2007-11-15T11:26:17.791-05:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2007-11-15T11:26:17.791-05:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Slow Changes" /><category scheme="http://www.blogger.com/atom/ns#" term="Design Techniques" /><title>How To Document a Type 3 SCD:  A pair of Type 1 Responses</title><content type="html">&lt;span style="font-style: italic;"&gt;Many dimensionally-aware database modeling tools allow you to document the Slow Change characteristics of each dimension attribute.  But often, your only choices are Type 1 or Type 2.  This post describes what to do when your design calls for the less common Type 3 change. As it turns out, a Type 3 SCD response can be understood as a pair of Type 1 responses.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In a previous post, I pointed out that we &lt;span style="font-style: italic;"&gt;think &lt;/span&gt;about slow changes with respect to a source system, we usually &lt;span style="font-style: italic;"&gt;document &lt;/span&gt;them as part of the dimensional schema.  For example, if the customer's name or date of birth changes on the &lt;span style="font-style: italic;"&gt;source, &lt;/span&gt;what do we do in the star schema? A Type 1 response would be to overwrite a corresponding value in the star schema; a Type 2 response would be to insert a new row for the revised version of the customer.  We usually document this behavior by flagging each attribute in the star as "Type 1" or "Type 2", even though the "change" happens on the source.&lt;br /&gt;&lt;br /&gt;(Not sure Slowly Changing Dimensions are?  See &lt;a href="http://blog.oaktonsoftware.com/2007/10/for-slowly-changing-dimensions-change.html"&gt;For Slowly Changing Dimensions, Change Is Relative&lt;/a&gt; for an definition of the concept and explanations of the Type 1 and Type 2 techniques.)&lt;br /&gt;&lt;br /&gt;A Type 3 change calls for yet another kind of response to a change in source data.  This approach response maintains before and after versions of the value in separate dimension attributes.  Because the before and after value are stored in the same row, either version can be associated with activity (read: facts) that occurred before or after the change.&lt;br /&gt;&lt;br /&gt;Schema designers often throw up their arms in disgust when they discover that their modeling tool does not support the concept of a Type 3 change.  But this despair may not be warranted; it is possible to document a Type 3 change as a pair of Type 1 changes.&lt;br /&gt;&lt;br /&gt;Before we look at how to do this, here's a quick example of a Type 3 slow change.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Type 3 SCD: An Example&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Suppose that the dimensional model contains a dimension table called &lt;span style="font-family:courier new;"&gt;product&lt;/span&gt;, and that one of the dimension attributes is &lt;span style="font-family:courier new;"&gt;category&lt;/span&gt;. The &lt;span style="font-family:courier new;"&gt;category &lt;/span&gt;attribute carries a standard classification mechanism used to categorize products, and appears in a large number of reports.  An example of a category is "stationery."&lt;br /&gt;&lt;br /&gt;A wide variety of products have this category designation, including several types of envelope, note pad, and paper sheets. All these products have rows in the &lt;span style="font-family:courier new;"&gt;product &lt;/span&gt;dimension table with a &lt;span style="font-family:courier new;"&gt;category &lt;/span&gt;of "stationery."  Fact tables linked to the &lt;span style="font-family:courier new;"&gt;product &lt;/span&gt;dimension can be analyzed by &lt;span style="font-family:courier new;"&gt;category&lt;/span&gt;, grouping all these products under the value "stationery."&lt;br /&gt;&lt;br /&gt;Now lets assume that the business has expanded its product lines, and the old "stationery" category is a bit too broad.  Management plans to redesignate existing stationery products as "Personal Stationery" or "Business Stationery".  Going forward, they want to use the new values for all analysis, even when studying facts that were recorded prior to the changeover.  So far, sounds like a Type 1 change.&lt;br /&gt;&lt;br /&gt;The catch: some planning numbers are tied to the old designations.  For at least a while, management will need the ability to use the old value to study facts, whether they were recorded before or after the change.  (A type 2 response does not help here; it associates all facts recorded before the change with the old value, and all facts recorded after the change with the new value.)  And, now that they've thought about it some more, it might be nice to be able to do this every year.&lt;br /&gt;&lt;br /&gt;The solution is to maintain two dimensional attributes for the category.  The first is meant to carry the current category designation, and might be called &lt;span style="font-family:courier new;"&gt;category_current&lt;/span&gt;.  The second is meant to carry the old category designation, and might be called &lt;span style="font-family:courier new;"&gt;category_previous&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;When the category for a product changes, say from "stationery" to "business stationery", we update the &lt;span style="font-family:courier new;"&gt;category_current&lt;/span&gt; to contain the new value "business stationery" and place the old value, "stationery" in &lt;span style="font-family:courier new;"&gt;category_previous&lt;/span&gt;.  When management wants to study facts using the new designations, their reports should group facts using the &lt;span style="font-family:courier new;"&gt;category_current&lt;/span&gt; dimension attribute.  When they want to study facts using the old categorization scheme, they can use the &lt;span style="font-family:courier new;"&gt;category_previous&lt;/span&gt; attribute.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Documenting Type 3 Change&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;As discussed in the previous post, we often document slow change response on an attribute by attribute basis in the dimensional model.  For our product example, this might be done as shown in the following illustration.&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_oaFlRokTed0/RzxnQxMf8AI/AAAAAAAAABI/OlLt0ZyuohE/s1600-h/scd3-1.jpg"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_oaFlRokTed0/RzxnQxMf8AI/AAAAAAAAABI/OlLt0ZyuohE/s320/scd3-1.jpg" alt="" id="BLOGGER_PHOTO_ID_5133091213076852738" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: left;"&gt;Many modeling tools allow us to document the SCD characteristic of a column, offering a pick list that allows us to choose what type of response is expected when the source of an attribute changes.  But what if your tool does not support Type 3?   Or what if you have a tool that automates the construction of an ETL routine, aggregate table or OLAP cube, and the tool does not offer the Type 3 option?&lt;br /&gt;&lt;br /&gt;There may be a solution.  Notice that when a type 3 change occurs, we update both versions of the attribute.  In our example, &lt;span style="font-family: courier new;"&gt;category_current&lt;/span&gt; is updated withe the new name of the product, and category_previous is updated with the prior value (which is the old value of category_current.)  We effectively have a pair of Type 1 attributes, and can document them as shown below.  I've added a highlight to illustrate the pair of attributes in question.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_oaFlRokTed0/RzxpCBMf8BI/AAAAAAAAABQ/Ba4j46LYX2s/s1600-h/scd3-2.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_oaFlRokTed0/RzxpCBMf8BI/AAAAAAAAABQ/Ba4j46LYX2s/s320/scd3-2.jpg" alt="" id="BLOGGER_PHOTO_ID_5133093158697037842" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Of course, if you have tools that use this information to do something, such as generate a cube or ETL script, you need to be sure this method will not be disruptive.  You will probably need to provide sources for each of the attributes, and it may be challenging to define a source for &lt;span style="font-family:courier new;"&gt;category_previous&lt;/span&gt;, which will probably not be present in the source after the change has occurred.   The solution may be able to define its source as the the &lt;span style="font-family: courier new;"&gt;category_current&lt;/span&gt; value of the product dimension table itself.  This, in turn, may require that you join to a copy of the dimension when extracting source data.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Repeatable Process&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To round out this discussion of the Type 3 change, I should point out that this slow change process is repeatable.  Next year, if the designations change again, we can move &lt;span style="font-family: courier new;"&gt;category_current&lt;/span&gt; values into &lt;span style="font-family: courier new;"&gt;category_previous&lt;/span&gt;, and place the new values in &lt;span style="font-family: courier new;"&gt;category_current&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Note that we always have access to two values:  the current value and the prior value.  Not enough?  Its also possible to maintain more than two columns for a type 3 response.  We might choose to maintain columns called &lt;span style="font-family: courier new;"&gt;category_current, category_last_year, category_2yrs_ago,&lt;/span&gt; etc.  Now, any or all of the values can be used to study all facts, regardless of when they were recorded.   (A scheme like &lt;span style="font-family: courier new;"&gt;category_2007, category_2006, category_2005&lt;/span&gt; might work as well, but might require you to redesign reports each year to use the correct attributes.)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;More Changes on the Way&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;That's not quite the end of the story of SCD's.  In future posts, I will look at the popular "timestamped" and "hybrid" approaches, as well as run down a few advanced concepts and caveats.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:78%;"&gt;Copyright (c) 2007 Chris Adamson&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-4249188506123785641?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=4249188506123785641&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/4249188506123785641?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/4249188506123785641?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2007/11/how-to-document-type-3-scd-pair-of-type.html" title="How To Document a Type 3 SCD:  A pair of Type 1 Responses" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/_oaFlRokTed0/RzxnQxMf8AI/AAAAAAAAABI/OlLt0ZyuohE/s72-c/scd3-1.jpg" height="72" width="72" /><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;CU4DRnkyeSp7ImA9WB9WEUk.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-6258722360380770040</id><published>2007-10-09T09:43:00.000-04:00</published><updated>2007-11-15T11:26:17.791-05:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2007-11-15T11:26:17.791-05:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Slow Changes" /><category scheme="http://www.blogger.com/atom/ns#" term="Basics" /><category scheme="http://www.blogger.com/atom/ns#" term="Design Techniques" /><title>For Slowly Changing Dimensions, "Change" is Relative</title><content type="html">&lt;span style="font-style: italic;"&gt;There's a difference between the way we think about Slowly Changing Dimensions and the way we document them.  In this post, I'll highlight this difference by examining the two most common Slow Change techniques.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The term "slowly changing dimension" originated with Ralph Kimball, who identified three techniques for dealing with changed data.  Commonly abbreviated as SCD's, these techniques are applied in any form of dimensional design, regardless of the &lt;a href="http://blog.oaktonsoftware.com/2007/03/three-data-warehouse-architectures-that.html"&gt;data warehouse architecture&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;In practice, there is a subtle but importance between the way we think about these changes and the way we describe them in a dimensional design.  This sometimes leads to confusion.&lt;br /&gt;&lt;br /&gt;Before I explain this important distinction, let me review the difference between surrogate and natural keys, and describe the two most common SCD techniques.  (Future posts will look at other slow change techniques.)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Natural Keys and Surrogate Keys&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;We usually think of dimension tables in a star schema as corresponding to something in a source system. For example, each row in a customer dimension table relates to a single customer in a source system. Each column is loaded from one or more sources, based on a set of rules.  The link back to a source system is preserved in the form of a &lt;span style="font-style: italic;"&gt;natural key&lt;/span&gt;&lt;span style=""&gt;—&lt;/span&gt;usually a unique identifier in a source system, such as a &lt;span style="font-family: courier new;"&gt;customer_id.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;But the star schema design does not rely on this natural key, or &lt;span style="font-style: italic;"&gt;business key&lt;/span&gt;, to uniquely identify rows in dimension tables. Instead, a &lt;span style="font-style: italic;"&gt;surrogate key&lt;/span&gt; is introduced. This surrogate key gives the dimensional design flexibility to handle changes differently than they are handled in source systems, while preserving the ability to perform joins using a single column.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Type 1 and Type 2 Slow Changes&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Slowly changing dimension techniques determine how the dimensional model will respond to changes in the source system. If the customer with id 8472 changes, what do we do with that change? Alert readers may already be concerned about what I mean by "change" here, but let's first recap the two most common techniques.&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Type 1: Update&lt;/span&gt;   When the dimensional model responds to a change in source data by updating a column, Kimball calls this a &lt;span style="font-style: italic;"&gt;type 1&lt;/span&gt; change.  For example, if a customer's date of birth changes, it is probably appropriate to update the corresponding row for that customer in the dimension table.&lt;br /&gt;&lt;br /&gt;Under this scenario, any facts that were already associated with the dimension table row have effectively been revised as well.  A report of sales dollars by date of birth, for example, will provide different results immediately before and after the type 1 change is applied.  The type 1 change does not preserve history of the attribute value.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Type 2: New Row  &lt;/span&gt;A more common response to a changed data element is to insert a new row into the dimension table.  For example, when the address of customer 8472 changes, we create a new row for the customer in the dimension table.   This row has a different surrogate key, and the new address.  Customer 8472 now has two rows in the dimension, each with its own surrogate key.&lt;br /&gt;&lt;br /&gt;This preserves the history of the attribute, and does not revise any previously stored facts.  New facts will be associated with the new version of customer 8472;  old facts remain associated with the old version.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;For the most part, these two techniques form the basis of a dimensional model's response to change.  (Future posts will consider the less common type 3 change, and additional techniques.)  While these concepts are fairly easy to understand, it is important to look a bit deeper.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;We think about slow changes with respect to the source&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Notice the way that the original problem was framed. I asked how the dimensional schema would "respond to changes in the source data."  This is how we usually think about the problem, and for good reason.  After all, the source data exists before it is loaded into the dimensional schema.  If &lt;span style="font-family:courier new;"&gt;birth_date&lt;/span&gt; changes, we overwrite; if &lt;span style="font-family:courier new;"&gt;address &lt;/span&gt;changes, we insert a new record.&lt;br /&gt;&lt;br /&gt;Now observe that a change to the source does not always result in a change in the dimensional schema.  In the example, a change in address resulted in a &lt;span style="font-style: italic;"&gt;new &lt;/span&gt;row&lt;span style=""&gt;—&lt;/span&gt;not a &lt;span style="font-style: italic;"&gt;changed &lt;/span&gt;row.  No data is changed.&lt;br /&gt;&lt;br /&gt;Still, we refer to this process as the occurrence of a type 2 change.  Why?  Because we &lt;span style="font-style: italic;"&gt;think &lt;/span&gt;about slow changes with respect to the &lt;span style="font-style: italic;"&gt;source &lt;/span&gt;data.  And there, a change did occur.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;We document slow changes with respect to the star&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The most common way to document the dimensional schema's response to change is on the dimensional side, on an attribute by attribute basis.  For each column in a dimension table, we note how changes in the source data will be handled.  Our customer example might be documented as follows:&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_oaFlRokTed0/RwvTxn2xn2I/AAAAAAAAABA/9VnPJJHePrk/s1600-h/scd+diagram.jpg"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_oaFlRokTed0/RwvTxn2xn2I/AAAAAAAAABA/9VnPJJHePrk/s400/scd+diagram.jpg" alt="" id="BLOGGER_PHOTO_ID_5119418250902478690" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;In the diagram, each non-key attribute is tagged with a 1 or a 2.  This indicates whether changes in the source of the attribute should be handled as type 1 or type 2 changes.&lt;br /&gt;&lt;br /&gt;Documenting SCD behavior in this way is handy.  ETL developers use this information to design a scheme for performing incremental loads.  Report developers use this information to understand how facts will be grouped when combined with different dimension attributes.&lt;br /&gt;&lt;br /&gt;The only drawback to documenting SCD rules in this way is that it can lead to confusion.  By tagging an attribute as a "type 2 SCD" we risk implying that attribute values may change. After all, the "C" in "SCD" stands for "change." &lt;br /&gt;&lt;br /&gt;But of course, this attribute does not change.  Rather, its classification as a type 2 attribute means "for a given natural key, if the source for this attribute undergoes a change, it will be necessary to insert a new row. "&lt;br /&gt;&lt;br /&gt;In future posts, I will look at some common misconceptions about slowly changing dimensions, and discuss additional techniques for handling changes.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:78%;"&gt;Copyright (c) 2007 Chris Adamson&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-6258722360380770040?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=6258722360380770040&amp;isPopup=true" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/6258722360380770040?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/6258722360380770040?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2007/10/for-slowly-changing-dimensions-change.html" title="For Slowly Changing Dimensions, &quot;Change&quot; is Relative" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/_oaFlRokTed0/RwvTxn2xn2I/AAAAAAAAABA/9VnPJJHePrk/s72-c/scd+diagram.jpg" height="72" width="72" /><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">1</thr:total></entry><entry gd:etag="W/&quot;DkADRH8_fCp7ImA9WB5aFEk.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-3956665458776445335</id><published>2007-09-10T12:00:00.000-04:00</published><updated>2007-09-10T13:19:35.144-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2007-09-10T13:19:35.144-04:00</app:edited><title>Mastering Data Warehouse Aggregates: Errata</title><content type="html">&lt;span style="font-style: italic;"&gt;This post contains updates and corrections for &lt;/span&gt;&lt;span&gt;Mastering Data Warehouse Aggregates:  Solutions for Star Schema Performance&lt;/span&gt;&lt;span style="font-style: italic;"&gt;.  I'll update it as needed&lt;/span&gt;&lt;span style="font-style: italic;"&gt;—&lt;/span&gt;&lt;span style="font-style: italic;"&gt;hopefully that won't be very often!&lt;/span&gt;&lt;br /&gt;&lt;div style="float: left; margin-right: 10px;"&gt;&lt;br /&gt;&lt;img alt="Book: Mastering Data Warehouse Aggregates" src="http://www.oaktonsoftware.com/images/book2_small.gif" align="left"&gt; &lt;br /&gt;&lt;/div&gt;&lt;a target = "_blank" href="http://www.amazon.com/gp/product/0471777099?ie=UTF8&amp;tag=oaktonsoftwar-20&amp;linkCode=as2&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0471777099"&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Mastering Data Warehouse Aggregates&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;By Chris Adamson&lt;br /&gt;&lt;span style="color: rgb(102, 102, 102);"&gt;Wiley Publishing Inc., 2006&lt;/span&gt;&lt;br /&gt;&lt;div style="clear: left;"&gt; &lt;/div&gt;Many thanks to those of you who have sent feedback, praise, comments and questions.  It is good to know that the book has been well received, and is being studied carefully.  Happily, I have only been made aware of a couple of errors.  Here they are, in page number order.&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold; color: rgb(102, 102, 102);"&gt;Page 171:&lt;/span&gt;   Figure 5.7 incorrectly lists a BRAND_KEY in the upper right of the diagram.  The column should be labeled PRODUCT_KEY, as described in the text.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold; color: rgb(102, 102, 102);"&gt;Page 195:&lt;/span&gt;   Step 2.1.4 in figure 6.3 should have a connection to step 2.1.5 (as indicated in the text on page 203).&lt;/li&gt;&lt;/ul&gt;For more information on the book, including the complete table of contents and a sample chapter, visit the &lt;a target="blank" href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471777099.html"&gt;Wiley website&lt;/a&gt;.  The book is available through &lt;a target="blank" href="http://www.amazon.com/gp/product/0471777099?ie=UTF8&amp;tag=oaktonsoftwar-20&amp;amp;linkCode=as2&amp;camp=1789&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;creative=9325&amp;amp;creativeASIN=0471777099"&gt;Amazon.com&lt;/a&gt; and at stores everywhere.  And, of course, feel free to send me your comments.&lt;br /&gt;&lt;br /&gt;Chris&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-3956665458776445335?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=3956665458776445335&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/3956665458776445335?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/3956665458776445335?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2007/09/mastering-data-warehouse-aggregates.html" title="Mastering Data Warehouse Aggregates: Errata" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;Ck4AQ3YzfSp7ImA9WB5QF0k.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-907253687513841155</id><published>2007-07-06T12:51:00.001-04:00</published><updated>2007-07-06T12:55:42.885-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2007-07-06T12:55:42.885-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Strategy" /><title>Drive Warehouse Strategy with a Dimensional Model</title><content type="html">&lt;span style="font-style: italic;"&gt;CIO's and managers of successful data warehouses understand that a dimensional model is more than just a design-stage project deliverable. They use the dimensional model to drive data warehouse strategy, capture requirements, set project priorities, and manage project scope.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Fundamentally, a dimensional model deals with the measurement of business processes. It describes how a business process is evaluated, and can be used to frame questions about a process. In this respect, it speaks clearly to the business users of the data warehouse.&lt;br /&gt;&lt;br /&gt;A dimensional model also has technical implications. Its definition determines the data sources that must be integrated, how information must be cleansed or standardized, and what queries or reports can be built. In this respect, it speaks clearly to the developers of the data warehouse.&lt;br /&gt;&lt;br /&gt;These business and technical characteristics of the dimensional model make it an ideal focal point for managing the entire data warehouse life cycle. A dimensional model can serve as the basis for a shared understanding of warehouse strategy. From a business perspective, it imparts a clear understanding of functional capability; from an I.T. perspective, it supports a clear understanding of technical activity.&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Warehouse Strategy  &lt;/span&gt;It is well understood that planning a dimensional model at an enterprise level can enable the incremental implementation of subject area applications. For a Kimball-style dimensional data warehouse, this process is critical to ensuring that stove-pipes are not developed.&lt;br /&gt;&lt;br /&gt;But a high-level dimensional model has additional value, even if you are not building a dimensional data warehouse. Because it provides a clear framework to describe functional capability and technical work-units, a dimensional model is an effective way to plan and document an enterprise data warehouse architecture.&lt;br /&gt;&lt;br /&gt;I use the dimensional model as the central focus of data warehouse strategic plans. It is understood by business and technical constituents, bringing them together with a shared understanding of scope, functionality and technical effort for each subject area. It clearly conveys functionality, while at the same time allowing development activity to be quantified.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li style="font-weight: bold;"&gt;Requirements Definition  &lt;span style="font-weight: normal;"&gt;In the same way, a dimensional model is an ideal way to capture requirements—whether at a strategic level, or for a subject-area implementation (or data mart.) Report requirements, data requirements, and loading requirements can all be expressed in dimensional terms. This makes each requirement understandable to a variety of audiences, and allows their dependencies to be easily cross-referenced.&lt;br /&gt;&lt;br /&gt;I use a dimensional format to capture requirements and develop specifications for deliverables such as reports, applications, ETL routines, and, of course, schema design.&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul style="font-weight: bold;"&gt;&lt;li&gt;Project Prioritization  &lt;span style="font-weight: normal;"&gt;A dimensional model can be cross referenced with business priorities, report functionality, data availability, load requirements, and several other factors that drive a development roadmap. As a common ground for business and technical interaction, it is an invaluable tool.&lt;br /&gt;&lt;br /&gt;Once priorities are set, the dimensional framework can be used to describe them. By segmenting a dimensional model into a set of sequenced projects, it clearly links both functionality and technical effort to the calendar. At the same time, it enables analysis of resource requirements that will be required.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Manging Scope  &lt;/span&gt;The dimensional model is an ideal way to define and manage the scope of projects. I use dimensional terms to describe project objectives, what is considered in and out of project scope, and how change requests will be evaluated by project leadership.&lt;br /&gt;&lt;br /&gt;This is particularly useful for implementation projects that employ an "iterative" build process, where scope-creep is an ever-present possibility. By linking project scope to dimensional characteristics (such as grain, sources, or attribution), physical design can be enhanced through iterations without allowing the project to spiral out of control.&lt;/li&gt;&lt;/ul&gt;A dimensional focus can characterize a variety of deliverables, regardless of the &lt;a href="http://blog.oaktonsoftware.com/2007/03/three-data-warehouse-architectures-that.html"&gt;type of architecture&lt;/a&gt; your data warehouse employs. Strategy documents, report requirement definitions, project definitions, load specifications... A unified dimensional approach links all these deliverables together, and expresses important information in a way that provides value to business and technical audiences.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;© 2007 Chris Adamson&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-907253687513841155?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=907253687513841155&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/907253687513841155?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/907253687513841155?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2007/07/drive-warehouse-strategy-with.html" title="Drive Warehouse Strategy with a Dimensional Model" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;AkIARn4-eSp7ImA9WB5SF0k.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-8081456034830982240</id><published>2007-06-11T15:50:00.000-04:00</published><updated>2007-06-13T11:29:07.051-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2007-06-13T11:29:07.051-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Top Ten Lists" /><category scheme="http://www.blogger.com/atom/ns#" term="Design Techniques" /><title>Ten Things You Won't Learn from that Demo Schema</title><content type="html">&lt;span style="font-style: italic;"&gt;Many people learn about dimensional modeling by studying a sample star schema database that comes with a software product.  These sample databases are useful learning tools—to a point. &lt;/span&gt;&lt;span style=""&gt;&lt;/span&gt;&lt;span style="font-style: italic;"&gt;Here are 10 things you won't learn by studying that demo.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If you've learned everything you know about star schema by working with a sample database, you probably have a good intuitive grasp of star schema design principles.  In a&lt;span style="text-decoration: underline;"&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;/span&gt; &lt;a href="http://blog.oaktonsoftware.com/2007/05/10-things-you-should-know-about-that.html"&gt;previous post&lt;/a&gt;, I  provided a list of 10 terms and principles that most sample databases illustrate well.&lt;br /&gt;&lt;br /&gt;But there are many important things about a dimensional data warehouse that are &lt;span style="font-weight: bold;"&gt;not &lt;/span&gt;revealed by the typical "Orders" or "Sales" demo.  Here are the top 10 things you will &lt;span style="font-weight: bold;"&gt;not &lt;/span&gt;learn from that sample database.&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Multiple Fact Tables  &lt;/span&gt;Most sample databases contain a single star—one fact table and its associated dimension tables.  But it is rare to find a business process that can be modeled with a single fact table; it is impossible to find an enterprise that can. Most real-world designs will involve multiple fact tables, sharing a set of common dimensions.&lt;br /&gt;&lt;br /&gt;When facts become available at different times, or with different dimensionality, they almost always belong in separate fact tables.  Modeling them in a single fact table can have negative consequences. Mixed grain issues may result, complicating the load and making reporting very difficult.  For example, building reports focused on only one of the facts can result in a confusing preponderance of extra rows containing the value zero.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Conformance&lt;/span&gt;  With multiple fact tables, it is also important that each star be designed so that it works with others.  A design principle called &lt;span style="font-weight: bold;"&gt;conformance &lt;/span&gt;helps ensure that as we build each new star, it works well with those that came before it.  This avoids the dreaded stove-pipe. This principle allows a set of star schemas to be planned around a set of common dimensions and implemented incrementally.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Drilling Across  &lt;/span&gt;It's also important to understand how to properly build a report that accesses data from multiple stars.  A single SQL select statement won't do the job. Double counting, or worse, can result. Instead, we follow a process called &lt;span style="font-weight: bold;"&gt;drilling across&lt;/span&gt;, where each star is queried individually.  The two result sets are then combined based on their common attributes. These drill across reports are some of the most powerful in the data warehouse.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Snapshot Fact Tables &lt;/span&gt;The fact table found in most demo stars is usually called a transaction fact table.  But there are real world situations where other types of fact table designs are called for.&lt;br /&gt;&lt;br /&gt;A snapshot design is useful for capturing the result of a series of transactions at a point-in-time; for example, the balance of each account in the general ledger at the end of each day.  This type of design introduces the concept of &lt;span style="font-weight: bold;"&gt;semi-additivity&lt;/span&gt;, which can be a problem for many ad hoc query tools.  It makes no sense to add together yesterday's balance and today's balance.  It is not uncommon to compute averages based on the data in a snapshot star.  But one must be careful here; the SQL Average() function may not always be what you need.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Factless Fact Tables&lt;/span&gt;  Another type of fact table often contains no facts at all.  Factless fact tables are useful in situations where there appears to be nothing to measure aside from the occurrence of an event, such as a customer contact.  They also come in handy when we want to capture information about which there may be no event at all, such as eligibility.&lt;br /&gt;&lt;br /&gt;In addition to transaction, snapshot and factless designs, there are other types of fact table as well.  It is not uncommon to need more than one, even when modeling a single activity.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Roles and Aliasing  &lt;/span&gt;Many business processes involve a dimension in multiple roles.  For example, in accounting a transaction may include the employee who makes a purchase, as well as the employee who approves it.  There is no need for separate"Purchaser" and "Approver" dimensions.  A single "Employee" dimension will do the job.  The fact table will have two foreign key references to the Employee dimension--one that represents the purchaser, and one that represents the approver.  We use SQL "aliasing" when querying this schema in order to capture the two employee roles.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Advanced Slow Change Techniques&lt;/span&gt;  If you are lucky, you were able to learn about Type 1 and Type 2 Slowly Changing Dimension techniques from the demo schema. I described these in a &lt;a href="http://blog.oaktonsoftware.com/2007/05/10-things-you-should-know-about-that.html"&gt;previous post&lt;/a&gt;.  Often, analytic requirements require more.&lt;br /&gt;&lt;br /&gt;A &lt;span style="font-weight: bold;"&gt;Type 3 &lt;/span&gt;change allows you to "have it both ways," analyzing all past and future transactions as if the change had occurred (retroactively) or not all.&lt;br /&gt;&lt;br /&gt;There are also &lt;span style="font-weight: bold;"&gt;hybrid &lt;/span&gt;approaches, one of which tracks the "transaction-time" version of the changed data element as well as the "current-value" of the data element.&lt;br /&gt;&lt;br /&gt;And then there's the &lt;span style="font-weight: bold;"&gt;time-stamped dimension&lt;/span&gt; technique, also called a transaction dimension.  In this version, each row receives an effective date/time and an expiration date time.  This provides Type 2 functionality, but also allows point-in-time analysis of the dimensional data.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Bridge Tables  &lt;/span&gt;Perhaps the most confusing technique for the novice dimensional designer is the use of bridge tables.  These tables are used when the standard one-to-many relationship between dimension and fact does not apply.  There are three situations where bridge tables come in handy:&lt;br /&gt;&lt;br /&gt;An &lt;span style="font-weight: bold;"&gt;attribute bridge &lt;/span&gt;resolves situations where a dimension attribute may repeat multiple times.  For example, a dimension table called "Company" may include an attribute called "Industry." Some companies have more than one industry. Rather than flattening into "Industry 1," "Industry 2," and so on, an attribute bridge captures as many industries as needed.&lt;br /&gt;&lt;br /&gt;A &lt;span style="font-weight: bold;"&gt;dimension bridge&lt;/span&gt; resolves situations where an entire dimension may repeat with respect to facts.  For example, there may be multiple salespeople involved in a sale.  Instead of loading the fact table with multiple salesperson keys, a dimension bridge gracefully manages the group of salespeople.&lt;br /&gt;&lt;br /&gt;A &lt;span style="font-weight: bold;"&gt;hierarchy bridge&lt;/span&gt; resolves situations where a recursive hierarchy exists within a dimension.  For example, companies own other companies.  At times, users may want to roll transactions up that occur beneath a specific company, or vice versa.  Instead of flattening the hierarchy, which imposes limitations and complicates analysis, a hierarchy bridge can be joined to the transaction data in various ways, allowing multiple forms of analysis.&lt;br /&gt;&lt;br /&gt;All bridge implementations have implications for usage, or report building.  Improper use of a bridge can result in double counting or incorrect results.  Bridges also make deployment of business intelligence tools more difficult.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Derived Schemas &lt;/span&gt; Useful stars can also be derived from existing stars.  Often called "second-line" solutions, these derived schemas can accelerate specific types of analysis with powerful results.  The &lt;span style="font-weight: bold;"&gt;merged fact table &lt;/span&gt;combines stars to avoid drilling across.  The  &lt;span style="font-weight: bold;"&gt;sliced fact table &lt;/span&gt;partitions data based on a dimension value, useful in distributed collection and analysis.  The &lt;span style="font-weight: bold;"&gt;pivoted fact table&lt;/span&gt; restructures row-wise data for columnar analysis and vice-versa.  And &lt;span style="font-weight: bold;"&gt;set operation fact tables&lt;/span&gt; provide precomputed results for union, intersect and minus operations on existing stars.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Aggregate Schemas&lt;/span&gt;  One of the reasons the star schema has become so popular is that it provides strong query performance.  Still, there are times when we want results to come back faster.  Aggregate schemas partially summarize the data in a base schema, allowing the database to compute query results faster.  Of course, designers need to identify aggregates that will provide the most help, and queries and reports will receive no benefit unless they actually use the aggregate.  Aggregates are usually designed as separate tables, instead of providing multiple levels of summarization in a single star.  This avoids double counting errors, and can allow the exploitation of an automated query rewrite mechanism so that applications do not need to be aware of the aggregates.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;I limited this list to 10 things.  That's enough to make the point:  a demo schema will only take you so far.  When I teach Advanced Star Schema Design courses, I find that even people with many years of experience have questions, and always learn something new.&lt;br /&gt;&lt;br /&gt;If you want to learn more, read the books recommended in the sidebar of this blog. Take  a class on Advanced Star Schema Design.  Interact with your peers at &lt;a href="http://www.tdwi.org/" target="_blank"&gt;The Data Warehouse Institute&lt;/a&gt; conferences.  And keep reading this blog.  There will always be more to learn.&lt;br /&gt;&lt;br /&gt;Related Posts:&lt;br /&gt;&lt;br /&gt;   &lt;a href="http://blog.oaktonsoftware.com/2007/05/10-things-you-should-know-about-that.html"&gt;Top 10 Thinks You Should Know About that Demo Schema&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;© 2007 Chris Adamson&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-8081456034830982240?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=8081456034830982240&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/8081456034830982240?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/8081456034830982240?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2007/06/ten-things-you-wont-learn-from-that.html" title="Ten Things You Won't Learn from that Demo Schema" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;A04EQn89cCp7ImA9WBFaGEk.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-6955546906224047401</id><published>2007-05-22T10:45:00.000-04:00</published><updated>2007-05-22T12:05:03.168-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2007-05-22T12:05:03.168-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Security" /><title>Data Security: Don't Be the Weak Link</title><content type="html">&lt;span style="font-style: italic;"&gt;Data Warehouse project teams dedicate a lot of time to protecting personal information—and rightly so.  But don’t let your notebook computer become the weak link.  I suggest three simple steps you can follow to protect yourself, your company and your customers.&lt;/span&gt;    &lt;p style="font-weight: bold;" class="MsoNormal"&gt;&lt;o:p&gt;&lt;/o:p&gt;The Letter&lt;/p&gt;    &lt;p class="MsoNormal"&gt;&lt;o:p&gt;&lt;/o:p&gt;Imagine that you have just received a letter that has you more than a little upset.&lt;span style=""&gt;  &lt;/span&gt;It came from a financial institution where you have an account.&lt;span style=""&gt;  &lt;/span&gt;It reads,&lt;/p&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;blockquote&gt;We are writing to inform you of a recent incident which may have compromised some of your personal information…Recently, a notebook computer belonging to one of our employees was stolen out of a hotel meeting room.&lt;o:p&gt;&lt;/o:p&gt;&lt;br /&gt;&lt;br /&gt;The computer contained confidential information about many of our customers.&lt;span style=""&gt;  &lt;/span&gt;Data on the computer included your account numbers, balance information, and social security number…&lt;/blockquote&gt;&lt;p class="MsoNormal"&gt; &lt;/p&gt;    &lt;p class="MsoNormal"&gt;&lt;o:p&gt;&lt;/o:p&gt;The letter goes on to say that the computer was password protected, and that most computers are stolen for the value of the hardware&lt;span&gt;—&lt;/span&gt;not because the thief is looking for confidential data.&lt;span style=""&gt;  &lt;/span&gt;Still, they have set up a hotline, and will provide resources to support you in the event of identity theft.&lt;br /&gt;&lt;/p&gt;    &lt;p class="MsoNormal"&gt;&lt;o:p style="font-weight: bold;"&gt;&lt;/o:p&gt;&lt;span style="font-weight: bold;"&gt;Your Reaction&lt;/span&gt;&lt;br /&gt;&lt;/p&gt;    &lt;p class="MsoNormal"&gt;How do you feel?&lt;span style=""&gt;  &lt;/span&gt;Probably angry.&lt;span style=""&gt;  &lt;/span&gt;Why was that notebook left unsecured?&lt;span style=""&gt;  &lt;/span&gt;Why was information about you on it in the first place? &lt;span style=""&gt; &lt;/span&gt;One of your first reactions might be to consider taking your account elsewhere.&lt;br /&gt;&lt;/p&gt;    &lt;p class="MsoNormal"&gt;&lt;o:p&gt;&lt;/o:p&gt;Now imagine that it was &lt;span style="font-style: italic;"&gt;your &lt;/span&gt;computer that was stolen.&lt;span style=""&gt;  &lt;/span&gt;How comfortable would you be, knowing the theft of &lt;span style="font-style: italic;"&gt;your &lt;/span&gt;computer forced your company to notify several thousand customers that their confidential information may have been breached?&lt;/p&gt;    &lt;p class="MsoNormal"&gt;Worse yet, imagine that you were a consultant working for the company.&lt;span style=""&gt;  &lt;/span&gt;You have caused them a massive headache, and also jeopardized &lt;span style="font-style: italic;"&gt;your own company’s&lt;/span&gt; contract with the financial institution.&lt;/p&gt;    &lt;p style="font-weight: bold;" class="MsoNormal"&gt;Guarding Confidential Data in the Warehouse&lt;/p&gt;    &lt;p class="MsoNormal"&gt;Data warehousing teams are often entrusted with confidential information relating to an organization’s customers, employees and activities.&lt;span style=""&gt;  &lt;/span&gt;We spend a great deal of time designing solutions that safeguard this information: &lt;/p&gt;      &lt;ul&gt;&lt;li&gt;We create alternate identifiers so that social security numbers are not needed to link data sets together &lt;/li&gt;&lt;li&gt;Extract files are maintained on secure servers, and accessed only by specific users&lt;/li&gt;&lt;li&gt;We implement encryption on sensitive data elements&lt;/li&gt;&lt;li&gt;We manage ETL processing in a secure environment&lt;/li&gt;&lt;li&gt;We provide lookup functions that allowing users to find a match to a data element, without providing unfettered access to the element itself&lt;/li&gt;&lt;li&gt;We take steps to ensure that log files do not contain sensitive data elements&lt;/li&gt;&lt;li&gt;We develop database-level security at column and row levels, and provide user access on a need-to-know basis &lt;/li&gt;&lt;li&gt;We audit access to confidential data elements&lt;/li&gt;&lt;/ul&gt;                &lt;p class="MsoNormal"&gt;Most of this work is focused on securing the processing, storage and use of sensitive data elements.&lt;span style=""&gt;  &lt;/span&gt;And most of the focus is on servers.  But what about your PC?&lt;br /&gt;&lt;/p&gt;        &lt;p style="font-weight: bold;" class="MsoNormal"&gt;Don’t Be the Weak Link&lt;/p&gt;    &lt;p class="MsoNormal"&gt;As someone involved in the design, development or testing of an analytic system (or any system), you may have access to files or extracts that contain confidential data.&lt;span style=""&gt;  &lt;/span&gt;Some of this could, quite innocently, land on your PC.&lt;span style=""&gt;  &lt;/span&gt;Your organization may have strict policies about this access.&lt;span style=""&gt;  &lt;/span&gt;&lt;/p&gt;      &lt;p class="MsoNormal"&gt;Whether or not they do, I encourage you to adopt your own policy as well.&lt;span style=""&gt;  &lt;/span&gt;I do.&lt;span style=""&gt;  &lt;/span&gt;And it is a simple one:&lt;span style=""&gt;  &lt;/span&gt;&lt;br /&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;/p&gt;&lt;blockquote&gt;"Not on MY computer.&lt;span style=""&gt;  &lt;/span&gt;Period."&lt;br /&gt;&lt;span style=""&gt;  &lt;/span&gt;&lt;/blockquote&gt;&lt;o:p&gt; &lt;/o:p&gt;      &lt;p class="MsoNormal"&gt;Whether developing database design, architecting an ETL process, or developing a quality assurance process, we are all able to do our jobs without turning our notebook computer to the weak link in the security chain.&lt;/p&gt;&lt;p class="MsoNormal"&gt;I refuse to accept any file that contains sensitive data, whether encrypted or not.&lt;span style=""&gt;  &lt;/span&gt;When I need to review an extract file, I examine it on secure server, from a secure workstation.&lt;span style=""&gt;  &lt;/span&gt;You get the idea.&lt;/p&gt;      &lt;p class="MsoNormal"&gt;By keeping this information off our notebooks, we avoid being the cause of one of those dreaded customer letters.&lt;span style=""&gt;  &lt;/span&gt;By protecting ourselves, we also protect our customers and our company.&lt;br /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/p&gt;  &lt;p style="font-weight: bold;" class="MsoNormal"&gt;Guarding Your PC&lt;/p&gt;      &lt;p class="MsoNormal"&gt;&lt;o:p&gt;&lt;/o:p&gt;Taking a few simple steps to avoid being in that uncomfortable position.&lt;o:p&gt; &lt;/o:p&gt;&lt;/p&gt;        &lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Just say No&lt;/span&gt;&lt;o:p style="font-weight: bold;"&gt;&lt;/o:p&gt;&lt;br /&gt;&lt;br /&gt;Don’t accept any confidential data and store it, even temporarily, on your PC.&lt;br /&gt;&lt;br /&gt;&lt;span style=""&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style=""&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;Wipe it Out&lt;/span&gt;&lt;o:p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/o:p&gt;If you receive a file that turns out to contain confidential data, get rid of it.  Use a “wipe” utility or “secure delete” to completely expunge it from your system.&lt;span style=""&gt;  &lt;/span&gt;Notify the sender that you have done so, and request a “clean” version of the file.&lt;br /&gt;&lt;br /&gt;There are a variety of utilities that can wipe a file or your hard drive.  A list can be found at the bottom of the DBAN page at Sourceforge.  One free utility is Heidi Eraser.    (Links appear at the end of this article.)&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Stay Up to Date&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A physical theft is not the only thing there is to worry about.&lt;span style=""&gt;  &lt;/span&gt;Keep you computer software up to date.&lt;span style=""&gt;  &lt;/span&gt;Yes, I.T. probably keeps you OS, firewall and critical applications up-to-date.&lt;o:p&gt;&lt;/o:p&gt;&lt;br /&gt;&lt;br /&gt;But you’ve got other things on that machine.&lt;span style=""&gt;  &lt;/span&gt;You &lt;span style="font-style: italic;"&gt;know &lt;/span&gt;you do.&lt;span style=""&gt;  &lt;/span&gt;Plug-ins. Utilities.&lt;span style=""&gt;  &lt;/span&gt;Media players.&lt;span style=""&gt;  &lt;/span&gt;They are innocent enough:&lt;span style=""&gt;  &lt;/span&gt;iTunes, WinAmp, Acrobat Reader, Flash Player, Real Player, Java...But these applications and many others can also cause a breach on your machine.&lt;o:p&gt;&lt;/o:p&gt;&lt;br /&gt;&lt;br /&gt;One place to check what needs updating is the Secunia Software Inspector (link at end of post), which scans for a wide variety of programs and tells you what security updates are needed.&lt;span style=""&gt;  &lt;/span&gt;This check requires Java – be sure to disable it after you’re done. &lt;span style=""&gt; &lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;            &lt;p class="MsoNormal" style=""&gt;  &lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;o:p&gt;So don't be the weak link.  Protect your PC and you'll be protecting yourself, your company and your customers.&lt;/o:p&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;span style="font-size:85%;"&gt;© 2007 Chris Adamson&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;o:p&gt;&lt;span style="font-style: italic;"&gt;Related Links &lt;/span&gt;&lt;span style="font-style: italic;"&gt;(all open in new browser windows):&lt;/span&gt;&lt;br /&gt;&lt;/o:p&gt;&lt;/p&gt;      &lt;p class="MsoNormal" style="margin-left: 0.5in;"&gt;DBAN page at Sourceforge:  &lt;a href="http://dban.sourceforge.net/" target="_blank"&gt;http://dban.sourceforge.net/&lt;/a&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;u1:p&gt;&lt;br /&gt;List of hard drive wipe utilities appears at bottom of page.&lt;/u1:p&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-left: 0.5in;"&gt;&lt;u1:p&gt;Heidi Eraser (a GNU Licensed File Wipe Utility)  &lt;a href="http://www.heidi.ie/eraser/default.php" target="_blank"&gt;http://www.heidi.ie/eraser/default.php&lt;/a&gt;&lt;br /&gt;&lt;/u1:p&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="margin-left: 0.5in;"&gt;&lt;u1:p&gt;Secunia Software Inspector (free service scans your computer for insecure software versions)   &lt;/u1:p&gt;&lt;a href="http://secunia.com/software_inspector/" target="_blank"&gt;http://secunia.com/software_inspector/&lt;/a&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;span style="font-size:85%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;    &lt;p class="MsoNormal"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-6955546906224047401?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=6955546906224047401&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/6955546906224047401?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/6955546906224047401?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2007/05/data-security-dont-be-weak-link.html" title="Data Security: Don't Be the Weak Link" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;AkEDQn89fSp7ImA9WB5SF0k.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-2382887398690522265</id><published>2007-05-01T10:20:00.000-04:00</published><updated>2007-06-13T11:31:13.165-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2007-06-13T11:31:13.165-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Basics" /><category scheme="http://www.blogger.com/atom/ns#" term="Top Ten Lists" /><category scheme="http://www.blogger.com/atom/ns#" term="Design Techniques" /><title>10 Things You Should Know About that Sample Star Schema</title><content type="html">&lt;span style="font-style: italic;"&gt;Today, many of us learn about the star schema by studying a sample database that comes with a software product&lt;/span&gt;&lt;span style=""&gt;—&lt;/span&gt;&lt;span style="font-style: italic;"&gt;usually one that covers sales or orders.  Here are 10 terms and principles of dimensional modeling to go with that sample schema you've worked with.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The star schema has become a &lt;span style="font-style: italic;"&gt;de facto&lt;/span&gt; standard for the design of analytic databases. Sample stars are often included with RDBMS software, BI Tools and ETL tools. They are also used for tutorials and training.  Almost universally, the sample schema describes a sales or order taking process, similar to the one depicted in the figure below:&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_oaFlRokTed0/RjddjMIgJ7I/AAAAAAAAAAw/Pj-Rlfroifw/s1600-h/sample_schema.gif"&gt;&lt;img style="cursor: pointer;" src="http://2.bp.blogspot.com/_oaFlRokTed0/RjddjMIgJ7I/AAAAAAAAAAw/Pj-Rlfroifw/s400/sample_schema.gif" alt="" id="BLOGGER_PHOTO_ID_5059615565507733426" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-style: italic;font-size:85%;" &gt;&lt;span style="font-size:100%;"&gt;Figure 1: A demo schema usually represents orders or sales.&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;(Click to Enlarge)&lt;/span&gt;&lt;/span&gt;&lt;span style="font-weight: bold; font-style: italic;font-size:85%;" &gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;You may have learned about the Star Schema by working with a sample like this one. If so, you probably have an intuitive grasp of star schema design principles. Here are ten terms and principles you should know that describe important features of the sample star.&lt;br /&gt;&lt;br /&gt;Most of this is probably readily apparent if you've worked with a sample schema&lt;span style=""&gt;—&lt;/span&gt;what may be new is the terminology.  The first two you probably know:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Facts &lt;/span&gt;are measurements that describe a business process. They are almost always numeric&lt;span style=""&gt;—&lt;/span&gt;but not all numeric attributes are facts.  You can find facts (or measurements) in almost any analytic request&lt;span style=""&gt;—&lt;/span&gt;"Show me sales dollars by product" (sales dollars).    "How many widgets were sold my John Smith in May?" (quantity ordered).  There are some schemas that do not include facts&lt;span style=""&gt;—&lt;/span&gt;we'll look at those in another post.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Dimensions&lt;/span&gt; give facts context.  They may be textual or numeric.  They are used to specify how facts are "filtered" and "broken out" on reports.  You can usually find dimensions after the words "by" or "for" in an analytic request.  "Show me sales dollars by product" (product).  "What are margin dollars by Month and Salesperson?"  (month, sales rep).&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li class="MsoNormal" style=""&gt;&lt;span style="font-weight: bold;"&gt;Dimension tables are wide.  &lt;/span&gt;Dimension tables  usually group together a set of related dimension attributes, though there are situations where a dimension may include a set of attributes not related to one another.  Dimension tables are not normalized, and usually have a &lt;span style="font-style: italic;"&gt;lot &lt;/span&gt;of attributes&lt;span style=""&gt;—&lt;/span&gt;far more than appear in most sample schemas. This allows a rich set of detail to be used in analyzing facts. 100 or more columns is not uncommon for some dimensions.  For this reason, we often call dimension tables &lt;span style="font-weight: bold;"&gt;wide.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li class="MsoNormal" style=""&gt;&lt;span style="font-weight: bold;"&gt;Dimensions have Surrogate      Keys.&lt;/span&gt;  The primary key for each dimension table is an attribute specifically created for the dimensional schema.  It is an integer assigned by the ETL process, and has no inherent meaning. It is &lt;span style="font-style: italic;"&gt;not &lt;/span&gt;a reused key from a source system, such as a customer ID or product code.  We call these attributes &lt;span style="font-weight: bold;"&gt;natural keys&lt;/span&gt;, and they may exist in the star, but do not serve as unique identifiers.&lt;br /&gt;&lt;br /&gt;In the sample schema, &lt;span style="font-family:courier new;"&gt;customer_key&lt;/span&gt; is a surrogate key generated for the star schema; &lt;span style="font-family:courier new;"&gt;customer_id&lt;/span&gt; is a natural key carried over from a source system.&lt;br /&gt;&lt;br /&gt;By assigning surrogate keys, we enable the star to handle changes to source data differently than the source system does.  For example, in a source system a customer record may be overwritten, while we want the star schema to track changes. Performance considerations also come into play&lt;span style=""&gt;—&lt;/span&gt;a surrogate key avoids the need for multi-column joins.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li class="MsoNormal" style=""&gt;&lt;span style="font-weight: bold;"&gt;Type 2 Changes track history.  &lt;/span&gt;The term "Slowly Changing Dimension" (or SCD) describes how the data warehouse responds to changes in the source of dimensional data. There are several techniques that can be applied when the source of dimension detail changes.  The most common is referred to as a "Type 2" change: an entirely new record is written to the dimension table.&lt;br /&gt;&lt;br /&gt;For example, if a customer moves, the record may simply be updated in a source system.  But in the star schema, we choose to &lt;span style="font-style: italic;"&gt;add &lt;/span&gt;a new row to the customer dimension, complete with a new surrogate key.  All prior facts remain associated with the "old" customer record; all future facts will be associated with the new record.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li class="MsoNormal" style=""&gt;&lt;span style="font-weight: bold;"&gt;Type 1 Changes overwrite history. &lt;/span&gt; The Type 1 change is used when source data changes are not deemed significant, or may be the correction of an error.  In such cases, we perform an update to an existing row in a dimension.  For example, if a customer's gender is updated in the source, we may choose to update it in the corresponding dimension records.  All prior facts are now associated with the changed value.&lt;br /&gt;&lt;br /&gt;In addition to Type 1 and Type 2 changes, there are other SCD techniques.  Hybrid approaches exist as well. Every design should identify which technique(s) will be used for each attribute of each dimension table.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li class="MsoNormal" style=""&gt;&lt;span style="font-weight: bold;"&gt;Fact      tables are narrow.&lt;/span&gt;  A fact table row is usually entirely composed of numeric attributes:  the facts, and foreign key references to the dimensions.  Because of these characteristics, each fact table row is narrow, at least in contrast with wide dimension rows full of textual values.  The narrowness of fact tables is important, because they will accumulate far more rows than dimension tables, and at a much faster rate.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li class="MsoNormal" style=""&gt;&lt;span style="font-weight: bold;"&gt;Fact tables are usually sparse.  &lt;/span&gt;Rows are recorded in the fact table only when there is something to measure. For example, not every customer orders every product from every salesperson each day.  Rows are only recorded when there is an order.   This helps manage the growth of the fact table. It also saves us from having to filter out a huge number of rows that have no sales dollars when displaying results in a report.  (Usually, you don't want a customer sales report to list every product&lt;span style=""&gt;—&lt;/span&gt;only the ones they bought.  You can use an outer join when you want the latter.)&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li class="MsoNormal" style=""&gt;&lt;span style="font-weight: bold;"&gt;Fact Table Grain&lt;/span&gt;  The level of detail represented by a row in a fact table is referred to as its &lt;span style="font-style: italic;"&gt;grain&lt;/span&gt;.  Facts that are recorded with different levels of detail belong in separate fact tables. This avoids an array of reporting difficulties, as well as kludges such as including special rows in dimension tables for "not applicable." Determining the grain of a fact table is an important design step and helps avoid future confusion. (There are times when "not applicable" attributes are necessary, but they are most often a sign of the need for another fact table.)&lt;br /&gt;&lt;br /&gt;In the example, the grain is sales by customer, product, salesperson and date.  A better design might capture sales at the order line level of detail.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li class="MsoNormal" style=""&gt;&lt;span style="font-weight: bold;"&gt;Additivity&lt;/span&gt;.  Facts are usually &lt;span style="font-style: italic;"&gt;additive&lt;/span&gt;.  This means they can be summed across any dimension value.  For example, &lt;span style="font-family:courier new;"&gt;order_dollars&lt;/span&gt; can be aggregated across customers, products, salespeople, or time periods, producing meaningful results.  Additive facts are stored in the fact table.  We also store additive facts that might be computed from other facts.  (&lt;span style="font-family:courier new;"&gt;order_dollars&lt;/span&gt; might be the sum of &lt;span style="font-family:courier new;"&gt;extended_cost&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;margin_dollars&lt;/span&gt;, but why include only two out of the three.?&lt;br /&gt;&lt;br /&gt;Some facts are &lt;span style="font-style: italic;"&gt;non-additive.&lt;/span&gt;  For example, margin rate is a percentage.  Two sales at 50% margin do not equate to a single sale at 100% margin&lt;span style=""&gt;—&lt;/span&gt;this fact is not additive.  In the star, we store the fully additive components of margin (&lt;span style="font-family:courier new;"&gt;order_dollars&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;margin_dollars&lt;/span&gt;) and let front end tools compute the ratio.  There are also semi-additive facts, which we will look at in the next post.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;Most of these terms and principles can be learned by working with a sample schema.  But there are many important principles that the typical "Sales" model does not reveal.  In &lt;a href="http://blog.oaktonsoftware.com/2007/06/ten-things-you-wont-learn-from-that.html"&gt;a future post&lt;/a&gt;, I'll look at the top 10 things the demo schema does &lt;span style="font-weight: bold;"&gt;not &lt;/span&gt;teach you.&lt;br /&gt;&lt;br /&gt;Related Posts:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://blog.oaktonsoftware.com/2007/06/ten-things-you-wont-learn-from-that.html"&gt;Ten Things You Won't Learn from that Demo Schema&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;p class="MsoNormal"&gt;&lt;span style="font-size:78%;"&gt;© 2007   Chris Adamson&lt;/span&gt;&lt;br /&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-2382887398690522265?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=2382887398690522265&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/2382887398690522265?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/2382887398690522265?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2007/05/10-things-you-should-know-about-that.html" title="10 Things You Should Know About that Sample Star Schema" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/_oaFlRokTed0/RjddjMIgJ7I/AAAAAAAAAAw/Pj-Rlfroifw/s72-c/sample_schema.gif" height="72" width="72" /><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry><entry gd:etag="W/&quot;CEYFR3o4eSp7ImA9WBFVEk4.&quot;"><id>tag:blogger.com,1999:blog-2707613555623047672.post-7987290218569903700</id><published>2007-04-10T10:17:00.000-04:00</published><updated>2007-04-10T15:08:36.431-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2007-04-10T15:08:36.431-04:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Architecture" /><title>Three Myths About Star Schema</title><content type="html">&lt;span style="font-style: italic;"&gt;Think Bill Inmon is anti-star schema? Think again. This post debunks three common myths about star schema design.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;As I wrote in a &lt;a href="http://blog.oaktonsoftware.com/2007/03/three-data-warehouse-architectures-that.html"&gt;previous post&lt;/a&gt;, there are three styles of data warehousing that incorporate star schema design:  Kimball’s Dimensional Data Warehouse, Inmon’s Corporate Information Factory, and what I call the “stand-alone” data mart.&lt;br /&gt;&lt;br /&gt;These architectures make use of the star schema in different ways.  And for proponents of each approach, terms like “enterprise data warehouse” and “data mart” have very different connotations.  This confusing situation has given rise to some misconceptions about the star schema.&lt;br /&gt;&lt;br /&gt;In this post, I will set the record straight on three common myths.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Myth #1:  Bill Inmon is anti-Star Schema&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;Perhaps the most common misconception about the star schema is that Bill Inmon is opposed to it.  In fact, Inmon &lt;span style="font-style: italic;"&gt;advocates &lt;/span&gt;the use of the star schema to build &lt;span&gt;data marts&lt;/span&gt; as part of his Corporate Information Factory architecture.  Where he &lt;span&gt;does &lt;/span&gt;&lt;span style="font-style: italic;"&gt;not&lt;/span&gt; want dimensional design used is for the granular repository of enterprise data from which data marts are drawn.&lt;br /&gt;&lt;br /&gt;So where did this myth come from?  In part, it stems from a common oversimplification of data warehousing theory, which distills a large set of technical issues into the catch-phrase “Kimball vs. Inmon.”  Because the star schema has become so closely associated with Kimball, "Kimball vs. Inmon" is often misconstrued as a debate over star schema.&lt;br /&gt;&lt;br /&gt;Lest you have any doubts that Inmon advocates the use of star schema, all you need to do is peruse some of his work.  Or, search out the first edition of Kimball’s seminal work, &lt;a href="http://www.amazon.com/gp/product/0471200247?ie=UTF8&amp;tag=oaktonsoftwar-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;creative=9325&amp;amp;creativeASIN=0471200247"&gt;The Data Warehouse Toolkit&lt;/a&gt;.  The forward to the first edition was penned by none other than…Bill Inmon.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Myth #2:  The Star Schema is for Aggregated Data&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I frequently encounter people who believe that the star schema is used only for aggregated data.  While it &lt;span style="font-style: italic;"&gt;can &lt;/span&gt;be used for aggregated data, it does not &lt;span style="font-style: italic;"&gt;have to&lt;/span&gt; be used that way. In fact, in a Kimball-style Dimensional Data Warehouse, one guiding principle dictates that fact tables should store the most granular data possible.  That is quite the opposite of aggregated data.&lt;br /&gt;&lt;br /&gt;So why the confusion?  In other architectures, this principle may not be followed.  A star schema data mart in an Inmon-style data warehouse may aggregate information that is extracted from a repository of atomic data.  A stand-alone data mart may aggregate data drawn from source systems. And even in a Kimball-style architecture, star schemas may be supplemented by aggregate schemas designed to improve performance.  (In fact, I’ve written an &lt;a href="http://www.amazon.com/gp/product/0471777099?ie=UTF8&amp;tag=oaktonsoftwar-20&amp;amp;linkCode=as2&amp;camp=1789&amp;amp;creative=9325&amp;creativeASIN=0471777099"&gt;entire book&lt;/a&gt; on this topic.)&lt;br /&gt;&lt;br /&gt;But none of these uses of the star preclude its use for the storage of granular, atomic data. Don't fall into the trap of assuming a particular application of the technique fully describes its capabilities or limitations.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Myth #3:  Star Schema Data Marts Are Stovepipes&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I hear this one in many flavors:  using a star schema design leads to “stovepipes” or to “silos” or to “islands of information.”  Another variation holds that star schema is for "bottom-up" development.  The notion is that if you have multiple star schema data marts, they will not work together.&lt;br /&gt;&lt;br /&gt;Whatever words are used, the characterization is false.  This myth has probably arisen from a particular application of star schema design:  the stand-alone data mart.  When a data mart is built in the absence of an enterprise context--a standalone data mart--it serves an immediate purpose, but may later found to be incompatible with analytic requirements in related subject areas.&lt;br /&gt;&lt;br /&gt;This sort of "failure" is often observed in stand alone data marts. But it hardly makes sense to blame such failures on the star schema design.  You can build a car without a steering wheel, but that does not mean that cars can only drive in straight lines.&lt;br /&gt;&lt;br /&gt;In fact, star schema can be very effectively used to &lt;span style="font-style: italic;"&gt;avoid &lt;/span&gt;stovepipes.  The principle of dimensional conformance can be applied, whether designing stars as part of a Dimensional Data Warehouse (Kimball) or a Corporate Information Factory (Inmon).  This design technique ensures that common dimensions are used consistently across subject-areas, allowing data from multiple stars to be easily combined into a single report.  In technical-speak, it allows applications to “drill-across” schemas or subject areas.&lt;br /&gt;&lt;br /&gt;The result is a set of data marts that serve departmental needs, but can also be combined to answer questions that span subject areas. They may be built incrementally, but they work together. Even builders of stand-alone data marts can take measures to ensure that the solution will conform with future data marts. (I'll tackle that topic in a future post.)&lt;br /&gt;&lt;br /&gt;So remember:  Inmon is not anti-star schema. The star schema is not limited to aggregate data.  Star schema data marts do not have to be stovepipes.&lt;br /&gt;&lt;br /&gt;Copyright (c) 2007 Chris Adamson&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2707613555623047672-7987290218569903700?l=blog.oaktonsoftware.com'/&gt;&lt;/div&gt;</content><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=2707613555623047672&amp;postID=7987290218569903700&amp;isPopup=true" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/7987290218569903700?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/2707613555623047672/posts/default/7987290218569903700?v=2" /><link rel="alternate" type="text/html" href="http://blog.oaktonsoftware.com/2007/04/three-myths-about-star-schema.html" title="Three Myths About Star Schema" /><author><name>Chris Adamson</name><uri>http://www.blogger.com/profile/14386002073556773411</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="00542310086779199689" /></author><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">0</thr:total></entry></feed>
