Chris Adamson’s Blog

Dimensional Models: Now More Than Ever

2017-09-25T11:28:00.000-04:00

Do new technologies and methods render the dimensional model obsolete?

The top question from readers of this blog continues to be: "Is the dimensional model still relevant?"

It is easy to understand why people ask this question:

Our BI programs have expanded beyond data warehousing to include performance management, analytics, and governance functions.
Methods have evolved and streamlined, thanks to the application of agile principles.
Technologies have advanced to include NoSQL solutions, schema-less paradigms, and virtualization options.

In a recent article for TDWI’s Upside, I discuss these changes to our data management processes, and their impacts on the dimensional model.

The conclusion: these pressures actually increase the importance of the dimensional model.

Here are a few points made in the article:

Although NoSQL technologies are contributing to the evolution of data management platforms, they are not rendering relational storage extinct. It is still necessary to track key business metrics over time, and on this front relational storage reigns. In part, this explains why several big data initiatives seek to support relational processing on top of platforms such as Hadoop. Nonrelational technology is evolving to support relational; the future still contains stars.
[A dimensional view of data] grows in importance as the underlying storage of data assets grows in complexity. The dimensional model is the business's entry point into the sprawling repositories of available data and the focal point that makes sense of it all.
The dimensional model of a business process provides a representation of information needs that simultaneously drives the traditional facts and dimensions of a data mart, the key performance indicators of performance dashboards, the variables of analytics models, and the reference data managed by governance and MDM. In this light, the dimensional model becomes the nexus of a holistic approach managing BI, analytics, and governance programs.

As we move to treat information as a business asset, the dimensional model has become a critical success factor. Yet many organizations are spread so thin that this critical skill is often missing. Be sure that doesn’t apply to your business!

For the full discussion, check out the article: Dimensional Models in the Big Data Era. (Chris Adamson, April 12, 2017, TDWI’s Upside.)

Learn More

Join Chris for three days of dimension modeling education in New York next month!

TDWI New York Seminar, October 23-25. Earn a certificate and 24 CPE credits. Check out the sidebar of this blog for additional dates. You can also bring my courses on site.

Read the full article on Upside:

Dimensional Models in the Big Data Era. (Chris Adamson, April 12, 2017, TDWI’s Upside.)

Read Chris's book:

Star Schema: The Complete Reference (McGraw-Hill, 2010)

In Praise of the Whiteboard

2017-06-05T12:26:00.001-04:00

Modeling tools are great, but early stage modeling activities are best conducted on a whiteboard. It is better for collaborative development, and handles rapid change more effectively. And there are inexpensive solutions if one is not readily available.

When I lead seminars that cover modeling techniques, this question always comes up: “What’s the best modeling tool?”

My answer: Start modeling activities on a whiteboard. Break out the software later, one the model has stabilized.

Using a whiteboard, you will get to a better solution, faster.

I find this to be true across the spectrum of BI project types and modeling techniques. Examples include:

Dimensional models (OLAP projects)
Strategy maps (Performance Management projects)
Influence Diagrams (Business Analytics projects)
Causal loop diagrams (Business Analytics projects)

There are two reasons a whiteboard works best for this kind of work: it supports collaboration, and it is better suited to the rapid changes common in early stage modeling. In short, a whiteboard is inherently agile.

Collaboration

The best models are produced by small groups, not individuals. Collaboration generates useful and creative ideas which reach beyond what a seasoned modeler can do alone.

Each of the techniques listed above requires collaboration between business and technical personnel. And within either of these realms, a diversity of perspectives always produces better results. Brainstorming is the name of the game.

Use of a modeling tool quashes the creativity and spontaneity of brainstorming sessions. You may have experienced this yourself.

Imagine five people in a room, one person’s laptop connected to a projector. Four people call out ideas, but the facilitator with the laptop can only respond to one at a time. The session becomes frustrating to all participants, no matter how good the facilitator is. The team loses ideas, and participants lose enthusiasm.

Now imagine the same five people in front of a whiteboard, each holding a pen. Everyone is able to get their ideas onto the board. While this may seem like anarchy, it helps ensure that no ideas are lost and it keeps everyone engaged. The result is always a better model, developed faster.

Rapid change

The other reason to start on a white board is practical: it is easy to erase, change, redraw. And you will be doing a lot of these things if you are collaborating.

Imagine a group is sketching out a model, and decides to make a major change. Perhaps one fact table is to be split into two. Or a single input parameter is to be decomposed into four. If a modeling tool is in use, making the change will will require deletion of elements, addition of new elements, and perhaps a few dialog boxes, check boxes, and warning messages. The tool gets in the way of the creative process.

Now imagine a whiteboard is in use. A couple of boxes are drawn, some lines erased, some new lines added. The free flow of ideas continues, uninterrupted. Once again, this is what you want.

Unimpeded collaboration produces better results, faster.

No white board? No problem.

Don’t let the lack of a whiteboard in your cube or project room stop you.

There are many brands of inexpensive whiteboard sheets that cling to the wall. These have the additional benefit of being easy to relocate if you are forced to move to another room. Here is one example, available on Amazon.com:

There are also several kinds of whiteboard-style notebooks. These are useful if you are working alone or in a group of two. They provide the same benefit of being able to collaborate and quickly change your minds, but in a smaller format. The one that I carry is called Wipebook:

I learned about these and similar solutions from clients and students in my seminars.

…and then the tool

All this is not to say that modeling tools are bad. To the contrary, they are essential. Once the ideas have been firmed up, a modeling tool is the next step.

Modeling tools allow you to gets ideas into a form that can be reviewed and revised. They support division of labor for doing required “grunt work” — such as filling in business definitions, technical characteristics, and other metadata. And they produce useful documentation for developers, maintainers, and consumers of your solutions.

But when you’re getting started, use a whiteboard!

Tapping Into Non-Relational Data

2017-03-20T10:22:00.000-04:00

Modern BI and Analytics programs use non-reatlional data management for six key functions. You should be cognizant of these functions when you add non-relational technology to your data architecture.

Kinds of Non-Relational Storage

Most IT professionals are familiar with the RDBMS. Relational databases store data in tables that are defined in advance. The definition specifies the columns that comprise the table, their data types, and so forth. The design is referred to as a data model or schema.

Relational storage is immensely useful, but it is not the only game in town. There are several alternative types of data storage including:

Key-value stores store data in associative arrays comprised of sort keys and associated data values. These flexible data structures can be distributed across nodes of commodity hardware, and are manipulated using distributed processing algorithms (map-reduce). Hadoop functions as a key-value store.

Document stores track collections of documents that have self-defining structure, often represented in XML or JSON formats. A document store may be built on top of a key-value store. MongoDB is a document store.

Graph databases store the connections between things as explicit data structures (similar to pointers). These are stored separately from the things they connect. This contrasts with the RDBMS approach, where relationship information is stored within the things being associated (keys within tables). Neo4j is a graph database.

Reasons for Non-Relational Storage

In the age of big data, organizations are tapping into these forms of storage for a variety of reasons. Here are just a few:

No model: not requiring a predefined schema enables storing new data that has yet to be explored and modeled.
Low cost: the cost of data management can be significantly lower than RDBMS storage.
Better fit: some business use cases are a natural fit to alternative paradigms.

But don’t simply add a box to your data architecture for non-relational data. You must plan for specific usage paradigms, and make sure your architecture and processes support them.

Uses for Non-Relational Storage

There are six major use cases for non-relational data in modern data architectures:

Capture Non-relational storage facilitates intake of raw data. New data sets are captured in non-relational data stores, where they become “raw material” for various uses. A non-relational data store such as Hadoop can be used to capture data without having to model it first. This is often referred to as a data lake. I prefer to call it a landing zone.

Explore Non-relational storage facilitates exploration and discovery. Exploration is the search for value in new data sets. Exploration applies analytic methods to captured data, often combining it with existing enterprise data. The goal is to find value in the data, and to identify things that will be worth tracking on a regular basis.

Archive Non-relational storage serves as an archive. Data for which immediate value has not been identified is moved to an archive. From here it can be fetched for future use. Archiving data helps ensure the data lake does not become the fabled “data swamp.” An alternative to archiving unused data is simply to purge it.

Deploy Non-relational storage supports production analytics. When value is found in data, it is transitioned to a production environment and processes are automated to keep it up to date. Deployments range from simple reports to complex analytic models.

Augment Non-relational storage serves as a staging area for the data warehouse. In many cases, the insights gained from exploration prove valuable enough to track on a regular basis. Augmentation is the process of adding elements to a relational data warehouse that come from non-relational sources or analytic processes. A credit score, for example, might be incorporated into a customer dimension.

Extend: Non-relational storage expands what can be maintained in the data warehouse. Sometimes there is lasting value in non-relational data, but is not appropriate to migrate it to relational storage. In such cases, the non-relational data is moved to a non-relational extension of the data warehouse. Applications can link relational and non-relational data. For example, non-relational XML documents may made available for “drill down” from a dimensional cube.

In addition to these six primary use cases, non relational platforms may serve several utility functions. These include staging, data standardization, cleansing, and so forth.

Learn More

Join me for my course Data Modeling in the Age of Big Data, offered exclusively through TDWI. At the time of this writing, it is offered next at TDWI Chicago on May 11, 2017. You can also bring this course to your site through TDWI Onsite Education. For more information contact me.

Data Alone Does Not Change People’s Minds

2017-03-19T10:58:00.000-04:00

On NPR’s Hidden Brain podcast, cognitive neuroscientist Tali Sharot discusses the role of data in changing people’s behavior.

From Data to Action

The goal of analytics is to have a positive impact on the performance of your organization. To have an impact, you usually need to convince people to change their behavior.

This is required whether you want to convince a CEO to adopt a new strategy, a manager to allocate resources differently, or a knowledge worker to change their processes.

That’s why data visualization and data storytelling have become key skill sets for modern analytics professionals.

Data is Not Enough

How do you convince people to change their behaviors? Many analysts fall into the trap of letting the data speak for itself.

On a recent episode of NPR’s Hidden Brain podcast, cognitive neuroscientists Tali Sharot explains that data alone won’t do the job. (The podcast is embedded above.)

Most people are familiar with the concept of confirmation bias, where we tend to accept data that supports our existing opinions. Sharot suggests there are ways to override this kind of bias.

Some key takeaways:

People evaluate new information based on what they already believe
Strongly held false beliefs are difficult to change with data
Fear tends to lead to inaction, rather than action
Positive feedback or hope is a powerful motivator if you want to change peoples actions

This is a fascinating listen for anyone interested in telling stories with data. Not only does it offer suggestions on how to change people’s behavior, it also illustrates the power of tracking results and making them available to people.

I’ve pre-ordered Sharot’s upcoming book, The Influential Mind. You should too!

Recommended Podcast Apps

I have received a lot of positive feedback from people who enjoy listening to the podcasts I mention on this blog. Several people have asked me how to listen to podcasts.

You can, of course, simply click on the play button in the posts. But you can also subscribe to podcasts using a smartphone app. This lets you listen on the go, and also notifies you when new episodes are available.

Here are two apps I recommend if you use an iOS device:

Castro Podcast Player is perfect if you are new to podcasts, or if you subscribe to a handful of podcasts.
Overcast: Podcast Player is good for people who subscribe to a large number of podcasts. It is more complex, but allows you to set up multiple playlists and priorities.

Further Reading

Read (or Listen to) Discussions of Analytic Models (9/28/2016)
What Hollywood Can Teach Analytics Professionals: How to Tell Stories (12/23/15)

Source: NPR’s hidden brain podcast for March 13, 2017

Avoid the Unintended Consequences of Analytic Models

2017-01-25T09:51:00.000-05:00

Cathy O’Neil’s Weapons of Math Destruction is a must-read for analytics professionals and data scientists.

In a world where it is acceptable for people to say, “I’m not good at math,” it’s tempting to lean on analytic models as the arbiters of truth.

But like anything else, analytic models can be done poorly. And sometimes, you must look outside your organization to spot the damages.

The Nature of Analytic Insights

Traditional OLAP focuses on the objective aspects of business information. “The person who placed this $40 order is 39 years old and lives in Helena Montana.” No argument there.

But analytics go beyond simple descriptive assertions. Analytic insights are derived from mathematical models that make inferences or predictions, often using statistics and data mining.¹

This brings in the messy world of probability. The result is a different kind of insight: “This person is likely to default on their payment.” How likely? What degree of certainty is needed to turn them away?

When you make a decision based on analytics, you are playing the odds at best. But what if the underlying model is flawed?

Several things can go wrong with the model itself:

It is a poor fit to the business situation
It is based on inappropriate proxy metrics
It uses training data that reinforces past errors or injustices
It is so complex that it is not understood by those who use it to make decisions.

And here is the worst news: whether or not you manage to avoid these pitfalls, a model can seem to be “working” for one area of your business, while causing damage elsewhere.

The first step in learning to avoid these problems is knowing what to look for.

Shining a Light on Hidden Damages

In Weapons of Math Destruction, Cathy O’Neil teaches you to identify a class of models that do serious harm. This harm might otherwise go un-noticed, since the negative impacts are often felt outside the organization.² She calls these models “weapons of math destruction.”

O’Neil defines a WMD as a model with three characteristics:

Opacity – the workings of the model are not accessible to those it impacts
Scale – the model has the potential to impact large numbers of people
Damage – the model is used to make decisions that may negatively impact individuals

The book explores models that have all three of these characteristics. It exposes their hidden effects on familiar areas of everyday life – choosing a college, getting a job, or securing a loan. It also explores their effects on parts of our culture that might not be familiar to the reader, such as sentencing in the criminal justice system.

Misaligned Incentives

O’Neil’s book is not a blanket indictment of analytics. She points out that analytic models can have wide ranging benefits. This occurs when everyone’s best interests line up.

For example, as Amazon’s recommendation engine improves, both Amazon and their customers benefit. In this case, the internal incentive to improve lines up with the external benefits.

WMD’s occur when these interests conflict. O’Neil finds this to be the case for models that screen job applications. If these models reduce the number of résumés that HR staff must consider, they are deemed “good enough” to use. They may also exclude valid candidates from consideration, but there is not an internal incentive to improve them. The fact that they harm outside parties may even go unnoticed.

Untangling Impact from Intent

WMD’s can seem insidious, but they are often born of good intentions. O’Neil shows that it is important to distinguish between the business objective and the model itself. It’s possible to have the best of intentions, but produce a model that generates untold damage.

The hand-screening of job applications, for example, has been shown to be inherently biased. Who would argue against “solving” this problem by replacing the manual screening with an objective model?

This may be a noble intention, but O’Neil shows that it fails miserably when the model internalizes the very same biases. Couple that with misaligned incentives for improvement, and the WMD fuels a vicious cycle that can have the precisely the opposite of the intended effect.

Learning to Spot Analytic Pitfalls

The first step to avoiding analytics gone awry is to learn what to look for.

“Data scientists all too often lose sight of the folks at the receiving end of the transaction,” O’Neill writes in the introduction. This book is the vaccine that helps prevent that mistake.

If you work in the field of analytics, Weapons of Math Destruction is an essential read.

Notes:

1. OLAP and Analytics are two of the key service areas of a modern BI program. To learn more about what distinguishes them, see The Three Pillars of Modern BI (Feb 9, 2005).

2. But not always. For example, some of the models explored in the book have negative impacts on employees.

Is Your Team Losing the Spirit of the Agile Manifesto?

2016-12-02T12:27:00.001-05:00

As the adoption of agile BI techniques spread, it is easy to become wrapped up in methodology and lose site of the agile spirit.

This year is the 15th anniversary of the Agile Manifesto. Mark the occasion by refocusing on the agile principles.

The Age of “NO”

Fifteen years ago, most business software was developed using rich but complex methodologies. Businesses had spent years refining various waterfall-based methods, which were heavily influenced by two decades of information engineering.

The result seemed to make sense from IT’s point of view, ensuring a predictable process, repeatable tasks, and standard deliverables.

To the people who needed the systems, however, the process was inscrutable.¹ It seemed like a bewildering bureaucracy that was managed in a foreign language. They were always being told “no.”

Can we add a field to this screen?
No, you would have had to ask that before the requirements freeze.
Can we change the priority of this report?
No, that decision must be made by the steering committee.
Can we add a value to this domain?
No, that is the province of the modeling group.

The groups were looking at software development from completely different perspectives. They were not really collaborating.

Business people were asking for functionality. IT was beating back the requests by appealing to methodology.

Enter Agile

In 2001, a group of developers met in Colorado to talk about what was working and not working in the world of software development. They produced the Agile Manifesto:

Manifesto for Agile Software Development

We are uncovering better ways of developing
software by doing it and helping others do it.
Through this work we have come to value:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

That is, while there is value in the items on
the right, we value the items on the left more.

Kent Beck, Mike Beedle, Arie van Bennekum, Alistair Cockburn, Ward Cunningham, James Grenning, Jim Highsmith, Andrew Hunt, Ron Jeffries, Jon Kern, Brian Marick, Robert C. Martin, Steve Mellor, Ken Schwaber, Jeff Sutherland, Dave Thomas

© 2001, the above authors
this declaration may be freely copied in any form, but only in its entirety through this notice.

The thrust of the manifesto is on collaboration between business and technical personnel, with an emphasis on visible business results. That is, while methods are important, results are more important.²

The Agile Manifesto helped refocus software development on the product: functional business applications.

Agile Today: The Danger of Cooptation

Fifteen years later, it is safe to say that “Agile” has permeated the mainstream. This is largely a good thing. But whenever something moves from being a new alternative to wide acceptance, there is a potential dark side. As new ideas spread, they can be misunderstood or corrupted; adoption becomes cooption.

I frequently see signs of back-sliding to The Age of “No.” Even among developers who follow agile-based approaches, there is a tendency to lose sight of the agile principles. Here are two examples from my recent experience:

A team had decided to implement an unusual data model. It could easily return incorrect results if not queried in specific ways. The recommendation was to write some extra documentation explaining the pitfalls, and a short tutorial for developers. The response: “We cannot do that. We are an Agile shop. We cannot produce documentation that is not auto-generated by our tools.”

On another occasion, a team was developing the scope for several work streams. One set of needs clearly required a three-week collaborative activity. This was expressly forbidden. “Our Agile approach requires everything be broken down into two-week sprints.”

In both cases, the response was couched in methodological concerns, with no focus on the business benefit or value.³

This is precisely the kind of methodological tunnel-vision against which the Agile Manifesto was a reaction.

Keeping the Faith

It is hard to disagree with the agile principles, regardless of your organization’s commitment (or lack thereof) to an agile process.

You can take one simple step to ensure that you are not losing touch with agile principles:

Whenever you are tempted to say “no,” pause and reflect.

Why are you denying the request? Is it simply based on process? Think about what the person actually wants. Is there value? Is there a way to address the business concern?

Sometimes “no” is the right answer. But always be sure your evaluation places business capabilities and benefits ahead of process and procedure.

Learn More:

Read more posts about applying agile principles to BI and analytics:

Create Social Documentation (July 8, 2015)
Document Information Requirements Graphically with BDM Diagrams (February 10, 2014)

Notes

1. These people were often referred to as “users.” Over time, this became a derogatory term.
2. Agile is often misinterpreted as emphasizing speed.
3. Luckily, both of these teams saw fit to make exceptions to their processes, prioritizing business value over method.

Probability and Analytics: Reactions to 2016 Election Forecasts

2016-11-17T13:02:00.001-05:00

Reactions to the 2016 election forecasts suggest we don’t do a good job communicating probability and risk.

In a September 2016 post, I suggested readers check out the discussions of analytic models at FiveThirtyEight. One of the links led to their forecast model for the 2016 presidential election.¹

In the past week, I have received quite a bit of email suggesting I should take down the post, given that the model “failed.” For example, one emailer wrote:

How can you continue to promote Nate Silver? The election result proved the analytics wrong.

These reactions expose a real issue with analytics: most people do not understand how to interpret probability.

An analytic failure?

On November 7th, the final prediction of the FiveThirtyEight “Polls Only Model” gave Hillary Clinton a 71% chance of winning. As things turned out, she lost.

Those emailing me were not alone in believing the model failed. The day after the election, there were many stories suggesting FiveThirtyEight and the other aggregators were wrong.²

But were they?

Nate Silver discusses the FiveThirtyEight Model

(If the video above does not play, you can access it here.)

Understanding probability

The FiveThirtyEight model gave Clinton a 71% chance of winning the election. That’s about a 7 in 10 chance. To understand how to interpret this probability, try the following thought experiment:

Suppose you are at Dulles airport, and are about to board a plane. While you are waiting, you are notified that there is a 7 in 10 chance your flight will land safely. Would you get on the plane?

I know I wouldn’t.

When the probability of something happening is 70%, the probability of it not happening is 30%. In the case of the airline flight, that’s not an acceptable risk!

Now suppose the flight lands safely. Was the prediction right?

Maybe, but maybe not. The plane landed safely, but were the odds with the passengers? Was there actually a greater danger that was narrowly avoided? Was there no danger at all?

When a single event is assigned a probability, its hard to assess whether the assigned probability was “correct.”

Suppose every flight departing Dulles was given a 7 in 10 chance of landing safely, rather than just one. The next day, we check the results and find that all flights landed safely. Was the prediction correct?

In this case, we are able to say that the model was clearly wrong. About 1,800 flights depart Dulles airport each day. The model predicted that thirty percent, or about 540 flights, would not land safely. It clearly missed the mark, and by a wide margin.

Probabilistic predictions are easier to evaluate when they apply to a large number of events.

Explaining probability

In the days and weeks leading up to the election, the FiveThirtyEight staff spent a good deal of time trying to put the uncertainty of their forecast in context. As the election drew closer, these became daily warnings:

November 6: A post outlined just how close the race was, and how a standard polling miss of 3% could swing the election.
November 7: An update called a Clinton win “probable but far from certain.”
November 8: The final model discussion outlined all the reasons a Clinton win was not a certainty, and explored scenarios that would lead to a loss.

Despite all this, many people were unable to interpret the probabilistic model, and the associated uncertainty.

Avoiding unrealistic expectations

If a research scientist at Yale and the MIT Technology Review misunderstood a probabilistic forecast, how well are people in your business doing?

Are people in your business making decisions based on probabilistic models?
Are they factoring an appropriate risk level into their actions?
Are you doing enough to help them understand the strength of model predictions?

It's important that decision makers comprehend the predictive strength of the models they use. And it’s everyone’s responsibility to make sure they understand.

We have a long, long way to go.

Notes:

1. See the post Read (or Listen to) Discussions of Analytic Models. The model discussion I linked to is: A User’s Guide To FiveThirtyEight’s 2016 General Election Forecast

2. “Aggregators" is a term used by the mainstream press to describe data scientists who build models based on polling data. Here are a few stories that suggested these models were wrong: The Wrap, Vanity Fair, The New Yorker, Quanta Magazine, MIT Technology Review.

Read (or Listen to) Discussions of Analytic Models

2016-09-28T10:56:00.000-04:00

Organizations often feel their analytics are proprietary, and therefore decline to discuss how their models work. One shining exception is Nate Silver’s FiveThirtyEight.com. The site makes a point of exposing how their models are built. They also discuss their models as part of their elections podcast.

Data Storytelling

Chris Adamson on Modeling Challenges

2016-04-24T11:00:00.002-04:00

In a recent interview, the folks at WhereScape asked me some questions about data modeling challenges.

In Business Intelligence, modeling is a social activity. You cannot design a good model alone. You have to go out and talk to people.

As a modeler, your job is to facilitate consensus among all interested parties. Your models need to reflect business needs first and foremost. They must also balance a variety of other concerns — including program objectives, the behavior of your reporting and visualization tools, your data integration tools, and your DBMS.

It’s also important to understand what information resources are available. You need to verify that it is possible to fill the model with actual enterprise data. This means you need to profile and understand potential data sources. If you don’t consider sources of data, your designs are nothing more than wishful thinking.

When considering a non-relational data sources, resist the urge to impose structure before you explore it. You’ve got to understand the data before you spend time building a model around it.

Check out the video above, where I discuss these and other topics. For a full-sized version, visit the WhereScape page.

What Hollywood Can Teach Analytics Professionals: How to Tell Stories

2015-12-23T11:11:00.001-05:00

You might not realize it, but you probably have something in common with the creators of the TV show South Park.

Analytics yield insights that can have powerful business impact. These insights come from statistics and data mining—processes that are inaccessible to most people. If you want your business to learn and remember, you have to tell a story.

All too often, the communication of an analytic finding reads like a police report: procedural, laden with jargon, and stripped of meaningful business context.

That’s not interesting. People won’t learn from it, and they certainly won’t change their behavior.

How then to get your point across? You need to learn how to tell stories. Data stories.

Trey Parker and Matt Stone know a thing or two about telling a story. They are the creators of South Park, a wildly successful television show which has been on the air for 19 years. Like you, their success depends on telling interesting stories.

In the video clip above, Parker and Stone are speaking to a group of students at NYU on storytelling strategies. Trey tells the students:

We can take these beats, which are basically the beats of your outline, and if the words “and then” belong between those beats, you’re f***ed. Basically. You’ve got something pretty boring.

What should happen between every beat that you’ve written down is either the word “therefore” or “but.”

Data storytellers make this mistake all the time. "We did this…then we tried that…the algorithm showed this…the correlation coefficient is that…our conclusion is...”

This kind of forensic storytelling is boring. It won’t be remembered, and the value of the insight will be lost. Save the procedural detail for an appendix somewhere. People learn from good stories, not lab reports.

As Matt says later in the clip, you need causality to have an interesting story:

But. Because. Therefore. That gives you the causation between each beat. And that…that’s a story.

Be sure to watch the entire clip and, if you are so inclined, take some time off for an episode or two of South Park. It just might make you a better data scientist!

The embedded video is from the NY Times ArtsBeat blog post, Hello! Matt Stone and Tray Parker Crash a Class at NYU (September 8, 2011). Hat tip to Tony Zhou and his Video Essay on F for Fake at the marvelous blog Every Frame a Paining.

Create Social Documentation

2015-07-08T09:47:00.001-04:00

Documentation is sometimes viewed as a necessary evil. But it doesn't have to be. Here's how to produce documentation that will be used.

Useful documentation gets used -- during all development phases, and by all interested parties.

Burdensome methodologies often expend precious hours producing documentation that is hard to use. Many projects leave behind fat binders of text that hardly anyone will ever open. These examples have given documentation a bad name.

The good news is that documentation can be done right. It does not have to be a drag on project time, it does not have to be a chore to read and review, and it does not have to be something we interact with alone.

Why we need documentation

Documentation is not an after-the-fact explanation of what has been built. Used properly, it is a central component of the entire lifecycle of a BI solution.

Important uses include:

Prior to development: Identify and validate requirements and designs
During development: Specify what to build
After development: Educate business people and support personnel

Of course, there are many other areas in which documentation has value (program planning, governance, change management, etc.). These three above are sufficient to illustrate the value of social documentation.

Social Documentation

Useful documentation should be easy to read and discuss. It should also not be burdensome to produce. Three principles shape social documentation.

Social documentation is the focus of collaboration.

Whenever possible, I recommend to my clients that we use PowerPoint for documentation. Why? Word processors are tailor made for reading, which is a solitary activity. Presentation software is tailor made for collaboration.

Social documentation is easy to navigate.

Support "random access" rather than "sequential access." Presentation software is great for this; we can easily sort and navigate slides by their titles. This can also be achieved using document maps or outlines.

Social documentation is not prose.

Each slide in a presentation, or section in a document, should be set up to capture essential information in a consistent format. This format may be tabular, diagramatic, or both. Your subject matter will dictate the appropriate format.

But here is the important part:

No paragraphs
No prose
If using PowerPoint: No bullet lists. (They're just a back door to writing paragraphs.)

Uses for social documentation

I find the presentation format excellent for defining program priorities, defining project scope, capturing business requirements, developing top level information architectures, and a variety of other tasks. For specifications, a word processed document with multi-level headers and a document map typically fits the bill.

When documenting business metrics for a dashboard or scorecard, for example, set up a PowerPoint presentation with one slide per metric. Use a standard tabular format to document each metric. This documentation is easy to produce, review and revise, as I will discuss in a moment.

Where presentation software is not practical, word processors can be used in the same way. Divide the document into sections, activate the contents sidebar, and use a consistent tabular format.

Of course, not all documentation is captured in this manner. For example, we might use social documentation to capture a top level star schema design, then use a modeling tool to produce a detailed design.

Advantages of Social Documentation

This simple approach has numerous advantages.

Frictionless and Comprehensive

During requirements specification, social documentation allows you to capture the necessary information in frictionless and comprehensive manner. A standard tabular format, for example, ensures the same items are filled in. The presentation itself is easy to navigate via sections and slide titles.

Engages with the business

Social documentation invites collaboration. Give people a big fat binder and their eyes will cross. Show them 5 or 6 slides that capture the business metrics they care about, and they will give you feedback.

I always have my laptop with me, so if I happen to be in a room with a SME, I can pull it out, flip to the correct slide, and ask a question.

Incidentally, collaboration with the business is one of the cornerstones of the agile manifesto.

Reviewed together, rather than in isolation

Ever sent out a fat document for review? If you have, you know the results are not good. Most people will not review it by the deadline. When reminded, they will say, "it looks good." A precious few will provide detailed feedback.

Social documentation transforms this process. A review is conducted by bringing people into a room and reviewing the deck. Any agreed upon changes are made directly to the presentation slides.

The documentation is now ready for the next tasks: guiding development and then serving as the basis for education.

Learn More

Read more about documenting BI program activities in these posts:

A great diagramming technique for information requirements: Document Information Requirements Graphically with BDM Diagrams (February 10, 2014.)
Recommended approach to documenting dimensional designs: Dimensional Modelers Do Not Focus on Logical vs. Physical (July 5, 2011)

For more details on what to document, check out my book Star Schema: The Complete Reference. Detailed descriptions and examples can be found in Chapter 18, "How To Design And Document A Dimensional Model.”

I also discuss documentation of information requirements and business metrics in the course “Business Information and Modern BI.” Check the sidebar for current offerings.

Join Chris in Europe: 18-22 May 2015

2015-03-26T22:13:00.001-04:00

I will be leading a week of in-depth sessions in Berlin this May. The rigorous agenda includes full courses on Performance Management, Analytics, and Dimensional Modeling.

Join me for any or all of the following sessions:

18 May: TDWI Performance Management: Measurement, Metrics, and Monitoring Learn about performance management and BI, including metric selection, dashboards and scorecards.
19 May: TDWI Business Analytics: Exploration, Experimentation and Discovery Learn about problem framing, problem models, solution models, heuristics and experimentation.
20 May: TDWI Predictive Analytics Fundamentals This class covers the business case for predictive modeling, statistics and data mining, the model development process, and the organizational impacts of predictive modeling.
21-22 May: Dimensional Modeling: Advanced Techniques for Practitioners This is the two-day companion course to my book, Star Schema: The Complete Reference.

Hope to see you there! For more details and to register, visit TDWI Europe:

May 18-20 Info and Registration
May 21-22 Info and Registration

Join me for the whole week, and you will have covered each of the Three Pillars of Modern BI!

BI and the Path to Business Value

2015-03-20T09:27:00.001-04:00

Managing BI services requires a consistent information architecture, even if different teams are responsible for data marts, performance management, and analytics.

Business Value From BI

Business Intelligence is the use of information to improve business performance.^[1] To improve business performance, we must do three things:

Track business performance
Analyze business performance
Impact business performance

Each step on the path to business value is supported by two kinds of BI services, as shown in the illustration.

Tracking performance requires understanding what is currently happening (Performance Management) and what has happened in the past (OLAP).
Analyzing performance requires the ability to get to detail (OLAP), develop insight into cause and effect (Business Analytics).
Impacting performance requires targeting a business metric (Performance Management) and taking a prescribed course of action (Business Analytics.)

Each of these steps leverages a pair of BI services, and each service shares a common interest in business information.^[2]Managing BI services therefore requires a consistent information architecture. This is true even when separate teams manage each area.

Tracking Performance

Understanding performance often starts with summarized data on dashboards and scorecards (Performance Management). The need investigate potential problems requires detailed data and history (OLAP and Reporting.)

As Wayne Eckerson demonstrated in Performance Dashboards, both these areas provide stronger business value when they are integrated. For example, a dashboard is more useful when someone can click a metric and bring up supporting detail in an OLAP cube.

To successfully link Performance Management and OLAP, the two domains must share common definitions for business metrics (facts) and associated reference data (dimensions). Metrics must be calculate in the same way, linked to reference data and different levels of detail, and synchronized (if managed separately).

Analyzing Performance

Analyzing performance is the process of breaking down what has occurred in an attempt to understand it better. Slicing and dicing an OLAP cube is a form of analysis, providing insight through detail. Analytic models provide a deeper level of analysis, providing insight into cause and effect, and extending this to the future through prediction.

OLAP is largely focused on exploring various aggregations of business metrics, while analytics is largely focused on the underlying detail that surrounds them. Our OLAP solutions provide historic detail to Business Analytics in the form of data from the data warehouse.^[3]

The exchange flows the opposite direction as well. Business analytics develop insights that suggest other things that should be tracked by OLAP services. For example, a particular set of behaviors may define a high value customer. This assessment is developed using Business Analytics, and applied to the customers in the OLAP data mart. For a fun example from the world of sports, check out the book Moneyballby Michael Lewis.^[4]

Improving Performance

All of this is somewhat academic if people in the business do not use all this information to make decisions. Business impact occurs at the intersection of Performance Management (which tells us what is important and how we are doing) and Analytics (which suggests the best course of action.)

Every analytic model targets a business metric or key performance indicator (KPI) from the world of performance management. That same KPI, in turn, can be used to measure return on investment of the analytic model.

For example, a direct sales manager of enterprise software wants to reduce cost of sales. An analytic model is developed that assesses the likelihood of a prospect to buy enterprise software.

The manager begins using the prospect assessment model to prioritize the work of the sales team. Less likely prospects are reassigned to a telesales force. Over the next two quarters, cost of sales starts falling. The same KPI that the analytic model targeted is used to measure its return on investment.

Information as an Asset

It is common to manage each of the pillars of Modern BI as a separate program. The path to business value, however, requires that these programs share a consistent view of business information. BI programs that are not centralized must carefully coordinate around a common information architecture.

Further Reading

For more on this definition of business intelligence, see Business Intelligence in the Modern Era (9/8/2014)
The three service areas are explained in The Three Pillars of Modern BI (2/9/2015).
Sometimes analytic modelers bypass the data warehouse, but there are steps you can take to make this important repository more useful. For tips on how to make your data warehouse more useful to analytic modelers, see Optimizing Warehouse Data for Business Analytics (9/25/13). Note that even with a well designed data warehouse, analytic models often augment this enterprise data with additional data sources.
The Oakland A's used analytics to re-evaluate the basic metrics used to assess the value of a baseball player. See Business Analytics and Dimensional Data (7/17/13).

** Interested in learning more about modern BI programs? **

Check out my new course, Business Information and Modern BI: Evolving Beyond the Dimensional Data Mart. Offered at TDWI conferences and onsite. See the sidebar for upcoming dates.

Modern BI with Chris Adamson: Chicago, May 7

2015-03-11T07:36:00.000-04:00

Join me at TDWI Chicago 2015 for my latest course, Business Information and Modern BI: Evolving Beyond the Dimensional Data Mart.

In this full-day class, I will show you how a modern BI program can help you track, analyze and improve business performance.

With a strong focus on information, we will look at how new technologies and best practices have reshaped the way BI delivers business value.

We will cover all three pillars of Modern BI, and also discuss organizational options, agile development and technology policies.

I'll also be leading classes on Predictive Analytics (5/5/15) and Data Visualization (5/6/15).

Discount Code for Registration

If you are planning to attend, use this link to register, and enter Priority Code 111 for a 10% discount.

Hope to see you there!

The Three Pillars of Modern BI

2015-02-09T09:37:00.000-05:00

Data marts are no longer sufficient to meet the demands of a modern BI program. This post lays out a framework for delivering BI value in the modern era.

The technologies and processes that help us deliver BI services have advanced by leaps and bounds over the last two decades. A modern BI program provides three perspectives on business performance, roughly corresponding to the past, present, and future.

OLAP and Reporting

OLAP and reporting services (or simply "OLAP") provide the "official record" of what has happened in the past--the canonical log of business activity.

This pillar of the modern BI program helps the business understand "where we've been." The typical information products provided in this service area include:

Reports provide pre-built, parameterized access to business information
Analysis provides the ability to explore the official record of business activity by slicing, dicing, drilling, and so forth (OLAP)
Ad hoc query capabilities allow people to ask their own questions about the official record, even if a pre-defined report or analysis does not exist.

For people in the business, these kinds of information products come to define this pillar of the BI program. There is also a fourth important information product of which the business may have less direct awareness:

The integrated record of business activities, aka "Data Marts." This record combines, standardizes and organizes information for business consumption.

Essential in delivering the first three kinds of information products, this component was the primary focus in the early years of BI, when we called the practice "data warehousing." Since then, the discipline has changed and expanded. But it is still essential that the BI program provide the ability to understand the past.

Performance management

Performance management services provide real-time status on key performance indicators, as well as performance versus goals.

KPI's and goals are carefully matched to the viewer's role and linked to business objectives. Goals communicate expectations, while KPI's communicate achievement of expectations.

If OLAP is about "where we have been," then performance management is about "where we are now." Typical information products in this BI service area include:

Dashboards provide real-time or near-real-time status of KPI's
Scorecards which communicate progress vs. goals

Information on dashboards and scorecards is carefully tailored for the user or functional area. Metrics are chosen for relevance and actionability, linked to business strategy, and balanced to reflect a holistic picture of performance.

While this service area can stand on its own, performance management solutions are more powerful when people can dig into the KPI's on their dashboards. This capability is enabled by integrating performance management services with OLAP services.

Business analytics

Analytic services probe deeply into data, providing insight into cause and effect, making predictions about what will happen in the future, and prescribing a course of action.

While analytic services draw on data from the past, their objective is to influence the future. Typical information products in this service area include:

Analytic models that make sense of activities or predict future events
Simulations that allow the manipulation of variables to study their potential impact on results
Visualizations that communicate analytic insights
Analytic metrics that assess current state and or future outcomes which are fed to OLAP, performance management, and OLTP applications

Like the other pillars of modern BI, analytic services can exist alone but are more powerful in the presence of the other pillars. Prescriptive metrics, for example, are best presented directly on operational dashboards; useful analytic metrics can be recorded and tracked over time in data marts.

Delivering Modern BI

In each area of the business, these capabilities should be balanced and tied together. Centralized management of all three pillars is not required, but they should be coordinated and integrated. A shared roadmap should lay out their planned evolution.

Your objective is business impact, and my next post shows how these services deliver it.

Learn More

BI and the Path to Business Value (3/20/2015) explores how the three pillars of modern BI enable business impact.
Business Intelligence in the Modern Era (9/8/2014) provides a definition of Business Intelligence for modern information assets.

For more on managing the Modern BI program, check out Chris's latest course: Business Information and Modern BI. Check the sidebar for upcoming dates.

Business Intelligence in the Modern Era

2014-09-18T11:10:00.002-04:00

This post offers an updated definition for BI, and suggests that you don't have to think about it as a box on an org chart.

BI has changed a lot in the last two decades. Technologies and best practices have evolved, and we've found more ways in which a BI program can deliver value. Some of these innovations have occurred outside of IT or the BI Competency Centers that many businesses have established. At the same time, many organizations are moving to make business units autonomous.

These changes lead many people to ask what exactly is BI? Is it a box on the org chart? Does it include analytics that were never done by IT? How do data governance and master data management fit in?

Business Intelligence Defined

I define BI as follows:

Business Intelligence:
The use of information to improve business performance
- Chris Adamson

The first thing to note about this definition is that it does not address any specific technologies or methods. These aspects change over time, and they certainly influence what we may be able to achieve. But the objective is always to provide business value.

Secondly, note that this definition is not beholden to the boundaries of a departmental structure. Regardless of who develops, supports or uses solutions, it's all considered BI.

Let's take a quick look at both these aspects.

BI Services and Activities

The reason we commit resources to BI programs is simple: we intend to use information to deliver some kind of business value. The definition has been crafted to cover any activities that support this objective. It can be used to describe a variety of activities that provide business value, both old and new.

Among the older activities it covers:

Traditional reporting, OLAP and ad hoc functions
Dashboards and scorecards
Traditional data warehouses and/or data marts
Data integration services

At the same time, some newer uses of information are covered:

Business analytics and predictive analytic
Master data management
Data governance
Virtualization and federation services

The definition also covers activities that some people think of as on "the other side of the fence" from BI:

Transaction processing

That's intentional; transaction processing manufactures much of the "raw material" that BI programs attempt to leverage. When we plan an operational solution, we should be thinking about these downstream uses.

BI and the Org Chart

While you may have a group responsible for BI program management, it is important to understand that the scope of BI reaches well beyond this group. The delivery of business benefit from information impacts the entire organization.

Some of the functional areas that participate in BI are:

Business units All of the value from BI happens within business areas that use information. This is where decisions are made and impacts are realized. For many businesses, responsibility for development of BI solutions also lies in business areas. This is particularly the case for analytics, but also increasingly for the traditional forms of BI.
BI Competency Centers Whether part of IT or external to it, many organizations have established a centralized resource for planning and overseeing the development of traditional forms of BI, such as data marts, dashboards or scorecards. In some cases, these centers have become focused on providing advisory services to business units that create and manage their own solutions.
Analytic Competency Centers Business analytics often begins within business areas such as marketing or risk management. Analytic competency centers are developed to help other areas of the business leverage information in a similar manner. Whether part of the BI competency center or distinct from it, this is also a core BI function.
IT At a minimum, IT has some responsibility for the technical infrastructure on top of which information systems are built -- networks, computers and the services that keep them up and running. IT may also have responsibility for some of the business applications and data management solutions.

Regardless of how your organization structure divvies up these responsibilities, BI is the sum total of these activities, and not the domain of a particular group or department. A business strategy to create value through information cuts across many departments. It cannot be planned or executed in isolation.

The Future of BI

We're not far from an age where BI is not a separate part of our information architecture. We're not there yet, but several trends have us on this path:

Focus on the future value and re-use of data managed by operational applications
Commitment to data governance
Maturation of master data management solutions
Technological advances in data management and information access

When we finally arrive at a unified information architecture, the definition of BI will still hold. We will be closer to delivering on its promise than ever before.

And, without a doubt, we will have come up with ways of using information to deliver value that have not even be thought of today.

Document Information Requirements Graphically With BDM Diagrams

2014-02-10T12:33:00.001-05:00

BI teams often struggle to keep the business engaged, especially during requirements analysis. This post looks at a graphical technique for documenting information requirements -- one that business people will read and respond to.

Keeping the business engaged is one of the keys to a successful BI program. One technique I have found to be very helpful on this front is Laura Reeves's Business Dimensional Model (BDM).

The BDM is a technique for documenting information requirements. Before I explain the BDM, a few words on the requirements themselves.

Information Requirements

Before you can design a dimensional model, you need to capture the business requirements that it will support. The most successful projects capture business requirements by working directly with people in the business, often through interviews or requirements sessions.

In my book, I suggest that as you organize your information requirements by business function. You then state them in simple form: as a group of metrics and their associated dimensionality.

For example, a set of interviews about the taking orders might boil down to a requirements statement such as:

Order Information by order date, order line, salesperson, customer and product.

The metrics that comprise the group are then fully documented. For example, "Order Information" is further supported with documentation of:

Order dollars
Order quantity
Cost dollars
Gross margin dollars
Gross margin rate

Relevant hierarchies in the dimensions should also be specified. For example, "Product" might be described as:

All Products à Category à Brand à Product

Finally, the major dimensions are cross-referenced to the metric groups in a conformance matrix.

These information requirements then drive solution modeling. The next step is to develop a top level dimensional model, and then a detailed database design.

(For more on developing and documenting requirements, including a fully fleshed out example, see my book -- it's listed at the end of this post.)

Getting People to Read It

When it comes to information requirements, you must ensure that the business stakeholders review and respond. (Better still is to involve the business in the identification and documentation process.)

In the book A Manager's Guide to Data Warehousing, Laura Reeves provides a graphical technique that helps keep the business's attention. She calls it the "Business Dimensional Model (BDM)."

This technique integrates nicely with the approach I've outlined above.

Each group of metrics is depicted in a simple diagram, with the metric group in the center and the major dimensions arrayed around it in circles.

For example, the Order Information metric group above might be documented thusly:

Within each circle, the underlined text identifies a dimension. Beneath the dimension, the level of detail applicable in the metric group is listed.

Additional illustrations document the dimension hierarchies. For example, the product dimension from the picture above might be documented like this:

The most detailed level of the dimension is shaded darkly. The arrows indicate hierarchies, going from summarized to detailed. Elements that will drive Type 2 slow changes have a shadow. Separate symbols (not shown) are used for junk dimensions, other derived elements, and future attributes.

People Like Pictures

I've found that using BDM diagrams dramatically increases the participation of business stakeholders. People look at BDM diagrams, understand them, and react to them -- often with great enthusiasm. That's a powerful aid in refining and validating your requirements.

These diagrams are also easy to produce using the built in drawing tools that come with basic productivity software. This means you can often get business stakeholders to participate in their creation. For example, the pictures above were created in Microsoft PowerPoint using basic shapes and Smart Shapes.

Lastly, the ability to produce these diagrams using basic productivity software means they are easy to incorporate in the best format for this kind of documentation: the presentation. I find the presentation format is far more likely to be reviewed than a word processing document. (More on this topic in a future post.)

Further Reading

As I said back in 2009, I am a big fan of Laura Reeves's approach to requirements and design. As you can see, there is a natural affinity between the BDM and the techniques I've talked about in the past. I encourage readers to check out her book (see below).

More info about requirements and documentation can be found on this blog. Have a look at these posts:

I first mentioned Laura's book in this blog back this post from 2009: Recommended Books on the Data Warehouse Lifecycle (July 27, 2009)
For an explanation of the three levels of a dimensional model (Requirements, Top Level Design and Detailed Design) see Dimensional Modelers Do Not Focus on Logical vs. Physical (July 5, 2011)
For an example of a conformance matrix, see The Conformance Matrix (June 5, 2012)
For an explanation of the type 2 slow change, see For Slowly Changing Dimensions, Change is Relative (October 9, 2007)

You can read more about the process of identifying information requirements in these books:

For a full explanation of the BDM, see Laura Reeves's A Manager's Guide to Data Warehousing (Wiley, 2009). The BDM is covered in Chapter 7, "Modeling The Data For Your Business"

The examples in this post are drawn from my book, Star Schema: The Complete Reference (McGraw-Hill, 2010) A more fleshed out explanation of tasks and deliverables, with examples, cab be found in Chapter 18, "How To Design and Document a Dimensional Model." The examples from this post come from Figure 18-4 (which in turn builds on the star in Figures 3-3, and the hierarchies in Figure 7-3).

You can help support this blog by using the links above to purchase these books from Amazon.com.

[Edited 2/13/14 - Corrected the links, thank you for the emails.]

Facebook's Ken Rudin on Analytics

2013-11-14T10:19:00.000-05:00

If you are interested in how business analytics impact your BI program, carve out forty-five minutes of time to watch Ken Rudin's recent TDWI keynote: "Big Data, Bigger Impact." The video is embedded below.

Rudin is the director of analytics at Facebook. In his presentation, he discusses several topics that are of interest to readers of this blog. Among them:

Big data technology should be used to extend your traditional BI solution, not replace it. Facebook has realized this, and is working to bring in relational technology to answer traditional business questions.

Successful analytics programs bring together centrally managed core data metrics with a variety of data that is not centrally managed. Rudin shares different ways he has been able to make this happen.

A similar balance can be attained with your organizational structure. Use of "embedded analysts" provides the business benefits of decentralization, while maintaining the efficiencies and scale advantages of a centralized program.

These are just a few of the points made during his talk. If you don't have the time to watch it now, bookmark this page for later.

You'll also want to check out Wayne Eckerson's latest book, Secrets of Analytical Leaders. (Details below.)

Big Data, Bigger Impact

Ken Rudin

TDWI World Conference, Chicago 5/6/2013

Recommended Reading

Wayne Eckerson's excellent book, Secrets of Analytical Leaders,features more insights from Ken Rudin and others.

I highly recommend this book if you are interested in analytics.

Get it from Amazon.com in paperback or Kindle editions.

Optimizing warehouse data for business analytics

2013-09-25T12:57:00.002-04:00

Business analytics often integrate information from your data warehouse with other sources of data. This post looks at the best practices of warehouse design that make this possible.

I receive a lot of questions regarding the best way to structure warehouse data to support an analytics program. The answer is simple: follow the same best practices you've already learned.

I'll cover these practices from a dimensional modeling perspective. Keep in mind that they apply in any data warehouse, including those modeled in third normal form.

1. Store Granular Facts

Analytic modelers often choose sources external to the data warehouse, even when the warehouse seems to contain relevant data. The number one reason for this is insufficient detail. The warehouse contains summarized data; the analytic model requires detail.

In this situation, the analytic modeler has no choice but to look elsewhere. Worse, she may be forced to build redundant processes to transform source data and compile history. Luckily, this is not a failure of warehouse design principles; its a failure to follow standard best practices.

Best practices of dimensional design dictate that we set the grain of base fact tables at the lowest level of detail possible. Need a daily summary of sales? Store the individual order lines. Asked to track the cost of tips? Store detail about each leg.

Dimensional solutions can contain summarized data. This takes the form of cubes, aggregates, or derived schemas. But these summaries should be derived exclusively from detailed data that also lives in the warehouse.

Like all rules, this rule has exceptions. There are times when the cost/benefit calculus is such that it doesn't make sense to house highly granular indefinitely. But more often than not, summary data is stored simply because basic best practices were not followed.

2. Build “Wide” Dimensions

The more attributes there are in your reference data (aka dimensions), the more useful source material there is for analytic discovery. So build dimensions that are full of attributes, as many as you can find.

If the grain of your fact table gives the analytics team “observations” to work on, the dimensions give them “variables.” And the more variables there are, the better the odds of finding useful associations, correlations, or influences.

Luckily, this too is already a best practice. Unfortunately, it is one that is often misunderstood and violated. Misguided modelers frequently break things down into the essential pieces only, or model just to specific requirements.

3. Track Changes to Reference Data (and Use Effective Dating)

When reference data changes, too many dimensional models default to updating corresponding dimensions, because it is easier.

For example, suppose your company re-brands a product. It's still the same product, but with a new name. You may be tempted to simply update the reference data in your data warehouse. This is easier than tracking changes. It may even seem to make business sense, because 90% of your reports require this-year-versus-last comparison by product name.

Unfortunately, some very important analysis may require understanding how consumer behavior correlates with the product name. You've lost this in your data set. Best practices help avoid these problems.

Dimensional models should track the change history of reference data. In dimensional speak, this means application of type 2 slow changes as a rule. This preserves the historic context of every fact recorded in the fact table.

In addition, every row in a dimension table should track "effective" and "expiration" dates, as well as a flag rows that are current. This enables the delivery of type 1 behavior (the current value) even as we store type 2 behavior. From an analytic perspective, it also enables useful "what if" analysis.

As with all rules, again there are exceptions. In some cases, there may be good reason not to respond to changes in reference data by tracking history. But more often than not, type 1 responses are chosen for the wrong reason: because they are easier to implement.

4. Record Identifying Information, Including Alternate Identifiers

Good dimensional models allow us to trace back to the original source data. To do this, include transaction identifiers (real or manufactured) in fact tables, and maintain identifiers from source systems in dimension tables (these are called "natural keys").

Some of this is just plain necessary in order to get a dimensional schema loaded. For example, if we are tracking changes to a product name in a dimension, we may have multiple rows for a given product. The product's identifier is not a unique identifier, but we must have access to it. If we don't, it would become impossible to load a fact into the fact table.

Identifying information is also essential for business analytics. Data from the warehouse is likely to be combined with data that comes from other places. These identifiers are the connectors that allow analytic modelers to do this. Without them, it may become necessary to bypass the warehouse.

Your analytic efforts, however, may require blending new data with your enterprise data. And that new data may not come with handy identifiers. You have a better chance blending it with enterprise data if your warehouse also includes alternate identifiers, which can be used to do matching. Include things like phone numbers, email addresses, geographic coordinates—anything that will give the analytics effort a fighting chance of linking up data sources.

Summary

If you've been following the best practices of dimensional modeling, you've produced an asset that maximized value for analytic modelers:

You have granular, detailed event data.
You have rich, detailed reference data.
You are tracking and time-stamping changes to reference data.
You've got transaction identifiers, business keys, and alternate identifiers.

It also goes without saying that conformed dimensions are crucial if you hope to sustain a program of business analytics.

Of course, there are other considerations that may cause an analytic modeler to turn her back on the data warehouse. Latency issues, for example, may steer them to operational solutions. Accessibility and procedural issues, too, may get in the way of the analytic process.

But from a database design perspective, the message is simple: follow those best practices!

Further Reading

You can also read more in prior posts. For example:

Rule 1: State Your Grain (December 9, 2009) covers the fundamentals of grain
Build High Resolution Stars (January 7, 2011) discusses the importance of setting grain at the lowest level possible
For Slowly Changing Dimensions, Change is Relative (October 9, 2007) covers type 1 vs type 2 processing and surrogate keys vs. natural keys.
Responding to Star Schema Detractors with Timestamps (March 12, 2008) covers the use of effective and expiration dates with Type 2 slow changes.
Do I Really Need Surrogate Keys (May 20, 2009) covers the fundamentals of business keys vs. warehouse keys
Creating transaction identifiers for fact tables (October 17, 2011) covers real and manufactured transaction identifiers

You can also read more in my book, Star Schema: The Complete Reference. If you use the links on this page to pick up a copy on Amazon, you will be helping support this blog.

It covers the best practices of dimensional design in depth. For example:

Grain, identifiers, keys and basic slow change techniques are covered in Chapter 3, "Stars and Cubes"
The place of summary data is covered in Chapter 14, "Derived Schemas" and Chapter 15, "Aggregates"
Conformance is covered in Chapter 5, "Conformed Dimensions"
Advanced slow change techniques are explored in Chapter 8, "More Slow Change Techniques"

Business Analytics and Dimensional Data

2013-07-17T16:03:00.001-04:00

Readers of this blog frequently ask about the relationship of business analytics to the dimensional data that is recorded in data marts and the data warehouse.

Business analytics operate on data that often does not come from the data warehouse. The value of business analytics, however, is measured by its impact on business metrics that are tracked in the data warehouse.

Business analytics may also help adjust our notion of which metrics matter the most.

The Data Warehouse and Dimensional Data

The dimensional model is the focal point of business information in the data warehouse. It describes how we track business activities and measure business performance. It may also be the foundation for a performance management program that links metrics to business goals.

Dimensional data is the definitive record of what matters to the business about activities and status. Clearly defined performance indicators (facts) are recorded consistently and cross referenced with standardized and conformed reference data (dimensions).

In this post, when I talk about "the data warehouse," I will have this dimensional data in mind.

Business Analytics

Business analytics seek to provide new insight into business activities. Analytics do not always operate on business metrics, and they don't rely exclusively on information form the data warehouse. Dimensional information may be an input, but other sources of data are also drawn upon.

The outputs of business analytics, however, aim directly at the metrics tracked by our dimensional models. Insights from analytics are used by people to move key metrics in the desired directions. These results are called impacts.

Business analytics may also help in another way. Sometimes, they help us determine which metrics are actually the most important.

A great illustration of these dynamics can be found in the business of Major League Baseball. (If you don't follow baseball, don't worry. You don't have to understand baseball to follow this example.)

Metrics in Baseball

Major league baseball has long been in the business of measurement. Followers of the game are familiar with the "box score" that summarizes each game, "standings" that illustrate the relative performance of teams, and numerous statistics that describe the performance of each player.

These metrics have precise definitions and have been recorded consistently for almost 150 years.¹ Like the metrics in your data warehouse, they are tracked systematically. Professional baseball teams can also set goals for these metrics and compare them to results, much like a scorecard in your performance management program.

How does one improve these results? If you run a baseball team, part of the answer lies in how you choose players. In the book Moneyball² Michael Lewis describes how the Oakland Athletics used a set of techniques known as sabermetrics³ to make smarter choices about which players to add to their roster.

These analytics allowed the A's to make smarter choices with measurable impact--improving performance and reducing costs. Analytics also motivated the A's to change the emphasis given to various metrics.

Business Analytics and the Oakland Athletics

The traditional approach to selecting players was focused on long-held conventional wisdom about what makes a valuable player. For example, offensive value was generally held to derive from the ability to contact the baseball, and with a player's speed. These skills are at least partially evident in some of the standard baseball metrics -- things like the batting average, stolen bases, runs batted in and sacrifices.

The Oakland A's looked to data to refine their notion of what a valuable player looks like. How do the things players do actually contribute to a win or loss? To do this, the A's went beyond the box scores and statistics -- beyond the data warehouse, so to speak.

By studying every action that is a part of the game -- what players are on base, what kind of pitches are thrown, where the ball lands when it is hit, etc -- the A's realized they could be smarter about assessing how a player adds value. These business analytics led to several useful conclusions:

Batting averages don't tell the whole story about a player's ability to get on base; for example, they exclude walks.
Stolen bases don't always contribute to scoring; much depends on who comes to bat next.
Runs batted in tell as much about who hits before a player as they do about the player himself
Sacrifices, where an out is recorded but a runner advances, were found to contribute less to the outcome of a game than conventional wisdom held.

You may or may not understand these conclusions, but here is the important thing: the analytics suggested that the A's could better assess a player's impact on winning games by turning away from conventional wisdom. Contact and speed are not the best predictors for winning game. "Patience at the plate" leads to better outcomes.

Impact for the A's

By using these insights to make choices, the A's were able to select less expensive players who could make a more significant contribution to team results. These choices resulted in measurable improvement in many of the standard metrics of baseball--the win/loss ratio in particular. These insights also enabled them to deliver improved financial results.

Analytics also helped the A's in another way: they refined exactly which metrics they should be tracking. For example, in assessing of offensive value, on base percentage should be emphasized over batting average. They also created some of their own metrics to track their performance over time.

The Impact of Analytics

Business analytics tell us what to look for, what works, or what might happen. Examples are signs of impending churn, what makes a web site "sticky", patterns that might indicate fraud, and so forth.

These insights, in turn, are applied in making business decisions. These choices provide valuable impact that can by tracking traditional business metrics. Examples include increased retention rates, reduced costs associated with fraud, and so forth.

These impacts are the desired outcome of the analytic program. If the analytics don't have a demonstrable impact on metrics, they are not providing value.

Business analytics can also help us revise our notion of what to track in our data warehouses, or which metrics to pay closest attention to. Number of calls to the support center, for example, may be less of an indicator of customer satisfaction than the average time to resolve an issue.

Conclusion

As you expand the scope of your BI program to include analytics, remember that your desired outcome is a positive impact on results. Move the needle on business metrics, and the analytics have done their job.

Thanks to my colleague Mark Peco, for suggesting that I use Moneyball as a way to explain analytics without revealing the proprietary insights attained by my customers.

Notes

[1] The box score and many of these statistics were established in the mid 1800's by a sports writer named Henry Chadwick.

[2] Moneyball by Michael Lewis (Norton, 2011).

[3] The Oakland A's are a high-profile example of the use of sabermetrics, but did not originate the concept. See wikipedia for more information.

Related Posts

See also these posts:

In the Era of Big Data, The Dimensional Model is Essential (June 5, 2013)
The Role of the Dimensional Model in Your BI Program (April 30, 2013)

In the Era of Big Data, The Dimensional Model is Essential

2013-06-05T12:26:00.001-04:00

Don't let the hype around big data lead you to believe your BI program is obsolete.

I receive a lot of questions about "big data." Here is one:

We have been doing data warehousing using Kimball method and dimensional modeling for several years and are very successful (thanks for your 3 books, btw). However, these days we hear a lot about Big Data Analytics, and people say that Big Data is the future trend of BI, and that it will replace data warehousing, etc.

Personally I don't believe that Big Data is going to replace Data Warehousing but I guess that it may still bring certain value to BI. I'm wondering if you could share some thoughts.

"Big data" is the never-ending quest to expand the ways in which our BI programs deliver business value.

As we expand the scope of what we deliver to the business, we must be able to tie our discoveries back to business metrics and measure the impact of our decisions. The dimensional model is the glue that allows us to achieve this.

Unless you plan to stop measuring your business, the dimensional model will remain essential to your BI program. The data warehouse remains relevant as a means to instantiate the information that supports this model. Reports of its death have been greatly exaggerated.

Big Data

"Big Data" is usually defined as a set of data management challenges known as "the three V's" -- volume, velocity and variety. These challenges are not new. Doug Laney first wrote about the three V's in 2001 -- twelve years ago.¹And even before that, we were dealing with these problems.

Photo from NASA in public domain.

Consider the first edition of The Data Warehouse Toolkit, published by Ralph Kimball in 1996.² For many readers, his "grocery store" example provided their first exposure to the star schema. This schema captured aggregated data! The 21 GB fact table was a daily summary of sales, not a detailed record of point-of-sale transactions. Such a data set was presumably too large at the time.

That's volume, the first V, circa 1996.

In the same era, we were also dealing with velocity and variety. Many organizations were moving from monthly, weekly or daily batch loads to real-time or near-real time loads. Some were also working to establish linkages between dimensional data and information stored in document repositories.

New business questions

As technology evolves, we are able to address an ever expanding set of business questions.

Today, it is not unreasonable to expect the grocery store's data warehouse to have a record for every product that moves across the checkout scanner, measured in terabytes rather than gigabytes. With this level of detail, market basket analysis is possible, along with longitudinal study of customer behavior.

But of course, the grocery store is now looking beyond sales to new analytic possibilities. These include tracking the movement of product through the supply and distribution process, capturing interaction behavior of on-line shoppers, and studying consumer sentiment.

We still measure our businesses

What does this mean for the dimensional model? As I've posted before, a dimensional model represents how we measure the business. That's not something we're going to stop doing. Traditional business questions remain relevant, and the information that supports them is the core of our BI solution.

At the same time, we need to be able to link this information to other types of data. For a variety of reasons (V-V-V), some of this information may not be stored in a relational format, and some may not be a part of the data warehouse.

Making sense of all this data requires placing it in the context of our business objectives and activities.

To do this, we must continue to understand and capture business metrics, record transaction identifiers, integrate around conformed dimensions, and maintain associated business keys. These are long established best practices of dimensional modeling.

By applying these dimensional techniques, we can (1) link insights from our analytics to business objectives and (2) measure the impact of resultant business decisions. If we don't do this, our big data analytics become a modern-day equivalent of the stove-pipe data mart.

The data warehouse

The function of the data warehouse is to instantiate the data that supports measurement of the business. The dimensional model can be used toward this aim (think: star schema, cube.)

The dimensional model also has other functions. It is used to express information requirements, to guide program scope, and to communicate with the business. Technology may eventually get us to a point where we can jettison the data warehouse on an enterprise scale,³but these other functions will remain essential. In fact, their importance becomes elevated.

In any architecture that moves away from physically integrated data, we need a framework that allows us to bring that data together with semantic consistency. This is one of the key functions of the dimensional model.

The dimensional model is the glue that is used to assemble business information from distributed data.

Organizations that leverage a bus architecture already understand this. They routinely bring together information from separate physical data marts, a process supported by the dimensional principle of conformance. Wholesale elimination of the data warehouse takes things one step further.

Notes

Doug Laney's first published treatment of "The Three V's" can be found on his blog.
Now out of print, this discussion appeared in Chapter 2, "The Grocery Store." Insight into the big data challenges of 1996 can be found in Chapter 17, "The Future."
I think we are a long time away from being able to do this on an enterprise scale. When we do get there, it will be as much due to master data management as it is due to big data or virtualization technologies. I'll discuss virtualization in some future posts.

More reading

Previous posts have dealt with this topic.

The Role of the Dimensional Model in Your BI Program (4/30/2013) details the four ways we use the dimensional model. Only one of these functions involves a database.

In Big Data and Dimensional Modeling (4/20/2012) you can see me discuss the impact of new technologies on the data warehouse and the importance of the dimensional model.

The Role of the Dimensional Model in Your BI Program

2013-04-30T12:05:00.000-04:00

The dimensional model delivers value long before a database is designed or built, and even when no data is ever stored dimensionally. While it is best known as a basis for database design, its other roles may have more important impacts on your BI program.

The dimensional model plays four key roles in Business Intelligence:

The dimensional model is the ideal way define requirements, because it describes how the business is measured
The dimensional model is ideal for managing scope because it communicates to business people (functionality) and technical people (complexity)
The dimensional model is ideal as a basis for data mart design because it provides ease of use and high performance
The dimensional model is ideal as a semantic layer because it communicates in business terms

Information Requirements

The dimensional model is best understood as an information model, rather than data model. It describes business activities the same way people do: as a system of measurement. This makes it the ideal form to express information needs, regardless of how information will be stored.

Image by Gravityx9
licensed under Creative Commons 2.0

A dimensional model defines business metrics or performance indicators in detail, and captures the attendant dimensional context. (For a refresher, see the post What is a Dimensional Model from 4/27/2010.) Metrics are grouped based on shared granularity, cross referenced to shared reference data, and traced to data sources.

This representation is valuable because business questions are constantly changing. If you simply state them, you produce a model with limited shelf life. If you model answers to the question of today, you've provided perishable goods.

A dimensional model establishes information requirements that endure, even as questions change. It provides a strong foundation for multiple facets of BI:

Performance management, including dashboards and scorecards
Analytic processing, including OLAP and ad hoc analysis
Reporting, including both enterprise and operational reports
Advanced analytics, including business analytics, data mining and predictive analytics

All these disciplines center on business metrics. It should be no surprise that when Howard Dresner coined the term Business Intelligence, his definition referenced "facts and fact based systems." It's all about measurement.

Program Roadmap and Project Scope

A dimensional model can be used to describe scope because it communicates to two important audiences.

Business people: functionality The dimensional model describes the measurement of a business process, reflecting how the process is evaluated by participants and observers. It communicates business capability.

Technical personnel: level of effort A dimensional model has technical implications: it determines the data sources that must be integrated, how information must be integrated and cleansed, and how queries or reports can be built. In this respect, it communicates level of effort.

These dual perspectives make the dimensional design an ideal centerpiece for managing the roadmap for your BI program. Fully documented and mapped to data sources, a dimensional model can be divided into projects and prioritized. It is a blueprint that can be understood by all interested parties. A simple conformance matrix communicates both intended functionality and technical level of effort for each project.

At the project level, a dimensional design can be used as the basis for progress reporting. It can also serve as a nonambiguous arbiter of change requests. Changes that add data sources or impact grain, for example, are considered out of scope. This is particularly useful for organizations that employ iterative methodologies, but its simplicity makes it easy to reconcile with any development methodology.

Database Design

The dimensional model is best known as the basis for database design. The term "star schema" is far more widely recognized than "dimensional model" (a fact that influenced the name of my most recent book).

In fact, the dimensional model is the de facto standard for data mart design, and many organizations use it to shape the entire data warehouse. It has an important place in the W.H. Inmon's Corporate Information Factory, Ralph Kimball Dimensional Bus architecture, and even in one-off data marts that lack an enterprise focus.

Implemented in a relational database, the dimensional model becomes known as a star schema or snowflake. Implemented in a multidimensional database, it is known as a cube. These implementations offer numerous benefits. They are:

Easily understandable by business people
Extraordinarily flexible from a reporting and analysis perspective
Adaptable to change
Capable of very high performance

Presentation and the Semantic Layer

A dimensional representation is the ideal way to present information to business people, regardless of how it is actually stored. It reflects how people think about the business, so it is used to organize the catalog of items they can call on for analysis.

Many business intelligence tools are architected around this concept, allowing a semantic layer to sit between the user and database tables. The elements with which people can frame questions are categorized as facts and dimensions. One need not know what physical data structures lay beneath.

Even the earliest incarnations of the semantic layer leveraged this notion. Many organizations used these tools to impose a dimensional view directly on top of operational data. Today, semantic layers are commonly linked to dimensional data marts.

A dimensional representation of business activity is the starting point for a variety of BI activities:

Building enterprise reports
Defining performance dashboards
Performing ad hoc analysis
Preparing data for an analytic model

The concept of dimensional presentation is receiving renewed attention as federated solutions promise the construction of virtual solutions rather than physical ones.

Further information

I briefly covered these four roles in an interview last year:

Big Data and Dimensional Modeling (4/20/2012)

Many of these themes have been discussed previously:

Three Data Warehouse Architectures that Use Star Schema (3/26/2007)
Drive Warehouse Strategy with a Dimensional Model (7/6/2007)
What is a Dimensional Model? (4/27/2010)
The Conformance Matrix (6/5/2012)

Although I've touched on these topics before, I wanted to bring them together in a single article. In the coming months, I will refer back to these concepts as I address common questions about big data, agile BI and federation.

In the mean time, please help support this blog by picking up a copy of my latest book.

Where To Put Dates

2013-03-28T09:38:00.002-04:00

A reader is trying to decide if certain dates should be modeled as dimensions of a fact table or as attributes of a dimension table.

I have two attributes that I'm really not sure where is the best place to place:'Account Open Date' and 'Account Close Date'. In my model, I have [Dim Accounts] as a dimension and [F transact] as a fact table containing accounts transactions. An account can have many transactions, so the dates have different cardinality than the transactions.
I thought to put the dates in the Accounts dimension, but this led to problems: difficulties in calculations related to those dates--like if I want to get the transactions of the accounts that opened in the 4th quarter of 2012, or to get the difference between the date of last transaction and the account opening date, and so on. In other words I can't benefit from the Date dimension and the hierarchies it contains.
So I though about placing those dates in the fact table, but what made me hesitate is that the granularity of those dates is higher than the fact table, so there will be a lot of redundancy.
- Ahmad
Bethlehem, Palestine

This is a common dilemma. Many of our most important dimensions come with a number of possible dates that describe them.

Ahmad is thinking about this problem in the right way: how will my choice affect my ability to study the facts?

It turns out that (1) this is not an either/or question, and (2) granularity is not an issue.

Dates that Describe Important Dimensions

Image licensed via Creative Commons 2.0

from Patrick Hoesley

The dates are clearly useful dimension attributes. I suggest that you keep them in the dimension in one of two ways, which I will discuss in a moment.

First, though, lets look at what happens if the dates are only represented as foreign keys in the fact table:

If the dates are not stored in the dimension, the open and close date are only associated with the Account dimension through the fact table. The fact table only has records when transactions occur. So it becomes harder to find a list of open accounts, or to find the set of accounts that were active as of a particular date.

An additional factless fact table may help here, but it is far more simple to store the dates in the dimension.

Date as Attribute vs Outrigger

If plan to represent the dates in your dimension table, you have two choices. You can model the dates themselves as attributes, or you can model a pair of day keys in your account dimension. Either approach is acceptable.

The first option does not expose the richness of your full day dimension for analytic usage, but it may be simpler to use for many business questions. Other questions (like your quarterly example) will require a bit more technical knowledge, but most BI tools help with this.

The second option transforms your star into a (partial) snowflake. The day dimension becomes known as an "outrigger" when it connects to your account dimension. This allows you to explicitly leverage all the attributes of your Day dimension. The cost is some extra joins, which may be confusing and may also disrupt star-join optimization.

Making the correct choice here involves balancing several perspectives:

The business view and usability
The capabilities of your BI software front end
The capabilities of your DBMS back end software

Day Keys in the Fact Table

Having said all that, it is also useful to represent at least one of these dates in the fact table. The account open date may be a good dimensional perspective for the analysis of facts.

As you observed, this date has different cardinality than the transactions. The account open date for an account remains constant, even if it has dozens of transactions in your fact table. But the fact that it has low cardinality should not stop you from choosing it as a major dimension as your star!

Your account transaction fact table may have a pair of day keys -- one for the date the account was opened, and one for the date of the transaction.

If you choose to do this, the account dimension itself should include the open date. The outrigger solution is not necessary since your fact table has full access to the Day dimension.

Note that I do not recommend this for your account closed date, because that date changes. Storing it as a key for every transaction against an account would require a lot of updates to fact table rows once the account becomes closed.

More Information

I've touched on this topic in the past. In particular, see this post:

Dates in Dimension Tables (5/12/2011)

Although I edited it out of Ahmad's question, he also cited an issue surrounding the use of NULL for accounts that do not have a closed date. On that topic, see this recent post:

Avoid NULL in Dimensions (1/7/2013)

Support this Blog

I maintain this blog in my spare time. If you find it helpful, you can help support it by picking up a copy of my book: Star Schema: The Complete Reference.

Use the links on this blog to get a copy of this or any of the other recommended books, and you will be helping to keep this effort going.

Learn Dimensional Modeling With Chris in 2013

2013-03-06T11:49:00.000-05:00

Several courses in Dimensional Modeling are already scheduled for 2013:

Berlin, Germany (March 13-15)
London, UK (March 18-20)
Washington DC (April 23-25)
Sydney, Australia (June 24-25)
Melbourne, Australia (June 27-28)
Minneapolis, MN (October 8-10)

Chris is also scheduled to teach several TDWI courses at these and other events. For full details, check the sidebar of this blog. (If using a reader, you will need to click through)

These don't work for you? Check back from time to time. As new cities and dates are added, they will appear on the sidebar.

Optional Relationships Without NULL

2013-02-13T10:03:00.001-05:00

Optional relationships are important in dimensional models. This post shows you how to support them without resorting to NULL keys in the fact table.

Last month, we looked at the impact of allowing dimension attributes to contain NULL. In this post, we'll look at the impact of allowing foreign keys in fact tables to contain NULL.

Once again, NULL will prove problematic. What should be simple queries will require an alternate join syntax, multiple comparisons and nested parentheses.

The preferred solution is to establish special-case rows in dimensions. These rows can be referenced by fact table rows that do not have corresponding dimension detail.

Recap on NULL

NULL is a special SQL keyword used to denote the absence of data.

Last month, I explained why we avoid allowing dimension attributes to contain NULL. NULL fails standard comparisons, necessitating query predicates containing numerous tests which are carefully balanced within sets of parentheses.

For the full story, and the preferred solution, see last month's post: Avoid NULL in Dimensions (1/7/2013).

But that was NULL dimension attributes. What about NULL foreign keys?

Optional Relationships and NULL

Sometimes, the relationship between a fact table and a dimension is optional. This means some rows in the fact table cannot be associated with the dimension.

In an ER model, the traditional solution is to store NULL foreign keys for such rows. Let's take a look at what would happen if we did that in a dimensional model.

You may have noticed that in some stores, the cashier asks you if a salesperson helped you. If so, they record that info. So some sales have a salesperson, some do not.

With an optional relationship to Salesrep, you star schema might look like this:

The dotted line represents an optional relationship. (In other notations, optionality is represented by including circles at the ends of relationship lines.) For fact table rows with no salesperson, salesrep_key contains NULL.

Usability Harmed by NULL Foreign Keys

When a foreign key can contain NULL, we once again face difficulties when answering some simple business questions. As before, NULL complicates queries because it requires a comparison syntax that is different from the syntax for standard values. This time, we'll also be facing different join syntax.

For example, using the sales star, you might like to see all sales where a manager was not involved. Assuming the Salesrep table has a column called salesrep_type, you would be forgiven for adding this to your query:

WHERE salesrep.salesrep_type != "Manager"

This predicate is not sufficient to find all sales without managerial involvement.

Assuming a standard join is linking sales_facts to salerep, rows with no salesrep will not appear in the query results. This happens because, for any fact without a salesrep_key, the join to salesrep will fail. An outer join must be used to help facts with no salesreps survive the join.

Even when an outer join is employed, the above constraint remains insufficient. That's because a side effect of the outer join is to create NULL salesreps in the data set.

In addition to an outer join, we must supplement the constraint above:

WHERE
( salesrep.salesrep_type != "Manager" OR
salesrep.salesrep_type IS NULL
) AND...

NULL keys force us to choose the correct join type, perform multiple comparisons against the same dimension attribute, and carefully balance parentheses.

A dimensional model is meant to be understandable and usable from a business perspective. NULL keys do not fit the bill.

Use a Special Case Row

When there is an optional relationship between fact and dimension, best practices call for a special row in the dimension. This row is referenced by facts that would otherwise require a NULL foreign key.

For example, we add a "not applicable" row to our salesrep table as so:

salerep_key	row_type	salerep_type	salesrep_name
0	No Salesrep	n/a	n/a
100	Salesrep	Associate	Paul Cook
101	Salesrep	Associate	Steve Jones
201	Salesrep	Manager	Glen Matlock

Now we don't need outer joins, and we don't need to bend over backwards to perform simple comparisons.

Further Reading

The technique described in this post can be extended to handle other situations (invalid data, future events, or reference data that becomes available after facts).

Read more about these possibilities in Chapter 6 of my book, my book, Star Schema: The Complete Reference.

Also check out the previous post, Avoid NULL in Dimensions (1/7/2013)

Edited 2/13/13 5:30pm to correct mismatched table headings. Thanks for the emails.