The return on investment in EDW architecture can be phenomenal. But first we must resolve an architectural paradox—fulfilling requirements that are never fully known at build time. This category focuses on architecture strategy, methodology, and related modeling techniques.
Business analytics are the distillation of “big data” that is actually used to make decisions. Surprisingly, business analytics are rarely included in the data models used to build the EDW. More often, BI designers model analytics in the BI tool metadata, assuming that the EDW data model and physical structure will support those analytics. Surprises occur at this point! This project suggests best practices and tools for integrating business analytics into the EDW data modeling process to create better BI solutions.
What is a Business Analytics Data Model?
BI solutions typically involve many thousands of data objects. BI dimensional data modeling is often tasked to capture all “potentially useful” source elements for the target subject area. This largely source-driven approach can lead to a “boil the ocean” effort that delays the deployment of useful BI reporting. To enable a more direct path to BI results, business analytic objects are integrated with the BI dimensional model. The process is a convergence of source/target-driven and analytics-driven modeling.
A Business Analytic Data Model is an enriched form of the BI dimensional data model that includes analytic data objects. Analytic data objects directly influence manual or mechanized decision making, are predictive as well as historical in nature, and can be derived from structured and non-structured data.
Business analytics are the lynch-pins that unite business requirements analysis with data modeling. Data requirements emerge by decomposing the business analytics. Modeling from business analytics to the required data reveals gaps or structural problems in the data model that might be missed by traditional source-to-target BI dimensional data modeling. Each approach compliments the other in getting to the best solution in the shortest time.
What are the Goals of a Business Analytics Data Model?
Business analytics modeling can help solve three major BI/EDW challenges:
Aligning business and IT stakeholders in the process of developing of BI/EDW solutions
Ensuring that the EDW data model will support the business reporting requirements
Establishing a framework for overall business performance management
On the surface, these challenges read like general requirements that would apply to any BI/EDW initiative. But why are they so incredibly difficult to achive? And how can a business analytics model help? Let’s address those questions one at a time.
Aligning Business and IT Stakeholders
IT and business stakeholders want the same thing in the end — a succesful BI solution that benefits the organziation and all involved. But they approach the journey from different perspectives.
There is a critical point where the business and IT journeys converge, and that is when the business analytics are formulated.
BUSINESS VIEW: Business analytics measure the optimization opportunity — How well are we doing? How big is the prize? Analytics suggest or trigger enabling actions to achive the opportunity.
IT VIEW: Business analytics are the nucleus of a mini-data model that collectively stretch into all corners of the dimensional model, and back to the data sources.
If we can get the IT and business collaboration right during this critical stage, the probablity of a successful BI/EDW result will be improved. Future posts will present some business analytics modeling methods, and pitfalls to avoid, gleaned from many BI projects undertaken by Brightlight and yours truly.
Ensuring that the EDW data model SUPPORTS business analytics reporting
Establishing a framework for business performance management
I would like your insights on business analytics modeling challenges you have encountered at your companies. Please use the comment form at the end of each post. I will respond and try to incorporate your ideas into the demonstration project.
Business hierarchies are the “crown jewels” of an analytic-driven organization. They encode the analytic framework for marketing, product and sales management, and competitive benchmarking. This project will walk you through the steps to create a high-performing EDW model for dynamic, flexible business hierarchies. The project will provide technical artifacts for a Netezza DW appliance solution, but the modeling concepts are applicable to other EDW platforms. I will post on this site progress of this project journey, and invite your suggestions and stories from your own experiences. You can reply using the commenting dialog at the end of each post.
In this project I will layout a modeling approach and a prototype for hierarchies that work the way you want, and they way Chelsea in finance wants, and Joe in marketing wants, and your merchandising director and suppliers want. The result is a compendium of approaches I have used at various clients in several industries over the last six years, born out of lot of listening and technical experimentation, and tempered with published research from DW gurus like Ralph Kimball and Joe Celko. I’m certain to have violated some rules along the way, but my view is that things that actually get used leave a trace of broken rules in pursuit of the greater good.
First, the requirements. Below is a list of “We needs” that come up time again when discussing hierarchy requirements with business folks:
We need to report the same sales data by different roll-up hierarchies
We need hierarchies that can report summary results consistently across business units.
We need hierarchies that support a variable number of levels (aka, “ragged hierarchies”)
We need to link related hierarchies–for example, sales organization hierarchy linked to product hierarchy
We need to track hierarchy changes so we can report based on the hierarchy structure today, or at any historical date.
We need to evaluate the impact of planned hierarchy changes simultaneously with current reporting.
We need to integrate external product or customer data with our internal hierarchies.
We need to update hierarchies with known changes in advance of the delayed feeds from external data suppliers.
We need to navigate the hierarchy in various ways depending on the business question.
We need to dynamically aggregate facts by hierarchy level or node.
We need to track multiple occurances of attributes such as addresses, industry codes, etc.
We need to handle hierarchy nodes with more than one parent (e.g, a store served by two distribution centers).
We don’t want to update the fact table(s) when a hierarchy changes.
We don’t want to rely on “meaningful identifiers” that become outdated and lead to confusion.
We need to build in health checks alert us to inconsistencies in the hierarchies that could distort reporting.
We need a standardized approach to building hierarchies that is repeatable and easy to maintain.
Importantly, we need to make the user interface simple and intuitive.
Whew! That is a tall order. When hierarchies are inflexible or difficult to change, it is nearly impossible to address more than a few these needs. Many organizations learn to live with inflexible hierarchies. We must change that paradigm. The value of business nimbleness that flexible hierarchies enable will far exceed the the investment to create them.
Look for upcoming posts describing a conceptual architecture for flexible hierarchies, and demonstrating an industry use case. In the meantime, please send me hierarchy challenges you have seen in your own organizations, and I’ll reply and/or incorporate them into the demo. Use the commenting box below.
Ragged Hierarchies are a common BI reporting structure in which the hierarchy contains a variable number of levels. Examples are an organization hierarchy, or a financial chart of accounts. Ragged hierarchies are cumbersome to navigate using standard SQL. But there are ways to simplify the navigation without compromising performance.
It is useful to employ a “tree” metaphor to visualize the modeling challenges of dimensional hierarchies.A “tree” is a collection of reporting hierarchies such as may exist in a product dimension or an organization dimension.A tree consists of “branches” that represent the levels of a hierarchy.Each branch contains one or more “nodes”.A node can be a “leaf” or another branch.A leaf is the most granular node of a hierarchy, and is referred to as a “leaf node”.Each leaf node contains the values of the various attributes of the dimension.The set of all nodes emanating from a given branch node is referred to as a “sub-tree”.A single “hierarchy” is the set of connected nodes between a given node and the trunk. The “trunk” represents the collection of all hierarchies in the dimensional tree.
A “ragged hierarchy” exists when the number of levels varies between hierarchies.Figure 1 depicts an Account dimension with ragged hierarchies.Three hierarchies are highlighted including a single, double, and triple level hierarchy.
Hierarchies are often modeled using “parent keys” in the dimension table.A dimension exhibits recursive hierarchies when there can be two or more levels of levels of parent-child relationships in the same table.Dynamically navigating recursive hierarchies with a BI tool like Cognos or Business Objects is expensive. Each level requires a self-join. There are two basic types of dimension recursive hierarchies.
Fixed Recursive Hierarchies
If the number of levels in the hierarchy is fixed, then the dimension table contains fixedrecursive hierarchies.An example is a standard product or merchandise hierarchies (e.g. family > class > commodity > category > item).Because the number of levels is known, the dimensions can be modeled with parent key column for each level of the hierarchy.However, this does not necessarily eliminate the need for recursive joins at query time.
Dynamic Recursive Hierarchies
If the number of levels in the hierarchy is indeterminate, then you have a ragged recursive hierarchy.Project and organization hierarchies are typically ragged.For example, a project may have one or more sub-projects, each of which could have its own sub-projects.
Left outer joins are required to navigate each level of a recursive hierarchy.If the dimensions are large and there are many levels in the hierarchy, performance could become an issue.It is not always possible to optimize recursive joins because the parent-key column functions as both a parent and a child. You cannot co-locate the parent and child rows on the same Netezza SPU.In databases like Oracle that use indexes, separately indexing the primary key and the parent key may offer some performance relief, but performance will still degrade as the number of levels (and therefore joins) increases.
Hierarchy Navigation Tables
A ragged hierarchy is simplified by creating a hierarchy navigation table in the database. Navigation tables can be directly queried by the BI tool without expensive recursive joins. Navigation tables can be optimized because the normal table joins are predictable, few in number, can be inner joined, and can be co-located with the main dimension table. There are two basic types of hierarchy navigation tables.
Vertical Navigation Tables
Vertical navigation tables contain a row for every level of every hierarchy. Thus, the leaf node is repeated for each level represented in its hierarchy. If a user wanted to roll-up costs to a mid-level node in the project hierarchy, they would simply select that node, and the navigation table would make a single join to the fact table, regardless of the number of levels in the hierarchy. Vertical navigation tables are useful when the user does not know the level the hierarchy they are querying.
Horizontal Navigation Tables
Horizontal (i.e., flattened) navigation table contain one row per hierarchy. The levels of the hierarchy are represented as columns. Thus, a horizontal navigation table will have fewer rows than a vertical navigation table for the same dimension. Horizontal navigation tables are useful when the user is selecting a specific level, or needs to display the parent path levels in a query or report.
Ragged Hierarchy Special Requirements
There can be variations in requirements within the general theme of ragged hierarchies. As usual, the devil is in the details. Specific modeling techniques to achieve optimum performance depend on the customer’s unique requirements. Examples of special hierarchy scenarios are described below.
Multiple Parent Nodes
This scenario breaks the strict hierarchical relationship of the dimension. A node can have 2 or more parents. For example, a project can be a member of two different “programs” at the same time. This requirement presents unique roll-up challenges.
Facts Attached to Multiple Levels
This scenario often exists in organizational hierarchies. Each organization has a budget and is responsible for the budgets of the organizations within its span of control. For example, Division A is responsible for the budges of Departments 1, 2 and 3. Department 1 is responsible for Units 1.a and 1.b. Thus, expense roll-up queries might include costs of the Division A sub-tree including, or excluding, the Division A’s own expenses.
Versioned hierarchies exist when the relationships of a hierarchy changes over time, and there is a requirement to report based on the hierarchy structure in effect as-of a point in time. For example, customer ownership relationships may change over time as a result of mergers and acquisitions.
Approach to Optimizing Ragged Hierarchies
Optimizing Ragged Hierarchies is a four-step processes as depicted in Figure 2.
Requirements Analysis – The first step to optimizing ragged hierarchies is to identify the business requirements. What is the general model and what are the special requirements? The requirements analysis will narrow down the optimization options.
Business Rule Modeling– The second step is to allocate business rule processing between the BI query tool, the database structures, and the ETL application. The choices will depend on the query use cases, the capabilities of the BI tool, and service level agreements (SLAs) related to query response times and available ETL batch window times.
Data Modeling – The third step involves modeling the dimensions, navigation tables, and the database tables and views required to support the loading and navigation of ragged hierarchies.
Physical Optimization – The fourth step is to optimize the physical database structures based on performance tests.The results will depend on a variety of factors including data volumes, hierarchy versioning requirements, the distribution of data, the frequency of updating the hierarchies, and the user query profiles.Physical optimization may lead to adjustments in steps 2 and 3.
In summary, ragged hierarchies are a fact of life in the ragged real world. Hopefully the techniques in this post will help you smooth things over with your users 🙂 I would love to hear your feedback on these and other approaches for handling ragged hierarchies.
Thanks to the pioneering work of Ralph Kimball, slowly changing dimension (SCDs) have become a cornerstore of dimensional modeling. The most basic type of versioning is to simply update (overwrite) the fields that change—the so-called Type 1 SCD. Turns out that Type 1 SCDs do not work very well for many important business decisions. This is illustrated by the query “show me monthly revenue by customer and sales rep”. Sales rep assignments can change throughout the year. So directly updating the sales rep field in the customer dimension table creates a sticky problem. Why? The current sale reps would get credit for the efforts of her predecessors. Not a happy situation at commission time!
“Effective” is what the business says it is—not what the data warehouse thinks it is
The Type 2 SCD attempts to remedy this situation by creating a “current version” row in the customer dimension table when an important attribute changes like sales rep. New sales transactions are stamped with the key of the customer dimension version currently in effect. Now sales facts can be associated with the correct salesperson to produce an accurate picture of who sold what to whom, and when. Everyone is happy with their commission checks, right? Not so fast. Type 2 SCD gets you much closer to reality, but not 100%. Consider what happens if the sales rep is updated in the customer dimension several days after the change actually occurred? You would have the transactions tagged to the wrong sales rep during the period between the “business effective date” and the “data warehouse effective date”.
Type 2 SCD’s are sufficient for many business reporting scenarios. But when dealing with commission checks, almost is not good enough. Backdating (and forward-dating in some cases) is a fact of business life, and we must deal with it. The EDW must somehow align the late arriving dimension or fact changes.
BizEff To The Rescue
One way to address the back-dating problem is through “BizEff” (business effective dating) dimensions. BizEff dimensions are useful for business processes that have inherent delays in processing dimension updates. BizEff reporting depends on properly formed dimensions. Typically, only a small subset of “trigger” attributes need to be effective-dated. If the dimension is wide (many columns) and deep (many rows), then segregating the effective-dated columns into a separate BizEff table may improve ETL and query performance. If the dimension is small, or the effective-dated column count is low relative to the total columns in the dimension, then the BizEff dimension should be modeled as a single table.
To illustrate, let’s assume a customer dimension with 50 columns, of which 2 need to be effective-dated:
Customer Status (e.g., prospect, new, tenure, inactive)
Sales Rep (e.g., Sandy Smith)
We will model this dimension with two tables as illustrated in Figure 1. The dim_customer dimension table contains 47 columns–the customer key and all the other columns except the effective-dated columns. The dim_customer_eff table contains 7 columns–the customer key, the effective date, the expiration date, the current row indicator, and the 3 effective-dated customer attributes.
The natural key of the base customer table is the Customer ID. The natural key of the BizEff table is the customer ID and the effective date. In both tables, the customer_id is replaced with an integer surrogate key. The date key could also be replaced with a surrogate integer key. Review Ralph Kimball’s definitive article on the rationale for surrogate keys.
The versioned BizEff rows represent contiguous and non-overlapping effectivity periods defined by the business effective and expiration dates. The effective date is provided the source system as entered by users. The ELT application must derive the expiration dates for distinct sets of the effective-dated columns. The expiration date is 1 less than the effective date of the the next version. The current version has a high value expiration date (e.g. 9999-12-31) as well as a TRUE value for the current row indicator. A change to any effective-dated column will trigger a new row in the BizEff table. Note that if a particular column value does not change, that value is rolled forward to the next effective-dated row.
If the number of effective-dated trigger columns is large relatively to the total number of columns in the dimension, or if the base dimension is relatively small, then there is not much advantage in splitting the base and effective-dated columns into separate tables. Using a single BizEff dimension table will save a join when constructing the BizEff views. There is no fixed rule about when to split a BizEff dimension into two tables. You need to evaluate the capacity and performance implications for your particular situation.
Now that we have a properly formed BizEff dimension, how do we query against it? The trick is creating a BizEff view that conceals complexity from the client tool. In the case of the 2-table BizEff customer dimension, the view joins three tables–the base customer dimension, the BizEff table and the date dimension (Figure 2). The BizEff view returns a row for every date for which a distinct set of customer attributes is effective. If the customer status for Suzanne Jones is Prospect from January 10 to February 12, the BizEff table would contain one row for that effective period and BizEff view would return 33 rows.
The beauty of the BizEff view is that it can support both “AsOf” and “AsWas” queries.
AsOf queries group data according to the dimensional context in effect “as of” a user-supplied date. The default AsOf date is normally the current date.
AsWas queries group data according to the dimensional context in effect at the time the fact transaction occured.
BizEff queries are useful for companies that rely on the EDW for operational reporting. This is particularly true for business processes that have inherent delays in processing demension updates.
Best of Both Worlds — BizEff2
Just when you think you have satisfied the business users with an elegant BizEff solution, out comes a new requirement that sends you back to the drawing boards. The accountants want to report how the transactions were originally recorded based on the incorrect dimension values–thank you very much Mssrs. Sarbanes and Oxley! This requirement is met by including the standard Type2 SCD version key in the BizEff dimension table and in the fact table (Figure 3). The originally recorded “as was” situation can be be recreated by joining the dimension table to the fact table using the the Type2 version key.
Note that with a BizEff2 dimension, it is necessary to preserve the original relationship of the scd2 key to the fact table. This means that you could have multiple scd2 rows for the same BizEff effective date, or multiple effective dates for the same scd2 key. This is more a more complex scenario but may be warranted in situations requiring maximum traceablity.
The performance implication of BizEff must be considered separately for ETL (back-room) and query (front-room) processing. Traditionally, the brute force way to ETL late arriving dimension changes is to update the dimension keys for all affected fact transactions. This approach approach works fine in data marts with relative small fact tables, but is not very scalable for rapidly growing fact tables with hundreds of millions of rows. The BizEff approach is more efficient for ETL because it operates only on the smaller dimension tables. Even very large versioned dimension tables are small in comparison to transaction grain fact tables.
In general, BizEff queries perform somewhat slower than the standard Type2 queries because the the query uses expression joins to identify the correct dimension row. Expression joins using “less than” or “greater than” operands are slower than the equi-joins used for standard Type2 queries. The performance hit for AsOf queries is relatively light because the join of the static AsOf date is performed only once at the dimension level. AsWas BizEff queries will perform somewhat slower than AsOf queries because the expression join must be applied to every fact row.
With the performance of today’s MPP data warehouses, BizEff queries are a viable alternative to the brute force fact update approach. Early performance testing will confirm the viability of BizEff for your particular situation.
So the next time your users cry out for effective dated reporting, just tell them to them to go BizEff!