Chapter06
Four-Step Dimensional Design Process
1) Choose the Biz Process 2) Declare teh Grain 3) Identify the Dimensions 4) Identify the Facts
More on Dimensions
* Date and Time * Degenerate Dimensions * Slowly Changing Dimensions * Role-Playing Dimensions * Junk Dimensions * Snowflaking and Outriggers * Handling Hierarchies * Many-Valued Dimensions with Bridge Tables
Fables and Falsehoods about Dimensional Modeling
* Fable 1 - Focusing on Departmental Reports * Fable 2 - Premature Summarization * Fable 3 - Overvaluing Normalization
Dimensional Modeling Requirements
* Fact Tables * Dimension Tables
More On Facts
* Three Fundamental Grains * Facts of Differing Granularity and Allocation * Multiple Currencies and Units of Measure * Factless Fact Tables * Consolidated Fact Tables
Three Fundamental Grains
* Transaction Fact Tables * Periodic Snapshot Fact Tables * Accumulating Snapshot Fact Tables
Benefits of Dimensional Modeling
* Understandability to biz users. * Query performance - denormalization reduces query complexity * Dimension tables are equivalents, making query types predictable to dev * Can handle slowly changing dimensions * Extensible with backward compatibility
Making the Case for Dimensional Modeling
* What is Dimensional Modeling? * What about Normalized Modeling? * Benefits of Dimensional Modeling
What is Dimensional Modeling?
A logical design technique for structuring data so that it's intuitive to business users and delivers fast query performance: Divides the world into: * Measurements (usually numeric facts) * Context in dimensional tables, star-joined to the facts (called ""cubes"" in OLAP)
Junk Dimensions
After modeling, there are miscellaneous flags and text attributes that cannot find an appropriate home (comments are one type of junk data). Designer now has a number of bad alternatives which should be avoided: * Leave them in the fact table * Make them into separate dimensions (preferred) * Eliminate them Study the flags and place them into one or more junk dimensions rather than add to fact table. You must consider how much overlap occurs with the various junk attributes
Factless Fact Tables
An event happens, but no quantifiable measurements are created or generated; may be important for many-to-many relationships tha thtey are related to.
Dimension Tables
Big, bulky, descriptive fields (textual). For filtering should be: Verbose, descriptive, complete (no null values) Discretely valued, Quality Assured. If codes are used instead of words, they should have imbedded meaning (Avoid this). While hierarchical, many-to-one relationships may be resolved on one table. Highly correlated clumps of attributes grouped to represent key objects of a biz, such as products, custs, employees, or facilities. These should be nowns. Usually 8-15 dimension tables is very common * Dimension Table Keys * Conformed Dimensions
Data Decomposition
Break up fields: Name = last and first; address = street num, street, city, etc.
Many-Valued Dimensions with Bridge Tables
Bridge tables for Many-to-Many relationships. These should be a last resort due to usability and performance issues.
Fable 1 - Focusing on Departmental Reports
Business needs should be the starting point (measurements). Reports will evolve over time no matter what you do. * Do not create new dimensions for each report * Do not create departmentally bound reports * You can always add new data sources to your fact and dimension tables * Multiple fact tables can be created to support different grains
Fact Tables
Contains measurements or ""numeric"" values or measurements. Large but remove much of the need to run cross-table queries * Fact Table Keys * Fact Table Granularity
Enterprise Data Warehouse Bus Arch
Critical activity, that cannot be too localized or too global. * Planning Crisis * Bus Arch * Value Chain Implications * Common Matrix Mishaps * Taking the Pledge
Bus Arch
Define a standard interface system for all iterative DW/BI components Establish a matrix of items to be integrated into the DW/BI system. These prioritized systems should fit together like a puzzle as the project goes on and helps avoid stovepipe solutions.
1) Overwrite the Dimension Attribute
Delete old and write in new. Easy to implement, difficult to do analysis on past condition due to value deleted. Requires a total update of aggregated data to keep data in sync... CONSIDER ALL OTHER TYPES BEFORE THIS ONE
Fixed Hierarchies
Different departments have many-to-one hierarchical views that may not match definitions; this may require you to maintain both hierarchies Hierarchies are often associated with drill paths in the data.
Facts of Differing Granularity and Allocation
Different granularity cannot be on the same table. Need to allocate higher-level facts to lower-level granularity by defining an allocatoin algorithm. If no agreement then you need two uniquely labeled facts. * Not always matchable * Header facts need precise names
Multiple Time Zones
Dimension table can identify fiscal year, local timezone, HQ timezone, and more
Slowly Changing Dimensions
Dimension tables that change, but slowly. Should be able to track changes and change types: 1) Overwrite teh Dimension Attribute 2) Add a New Dimension Row (or mark old) 3) Add a New Dimension Attribute Mini-Dimensions: Add a New Dimension Hybrid Slowly Changing Dimension Techniques
Taking the Pledge
Everyone uses the same master conformed dimensions. Architect, developers, and senior management (politics is critical)
Fact Table Granularity
Grain is the definition of the mearurement event that defines a fact row. "A fact row is created when _____ occurs." Grain is determined by physical realities of source. Fact tables should contain the lowest, most detailed atomic grain captured by a biz process. Atomic data sustains attacks from ad-hoc analysis and can be extended.
Date/Timestamps
Helps determine how long it takes to complete one action to the next
2) Declare teh Grain
Identify exactly what a fact table row is in the proposed biz process to determine dimensional model. Typically as atomic or finely grained as possible.
4) Identify the Facts
Identify facts and metrics that are critical to the biz process. This requires that you understand the biz needs
Common Matrix Mishaps
Matrix Row Mishaps: * Departmental or overly encompassing rows - instead of processes * Report-centric or narrowly defined rows - instead of process Matrix Column Mishaps: * Overly Generalized columns - person could be Employee or Customer * Separate columns for each level of hierarchy - or for descriptions of ranking of hierarchy in fact table (should be in a dimension table)
Consolidated Fact Tables
Merge measurements from multiple processes into a single fact table. Requires that they use the same grain. This should coexist with the existing schema, but will enhance cross-reference queries
Transaction Fact Tables
Most common and basic, one row per transaction. Very atomic based upon event activity. May need a complimentary snapshot to help users visualize before getting into the weeds Rarely updated after ETL
Multiple Currencies and Units of Measure
Most commonly measured in local currency first and will ideally only be tracked in a single currency type so that the fact table adds up. Different business units may want different units of measure (pallets, containers, units). Combine currency and measurements and each row can have complex lists of required facts. Must strike a balance of captured information vs calculated fields/conversions. Further complicated if these dimensions can slowly change over time (fewer oz of soda per bottle).
2) Add a New Dimension Row (or mark old)
Most popular and powerful (data storage not issue). Add a tag for "Current" or "Not-Current" with date-stamps.
Fact Table Keys
Mutipart key made of foreign keys from dimension tables (no null values)
Date and Time
Necessary and complex, should be guided through the dimension table through the key. * Surrogate Date Keys * Time of Day * Date/Timestamps * Multiple Time Zones
Hybrid Slowly Changing Dimension Techniques
Need to balance analytic power against ease of use (hybrids are challenging). Allows historical references, and current references. * Add a ""current"" or ""old"" value into the dimension table * Consider including a dimension natural key or durable identifier as a fact table foreign key (associates two dimension tables with teh facts)
Fable 3 - Overvaluing Normalization
Normalized data is not the ideal * Conformed dimensions depends on standardized labels, values, and definiions. Conformed dimensions are challenging but necessary * Integration of multiple systems is challenging but possible * Changes to dimensions are a challenge to any data system * Multiple data marts can be incorporated, albeit with some work if dimensions are not conformed
Degenerate Dimensions
Number from the header of a transaction control dimension, extracted and left unlinked to a dimension table with the rest of the facts from the transaction. Can be used for auditing, compliance, or researching data quality concerns. Can also be used to establish ""count per x-type transaction""
3) Identify the Dimensions
Once grain is selected, this may be fairly straightforward. Start thinking about foreign keys and whether you want: * Transaction Level * Snapshot Level * Document Transaction type (PO, Shipment, Invoice)
Value Chain Implications
Page 250. Each step in teh process can be a grain in the fact table with multiple data points. The bus matrix can ensure that dimensions are conformed for later integration. This is important because whether a business needs supply chain management or information management (insurance companies), the approach is the same for the company's value chain
Periodic Snapshot Fact Tables
Periodic pictures of biz performance according to a specific interval. Complement but do not replace transaction level because theere is little detail or dimensions Rarely updated after ETL
Accumulating Snapshot Fact Tables
Rare but worth mentioning. Does not track each transaction but takes information once a process has completed (such as the final stage of invoiceing step). Tracks multiple date foreign keys corresponding to major milestones. The information in these fact tables are updated as the item being tracked is changed in the transaction system
What about Normalized Modeling?
Reduces redundancies and makes transaction loading and updating simple and fast by breaking up everything into small and numerous entities. Many joins in normalized modeling hurts query performance. * 3NF called ""ER models"" (entity relationship)
Time of Day
Separate from date key, time repeats and is too much info to incorporate into date. Only break out into dimension if there are time periods to put in the dimension table (lunch, first shift, second shift, etc)
Snowflaking and Outriggers
Snowflaking has renormalized data and is generally to be avoided due to complexity and query performance. Situations where snowflaking into ""outrigger"" dimensions: * Significant space savings * Data of different grains collected * Increase analytical ease-of-use
Variable Depth Hierarchies via Bridge Tables
Some hierarchies are more complex. Rather than use a recursive reference, you can use a bridge table between the fact table and the dimension table (I think this is a lookup table). This approach allows more drill-down at the cost of useability and performance
Conformed Dimensions
Standardization is a goal for any well-architected data warehouse. * Match keys, attribute names, attirbute definitions, and domain values, regardless of the fact tbale * Attributes should match the granularity of the fact table (for tables of lower granularity) Advantages: * Consistency * Integration * Time-To Market There should be a few conformed categories that can match throughout the enterprise, but different biz units will always have different definitions. This is not a perfect science
Fable 2 - Premature Summarization
Start with atomic and move to summary later: * Atomic data is more flexible than summaries * Relational and BI databases are both capable of answering the same questions * Dimensional models are scaleable * Dimensional models are extensible * Dimensional models should be able to support complex data
Handling Hierarchies
These are a fact of life in a dimensional model * Fixed Hierarchies * Variable Depth Hierarchies via Bridge Tables
1) Choose the Biz Process
This likely happened during the prioritization phase
Planning Crisis
Two major tasks for the DW/BI manager (impossible): * Understand old system perfectly, be the authority, bring over information, and clean the info up * Understand what keeps management up at night. The DW/BI manager can carve off little pieces of the project at a time while bringing the project to conclusion. Stovepipe solutions are to be avoided, there must be an overarching strategy. These two can be in conflict
Dimension Table Keys
Uniquely identified with surrogate keys (4-bytes ints can handle almost any needs). Surrogate advantages are: * Performance (ints are easy to process) * Buffer from operational key management (which can be re-used) * Mapping to integrate disparate sources (using existing natural keys during ETL) * Unknown or not applicable conditions (avoids null values) * Track Changes in Dimensional Attribute values (roll-back)
Surrogate Date Keys
Use surrogate and not the date, itself, (date requires 8-bytes, not 4). Can bend "no meaning" surrogates for date YYYYMMDD, if it is adequate and is "performance enhancing." DO NOT USE FOR BI FILTERING!
3) Add a New Dimension Attribute
Used infrequently, but allows users to compare current condition with past condition. These are ""soft"" changes and work best if there are only a few. If changes happen with a predictable rhythm and the company needs to do historical fact comparisons this technique works
Mini-Dimensions: Add a New Dimension
When a dimension gets too big and may change, it might make sense to break up a fact key to connect to two dimensions: * Banded values - Demographic attributes can be clumped into banded ranges of discrete values * Restricted Growth - limited space requires a mini-dimension * Separation from core dimension - one mini-dimension is physically separated
Role-Playing Dimensions
When the same physical dimension table plays distinct logical roles in a dimension model (ex: several dates may have significance in a fact row that SQL will not let you join to the dimension tables) Uniquely label the columns so that users can tell attributes apart. Ship-to/sold to, facility origin/destination