Chapter06

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Four-Step Dimensional Design Process

1) Choose the Biz Process 2) Declare teh Grain 3) Identify the Dimensions 4) Identify the Facts

More on Dimensions

* Date and Time * Degenerate Dimensions * Slowly Changing Dimensions * Role-Playing Dimensions * Junk Dimensions * Snowflaking and Outriggers * Handling Hierarchies * Many-Valued Dimensions with Bridge Tables

Fables and Falsehoods about Dimensional Modeling

* Fable 1 - Focusing on Departmental Reports * Fable 2 - Premature Summarization * Fable 3 - Overvaluing Normalization

Dimensional Modeling Requirements

* Fact Tables * Dimension Tables

More On Facts

* Three Fundamental Grains * Facts of Differing Granularity and Allocation * Multiple Currencies and Units of Measure * Factless Fact Tables * Consolidated Fact Tables

Three Fundamental Grains

* Transaction Fact Tables * Periodic Snapshot Fact Tables * Accumulating Snapshot Fact Tables

Benefits of Dimensional Modeling

* Understandability to biz users. * Query performance - denormalization reduces query complexity * Dimension tables are equivalents, making query types predictable to dev * Can handle slowly changing dimensions * Extensible with backward compatibility

Making the Case for Dimensional Modeling

* What is Dimensional Modeling? * What about Normalized Modeling? * Benefits of Dimensional Modeling

What is Dimensional Modeling?

A logical design technique for structuring data so that it's intuitive to business users and delivers fast query performance: Divides the world into: * Measurements (usually numeric facts) * Context in dimensional tables, star-joined to the facts (called ""cubes"" in OLAP)

Junk Dimensions

After modeling, there are miscellaneous flags and text attributes that cannot find an appropriate home (comments are one type of junk data). Designer now has a number of bad alternatives which should be avoided: * Leave them in the fact table * Make them into separate dimensions (preferred) * Eliminate them Study the flags and place them into one or more junk dimensions rather than add to fact table. You must consider how much overlap occurs with the various junk attributes

Factless Fact Tables

An event happens, but no quantifiable measurements are created or generated; may be important for many-to-many relationships tha thtey are related to.

Dimension Tables

Big, bulky, descriptive fields (textual). For filtering should be: Verbose, descriptive, complete (no null values) Discretely valued, Quality Assured. If codes are used instead of words, they should have imbedded meaning (Avoid this). While hierarchical, many-to-one relationships may be resolved on one table. Highly correlated clumps of attributes grouped to represent key objects of a biz, such as products, custs, employees, or facilities. These should be nowns. Usually 8-15 dimension tables is very common * Dimension Table Keys * Conformed Dimensions

Data Decomposition

Break up fields: Name = last and first; address = street num, street, city, etc.

Many-Valued Dimensions with Bridge Tables

Bridge tables for Many-to-Many relationships. These should be a last resort due to usability and performance issues.

Fable 1 - Focusing on Departmental Reports

Business needs should be the starting point (measurements). Reports will evolve over time no matter what you do. * Do not create new dimensions for each report * Do not create departmentally bound reports * You can always add new data sources to your fact and dimension tables * Multiple fact tables can be created to support different grains

Fact Tables

Contains measurements or ""numeric"" values or measurements. Large but remove much of the need to run cross-table queries * Fact Table Keys * Fact Table Granularity

Enterprise Data Warehouse Bus Arch

Critical activity, that cannot be too localized or too global. * Planning Crisis * Bus Arch * Value Chain Implications * Common Matrix Mishaps * Taking the Pledge

Bus Arch

Define a standard interface system for all iterative DW/BI components Establish a matrix of items to be integrated into the DW/BI system. These prioritized systems should fit together like a puzzle as the project goes on and helps avoid stovepipe solutions.

1) Overwrite the Dimension Attribute

Delete old and write in new. Easy to implement, difficult to do analysis on past condition due to value deleted. Requires a total update of aggregated data to keep data in sync... CONSIDER ALL OTHER TYPES BEFORE THIS ONE

Fixed Hierarchies

Different departments have many-to-one hierarchical views that may not match definitions; this may require you to maintain both hierarchies Hierarchies are often associated with drill paths in the data.

Facts of Differing Granularity and Allocation

Different granularity cannot be on the same table. Need to allocate higher-level facts to lower-level granularity by defining an allocatoin algorithm. If no agreement then you need two uniquely labeled facts. * Not always matchable * Header facts need precise names

Multiple Time Zones

Dimension table can identify fiscal year, local timezone, HQ timezone, and more

Slowly Changing Dimensions

Dimension tables that change, but slowly. Should be able to track changes and change types: 1) Overwrite teh Dimension Attribute 2) Add a New Dimension Row (or mark old) 3) Add a New Dimension Attribute Mini-Dimensions: Add a New Dimension Hybrid Slowly Changing Dimension Techniques

Taking the Pledge

Everyone uses the same master conformed dimensions. Architect, developers, and senior management (politics is critical)

Fact Table Granularity

Grain is the definition of the mearurement event that defines a fact row. "A fact row is created when _____ occurs." Grain is determined by physical realities of source. Fact tables should contain the lowest, most detailed atomic grain captured by a biz process. Atomic data sustains attacks from ad-hoc analysis and can be extended.

Date/Timestamps

Helps determine how long it takes to complete one action to the next

2) Declare teh Grain

Identify exactly what a fact table row is in the proposed biz process to determine dimensional model. Typically as atomic or finely grained as possible.

4) Identify the Facts

Identify facts and metrics that are critical to the biz process. This requires that you understand the biz needs

Common Matrix Mishaps

Matrix Row Mishaps: * Departmental or overly encompassing rows - instead of processes * Report-centric or narrowly defined rows - instead of process Matrix Column Mishaps: * Overly Generalized columns - person could be Employee or Customer * Separate columns for each level of hierarchy - or for descriptions of ranking of hierarchy in fact table (should be in a dimension table)

Consolidated Fact Tables

Merge measurements from multiple processes into a single fact table. Requires that they use the same grain. This should coexist with the existing schema, but will enhance cross-reference queries

Transaction Fact Tables

Most common and basic, one row per transaction. Very atomic based upon event activity. May need a complimentary snapshot to help users visualize before getting into the weeds Rarely updated after ETL

Multiple Currencies and Units of Measure

Most commonly measured in local currency first and will ideally only be tracked in a single currency type so that the fact table adds up. Different business units may want different units of measure (pallets, containers, units). Combine currency and measurements and each row can have complex lists of required facts. Must strike a balance of captured information vs calculated fields/conversions. Further complicated if these dimensions can slowly change over time (fewer oz of soda per bottle).

2) Add a New Dimension Row (or mark old)

Most popular and powerful (data storage not issue). Add a tag for "Current" or "Not-Current" with date-stamps.

Fact Table Keys

Mutipart key made of foreign keys from dimension tables (no null values)

Date and Time

Necessary and complex, should be guided through the dimension table through the key. * Surrogate Date Keys * Time of Day * Date/Timestamps * Multiple Time Zones

Hybrid Slowly Changing Dimension Techniques

Need to balance analytic power against ease of use (hybrids are challenging). Allows historical references, and current references. * Add a ""current"" or ""old"" value into the dimension table * Consider including a dimension natural key or durable identifier as a fact table foreign key (associates two dimension tables with teh facts)

Fable 3 - Overvaluing Normalization

Normalized data is not the ideal * Conformed dimensions depends on standardized labels, values, and definiions. Conformed dimensions are challenging but necessary * Integration of multiple systems is challenging but possible * Changes to dimensions are a challenge to any data system * Multiple data marts can be incorporated, albeit with some work if dimensions are not conformed

Degenerate Dimensions

Number from the header of a transaction control dimension, extracted and left unlinked to a dimension table with the rest of the facts from the transaction. Can be used for auditing, compliance, or researching data quality concerns. Can also be used to establish ""count per x-type transaction""

3) Identify the Dimensions

Once grain is selected, this may be fairly straightforward. Start thinking about foreign keys and whether you want: * Transaction Level * Snapshot Level * Document Transaction type (PO, Shipment, Invoice)

Value Chain Implications

Page 250. Each step in teh process can be a grain in the fact table with multiple data points. The bus matrix can ensure that dimensions are conformed for later integration. This is important because whether a business needs supply chain management or information management (insurance companies), the approach is the same for the company's value chain

Periodic Snapshot Fact Tables

Periodic pictures of biz performance according to a specific interval. Complement but do not replace transaction level because theere is little detail or dimensions Rarely updated after ETL

Accumulating Snapshot Fact Tables

Rare but worth mentioning. Does not track each transaction but takes information once a process has completed (such as the final stage of invoiceing step). Tracks multiple date foreign keys corresponding to major milestones. The information in these fact tables are updated as the item being tracked is changed in the transaction system

What about Normalized Modeling?

Reduces redundancies and makes transaction loading and updating simple and fast by breaking up everything into small and numerous entities. Many joins in normalized modeling hurts query performance. * 3NF called ""ER models"" (entity relationship)

Time of Day

Separate from date key, time repeats and is too much info to incorporate into date. Only break out into dimension if there are time periods to put in the dimension table (lunch, first shift, second shift, etc)

Snowflaking and Outriggers

Snowflaking has renormalized data and is generally to be avoided due to complexity and query performance. Situations where snowflaking into ""outrigger"" dimensions: * Significant space savings * Data of different grains collected * Increase analytical ease-of-use

Variable Depth Hierarchies via Bridge Tables

Some hierarchies are more complex. Rather than use a recursive reference, you can use a bridge table between the fact table and the dimension table (I think this is a lookup table). This approach allows more drill-down at the cost of useability and performance

Conformed Dimensions

Standardization is a goal for any well-architected data warehouse. * Match keys, attribute names, attirbute definitions, and domain values, regardless of the fact tbale * Attributes should match the granularity of the fact table (for tables of lower granularity) Advantages: * Consistency * Integration * Time-To Market There should be a few conformed categories that can match throughout the enterprise, but different biz units will always have different definitions. This is not a perfect science

Fable 2 - Premature Summarization

Start with atomic and move to summary later: * Atomic data is more flexible than summaries * Relational and BI databases are both capable of answering the same questions * Dimensional models are scaleable * Dimensional models are extensible * Dimensional models should be able to support complex data

Handling Hierarchies

These are a fact of life in a dimensional model * Fixed Hierarchies * Variable Depth Hierarchies via Bridge Tables

1) Choose the Biz Process

This likely happened during the prioritization phase

Planning Crisis

Two major tasks for the DW/BI manager (impossible): * Understand old system perfectly, be the authority, bring over information, and clean the info up * Understand what keeps management up at night. The DW/BI manager can carve off little pieces of the project at a time while bringing the project to conclusion. Stovepipe solutions are to be avoided, there must be an overarching strategy. These two can be in conflict

Dimension Table Keys

Uniquely identified with surrogate keys (4-bytes ints can handle almost any needs). Surrogate advantages are: * Performance (ints are easy to process) * Buffer from operational key management (which can be re-used) * Mapping to integrate disparate sources (using existing natural keys during ETL) * Unknown or not applicable conditions (avoids null values) * Track Changes in Dimensional Attribute values (roll-back)

Surrogate Date Keys

Use surrogate and not the date, itself, (date requires 8-bytes, not 4). Can bend "no meaning" surrogates for date YYYYMMDD, if it is adequate and is "performance enhancing." DO NOT USE FOR BI FILTERING!

3) Add a New Dimension Attribute

Used infrequently, but allows users to compare current condition with past condition. These are ""soft"" changes and work best if there are only a few. If changes happen with a predictable rhythm and the company needs to do historical fact comparisons this technique works

Mini-Dimensions: Add a New Dimension

When a dimension gets too big and may change, it might make sense to break up a fact key to connect to two dimensions: * Banded values - Demographic attributes can be clumped into banded ranges of discrete values * Restricted Growth - limited space requires a mini-dimension * Separation from core dimension - one mini-dimension is physically separated

Role-Playing Dimensions

When the same physical dimension table plays distinct logical roles in a dimension model (ex: several dates may have significance in a fact row that SQL will not let you join to the dimension tables) Uniquely label the columns so that users can tell attributes apart. Ship-to/sold to, facility origin/destination


संबंधित स्टडी सेट्स

mgt 3324: Describe the challenges of being an entrepreneur part time

View Set

Medical Terminology Chapter 2 8th Edition

View Set

People Places and Things to Know Chapter 11 BJU Cultural Geography

View Set

Chapter 22- Performance and Breach of Sales Contracts (on the UCC's side)

View Set