Chapter 9 - Data Warehousing and Data Integration
Duration of the Database
- Natural duration - 13 months or 5 quarters - Financial institutions may need longer duration - Older data is more difficult to source and cleanse
Limitations of Independent Data Marts
- Separate ETL process for each data mart (redundant data and processing) - Inconsistency between data marts - Difficult to drill down for related facts between data marts - Excessive scaling costs are more applications are built - High cost for obtaining consistency between marts
Grain of the Fact Table
- Transactional grain (finest level) - Aggregated grain (more summarized) - Finer grains brings better market basket analysis capability - Finer grain implies more dimension tables, more rows in fact table - In web-based commerce, finest granularity is a click
10 Essential Rules for Dimensional Modeling
1. Use atomic facts 2. Create single-process fact tables 3. Include a date dimension for each fact table 4. Enforce consistent grain 5. Disallow null keys in fact tables 6. Honor hierarchies 7. Decode dimension tables 8. Use surrogate keys 9. Conform dimensions 10. Balance requirements with actual data
Data Mart
A data warehouse that is limited in scope
Data Warehouse
A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes - Subject-oriented (e.g. customers, patients, students) - Integrated (consistent naming conventions, formats, encoding structures; from multiple data sources) - Time-variant (can study trends and changes) - Non-updatable (read-only, periodically refreshed)
Reconciled Data Layer
After ETL, data should be: - Detailed, not summarized yet - Historical, periodic - Normalized, 3rd normal form or higher - Comprehensive, enterprise-wide perspective - Timely, data should be current enough to assist decision making - Quality controlled, accurate with full integrity
Data Warehouse Architectures
All involve some form of ETL -Independent Data Mart -Dependent Data Mart and Operational Data Store -Logical Data Mart and Real-Time Data Warehouse -Three-Layer architecture
Techniques for Data Integration
Consolidation (ETL) Data federation (EII) Data propagation (EAI and EDR)
Data Integration
Creates a unified view of business data
Data Warehouse vs. Data Mart: Scope
Data Warehouse: - Application independent - Centralized, possibly enterprise-wide - Planned Data Mart: - Specific DSS application - Decentralized by user area - Organic, possibly not planned
Data Warehouse vs. Data Mart: Data, Subjects, and Sources
Data Warehouse: - Data is historical, detailed, and summarized - Data is lightly denormalized - Multiple subjects - Many internal and external sources Data Mart: - Data has some history, is detailed and summarized - Data is highly denormalized - One central subject or concern to users - Few internal and external sources
Data Warehouse vs. Data Mart: Other Characteristics
Data Warehouse: - Flexible - Data oriented - Long life - Large - Single complex structure Data Mart: - Restrictive - Project oriented - Short life - Starts small, becomes large - Multi-, semi-complex structures, together complex
Size of Fact Table
Depends on the number of dimensions and the grain of the fact table. Number of rows = product of number of possible values for each dimension associated with the fact table.
Characteristics of Derived Data
Detailed (mostly periodic) data Aggregate (for summary) Distributed (to departmental servers) Most common data model - dimensional model (usually implemented as a star schema)
Surrogate Keys
Dimension table keys should be non-intelligence and non-business related, because: - business keys may change over time - helps keep track of non-key attribute values for a given production key - they are simpler and shorter - can be the same length and format for all keys
Objectives of Derived Data
Ease of use for decision support applications Fast response to predefined user queries Customized data for particular target audiences Ad-hoc query support Data mining capabilities
Data Propagation (EAI and EDR)
Enterprise Application Integration Enterprise Data Replication - data are pushed to duplicate sites as updates occur - sometimes called event-driven propagation - the events are updates themselves - triggers help to facilitate this Duplicate data across databases, with near real-time delay
Data Federation (EII)
Enterprise Information Integration - there is no physical data warehouse; it is "virtual", composed of dynamic queries and views that operate directly on the source data Provides a virtual view of data without actually creating one centralized database
ETL
Extract, Transform, Load - Capture/Extract - Scrub or data cleansing - Transform - Load and Index Performed: - during initial load of Enterprise Data Warehouse (EDW) - during subsequent periodic updates to EDW
Consolidation (ETL)
Extract-Transform-Load - the traditional approach for building physical data warehouses Consolidating all data into a centralized database (like a data warehouse)
Slowly Changing Dimensions
How to maintain knowledge of the past Kimball's approaches: - Type 1: replace old data with new (lose historical data) - Type 2: create a current value field and sever old-valued fields (multivalued) for each changing attribute - Type 3: create a new dimension table row each time the dimension object changes, with all dimension characteristics at the time of change (most common approach)
Issues with Company-Wide View
Inconsistent key structures Synonyms Free-form vs. structured fields Inconsistent data values Missing data
Changed Data Capture (CDC)
Indicates which data have changed since previous data integration activity
Need for Data Warehousing
Integrated, company-wide view of high-quality information (from disparate databases) Separation of operational and informational systems and data (for improved performance)
Application Integration
Involves coordinating the flow of event information, between business applications, often via a service-oriented architecture
User Interaction Integration
Involves creating fewer user interfaces that feed different data systems. Often done using an enterprise portal to interact with different data reporting and business intelligence systems.
Business Process Integration
Involves tighter coordination of activities across business processes so that applications can be shared and more application integration can occur
Variations of Star Schema
Multiple Facts Tables - can improve performance - often used to store facts for different combinations of dimensions - conformed dimensions Factless Facts Tables - No non-key data, but foreign keys for associated dimensions - Used for: - tracking events - inventory coverage
Normalizing Dimension Tables
Multivalued Dimensions - Facts qualified by a set of values for the same business subject - Normalization involves creating a table for an associative entity between dimensions Hierarchies - Sometimes a dimension forms a natural, fixed-depth hierarchy - Design options: - include all information for each level in a single denormalized table - Normalize the dimension into a nested set of 1:N table relationships
Other Data Warehouse Changes
New descriptive attributes New business activity attributes New classes of descriptive attributes Descriptive attributes have become more refined Descriptive data are related to one another New sources of data
Organizational Trends Motivating Data Warehouses
No single system of records Multiple systems not synchronized Organizational need to analyze activities in a balanced way Customer relationship management Supplier relationship management
Record Level Transformation Functions
Selection Joining Normalization Aggregation
Joining
The process of combining data from various sources into a single table or view
Normalization
The process of decomposing relations with anomalies to produce smaller, well-structured relations
Selection
The process of partitioning data according to predefined criteria
Aggregation
The process of transforming data from detailed to summary level
Informational System
a system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications
Operational System
a system that is used to run a business in real time, based on current data also called a system of record