Chapter 9 - Data Warehousing and Data Integration

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Duration of the Database

- Natural duration - 13 months or 5 quarters - Financial institutions may need longer duration - Older data is more difficult to source and cleanse

Limitations of Independent Data Marts

- Separate ETL process for each data mart (redundant data and processing) - Inconsistency between data marts - Difficult to drill down for related facts between data marts - Excessive scaling costs are more applications are built - High cost for obtaining consistency between marts

Grain of the Fact Table

- Transactional grain (finest level) - Aggregated grain (more summarized) - Finer grains brings better market basket analysis capability - Finer grain implies more dimension tables, more rows in fact table - In web-based commerce, finest granularity is a click

10 Essential Rules for Dimensional Modeling

1. Use atomic facts 2. Create single-process fact tables 3. Include a date dimension for each fact table 4. Enforce consistent grain 5. Disallow null keys in fact tables 6. Honor hierarchies 7. Decode dimension tables 8. Use surrogate keys 9. Conform dimensions 10. Balance requirements with actual data

Data Mart

A data warehouse that is limited in scope

Data Warehouse

A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes - Subject-oriented (e.g. customers, patients, students) - Integrated (consistent naming conventions, formats, encoding structures; from multiple data sources) - Time-variant (can study trends and changes) - Non-updatable (read-only, periodically refreshed)

Reconciled Data Layer

After ETL, data should be: - Detailed, not summarized yet - Historical, periodic - Normalized, 3rd normal form or higher - Comprehensive, enterprise-wide perspective - Timely, data should be current enough to assist decision making - Quality controlled, accurate with full integrity

Data Warehouse Architectures

All involve some form of ETL -Independent Data Mart -Dependent Data Mart and Operational Data Store -Logical Data Mart and Real-Time Data Warehouse -Three-Layer architecture

Techniques for Data Integration

Consolidation (ETL) Data federation (EII) Data propagation (EAI and EDR)

Data Integration

Creates a unified view of business data

Data Warehouse vs. Data Mart: Scope

Data Warehouse: - Application independent - Centralized, possibly enterprise-wide - Planned Data Mart: - Specific DSS application - Decentralized by user area - Organic, possibly not planned

Data Warehouse vs. Data Mart: Data, Subjects, and Sources

Data Warehouse: - Data is historical, detailed, and summarized - Data is lightly denormalized - Multiple subjects - Many internal and external sources Data Mart: - Data has some history, is detailed and summarized - Data is highly denormalized - One central subject or concern to users - Few internal and external sources

Data Warehouse vs. Data Mart: Other Characteristics

Data Warehouse: - Flexible - Data oriented - Long life - Large - Single complex structure Data Mart: - Restrictive - Project oriented - Short life - Starts small, becomes large - Multi-, semi-complex structures, together complex

Size of Fact Table

Depends on the number of dimensions and the grain of the fact table. Number of rows = product of number of possible values for each dimension associated with the fact table.

Characteristics of Derived Data

Detailed (mostly periodic) data Aggregate (for summary) Distributed (to departmental servers) Most common data model - dimensional model (usually implemented as a star schema)

Surrogate Keys

Dimension table keys should be non-intelligence and non-business related, because: - business keys may change over time - helps keep track of non-key attribute values for a given production key - they are simpler and shorter - can be the same length and format for all keys

Objectives of Derived Data

Ease of use for decision support applications Fast response to predefined user queries Customized data for particular target audiences Ad-hoc query support Data mining capabilities

Data Propagation (EAI and EDR)

Enterprise Application Integration Enterprise Data Replication - data are pushed to duplicate sites as updates occur - sometimes called event-driven propagation - the events are updates themselves - triggers help to facilitate this Duplicate data across databases, with near real-time delay

Data Federation (EII)

Enterprise Information Integration - there is no physical data warehouse; it is "virtual", composed of dynamic queries and views that operate directly on the source data Provides a virtual view of data without actually creating one centralized database

ETL

Extract, Transform, Load - Capture/Extract - Scrub or data cleansing - Transform - Load and Index Performed: - during initial load of Enterprise Data Warehouse (EDW) - during subsequent periodic updates to EDW

Consolidation (ETL)

Extract-Transform-Load - the traditional approach for building physical data warehouses Consolidating all data into a centralized database (like a data warehouse)

Slowly Changing Dimensions

How to maintain knowledge of the past Kimball's approaches: - Type 1: replace old data with new (lose historical data) - Type 2: create a current value field and sever old-valued fields (multivalued) for each changing attribute - Type 3: create a new dimension table row each time the dimension object changes, with all dimension characteristics at the time of change (most common approach)

Issues with Company-Wide View

Inconsistent key structures Synonyms Free-form vs. structured fields Inconsistent data values Missing data

Changed Data Capture (CDC)

Indicates which data have changed since previous data integration activity

Need for Data Warehousing

Integrated, company-wide view of high-quality information (from disparate databases) Separation of operational and informational systems and data (for improved performance)

Application Integration

Involves coordinating the flow of event information, between business applications, often via a service-oriented architecture

User Interaction Integration

Involves creating fewer user interfaces that feed different data systems. Often done using an enterprise portal to interact with different data reporting and business intelligence systems.

Business Process Integration

Involves tighter coordination of activities across business processes so that applications can be shared and more application integration can occur

Variations of Star Schema

Multiple Facts Tables - can improve performance - often used to store facts for different combinations of dimensions - conformed dimensions Factless Facts Tables - No non-key data, but foreign keys for associated dimensions - Used for: - tracking events - inventory coverage

Normalizing Dimension Tables

Multivalued Dimensions - Facts qualified by a set of values for the same business subject - Normalization involves creating a table for an associative entity between dimensions Hierarchies - Sometimes a dimension forms a natural, fixed-depth hierarchy - Design options: - include all information for each level in a single denormalized table - Normalize the dimension into a nested set of 1:N table relationships

Other Data Warehouse Changes

New descriptive attributes New business activity attributes New classes of descriptive attributes Descriptive attributes have become more refined Descriptive data are related to one another New sources of data

Organizational Trends Motivating Data Warehouses

No single system of records Multiple systems not synchronized Organizational need to analyze activities in a balanced way Customer relationship management Supplier relationship management

Record Level Transformation Functions

Selection Joining Normalization Aggregation

Joining

The process of combining data from various sources into a single table or view

Normalization

The process of decomposing relations with anomalies to produce smaller, well-structured relations

Selection

The process of partitioning data according to predefined criteria

Aggregation

The process of transforming data from detailed to summary level

Informational System

a system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications

Operational System

a system that is used to run a business in real time, based on current data also called a system of record


Set pelajaran terkait

Office Applications I Unit 1 Quiz 1

View Set

List #2: North America/U.S. Geography Basics

View Set

Aggregate Demand and Aggregate Supply

View Set

Chapter 10: Security in Network Design

View Set

Appl in Exercise and Sports Science Ch 3

View Set