ch. 10 Data Quality and Integration

Ace your homework & exams now with Quizwiz!

Record Level Transformation Functions (4)

-Selection -Joining -Normalization -Aggregation

The ETL Process (7)

ETL = Extract, transform, and load Capture/Extract Scrub or data cleansing Transform Load and Index -During initial load of Enterprise Data Warehouse (EDW) -During subsequent periodic updates to EDW

Data quality improvement Step 1: Business Buy-in (5)

Executive sponsorship Building a business case Prove a return on investment (ROI) Avoidance of cost Avoidance of opportunity loss

Three main architectures of MDM (3)

Identity registry -master data remains in source systems; registry provides applications with location Integration hub -data changes broadcast through central service to subscribing databases Persistent -central "golden record" maintained; all applications have access. Requires applications to push data. Prone to data duplication.

Data quality improvement Step 3: Data Stewardship Program (5-roles)

Roles: -Oversight of data stewardship program -Manage data subject area -Oversee data definitions -Oversee production of data -Oversee use of data Report to: business unit vs. IT organization?

Record-level (3 types)

Selection-data partitioning Joining-data combining Aggregation-data summarization

Algorithmic

Single field look up-transformation uses a formula or logical expression

Table lookup

Single field-uses a separate table keyed by source record code

Characteristics of Quality Data (8)

Uniqueness Accuracy Consistency Completeness Timeliness Currency Conformance Referential integrity

Also

decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data

Techniques for Data Integration (Data Propagation:EAI/EDR) (def + pros (2) Cons (1)

enterprise application integration & Enterprise Data repilcation Duplicate data across databases, with near real-time delay Pros- data available near real time, possible to work with ETL Cons- considerable overheard w/ synchronizing duplicate data

Techniques for Data Integration (Data Federation:EII) (def + pros(3) cons(2))

enterprise information integration: Provides a virtual view of data without actually creating one centralized database Pros- data is always current, simple for calling, works well for read only Cons- heavy workloads reqs are possible, write access to data sources may not be supported

Field-level ( 2 types)

single-field-from one field to one field multi-field-from many fields to one, or one field to many

Mapping and Metadata Management (5)

-A design step prior to performing ETL -Required data are mapped to data sources +Graphical or matrix representations -Explanations of reformatting, transformations, and cleansing actions to be done -Process flow involving tasks and jobs -Metadata that: +Identifies data sources +recognizes same data in different systems +represents process flow steps

Data quality improvement Step 4: Improving Data Capture Processes (5)

-Automate data entry as much as possible -Manual data entry should be selected from preset options -Use trained operators when possible -Follow good user interface design principles -Immediate data validation for entered data

The Reconciled Data Layer (After ETL, data should be...) (6)

-Detailed-not summarized yet -Historical-periodic -Normalized-3rd normal form or higher -Comprehensive-enterprise-wide perspective -Timely-data should be current enough to assist decision-making -Quality controlled-accurate with full integrity

Requirements for Data Governance (4)

-Sponsorship from both senior management and business units -A data steward manager to support, train, and coordinate data stewards -Data stewards for different business units, subjects, and/or source systems -A governance committee to provide data management guidelines and standards

The Reconciled Data Layer (Typical Operational data is..) (4)

-Transient-not historical -Not normalized (perhaps due to denormalization for performance) -Restricted in scope-not comprehensive -Sometimes poor quality-inconsistencies and errors

Data Integration (6)

1. Data integration creates a unified view of business data 2. Other possibilities: -Application integration -Business process integration -User interaction integration 3. Any approach requires changed data capture (CDC) -Indicates which data have changed since previous data integration activity

Causes of poor data quality (4)

1. External data sources -Lack of control over data quality 2. Redundant data storage and inconsistent metadata -Proliferation of databases with uncontrolled redundancy and metadata 3. Data entry -Poor data capture controls 4. Lack of organizational commitment -Not recognizing poor data quality as an organizational issue

Steps in Data quality improvement (6)

1. Get business buy-in 2. Perform data quality audit 3. Establish data stewardship program 4. Improve data capture processes 5. Apply modern data management principles and technology 6. Apply total quality management (TQM) practices

Data steward

A person responsible for ensuring that organizational applications properly support the organization's data quality goals

Techniques for Data Integration (Consolidation: ETL) (def + pros (3) cons(2))

Consolidating all data into a centralized database (like a data warehouse) Pros- users are isolated from conflicting workloads, retain history, accessed quickly Cons- network and storage is high cost, when too big performance lowers

Master Data Management (MDM)

Disciplines, technologies, and methods to ensure the currency, meaning, and quality of reference data within and across various subject areas

Data governance

High-level organizational groups and processes overseeing data stewardship across the organization

Importance of Data Quality (3)

If the data are bad, the business fails. Period. -GIGO - garbage in, garbage out -Sarbanes-Oxley (SOX) compliance by law sets data and metadata quality standards

Multi-field transformation

Many sources to one target One source to many targets

Purposes of data quality (4)

Minimize IT project risk Make timely business decisions Ensure regulatory compliance Expand customer base

Data quality improvement Step 5: Apply modern data management principles and technology (5)

Software tools for analyzing and correcting data quality problems: -Pattern matching -Fuzzy logic -Expert systems Sound data modeling and database design

Data quality improvement Step 2: Data Quality Audit (5)

Statistically profile all data files Document the set of values for all fields Analyze data patterns (distribution, outliers, frequencies) Verify whether controls and business rules are enforced Use specialized data profiling tools

Data quality improvement Step 6: TQM Principles and Practices (8)

TQM - Total Quality Management TQM Principles: -Defect prevention -Continuous improvement -Use of enterprise data standards Balanced focus -Customer -Product/Service -Strong foundation of measurement

Refresh mode

bulk rewriting of target data at periodic intervals

Static extract

capturing a snapshot of the source data at a point in time

Incremental extract

capturing changes that have occurred since the last static extract

Transform

convert data from format of operational system to format of data warehouse (record level and field level)

Fixing errors

misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Capture/Extract

obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse

Update mode

only changes in source data are written to data warehouse

Load/Index

place transformed data into the warehouse and create indexes (refresh and update mode)

Joining

the process of combining data from various sources into a single table or view

Normalization

the process of decomposing relations with anomalies to produce smaller, well-structured relations

Selection

the process of partitioning data according to predefined criteria

Aggregation

the process of transforming data from detailed to summary level

Scrub/Cleanse

uses pattern recognition and AI techniques to upgrade data quality (fixing errors and also)


Related study sets

Lamar University POLS 2302 Exam 1

View Set

Chapter 16: Pennsylvania Life Laws

View Set

Chapter 29: Management of Patients with Complications from Heart Disease

View Set

Chapter 15 Nursing Care during Labor and Birth

View Set

220-901 & 902 CompTIA A+ Certification Exam Prep

View Set

Energy & Weight Balance Nutrition Ch 10

View Set