ETL Concepts

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Data mining steps

Data extraction, data cleansing, modeling data, applying data mining algorithm, pattern discovery, data visualization

Extraction

ETL functional element

Loading

ETL functional element

Transformation

ETL functional element

Meta data logging

ETL functional element, provides possibilities for automated administration, trend prediction and code reuse

Data profiling

Inspect all data for errors, inconsistencies, redundancies, and incomplete information

Extract system

Source data adapters, job schedulers, filtering and sorting at the source, proprietary data format conversions and data staging after transfer to ETL format

Source system

A database, application, file or other storage facility which the data in a data warehouse is desired

Subsystems of ETL

38 subsystems

Target system

A database, application, file or other storage facility to which the transformed source data is loaded in a data warehouse

Staging area

A place where data is processed before entering the warehouse

Identifying target data

A target for a data element may be an attribute in a dimension table, a numeric measure in a fact table or a summarized total in an aggregation table

Delta data

Both new and changed data, it represents changes from one point in time to the next

Refresh mode

Bulk rewriting of target data at periodic intervals

Log capture

Can be used in case of DBMS which keeps log of changes in reusable form

Timestamp-based capture

Can be used where the source record contains one of more fields with timestamp information, a batch program can extract records which have timestamp greater than last run, there is a possibility of loosing changes if the data is changed more than two times between two runs

Data profiling details

Can data easily be used for other purposes, give metrics on data quality including whether the data conforms to company standards, assess risk in integrating data, track data quality, assess whether meta data accurately describes the actual values in the source database

ETL process steps

Capture, scrub or data cleansing, transform, load and index

Static extract

Capturing a snapshot of the source data at a point in time

Incremental extract

Capturing changes that have occurred since the last static extract

Data monitoring

Check and control data integrity over time

Data validation

Check that all rows in the fact table match rows in dimension tables to enforce data integrity

First generation ETL

Code generation products, generate the source code

Data profiling categories

Column profiling, dependency profiling, redundancy profiling

Data profiling system

Column property analysis including discovery of inferred domains and structure analysis including candidate foreign key - primary relationships, data rule analysis, and value rule analysis

Conversion by lookup

Conversion from code to standardized values

Data quality

Correct, standardize and verify data

Business meta data

Creating definitions and business rules, can be either technical or not technical, is used to help everyone in the business understand the data

Transformation responsibilities

Data validation, data accuracy, data type conversion, business rule application

Pull

Data warehouse extracts the data from the source systems

Other data warehouses

Data which can be useful to enhance decision support is captured from other data warehouses, snapshots can be created from source to destination

Data scrubbing not fixing errors

Decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data

Steps for designing an ETL process

Define the structure of source data, define the structure of destination data, map elements of source data to elements of destination data, define the transformation required, schedule the execution of process

Map source data to target data

Defines the source fields of the data, the destination fields in the data warehouse and any data modifications to transform the data into the desired format

Technical meta data

Describes information about technology such as the ownership of the database and physical characteristics of a database

Develop functional elements

Design an implementation of the 4 ETL functional elements are different from system to system, there will often be multiple versions of each functional element

Derived data characteristics

Detailed (mostly periodic) data, aggregate (for summary), distributed (to departmental servers)

After ETL data should be

Detailed, historical, not normalized (performance), comprehensive, quality controlled

Loading tables in a datawarehouse steps

Dimension tables loaded first, then fact tables, then aggregate tables (there should be an order within these three categories)

Types of load tables

Dimension, fact, aggregate

Planning of staging area

Disk space required, memory/processor requirement, network connectivity, software ETL

Derived data objectives

Ease of use, fast response to predefined user queries, customized data, ad-hoc query support, data mining capabilities

Second generation ETL

Engine-driven products, generate directly executable code

Data augmentation

Enhance data using information from internal and external data sources

Data accuracy

Ensure that fields contain appropriate values, such as only "off" or "on" in a status field

Data type conversion

Ensures that all values for a specified field are stored the same way in the data warehouse regardless of how they were stored in the source system

Business rule application

Ensures that the rules of the business are enforced on the data stored in the warehouse

ETL

Extract, Transform, Load

Conversion

Field level conversion from one datatype to other data type based on predefined criteria

Ways of transforming

Filtering (conditional statements), aggregation (summation of records), merging (concatenate records), derivation (clause's (if/then statements)), explosion (fragment data), data propagation (load data into multiple tables)

Legacy systems

Generally holds transactional data and it is the primary source of the data to any DW, can be in flat files

Data lineage

Meta data, the time that a particular set of records was loaded into the data warehouse

Data confromer

Identification and enforcement of special conformed dimension attributes and conformed fact table measures as the basis for data integration across multiple data sources

Dependency profiling

Identifies intra-table dependencies, addresses whether or not there are non-key attributes that determine or are dependent on non-key attributes

Redundancy profiling

Identifies overlapping values, used to identify candidate foreign keys

Identifying and mapping data

Identifies sources of data elements, the targets for those data elements, and the transformations that must be applied to each data element

Data extraction process

Identify various sources, develop connection details, apply conditions to filter data, read data from different databases (push, pull)

ETL development phases

Identifying and mapping data, developing functional element implementations

Constructive merge

If key of new data matches with existing data then data is marked as superseded and retained. New data is inserted and multiple sets of the same table may exist, a flag to distinguish most recent data will be required

Destructive merge

If key of new data matches with existing data, the existing data is updated with new data else new data is inserted

ETL process automation

It is required to decide at what frequency the data has to be extracted and loaded depending on how fast the data is changing and how the reporting is done

Algorithmic conversion

Lower case to upper case/measurement units

Data integration

Match, merge or link data from a variety of disparate sources

Identifying source data

May be as simple as identifying the server where the data is stored in an OLTP database and the storage type (SQL server database, excel spreadsheet, text file etc.)

ETL tools

Meant to extract, transform and load the data into the data warehouse for decision making

DTS package versioning

Meta data, can be used to view, branch, or retrieve any historical version of a particular DTS package

Schema changes

Meta data, changes to table definitions

Data type usage

Meta data, identifying all tables that use the user defined data type

Data warehouse usage statistics

Meta data, query times for reports

Transformation statistics

Meta data, the execution of time of each stage of transformation, the number of rows processed by the transformation, the last time the transformation was executed, and so on

Fixing errors

Misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Star schema

Most common star model, also called dimensional table

Significant business events

Non-predetermined events drive the capture of data, triggered when a completion event is performed

Incremental data

Often copied into the data warehouse while leaving the original data in place

File comparison

One copy of operational data is kept before the updating starts and at the end of an update a comparison is done to extract data required for the data warehouse

OLTP

Online Transaction Processing, data for data warehouse extracted from here

Update mode

Only changes in source data are written to data warehouse

Triggered capture

Only in case the operational system is based on DBMS, the database triggers are written the capture the data on change/insert/delete, if the frequency of data change is high then performance is affected

Delta data capture

Operational events, changed data capture, date last modified, point-in-time comparisons

Other sources

Other software's data which is required for decision support can be extracted, csv, flat files

Column profiling

Provides critical meta data required to perform dependency profiling (must be executed before)

Other of the 38 subsystems

Read power point

ETL loading element

Responsible for loading transformed data into the data warehouse database

ETL metadata functional element

Responsible for maintaining information about the movement and transformation of data and the operation of the data warehouse

Point-in-time snapshots

Scheduled at specific points in time, efficient method to pinpoint specific points in time or ranges of time

Selection

Select data based on some criteria

Aggregation

Selecting summary data from detailed data

Record-level transformations

Selection-data partitioning, joining-data combining, aggregation-data summarization

Field-level transformations

Single-field--from one field to another, multi-field-from many fields to one or one field to many

Change data capture system

Source log file readers, source data and sequence number filters and CRC based record comparison in ETL system

Push

Source system provide the data to the data warehouse

Separation/Concatenation

Split data into different data structures or merge data from two captured data files into single data

Mapping

The definition of the relationship and data flow between source and target objects

Static capture

The entire source data is captured at fixed intervals, can be full, selected attributes, or conditional

Append

The existing data is retained and new data is appending

Application-assisted capture

The operational system is modified to provide changed data

Data mining

The process of automatically searching large volumes of data for patterns using tools such as classification, association rule mining, clustering etc.

Enrichment

The process of making data more informative in context of the data warehouse, can be at record level or field level

Transformation

The process of manipulating data beyond copying

Transportation

The process of moving copied or transformed data from a source to a data warehouse

Aggregate tables

These tables hold fact data which is aggregated on some dimension, they are very fast to access to compared to accessing the fact table and then filtering the data

Dimension tables

These tables hold information which is non-transactional data (eg. products, customers)

Fact tables

These tables hold information which is transactional data (eg. orders, inventory)

Data extraction implementation

Transact-SQL stored procedures, data transformation services (DTS) tasks, or custom applications

Transactional systems

Transactional data different from legacy system can be captured, flat files

Operational data

Transient, restricted in scope, normalized, sometimes poor quality

Data cleansing system

Typically dictionary driven system for complete parsing of names and addresses of individuals and organizations, possibly also products or locations. Maintains back references to all participating original sources


Ensembles d'études connexes

ECON 201: Final Exam (Chapter 11)

View Set

Encoding, Retrieval & Consolidation Quiz 7

View Set

Seeley's AP 1 Final Exam Study Guide

View Set

Psych Midterm Prep Ch. 3: BIOPSYCHOLOGY

View Set

Iggy 11 Chapter 55: Care of Patients with Stomach Disorders

View Set

Financial Management & Policy - Chapter 9

View Set

Chapter 16: workers compensation and unemployment compensation

View Set