ETL Concepts
Data mining steps
Data extraction, data cleansing, modeling data, applying data mining algorithm, pattern discovery, data visualization
Extraction
ETL functional element
Loading
ETL functional element
Transformation
ETL functional element
Meta data logging
ETL functional element, provides possibilities for automated administration, trend prediction and code reuse
Data profiling
Inspect all data for errors, inconsistencies, redundancies, and incomplete information
Extract system
Source data adapters, job schedulers, filtering and sorting at the source, proprietary data format conversions and data staging after transfer to ETL format
Source system
A database, application, file or other storage facility which the data in a data warehouse is desired
Subsystems of ETL
38 subsystems
Target system
A database, application, file or other storage facility to which the transformed source data is loaded in a data warehouse
Staging area
A place where data is processed before entering the warehouse
Identifying target data
A target for a data element may be an attribute in a dimension table, a numeric measure in a fact table or a summarized total in an aggregation table
Delta data
Both new and changed data, it represents changes from one point in time to the next
Refresh mode
Bulk rewriting of target data at periodic intervals
Log capture
Can be used in case of DBMS which keeps log of changes in reusable form
Timestamp-based capture
Can be used where the source record contains one of more fields with timestamp information, a batch program can extract records which have timestamp greater than last run, there is a possibility of loosing changes if the data is changed more than two times between two runs
Data profiling details
Can data easily be used for other purposes, give metrics on data quality including whether the data conforms to company standards, assess risk in integrating data, track data quality, assess whether meta data accurately describes the actual values in the source database
ETL process steps
Capture, scrub or data cleansing, transform, load and index
Static extract
Capturing a snapshot of the source data at a point in time
Incremental extract
Capturing changes that have occurred since the last static extract
Data monitoring
Check and control data integrity over time
Data validation
Check that all rows in the fact table match rows in dimension tables to enforce data integrity
First generation ETL
Code generation products, generate the source code
Data profiling categories
Column profiling, dependency profiling, redundancy profiling
Data profiling system
Column property analysis including discovery of inferred domains and structure analysis including candidate foreign key - primary relationships, data rule analysis, and value rule analysis
Conversion by lookup
Conversion from code to standardized values
Data quality
Correct, standardize and verify data
Business meta data
Creating definitions and business rules, can be either technical or not technical, is used to help everyone in the business understand the data
Transformation responsibilities
Data validation, data accuracy, data type conversion, business rule application
Pull
Data warehouse extracts the data from the source systems
Other data warehouses
Data which can be useful to enhance decision support is captured from other data warehouses, snapshots can be created from source to destination
Data scrubbing not fixing errors
Decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data
Steps for designing an ETL process
Define the structure of source data, define the structure of destination data, map elements of source data to elements of destination data, define the transformation required, schedule the execution of process
Map source data to target data
Defines the source fields of the data, the destination fields in the data warehouse and any data modifications to transform the data into the desired format
Technical meta data
Describes information about technology such as the ownership of the database and physical characteristics of a database
Develop functional elements
Design an implementation of the 4 ETL functional elements are different from system to system, there will often be multiple versions of each functional element
Derived data characteristics
Detailed (mostly periodic) data, aggregate (for summary), distributed (to departmental servers)
After ETL data should be
Detailed, historical, not normalized (performance), comprehensive, quality controlled
Loading tables in a datawarehouse steps
Dimension tables loaded first, then fact tables, then aggregate tables (there should be an order within these three categories)
Types of load tables
Dimension, fact, aggregate
Planning of staging area
Disk space required, memory/processor requirement, network connectivity, software ETL
Derived data objectives
Ease of use, fast response to predefined user queries, customized data, ad-hoc query support, data mining capabilities
Second generation ETL
Engine-driven products, generate directly executable code
Data augmentation
Enhance data using information from internal and external data sources
Data accuracy
Ensure that fields contain appropriate values, such as only "off" or "on" in a status field
Data type conversion
Ensures that all values for a specified field are stored the same way in the data warehouse regardless of how they were stored in the source system
Business rule application
Ensures that the rules of the business are enforced on the data stored in the warehouse
ETL
Extract, Transform, Load
Conversion
Field level conversion from one datatype to other data type based on predefined criteria
Ways of transforming
Filtering (conditional statements), aggregation (summation of records), merging (concatenate records), derivation (clause's (if/then statements)), explosion (fragment data), data propagation (load data into multiple tables)
Legacy systems
Generally holds transactional data and it is the primary source of the data to any DW, can be in flat files
Data lineage
Meta data, the time that a particular set of records was loaded into the data warehouse
Data confromer
Identification and enforcement of special conformed dimension attributes and conformed fact table measures as the basis for data integration across multiple data sources
Dependency profiling
Identifies intra-table dependencies, addresses whether or not there are non-key attributes that determine or are dependent on non-key attributes
Redundancy profiling
Identifies overlapping values, used to identify candidate foreign keys
Identifying and mapping data
Identifies sources of data elements, the targets for those data elements, and the transformations that must be applied to each data element
Data extraction process
Identify various sources, develop connection details, apply conditions to filter data, read data from different databases (push, pull)
ETL development phases
Identifying and mapping data, developing functional element implementations
Constructive merge
If key of new data matches with existing data then data is marked as superseded and retained. New data is inserted and multiple sets of the same table may exist, a flag to distinguish most recent data will be required
Destructive merge
If key of new data matches with existing data, the existing data is updated with new data else new data is inserted
ETL process automation
It is required to decide at what frequency the data has to be extracted and loaded depending on how fast the data is changing and how the reporting is done
Algorithmic conversion
Lower case to upper case/measurement units
Data integration
Match, merge or link data from a variety of disparate sources
Identifying source data
May be as simple as identifying the server where the data is stored in an OLTP database and the storage type (SQL server database, excel spreadsheet, text file etc.)
ETL tools
Meant to extract, transform and load the data into the data warehouse for decision making
DTS package versioning
Meta data, can be used to view, branch, or retrieve any historical version of a particular DTS package
Schema changes
Meta data, changes to table definitions
Data type usage
Meta data, identifying all tables that use the user defined data type
Data warehouse usage statistics
Meta data, query times for reports
Transformation statistics
Meta data, the execution of time of each stage of transformation, the number of rows processed by the transformation, the last time the transformation was executed, and so on
Fixing errors
Misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
Star schema
Most common star model, also called dimensional table
Significant business events
Non-predetermined events drive the capture of data, triggered when a completion event is performed
Incremental data
Often copied into the data warehouse while leaving the original data in place
File comparison
One copy of operational data is kept before the updating starts and at the end of an update a comparison is done to extract data required for the data warehouse
OLTP
Online Transaction Processing, data for data warehouse extracted from here
Update mode
Only changes in source data are written to data warehouse
Triggered capture
Only in case the operational system is based on DBMS, the database triggers are written the capture the data on change/insert/delete, if the frequency of data change is high then performance is affected
Delta data capture
Operational events, changed data capture, date last modified, point-in-time comparisons
Other sources
Other software's data which is required for decision support can be extracted, csv, flat files
Column profiling
Provides critical meta data required to perform dependency profiling (must be executed before)
Other of the 38 subsystems
Read power point
ETL loading element
Responsible for loading transformed data into the data warehouse database
ETL metadata functional element
Responsible for maintaining information about the movement and transformation of data and the operation of the data warehouse
Point-in-time snapshots
Scheduled at specific points in time, efficient method to pinpoint specific points in time or ranges of time
Selection
Select data based on some criteria
Aggregation
Selecting summary data from detailed data
Record-level transformations
Selection-data partitioning, joining-data combining, aggregation-data summarization
Field-level transformations
Single-field--from one field to another, multi-field-from many fields to one or one field to many
Change data capture system
Source log file readers, source data and sequence number filters and CRC based record comparison in ETL system
Push
Source system provide the data to the data warehouse
Separation/Concatenation
Split data into different data structures or merge data from two captured data files into single data
Mapping
The definition of the relationship and data flow between source and target objects
Static capture
The entire source data is captured at fixed intervals, can be full, selected attributes, or conditional
Append
The existing data is retained and new data is appending
Application-assisted capture
The operational system is modified to provide changed data
Data mining
The process of automatically searching large volumes of data for patterns using tools such as classification, association rule mining, clustering etc.
Enrichment
The process of making data more informative in context of the data warehouse, can be at record level or field level
Transformation
The process of manipulating data beyond copying
Transportation
The process of moving copied or transformed data from a source to a data warehouse
Aggregate tables
These tables hold fact data which is aggregated on some dimension, they are very fast to access to compared to accessing the fact table and then filtering the data
Dimension tables
These tables hold information which is non-transactional data (eg. products, customers)
Fact tables
These tables hold information which is transactional data (eg. orders, inventory)
Data extraction implementation
Transact-SQL stored procedures, data transformation services (DTS) tasks, or custom applications
Transactional systems
Transactional data different from legacy system can be captured, flat files
Operational data
Transient, restricted in scope, normalized, sometimes poor quality
Data cleansing system
Typically dictionary driven system for complete parsing of names and addresses of individuals and organizations, possibly also products or locations. Maintains back references to all participating original sources