Ch 12 - Data extraction, transformation, and loading
job sequencing
determine whether the beginning of one job in an extraction job stream has to wait until the previous job has finished successfully
capture by comparing files
do a full file comparison between today's copy and yesterday's copy and compare the record keys to find inserts nad delete hten you capture any changes between the two
One major effort within __________ is the improvement of data quality
data transformation
capture based on date and time stamp
the data capture occurs at a later time, and you extract only the data with the data and time stamp later than the specified time
Usually, the mass refreshes, whether for initial load or for periodic refreshes, cause difficulties, not so much because of complexities, but because
these load jobs run too long
transofrmation tools will automatically _______________________ metadata
record the
What is simpler: refresh or update?
refresh
What are the two ways to keep a data warehouse up to date?
refresh and update
What is the multiple sources problem?
results from a single data element having more than one source so you have to figure out which system to get hte value from for the warehous
data revisions are
revisions since the last time data was captured
summarization
summarization bc keeping data at lowest granularity is sometimes not feasible
What is a TYPE 3 change to handle a slowly changing dimension?
tentative changes where your users need the ability to analyze the metrics in both ways - with the changes and without the changes
initial load
the initial data as of a certain time is moved into the data warehouse to get started
in periodic status, the status value is stored with reference to hte time when _____________
the new value became effective
What is the desired goal for using transformation tools?
to eliminate manual methods altogether
code generators
tools that directly deal with the extraction, transformation, and loading of data that enable the process by generating program code to perform these functions
capture through transaction logs
transaction logs of the DBMS maintained for recovery from possible failures
What is entity identification problem?
unique identification numbers for the same customer may not be the same across multiple systems meaning you will not know which customer records relate to the same customer.
If the target table to be loaded already exists and data exists in the table, the load process ____________
wipes out the existing data and applies the data from the incoming file
What are the difficulties with incremental loads?
1. determine the best method of capture from each source system 2. execute the capture without impacting source systems 3. schedule the incremental loads without impacting use of the data warehouse by the users
Two major factors differentiate the data extraction for a new operational system from the data extraction for a data warehouse. What are they?
1. for a data warehouse, you have to extract data from many disparate sources 2. for a data warehouse, you have to extract data on the changes for ongoing incremental loads as well as for a one-time initial full load For operational systems, all you need is one-time extractions adn conversions
What are the major transformation types?
1. format revisions 2. decoding of fields 3. calculated and derived values 4. splitting of single fields 5. merging of information 6. character set conversion 7. conversion of units of measurements 8. date/time conversion 9. summarization 10. key restructuring 11. deduplication
Activities within the data transformation function can run the gamut of transformation methods. You have to _________________..
reformat internal data structures, resequence data, apply various forms of conversion techniques, supply default values wherever values are missing, and you must design the whole set of aggregates that are needed for performance improvement
date/time conversion
representation of date and time in standard formats
selection
select either whole records or parts of several records from the source systems
capture based on date and time stamp works well if the number of revised records is __________
small
capture by comparing files is also known as the
snapshot differential technique
Capture through transaction logs will not work if ...
some of your source system data is on indexed or other flat files
data capture through triggers occurs right at the _________ and is therefore quite reliable
source
capture through database triggers is only available for ___________ in databases
source data
__________ goes beyond that to include examination and verification that the identified sources will provide the necessary value to the data warehouse.
source identification
splitting/joining
splitting the selected parts even further or joining parts of selected records from many source systems
You will use __________ data capture primarily for the initial load of the data warehouse.
static
splitting of single fields
store individual components of names and addresses to 1) improve the operating performance by indexing on individual components 2) users may need to perform analysis by using individual components
You also need to determine the mechanism for capturing the changes to data in each of the relevant source systems. These are __________ activities.
strenuous and time-consuming
For operational data in the periodic status category, thee history of the changes is preserved in ___________
the source systems themselves
current value
the stored value of an attribute represents the value of the attribute at this moment in time
a. You have to know the _____during each day when you can extract data without impacting use of the operational systems
time window
Data in the source system is said to be __________________ because source data changes with time.
time-dependent or temporal
You have to subject the extracted data to all manner of __________so that the data will be fit to be converted into strategic information.
transformations
if the source data is _______-, capture of the revisions is not easy
transient
in a current value, the valuse are ___________
transient or transitory
the approach for data extraction is driven by the _______ requirements
user
What are the two major types of data extractions from the source operational systems?
"as is" (static) data and data of revisions
How do you choose when you have the multiple sources problem
1. assign a higher priority to one of the sources and pick up the value from that source 2. select from the file based on last update date 3. depends on the related fields
What are the three categories of ETL tools?
1. data transformation engines 2. data capture through replication 3. code generators
What are the benefits of dividing up the whole load process into smaller chunks and populting a few files at a time?
1. you are able to run smaller loads in parallel 2. you can keep some parts of hte data warehouse up adn running while loading the other parts
† Querying through ______implies data access to current operational systems as well as to historical data. If self-service business intelligence is opened up to unknown or less trained users, erratic queries through _____may cause severe system degradation in operational systems.
EII , EII
what is a constructive merge?
If the primary key of an incoming record matches with the key of an existing record, leave the existing record, add the incoming record, and mark the added record as superseding the old record
what is a destructive merge?
If the primary key of an incoming record matches with the key of an existing record, update the matching target record. If the incoming record is a new record without a match with any existing record, add the incoming record to the target table
enterprise application integration (EAI)
In this approach, integration takes the form of linking application systems by allowing them to exchange data, messages, and transactions. A specific application may access the required data transparently without knowing the format of how the data is stored or where the data is stored.
. If data already exists in the table, the ___________process unconditionally adds the incoming data, preserving the existing data in the target table
append
update
application of incremental changes in the data sources
What is the incremental load?
applying ongoing changes as necessary in a periodic manner
refresh
complete reload at specified intervals
What is a full refresh?
completely erasing the contents of one or more tables and reloading with fresh data (initial load is a refresh of all the tables)
enterprise information integration
configuring and providing a unified view of dispersed corporate data (virtually)
What is a TYPE 1 change to handle a slowly changing dimension?
correction of errors
manual data transformation programs are mostly executed in the
data staging area
What are the three functional areas of a data warehouse?
data acquisition, data storage, and information delivery
________________ precedes all other functions
data extraction
what encompass the areas of data acquisition and data storage?
data extraction, data transformation, and data loading encompass
EII may not be applicable to a ____________, but that is not always recommended for a new data warehouse.
federated architecture
Third-party tools usually provide built-in ____
flexibility
Source data is in a constant state of ______
flux
method of extraction
for each data source, define whether the extraction process is manual or tool-based
time window
for each data source, denote the time window for the extraction process
extraction frequency
for each data source, establish how frequently the data extraction must be done: daily, weekly, quarterly, and so on
data capture through replication
most of these tools use the transaction recovery logs maintained by the DBMS. the changes to the source systems captured in the transaction logs are replicated in near real time to the data staging area for further processing
Is there any overhead in capture through transaction logs?
no
during the loads, the data warehouse has to be _____________
offline
What is the initial load?
populating all the data warehouse tables for the very first time
What are the useful remarks about incremental extracts for fact tables?
1. consist of new transactions 2. consist of update transactions 3. use database transaction logs for data capture
What do you need to consider for features in an ETL tool for your environment?
1. data extraction from various relational databases of leading vendors 2. data extraction from old legacy databases, indexed files, and flat files 3. data transformation from one format to another with variations in source and target fields 4. performing of standard conversions, key reformatting, and structural changes 5. provision of audit trails from source to target 6. application of business rules for extraction and transformation 7. combining of several records from the source systems into one integrated target record 8. recording and management of metadata
What are the tips for history loads of the fact tables?
1. identify historical data useful and interesting for the data warehouse 2. define and refine extract business rules 3. capture audit statistics to tie back to the operational systems 4. perform fact table surrogate key look-up 5. improve fact table content 6. restructure the data 7. prepare the load files
What are the useful remarks about incremental loads for fact tables?
1. load as frequently as feasible 2. use partitioned files and indexes 3. apply parallel processing techniques
By its very nature, _____is generally applied to real-time transaction processing in an operational system.
EAI
Wherever ETL applications require low-latency data, ____may be used to provide the _____-to-ETL interface for real-time data extraction and consolidation.
EAI
† Although a dynamic ______to-ETL interface is desirable, most organizations are using _____ to create data sets that are subsequently input into ETL applications.
EAI
First, _____cannot be regarded as an integration approach competing with ETL. The two approaches may be combined to complement each other. _____can be used as an input-producing mechanism for ETL; ETL may be considered as a service mechanism for _____.
EAI , EAI , EAI
Data transformation as part of ETL may sometimes be extremely complex. In these situations, complex data transformations of _____on the fly may result in very high system overhead.
EII
ETL enables the building of historical data over time in a systematic manner. Wherever historical trends and patterns are the focus of business intelligence, ______may not be the approach of choice.
EII
In cases where data volumes are too large for consolidation and only a fraction of the consolidated data is used regularly, retrieval and data consolidation using ETL may not be justified. _____may prove to be a viable alternative
EII
When using ETL to capture and consolidate data, some latency is to be expected. However, if the operational data is highly volatile and real-time querying and analysis is absolutely essential, then ____ may be considered as an option.
EII
_______________ contains information about the source systems, mappings between source data and data warehouse target data structures, data transformations, and data loading
ETL metadata
What are the two approaches for information integration?
Enterprise Information Integration (EII) and Enterprise Application Integration (EAI)
Which tool provides maximum flexibility for pointing to various source systems, to select the appropriate data transformation methods, and to apply full loads and incremental loads
data transformation engines
data transformation engines
dynamic and sophisticated data manipulation algorithms that capture data from a designated set of source systems at user-defined intervals, perform elaborate data transformations, sends the results to a target environment, and applies the data to target files
in recent years, transformation tools have greatly increased in_____________ and ________________
functionality and flexibility
What is a major disadvantage of using manual techniques for data transformation?
generation of metadata
it is generally agreed that transformation functions ends as soon as _______________ are created
load images
The complexity and the extent of data transformation strongly suggest that ___________methods alone will not be enough
manual
What is a TYPE 2 change to handle a slowly changing dimension?
preserve the history of the data warehouse
capture in source applications is also known as
application-assisted data capture
conversion
(all-inclusive) standardize among data extractions from disparate source systems and make fields usable and understandable to the users
periodic status
, the value of the attribute is preserved as the status every time a change occurs
With regard to extracting and applying the ongoing incremental changes, there are several difficulties.What are they?
1. Finding the proper extraction method for individual source datasets can be arduous. 2. Once you settle on the extraction method, finding a time window to apply the changes to the data warehouse can be tricky if your data warehouse cannot suffer long downtimes.
What are the advantages and disadvantages of capture in source applications?
1. Good flexibility for capture specifications. 2. Performance of source systems affected a bit. 3. Major revisions to existing applications. 4. Can be used on most legacy systems. 5. Can be used on file-oriented systems. 6. High internal costs because of in-house work.
What are the advantages and disadvantages of capture basd on date and time stamp?
1. Good flexibility for capture specifications. 2. Performance of source systems not affected. 3. Major revisions to existing applications likely. 4. Cannot be used on most legacy systems. 5. Can be used on file-oriented systems. 6. Vendor products may be used.
What are the advantages and disadvantages of capture by comparing files?
1. Good flexibility for capture specifications. 2. Performance of source systems not affected. 3. No revisions to existing applications. 4. May be used on legacy systems. 5. May be used on file-oriented systems. 6. Vendor products are used. No internal costs.
What are the advantages and disadvantages of capture of static data?
1. Good flexibility for capture specifications. 2. Performance of source systems not affected. 3. No revisions to existing applications. 4. Can be used on legacy systems. 5. Can be used on file-oriented systems. 6. Vendor products are used. No internal costs.
What are the advantages and disadvantages of capture through database triggers?
1. Not much flexibility for capture specifications. 2. Performance of source systems affected a bit. 3. No revisions to existing applications. 4. Cannot be used on most legacy systems. 5. Cannot be used on file-oriented systems. 6. Vendor products are used. No internal costs.
What are the advantages and disadvantages of capture through transaction logs?
1. Not much flexibility for capture specifications. 2. Performance of source systems not affected. 3. No revisions to existing applications. 4. Can be used on most legacy systems. 5. Cannot be used on file-oriented systems. 6. Vendor products are used. No internal costs.
What are the disadvantages of capture through database triggers?
1. building and maintaining trigger programs puts an additional burden on development effort 2. execution of trigger procedures produces additional overhead
What ar ethe types of deferred data extraction techniques?
1. capture based on date and time stamp 2. capture by comparing files
What are the three distinct options of immediate data capture?
1. capture through transaction logs 2. capture through database triggers 3. capture in source applicatiosn
What are the two ways that source systems store data?
1. current values 2. periodic status
What are the steps for using replication to capture changes to source data?
1. identify the source system database table 2. identify and define target files in the staging area 3. create mapping between the source table and target files 4. define the replication mode 5. schedule the replication process 6. capture the changes from the transaction logs 7. transfer captured data from logs to target files 8. verify transfer of data changes 9. confirm success or failure of replication 10. in metadata, document the outcome of replication 11. maintain definitions of sources, targets, and mappings
what are the steps in the source identification process?
1. list each data item of metrics or facts needed for analysis in fact tables 2. list each dimension attribute from all dimensions 3. for each target data item, find the source system and source data item 4. if there are multiple sources for one data element, choose the preferred source 5. identify multiple source fields for a single target field and form consolidation rules 6. identify single source field for multiple target fields and establish splitting rules 7. ascertain default values 8. inspect source data for missing values
What are the challenges in a typical data source environment?
1. multiple character sets 2. multiple data types 3. missing values 4. no default values 5. conflicting business rules 6. inconsistent values 7. multiple naming standards 8. incompatible structures
What are the activities and tasks that compose the ETL process?
1. plan for aggregate fact tables 2. determine data transformation and cleansing rules 3. establish comprehensive data extraction rules 4. prepare data mapping for target data elements from sources 5. integrate all the data sources, both internal and external 6. determine all the target data needed in the data warehouse 7. combine several source data structures into a single row in the target database of the data warehouse 8. split one source data structure into several structures to go into several rows of the target database 9. read data from data dictionaries and catalogs of source systems 10. read data from a variety of file structures including flat files, indexed files (VSAM), and legacy system databases (hierarchicial/network) 11. load details for populating atomic fact tables 12. transform data from one format in the source platform to another format in the target platform 13. derive target values for input fields 14. change cryptic values to values meaningful to users
What are the downsides of capture in source applications?
1. revise the programs in the source operational systems and keep them maintained 2. may degrade the performance of the source applications because of hte additional processing
What are the basic tasks in data transformation?
1. selection 2. splitting/joining 3. conversion 4. summarization 5. enrichment
What is a list of data extraction issues?
1. source idenitification 2. method of extraction 3. extraction frequency 4. time window 5. job sequencing 6. exception handling
What are the different difficulties with ETL functions?
1. source systems are very diverse and disparate 2. there is usually a need to do deal with source systems on multiple platforms and different operating systems 3. many source systems are older legacy applications running on obsolete database technologies 4. generally, historical data on changes in values are not preserved in source operational systems. historical information is critical in a data warehouse 5. quality of data is dubious in many old source systems that have evolved over time 6. source system structures keep changing over time because of new business conditions. ETL functions must also be modified accordingly 7. even when inconsistent data is detected among disparate source systems, lack of a means for resolving mismatches escalates the problem of inconsistency 8. most source systems do not represent data in types or formats that are meaninful to the users. many representations are cryptic and ambiguous
What is the sequence for initial bulk refresh as well as for incremental data loads?
1. triggering for incremental changes 2. filtering for refreshes and incremental loads 3. data extraction 4. transformation 5. cleansing 6. applying to the data warehouse database
In many cases, you need to convert from ________to___________ formats
EBCDIC to ASCII
______________ functions reshape the relevant data from the source systems into useful information to be stored in the data warehouse.
ETL
What is the difference between using third party tools versus in house for ETL
Third-party tools are generally more expensive than in-house programs, but they record their own metadata. On the other hand, in-house programs increase the cost of maintenance and are hard to maintain as source systems change.
In a current value, The value of an attribute remains constant only until ______________
a business transaction changes it.
data replication
a method for creating copies of data in a distributed environment
You must perform ___________of ETL for successfully transforming data into strategic information or business intelligence
all three functions
capture in source applications an be used for _________________ of source data
all types
calculated and derived values
calculate and derive such as average daily balances or operating ratios
capture through database triggers
can create trigger programs for all events for which you need data to be captured and have the output written to a separate file that will be used to extract data for the warehouse
_________ presupposes that all relevant source records contain date and time stamps
capture based on date and time stamp
if a source record gets deleted in between two extract runs, the information about the delete is not detected in ___________
capture based on time and date stamp
what data extraction technique is a last resort?
capture by comparing files
as is or static data
capture of data at a given point in time (snapshot of the relevant source data at a certain point in time)
For changing data into information you first need to
capture the data
format revisions
changes to data types and lengths of individual fields
merging of information
combination of fields (concatenation)
key restructuring
come up with keys for the fact and dimesnion tables based on the keys in the extracted records. tarnsform keys with built-in meansings into generic keys generated by the system itself
character set conversion
conversion of character sets to an agreed standard character set for textual data (EBCDIC or ASCII)
conversion of units of measurement
convert metrics so numbers are all in one standard unit of measurement
What is more common, periodic status or current value?
current value
decoding of fields
decode cryptic codes and change htem to values that make sense for the user
exception handling
determine how to handle input records that cannot be extracted
source identification
identify source applications and source structures
incremental data capture may be ____________ or ___________-
immediate or deferred
data revisions are also known as
incremental data capture
When beginning with data extraction you should start by doing a complete ______________ of the source systems.
inventory
deduplication
keep a single record and link all duplicates to the single record
The ___________on the source systems must contain information on every database and every data structure that are needed from the source systems. You need very detailed information, including database size and volatility of the data.
metadata
in capture through source applications you have to _________ relevant applications that write to the source files
modify
ETL functions form the _______ for the data warehouse information content.
prerequisites
enrichment
rearrangement and simplification of individual fields to make them more useful for the data warehouse environment