CIS 3050: Chapter 9
Static extract
A method of capturing a snapshot of the required source data at a point in time.
Nonupdateable
Data in the DW are loaded and refreshed from operational systems but cannot be updated by end users
Time Variant
Data in the Data Warehouse are carefully associated with a specific period of time so that they may be used to study trends and changes
Transient Data
Data in which changes to existing records are written over previous records, thus destroying the previous data content
Surrogate Key
Every key used to join the fact table with a dimension table should be a surrogate key. Like Product Code or Store Code. WHY? -because business keys always change -allows us to keep track of different nonkey attributes for the same production key over time -often simpler and shorter 0can be of same length and format for all keys despite business dimensions Surrogate keys are non intelligent, system assigned, not natural.
Logical Data Mart and Real Time Warehouse Architecture
I) has logical Data marts II) Data are moved from the DW rather than to a separate staging area in order to utilize the high performance computing power of the warehouse technology to perform the cleansing and transformation steps III) New data marts can be created quickly IV) Data marts are always up to date because data in a view are created when the view is referenced
Conformed dimension
One or more dimension tables associated with two or more fact tables for which the dimension tables have the same business meaning and primary key with each fact table.
application integration
achieved by coordinating the flow of event information between business applications
user interaction integration
achieved by creating fewer user interfaces that feed different data system
business process integration
achieved by tighter coordination of activities across business process so that applications can be shared and more application integration can occur
Derived Data
data that has been selected, formatted, and aggregated for end-user decision support applications
characteristics of data after the ETL
detailed, historical, normalized, comprehensive , timely, quality controlled the goal is to provide a single, authoritative source for data that support decision making
Corporate Information Factory
has both dependent data mart and operational data store. Considered a comprehensive view of organizational data in support of all user data requirements
3 layer data architecture
1) Derived data [Data mart metadata] 2) Reconciled data [EDW metadata] 3)Operational data [Operational metadata[ all come from the enterprise data model
Independent Data Mart Structure
1: data are extracted from the various internal and external source system files and databases. In a large organization, there may be dozens or even hundreds of such files and databases 2: The data from the various source systems are transformed and integrated before being loaded into the data marts. Transactions may be sent to the source systems to correct errors discovered in data staging. The dW is considered as the collection of data marts 3: The data warehouse is a set of physically distinct databases organized for decision support 4: Users access the DW by means of variety of query languages and analytical tools. Results may be fed back to the data warehouse and operational databases
10 rules of dimensional modeling
1: use atomic facts 2: create single process fact tables 3: include a date dimension for every fact table 4: enforce consistent grain 5: disallow null keys in fact table 6: honor hierarchies 7: decode dimension tables 8: use surrogate keys 9: conform dimensions 10: balance requirements with actual date
Enterprise Data Warehouse
A centralized, integrated data warehouse that is the control point and single source of all data made available to end users for decision support applications
Dependent Data Mart
A data mart filled exclusively from an enterprise data warehouse and its reconciled data
Independent Data Mart
A data mart filled with data extracted from the operational environment without the benefit of a data warehouse
Logical Data Mart
A data mart that is created by a relational view of a data warehouse
Incremental extract
A method of capturing only the changes that have occurred in the source data since the last capture
Data scrubbing
A process of using pattern recognition and other artificial intelligence techniques to upgrade the quality of raw data before transforming and moving the data to the data warehouse. Also called data cleansing.
Star Schema
A simple Database Design in which dimensional data are separated from fact or event data. A dimensional model is another name for a star schema . Contains fact table and one or more dimensional tables Product, Period, Store---> SALES
Informational Systems
A system designed to support decision making based on historical point in time and prediction data for complex queries or data mining applications
Operational Systems
A system that is used to run a business in real time, based on current data. Also called a system of record
Data Federation
A technique for data integration that provides a virtual view of integrated data without actually creating one centralized database when an application wants data, a federation engine retrieves relevant data from the actual sources in real time and sends the result to the requesting application . advantage is to access current data, no delay, hides the intricacies of other applications and the way data is stored in the from a given query or application
Changed Data Capture
A technique that indicates which data have been changed since the last data integration activity
Update Mode
An approach to filling a data warehouse in which only changes in the source data are written to the data warehouse
Refresh Mode
An approach to filling a data warehouse that involves bulk rewriting of the target data at periodic intervals.
Real-time data warehouse
An enterprise data warehouse that accepts near real time feeds of transactional data from the systems of record, analyzes, warehouse data, and in near real time relays business rules to the DW and systems of record so that immediate action can be taken in response to business events need for EXCHANGE of data at a NEAR real time pace because there is a need for action to a current part of the org. to know what is happening to make desirable things happen through the operational systems
Snowflake Schema
An expanded version of a star schema in which dimension tables are normalized into several related tables.
Operational Data Store
An integrated, subject oriented and continuously updatable current valued (with recent history) enterprise wide, detailed database designed to serve operational users as they do decision support processing. Integrated source for all the operational data
Periodic Data
Data that are never physically altered or deleted once they have been added to the store
Reconciled Data
Detailed, current data intended to be the single, authoritative source for all decision support applications.
data propagation
Duplicate data across databases, with near real-time delay. data are pushed to duplicate sites as updates occur. these updates can be synchronous or asynchronous which decouples the updates to the remote copies. Advantage is the near real time cascading of data changes throughout the organization.
Dimension Table
Holds descriptive data about the subjects of the business. usually the source of attributes used to qualify or categorize or summarize facts in graphs or queries. each dimension table has a one to many relationship wo the central fact table
Need for Data Warehousing
I) A business requires an integrated, company-wide view of high quality information II) The information systems department must separate informational from operational systems to improve performance dramatically in managing company data
Benefit of real time data warehouse
I) Capture customer data at the time of a business event II) Analyze customer behavior and predict customer responses to possible actions III) Develop rules for optimizing customer interactions IV) Take immediate actions with customers at touch points on best responses to customers as determined by decision rules in order to make desirable results happen
Need to Separate informational and operational systems
I) Informational Systems centralize data that are scattered throughout disparate operational system and makes them readily available for analytical applications II) A properly designed set of informational systems adds value to data by improving their quality and consistency III) A separate set of informational systems eliminates much of the contention for resources when informational applications are confounded with operational processing
Three Layer Data Architecture
I) Operational Data are stored in the various operational systems of record throughout the organization II) Reconciled data are the type of data stored in the enterprise data warehouse and an operational data store III) Derived data are the type of data stored in data marts
Grain of the Fact
Level of detail in a fact table, determined by the intersection of all the components of the primary key, including all foreign keys and any other primary key elements. A common grain would be each business transaction
data transformation
Process of changing the data from their original form to a format suitable for performing a data analysis addressing research objectives. (format)
Data Warehousing
Process whereby organizations create and maintain data warehouses and extract meaning from and help inform decision making through the use of data in the data warehouses.
Integrated
The data housed in data warehouses are defined using consistent naming conventions, formats, encoding structures, and related characteristics gathered from several internal systems of record and also often from sources external to the organization. This means that the data warehouse holds the one version of the truth
Joining
The process of combining data from various sources into a single table or view
Selection
The process of partitioning data according to predefined criteria
Aggregation
The process of transforming data from a detailed level to a summarized level
Why do Businesses need Data warehousing
a business requires an integrated, company wide view of high quality info the IS department must separate informational from operational systems to improve performance dramatically in managing company data
Subject Oriented
a data warehouse is organized around key subjects or high level entities of the enterprise such as customers, patients, students, products, organizational units, and time
Data warehouse
a subject oriented, integrated, time variant, nonupdatable collection of data used in support of management decision making processes and business intelligence
Fact Table
contain factual or quantitative data. measurements that are numerical , continually valued, and additive such as units of books sold
Data Mart
is a data warehouse that is limited in scope whose data are obtained by selecting and summarizing data from a DW or from separate extract, transform, and load processes from source data systems
Independent Data mart downsides
often created for small short term business lose flexibility if want to upgrade larger scale and react to business changes due to the complexity of users having to access data in separate data marts A separate ETL process is developed for each data mart, yields redundancy. Inconsistency across different marts no capability to drill down into greater detail or have a shared repository
Dependent Data mart and operational data store
operational data store --> enterprise data warehouse [dependent data marts both logical and physical]
Logical Data Mart and Real Time Warehouse Architecture
real time data warehouse ---> data marts
Data Consolidation
summarizing data from a set of similar cell ranges
ETL Process
the goal is to provide a single, authoritative source for data that support decision making.
data integration
the integration of data from multiple sources, which provides a unified view of all data