CS 350 Chapter 9
Conformed dimension
One or more dimension tables associated with two or more fact tables for which the dimension tables have the same business meaning and primary key with each fact table
Describe the characteristics of a surrogate key as used in a data warehouse or data mart
- business keys change, often slowly, over time, and we need to remember old and new business key values for the same business object. Surrogate keys allow us to handle changing and unknown keys with ease - Using a surrogate key also allows us to keep track of different nonkey attribute values for the same production key over time - Surrogate keys are often simpler and shorter, especially when the production key is a composite key - Surrogate keys can be of the same length and format for all keys, no matter what business dimensions are involved in the database
ETL
- extracting, transforming, and loading - an important process in which extracts, transforms, and loads data from source systems into the data warehouse - extracting happens periodically. The data warehouse doesn't need to have current data
Trends that encourage the need for data warehousing
- no single system of record - multiple systems are not synchronized - organizations want to analyze the activities in a balanced way - customer relationship management - supplier relationship management
Two types of tables in star schema
- one fact table: contains factualy or quatitative data about a business - one or more dimensional tables: hold descriptive data about the subjects of the business
List four objectives of derived data
- provide ease of use for decision support applications - provide fast response for predefined user queries or requests for information - customize data for particular target user groups - support ad hoc queries and data mining and other analytical applications
Objectives of derived data
- provide ease of use for decision support applications - provide fast response for predefined user queries or requests for information - customize data for particular target user groups - support ad hoc queries and data mining and other analytical applications
List five claimed limitations of independent data marts
- you lose flexibility for the long term and the ability to react to changing business conditions - technical limitations for the size of the data warehouse
Two major factors that drive the need for data warehousing
1. A business requires an integrated, company wide view of the high quality information 2. The information systems department must separate informational from operation systems to improve performance dramatically in managing company data
Changes to be made to a warehouse data model
1. New descriptive attributes 2. New business activity attributes 3. New classes of descriptive attributes 4. Descriptive attributes become more refined 5. Descriptive data are related to one another 6. New source of data
Two debates as to the actual value of independent data marts
1. One debate deals with the nature of the phased approach to implementing a data warehousing environment. The essence of this debate is whether each data mart should or should not evolve in a bottom-up fashion from a subset of enterprise wide decision support data 2. The other debate deals with the suitable database architecture for analytical processing. This debate centers on the extent to which a data mart database should be normalized
Three layered architecture
1. Operational data are stored in the various operational systems of record throughout the organization 2. Reconciled data are the type of data stored in the enterprise data warehouse and an operational data store. Reconciled data are detailed, current data intended to be the single, authoritative source for all decision support applications 3. Derived data are the type of data stored in each of the data marts, Derived data are data that have been selected, formatted, and aggregated for end user decision support applications
Essential rules of dimensional modeling
1. Use atomic facts 2. Create single-process fact tables 3. Include a date dimension for every fact table 4. Enforce consistent grain 5. Disallow null keys in fact table 6. Honor hierarchies 7. Decode dimension tables 8. Use surrogate keys 9. Conform dimensions 10. Balance requirements with actual data
Factors the lead to the need to separate operational and informational systems
1. a data warehouse centralizes data that are scattered throughout disparate operational systems and makes them readily available for decision support applications 2. A properly designed data warehouse adds value to data by improving their quality and consistency 3. A separate data warehouse eliminates much of the contention for resources that results when informational applications are cofounded with operational processing
Three types of metadata
1. operational metadata: describe the data in the various operational systems that feed the enterprise data warehouse. Operational metadata typically exist in a number of different formats and unfortunately are often of poor quality 2. Enterprise data warehouse metadata: are derived from the enterprise data model. EDW metadata describe the reconciled data layer as well as the rules for extracting, transforming, and loading operational data into reconciled data 3. Data mart metadata: describe the derived data layer and the rules for transforming reconciled data to derived data
Enterprise data warehouse (EDW)
A centralized, integrated data warehouse that is the control point and single source of all data made available to end users for decision support applications
Logical data mart
A data mart created by a relational view of a data warehouse. They are not physically separate databases but rather different relational views of one physical, slightly denormalized relational data warehouse
Dependent data mart
A data mart filled exclusively from an enterprise data warehouse and its reconciled data
Independent data mart
A data mart filled with data extracted from the operational environment, without the benefit of a data warehouse
Subject-oriented
A data warehouse is organized around the key of subjects of the enterprise. Major subjects may include customers, patients, students, products, and time
Data mart
A data warehouse that is limited in scope, whose data are obtained by selecting and summarizing data from a data warehouse or from separate extract, transform, and load processes from source data systemd
Star schema
A simple database design in which dimensional data are separated form fact or even data. A dimensional model is another name for a star schema
Data warehouse
A subject-oriented, integrated, time-variant, nonupdateable collection of data used in support of management decision-making processes
Informational system
A system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications
Operational system
A system that is used to run a business in real time, based on current data
Real-time data warehouse
An enterprise data warehouse that accepts near-real-time feeds or transactional data from the systems of record, analyzes warehouse data, and in near-real-time relays business rules to the data warehouse and systems of record so that immediate action can be taken in response to business events
Snowflake schema
An expanded version of a star schema in which dimension tables are normalized into several related tables
Operational data store (ODS)
An integrated, subject-oriented, continuously updateable, current-valued enterprise-wide, detailed database designed to serve operational users as they do decision support processing
Data warehoues vs data mart
Data Warehouse: Scope: - application independent - centralized, possibly enterprise-wide - planned Data: - historical, detailed, and summarized - lightly denormalized Subjects: - multiple subjects Other characteristics: - flexible - data oriented - long file - large - single complex structure Data Mart: Scope: - specific DSS application - Decentralized by user area - Organic, possibly not planned Data: - some history, detailed, and summarized - highly denormalized Subjects: - one central subject of concern to users Sources: - few internal and external sources Other characteristics: - restrictive - project oriented - short file - starts small, becomes large - multi, semi complex structures, together complex
Nonupdateable
Data in the data warehouse are loaded and refreshed from operational systems but cannot be updated by end users
Time-variant
Data in the data warehouse contain a time dimension so that they may be used to study trends and changes
Transient data
Data in which changes to existing records are written over previous records, thus destroying the previous data content
Periodic data
Data that are never physically altered or deleted once they have been added to the store
Derived data
Data that have been selected, formatted, and aggregated for end-user decision support application
Reconciled data
Detailed, current data intended to be the single, authoritative source for all decision support applications
Operational vs. Informational Systems
Operational: - Primary purpose: run the business on a current basis - Type of data: current representation of state of the business - Primary users: clerks, salespersons, administrators - Scope of usage: narrow, planned, and simple updates and queries - Design goal: performance - Volume: many constant updates and queries on one or a few table rows Informational: - Primary purpose: support managerial decision making - Type of data: historical point-in-time and predictions - Primary users: managers, business analysts, customers - Scope of usage: broad, ad hoc, complex queries and analysis - Design goal: ease of flexible access and use - Volume: periodic batch updates and queries requiring many or all rows
Explain the pros and cons of logical data marts
Pros: a. New data marts can be created quickly because no physical database or database technology needs to be acquired or created. Also, loading routines do not need to be written. b. Data marts are always up-to-date because data in a view are created when the view is referenced. Views can be materialized. Con: Logical data marts are only practical for moderate-sized data warehouses or when high performance data warehousing technology is used.
List four characteristics of a data warehouse
Subject-oriented, integrated, time-variant, and nonupdateable
Integrated
The data housed in the data warehouse are defined using consistent naming conventions, formats, encoding structures, and related characteristics gathered from several internal systems of record and also often from sources of external to the organization
In what ways are dimension tables often not normalized?
The fact table plays the role of a normalized n-ary associative entity that links the instances of the various dimensions, which are in second, but possibly not thid, normal form. Most experts view this as acceptable because dimensions are not updated and avoid costly joins
Grain
The level of detail in a fact table, determined by the intersection of all the components of the primary key, including all foreign keys and any other primary key elements
Briefly describe the major components of data warehouse architecture
The major components include: source data systems, data staging area, data and metadata storage area, and end-user presentation tools
Data warehousing
The process whereby organizations create and maintain data warehouses and extract meaning from and help inform decision making through the use of data in the data warehouses
Differences between transaction and reporting/analytic data warehousing
Transaction = A discrete unit of work that must be completely processed or not processed at all - May involve multiple updates - If any update fails, then all other updates must be cancelled - Transaction properties - Atomicity (works in its entirety or not at all) - Consistency (leaving the database in a consistent state) - Isolation (clearly separate from others) Durability (results kept permanently)
Is star schema a relational model? Explain why or why not
Yes because each dimension table has a one to many relationship with the central fact table