CIS463 (Chapter 9 and 10) and Quiz 2
After ETL, data should be
- Detailed-not summarized yet - Historical-periodic - Normalized-3rd normal form or higher - Comprehensive-enterprise-wide perspective - Timely-data should be current enough to assist decision-making - Quality controlled-accurate with full integrity
record-level
- Selection-data partitioning - Joining-data combining - Aggregation-data summarization
improve data capture processes
- automate data entry as much as possible - manual data entry should be selected from preset options - use trained operators when possible - immediate data validation for entered data
why should dimension keys be surrogates
- business keys may change over time - helps keep track of nonkey attribute values for a given production key - surrogate keys are simpler and shorter - surrogate keys can be same length and format for all key
multiple facts tables
- can improve performance - often used to store facts for different combinations of dimensions - conformed dimensions
TQM principles
- defect prevention - continuous improvement - use of enterprise data standards
characteristics of derived data
- detailed (mostly periodic) data - aggregate (for summary) - distributed (to departmental servers)
objectives of derived data
- ease of use for decision support applications - fast response to predefined user queries - customized data for particular target audiences - ad-hoc query support - data mining capabilities
causes of poor data quality
- external data sources (lack of control over data quality) - redundant data storage and inconsistent metadata (proliferation of databases with uncontrolled redundancy and metadata) - data entry (poor data capture controls) - lack of organizational commitment (not increasing poor data quality as an organizational issue)
multivalue dimensions
- facts qualified by a set of values for the same business subject - normalization involves creating a table for an associative entity between dimensions
issues with company-wide view
- inconsistent key structures - synonyms - free-form vs. structured fields - inconsistent data values - missing data
data warehouse architectures
- independent data mart - dependent data mart and operational data store - logical data mart and real-time data warehouse - three-layer architecture
need for data warehousing
- integrated, company-wide view of high-quality information (from disparate databases) - separation of operational and informational systems and data (for improved performance)
purposes of data quality
- minimize IT project risk - make timely business decisions - ensure regulatory compliance - expand customer base
organizational trends motivating data warehouses
- no single system of records - multiple systems not synchronized - organizational need to analyze activities in a balanced way - customer relationship management - supplier relationship management
limitations of independent data marts
- separate ETL process for each data mart -> redundant data and processing - inconsistency between data marts - difficult to drill down for related facts between data marts - excessive scaling costs are more applications are built - high cost for obtaining consistency between marts
field-level
- single-field-from one field to one field - multi-field-from many fields to one, or one field to many
hierarchies
- sometimes a dimension forms a natural, fixed depth hierarchy
requirements for data governance
- sponsorship from both senior management and business units - a data steward manager to support, train, and coordinate data stewards - data stewards for different business units, subjects, and/or source systems - a governance committee to provide data management guidelines and standards
typical operational data is...
- transient - not historical - not normalized - restricted in scope - sometimes poor quality
characteristics of quality data
- uniqueness - accuracy - consistency - completeness - timeliness - currency - conformance - referential integrity
10 essential rules for data modeling
- use atomic facts - create single-process fact tables - include a date dimension for each fact table - enforce consistent grain - disallow null keys in fact tables - honor hierarchies - decode dimension tables - use surrogate keys - conform dimensions - balance requirements with actual data
advantages of logical data marts
-new data marts can be created quickly because no physical database or database technology needs to be created or acquired and no loading routines need to be written - the data marts are always up to date because data in a view are created when the view is referenced; views can be materialized if a user has a series of queries and analysis that need to work off the same instantiation of the data mart
Which of the following scenarios favors a columnar physical database structure over the traditional row-based database structure? A query that seeks ALL columns from a table with a small number of rows. A query that seeks only one or two columns from a table with a huge number of rows. A query that seeks data from several sources on the web. All of the above favor columnar over row-based database structures.
A query that seeks only one or two columns from a table with a huge number of rows.
Which of the following factors drive the need for data warehousing? Businesses need an integrated view of company information. Informational data must be kept together with operational data. Data warehouses generally have better security. Businesses need backup copies of data in case the system crashes.
Businesses need an integrated view of company information.
where did we use multi field transformations in SSIS exercises
Derived column. Done for full name in SSIS#1 and for local currency amount in SSIS#3.
Master Data Management (MDM)
Disciplines, technologies, and methods to ensure the currency, meaning, and quality of reference data within and across various subject areas
You have been hired to design a data warehouse for an important client. Which of the following is the least of your concerns? Deciding whether the client should use an enterprise-wide warehouse or individual data marts. Identifying the subjects of interest for the client's decision-makers. Ensuring ACID compliance of the data warehouse. Determining the data sources for the warehouse.
Ensuring ACID compliance of the data warehouse.
Imagine a debate between Ralph Kimball and Bill Inmon. Which is the most like scenario of that debate? Kimball espouses an ETL process which generates corporate-wide data which is then distributed among individual data marts. Inmon thinks there is no direct need for ETL, and instead we should just rely on accessing operational databases for our BI needs. Inmon thinks the best way to provide an enterprise-wide view of cleansed data is via conformed dimensions between data marts. Kimball thinks it's best for individual departments users to quickly create multi-dimensional databases for their own needs before coordinating with other departments.
Kimball thinks it's best for individual departments users to quickly create multi-dimensional databases for their own needs before coordinating with other departments.
What is the most likely cardinality of a relationship between a dimension table and a fact table? One dimension to one fact. One dimension to many facts. Many dimensions to one fact. Many dimensions to many facts.
One dimension to many facts.
Which law, enacted in the early 2000s, has been most influential in motivating the need for data governance? Sarbanes-Oxley HIPAA the Data Governance Act Tyco-Worldcom
Sarbanes-Oxley
dependent data marts
a centralized enterprise-wide data warehouse, and the data marts are loaded from this enterprise DW.
When doing ETL to build a data warehouse, the transformation and cleaning of data is done in: an OLTP database an independent data mart a data staging area an MDM system
a data staging area
data mart
a data warehouse that is limited in scope
data steward
a person responsible for ensuring that organizational applications properly support the organization's data quality goals
Data Warehouse
a subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes
informational system
a system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications
operational system
a system that is used to run a business in real time, based on current data; also called a system of record
The process of transforming data from a detailed to a summary level is called: extracting updating joining aggregating
aggregating
helper table
an associative entity that implements a M:N relationship between dimension and fact
snowflake schema
an expanded version of a star schema in which dimension tables are normalized into several related tables
The consolidation approach to data integration has the advantage of: providing real-time access to current data avoiding contention with OLTP databases pushing data to duplicate sites as updates occur allowing retention of historical data minimizing network, storage, and data maintenance costs
avoiding contention with OLTP databases and allowing retention of historical data
finer grains
better market basket analysis capability
Usually, fact tables are associative entities that implement many-to-many relationships between dimensions. However, sometimes a fact table itself must have a many-to-many relationship with a dimension. In this case, you need an additional table to form the associative entity. According to Hoffer's terminology, this is known as a(n) ________ table.
bridge
refresh mode
bulk rewriting of target data at periodic intervals
time-variant
can study trends and changes
static extract
capturing a snapshot of the source data at a point in time
incremental extract
capturing changes that have occurred since the last static extract
persistent
central "golden rule" maintained; all applications have access. Requires applications to push data, prone to data duplication
transient data
changes to existing records are written over previous records, this destroying the previous data content
what are star schemas composed of?
composed of fact and dimension tables
It is important for data to be stored, exchanged and presented in a form that is specified by their metadata, including adherence to specified domain integrity constraints. This characteristic of quality data is called ___________.
conformance
In an SSIS package, which type of object makes direct use of a middleware provider? data source Sql execution task transformation connection manager
connection manager
integrated
consistent naming conventions, formats, encoding structures, from multiple data sources
consolidation (ETL)
consolidating all data into a centralized database
transform
convert data from format of operational system to format of data warehouse
integration hub
data changes broadcast through central service to subscribing databases
When we consider data in the data warehouse to be time-variant, we mean: that the time of storage varies data in the warehouse contain a time dimension so that they may be used to study trends and changes that there is a time delay between when data are posted and when we report on the data units of time differ depending on the data source
data in the warehouse contain a time dimension so that they may be used to study trends and changes
A(n) _________ data mart is a subset that is created directly from the data warehouse. It has the advantages of using a consistent data model and providing quality data.
dependent
size of fact table
depends on the number of dimensions and the grain of the fact table
most common data model
dimensional model (usually implemented as a star schema)
data propagation (Enterprise Application Integration and Enterprise Data Replication)
duplicate data across databases, with near real time delay
subject-oriented
e.g. customers, patients, students, products
A database action that results from a transaction is called a(n): transition event log entry journal happening
event
A(n) ____________ fact table will include ONLY foreign keys to dimensions, and no other fields.
factless
Which of the following will best facilitate drill-down when designing a data warehouse or data mart? fine-level granularity of fact tables reduction of dimension tables use of factless-fact tables using push-oriented integration approaches
fine-level granularity of fact tables
transactional grain
finest level
The two main factors that impact the size of a fact table in a data warehouse or data mart are the fact table's ________ and the database's _____
grain and duration
data governance
high-level organizational groups and processes overseeing data stewardship across the organization
three main architectures of MDM
identity registry, integration hub, persistent
The two main figures who made significant contributions to the idea of data warehousing are Ralph Kimball and Bill Inmon. Of the two, which would be more likely to advocate a data warehouse generated from all the schemas of the Microsoft's AdventureWorks database? Inmon Kimball
inmon
application integration
involves coordinating the flow of event information, between business applications, often via a service-oriented architecture
User interaction integration
involves creating fewer user interfaces that feed different data systems. This is often done using an enterprise portal to interact with different data reporting and business intelligence systems.
Business process integration
involves tighter coordination of activities across business processes (e.g., selling and billing) so that applications can be shared and more application integration can occur. ERP systems facilitate this; these systems also require cross-enterprise data to be integrated.
conformed dimension
is a dimension that has the same meaning to every fact with which it relates
logical data mart
is not a physically separated database. Rather, it is created by a relational view of a data warehouse
identity registry
master data remains in source systems; registry provides applications with location
real-time data warehousing
means that the source data systems, decision support services, and the data warehouse exchange data and business rules at a near-real-time pace
data marts
mini-warehouses, limited in scope
finer grain
more dimension tables, more rows in fact table
aggregated grain
more summarized
periodic data
neve physically altered or deleted once they have been added to the store
factless facts tables
no nonkey data, but foreign keys for associated dimensions
Which property of data warehouses make normalization less important than in OLTP databases? nonvolatility granularity necessity for data cleansing desire to eliminate data duplication
non volatility
capture/extract
obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse
update mode
only changes in source data are written to data warehouse
One advantage of maintaining ________ data in a data warehouse is that it provides a historical perspective of events and changes over time. An obvious disadvantage is that is takes up much more space, because the results of transaction events (and not just current statuses) are kept. An example of this type of data is found both in the FactCurrencyRate table of the Adventure Works data warehouse, and the EmployeeDepartmentHistory table of the AW OLTP. Database logs, which support backup and recovery procedures, can be used to extract this type of data from an OLTP database into a data warehouse.
periodic
load/index
place transformed data into the warehouse and create indexes
data warehousing
process whereby organizations create and maintain data warehouses and extract meaning from and help inform decision making through the use of data in the data warehouses
in a fact table number of rows =
product of number of possible values for each dimension associated with the fact table
data federation (Enterprise Information Integration)
provides a virtual view of data without actually creating one centralized database
fact table
provides statistics for sales broken down by product, period, and store dimensions
non-updatable
read-only, periodically refreshed
star schema
simple database design in which dimensional data are separated from fact or event data
On page 403 of the Hoffer database textbook, the authors describe two approaches to representing hierarchies of dimension data. If you choose the second approach, this implies that the resulting arrangement of dimension and fact tables will form a(n) ________ schema.
snowflake
operational data
stored in various operational systems of record throughout the organization (and sometimes in external systems)
An SSIS control flow is composed primarily of a sequence of (sometimes repeating) ______.
tasks
The term "timeliness", when referring to the quality of data, means: the database includes a time dimension time-variance maintaining a historical timeline for analytical purposes the time between when data are expected and when they are available for use
the time between when data are expected and when they are available for use
External data sources present problems for data quality because: data are not always available there is a lack of control over data quality there are poor data capture controls data are unformatted
there is a lack of control over data quality
fact table contain
time period data
what are factless facts tables used for?
tracking events and inventory coverage
t/f dimension table keys should be surrogate (non-intelligent and non-business related)
true
t/f in Web-based commerce, finest granularity is a click
true
t/f operational data store provides option for obtaining current data
true
Kimball's approaches
type 1: just replace old data with new (lose historical data) type 2: for each changing attribute, create a current value field and several old-valued fields (multivalued) type 3: create a new dimension table row each time the dimension object changes, with all dimension characteristics at the time of change. Most common approach
derived
type of data stored in each of the data marts.
reconciled data
type of data stored in the enterprise data warehouse and an operational data source
table lookup
uses a separate table keyed by source record code
scrub/cleanse
uses pattern recognition and AI techniques to upgrade data quality
What is the most appropriate term or phrase to use when describing the federated data integration approach? push-oriented virtual requires ETL normalized
virtual