CIS463 (Chapter 9 and 10) and Quiz 2

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

After ETL, data should be

- Detailed-not summarized yet - Historical-periodic - Normalized-3rd normal form or higher - Comprehensive-enterprise-wide perspective - Timely-data should be current enough to assist decision-making - Quality controlled-accurate with full integrity

record-level

- Selection-data partitioning - Joining-data combining - Aggregation-data summarization

improve data capture processes

- automate data entry as much as possible - manual data entry should be selected from preset options - use trained operators when possible - immediate data validation for entered data

why should dimension keys be surrogates

- business keys may change over time - helps keep track of nonkey attribute values for a given production key - surrogate keys are simpler and shorter - surrogate keys can be same length and format for all key

multiple facts tables

- can improve performance - often used to store facts for different combinations of dimensions - conformed dimensions

TQM principles

- defect prevention - continuous improvement - use of enterprise data standards

characteristics of derived data

- detailed (mostly periodic) data - aggregate (for summary) - distributed (to departmental servers)

objectives of derived data

- ease of use for decision support applications - fast response to predefined user queries - customized data for particular target audiences - ad-hoc query support - data mining capabilities

causes of poor data quality

- external data sources (lack of control over data quality) - redundant data storage and inconsistent metadata (proliferation of databases with uncontrolled redundancy and metadata) - data entry (poor data capture controls) - lack of organizational commitment (not increasing poor data quality as an organizational issue)

multivalue dimensions

- facts qualified by a set of values for the same business subject - normalization involves creating a table for an associative entity between dimensions

issues with company-wide view

- inconsistent key structures - synonyms - free-form vs. structured fields - inconsistent data values - missing data

data warehouse architectures

- independent data mart - dependent data mart and operational data store - logical data mart and real-time data warehouse - three-layer architecture

need for data warehousing

- integrated, company-wide view of high-quality information (from disparate databases) - separation of operational and informational systems and data (for improved performance)

purposes of data quality

- minimize IT project risk - make timely business decisions - ensure regulatory compliance - expand customer base

organizational trends motivating data warehouses

- no single system of records - multiple systems not synchronized - organizational need to analyze activities in a balanced way - customer relationship management - supplier relationship management

limitations of independent data marts

- separate ETL process for each data mart -> redundant data and processing - inconsistency between data marts - difficult to drill down for related facts between data marts - excessive scaling costs are more applications are built - high cost for obtaining consistency between marts

field-level

- single-field-from one field to one field - multi-field-from many fields to one, or one field to many

hierarchies

- sometimes a dimension forms a natural, fixed depth hierarchy

requirements for data governance

- sponsorship from both senior management and business units - a data steward manager to support, train, and coordinate data stewards - data stewards for different business units, subjects, and/or source systems - a governance committee to provide data management guidelines and standards

typical operational data is...

- transient - not historical - not normalized - restricted in scope - sometimes poor quality

characteristics of quality data

- uniqueness - accuracy - consistency - completeness - timeliness - currency - conformance - referential integrity

10 essential rules for data modeling

- use atomic facts - create single-process fact tables - include a date dimension for each fact table - enforce consistent grain - disallow null keys in fact tables - honor hierarchies - decode dimension tables - use surrogate keys - conform dimensions - balance requirements with actual data

advantages of logical data marts

-new data marts can be created quickly because no physical database or database technology needs to be created or acquired and no loading routines need to be written - the data marts are always up to date because data in a view are created when the view is referenced; views can be materialized if a user has a series of queries and analysis that need to work off the same instantiation of the data mart

Which of the following scenarios favors a columnar physical database structure over the traditional row-based database structure? A query that seeks ALL columns from a table with a small number of rows. A query that seeks only one or two columns from a table with a huge number of rows. A query that seeks data from several sources on the web. All of the above favor columnar over row-based database structures.

A query that seeks only one or two columns from a table with a huge number of rows.

Which of the following factors drive the need for data warehousing? Businesses need an integrated view of company information. Informational data must be kept together with operational data. Data warehouses generally have better security. Businesses need backup copies of data in case the system crashes.

Businesses need an integrated view of company information.

where did we use multi field transformations in SSIS exercises

Derived column. Done for full name in SSIS#1 and for local currency amount in SSIS#3.

Master Data Management (MDM)

Disciplines, technologies, and methods to ensure the currency, meaning, and quality of reference data within and across various subject areas

You have been hired to design a data warehouse for an important client. Which of the following is the least of your concerns? Deciding whether the client should use an enterprise-wide warehouse or individual data marts. Identifying the subjects of interest for the client's decision-makers. Ensuring ACID compliance of the data warehouse. Determining the data sources for the warehouse.

Ensuring ACID compliance of the data warehouse.

Imagine a debate between Ralph Kimball and Bill Inmon. Which is the most like scenario of that debate? Kimball espouses an ETL process which generates corporate-wide data which is then distributed among individual data marts. Inmon thinks there is no direct need for ETL, and instead we should just rely on accessing operational databases for our BI needs. Inmon thinks the best way to provide an enterprise-wide view of cleansed data is via conformed dimensions between data marts. Kimball thinks it's best for individual departments users to quickly create multi-dimensional databases for their own needs before coordinating with other departments.

Kimball thinks it's best for individual departments users to quickly create multi-dimensional databases for their own needs before coordinating with other departments.

What is the most likely cardinality of a relationship between a dimension table and a fact table? One dimension to one fact. One dimension to many facts. Many dimensions to one fact. Many dimensions to many facts.

One dimension to many facts.

Which law, enacted in the early 2000s, has been most influential in motivating the need for data governance? Sarbanes-Oxley HIPAA the Data Governance Act Tyco-Worldcom

Sarbanes-Oxley

dependent data marts

a centralized enterprise-wide data warehouse, and the data marts are loaded from this enterprise DW.

When doing ETL to build a data warehouse, the transformation and cleaning of data is done in: an OLTP database an independent data mart a data staging area an MDM system

a data staging area

data mart

a data warehouse that is limited in scope

data steward

a person responsible for ensuring that organizational applications properly support the organization's data quality goals

Data Warehouse

a subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes

informational system

a system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications

operational system

a system that is used to run a business in real time, based on current data; also called a system of record

The process of transforming data from a detailed to a summary level is called: extracting updating joining aggregating

aggregating

helper table

an associative entity that implements a M:N relationship between dimension and fact

snowflake schema

an expanded version of a star schema in which dimension tables are normalized into several related tables

The consolidation approach to data integration has the advantage of: providing real-time access to current data avoiding contention with OLTP databases pushing data to duplicate sites as updates occur allowing retention of historical data minimizing network, storage, and data maintenance costs

avoiding contention with OLTP databases and allowing retention of historical data

finer grains

better market basket analysis capability

Usually, fact tables are associative entities that implement many-to-many relationships between dimensions. However, sometimes a fact table itself must have a many-to-many relationship with a dimension. In this case, you need an additional table to form the associative entity. According to Hoffer's terminology, this is known as a(n) ________ table.

bridge

refresh mode

bulk rewriting of target data at periodic intervals

time-variant

can study trends and changes

static extract

capturing a snapshot of the source data at a point in time

incremental extract

capturing changes that have occurred since the last static extract

persistent

central "golden rule" maintained; all applications have access. Requires applications to push data, prone to data duplication

transient data

changes to existing records are written over previous records, this destroying the previous data content

what are star schemas composed of?

composed of fact and dimension tables

It is important for data to be stored, exchanged and presented in a form that is specified by their metadata, including adherence to specified domain integrity constraints. This characteristic of quality data is called ___________.

conformance

In an SSIS package, which type of object makes direct use of a middleware provider? data source Sql execution task transformation connection manager

connection manager

integrated

consistent naming conventions, formats, encoding structures, from multiple data sources

consolidation (ETL)

consolidating all data into a centralized database

transform

convert data from format of operational system to format of data warehouse

integration hub

data changes broadcast through central service to subscribing databases

When we consider data in the data warehouse to be time-variant, we mean: that the time of storage varies data in the warehouse contain a time dimension so that they may be used to study trends and changes that there is a time delay between when data are posted and when we report on the data units of time differ depending on the data source

data in the warehouse contain a time dimension so that they may be used to study trends and changes

A(n) _________ data mart is a subset that is created directly from the data warehouse. It has the advantages of using a consistent data model and providing quality data.

dependent

size of fact table

depends on the number of dimensions and the grain of the fact table

most common data model

dimensional model (usually implemented as a star schema)

data propagation (Enterprise Application Integration and Enterprise Data Replication)

duplicate data across databases, with near real time delay

subject-oriented

e.g. customers, patients, students, products

A database action that results from a transaction is called a(n): transition event log entry journal happening

event

A(n) ____________ fact table will include ONLY foreign keys to dimensions, and no other fields.

factless

Which of the following will best facilitate drill-down when designing a data warehouse or data mart? fine-level granularity of fact tables reduction of dimension tables use of factless-fact tables using push-oriented integration approaches

fine-level granularity of fact tables

transactional grain

finest level

The two main factors that impact the size of a fact table in a data warehouse or data mart are the fact table's ________ and the database's _____

grain and duration

data governance

high-level organizational groups and processes overseeing data stewardship across the organization

three main architectures of MDM

identity registry, integration hub, persistent

The two main figures who made significant contributions to the idea of data warehousing are Ralph Kimball and Bill Inmon. Of the two, which would be more likely to advocate a data warehouse generated from all the schemas of the Microsoft's AdventureWorks database? Inmon Kimball

inmon

application integration

involves coordinating the flow of event information, between business applications, often via a service-oriented architecture

User interaction integration

involves creating fewer user interfaces that feed different data systems. This is often done using an enterprise portal to interact with different data reporting and business intelligence systems.

Business process integration

involves tighter coordination of activities across business processes (e.g., selling and billing) so that applications can be shared and more application integration can occur. ERP systems facilitate this; these systems also require cross-enterprise data to be integrated.

conformed dimension

is a dimension that has the same meaning to every fact with which it relates

logical data mart

is not a physically separated database. Rather, it is created by a relational view of a data warehouse

identity registry

master data remains in source systems; registry provides applications with location

real-time data warehousing

means that the source data systems, decision support services, and the data warehouse exchange data and business rules at a near-real-time pace

data marts

mini-warehouses, limited in scope

finer grain

more dimension tables, more rows in fact table

aggregated grain

more summarized

periodic data

neve physically altered or deleted once they have been added to the store

factless facts tables

no nonkey data, but foreign keys for associated dimensions

Which property of data warehouses make normalization less important than in OLTP databases? nonvolatility granularity necessity for data cleansing desire to eliminate data duplication

non volatility

capture/extract

obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse

update mode

only changes in source data are written to data warehouse

One advantage of maintaining ________ data in a data warehouse is that it provides a historical perspective of events and changes over time. An obvious disadvantage is that is takes up much more space, because the results of transaction events (and not just current statuses) are kept. An example of this type of data is found both in the FactCurrencyRate table of the Adventure Works data warehouse, and the EmployeeDepartmentHistory table of the AW OLTP. Database logs, which support backup and recovery procedures, can be used to extract this type of data from an OLTP database into a data warehouse.

periodic

load/index

place transformed data into the warehouse and create indexes

data warehousing

process whereby organizations create and maintain data warehouses and extract meaning from and help inform decision making through the use of data in the data warehouses

in a fact table number of rows =

product of number of possible values for each dimension associated with the fact table

data federation (Enterprise Information Integration)

provides a virtual view of data without actually creating one centralized database

fact table

provides statistics for sales broken down by product, period, and store dimensions

non-updatable

read-only, periodically refreshed

star schema

simple database design in which dimensional data are separated from fact or event data

On page 403 of the Hoffer database textbook, the authors describe two approaches to representing hierarchies of dimension data. If you choose the second approach, this implies that the resulting arrangement of dimension and fact tables will form a(n) ________ schema.

snowflake

operational data

stored in various operational systems of record throughout the organization (and sometimes in external systems)

An SSIS control flow is composed primarily of a sequence of (sometimes repeating) ______.

tasks

The term "timeliness", when referring to the quality of data, means: the database includes a time dimension time-variance maintaining a historical timeline for analytical purposes the time between when data are expected and when they are available for use

the time between when data are expected and when they are available for use

External data sources present problems for data quality because: data are not always available there is a lack of control over data quality there are poor data capture controls data are unformatted

there is a lack of control over data quality

fact table contain

time period data

what are factless facts tables used for?

tracking events and inventory coverage

t/f dimension table keys should be surrogate (non-intelligent and non-business related)

true

t/f in Web-based commerce, finest granularity is a click

true

t/f operational data store provides option for obtaining current data

true

Kimball's approaches

type 1: just replace old data with new (lose historical data) type 2: for each changing attribute, create a current value field and several old-valued fields (multivalued) type 3: create a new dimension table row each time the dimension object changes, with all dimension characteristics at the time of change. Most common approach

derived

type of data stored in each of the data marts.

reconciled data

type of data stored in the enterprise data warehouse and an operational data source

table lookup

uses a separate table keyed by source record code

scrub/cleanse

uses pattern recognition and AI techniques to upgrade data quality

What is the most appropriate term or phrase to use when describing the federated data integration approach? push-oriented virtual requires ETL normalized

virtual


Kaugnay na mga set ng pag-aaral

Managerial Accounting SMartbook chapter 2

View Set

AP Chemistry Semester 1 Exam Review

View Set

Macroeconomics Exam 1 Worksheets February 28th

View Set