MIS430 FINAL EXAM Study Guide

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Operational Data Technical Difference

- Small amounts used in a process - High frequency of access - Can be updated - Non-redundant (non-repeating)

Analytical Data Functional Difference

- Used by narrower set of users for decision-making - Subject oriented

Dimension Attribute Characteristics

- not directly related to each other (e.g. customer info + ZIP demographics) - not normalized (e.g. ZIP) - Hierarchical (e.g. date->month->quarter->year)

Semi-structured data

- some structure (but not completely structured) - e.g. collection of e-mails

Kimball Bus Matrix

- Builds upon the concept of conformed dimensions - A structure of common dimensions - Identifies and enforces the relationship between business process metrics (facts) and descriptive attributes (dimensions)

Semi-additive Fact Table Measure

- Can be summarized across some dimensions but not others eg. Account balances can't be added across Date/TIme but can be added across Branch, Customer, and Account over a single day

Data granularity

- Defines level of detail - Multiple levels of detail are usually present

DWH/Data Mart Front-end (BI) Applications

- Provide access for indirect use

Independent Data Mart

- Stand-alone, created in same fashion as DWH - Has its own source systems and ETL infrastructure

Characteristics of DWHs

- Structured repository - Integrated - Subject-oriented - Enterprise-wide - Historical - Time-variant - Non-volatile - Retrieval of analytically useful information (read-only) - Detailed and/or summarized data

Dimension Table Characteristics

- Surrogate Key as the PK - Denormalized: typically 0NF or 1NF - Large # of attributes, capture specific aspects of decisions made - Few records compared to the fact table

Normalized DWH (Inmon)

- Top down approach - Envisions DWH as an integrated analytical DB - modeled using ER modeling and relational modeling - results in normalized relational DB schema - Serves as data source for dimensionally modeled data marts

Operational Data Functional Difference

- Used by all types of employees for tactical purposes - Application oriented

OLAP/BI Tools Purposes

1. Ad-hoc direct analysis of dimensionally modeled data - occurs when a user performs actions, such as pivoting, slicing, and drilling 2. Creation of front-end (BI) apps - can be created simply as a collection of OLAP/BI tool queries created by OLAP/BI tool expert users

Some typical external data sources

1. Data records such as supermarket, medicine store, clothing store, etc. 2. Random articles, news and reports available on the internet

Operational Information Use

1. Original operational purpose 2. As a source system for the DWH

Data Mart

A data store based on the same principles as a data warehouse, but with a more limited scope

ER Modeling

A predominant technique for visualizing DB requirements, used extensively for conceptual modeling of operational DBs

Snowflake Schema

A star schema with normalized dimensions - Usually not used in dimensional modeling

The Data Warehouse Definition

A structured repository of integrated, subject-oriented, enterprise-wide, historical, time-variant and nonvolatile data

Slice and Dice

Adds, replaces, or eliminates specified dimension attributes (or particular values of the dimension attributes) from the already displayed result

Analytical Information Use

Business intelligence processing

Dimension Attributes

Columns contain descriptive information with data type of textual/categorical; can also be numeric but numeric often converted to categorical; provides a basis for analysis of the subject - e.g. product dimension, customer dimension

Creating ETL Infrastructure

Creating necessary procedures and code for: - Automatic extraction of relevant data from sources - Transformation of extracted data so its quality-assured and structure conforms to modeled and implemented DWH - Seamless load of transformed data into DWH

Beta Release (DWH Deployment)

Deployment of a system to a selected group of users to test the usability of the system

Developing Front-end (BI) Applications

Designing and creating applications for indirect use by the end-users - Included in most DWH systems - Contain interfaces (forms + reports) accessible via a navigation mechanism (menu)

Slowly Changing Dimension

Dimension that contains attributes whose values can change

___________________ facilitates the retrieval of data from operational DBs into the DWHs

ETL Infrastructure

Analytical Information

Encompasses all organizational information (transactional), and its primary purpose is to support the performing of managerial analysis tasks

Alpha Release (DWH Deployment)

Internal deployment of a system to the members of the development team for initial testing of its functionalities

Drill Up

Makes the granularity of the data in the query result coarser

Drill Down

Makes the granularity of the data in the query result finer

Constellation Schema

Multiple facts in a dimensional model which can enable: - Quicker development of analytical databases for multiple subjects of analysis - Straightforward cross-fact analysis ex. avg daily UnitsSold vs. avg DefectiveUnitsFound per store/region/quarter/product/category

Why are snowflake schemas not used in dimensional modeling?

Normalization usually not necessary for analytical databases and NOT recommended in dimensional modeling

Source systems

Operational DBs and other operational repositories that provide analytically useful information for the DWH subjects of analysis - can include external data sources

DWH Administration and Maintenance

Performing activities that support the DWH end user, including dealing w/ technical issues: - Providing information security - Ensuring sufficient hard-drive space for DWH contents - Implementing the backup and recovery procedures

Historical (DWH Definition)

Refers to larger time horizon in the DWH than in op. DBs

Enterprise-wide (DWH Definition)

Refers to organization-wide view of the analytically useful information

Subject-oriented (DWH Definition)

Refers to the fundamental difference in the purpose of an operational DB system and a DWH - DWH developed to analyze specific business subject areas

DWH Implementation Time

Relatively long (months, years)

DWH Deployment

Releasing the created and populated DWH and its front-end (BI) apps

Drill hierarchy

Set of attributes within a dimension where an attribute is related to one or more attributes at a lower level but only related to one item at a higher level - Used for drill down/up operations

Production Release (DWH Deployment)

The actual deployment of a functioning system

Purpose of the DWH

The retrieval of analytical information and can store detailed and/or summarized data.

Transformation

Transforming the structure of extracted data in order to fit the structure of the target DWH model - Adding surrogate keys - Translating coded value - Deriving new calculated value Joining data from multiple sources and deduplicating the data

Operational Data Makeup Difference

Typical Time-Horizon: Days/Months Detailed Current

Analytical Data Makeup Difference

Typical Time-Horizon: Years Summarized (and/or Detailed) Values over time (Snapshots)

Creating DWH

Using DBMSoftware to implement the DWH model as a collection of physically created and mutually connected DB tables

Creating the DWH

Using a DBMS to implement the DWH data model as an actual DWH ** Typically, DWHs implemented using a relational DBMS (RDBMS) software

Data granularity ___________. a) is the process of data extraction b) defines the level of detail in a dataset c) aids in the process of data transformation d) allows for the integration of operational information in the DWH e) all of the above

b) defines the level of detail in a dataset

Which OLAP/BI tool feature reorganizes the values displayed in the original query result by moving values of a dimension column from one axis to another? a) Drill down/ Drill up b) Pivot c) Slice and Dice d) Roll up

b) pivot

Which of the following is NOT true for Analytical Data? a) Redundancy not an issue b) Typical Time-Horizon: Years c) Application-oriented d) Used by a narrower set of users for decision-making e) A and B

c) Application-oriented

Type 2 method of dealing with slowly changing dimensions can be combined with the use of ___________________. a) row indicators b) timestamps c) both A and B d) None of the above

c) both A and B

Row indicator

column that quickly indicates whether the record is currently valid

Timestamps

columns that indicate the time interval for which the values in the records are applicable

Dimension (tables)

contain descriptions of the business, organization, or enterprise to which the subject of analysis belongs

(ETL) The initial load, populates initially ________ DWH tables

empty

________ tables are located at the center of a star schema

fact

A data warehouse that contains the ______ level of detail is the most powerful

finest

Front-end (BI) Applications

for indirect access to data in data warehouses and data mart

Design and creation of front-end applications can take place _______________ with DWH creation

in parallel

Coarser granularity aggregated fact tables are quicker to query than detailed fact tables but _______________________________.

limited in terms of what can be retrieved from them

Load

loading the extracted, transformed, and quality assured data into the target DWH - a batch process that inserts the data into the DWH tables in an automatic fashion without user involvement

Due to the amount of details that have to be considered, creating ETL infrastructure is often the ______________________________ part of the DWH development process

most time- and resource-consuming

A typical DWH _________________ retrieves selected analytically useful data from the operational data sources

periodically

Online analytical processing (OLAP)

querying and presenting data from DWHs and/or data marts for analytical purposes

_________________ is the period in which the DWH is reloaded with the new data (e.g. hourly, daily)

refresh cycle

(ETL) After the initial load, every subsequent load is referred to as ____________.

refresh load

A data mart is focused on how many subjects?

singular

A data _________________, or landing zone, is an intermediate storage area used for data processing during the ETL process

staging area

Relational modeling

standard method for logical modeling of operational DBs

Fact table data rapidly grows in # of records, while dimension data is relatively _________

static

The DWH is sometimes referred to as the __________ system, to indicate the fact that it is a ____________ for the data from the source systems

target; destination

Data cleansing (scrubbing)

the detection and correction of low-quality data

Operational Information

the information collected and used in support of day to day operational needs in businesses and other organizations

Visualization of Data

the use of computer-supported, interactive visual representations of data to amplify cognition

The data quality control and improvement are included in the ______________ process

transformation

Online Transaction Processing (OLTP)

updating (i.e. inserting, modifying, deleting), querying and presenting data from DBs for operational purposes

(Fully) Additive Fact Table Measure

- Can be summed across any of the dimensions associated w/ the fact table Ex. Sales -> Can add hourly sales to get sales for a day, week, month, quarter, or year. Or across stores or regions

Transaction Identifier (Fact Table Attribute)

- Can be useful for market basket analysis (association rule mining task)

Non-volatile (DWH Definition)

- Captured snapshot does not change - New data are appended periodically - May be archived after usefulness declines ex. Balance owed by customer at the time is captured, and any change in balance is loaded next extraction

Detailed Fact Table

- Each record refers to a single fact - Have a fine level of granularity

Aggregated Fact Table

- Each record summarizes multiple facts - Have a coarser level of granularity

Pivot (Rotate)

Reorganizes the values displayed in the original query result by moving values of a dimension column from one axis to another

DWH Use

Retrieval of data Indirect - via front-end (BI) applications Direct - via DBMS or OLAP (BI) tools

Fact Table Characteristics

- Concatenated dimension keys as FK - Granularity determined by dimension tables e.g. each record represents sales by (Store, Date, Customer, Product) - Typically numeric attributes - MANY records (sometimes millions)

Structured repository (DWH Definition)

- Contains analytically useful information - Any DB is a structured repository with its structure represented in its metadata

Additional OLAP/BI Tool functionalities

- Graphically visualizing the answers - Creating and examining calculated data - Determining comparative or relative differences - Performing exception analysis, trend analysis, forecasting, and regression analysis

Integrated (DWH Definition)

- Integrates analytically useful data from various operational DBs (and possibly other ext. sources) - Refers to the process of bringing data from multiple sources into a singular DWH

Executive Dashboard

- Intended use by higher-level decision makers - Contains organized easy-to-read display of critically important queries describing org. performance - Usage should require little or no effort or training - can be web-based

DWH Modeling

- Logical modeling techniques include relational modeling and dimensional modeling

Operational Key (natural keys)

- Meaningful / Often used for search - As FK to connect related records E.g. ProductID, CustomerID

Transaction Time (Fact Table Attribute)

- Useful for any event transaction scenarios

Unstructured Data

- no explicit structure - e.g. text document

Two main reasons for the creation of DWH as separate analytical DB

1. The performance of operational data use can be diminished when having to compete with analytical queries 2. Often impossible to structure DB for efficient operational and analytical use

Dimensional DWH (Kimball)

- Collection of dimensionally modeled intertwined data marts i.e. constellation of dimensional models - Set of conformed dimensions is designed first - Fact tables subsequently added - Dimensions shared by 2+ fact table

Facts (tables)

- Contain performance (outcome) measures, metrics or facts - Quantitative (numeric) data related to the subject of analysis + foreign keys - Intended for mathematical computation and quantitative analysis

Type 3 Approach (slowly changing dimension)

- Create a "previous" and "current" column in the dimension table for each column where changes are anticipated - Useful when there is a fixed # of changes possible per column or when only limited history is recorded - Can be combined w/ timestamps

Type 2 Approach (slowly changing dimension)

- Creates a new additional dimension record using a new value for the surrogate key every time a change is made - Used when history should be preserved - Can be combined with use of timestamps and row indicators

Kimball Data Bus

- Data moved to staging area for scrubbing and made consistent - Data Marts created from staging area - Data Marts based on single process - Sum of dat marts can constitute Enterprise DWH - Conformed dimensions are key to success

Big Data

- Data sets in corporations and organizations that contain massive volumes of diverse and rapidly growing data - Typically unstructured or sem-structured

Dependent Data Mart

- Does not have its own source systems - Data comes from data warehouse - Provides users w/ subset of data from DWH, in cases when users or applications do not want, need, or are not allowed to have access to all the data in the entire data warehouse

Transaction-level Detailed Fact Table

- Each row represents a particular transaction e.g. single sales transaction, single care rental transaction, defect detection

Line-item Detailed Fact Table

- Each row represents each line item of a particular transaction e.g. itemized order transaction, invoice transaction

ETL includes the following tasks:

- Extracting analytically useful data from operational data sources - Transforming such data so that it conforms to the structure of the subject-oriented DWH model (while ensuring quality of the transformed data) - Loading transformed and quality-assured data into the DWH

Non-Additive Fact Table Measure

- Facts that cannot be added meaningfully across dimensions ex. textual facts, per-unit prices, percentages + rations, Measures of intensity, averages

Problems with external data

- Frequency of availability - Totally unstructured data - Unpredictability of data

Analytical Data Technical Difference

- Large amounts used in a process - Low/Modest frequency of access - Read (and Append) Only - Redundancy not an issue

Dimensional modeling

- Modeling technique tailored for analytical subject-oriented DB design purposes - In addition to using regular relational concepts (primary keys, foreign keys, integrity constraints, etc.) it distinguishes 2 types of tables: Dimensions + Facts

Type 1 Approach (slowly changing dimension)

- New value replaces old value in dimension record - No history is preserved - Simplest approach, used often when a change in a dimension is the result of an error

ETL Infrastructure

- Process includes using specialized ETL software tools and/or writing code - Due to amount of considered detail, often the most time and resource consuming part in DWH development - Essentially predetermined during requirements and modeling stages

DWH Requirements Collection, Definition, and Visualization

- Results in the requirements specifying the desired capabilities and functionalities of the future DWH - Requirements based on analytical needs to be met by internal and external source systems - Collected through interviewing various stakeholders of the DWH - Collections should be clearly defined and stated in a written document, and then visualized as a conceptual data model (ERD)

OLAP/BI Tools

- Simple point-and-click query-building applications - Tool writes and executes the code in DBMS (e.g. SQL) - Designed for analysis of dimensionally modeled data - End users access the data is typically structured as a dimensional model

Surrogate Key

- Simple, non-composite system-generated key - Auto-increment integer values - Use only within DWH - Uniquely identifies a record & join the dimension to fact table - Handles slowly changing dimension issue

Steps in Development of DWHs

1. Requirements collection, definition, and visualization 2. DWH Modeling 3. Creating the DWH 4. Creating the ETL Infrastructure 5. Developing Front-end (BI) Applications (indirect access) 6. DWH Deployment 7. DWH Use (direct and/or indirect) 8. DWH Admin/maintenance

Time-variant (DWH Definition)

Refers to the snapshots/slices of data from different periods of time across its time horizon Allows: - Historical analysis - Analysis of patterns for predictive use - Impact of decisions in different operational areas E.g. marketing on sales, customer retention, profits - Measure outcome influence- e.g. impact of decision to drop prices on revenues, # of customers

Extraction

Retrieval of analytically useful data from the operational data sources - What to extract is determined in the requirements and modeling stages

In so-called "active" DWHs, the retrieval of data from operational data sources is ______________.

continuous

In active DWHs, the loads occur in micro batches that occur ____________.

continuously

Drill up: a) changing the level of granularity of a particular dimension b) dimension reduction c) a visualization operation that allows an analyst to rotate the cube in space in order to provide an alternative presentation of the data d) A and B

d) A and B

Which of the following aspect of the DWH development process often takes the most time and resources? a) logical design b) implementation c) developing BI Applications d) ETL Process

d) ETL process

Commonly, some of the data in the data sources exhibit ______________ problems

data quality


Ensembles d'études connexes

Pharmacology PrepU; Chapter 22- Psychotherapeutic Agents #3

View Set

Patho Exam 4 Chapter 37 Mastery Quiz

View Set

AP Computer Science - Unit 5 answers

View Set