MIS430 FINAL EXAM Study Guide
Operational Data Technical Difference
- Small amounts used in a process - High frequency of access - Can be updated - Non-redundant (non-repeating)
Analytical Data Functional Difference
- Used by narrower set of users for decision-making - Subject oriented
Dimension Attribute Characteristics
- not directly related to each other (e.g. customer info + ZIP demographics) - not normalized (e.g. ZIP) - Hierarchical (e.g. date->month->quarter->year)
Semi-structured data
- some structure (but not completely structured) - e.g. collection of e-mails
Kimball Bus Matrix
- Builds upon the concept of conformed dimensions - A structure of common dimensions - Identifies and enforces the relationship between business process metrics (facts) and descriptive attributes (dimensions)
Semi-additive Fact Table Measure
- Can be summarized across some dimensions but not others eg. Account balances can't be added across Date/TIme but can be added across Branch, Customer, and Account over a single day
Data granularity
- Defines level of detail - Multiple levels of detail are usually present
DWH/Data Mart Front-end (BI) Applications
- Provide access for indirect use
Independent Data Mart
- Stand-alone, created in same fashion as DWH - Has its own source systems and ETL infrastructure
Characteristics of DWHs
- Structured repository - Integrated - Subject-oriented - Enterprise-wide - Historical - Time-variant - Non-volatile - Retrieval of analytically useful information (read-only) - Detailed and/or summarized data
Dimension Table Characteristics
- Surrogate Key as the PK - Denormalized: typically 0NF or 1NF - Large # of attributes, capture specific aspects of decisions made - Few records compared to the fact table
Normalized DWH (Inmon)
- Top down approach - Envisions DWH as an integrated analytical DB - modeled using ER modeling and relational modeling - results in normalized relational DB schema - Serves as data source for dimensionally modeled data marts
Operational Data Functional Difference
- Used by all types of employees for tactical purposes - Application oriented
OLAP/BI Tools Purposes
1. Ad-hoc direct analysis of dimensionally modeled data - occurs when a user performs actions, such as pivoting, slicing, and drilling 2. Creation of front-end (BI) apps - can be created simply as a collection of OLAP/BI tool queries created by OLAP/BI tool expert users
Some typical external data sources
1. Data records such as supermarket, medicine store, clothing store, etc. 2. Random articles, news and reports available on the internet
Operational Information Use
1. Original operational purpose 2. As a source system for the DWH
Data Mart
A data store based on the same principles as a data warehouse, but with a more limited scope
ER Modeling
A predominant technique for visualizing DB requirements, used extensively for conceptual modeling of operational DBs
Snowflake Schema
A star schema with normalized dimensions - Usually not used in dimensional modeling
The Data Warehouse Definition
A structured repository of integrated, subject-oriented, enterprise-wide, historical, time-variant and nonvolatile data
Slice and Dice
Adds, replaces, or eliminates specified dimension attributes (or particular values of the dimension attributes) from the already displayed result
Analytical Information Use
Business intelligence processing
Dimension Attributes
Columns contain descriptive information with data type of textual/categorical; can also be numeric but numeric often converted to categorical; provides a basis for analysis of the subject - e.g. product dimension, customer dimension
Creating ETL Infrastructure
Creating necessary procedures and code for: - Automatic extraction of relevant data from sources - Transformation of extracted data so its quality-assured and structure conforms to modeled and implemented DWH - Seamless load of transformed data into DWH
Beta Release (DWH Deployment)
Deployment of a system to a selected group of users to test the usability of the system
Developing Front-end (BI) Applications
Designing and creating applications for indirect use by the end-users - Included in most DWH systems - Contain interfaces (forms + reports) accessible via a navigation mechanism (menu)
Slowly Changing Dimension
Dimension that contains attributes whose values can change
___________________ facilitates the retrieval of data from operational DBs into the DWHs
ETL Infrastructure
Analytical Information
Encompasses all organizational information (transactional), and its primary purpose is to support the performing of managerial analysis tasks
Alpha Release (DWH Deployment)
Internal deployment of a system to the members of the development team for initial testing of its functionalities
Drill Up
Makes the granularity of the data in the query result coarser
Drill Down
Makes the granularity of the data in the query result finer
Constellation Schema
Multiple facts in a dimensional model which can enable: - Quicker development of analytical databases for multiple subjects of analysis - Straightforward cross-fact analysis ex. avg daily UnitsSold vs. avg DefectiveUnitsFound per store/region/quarter/product/category
Why are snowflake schemas not used in dimensional modeling?
Normalization usually not necessary for analytical databases and NOT recommended in dimensional modeling
Source systems
Operational DBs and other operational repositories that provide analytically useful information for the DWH subjects of analysis - can include external data sources
DWH Administration and Maintenance
Performing activities that support the DWH end user, including dealing w/ technical issues: - Providing information security - Ensuring sufficient hard-drive space for DWH contents - Implementing the backup and recovery procedures
Historical (DWH Definition)
Refers to larger time horizon in the DWH than in op. DBs
Enterprise-wide (DWH Definition)
Refers to organization-wide view of the analytically useful information
Subject-oriented (DWH Definition)
Refers to the fundamental difference in the purpose of an operational DB system and a DWH - DWH developed to analyze specific business subject areas
DWH Implementation Time
Relatively long (months, years)
DWH Deployment
Releasing the created and populated DWH and its front-end (BI) apps
Drill hierarchy
Set of attributes within a dimension where an attribute is related to one or more attributes at a lower level but only related to one item at a higher level - Used for drill down/up operations
Production Release (DWH Deployment)
The actual deployment of a functioning system
Purpose of the DWH
The retrieval of analytical information and can store detailed and/or summarized data.
Transformation
Transforming the structure of extracted data in order to fit the structure of the target DWH model - Adding surrogate keys - Translating coded value - Deriving new calculated value Joining data from multiple sources and deduplicating the data
Operational Data Makeup Difference
Typical Time-Horizon: Days/Months Detailed Current
Analytical Data Makeup Difference
Typical Time-Horizon: Years Summarized (and/or Detailed) Values over time (Snapshots)
Creating DWH
Using DBMSoftware to implement the DWH model as a collection of physically created and mutually connected DB tables
Creating the DWH
Using a DBMS to implement the DWH data model as an actual DWH ** Typically, DWHs implemented using a relational DBMS (RDBMS) software
Data granularity ___________. a) is the process of data extraction b) defines the level of detail in a dataset c) aids in the process of data transformation d) allows for the integration of operational information in the DWH e) all of the above
b) defines the level of detail in a dataset
Which OLAP/BI tool feature reorganizes the values displayed in the original query result by moving values of a dimension column from one axis to another? a) Drill down/ Drill up b) Pivot c) Slice and Dice d) Roll up
b) pivot
Which of the following is NOT true for Analytical Data? a) Redundancy not an issue b) Typical Time-Horizon: Years c) Application-oriented d) Used by a narrower set of users for decision-making e) A and B
c) Application-oriented
Type 2 method of dealing with slowly changing dimensions can be combined with the use of ___________________. a) row indicators b) timestamps c) both A and B d) None of the above
c) both A and B
Row indicator
column that quickly indicates whether the record is currently valid
Timestamps
columns that indicate the time interval for which the values in the records are applicable
Dimension (tables)
contain descriptions of the business, organization, or enterprise to which the subject of analysis belongs
(ETL) The initial load, populates initially ________ DWH tables
empty
________ tables are located at the center of a star schema
fact
A data warehouse that contains the ______ level of detail is the most powerful
finest
Front-end (BI) Applications
for indirect access to data in data warehouses and data mart
Design and creation of front-end applications can take place _______________ with DWH creation
in parallel
Coarser granularity aggregated fact tables are quicker to query than detailed fact tables but _______________________________.
limited in terms of what can be retrieved from them
Load
loading the extracted, transformed, and quality assured data into the target DWH - a batch process that inserts the data into the DWH tables in an automatic fashion without user involvement
Due to the amount of details that have to be considered, creating ETL infrastructure is often the ______________________________ part of the DWH development process
most time- and resource-consuming
A typical DWH _________________ retrieves selected analytically useful data from the operational data sources
periodically
Online analytical processing (OLAP)
querying and presenting data from DWHs and/or data marts for analytical purposes
_________________ is the period in which the DWH is reloaded with the new data (e.g. hourly, daily)
refresh cycle
(ETL) After the initial load, every subsequent load is referred to as ____________.
refresh load
A data mart is focused on how many subjects?
singular
A data _________________, or landing zone, is an intermediate storage area used for data processing during the ETL process
staging area
Relational modeling
standard method for logical modeling of operational DBs
Fact table data rapidly grows in # of records, while dimension data is relatively _________
static
The DWH is sometimes referred to as the __________ system, to indicate the fact that it is a ____________ for the data from the source systems
target; destination
Data cleansing (scrubbing)
the detection and correction of low-quality data
Operational Information
the information collected and used in support of day to day operational needs in businesses and other organizations
Visualization of Data
the use of computer-supported, interactive visual representations of data to amplify cognition
The data quality control and improvement are included in the ______________ process
transformation
Online Transaction Processing (OLTP)
updating (i.e. inserting, modifying, deleting), querying and presenting data from DBs for operational purposes
(Fully) Additive Fact Table Measure
- Can be summed across any of the dimensions associated w/ the fact table Ex. Sales -> Can add hourly sales to get sales for a day, week, month, quarter, or year. Or across stores or regions
Transaction Identifier (Fact Table Attribute)
- Can be useful for market basket analysis (association rule mining task)
Non-volatile (DWH Definition)
- Captured snapshot does not change - New data are appended periodically - May be archived after usefulness declines ex. Balance owed by customer at the time is captured, and any change in balance is loaded next extraction
Detailed Fact Table
- Each record refers to a single fact - Have a fine level of granularity
Aggregated Fact Table
- Each record summarizes multiple facts - Have a coarser level of granularity
Pivot (Rotate)
Reorganizes the values displayed in the original query result by moving values of a dimension column from one axis to another
DWH Use
Retrieval of data Indirect - via front-end (BI) applications Direct - via DBMS or OLAP (BI) tools
Fact Table Characteristics
- Concatenated dimension keys as FK - Granularity determined by dimension tables e.g. each record represents sales by (Store, Date, Customer, Product) - Typically numeric attributes - MANY records (sometimes millions)
Structured repository (DWH Definition)
- Contains analytically useful information - Any DB is a structured repository with its structure represented in its metadata
Additional OLAP/BI Tool functionalities
- Graphically visualizing the answers - Creating and examining calculated data - Determining comparative or relative differences - Performing exception analysis, trend analysis, forecasting, and regression analysis
Integrated (DWH Definition)
- Integrates analytically useful data from various operational DBs (and possibly other ext. sources) - Refers to the process of bringing data from multiple sources into a singular DWH
Executive Dashboard
- Intended use by higher-level decision makers - Contains organized easy-to-read display of critically important queries describing org. performance - Usage should require little or no effort or training - can be web-based
DWH Modeling
- Logical modeling techniques include relational modeling and dimensional modeling
Operational Key (natural keys)
- Meaningful / Often used for search - As FK to connect related records E.g. ProductID, CustomerID
Transaction Time (Fact Table Attribute)
- Useful for any event transaction scenarios
Unstructured Data
- no explicit structure - e.g. text document
Two main reasons for the creation of DWH as separate analytical DB
1. The performance of operational data use can be diminished when having to compete with analytical queries 2. Often impossible to structure DB for efficient operational and analytical use
Dimensional DWH (Kimball)
- Collection of dimensionally modeled intertwined data marts i.e. constellation of dimensional models - Set of conformed dimensions is designed first - Fact tables subsequently added - Dimensions shared by 2+ fact table
Facts (tables)
- Contain performance (outcome) measures, metrics or facts - Quantitative (numeric) data related to the subject of analysis + foreign keys - Intended for mathematical computation and quantitative analysis
Type 3 Approach (slowly changing dimension)
- Create a "previous" and "current" column in the dimension table for each column where changes are anticipated - Useful when there is a fixed # of changes possible per column or when only limited history is recorded - Can be combined w/ timestamps
Type 2 Approach (slowly changing dimension)
- Creates a new additional dimension record using a new value for the surrogate key every time a change is made - Used when history should be preserved - Can be combined with use of timestamps and row indicators
Kimball Data Bus
- Data moved to staging area for scrubbing and made consistent - Data Marts created from staging area - Data Marts based on single process - Sum of dat marts can constitute Enterprise DWH - Conformed dimensions are key to success
Big Data
- Data sets in corporations and organizations that contain massive volumes of diverse and rapidly growing data - Typically unstructured or sem-structured
Dependent Data Mart
- Does not have its own source systems - Data comes from data warehouse - Provides users w/ subset of data from DWH, in cases when users or applications do not want, need, or are not allowed to have access to all the data in the entire data warehouse
Transaction-level Detailed Fact Table
- Each row represents a particular transaction e.g. single sales transaction, single care rental transaction, defect detection
Line-item Detailed Fact Table
- Each row represents each line item of a particular transaction e.g. itemized order transaction, invoice transaction
ETL includes the following tasks:
- Extracting analytically useful data from operational data sources - Transforming such data so that it conforms to the structure of the subject-oriented DWH model (while ensuring quality of the transformed data) - Loading transformed and quality-assured data into the DWH
Non-Additive Fact Table Measure
- Facts that cannot be added meaningfully across dimensions ex. textual facts, per-unit prices, percentages + rations, Measures of intensity, averages
Problems with external data
- Frequency of availability - Totally unstructured data - Unpredictability of data
Analytical Data Technical Difference
- Large amounts used in a process - Low/Modest frequency of access - Read (and Append) Only - Redundancy not an issue
Dimensional modeling
- Modeling technique tailored for analytical subject-oriented DB design purposes - In addition to using regular relational concepts (primary keys, foreign keys, integrity constraints, etc.) it distinguishes 2 types of tables: Dimensions + Facts
Type 1 Approach (slowly changing dimension)
- New value replaces old value in dimension record - No history is preserved - Simplest approach, used often when a change in a dimension is the result of an error
ETL Infrastructure
- Process includes using specialized ETL software tools and/or writing code - Due to amount of considered detail, often the most time and resource consuming part in DWH development - Essentially predetermined during requirements and modeling stages
DWH Requirements Collection, Definition, and Visualization
- Results in the requirements specifying the desired capabilities and functionalities of the future DWH - Requirements based on analytical needs to be met by internal and external source systems - Collected through interviewing various stakeholders of the DWH - Collections should be clearly defined and stated in a written document, and then visualized as a conceptual data model (ERD)
OLAP/BI Tools
- Simple point-and-click query-building applications - Tool writes and executes the code in DBMS (e.g. SQL) - Designed for analysis of dimensionally modeled data - End users access the data is typically structured as a dimensional model
Surrogate Key
- Simple, non-composite system-generated key - Auto-increment integer values - Use only within DWH - Uniquely identifies a record & join the dimension to fact table - Handles slowly changing dimension issue
Steps in Development of DWHs
1. Requirements collection, definition, and visualization 2. DWH Modeling 3. Creating the DWH 4. Creating the ETL Infrastructure 5. Developing Front-end (BI) Applications (indirect access) 6. DWH Deployment 7. DWH Use (direct and/or indirect) 8. DWH Admin/maintenance
Time-variant (DWH Definition)
Refers to the snapshots/slices of data from different periods of time across its time horizon Allows: - Historical analysis - Analysis of patterns for predictive use - Impact of decisions in different operational areas E.g. marketing on sales, customer retention, profits - Measure outcome influence- e.g. impact of decision to drop prices on revenues, # of customers
Extraction
Retrieval of analytically useful data from the operational data sources - What to extract is determined in the requirements and modeling stages
In so-called "active" DWHs, the retrieval of data from operational data sources is ______________.
continuous
In active DWHs, the loads occur in micro batches that occur ____________.
continuously
Drill up: a) changing the level of granularity of a particular dimension b) dimension reduction c) a visualization operation that allows an analyst to rotate the cube in space in order to provide an alternative presentation of the data d) A and B
d) A and B
Which of the following aspect of the DWH development process often takes the most time and resources? a) logical design b) implementation c) developing BI Applications d) ETL Process
d) ETL process
Commonly, some of the data in the data sources exhibit ______________ problems
data quality