Data Warehouses
True/False: A popular approach is to implement the multidimensional model on top of the relational model
True
True/False: DWs are modeled and structured differently, use different techniques for storage and retrieval and cater to a different set of users
True
True/False: If a transaction encounters errors and must be canceled or rolled back, then all of the data modifications are erased
True
True/False: If a transaction is successful, all of the data modifications made during the transaction are committed and become a permanent part of the database
True
True/False: The data must be extracted from multiple, heterogeneous sources.
True
True/False: Use SELECT statements to read from a database
True
True/False: Warehousing approach has high query performance
True: - not necessarily most current info, though!
True/False: Traditional Databases generally deal with two- dimensional data (similar to a spread sheet)
True: However, querying performance in a multi-dimensional data storage model (matrices) is much more efficient.
True/False: OLAP interactive response time is less than 10 seconds
True: NOTE: OLAP interactive response time is less than *2 seconds!!!!**
True/False: Normalization saves storage space
True: this is a benefit of normalization
Star vs Snowflake schemas
Two common multi-dimensional schemas are: star & snowflake 1. Star Schema: Consists of a fact table with a single table for each dimension 2. Snowflake Schema It is a variation of star schema, in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them.
Use ? statements to write to a database
Use INSERT statements to write to a database
Use ? statements to read from a database
Use SELECT statements to read from a database
disadvantages of query driven approach (3)
1. Delayed answers during query processing - Slow or unavailable information sources - Complex filtering and integration - Dependent on all heterogeneous sources 2. Inefficient and potentially expensive if queries are frequent - Competes with local processing at sources - Request rate limiting to not overwhelm sources 3. In general, hasn't caught on in industry, except: - When anonymity matters (medical, student data) - When data freshness is paramount (airfare search)
Fact VS Dimension Table
1. Fact Table - Each tuple is a recorded fact. This fact contains some measured or observed variable (s) and identifies it with pointers to dimension tables. The fact table contains the data, and the dimensions to identify each tuple in the data. - A fact table is as an agglomerated (grouped) view of transaction data whereas each dimension table represents "master data" that those transactions belonged to. 2. Dimension Table - It consists of tuples of attributes of the dimension.
Multi-dimensional model (also called "dimensional model") includes two types of tables, which are?
1. Fact Table - Each tuple is a recorded fact. This fact contains some measured or observed variable (s) and identifies it with pointers to dimension tables. The fact table contains the data, and the dimensions to identify each tuple in the data. - A fact table is as an agglomerated (grouped) view of transaction data whereas each dimension table represents "master data" that those transactions belonged to. 2. Dimension Table - It consists of tuples of attributes of the dimension.
Three common multi-dimensional schemas are
1. Star Schema: Consists of a fact table with a single table for each dimension 2. Snowflake Schema It is a variation of star schema, in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them. 3. Flat table schema - denormalized
Second normal form
attributes fully depend on primary key
True/False:
ELT is slowly but surely deprecating ETL
True/False: During ETL - Extract, data can have varying formats
False: Data must be formatted for consistency within the warehouse
common type of instruction for data management
OLAP = common workload clustering of similar instructions
OLAP requires an approach to ? sources
OLAP requires an approach to unify sources - collect & combine info - provides integrated view, uniform user interface - supports sharing
OLTP VS OLAP - mostly ______
OLTP - mostly updates OLAP - mostly reads
OLTP vs OLAP
OLTP - mostly updates - many small transactions - GB-TB of data - current state - Very selective accesses - Indexes are essential - Thousands of users (e.g., clerical users) OLAP - mostly reads - Queries are long and complex - TB-PB of data (order of magnitude higher) - historical - Sift through everything - Fast scans are essential - A few users (e.g., decision- makers, analysts)
when creating data, think of OLAP/OLTP?
OLTP - updating data OLAP - look patterns
OLTP VS OLAP - ___________ is essential
OLTP: index OLAP: fast scans
OLTP
Online Transaction Processing - records all business transactions as they occur; acts as monitor; detects process aborts - DB
What does OLAP stand for
Online analytics processing (OLAP)
Transactions combine multiple ? and ? into single unit of work
Transactions combine multiple reads and writes into single unit of work
DataWarehouse or Query-Driven: backflushing
datawarehouse
difficulty with warehouse approach in industry
difficulty: propagating updates to historical data
goal of OLAP
discover patterns in massive amounts of data collected
Databases ingest data....
efficiently and answer simple SELECT statements quickly
Data management facilitates ____ & effective data science
fast
Tradeoff for ELT
greater flexibility for worse storage BUT need cloud
OLAP - Low response time
interactive response time is less than 2 seconds
OLAP - mostly read only
looking for patterns; not updating data
In 3NF, all ? attributes are completely independent of each other
non-key
DataWarehouse or Query-Driven: freshest data
query driven
DataWarehouse or Query-Driven: handle updates to fact tables
query driven
Use SELECT statements to ____ from a database
read
If a transaction encounters errors and must be canceled or ?, then all of the data modifications are erased
rolled back
It is a variation of star schema, in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them
snowflake schema
Consists of a fact table with a single table for each dimension
star schema
Transactions combine multiple reads and writes into single ?
unit of work
Use INSERT statements to ______ to a database
write
Only ? change state of Database
writes
Effective - Map reduce...
constrains programming choices
What is OLAP
*Online Analytical Processing* - describes the functionality for spectral manipulation and analysis of huge volumes of complex data - OLAP - Discover patterns in data collected
Database normalization is NOT always effective (3)
- A fully normalized schema may be very inefficient for complicated read queries - joining multiple tables on the fly. - Historical data are usually large with static relationships. - Normalization cannot be applied over heterogeneous sources.
Fully normalized star schema
- AKA snowflake schema - a variation of star schema, in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them.
Fast
- Add hardware to handle bigger problems - Exploit efficient data structures
How to not protect anonymity in Query-driven approach (6)
- Authentication, captchas→ serious attackers are humans - Only returning aggregate results→ attackers will craft queries with only one item per group - Return few rows at a time or limit the rate of queries→ attackers are patient and will run a sequence of queries - Never return sensitive values (e.g. social security numbers) → attackers will obfuscate data before transmission - Add random noise on every answer→ attackers will repeat the same query to "cancel out" noise - Only return True/False answers→ attacker can eventually guess answer
Query Driven Approach
- Clients communicate with integration system - metadata --> integration system - integration system < -- > wrappers - wrappers < -- > source
Database views and data warehouses are alike in that they both are *read-only* extractions from databases However, they differ in the following ways..(5)
- Data Warehouses exist as persistent storage instead of being materialized on demand. - Data Warehouses are multi-dimensional, supporting multiple levels of aggregation. - Data Warehouses can be indexed for optimal performance. - Views cannot be indexed directly. - Data Warehouses target OLAP and integrate data contained in more than one database.
Warehousing approach
- Data warehouse communicates with clients and integration system - integration system < -- > ETL - ETL < -- > Source - Information integrated in advance - Stored in DW for direct querying and analysis
Derived (computed) attributes
- Derived (computed) attributes: Attributes are computed by operations on stored and derived values. - DataWarehouse Functionality that can be expected
Drill-down
- Drill-down: Increasing levels of detail are revealed (the complement of roll-up). - expands aggregate data into a finer grained view; bring new dimension to cube - SQL: GROUPBY operator - add - DataWarehouse Functionality that can be expected
ETL vs ELT
- ELT = don't store copy, similar to query driven - ETL = copy store raw data, transform on fly (ELT tradeoff: greater flexibility for worse storage)
Fact Table
- Each tuple is a recorded fact. This fact contains some measured or observed variable (s) and identifies it with pointers to dimension tables. The fact table contains the data, and the dimensions to identify each tuple in the data. - A fact table is as an agglomerated (grouped) view of transaction data whereas each dimension table represents "master data" that those transactions belonged to.
ELT
- Extract, Load, and then Transform (data lake) - ~2010 - copy and store raw third-party data - transform on the fly to any structured DB format - NEG: storage > 2x more - POS: (1) loud computing, data centers => cheap storage and compute (2) not tied to initial transformation scripts - slowly but surely deprecating ETL
Authentication, captchas
- How to not protect anonymity in Query-driven approach - serious attackers are humans
Data warehouses can take advantage of this feature (generally deal with two- dimensional data) as generally these are...
- Non volatile - The degree of predictability of the analysis that will be performed on them is high.
Database normalization benefits (3)
- Normalization saves storage space. - Normalization makes it easier to maintain evolving relationships between business entities. - A relational database system is effective and efficient for transactional databases - update performance is good.
DataWarehouse Functionality that can be expected (7)
- Pivot: Cross tabulation (also referred to as rotation) is performed. - Roll-up (also Drill-up): Data is summarized with increasing generalization (for example, weekly to quarterly to annually). - Drill-down: Increasing levels of detail are revealed (the complement of roll-up). - Slice and dice: Projection operations are performed on the dimensions. - Sorting: Data is sorted by ordinal value. - Selection: Data is filtered by value or range. - Derived (computed) attributes: Attributes are computed by operations on stored and derived values.
Pivot
- Pivot: Cross tabulation (also referred to as rotation) is performed. - turn; look at dif face of cube - SQL: GROUPBY operator - remove(?) - DataWarehouse Functionality that can be expected
What may work to fix query driven approach
- Prohibit access to certain columns (e.g. SSN)→ attacker can't see what no query can access - Any part of the answer occurs k times in the database→ attacker can't distinguish one record among k records - Add random noise on every answer * and * - return the same answer to future queries:→ attackers can't know how far they are from true answer
Slice and Dice
- Projection operations are performed on the dimensions. - slice, look at subsection - SQL - SELECT clause - DataWarehouse Functionality that can be expected
Roll-up (also Drill-up)
- Roll-up (also Drill-up): Data is summarized with increasing generalization (for example, weekly to quarterly to annually). - SQL: GROUPBY operator - pull out - squish cube - DataWarehouse Functionality that can be expected
Selection
- Selection: Data is filtered by value or range. - DataWarehouse Functionality that can be expected
Sorting
- Sorting: Data is sorted by ordinal value. - DataWarehouse Functionality that can be expected
ETL - Transform
- The data must be fitted into the data model of the warehouse. - Data may have to be converted from its source model into a multi-dimensional format
Functional, popular qualifiers in SQL
- WHERE - GROUPBY - ORDER
Query drive approach hasn't caught on in industry, except: (2)
- When anonymity matters (medical, student data) - When data freshness is paramount (airfare search)
Backflushing
- ability to identify patterns
Data Warehouse processing includes...(3)
- cleaning and reformatting data - ETL (EXTRACT < TRANSFORM < LOAD) - OLAP - Data analytics and mining
OLAP over heterogeneous sources
- many sources of data - hard to avoid duplicate or redundant info - OLAP requires an approach to unify sources
Benefits of Warehousing approach (4)
1. high query performance - not necessarily most current! 2. doesn't interfere with local processing at sources - complex queries, OLTP at info sources 3. info copied at warehouse - can modify, annotate, summarize, restructure, etc. - can store historical info - security; no auditing 4. HAS caught on in industry - difficulty: propagating updates to historical data low response time => higher throughput
When doing OLAP queries in multi-dimensions, we typically think of
3d "Data Cube" - hyper cube = OLAP
A relational database system is effective and efficient for ? databases
A relational database system is effective and efficient for *transactional databases* - update performance is good.
In DataWarehousing, we find patterns and want to update the fact table. How can we do this?
BACKFLUSHING
Back flushing
Back flushing: refers to upgrading the source data by returning cleaned data
OLAP - complex
Complex multiple SQL conditions (group-by, where, joins, etc.)
Approach where information integrated in advance (query or dw)
DATA WAREHOUSING
Data Warehouse (DW) was proposed as a new type of database management system which would keep no transactional data but only summarized ? information for decision making purposes
Data Warehouse (DW) was proposed as a new type of database management system which would keep no transactional data but only summarized *historical* information for decision making purposes
Data Warehouse (DW) was proposed as...
Data Warehouse (DW) was proposed as a new type of database management system which would keep no transactional data but only summarized historical information for decision making purposes
? facilitates fast & effective data science
Data management facilitates fast & effective data science
Database views and data warehouses are alike in that they both are ? extractions from databases
Database views and data warehouses are alike in that they both are *read-only* extractions from databases However, they differ in the following ways: - Data Warehouses exist as persistent storage instead of being materialized on demand. - Data Warehouses are multi-dimensional, supporting multiple levels of aggregation. - Data Warehouses can be indexed for optimal performance. - Views cannot be indexed directly. - Data Warehouses target OLAP and integrate data contained in more than one database. each sql groupby is a new dimension of market/data
OLAP vs DB
Databases - Ingest data efficiently and answer simple SELECT statements quickly OLAP - Discover patterns in data collected
DataWarehouse or Query-Driven: ETL / ELT
Datawarehouse
DataWarehouse or Query-Driven: interactive response time
Datawarehouse approach
most common approach
Datawarehousing approach
ETL - Extract (definition and 3 key points)
Extract = Acquisition of data for the warehouse 1. The data must be extracted from multiple, heterogeneous sources. 2. Data must be formatted for consistency within the warehouse 3. The data must be cleaned to ensure validity. - Difficult to automate cleaning process. - Back flushing: refers to upgrading the source data by returning cleaned data
ETL
Extract, Transform, Load
True/False: OLAP is mostly write-only
FALSE Mostly read-only
True/False: Databases ingest data efficiently and answer simple GROUPBY statements quickly
FALSE: Databases ingest data efficiently and answer simple GROUPBY statements quickly
True/False: In 3NF, all non-key attributes are completely dependent of each other
FALSE: all non-key attributes are completely *independent* of each other
set of tables that share some dimension tables
Fact constellation - However, fact constellations limit the possible queries for the warehouse.
Fact Constellation
Fact constellation is a set of tables that share some dimension tables. However, fact constellations limit the possible queries for the warehouse. (ie, dimension table shared by 2 fact tables.
True/False: Normalization is always important and must be done.
False: Database normalization is NOT always effective - A fully normalized schema may be very inefficient for complicated read queries - joining multiple tables on the fly. - Normalization cannot be applied over heterogeneous sources
True/False: If a transaction encounters errors, there is nothing you can do once it is committed.
False: If a transaction encounters errors and must be canceled or rolled back, then all of the data modifications are erased
True/False: OLTP cycles = OLAP cycles in data warehousing approach
False: OLAP cycles are separate from OLTP cycles in DW approach
True/False: Use WHERE statements to read from a database
False: Use SELECT statements to read from a database
data items are atomic
First normal form
Typical Dimensions used in corporate DWs:
Fiscal Periods, Product Categories, Geographic Regions
- WHERE - GROUPBY - ORDER
Functional, popular qualifiers in SQL
Dimensions is something that can go in a ____________ SQL clause?
GROUPBY
Problem with query driven approach (business intelligence aspect)
Inefficient and potentially expensive if queries are frequent - Competes with local processing at sources
OLAP - Large volume of data
Large volume of data cover historical data from multiple operational data sets
ELT: Positive and Negative
NEG: storage > 2x more POS: - loud computing, data centers => cheap storage and compute
Normalization makes it easier to maintain evolving ? between business entities
Normalization makes it easier to maintain evolving relationships between business entities
Mostly read-only
OLAP
cover historical data from multiple operational data sets
OLAP - Large volume of data
attributes fully depend on primary key
Second normal form
Snowflake Schema
Snowflake schema is a variation of star schema, in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them.
Snowflake schema is a variation of star schema, in which the dimensional tables from a star schema are organized into a ? by ? them.
Snowflake schema is a variation of star schema, in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them.
Star Schema
Star Schema consists of a fact table with a single table for each dimension
True/False: OLAP interactive response time is less than 2 seconds
TRUE
True/False: OLAP is mostly read-only
TRUE
True/False: A warehouse is a specialized DB
TRUE: OLTP - mostly updates - many small transactions - GB-TB of data - current state - Very selective accesses - Indexes are essential - Thousands of users (e.g., clerical users) OLAP - mostly reads - Queries are long and complex - TB-PB of data (order of magnitude higher) - historical - Sift through everything - Fast scans are essential - A few users (e.g., decision- makers, analysts)
True/False: OLAp has more data than OLTP
TRUE: about order of magnitude OLTP: GB-TB of data OLAP: TB-PB of data (order of magnitude higher)
Relational database modeling - normalization
Tables are decomposed into smaller tables until all attributes (column labels) are tightly coupled with their primary key
OLAP - templates
Templates data scientists repeat ad-hoc and canned queries (1) over time and (2) across data sets
Building a data warehouse (2: design..)
The Design of a Data Warehouse involves following steps. - Acquisition of data for the warehouse. - Ensuring that Data Storage meets the query requirements efficiently. - Giving full consideration to the environment in which the data warehouse resides.
Building a data warehouse (1: builders..)
The builders of Data warehouse should take a *broad view* of the anticipated use of the warehouse. - The design should harmonize dimensions across the whole enterprise and multiple data sources - The design should support ad-hoc querying - An appropriate !schema! should be chosen that reflects the anticipated usage and the business model of the organization.
ETL - Load
The data must be loaded into the warehouse. - Proper design for refresh policy should be considered. - Data may come from different systems, language areas and time-zones. - *Order of loading is critical and semantic constraints must be obeyed* - Storing the data according to the data model of the warehouse - Creating and maintaining required data structures & access paths - Providing for time-variant data as new data are added - Supporting the updating of warehouse data. - Refreshing the data Purging data
all non-key attributes are completely independent of each other
Third normal form
Business intelligence
Use data collected during business operations to (1) improve efficiency (2) inform future strategies and directions (3) generate detailed and informative audits Data science with deep domain expertise and applied statistics (less compsci)
"A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context." -?
What is a data warehouse? " " -- Barry Devlin, IBM Consultant
How to do effective drill down on query?
add a group by clause
Third normal form
all non-key attributes are completely independent of each other
Data warehousing and OLAP are key for what type of data scientists?
data analysts
First normal form
data items are atomic