Data Warehouses

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

True/False: A popular approach is to implement the multidimensional model on top of the relational model

True

True/False: DWs are modeled and structured differently, use different techniques for storage and retrieval and cater to a different set of users

True

True/False: If a transaction encounters errors and must be canceled or rolled back, then all of the data modifications are erased

True

True/False: If a transaction is successful, all of the data modifications made during the transaction are committed and become a permanent part of the database

True

True/False: The data must be extracted from multiple, heterogeneous sources.

True

True/False: Use SELECT statements to read from a database

True

True/False: Warehousing approach has high query performance

True: - not necessarily most current info, though!

True/False: Traditional Databases generally deal with two- dimensional data (similar to a spread sheet)

True: However, querying performance in a multi-dimensional data storage model (matrices) is much more efficient.

True/False: OLAP interactive response time is less than 10 seconds

True: NOTE: OLAP interactive response time is less than *2 seconds!!!!**

True/False: Normalization saves storage space

True: this is a benefit of normalization

Star vs Snowflake schemas

Two common multi-dimensional schemas are: star & snowflake 1. Star Schema: Consists of a fact table with a single table for each dimension 2. Snowflake Schema It is a variation of star schema, in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them.

Use ? statements to write to a database

Use INSERT statements to write to a database

Use ? statements to read from a database

Use SELECT statements to read from a database

disadvantages of query driven approach (3)

1. Delayed answers during query processing - Slow or unavailable information sources - Complex filtering and integration - Dependent on all heterogeneous sources 2. Inefficient and potentially expensive if queries are frequent - Competes with local processing at sources - Request rate limiting to not overwhelm sources 3. In general, hasn't caught on in industry, except: - When anonymity matters (medical, student data) - When data freshness is paramount (airfare search)

Fact VS Dimension Table

1. Fact Table - Each tuple is a recorded fact. This fact contains some measured or observed variable (s) and identifies it with pointers to dimension tables. The fact table contains the data, and the dimensions to identify each tuple in the data. - A fact table is as an agglomerated (grouped) view of transaction data whereas each dimension table represents "master data" that those transactions belonged to. 2. Dimension Table - It consists of tuples of attributes of the dimension.

Multi-dimensional model (also called "dimensional model") includes two types of tables, which are?

1. Fact Table - Each tuple is a recorded fact. This fact contains some measured or observed variable (s) and identifies it with pointers to dimension tables. The fact table contains the data, and the dimensions to identify each tuple in the data. - A fact table is as an agglomerated (grouped) view of transaction data whereas each dimension table represents "master data" that those transactions belonged to. 2. Dimension Table - It consists of tuples of attributes of the dimension.

Three common multi-dimensional schemas are

1. Star Schema: Consists of a fact table with a single table for each dimension 2. Snowflake Schema It is a variation of star schema, in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them. 3. Flat table schema - denormalized

Second normal form

attributes fully depend on primary key

True/False:

ELT is slowly but surely deprecating ETL

True/False: During ETL - Extract, data can have varying formats

False: Data must be formatted for consistency within the warehouse

common type of instruction for data management

OLAP = common workload clustering of similar instructions

OLAP requires an approach to ? sources

OLAP requires an approach to unify sources - collect & combine info - provides integrated view, uniform user interface - supports sharing

OLTP VS OLAP - mostly ______

OLTP - mostly updates OLAP - mostly reads

OLTP vs OLAP

OLTP - mostly updates - many small transactions - GB-TB of data - current state - Very selective accesses - Indexes are essential - Thousands of users (e.g., clerical users) OLAP - mostly reads - Queries are long and complex - TB-PB of data (order of magnitude higher) - historical - Sift through everything - Fast scans are essential - A few users (e.g., decision- makers, analysts)

when creating data, think of OLAP/OLTP?

OLTP - updating data OLAP - look patterns

OLTP VS OLAP - ___________ is essential

OLTP: index OLAP: fast scans

OLTP

Online Transaction Processing - records all business transactions as they occur; acts as monitor; detects process aborts - DB

What does OLAP stand for

Online analytics processing (OLAP)

Transactions combine multiple ? and ? into single unit of work

Transactions combine multiple reads and writes into single unit of work

DataWarehouse or Query-Driven: backflushing

datawarehouse

difficulty with warehouse approach in industry

difficulty: propagating updates to historical data

goal of OLAP

discover patterns in massive amounts of data collected

Databases ingest data....

efficiently and answer simple SELECT statements quickly

Data management facilitates ____ & effective data science

fast

Tradeoff for ELT

greater flexibility for worse storage BUT need cloud

OLAP - Low response time

interactive response time is less than 2 seconds

OLAP - mostly read only

looking for patterns; not updating data

In 3NF, all ? attributes are completely independent of each other

non-key

DataWarehouse or Query-Driven: freshest data

query driven

DataWarehouse or Query-Driven: handle updates to fact tables

query driven

Use SELECT statements to ____ from a database

read

If a transaction encounters errors and must be canceled or ?, then all of the data modifications are erased

rolled back

It is a variation of star schema, in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them

snowflake schema

Consists of a fact table with a single table for each dimension

star schema

Transactions combine multiple reads and writes into single ?

unit of work

Use INSERT statements to ______ to a database

write

Only ? change state of Database

writes

Effective - Map reduce...

constrains programming choices

What is OLAP

*Online Analytical Processing* - describes the functionality for spectral manipulation and analysis of huge volumes of complex data - OLAP - Discover patterns in data collected

Database normalization is NOT always effective (3)

- A fully normalized schema may be very inefficient for complicated read queries - joining multiple tables on the fly. - Historical data are usually large with static relationships. - Normalization cannot be applied over heterogeneous sources.

Fully normalized star schema

- AKA snowflake schema - a variation of star schema, in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them.

Fast

- Add hardware to handle bigger problems - Exploit efficient data structures

How to not protect anonymity in Query-driven approach (6)

- Authentication, captchas→ serious attackers are humans - Only returning aggregate results→ attackers will craft queries with only one item per group - Return few rows at a time or limit the rate of queries→ attackers are patient and will run a sequence of queries - Never return sensitive values (e.g. social security numbers) → attackers will obfuscate data before transmission - Add random noise on every answer→ attackers will repeat the same query to "cancel out" noise - Only return True/False answers→ attacker can eventually guess answer

Query Driven Approach

- Clients communicate with integration system - metadata --> integration system - integration system < -- > wrappers - wrappers < -- > source

Database views and data warehouses are alike in that they both are *read-only* extractions from databases However, they differ in the following ways..(5)

- Data Warehouses exist as persistent storage instead of being materialized on demand. - Data Warehouses are multi-dimensional, supporting multiple levels of aggregation. - Data Warehouses can be indexed for optimal performance. - Views cannot be indexed directly. - Data Warehouses target OLAP and integrate data contained in more than one database.

Warehousing approach

- Data warehouse communicates with clients and integration system - integration system < -- > ETL - ETL < -- > Source - Information integrated in advance - Stored in DW for direct querying and analysis

Derived (computed) attributes

- Derived (computed) attributes: Attributes are computed by operations on stored and derived values. - DataWarehouse Functionality that can be expected

Drill-down

- Drill-down: Increasing levels of detail are revealed (the complement of roll-up). - expands aggregate data into a finer grained view; bring new dimension to cube - SQL: GROUPBY operator - add - DataWarehouse Functionality that can be expected

ETL vs ELT

- ELT = don't store copy, similar to query driven - ETL = copy store raw data, transform on fly (ELT tradeoff: greater flexibility for worse storage)

Fact Table

- Each tuple is a recorded fact. This fact contains some measured or observed variable (s) and identifies it with pointers to dimension tables. The fact table contains the data, and the dimensions to identify each tuple in the data. - A fact table is as an agglomerated (grouped) view of transaction data whereas each dimension table represents "master data" that those transactions belonged to.

ELT

- Extract, Load, and then Transform (data lake) - ~2010 - copy and store raw third-party data - transform on the fly to any structured DB format - NEG: storage > 2x more - POS: (1) loud computing, data centers => cheap storage and compute (2) not tied to initial transformation scripts - slowly but surely deprecating ETL

Authentication, captchas

- How to not protect anonymity in Query-driven approach - serious attackers are humans

Data warehouses can take advantage of this feature (generally deal with two- dimensional data) as generally these are...

- Non volatile - The degree of predictability of the analysis that will be performed on them is high.

Database normalization benefits (3)

- Normalization saves storage space. - Normalization makes it easier to maintain evolving relationships between business entities. - A relational database system is effective and efficient for transactional databases - update performance is good.

DataWarehouse Functionality that can be expected (7)

- Pivot: Cross tabulation (also referred to as rotation) is performed. - Roll-up (also Drill-up): Data is summarized with increasing generalization (for example, weekly to quarterly to annually). - Drill-down: Increasing levels of detail are revealed (the complement of roll-up). - Slice and dice: Projection operations are performed on the dimensions. - Sorting: Data is sorted by ordinal value. - Selection: Data is filtered by value or range. - Derived (computed) attributes: Attributes are computed by operations on stored and derived values.

Pivot

- Pivot: Cross tabulation (also referred to as rotation) is performed. - turn; look at dif face of cube - SQL: GROUPBY operator - remove(?) - DataWarehouse Functionality that can be expected

What may work to fix query driven approach

- Prohibit access to certain columns (e.g. SSN)→ attacker can't see what no query can access - Any part of the answer occurs k times in the database→ attacker can't distinguish one record among k records - Add random noise on every answer * and * - return the same answer to future queries:→ attackers can't know how far they are from true answer

Slice and Dice

- Projection operations are performed on the dimensions. - slice, look at subsection - SQL - SELECT clause - DataWarehouse Functionality that can be expected

Roll-up (also Drill-up)

- Roll-up (also Drill-up): Data is summarized with increasing generalization (for example, weekly to quarterly to annually). - SQL: GROUPBY operator - pull out - squish cube - DataWarehouse Functionality that can be expected

Selection

- Selection: Data is filtered by value or range. - DataWarehouse Functionality that can be expected

Sorting

- Sorting: Data is sorted by ordinal value. - DataWarehouse Functionality that can be expected

ETL - Transform

- The data must be fitted into the data model of the warehouse. - Data may have to be converted from its source model into a multi-dimensional format

Functional, popular qualifiers in SQL

- WHERE - GROUPBY - ORDER

Query drive approach hasn't caught on in industry, except: (2)

- When anonymity matters (medical, student data) - When data freshness is paramount (airfare search)

Backflushing

- ability to identify patterns

Data Warehouse processing includes...(3)

- cleaning and reformatting data - ETL (EXTRACT < TRANSFORM < LOAD) - OLAP - Data analytics and mining

OLAP over heterogeneous sources

- many sources of data - hard to avoid duplicate or redundant info - OLAP requires an approach to unify sources

Benefits of Warehousing approach (4)

1. high query performance - not necessarily most current! 2. doesn't interfere with local processing at sources - complex queries, OLTP at info sources 3. info copied at warehouse - can modify, annotate, summarize, restructure, etc. - can store historical info - security; no auditing 4. HAS caught on in industry - difficulty: propagating updates to historical data low response time => higher throughput

When doing OLAP queries in multi-dimensions, we typically think of

3d "Data Cube" - hyper cube = OLAP

A relational database system is effective and efficient for ? databases

A relational database system is effective and efficient for *transactional databases* - update performance is good.

In DataWarehousing, we find patterns and want to update the fact table. How can we do this?

BACKFLUSHING

Back flushing

Back flushing: refers to upgrading the source data by returning cleaned data

OLAP - complex

Complex multiple SQL conditions (group-by, where, joins, etc.)

Approach where information integrated in advance (query or dw)

DATA WAREHOUSING

Data Warehouse (DW) was proposed as a new type of database management system which would keep no transactional data but only summarized ? information for decision making purposes

Data Warehouse (DW) was proposed as a new type of database management system which would keep no transactional data but only summarized *historical* information for decision making purposes

Data Warehouse (DW) was proposed as...

Data Warehouse (DW) was proposed as a new type of database management system which would keep no transactional data but only summarized historical information for decision making purposes

? facilitates fast & effective data science

Data management facilitates fast & effective data science

Database views and data warehouses are alike in that they both are ? extractions from databases

Database views and data warehouses are alike in that they both are *read-only* extractions from databases However, they differ in the following ways: - Data Warehouses exist as persistent storage instead of being materialized on demand. - Data Warehouses are multi-dimensional, supporting multiple levels of aggregation. - Data Warehouses can be indexed for optimal performance. - Views cannot be indexed directly. - Data Warehouses target OLAP and integrate data contained in more than one database. each sql groupby is a new dimension of market/data

OLAP vs DB

Databases - Ingest data efficiently and answer simple SELECT statements quickly OLAP - Discover patterns in data collected

DataWarehouse or Query-Driven: ETL / ELT

Datawarehouse

DataWarehouse or Query-Driven: interactive response time

Datawarehouse approach

most common approach

Datawarehousing approach

ETL - Extract (definition and 3 key points)

Extract = Acquisition of data for the warehouse 1. The data must be extracted from multiple, heterogeneous sources. 2. Data must be formatted for consistency within the warehouse 3. The data must be cleaned to ensure validity. - Difficult to automate cleaning process. - Back flushing: refers to upgrading the source data by returning cleaned data

ETL

Extract, Transform, Load

True/False: OLAP is mostly write-only

FALSE Mostly read-only

True/False: Databases ingest data efficiently and answer simple GROUPBY statements quickly

FALSE: Databases ingest data efficiently and answer simple GROUPBY statements quickly

True/False: In 3NF, all non-key attributes are completely dependent of each other

FALSE: all non-key attributes are completely *independent* of each other

set of tables that share some dimension tables

Fact constellation - However, fact constellations limit the possible queries for the warehouse.

Fact Constellation

Fact constellation is a set of tables that share some dimension tables. However, fact constellations limit the possible queries for the warehouse. (ie, dimension table shared by 2 fact tables.

True/False: Normalization is always important and must be done.

False: Database normalization is NOT always effective - A fully normalized schema may be very inefficient for complicated read queries - joining multiple tables on the fly. - Normalization cannot be applied over heterogeneous sources

True/False: If a transaction encounters errors, there is nothing you can do once it is committed.

False: If a transaction encounters errors and must be canceled or rolled back, then all of the data modifications are erased

True/False: OLTP cycles = OLAP cycles in data warehousing approach

False: OLAP cycles are separate from OLTP cycles in DW approach

True/False: Use WHERE statements to read from a database

False: Use SELECT statements to read from a database

data items are atomic

First normal form

Typical Dimensions used in corporate DWs:

Fiscal Periods, Product Categories, Geographic Regions

- WHERE - GROUPBY - ORDER

Functional, popular qualifiers in SQL

Dimensions is something that can go in a ____________ SQL clause?

GROUPBY

Problem with query driven approach (business intelligence aspect)

Inefficient and potentially expensive if queries are frequent - Competes with local processing at sources

OLAP - Large volume of data

Large volume of data cover historical data from multiple operational data sets

ELT: Positive and Negative

NEG: storage > 2x more POS: - loud computing, data centers => cheap storage and compute

Normalization makes it easier to maintain evolving ? between business entities

Normalization makes it easier to maintain evolving relationships between business entities

Mostly read-only

OLAP

cover historical data from multiple operational data sets

OLAP - Large volume of data

attributes fully depend on primary key

Second normal form

Snowflake Schema

Snowflake schema is a variation of star schema, in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them.

Snowflake schema is a variation of star schema, in which the dimensional tables from a star schema are organized into a ? by ? them.

Snowflake schema is a variation of star schema, in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them.

Star Schema

Star Schema consists of a fact table with a single table for each dimension

True/False: OLAP interactive response time is less than 2 seconds

TRUE

True/False: OLAP is mostly read-only

TRUE

True/False: A warehouse is a specialized DB

TRUE: OLTP - mostly updates - many small transactions - GB-TB of data - current state - Very selective accesses - Indexes are essential - Thousands of users (e.g., clerical users) OLAP - mostly reads - Queries are long and complex - TB-PB of data (order of magnitude higher) - historical - Sift through everything - Fast scans are essential - A few users (e.g., decision- makers, analysts)

True/False: OLAp has more data than OLTP

TRUE: about order of magnitude OLTP: GB-TB of data OLAP: TB-PB of data (order of magnitude higher)

Relational database modeling - normalization

Tables are decomposed into smaller tables until all attributes (column labels) are tightly coupled with their primary key

OLAP - templates

Templates data scientists repeat ad-hoc and canned queries (1) over time and (2) across data sets

Building a data warehouse (2: design..)

The Design of a Data Warehouse involves following steps. - Acquisition of data for the warehouse. - Ensuring that Data Storage meets the query requirements efficiently. - Giving full consideration to the environment in which the data warehouse resides.

Building a data warehouse (1: builders..)

The builders of Data warehouse should take a *broad view* of the anticipated use of the warehouse. - The design should harmonize dimensions across the whole enterprise and multiple data sources - The design should support ad-hoc querying - An appropriate !schema! should be chosen that reflects the anticipated usage and the business model of the organization.

ETL - Load

The data must be loaded into the warehouse. - Proper design for refresh policy should be considered. - Data may come from different systems, language areas and time-zones. - *Order of loading is critical and semantic constraints must be obeyed* - Storing the data according to the data model of the warehouse - Creating and maintaining required data structures & access paths - Providing for time-variant data as new data are added - Supporting the updating of warehouse data. - Refreshing the data Purging data

all non-key attributes are completely independent of each other

Third normal form

Business intelligence

Use data collected during business operations to (1) improve efficiency (2) inform future strategies and directions (3) generate detailed and informative audits Data science with deep domain expertise and applied statistics (less compsci)

"A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context." -?

What is a data warehouse? " " -- Barry Devlin, IBM Consultant

How to do effective drill down on query?

add a group by clause

Third normal form

all non-key attributes are completely independent of each other

Data warehousing and OLAP are key for what type of data scientists?

data analysts

First normal form

data items are atomic


Set pelajaran terkait

Leveraging IS in Business Practice Exam 2

View Set

Chap. 14 health problems of early childhood

View Set

Iggy Chapter 26: Care of Patients with Burns

View Set

Chapter 8 - Organization Structure & Design

View Set

Chapter 35 Geriatric Emergencies

View Set

Chapter 25 Asepsis and Infection Control

View Set

Business management review chapter 8 test

View Set