Final Exam: Chp 6, 7, 9, 11, SQL Notes
Characteristics of big data
Volume, variety, velocity, veracity, value
Data lake
a large integrated repository for internal and external data that does not follow a predefined schema. capture everything, dive in anywhere, flexible access
Business Intelligence
a set of methodologies, processes, architectures, and technolgoeis that transform raw data into meaningful and useful information
Data warehouse
a subject-oriented, integrated, time-variant, nonupdatable collection of data used in support of management decision-making processes.
Informational systems
a system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications. primary purpose = support the managerial decision making
Operational systems
a system that is used to run a business in real time, based on current data; also called a system of record. primary purpose = run the business on a current basis.
DDL: Data Definition Language
commands that define a database, including creating, altering, and dropping tables and establishing constraints. Helps with physical design and maintenance. CREATE/DROP/ALTER
Referential integrity
constraint that ensures that foreign key values of a table must match primary key values of a related table in 1:M relationships. restricting: deletes of primary records, updates of primary records, inserts of dependent records
Materialized View
copy or replication of data, data actually stored, must be refreshed periodically to match corresponding base tables
Schema on read
data model determined later, depends on how you want to use it
Dimension tables
denormalized to maximize performance. contain descriptions about the subjects of the business
Size of the fact table
depends on the number of dimensions and the grain of the fact table. Number of rows = product of number of possible values for each dimension associated with the fact table.
Surrogate Keys
dimension table keys should be surrogate (non intelligent and non-business related) because business keys may change over time and surrogate are simpler and shorter.
text mining
discovering meaningful information algorithmically based on computational analysis of unstructured textual information
Noncorrelated subqueries
do not depend on data from the outer query. Executed once for the entire outer query
Status Data
ex. before or after image... snapshot of what it is
BI&A 1.0
focus on structured quantitative data largely from relational databases
Dimension hierarchies
help to provide levels of aggregation for users wanting summary information in a data warehouse
Slowly changing dimensions (SCD)
how to maintain knowledge of the past according to Kimballs approaches Type 1: just replace old data with new (lose historical data) Type 2: create a new dimension table row each time the dimension object changes, with all dimension characteristics at the time of change. Most common approach. Type 3: for each changing attribute, create a current value field and several old-valued fields (multivalued)
Issue of big data
huge volume, often unstructured
analytics data management infrastructure
important criteria: scalability, parallelism, low latency, and data optimization. These criteria ensure speed, availability, and access
BI&A 3.0
include data form mobile devices (location, sensors etc) as well as Internet of Things
BI&A 2.0
include data from the web (web interaction logs, customer reviews, social media)
Union Join
includes all data from each table that was joined
embedded sql
including hard-coded sql statements in a program written in another language such as c or java
Issues with company-wide view
inconsistent key structures, synonyms, free-form vs structured fields, inconsistent data values, missing data
Need for data warehousing
integrated, company-wide view of high-quality information (from disparate databases). Separation of operational and informational systems and data (for improved performance)
Data Mining Tools
knowledge discovery using a sophisticated blend of techniques from traditional statistics, artificial intelligence, and computer graphics. Goals: explanatory, confirmatory, and exploratory
Grain of the Fact Table
level of detail in a fact table. Transactional grain is the finest level) -> aggregated grain (more summarized) Finer grains = better market basket analysis capability Finer grain = more dimension tables, more rows in fact table. In web-based commerce, finest granularity is a click.
Correlated subqueries
make use of data from the outer query. can use the EXISTS operator. Executed once for each row returned by the outer query
Vector aggregate
multiple values returned from sql query with aggregate function (GROUP BY)
Normalizing dimension tables
multivalued dimensions: facts qualified by a set of values for the same business subject. Normalization involved creating a table for an associated entity between dimensions. Hierarchies: sometimes a dimension forms a natural, fixed depth hierarchy. Design options: include all info for each level in a single denormalized table. Normalize the dimension into a nested set of 1:M table relationships
Duration of the database
natural duration: 13 months or 5 quarters. financial institutions may need longer duration. older data is more difficult to source and cleanse.
Periodic data
never physically altered or deleted once they have been added to the store
Organizational trends motivating data warehouses
no single system of records, multiple systems not synchronized, organizational need to analyze activities in a balanced way, customer relationship management, supplier relationship management
Benefits of Standardized Relational Language
reduced training costs, productivity, application portability, application longevity, reduced dependence on a single vendor, cross-system communication
Join
relational operation that causes 2 or more tables with a common domain to be combined into a single table or view
Data visualization
representation of data in graphical and multimedia formats for human analysis. Without showing precise values, graphs and charts can depict relationships in the data.
Disadvantages of views
use processing time each time view is referenced; may or may not be directly updateable
Boolean Query
By default, processing order of Boolean operators is NOT, then AND, and then OR
DCL: Data Control Language
Commands that control a database, including administering privileges and committing data GRANT/ADD/REVOKE
DML: Data Manipulation Language
Commands that maintain and query a database INSERT/UPDATE/DELETE/SELECT
Schema Definition
Control processing/storage efficiency=choice of indexes & file organizations . Creating indexes = speed up random/sequential access to base table data
Star Schema
Dimensional model implemented as this. has 1:N relationship between dimension and fact tables. Excellent for ad-hoc queries, but bad for online transaction processing.
RDBMS
Relational database management systems: database management system that implements a relational data model, one where manages data are stored in a collection of tables in which all relationships are represented by common values in related tables
Steps in Table Creation
1. identify data types for attributes 2. identify columns that can and cannot be null 3. identify columns that must be unique (candidate keys) 4. identify primary key-foreign key mates 5. Determine default values 6. Identify constraints on columns (domain specifications) 7. Create the table and associated indexes
Derived Data
Objectives: ease of use for decision support applications. Fast response to predefined user queries, data mining capabilities Characteristics: detailed (mostly periodic) data, aggregate, distributed
Guidelines for better query design
understand how indexes are used, don't nest one query inside another query, don't combine a query with itself, combine update operations, consider the total query processing time
10 ESSENTIAL RULES FOR DIMENSIONAL MODELING
use atomic facts, create single-process fact tables, include a date dimension for each fact table, enforce consistent grain, disallow null keys in fact tables, honor hierarchies, decode dimension tables, use surrogate keys, conform dimensions, balance requirements with actual data
Use of prescriptive analytics
use of optimization and simulation tools for prescribing the best action to take. Example: making trading decisions in securities and stock market. Making pricing decisions for airlines and hotels. making product recommendations ex. amazon/netflix. Often requires predictive analytics and game theory
Prescriptive analytics
uses results of predictive analytics along with optimization and simulation tools to recommend actions that will lead to a desired outcome
Using and Defining Views
views provide users controlled access to tables
Use of descriptive analytics
was original emphasis of BI. reporting of aggregate quantitative query results. tabular or data visualization displays. dashboard=a few key indicators. scorecard=like a dashboard, but broader range
Dimensional model
Most common data model. Implemented as star schema.
Variations of the star schema
Multiple facts tables: can improve performance, store facts for diff combos of dimensions Factless facts tables: no nonkey data, but foreign keys for associated dimensions. Used for tracking events and inventory coverage.
Clauses of the select statement
Select, From, Where, Group By, Having, Order By
SQl
Structured Query Language: standard for relational database management systems
Data Dictionary Facilities
System tables that store metadata. users usually can view some of these tables. users are restricted from updating them.
Data Mart
a data warehouse that is limited in scope
Event data
a database action (create/update/delete) that results form a transaction
Transaction
a discrete unit of work that must be completely processed or not processed at all. may involve multiple updates.
Outer Join
a join in which rows that do not have matching values in common columns are nonetheless included in the result table
Dynamic View
a virtual table created dynamically upon request by a user. No data actually stored; instead data from base table made available to user. Based on sql select statement on base tables or other views
dynamic sql
ability for an application program to generate sql code on the fly as the application is running
Business performance management (BPM)
allow managers to measure, monitor, and manage key activities and processes to achieve organizational goals. dashboards are often used to provide an information system in support of bpm.
Helper table
an associated entity that implements a M:N relationship between dimension and fact
Natural Join
an equi-join in which one of the duplicate columns is eliminated in the result table
predictive analytics
applies statistical and computational methods and models to data regarding past and current events to predict what might happen in the future
Conformed Dimension
associated with multiple fact tables. 2 fact tables -> 2 connected start schemas
Big data and analytics impact: applications
business, e-government and politics, science and technology, smart health and well-being, security and public safety
reasons to embed sql in 3gl
can create a more flexible, accessible interface for the user. Possible performance improvement. Database security improvement...grand access only to the application instead of users
Transient data
changes to existing records are written over pervious records, thus destroying the previous data content
Big data and analytics impact: social implications
personal privacy vs collective benefit, ownership and access, data/algorithm quality and reuse, transparency and validation, demands for workforce capabilities and education
Subquery
placing an inner query inside an outer query. Can be noncorrelated or correlated
Schema on write
preexisting data model, how traditional databases are designed (relational databases)
Fact tables
provides statistics for sales broken down by product, period and store dimensions. Contain factual or quantitative data. contain time-period data. Date dimensions are important
Limitations of independent data marts
separate etl process for each data mart (redundant data and processing). Inconsistency between data marts. difficult to drill down for related facts between data marts. excessive scaling costs are more applications are built. high cost for obtaining consistency between marts
Advantages of views
simply with query commands, use little storage space, assist with data security, enhance programming productivity, establish physical data independence, provide customized view for user
Scalar aggregate
single value returned from sql query with aggregate function
Purpose of SQL Standard
specify syntax/semantics for data definitions and manipulation; define data structures and basic operations. Enable portability of database definition and application modules.
Use of predictive analytics
statistical and computational methods that use data regarding past and current events to form models regarding what might happen in the future. Examples: classification trees, linear and logistic regression analysis, machine learning, neural networks, time series analysis
Equi-join
a join in which the joining condition is based on equality between values in the common columns; common columns appear redundantly
Big Data
data that exist in very large volumes and many different varieties (data types) and that need to be processed at a very high velocity (speed)
Descriptive analytics
describes the past status of the domain of interest using a variety of tools through techniques such as reporting, data visualization, dashboards, and scorecards
The future of data warehousing:
integration with big data and analytics. Speed of processing, cost of data storage, unstructured data (new technologies)
More characteristics of big data
schema on read, rather than schema on write & data lake
Analytics
systematic analysis and interpretation of data- typically using mathematical, statistical, and computational tools to improve our understanding of a real-world domain. Encompasses more than BI. transform data to useful form, infrastructure for analysis, data cleanup processes and user interfaces