Final Exam: Chp 6, 7, 9, 11, SQL Notes

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Characteristics of big data

Volume, variety, velocity, veracity, value

Data lake

a large integrated repository for internal and external data that does not follow a predefined schema. capture everything, dive in anywhere, flexible access

Business Intelligence

a set of methodologies, processes, architectures, and technolgoeis that transform raw data into meaningful and useful information

Data warehouse

a subject-oriented, integrated, time-variant, nonupdatable collection of data used in support of management decision-making processes.

Informational systems

a system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications. primary purpose = support the managerial decision making

Operational systems

a system that is used to run a business in real time, based on current data; also called a system of record. primary purpose = run the business on a current basis.

DDL: Data Definition Language

commands that define a database, including creating, altering, and dropping tables and establishing constraints. Helps with physical design and maintenance. CREATE/DROP/ALTER

Referential integrity

constraint that ensures that foreign key values of a table must match primary key values of a related table in 1:M relationships. restricting: deletes of primary records, updates of primary records, inserts of dependent records

Materialized View

copy or replication of data, data actually stored, must be refreshed periodically to match corresponding base tables

Schema on read

data model determined later, depends on how you want to use it

Dimension tables

denormalized to maximize performance. contain descriptions about the subjects of the business

Size of the fact table

depends on the number of dimensions and the grain of the fact table. Number of rows = product of number of possible values for each dimension associated with the fact table.

Surrogate Keys

dimension table keys should be surrogate (non intelligent and non-business related) because business keys may change over time and surrogate are simpler and shorter.

text mining

discovering meaningful information algorithmically based on computational analysis of unstructured textual information

Noncorrelated subqueries

do not depend on data from the outer query. Executed once for the entire outer query

Status Data

ex. before or after image... snapshot of what it is

BI&A 1.0

focus on structured quantitative data largely from relational databases

Dimension hierarchies

help to provide levels of aggregation for users wanting summary information in a data warehouse

Slowly changing dimensions (SCD)

how to maintain knowledge of the past according to Kimballs approaches Type 1: just replace old data with new (lose historical data) Type 2: create a new dimension table row each time the dimension object changes, with all dimension characteristics at the time of change. Most common approach. Type 3: for each changing attribute, create a current value field and several old-valued fields (multivalued)

Issue of big data

huge volume, often unstructured

analytics data management infrastructure

important criteria: scalability, parallelism, low latency, and data optimization. These criteria ensure speed, availability, and access

BI&A 3.0

include data form mobile devices (location, sensors etc) as well as Internet of Things

BI&A 2.0

include data from the web (web interaction logs, customer reviews, social media)

Union Join

includes all data from each table that was joined

embedded sql

including hard-coded sql statements in a program written in another language such as c or java

Issues with company-wide view

inconsistent key structures, synonyms, free-form vs structured fields, inconsistent data values, missing data

Need for data warehousing

integrated, company-wide view of high-quality information (from disparate databases). Separation of operational and informational systems and data (for improved performance)

Data Mining Tools

knowledge discovery using a sophisticated blend of techniques from traditional statistics, artificial intelligence, and computer graphics. Goals: explanatory, confirmatory, and exploratory

Grain of the Fact Table

level of detail in a fact table. Transactional grain is the finest level) -> aggregated grain (more summarized) Finer grains = better market basket analysis capability Finer grain = more dimension tables, more rows in fact table. In web-based commerce, finest granularity is a click.

Correlated subqueries

make use of data from the outer query. can use the EXISTS operator. Executed once for each row returned by the outer query

Vector aggregate

multiple values returned from sql query with aggregate function (GROUP BY)

Normalizing dimension tables

multivalued dimensions: facts qualified by a set of values for the same business subject. Normalization involved creating a table for an associated entity between dimensions. Hierarchies: sometimes a dimension forms a natural, fixed depth hierarchy. Design options: include all info for each level in a single denormalized table. Normalize the dimension into a nested set of 1:M table relationships

Duration of the database

natural duration: 13 months or 5 quarters. financial institutions may need longer duration. older data is more difficult to source and cleanse.

Periodic data

never physically altered or deleted once they have been added to the store

Organizational trends motivating data warehouses

no single system of records, multiple systems not synchronized, organizational need to analyze activities in a balanced way, customer relationship management, supplier relationship management

Benefits of Standardized Relational Language

reduced training costs, productivity, application portability, application longevity, reduced dependence on a single vendor, cross-system communication

Join

relational operation that causes 2 or more tables with a common domain to be combined into a single table or view

Data visualization

representation of data in graphical and multimedia formats for human analysis. Without showing precise values, graphs and charts can depict relationships in the data.

Disadvantages of views

use processing time each time view is referenced; may or may not be directly updateable

Boolean Query

By default, processing order of Boolean operators is NOT, then AND, and then OR

DCL: Data Control Language

Commands that control a database, including administering privileges and committing data GRANT/ADD/REVOKE

DML: Data Manipulation Language

Commands that maintain and query a database INSERT/UPDATE/DELETE/SELECT

Schema Definition

Control processing/storage efficiency=choice of indexes & file organizations . Creating indexes = speed up random/sequential access to base table data

Star Schema

Dimensional model implemented as this. has 1:N relationship between dimension and fact tables. Excellent for ad-hoc queries, but bad for online transaction processing.

RDBMS

Relational database management systems: database management system that implements a relational data model, one where manages data are stored in a collection of tables in which all relationships are represented by common values in related tables

Steps in Table Creation

1. identify data types for attributes 2. identify columns that can and cannot be null 3. identify columns that must be unique (candidate keys) 4. identify primary key-foreign key mates 5. Determine default values 6. Identify constraints on columns (domain specifications) 7. Create the table and associated indexes

Derived Data

Objectives: ease of use for decision support applications. Fast response to predefined user queries, data mining capabilities Characteristics: detailed (mostly periodic) data, aggregate, distributed

Guidelines for better query design

understand how indexes are used, don't nest one query inside another query, don't combine a query with itself, combine update operations, consider the total query processing time

10 ESSENTIAL RULES FOR DIMENSIONAL MODELING

use atomic facts, create single-process fact tables, include a date dimension for each fact table, enforce consistent grain, disallow null keys in fact tables, honor hierarchies, decode dimension tables, use surrogate keys, conform dimensions, balance requirements with actual data

Use of prescriptive analytics

use of optimization and simulation tools for prescribing the best action to take. Example: making trading decisions in securities and stock market. Making pricing decisions for airlines and hotels. making product recommendations ex. amazon/netflix. Often requires predictive analytics and game theory

Prescriptive analytics

uses results of predictive analytics along with optimization and simulation tools to recommend actions that will lead to a desired outcome

Using and Defining Views

views provide users controlled access to tables

Use of descriptive analytics

was original emphasis of BI. reporting of aggregate quantitative query results. tabular or data visualization displays. dashboard=a few key indicators. scorecard=like a dashboard, but broader range

Dimensional model

Most common data model. Implemented as star schema.

Variations of the star schema

Multiple facts tables: can improve performance, store facts for diff combos of dimensions Factless facts tables: no nonkey data, but foreign keys for associated dimensions. Used for tracking events and inventory coverage.

Clauses of the select statement

Select, From, Where, Group By, Having, Order By

SQl

Structured Query Language: standard for relational database management systems

Data Dictionary Facilities

System tables that store metadata. users usually can view some of these tables. users are restricted from updating them.

Data Mart

a data warehouse that is limited in scope

Event data

a database action (create/update/delete) that results form a transaction

Transaction

a discrete unit of work that must be completely processed or not processed at all. may involve multiple updates.

Outer Join

a join in which rows that do not have matching values in common columns are nonetheless included in the result table

Dynamic View

a virtual table created dynamically upon request by a user. No data actually stored; instead data from base table made available to user. Based on sql select statement on base tables or other views

dynamic sql

ability for an application program to generate sql code on the fly as the application is running

Business performance management (BPM)

allow managers to measure, monitor, and manage key activities and processes to achieve organizational goals. dashboards are often used to provide an information system in support of bpm.

Helper table

an associated entity that implements a M:N relationship between dimension and fact

Natural Join

an equi-join in which one of the duplicate columns is eliminated in the result table

predictive analytics

applies statistical and computational methods and models to data regarding past and current events to predict what might happen in the future

Conformed Dimension

associated with multiple fact tables. 2 fact tables -> 2 connected start schemas

Big data and analytics impact: applications

business, e-government and politics, science and technology, smart health and well-being, security and public safety

reasons to embed sql in 3gl

can create a more flexible, accessible interface for the user. Possible performance improvement. Database security improvement...grand access only to the application instead of users

Transient data

changes to existing records are written over pervious records, thus destroying the previous data content

Big data and analytics impact: social implications

personal privacy vs collective benefit, ownership and access, data/algorithm quality and reuse, transparency and validation, demands for workforce capabilities and education

Subquery

placing an inner query inside an outer query. Can be noncorrelated or correlated

Schema on write

preexisting data model, how traditional databases are designed (relational databases)

Fact tables

provides statistics for sales broken down by product, period and store dimensions. Contain factual or quantitative data. contain time-period data. Date dimensions are important

Limitations of independent data marts

separate etl process for each data mart (redundant data and processing). Inconsistency between data marts. difficult to drill down for related facts between data marts. excessive scaling costs are more applications are built. high cost for obtaining consistency between marts

Advantages of views

simply with query commands, use little storage space, assist with data security, enhance programming productivity, establish physical data independence, provide customized view for user

Scalar aggregate

single value returned from sql query with aggregate function

Purpose of SQL Standard

specify syntax/semantics for data definitions and manipulation; define data structures and basic operations. Enable portability of database definition and application modules.

Use of predictive analytics

statistical and computational methods that use data regarding past and current events to form models regarding what might happen in the future. Examples: classification trees, linear and logistic regression analysis, machine learning, neural networks, time series analysis

Equi-join

a join in which the joining condition is based on equality between values in the common columns; common columns appear redundantly

Big Data

data that exist in very large volumes and many different varieties (data types) and that need to be processed at a very high velocity (speed)

Descriptive analytics

describes the past status of the domain of interest using a variety of tools through techniques such as reporting, data visualization, dashboards, and scorecards

The future of data warehousing:

integration with big data and analytics. Speed of processing, cost of data storage, unstructured data (new technologies)

More characteristics of big data

schema on read, rather than schema on write & data lake

Analytics

systematic analysis and interpretation of data- typically using mathematical, statistical, and computational tools to improve our understanding of a real-world domain. Encompasses more than BI. transform data to useful form, infrastructure for analysis, data cleanup processes and user interfaces


Kaugnay na mga set ng pag-aaral

APUSH Unit one & two Short answer questions

View Set

Pharmacology Chapter 15 Assignment

View Set