ITP 487 - Midterm

Ace your homework & exams now with Quizwiz!

outer JOINs

- return rows matching join condition - also return rows with unmatched attribute values for tables to be joined - full outer JOINs = unions

left outer join

- returns rows matching join condition - returns rows in left side table with unmatched values

right outer join

- returns rows matching join condition - returns rows in right side table with unmatched values

foreign key

primary key of one relation placed in another relation to form a link between the relations - single column or composite key - key values are foreign to relation in which they appear as foreign key values

Need for Data Warehousing

Integrated, company-wide view of high-quality information (from disparate databases) Separation of operational and informational systems and data (for improved performance)

SQL

Structured Query Language

data manipulation commands

INSERT, SELECT, COMMIT, UPDATE, ROLLBACK, DELETE

increasing importance of services

Manufacturing (especially of durable goods) is more strongly affected by recessions. The economy is based more on services now (think Yelp?), decreasing the effect of the business cycle on GDP.

inner join

Most common type of join; includes rows in the query only when the joined field matches records in both tables.

informational systems

OLAPs (online analytical processing systems) - provide place for data storage and preparation for analytics - hold large volumes of historical data - archived/aggregated data which may have come from transactions - i.e. Data Warehouse

Star Schema

The most commonly used and the simplest style of dimensional modeling - central to the model is the fact table

Denormalization

The process of combining normalized relations into physical tables based on affinity of use of rows and fields; makes data retrieval faster

Second Normal Form (2NF)

The second stage in the normalization process, in which a relation is in 1NF and there are no partial dependencies (dependencies in only part of the primary key).

mobility

huge boom in data and money

How do BI and data analytics differ?

- BI helps in making business decisions based on past results - Data Analytics helps in making predictions that will help in the future

characteristics of transactional systems

- availability should be as close to 100% as possible -detailed down to individual transactions - updatable - process transactions quickly - store current information, archive older data - support organization's business functions - require concurrency management to deal w/ users who try to access same data at same time - support business processes - process small uniform transactions - optimized for quick writing and storage - data are functionally oriented

many-to-many relationship

- can't be implemented as such in relational model - can be changed into multiple one-to-many relationships - create composite entity

master data

- data associated with various processes, but which rarely changes - same data used across the enterprise - ex: customer, product, vendor

persistent data

- data is physically stored - accessed directly - e.g. database table

virtual data

- data stored elsewhere - virtual data has instructions on how to access in real time - e.g. database view - combine multiple tables into one by maintaining view but doing work underneath (extraction layer)

Data Type (aka Class)

- defines structures of instances of data objects - each attribute type is defined - e.g. defines Student as: SID, name, address, etc.

how to snowflake

- extract attributes from dimension tables, places them into new tables - dimension table should consist of 2+ sets of attributes which define information at different grains (detail) - sets of attributes in same dimension table are populated by different source systems

multi-dimensional data

- facts surrounded by dimensions for analysis - e.g. OLAPs (like DWs) - allows you to slice and dice data, like revenue by dimensions

tabular data

- first row defines class (headers) - subsequent rows are instances of that data type - e.g. spreadsheets, relational DBs

SAP HANA is high performing and efficient

- in-memory - very fast but expensive RAM - uses columnar storage - creates a dictionary for every unique value - columns only hold keys - dictionary compression - attribute values are stored in a dictionary

examples of data sources

- legacy systems (older tech) - web services (HTTP, SOAP, XML, web pages) - crawlers and info agents (web scraping/crawling) - social media (text/clicks, application programming interface (API)) - sensors (equipment - status, alerts, safety monitoring, input/output measurements, prediction of failure, meter reading, diagnosis/repair)

historization of data

- looking at dates valid from mmddyyyy to mmddyyyy

characteristics of the explosion of data

- more users - socialization of everything - entertainment shift - mobility - internet of things - sensors, devices - cloud and longevity of data - low cost storage

structured data source

- organized or modeled in rows/columns - values are proscribed (each cell contains values to a given input) - data more easily scanned/examined - understandable by computers - spreadsheets, flat files, databases - data integrity is of concern (can be changed) - limited file size

transactional data

- reflects the data resulting from executing a business process (a transaction) - includes data such as dates, prices, delivery terms, etc - ex: sales orders, purchase orders, rental booking

data mining techniques

- regression - decision trees - neural networks - association analysis - clustering - classification - genetic algorithms

data storage structures

- relational DBs = transactional systems - DWs = informational systems; DWs ARE relational - cannot be combined, should be used as OLTP or OLAP (one or the other)

relational database

- relationships created using keys - PK, FK

Data Instance (aka Object)

- specific instance of a specific type - e.g. a real student: SID is xxxxxxxxxx, Name is John Smith, etc.

data warehouse structure

- takes large quantities of data, stores for easier access - denormalized relational DB (some redundant data added back into DB to reduce number of tables to speed up queries) - read-only, so modification anomalies aren't relevant - not going to delete transactions - single source of truth - joining tables can be expensive, in DWs data is already stored like a JOIN

unstructured data source

- unorganized - data can be of varying lengths - values may not be topical (may be free format and not relate to each other) - more difficult to scan/examine - must be translated so computer can read - unstructured text, audio, video, pictures, graphics - closed captioning/subtitles

characteristics of informational systems

-less detailed than transactional; stored in summarized form - data extracted from other systems, periodically loaded in -needs to be designed to handle variety of queries, ad hoc reporting - supports managerial decisions making, frequently strategic planning

shortcomings of star schemas

1) duplication of dimensional data (redundancy issues) 2) no support for multiple languages 3) no historization (changing dimensional data) 4) hierarchies can cause anomalies (i.e., changing a city means you have to check with state and postal code)

advantages of star schema

1) star-like layout is easy to understand and implement; 1:M relationship is straightforward 2) star scheme has a SINGLE LEVEL OF JOIN for any query (fact table is joined to 1+ dimension tables)

5 drivers of business intelligence

1. globalization 2. decentralization 3. smaller margins 4. faster reaction to market 5. increasing importance of service

5 components of data warehousing process

1. source systems 2. data staging 3. data warehouse 4. data mart (smaller structures) 5. analytics tools

transitive dependency

A functional dependency between the primary key and one or more nonkey attributes that are dependent on the primary key via another nonkey attribute.

one-to-many relationship

A relationship between two tables in a database in which one record in the primary table can match many (zero, one, or many) records in the related table. - relational modeling ideal - should be the norm in any relational database design

referential integrity constraint

A rule that limits the values of a foreign key to those already existing as primary key values in the corresponding relation - helpful in preventing fraud; e.g. each transaction must have a customer linked to it - on an ER diagram, portrayed as cardinality

Third Normal Form (3NF)

A table is in 3NF when it is in 2NF and no nonkey attribute is functionally dependent on another nonkey attribute; that is, it cannot include transitive dependencies.

snowflake schema

An expanded version of a star schema in which dimension tables are normalized into several related tables. - reduced data redundancy - support of multiple languages, currencies, Units of measure - supports historization

transactional systems

OLTPs (online transactional process systems) - use a 3-tiered system architecture: 1) user interface/presentation tier 2) business services/logic/application tier 3) data services and programming tier

DISTINCT clause

Produces list of values that are unique

Code for inner JOINs

SELECT * FROM TABLE A, TABLE B WHERE A.pk = B.fk;

COUNT function

Tallies the number of cells in a range that contain values.

data staging

The act of formatting data from source databases before fitting it into a data warehouse; it consists of three steps: extraction, transformation, and storage.

GROUP BY clause

The clause that groups rows based on the specified column

Data Modeling

The definition of the data and their relationships i.e. entities, attributes, relationships, constraints in relational DBs

primary key

a candidate key selected as the primary means of identifying rows in a relation - one and only one PK per relation - may be composite key - short, numeric, never changes

key

a combination of one or more columns that is used to identify rows in a relation

composite key

a key that consists of 2+ columns

candidate key

a key that uniquely identifies a row in a relation

data warehouse

a process and architecture that requires robust planning to implement a platform, which consists of the selection, conversion, transformation, consolidation, integration, cleansing and mapping of data from multiple operational data sources to a target DBMS that supports an enterprise's decision-making processes and BI systems

ERP (Enterprise Resource Planning)

a suite of applications called modules, a database, and a set of inherent processes for consolidating business operations into a single, consistent, computing platform

Relation (aka table)

a two-dimensional table that has the following characteristics - rows contain data about an entity - columns contain data about attributes of the entity - all entries in column are the same kind - each column has a unique name - cells of the table hold a single value - order of columns and rows are unimportant - no duplicate rows

SAP HANA

an in-memory database in SAP that is exceptionally fast that uses SQL and supports DB views

surrogate key

artificial column added to a table to serve as a primary key - values are meaningless to users - hidden in forms and reports

one-to-one relationship

between two entities in which an instance of one entity can be related to only one instance of a related entity - should be rare in any relational DB design - could mean entity components weren't properly defined - could mean 2 entities belong in same table

smaller margins

comparing prices between retailers; e.g. US and pharmaceutical industry - insurance doesn't tell you how much they'll cover, so drugs may vary greatly in pricing

Data Package Dimension

contains details of time when a record was loaded into fact table - i.e. loading number and time stamp

Star Schema - Fact Table

contains facts about the business (a list of historical transactions w/ data about key figures or measures) - transactions originate in transactional system, cascade via ETL to DW and then into star schema - 2 types of data: 1) master data/PKs (Product_ID, Time_ID, etc.) - about sales order; 2) transactional data - quantity sold, price, revenue (measures/key figures)

Data Warehousing Process

data from source are extracted, transformed, loaded (ETL) into target system - data staging is area where data are cleaned up/prepared (transformation)

information analysis structure

data warehouse data is consistent, integrated, etc. - the warehouse catalog is the single most important element because it queries the Data Warehouse and lists the tables and columns that exist within it

decentralization

decision-making is decentralized to individual stores (which each have their own models), hence we need BI

unit dimension

defines unit of measure for quantity values; 2 types: 1) currency (currency key links to unit dimension w/ description of currency) 2) unit of measure (UoM)

deletion anomaly

deletion of a row loses information about two or more entities

Internet of Things

describes a system in which everyday objects are connected to the internet and in turn are able to communicate information throughout an interconnected system - e.g. Apple Watch and Stanford Health tracking people's heart health by EKG - e.g. Tesla self-driving cars, have advantage in huge amount of data accumulated by their users - e.g. voice recognition devices, like Amazon Echo, Google Home, Apple Siri, etc.

time dimension

details about the time of a transaction in the fact table, i.e. year, quarter, month

faster reaction to market

e.g. Amazon has millions of vendors who may copy products and sell them on a demand basis (order after transactions)

characteristic

each attribute in a dimension table - may have additional attributes describing them in more detail - attributes stores in attribute tables - additional text stores in text tables (time dependent/independent, language dep/indep) - hierarchies stored in hierarchy tables

cardinality

expresses the specific number of instances in an entity; "count"

globalization

has 2 key areas: 1. manufacturing - goes to where it's cheapest to produce (offshoring/outsourcing) 2. economies - China and India were dominant in history, and will be dominant in the near future, consumers are changing

First Normal Form (1NF)

information is stored in a relational table with each column containing an atomic value; no repeating groups of columns

insertion anomaly

insertion of a fact in one entity can't be done until a fact about another entity is added - can't record until we have information for the entire row

enterprise

large companies

dimension table

master data stored in separate tables; provide more detailed view of the fact e.g. revenue is the fact, customer number/name/address are stored in Customer dimension

minimum cardinality

minimum number of entity instances that must participate in a relationship - if 0, then participation is optional, no entity instance must participate - if 1, then participation is mandatory, at least one entity instance must participate

OLAP

online analytical process system - support managerial decision making - historical point-in-time (snapshots) and predictions - managers, business analysts, customers - broad, ad hoc, complex queries and analysis - ease of flexible access and use - periodic batch updates and queries requiring many or all table rows

OLTP

online transaction processing system - runs the business on a current basis - current representation of state of the business - narrow, planned, and simple updates and queries - many constant updates and queries on one or a few table rows

Joins

performed when data are retrieved from more than one table at a time - list tables in FROM clause of SELECT statement

data warehouse objectives

single source of truth because: - standardized structures and representation for all enterprise information - easy to use access, single point of access to all enterprise information - self service, high quality business reporting and analysis on all levels - high performance environment fed from heterogeneous sources - unburdens systems and IT resources in the transaction environment

entity

some identifiable thing that users want to track - customers - products - sales orders - vendors

entertainment shift

streaming services and cable TV - so many channels to choose from

data view

the data frame mode optimized for the display and analysis of map data

maximum cardinality

the maximum number of entity instances that can participate in a relationship - one-to-one - one-to-many - many-to-many

Data mining

the process of discovering meaningful correlations, patterns, and trends by sifting through large amounts of data stored in repositories; should provide enterprises with insights not available through traditional techniques - machine learning, i.e. Netflix looking at what and when customers watch, where they discovered the movie, and what recommendations they DIDNT select

normalization

the process of structuring a relational database in order to reduce data redundancy and improve data integrity - eliminated modification anomalies - insertion, update, deletion

Business Intelligence (BI)

the user-centered process of exploring data, data relationships and trends - helps improve overall decision-making for enterprises

enterprise resource planning systems (ERP)

transactional systems - fully integrated system to allow functional business areas to share data

update anomaly

update of a fact must be done in all locations of changed fact

SELECT command

used to list contents of table -asterisk (*) can be used as wild

ORDER BY clause

useful when listing order is important


Related study sets

ACCT 450 - Chapter 14 Partnerships: Formation and Operation

View Set

ITExams - CompTIA A+ Core 1 Practice Test

View Set

A. Life Insurance Policy Riders -Guaranteed Insurability Rider (GIR)-

View Set

Ch. 7 Moral Development, Values, and Religion

View Set

Module 4 Thyroid/Para/Adrenal/Pituitary/AKI/CKD/Bioterrorism

View Set

Hazmat - Will Carry - Labeling and Marking

View Set

ALGEBRAIC EXPRESSIONS: EXPONENTS PART 1

View Set