ITP 487 - Midterm
outer JOINs
- return rows matching join condition - also return rows with unmatched attribute values for tables to be joined - full outer JOINs = unions
left outer join
- returns rows matching join condition - returns rows in left side table with unmatched values
right outer join
- returns rows matching join condition - returns rows in right side table with unmatched values
foreign key
primary key of one relation placed in another relation to form a link between the relations - single column or composite key - key values are foreign to relation in which they appear as foreign key values
Need for Data Warehousing
Integrated, company-wide view of high-quality information (from disparate databases) Separation of operational and informational systems and data (for improved performance)
SQL
Structured Query Language
data manipulation commands
INSERT, SELECT, COMMIT, UPDATE, ROLLBACK, DELETE
increasing importance of services
Manufacturing (especially of durable goods) is more strongly affected by recessions. The economy is based more on services now (think Yelp?), decreasing the effect of the business cycle on GDP.
inner join
Most common type of join; includes rows in the query only when the joined field matches records in both tables.
informational systems
OLAPs (online analytical processing systems) - provide place for data storage and preparation for analytics - hold large volumes of historical data - archived/aggregated data which may have come from transactions - i.e. Data Warehouse
Star Schema
The most commonly used and the simplest style of dimensional modeling - central to the model is the fact table
Denormalization
The process of combining normalized relations into physical tables based on affinity of use of rows and fields; makes data retrieval faster
Second Normal Form (2NF)
The second stage in the normalization process, in which a relation is in 1NF and there are no partial dependencies (dependencies in only part of the primary key).
mobility
huge boom in data and money
How do BI and data analytics differ?
- BI helps in making business decisions based on past results - Data Analytics helps in making predictions that will help in the future
characteristics of transactional systems
- availability should be as close to 100% as possible -detailed down to individual transactions - updatable - process transactions quickly - store current information, archive older data - support organization's business functions - require concurrency management to deal w/ users who try to access same data at same time - support business processes - process small uniform transactions - optimized for quick writing and storage - data are functionally oriented
many-to-many relationship
- can't be implemented as such in relational model - can be changed into multiple one-to-many relationships - create composite entity
master data
- data associated with various processes, but which rarely changes - same data used across the enterprise - ex: customer, product, vendor
persistent data
- data is physically stored - accessed directly - e.g. database table
virtual data
- data stored elsewhere - virtual data has instructions on how to access in real time - e.g. database view - combine multiple tables into one by maintaining view but doing work underneath (extraction layer)
Data Type (aka Class)
- defines structures of instances of data objects - each attribute type is defined - e.g. defines Student as: SID, name, address, etc.
how to snowflake
- extract attributes from dimension tables, places them into new tables - dimension table should consist of 2+ sets of attributes which define information at different grains (detail) - sets of attributes in same dimension table are populated by different source systems
multi-dimensional data
- facts surrounded by dimensions for analysis - e.g. OLAPs (like DWs) - allows you to slice and dice data, like revenue by dimensions
tabular data
- first row defines class (headers) - subsequent rows are instances of that data type - e.g. spreadsheets, relational DBs
SAP HANA is high performing and efficient
- in-memory - very fast but expensive RAM - uses columnar storage - creates a dictionary for every unique value - columns only hold keys - dictionary compression - attribute values are stored in a dictionary
examples of data sources
- legacy systems (older tech) - web services (HTTP, SOAP, XML, web pages) - crawlers and info agents (web scraping/crawling) - social media (text/clicks, application programming interface (API)) - sensors (equipment - status, alerts, safety monitoring, input/output measurements, prediction of failure, meter reading, diagnosis/repair)
historization of data
- looking at dates valid from mmddyyyy to mmddyyyy
characteristics of the explosion of data
- more users - socialization of everything - entertainment shift - mobility - internet of things - sensors, devices - cloud and longevity of data - low cost storage
structured data source
- organized or modeled in rows/columns - values are proscribed (each cell contains values to a given input) - data more easily scanned/examined - understandable by computers - spreadsheets, flat files, databases - data integrity is of concern (can be changed) - limited file size
transactional data
- reflects the data resulting from executing a business process (a transaction) - includes data such as dates, prices, delivery terms, etc - ex: sales orders, purchase orders, rental booking
data mining techniques
- regression - decision trees - neural networks - association analysis - clustering - classification - genetic algorithms
data storage structures
- relational DBs = transactional systems - DWs = informational systems; DWs ARE relational - cannot be combined, should be used as OLTP or OLAP (one or the other)
relational database
- relationships created using keys - PK, FK
Data Instance (aka Object)
- specific instance of a specific type - e.g. a real student: SID is xxxxxxxxxx, Name is John Smith, etc.
data warehouse structure
- takes large quantities of data, stores for easier access - denormalized relational DB (some redundant data added back into DB to reduce number of tables to speed up queries) - read-only, so modification anomalies aren't relevant - not going to delete transactions - single source of truth - joining tables can be expensive, in DWs data is already stored like a JOIN
unstructured data source
- unorganized - data can be of varying lengths - values may not be topical (may be free format and not relate to each other) - more difficult to scan/examine - must be translated so computer can read - unstructured text, audio, video, pictures, graphics - closed captioning/subtitles
characteristics of informational systems
-less detailed than transactional; stored in summarized form - data extracted from other systems, periodically loaded in -needs to be designed to handle variety of queries, ad hoc reporting - supports managerial decisions making, frequently strategic planning
shortcomings of star schemas
1) duplication of dimensional data (redundancy issues) 2) no support for multiple languages 3) no historization (changing dimensional data) 4) hierarchies can cause anomalies (i.e., changing a city means you have to check with state and postal code)
advantages of star schema
1) star-like layout is easy to understand and implement; 1:M relationship is straightforward 2) star scheme has a SINGLE LEVEL OF JOIN for any query (fact table is joined to 1+ dimension tables)
5 drivers of business intelligence
1. globalization 2. decentralization 3. smaller margins 4. faster reaction to market 5. increasing importance of service
5 components of data warehousing process
1. source systems 2. data staging 3. data warehouse 4. data mart (smaller structures) 5. analytics tools
transitive dependency
A functional dependency between the primary key and one or more nonkey attributes that are dependent on the primary key via another nonkey attribute.
one-to-many relationship
A relationship between two tables in a database in which one record in the primary table can match many (zero, one, or many) records in the related table. - relational modeling ideal - should be the norm in any relational database design
referential integrity constraint
A rule that limits the values of a foreign key to those already existing as primary key values in the corresponding relation - helpful in preventing fraud; e.g. each transaction must have a customer linked to it - on an ER diagram, portrayed as cardinality
Third Normal Form (3NF)
A table is in 3NF when it is in 2NF and no nonkey attribute is functionally dependent on another nonkey attribute; that is, it cannot include transitive dependencies.
snowflake schema
An expanded version of a star schema in which dimension tables are normalized into several related tables. - reduced data redundancy - support of multiple languages, currencies, Units of measure - supports historization
transactional systems
OLTPs (online transactional process systems) - use a 3-tiered system architecture: 1) user interface/presentation tier 2) business services/logic/application tier 3) data services and programming tier
DISTINCT clause
Produces list of values that are unique
Code for inner JOINs
SELECT * FROM TABLE A, TABLE B WHERE A.pk = B.fk;
COUNT function
Tallies the number of cells in a range that contain values.
data staging
The act of formatting data from source databases before fitting it into a data warehouse; it consists of three steps: extraction, transformation, and storage.
GROUP BY clause
The clause that groups rows based on the specified column
Data Modeling
The definition of the data and their relationships i.e. entities, attributes, relationships, constraints in relational DBs
primary key
a candidate key selected as the primary means of identifying rows in a relation - one and only one PK per relation - may be composite key - short, numeric, never changes
key
a combination of one or more columns that is used to identify rows in a relation
composite key
a key that consists of 2+ columns
candidate key
a key that uniquely identifies a row in a relation
data warehouse
a process and architecture that requires robust planning to implement a platform, which consists of the selection, conversion, transformation, consolidation, integration, cleansing and mapping of data from multiple operational data sources to a target DBMS that supports an enterprise's decision-making processes and BI systems
ERP (Enterprise Resource Planning)
a suite of applications called modules, a database, and a set of inherent processes for consolidating business operations into a single, consistent, computing platform
Relation (aka table)
a two-dimensional table that has the following characteristics - rows contain data about an entity - columns contain data about attributes of the entity - all entries in column are the same kind - each column has a unique name - cells of the table hold a single value - order of columns and rows are unimportant - no duplicate rows
SAP HANA
an in-memory database in SAP that is exceptionally fast that uses SQL and supports DB views
surrogate key
artificial column added to a table to serve as a primary key - values are meaningless to users - hidden in forms and reports
one-to-one relationship
between two entities in which an instance of one entity can be related to only one instance of a related entity - should be rare in any relational DB design - could mean entity components weren't properly defined - could mean 2 entities belong in same table
smaller margins
comparing prices between retailers; e.g. US and pharmaceutical industry - insurance doesn't tell you how much they'll cover, so drugs may vary greatly in pricing
Data Package Dimension
contains details of time when a record was loaded into fact table - i.e. loading number and time stamp
Star Schema - Fact Table
contains facts about the business (a list of historical transactions w/ data about key figures or measures) - transactions originate in transactional system, cascade via ETL to DW and then into star schema - 2 types of data: 1) master data/PKs (Product_ID, Time_ID, etc.) - about sales order; 2) transactional data - quantity sold, price, revenue (measures/key figures)
Data Warehousing Process
data from source are extracted, transformed, loaded (ETL) into target system - data staging is area where data are cleaned up/prepared (transformation)
information analysis structure
data warehouse data is consistent, integrated, etc. - the warehouse catalog is the single most important element because it queries the Data Warehouse and lists the tables and columns that exist within it
decentralization
decision-making is decentralized to individual stores (which each have their own models), hence we need BI
unit dimension
defines unit of measure for quantity values; 2 types: 1) currency (currency key links to unit dimension w/ description of currency) 2) unit of measure (UoM)
deletion anomaly
deletion of a row loses information about two or more entities
Internet of Things
describes a system in which everyday objects are connected to the internet and in turn are able to communicate information throughout an interconnected system - e.g. Apple Watch and Stanford Health tracking people's heart health by EKG - e.g. Tesla self-driving cars, have advantage in huge amount of data accumulated by their users - e.g. voice recognition devices, like Amazon Echo, Google Home, Apple Siri, etc.
time dimension
details about the time of a transaction in the fact table, i.e. year, quarter, month
faster reaction to market
e.g. Amazon has millions of vendors who may copy products and sell them on a demand basis (order after transactions)
characteristic
each attribute in a dimension table - may have additional attributes describing them in more detail - attributes stores in attribute tables - additional text stores in text tables (time dependent/independent, language dep/indep) - hierarchies stored in hierarchy tables
cardinality
expresses the specific number of instances in an entity; "count"
globalization
has 2 key areas: 1. manufacturing - goes to where it's cheapest to produce (offshoring/outsourcing) 2. economies - China and India were dominant in history, and will be dominant in the near future, consumers are changing
First Normal Form (1NF)
information is stored in a relational table with each column containing an atomic value; no repeating groups of columns
insertion anomaly
insertion of a fact in one entity can't be done until a fact about another entity is added - can't record until we have information for the entire row
enterprise
large companies
dimension table
master data stored in separate tables; provide more detailed view of the fact e.g. revenue is the fact, customer number/name/address are stored in Customer dimension
minimum cardinality
minimum number of entity instances that must participate in a relationship - if 0, then participation is optional, no entity instance must participate - if 1, then participation is mandatory, at least one entity instance must participate
OLAP
online analytical process system - support managerial decision making - historical point-in-time (snapshots) and predictions - managers, business analysts, customers - broad, ad hoc, complex queries and analysis - ease of flexible access and use - periodic batch updates and queries requiring many or all table rows
OLTP
online transaction processing system - runs the business on a current basis - current representation of state of the business - narrow, planned, and simple updates and queries - many constant updates and queries on one or a few table rows
Joins
performed when data are retrieved from more than one table at a time - list tables in FROM clause of SELECT statement
data warehouse objectives
single source of truth because: - standardized structures and representation for all enterprise information - easy to use access, single point of access to all enterprise information - self service, high quality business reporting and analysis on all levels - high performance environment fed from heterogeneous sources - unburdens systems and IT resources in the transaction environment
entity
some identifiable thing that users want to track - customers - products - sales orders - vendors
entertainment shift
streaming services and cable TV - so many channels to choose from
data view
the data frame mode optimized for the display and analysis of map data
maximum cardinality
the maximum number of entity instances that can participate in a relationship - one-to-one - one-to-many - many-to-many
Data mining
the process of discovering meaningful correlations, patterns, and trends by sifting through large amounts of data stored in repositories; should provide enterprises with insights not available through traditional techniques - machine learning, i.e. Netflix looking at what and when customers watch, where they discovered the movie, and what recommendations they DIDNT select
normalization
the process of structuring a relational database in order to reduce data redundancy and improve data integrity - eliminated modification anomalies - insertion, update, deletion
Business Intelligence (BI)
the user-centered process of exploring data, data relationships and trends - helps improve overall decision-making for enterprises
enterprise resource planning systems (ERP)
transactional systems - fully integrated system to allow functional business areas to share data
update anomaly
update of a fact must be done in all locations of changed fact
SELECT command
used to list contents of table -asterisk (*) can be used as wild
ORDER BY clause
useful when listing order is important