ITP 249
AGGREGATION FUNCTIONS and the 7 types
Aggregation Means 'grouping together data around a constraint' Once data is grouped, SQL has special commands to analyze the groups Aggregating functions and grouping are usually used together 1. SUM 2. AVG 3. COUNT 4. MIN 5. MAX 6. FIRST 7. LAST
Entity
Anything about which data can be collected and stored
NORMALIZATION PROCESS
Apply a series of tests on a relation to determine whether it satisfies or violates the requirements of a given normal form. If the tests pass, move on to check the requirements of the next level normal form. If the tests fail, the relation is decomposed into simpler relations that individually meet the normalization tests. All these normal forms are based on the functional dependencies among the attributes of a relation.
SURROGATE KEY
Artificial field added to a table to serve as a Primary Key Has values that are meaningless to users Normally hidden from the user interface Example: Employee ID (auto generated) in the previous slide
AVG
Average of all values for a column
Data vs. Information
DATA IS A COLLECTION OF FACTS These facts are independent of overall meaning INFORMATION IS MEANING DERIVED FROM DATA Information is extracted by processing data - organizing, analyzing, summarizing
DATA DEFINITION LANGUAGE (DDL)
Data Definition Language (DDL) is used to define database schema. It consists of SQL commands that create and modify database structures like tables and relationships. SQL commands CREATE, DROP and ALTER are examples of DDL.
DATA MANIPULATION LANGUAGE (DML)
Data Manipulation Language (DML) is used to query and manipulate existing data. SQL commands SELECT, INSERT, UPDATE and DELETE are examples of DML.
FUNCTIONAL DEPENDENCY - FULL vs PARTIAL
FULL Functional Dependency Given a composite key A Attribute B is functionally dependent on A B is NOT functionally dependent on any part of A It's dependent on both! PARTIAL Functional Dependency Given a composite key A Attribute B is functionally dependent on A B is functionally dependent on one part of A (not only B depends on the whole A, but B also depends on a part of A)
BUSINESS RULES
First iteration of the models usually comes from the Business Rules Business Rules are descriptions of policies, principles, procedures, restrictions for an organization They describe characteristics of data as viewed by different people
FIRST NORMAL FORM
Requirements: No repeating groups of columns Each column contains atomic values PK for each table Steps: Identify Primary Keys for all tables Identify all dependencies
(Start of Lecture 7) SELECT CLAUSE VS WHERE
Select is for columns Where is for rows
SUM
Sum of all values for a column
Outer Join
The OUTER JOIN returns all rows from both the participating tables which satisfy the join condition along with rows which do not satisfy the join condition LEFT JOIN: The table before the JOIN command is the main table Lists all the rows from the left table, even if there's no match in the right table RIGHT JOIN: The table after the JOIN command is to the main table Lists all the rows from the right table, even if there's no match in the left table Still need column list and a matching condition
Set Operators (Purpose, 3 types, and compatibility)
To combine separate queries into one result. 3 types of set operations: 1. UNION 2. INTERSECT 3. EXCEPT Set compatible: Number of columns in each SELECT statement must match Data types on aligned columns must match Each SELECT statement must have the same expressions and aggregate functions
Foreign Key
is a Primary Key from another table (related/referenced table) Placed in a relating table to form a relationship to the related table Valid values are NULL and the Primary Key values of the reference table So a Foreign Key value must either be the Primary Key of the referenced table or be NULL
Data Model
simple representation of complex real-world structures
database
specialized structure that allow computer-based systems to store, manage and retrieve data very quickly
RELATIONSHIP TYPES
1 : 1 (one to one) 1 : M (one to many) M : N (many to many)
SUBQUERY in HAVING CLAUSE
HAVING clause restricts the output of the GROUP BY clause. Applies conditions to the grouped rows. The SubQuery appears in the HAVING clause is similar to how it appears in the WHERE clause.
FUNCTIONAL DEPENDENCY RULES
If A determines B and C, then A determines B and A determines C. If A and B determine C, then neither A nor B by itself determines C.
Self Join
In a self join, a table is joined with itself (unary relationship)
RMDBS
"Relational Database Management System." An RDBMS is a DBMS designed specifically for relational databases. Therefore, RDBMSes are a subset of DBMSes. A relational database refers to a database that stores data in a structured format, using rows and columns
3. INNER JOIN - NATURAL JOIN
A NATURAL JOIN is an EQUIJOIN operation that creates an implicit ON clause for you based on the common columns in the two tables being joined. Common columns are columns that have the same name in both tables. Takes advantage of DBMS's table structure to figure out common column (same name, same data type) Natural Join: Guidelines - The associated tables have one or more pairs of identically named columns. - The columns must be the same data type. - Don't use ON clause in a natural join.
SUBQUERY USAGE
A Subquery occurs when the result from an Inner Query is used as part of an Outer Query. In other words, the Inner Query's results are used to inform the Outer Query. Subquery is a query inside a query, normally inside parentheses: First query is the Outer Query. Inside query is the Inner Query. Parentheses are used to distinguish the queries and indicate order of execution: Inner Query is executed first. Output of Inner Query is used as input for Outer Query. Normal queries allow us to ask a question of the database. Subqueries allow us to ask multiple questions of the database. Who sells saw parts? (normal query that will be used as a subquery) Now tell me about those companies? (outer query that uses the subquery's result as input) Subqueries can nest several levels. They're not limited to 1 inner and 1 outer. Building queries using Subqueries: Approach it incrementally, one part at a time. There are many ways to get the same answer. Solutions that work on one DBMS may not work on others.
FUNCTION DEPENDENCY
A functional dependency (FD) is a relationship between two attributes, typically between the PK and other non-key attributes within a table. For any relation R, attribute Y is functionally dependent on attribute X (usually the PK), if for every valid instance of X, that value of X uniquely determines the value of Y. This relationship is indicated by the representation below :
STRONG relationship
A relationship is a Strong relationship if Primary Key of the relating entity CONTAINS Foreign Key (Primary Key of related entity) i.e. FK is part of PK Noted with solid line in Crow's Foot notation
WEAK relationship
A relationship is a Weak relationship if Primary Key of the relating entity does NOT contain Foreign Key (Primary Key of related entity) i.e. FK is not part of PK Noted with dashed line in Crow's Foot notation
Constraint
A restriction placed on the data
INSERTION ANOMALY
An Insert Anomaly occurs when certain attributes cannot be inserted into the database without the presence of other attributes.
UPDATE ANOMALY
An Update Anomaly exists when one or more instances of duplicated data is updated, but not all.
ALIAS and the 2 clauses it can be used in
An alias is an alternative table name for the current query Use alias Column name in the SELECT clause (both are equivalent) Use alias Table name in the FROM clause (both are equivalent)
INDEX
Analogy: Index in a printed book A listing of the arrangement of a table with respect to a particular table's field One index to one field in one table Points to the record's location in the database Used to improve performance Index key: Index's reference point. Points to data location identified by the key Unique Index: An index in which the index key can have only one pointer value (row) associated with it
CONVERT A MANY-TO-MANY RELATIONSHIP
Break the Many-to-Many relationship into two 1:M relationships by introducing a new entity.
BY LOCALITY (2 types)
CENTRALIZED Usually owned and maintained by the organization using the database Example - SIS (Student Information System) at USC DISTRIBUTED Database is decentralized and data is distributed (sometimes redundantly) Example - Domain name servers Blockchain? Bitcoin?
CANDIDATE KEY
Candidate Key (or Key) functionally determines the rest of (all other) fields in a table. Is a potential Primary Key Must be unique, and cannot be NULL Should be secure, never change, and never be recycled
Sources of Business Rules
Company managers Policy makers Department managers Written documents - procedures, standards, manuals End users Any other stakeholders
CONCEPTUAL - LOGICAL - PHYSICAL MODELS
Conceptual: a conceptual data model identifies the highest-level relationships between the different entities (also called entity-based or object-based data models.) Logical: a logical data model describes the data in as much detail as possible, without regard to how they will be physical implemented in the database. Physical: physical data model represents how the model will be built in the database. A physical database model shows all table structures, including column name, column data type, column constraints, primary key, foreign key, and relationships between tables.
Sub-Queries (Definition, Objective, 4 Clauses it can appear in)
Definition: A Subquery is a SELECT statement (query) within another statement. I.E. a query inside a query Subquery = Inner query = Nested query Objective: To write queries in SQL to retrieve data from a table, but restricting the data based upon data in another table (subquery) i.e. to get results from table A, conditioned on results from table B Subquery can occur as part of: SELECT clause FROM clause WHERE clause HAVING clause
Relationship
Describes an association among entities
DESIGN ERD - GENERAL STEPS
Determine all possible Entities, their main Attributes, and Primary Keys. Collapse into one larger view Consider Relationships in pairs Determine Relationship Type, and Cardinality (Max/Min) Assign Foreign Keys. Determine Relationship Strength. Remove all Many-to-Many Relationships. Replace with one to many relationships Fill in remaining Attributes.
3. Except
EXCEPT returns any distinct values from the left query that are NOT found on the right query. Similar to the UNION command, EXCEPT also operates on two SQL statements. The difference is that, while UNION essentially acts as an OR operator (value is selected if it appears in either the first or the second statement), the EXCEPT command acts as an AND operator (value is selected only if it does NOT appear in BOTH statements).
RULES for WELL-FORMED TABLES (3 things)
Each table represents a single subject COURSE table only contains data about courses 2. No data is redundantly stored Allows updates to occur in only one place 3. All non-key attributes dependent on the Primary Key Yields simplest tables and data that is uniquely identifiable by the PK
ENTITY vs ENTITY INSTANCE
Entity refers to ALL items of a kind E.g. ALL Students, ALL Vehicles, ALL Transactions Corresponds to a whole table Synonyms: a Table, an Object Entity Instance refers to a SINGLE item E.g. ONE Student, ONE Vehicle, ONE Transaction Corresponds to a single row in the table Synonyms: a Record, an Object Instance, a Thing
1. Inner Join -Equijoin
Equijoin - matches based on equality Condition is where Primary Key and Foreign Key match INNER JOIN can simply be written as JOIN (by default JOIN is INNER JOIN) FROM A JOIN B ON <condition> is equivalent to FROM A, B WHERE <condition>
HIERARCHICAL MODEL
Every parent has one or many children of a certain type Example - 'Client' can have one or more 'Appointments' PROS Fast access Ensures referential integrity CONS Whole hierarchy must be satisfied to enter data Can only retrieve data from root Data redundancy
RELATIONAL MODEL
Every relation can be thought of as a table Every table has a column that uniquely identifies each row in the table By sharing the identifying field, data can be related PROS Implementation independence Easy data retrieval CONS Computationally expensive
Database Transaction Properties (ACID)
In order to guarantee validity of database transactions, the following ACID properties are required: qAtomicity: Each transaction succeeds or does not. There is no midway point (i.e. All or Nothing) q qConsistency: Ensures database correctness. All integrity constraints must be maintained. q qIsolation: Transactions happening concurrently without leading to inconsistency. Changes are not visible until complete. q qDurability: Updates and modifications are stored so that they persist even if a failure occurs ✘
1. LEFT OUTER JOIN or LEFT JOIN
LEFT JOIN: The table before the JOIN command is the main table Lists all the rows from the left table, even if there's no match in the right table
2. Intersect
List titles that appear in BOTH tables INTERSECT returns any distinct values that are returned by both the query on the left and right sides of the INTERSECT operand.
ESTABLISH - ONE-TO-ONE RELATIONSHIP (where to place foreign/primary keys)
Mandatory - Optional Place the Primary Key of the mandatory side as the Foreign Key on the optional side Optional - Optional Place the Foreign Key on whichever side to have the fewest nulls Mandatory - Mandatory Most likely should be just one entity
Maximum Cardinality
Maximum number of entity instances that can participate in a relationship Possible values are 1 (one) and M (many)
Cardinality
Means "count," and is expressed as a number 0, 1, or M/N Refers to the count of one entity to the other entity in a relationship How to determine Cardinality: Business Rules Common sense
Minimum Cardinality
Minimum number of entity instances that must participate in a relationship Possible values are 0 (optional) and 1 (required/mandatory)
MIN, MAX
Minimum, Maximum attribute value
2. Inner Join - Non-Equijoin
Non-Equijoin - matches based on in-equality
NORMALIZATION
Normalization the process of evaluating and correcting table structures to reduce redundancies and to better separate subjects. Normalization usually occurs after the 1st ERD but before the final ERD.
NO-SQL
Not Only SQL: qNot based on relational databases qThey may support SQL like querying qSchemaless qACID transactions may be compromised to increase performance, availability, speed. Eventually consistent. NoSQL Data Stores: qKey-Value: amazon dynamo qNo schema; data values are identified and accessed via a key qColumn: cassandra qGraph DB: neo4j qDocument: mongodb
BUSINESS RULES TO DATA MODELS
Nouns translate to Entities Usually nouns translate to tables Even so... some Nouns will translate to Attributes And some extra Tables may exist that aren't in the business rules (think M:N) Verbs translate to Relationships Relationships are bidirectional and must identify relationship type Adverbs (can, may, must) translate to Cardinality of a Relationship Cardinality refers to the count of one entity to the other entity in a relationship
COUNT
Number of rows with non-null values
BY USAGE (2 types)
OPERATIONAL DATABASES Maintains data relevant to moment-to-moment operations Optimized for fast data manipulation (inserting and editing data) Example - single Walmart store database ANALYTICAL DATABASES Contains historical data to observe trends and make business decisions Optimized for fast data processing (access and computation) Example - Walmart historical sales database
OBJECT ORIENTED MODEL
Overlays object oriented functionality on traditional relational databases Uses object relational mapping (ORM) middleware between consumer programs and database structures PROS Allows programmers to develop database programs without knowing SQL CONS There is a high cost - financial expense, computation time, external dependencies, maintenance overhead
ANOMALIES (3 types)
Poorly designed DBs can create problems, issues. Anomalies are inconvenient and lead to error-prone situations when we process data in tables. There are three types of anomalies: Update Anomaly Delete Anomaly Insert Anomaly
2. RIGHT OUTER JOIN or RIGHT JOIN
RIGHT JOIN: The table after the JOIN command is to the main table Lists all the rows from the right table, even if there's no match in the left table Still need column list and a matching condition
PLACES WHERE SUBQUERY CAN OCCUR
Recall that Subquery can occur as part of: SELECT clause FROM clause WHERE clause HAVING clause
SECOND NORMAL FORM
Requirements: It is in 1NF It has NO Partial Dependencies Steps: Start at 1NF Identify Partial Dependencies Make New Tables to eliminate Partial Dependencies Assign corresponding dependent attributes
SUBQUERY in SELECT CLAUSE
SELECT clause uses <columnlist> to indicate columns of the result Columns can be attributes of tables Result of an aggregate function <columnlist> can also include Subquery expression: This is called Inline Subquery Subquery must return one single value If creating an Alias for the Subquery's result, cannot use that alias in the <columnlist>
Data Analytics
The goal of this course is to learn how to use database systems not just as systems of record but a system of information. i.e. a repository of data that can be analyzed to derive information, knowledge and perhaps wisdom!
Difference between equijoin and natural join
The key difference between Inner Join and Natural Join is that Inner Join provides the result based on the matched data according to the equality condition specified in the query while Natural Join provides the result based on the column with the same name and same data type present in tables to be joined
SUBQUERY ADVANTAGES
They allow queries that are structured so that it is possible to isolate each part of a statement. They provide alternative ways to perform operations that would otherwise require complex joins and unions. Many people find subqueries more readable than complex joins or unions. Subqueries are not the complete replacement of Joins. Most of the time, it's being used as a complement/alternative approach to complicated Joins.
UNION
UNION: list each job title ONCE even if the title appears in both tables UNION ALL: to show overlap - if CEO appears in each table, it will appear twice in the combined results Think of UNION similar to the OR operator: if value appears in one statement result, it will appear in the final result.
Primary Key
Uniquely identifies each row A Primary Key may: consist of a single column or multiple columns combined Each table must have a Primary Key column Primary Key values must be unique, and cannot be NULL
KEY-VALUE STORES
Use case: qQuick lookups with no 'relational' component (no joins) qQuick and high scalability qOften (mostly) in memory Application: qUser session data between shared applications ie amazon dynamo
COLUMN STORES
Use case: qSuper scalable qMapReduce support Application: qLarge scale real time data logging (Finance, Web Analytics) Cassandra Database uses Cassandra Query Language (CQL) CQL offers a model close to SQL in the sense that data is put in tables containing rows of columns. ie cassandra
SUBQUERY in WHERE-IN CLAUSE
Used when comparing a single attribute to a list of values (result of a Subquery) The Subquery appears in: in the WHERE-IN Clause on the right side of the comparison Subquery result: Requires a Subquery that returns a list of values Values returned by the Subquery must be of comparable data type for the comparison Can be used in combination with Joins
SUBQUERY in WHERE CLAUSE
Used when comparing a single attribute to a single value (result of a Subquery). Most common type: the Subquery appears in the WHERE Clause on the right side of the comparison Subquery result: Requires a Subquery that returns only one single value Value returned by the Subquery must be of comparable data type for the comparison Can be used in combination with Joins.
SUBQUERY in WHERE-ALL&ANY CLAUSE
Used when comparing a single value to a list of values (result of a Subquery) using inequality comparison Meaning: > ALL : (reads "greater than ALL") is equivalent to "greater than the highest in list" < ALL : (reads "less than ALL") is equivalent to "less than lowest" = ANY : (reads "equal to ANY") is equivalent to using the IN operator
NETWORK MODEL
Uses inverted tree structure to represent owner/member relationships through structures with multiple links allowed PROS Complex (real-world) relationships are better modeled Any data can be accessed through structures CONS Users must be familiar with network layout to retrieve data Changing structures is difficult and involves reworking whole network
View
Virtual table made of fields of other tables Database stores base tables Database temporarily assembles views to display query results Allows the viewing of interrelated data from multiple tables Data protection by only displaying results (no edits allowed)
DELETION ANOMALY
A Delete Anomaly exists when certain attributes are lost because of the deletion of other attributes. What other data is removed when we delete a row of Student's Activity data?
JOIN
A JOIN occurs when data is retrieved from multiple tables. Foreign Key and Primary Key of related tables are compared and matched. This implies a relationship exists(weak or strong) A JOIN occurs by listing tables in the FROM clause, e.g. FROM Student, Course The comma between the table names creates a Cartesian Product Table which is filtered by the JOIN matching condition in the WHERE clause The JOIN matching condition is WHERE TABLE1.PrimaryKey = TABLE2.ForeignKey
DOT NOTATION
The dot "." notation allows us to refer to a column within a table Examples: -STUDENT.SID means we're referring to the SID column in the table STUDENT -INVOICE.CUS_CODE means we're referring to the CUS_CODE column in the table INVOICE
Attribute
A characteristic of an entity, Table column rules apply to Attributes
The 2 types of joins and their 3 and 2 respective subcategories
1. Inner Join a. Equijoin b. Non-Equijoin c. Natural Join 2. Outer Join a. left b. right
analytics
1. the analysis of data, typically large sets of business data, by the use of mathematics, statistics, and computer software. 2. the patterns and other meaningful information gathered from the analysis of data.
TABLE FEATURES
2D structure composed of rows and columns Each row represents one occurrence within the entity set(instance) Each column represents an attribute and has a distinct name Each intersection of a row and column represents a single data value All values in a column must conform to the same data format The order of rows and columns does not matter Each table has an attribute that uniquely identifies each row
FIRST, LAST
First, Last value of the results
GROUP BY
GROUP BY by default collects all identical items together Can restrict groups with the HAVING Clause by adding criteria/conditions to restrict the results The <conditionlist> in HAVING Clause is similar to that in the WHERE Clause, but it applies to GROUP BY only
POST RELATIONAL MODEL
Handles volume, velocity and variety of data (3V's) Uses key-value storage of data instead of structured tables, allowing for sparse unstructured data Not always an appropriate replacement for relational database PROS Enabled the 'Big Data' revolution CONS Can involve a large overhead
IMBS
IN-MEMORY DATABASES (IMDB) Using RAM instead of hard disk for the database All relevant data are in memory all the time Speeds up queries to provide real time or near real time analytics capabilities Innovations: qData are stored in RAM qUse of columnar storage for the relational database. qIndexing (is free with columnar storage) qData compression qParallel data processing qPartitioning data
Graph DB
In computing, a graph database (GDB[1]) is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data. A key concept of the system is the graph(or edge or relationship), which directly relates data items in the store. The relationships allow data in the store to be linked together directly, and in many cases retrieved with one operation. Graph databases, by design, allow simple and fast retrieval[citation needed] of complex hierarchical structures that are difficult to model[according to whom?] in relational systems. Graph databases are similar to 1970s network model databases in that both represent general graphs, but network-model databases operate at a lower level of abstraction[2] and lack easy traversal over a chain of edges.[3] Use Case: qDense network of strongly connected entities Application: qFacebook graph search qGoogle knowledge graph qFraud detection qNetwork infrastructure Graph DB vs SQL: qThe size and degree of connectedness => exponential increase of query times for SQL. qConstant time for Graph DBs on size and linear for connectedness. neo4j
SUBQUERIES VS JOINS
SUBQUERIES and JOINS share many similarities and differences. Joins are typically used: to return rows (of a merged/joined table) Subqueries are typically used: to return a single value (an aggregation) or rows (of a table, or a merged/joined table). to perform tests for set membership or make set comparisons (in vs not in).
FILE SYSTEM MODEL
Sometimes called 'flat model', referring to a 'flat file' that contains data Does not indicate relationships - just maintains records Example - an address book Today -> popular for data transmission (as a delimited file or XML or JSON)
MULTI-USER
Sometimes called 'workgroup' or 'enterprise' depending on scale Example - USC course schedule
SUBQUERY in FROM CLAUSE
Subquery in the FROM clause forms a Temporary Table Can create Alias for the Temporary Table The Alias can be used in places where a table is expected
SINGLE USER
Supports one 'user' (could be a program) Example - desktop database
Where vs Having
The WHERE clause applies the condition to individual rows before the rows are summarized into groups by the GROUP BY clause. However, the HAVING clause applies the condition to the groups after the rows are grouped into groups. Therefore, it is important to note that the HAVING clause is applied after whereas the WHERE clause is applied before the GROUP BY clause.
COMPOSITE DETERMINANT
a determinant composed of multiple fields State, LicenseNumber) => (FirstName, LastName, Address) (StudentID, CourseNumber) => Grade (Manufacturer, SerialNumber) => (Model, WarrantyInfo)