Database Theory
* Six Clauses that form an SQL statement
1) SELECT 2) FROM 3) WHERE 4) GROUP BY 5) HAVING 6) ORDER BY
* DCL
(Data Control Language) Used to create controls (security definitions, access privileges, etc) for access to the database system. GRANT, REVOKE, DENY
* DDL
(Data Definition Language) Contains all commands for creating, updating, and deleting objects. CREATE, DROP, ALTER
* DML
(Data Manipulation Language) Contains all commands for selecting/ querying, merging, joining, and appending, union data together. INSERT, UPDATE, DELETE, SELECT
ETL
(Extracting, Transforming, Loading) The process of extracting data from a transaction database, transforming it, and loading it into an analytical database.
MOLAP
(Multidimensional OLAP (Online Analytical Processing) A source database that is dedicated to the compilation of answers or events.
* OLAP
(Online Analytical Processing) Relational databases that are used for data analytics. Data is not real-time but historic. Created by copying data from an OLTP (Online Transnational Processing) database. The focus of this database type is data retrieval speed.
* OLTP
(Online Transactional Processing) Relational databases focus on real-time data entry. Accuracy of the data and the speed of inserting, updating, and deleting records is the primary concern.
RDBMS
(Relational Database Management System)
ROLAP
(Relational OLAP (Online Analytical Procesing) - A database that stores extra
SSMS
(SQL Server Management Studio) A Microsoft GUI to help manage SQL databases.
SQL
(Structured Query Language) A standardized language for use with ANSI and ISO (International Standards Office) relational databases. Stable and predictable. Will work everywhere.
TPA
(Transaction Path Analysis) A planning method to determine the entity/ attribute assignments.
GROUP BY
<group by list> Optional; Creates groups for output rows according to unique combination of values specified in the GROUP BY clause. Calculates a summary value for aggregate functions in subsequent phases . Detail rows are only available to aggregate functions after the GROUP BY clause is processed. HAVING, SELECT, and ORDER BY must return a single value per group (since they're executed downstream of the GROUP BY clause). Creates groups. Order matters (think of a nested sort in Excel).
ORDER BY
<order by list> Optional; Indicates the sort order of the columns that are going to be returned.
HAVING
<search condition> Optional; Filters the groups created in the GROUP BY clause. Filtering that operates on groups of rows. Searches against the groups that were created by the GROUP BY clause. Must use an aggregate function in the HAVING clause because we're only working with groups.
WHERE
<search condition> Optional; Filters the records based on what you want to return; Filtering that operates on rows.
SELECT
<select list> Required; A list of columns/attributes you want to return; Indicates columns to preserve
FROM
<table source> Required; A list of the entities/tables you want to pull the data from; Gathers rows (cartesian product)
Junction table/ Intermediary table
A 2-column table that houses the correlation of one table as the primary key and the
Statement
A collection of clauses that execute a unit of work.
Clause
A component of an SQL statement.
* Primary Key
A designation assigned to a unique attribute in the table. Uniquely identifies every row in a table. Must be unique, must be non-nullable, only get one per table. Used to identify a column that uniquely identifies each record
Trigger
A pre-defined set of instructions that execute on a certain event.
Entity Relationship Modeling
A process for showing the entities and relationships required for an application via a diagram
Self-Referencing relationship
A relationship that can exist within a table wherein a foreign key references a primary key in the same table. Example 1: An employee has a manager and that manager is an employee. Example 2: A customer refers another customer on the Customer table.
Composite Keys
A table that has more than 1 primary key. Cannot be created inline (outline definition).
Indexes
A way of finding data within a table. Vital when querying.
Optional relationship
Allows NULLs in foreign key columns.
Fourth normal form
Boyce-Codd Normal Form - Each attribute must represent a fact about the primary key (expansion of 2nd normal form). It's a composite (concatenated) primary key on a unique list of possibilities. Allows for parallel processing.
Clustered index
CREATE CLUSTERED INDEX ClusteredIndexName ON schema.TableName(col1, col2, col3); An organization of data within the table (physical sort of records in ascending order). Defines the logical record order in the table. Row-by-row indexing. It is the table itself in ascending order. One clustered index per table. A clustered index on a column causes a table to be stored with rows logically organized by that column's values (commonly defaults to primary key, but can be any unique candidate key). Can only look up on 1 column.
Non-Clustered indexes
CREATE INDEX NonClusteredIndexName ON schema.TableName(col1, col2, col3); Stores pointers to the clustered indexes (typically, non unique). This is a separate structure from the table that is comprised of a few columns in the table that points back to a clustered index. Much like a VLookup (can have more than 1 column to lookup). If no clustered index exists, then it points to where in the heap to search for the target data.
* Surrogate keys / Synthetic identifiers
Candidate keys that can be created when there are no other suitable candidate keys (eg VIN, SSN, ISBN, CUSIP).
XML Schema
DECLARE @MyXMLVariable XML(schema.TableName) A way of describing the structure of an XML document. XML is self-describing.
Business Phase
Data modeling phase 1. Determine the purpose and requirements of the database (business focus).
Users Phase
Data modeling phase 2. Investigate performance requirements and usability issues (user focus).
Modeling Phase
Data modeling phase 3. Document the logical and physical database implementation plan as one.
* Relationship
Defines the common values between two tables (primary key and foreign key pair). A way of identifying the commonality between two attributes in 2 tables by selecting a single column in each table. There is no physical component to a relationship. A verb (attend, has, meets, owns)
XML
Describes data structure and values. xml data type - Stores XML in variables and columns XQuery - Node-tree query language for XML FOR XML clause - Generates XML from relational data. Allows you to get relational data out of a SQL database and into XML form (for use with an app that interacts our database info in XML document format) Extensible Markup Language - Language to define what a document is. Putting content online and describing that content. Describes the hierarchical relationship and categorization of the content. Document storage (NoSQL databases) is primarily related to XML storage.
Xquery
Designed to query XML. Supported by SQL Server 2016
Required relationship
Does not allow NULLs in foreign key columns.
Constraints
Enforce rules about what the data can do (eg. unique constraint, primary key constraint, foreign key constraint)
* Referential Integrity
Ensures that data remains meaningful. Foreign key constraints and triggers can enforce referential integrity in SQL Server 2016. Table relationships must always be consistent. In other words, any foreign key field must agree with the primary key that is referenced by the foreign key. The restriction is on the foreign key, not the primary key (foreign key cannot be NULL or absent). Ensures that key values are consistent across tables. Foreign Key and Primary Key constraints
ERA
Entity Row Attribute
* One-to-one
For a column with a lot of NULL values, remove the sparse column and make it it's own table to cut down on the number of NULLs in the database. For every one record in Customers, there is only one record in Cust Details. These are extremely rare in transactional databases, but they are non-existent in analytical databases. If the relationship is optional, you should create the foreign key in the side of the relationship that allows NULL values.
Fifth normal form
Further decompose tables so that every record is unique and non-dependent. For big data.
Super-type relationship
Generalization; A way to aggregate records based on a specific attribute of that record. Good for creating AND logic to traverse up a tree. Only applicable if the data already exists in the database. Not the same as querying. Improves performance and is akin of de-normalizing. An analytical process for bringing the sub-types back together for reporting.
Dependencies
How the attributes within the entity are related to each other. One goal of normalization is to ensure non-key values are dependent on the primary key.
--
How to comment out in code in SQL Server 2016
* Logical
How we describe what's inside the container.
HOLAP
Hybrid OLAP - A balance between the MOLAP and ROLAP
Third normal form
Identify fields that don't directly describe the entity (remove transitive dependencies and put them in a separate table and join them with an intermediary table). Also, no calculations.
Second normal form
Identify fields that should be grouped into separate tables (a person, place, thing, or event). All non-key columns must relate to the entire primary key (remove composite keys and put them in a separate table and join them with an intermediary table).
* Foreign Key
Indicates that a column that contains a value that is defined in another table (go look it up). It has to be a unique value, but it doesn't have to be the primary key from the other table. Does not have to be unique in the foreign table.
Processed Data
Information. Processing is how data becomes information. Has business content and meaning applied.
I/O Ops
Input-Output operations; how many times the information is written to or retrieved from the physical disk.
* Physical Column/ Field
Logical Attribute - One aspect of the entity. Customer name, product category, order date, or salary amount. Does not imply ordinal positioning.
* Physical Value
Logical Data - The entry for a specific attribute of a single record (where the column and row meet). A single customer name, a single product, a single order, a single employee.
* Physical Table
Logical Entity - What the data within the table describes. A person, place, thing, or event.
* Physical Row
Logical Record - A single instance of the entity (e.g. a single customer, product, event).
MDX, DAX, DMX
MDX (Multidimesional Language Expressions), DAX (Data Analysis Expression), DMX (Data Mining Expressions) Languages used to write functionality within BI (Business Intelligence) platforms.
Synthetic ID
Made up numbers that generally serve as a primary key (must unique).
MPP
Massively Parallell Processing
Transact-SQL/ TSQL
Microsoft's flavor of SQL
Hadoop
Not a database. A big data technology that offers massive processing power for processing very large data loads.
OLAP / SSAS Cube
OLAP (Online Analytical Processing) / SSAS (SQL Server Analysis Services) Multi-dimensional cube. Optimized for data analysis. Alternative way of storing data for analytics. Kind of like a pivot table (stores aggregates and raw at the same time).
Data Warehouse
OLAP (Online Analytical Processing) database. Historical facing. Populated by the ETL (Extracting, Transforming, Loading) process. Read-only databases (users cannot update, insert, or delete).
* No action
One of the 5 ways to deal with primary key value changes. Action will fail and result in a referential integrity violation error.
* ON UPDATE Cascade Updates
One of the 5 ways to deal with primary key value changes. Applies the new value in the primary key and uses it to update all foreign key references in other tables. The only time the primary table is aware of the foreign tables. This doesn't work for triggers because it only works when there's an explicit relationship.
* Cascade Deletes
One of the 5 ways to deal with primary key value changes. Deletes the primary key value in the primary key and deletes all referenced rows in the foreign key referenced tables. All rows associated with the deletion would be gone from the database. This doesn't work for triggers because it only works when there's an explicit relationship. DANGER!
* ON DELETE Set NULL
One of the 5 ways to deal with primary key value changes. Updates all rows associated with the primary key in the foreign key referenced tables to NULL values. Slightly better than cascade deleting.
* ON DELETE Set Default
One of the 5 ways to deal with primary key value changes. Updates all rows associated with the primary key in the foreign key referenced tables to whatever default value the user selects.
* One-to-many
One side is the primary key, and the other side is the foreign key. The constraint to maintain the referential integrity is the responsibility of the foreign key. Many records in Customers, relate to many records in Products. Requires the Orders table to define.
PL/SQL
Oracle's flavor of SQL
* Logical Attribute
Physical Column/ Field - One aspect of the entity. Customer name, product category, order date, or salary amount. Does not imply ordinal positioning.
* Logical Record
Physical Row - A single instance of the entity (e.g. a single customer, product, event)
* Logical Entity
Physical Table - What the data within the table describes. A person, place, thing, or event.
* Logical Data
Physical Value - The entry for a specific attribute of a single record (where the column and row meet). A single customer name, a single product, a single order, a single employee.
R
Popular statistical programming language
4 FOR XML clauses
RAW AUTO EXPLICIT PATH
Data
Raw and unprocessed. Words, numbers, dates with no context. Lacks business content and meaning.
First normal form
Remove redundant fields or comma separated values (no repeating attributes, no cellular lists, no repeating columns eg phone1, phone2, phone3... phoneN)
SELECT DISTINCT <column list>
Returns a set of values that are unique (no dupes) to the identified column(s). This can help identify a candidate key (when SELECT DISTINCT <column list>.SUM == SELECT *.SUM)
RAW mode
SELECT col1, col2 FROM schema.TableName WHERE col1 = 122 OR col1 = 119 FOR XML RAW; Produces: <row col1="122" col2 = "value1"> <row col1="119" col2 = "value2"> SELECT col1, col2 FROM schema.TableName WHERE col1 = 122 OR col1 = 119 FOR XML RAW, ELEMENTS; Produces: <row> <col1> 122 </col1> <col2> "value1" </col12> </row> <row> <col1> 119 </col1> <col2> "value2" </col12> </row> Generates a single <row> element per row in the rowset that is returned by the select statement.
AUTO mode
SELECT col1, col2 FROM schema.TableName WHERE col1 = 122 OR col1 = 119 ORDER BY col2 FOR XML AUTO; Produces <Table 1> <Table 2> <Table 3 result> <Table 3 result> <Table 3 result> SELECT col1, col2 FROM schema.TableName WHERE col1 = 122 OR col1 = 119 ORDER BY col2 FOR XML AUTO, ELEMENTS; Produces Every output as an attribute. For joined tables, the first table referenced becomes the outer group. Tables on subsequent joins become nested inner groups until you get to the innermost table being joined which becomes the innermost child element Generates query results as nested XML elements. This does not provide much control over the shape of the XML generated from a query result.
PolyBase
SQL Server 2016 includes PolyBase, which integrates SQL Server with Hadoop.
Data models
The entities are no longer store houses of events vs. objects (de-normalized). May not have a primary key.
Sub-type relationship
Specialization; A way to split records based on a specific attribute of that record. Good for creating OR logic (as opposed to the avoidance of NULL values) to traverse down a tree. Only applicable if the data already exists in the database. Not the same as querying. Improves performance and is in the area of 5th or 6th normal form. A transactional process for getting rid of NULL values.
Sixth normal form
Sub-typing. Break out categories and sub-categories (creating new tables for items that are commonly queried together). For big data.
* Physical
The container.
Normal Form
The normal form is in the state a table is in once the rule defined by the form is applied. Sequential rules apply (in order to have 2nd normal form, you need 1st normal form).
* Candidate key
The potential primary keys for a table (they COULD be a primary key). Use DISTINCT to find candidate keys
De-normalization
The process of breaking the rules of normal form. A common practice in data warehouse design. From a normalized database, selectively de-normalize some of the tables in it.
* Normalization
The process of reducing redundant data in a database. tHE PROCESS OF USING STANDARDIZED rules to build a database design that will remove redundancies, improve stability, eliminate anomolies, and improve data integrity.
EXPLICIT mode
Ugly and horrible because you have to EXPLICITLY define what the elements and attributes are. Query will look obnoxious.
Column-Oriented Databases
Used to optimize the storage and query response for data. Great when columns store a high number of duplicate values (for better aggregation). aka Column-store indexes. It points you to the index of an attribute instead of a record.
M
What PowerBI and PowerQuery use to interface with SQL queries (Microsoft unique language)
PATH mode
When combined with the nested FOR XML query capability provides the flexibility of the EXPLICIT mode in a simpler manner. Column name or column aliases are treated as XPath expression. These expressions indicate how the values are being mapped to XML.
Intelligence
You analyze information to obtain intelligence. Organizations can use intelligence to understand past performance to predict future trends.
* Many-to-many
You'll be jumping though an intermediary table for this one. For every one record in Customers, there can be many records in Orders