CSCI 3287: Database Systems
Constraint
when the DBMS is running, it will not allow any program to directly access any DBMS managed files (through OS permissions)
naming relationships
write the name of the entity above the rectangle and it should be singular
an occurrence of an entity
each row represents one occurrence of an entity
null/not null
enforces a column to NOT accept NULL values a column can hold null values
why use triggers?
enforces business rules, moves code from an application program to the database, performance improvement, all work done on server
currency control
ensures that one user's work does not inappropriately influence another user's work - no single concurrency control technique is ideal for all circumstances - tradeoffs need to be made between level of protection and throughput
purpose of indexes
faster processing by reducing disk I/O indexes are usually much smaller than the base tables they serve -- less I/O, less buffer space consumed. Indexes are often pinned in memory
tradeoffs with indexes
faster reads slower inserts/updates
the "trade off" with ACID transaction compliance
faster throughput data integrity/accuracy
table
general term for where data is stored in db's
Functional Dependencies (FD)
has the form of X --> Y (reads: X implies Y), where X and Y are sets of attributes. It means that whenever two tuples are identical on all the attributes in X, they must also be identical on all the attributes in Y alternatively: you can interpret X --> Y as: each possible value of X can correspond to exactly one value of Y
breaking deadlock
almost every DBMS has algs for detecting upfront when deadlock occurs, DBMS aborts one of the trans an rolls back any partially completed work
surrogate key
an artificial column added to a relation to serve as a primary key - DBMS supplied - short, numeric, and never changes - an ideal primary key - has artificial values that are meaningless to users - normally hidden in forms and reports
shared vs exclusive
an exclusive lock prohibits other users from using the locked resource a shared lock allows other users to read the locked resource but they cannot update it
Development of database
application program --> DBMS software --> database
Objective of normalization
arrange the data into a series of clearly defined, well-organized relations - each with a primary key - all attributes are functionally dependent on the primary key - each with required foreign keys referencing other relations
Why views?
base table or specific columns can be hidden from certain users who are only allowed access to the view very complex SQL to create the view can be hidden from end users
Why are products called "relational"?
because SQL lang based upon the principles of relational algebra
before and after images in logs
before: a copy of every database record (or page) before it was changed after: a copy of every database record (or page) after it was changed
buffer pools
caching disk I/O can hold objects in memory (smaller tables, indexes)
ROLLBACK
causes the transaction to abort, or terminate unsucessfully
COMMIT
causes the transaction to end successfully
Check
check is used as limit the value range that can be placed in a col if you use check constraint on a single col, it allows only certain values for this col, it applies the same for the table
select
choose which columns to display in the returned results
cold vs warm backup
cold: stop all db traffic and back up quickly MySQLDUMP vs file-level backup (OS) faster requires downtime warm: take the backup while db traffic is in-progress takes longer less inconvenience to user community
key
combination of one or more columns that is used to identify rows in a relation
projection
create a new subset of columns from R
DDL
data definition language - for creating tables, relationships, and other structures (create, drop, alter, truncate)
order by
designate the order in which selected rows should be displayed
read anomalies
dirty read, unrepeatable read, phantom read
Query cache
holds the compiled executable query queries may be run over and over by an application, changing only host variables
cardinality
how many of these are related to how many of these (zero, one or many) on both ends of the relationship
implicit vs explicit
implicit locks are locks placed by DBMS explicit locks are issued by the application program
types of locks
implicit vs explicit shared vs exclusive deadlock
relational data
is "self-describing" and it contains within itself all information about itself (aka metadata)
optionality
is the relationship mandatory (one or more) or optional (zero)
Full outer join
joins all rows in A and B plus those that have a matching entry in table A and B
hot standby via replication
keep a duplicate up-to-date at all times
Inner Join
keyword selects records that have matching values in both tables (joins only the rows that have a matching entry in table A and B)
cardinality and optionality symbols
line with two small cuts = one to mandatory line with one small cut and v = many to mandatory line with open circle and small cut = one to optional line with open circle and v = many to optional
what triggers a long flush to disk?
logs in memory must be written to disk (transaction logs) a log flush in MySQL ("redo logs" "archive logs" in Oracle)
foreign key - constraints
may be defined at the col level should be defined at the table level so you can give it a name maintaining RI
Unique
may have null values give it a name check constraint - applies cond to a col rules for conditions same as WHERE clause not avaliable in mysql
primary key - constraints
may not have null values may be defined at the column level if one column must be defined at the table level if more than one column it includes NOT NULL and UNIQUE
ID Dependence
ID dependence means they have the same ID in their entities
Domain relation name
(column-name1:datatype, column-name2:datatype) ProductID:integer
VIEW
- VIEW is an empty shell of a table definition - view contains no data until it is queried - sometimes considered a virtual table - each time the view is queried, the underlying query that populates the view is re-executed
Normalization
- a method of cleaning up and organizing data - structures the data to fit the RELATIONAL MODEL - a set of rules you apply to the data to clean up the data - a structured, defined, detailed process - BCNF is necessary in normalization process
Widespread use of relational DBMS software due to the following reasons:
- allow to query the info - supports huge data sets (terabytes, petabytes) - provide durability - help control access - 90% of market share
parent-to-child relationships
- an order belongs to one customer - a customer can place zero, one or many orders - an order may contain one or many OrderDetails - a product may be purchased on zero, one or many OrderDetails
Main process of data modeling
- draw a horizontal line across each entity rectangle, and enter the name of the primary key attribute above the line - then list all the remaining attributes within the rectangle below the line - identify any foreign key attributes with an (FK) - walk through the model with your customers
begin process of data modeling
- draw a rectangle to represent each ENTITY - draw a circle to contain each ATTRIBUTES - draw a diamond to represent the RELATIONSHIP - draw a RELATIONSHIP line between each related entity - Relationships with at most one instance of an entity connect with arrows - Relationships with at least one instance of an entity connect with bold lines - draw CARDINALITY and OPTIONALITY symbols on both ends of each relationship line - Highlight the Primary key - resolve any many-to-many relationships by creating an ASSOCIATION (child) entity between the two "parent" entities
Codd's Rules for Relational Model
- each attribute of a tuple must be atomic - each column of a relation defines a . domain - tuples in a relation are a SET, not a list. Order is immaterial. - the columns are a SET of attributes. Order is immaterial. - each tuple must have an attribute that uniquely identifies that tuple - there can be NO duplicate tuples in a relation
Rules about functional dependencies
- if i know a true FD for a relation, I can deduce other true FD's for that relation - Transitive Rule: in relation R(A,B,C) if A->B and B->C then A->C - Augmentation Rule: In relation R(A,B,C) if A->C then A+B->C+B - Combining and splitting rule in a relation R(A,B,C,D) if A+B->C, and A+B->D, then A+B->C,D - Trivial FDs - are assumed to be true. The right side is a subset of the left side - Closure: for a relation, we seek ALL TRUE functional dependency rules - A relation: R(A,B,C,G,H,I) FD's = A->B, A->C, C+G->H, C+G->I, B->H
Preparation of data modeling
- once data requirements are clear then decide business area you are modeling - organize all the data items into ENTITIES and ATTRIBUTES - determine an attribute that can serve as a PRIMARY KEY for each entity - if no appropriate candidate key exists, then plan to create SURROGATE key page
usefulness of DBMS software
- query the information into database - support huge data sets - provides durability (do not lose anything) - control access
Why reorganize an index? tradeoffs?
- reads go faster when indexed - inserts require not only data table updates, but also index updates - for applications that are insert-intensive, indexes can cause significant delays - for small tables, an index can actually hurt performance
group function/group by
- require sql to create an interim answer set, and then process the group function against the interim answer set, delivering a final answer set that contains only the final total for the function - when you combine a GROUP FUNCTION with a WHERE clause, keep in mind that the WHER clause simply reduces the number of rows in the INTERIM answer set before the GROUP function does its calculation
why has the relational model prevailed for over 30 years?
- solid, reliable operating principles - can grow very large (multi-terabytes) - efficient to manage, maintain - available talent in the marketplace - fabulous DBMS software alternatives - design approach is simple, limited yet versatile - data operations (via sql) are simple, limited yet versatile - SQL is very powerful: few lines of code = big results
B-tree
- supports both sequential access of rows AND keyed access - B = Balanced - B-tree indexes balance each leaf page node between half-full and full - B-tree indexes balance the levels of leaf pages - as pages get full, b-tree indexes will "split" to create more room for indexes to grow out - index has pre-allocated overflow space - as the overflow space gets full, the "split" leaf nodes get spread out - therefore B-tree indexes occasionally need to be reorganized, drop the index and recreate it
Subquery
1. WHERE. a query within a query. The answer set to an inner query is used as a predicate in a where clause in the outer query 2. SELECT or FROM derived value or table. Answer set to an inner query is used as a table in select or from clause 3. Subquery must be enclosed in ( )
Key (formal definition)
1. a set of attributes (A1, A2, A3, etc) is a KEY for the relation if that set of attributes functionally determines all the other attributes of the relation 2. No subset of those attributes (A1, A2, A3, etc) functionally determines all the other attributes of the relation
Steps in DB recovery
1. identify the most recent full backup 2. identify the latest transaction log file 3. id the point-in-time of the failure 4. restore the most recent full backup 5. apply transaction logs up to the last commit before the failure occured
Characteristics of Relational DB
1. purpose of a db is to help people track things of interest to them 2. data is stored in tables, which have many rows and cols like a spreadsheet 3. something we store data about is an entity 4. a db stores BOTH data (in tables) AND relationships (between tables) 5. a db may have multiple tables, where each table stores data about a different thing 6. each row in a table stores data about one occurrence of the thing of interest 7. each col is essentially one attribute of the entity or table 8. a db stores metadata - data about itself stored within itself
ACID compliance
A = atomicity - transaction is all or nothing C = Consistency - transaction must maintain data consistency I = isolation - multiple users/transactions cannot step on each other D = durability - once a transaction is finished, the results cannot be lost
Composite/Concatenated Key
A primary key that consists of two or more columns. Need NOT be contiguous.
DBMS must contain
ACID compliance
Rank in terms of complexity and price
Complexity: oracle - hardest IBM DB2 Microsoft SQL server MySQL Microsoft Access Expense: Oracle - $47000 SQL server - 15000 MySQL - 5000 MS Access - FREE
DML
Data manipulation language for querying and modifying the data (insert, update, delete, two formats for inserts, bulk load with insert)
Steps of normalization
First normal form: - remove any multi-valued cells and/or any rows requiring a specific sequence - remove repeating group of attributes - put repeating attributes into a new relation + ex. create a key for a new relation (concatenated key) the key of the original relation + an identifier for each occurrence of the Repeating Group of attributes Second Normal form: - for entities with concatenated keys, make sure that all attributes are dependent on the full key - only meaningful when there is a concatenated key - create a new relation - create a key for the new relation - migrate only the dependent cols to the new relation Third Normal form: - make sure that no attributes are dependent on any other non-key attributes - the same steps as 2nd normal form fourth normal form: - eliminate multivalued dependencies - multiple independent attributes are dependent on the same determinant
E.F. Codd
Founder of relational database, worked for IBM, wrote famous article descr. how relational model could eliminate lock between apps and data
handling dates
HH - represents hour (00 to 23) MM - represents the minutes SS - represents the seconds NNN - represents the fractional seconds
As of fall 2016 there are 4 out top 5 software products used in industry are relational. Which are these?
Oracle, MySQL, SQL server, PostgreSQL
constraint
R <constraint> S says that every tuple in R is also in S R = 0 is a constraint that says R must be empty
where
Restricts the rows returned by placing limits on the col values
primary key
a candidate key selected as the primary means of identifying rows in a relation (only one primary key per relation)
data modeling
a conceptual picture of your database, drawn with a computer based tool
schema
a description of the relation in this format(column-name1, column-name2) used when using SELECT
attributes
a fact of characteristics describing an entity
column
a fact of characteristics describing an entity
table
a general term for where data is stored in databases
What is a DBMS software?
a layer between the data and the application programs
trigger
a piece of code, associated with a TABLE, associated with an EVENT, the CODE fires when the EVENT happens
tuple
a row (a set, not a list each tuple must have an attribute that uniquely identifies that tuple, and no duplicate tuples, atomic)
superkey
a set of attributes within a table whose values can be used to uniquely identify a tuple
relation
a specific kind of table that follow rules such as each cell must hold a single value
relation
a specific kind of table that follows rules such as each cell must hold a single value
page
a unit of I/O, same as "block"
R <union> S
all elements in R or S or both
benefits of clustered index
allow for binary search foreign keys may be indexed other non-key columns may be indexed ("secondary" index) data rows will not be in secondary key sequence MySQL secondary indexes include the primary key values so that access by secondary keys can leverage the clustered index
preventing deadlock
allow users to issue all lock requests up front
why do we care about all of this?
must consider all data when designing a database, must organize data into relations. need to understand FD's to determine keys for every relation. Need keys so that you can normalize all of the data. Must normalize the data to design the database properly.
deadlock
occurs when tow transactions are each waiting on a resource that the other transaction holds
rules of "ONE"
one entity = one table one occurrence of that entity = one row one attribute = one column all cols in the row describe ONE occurrence of that entity all cos are functionally dependent on one primary key
product
one table multiplied by another table (interim answer set)
order by
options are ASC or DESC, defaults to ASC
data modeling process
prep begin main process
referential integrity
prevents inserting a row into a child table where the parent key value is missing
constraints
primary key null/not null unique check foreign key
foreign key
primary key of one relation that is placed into another relation to form a link between the relations. can be a single column or a composite key best enforces referential integrity
join
produce a new relation that combines matching tuples on a common attribute
theta join
produce a new relation that combines matching tuples on a condition
selection
produce a new relation that is a subset of tuples from R
cartesian product
produce a new relation that is the combination of every tuple in one table combined with every tuple in the other in mysql the CROSS JOIN produced a result set which is the product of rows of two associated tables when no WHERE clause is used with CROSS JOIN. result set appeared by multiplying each row of the first table with all rows in the second table if no condition introduced with CROSS JOIN product = one table multiplied by another table (interim answer set)
count
provides a count of how many rows have a value in a column, counted across many rows can be used with any column or with a (*) to simplify count rows
avg
provides the average of the values in a column across many rows must only be used with numeric columns
max
provides the highest value in a column across many rows
min
provides the lowest value in a column across many rows
sum
provides the sum of the values in a column across many rows must only be used with numeric columns
isolation levels
read uncommited, read commited, repeatable read, serializable
anomolies
redundancy - storing a piece of data more than once update/delete errors
What does 'serializable' transactions mean?
refers to two transactions that run concurrently and generate results that are consistent with the results that would have occurred if they had run separately
TRUNCATE statement
removes all rows, keeps structure
DROP statement
removes all rows, removes structure
What does the DBMS software do?
removes necessity for application programs to maintain a direct record/field mapping to the way the data is physically stored
how do locks implement currency control?
resource locking prevents multiple applications from obtaining copies of the same record when the record is about to be changed
distinct
restrict the returned results to have unique values - sql cannot easily determine whether or not a row is a duplicate of another row - answer set may contain duplicate rows - removes duplicates from the answer set
Outer Left/Right Join
returns ALL rows from LEFT/RIGHT table only matching rows from RIGHT/LEFT table
recursive relationship
same entity to participate in the relationship
DESCRIBE statement
shows you want mysql knows about a table
session memory
smaller temp tables - client specific memory - table read buffers - sort operations
entity
some identifiable person, place, thing, event
clustered index
sort and store the data rows in the table or view based on their key values are the columns included in the index definition can only be one clustered index per table, because the data rows themselves can be stored in only one order if table has no clustered index, its data rows are stored in an unordered structure (heap)
thread cache
splits up the execution paths of MySQL into multiple threads, one per user conenction, so they run in parallel
supertype/subtype
super type contains all common attributes, while the subtypes contain specific attributes super type may have a discriminator attribute which indicates the subtype
answer set
table (create and execute queries)
we often use which terms interchangably
table, relation, file, column, attribute, field, row, tuple, record
USE
tells the query engine which database you want to use for your query
Catalog consists of?
the catalog consists of tables. catalog tables contain information about tables, cols, rows, indexes, users, etc.
R <intersection> S
the set of elements in R and S
R <difference> S
the set of elements in R but not in S
insert
used to add new data to a table
group by
used to collapse row with identical values for one or more columns together
Alias
used to give a table, or a column in a table a temporary name often used to make column names more readable alias only exists for duration of the query
association/associate entity
used to resolve a many-to-many relationship by creating an ASSOCIATION("child") entity between the two "parent" entities a dependent child between two independent parents relationship to parents is mandatory