Database Systems Final
Consider the Entity, Boundary, and Controller class categories from the Unified Process. 1. How are classes from the Entity category used when defining our application's schema?
1. Entity classes identify the relations (tables) in our schema. Associations (1-N, N-M) between Entity classes identify the FK relationships in our schema.
1. Describe the primary difference between a dense and sparse index.
A dense index maintains an index entry for every record in the data file. A sparse entry maintains an index entry for only the anchor record (i.e. the first record of every block) and not every record in the data file.
1. What is a serial schedule of multiple transactions?
A serials schedule is the scheduling of multiple transactions for execution in complete isolation i.e. there is no overlap in the execution of two or more transactions.
Identify the cardinality of each of the following crows-foot (A-D) associations. (Diagram in document)
A: Zero or More B: One or More C: One and Only One D: Zero or One
Our ecommerce application requires that a customer's information be updated i.e. changed such as modify birthdate. Normally, update is an action and not an entity in our design. Under what circumstances would "Customer Update" be considered an Entity to be persisted?
If we needed to record information about individual update operations. For example, the date / time the update occurred, ID of the user making the update, etc.
1. What are two methods of placing data records into disk blocks?
Spanned and Unspanned Records.
Each of these questions refer to ordered data files. 3. What is the relationship between anchor records on block i and block i+1?
The Anchor Record (AR) index field value of block i should be < the Anchor Record value of block i+1.
4. What is the recommended alternative to storing complex information in a composite attribute?
The recommended alternative is to migrate the composite attribute to a separate table (possibly a weak entity) where each component is maintained by a separate attribute.
2. How does the use of a transaction log increase DBMS performance?
The transaction log is used to persist changes to disk blocks without the need to write the dirty blocks to the drive (reducing I/O operations). Every block read and write is recorded as a log entry and multiple entries are persisted in a single block write (as opposed to N dirty block writes).
Are each of the following schedules Conflict Equivalent to the given serial schedule? Serial: R1(X); W1(X); R1(Y); W1(Y); R2(X); W2(X); 1. R1(X); W1(X); R2(X); W2(X); R1(Y); W1(Y); Hint: You will need to build a Precedence Graph for each schedule.
Yes Serial and Schedule 1 are Conflict Equivalent. Conflicting Operations are executed in the same order in both schedules.
Describe the general solution to normalizing (fixing) relations that are not in second or third normal forms?
"Normalization through Decomposition"
Provided the names and brief description of their meaning from the acronym ACID.
**Note: For the exam do not confuse ACID with Isolation Levels. Atomicity: A transaction is an atomic unit of processing; it is either performed in its entirety or not performed at all. Consistency: A correct execution of the transaction must take the database from one consistent (correct) state to another. Isolation: A transaction should not make its updates visible to other transactions until it is committed. Durability: Once a committed transaction changes the state of the database, those changes can never be rolled back (aborted) or otherwise lost e.g. because of system failure.
Name and describe the four levels of Transaction Isolation provided to SQL clients.
**Note: For the exam do not confuse Isolation Levels with ACID. 1. Read Uncommitted: A transaction is allowed to read values written by other transactions but not yet committed by those transactions. 2. Read Committed: A transaction is allowed to read only committed values i.e. values that have been committed by other transactions. 3. Repeatable Reads: Once a transaction starts, the client will always read the same committed value for some X, even if a new value for X has been committed by other transactions. 4. Serializable: The transaction executes in isolation from all other transactions i.e. serial schedules.
What are the four reasons given why the execution of serial schedules always results in a correct database state?
1. All Transactions are assumed to start with a correct database state. 2. Transactions executed in isolation are assumed to produce a correct state once they commit (or are aborted). If an individual transaction corrupts the application's data, the problem lays in the application logic and not the schedule. 3. A serial schedule fully executes each of its N transactions in isolation from the other transactions in the schedule. Only one transaction can be executing, and an individual transaction will run to completion before the next transaction is allowed to start. 4. So it follows that once a serial schedule completes the serial execution of its N transactions, the database will be in a correct state.
3. How would this relation suffer from Update Anomalies? CUSTOMER_PURCHASE (CUSTOMER_ID*, PRODUCT_ID*, PURCHASE_DATE, PURCHASE_AMT, PRODUCT_DESCRIPTION, CUSTOMER_NAME) Note the relation's primary key is *ed
1. Because PURCHASE_DESCRIPTION and CUSTOMER_NAME are partially dependent on the relation's PK, these attributes will be duplicated in every tuple that shares the same product or customer. Changing the value of a Customer's name or Product's description will require updating every tuple that references the entity being modified. 2. Until a customer makes a purchase, they are not represented in the database (assuming no other table maintains customer information). 3. if we eliminate a product from the system (for whatever reason) we will remove all its purchases and the customer information associated with the purchase. This could result in removing customers entirely from the system.
This question is from the previous section, but the information is worth repeating. 1. Define "Blocking Factor" of a file's records.
1. Blocking Factor is (B/R) where B is the block size and R is the record size. Blocking Factor (BF) describes how many 'units' fit on a single drive block. The BF of an index file describes how many index entries will fit in a block. The BF of a data file describes how many records (table rows) will fit on a block.
What are the three properties of transactions described in the slides?
1. From the POV of the database's global state, either a transaction's operations complete entirely or are never executed. 2. Multiple transactions can be executing concurrently against shared (common) DBMS state (rows, attributes, etc.). 3. Depending on its "Isolation Level", a transaction can fully isolate the changes it makes to the state it shares with other concurrently executing transactions.
There are two alternative methods of representing an employee's optional (single) phone number. The first is to use a nullable attribute and the second is to use a weak entity. 1. Given a requirement that a large number of the employees (> 80%) are assigned a phone number, which is the preferred method and why?
1. If a large percentage of EMPLOYEE tuples have a single office number, the use a nullable attribute would seem appropriate. This is because the nullable attribute is more efficient (No Join Needed) and the large percentage of employees with numbers indicates that the nullable attribute will not be wasteful of table space.
Describe two issues in index search efficiency when indexing using a surrogate integer field vs using an email address hint: comparison and I/O.
1. The comparison of an integer value requires one comparison operation vs N comparisons needed to compare a string of length N. 2. The index file entry blocking factor will be greater for the integer index value vs a variable length string.
The book describes three possible methods of bringing Department (Figure 14.9 B) into First Normal Form. We discussed the first solution in class. Describe the other two possible methods of normalization (changes to the Department schema) and the problems that result when each is applied.
1. The first solution was to move department location into a Department relation using the department's ID as a foreign key. This was the solution discussed in class and is the preferred solution. 2. The second solution is to expand the relation's primary key to include both department number and individual location. This has the effect of creating a separate tuple for each department's locations. For example, Research will be expanded into three tuples for three locations. It also introduces redundant values in the relation e.g. dmgr_ssn would be duplicated for all three location tuples (bad). 3. If there is a maximum number of department locations (e.g. three locations), we can introduce three nullable location attributes into the relation. This has the disadvantage of introducing null attributes into the relation which is a waste of table space in the DBMS and making it difficult to build selects for specific values.
What are the three conditions that identify a conflict between two operations in the same schedule?
1. The operations belong to different transactions. 2. The operations access the same item X. 3. At least one of the operations is a write_item(X).
What are the three given methods of implementing a derived attribute?
1. Using built-in function e.g. aggregate functions MIN(), AVG(), etc. 2. As the attribute of a view. 3. As functions written in the DBMS procedural DML.
Consider the Entity, Boundary, and Controller class categories from the Unified Process. 2. How are classes from the Controller category used in defining our application's schema?
2. Controller classes implement application-specific operations. This may require information from the DB (reports requiring specific selects). Supporting controller operations may lead to the addition of new indexes, foreign keys, views, and materialized views to the schema.
There are two alternative methods of representing an employee's optional (single) phone number. The first is to use a nullable attribute and the second is to use a weak entity. 2. Given a requirement that only a small number of employees (< 10%) are assigned phone numbers, which is the preferred method and why?
2. If a small percentage of EMPLOYEE tuples will have a single office number, the use of a nullable attribute would waste table space and the weak entity is the best choice. However, the additional processing of performing a Join with the weak relation may still make the nullable attribute more attractive.
This question is from the previous section, but the information is worth repeating. 2. In terms of Blocking Factors, what is the advantage of using an index file to search for a specific record in a data file? Hint: The Blocking Factor of the index file vs. the Blocking Factor of the data file.
2. Increasing the performance of most database operations mean minimizing the number of blocks that must be read or written from/to the disk. The blocking factor of an index file is much greater (better) than the BF of the data file because more index entries will fit on a block than data file records. When searching for a target record, we can locate the data file block containing the target record with fewer block reads when we first search the index to identify the target data file's block. That is, the I/O efficiency of an index file is much greater than a data file.
There are two alternative methods of representing an employee's optional (single) phone number. The first is to use a nullable attribute and the second is to use a weak entity. 3. How would the requirement of 'multiple phone numbers can be assigned to an Employee' affect your decision?
3. A 1-M relationship between employee and phone numbers would require the use of a weak entity.
1. What type of association cardinality between entity classes requires a Join Table to be implemented in the schema?
A N-M associations between two entities. Recall, a join table is the mechanism used to implement n-m associations.
Describe Phantom Reads within a transaction.
A Phantom Read is concerned with result sets and not individual records / values. A phantom read occurs when two selects with identical criteria and in the same transaction, produce different result sets because of other transaction's inserts or deletes. It is expected that within a transaction, identical selects produce the same result sets.
The following questions refer to 'Primary Index on Ordered Data File's Unique Ordering Attributes'. 1. What are the qualities of the data file attributes that can be used to build a Primary Index as described in the slides?
A Primary Index is defined on a unique ordering attribute of an ordered data file. That is, the table's primary key e.g. SSN in the example Employee table.
Describe what triggers a Cascade Abort on an uncommitted transaction.
A cascade abort of Transaction A is triggered by the abort of a second transaction that A depends upon i.e. Transaction A has used (read) a value that a second transaction has previously written but then aborts. For example, the schedule S: r1(X); w1(X); r2(X); w2(X); a1. T1's abort will cause T2 to cascade abort because T2's read of x causes it to become dependent on T1's w1(x); and T1's abort.
Describe the meaning of a Functional Dependency between attributes in a single relation A1 → A2. Use an example such as the Employee relation's SSN (Social Security Number) and the remaining attributes in the relation. Do not submit SSN as an answer.
A functional dependency A1 → A2 means that for every unique value assigned to attribute A1, the same value must be found assigned to A2. That is, A1 consistently identifies A2. For example, every tuple that has a given social security number the same first name, last name, etc. must be present in the tuple. In other words, the same SSN (A1) must not result in different employees / names (A2). Given a relation R, a set of attributes X in R is said to functionally determine another set of attributes Y in R if each X value is always associated (produces) the same Y value.
2. What is the minimum number associated entities required to build join table?
A join table requires at least two entities.
2. What is the significance of a higher blocking factor?
A larger blocking factor (i.e. 4096/2028 = 2 vs 4096/1024 = 4, etc.) implies that more records and so more information is being retrieved with every drive READ BLOCK operation. We want to reduce the number of I/O operations. By increasing the amount of information maintained in each block, we reduce the number of block reads and increase the efficiency of record search and other DBMS operations.
1. Describe the meaning of a multi-valued attribute. 2. Provide an example of a multi-value attribute. Do not use the example from the slides. 3. How do we convert a multi-valued attribute into a weak entity? For example, maintain a customer's zero or more favorite colors.
A multi-valued attribute maintains multiple values for the same attribute EX: favorite colors. By providing N Color columns (color1, color2, etc.) to record up to N colors per row. Each of the zero or more values in a multi-valued attribute are represented not as a combined single attribute (e.g. "red,blue") but as separate rows in the weak entity. For example, the Customer's favorite colors would need a weak entity table that maintains two attributes: 1) the foreign key into the Customer table and 2) an attribute maintaining one of the colors choices. The choices "red" and "blue" would be maintained in two tuples in the weak entity table.
Are each of the following schedules recoverable? 3. R1(X); W1(X); R2(X); R1(Y); A1; W2(X); C2; Hint: You will need to build a Precedence Graph for each schedule.
A schedule S is recoverable if no transaction T in S commits until all transactions T' that have written some value X that T reads have committed. No. Because T2 commits on a value that was written by aborted T1
Are each of the following schedules recoverable? 1. R1(X); W1(X); C1; R2(X); W2(X); C2; Hint: You will need to build a Precedence Graph for each schedule.
A schedule S is recoverable if no transaction T in S commits until all transactions T' that have written some value X that T reads have committed. Yes this schedule is recoverable.
Are each of the following schedules recoverable? 2. R1(X); R2(X); W1(X); R1(Y); W2(X); C2; W1(Y); C1; Hint: You will need to build a Precedence Graph for each schedule.
A schedule S is recoverable if no transaction T in S commits until all transactions T' that have written some value X that T reads have committed. Yes. Even though it suffers from a lost update, this schedule is still recoverable because T2 is not dependent on T1 i.e. T2 reads X before T1 writes X.
This question applies to "Secondary Index on Order Data File's Non-Ordering Attributes". 2. Why is a secondary index less efficient that a Primary Index on an Ordered Data File?
A secondary dense index is less efficient because the index requires N-1 times more index entries than a primary sparse index where N is the data file's blocking factor (number of data records that fit into a block).
This question applies to "Secondary Index on Order Data File's Non-Ordering Attributes". 1. Why must a secondary index always be a dense index?
A secondary index is created against the data file's non-ordering attributes i.e. not the physical ordering of the data file blocks. Data records are randomly ordered so no anchor record is possible. So unlike a sparse index, a secondary index requires an index entry for every record in the data file. This is the definition of a dense index.
2. Which SQL SELECT operator benefits from a B+ tree? Briefly explain your answer.
A select operation that is identifies a range of consecutive indexed values. For example a BETWEEN operator identifying a range of records between two indexed attribute values e.g. a date. This is because the leaf nodes of the B+ tree maintain pointers to the next index block and moving from one block to the next in an ascending or descending order is an O(1) operation.
3. Can we apply a sparse index to an unordered data file? Explain your answer.
A sparse index cannot be utilized for an unordered data file. A sparse index relies on the fact that the indexed attribute value is ascending across data file blocks. That is, that the indexed attribute values on data file block i is less than the records on block i+1 (assuming ascending order). An unordered data file places records randomly across its blocks, and so each unique index value (data record) requires a pointer to the record's block i.e. a dense index.
1. Consider a two level index on an ordered or unordered data file. What is meant by: The second level index indexes the first level index?
A two level index has a primary (first level) index that indexes the data file blocks (sparse index on an ordered data file) or individual records (dense index on an unordered file). However, the first level index is always an ordered file and so can be indexed itself using a sparse index. That is, a second level index can be added that indexes the anchor entries of each of the first level index blocks. When using a two level index, the DBMS first searches the spare second level index to directly identify the first level index that may contain the searched for value. This is an improvement over the binary search needed to search a single-level index using binary search.
2. What are the advantages and disadvantages of each?
A. Ordered Files: Advantage: The data can be searched in O(Log2) time using a binary search. Disadvantage: An empty slot must be made in the data when inserting a new record into the 'middle' of the file. Conversely, deleted record slots must be 'filled in' with upstream records to fill the hole created by deleted records. Unordered Files: Advantage: New records are appended to the end of the file so insertion is very efficient. Disadvantage: Searching the file for a specific record requires a scan of the entire file's contents (unless an index is applied).
Answer this question in the context of the broken EMP_DEPT relation presented in the chapter and discussed in class. Why is it difficult to support a primary key in the EMP_DEPT relation?
All attributes must be nullable, even the primary key. But primary keys cannot be null. This is because, inserting a record describing a DEPT without Employees requires that all of the employee attributes be null. Inserting a record describing an EMP unassigned to a Department requires that the department attributes be null.
Describe the Unrepeatable Read problem.
An Unrepeatable Read occurs when during transaction Ti a record is read several times and different values are returned because of other transaction's commits that occur during Ti's execution.
What features of a database schema are identified during the Analysis phase of the SDLC? Hint: Three features.
An application's database schema is produced as a result of the: 1. The entity classes that require persistence are translated into tables with columns defined by the class attributes. Most, but not every entity identified must be persisted. 2. The 1-N and N-M associations identified between entity classes are translated into foreign key relations and join tables. 3. Implementing operations such as entity retrievals, reports, and other operations against persisted entities may require the addition of indexes, views, custom functions and other DBMS objects to the schema.
In terms of blocking factors, how do indexes promote more efficient searches of the data file?
An index-file entry is typically much smaller than the data-file record. Each index file entry maintains only the ordering field's value and the location of the field's record (typically the block number containing the target record) in the unordered data file. The smaller index entry means a larger blocking factor (B/R or more records per block) than the data file records so an I/O operation retrieving index file entries contains more information per block read. Identifying the location of a target record in an unordered file will require fewer block reads than searching for the record directly in the data file.
2. In terms of functional dependency, describe why the following CUSTOMER_PURCHASE relation is not in second normal form. This table describes every instance of an individual Customer purchasing a Product. CUSTOMER_PURCHASE (CUSTOMER_ID*, PRODUCT_ID*, PURCHASE_DATE, PURCHASE_AMT, PRODUCT_DESCRIPTION, CUSTOMER_NAME)
B. This relation mixes attributes from a customer's purchase (CUSTOMER_ID, PRODUCT_ID) → (PURCHASE_DATE, & PURCHASE_AMT) with attributes that are dependent on only part of the key i.e. Product ID → PRODUCT_DESCRIPTION and Customer ID → CUSTOMER_NAME.
2. The number of block reads to search using an ordered data file and no index.
Binary Search of Ordered Data File Log2b = Log2 1112 = 11 block reads.
1. Describe why we describe drives as 'block devices'.
Block devices stores and manages its data in fixed sized blocks of 4096 bytes of data. Data is transferred to / from these devices as blocks of data rather than one word / byte at a time.
1. Describe "Blocking Factor".
Blocking factor describes the number of records that can be placed in a disk block. It is defined as floor(B/R) where B and R are the block and record sizes.
Using a block size of 4096 (212), repeat the calculation given in the slides "Efficiency of Indexes for Search Operations" to produce the number of block reads. · Record size R = 150 bytes · Block size B = 4096 bytes · Number of Data File Records r = 30000. · Index Entry Size RI = (VSSN+ PR) = (9+7) = 16 bytes. Provide the number of block reads for... Calculate what you need to answer the following 3 questions
Data File Blocking Factor Bfr = (B div R) = (4096 div 150) = 27 records/block. Number of data file blocks b = (r/Bfr) = (30000/27) = 1112 data file blocks. Index File Blocking Factor BfrI= (B div RI) = (4096 div 16) = 256 index entries / block. Number of Index File Blocks bl = (r/ BfrI) = (30000/256) = 117 index file blocks
3. Describe the steps required to update the data (e.g. a record) stored on a block device.
Data Update: - The block containing the record must be read from the device into memory. - The record within the block is updated in memory. - The entire block is written back to the drive. This includes the data surrounding the record which was not updated. The key point is that data is not manipulated directly on the drive. The block containing the data to be updated must first be copied into memory where it can be accessed by the processor.
4. What is a Transaction Log Redo operation and when is it performed?
During database startup, the Redo ensures that committed changes to the database state present in the log file are reflected in the associated objects (tables, indexes, etc.). This is needed when the system crashes after a transaction is recorded in the log file but before its dirty data and index file blocks have been flushed to the drive.
1. What are the two records structures discussed in the slides? Hint: Relates to how records lengths are defined.
Fixed Length and Variable Length Records.
2. What is an advantage and disadvantage of each structure?
Fixed Length: Advantage: Simple to calculate the starting byte of each record within the file and starting byte of fields within records. Disadvantage: Wasteful of space when storing variable length data because the max data length must be allocated for every record e.g. Strings. Variable Length: Advantage: Conserves disk space by using just the space needed to hold variable length data. Disadvantage: Expensive to locate specific records within a file and specific fields within records.
2. Provide an example of a Composite Attribute. Do not present the example given in the slides.
For example if we attempted to maintain a customer's Address in a single composite attribute, we would have values such as "223 First St;Apt 123;Plano;TX;75058" where the semicolon would delimit the boundary between each of the address's components.
Provide an example of how functional dependencies are not symmetric i.e. if A1 → A2 it is not necessarily true that A2 → A1.
If a SSN identifies a customer's (fname, lname), it is not the case that unique (fname, lname) identifies the same SSN. Naturally, other examples are possible.
As described in the text, what are the two strategies employed by the DBMS query processor for selects searching multiple attributes in the same table (file) when only one or both attributes are separately indexed? For example, searching for customers by last-name and zip-code when an index exists on only one attribute and when each attribute is individually indexed. Hint: Three strategies are given in the book, but the first two are identical. Select * from customer where (last_name = 'smith' and zip_code = '22334');
If only one of the two attributes is indexed, search for the records that match the indexed attribute value, and linearly search the identified records for the non-indexed attribute value. For example, if only zip-code is indexed, 1) search for the given zip-code and 2) linearly search the zip-code result set for records with the given last-name. If both attributes are individually indexed (both zip-code and last-name have separate indexes), then perform two searches and return the intersection of both result sets.
3. How would this relation suffer from Update Anomalies? PURCHASE (PURCHASE_ID, PURCHASE_DATE, PURCHASE_AMT, CUSTOMER_ID, CUSTOMER_CITY, CUSTOMER_ZIPCODE) Note the relation's primary key is underlined.
If the customer were to change their address (city or zip-code), we would need to update all of the customer's purchases based on the customer_id attribute. And again, when we remove purchases, we are removing customer information.
Describe each of the Update Anomalies described in the section and the problems that arise when they occur in our schema. Use the broken relation EMP_DEPT in your answer.
Inserting Information: When a new record is to be inserted into EMP_DEPT, both employee and department information must be provided or null attributes must be allowed. If we wish to support departments without employees, or employees without assigned departments, then all the table's attributes must be nullable. Deleting Information: When we remove all instances of employees from a department, all knowledge of the department is lost. It is generally unacceptable to loose information in a system. Updating Information: When an attribute of the many side entity (DEPT) of the combined relation is to be changed, all instances of the entity must be updated. If all instances (rows) are not updated, we will have an entity with two values describing a single property e.g. updating the manager's ssn in a department.
3. What is the maximum number associated entities we can represent with a join table?
Joins between 3, 4 ... N-way associations between N entities are possible, but when greater than three the developer should consider refactoring the design.
In terms of increasing DBMS performance what are the indexing and caching techniques described in this section attempting to minimize?
Most of the performance overhead in maintaining a database is the time needed to execute the block I/O operations that read or write blocks of data to / from the drive.
In conflict? Serial: R1(X); W1(X); R1(Y); W1(Y); R2(X); W2(X); 2. R1(X); R2(X); W1(X); R1(Y); W2(X); W1(Y); Hint: You will need to build a Precedence Graph for each schedule.
No Serial and Schedule 1 are not Conflict Equivalent. T2's Read(X) operation happens earlier in T2 than in T1.
The following questions refer to 'Primary Index on Ordered Data File's Unique Ordering Attributes'. 2. Can the Primary Index for this strategy (as described in the slides), be applied to both ordered and unordered data files? Explain your answer.
No. As the title implies this strategy must be applied to an ordered data file. This is because the Primary Index indexes the anchor records of an ordered data file's blocks (it is a sparse index), and not individual data records (a dense index). Ordered data files require a dense index.
The following questions refer to 'Primary Index on Ordered Data File's Unique Ordering Attributes'. 3. Can we have two Primary Index on separate attributes in the same data file? Explain your answer.
No. This index type required an ordered data file i.e. a data file that is physically ordered by its ordering-attribute. So we cannot have two indexes that rely on an ordering of different attributes on the same file i.e. we cannot have two physical orderings of records in a single data file.
3. What are the five operations recorded in the Transaction Log as described in the slides?
Operations are: a. [start_transaction, tid] signals the start of a new transaction with a unique transaction id. b. [write_item, tid, X, oldVal, newVal] Update the value of X in transaction tid overwriting the old value with the new value. c. [read_item, tid, x] Transaction tid read the value of X. d. [commit, tid] Transaction tid was committed. e. [abort, tid] Transaction tid was aborted.
1. Describe the two methods of inserting / maintaining records in data files.
Ordered (Sequential) Files: Records are physically maintained in an order based on an Ordering Field (Usually the table's primary key). Unordered (Heap) Files: Records are physically maintained in the order they were inserted into the database i.e. new records are appended to the end of file.
Describe the difference between Read Committed and Repeatable Read.
Read Committed guarantees that a transaction (Ti) only reads committed values, but that value may change during a transaction's execution because of commits made by other transactions during its execution. Repeatable Read guarantees that Ti sees the same committed value in place when Ti started and throughout its lifetime, even if commits by other transactions occur during its execution.
Informally describe the conditions to be met for a relation to be in First Normal Form.
Relations have no multivalued attributes or nested relations. In laments terms, all attributes should maintain only a single atomic (indivisible) values.
3. The number of block reads using a single-level index against an un-ordered data file.
Search using a single level index against an unordered data file: Log2bI = Log2 117 + 1 = 8 block reads. Note the +1 read is to retrieve the data file block containing the record.
3. Why are serial schedules considered to be an undesirable approach to scheduling the execution of multiple transactions?
Serial schedules are undesirable because transactions are executed individually and do not overlap. Serial execution limits the system's ability to execute multiple transactions concurrently and degrades overall DBMS performance.
According to the slides, for a 'thing' in the problem domain to be considered an Entity class it must possess what three characteristics?
Symbol: The thing has a name. Intension: The thing represents a something found in the problem domain. Intension means "the internal content of a concept." Extension: We can identify unique individual instances of the thing.
1. How does the B-Tree differ from the B+ Tree?
The B-Tree maintains both node pointers, and <key-values & data record (block) pointers> in the tree nodes. There is no difference between internal and leaf nodes. The B+-Tree maintains node pointers and key values in the internal nodes. Leaf nodes contain the <key-value, data-record-pointers> The B+-Tree maintains the leaf node blocks as a linked list so that indexed records can be transversed (visited) in index order in linear time.
2. Explain how data is addressed in block devices.
The OS commands the device to retrieve data, or save data as specific block addresses starting at one (first block) though block N where N is the capacity of the drive (in bytes) divided by the block size.
Each of these questions refer to ordered data files. 1. Provide a short description of a data block's "Anchor Record".
The anchor record is the first record of a data file block. The record contains the first indexing field value of the sequence of records stored on the data file block.
Each of these questions refer to ordered data files. 2. What is the relationship between the anchor record and the other records on same block?
The anchor record's index field value should be ≤ the ordering attribute field values of the other records on the same data block (assuming ascending order).
2. Describe why the following relation PURCHASE is not in third normal form. PURCHASE (PURCHASE_ID*, PURCHASE_DATE, PURCHASE_AMT, CUSTOMER_ID, CUSTOMER_CITY, CUSTOMER_ZIPCODE) Note the relation's primary key is *ed.
The attributes CUSTOMER_CITY and CUSTOMER_ZIPCODE are functionally dependent on CUSTOMER_ID which is not a primary key.
If both an ordered data file and a single-level index file can be searched in Log2N data file bock reads... 1. Why is it is more efficient to search for a record using the index?
The blocking factor of the index file is much greater than the data file so the index can be used to locate the target data file block in far fewer disk block reads (40x fewer reads in the example).
3. Describe the issue with building a WHERE clause (in a SELECT statement) querying for a single part of a complex attribute. Describe an example of this issue from your example above.
The disadvantage of composite attributes is that it is difficult to query for only a value of some component of the composite For example, identifying the customers in Plano TX and not Plano IL.
4. What are the sources of a join table's N foreign key attributes?
The join table requires N foreign key references to the entity primary keys involved in the N-way association being modeled. These are implemented in the join table as N foreign keys references into the N entity's tables.
Describe the two problems that occur when two problem domain entities are combined into a single relation. For example, trying to maintain individual Employee and Department when both entities are combined into a single relation EMP_DEPT. Again, do not use this example. How do we decompose this relation to solve this problem?
The most important issues are: 1) It is difficult to maintain single instances of entities (employees or departments) when both are combined into a single relation. For example, if we want to create a new department, we need to pair with an employee when one may not exist. 2) Likewise, it is difficult to create an instances of one entity when there is initially no association with the second entity. For example, creating a new employee that has not been assigned to a department. - Also, if we move all employees out of a department, the system also loses all knowledge of the department. This is likely not desired. Also important: (but not to be graded)- - There is significant wasted storage space and duplication of information. - We need to use nullable attributes to implement optional information which can result in an inefficient use of table space. - If we want to update employee or department specific attributes, we must update multiple rows in the combined relation. 2 - The solution is to decompose the combined relation into two relations (CUSTOMER & DEPARTMENT) and use a FK to maintain the 1-M relationship or a join-table if it happens to be an M-N relationship.
2. How does multi-level indexes reduce block reads?
The multi-level index is structured as tree of disk blocks where each layer (level) in the tree indexes the complete range of data record index values. The first level (leaves of the index tree) reference the record block address in the data file. The second level entries index the anchor records (first) entries of the first level index entry blocks, etc. Block reads are reduced because instead of a binary search of a single level, each level in the tree directly references the correct block in the next level until the data file block is identified.
If both an ordered data file and a single-level index file can be searched in Log2N data file bock reads... 2. What are some of the other advantages of utilizing an index? Hint: There are two advantages in the slides.
The number of I/O operations needed to maintain (insert, update, delete) an ordered index file's records is fewer than those needed to maintain ordered data file's records. Multiple orderings (access paths) can be maintained for a single data file. An ordered file can only be efficiently searched on its single ordering attribute. But a data file can maintain several indexes, each of which represent an ordering attribute that can be searched in Log2 time.
1. The number of block reads to search an un-ordered data file without an index.
The number of block reads to search an un-ordered data file without an index is 1112 / 2 if the record exists and 1112 if the record does not exist.
1. Informally describe the conditions to be met for a relation to be in Second Normal Form.
The relation is in 1st Normal Form and where the relation's primary key is a compound key, all nonprime attributes must be functionally dependent on all the key's attributes.
1. Informally describe the conditions to be met for a relation to be in Third Normal Form.
The relation is in 2nd Normal Form and: the relation has no non-key attributes that are functionally dependent on another non-key attribute. That is, there should be no transitive dependencies of a non-key attribute on the primary key.
2. What is the advantage of the sparse over the dense index? Hint: Blocking Factor
The sparse index indexes only the anchor records, it requires fewer index entries to cover every record in the data file and so fewer disk block are needed to use and maintain a sparse index. That is, during a search, fewer disk read operations are needed to identify the data file block containing the target record resulting in faster searches.
1. Describe the purpose of the Transaction Log.
The transaction log is a file that records all of the changes to the database state across all transactions.
1. Describe the difference between Simple and Composite Attributes.
The value of a simple attribute is an atomic value i.e. the value cannot be divided into two or units. An integer attribute is likely a good example of a simple attribute. The value of a composite attribute can be divided into useful sub-units of different types. The disadvantage of composite attributes is that it is difficult to query for only a value of some component of the composite e.g. identifying the customers in Plano TX and not Plano IL.
Describe the suggested method of eliminating the use of a multi-valued attribute to represent the employee's three favorite colors. diagram in assessment 9
This is another example of the use of a weak entity to implement a 1-M relationship between Employee and zero or more color choices. That is, a new table Fav_Col with two attributes: ESsn (A FK to EMPLOYEE Ssn) and color. Note for the exam you may be asked to describe the new weak entity table.
Identify each of the following transactions as representing a Lost Update, Dirty Read, or Incorrect Summary. There is one of each. Diagram in ass 12 doc
Transactions: 1. Dirty Read 2. Lost Update 3. Incorrect Summary For Exam: You may be asked to identify a schedule as shown above or asked to provide a brief description of its problem i.e. what is a lost update, dirty read, etc.
2. What is an advantage and disadvantage of each method? (unspanned vs spanned)
Unspanned: Advantage: Easy to index / identify the location of a specific record in the file. Disadvantage: Produces unused space in each disk block. Spanned: Advantage: Eliminates unused space in each disk block. Disadvantage: Records do not start at fixed location in each block.
2. What problems are we trying to prevent when we serially execute the transactions in a schedule i.e. serial schedules?
We are trying to avoid the execution of conflicts (conflicting operations) between two or more transactions that can cause one of the four update anomalies described in the materials i.e. lost update, dirty read, incorrect summary, or unrepeatable read.
There are two alternative methods of representing an employee's optional (single) phone number. The first is to use a nullable attribute and the second is to use a weak entity. Your answers should consider the performance tradeoff between number of database operations per query and record's size.
We can either assign OFFICE_PNUM as a nullable attribute of EMPLOYEE or create a weak entity maintaining the Employee primary key and the office number.
Each of these questions refer to ordered data files. 4. In terms of a block's anchor record value relative to its adjacent blocks, what data file block might the record containing a searched-for field value X be found?
When searching for a specific index field value (XS), the record will be located on data file block i where ARi ≤ X < ARi+1
What is the relationship between normalization and cohesion in software engineering? Hint: How do we fix a class with poor cohesion?
a relation that contains functional dependencies not covered by the relation's primary key can be decomposed into 2 or more relations, each of which have their own primary keys and are in second and third normal form. In much the same manner as correcting an entity class that lacks cohesion from OOAD,
NOTE: Review the ER diagram notation used to describe a join table (as shown in the slides) and be prepared to present a join between two tables for the exam
assessment 8
SQL Assignment 1
okay
SQL Assignment 2
okay