CS 448 Final

¡Supera tus tareas y exámenes ahora con Quizwiz!

What is multivalued dependency? When does it arise?

It is defined as a full constraint between two different sets of Attributes in a relation. This does not allow having a set of values in a tuple. The tuples should be presented in a relation. Occurence: The relation will have constarints that cannot be specifed as the functional dependency, then the multivalued dependency arises. It will also occur when there is occurence of one or more tuples on the same table in a database.

What is a predicate lock?

Locks access to all records that satisfy an arbitrary predicate (condition)

When a failure occurs, the log is consulted and each operation is either undone or redone. This is a problem because a. searching the entire log is time consuming b. many redos are unnecessary c. both (a) and (b) d. none of the above

c

What is a grid file? What are its advantages and disadvantages?

construct a grid array with one linear scale (dimension) for each of the search attributes good for range queries

There is a possibility of a cascading rollback when a. a transaction writes items that have been written only by a committed transaction b. a transaction writes an item that is previously written by an uncommitted transaction c. a transaction reads an item that is previously written by an uncommitted transaction d. both (b) and (c)

d

What is bitmap indexing? Create a relation with two columns and sixteen tuples and show an example of a bitmap index on one or both.

used for relations that contain a large number of rows, creates an index for one or more columns, and each value or value range in those columns is indexed built on one particular value of a particular field and is just an array of bits (0s or 1s), for a given field there is one separate bitmap index maintained corresponding to each unique value in the database

Spark

Unified analytics engine for large-scale data processing

big data

volume = size of data velocity = rate of data variety = types of data veracity = quality of data

What are some of the commonly used buffer replacement strategies?

- last recently used - clock policy - first in first out

What is the system log used for? What are the typical kinds of records in a system log?

The system log keeps track of all transaction operations that affect the values of database items, as well as other transaction information that may be needed to permit recovery from failures. Records: start transaction, read item, write item, commit transaction, abort transaction

How does double buffering improve block access time?

With double buffers, when the CPU is processing the current block of data in buffer 1, it can also retrieve the next block into buffer 2 at the same time. When the processing (of buffer 1) is done, the CPU can then move on to the next block (in buffer 2) immediately without waiting.

Using a log-based recovery scheme might improve performance as well as provide a recovery mechanism by a. writing the log records to disk when each transaction commits b. writing the appropriate log records to disk during the transaction's execution c. waiting to write the log records until multiple transactions commit and writing them as a batch d. never writing the log records to disk

c

What are some variations of the two-phase locking protocol? Why is strict or rigorous two-phase locking often preferred?

strict 2PL: - a transaction T does not release any of its exclusive (write) locks until after it commits or aborts, hence, no other transaction can read or write an item that is written by T unless T has committed - not deadlock free - guarantees strict schedules for recoverability rigorous 2PL: - a transaction T does not release any of its locks (exclusive or shared) until after it commits or aborts - more restrictive variation of strict 2PL - guarantees strict schedules conservative/static 2PL: - requires a transaction to lock all the items it accesses before the transaction begins execution by predeclaring its read-set and write-set - if any of the predeclared items needed cannot be locked, the transaction does not lock any item; instead, it waits until all the items are available for locking - deadlock free, however it is difficult to use in practice because of the need to predeclare the read-set and write-set, which is not possible in some situations Most popular variation of 2PL is strict, however rigorous is easier to implement than strict.

block anchor

the first record in each block of the data file is the block anchor of the block

What is a timestamp? How does the system generate timestamps?

timestamp = a unique identifier created by the DBMS to identify a transaction, deadlocks can not occur Timestamp values are assigned in the order in which the transactions are submitted to the system, so a timestamp can be thought of as the transaction start time There may be different ways of generating timestamps such as: - use a counter that is incremented each time its value is assigned to a transaction, must periodically reset the counter to zero when no transactions are executing for some short period of time - use the current date/time value of the system clock and ensure that no two timestamp values are generated during the same tick of the clock

Spark vs MapReduce

More efficient: 100x on smaller jobs to 3x on large jobs - Caches data in RAM instead of disk - Faster startup, better CPU utilization - Richer functional programming - Specially suited for iterative algorithms

primary key field

a primary key is the ordering key field of the file. A file that is uniquely identifies a record

Why are normal forms alone insufficient as a condition for a good schema design?

Normal forms alone are insufficient as a condition for a good schema design because they do not take into account other aspects of the database design such as data integrity, performance, and user requirements. The resulting relations must collectively satisfy these two additional properties: Lossless join property and dependency preservation, in order to qualify as a good design.

What are the reasons for having variable-length records? What types of separator characters are needed for each?

The reasons for having variable-length records are to save space and to reduce the amount of data that needs to be transferred. The separator characters needed for each type of record are determined by the type of data being stored.

Draw a state diagram and discuss the typical states that a transaction goes through during execution.

1. active = the initial state in a transaction, execution of a transaction starts in an active state, transaction remains in an active state until its execution is in process 2. partially committed = when the last operation of a transaction is executed it goes to a partially committed state, here there is a possibility that the transaction may be aborted or else it goes to committed state 3. failed = a transaction goes to a failed state if it is determined that it can no longer proceed with its normal execution, happens if one of the checks fail or if the transaction is aborted during active state, can be restarted later 4. aborted = failed transaction when rolled back is in an aborted state In this stage system has two options: 1) Restart the transaction: A restarted transaction is considered to be new transaction which may recover from possible failure 2) Kill the transaction: A transaction can be killed to recover from failure 5. committed = the transaction when successfully completed comes to this state, transaction is said to be terminated if it's neither committed nor aborted 6. terminated = the transaction is leaving the system, information is removed

What undesirable dependencies are avoided when a relation is in 2NF?

2NF removes all partial dependencies of nonprime attributes A in R on key and ensure that all nonprime attributes are fully functionally dependent on the key of R.

What undesirable dependencies are avoided when a relation is in 3NF?

3NF removes all transitive dependencies on key of R and ensures that no non prime attribute is transitively dependent of key.

Discuss insertion, deletion, and modification anomalies. Why are they considered bad? Illustrate with examples.

A database anomaly refers to a data inconsistency in the database resulting from an operation and caused due to too much redundancy in the database's information. insertion anomalies = An insertion anomaly is the inability to add data to the database due to the absence of other data. ex) Cannot insert a project unless an employee is assigned to it. deletion anomalies = A deletion anomaly is the unintended loss of data due to deletion of other data. ex) When a project is deleted, it will result in deleting all the employees who work on that project. modification anomalies = A modification anomaly refers to the inability to complete the operation due to lack of data and is an unexpected side effect from trying to insert, update, or delete a row. Anomalies are considered to be bad as they result in incomplete and inconsistent data. Moreover, improper insertion, deletion, or update operations will violate the integrity properties. Thus, the entire database would be unconnected. Normalization will remove these anomalies.

What is the lossless (or nonadditive) join property of a decomposition? Why is it important?

A decomposition D = {R1, R2, ..., Rm} of R has the lossless (nonadditive) join property with respect to the set of dependencies F on R if, for every relation state r of R that satisfies F, the following holds, where * is the natural join of all the relations in D: * (projection R1(r), ..., projection Rm(r)) = r if R1 intersect R2 determines R2 - R1 is in F+ or if R1 intersect R2 determines R1 - R2 is in F+ then the decomposition is lossless join The word loss in lossless refers to loss of information, not to loss of tuples. In fact, for "loss of information" a better term is "addition of spurious information". It is important because it shows the problem of spurious tuples. Lossless join property avoids the problem of spurious tuples that contain the wrong information.

What is the dependency preservation property for a decomposition? Why is it important?

A decomposition D = {R1, R2, ..., Rm} of R is dependency preserving with respect to F if the union of the projections of F on each Ri in D is equivalent to F. It is important because: 1) Can check that updates to the database do not result in illegal relations being created. 2) Can check updates without having to compute natural joins. 3) Each dependency in F represents a constraint on the database. 4) It is always possible to find a dependency preserving decomposition D with respect to F such that each relation Ri in D is in 3NF

What is a functional dependency? What are the possible sources of the information that defines the functional dependencies that hold among the attributes of a relation schema?

A functional dependency (FD) is a relationship between two sets of attributes in a relation schema, where the values of one set of attributes (the determinant set) uniquely determine the values of another set of attributes (the dependent set). A functional dependency describes a relationship between the attributes of an entity such that the value of one attribute is dependent on the value of some other attribute and is very well affected by it. A set of attributes X functionally determines a set of attributes Y if the value of X determines a unique value for Y

What is a minimal set of functional dependencies? Does every set of dependencies have a minimal equivalent set? Is it always unique?

A minimal cover of a set of functional dependencies E is a set of functional dependencies F that satisfies the property that every dependency in E is in the closure F+ of F. A set of FDs is minimal if it satisfies the following conditions: 1) Every dependency in F has a single attribute for its RHS. 2) We cannot remove any dependency from F and have a set of dependencies that is equivalent to F. 3) We cannot replace any dependency X determines A in F with a dependency Y determines A, where Y is a proper- subset-of X and still have a set of dependencies that is equivalent to F. How to find minimal set: 1) split the right hand attribute of all FDs 2) remove all redundant FDs 3) find the extraneous attribute and remove it - AB->C, either A or B or none can be extraneous. - If A closure contains B then B is extraneous and it can be removed. - If B closure contains A then A is extraneous and it can be removed. Every set of functional dependencies have a minimal equivalent set and it is not necessarily always unique.

index locking

A technique used to maintain index integrity. A portion of an index is locked during a database transaction when this portion is being accessed by the transaction as a result of attempt to access related user data.

What is meant by the concurrent execution of database transactions in a multiuser system? Discuss why concurrency control is needed, and give informal examples.

Concurrent execution of database transactions in a multi-user system means that any number of users can use the same database at the same time. Concurrency control is needed in order to avoid inconsistencies in the database. Ex: Alice and Bob share an account with $500 Alice gets initial amount (x = $500) Bob gets initial amount (x = $500) Alice deposits $100 (x + 100) = $600 Bob withdraws $200 (x - 200) = $300 Alice saves the new balance ($600) Bob saves the new balance ($300) The result should have been 400

What does the term unnormalized relation refer to? How did the normal forms develop historically from first normal form up to Boyce-Codd normal form?

A unnormalized relation refer to a relation which does not meet any normal form condition. The normalization process was first proposed by Codd (1972), takes a relation schema through series of tests to certify whether it satisfies a certain normal form. The process, which proceeds in a top-down fashion by evaluating each relation against criteria for normal forms and decomposing relations as necessary, thus can be considered as relation design by analysis. Initially Codd proposed three normal forms 1NF, 2NF and 3NF. A stronger definition of 3NF called Boyce-Codd normal form (BCNF) was proposed later by Boyce and Codd. All these normal forms are based on a single analytical tool: the functional dependencies among attributes of relation. 1NF splits relation schema into schemas that have atomic values as domain for all attribues and values of none of attribute is set of values. 2NF removes all partial dependencies of nonprime attributes A in R on key and ensure that all nonprime attributes are fully functionally dependent on the key of R. 3NF removes all transitive dependencies on key of R and ensure that no non prime attribute is transitively dependent on key.

Why is accessing a disk block expensive? Discuss the time components involved in accessing a disk block.

Accessing a disk block is expensive because the data must first be arranged in a particular order before it is stored in blocks on the disk. This is called blocking. Seek time - the time it takes to move the heads to the appropriate cylinder. The greater the seek distance, the greater the time. Rotational latency - the time it takes for the required data to come round and be under the disk heads. Transfer speed - this will involve two separate components: the speed at which the bits are passing under the heads as the disk rotates; and the speed of the interface between the disk and the rest of the computer. The slower speed will limit overall speed for large block transfers.

How does a B-tree differ from a B+-tree? Why is a B+-tree usually preferred as an access structure to a data file?

B tree: - a B-Tree has data pointers in the both internal and leaf nodes - used to implement dynamic multilevel index - has additional constraints that ensure the tree is always balanced and that the space wasted by deletion never becomes excessive - each node is at least half full, each node in a B tree of order p can have at most p - 1 search values B+ tree - in B+-Tree, it has only tree pointers in internal nodes and all data pointers are in leaf nodes - used to implement dynamic multilevel index - data pointers are stored only at the leaf nodes of the tree, hence, the structure of leaf nodes differs from the structure of internal nodes - leaf nodes have an entry for every value of the search field, along with a data pointer to the record if the search field is a key field entries in the internal nodes of a B+-Tree leading to fewer levels improving the search time. In addition that, the entire tree can be traversed in order using the pent pointers

Define Boyce-Codd normal form. How does it differ from 3NF? Why is it considered a stronger form of 3NF?

BCNF is a stronger form of normalization than 3NF because it eliminates the second condition for 3NF, which allowed the right side of the FD to be a prime attribute. Thus, every left side of an FD in a table must be a superkey. Every table that is BCNF is also 3NF, 2NF, and 1NF, by the previous definitions. Stronger than 3NF, but if it's in this form, it's also in 3NF. Does not allow non-key attributes as determinants. A determinant can be a composite or single attribute.

What are the before image (BFIM) and after image (AFIM) of a data item? What is the difference between in-place updating and shadowing, with respect to their handling of BFIM and AFIM?

BFIM = the old value of the data item before updating AFIM = the new value after updating If shadowing is used, both the BFIM and the AFIM can be kept on disk; hence, it is not strictly necessary to maintain a log for recovering. No undo/no redo for recovery. When in-place updating is used it is necessary to use a log for recovery. The recovery mechanism must ensure that the BFIM of the data item is recorded in the appropriate log entry and that the log entry is flushed to disk before the BFIM is overwritten with the AFIM in the database on disk (write-ahead logging). in-place updating: - writes the buffer to the same original disk location, thus overwriting the old value of any changed data items on disk - a single copy of each database disk block is maintained shadowing: - writes an updated buffer at a different disk location, so multiple versions of data items can be maintained, but this approach is not typically used in practice - for pages updated by the transaction, two versions are kept, the old version is referenced by the shadow directory and the new version by the current directory.

What is the two-phase locking protocol? How does it guarantee serializability?

Basic two-phase locking protocol = transaction where all locking operations precede the first unlock operation in the transaction (expanding and shrinking phase) Two phase locking can cause deadlocks, but it guarantees serializability. Two phase locking is not necessary for serializability. * if lock conversion is allowed upgrading of locks must be done during the expanding phase and downgrading of locks must be done during the shrinking phase It can be proved that, if every transaction in a schedule follows the two-phase locking protocol, the schedule is guaranteed to be serializable, obviating the need to test for serializability of schedules. The locking protocol, by enforcing two-phase lock- ing rules, also enforces serializability.

What are checkpoints, and why are they important? What are transaction commit points, and why are they important?

Checkpoints are a type of entry written into the log periodically at that point when the system writes out to the database on disk all DBMS buffers that have been modified. All actions are suspended in the transactions temporarily. The interval at which to take checkpoints must be decided. The interval may be measured in time or in the number of committed transactions since the last checkpoint. - Important because all transactions that have their commit entries in the log before a checkpoint entry do not need to have their WRITE operations redone in case of a system crash, since all their updates will be recorded in the database on disk during checkpointing. - Help to limit the storage space that transaction logs are taking up and the physical space that keeping track of updates takes up. A transaction T reaches its commit point when all its operations that access the database have been executed successfully and the effect of all the transaction operations on the database have been recorded in the log.

Between the properties of dependency preservation and losslessness, which one must definitely be satisfied? Why?

Losslessness must be satisfied. BCNF can only have losslessness, but 3NF can have both losslessness and dependency preservation.

How are the OUTER JOIN operations different from the INNER JOIN operations? How is the OUTER UNION operation different from UNION?

Consider two relational databases R and S. When a user wants to keep all the tuples in R, or all those in S, or all those in both relations in the result of the JOIN regardless of weather or not they have matching tuples in other relations, a set of operations called outer joins can do so. This satisfies the need of queries in which tuples from two tables are to be combined by matching corresponding rows, but without losing any tuples for lack of matching values.When only matching tuples (based on condition) are contained in a resultant relation and not all tuples then join is INNER JOIN (EQUIJOIN and NATURALJOIN). Outer Union will display columns that appear only in one dataset. Inner Union will not display any columns that appear in only one dataset.

What are the components of a disk block address?

Each disk unit has a unique serial number. The system accesses a disk unit by way of a logical address. The logical address consists of a system bus, a system card, an I/O bus, an IOP, an IOA, and a device number.

What is the difference between a file organization and an access method?

File Access Method = The way by which information/data can be retrieved. There are two method of file access: - Direct Access - Sequential Access File Organization Method = The process that involves how data/information is stored so file access could be as easy and quickly as possible. Three main ways of file organization: - Sequential - Index-Sequential - Random

Define first, second, and third normal forms when only primary keys are considered. How do the general definitions of 2NF and 3NF, which consider all keys of a relation, differ from those that consider only primary keys?

First Normal Form (1NF): The relation has a Primary Key, which uniquely identifies each row in the relation. Every column stores Atomic Values, and there are no Repeating Groups. Atomic means the column only stores one thing. Columns storing comma-separated values are also non-atomic. - all attributes depend on the key Second Normal Form (2NF): The relation meets 1NF, and there are No Partial Functional Dependencies. That is, each non-key attribute is Functionally Dependent on the full Primary Key. For a Partial Functional Dependency to occur, the Primary Key must be a Composite Primary Key. A Composite Primary Key is a Primary Key that consists of 2 or more columns. - all attributes depend on the whole key Third Normal Form (3NF): The relation meets 2NF, and there are No Transitive Functional Dependencies. A Transitive Functional Dependency is when a non-key column is Functionally Dependent on another non-key column, which is Functionally Dependent on the Primary Key. - all attributes depend on nothing but the key

force/no force

Force approach = if all pages updated by a transaction are immediately written to disk before the transaction commits - REDO will never be needed during recovery, since any committed transaction will have all its updates on disk before it is committed No force approach = all pages updated by a transaction are not immediately written to disk before the transaction commits - an updated page of a committed transaction may still be in the buffer when another transaction needs to update it, thus eliminating the I/O cost to write that page multiple times to disk and possibly having to read it again from disk

How does the granularity of data items affect the performance of concurrency control? What factors affect selection of granularity size for data items?

Granularity = size of data items The larger the data item size is, the lower the degree of concurrency permitted. The smaller the data item size is, the more the number of items in the database. More lock and unlock operations will be performed, causing a higher overhead. More storage space will be required for the lock table. Factors: depends on the types of transactions involved - If a typical transaction accesses a small number of records, it is advantageous to have the data item granularity be one record - If a transaction typically accesses many records in the same file, it may be better to have block or file granularity so that the transaction will consider all those records as one (or a few) data items

Hadoop

HDFS = Distributed file system designed to run on commodity hardware MapReduce = Software framework for processing large data sets in a distributed computing environment

cascade rollback

If a transaction T is rolled back, any transaction S that has, in the interim, read the value of some data item X written by T must also be rolled back. Similarly, once S is rolled back, any transaction R that has read the value of some data item Y written by S must also be rolled back; and so on. * only write operations need to be undone A schedule avoids cascade aborts if there are no dirty reads.

How do optimistic concurrency control techniques differ from other concurrency control techniques? Why are they also called validation or certification techniques? Discuss the typical phases of an optimistic concurrency control method.

In optimistic concurrency control techniques, also known as validation or certification techniques, no checking is done while the transaction is executing. Lower overhead during transaction execution, with the effect of speeding up the transactions. The techniques are called optimistic because they assume that little interference will occur and hence most transaction will be validated successfully, so that there is no need to do check- ing during transaction execution. Phases: 1) read phase = a transaction can read values of committed data items from the database, however, updates are applied only to local copies (versions) of the data items kept in the transaction workspace 2) validation phase = checking is performed to ensure that serializability will not be violated if the transaction updates are applied to the database 3) write phase = if the validation phase is successful, the transaction updates are applied to the database; otherwise, the updates are discarded and the transaction is restarted

Discuss the various types of inner join operations. Why is theta join required?

Inner Join is used to return rows from both tables which satisfy the given condition. It is the most widely used join operation and can be considered as a default join-type Theta Join allows you to merge two tables based on the condition represented by theta. Theta joins work for all comparison operators. The general case of JOIN operation is called a Theta join. EQUI Join is done when a Theta join uses only the equivalence condition. EQUI join is the most difficult operation to implement efficiently in an RDBMS, and one reason why RDBMS have essential performance problems. Natural Join does not utilize any of the comparison operators. In this type of join, the attributes should have the same name and domain. In Natural Join, there should be at least one common attribute between two relations.

Discuss the advantages and disadvantages of using (a) an unordered file, (b) an ordered file, and (c) a static hash file with buckets and chaining. Which operations can be performed efficiently on each of these organizations, and which operations are expensive?

Ordered (Sequential) Files: Records are physically maintained in an order based on an Ordering Field (Usually the table's primary key). - Advantage: The data can be searched in Log2 time using a binary search. Records can be scanned very quickly when stored on adjacent disk blocks. - Disadvantage: An empty slot must be made in the data when inserting a new record into the 'middle' of the file. Likewise, empty record slots must be 'filled in' with upstream records to fill the hole created by deleted records. Unordered (Heap) Files: Records are physically maintained in the order they were inserted into the database. - Advantage: New records are appended to the end of the file so insertion is very efficient. - Disadvantage: Searching the file for a specific record requires a scan of the entire file's contents (unless an index is applied). Static (hash) files: - advantages: Performance is exceptional for smaller databases. Aids in Storage management. Hash key values help in faster access to the storage addresses. The Primary key values can be used in the place of the hash value. - disadvantages: It cannot work efficiently with the databases that can be scaled. It is not a good option for large-size databases. Bucket overflow issue occurs if there is more data and less memory.

A file has r = 20,000 STUDENT records of fixed length. Each record has the following fields: Name (30 bytes), Ssn (9 bytes), Address (40 bytes), PHONE (10 bytes), Birth_date (8 bytes), Sex (1 byte), Major_dept_code (4 bytes), Minor_dept_code (4 bytes), Class_code (4 bytes, integer), and Degree_program (3 bytes). An additional byte is used as a deletion marker. The file is stored on the disk whose parameters are given in Exercise 16.27. a. Calculate the record size R in bytes. b. Calculate the blocking factor bfr and the number of file blocks b, assuming an unspanned organization. c. Calculate the average time it takes to find a record by doing a linear search on the file if (i) the file blocks are stored contiguously, and double buffering is used; (ii) the file blocks are not stored contiguously. d. Assume that the file is ordered by Ssn; by doing a binary search, calculate the time it takes to search for a record given its Ssn value.

PAGE 605 AND 606

Suppose that a disk unit has the following parameters: seek time s = 20 msec; rotational delay rd = 10 msec; block transfer time btt = 1 msec; block size B = 2400 bytes; interblock gap size G = 600 bytes. An EMPLOYEE file has the following fields: Ssn, 9 bytes; Last_name, 20 bytes; First_name, 20 bytes; Middle_init, 1 byte; Birth_date, 10 bytes; Address, 35 bytes; Phone, 12 bytes;Supervisor_ssn, 9 bytes; Department, 4 bytes; Job_code, 4 bytes; deletion marker, 1 byte. The EMPLOYEE file has r = 30,000 records, fixed-length format, and unspanned blocking. Write appropriate formulas and calculate the following values for the above EMPLOYEE file: a. Calculate the record size R (including the deletion marker), the blocking factor bfr, and the number of disk blocks b. b. Calculate the wasted space in each disk block because of the unspanned organization. c. Calculate the transfer rate tr and the bulk transfer rate btr for this disk unit (see Appendix B for definitions of tr and btr). d. Calculate the average number of block accesses needed to search for an arbitrary record in the file, using linear search. e. Calculate in msec the average time needed to search for an arbitrary record in the file, using linear search, if the file blocks are stored on con- secutive disk blocks and double buffering is used. f. Calculate in msec the average time needed to search for an arbitrary record in the file, using linear search, if the file blocks are not stored on consecutive disk blocks. g. Assume that the records are ordered via some key field. Calculate the average number of block accesses and the average time needed to search for an arbitrary record in the file, using binary search.

PAGE 605 AND 606

What is the difference between primary and secondary storage?

Primary memory storages are temporary; whereas the secondary storage is permanent. Primary memory is expensive and smaller, where as secondary memory is cheaper and larger. Primary memory storages are faster, where as secondary storages are slower. Primary memory storages are connected through databases to CPU, whereas the secondary storages are connect through data cables to CPU.

indexing field

Record structure is consisting of several fields. The record fields are used to construct an index. An index access structure is usually defined on a single field of a file. Any field in a file can be used to create an index and multiple indexes on different fields can be constructed on a field.

What are the main goals of the RAID technology? How does it achieve them?

Redundant Array of Independent Disks (RAID) is a term used to describe computer storage systems that spread or replicate data across multiple drives. There are two main reasons for RAID storage to work in this way: it increases data reliability and improves I/O (input/output) performance. RAID (redundant array of independent disks) is a way of storing the same data in different places on multiple hard disks or solid-state drives (SSDs) to protect data in the case of a drive failure. There are different RAID levels, however, and not all have the goal of providing redundancy.

ideas for security

Replication, Aggregation and Generalization, Exaggeration and Mutilation, Anonymity, User Profiles, Access Permissions

What is a serial schedule? What is a serializable schedule? Why is a serial schedule considered correct? Why is a serializable schedule considered correct?

Serial Schedule: The serial schedule is a type of schedule where one transaction is executed completely before starting another transaction. No interleaved operations. Serializable Schedule: A serializable schedule is a schedule whose effect on any consistent database instance is guaranteed to be identical to that of some complete serial schedule. A serial schedule is considered correct because it maintains the consistency and isolation properties of the database system. A serializable schedule is accepted as correct because the database is not influenced by the concurrent execution of the transactions. The isolation level affects a transaction's serializability.

steal/no steal

Steal approach = if the recovery protocol allows writing an updated buffer before the transaction commits - avoids the need for a very large buffer space to store all updated pages in memory No steal approach = if a cache buffer page updated by a transaction cannot be written to disk before the transaction commits - UNDO will never be needed during recovery, since a committed transaction will not have any of its updates on disk before it commits

What is meant by the closure of a set of functional dependencies? Illustrate with an example.

The closure of F, denoted as F+, is the set of all regular Functional Dependencies that can be derived from F

Discuss the different types of failures. What is meant by catastrophic failure?

Transaction, system, and media failures Computer failure (system crash) - hardware, software error during transaction execution Transaction/system error - operations like integer overflow can cause the transaction to fail Disk failure - some disk blocks may lose data because of a read or write malfunction Catastrophic failure refers to the quick and sudden failure from which it is impossible to recover. It shuts down the ability of the business system to perform real-time tasks. The core operations can also be failed through this. Recovery through database backup. Examples: fire, theft, sabotage, air-conditioning failure, etc.

How does tuple relational calculus differ from domain relational calculus?

Tuple Relational Calculus in DBMS uses a tuple variable (t) that goes to each row of the table and checks if the predicate is true or false for the given row. Depending on the given predicate condition, it returns the row or part of the row. Domain Relational Calculus uses domain Variables to get the column values required from the database based on the predicate expression or condition.

When are two sets of functional dependencies equivalent? How can we determine their equivalence?

Two sets of functional dependencies E and F are equivalent if E+= F+. That is E is equivalent to F if E covers F (every FD in F is in E+) and F covers E (every FD in E is in F+).

What are UNDO-type and REDO-type log entries?

UNDO-type log entries = include the old value (BFIM) of the item since this is needed to undo the effect of the operation from the log (by setting the item value in the database back to its BFIM) REDO-type log entries = includes the new value (AFIM) of the item written by the operation since this is needed to redo the effect of the operation from the log (by setting the item value in the database on disk to its AFIM)

What is union compatibility? Why do the UNION, INTERSECTION, and DIFFERENCE operations require that the relations on which they are applied be union compatible?

Union compatibility is the two relations are said to be union compatible if both the relations have the same number of attributes and the domain of the similar attributes is same. (It's to have the same attributes and domain) If we do a Union, Intersection or a Difference operation on two different relations, it is necessary for them to be union compatible because all these operations are set operations which are binary. And, for the result to be a relation, it is bound to have the tuples of the same no of attributes and the domain should be same too.

For incremental logging with immediate updates, a log record for a transaction would contain a. a transaction name, a data item name, and the old and new value of the item b. a transaction name, a data item name, and the old value of the item c. a transaction name, a data item name, and the new value of the item d. a transaction name and a data item name

a

secondary key field

a secondary index is also an ordered field with two fields. (like a primary index). The first field is of the same data type as some non-ordering field of the data file that is an indexing field. If the secondary access structure uses a key field, which has a distinct value for every record. Therefore, it is called as secondary key field

snapshot isolation

a transaction sees the data items that it reads based on the committed values of the items in the database snapshot (or database state) when the transaction starts write locks needed, read locks aren't needed ensures that the phantom record problem does not occur, since the database transaction, or, in some cases, the database statement, will only see the records that were committed in the database at the time the transaction started does not allow the problems of dirty read and non repeatable read to occur

Consider the following decompositions for the relation schema R of Exercise 14.24. Determine whether each decomposition has (1) the dependency preservation property, and (2) the lossless join property, with respect to F. Also determine which normal form each relation in the decomposition is in. a. D1 ={R1,R2,R3,R4,R5};R1 ={A,B,C},R2 ={A,D,E},R3 ={B,F}, R4 = {F, G, H}, R5 = {D, I, J} b. D2 = {R1, R2, R3}; R1 = {A, B, C, D, E}, R2 = {B, F, G, H}, R3 = {D, I, J} c. D3 ={R1,R2,R3,R4,R5};R1 ={A,B,C,D},R2 ={D,E},R3 ={B,F}, R4 = {F, G, H}, R5 = {D, I, J}

a. dependency preserving, lossless, 3nf b. dependency preserving, lossless, 1nf c. not dependency preserving, not lossless, 1nf

Which of the following schedules is (conflict) serializable? For each serializable schedule, determine the equivalent serial schedules. a. r1(X); r3(X); w1(X); r2(X); w3(X); b. r1(X); r3(X); w3(X); w1(X); r2(X); c. r3(X); r2(X); w3(X); r1(X); w1(X); d. r3(X); r2(X); r1(X); w3(X); w1(X);

a. not serializable b. not serializable c. serializable (T2 -> T3 -> T1) d. not serializable

Consider the three transactions T1, T2, and T3, and the schedules S1 and S2 given below. Draw the serializability (precedence) graphs for S1 and S2, and state whether each schedule is serializable or not. If a schedule is serializable, write down the equivalent serial schedule(s). T1: r1 (X); r1 (Z); w1 (X); T2: r2 (Z); r2 (Y); w2 (Z); w2 (Y); T3: r3 (X); r3 (Y); w3 (Y); S1: r1 (X); r2 (Z); r1 (Z); r3 (X); r3 (Y); w1 (X); w3 (Y); r2 (Y); w2 (Z); w2 (Y); S2: r1 (X); r2 (Z); r3 (X); r1 (Z); r2 (Y); r3 (Y); w1 (X); w2 (Z); w3 (Y); w2 (Y);

a. not serializable b. serializable (T1 -> T3 -> T2)

In case of transaction failure under a deferred update incremental logging scheme, which of the following will be needed? a. an undo operation b. a redo operation c. an undo and redo operation d. none of the above

b

Incremental logging with deferred updates implies that the recovery system must a. store the old value of the updated item in the log b. store the new value of the updated item in the log c. store both the old and new value of the updated item in the log d. store only the Begin Transaction and Commit Transaction records in the log

b

The write-ahead logging (WAL) protocol simply means that a. writing of a data item should be done ahead of any logging operation b. the log record for an operation should be written before the actual data is written c. all log records should be written before a new transaction begins execution d. the log never needs to be written to disk

b

For correct behavior during recovery, undo and redo operations must be a. commutative b. associative c. idempotent d. distributive

c

Discuss the problems of deadlock and starvation, and the different approaches to dealing with these problems.

deadlock: - occurs when each transaction T in a set of two or more transactions is waiting for some item that is locked by some other transaction T′ in the set - each transaction in the set is in a waiting queue, waiting for one of the other transactions in the set to release the lock on an item - because the other transaction is also waiting, it will never release the lock deadlock solutions: 1) deadlock prevention protocols = design the system in such a way that the possibility of deadlock is excluded - every transaction lock all the items it needs in advance, if any of the items cannot be obtained, none of the items are locked - ordering all the items in the database and making sure that a transaction that needs several items will lock them according to that order - wait die (transaction timestamp) = an older transaction is allowed to wait for a younger transaction, whereas a younger transaction requesting an item held by an older transaction is aborted and restarted, avoid starvation - wound wait (transaction timestamp) = a younger transaction is allowed to wait for an older one, whereas an older transaction requesting an item held by a younger transaction preempts the younger transaction by aborting it, avoid starvation - no waiting algorithm = if a transaction is unable to obtain a lock, it is immediately aborted and then restarted after a certain time delay without checking whether a deadlock will actually occur or not - cautious waiting algorithm = no transaction will ever wait for another blocked transaction 2) deadlock detection = the system checks if a state of deadlock actually exists - wait for graph, state of deadlock if and only if there is a cycle in the graph, need to abort certain transactions based on victim selection 3) timeouts = if a transaction waits for a period longer than a system-defined timeout period, the system assumes that the transaction may be deadlocked and aborts it—regardless of whether a deadlock actually exists starvation: - occurs when a transaction cannot proceed for an indefinite period of time while other transactions in the system continue normally - may occur if the waiting scheme for locked items is unfair in that it gives priority to some transactions over others - can occur because of victim selection if the algorithm selects the same transaction as victim repeatedly, thus causing it to abort and never finish execution - due to live locks which are a deadlock-like situation in which processes block each other with a repeated state change yet make no progress starvation solutions: 1) first-come first-served queue or fair waiting scheme 2) allow some transactions to have priority over others but increase the priority of a transaction the longer it waits, until it eventually gets the highest priority and proceeds 3) use higher priorities for transactions that have been aborted multiple times

Discuss the UNDO and REDO operations and the recovery techniques that use each.

deferred update/no steal approach: uses redo, no undo (for short transactions and transactions that change few items) Recovery using Deferred Update in a Multiuser environment: - there is no need to redo the write item operations of any transactions committed before the last checkpoint - write item operations redone if transactions reached commit point after the last checkpoint - transactions who didn't reach their commit points are ignored immediate update: uses undo and redo (steal-no force approach) or undo and no redo (steal-force approach) Recovery using Immediate Updates for a Multiuser environment (steal-no force approach): - undo all the write item operations of the active (uncommitted) transactions in the reverse order of which they were written to the log - redo all the write item operations of the committed transactions from the log in the order of which they were written to the log steal-force approach: - redo all the write item operations of the committed transactions from the log in the reverse order of which they were written to the log, redo only the last update of each item X

assessing concurrency control

degree of concurrency control (number of transactions that can run), through-put blocking, number of deadlocks

Define the violations caused by each of the following: dirty read, non repeatable read, and phantoms.

dirty read = uncommitted data is read A transaction T1 may read the update of a transaction T2, which has not yet committed. If T2 fails and is aborted, then T1 would have read a value that does not exist and is incorrect. non repeatable read = Before transaction A is over, another transaction B also accesses the same data. Then, due to the modification caused by transaction B, the data read twice from transaction A may be different. A transaction T1 may read a given value from a table. If another transaction T2 later updates that value and T1 reads that value again, T1 will see a different value. A non-repeatable read occurs, when during the course of a transaction, a row is retrieved twice and the values within the row differ between reads. phantom read = when the user reads records, another transaction inserts or deletes rows to the records being read. When the user reads the same rows again, a new "phantom" row will be found. A phantom read occurs when, in the course of a transaction, two identical queries are executed, and the collection of rows returned by the second query is different from the first.

protect privacy of data

encryption, provide aggregate data, provide limited data records, increase noise

dense index

has an index entry for every search key value (and hence every record) in the data file

nondense (sparse) index

has index entries for only some of the search values, has fewer entries than the number of records in the file

clustering field/index

if the ordering field is not a key field (numerous records in the file can have the same value for the ordering field) this index is used, this file called a clustered file speeds up retrieval of all the records that have the same value for the clustering field ordered file with two fields; the first is of the same type of the clustering field of the data file, and the second is a disk block pointer non-dense index because it has an entry for every distinct value of the indexing field, which is a monkey by definition and hence has duplicate values rather than a unique value for every record in the file

How does multilevel indexing improve the efficiency of searching an index file?

multi-level index: reduce the part of the index that we continue to search by the blocking factor for the index (fan-out) search space reduced much fasterdivide record search space in n-ways at each search step using the multilevel index considers the index file (the first level of a multilevel index) as an ordered file with a distinct value for each record first level index file is a sorted data file, primary index for the first level created which is called the second level of the multilevel index -while searching the record, it reduces the access of number of blocks in the given indexing field value -the benefits of multi-level indexing include the reduction of insertion and deletion problems in indexing -while inserting new entries, it leaves some space that deals to the advantage to developers to adopt the multi-level indexing -by using B-trees and B+ trees, it is often implemented

What is meant by a safe expression in relational calculus?

one that is guaranteed to yield a finite number of tuples as its result, otherwise, the expression is called unsafe

State the informal guidelines for relation schema design that we discussed. Illustrate how violation of these guidelines may be harmful.

semantics of the relational attributes must be clear (each tuple in a relation should represent one entity or relationship instance) redundant information in tuples and update anomalies (wastes storage and causes problems with update anomalies) null values in tuples (as few as possible) spurious tuples (avoid at any cost, check lossless join property)


Conjuntos de estudio relacionados

Micro test 2 chaper 6 study guide

View Set

ATI Targeted Medical-Surgical 2019: Cardiovascular

View Set

Chapter 26:Stress and Associated Problems

View Set

Fundamental HESI, Hesi Fundamentals, Hesi Fundamentals Practice Test, UNIT 1: Foundations of Nursing Practice

View Set

308 Management of the Patient with Respiratory Problems: Airway management week 7 (7/01/19)

View Set