DATABASE FINAL

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

dirty read

A dirty read occurs when a transaction reads a database object that has been modified by another not-yet-committed transaction.

recoverable schedule

A recoverable schedule is one in which a transaction can commit only after all other transactions whose changes it has read have committed.

schedule

A schedule is a series of (possibly overlapping) transactions.

avoids-cascading-aborts schedule

A schedule that avoids-cascading-aborts is one in which transactions only read the changes of committed transactions. Such a schedule is not only recoverable, aborting a transaction can be accomplished without cascading the abort to other transactions.

serializable schedule

A serializable schedule over a set S of transactions is a schedule whose effect on any consistent database instance is identical to that of some complete serial schedule over the set of committed transactions in S. (result of schedule is equivalent to running T1 then T2 - T1's read and write of B is not influenced by T2's actions on A, and the net effect is the same if these actions are swapped to obtain the serial schedule

Transaction In what ways is it different from an ordinary program (in a language such as C)?

A transaction is an execution of a user program, and is seen by the DBMS as a series or list of actions. The actions that can be executed by a transaction include reads and writes of database objects, whereas actions in an ordinary program could involve user input, access to network devices, user interface drawing, etc.

Durability

Durability defines the persistence of committed data: once a transaction commits, the data should persist in the database even if the system crashes before the data is written to non-volatile storage.

Correspondence between SQL clauses and Relational Algebra operators

FROM -> cross product all relations in list WHERE -> selection operator SELECT -> projection to attribute-list

Describe how to evaluate a grouping query with aggregation operator MAX using a sorting-based approach.

First we sort all of the tuples based on the GROUP BY attribute. Next we re-sort each group by sorting all elements on the MAX attribute, taking case not to re-sort beyond the group boundaries.

nested loop join:

For each r in R: For each s in S: if r.a = s.a append (r,s) to result M + (M*f)*N (f tuples per page)

Strict 2PL

Strict 2PL is the most widely used locking protocol where 1) A transaction requests a shared/exclusive lock on the object before it reads/modifies the object. 2) All locks held by a transaction are released when the transaction is completed.

Define the term most selective access path for a query.

The most selective access path is the query access path that retrieves the fewest pages during query evaluation. This is the most efficient way to gather the query's results.

Suppose that you are building a DBMS and want to add a new aggregate operator called SECOND LARGEST, which is a variation of the MAX operator. Describe how you would implement it.

The operator SECOND LARGEST can be implemented using sorting. For each group (if there is a GROUP BY clause), we sort the tuples and return the second largest value for the desired attribute. The cost here is the cost of sorting.

goal of DBMS

To provide a way to store and retrieve database information that is both convenient and efficient.

Explain why it is not always possible to perform SQL UPDATE/DELETE/INSERT statements on top of a view.

Updates on views must be translated into one or more updates on base relations. For some views, the translation can be done in an unambiguous way. In that case, updates are allowed. For example, if a view selects tuples from a relation that satisfy a simple condition, then updates on the view can be translated into updates on the base table in a straightforward fashion. In other cases, however, there may not be one unambiguous translation. For example, if a view joins two tables together and a user wants to delete a tuple from the view, that delete operation can be translated into deleting one or more tuples from one or both of the underlying tables. Depending on how we would implement the delete operation, additional tuples might also disappear from the view, which would be an undesirable side-effect. For such a view, the only reasonable solution is thus to disallow updates. When it is possible to perform updates on a view, the view is called updatable.

why are B+ tree leaf nodes linked

allows rapid in order traversal, makes range queries easier

Strict 2PL is used in many database systems. Give two reasons for its popularity.

1. it ensures only safe interleaving of transactions so that transactions are recoverable, avoid cascading aborts etc... 2. very simple and easy to implement. lock manager only need to provide a lookup for exclusive locks and an atomic locking mechanisms (such as with a semaphore).

ACID properties

A DBMS must ensure four important properties of transactions to maintain data in the face of concurrent access and system failures A - atomicity - all transactions carried out or nothing C - consistency - consistent DB must stay consistent after transaction I - isolation - each transaction must "appear" to be in a vacuum (don't consider effect of other concurrently executing transactions) D - durability - if transaction completes, effects must be persistent

blind write

A blind write is when a transaction writes to an object without ever reading the object.

Discuss two scenarios where a view can be helpful in a database management system.

A view is a relation defined by a query and there are many benefits to views: Views help provide logical data independence. That is, if applications execute queries against views, we can change the conceptual schema of the database without changing the applications. We only need to update the view definitions. Views can help increase physical data independence: We can split base tables into smaller tables horizontally or vertically and hide this data organization behind a view. Views can help secure the data by limiting access to that data: We can create views that only reveal the information users are allowed to see. We can then grant users access to the views instead of the base tables. Views can help encapsulate complex queries. Views can speed-up computations if we materialize the content of a view instead of re-computing it whenever the view is used.

When does a general selection condition match an index? What is a primary term in a selection condition with respect to a given index?

An index matches a selection condition if the index can be used to retrieve just the tuples that satisfy the condition. A primary term in a selection condition is a conjuct that matches an index (i.e. can be used by the index).

unrepeatable read

An unrepeatable read occurs when a transaction is unable to read the same object value more than once, even though the transaction has not modified the value. Suppose a transaction T2 changes the value of an object A that has been read by a transaction T1 while T1 is still in progress. If T1 tries to read the value of A again, it will get a different result, even though it has not modified A.

Atomicity

Atomicity means a transaction executes when all actions of the transaction are completed fully, or none are. This means there are no partial transactions (such as when half the actions complete and the other half do not).

What are the three major steps of the database design (data modeling) process? Define each by one sentence.

Conceptual modeling: The process of identifying the entities, their proper- ties, and the relationships among these entities that are part of the environ- ment that the information system models. In our case, we used E-R model for conceptual modeling. Logical modeling: The process of mapping the conceptual model to the primitives of the data model that is used. In the case of relational DBMSs, this is the process of defining the relational schema. Physical modeling: The process of defining the physical organization char- acteristics (e.g., indexes, file structure) of the database.

Consistency

Consistency involves beginning a transaction with a 'consistent' database, and finishing with a 'consistent' database. For example, in a bank database, money should never be "created" or "deleted" without an appropriate deposit or withdrawal. Every transaction should see a consistent database.

structure of DBMS

DRAW PIC

Discuss the pros and cons of hash join, sort-merge join, and block nested loops join.

Hash join provides excellent performance for equality joins, and can be tuned to require very few extra disk accesses beyond a one-time scan (provided enough memory is available). However, hash join is worthless for non-equality joins. Sort-merge joins are suitable when there is either an equality or non-equality based join condition. Sort-merge also leaves the results sorted which is often a desired property. Sort-merge join has extra costs when you have to use external sorting (there is not enough memory to do the sort in-memory). Block nested loops is efficient when one of the relations will fit in memory and you are using an MRU replacement strategy. However, if an index is available, there are better strategies available (but often indexes are not available).

If the join condition is not equality, can you use sort-merge join? Can you use hash join? Can you use index nested loops join? Can you use block nested loops join?

If the join condition is not equality, you can use sort-merge join, index nested loops (if you have a range style index such as a B+ tree index or ISAM index), or block nested loops join. Hash joining works best for equality joins and is not suitable otherwise.

Isolation

Isolation ensures that a transaction can run independently, without consider- ing any side effects that other concurrently running transactions might have. When a database interleaves transaction actions for performance reasons, the database protects each transaction from the effects of other transactions.

Explain the difference between logical and physical data independence.

Logical data independence means that users are shielded from changes in the logical structure of the data, while physical data independence insulates users from changes in the physical storage of the data. Consider the Students relation from that example (and now assume that it is not replaced by the two smaller relations). We could choose to store Students tuples in a heap file, with a clustered index on the sname field. Alternatively, we could choose to store it with an index on the gpa field, or to create indexes on both fields, or to store it as a file sorted by gpa. These storage alternatives are not visible to users, except in terms of improved performance, since they simply see a relation as a set of tuples. This is what is meant by physical data independence. **Logical data independence refers to the immutability of the application programs and users to changes in the logical schema of the database (and vice versa) whereas the physical data independence refers to the immutability of the application programs and users to the changes in the physical schema of the database (and vice versa).

Give an example of how buffer replacement policies can affect the performance of a join algorithm.

One example where the buffer replacement stratagy effects join performance is the use of LRU and MRU in an simple nested loops join. If the relations don't fit in main memory, then the buffer strategy is critical. Say there are M buffer pages and N are filled by the first relation, and the second relation is of size M-N+P, meaning all of the second relation will fit in the the buffer except P pages. Since we must do repeated scans of the second relation, the replacement policy comes into play. With LRU, whenever we need to find a page it will have been paged out so every page request requires a disk IO. On the other hand, with MRU, we will only need to reread P-1 of the pages in the second relation, since the others will remain in memory.

abstraction in dbms

Physical Level: How the data are stored. E.g. index, B-tree, hashing. Lowest level of abstraction. Complex low-level structures described in detail. Conceptual Level: Next highest level of abstraction. Describes what data are stored. Describes the relationships among data. Database administrator level. View Level: Highest level. Describes part of the database for a particular group of users. Can be many different views of a database. E.g. tellers in a bank get a view of customer accounts, but not of payroll data.

What is referential integrity? How do you represent it in relational model?

Referential integrity refers to the relationship between two entities such that if there is a referential integrity from entity E1 to entity E2, an instance of E2 has to exist for an instance of E1 to exist. In the relational model, this is represented by foreign keys.

What is the main difference between relational calculus and relational algebra?

Relational calculus is a declarative language that requires the user to specify the predicates that need to be satisfied by all the tuples in the result relation. Relational algebra, on the other hand, is procedural where the user specifies how to execute the query by means of relational algebraic operators.

Mongo DB

collection = table. document = row. field = column index = index The underlying data models are different. Relational databases structure data into tables and rows, while MongoDB structures data into collections of documents. As the tutorial states: "relational databases define columns at the table level, while MongoDB defines its fields at the document level". For MongoDB each document within a collection can have its own unique set of fields. fields are tracked with each individual document Documents have a lot more info than just a row. For MongoDB collections do not enforce a scheme, as the tutorial states a collection is: "schema-less" and not strict about what goes in.

The advantages of using a DBMS

data independence and efficient access reduced application development time data integrity and security data administration concurrent access and crash recovery

page nested loop join:

for each page Pr in R for each page Ps in S join all tuples r in Rr with s in Ps append to result cost: M+M*N

Eliminate duplicates:

for each partition i = 1... B-1 create an in memory hash table for each tuple in partition i hash tuple using h2() scan bucket for duplicate if no duplicate, insert in bucket append hash table to result file must be different hash functions because dont want both to hash to the same thing. pitfalls: skewed distributions, partitions could be larger than B (partition again must use new hash function) N + 3M

perfomance: insert:

heap file: 2 sorted file: N clustered B+ tree: logF (1.5*N) + 1 unclustered B+ tree: logF(.15*N) + 3 hash index: 4

perfomance: equality:

heap file: N sorted file: Log2 N clustered B+ tree: logF (1.5*N) unclustered B+ tree: logF(.15*N) hash index: 2

perfomance: range search:

heap file: N sorted file: Log2 N + match pages clustered B+ tree: logF (1.5*N) + match pages unclustered B+ tree: logF(.15*N) + match pages hash index: N

properties of a good schedule

high throughput -> interleaving serializable -> isolation avoid conflicts avoid cascading aborts

Block nested loop join:

hold B-2 pages of R 1 page of S 1 page for output for each bufferblock br of R for each page of S join all types r in br, s in S append (r,s) to output flush output if full cost: M + N * M/ B-2 if both relations are big BNL has many "blocks" -> worse performance

indexing (technique for evaluating operator: selection, projection, join)

indexing-selection If the selection is equality and a B+ or hash index exists on the field condition, we can retrieve relevant tuples by finding them in the index and then locating them on disk. indexing-projection If a multiattribute B+ tree index exists for all of the projection attributes, then one needs to only look at the leaves of the B+. indexing-join When an index is available, joining two relations can be more efficient. Say there are two relations A and B, and there is a secondary index on the join condition over relation A. The join works as follows: for each tuple in B, we lookup the join attribute in the index over relation A to see if there is a match. If there is a match, we store the tuple, otherwise we move to the next tuple in relation B.

iteration (technique for evaluating operator: selection, projection, join)

iteration-selection Scan the entire collection, checking the condition on each tuple, and adding the tuple to the result if the condition is satisfied. iteration-projection Scan the entire relation, and eliminate unwanted at- tributes in the result. iteration-join To join two relations, one takes the first attribute in the first relation and scans the entire second relation to find tuples that match the join condition. Once the first attribute has compared to all tuples in the second relation, the second attribute from the first relation is compared to all tuples in the second relation, and so on.

sorting goals

minimize CPU costs (CS 35) minimize disk I/O overlap CPU and disk I/O sequential I/O instead of random I/O

Uses for sorting

order by and group by commands (users may want answers in some order, like increasing age) eliminate duplicates (useful for eliminating duplicate copies in a collection of records) query processing - make join efficient (a widely used algorithm for performing a very important relational algebra operation (called join) requires a sorting heap bulk loading a B+ tree (sorting records is the first step in bulk loading a tree index)

partitioning (technique for evaluating operator: selection, projection, join)

partitioning-selection Do a binary search on sorted data to find the first tuple that mathes the condition. To retireve the remaining entries, we simple scan the collection starting at the first tuple we found. partitioning-projection To elimiate duplicates when doing a projection, one can simply project out the unwanted attributes and hash a combination of the remaining attributes so duplicates can be easily detected. partitioning-join One can join using partitioning by using hash join variant or a sort-merge join. For example, if there is a sort merge join, we sort both relations on the the join condition. Next, we scan both relations and identify matches. After sorting, this requires only a single scan over each relation.

sorting info / equations

pass 0 number of sorted runs N/B size of each page B (last one may be shorter) total number of passes (logb-1 N/B) + 1 total I/0 cost 2 * N * (total number of passes) B-1 >= N/B

hash join:

phase 1: partition R partition S phase 2: Load Ri into m-memory hash table for each tuple hash search bucket for match output to result cost: 3(M+N) smaller relation should be small (N < B^2) susceptible to skewed data

merge sort join: (improving merge sort join)

sort R (M log M) sort S (N log N) Merge R + S r first record in R s first records in S if r > s s++ if r < s r++ if r = s add (r,s) to result loop until no matches repeat until done with R or S total run time: MlogM + NlogN + N + M how can we improve sort-merge join? Sorting of R and S has respective merging phases Join of R and S also has a merging phase - Combine all these merging phases! Pass 0: sort subfiles of R, S individually Pass 1: merge sorted runs of R, merge sorted runs of S, and merge the resulting R and S files as they are generated by checking the join condition. why larger relation have to be small enough (M < B^2)? be able to do in one pass -> limit on actual relation size cant put the first of each of those pages in buffer if M is to big cost: 3(M+N) larger relation should be small (M<B^2)

Explain how the use of Strict 2PL would prevent interference between the two transactions.

strict 2PL would require T2 to obtain an exclusive lock on Y before writing to it. This lock would have to be held until T2 committed or aborted; this would block T1 from reading Y until T2 was finished thus there would be no interference.


Ensembles d'études connexes

302 Hinkle Chapter 22: Management of Patients with Upper Respiratory Tract Disorders PrepU

View Set

New Issues: U.S. Government and Agency Underwritings

View Set

Chapter 17 and Appendix C Financial Markets Microeconomics

View Set