Chapter 8. Storage and Indexing
Where does a DBMS store persistent data?
A DBMS stores vast quantities of data, and the data must persist across program executions. Therefore, data is stored on external storage devices such as disks and tapes
Positive side of using composite key index
A composite key index can support a broader range of queries because it matches more selection conditions. Further, since data entries in a composite index contain more information about the data record (i.e., more fields than a single-attribute index), the opportunities for index-only evaluation strategies are increased.
Primary keys and secondary keys
An index on a set of fields that includes the primary key is called a primary index; other indexes are called secondary indexes.
Negative side of using composite key index
Composite index must be updated in response to any operation (insert, delete, or update) that modifies any field in the search key. A composite index is also likely to be larger than a single-attribute search key index because the size of entries is larger. For a composite B+ tree index, this also means a potential increase in the number of levels, although key compression can be used to alleviate this problem (see Section 10.8.1).
How does a DBMS bring data into main memory for processing?
Data is fetched into main memory as needed for processing. Data is read into memory for processing, and written to disk for persistent storage, by a layer of software called the buffer manager. When the files and access methods layer (which we often refer to as just the file layer) needs to process a page, it asks the buffer manager to fetch the page, specifying the page's rid. The buffer manager fetches the page from disk if it is not already in memory.
Delete a record
Delete a record that is specified using its rid. We must identify the page that contains the record, fetch it from disk, modify it, and write it back. Depending on the file organization, we may have to fetch, modify, and write back other pages as well.
The cost of page I/O (input from disk to main memory and output from memory to disk)
Dominates the cost of typical database operations, and database systems are carefully optimized to minimize this cost.
Scan operation
Fetch all records in the file. The pages in the file must be fetched from disk into the buffer pool. There is also a CPU overhead per record for locating the record on the page (in the pool).
Search with Range Selection
Fetch all records that satisfy a range selection; for example, "Find all employee records with age greater than 35."
Search with Equality Selection
Fetch all records that satisfy an equality selection; for example, "Find the employee record for the employee with age 23 and sal 50." Pages that contain qualifying records must be fetched from disk, and qualifying records must be located within retrieved pages.
Heap File
Files containing an unsorted set of records that are uniquely identified by a record id which allows them to be inserted or deleted using that id.
Unique index
In general, a secondary index contains duplicates. If we know that no duplicates exist, that is, we know that the search key contains some candidate key, we call the index a unique index.
Insert a Record
Insert a given record into the file. We must identify the page in the file into which the new record must be inserted, fetch that page from disk, modify it to include the new record, and then write back the modified page. Depending on the file organization, we may have to fetch, modify, and write back other pages as well.
What is an index?
It is a performance tuning method of allowing faster retrieval of records from the table. An index creates an entry for each value and it allows us to efficiently retrieve all records that satisfy search conditions on the search key fields of the index. We can also create additional indexes on a given collection of data records, each with a different search key, to speed up search operations that are not efficiently supported by the file organization used to store the data records
Data entry
It refers to the records stored in an index file A data entry with search key value k, denoted as k*, contains enough information to locate (one or more) data records with search key value k. We can efficiently search an index to find the desired data entries, and then use these to obtain data records (if these are distinct from data entries).
Disk space manager
Space on disk is managed by the ___________ , according to the DBMS software architecture described in Section 1.8. When the files and access methods layer needs additional space to hold new records in a file, it asks the disk space manager to allocate an additional disk page for the file; it also informs the disk space manager when it no longer needs one of its disk pages. The disk space manager keeps track of the pages in use by the file layer; if a page is freed by the file layer, the space manager tracks this, and reuses the space if the file layer requests a new page later on.
Leaf-level
The bottom level of a clustered or nonclustered index, or the bottom level of a hierarchy. It contains the data entries.
Tree-based index
The data entries are arranged in sorted order by search key value, and a hierarchical search data structure is maintained that directs searches to the correct page of data entries. (See figure 8.3)
Composite Search Keys
The search key for an index can contain several fields; such keys are called _________________ or concatenated keys.
What DBMS component reads and writes data from main memory, and what is the unit of I/O?
The unit of information read from or written to disk is a page. The size of a page is a DBMS parameter, and typical values are 4KB or 8KB. The buffer manager fetches the page from disk if it is not already in memory.
Primary index
Two data entries are said to be duplicates if they have the same value for the search key field associated with the index. A primary index is guaranteed not to contain duplicates, but an index on other (collections of) fields can contain duplicates.
Hashing (Hash-based index)
We can organize records using a technique called _____________ to quickly find records that have a given search key value. In this approach, the records in a file are grouped in buckets, where a bucket consists of a primary page and, possibly, additional pages linked in a chain. The bucket to which a record belongs can be determined by applying a special function, called a hash function, to the search key. Given a bucket number, a hash-based index structure allows us to retrieve the primary page for the bucket in one or two disk l/Os. (See figure 8.2)
Clustered index
When a file is organized so that the ordering of data records is the same as or close to the ordering of data entries in some index, we say that the index is clustered