11. Indexing Structures
Define "Blocking Factor" of both Index and Data Files.
(B/R) where B is the block size and R is the record size. Blocking Factor (BF) describes how many 'units' fit on a single drive block. The BF of an index file describes how many index entries will fit in a block. The BF of a data file describes how many records (rows) will fit on a block.
What are the two strategies discussed in the text for searching for multiple attributes in the same table (file) when one or both attributes are indexed? For example, searching for customers with a given last-name and zip-code. Hint: Three strategies are given in the book, but the first two are identical.
1. If only one of the two attributes is indexed, search for the records that match the indexed attribute value, and linearly search the identified records for the non-indexed attribute value. For example, if only zip-code is indexed, 1) search for the given zip-code and 2) linearly search the zip-code result set for records with the given last-name. 2. If both attributes are individually indexed (both zip-code and last-name have separate indexes), then perform two searches and return the intersection of both result sets.
What are the qualities of the data file attributes that can be used to build a Primary Index as described in the slides?
A Primary Index is defined on a unique ordering attribute of an ordered data file e.g. SSN in the example Employee table.
Describe the primary difference between a dense and sparse index.
A dense index maintains an index entry for every record in the data file. A sparse entry maintains an index entry for every anchor record (every block) in the data file.
Why must a secondary index be a dense index?
A secondary index is created against the data file's non-ordering attributes i.e. not the physical ordering of the data file blocks. So unlike a sparse index, a secondary index requires an index entry for every record in the data file. This is the definition of a dense index.
What type of select operation benefits most from a B+ tree? Briefly explain your answer.
A select operation that is identifies a range of consecutive indexed values. For example a BETWEEN operation identifying a range of records between two indexed attribute values e.g. a date. This is because the leaf nodes of the B+ tree maintain pointers to the next index block and moving from one block to the next in an ascending or descending order is an O(1) operation.
Describe the structure of a multi-level index.
A single-level index with additional layers that serve to index the index. Like the single-level index, the first level index entries are physically ordered according to the indexing attribute values of the data records being indexed. The multi-level index is structured as tree of disk blocks where each layer (level) in the tree indexes the complete range of data record index values. The first level (leaves of the index tree) reference the record address (block address) in the data file. The second level entries index the anchor records (first) entries of the first level index entry blocks. The third level entries index the anchor record entries of the second level index entry blocks. Etc.
Using a block size of 4096 (1012), repeat the calculation given in the slides "Efficiency of Indexes for Search Operations" to produce the number of block reads. • Record size R = 150 bytes • Block size B = 4096 bytes • Number of Data File Records r = 30000. • Index Entry Size RI = (VSSN+ PR) = (9+7) = 16 bytes. Provide the number of block reads for... 1. The number of block reads to search an un-ordered data file without an index. 2. The number of block reads to search using an ordered data file. 3. The number of block reads using a single-level index against an un-ordered data file.
Data File Blocking Factor Bfr = (B div R) = (4096 div 150) = 27 records/block. Number of data file blocks b = (r/Bfr) = (30000/27) = 1112 data file blocks. Index File Blocking Factor BfrI= (B div RI) = (4096 div 16) = 256 index entries / block. Number of Index File Blocks bl = (r/ BfrI) = (30000/256) = 117 index file blocks 1. The number of block reads to search an un-ordered data file without an index is 1112 / 2 if the record exists and 1112 if the record does not exist. 2. Binary Search of Ordered Data File Log2b = Log2 1112 = 11 block reads. 3. Search using a single level index against an unordered data file: Log2bI = Log2 117 + 1 = 8 block reads. Note the +1 read is to retrieve the data file block containing the record.
In terms of Blocking Factors, what is the advantage of using an index file to search for a specific record in a data file? Hint: The Blocking Factor of the index file vs. the Blocking Factor of the data file.
Increasing the performance of most database operations depends on decreasing the number of blocks that must be read or written from/to the disk. The blocking factor of an index file is much greater (better) than the BF of the data file because more index entries will fit on a block than records. When searching for a target record, we can locate the data file block containing the record with fewer block reads when we search with the index rather than retrieving and searching the data file records directly. That is, the I/O efficiency of the index file is much higher than the data file.
Can we apply a sparse index to an unordered data file? Explain your answer.
No, a sparse index relies on the fact that the indexed attribute value is ascending across data file blocks. Indexed attribute values on index block i are less than the values on index block i+1 (assuming ascending order). An unordered data file places records randomly across its blocks, and so each unique index value (data record) requires a pointer to the record's block i.e. a dense index.
Can a Primary Index as described in the slides, be applied to both ordered and unordered data files?
No. Because a Primary Index indexes the anchor records of an ordered data file's blocks, and not individual data records (i.e. it is sparse), a Primary Index can only be applied to ordered data files.
Can we have two Primary Index on separate attributes in the same data file? Explain your answer.
No. Primary Indexes require an ordered data file i.e. a data file that is physically ordered by its ordering-attribute(s). So we cannot have two indexes that rely on an ordering of different attributes on the same file i.e. we cannot have two physical orderings in a single file.
What is the relationship between the anchor record on block i and block i+1?
The Anchor Record (AR) index field value of block i should be < the Anchor Record value of block i+1.
How does a B-Tree differ from a B+-Tree?
The B-Tree maintains both node pointers, and <key-values & record (block) pointers> in the tree nodes. There is no difference between internal and leaf nodes. The B+-Tree maintains node pointers and key values in the internal nodes. Leaf nodes are reserved for <key, record-pointers> The B+-Tree maintains the leaf node blocks as a linked list so that indexed records can be transversed (visited) in index order in linear time.
What is the relationship between the anchor record and the other records on same block?
The anchor record's index field value should be ≤ all other field values of other records on the same data block (assuming ascending order).
Both an ordered data file and an index file can be search in Log2N data file bock reads. Why is it is more efficient to search for a value using the index?
The blocking factor of the index file is (usually) much greater than the data file and the index can be used to locate the target data file block in fewer disk block reads.
Why is a secondary index less efficient that a Primary Index on an ordered data file?
The index requires N-1 times more index entries than a primary index where N is the data file's blocking factor (number of data records that fit into a block.
What is the advantage of the sparse over the dense index?
The sparse index requires fewer index entries to cover every record in the file and so fewer disk block are needed to maintain a sparse index. During a search, fewer disk read operations are needed to identify the data file block containing the target record resulting in faster and more efficient searches.
Provide a short description of a data block's "Anchor Record".
This is the first record of a data block. The record contains the starting indexing field value of the sequence of records stored on the data file block.
In terms of anchor record values, what block might the record containing a searched-for field value VAL be found?
When searching for a specific index field value (VAL), the record will be located on block i where ARi ≤ VAL < ARi+1