4707 Final Exam
Which primary file structure covered in class can support range search very well and why? Which secondary file structure covered in class can support range search very well and why?
Primary: A B-tree can support range search very well because it is a balanced tree structure that maintains a well-defined and uniform distribution of keys throughout the tree. This allows for efficient search of keys within a given range, as the tree can be traversed in a predictable and efficient manner to find the keys within the specified range. Secondary: A B+ tree can support range search very well because it stores the actual data in the leaf nodes and uses a series of pointers to connect the leaf nodes to one another. This allows for efficient traversal of the tree in any order, including traversal within a given range of keys. Additionally, the use of pointers in a B+ tree allows for fast access to the data in the tree, which can further improve the performance of range search operations.
Briefly explain how RAID can benefit physical database design.
RAID (Redundant Array of Independent Disks) is a technology that allows multiple disks to be combined into a single logical storage unit. Benefits: Providing improved performance and reliability for the storage of the database files by combining multiple disks into a single logical unit that can be accessed as a single entity by the database server as well as provide improved reliability for the storage of the database files by using redundancy, where extra copies of the data are stored on multiple disks.
Give two example considerations in information lifecycle management.
Retention policies: Information lifecycle management involves deciding how long to retain different types of data and when to delete or archive it. This can be based on factors such as legal or regulatory requirements, the value of the data to the organization, and the cost of storing the data. Data migration: As data ages, it may need to be migrated from one storage medium to another. For example, data that is accessed frequently might be stored on high-performance storage devices, while data that is accessed infrequently might be moved to slower and less expensive storage.
Write SQL: Give an example of an ambiguous query using the University Database.
SELECT * FROM Table1 LEFT JOIN Table2 ON Table1.ID = Table2.ID RIGHT JOIN Table3 ON Table1.ID = Table3.ID In this query, Table1 is a non-preserved table because it is involved in a one-sided outer join (the LEFT JOIN). The result of this query may depend on the order in which the joins are listed in the FROM clause, because the non-preserved table (Table1) is involved in another join (the RIGHT JOIN).
Briefly explain how "filter and refine" works in general to speed up spatial queries
"Filter and refine" is a technique used to speed up spatial queries in databases. It works by first applying a coarse filter to the spatial data to eliminate irrelevant data, and then applying a more detailed filter to refine the results.
1.) give an example of collision using an LR with key 11 and 2.) how linear probing solves the collision problem.
1.) A collision in a hash file occurs when two or more data values are assigned to the same location in the primary storage. For example, suppose we have a hash file with 5 PRs and a hash function of key mod 5. If we try to store two data values with keys of 11 and 16 in the hash file, the hash function would return 11 mod 5 = 1 and 16 mod 5 = 1 for both data values, indicating that they should be stored in PR 1. This would cause a collision, because there is only space for one LR in PR 1. 2.) Linear probing is a technique for resolving collisions in a hash file. When a collision occurs, the database searches for the next available location in the primary storage and stores the data value there. For example, if a collision occurs in PR 1, the database would search for the next available location, which might be PR 2. If PR 2 is available, the database would store the data value in PR 2. If PR 2 is not available, it would search for the next available location, and so on.
Give an example of a ColumnStore. Briefly explain the benefit of a ColumnStore.
1.) A columnstore is a type of database storage that organizes data into columns, rather than rows as in traditional row-oriented storage. EX.) Name | Jane | John | Bob | Sally Age | 30 | 25 | 28 | 29 Gender | Female | Male | Male | Female 2.) The benefit of a columnstore is that it allows for more efficient storage and access to data, especially for columns that are frequently used together. This can significantly improve the performance of queries that involve a large number of columns, as it reduces the amount of data that needs to be read from storage and processed in memory. Additionally, columnstores can be compressed more effectively than row-oriented storage, which can further reduce the storage footprint of the data and improve the performance of queries.
Briefly explain the non-clustering index and clustering index and their usage.
1.) A non-clustering index is a type of index that provides an alternative sorting order for the rows in a table, without changing the physical order of the rows on disk. Non-clustering indexes can be used to improve the performance of queries that filter or sort the data in a table by the indexed columns. 2.) A clustering index is a type of index that physically reorders the rows in a table based on the indexed columns. This can improve the performance of queries that access large ranges of data in the table, as the rows can be read from disk in the same order as the indexed columns.
give a scenario when PCTFREE is large and a scenario when PCTFREE is small. You should also explain the PCTUSED parameter in these two scenarios and justify your choice.
1.) A scenario when PCTFREE is large might be when a database table is expected to have a high rate of insert and update operations, and there is a need to prevent database fragmentation In this scenario, it would also be appropriate to set PCTUSED to a relatively low value, as this would allow blocks to be considered for reuse only when they are almost completely empty. 2.) A scenario when PCTFREE is small might be when a database table is expected to have a low rate of insert and update operations, and there is a need to maximize the use of storage space and minimize the amount of unused space in the table. In this scenario when PCTFREE is small, it would be appropriate to set PCTUSED to a relatively high value to allow blocks to be considered for reuse only when they are almost completely full.
Briefly explain why 1.) application profiles are important for physical database design. Also, 2.) statistics about query frequencies and the distribution of parameter values are important. 3.) Give an example query and use the example to explain query frequencies and the distribution of parameter values.
1.) Application profiles are important for physical database design because they provide information about how the database will be used in practice. This information can help database designers make decisions about how to organize and index the data, how to allocate resources, and how to optimize query performance. 2.) Statistics about query frequencies and the distribution of parameter values are important because they can help database designers identify which queries are used most frequently and which parameter values are most common. This information can be used to optimize the database design and improve query performance. 3.) Suppose we have a database that contains customer information for an online retail store. The database receives a lot of queries that search for customer information by customer name, and many of these queries specify the first and last name of the customer. In this case, the query frequency and the distribution of parameter values for the customer name search would be important for physical database design. The database designer could use this information to create an index on the customer name column, which would improve the performance of the customer name search query.
Give and explain two example selection rules for a non-clustering index.
1.) Choose columns that are frequently used in filter or sort operations in queries. For example, if a table contains customer data and the queries that are run against it often include conditions on the customer's name or address, then these columns could be good candidates for a non-clustering index. 2.) Choose columns that have a low degree of cardinality, or uniqueness, in the data. This means that the values in the indexed columns are not distributed evenly and there are many repeated values.
Give and explain two example selection rules for a clustering index.
1.) Choose columns that are frequently used in range queries and other types of queries that access large ranges of data in the table. For example, if a table contains sales data and the queries that are run against it often include a range of dates or a range of customer IDs, then the date or customer ID columns could be good candidates for a clustering index. 2.) Choose columns that have a high degree of cardinality, or uniqueness, in the data. This means that the values in the indexed columns are distributed evenly and there are relatively few repeated values.
Give an example of a Bitmap index. Briefly explain the type of columns that would benefit from a Bitmap index.
1.) Consider a database table with a column "Gender" that contains the values "Male" and "Female". We can create a bitmap index for this column by creating a bitmap with two bits, one for "Male" and one for "Female". 2.) Bitmap indexes are most useful for columns that have a low cardinality (i.e., a small number of distinct values) and a high proportion of NULL values. In such cases, bitmap indexes can provide a compact and efficient way to represent the values in the column and support fast access to the data.
Give an example of a Bitmap join index. Briefly explain the benefit of a Bitmap join index.
1.) Consider two tables "Employees" and "Departments" in a database, where the Employees table contains a column "DepartmentId" that references the departments in the Departments table. We can create a bitmap join index for these two tables by creating a bitmap index for the "DepartmentId" column in the Employees table and a bitmap index for the "Id" column in the Departments table. 2.) The benefit of a bitmap join index is that it allows for efficient and fast evaluation of join conditions, even for large tables with a large number of rows. This can significantly improve the performance of queries that involve joins, as it reduces the amount of data that needs to be scanned and compared to evaluate the join condition. Additionally, bitmap join indexes can be used in conjunction with other indexing techniques, such as B-trees or hash tables, to further improve the performance of join operations.
Give a denormalization example for a generalization hierarchy. Explain the benefits and risks with the example.
1.) Denormalize the tables: "Car", "Boat", "Airplane" into a singular "Vehicle". 2.) Benefits: Improved query performance, reduced complexity of the database design. Risks: Increase in amount of redundant data in the database, changes to one type of vehicle may need to be propagated to multiple rows in the denormalized table.
Briefly explain a Function index and give an example.
A function index is a type of index in a database that is based on the result of a function or expression applied to the values in a given column. It allows for efficient search and access to data based on the values of the expression, rather than the actual values in the column. Consider a database table with a column "DateOfBirth" that contains the date of birth of each person in the table. We can create a function index for this column by defining a function that extracts the year from the date of birth, and then creating an index based on the result of the function applied to the values in the column.
briefly explain a better solution for estimating the number of rows in some query results when data are skewed.
An equal height histogram is a better solution for estimating the number of rows in some query results when data are skewed because it takes into account the distribution of the data. In an equal height histogram, the bins are sized so that each bin contains the same number of data points, regardless of the actual values of the data. This means that the bins will be larger for the values that are more spread out, and smaller for the values that are closer together. This allows the histogram to more accurately represent the distribution of the data, making it easier to estimate the number of rows in each bin and the number of rows in the entire dataset.
briefly explain application buffers and DBMS buffers.
Application buffers and DBMS buffers are both types of memory storage used in database management systems (DBMS). 1.) Application buffers are used to store data that is being transferred between the DBMS and the application that is using the database. 2.) DBMS buffers, on the other hand, are used by the DBMS itself to store data that is being accessed or modified by the database.
Briefly explain "hints" in writing SQLs for improving the optimization results.
In SQL, a "hint" is a directive or suggestion that is provided to the query optimizer to influence the way it generates an execution plan for a query. Hints can be used to specify a particular index or join algorithm to use, to force a specific execution order for the query, or to control other aspects of the query execution plan.
briefly explain the difference between the hints "all" and "first rows".
In general, the ALL hint is best suited for queries that are expected to return a large number of rows, while the FIRST ROWS hint is best suited for queries that are used for interactive applications or other scenarios where only a small number of results is needed.
Dynamic Hash Files: Show how inserting key 9 with hash address 10001 into the below structure would work with dynamic hash files. Note that the capacity of each data bucket (PR) is 2 LRs.
Increase ending of hash address by one bit and then organize based on that
Briefly describe spatial data vector and raster models in spatial databases with examples of a point, a line, and a polygon.
Points: specific location, like a church Lines: distinctive start and end points with locations along the route, like a road. Polygon: a region bounded by several points. Ie each corner of a square is a point and the region within is the polygon. Government districts. Raster model: Manages continuous and amorphous models, ie snowfall or wetlands. Raster is grid based. A raster line is filled in cubes on a grid connected to eachother.
Briefly explain each box below in the query optimization process (see #7 in discord document)
Syntax analysis: the process of checking the program's source code for grammatical correctness, ensuring that it follows the rules of the programming language's syntax. Semantic analysis: The process of checking the program's source code for meaning and correctness, ensuring that it performs the intended operations and produces the expected results. Query Transformation: The process of optimizing a database query by transforming its original form into an equivalent form that can be executed more efficiently. Access Plan Evaluation: The process of choosing the most efficient execution plan for a database query by comparing the relative costs and benefits of different access plans. Access Plan Interpretation: The process of analyzing each step in the plan to understand how the DBMS will retrieve the requested data. This can help identify potential bottlenecks or inefficiencies in the plan. Code Generation: The process of converting a high-level query into a form that can be executed by the database management system.
Briefly explain how PCTFREE and PCTUSED parameters are used and how physical database design benefits from these two parameters.
The PCTFREE and PCTUSED parameters are used in database design to control the allocation and reuse of space in database tables and indexes. PCTFREE specifies the minimum amount of free space that should be left in each database block, and PCTUSED specifies the minimum amount of used space that should be present in each block before it is considered for reuse. The benefit of using PCTFREE and PCTUSED in physical database design is that they allow for more efficient use of storage space and better performance of database operations. By controlling the allocation and reuse of space in database blocks, these parameters can help to prevent wasteful use of space, reduce the number of disk accesses required for database operations, and improve the performance of insert, delete, and update operations.
give an example of when columns are related, using only the distribution of values in the columns cannot accurately estimate the number of rows in some query results.
The age and income of individuals are often related, in that individuals tend to earn more as they get older and gain more experience in the workforce. However, the distribution of values in the age and income columns may not accurately reflect this relationship. The age column may have a uniform distribution of values, while the income column may have a highly skewed distribution with a few very high earners and many low earners.
Briefly explain the goal of physical database design.
The goal of physical database design is to create an efficient and effective way of storing and organizing data in a database. This involves designing the physical structure of the database, including the data structures and files used to store the data, as well as the access methods and indexing techniques used to retrieve and manipulate the data.
briefly explain how weight is set in the equation. Please also explain how one would set the weight differently for a DBMS using only SSD and a DBMS using only hard drives (disks).
The weight is usually close to 0 to reflect that many CPU operations can be performed in the time to perform one physical record transfer with a hard disk. For solid state devices, the weight is larger to reflect much faster electronic switching to read data.
Equal-width histograms do not work well with skewed data. Give an example of why equal-width histograms cannot accurately estimate the number of rows in some query results.
They do not take into account the distribution of the data. For example, suppose we have a dataset with two groups of values: a small group of values that are close together, and a large group of values that are more spread out. If we use an equal-width histogram to visualize this data, the bins will all be the same size, regardless of the actual distribution of the data. This means that the bin containing the small group of values will be much taller than the other bins, while the bin containing the large group of values will be much shorter. This makes it difficult to accurately estimate the number of rows in each bin, and therefore the number of rows in the entire dataset.
Give three example constraints in physical data design and briefly explain how each impacts the goal you described.
1.) Hardware constraints: The available hardware resources, such as the amount of memory and storage space, can impact the physical design of the database. For example, if the database needs to store a large amount of data but the available hardware has limited storage capacity, the physical design of the database will need to take this into account by using data compression or other techniques to reduce the amount of space required. 2.) Performance requirements: The performance requirements of the application using the database can also impact the physical design of the database. For example, if the application requires fast access to the data, the physical design of the database may need to prioritize indexing and other access methods that can improve the speed of data retrieval. 3.) Data integrity: Ensuring the integrity of the data stored in the database is another important constraint in physical database design. This can involve implementing constraints such as primary and foreign keys, as well as using data redundancy and backup strategies to protect against data loss. The physical design of the database will need to take these constraints into account to ensure that the data is stored and accessed in a way that maintains its integrity.
Physical database design requires selecting file structures for primary and secondary storage. 1.) Give an example of a primary storage choice and 2.) a secondary storage choice and 3.) briefly explain how they work together (i.e., how searching the secondary storage can find the LRs in the primary storage).
1.) One example of a primary storage choice is a hash index, which is a type of index that uses a hash function to map data values to specific locations in the primary storage. This allows the database to quickly look up data values by their hash values, which can improve query performance. 2.) One example of a secondary storage choice is a b-tree index, which is a type of index that organizes data values in a hierarchical tree structure. This allows the database to quickly search for data values by traversing the tree, starting from the root node and following the branches until it reaches the leaf nodes that contain the data values. 3.) When these two storage choices are used together, the database can use the hash index to quickly locate the LRs (leaf records) in the primary storage that contain the data values that are being searched for. It can then use the b-tree index to navigate to the specific LRs that contain the data values, which allows it to retrieve the data values from the primary storage and return them to the user. This two-stage approach allows the database to efficiently search for data values in both the primary and secondary storage, which can improve query performance.
The combined measure of database performance is an equation that includes two components and a weight (a real number). Briefly explain the two components.
1.) PRA is the number of physical record accesses 2.) CPU-OP is the number of CPU operations such as comparisons and assignments.
A DBMS can store logical records (LRs) in physical records (PR) in various ways, including a) multiple LRs per PR, b) one LR split across multiple PRs, and c) one PR containing LRs from different tables. Briefly discuss how option c) could help reduce the number of physical record accesses during queries.
1.) Reduced # of PR accesses since LR data can be retrieved from single PR access 2.) More data able to be stored in less PR's
Briefly explain how query optimization benefits from table profiles and application profiles.
1.) Table profiles include information such as the number of rows in the table, the distribution of values in the columns, and the size and usage patterns of indexes. The query optimizer can use this information to generate more accurate estimates of the cost and performance of different execution plans for a query, and to choose the plan that is most likely to be efficient and effective. 2.) Application profiles include information such as the types and frequency of queries that are run, the data access patterns of the application, and the workload on the database server. The query optimizer can use this information to generate execution plans that are tailored to the specific needs of the application, and to adapt to changes in the workload over time.
Briefly explain how the record format, including compression and derived data, can benefit physical database design and the trade-offs.
1.) The record format, including the use of compression and derived data, can benefit physical database design by reducing the amount of space that is required to store the data in the database. This can improve the performance of the database, as it can reduce the amount of disk I/O that is required to access the data, and can also reduce the amount of memory and other resources that are needed to store and process the data. 2.) The trade-offs associated with using compression and derived data in the record format include the increased complexity of the database design and the potential for reduced data integrity.
Briefly explain the benefit of B+trees compared to Btrees.
3 benefits: 1.) B+ trees store the actual data in the leaf nodes, rather than in the non-leaf nodes as in B-trees. This allows for more efficient access to the data, as all the data can be accessed by traversing the leaf nodes in sequence. 2.) B+ trees use a series of pointers to connect the leaf nodes to one another, allowing for efficient traversal of the tree in any order. 3.) B+ trees allow for more flexible and dynamic tree structures, as the non-leaf nodes can have any number of keys (as long as they have at least two keys)
Explain the following Btrees properties: a) balanced, b) bushy, c) block-oriented, and d) dynamic
Balanced: A B-tree is considered balanced if all its leaf nodes are at the same depth. This ensures that the tree has a uniform and well-defined structure, which helps to maintain good performance even as the number of items in the tree increases. Bushy: A B-tree is considered bushy if it has a large number of children per node. This allows the tree to store a large number of items in a compact and efficient manner, as it reduces the number of levels in the tree and thus reduces the number of disk accesses required to find a particular item. Block-oriented: A B-tree is considered block-oriented if it is designed to store its data in blocks on disk, rather than in memory. This allows the tree to handle large amounts of data by only loading the necessary blocks into memory as needed, which reduces the memory footprint of the tree and allows it to efficiently handle data that is too large to fit entirely in memory. Dynamic: A B-tree is considered dynamic if it can resize itself automatically as the number of items in the tree changes. This allows the tree to maintain good performance even as the number of items increases or decreases, as it can adjust its structure to accommodate the changing data.
Give examples of direction, distance, and topology queries (in plain English, not SQLs).
Direction query: What is the nearest restaurant north of my current location? Distance query: What are the restaurants within a 1-mile radius Topology query: How many restaurants within the Los Angeles city boundary?
Transferring physical records between the application buffers, DBMS buffers, and the operating system is a routine task. Draw the diagram showing how LRs and PRs are organized in the application and DBMS buffers and the operating system's storage units (e.g., disk). You need to also indicate the data flow (i.e., read and write).
Look at picture #1 in file loaded in Discord
Briefly explain how Hash files can be used as the primary storage using the following constrain a) Hash function = key mod 5, b) 5 PRs, c) LRs with key = 1, 2, 3, 4, 6, d) each PR can store two LRs. You need to show 5 PRs and the LRs stored in each PR. You need to show how you assign an LR to a PR.
Look at picture #2 in file loaded in Discord
now you have inserted the LR with key 11 using linear probing. Briefly explain how many PR reads are needed to search for the LR with key 11. Also, Briefly explain how many PR reads are needed to search for the LR with key 5 (which is not in any PRs).
Look at picture #3 in file loaded in Discord
Construct a Btree with the following keys: 55, 1, 2, 5, 7. Each PR contains at most four keys.
Look at picture #4 in file loaded in Discord
delete key 7 from the Btree.
Look at picture #5 in file loaded in Discord
Using your Btree in Q20 to draw a B+tree
Look at picture #6 in file loaded in Discord
Briefly explain map projections.
Map projections are used to represent spatial data on a map. Spatial databases typically store spatial data in a coordinate system, such as latitude and longitude, and a map projection is used to transform these coordinates into a two-dimensional representation on a map.
Table profile: briefly explain why a) the number of unique values, b) the distribution of values, and c) correlations of columns are important for physical database design.
a.) If a column has a small number of unique values, the database can use a smaller data type to store the values, which can save space and improve performance. Additionally, if a column has a large number of unique values, the database may need to use a larger data type or use additional indexing or data structure techniques to support efficient access to the data. b.) If the values in a column are heavily skewed towards a few specific values, the database may need to use data structures or indexing techniques that are optimized for this type of distribution. Additionally, if the values in a column are evenly distributed, the database may be able to use simpler data structures and access methods, which can improve performance and save space. c.) if two columns are highly correlated, the database may be able to use this information to improve data compression and reduce the amount of space required to store the data. Additionally, if two columns are not correlated, the database may be able to store the data from each column in separate physical records, which can improve the performance of queries that only need to access one of the columns.
Briefly compare the steps of PR read for the file structures of a) unordered, b) ordered, and c) hash for inserting a new record, key search, and range search. Draw examples LRs and PRs to explain your answer.
a.) In an unordered file structure, inserting a new record would typically involve simply adding the new record to the end of the file. To perform a key search, the file would need to be scanned sequentially to find the record with the desired key. To perform a range search, the file would need to be scanned sequentially and any records within the specified range would be returned. EX.) [(K1, V1), (K2, V2), (K3, V3), ..., (K, V)] b.) In an ordered file structure, inserting a new record would involve finding the correct position in the file to maintain the order, and then inserting the record at that position. To perform a key search, a binary search algorithm could be used to quickly find the record with the desired key. To perform a range search, a search algorithm could be used to find the first and last records within the specified range, and then all records between those two records could be returned. EX.) [(K1, V1), (K2, V2), (K3, V3), ..., (K, V), ...] c.) In a hash-based file structure, inserting a new record would involve computing the hash value for the key and using that value to determine the bucket in which the record should be stored. To perform a key search, the hash value for the key would be computed and used to find the bucket containing the record with the desired key. To perform a range search, all the records in all the buckets would need to be scanned to find any records within the specified range. EX.) Bucket 1: [(K1, V1), (K2, V2), ...] Bucket 2: [(K3, V3), (K4, V4), ...] ... Bucket N: [(K, V), ...]