Indexing and Joins
What's the min and max number of pointers that a B+ tree internal node can have?
A B+ tree internal node must have at minimum ceiling(n/2) pointers and at most n pointers.
What's the min and max number of key-values that a B+ tree leaf node can contain?
A B+ tree node has n points and n-1 key-values. At minimum, it must have ceiling((n-1)/2) key-values and at maximum it can have n-1 key-values. If there are too few key-values, we've underflowed the node and the tree ends up being taller than it should be, which causes poor performance. If there are too many key-values, we've overflowed the node and don't have space to add new keys.
InnoDB vs. MyISAM
Both are storage managers. InnoDB supports foreign keys and referential integrity. MyISAM is more for dumping data and doesn't really have a concept of referential integrity.
What data types can B+ trees be used for?
Data types that can be compared, such as numbers
Should we use a hash join for an equi or nonequi join?
Equi-join since hash indices are optimized for random access, which works well with equality queries.
Why do key-value stores such as Tokyo Cabinet use B+ trees when they're supposed to be in RAM?
Having a B+ tree can help optimize for range queries on a particular key
MyISAM vs. InnoDB
MyISAM is how the ISAM indexing method is implemented in mySQL. It's good for a use case where you just need dump data without referential integrity. If you need support for foreign keys, referential integrity, transactions etc. then you need InnoDB, which uses a B+ tree.
What type of joins can merge joins be used for?
Natural joins and equijoins
Should we use a B+ tree index for an equi or nonequi join?
Nonequi-join since B+ trees are optimized for sequential access, which works well with range queries.
How many block transfers would be required if both the index and the data file is on RAM?
Only the number of block transfers initially used to load in the index and data file. After the initial loading, the rest is trivial since we are doing everything in RAM rather than on disk.
How would we do a shuffle-based join if we're doing an outer join?
The inner part of the join would be computed as normal, but the outer part of the join would be quite slow. We'd fix one relation and then shuffle around the rows of the other relation so that the fixed relation can be compared with every tuple in the other relation.
When are hash joins performant?
They are very performant if all or most of the blocks of R fit in the buffer. If very few blocks of R fit in the buffer, hash joins are pretty non-performant. We want to minimize the number of times we have to build a hash table and the number of times we have to do a full table scan of S.
What's the problem with the ISAM index?
Very slow and requires constant rebuilding of the file. That's because over time, we may get a lot of overflow blocks. Having a bunch of overflow blocks causes the access pattern to become more like random access than sequential, which is not what we want. The index would become very bloated and inefficient. We'd have to keep rebuilding the data files to deal with this issue, which is computationally expensive. Using a B+ tree instead would prevent us from having to keep rebuilding the file.
What is a shuffle-based join?
We have to use a shuffle-based join if one relation doesn't fit entirely in RAM. We have to shuffle around rows to the nodes. The shuffling is done similarly to the shuffle phase in Hadoop MapReduce. All records that have the same value for a join key get assigned to the same node. For example, if a record s from S and a record r from R both have the same value K on the join key, those records would be shuffled to the same machine. Each machine locally computes a theta join, and at the end, the results are combined together. When we're doing an inner join, this is a very simple embarassingly parallel operation.
Steps for doing a hash join
We use whichever relation (let's say R) is chosen to be the build side as follows: 1. The system reads in as many blocks from R as will fit into the buffer and creates a hash table. The key for the hash table will be the join key. The value in the hash table will be the value associated with the rest of the attributes in that record. So you basically have a reduced set of rows that are stored in a hash table. 2. Then, in the probe side S, we perform a full table scan over S. For each tuple in S, if the join key is contained in the build side's hash table, output the match. 3. Once we're finished scanning S, we clear the hash table/build side and start with step 1 again until all blocks of R are exhausted.
What is the time it takes in terms of disk IO to find a key in a B+ tree?
(h+1)(tS + tT)
What is a broadcast-based join?
-Basically, if one relation R is small enough to fit into RAM for every executor in the cluster as well as the driver program, that relation is broadcast to every executor in the cluster. It's basically a distributed cache. Then each executor has its own little piece Si of the bigger relation S and it compares its own piece of S with R. At the end, the results from all the executors combine together. -Spark will automatically use a broadcast-based join if the size of the smaller table is less than 10 MB by default, but you can set the threshold manually. -In the case of a broadcast-based hash join, each executor uses its piece of the bigger relation S as the build side on which the hash table is built.
Restrictions on broadcast-based hash joins
-Only supports equijoin because that's how a hash table works. -Doesn't support full outer join because each executor only has a small piece of S. It would be extremely inefficient to shuffle around the rows of S to generate all possible mismatches.
Although secondary indices are less efficient than primary ones, what are cases in which we may want to use one?
1. If we have a lot of non-unique key-values, a secondary index may be better. 2. If we're indexing on something other than the primary key, we may want to use a secondary index. Although we're allowed to use any kind of key we want in either index, it may make sense to have a primary index on a primary key and a secondary index or some other type of key, e.g. a foreign key.
What are the steps of query processing?
1. Parse the query. Translate SQL to internal form similar to the parser in a compiler. The internal form is usually based on RA. The parser also checks syntax and whether or not relations exist. Pretty easy. Is it syntactically correct? Are you using tables that actually exist? That's the first thing that happens and it's easy to handle. 2. From the tons and tons of equivalent expressions that were created, the optimizer picks the most efficient query plan, the one with the least cost or fastest performance (depending on what the cost function outputs). There are a variety of criteria we can use to do this, such as: Block transfers and seeks (disk I/O), the number of tuples/rows in the table, the current CPU and RAM usage, data transfer speed, disk space, network transfer (big in the distributed system), time
What are the complexity considerations for a B+ tree?
1. The time complexity for insertion, deletion, and traversal is O(log_[n/2] k) where k is the number of keys in the index and n is the fanout (maximum number of children). 2. There is space overhead for storing the nodes. Also, there are duplicate values of key-values because the internal nodes are routers. 3. There may be a waste of space if a node is only half full. But these complexity considerations are definitely worth it because the time complexity for a B+ tree index pales in comparison to the time it takes to having to keep rebuild an ordered ISAM index with overflow blocks.
What's the minimum number of entries a root node must have?
1. You'll probably have 1 to start with and then add more as you insert more values into the B+ tree.
What are the results of query processing?
1. expressions that can be used at the physical level of the RDBMS (parsing and translation) 2. A series of query-optimizing transformations 3. An actual evaluation of queries (evaluation)
What are the advantages and disadvantages of using a primary index?
Advantages: -Minimizes block transfers -Can be used for range queries, not just point queries Disadvantages: -As the index grows, there can be too many overflow blocks, which can cause access to become random rather than sequential. Difficult to maintain contiguity on disk. The primary index has to be rebuilt to get good performance.
Advantages and disadvantages of shuffle hash join
Advantages: -Doesn't require either table to fit in RAM. This is great for big data. -Doesn't require keys to be sortable. A merge join requires sortable keys. -Isn't susceptible to key skew. Merge join is susceptible to key skew because it involves sorting and keys. Disadvantages: -Building the hash table uses additional RAM and processing time. -Only works for equijoins and doesn't support full outer joins. -shuffling is very expensive, far more expensive than broadcast.
What data types can a hash index be used for?
Anything that's hashable/serializable
What is a broadcast-based nested loop join?
Basically the nested loop join where one of the relations fits entirely in RAM. That relation is broadcast to all the executor nodes and they compare it against their little piece of S. This is typically the least performant, aside from the cross join, but it's used as a fallback and it has no restriction on the join condition.
How can a join be done using a Bloom Filter?
If not all the blocks of R can fit in the RAM buffer, maybe instead of doing the sliding hash table stuff, we could create a Bloom filter. We store the keys into the bloom filter. Then we do a full table scan, typically just one, of S, and check whether or not the key is in the Bloom Filter. If it's not, we move on and throw out that record. But, if there does seem to be a match and the key seems to be in the Bloom filter, there could be false positive. Either we're going to join the two records together or we have a collision. One way to do it is just join together the records even if there's a collision, and then at the end, do a final table scan over that result and check whether or not the actual values of the keys are equal. If it was actually a collision or false positive, just throw it out of the result. This is a common way of doing a join in a distributive framework, but you can use it with one node too. Oracle uses bloom filter to do hash joins.
How does the query optimizer decide whether to do a selection or projection first?
It considers the ratio of rows to columns. If there are a lot more rows than columns, it would make sense to filter out rows via selection first, and then project out the columns. Whereas if there are lot more columns than rows, it would make sense to filter out columns via projection first, and then filter the rows.
Pros of the nested loop join?
It doesn't require an index and it can be used for any join, no restrictions on the join condition. In contrast, while index joins, hash joins, and merge joins are faster, they can typically only be used for natural joins or equijoins.
When using a B+ tree, how often does the data file get rebuilt?
It gets rebuilt on every insert and delete. But because of the way the data is rebuilt and because there's a multilevel index, this is actually quite fast. Inserting keys and rebuilding the index on the fly is much quicker than having to wait hours to rebuild the entire index as with the ISAM method.
What makes the performance of a B+ tree so good?
It has a fixed height. It will require the same amount of block transfers/seeks to get to any key-value on the tree since there is a fixed length from the root to the leaf node level.
When is the broadcast hash join a good fit?
When one of the relations fits entirely in RAM in the driver
storage manager
responsible for storing, retrieving, and managing data in memory and on disk