Week 6: Hash tables, Binary Search Trees, Graphs
In order to implement this idea, we need two things:
--> A table structure in which to store the data. The table is simply a Java array. ("bucket array for hashtable") --> A way to get from a key to a particular spot in the table (index in the array). The way this is done is via a hash function h, which maps a key k to an index h(k) in the table. Ideally, the hash function spreads the objects fairly evenly over the table.
We say that a hash function is "good" if ...
1. It maps the keys in our map in a way that minimize collisions 2. It is fast and easy to compute
Linear probing has *2 DISADVANTAGES*
1. Removing 2. Clustering
What makes a *good* compression function?
A good compression function minimizes the number of collisions for a given set of distinct hash codes. A simple compression function is to take the hashcode modulo the table size, as we did above. This scheme works well when the table size is a prime number, but not so well otherwise.
*What makes a *good* hash function*?
A good hash function has three desirable properties: 1. It can be computed quickly. 2. It spreads the universe of keys fairly evenly over the table. 3. Small changes in the key (e.g., changing a character in a string or changing the order of the letters) should usually result in a different hash value.
What does a hash table do?
A hash table uses a *hash function* to map the keys of a map (or set) to corresponding indices in a table Ideally, keys will be well distributed in the range from 0 to N − 1 by a hash function, but in practice there may be two or more distinct keys that get mapped to the same index.
Open Addressing: Linear Probing
A simple method for collision handling with open addressing is linear probing. With this approach, if we try to insert an entry (k, v) into a bucket A[ j] that is already occupied, where j = h(k), then we next try A[(j+1) mod N]. If A[(j+1) mod N] is also occupied, then we try A[( j + 2) mod N ], and so on, until we find an empty bucket that can accept the new entry. Once this bucket is located, we simply insert the entry there. Of course, this collision resolution strategy requires that we change the implementation when searching for an existing key—the first step of all get, put, or remove operations. In particular, to attempt to locate an entry with key equal to k, we must examine consecutive slots, starting from A[h(k)], until we either find an entry with an equal key or we find an empty bucket. (See Figure 10.7.) The name "linear probing" comes from the fact that accessing a cell of the bucket array can be viewed as a "probe," and that consecutive probes occur in neighboring cells (when viewed circularly).
Why do we want 2 "equal" objects to have the same hash code?
Because if you insert an item in the hash table and then search for it using something that is equal to the item, then you expect to find it. If the hashcodes are different you won't find it. You will be looking in the wrong slot.
DISADVANTAGE #2: Clustering
But there's another, more insidious, problem. It's called clustering. Long runs of occupied slots build up, increasing the average search time. Clusters arise because an empty slot preceded by *t* full slots gets filled next with a probability of (t + 1)/N. That's because the probability of hashing to any of the t slots is 1/N (and there are t of them), plus the next key could hash directly into that empty slot. And that's not the only way that clusters grow. Clusters can coalesce when one cluster grows and meets with another cluster.
Maps and Sets
For a map, we want to use the key of the (key, value) pair to figure out where we are going. For a set, we use the object itself; in other words, the object is the key. So we'll talk about storing keys in the hash table, possibly with an associated value.
Requirements for a good hash code... what makes a hash code *good*?
For hashing schemes to be reliable, *it is imperative that any two objects that are viewed as "equal" to each other have the same hash code*. This is important because if an entry is inserted into a map, and a later search is performed on a key that is considered equivalent to that entry's key, the map must recognize this as a match. Therefore, when using a hash table to implement a map, we want equivalent keys to have the same hash code so that they are guaranteed to map to the same bucket. More formally, if a class defines equivalence through the *equals* method then that class should also provide a consistent implementation of the hashCode method, such that if x.equals(y) then x.hashCode() == y.hashCode(). As an example, Java's String class defines the equals method so that two in- stances are equivalent if they have precisely the same sequence of characters. That class also overrides the hashCode method to provide consistent behavior. Java's primitive wrapper classes also define hashCode().
What is *linear probing*?
INSERTION: Suppose we want to insert key k and that h(k) = i, but slot i is already occupied. We cannot put key k there. Instead, we probe (look at) slot i + 1. If it's empty, put key k there. If slot i + 1 is occupied, probe slot i + 2. Keep going, wrapping around to slot 0 after probing slot m − 1. *As long as n < N*, i.e., α < 1, we are *guaranteed to eventually find an empty slot*. If the table fills, that is, if α reaches 1, then increase the table size and rehash everything.
Sets
If a set is a map with keys and no values, why wouldn't you just use an ArrayList/array/linked list?
One more time for the people in the back - why is it important for hash codes to avoid collisions as much as possible?
If the hash codes of our keys cause collisions, then there is no hope for our compression function to avoid them
Chaining: How many items do we expect to look at when searching for a item? For unsuccessful search (it wasn't in the map or set), we would look at everything in the appropriate list. But how many elements is that?
If the table has N slots and there are n keys stored in it, there would be n/N keys per slot on average, and hence n/N elements per list on average. We call this ratio, n/N, the *load factor*, and we denote it by α. If the hash function did a perfect job and distributed the keys perfectly evenly among the slots, then each list has α elements. In an unsuccessful search, the average number of items that we would look at is α. For a successful search we find the element (so always do 1 comparison), and look at about half of the other elements in the list. This means that for successful search you look at about 1 + α/2 items. Either way the running time would be Θ(1 + α). Why "1 + "? Because even if α < 1, we have to account for the time computing the hash function h, which we assume to be constant, and for starting the search. (Of course, if α < 1, then we cannot perfectly distribute the keys among all the slots, since a list cannot have a fraction of an element.)
What is a collision?
If there are two or more keys with the same hash value, then two different entries will be mapped to the same bucket in A. In this case, we say that a collision has occurred. To be sure, there are ways of dealing with collisions, which we will discuss later, but the best strategy is to try to avoid them in the first place.
How does this create further problems?
If we remove many keys, we can end up with all (or almost all) of the "empty" slots being marked, and unsuccessful searches can go on for a very long time.
Finding a *good* hash code
In particular, if you define *equals* for an object, it is very important that you override hashCode so that *two items considered "equal" have the same hashcode*.
Chaining
Instead of storing each element directly in the table, each slot in the table references a linked list. The linked list for slot i holds all the keys k for which h(k) = i. The keys are k1, k2, ..., k8. We show each linked list as a noncircular, doubly linked list without a sentinel, and table slots without a linked list are null. Of course, we could make a circular, doubly linked list for each table slot instead, and have each table slot reference one such list, even if the list is empty. In some situations, especially when we only insert into the hash table and never remove elements, singly linked lists suffice.
Hash function: the how of it
Java computes hash functions in two steps. The first is to have every object have a hashCode() method, which returns an int. The default is often the address of the object in memory. Then we have to map this integer into an entry in the table. This is called the compression function (Fortunately, Java's library takes care of the compression function, but *leaves the hash code up to us!*)
Wait... that doesn't sound like a problem. Why can't we just remove the key from the slot?
Let's take the following situation. We insert keys k1, k2, and k3, in that order, where h(k1) = h(k2) = h(k3). That is, all three keys hash to the same slot, but k1 goes into slot i, k2 goes into slot i + 1, and k3 goes into slot i + 2 (see INSERTION). Then we remove key k2, which opens up a hole in slot i + 1. Then we search for key k3. What's going to happen? We probe slot i, see that it holds k1, not k3, and then probe slot i + 1. *Because slot i + 1 is now empty, we conclude that k3 is not in the hash table. Wrong!*
DISADVANTAGE #1: Removing
Linear probing is a nice idea, but it has a couple of problems. One is *how to *remove* a key from the hash table*.
What about as n increases to be larger than N?
Note that if n gets much larger than m, then search times go up. How can we avoid this? The same way that we do for an ArrayList. When the table gets too full, we create a new one about double the size and rehash everything into the new table. What is "too full"? Java implementations typically start the table with size 11 and double the table size when α exceeds 0.75. Everything is peachy now, right? Yes, except that we are now counting on the table having several empty slots. In other words, we're wasting lots of space, to say nothing of all the links within the lists. If memory is at a premium, as it would be in an embedded system or handheld device, we might regret wasting it.
Worst case
Of course, the worst case is bad. It occurs when all keys hash to the same slot. It can happen, even with simple uniform hashing, but of course it's highly unlikely. But the possibility cannot be avoided. If an adversary puts n*m items into the table then one of the slots will have at least n items in it. He or she then makes those n items the data for the problem that you are dealing with, and you are stuck. (There is an idea called universal hashing, which basically computes a different hash code every time you run the program, so that data that is slow one time might be fast the next.) Should the worst case occur, the worst-case time for an unsuccessful search is Θ(n), since the entire list of n elements has to be searched. For a successful search, the worst-case time is still Θ(n), because the key being searched for could be in the last element in the list.
How can we store everything in the table even when there's a collision?
One simple scheme is called *linear probing*.
Linear probing: SEARCHING
Searching with linear probing uses the same idea as inserting. Compute i = h(k), and search slots i, i + 1, i + 2, ..., wrapping around at slot m − 1 *until either we find key k or we hit an empty slot. If we hit an empty slot, then key k was not in the hash table.*
How do we evaluate a hash function h(k)?
The evaluation of a hash function has 2 portions: A) A Hash Code that maps a key k to an integer 2) A Compression function that maps the hash code to an integer within a range of indices, [0, N − 1], for a bucket array.
Phase A: Hash Codes
The first action that a hash function performs is to take an arbitrary key k in our map and compute an integer that is called the hash code for k; this integer need not be in the range [0, N − 1], and may even be negative. We desire that the set of hash codes assigned to our keys should avoid collisions as much as possible. For *if the hash codes of our keys cause collisions, then there is no hope for our compression function to avoid them*.
What is a hash function?
The goal of a hash function, h, is to map each key k to an integer in the range [0, N − 1], where N is the capacity of the bucket array for a hash table. Equipped with such a hash function, h, the main idea of this approach is to use the hash function value, h(k), as an index into our bucket array, A, instead of the key k (which may not be appropriate for direct use as an index). That is, we store the entry (k,v) in the bucket A[h(k)]
Phase B of hash function: Compression function
The hash code for a key k will typically not be suitable for immediate use with a bucket array, because the integer hash code may be negative or may exceed the ca- pacity of the bucket array. Thus, once we have determined an integer hash code for a key object k, there is still the issue of mapping that integer into the range [0, N − 1]. This computation, known as a compression function, is the second action per- formed as part of an overall hash function.
Advantages of separating hash function evaluation into 2 portions:
The hash code portion of that computation is independent of a specific hash table size. This allows the development of a general hash code for each object that can be used for a hash table of any size; only the compression function depends upon the table size. This is particularly convenient, because the underlying bucket array for a hash table may be dynamically resized, depending on the number of entries currently stored in the map.
Collision-handling schemes: Overview
The main idea of a hash table is to take a bucket array, A, and a hash function, h, and use them to implement a map by storing each entry (k,v) in the "bucket" A[h(k)]. This simple idea is challenged, however, when we have two distinct keys, k1 and k2, such that h(k1) = h(k2). i.e. Distinct keys have the same hash value/index in bucket array. The existence of such collisions prevents us from simply inserting a new entry (k,v) directly into the bucket A[h(k)]. It also complicates our procedure for performing insertion, search, and deletion operations
2) Open Addressing
The second way to handle collisions is called open addressing. The idea is to store everything in the table itself, even when collisions occur. There are no linked lists.
Open Addressing
The separate chaining rule has many nice properties, such as affording simple im- plementations of map operations, but it nevertheless has one slight disadvantage: It requires the use of an auxiliary data structure to hold entries with colliding keys. If space is at a premium (for example, if we are writing a program for a small hand- held device), then we can use the alternative approach of storing each entry directly in a table slot. This approach saves space because no auxiliary structures are em- ployed, but it requires a bit more complexity to properly handle collisions. There are several variants of this approach, collectively referred to as open addressing schemes, which we discuss next. Open addressing requires that the load factor is always at most 1 and that entries are stored directly in the cells of the bucket array itself
So... how do we handle collisions? How many ways are there and what are they
There are 2 ways to handle collisions: 1) Chaining 2) Open addressing
Double hashing
There are various schemes that help a little. One is double hashing, where we take two different hash functions h1 and h2 and make a third hash function hʹ from them: h'(k,p) = (h1(k) + ph2(k)) mod m. The first probe goes to slot h1(k), and each successive probe is offset from the previous probe by the amount h2(k), all taken modulo m. Now, unlike linear or quadratic probing, the probe sequence depends in two ways upon the key, and as long as we design h1 and h2 so that if h1(k1) = h1(k2) it's really unlikely that h2(k1) = h2(k2), we can avoid clustering. Again, we must choose our hash functions h1 and h2 carefully.
Now, what if the keys are not perfectly distributed?
Things get a little trickier, but we operate on the assumption of simple uniform hashing, where we assume that any given key is equally likely to hash into any of the m slots, without regard to which slot any other key hashed into. When we say "any given key," we mean any possible key, not just those that have been inserted into the hash table. For example, if the keys are strings, then simple uniform hashing says that any string—not just the strings that have been inserted—is equally likely to hash into any slot. Under the assumption of simple uniform hashing, any search, whether successful or not, takes Θ(1 + α) time on average.
How about inserting and removing from a hash table with chaining?
To insert key k, just compute h(k) and insert the key into the linked list for slot h(k), creating the linked list if necessary. That takes Θ(1) time. How about removing an element? If we assume that we have already searched for it and have a reference to its linked-list node, and that the list is doubly linked, then removing takes Θ(1) time. Again, that's after having paid the price for searching. In fact, you can do just as well with a singly linked list if you keep track of the position before the one you are considering.
How can we correct this error?
We need some way to *mark a slot as having held a key that we have since removed*, so that it should be *treated as full during a search but as empty during insertion* (Why?)
Define collisions
When *multiple keys map* to the same table index
Why collisions occur
if each key mapped to a distinct index in the table, we'd be done. That would be like all Sears customers having the last two digits of their phone numbers be unique. But they are not. Likewise, "randomly" distributed keys don't do what you might expect. If you "randomly" distribute n keys into a table of size n, you might expect to find one item in most of the slots, with very few empty slots. In fact, over a third of the slots are expected to be empty - meaning the other two thirds are expected to have collisions.