Hashing
In a successful search
the number of nodes to examine is about Lambda/2 on average.
In an unsuccessful search
The number of nodes to examine is Lambda on average.
General Ideas when Hashing?
- Each key is mapped into some number in the range 0 to TableSize − 1 and placed in the appropriate cell. - Should be simple to compute - Should ensure that any two distinct keys get different cells. - Hash function
Example: insert {5, 15, 6, 3, 27, 8} If position h(key) = key mod TableSize is occupied then Apply the linear probing ith probe was (h(key) + i) % TableSize, i =1, 2, 3, 4, ...
0 1 2 3 4 5 6 7 8 9 5 15 6 27 8 <Primary-Cluster
Define Quadratic probing
A collision resolution method that eliminates the primary clustering problem of linear probing
Compare: AVL Tree vs. Hash Table Average Complexity? Find min/max? Items in a range? Sorted input?
AVL Tree HashTable Average Complexity O(logN) O(1) Find Min/Max Yes No Items in a range Yes No Sorted Input Very Bad No problems (many rotations)
Advantages and Disadvantages of Separate Chaining
Advantages - Simple to implement. - Hash table never fills up, we can always add more elements to chain. Disadvantages - Parts of the table/array might never be used. - Uses extra space for links. - As chains get longer, search time increases to O(n) in the worst case.
Advantages and Disadvantages of Open Addressing
Advantages of Open addressing: - All items are stored in the hash table itself. There is no need for another data structure. Disadvantages of Open Addressing: - The keys of the objects to be hashed must be distinct. - Dependent on choosing a proper table size. - Requires the use of a three-state (EMPTY, OCCUPIED, DELETED) flag in each cell.
For Collision Resolution what is Separate Chaining?
All keys that map to the same table location are kept in a linked list
Define a collision
Choosing a function, deciding what to do when two keys "hash" to the same value.
What is the bad news for quadratic hashing?
For quadratic probing is NO guarantee of finding an empty cell once the table gets more than half full, or even before the table gets half full if the table size is not prime. Theorem: if the table is half empty (l < 1/2) and the Table-Size is prime, then we are always guaranteed to be able to insert a new element.
Define Double Hashing
General Idea: - Given two good hash functions u and v, it is very unlikely that for some key, u(key) == v(key) - So make the probe function f(i) = i*v(key) Detail: Make sure v(key) cannot be 0 formula (h1(key) + h2(key)∗i) mod (tablesize).
Which is the best selection for collision resolutions when it comes to hash-tables?
Gonnet and Baeza-Yates compare several hashing strategies; their results suggest that quadratic probing is the fastest method.
Define the load factor (lambda) of a hash table.
Lambda = N / Table-Size Where N is the number of items in the table
Important consideration when picking the table size.
If the table size is 10 and the keys all end in zero choice of hash function needs to be carefully considered. The hash function (Key mod TableSize) is a bad choice. - It is a good idea to ensure that the table size is prime. Why? Real-life data tends to have a pattern. - "Multiples of 61" are probably less likely than "multiples of 60". - If the input keys are random integers, then the function Key mod TableSize is a very simple to compute and distributes the keys evenly.
Explain Open Addressing
Important points: - All items are stored in the hash table itself. - In addition to the cell data (if any), each cell keeps one of the three states: EMPTY, OCCUPIED, DELETED. - While inserting, if a collision occurs, alternative cells are tried until an empty cell is found. - Deletion (lazy deletion): When a key is deleted the slot is marked as DELETED. - Probe sequence: A probe sequence is the sequence of array indexes that is followed in searching for an empty cell during an insertion, or in searching for a key during find or delete operations.
Quadratic probing is better than linear probing because it eliminates primary clustering; however, what is a possible drawback?
It may result in secondary clustering: if h(k1) = h(k2) the probing sequences for k1 and k2 are exactly the same. This sequence of locations is called a secondary cluster
Given integer values, what is the hash function?
Key mod TableSize
Define primary clustering
Keys tend to cluster around table locations that they originally hash to
Given the Lambda expression of load-factor. What is a general rule of separate-chaining?
Make the table size about as large as the number of elements expected Lambda ~= 1
Can we eliminate collisions?
No, we reduce and have ways of handling collisions; however, we can not remove the possibility of collisions.
The computational effort for search is what time complexity?
O(1) + Time to traverse the list
Running time complexity for hash functions
On average, a good hash function will achieve O(1) inserts, searches, and removes, but in the worst-case may require O(N).
Is there a way to use the "unused" space in the table/array instead of using chains to make more space?
Open Addressing Main idea: use empty space in the table
Three types of Collision Resolutions
Separate Chaining Quadratic Probing Double Hashing
One main issue when it come to hashing? What limitations do we have?
Since there are a finite number of cells and a virtually infinite supply of keys. This is impossible given we can not give that memory space to a computer.
When should Hash-Table be used?
Use Hash Table if there is any suspicion of SORTED input & NO ordering information is required.
How does the probe function change? c(i) = i
We can avoid primary clustering by changing the probe function c(i) = i by c(i) = i2 -or- bucket = (Hash(item->key) + c1 * i + c2 * i * i) % N
Given the following Table 0 1 2 3 4 5 6 7 8 9 Element 8 109 10 38 19 find(109)= find(58)= delete(38) = find(8) =
find(109)= 1 find(58)= null (T[8],T[9],T[0],T[1], and T[2] ¹ 58, T[3]=null) delete(38) T[8] = "no data, don't stop" DELETED find(8), T[8] ? 8, no data, move to next T[9] ? 8, 19 ¹ 8, move to next T[0] ? 0, 0 = 0, YES!, find(8) = 0
Common probe sequences are of the form
hi(key) = (h(key) + c(i)) mod TableSize, where i = 0, 1, ..., TableSize-1 and c(0) = 0. c(i) is used to resolve collisions