CSE100 Week 3
modding by a prime number will guarantee that there are no factors, other than _______ & ________, that will cause the mod function to never return some index values.
1 and the prime number itself
disadvantages of Separate Chaining
1. We require extra storage now for pointers 2. All the data in our Hash Table is no longer huddled near 1 memory location (since pointers can point to memory anywhere), - this poor locality causes poor cache performance - i.e., it takes the computer longer to find data that isn't located near previously accessed data
Calculate P/3,5/(≥1 collision) Input your answer in decimal form and round to the nearest thousandth.
3 insertions into 5 slots P(no collision 1st) = 1, all slots empty P(no collision 2nd) = 4/5, 1 slot filled P(no collision 3rd) = 3/5, 2 slots filled thus P(no collisions) = 1*(4/5)*(3/5) = .48 P(≥1 collision) = 1-P(no collisions) = 1-.48 = .52
given hash hable, where H(k) = k % M and M = 7. After inserting the keys 31, 77, and 708 into our Hash Table (in that order), which index will the key 49 end up hashing to using Linear Probing?
31%7 = 3 77%7 = 0 708%7 = 1 49%7 = 0 >> 1 >> 2 index 2
Suppose you have a Hash Table that has 500 slots. It currently holds 99 keys, all in different locations in the Hash Table (no collisions thus far). What is the probability that the next key you insert WILL cause a collision?
99/500
how to use hash values in order to determine indices to use to store elements in an array
Call hash(key), and save the result as hashValue Perform a second "hashing" by modding hashValue by m to get a valid index in the array (i.e., index = hashValue % m) - m = array length
how 2 hash tables are used for cuckoo hashing
H1(k) hashes keys exclusively to the first Hash Table T1 H2(k) hashes keys exclusively to the second Hash Table T2 A key k starts by hashing to T1 if another arbitrary key j collides with key k at some point in the future, key k then hashes to T2 However, a key can also get kicked out of T2, in which case it hashes back to T1 and potentially kicks out another key into T2
open addressing collision resolution strategies
Linear probing Double hashing Random hashing Cuckoo hashing
Double hashing
Linear probing with two hash functions: H1(k) to calculate the hashing index H2(k) to calculate the offset in the probing sequence H2(k) is only used if there's a collision improves Linear probing insert/find and distribution in hash table, BUT remove still sucks because if "delete flags" Average O(1), Worst O(n)
define a perfect hash function
MUST have both the Property of Equality and Property of Inequality MUST return different hash values for different keys example: h(k) = k is a perfect hash function
average case for find insert delete with linear probing?
O(1) only if table is not very full
worst case runtime of find_linear_probing?
O(n)
worst case runtime of insert_linear_probing?
O(n)
how costly is resizing a hash table?
O(n) because you have to rehash all of the elements into the new hash table need to update size in the hash function
Suppose you have a hash table that can hold 100 elements. It currently stores 30 elements (w/o collisions so far). What is the probability that your next TWO inserts will cause AT LEAST one collision (assuming a totally random hash function)?
P(>= 1 collision) = 1 - P(no collisions) P(no collisions) = 1 - P(collisions) first insert P(collision) = 30/100 P(no collisions 1st insert) = 1-(30/100) = .7 second insert P(collision) = 31/100 P(no collisions 1st insert) = 1-(31/100) = .69 P(no collisions for both inserts) = .7*.69 = .483 P(>= 1 collision) = 1-.483 = .52
Cons of cuckoo hashing
Potential of infinite cycles because every key only has 2 possible locations worst case O(n) insert: if we have to rehash entire table - but average case is O(1) still
2 required properties of a hash function
Property of Equality: Given two keys k and l, if k and l are equal, h(k) MUST equal h(l). In other words, if two keys are equal, they must have the same hash value Property of Inequality: Given two keys k and l, if k and l are not equal, h(k) SHOULD NOT equal to h(l). - if two keys are not equal, it would be nice (but NOT NECESSARY!) for them to have different hash values
closed addressing collision resolution strategy
Separate Chaining aka open hashing strategy
is this hash function valid: unsigned int hashValue(Data key) { return 0; }
YES because keys that are of equal value will have the same hash value, but that is pretty terrible, because we will have a lot of collisions
advantages of Separate Chaining
average-case performance is much better than Linear Probing and Double Hashing as the amount of keys approaches, and even exceeds, the capacity of the Hash Table - this is because the probability of future collisions does not increase each time an inserting key faces a collision
Random hashing
based off of linear probing use a pseudorandom number generator seeded by the key to produce a sequence of hash values - must seed the pseudorandom number generator by the key to make sure hash function is deterministic (always produce same hash value for same key) Once an individual hash value is returned, the algorithm just mods it by the capacity of the Hash Table If there is a collision, the algorithm just chooses the next hash value in the pseudorandomly-produced sequence of hash values.
Linear probing is called what kind of collision strategy
closed hashing strategy open addressing strategy
How to avoid collisions while constructing a hash table
generally the more extra space you have, the lower the expected number of collisions. keep in mind of the load factor and resize at around .70 if we expect to be inserting N keys into our Hash Table, we should allocate an array roughly of size M = 1.3N always choose the capacity of our Hash Table to be a prime number - modding by a prime number will guarantee that the mod function will return almost all possible indices for factors - because prime numbers have less common factors
Cuckoo Hashing
if an inserting key collides with a key already in the Hash Table, the inserting key pushes out the old key and takes its place. The displaced key then hashes to a new location 2 hash functions: H1(k) and H2(k) - usually 2 Hash Tables: H1(k) maps to first hash table. H2(k) maps to second hash table so every key strictly has 2 different locations H1(k) is the first location that a key always maps to (but doesn't necessarily always stay at).
describe Linear Probing
if an object key maps to an index that is already occupied, simply shift over and try the next available index. Must do this to calculate next index: index = (index + 1) % M ^ this is to keep the index within the bounds of the array (it loops through array)
Describe the trickiness of delete for Linear Probing describe the fix
if you just find something by probing for it and delete the index, it will break the value stored to the left of that index if that value was probed into that location. - because find terminates when it reaches a null key - thus we won't be able to find the value left of the index we deleted if that value was inserted there with collision to fix: replace the value you want to delete with a delete flag - can insert in these location - find will see these flags and won't terminate, will keep searching
Separate Chaining
keep pointers to Linked Lists as the keys in our Hash Table.
good hash function for strings must...(2)
must have a time complexity of O(k), where k = length of the string - specifically it must iterate through all the chars of a string must perform arithmetic that is non-commutative - can't just add ascii value of all the chars together - because if we add, hash value for Hello = olleH
Consider a Hash Table with 100 slots. Collisions are resolved using Separate Chaining. Assuming that there is an equal probability of mapping to any index of the Hash Table, what is the probability that the first 3 slots (index 0, index 1, and index 2) are unfilled after the first 3 insertions?
no access to 3 slots = 97 possible slots 97/100 * 97/100 * 97/100 each time is 97 because separate chaining allows for insertion into an index that already has an element
C++, a Hash Table is called
unordered_set
The rum time of the hash function must be independent of
of the number of elements in the Hash Table, BUT might not be independent of the size of the data you are hashing ie. For primitive data types, hash functions are constant time, because primitives are small but for types that are collections of other types (e.g. lists, strings, etc.), good hash functions iterate over all the elements in the collection. - So technically sometimes a hash function's runtime is O(k) where k = number of elements
describe the special case of O(1) for hash tables
runtime is constant with respect to the NUMBER OF ELEMENTS in the Hash Table ex: mapping a string of length k to an index in an array is in reality O(k) overall: we first perform a O(k) operation to compute a hash value for the string then perform a O(1) operation to map this hash value to an index in the array. But note that this has nothing to do with how many other strings are in the Hash Table.
cons of linear probing
slows down hash table drastically cause we're essentially doing a linear search for an open spot clusters
load factor
the expected number of operations a Hash Table will perform n = number of elements in table m = table size alpha = n/m when load factor is >= .70, the number of calculations done increases drastically - means more collisions when load factor is >= .70
While prime numbers smooth over a number of flaws, they don't necessarily solve...
the problem of unequal distributions aka it doesn't necessarily solve clustering - this is more related to hash function
Pros of cuckoo hashing
worst case O(1) find: if the key is not in either index1 = H1(k) or index2 = H2(k) then it is not in the table worst case O(1) "delete": if the key exists in our table, we know that it is either in index1 = H1(k) or index2 = H2(k) can do even more than 2 hash tables if you want
even if we have a perfect hash function, can we still get collisions? why?
yes picking an unfortunate size for our array or choosing a bad indexing hash function