Data Structures Test 5 - Hashing and Garbage Collection
Linear Probing, Quadratic Probing, and Double Hashing
List the open addressing/collision resolution solutions
q
After Tri-Color Garbage Collection, what is garbage in this set?
object3, object4
After Tri-Color Garbage Collection, what is garbage is garbage in this table?
Counting GCs have a cost every time an assignment operator is being used with a reference. With tracing, the passes through the data tend to be periodic, and only have a performance cost at those times when it is running. Counting does the advantage of never having a build up of garbage, as soon as trash is created, it is immediately identified and reclaimed. Tracing does allow garbage build up, which means we'll have less available memory than we actually should.
Compare Tracing and Counting Garbage Collectors
Load Factor is a measure of the number of occupied cells in the table versus the table size, so if the table is empty, then the load factor is 0.0 and if the table is full, the load factor is 1.0. A full table guarantees a collision, an empty table has no possibility of a collision, in between, the load factor tells us the odds of a collision for the next inserted data point. Example: 7(in table) / 9(table size) -> 0.7778
Define Load Factor
If we have a table that has a load factor of 1.0 and is a perfect hash, then we have achieved a minimal perfect hash. (The entire table is full and there have been no collisions filling it) A minimal perfect hash is very unlikely.
Define minimal perfect hash
normalization is the modulo operator used to transformed a hashed value to a value that is valid for the hash table
Define normalization
In the context of hash tables. If we have a table that contains values that were added in such a way that there were no collisions, then we have a perfect hash. Consequently, our worst case performance is O(1).
Define perfect hash
With double hashing(open addressing), to resolve a collision, a different hash function is used on the key to create the offset from the collision point. The odds of different keys that collided with one hash function, and then follow the same offset are very unlikely. A popular second hash function is : hash2(key) = PRIME - (key % PRIME) where PRIME is a prime smaller than the TABLE_SIZE. A good second Hash function is: It must never evaluate to zero Must make sure that all cells can be probed
Explain Double Hashing and its benefits
Linear Probing(open addressing) Function: Σ i -> take value and add that as an offset to the collision point. Example: Modulo, Table Size 9, Linear Probing Key - Value - Index 10 - 1 - 1 28 - 1 - 1 + 1 40 - 4 - 4 22 - 4 - 4 + 1 17 - 8 - 8 35 - 8 - 8 + 1 (wrap around) 0-0-0 + 1, 0 + 2, 0 + 3
Explain Linear Probing and its benefits
With quadratic probing(open addressing) we'll use the function Σ i^2 So each attempt to resolve the collision will have the pattern of offsetting from the collision point at 1 away, 4 away, 9 away, etc. Wrapping if needed like we did with linear probing. P(i) -> return i * i; P(1) -> 1 * 1 P(2) -> 2 * 2 P(3) -> 3 * 3 Example: Modulo, Table Size 9, Quadratic Probing Key - Value - Index 10 - 1 - 1 28 - 1 - 1 + 1 40 - 4 - 4 22 - 4 - 4 + 1 17 - 8 - 8 19-1-1 + 1, 1 + 4, 1 + 9, 1 + 16, 1 + 25 Quadratic probing reduces the size of the primary clusters, at the expense of introducing secondary clusters. Clusters are bad, as they get bigger, the odds of the hitting them increases, and the insert/lookup performance decreases.
Explain Quadratic Probing and its benefits
As we're reading an array, it's common to start at one point, and then use the next element, and the next element after that and so on, in addition to i and sum probably travelling together through the caches of the CPU, the same thing will likely be happening to blocks of the array. If the array has 1 million elements. We're going to be allocating all 4 megabytes of that array to the RAM, but we're not going to feed all of that to the CPU in one go. Small bits of the array will be passed along, that way after processing the first element, when we want the second element, we don't have to all the way back to RAM to get the value, it may well already be in the CPU cache. File systems, if you write or read to/from disk, things will be done to try to cut down on the overhead. Writing, that tends to be buffered, then committed to disk as block operation. For example, write an int several times, you probably won't have any disk IO, not until you close the stream, for flush, or fill the buffer. Same thing for reading, if you ask for a single int from disk with your fstream. It's not going to just ready 5 bytes, it'll probably be reading 4kb of data, in anticipation that you may well want the next 4 bytes of data. Another sequential example, pipelining, this is a pattern the CPU uses to try to get its work done faster. An instruction that is fed into the CPU may take several clock cycles to work through, and as this happens it may be flowing through the parts of the CPU, leaving the previous parts idle. So, based on the Von Neumann architecture we'll go ahead and start the next instruction through the CPU while instruction 1 is still being processed.
Explain Sequential Locality of Reference
With spatial locality of reference, you tend to see that resources that are next to a resource being accessed are more likely to be accessed than resources further away. Int global = 10; Function Total(Int array[], int size) Int sum = 0 For (int i = 0; i < size; i++) Sum += array[i] Return sum Three areas of memory with the above pseudocode, we have a global variable, and a function, which has memory allocated on the stack, as well as the address of the array being passed in. The global is off in some area of RAM far away from the stack frame of Total. The stack frame of Total is on the run time stack, the array could be in a number of places. It might be a local variable in another function, so also on the stack, it could be dynamically allocated, and thus on the heap, or it may be pointing to a global variable. As the function is called and being used, the sum, and i variable are both close together in memory, additionally we have the array being hit. Things that are useful right now, and the things next to it may be useful, so things like caching may well bring them along. Another example, pre-fetch, Google Maps as an example.
Explain Spatial Locality of Reference
Resources that are used more recently are more likely to be used again in the near future than resources that have not been used recently. Again, with the caching, and specifically with the L1, L2, L3 cache on the CPU, all of which have very limited footprints. Pages on these caches will need to be evicted once we identify new data that is more pertinent at this point in time. So the oldest page is likely to be the page of data that is going to be dumped from the cache.
Explain Temporal Locality
White Set - Potential Garbage Gray Set - Objects that may have references to objects in the white set. Black Set - All objects that are not garbage. Objects with no references to objects in the white set. Initial State Black set is empty. Gray set contains root objects (global variables, stack variables, and the like) White set contains all other objects. The algorithm will work on the Gray set. While the Gray set is not empty. Pull an object from the Gray set and 'Blacken' it by pulling any object that it has reference(s) to that are in the white set. This allows that object to be moved to the Black set Once the Gray set is empty All remaining objects in the White set are trash.
Explain Tri Color Garbage Collection
More or less an extension of RAM. Where a page file, probably called pagefil.sys in your C: or C:\Windows, is used to store RAM contents. So an application that is 'running' but not active, can be moved in part or whole from RAM to disk. Essentially, idle applications, because they haven't done anything for a while, can sit on disk, thus freeing up RAM to be used for another application. Either a new one or expanding an existing one. Moving the data in and out will leverage all 3 types(of locality of reference), if we're using parts of the application for moving, we'll still be operating in chunks, and leveraging time to decide which chunks should/could be moved.
Explain Virtual Memory
Break apart keywords by character, and those can be added together, we'll just use the corresponding ASCII value of the character, which we just add up together. Key: abc -> 97 + 98 + 99 -> 294 Key - Math - Value abc - 97 + 98 + 99 - 294 Abc-65 + 98 + 99-262
Explain the Character Folding hash function
Here, we'll decide how many groups, or how big a group is, and break apart the key into those groups, that we can then smash together with addition. Fold Shift: Group Size: 2 Key - Groups - Value 12345 - 12 + 34 + 5 - 51 123456 - 12 + 34 + 56 - 102 An optional second step, we can reverse the data of alternating groups (even/odd). Fold Boundary Key - Odd Groups - Value 12345 - 21 + 34 + 5 - 60 123456 - 21 + 34 + 65 - 120 12 flipped to 21 5 flipped to 5 56 flipped to 65 Key - Even Groups - Value 12345 - 12 + 43 + 5 - 60 123456-12 + 43 + 56-111
Explain the Folding by Groups hash function
Take the key and square it, then extract the middle. Key - Square - Middle (3) 1234 - 1522756 - 227 12345 - 152399025 - 399 Key - Square - Middle (4) 1234 - 1522756 - 2275 or 5227 12345-152399025-3990 or 2399
Explain the Mid Square hash function
Algorithm: Take the ascii value of a character, then multiply that value by the 1-based index of that character, then sum up the total. Key | ASCII | Folding ABC - 65, 66, 67 - 65*(1) + 66*(2) + 67*(3) = 398 Abc - 65, 98, 99 - 65*(1) + 98*(2) + 99*(3) = 558
Explain the Program Character Based Folding hash function
Analogous to the way an ASCII or EBCDIC character string representing a decimal number is converted to a numeric quantity for computing, a variable length string can be converted as (x0ak−1+x1ak−2+...+xk−2a+xk−1). This is simply a polynomial in a non-zero radix a!=1 that takes the components (x0,x1,...,xk−1) as the characters of the input string of length k. It can be used directly as the hash code, or a hash function applied to it to map the potentially large value to the hash table size. The value of a is usually a prime number at least large enough to hold the number of different characters in the character set of potential keys. Radix conversion hashing of strings minimizes the number of collisions. Available data sizes may restrict the maximum length of string that can be hashed with this method. For example, a 128-bit double long word will hash only a 26 character alphabetic string (ignoring case) with a radix of 29; a printable ASCII string is limited to 9 characters using radix 97 and a 64-bit long word. However, alphabetic keys are usually of modest length, because keys must be stored in the hash table. Numeric character strings are usually not a problem; 64 bits can count up to 1019, or 19 decimal digits with radix 10.
Explain the Radix Transformation function
D, F, H
For the following objects in memory, run the Tri-Color Garbage Collector and determine which objects are garbage. Root objects are A, G. What is garbage?
Dealing with redundant data, data patterns, that can cause some skew to the data results. Overflow: Data exceeds the capacity of the storage, or the range. Example, if we have an 8 bit integer, the range is -128 to 127, 25 * 25 -> 113 Range of 8 bits is 2^8, which is 256, 625 - 256 -> 369 - 256 -> 113 20 * 20 (400) -> 144 - 256 -> -112 16 bits: +/- 32kb (32,767) 2^15 (32,768 - 1) 32 bits: +/- 2gb 64 bits: +/- 16 exabytes 18446744073709551616 Data With Patterns can skew the data, so might find ourselves giving that data a chop from the get go. Data - Chopped - Squared - Chopped^2 11112345 - 2345 - 123484211399025 - 5499025 11114567 - 4567 - 123533599597489 - 20857489 11116551 - 6551 - 123577706135601 - 42915601 Trying to avoid clustering with the hash values that exist as a byproduct of the data patterns that we're seeing, which reduces performance.
How can redundant data patterns in hashing be dealt with?
3
Identify which cell adding key 40 with a normalized hash result of 2 is placed into the following hash table using linear probing for collision resolution.
2
Identify which cell adding key 60 with a normalized hash result of 6 is placed into the following hash table using linear probing for collision resolution.
Hash-to hash the value and normalize it. Insert(k): Keep probing until an empty slot is found. Once an empty slot is found, insert k. Search(k): Keep probing until slot's key doesn't become equal to k or an empty slot is reached. Delete(k): Delete operation is interesting. If we simply delete a key, then the search may fail. So slots of deleted keys are marked specially as "deleted".
List out the names of the functions a hash table class would have and their purpose.
list, tree, stack, local
In this pseudocode, what are the root objects for tri-color garbage collection at Garbage Collection 1
Modulo - Key value and divide that by a value and keep the remainder. Often use the size of an array as the value, when using hashing for storage. Folding Take the key and split into parts, then merge those parts back together. (Shift & Boundary Folding variants) Radix Transformation Converting from one base to another Mid-Square Square the value and then extract the middle section of that result.
List all of the simple hash functions
1,692
Use shift folding, length 3, on the following value to calculate the Hash Value. 698741253
Line 5
Using Reference Counting Garbage Collection. Identify which line, if any, will cause the reclamation of the object "one". 1. int main() 2. { 3. Object A = new object("one"); 4. Object B = new object("two"); 5. A = B; 6. B = null; 7. A = new object("three"); 8. }
4
Using Reference Counting Garbage Collection. Identify which line, if any, will cause the reclamation of the object C. 1. Reference x = new Object(C); 2. Reference y = x; 3. y = new Object(B); 4. x = y; 5. x = NULL;
When we make a reference to a resource, we'll keep a counter associated with that resource, as we adjust references to track different things we'll increment and/or decrement those counters. If a counter ever reaches 0, then we'll immediately free up that resource.
What are Counting Garbage Collectors?
These will generally start from defined roots and then trace through those objects to determine what is reachable, and what is not. Those items that cannot be reached are flagged as garbage, and can be reclaimed.
What are Tracing/Marking Garbage Collectors?
Data storage for fast insert and retrieval, ideally O(1) Cryptography - such as password storage Signature Validation - Data Integrity checks.
What are some real world uses for hashing?
Two common types: Tracing GCs, and Counting GCs.
What are the types of Garbage Collection
Dynamically allocated memory typically comes and goes. Example: int* array = new int[120]; Delete[] array; We allocated 480 bytes on the heap for the array, then at some point in the future, we decided we didn't need that memory any more and we freed it back up. With this example, we're manually managing memory. With Garbage Collection, there is an algorithm that does the job of cleaning up after the application for us.
What is Garbage Collection
Locality of Reference is a pattern or phenomenon that is often observed in computer systems, and we've built things into them, to leverage those patterns. 3 Types Spatial, Temporal, and Sequential (sub category of spatial)
What is Locality of Reference and what are the types?
A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found.
What is a bucket?
With Chaining, we'll be using an array of linked lists for the hash table, all keys that have a given hash value, will share the same list. If order can matter, then the list order needs to be maintained with an applicable order, where sorted (ascending) is the default.
What is the Chaining collision resolution?
Tri-Color Garbage Collector This system uses 3 sets to categorize resources, aka objects in memory that it's managing. White set, Gray set and a Black Set.
What is the system for Tracing/Marking Garbage Collectors
index 2 : 32 index 3 : 25 index 4 : 44
What will the table look like if 15 is deleted with chain collapsing?
key % 10
Which of these Hash Functions yields a perfect hash with a 10 element array for the following values? (Remember to use integer math) 6, 32, 33, 44 key / 10 key % 10 (key % 10) + (key / 10) (key % 10) - (key / 10) None of these
(key % 10) + (key / 10)
Which of these Hash Functions yields a perfect hash with a 10 element array for the following values? (Remember to use integer math) 6, 43, 71, 76 key / 10 key % 10 (key % 10) + (key / 10) (key % 10) - (key / 10) None of these