data structures and algs exam 2
assigment 2
check with code
recitation 5 (recursive constructor, toString, reverse)
check with code
stack uses
(1) runtime stack for method calls (especially for recursive calls) When a method is called, its activation record is pushed onto the runtime sack When it is finished, its activation record is popped from the runtime stack (2) testing for matching parenthesis The left paren ( must come before the right ) and they must be the same paren type We would encounter a left paren and push it onto the stack, we would then encounter a right paren and check the stack If the stack is empty, there is an error (no left paren to match the right) , if it is not empty, we pop the stack and check the character (if it does not match the left paren there is an error as the paren types don't match) Else we continue Once we have read in all of the input we can check the stack If the stack is not empty, there is an error (no right paren to match the left) (3) post-fix expressions This is where operators follow operands It is useful since no parentheses are needed Idea is that each operator seen is used on the two most recently seen operands To use with the stack We read each character If it is an operand we push it onto the stack If it is an operator Pop right operand Pop left operand Apply operator Push result onto the stack After all characters have been read we pop final answer from stack If we ever pop an empty stack or if stack is not empty at the end the expression has an error
LL vs arrayList iterator implementation
A LL implementation uses a node reference as the sole instance variable for the iterator (it is initialized to first node when the iterator is created and then progresses down the list with each call to next) An array list implementation has an integer value to store the index of the current value in the iteration and is incremented with each call to next (for remove to be implemented there must also be a way to fill in the gap) Note that using an iterator for the array list will not improve our access time to "visit" all the values of the list, however, it does allow for consistent access so that we can use an iterator to access any list without knowing if it is an array list vs linked list
dictionary data structure
A dictionary (symbol table) is an abstract structure that associates a value with a key We use the key to search a data structure for the value The key and value are separate entities-for a given application we may only need the keys or the values or both This symbol table is an interface (idea is that the dictionary specification does not require any specific implementation) Methods Add(key,value) ⇒ returns a value Remove(k) ⇒ removes a key value pair (returns the value it is removing) getValue(k) ⇒ gets a key value without removing it (returns the value at the key) getKeyIterator, getValueIterator (allows us to iterate through all keys and values) isEmpty, getSize, clear We could implement this interface using what we already know Have an underlying sorted array, or have an underlying sorted linked list Both of these implementations are similar in that the basic search involves the direct comparison of keys In other words, to find a target key K we must compare K to one or more keys that are present in the data structure If we change our basic approach perhaps we can get an improvement
towers of hanoi runtime
A recursive algorithm with a single recursive call provides a linear chain of calls When a recursive algorithm has 2 class, the execution trace is a binary tree, which is difficult to do without recursion (to do it, programmer must create and maintain their own stack to keep all the various data values=the local state information for each call), this increases the likelihood of eros or bugs in the code
why is it hard to do towers of hanoi iteratively?
A recursive algorithm with a single recursive call provides a linear chain of calls When a recursive algorithm has 2 class, the execution trace is a binary tree, which is difficult to do without recursion (to do it, programmer must create and maintain their own stack to keep all the various data values=the local state information for each call), this increases the likelihood of eros or bugs in the code
stack implementations
A stack can easily be implemented using either an array or a linked list Array Push: add to the end of logical array Pop: remove from end of logical array Linked list Push to front of linked list Pop from front of linked list
list iterator
An iterator can be used for any java collection (list, queue, deque) but for the list, we can add more functionality using the ListIterator class which allows us to traverse in both directions Note that a singly list will not support the list iterator class! (so you can't do this with the authors list but you can for the standard java list, and you can always do this with an array) List iterator is best implemented externally (meaning methods are not part of the class being iterated on, and it is instead an inner class)-we will need more logic to handle traversal in both directions as well as handling the set and remove methods
double hashing vs linear probing
As alpha increases, double hashing shows improvement over linear probing However, as alpha approaches 1 (as N ⇒ M) both schemes degrade to O(N) performance Since there are only M locations in the table, as it fills there becomes fewer empty locations remaining Multiple collisions will occur even with double hashing For inserts and unsuccessful finds Both continue until an empty location in the table is found (and few exist) Thus it could take close to M probes before the collision is resolved (since the table is almost full O(M) is basically O(N))
Array vs linked list implementations (stack and queue)
As long as resizing is done in an intelligent way, the array versions of these may be a bit faster than the linked list versions Stack: push pop are O(1) amortized time for both implementations, but they are a constant factor faster in normal use with the array based versions Queue: enqueue and dequeue are O(1) amortized time for both implementations but they are a constant factor faster in normal use with the array version But notice that the arraylist does not automatically "downward" size when items are deleted, so the arraylist stack will not do so either It could waste memory if it previously has many items and now has a few
recursive implementations of insertion sort and selection sort
As with sequential search and some other simple problems, this is more to show how it can be done rather than something we would actually do Since these each have only a single recursive call and are not divide and conquer, there is no efficiency or implementation motivation to doing them recursively
Simple hashing concept
Assume we have an array (table) T of size M Assume we have a function h(x) that maps from our key space into indexes (0,1⇒m-1) Also assume that h(x) can be done in time proportional to the length of the key Now how do we insert and find some key x? To insert: i = h(x) (this gives us an index that is a legal position within the table) T[i] = x (we place the key in that index at the table) To find: i = h(x) (calculate h(x), get the same value) if (T[i] == x) ⇒ return true (we found it), else return false
insertion sort with a linked list
At each iteration we simply remove the front node from the lsit and insert it in order into a second (new) list We can create two separate lists and move the nodes from the old list into the new one We are not creating any new nodes-just moving the ones we already have around Each node is removed from the original list and inserted into the second list int the proper order We insert from front to back of the sorted list, comparing the data as we go Note, in this case the worst case would be if every node ends up at the end of the list (meaning that the worst case would occur if the data was initially sorted) Same number of comparisons as the array and the sum is the same but from a different set of data You could run a pre processor that checks if the data is sorted (then this case would be O(1)) But, this is O(N) just to check, so if it there is a low probability of the list being sorted then it is not worth it
selection sort
At iteration i of the outer loop, find the ith smallest item and swap it into location i As i increases, the number of items you are considering decreases until you reach the last element, where it does not need to be swapped) Here, we have a for loop that traverses from 0 to n-1 (only N-2 elements need to be compared, assuming the array is of size N), we find the smallest using a nested for loop that starts at the first+1 index and iterates through the array, finding the smallest item and then swapping it into the position found in the smaller for loop Note that both loops are counter driven, meaning that no matter how the data is organized, the number of iterations is the same
iterator implementation
Could we make it part of the list interface? This would give our list the extra functionality of an iterator and would allow the interface methods to access the underlying list directly and be implemented in an efficient way (because they are both part of the same class) Drawback: if the iterator methods are part of the list, we have no way of creating/using multiple distinct iterations on the same underlying data (if we move the current item, it is changed globally, there is no way to have a second iterator with a different state) To fix this, we need to somehow separate the iterator from the list while still giving the iterator access to the data within the list Solution: implement the iterator external, but on top of the list (make each iterator a new object-separate class, but give it access to the implementation of the underlying list so that it can access it efficiently-private inner class) In standard java, the list interface already has the iterator method built in, but the author has a separate list interface that has the iterator Implementation of the next method: this method has two parts, first it saves the value of the current node, then it moves the pointer down one node and returns the data from the current node (note that if hasNext returns false an exception is thrown)
stack idea
Data is added and removed from one end only (typically called the top) Logically the top item is the only item that can even been seen You can push an item onto the top of the stack, pop an item from the top of the stack, or peek at an item at the top without disturbing it Also has the methods isEmpty and clear A stack organizes data by last in first out (LIFO/FILO)
deque
Deque: double ended queue (allows for add or remove from either end) Circular array Add to front: move front index to left Remove from back: move back index to the left Both indices can wrap around either end of the deque Linked implementation A singly linked list with front and back reference, and a circular singly linked list with back reference worked for the queue, but will not work for the deque because you must be able to remove from the back (so we need the node before the last one) What would work is a circular, doubly linked list Will allow us to move in either direction, and easily update either end of the deque
linked implementations of queue
Doubly linked list We will have easy access to the front and end of the queue We could also build our queue with a linked list object (what is done in the jdk( Even though linked list can do a lot more than queue operations, if we use a queue reference to the object, we restrict it to the queue operations Singly circular queue (with reference to the back node) Add to back, remove from front Enque is one after the last node (make that the new last node) Dequeue with the front node Preserving your own memory When dequeuing, instead of removing the node and allowing it to be garbage collected, we deallocate it ourselves This saves some overhead of creating new nodes Two references: queueNode and freeNode queueNode is the front of the queue (will be the next node dequeued), freeNode is the first node after the rear of the queue (this will be the next node enqueued, if there are none left then we create a new node)
closed addressing
Each location in the hash table represents a collection of data (if we have a collision, we resolve it within the collection without changing hash addresses) Most common form is separate chaining We use a simple linked list at each location in the table We place each node at the end of the list We could also insert at the front of the list (O(1)), but if we did that then we can't check for duplicates Performance of separate chaining is dependent on chain length If it is not found we must search through entire chain Chain length is determined by the load factor (alpha) Average chain length = total num nodes/M but total number of nodes = N so the average chain length = N/M = alpha As long as alpha is a small constat, performance will still be O(1) This means that unlike in closed addressing, N can be greater than M This is a more graceful degradation than open addressing where you have a hard bound of 1 for alpha However, if N is much greater than M, it can still degrade to O(N) performance, so we still may need to resize the array (note that N would have to get much bigger than N would in closed addressing) A poor hash function can also degrade this into O(N) (ie if we all hash items to the same list) We could come up with a closed addressing scheme that can mitigate the damage caused by a poor hashing function by choosing a better collection (ie a sorted array, a binary search tree), but we don't really do this because there is a reasonable assumption that the hash function is good
How do we fix the cluster issues caused by linear probing?
Even after a collision, we need to make all of the locations available to a key, this way, the probability from filled locations will be redistributed throughout the remaining empty locations in the table, rather than just being pushed down to the first empty location after the cluster Idea of solution: you have C locations full and M-C locations empty. Instead of the insert probability of the C locations falling on just a few locations, we make it so that P(insert at a full location) == 0, and P(insert at any location) == (C/(M))/(M-C). This way, there is a probability of C/M that the hash value will be to a filled location We'd like that probability to be divided evenly amongst the M-C remaining open locations Now after a collision at index i, a key would still be equally likely to be inserted at any remaining open location (rather than just being moved down to the bottom of the cluster) We can do this by double hashing
random pivot index-quicksort
For each call, we choose a random index between the first and last (inclusive) and then we use that as the pivot (we would have to swap that to the end of the array) The worse case could be just as bad as the simple pivot choice For the average case, it is very unlikely that a random pivot will always be bad Consider our initial choice of a pivot, the probability that it is an extreme (largest or smallest) value is 2/N For the second call that probability is 2/(N-1), then it is 2/(N-2), 2/(N-3) To find the probability that all of these things occur we multiply the probabilities (2/N)*(2/(N-1))*(2/N-1)...*2/1 This gives you 2^N/N! This is very small, essentially zero Thus the random pivot has a worst case of N^2 but the probability is infinitesimal and it will usually be NlgN However, generating random numbers has a lot of overhead, so we would need to run an empirical analysis to see if it is actually faster We would have to compare the overhead of finding the median in median of three to the overhead of making random numbers
quicksort-when to stop recursion?
For simple quicksort we stop when the logical size is one However, the benefit of divide and conquer decreases as the problem size gets smaller At some point, the cost of recursion outweighs the savings you get from divide and conquer For example if you were taking an array of size 100 and doing divide and conquer you would get an array of 50 (-50 cut), if you did that with a smaller array of size 10 you would get an array of size 5 (-5 cut-far smaller) If you think about the execution trace of a binary tree, the majority of the calls are happening at the lower level (this means that most of the recursive overhead occurs at the bottom levels). Therefore, an optimization would be to cut off the bottom half, thereby getting rid of those recursive calls and overhead So it is good to stop the recursion early and switch to insertion sort at that point Insertion sort is a good choice because it works when for when data is in a small array and doesn't have to move far (the array is mostly sorted) Note that changing the base case does not asymptotically improve runtime, we are just changing it by a constant Note that this analysis could also apply to mergesort Alternatively we could stop at a base case > 1 but not sort the items in the recursive call at all. After all the recursion is complete, we could then insertion sort the entire array Even though it is poor overall, if the data is mostly sorted due to quicksort, we will be close to the best case for insertion sort and maybe we will get better overall results The size at which to stop would have to be determined empirically
Functionality of an iterator
Has next method: returns true of there are items left in the collection that have not yet been visited Next: return the next item in the collection (through exception if no items remain) Remove: remove the last item returned from the collection (destroys the list in the process of viewing it) Example of using an iterator: calculating the mode of a collection We will start at the first value, see how many times it occurs and proceed to the next value (continuing all the way through-keeping track of the value with the highest count) We will have two separate iterations (one going through the list and identifying each item, one counting the occurrences of that item) If we did this with nested loops it would be N^2 for the arraylist and N^3 for the linked list But, if we used two iterators (the outer considering the next item in the list and the inner counting the number of occurrences of that item) you will get an asymptotic time of N^2 no matter how it is implemented Note that these two iterators are independent Note that when you initialize and then do next on an iterator you will be at the first element
Example 3: towers of hanoi problem
Here, we have three towers, on the first one we have disks of decreasing size Or goal is to get all the disks onto the last tower but we can only move one disk at a time, and we cannot put a larger disk on top of a smaller one Basic solution idea We have N disks and 3 towers (start mid and end) To move N disks from start to end, we move N-1 disks from start to mid, then move the last disk from start to end, the move N-1 disks from mid to end But, can't we only move one disk at a time? Not N-1 like the solution says Since the solution is recursive, we are breaking down the problem with each recursive call so that the actual move of a disk is always done only one at a time Solution In general, the N-1 pile goes where you don't want the pile to end up, and the N ring goes to where you want it to end up Lets say we have a tower of three rings, we want to move the first 2 rings to position 2 and the bottom to position 3 We then will look at the subproblem of moving the top two rings to position 2 We will move the top disk to position three and the middle to position 2 Then we will move the top disk to position 2 Then we will move the bottom disk to position 3 Now we have to move the middle and top disks to position 3 from position 2 We will move the top disk to position 1 We will then move the middle disk to position 2 We will finally move the top disk to position 3 Code Accepts a size, sart, mid, and end parameters First, we test if the size of the tower = 1, if it is, simply move that disk to where it wants to go (from start to end) If not, Make a recursive call that swaps the end and mid positions (this is basically alternating where you want the top disk to go) Once the size variable is equal to one, the top disk will be moved Then, the next disk will be moved to the spot that was not taken by the top disk (where the stack wants to be) (this is from the previous recursive call in the print statement) Then another recursive call is made, this time moving the top disk to the spot where it wants to be (on top of the middle disk)
mergesort
How do we divide the problem We can simply (logically) break the array in half based on index value We continue recursively until we reach the base case of an array of size 1, which is already sorted So if we have an array of size 8, we will have 8 "arrays" of size one Once the base case is reached, we have to determine how to put the pieces back together again (this leads to the second question about using subproblems to define the overall problem) How do we use the subproblems to solve the overall problem Consider one call, after both of its recursive calls have completed, at this point, we will have two sorted subarrays, one on the left and one on the right We will merge the dwo subarrays, moving them into an overall sorted array Note that this is where we are really doing the "work" of the sort. We are comparing items and moving them based on those comparisons In terms of pseudocode If the logical size of A is greater than one We break A into left and right halves, recursively sort the left half of A, recursively sort the right half of A, and then merge the two sorted halves together For the actual code Merge method We will have variables that mark the start of the first array, the end of the first array (mid), the start of the second array (mid+1), end of the second array (last) We will have a for loop that will run as long as the end of both halves are larger than the starts Whichever one is bigger will go to current spot in the array (marked with the index variable that is increment with each iteration of the loop) If one of the halves run out first, there are two for loops that will fill the temp array with the rest of the values (this works because both halves of the array are sorted) Only one for loop will execute We then copy the data back into the original array (note-the data in merge sort is not sorted in place!) This adds linear overhead to the runtime Actual mergesort method Since we don;t want to make a new temp array with each recursive call (too much overhead), we make one temp array and then pass it into the recursive method If the start of the array is greater than the end We first call merge sort setting the start = first and the last = mid We then call mergesort setting the start = mid+1 and last = last Then, after both recursive calls are made, there merge call is made, under the if statement that the merge should only be called if the greatest value in the right array is greater than the smallest value of the left array (if it was smaller, then no sorting would need to be done This step is not needed, but it can save us runtime in some cases as the merge step has the overhead of copying the data into temp and back into arr
quicksort
How do we divide the problem? Instead of using index values, we break up the data based on how it compares to the special data value called the pivot value We compare all values to the pibot and then place them into 3 groups (less than pivot, == to pivot, greater than pivot) Since we are dividing by comparing values to another value the division between these groups will not be in half How do we use subproblem solutions to solve our overall problem? We don't have to do anything Note that we are already comparing during partition Since the pivot is already in its correct spot, if we recursively sort the left side and recursively sort the right side, the whole array is sorted So we don't even need to consider question 2 here (unlike mergesort, however, unlike mergesort, implementing question 1 takes a lot of work) Primitive quicksort We will make the pivot value A[last] If the size of A is > 1 Choose pilot value We partition the data (the data is not yet sorted but we know that at least one item (the pivot) is in the correctly sorted location Partition is done in place The rest of the data is now more sorted than it was before since it at least on the correct side of the array We now recursive sort the left and right sides of the array Why is there a second condition for the index from right? This is testing to make sure that the values are greater than the pivot, we could have a situation where all the values are greater than the pivot where we would decrement the index from left beyond 0 which would give us an index out of bounds exception We don't need this for the index from left (which is testing to make sure the values are less than the pivot) because even if all the values in the array are less than the pivot, we will eventually hit the pivot (the last item in the array) which would end the loop Basic idea of partition We start with a counter on the left and right of the array As long as the data at left counter is less than the pivot we do nothing (just increment the counter), as long as the data at the right counter is greater than the pivot we do nothing (just decrement counter) The idea here is that the data is already on the correct side, so we don't have to move it When the left counter and the right counter "get stuck" it means there is data on the left that should be on the right and vice versa (so we swap the values and continue) When the left side counter is >= right side, we stop and put the pivot in its place We do this by swapping A[pivot index] and A[index from left] We also set pivotIndex = indexfromleft What happens if we come across an item on one side that needs to be swapped, but no item on the other side that needs to be swapped? Nothing! We thought the item needed to be swapped but it was actually just the last item on the left side of the array
median of three quicksort
How do we make the worst case be rare? The problem from the simple pivot is that we chose a pivot from one specific location, we can fix this by choosing a pivot in a more intelligent way We don't pick the pivot from any one index, instead we consider 3 possibilities everytime we partition (the first, middle, and last elements). Then, we order these items and put the smallest value into the first index, middle into the middle index and largest in the last index We know that A[first] <= A[mid] <= A[last] Now we use A[mid] as the pivot (median of the three values we checked) Sorted data is now the best case (the median of three is the actual median of the data-the runtime is NlgN) For the partition method we put the pivot in the second to last position, not the last as we did with the simple pivot because we know that a[last] must be greater than the pivot so we can start partition from +1 the first index and -2 from the last index Note that the median of three method does not guarantee that the worst case of N^2 will not occur It does guarantee that there is at least one item in each bin If there is only one item in the bin for every call, then we get N^2 This only reduces the likelihood of the worst case and makes the situation in which it would occur not obvious (there are still situations where it could require O(N^2)) So we say that the expected case runtime is NlgN and the worst case is N^2 (this is true for both the simple pivot and the median of three-but the worst case for the median of three is less likely to occur)
insertion sort
Idea: remove the items one at a time from the original array and insert them into a new array, putting them into the correct relative order as you insert We could do this in two arrays, but that would double our memory requirements Instead, we sort in place which will only use a constant amount of extra memory To do this, we split the array into a sorted section and an unsorted section In each iteration of our outer loop, we take the leftmost item of the right side of the array (unsorted section) and put it into its correct relative location in the sorted section The size of the sorted section increases by one, and the size of the unsorted section decreases by one The code: We iterate from first + 1 (because the first index is already sorted) to <= last We then call another method which iterates from the -1 the position the element we are checking is in to the start of the array In this for statement we also use compareto to make sure we only proceed in the loop if the item is less than the element at the index we are checking If it is inside the loop, we move the item at the index over to the left by one After it exits that loop we place the item in the index + 1 spot
Mergesort runtime
If we have an array of size N At each level, the number of problems doubles but the size is cut in half. Thus, the total number of comparisons needed at each level stays the same O(N) (each level doubles but the call sizes cuts in half so it cancels out) Every call has to do the same amount of work to put the items back together (relative to size) ⇒ for example if we divided N/4 there are 4 calls of N/4 giving us N total comparisons Since the size is cut in have with each call, we have a total for LgN levels So our total runtime is N*LgN (when multiplying big-O terms we do not throw out the smaller terms) Note that we are looking at merge sort level by level when doing the analysis, but the actual execution is a tree execution like towers of hanoi (each call generates two calls), we recursively sort the left side of the array, going all the way down to the base case then merging back up before even considering the right side More formal analysis The execution trace has lgN levels (because each call of size X spawns two calls of size X/2) At each call of size X we do X work (to merge the arrays back together) This leads to N + 2(N/2) + 4(N/4) If we assume that N = 2^k then this can be written as 2^02^k + 2^1*2^k-1 + 2^2*2^k-2 Where the twos to the power of a number are the number of problems/calls and were the 2s to the N are the size of the problems Even though these terms appear to be different, the are all 2^k (in other words, we are adding 2^k k+1 times which gives us the sum (k+1)*2^k Since we know that k = lgN we can turn this into (lgN+1)*N which is O(NlgN) Compared to binary search, we have lgN levels but we are doing N operations per level instead of 1 This is for worst and average case
why is quicksort not stable?
If we look the partition method, the swap command that swaps the two values when we come across two items that are in the wrong spot does not take into account any items around it
queue basic idea
In a queue, data is added to the end and removed from the front Logically, the items other than the front cannot be accessed Fundamental operations Enqueue an item to the end of the queue Dequeue an item from the front of the queue Front (look at the front item without disturbing it) To implement, we will need a structure that has access both the front and the end, and we want the enqueue and dequeue operations to be O(1)
insertion sort vs selection sort
In insertion sort we are comparing X to items in the sorted array and placing it among them, in selection sort we are looking for items in the unsorted array Both have two for loops, but in insertion sort the inner loop will stop when the item is at the correct insertion point, but for selection sort, we don't do the comparison in the loop header They both end up with similar runtime in the worst case, however, the will not always have the same performance as insertion sort can vary a lot in its runtime
Example 2: boggle
In this problem, you are given a grid and a word, and you have to figure out if the word is located somewhere within the grid Each letter must touch the previous letter, we can only move right/down/left/up We can only use a letter once in the word Basic idea of solution Each recursive call will consider one location on the board As the recursive calls "stack up" we match letters in our board with characters in the word If we match all the characters in the word then we are done If a word is K letters long, we would have K calls built up on the runtime stack The last call would match the last letter in the word If we get stuck within a word we will backtrack to try a different direction in our match (like the maze problem) Solution We will first traverse the squares in order until we find the starting letter Once we find it, we start to recurse We will look for letters in the order: right, up, left, down When we have looked in all for directions, and the letter is still not found, we backtrack Code In one method, you have two nested for loops that will iterate through the rows and columns This method will call the recursive method (using parameters row, col, word, 0, board) If the word is found it will store the indices of the solution and then print the answer + board, if it is not found, the loop will proceed In the recursive method header, we have arguments for the row, col, the word, the position in the word and the array It will first make sure the row and column are within the bounds, It will also make sure that the char at the given row and column matches the first letter of the string Then, the char at the spot will be turned into uppercase so that the same char won't be used twice (cant go backwards), and the board will show where the word is The base case occurs when the location of the word is = to the word.length-1 If this is not true, we set a boolean = to right and then if that is not true we go up then down answer = findWord(r, c+1, word, loc+1, bo); // Right if (!answer) answer = findWord(r+1, c, word, loc+1, bo); // Down if (!answer) answer = findWord(r, c-1, word, loc+1, bo); // Left if (!answer) answer = findWord(r-1, c, word, loc+1, bo); // Up If after all of this, the answer is not found, we need to backtrack-row and col and the word index will go back via the activation record, but any changes we made to the board (ie turing uppercase) will need to be undone If after all of that it is still not found, we will go back to the first method and a new starting variable is chosen
bubble sort
Item J is compared to item j+1 If the data is sorted j should be less than j+1 (so we do nothing) If j is greater then they are out of order so we swap them We continue from the beginning again until they are all sorted The iteration moves one irem to the correct spot within the array Bubble sort is O(N^2) in the worst case
open addressing-how do we keep the table mostly empty
Linear probing alpha ⇒ 1/2 , double hashing alpha ⇒ 3/4 We must monitor the logical size (number of entries) vs physical size (array length) to calculate alpha) Then, we resize the array and rehash all of the values when alpha gets passed the threshold
problems with mergesort
Merge Sort's runtime of NlgN is a definite improvement over runtime of primitive sorts (asymptotically it is optimal as it achieves the lower bound) However, in order to "merge" we need an extra array for temporary storage (we are not sorting in place) This adds memory requirements (although O(n) memory is not really a big deal) More importantly, copying to and from this extra memory slows down the algorithm in real terms (the asy runtime is very good, but when actually timed in practice we can do better) Why doesn't copying data affect asymptotic runtime? We have N items to assign into the temp array, and we do N-1 comparisons (this is O(N)) Then we have to copy the items back from the temp (this is O(N) Our total sum of moves would be O(N) +O(N) = O(N) The extra linear amount of work to copy the data back empirically slows down the algorithm but not asymptotically
Quicksort vs Mergesort
Mergesort has a more consistent runtime than quicksort (mergesort is always nlgn whereas quicksort has the worst case of n^2) However, in a normal case, quicksort outperforms mergesort (due the the extra array and the copying of data, mergesort is normally slower than quicksort This is why many predefined sorts in programming languages are quicksort (ex. In JDK arrays.sort() for primitive types uses a dual pivot quicksort) However, quicksort is not a stable sort If we are given two equal items, X1 and X2, where X1 appears before X2 in the original data merge sort will keep X1 before X2, but this is not guaranteed with quicksort (ie if you wanted to sort first alphabetically and by salary, in a stable sort if two people had the same salary they would be in alphabetical order) Thus, for more complex (object) types (where we would want to search based on different fields), it may be better to use merge sort even if it is slower This can be seen in java, up through jdk 6 java used mergesort for objects and quicksort for primitive types (this is because stability does not matter for primitive types, so you should just use the faster sort. Stability does matter for objects so you should use the stable (but slower) sort) In JDK 7+ they switched to timsort which is more complicated but faster and still stable (it is derived from mergesort but actually incorporates several different approaches)
boyer moore method
Mismatched character heuristic We look at the pattern from right to left instead of left to right Now, if we mismatch a character early, we have the potential to skip many characters with only one comparison (with one mismatch, we can move down the entire length of the pattern-m positions) Assuming our search progresses, with the mismatch always occurring at the end, N/M comparisons are required (assuming its a text length N and a pattern length M) This is a big improvement over the brute force alg since it is now a sub linear time Will the search always progress like that? Not always, but when searching text with a relatively large alphabet, we often encounter characters that do not appear in the pattern, and the alg takes advantage of this Note that we might not be able to slide all M locations If M is present in the string, we must make sure not to go farther than where the mismatched character in A is first seen (from the right, starting at 0) in P How do we figure out how far to skip? We preprocess the pattern to create a right array This array is indexed on all the characters in the alphabet, and each value within the array will indicate how far we can skip if that character in the text mismatches with the pattern It is initially set so that all characters in the alphabet are not in the pattern and set to -1 The index increases as characters are found further to the right in the pattern, the larger the value in the right array, the less we can skip Each letter in M is given the value of the index where that character is in M If there are duplicates, you use the largest index Then in the alg we either skip 1 or j-valueInRightArr (j being the position in the pattern string: starts at M-1 ⇒ 0) Can the mismatch char heuristic be poor? Yeah, this will be in the case of text: XXXXXXXXXXXX, pattern: YXX P must be completely compared (M character comparisons, right to left) before we mismatch and skip Note that in this case the skip would be -4 (since j = 0 and the value in the right arr would be 4), but since its the max we would actually skip 1 Also note that this is the opposite of the worst case in the brute force alg! Runtime: We will do M comparisons, move down 1 do M comparisons etc. We will therefore move down N-M+1 times before the pattern will go past the end of the text, which gives us a total of (N-M+1)(M) comparisons or O(MN) which is the same runtime of the brute force alg in its worst case This is why the boyer more method has two heuristics Second heuristic Guarantees that the runtime will never be worse than linear This means that the mc heuristic has a worst case of MN and the second heuristic has worst case linear of N, this means that the best case is M/N and the worst case is N Note that the boyer moore alg has some preprocessing overhead due to the creation of the right array
runtime for simple sorting algorithms (insertionsort, mergesort, bubblesort)
Note that all these simple sorting algorithms have similar runtimes (O(N^2)) in the worst case For a small number of items, their simplicity makes them okay to use, but for a larger number of items, this isn't a good runtime
Backtracking example 1: 8 queens problem
Problem: how can I place 8 queens on a chessboard such that no queens can take any other in the next move? Queens can move horizontally, vertically, or diagonally Backtracking is needed when we reach a dead end, where no more queens can be placed on the board Basic idea of solution All queens must be in different rows and columns. Since the board is 8x8, there must be one queen in every column, and one in every row Note: tho each queen is in their own diagonal, not every diagonal needs a queen Using recursion: to place 8 queens on the board we need to place a queen in a legal row and column, then recursively place the remaining 7 queens Using backtracking: our initial choices may not lead to a solution-we need a way to undo a choice and try another one Solution Each recursive call attempts to place one queen in a specific column Within each call, a loop is used to iterate through the 8 rows within a column For a given call, the state of the board from previous placements is known (i.e. where are the other queens?) ⇒ used to determine if a square is legal or not If we place a queen in column 7, we are finished If in a column K, there are no legal spots (all rows have been tried-loop is over), the call terminates and backtracks to the previous call in column k-1 If the call was successful we again move to column K Note that backtracking can occur multiple times, and we may backtrack to more than one column (we could backtrack all the way to the first col if needed) Code First you have a for loop that loops through the number of rows Inside, you will determine if the current position (row, col) is safe You do this by checking if the row and diags are safe-col will always be safe bc you are doing a column wise approach Row is checked by seeing if it is true/false For left diagonal (start at bottom left corner), the rows and columns add to the same value so you can use row+col as the index in a left diag array For the right diagonal (start at top left corner), you can use the difference, but add 7 so that the index will be positive If it is safe Mark the current row, leftdiag and right diag as false If the you are on the seventh column, you are done If you are not on the seventh column Make a recursive call to the next column This recursive call will keep iterating columns and placing queens until you reach the seventh column or it will terminate when the loop finishes If it ends when the loop ends, the current activation record will be popped off, and you will move to the previous recursive call There, you will undo what you did before (setting everything back to true), and then the loop will iterate again (meaning the queen will be moved to the next safe row)
shellsort
Rather than comparing adjacent items, we compare items that are farther away from each other Specifically, we compare and "sort" items that are K locations apart for some K I.e we insertion sort subarrays of our original array that are K locations apart We gradually reduce K from a large value into a small one, ending with K = 1 Note that when K = 1 the algorithm is straight insertion sort (but by the time we get to it, the data won't have to move very far) From the outside, this looks like it would be worse than insertion sort (the last iteration is a full insertion sort + previous iterations do insertion sorts of subarrays) However, when this is actually timed it outperforms insertion sort Note that insertion sort has very good runtime, O(N) in its best case (if the data is sorted), and the shell sort moves towards this best case, by sorting arrays with elements K apart (so by the time we get to the full insertion sort it is very close to the best case) A good implementation of shell sort will have about O(N^3/2) execution Code The simplest implementation has the gap be N/2 We call a while loop that iterates while K > 0, and divides K/2 after each iteration Inside the loop, we have a for loop that starts at the first element and goes until first + space, with the first element incriminating by one after each call For example, if the space was 4, this loop would run 4 times (one for each pair of values), if the space was two, it would run two times It calls another method that basically does insertion sort, but instead of a one, the variable space is there
iterators
Recall what the list interface is It is a set of methods that indicates the behavior of the classes that implement it Nothing is specified about how the classes that implement the list are themselves implemented (ie the list could be stored in an array, a linked list) How can the user of any list class access all the data in a sequential way? We could copy the data into an array and return the array The list interface does have the get entry method, but this is really bad for the linked list getEntry(1) is 1 operation getEntry(2) is 2 operations getEntry(3) is 3 operations This gives us 1 + 2 +3 ⇒ N(N+1)/2 = N^2 This is so poor for a linked list because each get entry restarts at the beginning of the list (there is no memory from one call to the next) What if we could remember where we stopped the last time and resume from there the next time? Iterator: a program component that allows us to iterate through a list in a sequential way, regardless of how the list was implemented The iterator maintains the state of the iteration and progresses one location at a time (we don't start at the beginning each time-we resume where we left off) The details of how we progress are left up to the implementers (abstraction), the user just knows it goes through the data
collision
Simple hashing fails in the case of a collision This is where h(x1) == h(x2) where x1 != x2 Two distinct keys hash to the same location!
what is a sorting algorithm?
Sorting is a very common and useful process (we sort names, salaries, movie grosses etc.) It is important to understand how sorting works and how it can be done efficiently By default we will consider sorting in increasing order For all indices i,j: if i < j, then A[i] <= A[j] (this allows for duplicates) For decreasing order we change this to A[i] >= A[j]
selection sort runtime
The key instruction is a comparison Since we don't do the comparison in the loop header, the iterations in both for loops are solely based on index (we iterate through all indices no matter what) In the first loop i goes from 0-N-2 In the second loop j goes from i+1 to N-1 We do one comparison in each iteration of the second loop Thus, it does not matter how the data is initially organized-there is no best or worst case, all cases iterate the same number of times and do the same number of comparisons When i = 0 the inner loop goes from 1-N-1 ⇒ N-1 comparisons When i = 1 the inner loop goes from 2-N-1 ⇒ N-2 comparisons When i = 2 the inner loop goes from 3-N-1 ⇒ N-3 comparisons When i = N-2 the inner loop goes from N-1 ⇒ N-1 1 comparison The total is 1+2...(N-1) which gives us the sume (N-1)(N)/2 So, O(N^2) ⇒ same as the worst case as insertion sort
insertion sort runtime
The key instruction is the comparison of array items Worst case: reverse sorted data (high to low) This is because each item is moving all the way to the front 1 comparison for the first element, the 2 comparisons for the second element, a;; the way to N-1 comparisons (because we don't do anything to the first element) gives you a variation of the special sum If we were to add them all up to get the total for the sequence it would be: 1+2+3+4..+N = VERY IMPORTANT The variation would be (N-1)N/2 which gives us O(N^2) On average, the actual comparisons are a bit better, but the asymptotic time would still be O(N^2) Exam question: What is the best case for insertion sort? A sorted array, because there would be one comparison for each element, or N-1 total comparisons ⇒ O(N) This means that insertion sort is a good sort for almost sorted data, but by itself it has a very high asymptotic time
recursive sorting algorithms
The lower bound for sorting is nlogn in the worst case (having a lower bound does not necessarily mean that it is possible/achievable to write code with that runtime, but for comparison it is) We can use divide and conquer to sort We will define sorting an array of N items in terms of sorting one or more smaller arrays (for example an array of size N/2) This will work well for recursive algorithms How can we apply divide and conquer to sorting? We have two questions to consider First, how do we divide our array into subproblems (do we break the array by index, or some other way) Second, how do we use the solutions of the subproblems to determine the overall solution (ie once our recursive call(s) are complete, what more needs to be done to complete the sort)
Issue with iterators (remove method)
The structure of iterators allow access to the underlying list, we can also change the underlying list with methods like the remove method However, if we modify the underlying list via another access (through the list methods or through a different iterator) within an iteration we will get a concurrent modification exception the next time we call the next() method We are changing the underlying state unbeknownst to the iterator and this could have unexpected results so java does not allow this Note that this is fine if we are modifying within the same iterator
mergesort runtime comparison towers of hanoi
Towers of hanoi required 2^N-1 moves while merge sort only requires NlgN comparisons This is because of how the problem size decreases: for towers of hanoi it is N-1 for merge sort is it N/2 Towers of hanoi The execution trace for towers of hanoi had N levels (each call of size X spawned two calls of size X-1) and at each call of size X we do one move This leads to a sum of 1+2+4 = 2^0+2^1+2^2...2^n-1or O(2^n) Therefore, even though the execution taxes look similarly the runtime does not
quicksort runtime
Unlike mergesort which had the same average and worst cases because it was index based, the performance of quick sort depends on the quality of the divide (how other values relate to the pivot) In the situation where the pivot is always the middle value we get an execution trace that is O(NlgN) which is similar to mergesort In this tree the height is log N and the work in each level is N however, since an extra array is not needed in quicksort, the measured runtime will actually be faster than mergesort In the situation where the pivot is always the extreme index of the partition (worst case ⇒ when pivot is first/last item) If we consider for example the case where the pivot is the largest item There will be no items greater than the pivot, so that side of the partitioned array is empty (note also that the situation could be the opposite, when all the data is greater than the pivot) For the worst case to occur this must happen for every recursive call each time partition is called Analysis Since the comparisons are being done in partition, we will look at the partition algorithm for each call In the first call, partition must compare N-1 items to the pivot, all values (except pivot) are less than the pivot This spawns a recursive call of N-1 In the second call, the partition must compare N-2 items to the pivot This spawns a single recursive call of size N-2 The resulting sum is 1+2...N-1 ⇒ (N-1)(N)/2 or O(N^2) This is the same runtime as the primitive sorts, but it plays out as worse in real time because we also have recursive overhead Why is this so poor? Recall the idea of divide and conquer where recursive calls are a fraction of the original size In this case, the recursive calls are only one smaller than the original size (N-1) Thus we are losing the power of divide and conquer so our runtime ends up being O(N^2) Thus quicksort has two very different runtimes based on the data circumstances...how will it perform normally/on average How does quicksort perform on average? This depends on how the data was distributed and how the pivot is chosen If we assume that the data is already sorted and we pick the pivot to be the last item in the array, then everytime we pick the pivot, there will be no values that are greater than the pivot (so the recursive array is only reduced by one item each time) This means that our worst case is when the data is already sorted (the pivot is always the greatest element and there is no data in the "greater than the pivot" partition Reverse sorted data would also be a worst case (now the pivot is always the smallest item) (both would be O(N^2) Is data being sorted something we may come across regularly? (ie is this something we should be worried about?) In real life we might have to sort twice in a row or sort almost sorted data (after adding one new item). However, for random data it should perform well since it is not likely that poor pivots will consistently be chosen Therefore, in the average/expected case, the primitive quick sort has O(NlgN) runtime (even though the worse case is N^2( However, we don't want the relatively simple scenario to lead to worst case behavior (worst case should be rare)
rabin karp method
Using hashing for string matching If we calculate a hash function for strings (using 256 powers idea) we will know that if the integer values match, the strings will too Note that we will need to keep these integer values at a reasonable size We will do this by having the hash values mod a large integer, to guarantee we won't get overflow We also need to be able to incrementally update a hash value so that we can progress down a text string looking for a match We don't want to have to rehash everytime we move down by one, as this will give us a runtime of MN (we rehash m characters for n total positions) We can do this with the properties of modulo arithmetic, characters can be removed from the beginning of the string and added to the end With each mismatch we remove the leftmost char from the hash value and add the next character from the text to the hash value We are still moving left to right in the text, but we are not directly comparing characters, we are instead comparing hash values Note: we know hashing only guarantees no collisions if h(x) is unique for any key and key space is smaller than the than the table size This is not practical to implement so we might get a collision What do we do if a collision occurs? The hash value would match but the strings would not Monte carlo version: we don't check for collisions The algorithm will be guaranteed to run in O(N) time, and is highly likely to be correct, but it could fail if a collision occurs Las vegas version: we check for collisions Algorithm would be highly likely to run in O(N) time The worst case would be every iteration has a collision and that collision would mismatch on the last char for all of them, giving us a worst case of O(MN) Algorithm is guaranteed to be correct Rabin karp does not improve the normal case runtime of brute force
How do we reduce the number of collisions?
We do so with a good hash function With a good hash function, collisions are a pseudo random occurrence (collisions will occur, but due to chance, not bc of similarities or patterns in the keys) ⇒ when collisions do occur it is bc we are unlucky A good hash function must Utilize the entire key (if possible) and exploit any difference between keys It should also utilize the full key space of the hash table Example of creating hash functions 1) consider one based on phone numbers for Pitt students where M = 1000 Attempt 1: first 3 digits of a number 412 This is a bad idea bc most pple will have the same first three digits (area code) Attempt 2: take the phone number as an integer %M Better idea bc then we are comparing the last 3 digits which are pseudo random 2) hash for words in a table of size M Attempt 1: add the ASCII values Problem 1: this does not fully exploit the differences in the keys (even though we use the entire key, we don't take into account the positions of the characters) Problem 2: we don't use the full address space (even small words will have hash values in their 100s, and larger has values will be well below 1000) Thus if M = 1000, there will likely be collisions in the middle of the table and many empty locations at the beginning and end of the table Attempt 2: to make it better, we need to utilize all of the characters, keep track of their positions, and use all of the table We can do this by thinking about integers, 123 is different from 321 even though they have the same characters They are different bc each digit has a different power of ten 123 = 1*10^1 + 2*10^2 + 3*10^3 We can do something similar for hash values of arbitrary strings Ideally integers with given digits in given positions are different bc we have 10 digits and each location is a different power of ten For ascii characters, we have 256 of them so we would have to multiply each digit with a different power of 256 This will definitely distinguish the hash values of all strings Will this utilize all of the table? The table is size M, and these numbers will get large very quickly so we would need to wrap the value by M (take %M) In practice These numbers will get big very quickly, and will quickly become larger than even a long can store Because of that, if we store these values in an int or a long, the values will wrap and thus no longer be unique for each string (this is okay-it will just result in a collision) Calculating the values should be done in an efficient way so that H(X) can be done quickly There is an approach called horner's method that can be applied to calculate H(X) values efficiently One other good approach to hashing (a string) Choose M to be a prime number, calculate our hash function as h(x) = f(x) %M Where f(x) is some function that converts x into a large "random" integer in an intelligent way It is not actually random, but the idea is that if the keys are converted into very large integers (much bigger than the number of actual keys) collisions will occur because of the pigeonhole principle, but less frequentl
why is mergesort stable?
We do the comparison during the merge method we have it as <= 0 (meaning that you would only bring a right item to the left when it is less than it (if there is a tie, we take the item form the left side) (if we were to do < then the sort would not be stable)
radix sort
We know that comparison based sorts have a lower bound of NlgN, but can we do better than that? Consider an array of string We could use a comparison based sort that uses compareTo to get NlgN runtime However, if we recognize that a string is an array of characters, we could take a better approach Consider the positions in each string (from rightmost to leftmost) and the character value at that position Instead of comparing these characters to one another, we will use each as an index to "bin (actually a queue) of strings, we will do this based on ASCII characters We would first look at the last character and put that into the proper queue (we will have an arraylist of queues of strings) We will then copy the data in order from the queues back into the array and consider the next character Since a queue is first in first out, the relative ordering will remain the same Why does this work? Each time we put the data into the bins we are sorting based on that character (there is an implicit ordering in the bins so placing them in the proper bin orders them) We proceed from the least significant to most significant characters Strings that are the same characters from 0-K for some K will be distinguished (ordered) by some character k+1 and that worder will not change when comparing from K down to 0 Note that direct comparison of strings goes from left to right, but radix sort goes from right to left What do we do if strings are different lengths We will pad the smaller strings to make sure they are all the same, we will pad on the right of the smaller strings with a character that is smaller than A (@) Therefore, something with no character at that spot will be placed before something that does @ is a good choice because it comes before A and it is contiguous with it
radix sort runtime + other faults
We must iterate through each position in a string (O(K)) For each position we must iterate through all of the strings, putting each into a bucket (O(N)) An enqueue is O(1) so doing it for all N items is O(N), a dequeue is O(1) so doing it for all dequeues is O(N) We must then remove them from the buckets and put them back into the array If the max string length is K and the length of the array is N, this will yield a runtime of O(KN) (this is empirical, not asymptotic) If we consider K to be a constant, this runtime will be O(N) However, there is considerable overhead We need space overhead for the bins (O(N)) And time overhead for copying (not in place) And we will have overhead in extracting the individual values (for a string this isn't a problem, but for a radix sort of an int for example, isolating each digit requires some math) Because of this overhead, even though it might be asymptotically better than a comparison sort, in terms of runtime, it might not be Also, even though KN is smaller than NlgN, for some values K may be larger than lgN (medium or small arrays) Also, radix sort is not a generally applicable sorting algorithm To do this sort we must be able to break our key into separate values for which ordering can be utilized Comparison based sorts allow for arbitrary algorithms to be used for the comparison (as long as they implement the compare to method), perhaps even utilizing multiple data values
why is 8 queens so hard to solve iteratively
We need to store state information as we try (and untry) many locations on the board Consider each column... Here, we would need to know the current column index, as well as which row the queen is in that column We would need to be able to keep track of this information efficiently, and you must be able to undo the board, when you backtrack The runtime stack does this automatically via activation records (the col (parameter) and row (local variable) are separate for each call of the method Without recursion, we would need to store and update this information ourselves This can be done (by using our own stack rather than the runtime stack) but since the mechanism is already built into recursive programming, why not utilize it? The iterative solution does not have the benefit of having less overhead
How do we implement hashing in the dictionary interface?
We will need an array of Entry<K,V> for our table We will then have a constructor Finds the next prime number for table size Add method (adding key value pairs into hash table) ⇒ linear probing Enlarge hash table method We want a new hash table that is larger (and prime) and rehash all the values into that table Get value method Remove method How would separate chaining change this? We would still have an array, but instead of an array of Entry<K,V> it is an array of linked lists of Entry<K,V> (each location represents the front of a linked list)
open addressing: double hashing
When a collision occurs, increment the index (mod table size) just like what is done in linear probing, however, instead of using an increment of one, we use a second hash function h2(x) to determine the increment This way the keys that hash to the same location will likely not have the same increment h1(x1) == h1(x2) with x1 != x2 (with good hashing function) is bad luck, but ALSO having h2(x1) == h2(x2) is REALLY bad luck (should occur even less frequently) It also allows for a collided key to move anywhere on the table Note that linear hashing is basically just a special case of double hashing (where the increment is always one instead of the hash value giving you an increment) The way this would work in practice: first hash tells you an absolute address (go to location i), the second hash is an increment This creates more clusters that are shorter in size and thereby slows down the degradation of hashing Note that we will still get collisions, but because h2(x) varies for different keys, it allows us to spread the data throughout the table even after an initial condition However, to make sure that double hashing actually "works" we must also Make sure the increment is > 0 (if increment is 0 it won't go anywhere) Make sure no index is tried twice before all indices are tried once To fix this you have to make M a prime number
Open Adressing-How do we fix the delete problem?
Why is this a problem? If you have a cluster, and you delete one of the middle values in the cluster and then try to search for a value after the middle one you will get that the value does not exist (the chain was broken) Rehash all the keys from the deleted key to the end of the cluster This will not be a lot of work id the table is mostly empty (because then the clusters are shorter in length so few items need to be shifted) We can't do this for double hashing because we don't have a contiguous chain of keys, we can't determine which cluster they came from This would be O(N) (same as copying data over) but it is clear that this is worth it How do we do this for double hashing? ⇒ consider 3 states for every location in the table Empty: will stop an insert (put value there), will stop a find (return not found) Full: will not stop an insert (keep looking), will not stop a find (inside cluster) Deleted: will stop an insert (re-use the location), will NOT stop a find (was inside a cluster and we don't want to stop the search)
what makes insertion sort's performance poor?
With each comparison you either do nothing or move the data one location (which is a very small amount) If the data is greatly out of order, it will take a lot of comparisons to get it into the right order (as each comparison is only one shift) If we can move the data farther with one comparison, we can improve our runtime
Can we guarantee that collisions do not occur?
Yes, but only when the size of the key space (K) is less than or equal to the table space M Key space: the number of possible keys you can put in a table When K <= M there is a technique called perfect hashing that can ensure no collisions This also works if N <= M but the keys are known in advance, which reduces the key space to N (N being the actual number of keys you put in the table) Pigeonhole principle: we have more pigeons (potential keys) than we have pigeonholes (table locations) so at least two pigeons must share a pigeonhole This is usually the case Example: employer is using social security numbers as the keys M= 1000 and N = 500 It seems like we should be able to avoid collisions since our table will not be full However, the key space is 10^9 since we do not know what the 500 keys would be in advance (employees are hired and fired so the keys will change) To get perfect hashing the size should be 10^9
Rec exercise 6: adding diagonals to word boggle
You basically just need to add extra if statements to check for the for additional directions if (!answer) answer = findWord(r+1, c, word, loc+1, bo); // Down if (!answer) answer = findWord(r, c-1, word, loc+1, bo); // Left if (!answer) answer = findWord(r-1, c, word, loc+1, bo); // Up if(!answer) answer = findWord(r-1,c+1, word, loc+1, bo); //right up if(!answer) answer = findWord(r-1, c-1, word, loc+1, bo); //left up if(!answer) answer = findWord(r+1,c+1, word, loc+1, bo); //right down if(!answer) answer = findWord(r+1, c-1, word, loc+1, bo); //left down
dual pivot quicksort
f we choose more than one pivot? In dual pivot quicksort we choose two pivots (p1 and p2) and create three partitions Items that are < P1 Items that are >= P1 and <= P2 Items that are > P2 This yields 3 subarrays that must be sorted recursively As long as pivots are chosen wisely, this has incremental improvement over traditional quicksort This has been incorporated into JDK in java for sorting of primitive types
list iterator methods
hasNext next hasprevious previous nextIndex previousIndex remove set add
open addressing: linear probing
if a collision occurs at location i, try (in sequence) locations i+1, i+2 (mod M-wrapping around table) until the collision is resolved For insert: collision is resolved when an empty location is found For find collision is resolved (found) when the item is found (during a sequence of probs) Collision is resolved (not found) when an empty location is found or when index circles back to i Performance: O(1) for insert and search for normal use (when the table is not very full) Bc the table is not very full, at most a few probes will be required before the collision is resolved Issues: what happens as the table fills with keys? Lead factor (alpha): a = N/M (fraction that the table is full) How does alpha affect linear probing performance? Consider a hash table of size M that is empty using a good hash function Given a random key x,the probability that x will be inserted into any location i in the table is 1/M Now consider a hash table of size M that has a cluster of C consecutive locations are filled Given a random key, x, what is the probability that x will be inserted into the location immediately following the cluster? (C+1)/M Why? The probability of mapping c to a given location is still 1/M, but for any i in the cluster, x will end up after the cluster Thus, we have C locations in the cluster + the one directly after it (any location within the cluster will be placed in the spot after the cluster, thereby increasing its probability) Why is this bad? A collision in linear probing is resolved found when a key is found (somewhere within the cluster), or resolved not found when an empty location is found (must traverse the entire cluster) As the clusters get longer we need more probes in both situations, but especially for not found As alpha increases the cluster size begins to approach M, and this means that search times will degrade from O(1) to O(M) or O(N) (hashing degenerates into sequential search)
sorting linked lists (mergesort vs quicksort)
merge sort is more viable but there is more overhead For example, how would you divide a linked list in half (based on the nodes position) ⇒ we would have to traverse through all the indexes to get to the middle O(N) However, this does not add any extra asymptotic work as the work required for each call of merge is N No new nodes will be made We would first get the middle node by traversing through the list, then we would separate the two lists To merge the two back together we wouldn't need to copy the array back into the original array like we had to for array based mergesort Quicksort could work but it requires a doubly linked list and the partition overhead would be more than is worthwhile
array based implementations of queue
primQ1 and primQ2-not effective because they had O(N) Circular array allowed for O(1)
Collision resolution
redesigning our hashing operations to work despite collisions Open addressing If a collision occurs at index i in the table, try alternative index values until the collision is resolved Thus a key may not necessarily end up in the location that its hash function indicates We must choose alternative locations in a consistent, predictable way so that items can be located correctly Our table can store at most M keys Closed addressing Each index i in the table represents a collection of keys Thus a collision at location i simply means that more than one key will be in or searched for within the collection at that location The number of keys that can be stored in the table depends on the max size allowed for the collections
iterator method implementations
review!
current search algs
to search we are given some collection C and some key value K-we have to find/retrieve some object whose key matches K For an arr/vector Unsorted: sequential search O(N) Sorted: binary search O(lgN) For a linked list Unsorted: sequential search using pointers O(N) Sorted: sequential search O(N)-can't do binary search with a linked list So right now we are looking at O(lgN) as the best time for searching But these searching methods have all involved the direct comparison of keys Could we possibly do better by using a different approach?
basic idea of string matching
we are given a pattern string P of length M and a text string A of length N, we want to figure out if all characters in P match a substring of the characters in A, starting from some index i
string matching brute force algo
we start at the beginning of pattern and text, we go left to right, character by character If a mismatch occurs we restart the process at one position over from the previous start of text and the beginning of the pattern Uses nested loops Runtime In a normal case: you may mismatch right away or after a few character matches for each location in the text (only match a couple of characters before moving down, basically constant)-we may have to go through the entire text (of length N) ⇒ O(N) The longer the prefix of the text matches the pattern, the longer we have to go through the loop, so the worst case would be So worst case: would be when the pattern is completely compared (if pattern is size M then there are M comparisons) each time we move one index down in A This would be like if the pattern is XXY and the text is XXXXXXXXXXXXY So we would text match 0, 1...N-M before failing Therefore, the total is M*(N-M+1) = O(MN) Because the worst case is a bit contrived (not very likely) this is still a good alg, java SDK uses this alg for indexOf() method How can we improve the brute force alg? We could improve the worst case performance This is good theoretically, but in reality the worst case does not occur very often We could improve the normal case performance This will be very helpful, especially for searches in long string files
Backtracking
when you proceed forward to a solution until no solution can be achieved along the current path At that point you undo the solution (backtrack) to a point where you can proceed forward and try a different path