INTERVIEW Datavant / General / InterviewCake april 2019
Where logs come up in algorithms and interviews
"How many times must we double 1 before we get to nn" is a question we often ask ourselves in computer science. Or, equivalently, "How many times must we divide nn in half in order to get back down to 1?" both : log(b2)(n)
What does log_10(100) mean?
"what power must we raise 10 to, to get 100"? The answer is 2.
How to trim down a dictionary even more?
'hash' the contents first, to store a constant-size fingerprint of the file in our dir, instead of the whole file itslef.
Disadvantage to linked lists`
(1) They have O(i) - time lookups That is, the worst case to look up an item in the list is n Vs. in arrays and dynamic arrays, lookup time is O(1) (2) Linked list "walks" are not cache-friendly The next node could be anywhere!
formula for the number of nodes on the last level of a binary tree?
(n+1) / 2 ex. 1 + 2 + 4 + 8 = 15 a "triangular number"
How does the binary search algorithm work?
*start with a sorted list* Recursively, divide the range in half, searching each half for target While range > 1: Base case : if guess == taret * If range becomes 1 then we exit the while loop and return False
Cake's solution to duplicate file problem: import os import hashlib def find_duplicate_files(starting_directory): files_seen_already = {} stack = [starting_directory] # We'll track tuples of (duplicate_file, original_file) duplicates = [] while len(stack): current_path = stack.pop() # If it's a directory, # put the contents in our stack if os.path.isdir(current_path): for path in os.listdir(current_path): full_path = os.path.join(current_path, path) stack.append(full_path) # If it's a file else: # Get its hash file_hash = sample_hash_file(current_path) # Get its last edited time current_last_edited_time = os.path.getmtime(current_path) # If we've seen it before if file_hash in files_seen_already: existing_last_edited_time, existing_path = files_seen_already[file_hash] if current_last_edited_time > existing_last_edited_time: # Current file is the dupe! duplicates.append((current_path, existing_path)) else: # Old file is the dupe! duplicates.append((existing_path, current_path)) # But also update files_seen_already to have # the new file's info files_seen_already[file_hash] = (current_last_edited_time, current_path) # If it's a new file, throw it in files_seen_already # and record its path and last edited time, # so we can tell later if it's a dupe else: files_seen_already[file_hash] = (current_last_edited_time, current_path) return duplicates def sample_hash_file(path): num_bytes_to_read_per_sample = 4000 total_bytes = os.path.getsize(path) hasher = hashlib.sha512() with open(path, 'rb') as file: # If the file is too short to take 3 samples, hash the entire file if total_bytes < num_bytes_to_read_per_sample * 3: hasher.update(file.read()) else: num_bytes_between_samples = ( (total_bytes - num_bytes_to_read_per_sample * 3) / 2 ) # Read first, middle, and last bytes for offset_multiplier in range(3): start_of_sample = ( offset_multiplier * (num_bytes_to_read_per_sample + num_bytes_between_samples) ) file.seek(start_of_sample) sample = file.read(num_bytes_to_read_per_sample) hasher.update(sample) return hasher.hexdigest()
- As we go, take a "fingerprint" or a hash of each file in contsant time with the first few, middle few, and last few bytes - generate a dictionary of file : word count - If a file's fingerprint is already in dict, assume we have a duplicate - assumption is that two files won't have the same fingerprints - and the most recent file is the duplicate - and assume two files w/ same contents are the same file OVERAL TIME IS O(n)
How to write a function that checks if a string is permutation of palindrome? https://www.interviewcake.com/question/python3/permutation-palindrome?course=fc1§ion=hashing-and-hash-tables
- Check that each character left of the middle has a corresponding copy right of the middle - Then check each character appears an even number of times - ONE char appears an odd number of times! - Data structure to use: DICTIONARY (with the char : char_count or char: bool ) - What if we just track whether or not each character appears an odd number of times? Then we can map characters to booleans. This will be more explicit (we don't have to check each number's parity, we already have booleans) and we'll avoid the risk of integer overflow ↴ if some characters appear a high number of times.
What actually goes in a function's stack frame? It's not the function's name.
- Local variables - Arguments passed to the function - Information about the caller's status - "return address" - what line to return to once the function returns
merge sort idea
- divide list in half - sort the two halves - merge the sorted halves into one sorted whole - repeat chunk recursively.
What is happening when you slice an array? ex. arr[:3] What is O(n)?
1) allocating a new list 2) copying the elements from the original list to the new list Overall, adds O(n) time and space
Why use hashing?
1. Dictionaries - suppose we want to look up values based on arbitrary "keys" instead of just indeicies. We can translate keys into hash values. This is actually how dictionaries owrk 2. Prevent 'man-in-the-middle attacks" for internet stuff, we signal that this should be hashed and if not, it might be corrupted!
hashing function
1. must be one-way - ex "sum up all chars and then do mod %" 2. variable-length input produces fixed-length output 3. must have few or no collisions
hashing prob presented - how big does he propose to make the blocks?
4000 bytes because the disk has to grab that modularity anyway and having a bigger slice of the file ensures that the hash values generated are unique.
greedy algorithm
A greedy algorithm builds up a solution by choosing the option that looks the best at every step. merge_ranges was a greedy Careful: sometimes a greedy algorithm doesn't give you an optimal solution: - When filling a duffel bag with cakes of different weights and values, choosing the cake with the highest value per pound doesn't always produce the best haul. - To find the cheapest route visiting a set of cities, choosing to visit the cheapest city you haven't been to yet doesn't produce the cheapest overall itinerary.
Fisher-Yates Shuffle
An algorithm that randomizes the order of a set of objects. Truly random because it can be proven that at each index, the probability of finding a given number is 1/ n
stack overflow or, "RecursionError" in Py 3.6
An error condition that occurs when attempting to push register values onto a stack that's already at its maximum capacity.
Overall summary of arrays, datastrctures Each has its tradeoffs. You can't have it all
Arrays have fast lookup but their sizes need to be specified ahead of time. There are two ways to get around this: dynamic arrays and linked lists. Linked lists have faster appends and prepends than dynamic arrays, but dynamic arrays have faster lookups Fast lookups are really useful, especially if you can look things up not just by indices (0, 1, 2, 3, etc.) but by arbitrary keys ("lies", "foes"...any string). That's what hash tables are for. The only problem with hash tables is they have to deal with hash collisions, which means some lookups could be a bit slow. `
How to improve recursive fxn spaces complextiy Ex; def prod(n): return 1 if n <=1 else n * prod(n - 1) - > Make it an iterative function: def prod_iter(n): result = 1 for n in range(1, n + 1): result *= num return result
As we iterate through the loop, the local variables change, but we stay in the same stack frame because we don't call any other functions.
Lists - lookup complexity
Average Case : O(1) Worst case: O(1)
Lists - append complexity
Average Case : O(1) Worst case: O(n)
Lists - delete complexity
Average Case : O(n) Worst case: O(n)
Lists - insert complexity
Average Case : O(n) Worst case: O(n)
Lists - space complexity
Average Case : O(n) Worst case: O(n)
Why are hash tables / dictionaries NOT cache friendly?
Behind the scenes, they use linked lists, which do not put data together in memory!
most simple operations on fixed-width integers (addition, subtraction, multiplication, division) take constant time (O(1)O(1) time).
But that efficiency comes at a cost—their values are limited. Specifically, they're limited to 2^n2 n possibilities, where nn is the number of bits.
Size vs. Capacity
But the underlying array has a length of 10. We'd say this dynamic array's size is 4 and its capacity is 10. Adds extra capacity later by storing the end_index
Strengths of array
Fast Lookkups - O(1) time to look up an element, regardless of length Fast appends - O(1) time to add a new element to the end of the array
Weaknesses of array
Fixed size Costly inserts & deletes - need to "scoot over" - O(n)
messy probs apprroaach
For messy problems like this, focus on clearly explaining to your interviewer what the trade-offs are for each decision you make. The actual choices you make probably don't matter that much, as long as you show a strong ability to understand and compare your options.
log (base10) 100 = 2 "100 is the "answer"" the small number is the base, here 10. that is, what do I need to raise 10 to in order to get 100?
General structure of log questions
In a perfect tree, what is the tree's height(h)? or, how many "levels" does the tree have?
If we count the number of nodes on each level, we can notice that it successively doubles as we go: The number of nodes in our perfect binary tree is always odd. We know this because the first level always has 1 node h ~= log(base-2)((n + 1) / 2)
"Perfect tree"
In a binary tree, each tier is full, each child has two children.
when a computer does a read, how does it grab contents?
In constant-size chunks called 'blocks' for mac, default size is 4kb per block
hash collisions - how to handle? *Interesting*
Instead of storing the actual hash values in our array, have each array slot hold a pointer to a linked list, which holds the counts for all the words that hash to that index.
how to handle hash collisions? *just one way of many!
Instead of storing values in the array, just have the array slot store POINTERS Then, these pointers point to a LINKED LIST Which, in turn point to each hash collision. Each Node here has: - "lies" is key - 20 is value - next is pointer to next key - "foe" is key - 1 is value - *next is None at the end
pointer
Integers in an array that point to another spot in memory Fixes: - Items in an array no longer have to be the same length - We don't need uninterrupted free memory to store items in a contiguous block of memory Cons: - Pointers are not cache-friendly - Slight slowdown
To avoid slicing (which is costly for space/time), what can you usually do?
Keep track of the indices in the list!
a data structure that can store a string, has fast appends, and doesn't require you to say how long the string will be ahead of time
Linked Lists! Possible to use pointers to do this
example of a hashing function for "lies" l > 108 i > 105 e > 101 s > 115 sum = 429 % 30 = 9
Modding our sum by 30 makes sure we get a whole number, less than 30, which we can store in our dictionary. * Irreversible **
Time to add a node to the end of a linked list
O(1)
What is time complexity of this? def is_riffle_best(half1, half2, deck): half1_ix = 0 half2_ix = 0 half1_max_ix = len(half1) - 1 half2_max_ix = len(half2) - 1 for card in deck: # If we still have cards in half1 and the "top" card in half1 is the same, if half1_ix <= half1_max_ix and card == half1[half1_ix]: half1_ix += 1 # If we still have cards in half1 and the "top" card in half1 is the same, elif half2_ix <= half2_max_ix and card == half2[half2_ix]: half2_ix += 1 else: return False # If all cards in shuffled deck have been accounted for, this is a riffle! return True
O(1)
Runtime of binary search
O(log(n)) Why: "how many times must we divide our original list size (n) in half until we get down to 1?" In binary search, we are dividing the original list size in half that many times!
Sorting time costs O(______) in general
O(n * log(base-2)(n)) ...the best worst-case runtime for sorting
What is t h e c omplexity of this? def merge_ranges(meetings): """ A feature to see the times in a day when everyone is available. In HiCal, a meeting is stored as a tuple ↴ of integers (start_time, end_time). These integers represent the number of 30-minute blocks past 9:00am. https://www.interviewcake.com/question/python3/merging-ranges?course=fc1§ion=array-and-string-manipulation >>> merge_ranges([(0, 1), (3, 5), (4, 8), (10, 12), (9, 10)]) [(0, 1), (3, 8), (9, 12)] """ # Merge meeting ranges # Want to do O(n) - What if we sorted our list of meetings by start time? sort_meetings = sorted(meetings) merged_meetings = [sort_meetings[0]] for curr_start, curr_end in sort_meetings[1:]: last_merged_start, last_merged_end = merged_meetings[-1] if (curr_start <= last_merged_end): merged_meetings[-1] = (last_merged_start, max(last_merged_end, curr_end)) else: merged_meetings.append((curr_start, curr_end)) return merged_meetings
O(n log n) Why: Even though we only walk through our list of meetings once to merge them, we sort all the meetings first, giving us a runtime of O(n log n). It's worth noting that if our input were sorted, we could skip the sort and do this in O(log n) time! and space is O(n): - We create a new list of merged meeting times. In the worst case, none of the meetings overlap, giving us a list identical to the input list. Thus we have a worst-case space cost of \ Note: The "brute force" algorithm would have been O(n^2) to loop through the list twice. However, since we decided to sort it first, then use a greedy approach, we found a better runtim
The brute force ↴ approach would be to check every permutation of the input string to see if it is a palindrome.
O(n!) time complexity. Ouch...
Time complexity of this: def has_palindrome_permutation(the_string): # Track characters we've seen an odd number of times unpaired_characters = set() for char in the_string: if char in unpaired_characters: unpaired_characters.remove(char) else: unpaired_characters.add(char) # The string has a palindrome permutation if it # has one or zero characters without a pair return len(unpaired_characters) <= 1
O(n) Why: We are making one iteration through the n characters in the string The unapired set of characters is the only thing that is taking up nonconstant space
What is the runtime of a single doubling operation?
O(n) With an array, you must copy all n items from the array, where each item takes O(1) time, so overall it is O(n) worst case.
Spact/Time complex of this: def sort_scores(scores, highest): """ Write a function that takes: - scores; a list of unsorted scores - highest: the highest_possible_score in the game and returns a sorted list of scores in less than O(nlgn) time. (where n = len(scores)) >>> sort_scores([37, 89, 41, 65, 91, 53], 100) [91, 89, 65, 53, 41, 37] """ dic = {} if len(scores) > 1: for s in scores: dic[s] = dic.get(s, 0) + 1 # add the score to a new list sorted_scores as many times as count of appearances. sort = [] sorted_keys = sorted(dic.keys) for key in range(len(sorted_keys), -1, -1): for n in dic[key]: sort[0] = key * dic[key] # print("sorted_scores", sorted_scores) return sort
O(n) and O(n) ** Wait, aren't we nesting two loops towards the bottom? So shouldn't it be O(n^2) ) time? Notice what those loops iterate over. The outer loop runs once for each unique number in the list. The inner loop runs once for each time that number occurred.
What is the time complexity of this reverse-list in place alg? def reverse_list(lst): """ Write a function that takes a list of characters and reverses the letters in place. ↴ (note lists are mutable but strings are not) >>> reverse_list(['a', 'b', 'c']) ['c', 'b', 'a'] >>> reverse_list(['a', 'b']) ['b', 'a'] >>> reverse_list([]) [] >>> reverse_list(['c']) ['c'] """ left = 0 right = len(lst) - 1 while left < right: # Swap, move toward middle: lst[left], lst[right] = lst[right], lst[left] left += 1 right -= 1 return lst
O(n) time O(1) space. Note that if you were to use a built in function (sorted()) to append the lists and sort,, it would take n log n. Because we chose to sort in advance, we save time
What is time complexity of this version 2? def is_riffle_iter(half1, half2, deck, deck_ix=0, half1_ix=0, half2_ix=0): """ Write a Better iterative RECURSIVE function that takes a deck and checks if it has been shuffled in a "riffle" way Note this has better time complexity than the previous recursive version! Instead of slicing the list, it just keeps track of indexes. """ # base case is still the same, we have hit the end if deck_ix == len(deck): return True # If we still have cards in half1 and the top card is the same # as the top card in deck, if half1_ix < len(half1) and half1[half1_ix] == deck[deck_ix]: half1_ix += 1 # If we still have cards in half2 and the top card is the same # as the top card in deck, elif half2_ix < len(half2) and half2[half2_ix] == deck[deck_ix]: half2_ix += 1 # If either half is depleted OR there are not matches, else: return False #Move on to the next card: deck_ix += 1 return is_riffle_iter(half1, half2, deck, deck_ix, half1_ix, half2_ix)
O(n) time We can do better!
Wht is the time complextiy for this? def is_riffle(half1, half2, deck): """ Write a RECURSIVE function that takes a deck and checks if it has been shuffled in a "riffle" way ## If deck is a "riffle" of half1 and half2, the first card from deck ## should be either the same as the first card from half1 or the same as the ## first card from half 2. ## Go through the deck, matching and throwing out as you match ## If we get to the end, return True """ if len(deck) == 0: return True # Base case # If it exists, is the top of deck the same as the the top of half1? if len(half1) and half1[0] == deck[0]: # Take the tops off both and recurse return is_riffle(half[1:], half2[1:], deck[1:]) # If it doesn't match, we know it's not: else: return False
O(n^2) Recuresive probs are pretty bad this way
What is runtime for this? def time_equal(flight_length, movie_lengths): """Write a function that takes an integer flight_length (in minutes) and a list of integers movie_lengths (in minutes) and returns a boolean indicating whether there are two numbers in movie_lengths whose sum equals flight_length. >>> time_equal(0, [3, 3, 4]) False >>> time_equal(10, [3, 7, 4]) True >>> time_equal(100, [30, 70]) True >>> time_equal(100, []) False Time is O(n) """ if len(movie_lengths) > 1: for m in movie_lengths: for n in movie_lengths: if m + n == flight_length: return True return False
O(n^2) exponential because we iterate through list so much -_- Solution: replace the inner loop with something faster Is there a way to check for the existence of the second movie length in constant time? Yes, sets!
Outcome Steps Probability item #1 is a a is picked first \frac{1}{n} n 1 item #2 is a a not picked first, a picked second \frac{(n-1)}{n} * \frac{1}{(n-1)} = n (n−1) ∗ (n−1) 1 = \frac{1}{n} n 1 item #3 is a a not picked first, a not picked second, a picked third \frac{(n-1)}{n} * \frac{(n-2)}{(n-1)} * \frac{1}{(n-2)} = n (n−1) ∗ (n−1) (n−2) ∗ (n−2) 1 = \frac{1}{n} n 1 item #4 is a a not picked first, a not picked second, a not picked third, a picked fourth \frac{(n-1)}{n} * \frac{(n-2)}{(n-1)} * \frac{(n-3)}{(n-2)} * \frac{1}{(n-3)} = n (n−1) ∗ (n−1) (n−2) ∗ (n−2) (n−3) ∗ (n−3) 1 = \frac{1}{n} n 1
Rigorous proof that any given item a has the same probability of ending up at any spot That is, (1/n)
What is time complexty ofo this def word_cloud(string): """ Build a word cloud, an infographic where the size of a word corresponds to how often it appears in the body of text. Return a word count dictionary. >>> word_cloud("cloudy cloudy day") {'cloudy': 2, 'day': 1} >>> word_cloud("cloudy cloudy day day day") {'cloudy': 2, 'day': 3} >>> word_cloud("Cloudy cloudy Day day day") {'cloudy': 2, 'day': 3} >>> word_cloud("") {} """ counts = {} listo = [s.lower() for s in string.split(" ")] # Create word counter for lowercase list: if len(listo) > 1: for item in listo: counts[item] = counts.get(item, 0) + 1 return counts
Runtime and memory cost are both O(n).
The processor has a/many cache where it stores a copy of stuff it's recently read from RAM.
So reading from sequential memory addresses is faster than jumping around.
How to solve log(base10)(100)?
Take the log(base10) of both sides. log(b10) (10^x) = log(b10)(100) "What power must we raise 10 to in order to get 10^x?" >> x simplify: x = log(b10)(100) Now evaluate. x = 2
It's connected to a memory controller. The memory controller does the actual reading and writing to and from RAM. It has a direct connection to each shelf of RAM.
That direct connection is important. It means we can access address 0 and then immediately access address 918,873 without having to "climb down" our massive bookshelf of RAM. That's why we call it Random Access Memory (RAM)—we can Access the bits at any Random address in Memory right away.`
Average lookup time for a hash table
Theoretically, if every hash had a collision, the hash table would degrade to a linked list However... Collisions are fairly rare, so lookups on a hash table are O(1) time.
What is RAM? "working memory" VS "Storage"
Think of RAM like a really tall bookcase with a lot of shelves. Like, billions of shelves. We call a shelf's number its address. Each shelf holds 8 bits. (0 and 1) 8 bits is called a byte. So each shelf has one byte (8 bits) of storage. ** RAM is not where mp3s and apps get stored. In addition to "memory," your computer has storage (sometimes called "persistent storage" or "disc"). While memory is where we keep the variables our functions allocate as they crunch data for us, storage is where we keep files like mp3s, videos, Word documents, and even executable programs or apps. RAM is basically an array already!
Answer to 2 Egg Drop from 100 stories
This triangular series ↴ reduces to n * (n+1) / 2 = 100n∗(n+1)/2=100 which solves to give n = 13.651n=13.651. We round up to 14 to be safe. So our first drop will be from the 14th floor, our second will be 13 floors higher on the 27th floor and so on until the first egg breaks. Once it breaks, we'll use the second egg to try every floor starting with the last floor where the first egg didn't break. At worst, we'll drop both eggs a combined total of 14 times.
""" Greedy algo for the max profit finder """ def max_prod_of_3(ints): if len(ints) < 3: raise ValueError('Error, less than 3') # Initialize hi = max(ints[0], ints[1]) lo = min(ints[0], ints[1]) hi_of_2 = ints[0] * ints[1] # highest product of 2 lo_of_2 = ints[0] * ints[1] # lowest hi_of_3 = ints[0] * ints[1] * ints[2] # highest product of 3 # Iter through, updating any of the variables: for i in range(2, len(ints)): curr = ints[i] hi_of_3 = max(hi_of_3, curr * hi_of_2, curr * lo_of_2) hi_of_2 = max(hi_of_2, curr * hi, curr * lo) lo_of_2 = min(lo_of_2, curr * hi, curr * lo) hi = max(hi, curr) lo = min(lo, curr) return hi_of_3
Time complexity: O (n) because we only have to iter through the list once
TimSort
Timsort is Python's built in sorting algorithm. It's actually optimized for sorting lists where subsections of the lists are already sorted. For this reason, a more naive algorithm: def merge_sorted_lists(arr1, arr2): return sorted(arr1 + arr2) Python 2.7 is actually faster until nn gets pretty big. Like 1,000,000.
What would happen if all our dict keys caused hash collisions?
Unlikely, but it would cause O(n) time. If you think a lot of hash collisions might occur, consider dong dynanic array resizing
What are some common ways to get O(n) runtime?
Use a greedy algorithm. ↴ But in this case we're not looking to just grab a specific value from our input set (e.g. the "largest" or the "greatest difference")—we're looking to reorder the whole set. That doesn't lend itself as well to a greedy approach. counting. ↴ We can build a list score_counts where the indices represent scores and the values represent how many times the score appears. Once we have that, can we generate a sorted list of scores?
"keep two pointers" pattern
Used for checking if a string is a palindrome. civic ^ ^ civic ^ ^ civic 'Walk toward middle' ^
def sample_hash_file(path): # Helper function byte_size = 4000 total_bytes = os.path.getsize(path) hasher = hashlib.sha512() with open(path, 'rb') as file: if total_bytes < byte_size * 3: hasher.update(file.read()) else: bytes_btw_each = (total_bytes - byte_size * 3) / 2 # Read first, middle, and last blocks: for x in range(3): start = x * (bytes_btw_each + byte_size) file.seek(start) sample = file.read(byte_size) hasher.update(sample) return hasher.hexdigest()
Uses import hasher and a bunch of methods on it - Constructor: hasher = hashlib.sha512() - hasher.update(file.read()) # Reads the file and updates the hasher - return hasher.hexdigest()
""" Greedy algo for the max profit finder """ def get_max_profit(stock_prices): """ stock_prices is a list of stocks from that day note: you can only sell AFTER you buy, and profit could be negative """ if len(stock_prices) < 2: raise ValueError('Err, profits require at least 2 prices') # Initialize min_price = stock_prices[0] max_profit = stock_prices[1] - stock_prices[0] for i in range(1, len(stock_prices)): price = stock_prices[i] profit = price - min_price # Potential profit max_profit = max(max_profit, profit) # Update if we did better min_price = min(min_price, price) # Update if we found lower return max_profit
What is complexity? O(1) for space O(n) for time because we only loop through list once
short circuit evaluation
When a boolean expression is evaluated the evaluation starts at the left hand expression and proceeds to the right, stopping when it is no longer necessary to evaluate any further to determine the final outcome. ex: if it_is_friday and it_is_raining: print("board games at my place!") Suppose the first conditional is False. since False & anything = False, it wont' bother checking the other evalution.
Amortization of dynamic arrays In industry we usually wave our hands and say dynamic arrays have a time cost of O(1) for appends, even though strictly speaking that's only true for the average case or the amortized cost.
With dynamic arrays, every expensive append where we have to grow the array "buys" us many cheap appends in the future. Instead of looking at the worst case for each individual append (n), look at overall cost of many appends (m x n). The cost of doing m involves: - The cost of appending m items - The cost of any array doubling / copying we need to do along the way. * Each time doubling is costly, but since the capacity is doubled, time will be twice as long Remember: even though the amortized cost of an append is O(1), the worst case cost is still O(n).
half1[1:]
a slice costs O(m) time and space! because it makes a copy of the entire thing
address of nth item in array=
address of array start + (n * size of each item in bytes) That's the tradeoff. Arrays have fast lookups (O(1)O(1) time), but each item in the array needs to be the same size, and you need a big block of uninterrupted free memory to store the array.
.isdir()
an os function that returns whether a file is a directoyr of not
.listdir()
an os functoin that returns a list of all directories in a given location
Dictionaries/HashMaps are built on ________.
arrays which are pretty similar: - You can look up "key" in O(1) time, just like indices * A hashmap is ilke a "hack" on an array that lets us use flexible keys instead of being stuck with sequential "indices"
dynamic array
built on top of a normal array Programmed to resize itself when it runs out of space. ** When expanding, usually makes a new, 2x bigger array of uninterrupted memory slots
hashing function does what?
convert a key to an array index (an integer) Run the key through the hashing function to get the index to go to in the underlying array to grab the value.
merge sort
def merge_sort(list_to_sort): """ https://www.interviewcake.com/article/python/logarithms?course=fc1§ion=algorithmic-thinking """ # Base case: lists with fewer than 2 elements are sorted if len(list_to_sort) < 2: return list_to_sort # Step 1: divide the list in half # We use integer division, so we'll never get a "half index" mid_index = len(list_to_sort) / 2 left = list_to_sort[:mid_index] right = list_to_sort[mid_index:] # Step 2: sort each half sorted_left = merge_sort(left) sorted_right = merge_sort(right) # Step 3: merge the sorted halves sorted_list = [] current_index_left = 0 current_index_right = 0 # sortedLeft's first element comes next # if it's less than sortedRight's first # element or if sortedRight is exhausted while len(sorted_list) < len(left) + len(right): if ((current_index_left < len(left)) and (current_index_right == len(right) or sorted_left[current_index_left] < sorted_right[current_index_right])): sorted_list.append(sorted_left[current_index_left]) current_index_left += 1 else: sorted_list.append(sorted_right[current_index_right]) current_index_right += 1 return sorted_list
GREEDY approach with TWO PASSES to solve this problem: **You have a list of integers, and for each index you want to find the product of every integer except the integer at that index.** * No division allowed! * # Strategy: The product of all the integers except the integer at each index can be broken down into two pieces: 1- the product of all the integers before each index, and 2- the product of all the integers after each index. >>> prod_finder([3, 1, 2, 5, 6, 4]) [240, 720, 360, 144, 120, 180]
def prod_finder(ints): if len(ints) < 2: raise IndexError('Getting the product of numbers at other indices requires at least 2 numbers') products = [None] * len(ints) # Forward traversal: p = 1 # Product so far, before index for i in range(len(ints)): products[i] = p p *= ints[i] # Reverse traversal: p = 1 # Re-initialize product so far, for after index for i in range(len(ints) - 1, -1, -1): products[i] *= p p *= ints[i] return products
The space cost of stack frames Ex. consider this recursive function: def prod(n): return 1 if n <=1 else n * prod(n - 1)
each function call creates its own stack frame, taking up space on th ecall stack This is why recursive fxns have poor space complextiy By the end, there are 10 entire stacks on the stack frames! so the entire call stack takes up O(n) space. Even though the function itself doesn't create any data structrs
how to get the edit time of current path
edit_time = os.path.getmtime(current_path)
Logarithms are used for solving for x when x is an _______.
exponent
How to create a full path?
full_path = os.path.join(current_path, path)
hash table is also known as
hash hash map map unordered map dictionary
stack frame
in a call stack, there is one of these for each function call Each time a function is called, a new stack frame gets created at the top of the call stack, right above the caller's stack frame. Once the function is finished, its stack frame is removed, passing control back to the caller and restoring the caller's stack frame. This is how a program knows what to do after the function returns!
set
like a hash map but only stores keys w/0 values - not ordered - no indices - no duplicates in Py, the set implementation is copied from dict implementation
set.add()
like list.append() but can't be ordered
Binary search runtime is best only when ______
list is already sorted This takes advantage of the optimal runtime O(log(n))
"Destructive" aka "in place" algorithm
operates on a list or other data structure directly so the original is destroyed will not create new memory generally has O(1) space cost ** Be careful ** that you wont need the original later for debugging
hash table pros/cons
oranizaes data so you can quickly look up values for a given key Strengths: - fast lookups, O(1) on average, but O(n) for worst case - Flexible keys You can use any key as long as it is hashable! Weaknesses: - Slow worst-case lookups - O(n) - Unordered: so if you're looking for smallest, need to iterate through to find it (takes n time) - SIngle-directional lookups Looking up keys (the hashed value) takes O(1), but looking upvalues takes O(n) time - Not cache-friendly
linked lists have faster _____ (O(1) time) than dynamic arrays (O(n)O(n) time).
prepends
file.read()
returns contents of a file as a string?
tail call optimization (TCO)
something going on behind the scenes where a recursive function *might* free up its stack frame before doing its final call, saving space. you can optimize a recursive function with TCO Python and Java do not allow it!!!
hash function
takes data (like a stirng or file) and outputs a "hash", a fixed size string or number
Why is merge sort O(n log(base-2) (n))?
the log(base-2)(n) comes from the number of times we have to cut n in half to get down to sublists of just 1 element (base case) The additional n comes from the time cost of merging all n items together each time we merge two sorted sublists.
call stack
what the progrma uses to keep track of ffunction calls
hash collision
when multiple files have the same hash value Need to handle
How do you get contents of a file with a with open as f block?
with open(current_path) as file: file_contents = file.read()