Python Data Structures / Algorithms
Binary Search Time Complexity
-every time the array doubles in size, it takes an extra iteration of the algo 2^(# iterations) = (array size) -Binary Search is O(log n)
Sorting - Insertion Sort
-select the first unsorted element, swap elements until the unsorted element is in the correct position -advance the sorted marker Time - O(n^2) Space - O(1) best case - input array already sorted. Just one comparison for each item in the array. worst case - array in reverse order. -very good for small arrays (even better than quicksort). Good quicksort implementation will use insertion sort for arrays smaller than a threshold.
Memoization
-store the values for function calls so we don't have to repeat the work -Ex - in Fibonacci sequence - we have many of the same repeated calls ( fib(3), fib(2), etc...) -in python, we can either: 1) add a decorator to the function to cache the first maxsize values: @lru_cache(maxsize=1000) 2)Create a dictionary of cached values
Hash Table Collision
-two different inputs generate the same hash value/index -fixed via chaining, open addressing, or double hashing
Stack - implemented with a list
-*can use append() and pop() on a simple python list to implement a stack*
DFS - Tree/Forward/Back/Cross Edge
-Tree Edges - part of path explored by dfs - Back edge - up the tree edge hierarchy - Forward edge - down the hierarchy -cross edge - edge across 2 vertex w/ no hierarchy
Binary Tree
-has at most two children (0,1, or 2 children)
BST - Search, Insert, Delete Complexity
Search, Insert, Delete: avg=> lgn worst case => n -worst case when we degenerate to linked list
Sorting - Radix sort complexity
Time - O(wn) (w => word size) Space - O(w+n)
Sorting - Timsort Complexity
Time - nlgn Space - n
Rehashing
When the load factor increases to more than pre-defined value (default load factor is .75), the complexity increases. To overcome this, the array (or underlying data structure) is doubled and all values are hashed again against this larger array to maintain a low load factor and low complexity.
Sorting - Bucket Sort
1) create n empty buckets 2)put each item into appropriate bucket 3)sort individual buckets 4)concatenate all individual buckets
Tree Traversal BFS - Level Order
BFS: Level Order: Top to bottom, left to right
Graph Search vs Graph Traversal
Graph search - stop searching when you find the element you're looking for Graph traversal - look at every element
Sets and Maps
Set - collection of unique items -can think of it like a bag of items - you reach in and grabe one, but never know which one you'll get - a map is a set based data structure -list is array based data structure
DFS Complexity
Time => O(V+E) - algorithm loops over each vertice and each edge Space => O(V) (store visited in both recursive and iterative
Binary Search - calculating mid
mid = lo + (hi-lo) https://stackoverflow.com/questions/25571359/why-we-write-lohi-lo-2-in-binary-search
Sequential/Linear Search
search every item sequentially O(n)
Sorting - Merge Sort complexity
(# iter) * (# comparison @ each iter) 2^iter = n ==> iter = log(n) # comparison @ each iter = n Time Complexity: *O(n lgn)* Space complexity: *O(n)* -Space complexity - we only use 2 different arrays at each step, the original one and the new one we're copying into
Tree Terminology - Internal Nodes, External Nodes / Leaf
*Internal Nodes* => all nodes except leafs (if there's only one node it's an external) *External Nodes / Leafs* => nodes at the end
Binary Tree - Search(item), Delete(item), Insert Complexity
*Search* - O(n) (any traversal algo is linear since BT there's no order to nodes) *Delete* - O(n) (starts with search since we need to find what we want to delete. Deltions are tricky, sometimes need to promote grandchildren up and/or re-arrange nodes) *Insert* - avg=> lgn, worst=> n
Sorting - Bubble Sort (sinking sort)
- in each iteration, largest element will "bubble" to the top -in place, stable, sorting algo A better version of bubble sort, known as modified bubble sort, includes a flag that is set if an exchange is made after an entire pass over the array. If no exchange is made, then it should be clear that the array is already in order because no two elements need to be switched. In that case, the sort should end.
Degree, In-degree, Out-degree, Degree of Graph
------------------------ Undirected Graph ------------------------- Degree - The number of edges that are connected to vertex (loop edges count as 2) ----------------------- Directed Graph ----------------------- 1) In-degree - number of edges incoming to a vertex 2) Out-degree - number of edges outgoing from a vertex (directed loop edge count as 1 out-degree, and 1 in-degree ) Degree of Graph - sum of degrees of all vertices
Sorting - Quicksort vs MergeSort
-Quicksort is faster for smaller datasets, and arrays since it has good locality of reference -Mergesort is better for stability, large datasets (especially those that don't fit in memory), or linked list (LL don't have good locality of reference). You're also guaranteed nlgn worst case.
Sorting - External Sorting
-a class of sorting algos that handle massive amounts of data that doesn't fit into RAM / main memory -External merge sort - loads the input data (as much as it can) into RAM, sorts it, then stores as a chunk on disk. Then load parts of each chunk into memory . https://en.wikipedia.org/wiki/External_sorting#External%20merge%20sort
Hash Table
-a data structure that implements an associative array abstract data type, a structure that can map keys to values -allows you to do lookups in constant time -take some value ==> convert value based on formula ==> spits out coded version of value ==> index into hash table -common pattern is to take modulo/remainder of last few digits of big number
Hash Table String Keys
-key can be a string - one way is using ASCII values combined w/ a hash formula -can generally use ASCII if you ahve 30 or less string keys -if you use all letters in string key, the hash value becomes huge and memory may not be able to represent it as an index.
Graph Connectivity
-measures minimum # of elements that need to be removed for a graph to become disconnected -In left group, we can remove one connection and it becomes disconnected -Sometimes can use connectivity to answer which graph is 'stronger'
Sorting - Merge Sort
-mergesort is an example of *divide and conquer* - the idea of breaking array up, sorting all the parts, than building it back up again. (several sorting algos use this principle) -worst case - reversing a list
Sorting - Selection Sort
-search through array and find min -swap min with first element -advance sorted marker Time - O(n^2) Space - O(1)
Space Complexity (Auxilary)
-the amount of memory an algo takes Auxiliary Space - extra or temp space used by an algo Space Complexity - total space taken by algo w/ respect to input size. Space Complexity = (Auxilary Space) + (Input space) Merge Sort uses O(n) auxiliary space, Insertion sort and Heap Sort use O(1) auxiliary space. Space complexity of all these sorting algorithms is O(n) though.
Graph - Weakly Connected vs Strongly Connected vs Complete
-weakly and strongly generally apply to directed graphs *Weakly connected* - if considering as an undirected graph, it is connected (there may not be a path b/w some pairs of vertices) *Strongly Connected* - for a directed graph, there is a path between all pairs of vertices. *Unilaterally connected* - semi-path (touches all vertices). There's a path from A to B, but not necessarily B to A *Complete* - an undirected graph - there is a path between every pair of nodes
HashTable Collisions - Open Addressing / Probing (closed hashing)
-when a collision occurs, we look for the next open space (in linear) (also quadratic, double hasing etc..) -CPython uses random probing, where the next slot is picked in a pseudo random order
AVL Tree
a self-balancing Binary Search Tree where difference between height of left and right subtrees cannot be more than one for all nodes (doesn't need to be complete. A complete has its nodes all the way to the left and while an AVL tree keeps the left and right subtree depth the same, it doesn't order the nodes all the way to the left) worst case complexity (search, insert, delete) - lgn
Sorting - Locality of Reference (linear search, quicksort, mergesort, heapsort)
access located around a small number of memory locations linear search - over an array has best locality of reference. Over a linked list has bad locality of reference Quicksort - partition strategy generally looks/swaps values in array indexes that are close to each other Mergesort - in the final step it could be access locations n/2 away at best Heapsort - will compare against values at locations that are twice or half the index of the current element (in max_heapify() ) linear search array > quicksort > mergesort > heapsort
Stacks
LIFO data structure -list based data structure where we push elements on top, and pop the top -A stack itself is an abstract data structure, it could be implemented anyway as long as there is a push and pop operation. Each element could be implemented with a next pointer and head
Collection Data Structure Comparisons
Lists - indexed elements. good for accessing elements in the middle. insertion and deletion messy O(n). Has potentially unused memory space Linked List - better for insert/delete O(1), but difficult to access elements in the middle Stack - easy to implement with a LL, easy push and pop O(1) Queue - easy to implement, fast enqueue/dequeue
Sorting - In Place vs Not In Place
- In Place ==> *sorted without having to copy to a new data structure* -In place have lower space complexity, since we're not recreating the data structure -tradeoff b/w space and time complexity - won't matter for small arrays, but if you have millions of elements, it makes a huge difference In Place : Bubble sort, Selection Sort, Insertion Sort, Heapsort. Not In-Place : Merge Sort. Note that merge sort requires O(n) extra space.
Binary Search
- check if target value is less than or greater than the middle of array . Split in half each time. -Is 7 greater than or less than this element? - Constraint: *elements must be sorted first* - with an even # of elements, choose the lower number as the pivot/comparison. -the cruder sequential search is worse: O(n) for most cases (the best case is O(1) if it's the first element in a sorted list)
Graph Representation - Adjacency List vs Adjacency Matrix (optimal implementation and complexity )
---------------------------------- Adj Matrix - dictionary of lists ---------------------------------- add/remove vertex => V Add/remove edge => 1 query => 1 space => V^2 ---------------------------------- Adj List- dictionary of sets ---------------------------------- add vertex => 1 remove vertex => V Add/remove edge => 1 query => 1 space => V+E
Heap vs Binary Heap
-Heap is any number of children, binary heap is 2 -binary heap is a *complete tree* - all levels except last filled - won't degenerate
Graph Representation comparison - when to use Adj Matrix vs Adj List
-adjacency matrix best for dense -adjacency list best for sparse Adj List is overall more efficient
Linked Lists
-an extension of a list, but *no indices*. Each element can have a next and previous pointer -much easier to insert/delete since we're just moving pointers -difficult to access elements in the middle -worse locality of reference than arrays
Hash Table Load Factor
-gives a sense of how full a hash table is. It's also the expected length of a chain (if we're using chaining) load factor less than 1 => mostly empty spaces, wasting space load factor more than 1 => collisions α = n/m (m => table size, n=> elements) -as long as you set you're table size to m=n, we are guaranteed O(1) operations
Priority Queue implementation array vs heap
Simple Array: ---------------- get/deleteHighestPriority O(1) (get first element) insert() O(n) (append) Linked List ----------------------- get/deleteHighestPriority - N (slightly better on delete) insert() - 1 As Heap Array ----------------- insert() - lgn getHighestPriority() - 1 deleteHighestPriority() - lgn
Python *List* Time Complexity
append - O(1) (amortized) pop - O(1) insert and remove - O(n) get item - O(1) *in* membership testing - O(n) -largest costs come from growing beyond the current allocation size (because everything must move), or from inserting or deleting somewhere near the beginning (because everything after that must move). -insertion and deletion messy (due to fixed sizes -good for accessing elements in the middle -If you need to add/remove at both ends, consider using a collections.deque instead.
Binary Tree - Level vs number of nodes
each new level can have twice as many nodes as the one before it. We're adding a power of two at each level,
Binary Search Tree (BST)
every value on the left is smaller, every value on right is larger
Depth-First Search
follow one path as far as it will go -push the nodes you've seen onto the stack - when you hit a node you've seen before, pop the current node from stack and try another edge DFS ==> O(V+E) -we visit each edge and vertex once (runtime technicall 2E - since we visit each edge twice)
Binary Heap and Complexity (search, insert, delete, extractMax/extractMin, peek)
heap w/ at most two children. Often stored in an array. -must be complete (all levels except last are full, and all nodes all the way to the left) . New values added @ bottom level left to right. -being complete guarantees the correct shape and worst case O(lgn) Search: Avg - N, Worst - N ( Insert: Avg - 1, Worst - lgn Delete: Avg- lgn, Worst - lgn Peek: O(1)
Efficiency/Complexity
how well your using your computer's resources to get a particular job done. Thought about it in terms of space and time -there are tradeoffs for algorithm efficiency - slow methodical approach vs faster method that reduces repetition -use heuristics that are not as precise but still pretty good -efficiency can rely on creativity, but there are often tips and tricks
Red-Black Tree
not as good at self-balancing as AVL, but good enough to guarantee lgn worst case Rules: --------- -each node Red or Black -root is Black - All (nil )leaves are black - Red node has two black children - every path from node to leaf has same number of black nodes worst case complexity (search, insert, delete) - lgn
Complexity - number of digits in number
number of digits in a number is lgn -a number n with d digits has values up to 10^d - d=lgn ==> d=lgn code involving digits will be O(d) or O(lgn)
Sorting - Non-Comparison Sorts - Radix and Counting Sort
radix sort uses counting sort as a subroutine
DFS Recursive Implementation
recursive dfsRecursive(v): self.visited.add(v) for neighbor in self.graph[v] if neighbor not in self.visited: self.dfsRecursive(neighbor)
Complexity - Fibonacci and alternatives
recursive - 2^N / N recursive memo - N/N iterative - N/1 matrix mult - Lgn/1 formula - 1/1
Sorting - Stability
stability means that equivalent elements retain their relative positions, after sorting. Ie: Useful to preserve the order if you want to sort on multiple values with the same last name, but different first name 1) stable by default: Insertion sort, Merge Sort, Bubble Sort 2)unstable by default: Quick sort, heap sort, selection sort, shell sort -Any given sorting algo which is not stable can be modified to be stable. -Radix sort requires that the underlying sorting algo be stable
Complexity - looping from 1 to (x^2<=n)
O( sqrt(n) )
Stack Time Complexity
push - O(1) pop - O(1)
Python heapq operations complexity (push, pop,
push - lgn pop - lgn build heap - nlgn (push all items onto heap - nlgn)
Sorting - Bubble Sort Complexity
-overall we do (n-1) iterations and at each iteration we do (n-1) comparisons Time - O(n^2) (best case O(n) - only 1 number needing to bubble up or already sorted) Space - O(1) (in place)
Applications of DFS
1) Detecting a cycle - if we see a back edge during DFS we have a cycle 2) Finding minimum spanning tree 3) Topological sort (for directed graphs) - linear ordering of vertices such that arrows all point downwards . Graph must be a DAG (doesn't work for cyclic graphs) 4) Find strongly connected component (3 in pic) 5) Path Finding - find path between two vertices. Store path on stack, as soon as destination found, return stack 6) Check if graph is bipartite (no edge of the same color touch, but edges of opposite color touch). Implement via coloring scheme then run DFS coloring vertex opposite to its parents 7) Solving puzzle problems - given a matrix/maze w/ obstacles - must traverse with DFS
Priority Queue
ADT that is an extension of queue w/ following properties: 1) every item has a priority 2) element with highest priority dequeued before lower priority 3) 2 elements w/ same priority are served according to order in queue Operations: ----------------- insert(item, priority): Inserts an item with given priority. getHighestPriority(): Returns the highest priority item. deleteHighestPriority(): Removes the highest priority item. -used in algos like Djikstra's shortest path, and Prim's min spanning tree
Types of Binary Tree: Full (strict/proper)
All internal nodes has 0 or 2 children (never just 1 child) Can be used to represent mathematical expressions
Types of Binary Tree: Perfect
All internal nodes have exactly 2 children. All leaf nodes on the same level/depth. All perfect trees are complete, but not vice versa
Types of Binary Tree: Complete
All levels are completely filled except possibly the last level. And in last level nodes are as left as possible. -a perfect binary tree who's rightmost leaves have been removed is called a complete binary tree.
Binary Heap as Array Child/Parent Indice formula
Arr[(i-1)//2] Returns the parent node Arr[(2*i)+1] Returns the left child node Arr[(2*i)+2] Returns the right child node
Sorting - Binary Insertion Sort vs Normal Insertion Sort
Binary search reduces the number of comparisons (vs normal insertion sort) by using binary search to find the proper location to insert the selected item. Normal insertion sort ==> O(n) comparisons worst case Binary Insertion sort ==> lgn comparisons in worst case Binary insertion sort still has worst case running time of O(n^2) due to swaps requires for insertion
Lists/Arrays
all the properties of a collections but it's *ordered* -In python, lists are implemented as Arrays. - The main difference between a list and an array is the functions that you can perform to them. For example, you can divide an array by 3, and each number in the array will be divided by 3. If you try to divide a list by 3, Python will tell you that it can't be done, and an error will be thrown. -The growth pattern of the number of slots of a list is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88 . -good locality of reference
Recursion (requirements)
function that calls itself -3 requirements: 1) call itself 2) base case 3) alter the input parameter -without a base case, we could end up in infinite recursion
Queue Types: 2)Deques (double-ended queue)
queue that goes both ways. Can enqueue or dequeue from either end. -generalized version of both stacks and queues - since you can represent either of them with it: a)Stack - add and remove from the same end (either end) b)Queue - add on one end, remove on the other
Hash Table - Linear Probing vs Chaining
Chaining --------------- Pros: simpler to implement, very flexible size, less sensitive to hash function or load factors. Cons: wastes space in that some parts of hash table are never used. Open Addressing: ----------------------- Pros: Better cache performance as everything in same table Cons: more computation, table may become full, suffers from clustering
Tree Traversal DFS - Pre, In, Post
DFS: 1) *PreOrder* (diagonal-wise) -Visit ==> Left ==> Right -Used to create a copy or serialize the tree (for later deserializing) 2) *InOrder* (column-wise) -Left ==> Visit ==> Right -In a BST, inorder traversal gives nodes in sorted order 3) *PostOrder* (level wise, starting from bottom, left to right) -Left ==> Right ==> Visit -Used to delete the tree using just O(1) space - since it will delete children before parent (other traversals require more space
Graph
Data structure designed to show relationships between objects. (also called a network) -Tree is as specific type of graph -can have cycles , can start anywhere -no root node -nodes generally store data, but edges can also store data in the form of weights
Types of Binary Tree: Degenerate
Every parent node has only one child, either left or right -suffer same performance as a linked list (slightly slower since in tree we check left and right)
Queue Types: 1)Standard Queue
FIFO data structure (ex - a line of ppl) -front - oldest element in Q (front of line) -back- newest added elements enqueue() - add element to back dequeue() - remove element from front -peek() - look at front element
Sorting - HeapSort
Heapsort uses a heap (complete binary tree - all nodes except last filled, and all nodes all the way to the left). From left to right, down, each element of the input array become nodes. 1)build a max heap (heapify() ) 2)swap the first and last element - this moves the highest element to the end (similar to a selection sort) 3) delete the node from the heap -efficient for priorit queues - as the heap method supports insert/delete/extract
Complexity - Fibonacci
O( 2^N) -for recursive calls ==> O(branches^depth) -a tighter runtime is O(1.6^N) since at the bottom of call stack there is sometimes 1 recursive call instead of 2 -Computing all fibonacci numbers from 1 to N (without memoization) is still 2^N
Tree Terminology - Levels, depth, height
*level* => how many connection it takes to reach root from leaf + 1 *height of a node* => number of edges between node and furthest leaf *depth of a node* => number of edges to the root -height and depth move inversely -root height is highest, leaf depth is highest
Ω, O, Θ - Complexity
*Ω* / Big Omega - lower bound -Printing an array is Ω(n), Ω(lgn), and Ω(1) *O* => upper bound (oh) -Printing an array is O(n), O(n^2), O(2^n) *Θ* / Big Theta => both O and Ω - a tight bound on runtime -Industry meaning of big O is closer to what academics mean by Θ. In interviews we always try to offer a tightest description of runtime.
Graph Edges
-Nodes are people, but the edges between the people could be describing many things: people who met each other, or people who lives in the same city at the same time, or people who worked on a project at the same time. -information we decide to store, will depend on use case
Sorting - Quicksort
-a divide and conquer sorting algo (like mergesort) -picks an element as pivot and places all items less than pivot to the left, and all items greater than pivot to the right -Recursively quicksort the left and right halfs -in-place but not stable
Hash Table vs Arrays, Linked List, BST, Direct Access Table (Complexity)
-arrays and linked list suffer search or insert/delete problems -balanced BST is lgn for all ops -direct table access is constant for all ops, but the size of the table would have to be huge (possibly larger than what could be represented -a HashTable is constant for all operations on avg
Queue Types: 3)Priority Queue
-assign each element a numerical priority on insertion -on dequeue, remove element with highest priority -in a priority tie, remove oldest element -Priority queue is an abstract data type, it can be implemented in many ways, although it's typically implemented as a heap (heap must be a tree (no cycles), not necessarily binary either)
Sorting - Bucket Sort Complexity
-best case => uniform distribution over buckets => *O(n+k)* (k buckets) -worst case => When input keys are close to each other (clustered), results in buckets containing more elements than avg. All elements in single bucket and performance dominated by sorting algo => *O(nlgn)* for merge/quicksort Space - n
Abstract Data Type
-can be implemented in many different ways. -Stacks and Q's are abstract data types that can be implemented in many ways. Q's often implemented as Linked List.
Efficient Hash Function
-choose between 1) hash function that spreads out values evenly, but uses a lot of space 2) uses less buckets, but might have to searching within each bucket -Hashing questions are popular because there's never a perfect solution - you're expected to talk about upsides and downsides of whatever you choose. Do your best to optimize your hash function.
Sorting - Basic Approach (Naive)
-compare every element to every other element, until everything is sorted -very inefficient -Bubble sort, selection sort,
HashTable Collisions - Separate Chaining (open hashing)
-each cell of table points to a linked list -Pros: simple to implement, the table never fills up and we're always able to add more elements to chain -Cons: wastes space in extra linked list, slower lookup (O(n) worst case if all elements in one linked list
Binary Search Time Complexity alternate way of thinking
-each time we split the array in half we have: n/2, n/4, n/8 etc... items left. When we split enough we get n/2^i = 1 (where i is the number of comparisons) 1 = n/2^i 2^i = n i * log(2) = log n
Graph Directed / Undirected
-edges can have a direction (directed graph) -an undirected graph has edges with no sense -often need a directed acyclic graph that guarantees no cycles - *cycles can be dangerous* and lead to infinite loop
Sorting - Heapsort complexity
-worst case - swapping value from bottom of tree to top is lgn swaps (height of tree is lgn) . This swapping is called n times ==> O(nlgn) Time - nlgn Space - 1
Types of Binary Tree: Balanced
Different of left and right subtree height is max one for all the nodes A tree that balances itself is a "self balanced binary tree" - (AVL Trees, Red-Black Trees)
Graph - Disconnected vs Connected
Disconnected - has vertex/node that can't be reached by other vertices (undirected) Connected - has no disconnected vertices
Tree Definition and Properties
Trees are a restricted form of a graph - directed (one direction), acyclic (no cycles) graphs -trees are an extention of a linked list Properties: 1) tree must be connected (can't have an unconnected node) 2)no cycles (acyclic
Complexity - sort each string in a list, then sort the list itself
['aba', 'cbd', 'cdc'] ===> ['aab', 'bcd', 'ccd'] -a = # elements in list -s = longest string 1) sort each string in list: a * slgs 2) sort the list itself s*alga ( comparing each string takes O(s) total = a*s(lga + lgs)
Collections
a group of things with no order -(we can't say give me the 3rd element in the collection) -can have different types -many data structres are extensions of collections: lists, arrays, LL, stacks queues
Sorting - Quicksort Complexity
best - nlgn avg - nlgn worst - n^2 space - lgn (n for worst case) (partition uses O(1) space, then multiply by recursion call tree depth of lgn ) -Worst case occurs when pivot is always greatest or smallest element (already in ascending or descending order, or elements all the same). If we know arrays are near sorted, we don't want to use quicksort. - n^2 problem can be mitigated by choosing a random or avg pivot value. Worst case can still occur if max (or min) element is always chosen as pivot -best case occurs when partition divides list into nearly two equal pieces. Each recursive call processes a list half the size
Sorting - Timsort
hybrid algo that uses merge sort (for large sequences) and binary insertion sort (for small sequences) -looks for increasing/decreasing sequences -incredibly fast for nearly sorted data
Linked List Time Complexity
insert/delete - O(1) *(removing last element w/ no tail can be O(n) get item - O(n)
DFS Iterative Implementation
iterative dfsIterative(v): stack = [v] while stack: vertex = stack.pop() if vertex not in self.visited: self.visited.add(vertex) stack.extend(self.graph[vertex] -self.visited) # ^ push unvisited neighbors onto stack return self.visited
Heap
specific type of tree where root element is either max or min value max-heap => all parents >= children min-heap => all parents <= children -generic heap can have any number of children. Binary heap can have at most two children -often stored in an array (less space storage vs using Nodes with a bunch of left/right pointers)