Strings Extra

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What is the insight in finding the longest repeating substring?

1. BF is O(n^2). Enumerating every substring. Comparing substring to longest, replacing if needed. 2. Building Suffix Trie allows for faster queries but also takes O(n^2) to build

What are 5 examples of problems that a suffix trie can efficiently solve (after being built)?

1. Find the Longest Common Substring between X and Y 2. Find the Longest Repeating Substring between X and Y 3. Find the Longest Palindromic Substring between X and Y 4. Find Substrings that are Common to X and Y 5. Find Substrings that are Not Common to X and Y

Advantages of a suffix tree

- Can easily find how many times a word repeats within a word. Given by how many dollar signs break off of it (branches). - Can find the longest repeating substring - Least repeating substring - All substrings that don't overlap

Suffix tree tradeoff

- O(n^2) setup for O(n) query - Suffix tree is a more flexible option

Insight for # of submatrices that sum to target

1.

What's the insight behind print a string sinusoidally?

1. Count the # of positions before it comes back to the start, then modulo based on that. 3 rows -> modulo 4.

What's one of the most popular applications of suffix trees?

Autocomplete

What's the insight for Generate Numeronyms?

I had the right intuition from the start. But I used DFS to shrink both sides Correct solution used two for-loops. There was no special insight, just a confusing test case.

What is one major concern with using a Trie as a solution?

It uses a lot of memory. Too many pointers at each node (O(L), where L is alphabet length). That is why compressed/radix Tries are useful

What is the worst case time complexity of Rabin Karp algorithm, where N is the size of the text and M is the size of the pattern?

O(MxN). Inputs that generate same hash values make this close to brute-force

In a suffix tree how can we distinguish between words?

Diff char at the ends. $ vs #. Terminating characters.

What's the MOST effective way to enumerate all substrings?

Double for loop for i in (0, n): for j in (i, n): # do stuff

What do wildcard work in string matching?

Ex. * for any char, ? for one char.

T/F If our predominant query is to find all strings matching a prefix, then we should use a self-balancing BST over a Trie

F

T/F FSA can have only have one accepting states?

F Can have many

Find prefix with wildcard

FIND THIS AGAIN. WAS SLEEPING. ~ 120-180 m

KMP T/F? We need to pre-process entire text, but not pattern

False. Other way around

Sliding window problems

Find substrings with at least 3 L's

Rabin Karp could be useful for what disk purpose?

Keep two files on disk in sync, by copying only the deltas.

What is the maximum number of children that a Trie node can have?

L, where L is the number of characters of the alphabet that the interviewer gives us!

What technique does Rolling Hash use, in order to avoid overflow and still minimize conflicts?

Modulo Prime each numerical operation

String in Java / Python are immutable?

Yes If s = "Interview", then str = s. And we overwrite str. s is still "Interview" "Interview", because str now points to a new string literal and s still points to "Interview". String being immutable doesn't mean the reference to it is immutable.

TC of substring search

complexity is O(Length of Text x Length of Pattern), which is not linear time.

When looking the smallest controlling set, suffix tree is optimal

No. Then the linear approach is better. However suffix tree is useful for many other things.

Default encoding for text is ASCII on most systems?

No. UTF-8

Given a corpus of N strings, of length L each, what if we used a BST instead of a Trie, in order to store strings for efficient lookups? What would be the time complexity of inserting a new string in such a BST? Assume that it's a self-balancing BST.

O(L * log N) Compare each incoming string to the given string in the tree node and make Log N such decisions to find its right place

TC of looking up pattern in suffix trie?

O(Length of Pattern)

Substring Search

S, P -> Determine whether P is a substring of S

Controlling set. When does a set control a string?

Set of characters A set controls a string if those characters in a set appear in the string Want a func that shows us the shortest substring controlled by the set s = "helloworld" , set = { 'l', 'r', 'w' }

Automaton is made up of

Set of states

Finite state autonoma - What is the state with outside arrow pointing to it?

Start state

Automaton is always on a

State

How does substr func work?

Substring function creates a new String object. Every time. Every time you use Substring, you're taking a space complexity cost of O(N).

How to find the longest shared substring

Suffix tree and look for two symbols on the same letter. Ex. $ and #.

T/F All substrings are prefixes of a suffix?

T

T/F If our predominant query is simply to check the existence of a given String, then we should use a HashTable over a Trie

T

What's the BF approach for a controlling set?

s = "helloworld" , set = { 'l', 'r', 'w' } BF: enumerate all substrings n^2 * evaluate size of the string n = O(n^3) Triple for loop. Double for loop to get every substring. Inside for loop to check every letter against the set.

T/F? SubArray and SubString are the same thing, except the former is in the context of any array and the latter is in the context of a String Both SubArrays and SubStrings are defined to be contiguous group of elements of a given array. A SubSequence needs to make sure that the order is preserved SubArray, SubString and SubSequence - all of them preserve order There is no concept of ordering of elements in a SubSet

T

Trie is a tree?

T

How should you lead your interviewer

Talk through your approach. "First I'm going to work on the algorithmic optimizations. I have an O(n^3) to start, I'm going to write out test cases and pseudocode. Then I'll work on smaller optimizations."

Finite state autonoma - What is the state with a double circle?

The final state / accepting state. Pattern is true

What is delta in FSA. What's the function signature?

Transition function f(i, x) = j i = current state x = symbol given as input j = next state

What is a suffix tree? TC / SC?

Tree of all the suffixes of a word. Similar to prefix tree TC construct: O(n^2) TC find: O(m) (m == size of term searching)

String problem categories

Trees Pattern matching on dict Pointer manipulation - string reversal Suffix trees

ASCII, UTF-8 and ISO 8859 all have the same encoding for characters 32-127?

True

What's the order of operations in minimum window substr?

1. Create pattern dict 2. Create start / end. Whenever find a char in patter, increment string dict. Increments a counter as long as we never have more t_dict of a string than s_dict if s_char in t_dict and s_dict[s_char] <= t_dict[s_char]: count += 1 3. When count is correct, do a while check to make sure that we s_dict count is not higher than t_dict. start_char = s[start] while start_char not in t_dict or s_dict[start_char] > t_dict[start_char]: if start_char in t_dict: s_dict[start_char] -= 1 start += 1 start_char = s[start]

Insight for Word Break II

1. DFS passing s, word dict and memo 2. Iterate through words to see if the s starts with the word. If so, DFS forward. Add the cur word with all possible endings. Add to memo for future calls.

Insight for shortest way to form a string?

1. Given each char. Have i = pat.find(ch, i). If not found reset i. i += 1

Insight for print matrix in spiral order

1. Have a list of 4 moves 2. While loop doesn't have to know what the move is. Just which index is it. Check if bounds are right and that next isn't a None. If it is rotate. If 4+ rotations, end

Insight for Jump Game II?

1. Keep track of current / next ladder. Don't jump ladder until you're at the end of it.

Insights for Word Subsets?

1. Preprocess the sets into one dict 2. Compare each word to a copy of merged_set

Insight for minimum window substr

1. Two pointers. Create a target_dict and a cur_dict 2. Two modes cur_count < target_count, and otherwise 3. If less, add ch[j] increment j 4. If eq, remove ch[i] and increment i

Why was your original implementation of minimum window an order of magnitude slower?

1. Was constantly comparing an entire dict. Worst case if the dict is large, it's an additional O(n)

Questions to ask in a strings question:

Could any of the data structures help: 1. Two pointers 2. Creating a tree or suffix trie 3. Adding words to a dict / set 4. Enumerating all substrings 5. Creating a DP table

What's the optimization for a substring search? What's it called

Creating a table to track which character in the prefix was a miss. And only resetting up to the one. Instead of the whole thing. KNB?

What's the insight for minimum window subsequence?

DFS forward, return index of matched end of subsequence. MEMOize the answer so any further request from same point will return same end. Only have to iterate through that part of the string once for that char.

Insight for shortest subarray with sum at least k?

Firstly, the naive algorithm is to find shortest subarray by comparing every combination of start and end indexes of subarray in A such as A[i : j]. Time: O(N^2) => TLE not accepted by OJ Secondly, we iterate through the whole array and solve it in O(N) time which will be accepted by OJ.We create sm variable which will store sum of numbers till the current index => A[:i]We create l variable which will store the shortest length initialized as float("inf), infinity in other words.Also, we create heap queue, which will store tuples (sum(A[:preIndex]), preIndex) for previous indexes.By the help of heapq, we can check if A[preIndex:i] is a valid subarray whose sum is less than or equal to K.We push (0, -1) to heap for checking A[:i] subarrays. Let's consider how we check valid subarrays.At current index i, we know sum(A[:i]) and we have to find sum(A[:preIndex]) such as sum(A[:i]) - sum(A[:preIndex]) >= K. In other words, critical dif(ference) limit for sum(A[:preIndex]) is less than or equal to sum(A[:i]) - K.For this, we can access smallest preSum (sum(A[:preIndex])) in heapq due to heapq structure in Python. Python will give us smallest tuple prioritized by sum and then index.The most important thing is here, we should pop smallest tuple from heap if i - preIndex >= l or sum(A[:preIndex]) <= critical dif(ference) limit.We cannot use this tuple later as length will always increase as i increases !For the current tuple, it might be the new shortest subarray if i - preIndex < l and we update l Finally we return l after whole iteration of array. If l is still as initial value that means we couldn't find any subarray so return -1.

In what case, does the entire pre-processed table for KMP consist of all zeroes?

When the pattern is all different characters e.g. abcd

Find smallest number by removing k digits. Ex. num = 3194, k = 2

Within window, find smallest one. Is smallest on the left? If not, drop to the left.

What algo to build suffix tree in linear time?

Ukkonen's algo Very difficult to implement in an hour

How does KMP work?

1. For loop to start i at every point of the string. We walk through to compare char with the pattern one at a time. 2. If not a match, we check: "do we have a suffix that is also a prefix", 3. If so, we we can start off from that point when resuming outside loop. Where the suffix begins. Tricky. Look i-1 from char that didn't match. Look at the index it points to. If char at that index also doesn't match repeat. Until either a match or back at 0. 4. If not, we can jump all the way to the char that didn't match in the outer loop.

What's the insight for regex matcher?

1. Good case for DP 2. Make DP table 3. Only two conditions. Either 3a. char matches or . 3b. char is a * in which case we can look if dp[i][j-2] is T/F OR if pattern[j-1] or pattern[j-1] == "." then we can take dp[i-1][j] 4. Else false 5. Take the bottom right val

What's the insight behind KMP?

1. Instead of costing O(nm), we can lower it by not having to backtrack to the start of the pattern each time. Instead given our kmp array, we can step back to the previous spot that shared the same prefix.

For longest repeating substring, what's the conditional(s) that marks a given node as a potential longest?

1. It has more than 1 child 2. It's the end_of_suffix and has a child

What's the insight for joining words to make a palindrome?

1. Preprocess words by making a dict of them reversed. Reversed dict 2. Iterate through each word to check both if was the start vs if it was the end of a palindrome. 3. Check for if fragment is in reversed dict, make sure it's not pointing to itself, and make sure that any unused str is a palindrome itself.

What are the two types of string problems?

1. Substring 2. Search

How to implement a wildcard matcher with recursion?

1. When char are the same move forward. Or when one el wildcard. 2. If not match return false 3. If * then recurse with both accepting next char or not. Moving i forward or j forward. 4. If * and target string not at end, false.

Tree vs Trie

A tree is a general structure of recursive nodes. There are many types of trees. Popular ones are binary tree and balanced tree. A Trie is a kind of tree, known by many names including prefix tree, suffix tree, digital search tree, and retrieval tree (hence the name 'trie').

Important starting premise of a string

Default behavior. IMMUTABLE vs MUTABLE Each runtime has it's own version Java - immutable, so need string builder C++ - different Python - immutable When you have immutable strings

Given a corpus of N strings, of length L each, what if we used a HashTable instead of a Trie, in order to store strings for efficient lookups? Which of the following are true statements?

Both Tries and HashTables have the same time complexity O(L) of insert, because in a HashTable, we need to compute a hashkey for each insert which takes O(L). Even though both Tries and HashTables have the same complexity of insert [O(L)], all things equal, Tries are faster in implementation because we don't need to run the strings through a hashing function and handle collisions. Which one is actually faster for our implementation, depends on what kind of queries we will get. e.g. If we simply need to find existence of a string, HashTables are faster. If however we need to find prefixes, then Tries are faster.

String vs Character array for pw?

Character Array, because Strings are immutable and can be dumped to see what the password is!

What does a suffix tree do?

Compresses duplicates Can find the most repeated substring of at least k length Duplicate substring Shared substring

Why is KMP important?

Concept / intuition. It's a first step to creating a sliding window. a Jump table. Finds patterns in a string. Don't need to remember how to implement

Which of the following are true, for solving the Longest Common Substring problem? (Given two strings of length M and N) If we use hash tables and hash all substrings of both strings, then we can reduce the time complexity to constant time, because hashtables are always constant

F

Text can exist outside of an encoding?

False. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text.

What's the linear solution for controlling set?

Have two pointers starting from left. When you don't have controlling set grow to right. When you do, move the left right. Grow from one side, shrink from the other. Like a worm Two pointers, while loop. If growing j++, shrinking i++. Need a data structure to support growing, counting, removing. Like a hashmap.

What was the issue with Boggler Solver?

Found a working solution that works with DFS of every cell in the matrix. Though as expected it runs slow for some of the bigger use cases. matrix = n*m dictCount = d wordLen = w O(n*m*d*w)

Rabin Karp question uses:

Given two strings A and B, and a number X find if they have a common sequence of length X. Given two strings, find out if one is a rotation of the other.

How does a rolling hash work?

Good for comparing large things a small bit at a time window = size of substr hash = val_of_char * (alphabet_size)^window_size-1 + ... When moving a letter we subtract the prev one with window size and shift the others over

Ex of double for loop inclusive vs exclusive

Inclusive for i in range(0, n): for j in range(i, n): Exclusive for i in range(0, n-1): for j in range(i+1, n) ?

When is an index inclusive vs exclusive?

Inclusive = value is included in the array Exclusive = not included, length typically exclusive

BF substring search. Fast or slow? Why?

It's slow. It's not linear time. Linear time is possible and this one doesn't give us that. It needs us to save previous text. In large corpus of text, when we get a mismatch towards the very end of the pattern, we need to have previous text saved, so that we can go back and increment one.

If L is the size of the given alphabet, and N is the size of the longest string in the input, what is the time complexity of Insert, Delete and Query operations on a Trie?

N, N, N (These 3 operations are independent of the size of the alphabet)

Sliding window tip

Need to pick a size that allows you to make a decision

If N is the size of the text, and M is the size of the pattern, then what is the time complexity of KMP, if we were to find ALL occurrences of the pattern?

O(M + N), because regardless of however many occurrences, the algorithm compares each text character exactly once

If N is the size of the text, and M is the size of the pattern, then what is the time complexity of KMP?

O(M + N), which is linear

TC of building suffix tree?

O(N^2) - We're adding N suffixes with max length N

How many nodes in a Suffix Trie made out of a string of length N?

O(NxN) - N suffixes, with max length of N and each node is added to the tree

What's the optimized version for controlling set?

O(n^2)

What's the BF code for controlling set?

O(n^3)

BF of finding largest palindrome

O(n^3). Double for loop to enumerate all substrings. Two pointers to validate if palindrome.

How long does it take to get the len of a string in Python, Java, and C?

Python/Java/C#/C++: It takes constant time to find length of a string every time, because length is cached at the time of String object creation and Strings are immutable after that. C: It takes linear time to find length of a string every time, as you have to go thru each character.

What's the easier way to write a tree on the whiteboard

Radix tree. Minimizes the amount of writing. Don't need individual nodes for a single path that doesn't branch.

What's the BF for finding substring?

Starting from every char. Walking through substring and input at same time. n = str m = substr O(nm)

What is a substring?

Subarray of the characters. Continuous chunk Similar to subarray BC is substring of ABCD BD is NOT

Which of the following are true, for solving the Longest Common Substring problem? (Given two strings of length M and N) After we construct a Suffix Trie, then it can be solved in linear time O(M + N), because all it takes is a DFS over the combined tree.

T

Which of the following are true, for solving the Longest Common Substring problem? (Given two strings of length M and N) Brute Force takes O(M^3 x N), where you'd first find all substrings of M [O(M^2)] and check if they exist in N (MxN).

T

Which of the following are true, for solving the Longest Common Substring problem? (Given two strings of length M and N) If we are open to using space O(MxN), then we can use DP to solve it in O(MxN) time.

T

Solution to longest substring with at most k distinct char? TC / SC?

TC: O(n) - one pass through SC: O(1) - dict will have a max size

Rabin Karp for finding a substring

Take a window of the substring and compare with the prefix in constant time. -> Rolling Hash O(m+n)

What's the insight behind indices in a text string?

There are spaces b/w every word so can use [].split(" ") to get an array of each word. Build a dict to keep track of all indices of each. Do a simple lookup. O(n)

Code to enumerate all substrings

Two for loops

History of string problems

Used to be warm up problem. Ex reverse a string Ex. Phone screen Just b/c they're simple, doesn't mean there aren't important nuances

What's the insight for Boggler Solver?

Using a trie, we can at least search all the word simultaneously. Can remove dictCount from the equation. matrix = n*m dictCount = d wordLen = w O(n*m*d*w) Turns to O(n*m*w)

What was the problem we ran into when joining words to make a palindrome? Which cases worked?

Worked 1. Any non-palindrome word Didn't work 1. When the words themselves contained palindromes. Different rules were needed. Ex. 'a' and 'levela'


Kaugnay na mga set ng pag-aaral

Jewish, Early Christian, and Byzantine Art

View Set

Finance 320 Ch. 9 (NPV,IRR,& MIRR)

View Set

Mastering Biology - Cell Structure

View Set

ANIMAL BEHAVIOR AND INTERDEPENDENCIES Biology

View Set

Career Preparation Edge - Teamwork and Collaboration

View Set

AP Biology: Animal Reproduction and Development, Chapter 14 - Embryonic Development

View Set