System Design + Scalability cracking the code interview questions + PM interview

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Pros and cons of XML approach

+ Easy to distribute and can be easily read by both machines and humans + Most languages have library to perform XML parsing (easy) + Can add new data to XML file by adding more nodes, doesnt break parser + Data is being stored as XML files, can use existing tools for backing up data - Sends clients all info, even if they only want part of it (inefficient) - Performing any queries on data requires parsing entire file - Clients limited to grabbing data only how we expect them.

Pros and cons of SQL DB

+ Easy way to do query processing over data in case of additional features + Rolling back, backing up data and security provided using standard DB features + Easy for client to integrate (standard feature) - Much heavier weight and complex than we need - Difficult for humans to read so need to implement additional layer to view and maintain data, increasing implementation costs - Security: clients may perform expensive and inefficient queries & need to be careful to not give clients access they shouldnt

Pros and cons of FTP server and text files

+ easy to maintain (easily viewed, backed up) - Require more complex parting to do any query - If any additional data added to text file, may break client's parsing mechanism.

Design a cache system for simplified search engine. System requirements

- Efficient lookups given a key - Expiration of old data so that it can be replaced with new data - Handle updating/clearing the cache when results for a query change

Load Balancer for google.com (For California)

- Logical to select closest server group based on your region but if that group is full, need to reroute to a nearby group e.g. Nevada. Needs a variation of weighted round-robin. Round robin uses a queue e.g. have 10 available slots. Queue is FIFO so dequeued and enqueued. - 2-tiered queue system: 1) states 2) individual server clusters. Weighted round robin so queue would be sorted each time so available servers in your closest states would should up first. - Complexity for access is O(1) since closest available server is always on the head. For delete is O(1) since dequeue, add is O(1) since its enqueue. - Could add in switch preventing starvation, where task is continually denied access to a resource - Could use Layer 4 (Network layer) or Layer 7 (Application layer) load balancers which aren't just based on physical layer

Design a cache system for simplified search engine. Assumptions

- Other than calling out, all query processing happens on the initial machines that was called - # of queries we wish to cache is large - Calling between machines is quick - Results for a given query is ordered list of URLs, each of which has an associated 50 character title & 200 character summary - Most popular queries are v popular so appear in cache

Design Cache for many machines

1) Give each machine its own cache: + Quick as no machine-to-machine calls - Less effective cache as many repeat queries treated as fresh queries 2) Each machine has copy of cache: When new items added sent to all machines. + Common queries nearly always in cache - Updating cache means firing off data to many machines and our cache holds much less data 3) Each machine stores a segment of the cache: Divide cache so each machine holds different part - can assign queries based on formula hash or you could design system so that machine j returns null if doesnt have query in cache. - Increases # of machine-to-machine calls

Design Google Search service

1) Web crawler 2) Index (quick retrieval) 3) serve results (ranking) Serving results - need to understand search string (query rewrite) and context from user account: - Based on search history, YT/other Google property history, location, interest profile Allows searches to be more accurate, pertinent - better user experience. Ranking: Query independent and query dependent e.g. CTR, # of follow on queries Filtering results for offensive results, illegal results, malicious, spam, sexually explicit.

Lossy compression (allow image to lose some of its content)

1. Transform encoding: image's colors averaged out using formula over small blocks (discrete cosine transform). Image suffers color loss and introduce artifacts like pixellation at points of image. You can dictate how much quality you want to remain with the image. 2. Chroma subsamping: Attempts to keep brightness the same on all areas rather than averaging over everything. Tricks eyes into not readily noticing any dip in quality. Used well in video streams. 3. Google WebP: Instead of averaging color info, it predicts the color of a pixel by looking at fragments surrounding it. Data that's actually written into resulting compressed image is the difference between the predicted color and the actual color. Instead of printing whole bunch of zeroes, it compresses all of them into 1 symbol that represents them.

Bayesian vs AI

Bayesian: use statistical methods to predict probabilities based on existing info. Simpler + faster AI: Gathers enough info to formulate a scenario and match with pre-existing patterns so it can know what to do next. Bayesian better when we don't have enough data, want to avoid overfitting (try too hard to figure out a pattern) and don't have nough experience.

Lossless compression (preserves quality while maintaining small image size- minimizes distortion + preserves clarity)

Builds an index of all the pixels and grouping same-colored pixels together (like file compression works) e.g. DEFLATE algorithm. Instead of just running through length of data and storing multiple instances of a pixel with same color into single data unit (run-length encoding), it grabs duplicate strings found within entire cod and sets a pointer for each duplicate found. Where pixels are used frequently, it replaces all of those pixels with a weighted symbol that further compresses everything. None of pixels are forced to change colour. Only difference is how much space is taken up on your hard drive

Design Cache for a single system

Data structure that can easily purge data and efficiently lookup. - Linked list easy purging of old data, moving "fresh" items to the front. Remove last element of linked list when list exceeds certain size. - Hash table gives efficient lookups, not allow easy data purging - Merge 2 data structures (linked list move to front every time it's accessed. End of linked list always contains stale info). Hash table maps from a query to corresponding node in linked list. Allows us to not only efficiently return cached results, moves appropriate node to the front of the list, updates its "freshness"

How would you design data structures for a very large social network? How to design an algorithm to show shortest path between 2 people? Scaled case

Friends may not live on same machine as we do so replace list of friends with a list of their IDs and traverse. Use hash tables to look up machines that hold friends. Optimizations Can reduce machine jumps by batching these jumps (5 of friends live on 1 machine) Divide people across machines by country, city, state, university Usually in BFS we mark a node as visited. Can't do that here since could be multiple searches going on at a time so bad idea to edit data. Can mimic marking of nodes with a hash table to look up a node id and determine whether it's been visited.

You have 10 billion URLs. How do you detect duplicate docs?

Each URL has an average of 100 characters with each character taking 4 bytes. So 10B URLs will take up about 4TB. Simple solution: create hash table where each URL maps to TRUE if already been found in the list. Or we could sort the list and look for duplicate values (takes more time) If we can't store it all in memory could solve by storing data on disk or by partitioning data across multiple machines. Storing data on disk: Split list of URLs into 4000 chunks of 1GB. Divide up URLs based on hash value. Then load each file into memory and create a hash table of URLs Multiple machines: Send URL to machine x. + parallelize operation so can be processed simultaneously (large amount of data) - Rely on 4K different machines to operate perfectly. Not realistic and need to think about failure.

Reduce bandwidth for Google Search

Files (Videos, Images, Texts) - Cache policies (if user from certain country, ask browsers to cache more data) i.e. if far from Google servers and limited last mile bandwidth - Cache settings: require client browsers to turn on cache settings - smaller result sets (fewer top results) - Less detailed results - smaller snippets, fewer image and video previews at lower resolution - Improve file compression: image, text, file compression

If you were designing a web crawler how would you avoid getting into infinite loops?

Need to detect and prevent cycles. Create a hash table where we set hash[v] to true after we visit page v. We can crawl the web using BFS. Each time we visit a page, we gather all its links and insert them at the end of queue. If we've already visited a page, we ignore it. Page v can be based on content or on URL. URL parameters are arbitrary so define based on content but what if content always updates. Could base it on estimation for degree of similarity. If based on content and URL, page is deemed to be sufficiently similar to other pages, we deprioritize crawling its children. For each page we can create a signature based on snippets of content and page's URL. Have a DB which stores a list if items we need to crawl. One each iteration, we select the highest priority page to crawl. Then do the following: 1. Open up page and create a signature of page based on specific subsections of page and its URL 2. Query DB to see whether anything with signatures has been crawled 3. If something with this signature has been recently crawled, insert this page back into DB at a low priority 4. If not, crawl page and insert its links into DB

Building a service that will be called by client applications to get simple stock price info. How to design service that provides info to client info?

Requirements - Client Ease of use - Ease of uses for ourselves - Flexibility for future demands, Scalability & Efficiency #1 Store data in simple text files and let clients download data through FTP server. #2 Use standard SQL DB #3 Use XML to distribute data. Can provide a web service (SOAP) for client data access. Adds a layer but provides additional security and makes it easier for clients to integrate.

Design a dictionary lookup for Scrabble? Q: Using English

Use a Trie - "rack of letters". Can have repeated letters, out of order and have anagrams. Algorithm: Takes an array of letters and returns an array of strings. These strings hold solutions to the problem. - Longest word in Scrabble is 8 letters. - Trie has random symbol as head that's not a letter which represents empty string. It has children A-Z, while its children A would have A-Z etc. Do this upto depth 8 so have 26^8 +1 nodes and each node has relatively little memory. Improvement: Remove paths down tree that don't contain words e.g. aaa . Each word look up requires O(N^2) where N is length of word. Space complexity it's 26^8 +1 node. With each node being 2(char) + 1(boolean) + 8(obj reference) = 11 bytes.

How would you design data structures for a very large social network? How to design an algorithm to show shortest path between 2 people? Simple case

Use graph with each person as a node and edge between 2 nodes showing friends. To find path between 2 people would do simple breadth-first search vs depth-first search since it's most efficient. Could do bidirectional breadth-first search (2 breadth-first searches, one from source and one from destination. When searches collide, we've found path). Goes through much fewer nodes than BFS. O(k^q/2 + k^q/2) = O(k^q/2) whereas BFS is O(k^q)

Design Cache: How to update results when contents change

When results change: 1. Content at URL changes 2. Ordering of results change in response to rank of page changing 3. New pages appear related to particular query To handle #1 & #2: Create separate hash table that would tell us which cached queries are tied to a specific URL. Handled completely separately for the other cache and resides on different machines. Requires a lot of data. If data doesn't require instant refreshing, could periodically crawl through cache stored on each machine to purge queries tied to updated URLs. #3: update single word queries by parsing content at new URL and purging queries from caches. But only handles 1-word queries. Can implement an automatic time-out on cache so time-out no matter how popular it is can sit in cache for more than x minutes and gets periodically refreshed. Optimization: implement timeouts based on topic or based on URLs so time out value based on how frequently page has been updated in past. Time out for query would be MIN of timeouts for each URL


Set pelajaran terkait

LearningCurve: 14c. Major Depressive Disorder and Bipolar Disorder Apr

View Set

Network Auth and Security Chapter 7

View Set