System Design
MapReduce Fault Tolerance and Idempotent
- Fault Tolerant ==> mapreduce handles faults - network partitions and machine failures ==> re-performs Map or Reduce where fail occurs - Assumptions that map and reduce operations are idempotent -MapReduce system admin only care about input and output at each step - they don't care about worrying about fault tolerance and the intricacies of Map and reduce functions
When to choose SQL vs When to choose NoSQL
Reasons to use SQL: -need to ensure ACID compliance to reduce anomalies -data is structured and unchanging and you're not experiencing massive growth - no need to scale Reasons to use NoSQL: -when the rest of application is fast and we want to prevent db from being bottleneck -storing large volumes of data with little to no structure (adding new entity types on the fly) -you want to use cloud computing and storage (cost savings and easily scalable) -Rapid development/want to quickly iterate - no need to prep a schema ahead of time
Key-Value Stores
most popular non-relational db (redis, dynamodb ) Mapp from string keys to arbitrary values (usually value stored as string , but can be other types -very similar to a hash table -caching works well with key value stores or dynamic configuration (params or constants/global vars of system defined for whole system) -O(1) lookup time ==> low latency and high throughput -some key value stored persist (even if the server crashes) -Redis is an in-memory key value store - some might give strong consistency, while others might only give eventual consistency
HTTP Privacy - MITM Attack
there is an implied privacy b/w client and server over HTTP - however a malicious actor can hijack/intercept the underlying IP packets and read the data - or alter the data and relay them altered. Bad breach of security/privacy Man In the Middle - having unwanted 3rd party read/alter communication. -MITM can also fake sending its own public key in TLS handshake
Load Balancer
A type of reverse proxy that helps distribute load based on metric. -Monitors/watches server health (elevated error rate or crashes) - only sends requests to healthy servers -pairs well with horizontal scaling / adding more nodes by better making use of more nodes - improves throughput as each server won't be overloaded - lower latency - servers respond to requests faster because they won't be clogged/bogged down LB Types: 1) Smart Clients - Balance load across several hosts, detecting if they are down, and detecting if new hosts are added 2) Hardware Load Balancers (Citrix NetScaler) - most expensive but high performance. Difficult to configure. Sometimes used in conjunction with software - used as the first point of contact LB 3) Software Load Balancers (HAProxy - most popular open source) - considered hybrid of smart client and hw load balancer. -usually a good first choice - if needed you can later transition to hw or smart client
Scalability
Capability of a system, process, or network to grow and manage increased demand. - generally the performance degrades ==> - network speed is slower because machines are far apart -as you grow it's easy to become untenable - even though you built it to be distributed Scalable architecture will avoids this performance degradation by balancing load on all nodes evenly
Pubsub Idempotent Operations
Idempotent - operation that has the same ultimate outcome regardless of how many times it's done - since pubsub has an at least once delivery - sometimes it sends messagem ore than once - if we have non-idempotent operations - this is bad
Horizontal vs Vertical Scaling
Horizontal - adding more servers to pool of resources - easy to add machine into the existing pool - Ex ==> Cassandra and MongoDB - easily scale horizontally by adding more machines to meet growing needs. Vertical - add more power (CPU, RAM, storage etc..) to existing server. Usually limited by capacity of single server - and scaling one server can involve downtime Ex ==> MySql - can switch to bigger machines, but
Redundant Load Balancers
Load balancer can be a single point of failure - we can add a second LB. Both LB can monitor health of server pool - and both are capable of switching on if the other LB fails
Improving Availability
- Avoid single points of failure through redundancy o IE - If you have just one app server or one load balancer - it is a single point of failure
API Design Consequences
- Once you develop and customers rely/consume/build on API - making changes to API become extremely difficult or impossible (because so many services rely on them) . API design has long-lasting consequencyes on lots of people -API Design/changes usually go through rigorous review process
Caching Potential Problems
1) sometimes harmful to have a cache with a poor eviction policy - constantly making extra calls - higher latency 2) thrashing - constantly loading cache and evicting it without actually using the cache values (no longer using the cache results) 3) consistency with in memory cache on multiple servers -this is ok for non-critical data (maybe profile data). But not good for critical banking, healthcare data
SQL Server Transaction Syntax
BEGIN TRY BEGIN TRANSACTION T1 DELETE ... UPDATE ... INSERT .... COMMIT TRANSACTION T1 END TRY BEGIN CATCH IF (@@TRANCOUNT >0) BEGIN ROLLBACK TRANSACTION T1 PRINT 'Error detected, all changes reversed END END CATCH
Pubsub Pros and Cons
Pros ==> -resilient through network partitions because of guaranteed at least once-delivery -rewind/replayability -end to end encryption -message retention (for specified period - loose coupling -easy testing Cons ==> because of at least once-delivery - non-idempotent operations can have bad impact -
RabbitMQ vs Kafka vs SQS
- RabbitMQ offers more complex routing use-cases versus Kafka's more simple broadcast to topics archivtecture -kafka provides higher throughput -SQS is an AWS managed service w/ high availability and easy scaling -RabbitMQ is usually managed on your own server whereas SQS is in the cloud (and scales easier) - Rabbit can be slightly faster -both provide good fault-tolerance, high availability, scalability
S3 vs HDFS
- S3 is cheaper, elastic, high durability and availability, and encrypted -Hadoop is usually managed on prem (availability and maintenance concerns) -Hadoop is faster per core, but slower in performance per dollar
Cache Invalidation Schemes (write policy)
- caching requires maintenance to keep the cache coherent with the source of truth (database) - data modified in the db should be invalidated in the cache (to prevent inconsistent app behavior) 1) Write-through cache - data written into cache and db (cache and db always in sync) - safe and slow -best for financial/healthcare/critical data - Pros: nothing lost in power failure / crash, always cache coherent - cons: high latency with two operations (slow but data safe) 2) Write-back cache - update only cache and immediately confirm back to client. Write back to storage is done at certain intervals. Pros: low latency, high throughput for write-intensive applications Cons: data availability risk if crash before db write is made (but can be resilient with multiple writes) (fast but risky) 3) Write-around cache - data only written to db, not cache Pros: reduces flood/load on cache - especially data that doesn't need to be immediately re-read Cons :' higher latency for reading data recently written (have to go to db to fetch, then possibly update cache) -possible to use a hybrid - best of both worlds
Reliability vs Availability
A reliable system is an available system. However, If it's available , it's not necessarily reliable
Eventual vs Strong consistency
Eventual Consistency ==> db is eventually updated (async update) so some future transaction/query is not guaranteed to see the most up to date data Strict/Strong Consistency - guarantees each transaction sees the most up to date data. -Large overhead for maintaining strict/strong consistency in NoSQL because they are distributed - it usually slows down the system as we scale up - have to vertically scale servers -strong consistency also called linearizability
Consensus Algorithms
Paxos and Raft - complicated consensus algos that have a lease mechanism Zookeeper and Etcd ==> allow you to implement leader election in your own custom way Etcd ==> key value store with high availability and strong consistency - rare to get both of these strong consistency - multiple machines - reading and writing to the same key value pair in key value store - always guaranteed to return correct value
Cloud Storage - AWS S3, Azure Blob Storage
- S3 not a traditional file system - more like a key/object store -s3 uses eventual consistency -systems can scale automatically (elastic) - and you only pay for amt of storage consumed
RabbitMQ
-producer sends a message to exchange which routes using the binding key to one or more Queue's - consumer subscribes to queue's Different types of exchanges allow a variety of routing: - fanout exchange - duplicate message to every q - direct exchange - message creating w/ routing key - topic exchange - sent to q w/ partial match on binding key - header exchange - message routed according to header -cloud friendly, runs as a cluster, consumer sends message ACK back to broker, browser based UI, support
Example of API Design
Ex==> Asked to design Stripe API -list a couple entities and their types - do a sample data of actual entity -do some endpoints: these are CRUD (POST, GET, PATCH, DELETE) CRUdOperation(paramName: paramType) è ReturnType -Always want to make sure that any list endpoint is paginated with a limit Exercises to prepare: 1) go thru systemexpert's API design questions 2) PIck favorite product/service - look through API documentation - useful way to understand a production grad API (Strip API, Cloud IoT) 3) Pick product/service with public API - how wyould you design that API - go through entire API Design yourself - then go through their docs and see if your design is similar
Distributed Cache Stale/Consistency Problems (each server has own own in memory cache)
- Ex ==> youtube comments section - User A posts a comment and caches to server 1 - User B reads that comment and caches to server 2 - User A updates comment and caches update to server 1 - User B reads comment - it is pulled from cache, but the cache is outdated -better to put a single cache separate from each server (like a redis) - then all servers use a single source of truth
Replication Consistency
Slave/Replicas have to be consistent with the main/master db - so replication needs to happen in a synchronous way: -if a write op on the replica fails, then the main db op should no complete -write ops take longer in a master/slave replica setup - but improve ability - this increases latency and reduces throughput on writes replication can be both synchronous and async Ex DB replica async updates - db across regions in US and India - US post to db==> write is fast initailly to US db, but needs to async update India db. Async update doesn't block anything - US user sees his post immediately -This works ok if we don't care about replicas not being up to date quickly
Caching Locality of Reference
Load balancing helps you scale horizontally/ more servers - but caching enables you to make better use of the resources you have -caching takes advantage of locality of reference - recnetly requested data is likely to be requested again
Anynomous Proxy vs Transparent Proxy
Open proxy - accessible by any internet user - allows users within a network group to store and forward internet services - DNS/web pages to reduce and control bandwidt used by group. 2 types of open proxy: Anonymous - reveals its identity as a server but doesn't disclose the initial IP address . This proxy server can be discovered easily, but it still hides IP addresses Transparent - reveals its identity and reveals first IP address via HTTP headers . Usually used to cache websites
DNS Round-Robin
DNS server can return multiple possible web servers in a round-robin fashion - single domain name can get multiple possible IP addresses
Use Case of Redis and Clustered configurations
- a global cache -persistent storage which is like a cache - high availability mode with replicate to a slave, - clustered will shard/partition the data across nodes in the cluster - you can combine clustering with high availability mode
Publish-Subscribe Pattern
Apache Kafka, Google Cloud Pubsub Pubsub 4 entities: 1) Publishers -(servers) - publish data to the topics - publishers publish message to one or more topics, one or more clients subscribe to the topic 2) Subscribers (clients) - clients subscribe to topics (rather than subscribing to publishers directly). Clients listen for data from topics. -Publishers and subscribers don't know about each other 3) Topic - kind of like channels of specific info (possibly specific type of data in each -effectively a persistent streaming /database solution encapsulated in a bidirectional pubsub model -multiple topics represent different entity (possibly different database tables) 4) Messages - represent some form of adata relevant to subscribers -user requests should not be exposed directly to pubsub
File System vs Distributed File System
File System - way in which files are named / placed logically and retrieved from hd / ssd. -File systems use compression techniques to store -handles directory structure / naming -Ex: NTFS, FAT32 - windows ==> NTFS - mac ==> HFS - ExFAT - readable by linux,mac,windows (ok performance) Distributed File System - designed to handle data across machines - understand all storage availability in whole cluster. -looks like one big hard disk (dfs manages lookups over multiple machines) Ex ==> Hadoop Distributed File System, Cluster File System
Active vs Passive Redundancy
Passive Redundancy - multipel components at a given layer - as a node dies, nothing bad will happen - other components in layer will run smoothly - although they temporarily might have more load until the broken node gets fixed - twin engine airplane ==> if one engine fails, the other can still fly and land plane safely Active Redundancy - only a few of the machines take traffic and we have some idle machines waiting for failure. When a node fails other machiens will know and take over - this is related to Leader Election - Adding these redundancy measures improves availability
Pubsub Properties
Properties 1) guaranteed at least once-delivery (subscribers ACK to topic) -sometimes messages sent more than once - exactly-once delivery is impossible in a distributed system) 2) Ordering of Messages - messages put on topics will be sent to client subscribers in the same order they were sent - kind of like a queue (ie ==> stock orders sent in order, stock prices sent in order) 3) Replayability (stems from persistent storage) - many pubsub solutions allow rewind / replayability (rewinding to previous message / snapshot of topic) 4) Content-based Filtering at subscriber ==> subscriber1 might only want stock prices of FANG companies - but S2 only wants non-tech companies 5) Flexibility/Extensibility - most cloud solutions give you things out of the box - everything autoscales, topics get sharded automatically - but devs abstracted away form this 6) End to End Encryption - only subscribers know how to read messages
Replication vs Sharding
Replication - mirrored data - usually Master-Slaves setup Pros: -Better Availability since slaves can take over if Master fails (fault-tolerance) - Slaves can handle read operations - Cons: -slower writes since we have to copy the data over Sharded Cluster - each shard in cluster takes care of part of data. Both reads and writes are served by the node where the data resides -Sharding is essentially performing load balancing by routing operations to different db servers Pros : - Much more scalable - since each server not hosting as much data Cons - (without replica) ==> low fault tolerance, low availability - when one shard of the cluster goes down , the data on it is not available -more complexity because now have to manage distributing data b/w shards - additional config Sharding and Replicating (sharded cluster) (usually when we care about availability and scalable performance - we create a shard replica aka replica set in mongo) https://dba.stackexchange.com/questions/52632/difference-between-sharding-and-replication-on-mongodb
NoSQL DB
Unstructured, distributed, and have dynamic schema Roughly 4 categories: Key-value Stores - data stored in an array of key-value pairs - key is attribute name (Redis, Voldemort, Dynamo) Document databases - data is stored in documents - documents are grouped by collections. Each document can have entirely different structre (Mongo, Couch) Wide-Column Db - (columnar db) - column families, which are containers for rows. Don't need to know all columns up front (each row can also have different number of columns. Best for analyzing large datasets (Cassandra HBase) Graph Database - Used to store data whose relations are represented as a graph. Graph with nodes (entities)(, properties (info about the entities) and liens (connection between the entities ( Neo4J and InfiniteGraph)
Relational Databases and ACID
- Imposed structure stored in a table form (called relations) -rows in table are instances of the entity - referred to as records -columns represent attributes of the entity -all tables in relational db have defined schema 1) Atomicity - if a transaction consists of multiple operations-these suboperations are collectively a unit - they all fail or they all succeed ie ==> deduct funds from one acct, and add to another account ==> both need to succeed 2)Consistency - no stale state in the db, where one transaction has executed, and another transaction doesn't know about it. Consistency ensures each transaction knows about others. --eventual vs strict/strong consistency 3) Isolation - multiple transactions can occur at the same time, but they'll execute as if they have been done sequentially one-by-one - if we do 2 different transactions at the same time - one of the transactions will hang while the other is processing 4) Durability - when you make a transaction in db - the effects of that transaction are permanent - data is stored on disk
Rendezvous Hashing (Highest Random Weight)
- for each client - calculate its ranking of all the servers - if we retire a server, and the key/client was mapped to this server, then the key/client has to pick the second highest rank - the simple hashing function - almost all servers get remapped, but with the rendezvous strategy only the key/clients mapped to the server get rerouted Comparison - Consistent hashing ==> O(lgn) binary search, sometimes doesn't provide even distribution for small clusters (fixed by virtual replicas) - Rendezvous hashing ==> O(n) , simpler to understand and code use rendezvous if clusters are small,
Load Balancer Selection Algorithm
- only selects from pool of healthy backend servers Random redirection - purely random order Round robin - cycles through list of servers in certain order Weighted Round Robin - places weights (integer based on processing capacity) on servers - servers with higher weights receive more connections. Better for situations where some servers are more powerful Based on performance/load - actively monitor server via health checks and route to server with least load IP Hash - hash IP Address of client and redirect to one of the servers. Will send the same clients to the same server (which maximizes cache hit rate) Path Based - distributes servers based on path of request (/payments vs /code path) - helps isolate requests (we can push updates to one path without affecting others) Least Connection Method - route to fewest connections (useful for when persistent client connections are unevenly distributed) Least Response Time Method - directs to server w/ fewest connections and lowest response time Least Bandwidth Method - directs to server w/ least amount of traffic in Mbps
Cache Eviction Policies
- while cache invalidation is strategy for handling cache writes when cache has space left, cache eviction is handling what happens when our cache is full or the data is stale/outdated 1. First In First Out (FIFO): The cache evicts the first block accessed first without any regard to how often or how many times it was accessed before. 2. Last In First Out (LIFO): The cache evicts the block accessed most recently first without any regard to how often or how many times it was accessed before. 3. Least Recently Used (LRU): Discards the least recently used items first. <== Best default usage 4. Most Recently Used (MRU): Discards, in contrast to LRU, the most recently used items first. 5. Least Frequently Used (LFU): Counts how often an item is needed. Those that are used least often are discarded first. 6. Random Replacement (RR): Randomly selects a candidate item and discards it to make space when necessary. bad eviction policy leads to thrashing
When to use Sharding/Partitioning
-because of the complexity should be a last resort Common Scenarios - app data grows to exceed storage capacity of large amounts of data - volume of reads/writes surpass what single node or its read replicas can handle (resulting in high latency or timeouts) -bandwidth needed by requests exceed bandwidth available to single db node and read replicas - giving timeouts Exhaust all other options -move db to its own server -implement caching - if read performance is an issue - use one or more read replicas - slaves can take read requests -vertical scaling
Sharding / Partitioning Method (Horizontal, Vertical, Path Based)
-goal is to evenly distribute data Horizontal - different rows inserted into different tables based on their value (all zip codes greater than 10000) Cons ==> can lead to hotspots if unbalanced split Vertical - partitioned by entity or feature ==> user profile on one server, friends list on another, photos on another. Normalizing is similar to vertical partitioning. Pros ==> -easy and low impact on app, - can isolate infrequently used columns on separate server, Cons ==> -new features/add'l growth can cause problems - might need to adjust existing setup -querying across vertical splits requires joins Directory Based - a lookup service that sits on top of existing partitioning method. Basically maintain your own lookup table Pros ==> -loose coupling - we can easily add to the db pool or change partition schema (can't as easily add servers with naive modulo # servers partition) Cons ==> Single point of failure
Row vs Column DB
-traditional row sql db store data in blocks in HDD (commodity servers) - but they don't fit properly in block and take up more space -rows are written sequentially, and when we want to access 1 column/attribute value in a row, we have to pull whole row -getting analytical data (time series where we want just 1 column) is not good with row db - head has to seek to various locations so lookup is slower -column db store the column sequentially - much more space efficient cause stored in whole block - column db partition/sharded by column (each column in one machine) -much easier to pull analytical workload / time series type data. - because the data type is the same in a column - we can compress the data because it's heterogeneous type (in a row we have heterogenous data types) (3x more compression) - writes might take longer for columnar vs row - but columnar is usually for reads - business requirements ==> if more column reads, go with columnar -if db is transactional workload ==> specifically locate a full row/record ==> go with row
Latency of various reads: memory, ssd, hdd, network, intn'l round trip
1 MB Memory ==> 250 us 1 MB SSD ==> 1000 us 1 MB from 1Gbps network ==> 10,000 us 1 MB HDD ==> 20,000 us Packet from CA to Netherlands back to CA ==> 150,000 us - in memory is fastest - sometimes local network or fast internet is faster than HDD read - ssd faster than hdd - sending a packet over international network is very slow
Leader Election vs Idempotent API vs Optimistic Locking vs Workflow engine
1) Idempotent API - sometimes easier to unerstand, the API can handle multiple retry requests without causing problems in the system 2) Optimistic Locking - read the record version number, then when you go to update the record, if it's not the same version number, abort the operation - this allows us to do multiple retries without causing problems (pessimistic locking actually locks the record - but can cause deadlocks) 3) Workflow Engines - Leader manages workflow in a system (Apache Airflow, .Net State machine, AWS Step Functions)
Partition/Sharding Problems
1) Joins and Denormalization - joins on one server db is easy, but on sharded db is often not feasible Workaround - denormalize db so that queries are performed on a single table. However we now deal with perils of denormalization such as data inconsistency 2) Referential Integrity - cross-shard query on partitioned db is not feasible - also trying to force data integrity constraints (foreign keys) in a shareded db is complicated Workaround - can inforce referential integrity on application code - 3) Rebalancing - sometimes we want to change sharding scheme if a) non-uniform distribution / hotspots or b) lot of load on one shard Workaround - a) create more shards b) rebalance existing shard These changes usually require downtime. Directory partitioning loose coupling makes rebalancing easier - but at the cost of increased complexity and a single point of failure (deferring sharding to application logic) 4) Not well supported by SQL db - usually a roll your own approach
Partitioning Criteria
1) Key/Hash Based hash(1 or more attributes) ==> partition # hash(pk int) ==> partition # (0 to 99) Pros ==> -industry grade hash fn ensured uniform distribution Cons ==> - fixes the total # of db unless we want to change hash fn - which would require redistribution of data (workaround is using consistent hashing) 2) List - each partition is assigned a list of values ==> ie all users living in Scandanavia - list of norway, sweden,finland etc.. 3) Round Robin - ensures unifrom data distribution. With n partition, the I tuple is assigned to partition (I mod n) 4) Range Partitioning (as opposed to horizontal partition that splits - this chooses an actual range - like zip code 700000 to 799999) 5) Composite Partitioning combine any of the above schemes to create a new one ( ie apply list partition, then hash partition) -consistent hashing could be considered a composite of hash and list ==> hash reduces key space to a size of the number of lists
Step by Step Guide for System Design
1) Requirement Clarification ==> always ask clarifying questions to understand exact scope (only have 30-40 mins) - Will users of our service be able to post tweets and follow other people? - Should we also design to create and display user's timeline? - Will tweets contain photos and videos? 2) Define whatAPIs are expected from system -helps establish exact contract - ensure have correct requirements § postTweet(user_id, tweet_data, tweet_location, user_location, timestamp, ...) 3) Back of the envelope estimation - estimate the scale of the system · (# of tweets, how many timeline generations per sec etc..) · How much storage would we need? This will vary based on whether users allowed to store photos/videos What network bandwidth usage are we expecting? 4) Define a data model - helps clarify the system and guides data partition/mgmt. Helps identify various entities of the system. (storage, transportation, encryption -what kind of db - NoSQL (cassandra, mongo) or MySQL (what kind of block storage · User: UserID, Name, Email, DoB, CreationData, LastLogin, etc. · Tweet: TweetID, Content, TweetLocation, NumberOfLikes, TimeStamp, etc. · UserFollows: UserdID1, UserID2 · FavoriteTweets: UserID, TweetID, TimeStamp 5) High-Level Design - draw block diagram 5-6 boxes representing core components of system -For twitter we need multiple application servers to serve read/write requests w/ load balancers in front of them for traffic distributions. -Need a distributed file storage for photos/videos 6) Detailed Design - dig deeper into components (this is where interviewer can help guide you towards areas they want you to explore further) - should discuss pros and cons / tradeoffs of choices o Since we'll be storing a huge amount of data, how should we partition our data to distribute it to multiple databases? Should we try to store all the data of a user on the same database? What issue can it cause? 7) Identify Bottlenecks - discuss as many blottlenecks as possible and different approaches to mitigate -single point of failure? Do we have enough replica so that if we lose a few servers we're still available?
Availability
1) how resilient a system is to failures (system's fault tolerance) 2) percentage of time in a given period (usually a year) that the system is at least operation to satisfy primary functions - often use the five 9's as a standard 99.999% available - availability becomes important for some system that have to remain online - hospital or airplane software - Some parts of system have to be highly available while others don't - Ex ==> Stripe - the transaction processing should be highly available while the Stripe dashboard does not need to be as highly available (not as critical) Tradeoffs ==> high avilability might come at cost of higher latency and lower throughput - so think about which parts need high availability
Client-Server Model - what happens when we go to google.com
1) initial typing - type first letter ==> show autocomplete list, some browsers even search index or send to default search engine 2) url parsing - is it a search term (we go to google with term) or is it a valid URL 3) protocol - is it http port 80 unecrypted or https port 443 encrypted - usually default to encrypted 4) DNS - first check browser dns cache, then check hosts cache, then check DoH or DNS over TLS , then do a dns query -DNS Query to find IP of google ==> send ip packet to router, router changes source , adds NAT , and send to ISP DNS, if ISP DNS doesn't have it finds it and returns - now client has google IP 5) TCP Connection - 3 way handshake - now we have full duplex comms 6) TLS 1.3 - transport layer security (newer version of SSL) - do diffie helmman key exchange - result is a symmetric private key on client and server - so we can send each other secure messages now. ALPN negotiates what protocol to use (http, http/2, http/3 7) GET HTTP request - add headers (no body for GET) - compress/encrypt data then send. If http, need ot setup TCP connection for each additional asset, if HTTP/2 can use the same connection , http/2 w/ push can predict all assets user needs and push it so only one round trip 8) HTML Parsing - browser builds DOM tree as the packets arrive on the wire - builds file top to bottom - if it finds js script, unless defined as async, it will synchronously block dom to get script
CAP Theorem Examples
2 machines/servers/db (master and slave) in cluster w/ connection b/w them Available and Partition Tolerant: -Available if one machine goes down, the other is available -Partition Tolerant - even though connection b/w machines has gone bad - system is still available -No consistency here though. This is an eventual consistency system (eventual consistency does not work in a banking or secure place - but it could work for instagram)
Long-Polling , WebSockets, Server-Sent Events
Ajax Polling - just normal interval polling HTTP Long-Polling (Hanging GET) - allows the server to push info to client whenever data is available - server may not respond immediately - if server doesn't have data, server holds request and waits until some data available (or until connection timeout) -once data available, full response sent. Then client immediately re-request info from server so that server always have an available waiting request it can use WebSockets (streaming) - full bidirection/duplex communication over single TCP connection. Established via WebSocket handdshake. -Low overhead for real-time data - A standardized way for the server to send data to browser without being asked Cons ==> not automatically multiplexed over HTTP/2,. Also frame based, not stream based Server-Sent-Events - client established persistent / long-term connection with the server. Only server can send data (if the client wants to send data it needs to use another connection/protocol) -SSE best when we need real-time traffic from server to client or if server will be sending multiple events to client - unidirectional pubsub type model -part of HTML5 standard (polyfills available for browsers) - natural fit with HTTP/2 - multiplexing out of the box https://medium.com/system-design-blog/long-polling-vs-websockets-vs-server-sent-events-c43ba96df7c1
Content Distribution Network / Content Delivery Network (CDN)
CDNs are a type of cache that is used for serving large amounts of static media. - user request asset ==> go to CDN ==> CDN servers if it's local, if CDN doesn't have locally it will query back-end and cache it locally and return to user - usually for large system we have our own CDN - but for scalable option we can serve something at static.yourservice.com - using lightweight HTTP server like Nginx until we get our own CDN later - any content that is not personalized or confidential, and where we can expect multiple people to access the same files (images,videos,pdf etc..) are good candidates for CDN. (personal data would be stored within the system)
System Design Interview vs Coding/Algorithm Interview
Coding interviews ==> there's an objective optimal answer (you can still tackle coding problems w/o data structures) System Design - requires design fundaments and it's more subjective - SD are intentionally vague - it's the candidates job to propose a solution and defend that position (eliminate doubt of interviewer SD composed of : 1) foundational knowledge 2) key characteristics tradeoffs ==> availability, latency, throughput, consistency, redundancy 3) Actual components - load balancer, proxy, caches, rate limiting, leader election 4) Actual tech - real/existing products/services (amazon s3, google cloud storage, redis, nginx)
Consistent Hashing
Consistent hashing is an effective strategy for distributed caching (similar to a Distributed Hash Table) -distributes data cross clusters in such a way that minimizes reorganization when nodes are added or removed. Easier to scale system up or down. - can be used to determine which cache server a key is stored on, - or which server a client gets routed to - hash key/client to a single integer and place on circle ==> go clockwise until you reach the integer of the server - to account for more powerful servers (or vertically scaled servers), we can double hash the servers and place two of the same server/caches points on the circle so more requests get routed to it - in consistent hashing, adding a server - only k/n keys need to be remapped - (k = total # of keys/clients, n = total numbers of servers) . Whereas in the simple mod hash function - all keys need to be remapped
API Design vs System Design
Design Twitter or Twitter API, Design Stripe or Stripe API Initial Questions/Requirements Sys Design ==> interviewee asks clarifying questions - whatwe're designing, what parts - what functinoality do we support - what regions we support API Design - still have to ask clarifying questions - just API that supports homepage? The trending tab? Who will consume our API Drawing out System/API Sys Design ==> Draw out diagrams, system components, what the SQL table looks like API Design ==> Write outline of API - various entitites For Twitter ==> tweet entity - For Stripe == > Charge or customer entity -API Endpoints -Parameters of API Endpoints and their types -Response types -Usually want CRUD types -Don't have to write logic, just outlining them -There's no objective right or wrong answer with either sys or API design - two interviewee's answers might be totally different but both valid
Distributed vs Global Cache
Distributed Cache - each node owns part of the cached data (divided up using consistent hashing) . Server processing request knows where to look (access other servers based on the consistent hashing to grab the correct cache data) Pros: increased cache space via adding nodes Cons: remedying a missing node can be problematic. (Fixed possibly by storing copies of data - logic gets complicated). If data truly lost can just pull from db Global Cache - all nodes use same single cache space pros: good for fixed dataset cons: can overwhelm a single cache 2 types of global cache: 1) Global - cache responsible ==> on cache miss - the cache itself becomes responsible for retrieving from db (more popular type) - and for eviction 2) Global - server/nodes responsible ==> on cache miss, server/request node fetches from db and optionally update cache - better for when the application logic understands the eviction strategy and can respond better
Efficency, Latency, Throughput
Efficiency of a system is measured by: 1) latency - delay to obtain item - or how long it takes for data to traverse a system - to get from one point in a system to another point . Different parts of system have different latency - and will often involve tradeoffs (like making the site more available) - IE ==> network request latency - to go from client to server, the server back to client - or server reading data from disk 2) throughput - amount of work a component/machine can perform in a give amount of time - IE ==> how much data can be transferred over the network - sometimes measured in Gigabits per second - if we have multiple requests coming into a server - how many of these requests can server handle in a given time (1 second) - how many bits does server let through in that amount of time. (server could be a bottleneck) - to increase/optimize throughput ==> simply pay cloud provider more to increase it - increasing throughput doesn't necessarily fix all problems - for example if we have just 1 server and we increase throughput, we still have that single point of failure, it's better to add servers -Latency and Throughput are related measures of performance - but not necessarily correlated - some parts of system have good low latency (fast data transfers) but other parts of system have bad low throughput
Peer-to-Peer Networks
Ex==> we want to deploy 5GB file from one machine to thousands of machines multiple times a day Peer-to-Peer Networks - split up the 5GB file into very small chunks/pieces, and send chunks to peers and let them communicate with each other to collect chunks Pros ==> parralelize the 5MB transfer - much faster than having one or multiple dedicated serving machines Peer Discover / Peer Selection 1) Use a central tracker db that can orchestrate 2) Gossip or Epidemic Protocol - peers talk amongs themselves ("I just talked to that server, it has chunks 4,10, and 15 etc..."). Peers gossip/share the mappings. Industry example ==> Kraken ==> distribute 20K 1GB blobs in under 30 seconds
Forward / Reverse Proxies
Forward Proxy - server that sits between client (or set of clients) and another server (or set of servers) - acts on behalf of client - a vpn is a forward proxy - Pros: anonymity, caching, blocking unwanted sites, geofencing Reverse Proxy - good for load balancing, caching, isolating internal traffic, filtering out bad requests, logging, canary deployment (test a new feature on one particular server), protecting against ddos attack/heavy load - most useful case of reverse proxy ==> Load Balancer to distribute request load b/w servers - a DNS query would likely return the IP of the reverse proxy - not the backend server Proxy Hiding the Source - in a forward proxy, there server doesn't know which specific client (the forward proxy hides the client IP address) - in a reverse proxy, the client doesn't know which specific server it's connection to (the reverse proxy hides the server IP's)
HTTP Versions - pipelining/multiplexing
HTTP 1.1 (now deprecated 2018) -in HTTP 1.1 you can only send one request at a time on a given TCP connection - have to wait for response before sending another request on same TCP connection (bottleneck) - or you can open separate TCP connections HTTP 1.1 Pipelining - a speedup modification that allowed multiple requests on the same TCP connection without waiting for the response - requires requests to come back in the same order (if not you get head of line blocking - request come back in different order and you have to wait) HTTP/2 - do as much as possible with as few round trips as possible. Get all resources over one TCP connection Server Push - server knows you need additional assets. Rather than just sending index.html , it pushes other resources on the same tcp connection
HTTPS and TLS Handshake
HTTPS runs on top of TLS (transport layer security - formerly known as SSL) TLS Handshake 1) Client Sends hello (random string of bytes, TLS versions, cipher suits supported) 2) Server responds w/ hello and the SSL certificate (SSL cert contains the public key) 3) Client verifies SSL cert (with CA) and encrypts Premaster Secret using the public key and sends back to server 4) Because Server has the private key, it can decrypt the Premaster Secret (server is only one who can decrypt it 5) Both Client and Server have: client hello data, server hello data, unencrypted premaster secret - Using these three things - it generates they each generate a Session Key -Session Key ==> only used during this session - afterwards they throw away 6) Client Sends a finished message that is encrypted w/ session key 7) Server sends a finished message encrypted with session key - the above is the standard RSA key exchange used most often - but sometimes Diffie Hellman key exchange is used (slightly more costly) -Diffie Hellman has a perfect forward secrecy - if server hacked and private key obtained, it can only read future messages, not past ones (RSA would expose past and future messages) . But if you're server is hacked you probably will lose credit card #'s in db so PFS is not that great -each new session has to go through entire TLS handshake to generate new session keys
Hashing
Hashing simply transforms arbitrary piece of data into a fixed size value - typically an integer. In sys design this data might be IP address, user name, http request (anything that can be hashed) In practice you never write your own hashing function - usually industry grade hashing fn MD5, SHA1, bCrypt that guarantee uniform distribution Problems with Load Balancing Selection Algos - if we have a computationally expensive operation on server, and we use Round Robin LB selection, - the same type of request or same client will be routed to different servers, and the expensive operation will not be in cache (cache miss) - in a simple hashing scenario - we hash the client name (modulo # of servser) - the client is routed to the same server - so this fixes the cache misses described above. This scenario is not well suited for horizontally scaling as it messes up all the mappings (we can't just have downtime to update the cache mappings) - However this is a problem if we start adding/removing servers - this is where consistent hashing and rendezvous hashing are better
HTTP
HyperText Transfer Protocol - higher level abstraction (over TCP and IP) that uses request-response methods: GET/PUT/POST/DELETE path - server might have multiple paths header - collections of metadata key value pairs http response error statusCode - 401 unauthorized - server received an unauthenticated request. Possibly invalid credentials or the login URL has changed (may also need to clear cache 404 Not Found - when a page user is looking for cannot be found - often fixed by setting up redirect 500 internal Server Error - generic error that displays when something is wrong with your server - 502 Bad Gateway - an invalid response somewhere along the way - usually fixed 403 forbidden https://en.wikipedia.org/wiki/List_of_HTTP_status_codes Idempotent - multiple same requests don't change anything only non-idempotent ==> POST
IP & TCP
Internet protocol - modern internet runs on internet protocol . IP alone can't guarantee packets are received - data is sent in IP packet (fundamental unit of data) -source IP, dest IP, total size of packet, IP version data limited to 2^16 bytes ==> 65k bytes ==> .065 MB TCP - transmission control protocol - more powerful wrapper around IP - connection done via 3 way handhsakke -sends packets in an ordered, reliable, error-free way - guaranteeing packets arrive (if error, resend) -TCP doesn't have a robust framework that's valuable for a SWE to use, that's where HTTP becomes more valuable
Leader Election
Leader election is the simple idea of giving one thing (a process, host, thread, object, or human) in a distributed system some special powers. Those special powers could include the ability to assign work, the ability to modify a piece of data, or even the responsibility of handling all requests in the system. We assign one person as leader so we're only doing a certain operation once and not multiple times. -sometimes follower nodes can handle read requests, but only leader should handle write requests -DynamoDB, Zookeeper, etcd Pros==> -easier for humans to think about , - high availability and consistency with thing like etcd/Raft consensus algo Cons ==> - single leader is single point of failure - leader is a single point of trust - if leader is doing the wrong work w/ nobody checking, it can cause problems
Logging and Monitoring
Logging - some type of component to collect logs that you later go through to debug issues. Service collects log and stores in db (IE - StackDriver). -allows you to debug systems at scale Monitoring - gather meaningful metrics, and have tools to monito those metrics. You want visibility into system's health, performance, general status. -Sometimes done through time series db (InfluxDB, Graphite, Prometheus) -similar to logging, monitoring is a tool or thing that makes managing a system easier/better. Monitoring - Scraping Logs You can use soem type of service that scrapes your logs to generate mertics Con==> -Problem is if you ever decide to changing logging, you risk breaking monitoring Alerting - With good monitoring - you can setup alerts when there's an elevated error rate (can even send to your slack channel) -Good alerting depends on good minotring
Distributed File System Internals
Name Node - acts like master - knows info where the file is in the data noes and which nodes/blocks are free -does health checks on data nodes (what the size of each data node, and if it's failing) -name nodes proactively move data when a machine fails -If block size set to 64MB ==> all files we upload are broken down to 64MB chunks -Name Node can be a single point of failure -Data Nodes have Rack Awareness - they usually exist in different racks (different data center also sometimes) to ensure good availability and fault tolerance -replication shouldn't happen in same rack
SQL vs NoSQL
SQL Pros==> -relational for easy structuring and querying -well defined/structured data prevents errors in data model -ACID protects integrity and reduces anomalies Cons ==> -Structured data takes more time to setup -Usually have to vertically scale which is more expensive -Doesn't support scaling for write-heavy systems -Master ==> slave read replica is easy for read only systems. -By increasing read replicas, we have more availability, but in turn we sacrifice consistency (data updates asynchronous) - higher chance of accessing stale data (CAP Theorem) NoSQL Pros ==> -More flexible unstructed on-the-fly data -Much better for horizontal scaling for both read and write-heavy systems Cons -eventual consistency - since NoSQL is usually distributed and used for write-heavy systems - we lose strong consistency. After a write to a shard in a distributed NoSQL db - there will be a small delay before update is propagated to replicas - this could mean we get a stale data (more of a distributed problem than a NoSQL Issue -No ACID ==> NoSQL is more scalable and faster at the cost of losing ACID - increased data integrity and anomaly issues
SLA and SLO
Service level agreements - not just an implied guarantee of availablity - an explicit guarantee -often in cloud services, if they don't meet five 9's of availability they will credit customer's bill An SLA is made up of SLO (one SLO is percentage uptime, another SLO is guarantee only X max errors)
Sharding / Partitioning
Sharding/Partitioning is an alternative to replicating that is better for large scale systems (large quantities of data). Sharding splits the data up - sharding better because it's not great to replicate huge amount of data -after a certain point vertical scaling maxes out, and we have to horizontally scale by sharding Sharding and Partitioning ==> both breaking large data into smaller subsets Sharding ==> spread across multiple computers Partitioning ==> grouping subsets of data w/in single db instance
Static vs Dynamic Configuration
Static - config bundled/packaged w/ app code. Have to deploy whole code base to make a change to config (slow and safe) Pros ==> breaking config changes usually caught in code deploy/review - therefore it's safer Cons ==> slower because we have to deploy entire app Dynamic - Completely separate from app code. Pros ==> config changes happen immediately and have implications. More power and flexibility (we could even build UI on top of config and immediately get changes) Cons ==> More complex - backed by db that your system queries -doesn't go through any test - more risky Workaround -possibly need complex deployment system/tools around config changes. Any dynamic config changes go through review process and have access control (only certain people access them). -Or the change is only allowed to deploy every 4 hours
Symmetric vs Asymmetric Encryption
Symmetric Encryption (fast but not as safe) - encryption that relies on symmetric algorithms - relying on a single key to both encrypt and decrypt data ex==> AES, DES, 3DES, IDEA , RC4 Pros ==> Much faster than assymmetric, easy to setup Cons ==> key has to be shared b/w 2 parties safely (MITM can steal it - or potentially alter it and give to client) - with PEM Assymetric Encryption (slow but safer) - requires two keys to work - a public key that's made public, and a private key used to decrypt the data. - if you encrypt message w/ public key, can only decrypt w/ private key. Only server has private key Pros ==> doesn't force sharing of secret keys (supports digital signing, which authenticates data or the key so we know it wasn't tampered with) Cons ==> slower and requires more effort -
Polling and Streaming
When a client needs to get data updated frequently we can use polling or streaming : Polling - client issues a request on a set interval (every X seconds) Cons ==> -Won't get a "live" feeling, chat app will be delayed - If you increase polling frequency you put huge load on server ( as system scales this is bad) -sometimes the requests are empty -just taking up space on the lcient Streaming - client opens a long-lived connection (like a socket ) that stays open so long as network is healthy and neither machine closes it. Client is "listening" and server is the active machine pushing
Configuration
a set of parameters or constants that your system/app use. Instead of deeply intertwined incode - you write/configure them in isolated file
database index
an auxiliary/additional data structure for fast searching on (one or more) columns/attributes of a table. Stores column points in sorted order that point to a relevant row Pros : -allows us to do binary search lgn instead of sometimes O(n) linear table scan Cons: takes up more space, also when you write/update db - you also have to update index Analogy DB index is like a table of contents - ordered in a way that makes it easy to search - points to where main data is stored - fast read operations at the cost of slow write operations - lots of different types ==> bitmap index, reverse index, dense index clustered index ==> contains base table data itself (usually on pk) non-clustered index ==> independent of base data - points to rows or clustered index
Caching and use cases
caching avoids have to redo the same operation over and over (especially computationally complex operations that take a lot of time) - the result of caching is lower latency - sometimes avoids extra round trip network requests Caching on client - cache on client such that we don't go to server Caching on server - maybe client needs to always interact with server, but server doesn't need to go to db (also haredware caching - cpu cache faster than mem) Speeding Up System Use Case for READS: 1) client ==> server ==> db cache to avoid network request - put cache on client or server so we only do this multi-hop once - ie data loaded into website can be cached on client for faster experience 2) avoid repeating some long computationally expensive operation - if this long operation is on server - cache on server or possibly client Relieving Load on Single Point Use Case for READS 3) If we had multiple servers (say millions of request) hitting the database, inserting a global cache will relieve the load on the database - each server could also have it's own cache - a cache server runs on more expensive hardware (SSD) - and it gets slower the more we fill it up - so we can't fill it up completely
SSL Certificate
digital certificate signed by a trusted 3rd party called Certificate of Authority (CA) - nonprofit org that is trustworthy -all of data in certificate is signed by CA's private key so we know it's legit. The client (chrome) will have public keys of all major CA such that it can verify all SSL cert -Certificate contains info about the server itself and public key of server (sometimes public/private keys given to server by CA),
CAP Theorem ==> Consistency vs Availability vs Partition Tolerance
distributed software system can only provide at most 2 of the 3 guarantees. There is an inherent tradeoff. Consistency - data is the same across the cluster, so you can read or write from/to any node and get the same data. Availability means the ability to access the cluster even if a node in the cluster goes down (achieved by replicating the data) Partition tolerance means that the cluster continues to function even if there is a network "partition" (communication break) between two nodes (both nodes are up, but can't communicate). A system that is partition-tolerant can sustain any amount of network failure that doesn't result in a failure of the entire network. Data is sufficiently replicated across combinations of nodes and networks to keep the system up through intermittent outages. - do we allow copies of our data to become out of sync or not. If we do allow it, we might provide more availability -Example of
MapReduce
framework to process large datasets in a distributed file system in an efficient, quick, and fault-tolerant way -similar to Map and Reduce in programming (python) Distributed data ==> Map ==> transform data into Key/Value pairs ==> Key/Value shuffled around in way that makes sense ==> Reduce into some final output -distributed file system ==> dataset split into chunk and replicated across machines -some central control plane (manager) - that is aware of where all chunks are (what worker machine stores what data - operations are done on the machine where data is (don't want to transfer data around) - k/v pair structure ==> one way to combine key/value pairs is to combine keys (sum their values - as shown in picture)
Serviceability / Mangeability
how easy it is to operate or maintain - ease of diagnosing, understanding problems when they occur, ease of making updates/modifications. the simplicity and speed at which a system can be repaired or maintined . If the time to fix a failed system increases - the availability will decrease.
Replication and Redundancy
redundancy - the duplication of critical components or functions of a system with the intention of increasing reliability of a system - usually in the form of a back up or fail safe. Removes single point of failure and provides backup in crisis Replication - sharing info to ensure consistency b/w redundant resources, such as software or hardware components - to improve reliability, fault-tolerance, or accessibility -replication in DBMS - usually a master-slave relationship b/w original and copies. Main master db handles read/write - then this is propagated to slaves - each slave outputs an ACK that it received teh update. -slave/replicas take over when main/master db goes down and begins taking read/write requests
Rate Limiting
setting some threshold on a certain operations past which - these operations will return errors. Limiting the amount of operations that can be performed in a given amount of time (limiting throughput) -useful for security and performance. IE ==> preventing Denial of Service attack (although won't prevent DDoS) -system-wide rate limit - only 100 requests per minute for the whole system for example -Ex ==> rate limit based on header data, specific user, or IP address Tier-based Rate limiting (algoexpert example) - one code execution every second - 3 run code operations in span of 10 second - 10 times in a minute -this tier-based complicates the logic there
Reliability
the probability a system will fail in a given period - often measured as mean time between failures - a distributed system considered reliable if it keeps delivering services even when one or more components fail Ex ==> Amazon user transactions should never be canceled due to failure of a machine/component - Amazon has introduced reliability into their distributed system via redundancy (both of software components and data) -setting up redundancy to improve reliability has a cost - we have to pay to achieve such resilience by removing the single point of failure