Grokking the System Design Interview

Ace your homework & exams now with Quizwiz!

Three places to put a load balancer

- Between the user and the web server - Between web servers and an internal platform layer, like application servers or cache servers (most common) - Between internal platform layer and database.

Availablity

By definition, availability is the time a system remains operational to perform its required function in a specific period. It is a simple measure of the percentage of time that a system, service, or a machine remains operational under normal conditions. Availability takes into account maintainability, repair time, spares availability, and other logistics considerations.

Reliability

By definition, reliability is the probability a system will fail in a given period. In simple terms, a distributed system is considered reliable if it keeps delivering its services even when one or several of its software or hardware components fail. A reliable distributed system achieves this through redundancy of both the software components and data.

Write-through cache

Data is written into the cache and the corresponding database simultaneously. Allows for fast retrieval and, since the same data gets written in the permanent storage, we will have complete data consistency between the cache and the storage. Ensures that nothing will get lost in case of a crash, power failure, or other system disruptions. Since every write operation must be done twice before returning success to the client, this scheme has the disadvantage of higher latency for write operations.

Write-back cache

Data is written to cache alone, and completion is immediately confirmed to the client. The write to the permanent storage is done after specified intervals or under certain conditions. This results in low-latency and high-throughput for write-intensive applications; however, this speed comes with the risk of data loss in case of a crash or other adverse event because the only copy of the written data is in the cache.

Data partitioning

Data partitioning is a technique to break up a big database (DB) into many smaller parts. It is the process of splitting up a DB/table across multiple machines to improve the manageability, performance, availability, and load balancing of an application. The justification for data partitioning is that, after a certain scale point, it is cheaper and more feasible to scale horizontally by adding more machines than to grow it vertically by adding beefier servers.

Common problems of data partitioning

Joins and denormalization Referential Integrity Rebalancing

Load Balancer

helps to spread the traffic across a cluster of servers to improve responsiveness and availability of applications, websites or databases. LB also keeps track of the status of all the resources while distributing requests. If a server is not available to take new requests or is not responding or has elevated error rate, LB will stop sending traffic to such a server

System interface definition Twitter examples

postTweet(user_id, tweet_data, tweet_location, user_location, timestamp, ...) generateTimeline(user_id, current_time, user_location, ...) markTweetFavorite(user_id, tweet_id, timestamp, ...)

80-20 Rule

20% of tweets cause 80% of reads

Example Clarification for Designing Twitter

- Will users of our service be able to post tweets and follow other people? - Should we also design to create and display the user's timeline? - Will tweets contain photos and videos? - Are we focusing on the backend only, or are we developing the front-end too? - Will users be able to search tweets? - Do we need to display hot trending topics? - Will there be any push notification for new (or important) tweets?

Cache

A cache is like short-term memory: it has a limited amount of space, but is typically faster than the original data source and contains the most recently accessed items. Caches can exist at all levels in architecture, but are often found at the level nearest to the front end, where they are implemented to return data quickly without taxing downstream levels.

CDN

Content Delivery/Distribution Network A kind of cache that comes into play for sites serving large amounts of static media. In a typical CDN setup, a request will first ask the CDN for a piece of static media; the CDN will serve that content if it has it locally available. If it isn't available, the CDN will query the back-end servers for the file, cache it locally, and serve it to the requesting user.

Step 3: System interface definition

Define what APIs are expected from the system. This will establish the exact contract expected from the system and ensure if we haven't gotten any requirements wrong.

Step 4: Defining data model

Defining the data model in the early part of the interview will clarify how data will flow between different system components. Later, it will guide for data partitioning and management. The candidate should identify various system entities, how they will interact with each other, and different aspects of data management like storage, transportation, encryption, etc. Which database system should we use? Will NoSQL like Cassandra best fit our needs, or should we use a MySQL-like solution? What kind of block storage should we use to store photos and videos?

Step 5: High-level design

Draw a block diagram with 5-6 boxes representing the core components of our system. We should identify enough components that are needed to solve the actual problem from end to end.

High-level design Twitter examples

For Twitter, at a high level, we will need multiple application servers to serve all the read/write requests with load balancers in front of them for traffic distributions. If we're assuming that we will have a lot more read traffic (compared to write), we can decide to have separate servers to handle these scenarios. On the back-end, we need an efficient database that can store all the tweets and support a large number of reads. We will also need a distributed file storage system for storing photos and videos.

Horizontal vs Vertical Scaling

Horizontal scaling means that you scale by adding more servers into your pool of resources whereas Vertical scaling means that you scale by adding more power (CPU, RAM, Storage, etc.) to an existing server. Horizontal scaling: NoSQL like Cassandra, MongoDB allow you to easily add more machines Vertical scaling: MySQL allows you to easily switch to bigger machine but often involves downtime

Step 1: Requirements Clarification

It is always a good idea to ask questions about the exact scope of the problem we are trying to solve. Design questions are mostly open-ended, and they don't have ONE correct answer. That's why clarifying ambiguities early in the interview becomes critical. Candidates who spend enough time to define the end goals of the system always have a better chance to be successful in the interview. Also, since we only have 35-40 minutes to design a (supposedly) large system, we should clarify what parts of the system we will be focusing on.

Step 2: Back of the Envelope Estimations

It is always a good idea to estimate the scale of the system we're going to design. This will also help later when we focus on scaling, partitioning, load balancing, and caching. - What scale is expected from the system (e.g., number of new tweets, number of tweet views, number of timeline generations per sec., etc.)? - How much storage will we need? We will have different storage requirements if users can have photos and videos in their tweets. - What network bandwidth usage are we expecting? This will be crucial in deciding how we will manage traffic and balance load between servers.

Database indexes

Make reads faster but can slow down writes because you also have to update the index If your application is write-heavy it might not make sense to add indexes

Reliability vs Availability

Reliability is availability over time considering the full range of possible real-world conditions that can occur. If a system is reliable, it is available. However, if it is available, it is not necessarily reliable. In other words, high reliability contributes to high availability, but it is possible to achieve a high availability even with an unreliable product by minimizing repair time and ensuring that spares are always available when they are needed.

Scalability

Scalability is the capability of a system, process, or a network to grow and manage increased demand. Any distributed system that can continuously evolve in order to support the growing amount of work is considered to be scalable.

Write-around cache

Similar to write-through cache, but data is written directly to permanent storage, bypassing the cache. Can reduce the cache being flooded with write operations that will not subsequently be re-read, but has the disadvantage that a read request for recently written data will create a "cache miss" and must be read from slower back-end storage and experience higher latency.

Serviceability or Manageability

The simplicity and speed with which a system can be repaired or maintained; if the time to fix a failed system increases, then availability will decrease. Things to consider for manageability are the ease of diagnosing and understanding problems when they occur, ease of making updates or modifications, and how simple the system is to operate (i.e., does it routinely operate without failure or exceptions?).

Efficiency

Two standard measures of its efficiency are: response time (or latency) that denotes the delay to obtain the first item throughput (or bandwidth) which denotes the number of items delivered in a given time unit (e.g., a second). The two measures correspond to the following unit costs: Number of messages globally sent by the nodes of the system regardless of the message size. Size of messages representing the volume of data exchanges.

Defining data model Twitter examples

User: UserID, Name, Email, DoB, CreationDate, LastLogin, etc. Tweet: TweetID, Content, TweetLocation, NumberOfLikes, TimeStamp, etc. UserFollow: UserID1, UserID2 FavoriteTweets: UserID, TweetID, TimeStamp

Grokking the System Design Interview

Related study sets

AAPPL VOCABULARIES

15.F SIDS

Customer Service Exam 1: Chapters 1-5

EA PART II - CORPORATIONS: 11. S Corp (Form 1120S)

BIOL 475 U1

ch45 fasho

Chapter 7 Microeconomics

Reading Quiz 1

Learn it all ~ MED IMAGING

Computer Literature test number one

Chapter 14: Annuities and Individual Retirement Accounts

CPCU 520 Ch 8

ESTRUCTURA | 7.2 Indefinite and negative words- Lo opuesto

Nursing 165- Test 3

104 final exam ATTEMPT 10

ch2

SIE unit 4 - Packaged Investments

Robbins GI Path Review

EXSS 181 Exam 2

Chapter 24. Hygiene