software architecture

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

AJAX - Long Polling

Long polling can be used in simple asynchronous data fetch use cases when you do not want to poll the server every now & then

Capacity Estimation facebook messenger

500M daily user, each 40 message daily, each 1kb

HTTP Protocol

Protocol for data exchange over the World Wide Web. Stateless, Request-response protocol

What is ACID?

Typically for Relational databases Atomicity, Consistency, Isolation, Durability

What is the reason to use relational database?

When accuracy is very important.

proxies

(substitutes) filter/log/transform/cache requests

SQL

ACID: atomicity, consistency, isolation, durability vertical scaling Fixed schema Each row contains all info about 1 entity. Each column contains all the separate data points. Ex: MySQL, Oracle, Postgres

CAP choices

- CP: atomic read & write - AP: allow for eventual consistency, works despite external failures.

Why is it important to define the data model early? (List 2)

- Clarifies how data will flow among different components of the system - Later, it will guide towards data partitioning and management

Why is it always a good idea to ask questions about the exact scope of the problem being solved (AKA requirements clarification)?

- Design questions are open-ended, and have more than 1 correct answer - Clarifying ambiguities early in the interview becomes critical - Candidates who spend enough time to define the end goals of the system always have a better chance to be successful in the interview - With only 35-40 minutes to design a (supposedly) large system, we should clarify what parts of the system we will be focusing on.

Why is it important to define what APIs are expected from the system (AKA system interface definition)?

- Establish the exact contract expected from the system - Ensure that requirements are NOT wrong

Cache eviction policies

- FIFO - LIFO - LRU - LFU - RR: random replacement

Say the disadvantages of failover

- Fail-over adds more hardware and additional complexity. - There is a potential for loss of data if the active system fails before any newly written data can be replicated to the passive.

Protocols

- HTTP: GET/POST/PUT/PATCH resources - TCP: handshakes, data is intact - UDP: connection-less, less reliable, but lowest latency - RPC: remote procedure call: executes procedures on a different address space, for performance with internal communications (Protobuf, Thrift, Arvo) - REST: representational state transfer: client/service model, all communication must be stateless and cacheable

What should the candidate do when digging deeper into 2-3 components (detailed design)?

- Present different approaches and their pros/cons - Explain why one approach is preferred It is VERY IMPORTANT to consider tradeoffs between different options while keeping system constraints in mind

AJAX polling, long polling, WebSockets, Server-sent events

- AJAX polling: client repeatedly poll (request) data -> HTTP overhead - long polling (hanging GET): server delay response until update available or timeout - WebSockets: full-duplex, persistent connection - Server-sent events: long-term connection with the server, and the server use this to send data

Say some questions to ask regarding features (system design)

- Size/capacity (of whatever we're making). - Queries per second. - Access strategy (for a cache).

General disadvantages of replication

- There is a potential for loss of data if the master fails before any newly written data can be replicated to other nodes. - Writes are replayed to the read replicas. If there are a lot of writes, the read replicas can get bogged down with replaying writes and can't do as many reads. - The more read slaves, the more you have to replicate, which leads to greater replication lag. - On some systems, writing to the master can spawn multiple threads to write in parallel, whereas read replicas only support writing sequentially with a single thread. Replication adds more hardware and additional complexity.

What are some questions you can ask when defining the data model for an application?

- Which database system should we use? - Will NoSQL like Cassandra best fit our needs, or should we use a MySQL-like solution? - What kind of block storage should we use to store photos and videos?

cache invalidation algorithms/schemes

- cache-aside: look in cache. If cache-miss, load from db, cache, and return. - write-through cache: client -> db -> cache. write to cache and db at the same time -> high write latency - write-around cache (lazy): write db & invalidate cache. Read would populate the cache -> high read latency - write-back cache: write to cache alone, add to queue, the queue would write to DB after. - refresh ahead: similar to cache-aside, but load hot items in-advance

where to cache

- close to front-end to not tax downstream levels. Ex: application server, CDN, ISP.

Availability patterns

- fail-over: active-passive, active-active - replication: master-slave, master-master

DB sharding: partitioning methods

- horizontal: split by row -> unbalance - vertical: split by col -> scaling problem when numRow too big - directory-base: similar to OS storage

DB sharding: partitioning criteria

- key or hash - list: group entities - round-robin - composite: combination of the above problems: join, referential integrity, rebalancing

Latency vs. throughput

- latency: time to perform an action - throughput: number of such action

DB design for PasteBin

- no relationship between records, except if we want to store which record belongs to which user paste: URLHash, ContentKey, ExpirationDate, CreationDate User: UserID, Name, Email, CreationDate, LastLogin

DB indexing

- provides the basis for rapid random lookups and efficient access of ordered records - reduce write perf

3 common places for load balancers

- user & web Server - Web server & platform layer - platform later and db

Consistency patterns

- weak consistency: read might not see the write: memcache - eventual consistency: DNS, email - strong consistency: file system, RDBMS

DNS traffic routing algorithms

- weighted round robin - latency-based - geo location based

What aspects of the system should the candidate be able to identify when defining the data model? (List 3)

1. Various entities of the system 2. How the various entities interact with teach other 3. Different aspects of data management, i.e. storage, transportation, encryption, etc.

What should the candidate do to communicate the high-level design of a system? (List 2 steps)

1. Draw a block diagram with 5-6 boxes representing the core components 2. Identify enough components that are needed to solve the actual problem from end-to-end.

Example of high-level design for Twitter

1. For Twitter, at a high-level, we will need multiple application servers to serve all the read/write requests with load balancers in front of them for traffic distributions. 2 If we're assuming that we will have a lot more read traffic (as compared to write), we can decide to have separate servers for handling these scenarios. 3. On the backend, we need an efficient database that can store all the tweets and can support a huge number of reads. 4. We will also need a distributed file storage system for storing photos and videos.

What are some questions a candidate can ask to identify and resolve bottlenecks? (List 4)

1. Is there any single point of failure in our system? What are we doing to mitigate it? 2. Do we have enough replicas of the data so that if we lose a few servers we can still serve our users? 3. Similarly, do we have enough copies of different services running such that a few failures will not cause total system shutdown? 4. How are we monitoring the performance of our service? Do we get alerts whenever critical components fail or their performance degrades?

What is the high-level view of the step-by-step approach to solve multiple design problems? (List 7)

1. Requirements clarifications 2. System interface definition 3. Back-of-the-envelope estimation 4. Defining data model 5. High-level design 6. Detailed design 7. Identifying and resolving bottlenecks

7 main steps of Grokking System Design

1. Requirements: functional, non-functional (technical), and extended. (who, how they use it, how many users, what the system does, input/output, how much data, how many requests, read/write ratio) 2. System interface: APIs 3. Capacity Estimation: traffic (qps), storage (db), bandwidth (network), memory (cache) -> high level estimate for each action 4. Data model/ DB design: what type of db, entity db structure & relationship. What's the estimation for the metadata/storage. 5. High-level design: block diagram of each service layer 6. Detail design: go into the first 2-3 components from the functional requirements. Also include a walk-through of the algorithms (if user does this, then what ...) 7. bottlenecks/scaling/technical issues

Example of detailed design questions for Twitter

1. Since we will be storing a massive amount of data, how should we partition our data to distribute it to multiple databases? - Should we try to store all the data of a user on the same database? - What issue could it cause? 2. How will we handle hot users who tweet a lot or follow lots of people? 3. Since users' timeline will contain the most recent (and relevant) tweets, should we try to store our data in such a way that is optimized for scanning the latest tweets? 4. How much and at which layer should we introduce cache to speed things up? 5. What components need better load balancing?

Why is it always a good idea to estimate the scale of the system we're going to design (AKA back-of-the-envelope estimation)?

Estimating the scale helps later when it's time to focus on scaling, partitioning, load balancing, and caching... (SPLBC)

What are some questions that can help estimate the scale of a system? (List 3)

1. What scale is expected from the system (e.g., number of new tweets, number of tweet views, number of timeline generations per sec., etc.)? 2. How much storage will we need? We will have different numbers if users can have photos and videos in their tweets. 3. What network bandwidth usage are we expecting? This will be crucial in deciding how we will manage traffic and balance load between servers.

Capacity Estimation Twitter

1B user, 200M daily, 0.5 tweet/user/day, 1 follow 200 timeline: visit 2 times/day, visit 5 other, each has 20 tweets (400:1 read/write) each tweet 2 bytes + 30 bytes metadata, every 1 out of 5 has a photo, every 1 out of 10 has a video => ~250kb/tweet

HTTP code full

1×× Informational 100 Continue 101 Switching Protocols 102 Processing 2×× Success 200 OK 201 Created 202 Accepted 203 Non-authoritative Information 204 No Content 205 Reset Content 206 Partial Content 207 Multi-Status 208 Already Reported 226 IM Used 3×× Redirection 300 Multiple Choices 301 Moved Permanently 302 Found 303 See Other 304 Not Modified 305 Use Proxy 307 Temporary Redirect 308 Permanent Redirect 4×× Client Error 400 Bad Request 401 Unauthorized 402 Payment Required 403 Forbidden 404 Not Found 405 Method Not Allowed 406 Not Acceptable 407 Proxy Authentication Required 408 Request Timeout 409 Conflict 410 Gone 411 Length Required 412 Precondition Failed 413 Payload Too Large 414 Request-URI Too Long 415 Unsupported Media Type 416 Requested Range Not Satisfiable 417 Expectation Failed 418 I'm a teapot 421 Misdirected Request 422 Unprocessable Entity 423 Locked 424 Failed Dependency 426 Upgrade Required 428 Precondition Required 429 Too Many Requests 431 Request Header Fields Too Large 444 Connection Closed Without Response 451 Unavailable For Legal Reasons 499 Client Closed Request 5×× Server Error 500 Internal Server Error 501 Not Implemented 502 Bad Gateway 503 Service Unavailable 504 Gateway Timeout 505 HTTP Version Not Supported 506 Variant Also Negotiates 507 Insufficient Storage 508 Loop Detected 510 Not Extended 511 Network Authentication Required 599 Network Connect Timeout Error

common http codes

1×× Informational 2×× Success 3×× Redirection 4×× Client Error 5×× Server Error examples: 200 OK 300 Multiple Choices 301 Moved Permanently 302 Found 304 Not Modified 307 Temporary Redirect 400 Bad Request 401 Unauthorized 403 Forbidden 404 Not Found 410 Gone 500 Internal Server Error 501 Not Implemented 503 Service Unavailable 550 Permission denied

Twitter APIs

post(userId, tweetData, tweetLoc = None, userLoc = None, uploadMediaIDs = None, timestamp,...) generateTimeLine(userId, timestamp, ...) markTweetFav(userId, tweetId, timestamp, ...)

IPv4 vs. IPv6

32 vs. 128 bit addresses. Also: No more NAT (Network Address Translation) Auto-configuration No more private address collisions Better multicast routing Simpler header format Simplified, more efficient routing True quality of service (QoS), also called "flow labeling" Built-in authentication and privacy support Flexible options and extensions Easier administration (say good-bye to DHCP)

Capacity Estimation dropbox

500M users, 100M active, 3 devices, 200 files, average file size: 100KB, 10 years

Capacity Estimation Instagram

500M users, 1M active, 2 photos/activeUser/day, average photo size: 200KB, 10 years

What are heuristic exceptions?

A Heuristic Exception refers to a transaction participant's decision to unilaterally take some action without the consensus of the transaction manager, usually as a result of some kind of catastrophic failure between the participant and the transaction manager. In a distributed environment communications failures can happen. If communication between the transaction manager and a recoverable resource is not possible for an extended period of time, the recoverable resource may decide to unilaterally commit or rollback changes done in the context of a transaction. Such a decision is called a heuristic decision. It is one of the worst errors that may happen in a transaction system, as it can lead to parts of the transaction being committed while other parts are rolled back, thus violating the atomicity property of transaction and possibly leading to data integrity corruption. Because of the dangers of heuristic exceptions, a recoverable resource that makes a heuristic decision is required to maintain all information about the decision in stable storage until the transaction manager tells it to forget about the heuristic decision. The actual data about the heuristic decision that is saved in stable storage depends on the type of recoverable resource and is not standardized. The idea is that a system manager can look at the data, and possibly edit the resource to correct any data integrity problems.

Peer to Peer Architecture

A P2P network is a network in which computers also known as nodes can communicate with each other without the need of a central server. A Seeder is a node which hosts the data on its system and provides bandwidth to upload the data to the network, a Leecher is a node which downloads the data from the network.

Message Queue: Publish Subscribe Model

A Publish-Subscribe model is the model where multiple consumers receive the same message sent from a single or multiple producers.

What is a cluster?

A cluster is group of computer machines that can individually run a software. Clusters are typically utilized to achieve high availability for a server software. Clustering is used in many types of servers for high availability. App Server Cluster An app server cluster is group of machines that can run a application server that can be reliably utilized with a minimum of down-time. Database Server Cluster An database server cluster is group of machines that can run a database server that can be reliably utilized with a minimum of down-time.

What's a CDN?

A content delivery network (CDN) is a globally distributed network of proxy servers, serving content from locations closer to the user. Generally, static files such as HTML/CSS/JS, photos, and videos are served from CDN, although some CDNs such as Amazon's CloudFront support dynamic content. The site's DNS resolution will tell clients which server to contact. Serving content from CDNs can significantly improve performance in two ways: - Users receive content at data centers close to them - Your servers do not have to serve requests that the CDN fulfills

What is a distributed system?

A distributed system in its most simplest definition is a group of computers working together as to appear as a single computer to the end-user. These machines have a shared state, operate concurrently and can fail independently without affecting the whole system's uptime.

Message-oriented Middleware (MOM)

A microservice sends a message to or receives it from the MOM the sender and the recipient do not know each other

What is eventual consistency?

A model for database consistency in which updates to the database will propagate through the system so that all data copies will be consistent eventually.

Say some additional benefits of Load Balancers

Additional benefits include: - SSL termination - Decrypt incoming requests and encrypt server responses so backend servers do not have to perform these potentially expensive operations - Removes the need to install X.509 certificates on each server - Session persistence - Issue cookies and route a specific client's requests to same instance if the web apps do not keep track of sessions

What do you call an external system that is used for authentication (like Facebook, Google...)?

An identity provider.

SQl vs NoSQL -> BASIC QUESTIONS

Are joins required? => SQL Size of the DB. Doesn't fit in a machine? => NoSQL Technology Maturity => Old? => SQL (rare) Read-write pattern => Lots of reads? => NoSQL.

What is consistency?

Assuming you have a storage system which has more than one machine, consistency implies that the data is same across the cluster, so you can read or write to/from any node and get the same data.

Atom

Atom does not require additional infrastructure, just HTTP. Old events are easily accessible if necessary. The sequence can be guaranteed. The Atom feed is consistent.

ACID

Atomicity: each transaction is all or nothing Consistency: each transaction will bring the DB from 1 valid state to another Isolation: execute concurrently vs serially has the same result durability: once a transaction has been committed, it will remain so

In NoSQL DBs, what does BASE mean?

Basically Available, Soft state, Eventual consistency

What does BASE mean?

Basically Available, Soft state, Eventual consistency

In NoSQL DBs, what does Basically Available mean?

Basically Available: This constraint states that the system does guarantee the availability of the data as regards CAP Theorem; there will be a response to any request. But, that response could still be 'failure' to obtain the requested data or the data may be in an inconsistent or changing state, much like waiting for a check to clear in your bank account.

Explain Blue-Green Deployment Technique

Blue-green deployment is a technique that reduces downtime and risk by running two identical production environments called Blue and Green. At any time, only one of the environments is live, with the live environment serving all production traffic. For this example, Blue is currently live and Green is idle. As you prepare a new version of your software, deployment and the final stage of testing takes place in the environment that is not live: in this example, Green. Once you have deployed and fully tested the software in Green, you switch the router so all incoming requests now go to Green instead of Blue. Green is now live, and Blue is idle. This technique can eliminate downtime due to application deployment. In addition, blue-green deployment reduces risk: if something unexpected happens with your new version on Green, you can immediately roll back to the last version by switching back to Blue.

What types of NoSQL DBs are supported by DynamoDB?

Both key-values and documents.

What is Websocket?

WebSocket is a computer communications protocol, providing full-duplex communication channels over a single TCP connection. It is at the same level, but different, from HTTP (level 7 of OSI model).

Define master-master replication, and mention 3 disadvantages particular to this method of replication.

Both masters serve reads and writes and coordinate with each other on writes. If either master goes down, the system can continue to operate with both reads and writes. Disadvantages: - You'll need a load balancer or you'll need to make changes to your application logic to determine where to write. - Most master-master systems are either loosely consistent (violating ACID) or have increased write latency due to synchronization. (MR: This is basically CAP Theorem). - Conflict resolution comes more into play as more write nodes are added and as latency increases.

Disadvantage(s): CDN

CDN costs could be significant depending on traffic, although this should be weighed with additional costs you would incur not using a CDN. Content might be stale if it is updated before the TTL expires it. CDNs require changing URLs for static content to point to the CDN.

Component design and algorithms Youtube/Netflix

Client (upload/search/view) <-> app servers <-> processing queue <-> encode service app servers <-> user DB + metadata DB encode service <-> metadata DB + media storage video processing: multiple chunks/format(codec)/resolution -> use a processing queue manage traffic: metadata uses master/slave dedup should happen early: when user start uploading caching: use CDN (content delivery network) and LRU cache for video/metadata

Component design and algorithms Twitter

Client <-> load balancer <-> app servers <-> db + file storage cache: 80-20, LRU: for each user: key is ownerID, value is linkedHashMap timeline: see FB LB: typical 3 LB pattern: client/app, app/db, app/cache

Cloud Foundry

Cloud Foundry's solutions are very similar to those of Kubernetes Service discovery also works via DNS Load balancing is also transparently implemented by Cloud Foundry routing of external requests, Cloud Foundry relies on DNS resilience: Cloud Foundry itself does not offer a solution in this area

What is Clustering?

Clustering is needed for achieving high availability for a server software. The main purpose of clustering is to achieve 100% availability or a zero down time in service. A typical server software can be running on one computer machine and it can serve as long as there is no hardware failure or some other failure. By creating a cluster of more than one machine, we can reduce the chances of our service going un-available in case one of the machine fails. Doing clustering does not always guarantee that service will be 100% available since there can still be a chance that all the machine in a cluster fail at the same time. However it in not very likely in case you have many machines and they are located at different location or supported by their own resources.

The conformist pattern

Conformist means that a bounded context simply uses a domain model from another bounded context. ex: order -> statistics

Data Ingestion

Data Ingestion is a collective term for the process of collecting data streaming-in from several different sources and making it ready to be processed by the system. two primary ways to ingest data, in Real-time & in Batches

What is denormalization?

Denormalization attempts to improve read performance at the expense of some write performance. Redundant copies of the data are written in multiple tables to avoid expensive joins. Some RDBMS such as PostgreSQL and Oracle support materialized views which handle the work of storing redundant information and keeping redundant copies consistent. Once data becomes distributed with techniques such as federation and sharding, managing joins across data centers further increases complexity. Denormalization might circumvent the need for such complex joins. In most systems, reads can heavily outnumber writes 100:1 or even 1000:1. A read resulting in a complex database join can be very expensive, spending a significant amount of time on disk operations. Disadvantage(s): denormalization - Data is duplicated. - Constraints can help redundant copies of information stay in sync, which increases complexity of the database design. - A denormalized database under heavy write load might perform worse than its normalized counterpart.

Say some disadvantages of LBs

Disadvantage(s): load balancer The load balancer can become a performance bottleneck if it does not have enough resources or if it is not configured properly. Introducing a load balancer to help eliminate single points of failure results in increased complexity. A single load balancer is a single point of failure, configuring multiple load balancers further increases complexity.

Say some advantages of MongoDB

Dynamic schema: As mentioned, this gives you flexibility to change your data schema without modifying any of your existing data. Scalability: MongoDB is horizontally scalable, which helps reduce the workload and scale your business with ease. Manageability: The database doesn't require a database administrator. Since it is fairly user-friendly in this way, it can be used by both developers and administrators. Speed: It's high-performing for simple queries. Flexibility: You can add new columns or fields on MongoDB without affecting existing rows or application performance.

Events

Each microservice decides for itself how it reacts to the events This leads to better decoupling Inconsistency cannot be solved. It takes time for asynchronous communication to reach all systems.

In NoSQL DBs, what does Eventual Consistency mean?

Eventual consistency: The system will eventually become consistent once it stops receiving input. The data will propagate to everywhere it should sooner or later, but the system will continue to receive input and is not checking the consistency of every transaction before it moves onto the next one.

ETL

Extract, Transform, and Load. Used to standardize data across systems, that allow it to be queried.

What Is Fail Over?

Fail over means switching to another machine when one of the machine fails. Fail over is a important technique in achieving high availability. Typically a load balancer is configured to fail over to another machine when the main machine fails. To achieve least down time, most load balancer support a feature of heart beat check. This ensures that target machine is responding. As soon as a hear beat signal fails, load balancer stops sending request to that machine and redirects to other machines or cluster.

What are the patterns of high availability?

Fail-over and replication Fail-over can be active-passive or active-active. Replication can be Master-slave or master-master.

What are the 5 steps to solving a problem in a system design interview?

Feature expectations ( First 2 mins ): "What are the features?" As said earlier, there is no wrong design. There are just good and bad designs and the same solution can be a good design for one use case and a bad design for the other. It is extremely important hence to get a very clear understanding of whats the requirement for the question. Estimations ( 2-5 mins ) "What are the scalability requirements?" Next step is usually to estimate the scale required for the system. The goal of this step is to understand the level of sharding required ( if any ) and to zero down on the design goals for the system. For example, if the total data required for the system fits on a single machine, we might not need to go into sharding and the complications that go with a distributed system design. OR if the most frequently used data fits on a single machine, in which case caching could be done on a single machine. Design Goals ( 1 mins ) "What are the design goals?" "Latency?", "Consistency?", "Availability", Figure out what are the most important goals for the system. It is possible that there are systems which are latency systems in which case a solution that does not account for it, might lead to bad design. Skeleton of the design ( 4 - 5 mins ) "What are the operations that need to be supported?" (Describe API) 30-40 mins is not enough time to discuss every single component in detail. As such, a good strategy is to discuss a very high level with the interviewer and go into a deep dive of components as inquired by the interviewer. Deep dive ( 20-30 mins ) (Describe Application later) (Describe Database Layer) (Plan for fault tolerance - ie, when a part of the system dies, then what??)

Define DB Federation (or functional partitioning)

Federation (or functional partitioning) splits up databases by function. For example, instead of a single, monolithic database, you could have three databases: forums, users, and products, resulting in less read and write traffic to each database and therefore less replication lag. Smaller databases result in more data that can fit in memory, which in turn results in more cache hits due to improved cache locality. With no single central master serializing writes you can write in parallel, increasing throughput. Disadvantage(s): federation Federation is not effective if your schema requires huge functions or tables. You'll need to update your application logic to determine which database to read and write. Joining data from two databases is more complex with a server link. Federation adds more hardware and additional complexity.

Why should you structure your solution by components?

For medium sized apps and above, monoliths are really bad - having one big software with many dependencies is just hard to reason about and often leads to spaghetti code. Even smart architects — those who are skilled enough to tame the beast and 'modularize' it — spend great mental effort on design, and each change requires carefully evaluating the impact on other dependent objects. The ultimate solution is to develop small software: divide the whole stack into self-contained components that don't share files with others, each constitutes very few files (e.g. API, service, data access, test, etc.) so that it's very easy to reason about it. Some may call this 'microservices' architecture — it's important to understand that microservices are not a spec which you must follow, but rather a set of principles. Structure your solution by self-contained components is good (orders, users...) Group your files by technical role is bad (ie. controllers, models, helpers...)

requirement for shortURL

Functional: - URL -> short alias - access short alias -> redirect to URL - user can pick custom short link - link has expiration (default or user define) Non-functional: - highly available - low latency - shortlink is not guessable Extended: - Analytics - accessible REST APIs

What enhancements brings HTTP2 over HTTP?

HTTP/2 allows the server to "push" content. Also multiplexing, header compression, and prioritization of requests.

Hexagonal Architecture

Hexagonal approach is an evolution of the layered architecture There is an inside component which holds the business logic & then the outside layer, the Ports & the Adapters which involve the databases, message queues, APIs & stuff

NoSQL

Horizontal scaling Different storage models, dynamic schema (good for rapid development). -Key-value: Redis, Dynamo -Document: CouchDB, MongoDB -Wide-column: Casandra, HBase -Graph: Neo4j, InfiniteGraph NoSQL applications: - clickstream/log - leaderboard/scoring data - temp data: shopping cart - frequently accessed tables - metadata/lookup tables

What is horizontal scaling?

Horizontal scaling takes place through duplicating host machine as needed. Drawbacks are the challenges in distributed systems.

Horizontal vs. vertical scalability

Horizontal: buy more machines. Vertical: improve the machines.

Hybrid App for mobile

Hybrid apps are primarily built using open web-based technologies such as Html5, CSS, JavaScript. They run in a native-container and communicate with the native OS via a wrapper or a middle layer

What Is IP Address Affinity Technique For Load Balancing?

IP address affinity is another popular way to do load balancing. In this approach, the client IP address is associated with a server node. All requests from a client IP address are served by one server node. This approach can be really easy to implement since IP address is always available in a HTTP request header and no additional settings need to be performed. This type of load balancing can be useful if you clients are likely to have disabled cookies. However there is a down side of this approach. If many of your users are behind a NATed IP address then all of them will end up using the same server node. This may cause uneven load on your server nodes. NATed IP address is really common, in fact anytime you are browsing from a office network its likely that you and all your coworkers are using same NATed IP address.

Why is it better to cache objects instead of database queries?

If you cache queries, when the object changes you'd need to find out which queries are affected by that change, which is not trivial.

Kappa Architecture

In Kappa architecture, all the data flows through a single data streaming pipeline as opposed to the Lambda architecture which has different data streaming layers that converge into one. Kappa contains only two layers. Speed, which is the streaming processing layer, & the Serving which is the final layer.

Define active-active fail-over

In active-active, both servers are managing traffic, spreading the load between them. If the servers are public-facing, the DNS would need to know about the public IPs of both servers. If the servers are internal-facing, application logic would need to know about both servers. Active-active failover can also be referred to as master-master failover.

Describe the differences in scalability between SQL and NoSQL DBs.

In most situations, SQL databases are vertically scalable, which means that you can increase the load on a single server by increasing things like CPU, RAM or SSD. NoSQL databases, on the other hand, are horizontally scalable. This means that you handle more traffic by sharding, or adding more servers in your NoSQL database. NoSQL databases is the preferred choice for large or ever-changing data sets.

What is cap theorem?

In presence of network partitions you cannot achieve both consistency and availability. And on the web we almost always chose availability.

What is availability?

In the context of a database cluster, Availability refers to the ability to always respond to queries ( read or write ) irrespective of nodes going down.

What is partition tolerance

In the context of a database cluster, cluster continues to function even if there is a "partition" (communications break) between two nodes (both nodes are up, but can't communicate).

What are the four layers of the TCP/IP suite?

LITA: Link layer Internet layer Transport layer Application layer

HTML5 Event Source API & Server Sent Events

Instead of the client polling for data, the server automatically pushes the data to the client whenever the updates are available the data flow is in one direction only, that is from the server to the client

What is the CAP theorem?

It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency (all nodes see the same data at the same time) Availability (every request receives a response about whether it succeeded or failed) Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures). In other words, in the presence of a network partition, one has to choose between consistency and availability.

Types of NoSQL DBs. Describe them.

Key-Wide-Docu-Graph. (kiwi-Grdo) - Key-value stores are the simplest. Every item in the database is stored as an attribute name (or "key") together with its value. Riak, Voldemort, and Redis are the most well-known in this category. - Wide-column stores store data together as columns instead of rows and are optimized for queries over large datasets. The most popular are Cassandra and HBase. - Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. MongoDB is the most popular of these databases. - Graph databases are used to store information about networks, such as social connections. Examples are Neo4J and HyperGraphDB.

Lambda Architecture

Lambda is a distributed data processing architecture that leverages both the batch & the real-time streaming data processing approaches to tackle the latency issues arising out of the batch processing approach. It joins the results from both the approaches before presenting it to the end user. The architecture has typically three layers: Batch Layer Speed Layer Serving layer The Batch Layer deals with the results acquired via batch processing the data. The Speed layer gets data from the real-time streaming data processing & the Serving layer combines the results obtained from both the Batch & the Speed layers.

Say some examples of design goals

Latency - Is this problem very latency sensitive (Or in other words, Are requests with high latency and a failing request, equally bad?). For example, search typeahead suggestions are useless if they take more than a second. Consistency - Does this problem require tight consistency? Or is it okay if things are eventually consistent? Availability - Does this problem require 100% availability?

Define Layer 4 LB

Layer 4 load balancers look at info at the transport layer to decide how to distribute requests. Generally, this involves the source, destination IP addresses, and ports in the header, but not the contents of the packet. Layer 4 load balancers forward network packets to and from the upstream server, performing Network Address Translation (NAT).

Define Layer 7 LB

Layer 7 load balancers look at the application layer to decide how to distribute requests. This can involve contents of the header, message, and cookies. Layer 7 load balancers terminates network traffic, reads the message, makes a load-balancing decision, then opens a connection to the selected server. For example, a layer 7 load balancer can direct video traffic to servers that host videos while directing more sensitive user billing traffic to security-hardened servers. At the cost of flexibility, layer 4 load balancing requires less time and computing resources than Layer 7, although the performance impact can be minimal on modern commodity hardware.

In what ways can Load balancers distribute traffic?

Load balancers can route traffic based on various metrics, including: Random Least loaded Session/cookies Round robin or weighted round robin Layer 4 Layer 7

Define load balancer

Load balancers distribute incoming client requests to computing resources such as application servers and databases. In each case, the load balancer returns the response from the computing resource to the appropriate client. Load balancers are effective at: - Preventing requests from going to unhealthy servers - Preventing overloading resources - Helping eliminate single points of failure - Load balancers can be implemented with hardware (expensive) or with software such as HAProxy.

What Is Load Balancing?

Load balancing is simple technique for distributing workloads across multiple machines or clusters. The most common and simple load balancing algorithm is Round Robin. In this type of load balancing the request is divided in circular order ensuring all machines get equal number of requests and no single machine is overloaded or underloaded. The Purpose of load balancing is to Optimize resource usage (avoid overload and under-load of any machines) Achieve Maximum Throughput Minimize response time Most common load balancing techniques in web based applications are Round robin Session affinity or sticky session IP Address affinity

Techniques to scale a relational database

Master-slave replication, master-master replication, federation (aka functional partitioning), sharding, denormalization, and SQL tuning (including DB partitioning).

Say some advantages of MySQL

Maturity: MySQL is an extremely established database, meaning that there's a huge community, extensive testing and quite a bit of stability. Compatibility: MySQL is available for all major platforms, including Linux, Windows, Mac, BSD and Solaris. It also has connectors to languages like Node.js, Ruby, C#, C++, Java, Perl, Python and PHP, meaning that it's not limited to SQL query language. Cost-effective: The database is open source and free. Replicable: The MySQL database can be replicated across multiple nodes, meaning that the workload can be reduced and the scalability and availability of the application can be increased. Sharding: While sharding cannot be done on most SQL databases, it can be done on MySQL servers. This is both cost-effective and good for business.

What Is Middle Tier Clustering?

Middle tier clustering is just a cluster that is used for service the middle tier in a application. This is popular since many clients may be using middle tier and a lot of heavy load may also be served by middle tier that requires it be to highly available. Failure of middle tier can cause multiple clients and systems to fail, therefore its one of the approaches to do clustering at the middle tier of a application. In general any application that has a business logic that can be shared across multiple client can use a middle tier cluster for high availability.

MySQL vs MongoDB. When to use?

MySQL is a strong choice for any business that will benefit from its pre-defined structure and set schemas. For example, applications that require multi-row transactions - like accounting systems or systems that monitor inventory - or that run on legacy systems will thrive with the MySQL structure. MongoDB, on the other hand, is a good choice for businesses that have rapid growth or databases with no clear schema definitions. More specifically, if you cannot define a schema for your database, if you find yourself denormalizing data schemas, or if your schema continues to change - as is often the case with mobile apps, real-time analytics, content management systems, etc.- MongoDB can be a strong choice for you.

Kubernetes

Nodes are the servers on which Kubernetes runs. They are organized in a cluster. Pod is multiple Docker containers that together provide a service. Containers that belong to one pod run on one node. Replica set ensures that a certain number of instances of a pod runs. This allows the load to be distributed to the pods. In addition, the system is fail-safe. If a pod fails, a new pod is automatically started. Load balancing is ensured by Kubernetes by distributing the traffic for the IP address of the Kubernetes service to the individual pods on the IP level. This is transparent for callers and for the called microservice DNS offers service discovery Routing is covered by Kubernetes via the load balancer or node ports of the services. This is also transparent for the microservices Resilience is offered by Kubernetes via the restarting of containers and load balancing.

Describe Optimistic vs. pessimistic locking

Optimistic concurrency control (OCC) is a concurrency control method applied to transactional systems such as relational database management systems and software transactional memory. OCC assumes that multiple transactions can frequently complete without interfering with each other. While running, transactions use data resources without acquiring locks on those resources. Before committing, each transaction verifies that no other transaction has modified the data it has read. If the check reveals conflicting modifications, the committing transaction rolls back and can be restarted

Data model/ DB design Instagram:

Photo: photoID, userID, storagePath, photoGeoLoc, userGeoLoc, creationDate User: userID, name, email, DOB, creationDate, lastLogin UserFollow: pair <UserID1, UserID2> Photo file-> S3 or HDFS User/Photo: RDBMS relationship/follow: wide column like casandra use the Capacity Estimation to estimate storage for metadata and file storage shard base on photoID (since userID is not balance)

Message Queue: Point to Point Model

Point to point communication is a pretty simple use case where the message from the producer is consumed by only one consumer.

Document Oriented Database

Popular: MongoDB, CouchDB, OrientDB, Google Cloud Datastore, Amazon Document DB semi-structured data, need a flexible schema which would change often Typical use cases: Real-time feeds Live sports apps Writing product catalogs Inventory management Storing user comments Web-based multiplayer games

Key Value Database

Popular: Redis, Hazelcast, Riak, Voldemort & Memcache uper-fast data fetch. Typical use cases of a key value database are the following: Caching Persisting user state Persisting user sessions Managing real-time data Implementing queues Creating leaderboards in online games & web apps Implementing a pub-sub system

Projects where SQL/NoSQL is ideal

Projects where SQL is ideal: - logical related discrete data requirements which can be identified up-front - data integrity is essential - standards-based proven technology with good developer experience and support. Projects where NoSQL is ideal: - unrelated, indeterminate or evolving data requirements - simpler or looser project objectives, able to start coding immediately - speed and scalability is imperative.

Define Pull CDNs

Pull CDNs grab new content from your server when the first user requests the content. You leave the content on your server and rewrite URLs to point to the CDN. This results in a slower request until the content is cached on the CDN. A time-to-live (TTL) determines how long content is cached. Pull CDNs minimize storage space on the CDN, but can create redundant traffic if files expire and are pulled before they have actually changed. Sites with heavy traffic work well with pull CDNs, as traffic is spread out more evenly with only recently-requested content remaining on the CDN.

Define Push CDNs

Push CDNs receive new content whenever changes occur on your server. You take full responsibility for providing content, uploading directly to the CDN and rewriting URLs to point to the CDN. You can configure when content expires and when it is updated. Content is uploaded only when it is new or changed, minimizing traffic, but maximizing storage. Sites with a small amount of traffic or sites with content that isn't often updated work well with push CDNs. Content is placed on the CDNs once, instead of being re-pulled at regular intervals.

Types of CDN?

Push and Pull

ROCA (Resource-oriented Client Architecture)

ROCA splits into two parts: The server-side and the client-side architecture. The server-side consists of RESTful backends, serving human-readable content as well as services for machine-to-machine communication, either public or internal. The client-side focuses on a sustainable and maintainable usage of JavaScript and CSS, based on the principle of Progressive Enhancement The server adheres to the REST principles: All resources have an unambiguous URL. Links to web pages can be sent by e-mail and then accessed from any browser if the necessary authorizations are given. HTTP methods are used correctly. For example, GETs do not change data. The server is stateless.

SQL vs NoSQL

Reasons for SQL: Structured data Strict schema Relational data Need for complex joins Transactions Clear patterns for scaling More established: developers, community, code, tools, etc Lookups by index are very fast - Data fits in a single machine. - Not write-heavy. Reasons for NoSQL: Semi-structured data Dynamic or flexible schema Non relational data No need for complex joins Store many TB (or PB) of data Very data intensive workload Very high throughput for IOPS Sample data well-suited for NoSQL: Rapid ingest of clickstream and log data Leaderboard or scoring data Temporary data, such as a shopping cart Frequently accessed ('hot') tables Metadata/lookup tables

What is replication?

Replication refers to frequently copying the data across multiple machines. Post replication, multiple copies of the data exists across machines. This might help in case one or more of the machines die due to some failure.

REST API

Representational State Transfer acts as a gateway, as a single entry point into the system takes advantage of the HTTP methodologies to establish communication enables servers to cache the response stateless process

Describe the TCP/IP link layer

The link layer defines the networking methods within the scope of the local network link on which hosts communicate without intervening routers. This layer includes the protocols used to describe the local network topology and the interfaces needed to effect transmission of Internet layer datagrams to next-neighbor hosts.

What is SOLID?

S.O.L.I.D is an acronym for the first five object-oriented design (OOD) principles by Robert C. Martin. S - Single-responsiblity principle. A class should have one and only one reason to change, meaning that a class should have only one job. O - Open-closed principle. Objects or entities should be open for extension, but closed for modification. L - Liskov substitution principle. Let q(x) be a property provable about objects of x of type T. Then q(y) should be provable for objects y of type S where S is a subtype of T. I - Interface segregation principle. A client should never be forced to implement an interface that it doesn't use or clients shouldn't be forced to depend on methods they do not use. D - Dependency Inversion Principle. Entities must depend on abstractions not on concretions. It states that the high level module must not depend on the low level module, but they should depend on abstractions.

Describe the differences in structure between SQL and NoSQL DBs, and when are SQL DBs better.

SQL databases are table-based, while NoSQL databases are either document-based, key-value pairs, graph databases or wide-column stores. This makes relational SQL databases a better option for applications that require multi-row transactions - such as an accounting system - or for legacy systems that were built for a relational structure.

What Is Scalability?

Scalability is the ability of a system, network, or process to handle a growing amount of load by adding more resources. The adding of resource can be done in two ways Scaling Up This involves adding more resources to the existing nodes. For example, adding more RAM, Storage or processing power. Scaling Out This involves adding more nodes to support more users. Any of the approaches can be used for scaling up/out a application, however the cost of adding resources (per user) may change as the volume increases. If we add resources to the system It should increase the ability of application to take more load in a proportional manner of added resources. An ideal application should be able to serve high level of load in less resources. However, in practical, linearly scalable system may be the best option achievable. Poorly designed applications may have really high cost on scaling up/out since it will require more resources/user as the load increases.

How to speed up a noSQL DB?

Sharding (it reduces the number of indexes per DB table)! Also denormalization (removing references of one table in another one by adding the actual referred entries of the first a table in the second one). Also indexing; also using statistical methods provided my the DB (like checking how many documents are queried for certain query). Also, with a memory cache! Also, there could be vertical scaling like enhancing the memory or the processor of the DB.

Netflix stack

Service discovery is offered by Eureka. Eureka focuses on Java with the Java client. Client-side caching is very fast and resilient Resilience: Hystrix implements timeout, fail fast, bulkhead (a separate thread pool can be set up for each called microservice), circuit breaker. Load balancing: Ribbon implements client-side load balancing, avoids single points of failure or bottlenecks. Ribbon relies on Eureka for service discovery but can also use Consul. Routing: Zuul's dynamic filters are very flexible. Zuul ensures maximum flexibility. A reverse proxy might be the safer option

What is DB sharding? Say advantages and disadvantages.

Sharding distributes data across different databases such that each database can only manage a subset of the data. Taking a users database as an example, as the number of users increases, more shards are added to the cluster. Similar to the advantages of federation, sharding results in less read and write traffic, less replication, and more cache hits. Index size is also reduced, which generally improves performance with faster queries. If one shard goes down, the other shards are still operational, although you'll want to add some form of replication to avoid data loss. Like federation, there is no single central master serializing writes, allowing you to write in parallel with increased throughput. Common ways to shard a table of users is either through the user's last name initial or the user's geographic location. Disadvantage(s): sharding You'll need to update your application logic to work with shards, which could result in complex SQL queries. Data distribution can become lopsided in a shard. For example, a set of power users on a shard could result in increased load to that shard compared to others. Rebalancing adds additional complexity. A sharding function based on consistent hashing can reduce the amount of transferred data. Joining data from multiple shards is more complex. Sharding adds more hardware and additional complexity.

What is sharding?

Sharding is a type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. Sharding makes a database system highly scalable. The total number of rows in each table in each database is reduced since the tables are divided and distributed into multiple servers. This reduces the index size, which generally means improved search performance. The most common approach for creating shards is by the use of consistent hashing of a unique id in application (e.g. user id). The downsides of sharding: It requires application to be aware of the data location. Any addition or deletion of nodes from system will require some rebalance to be done in the system. If you require lot of cross node join queries then your performance will be really bad. Therefore, knowing how the data will be used for querying becomes really important. A wrong sharding logic may result in worse performance. Therefore make sure you shard based on the application need.

How to scale a noSQL DB

Sharding.

In NoSQL DBs, what does Soft State mean?

Soft state: The state of the system could change over time, so even during times without input there may be changes going on due to 'eventual consistency,' thus the state of the system is always 'soft.'

Define reverse proxy

Source: Wikipedia A reverse proxy is a web server that centralizes internal services and provides unified interfaces to the public. Requests from clients are forwarded to a server that can fulfill it before the reverse proxy returns the server's response to the client. Additional benefits include: Increased security - Hide information about backend servers, blacklist IPs, limit number of connections per client Increased scalability and flexibility - Clients only see the reverse proxy's IP, allowing you to scale servers or change their configuration SSL termination - Decrypt incoming requests and encrypt server responses so backend servers do not have to perform these potentially expensive operations Removes the need to install X.509 certificates on each server Compression - Compress server responses Caching - Return the response for cached requests Static content - Serve static content directly HTML/CSS/JS Photos Videos Etc

What are benchmarking and profiling?

Strategies for enhancing SQL DBs. Benchmark - Simulate high-load situations with tools such as ab. Profile - Enable tools such as the slow query log to help track performance issues.

Streaming Over HTTP

Stream large data over HTTP by breaking it into smaller chunks. Possible with HTML5 & a JavaScript Stream API.

What is TCP?

TCP is a connection-oriented protocol that addresses numerous reliability issues in providing a reliable byte stream: data arrives in-order data has minimal error (i.e., correctness) duplicate data is discarded lost or discarded packets are resent includes traffic congestion control

What does the Internet Protocol (IP) do?

The Internet Protocol performs two basic functions: Host addressing and identification: This is accomplished with a hierarchical IP addressing system. Packet routing: This is the basic task of sending packets of data (datagrams) from source to destination by forwarding them to the next network router closer to the final destination.

What is MD5

The MD5 algorithm is a widely used hash function producing a 128-bit hash value.

What are the Twelve-Factor App principles?

The Twelve-Factor App methodology is a methodology for building software as a service applications. These best practices are designed to enable applications to be built with portability and resilience when deployed to the web. Codebase - There should be exactly one codebase for a deployed service with the codebase being used for many deployments. Dependencies - All dependencies should be declared, with no implicit reliance on system tools or libraries. Config - Configuration that varies between deployments should be stored in the environment. Backing services All backing services are treated as attached resources and attached and detached by the execution environment. Build, release, run - The delivery pipeline should strictly consist of build, release, run. Processes - Applications should be deployed as one or more stateless processes with persisted data stored on a backing service. Port binding - Self-contained services should make themselves available to other services by specified ports. Concurrency - Concurrency is advocated by scaling individual processes. Disposability - Fast startup and shutdown are advocated for a more robust and resilient system. Dev/Prod parity - All environments should be as similar as possible. Logs - Applications should produce logs as event streams and leave the execution environment to aggregate. Admin Processes - Any needed admin tasks should be kept in source control and packaged with the application.

Describe the TCP/IP application layer

The application layer is the scope within which applications create user data and communicate this data to other applications on another or the same host. This is the layer in which all higher level protocols, such as SMTP, FTP, SSH, HTTP, operate. Processes are addressed via ports which essentially represent services.

HTTP PULL

The client pulls the data from the server whenever it requires Not ideal & a waste of resources. Excessive pulls by the clients have the potential to bring down the server two ways: HTTP GET: request to the server manually by triggering an event, like by clicking a button AJAX: Asynchronous JavaScript & XML. Fetch the updated data from the server by automatically sending the requests over and over at stipulated intervals.

In SQL DBs, what does ACID mean?

The four characteristics of database transactions: Atomicity, Consistency, Isolation and Durability. Atomicity: each transaction either happens or doesn't happen. Consistency: Databases are always in a consistent state. Isolation: Each transaction should be independent of each other. The effect of several transactions should be them running independently. Durability: The result of transactions should be permanent.

Describe the TCP/IP internet layer

The internet layer exchanges datagrams across network boundaries. It provides a uniform networking interface that hides the actual topology (layout) of the underlying network connections. This layer defines the addressing and routing structures used for the TCP/IP protocol suite. The primary protocol in this scope is the Internet Protocol, which defines IP addresses. Its function in routing is to transport datagrams to the next IP router that has the connectivity to a network closer to the final data destination.

Define master-slave replication, and mention one particular disadvantage of this method of replication.

The master serves reads and writes, replicating writes to one or more slaves, which serve only reads. Slaves can also replicate to additional slaves in a tree-like fashion. If the master goes offline, the system can continue to operate in read-only mode until a slave is promoted to a master or a new master is provisioned. Disadvantage: Additional logic is needed to promote a slave to a master.

Describe the TCP/IP transport layer

The transport layer performs host-to-host communications on either the same or different hosts and on either the local network or remote networks separated by routers. It provides a channel for the communication needs of applications. Protocols: TCP and UDP.

two types of clients

Thin Client is the client which holds just the user interface of the application. It has no business logic of any sort. On the contrary, the thick client holds all or some part of the business logic

Describe Write back cache

This is a caching system where the write is directly done to the caching layer and the write is confirmed as soon as the write to the cache completes. The cache then asynchronously syncs this write to the DB. This would lead to a really quick write latency and high write throughput. But, as is the case with any non-persistent / in-memory write, we stand the risk of losing the data incase the caching layer dies. We can improve our odds by introducing having more than one replica acknowledging the write ( so that we don't lose data if just one of the replica dies ).

Describe Write around cache

This is a caching system where write directly goes to the DB. The cache system reads the information from DB incase of a miss. While this ensures lower write load to the cache and faster writes, this can lead to higher read latency incase of applications which write and re-read the information quickly.

Describe Write through cache

This is a caching system where writes go through the cache and write is confirmed as success only if writes to DB and the cache BOTH succeed. This is really useful for applications which write and re-read the information quickly. However, write latency will be higher in this case as there are writes to 2 separate systems.

Data model/ DB design Twitter:

Tweet: tweetID, content, tweetLoc, userLoc, creationDate, numFav User: userID, name, email, DOB, creationDate, lastLogin Follow: <UserID1, UserID2> Fav: <tweetID, UserID>, creationDate sharding: tweetID > userID. However, we have to scan a whole table to find new tweet or tweet from a user need to combine with tweetID + creationTime. Use epoch time: second + incrementing sequence

What is BASE?

Typically for NoSQL: Basic Availability The database appears to work most of the time. Soft-state Stores don't have to be write-consistent, nor do different replicas have to be mutually consistent all the time. Eventual consistency Stores exhibit consistency at some later point (e.g., lazily at read time).

DB design for shortURL

URL: hash, originalURL, creationDate, ExpDate User: name, email, creationDate, lastLogin doesn't need relationship, so key-value store like Casandra/DynamoDB are fine.

Techniques to speed up a relational DB

Usually it's like vertical scaling, like adding indexes / DB tuning (like DB partitioning) / remove logs / updating memory, processor power, bandwidth... See https://www.catswhocode.com/blog/10-sql-tips-to-speed-up-your-database . Also, with a memory cache (like Amazon ElastiCache or memcached). MySQL + Memcached is common. And also with a CDN

What is vertical scaling?

Vertical scaling takes place through an increase in the specifications of an individual resource, such as upgrading a server with a larger hard drive, more memory, or a faster CPU.

Define the 3 consistency patterns

Weak consistency After a write, reads may or may not see it. A best effort approach is taken. This approach is seen in systems such as memcached. Weak consistency works well in real time use cases such as VoIP, video chat, and realtime multiplayer games. For example, if you are on a phone call and lose reception for a few seconds, when you regain connection you do not hear what was spoken during connection loss. Eventual consistency After a write, reads will eventually see it (typically within milliseconds). Data is replicated asynchronously. This approach is seen in systems such as DNS and email. Eventual consistency works well in highly available systems. Strong consistency After a write, reads will see it. Data is replicated synchronously. This approach is seen in file systems and RDBMSes. Strong consistency works well in systems that need transactions.

What is the reason to use NoSQL database?

When availability is most important.

client

Window to our application. Open-source technologies popular for writing the web-based user interface are ReactJS, AngularJS, VueJS, Jquery etc. All these libraries use JavaScript.

Define active-passive fail-over

With active-passive fail-over, heartbeats are sent between the active and the passive server on standby. If the heartbeat is interrupted, the passive server takes over the active's IP address and resumes service. The length of downtime is determined by whether the passive server is already running in 'hot' standby or whether it needs to start up from 'cold' standby. Only the active server handles traffic. Active-passive failover can also be referred to as master-slave failover.

What are the 3 types of caching systems?

Write through cache : This is a caching system where writes go through the cache and write is confirmed as success only if writes to DB and the cache BOTH succeed. This is really useful for applications which write and re-read the information quickly. However, write latency will be higher in this case as there are writes to 2 separate systems. Write around cache : This is a caching system where write directly goes to the DB. The cache system reads the information from DB in case of a miss. While this ensures lower write load to the cache and faster writes, this can lead to higher read latency in case of applications which write and re-read the information quickly. Write back cache : This is a caching system where the write is directly done to the caching layer and the write is confirmed as soon as the write to the cache completes. The cache then asynchronously syncs this write to the DB. This would lead to a really quick write latency and high write throughput. But, as is the case with any non-persistent / in-memory write, we stand the risk of losing the data in case the caching layer dies. We can improve our odds by introducing having more than one replica acknowledging the write ( so that we don't lose data if just one of the replica dies ).

What are the strategies for caching?

Write-through, write-around, write-back. https://www.interviewbit.com/problems/design-cache/

ESI (Edge Side Includes)

XML-based markup language that provides a means to assemble resources in HTTP clients. Unlike other in-markup languages, ESI is designed to leverage client tools like caches to improve end-user perceived performance, reduce processing overhead on the origin server, and enhanced availability. ESI is primarily intended for processing on surrogates (intermediaries that operate on behalf of the origin server, also known as "Reverse Proxies") Varnish is a web cache and is used as an ESI implementation

PasteBin APIs

addPaste(APIKey, pasteData, alias = None, pasteName = None, expireData = None) getPaste(APIKey, pasteKey) deletePaste(APIKey, pasteKey)

load balancing algorithms

after the health check, load balancer would choose: - least connection - least response time - least bandwith (least amount of traffic) - (weighted) round robin - IP hash

Progressive Web Apps

apps, with the look and feel of native apps, that can run in the browser of both mobile and desktop devices & can also be installed, from the browser, on the device of the user

Capacity Estimation PasteBin

assuming 1M/day, each 10 MB, 5:1 read/write, 5y storage Find: traffic (QPS) storage (DB) bandwidth (network) memory (cache)

Capacity Estimation shortURL

assuming 500M new URL/month, each 500 bytes, 100:1 read/write, 5y storage Find: traffic (QPS) storage (DB) bandwidth (network) memory (cache)

Consul

powerful service discovery technology. Consul is very flexible; due to the DNS interface and Consul Template it can be used with many technologies. More transparent to use than Eureka Apache httpd can be used as a load balancer and router for HTTP requests Load balancing can be implemented with Ribbon

separate ways pattern

bounded contexts are not related at the software level although a relation would be conceivable

basic design & algorithms of shortURL

client <-> (App Server + App db) <-> (key generation service + key DB) key generation service: length limit, MD5/SHA256 from the originalURL + seqNo (to avoid same org to the same shortURL from diff users) db partition: hash-base is better than URL base cache: LRU (linked-hashmap) load balancer: client - appServer, appServer-DB, appServer-cache cleanup expired URLs: use a deamon process (cronjob is less freq, like once/h) optional: security -> need a access/permission service

basic design & algorithms of PasteBin

client <-> (App Server + Metadata db + Object storage (ex: Amazon S3) - storage layer would be devided into Metadata & object storage - can have a separate key generation service - the rest is similar to shortURL

shared kernel pattern

common core that is shared by multiple bounded contexts. an anti-pattern for microservices systems

WebHooks

consumers register an HTTP endpoint with the service with a unique API Key. It's like a phone number. Call me on this number, when an event occurs. I won't call you anymore.

shortURL APIs

createURL(APIKey, URL, alias=NONE, userName=NONE, expireDate = NONE) deleteURL(APIKey, alias)

Component design and algorithms dropbox

design consideration: - break file into chunks: easier to transfer, resume failed ops, transfer only updated chunks/diff, save bandwidth High level: Clients -> block server/Metadata/Sync <-> storage/metadata (take a look at the charts in grokking) Client: break client into 4 parts: chunker/indexer/watcher/internal DB client <-> block/cloud storage client <-> synchronization service Metadata: keep a copy with the client, maintain versioning. Keep info about chunks, files, user, devices, workspace (folder) Synchronization service: notify any change to the files. Sync the local DB to the cloud. Base on offline period -> poll for updates from the server. High push/pull. For the number of requests, we add a messaging queue. service: request queue & response queue. Request queue is global, response is specific to individual. dedup: post-process dedup (after upload) is a waste of bandwidth -> we use in-line dedup (hash right at local device)

Write-Back

directly write to the cache instead of the database. And the cache after some delay as per the business logic writes data to the database.

Caching Strategies: Write-Through

each & every information written to the database goes through the cache. Before the data is written to the DB, the cache is updated with it. This works well for write-heavy workloads like online massive multiplayer games

Shared Nothing Architecture

eliminating all single points of failure. Every module has its own memory, own disk. So even if several modules in the system go down, the other modules online stay unaffected. It also helps with the scalability and performance.

requirement for Instagram (photo sharing)

functional requirement: - upload/download/view photos - search - follow another user - news feed non-functional - highly available - highly reliable - low latency read - consistency can take a hit: if the user doesn't see new photos/videos, it's fine - read-heavy

requirement for facebook messenger

functional: - one-on-one conversation - online/offline status - chat history non functional - real-time, low latency - highly consistent: between devices - high availability extended: - group chat - push notification

requirement for twitter

functional: - post tweets - follow user - mark favorite - display timeline non-functional: - highly available - low latency - consistency can take a hit extended: - search tweets - reply to a tweet - timeline: trending, suggestion, moments - tag - notification

requirement for Youtube/Netflix

functional: - upload - share/view - search title - stats non-functional: - reliable - available - minimal lag (low latency)

requirement for PasteBin

functional: - upload or paste data (only text) <-> unique URL - pick custom alias - timespan/expiration non-functional: - reliable - available - minimum latency - should not be guessable extended requirement: - analytics - REST API

requirement for dropbox

functional: - upload/download files from any devices - share - auto-sync - large file - ACID - offline editing non-functional: - huge read/write - availability, reliability, durability extend: snapshot of data so the user can go back

Kafka

high throughput and low latency can save records permanently Kafka also has stream-processing capabilities Kafka organizes data in records. This is what other MOMs call "messages". Partitions allow strong guarantees concerning the order of records, but also parallel processing. supports exactly once semantics that is, a guaranteed one-time delivery

horizontal vs vertical scaling

horizontal: adding more servers. ex: Cassandra/MongoDB vertical: upgrade 1 server. ex: MySQL

What are the steps for solving a systems design question? (HiredInTech)

http://old.hiredintech.com/system-design/the-system-design-process/ - Scope the problem: Don't make assumptions; Ask questions; Understand the constraints and use cases. - Sketch up an abstract design that illustrates the basic components of the system and the relationships between them. - Think about the bottlenecks these components face when the system scales. - Address these bottlenecks by using the fundamentals principles of scalable system design.

Component design and algorithms Instagram

images: client/user upload <-> upload service <-> Image storage + metadata client/user download/view/search <-> download/view service <-> Image storage + metadata news feed: - pre-generate/cache news feeds in a separate table - use a hybrid push/pull model - need to sort photos by creationTime -> epoch time. PhotoID would have 2 part: epoch time + sequence -> shard base on this caching: memcache, LRU, 80-20 rule

Design Pastebin.com (or Bit.ly)

https://github.com/donnemartin/system-design-primer/blob/master/solutions/system_design/pastebin/README.md

Specify Twitter's System design

https://github.com/donnemartin/system-design-primer/blob/master/solutions/system_design/twitter/README.md

Steps to approach a problem ( InterviewBit)

https://www.interviewbit.com/courses/system-design/topics/storage-scalability/ Feature expectations ( First 2 mins ) : There is no wrong design. There are just good and bad designs and the same solution can be a good design for one use case and a bad design for the other. It is extremely important hence to get a very clear understanding of whats the requirement for the question. Estimations ( 2-5 mins ) Next step is usually to estimate the scale required for the system. The goal of this step is to understand the level of sharding required ( if any ) and to zero down on the design goals for the system. For example, if the total data required for the system fits on a single machine, we might not need to go into sharding and the complications that go with a distributed system design. OR if the most frequently used data fits on a single machine, in which case caching could be done on a single machine. Design Goals ( 1 mins ) Figure out what are the most important goals for the system. It is possible that there are systems which are latency systems in which case a solution that does not account for it, might lead to bad design. Skeleton of the design ( 4 - 5 mins ) 30-40 mins is not enough time to discuss every single component in detail. As such, a good strategy is to discuss a very high level with the interviewer and go into a deep dive of components as enquired by the interviewer. Deep dive ( 20-30 mins ) This is an extension of the previous section.

REST vs messaging

https://www.slideshare.net/ewolff/rest-vs-messaging-for-microservices

Describe consistent hashing

https://www.toptal.com/big-data/consistent-hashing https://www.youtube.com/watch?v=zaRkONvyGr8&list=PLMCXHnjXnTnvo6alSjVkgxV-VH6EPyvoX&index=4

Event-driven architecture

is all about processing asynchronous data streams. In an event-driven system everything is treated as a stream.

tier

logical separation, physical separation at the component level, not the code level

Caching Strategies: Cache Aside

most common. The data is lazy-loaded in the cache. When the user sends a request for particular data, the system first looks for it in the cache. If present, then it is simply returned from it. If not, the data is fetched from the database, the cache is updated and is returned to the user works best with read-heavy

open host service pattern

offers a generic interface with several services. frequently found at public APIs

proxy server types

open proxy: accessible by any internet user: Anonymous proxy to hide IPs, transparent proxy to cachace websites reserve proxy: centralize internal services and provides unified interfaces to the public (retrieves resources and return to client appearing as if they originated from the proxy server)

Web Socket

persistent bi-directional low latency data flow Have bi-directional data? Go ahead with Web Sockets. One more thing, Web Sockets tech doesn't work over HTTP. It runs over TCP

Wide-column database

popular: Cassandra, HBase, Google BigTable, Scylla DB primarily used to handle massive amounts of data, technically called the Big Data

Time Series Database

popular: Influx DB, Timescale DB, Prometheus generally ingested from IoT devices, self-driving vehicles, industry sensors, social networks, stock market financial data etc Time-series data is primarily used for running analytics, deducing conclusions and making future business decisions looking at the results of the analytics

Graph Database

popular: Neo4J visualization, low latency Ideal use cases of graph databases are building social, knowledge, network graphs

HTTP PUSH

server keeps pushing the new updates to the client whenever they are available Clients use AJAX (Asynchronous JavaScript & XML) to send requests to the server in the HTTP Pull based mechanism - Ajax Long polling - Web Sockets - HTML5 Event Source - Message Queues - Streaming over HTTP The connection between the client and the server stays open with the help of Heartbeat Interceptors.

Caching Strategies: Read-Through

similar to the Cache Aside, also lazy-loaded in the cache. Subtle difference: the cache always stays consistent with the database. can always pre-load the cache with the information which is expected to be requested most by the users

customer/supplier pattern

supplier is upstream and the customer is downstream ex: payment can become a customer of the order process supplier <-> customer ex: order <-> payment When the bounded context payment does not obtain the necessary data from the bounded context order processing, the products can be ordered but not paid for. Therefore, the customer/supplier pattern is an obvious choice.

Segmented P2P file transfer

system hosts a large file. Other nodes in the network, in need of the file, locate the system containing the file. Then they download the file in chunks, re-hosting the downloaded chunk simultaneously, making it more available to the other users

anti-corruption layer

the bounded context does not directly use the domain model of the other bounded context, but it contains a layer for decoupling its own domain model from the model of the bounded context.

self-contained system

type of microservice that specifies elements of a macro architecture. includes logic, data, and a UI. SCSs divide the system into several independent web applications SCSs focus on loose coupling. an SCS must have a UI.

Youtube/Netflix APIs

uploadVid(apiDevKey, videoTitle, videoDescription, tag[] = None, catagoryID, defaultLanguage, recordingDetail, videoContent) searchVid(apiDevKey, searchQuery, userLoc, resultSize, pageToken) streamVid(apiDevKey, vidId, offset, codec, resolution)

Component design and algorithms FB messenger

users <-> chat server <-> storage chat server: - push > pull: use HTTP long polling vs web socket - delay would shuffle order of message -> need time stamp and sort by time storage: - no relationship, and cannot afford delay -> use wide-column user status (online/offline) - user A login or interact -> write a timestamp (lastLogin) - any interaction B to A would pull the status - can use a time-out cache: cache recent message for the screen (mobile vs web) groupchat: this would need a push model -> need to limit # of user to avoid fan-out push notification: pub-sub service. Need a notification service to send push notification when the offline users come online

Data model/ DB design Youtube/Netflix:

videoMetadata: videoID, Title, Description, Size, Thumbnail, uploader/user, likes, dislikes, views, creationDate comment: commentID, <videoID, userID>, comment, creationDate user: userID, Name, email video store in HDFS or GlusterFS thumbnail: info stores in bigtable metadata sharding: userID is not good -> use videoID


Set pelajaran terkait

8P/2C_Science_Le système solaire

View Set

Short Answers to Death of a Salesman act 1

View Set

Geology Ch. 4-5, Geology: Overview of the Physical Earth, Bowens Reaction Series, Rocks, Quiz Questions

View Set

Macro-Economic Theory Final Exam (Villanova)

View Set

NUR 343 Adaptive Quiz #1 Women's Health/Disorders

View Set

chapt. 6 How Do We Know When to Reject the Null Hypothesis? Alpha, P Values, Normal Distribution, Confidence Intervals, Power and Effect Sizes

View Set

TTC NUR205: MedSurgII Chapter 24 PrepU (Intestinal & Rectal Disorders)

View Set

BIO 220 LAB Practical Exam 1 CSUSB Lujan

View Set

Who was the first Englishman to sail around the world

View Set