SDI
How can you optimize systems that have high read-to-write ratios?
1. Caching frequently accessed data in memory can significantly reduce the response time for read operations, as accessing data in memory is faster than accessing data on disk. 2. Load balancer to distribute read traffic across multiple replicas of the database can help to reduce the load on any single server. 3. Database replication: Replicating the database to multiple servers can help to ensure that there are enough resources available to handle a high number of read requests. Additionally, using a master-slave or master-master replication setup can separate read and write operation in different clusters and have dedicated resources, such as read-only replicas for read-heavy workloads. 4. Use of CDN (content delivery network) : Using a content delivery network (CDN) can help to distribute read traffic across multiple servers, which can help to reduce the load on the primary servers.
How can you optimize systems that have high write-to-read ratios?
1. One approach is to use a technique called "sharding" to spread write traffic across multiple servers. Sharding involves partitioning the data based on a shard key and storing each partition on a separate server. This can help to distribute the load and ensure that write operations can be executed quickly. 2. Offload some of the write operations to a queue or buffer, which can then be processed asynchronously. This can help to ensure that write operations do not slow down read operations. 3. Use a different storage system optimized for the desired traffic pattern. NoSQL databases like MongoDB or Cassandra are designed to handle high write loads and can scale horizontally to handle increased traffic. Additionally, a master-slave or master-master replication setup can help to separate read and write operations on a database.
What is a load balancer?
A load balancer evenly distributes incoming traffic among web servers that are defined in a load-balanced set. • If server 1 goes offline, all the traffic will be routed to server 2. This prevents the website from going offline. We will also add a new healthy web server to the server pool to balance the load. • If the website traffic grows rapidly, and two servers are not enough to handle the traffic, the load balancer can handle this problem gracefully. You only need to add more servers to the web server pool, and the load balancer automatically starts to send requests to them.
What are the challenges of rate limiting in a distributed environment?
1. Race condition -If two requests concurrently read the counter value before either of them writes the value back, each will increment the counter by one and write it back without checking the other thread. Both requests (threads) believe they have the correct counter value 4. However, the correct counter value should be 5. Locks are the most obvious solution for solving race condition. However, locks will significantly slow down the system. Two strategies are commonly used to solve the problem: Lua script and sorted sets data structure in Redis. 2. Synchronization issue - To support millions of users, one rate limiter server might not be enough to handle the traffic. When multiple rate limiter servers are used, synchronization is required. - As the web tier is stateless, clients can send requests to a different rate limiter as shown on the right side. If no synchronization happens, rate limiter 1 does not contain any data about client 2. Thus, the rate limiter cannot work properly. - One possible solution is to use sticky sessions that allow a client to send traffic to the same rate limiter. This solution is not advisable because it is neither scalable nor flexible. A better approach is to use centralized data stores like Redis
What is partition tolerance?
: a partition indicates a communication break between two nodes. Partition tolerance means the system continues to operate despite network partitions.
What is a Content delivery network (CDN)?
A CDN is a network of geographically dispersed servers used to deliver static content. CDN servers cache static content like images, videos, CSS, JavaScript files, etc. Here is how CDN works at the high-level: when a user visits a website, a CDN server closest to the user will deliver static content. Intuitively, the further users are from CDN servers, the slower the website loads. For example, if CDN servers are in San Francisco, users in Los Angeles will get content faster than users in Europe.
What is a cache?
A cache is a temporary storage area that stores the result of expensive responses or frequently accessed data in memory so that subsequent requests are served more quickly. The cache tier is a temporary data store layer, much faster than the database. The benefits of having a separate cache tier include better system performance, ability to reduce database workloads, and the ability to scale the cache tier independently.
What is a message queue?
A message queue is a durable component, stored in memory, that supports asynchronous communication. It serves as a buffer and distributes asynchronous requests. The basic architecture of a message queue is simple. Input services, called producers/publishers, create messages, and publish them to a message queue. Other services or servers, called consumers/subscribers, connect to the queue, and perform actions defined by the messages. Decoupling makes the message queue a preferred architecture for building a scalable and reliable application. With the message queue, the producer can post a message to the queue when the consumer is unavailable to process it. The consumer can read messages from the queue even when the producer is unavailable.
what is a rate limiter and the benefits?
A rate limiter is used to control the rate of traffic sent by a client or a service. In the HTTP world, a rate limiter limits the number of client requests allowed to be sent over a specified period. If the API request count exceeds the threshold defined by the rate limiter, all the excess calls are blocked. Status code 429 is sent when rate limiter blocks request that is either dropped or forwarded to message queue to be processed later. • Prevent resource starvation caused by Denial of Service (DoS) attack. A rate limiter prevents DoS attacks, either intentional or unintentional, by blocking the excess calls. • Reduce cost. Limiting excess requests means fewer servers and allocating more resources to high priority APIs. Rate limiting is extremely important for companies that use paid third party APIs. For example, you are charged on a per-call basis for the following external APIs: check credit, make a payment, retrieve health records, etc. Limiting the number of calls is essential to reduce costs. • Prevent servers from being overloaded. To reduce server load, a rate limiter is used to filter out excess requests caused by bots or users' misbehavior
Advantages of consistent hashing?
Automatic scaling: servers could be added and removed automatically depending on the load. Heterogeneity: the number of virtual nodes for a server is proportional to the server capacity. For example, servers with higher capacity are assigned with more virtual nodes.
What is availability?
Availability means any client which requests data gets a response even if some of the nodes are down
What are availability zones?
Availability zones (AZs) are a way for cloud providers to provide high availability for their services. An availability zone is a physically separate location within a region that has its own power, cooling, and networking infrastructure. By using multiple availability zones, a cloud provider can ensure that their services can continue to function even if one availability zone goes offline. Some examples of how cloud providers can use availability zones: Data replication: Data can be replicated across multiple availability zones, so that if one availability zone goes offline, the data is still accessible from another availability zone. Load balancing: Requests can be distributed across multiple availability zones, so that if one availability zone goes offline, requests can still be fulfilled by another availability zone. Auto-scaling: Resources such as servers and storage can be automatically added or removed from multiple availability zones, depending on the workload.
What is CAP Theorem?
CAP theorem states it is impossible for a distributed system to simultaneously provide more than two of these three guarantees: consistency, availability, and partition tolerance.
What is consistency?
Consistency means all clients see the same data at the same time no matter which node they connect to. Strong consistency is usually achieved by forcing a replica not to accept new reads/writes until every replica has agreed on current write. This approach is not ideal for highly available systems because it could block new operations. Dynamo and Cassandra adopt eventual consistency, which is our recommended consistency model for our key-value store. From concurrent writes, eventual consistency allows inconsistent values to enter the system and force the client to read the values to reconcile.
What is consistent hashing?
Consistent hashing is a special kind of hashing such that when a hash table is re-sized and consistent hashing is used, only k/n keys need to be remapped on average, where k is the number of keys, and n is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped • Minimized keys are redistributed when servers are added or removed. • It is easy to scale horizontally because data are more evenly distributed. • Mitigate hotspot key problem. Excessive access to a specific shard could cause server overload. Imagine data for Katy Perry, Justin Bieber, and Lady Gaga all end up on the same shard. Consistent hashing helps to mitigate the problem by distributing the data more evenly
What is database replication and advantages?
Database replication can be used in many database management systems, usually with a master/slave relationship between the original (master) and the copies (slaves)". A master database generally only supports write operations. A slave database gets copies of the data from the master database and only supports read operations. Advantages of database replication: • Better performance: In the master-slave model, all writes and updates happen in master nodes; whereas, read operations are distributed across slave nodes. This model improves performance because it allows more queries to be processed in parallel. • Reliability: If one of your database servers is destroyed by a natural disaster, such as a typhoon or an earthquake, data is still preserved. You do not need to worry about data loss because data is replicated across multiple locations. • High availability: By replicating data across different locations, your website remains in operation even if a database is offline as you can access data stored in another database server.
What is Fixed window counter algorithm?
Fixed window counter algorithm works as follows: • The algorithm divides the timeline into fix-sized time windows and assign a counter for each window. • Each request increments the counter by one. • Once the counter reaches the pre-defined threshold, new requests are dropped until a new time window starts. A major problem with this algorithm is that a burst of traffic at the edges of time windows could cause more requests than allowed quota to go through.
What is availability?
High availability is the ability of a system to be continuously operational for a desirably long period of time. High availability is measured as a percentage, with 100% means a service that has 0 downtime. Most services fall between 99% and 100%
What is horizontal scaling (sharding)?
Horizontal scaling, referred to as "scale-out", allows you to scale by adding more servers into your pool of resources. Horizontal scaling is more desirable for large scale applications due to the limitations of vertical scaling.
CP (consistency partition) vs AP(availability partition)
In a distributed system, partitions cannot be avoided, and when a partition occurs, we must choose between consistency and availability. n3 goes down and cannot communicate with n1 and n2. If clients write data to n1 or n2, data cannot be propagated to n3. If data is written to n3 but not propagated to n1 and n2 yet, n1 and n2 would have stale data. If we choose consistency over availability (CP system), we must block all write operations to nodes to avoid data inconsistency among these the servers, which makes the system unavailable. Bank systems usually have extremely high consistent requirements. For example, it is crucial for a bank system to display the most up-to-date balance info. If inconsistency occurs due to a network partition, the bank system returns an error before the inconsistency is resolved. However, if we choose availability over consistency (AP system), the system keeps accepting reads, even though it might return stale data. For writes, n1 and n2 will keep accepting writes, and data will be synced to n3 when the network partition is resolved.
What is stateless architecture?
In this stateless architecture, HTTP requests from users can be sent to any web servers, which fetch state data from a shared data store. State data is stored in a shared data store and kept out of web servers. A stateless system is simpler, more robust, and scalable.
What is a leaky bucket algorithm?
The leaking bucket algorithm is similar to the token bucket except that requests are processed at a fixed rate. It is usually implemented with a first-in-first-out (FIFO) queue. The algorithm works as follows: • When a request arrives, the system checks if the queue is full. If it is not full, the request is added to the queue. • Otherwise, the request is dropped. • Requests are pulled from the queue and processed at regular intervals. Leaking bucket algorithm takes the following two parameters: • Bucket size: it is equal to the queue size. The queue holds the requests to be processed at a fixed rate. • Outflow rate: it defines how many requests can be processed at a fixed rate, usually in seconds. Pros: • Memory efficient given the limited queue size. • Requests are processed at a fixed rate therefore it is suitable for use cases that a stable outflow rate is needed. Cons: • A burst of traffic fills up the queue with old requests, and if they are not processed in time, recent requests will be rate limited. • There are two parameters in the algorithm. It might not be easy to tune them properly.
What is sliding window counter algorithm?
The sliding window counter algorithm is a hybrid approach that combines the fixed window counter and sliding window log. Assume the rate limiter allows a maximum of 7 requests per minute, and there are 5 requests in the previous minute and 3 in the current minute. For a new request that arrives at a 30% position in the current minute, the number of requests in the rolling window is calculated using the following formula: • Requests in current window + requests in the previous window * overlap percentage of the rolling window and previous window • Using this formula, we get 3 + 5 * 0.7% = 6.5 request. Depending on the use case, the number can either be rounded up or down. In our example, it is rounded down to 6. Since the rate limiter allows a maximum of 7 requests per minute, the current request can go through Pros • It smooths out spikes in traffic because the rate is based on the average rate of the previous window. • Memory efficient. Cons • It only works for not-so-strict look back window. It is an approximation of the actual rate because it assumes requests in the previous window are evenly distributed. However, this problem may not be as bad as it seems. According to experiments done by Cloudflare [10], only 0.003% of requests are wrongly allowed or rate limited among 400 million requests.
What is a token bucket algorithm?
The token bucket algorithm is widely used for rate limiting. The token bucket algorithm work as follows: • A token bucket is a container that has pre-defined capacity. Tokens are put in the bucket at preset rates periodically. Once the bucket is full, no more tokens are added. Once the bucket is full, extra tokens will overflow. • Each request consumes one token. When a request arrives, we check if there are enough tokens in the bucket. Figure 4-5 explains how it works. • If there are enough tokens, we take one token out for each request, and the request goes through. • If there are not enough tokens, the request is dropped. The token bucket algorithm takes two parameters: • Bucket size: the maximum number of tokens allowed in the bucket • Refill rate: number of tokens put into the bucket every second Pros: • The algorithm is easy to implement. • Memory efficient. • Token bucket allows a burst of traffic for short periods. A request can go through as long as there are tokens left. Cons: • Two parameters in the algorithm are bucket size and token refill rate. However, it might be challenging to tune them properly
What is vertical scaling?
Vertical scaling, referred to as "scale up", means the process of adding more power (CPU, RAM, etc.) to your servers. When traffic is low, vertical scaling is a great option, and the simplicity of vertical scaling is its main advantage. Unfortunately, it comes with serious limitations. • Vertical scaling has a hard limit. It is impossible to add unlimited CPU and memory to a single server. • Vertical scaling does not have failover and redundancy. If one server goes down, the website/app goes down with it • The overall cost of vertical scaling is high. Powerful servers are much more expensive.completely.
Where should we store rate limiting counter logic?
Where shall we store counters? Using the database is not a good idea due to slowness of disk access. In-memory cache is chosen because it is fast and supports time-based expiration strategy. For instance, Redis is a popular option to implement rate limiting. It is an in-memory store that offers two commands: INCR and EXPIRE. • INCR: It increases the stored counter by 1. • EXPIRE: It sets a timeout for the counter. If the timeout expires, the counter is automatically deleted. 1. The client sends a request to rate limiting middleware. 2. Rate limiting middleware fetches the counter from the corresponding bucket in Redis and checks if the limit is reached or not. 3. If the limit is reached, the request is rejected. 4. If the limit is not reached, the request is sent to API servers. Meanwhile, the system increments the counter and saves it back to Redis.
What is sliding window log algorithm?
• The algorithm keeps track of request timestamps. Timestamp data is usually kept in cache, such as sorted sets of Redis • When a new request comes in, remove all the outdated timestamps. Outdated timestamps are defined as those older than the start of the current time window. • Add timestamp of the new request to the log. • If the log size is the same or lower than the allowed count, a request is accepted. Otherwise, it is rejected. Pros: • Rate limiting implemented by this algorithm is very accurate. In any rolling window, requests will not exceed the rate limit. Cons: • The algorithm consumes a lot of memory because even if a request is rejected, its timestamp might still be stored in memory.