Distributed Systems
(total elapsed time - downtime) / Number of Failures
How is MTBF calculated?
Maintenance Time / Number of repairs
How is MTTR calculated
scrubbing server
In a DNS system, after the routing system sends the client the IP address of the nearest proxy server, what component handles the request next
the routing system
In a DNS system, when a client first wants to send a request, what is the first component it is sent to?
for any connection, all incoming packets are forwarded to the same tier 3 load balancer
In a typical data center load balancing system, there are three tiers of load balancers. What is the role of the second tier which sits at layer 4 of OSI
less hardware infrastructure/layers are required to do load balancing, network latency will be reduced because of no intermediate hop, with a dedicated load balancing layer there can be traffic issues
In designing twitter, client side load balancing is utilized. What are the advantages of this approach?
uniqueness and scalability
In the context of a sequencer, what is the problem with using multiple databases that all start at different values and increment by the number of databases? What requirements are violated?
graph databases can be used in social applications
What are popular applications of graph databases?
Amazon DynamoDB, Redis, and memcached DB
What are some popular key-value databases
they are quite expensive, the configuration requires additional human resources, availability can be an issue as additional hardware is required in the case of failure
What are the downsides of hardware based load balancers
lack of standardization and consistency
What are the drawbacks of noSQL databases?
DNS resolver, root-level name servers, top-level domain name servers, authoritative nam servers
What are the four types of servers in the DNS hierarchy?
user requests a domain name to browser, browser sends DNS query to ISP, ISP forwards DNS query to DNS infrastructure, DNS infrastructure sends list of IP addresses to ISP, ISP forwards list to browser, browswer responds with HTTP request to ISP, ISP forwards HTTP request to web server
What are the steps in the high level flow of a dns
caching, server replication, and protocol
What are the three main things that make DNS a reliable system
MTBF and MTTR
What are the two ways of measuring system reliability
DNS uses caching at different layers and DNS name servers are in a hierarchical form
What are two aspects of domain name systems that allow them to scale well and cater to the requests of the entire internet?
load balancers are typically deployed in pairs. To maintain availability, enterprises will also use clusters of load balancers that use heartbeat communication. If the entire cluster fails, manual rerouting must occur
What happens if a load balancer fails? Are they not a single point of failure?
128-bit numbers makes the primary-key indexing slower, we cannot guarantee uniqueness, and they are not monotonically increasing
What is the problem with using a UUID for a sequencer?
multiple queues based on task categories
What is the solution to the fact that in a task scheduler, some tasks need to be processed with more urgency than others further ahead in a FIFO queue
UDP has no handshaking dialogues and therefore exposes the user to the unreliability of the underlying network
What makes UDP a more unreliable communication protocol as compared with TCP?
pull
What type of CDN strategy (push or pull) is appropriate when dynamic content is being requested?
applications that require a large volume of semi-structured or unstructured data, low latency, and flexible data models
What type of applications require noSQL databases?
HTTP responses
What type of content do web servers serve?
in the browser, the operating system, and the local name server. Some frequently visited services are also cached in the ISP DNS resolvers themselves
Where are DNS requests typically cached?
application server
Which server layer deals with longer running processes that are more resource intensive?
small additional costs are required to implement shadow load balancers on commodity hardware
Why isn't availability an issue when it comes to software based load balancers
99%
Youtube: What uptime % would be considered good as we consider availability?
the client could be malicious, if the uploaded video is a duplicate it can be filtered out, encoders will only be available on a private IP address within Youtube's network
Youtube: when a user uploads a video, it is passed through a server before hitting the encoder. Why is this the case?
simple design that does not require dealing with impedance mismatch, easy horizontal scaling, highly available, cost because you do not need to pay for a RDMS and many noSQL databases are open source
advantages of NoSQL databases
a document-term index
an inverted index is an improvement ona document level index in a distributed search system. What does an inverted index make use of?
health checking of servers, TLS termination, predictive analytics on traffic patterns, reduced human intervention in the case of failure handling, increased security
aside from ensuring that services will be scalable, available, and highly performant, what are key benefits and services of load balancers?
99.9%
at what level do cloud providers like google, microsoft, and amazon set their availabilities in their SLAs?
web layer, application layer, and database layer
at what three levels will a cache generally be present in a system?
layer 4 (network/trasnport) or layer 7 (application)
at what two layers of the OSI can load balancers be placed?
a term is a frequently occuring word that can be thrown away
define "term" in the context of a document-term matrix in distributed search
DNS redirection is a method that can be used to route users to the nearest proxy servers in a content delivery network. In step one, a client is mapped to the appropriate nework location. In step two, that location distributes the load over proxy servers.
define DNS redirection
the network time protocol is a networking protocol for clock synchronization between computer systems over packet-switched, variable latency data networks
define NTP
the broker server is the core component of the pub-sub system. It handles read and write requests. A broker will have multiple topics where each topic can have multiple partitions associated with it.
define a broker in the context of a pub-sub system
a cache that embeds cache and service functionality within the same host
define a co-located cache
a report collector independent of the primary service. It is made independent to avoid the situations where client agents want to report an error to a failed service
define a collector in the context of client side monitoring
columnar databases store data in columns instead of rows which enable quick and efficient access to all entires in a database column
define a columnar database
a group of geographically distributed proxy servers where a proxy server is an intermediate server between a client and the origin server
define a content delivery network
a mathematical matrix that represents the frequency of terms in a list of documents
define a document-term matrix in the context of distributed search
the internet's naming service that maps human-friendly domain names to machine-readable IP addresses
define a domain name system
key value stores are distributed hash tables where a key binds to a specific value and does not assume anything about the structure of the value
define a key value store as a distributed systems building block
an agent that collects logs from each node and dumps them into storage. This way we do not need to visit each node if we want to know about a particular event
define a log accumulator
an intermediate component between the interacting entities known as producers and consumers
define a messaging queue
rate limiters put a limit on the number of requests a service will fulfill and throttles requests that cross the limit
define a rate limiter
each counter has a specific number of shards as needed. These shards are run on different computational units in parallel.
define a sharded counter
a persistent sequence of messages stored in the local storage of a broker
define a topic in the context of publisher-subscriber systems
the data is first written to the cache and asynchronously written to the database. Inconsistency is inevitable when the client reads stale data from the database
define a write back cache and its considerations
writes on the cache as well as the database. This increases write latency but ensures strong consistency between the database and the cache
define a write through cache and its considerations
a prober embedded in the client application that sends the appropriate service reports about any failures
define an agent in the context of client side monitoring
encoders and transcoders compress videos and transform them into different formats and qualities to support varying numbers of devices according to their screen resolution and bandwidth
define an encoder
apache kafka is a framework for distributed real-time event processing and storage
define apache kafka
a database transaction is considered an atomic unit and if one statement fails, the whole transaction is aborted and rolled back
define atomicity in regards to databases
messages are placed in the queue in the order they are received which may not necessarily be the order they were sent
define best effort ordering
a storage solution for unstructured data where every type of data is stored as a binary large object
define blob storage
consistent hashing assigns each server or item in a distributed hash table a place on an abstact circle called a ring
define consistent hashing
the system should guarantee that completed transactions will survive permanently in the databse even in system failure events
define durability in regards to databases
edge side includes is a popular dynamic content caching strategy in CDNs. Since many dynamic content web pages only change in small portions, ESI caches the portions that do not change and only notifies of the portions that do.
define edge side includes markup language
the number of requests can cross the predefined limit if the system has excess resources available
define elastic throttling
the distribution of traffic load across multiple geographic regions
define global server load balancing
there is a hard limit on the number of API reqeusts. So whenever a request crosses the limit, it is discarded
define hard throttling
scaling by increasing the number of machines in the network
define horizontal scaling
horizontal sharding is used to divide a table into multiple tables row wise
define horizontal sharding
impedence mismatch occurs in relational databases when there is a difference between the relational model and the in-memory data structures
define impedence mismatch
a characteristic of operations that allows them to be applied several times without influencing the outcome
define indempotence
the organization and manipulation of data that's done to facilitate fast and accurate information retrieval
define indexing in the context of a search system.
in the case of multiple concurrent transactions, it should give the same result as if they were run in isolation
define isolation in regards to databases
load balancing achieved within a data center
define local load balancing
all secondary nodes replicate the actual data changes.
define logical log replication
the probability that the service will restore its functions within a specified time of fault occurrence
define maintainability
in power of two load balancing, two nodes are selected at random and the one with less load is chosen. This simple technique is exponentially better than random selection
define power of two client side load balancing
an asynchronous service-to-service communication method that is popular in serverless and microservices architecture
define pub-sub messaging
the probability that the service will perform its functions for a specified time
define reliability
refers to keeping multiple copies of data at various nodes which are preferably geographically distributed to achieve availability, scalability, and performance
define replication in regards to databases
each request is forwarded to a server in the pool in a repeating sequential manner
define round robin load balancer scheduling
the ability of a system to handle an increasing amount of workload without compromising performance
define scalability
the number of requests can exceed the predefined limit by a certain percentage
define soft throttling
the primary node saves all statements that it executes and sends them to the secondary nodes to perform
define statement-based replication which is a type of primary-secondary replication
in strict ordering messages are placed in a messaging queue in the order they were produced usually based on an attached timestamp
define strict ordering in the context of messaging queues
comprised of clearly defined data types with patterns that make them easily searchable. Typically quantitative data
define structured data
the primary node waits for acknowledgements from the seconday nodes about updating the data. After receiving acknowledgement from all secondary nodes, the primary node reports success to the client
define synchronous replication
some applications provide a different level of service to users based on their IP addresses. In that case, hashing the IP address is performed to assign users' requests to servers
define the IP hash method of load balancing
DNS is a routing methodology in which all edge servers location in multiple locations share the same sinlge IP address. THe border gateway protocol is then used to route clients based on the internet's natural network flow
define the anycast method for routing clients to the nearest proxy server in DNS
they adhere to a particular schema before storing data. Instances are stored in rows, each of which has a unique key identifying that tuple which can be linked to other tables
define the key factors of a relational database
content gets sent automatically to the CDN proxy servers from the origin server in the push model
define the push CDN content caching strategy
servers are unaware of each other, there is no synchronization, data sharing, and communication between the servers
define the shared-nothing architectured utilzied by memcache
scaling by providing additional capabilities such as CPUs or RAM to an existing device
define vertical scaling
memory hardware that fetches and stores data at high speed but is lost once the system is turned off
define volatile computer memory
if some servers have a higher capability of serving a clients' requests, then it is given a higher weight. The load balancer forwards requests according to the weight of each node
define weighted round-robin load balancing
data is written only to the database and written to the cache when there is a cache miss later. Not a favorable strategy for reading recently updated data
define write around cache and its considerations
the primary node saves the query before executing it in a log file known as a write ahead log file. It then uses these logs to write data onto the secondary nodes
define write-ahead log shipping
cassandra is PA/EL
describe cassandra DB in terms of the PACELC theorem
mongoDB is PA/EC
describe mongoDB in terms of the PACELC theorem
redis does not have strong consistency due to its use of asynchronous replication
describe the consistency level of redis and why
a client sends a task to a rate limiter which accepts. The task is sent to the task submitter which gets a UUID from a generator. The task is then sent ot a databse. The task is then batched and prioritized before being placed in a distributed queue. The task is then sent to the cloud resources for processing
describe the flow of a task scheduler
the local server itself requests the root the the top level domain then the authoritatitve server
describe the iterative process of DNS resolution
a burst of requests can fill the bucket and if not processed in the specified time, recent requests can take a hit. determining the optimal bucket size and outflow rate is a challenge
disadvantages of the leaking bucket algorithm for rate limiting
no. Usually they use the services of a CDN provider like Akamai or Cloudflare
do most companies build their own CDNs?
most content providers use both to get the benefits of each approach
do most content providers use a push or pull CDN caching approach?
static content
do web servers typically serve static or dynamic content?
read throughput
does MongoDB have a higher read throughput or a higher write throughput?
a system's ability to execute persistently even if one or more of its components fail
fault tolerance
availability
for key-value stores do we prefer consistency or availability?
we introduce secondary monitoring systems which pull from the application. These then push to a datacenter monitoring system which pushes to a global monitoring system
how can we utilize a hybrid push/pull approach to improve on scalability in a monitoring service
CDNs cache content that serves as a backup whenever the origin server fails. If a proxy server stops working, other operational proxy servers step in and continue to drive traffic
how do CDNs guarantee availability
we partition the blobs based on the complete path of the blob which is the combination of the account ID, container ID, and blob ID
how do we ensure that data with the same user or container is stored in the same partition in blob store systems?
we can periodically read all the shards of a counter and cache the results
how do we improve on the low read throughput and high read latency that is inherent to sharded counters?
as a string which means all non string data must be serialized
how does memcache store data?
13 (managed by 12 different organizations)
how many DNS logical root name servers are there spread strategically around the world?
1000
how many QPS can mySQL handle?
8000
how many requests per second can a single server handle
8000
how many users per day should we estimate that a server can handle
netflix might be able to store more than 90% of its content in the CDN while this is not feasible for Youtube due to the volume of content
how might netflix and youtube have different strategies for caching their content in a content delivery network?
a few MBs
how much data can we usually store in the value component of a key-value store
the length of the network path and the capacity (bandidth) limits along the network path
network distance between a user and a proxy server in a CDN is a function of what two variables?
pagination
often a user will want a list of blobs matching some condition. However, when these lists become very large this can affect performance. What is a solution to this problem?
certain content requires the execution of scripts that can be executed at proxy servers instead of the origin server
other than edge side includes, what is a popular optimization technique for dynamic content caching in CDNs?
CPU register, CPU cache, RAM, SSD, Magnetic Disk
rank the five types of memory from fastest to slowest
No because it cannot possible meet the data storage and query requirements and failure of the one mega server would result in downtime for everyone
should key value servers be located on a single node and why?
pipelining
since redis uses a client-server model, a request cannot be handled until the server responds to the prior request. What is the method that Redis uses to get around this?
true
true or false, DNS can be considered a global server load balancer?
true
true or false, blobs are broken up into a constant amount of chunks?
false
true or false, in nonvolatile memory the processor has direct access to the data
true
true or false, weighted round robin is a static load balancing algorithm?
within the premise of a users network
typically, where are DNS resolvers located?
throttle rule retriever
under a rate limiter system, assume that the service owner adds some extra rules to the database. Consequently data in the cache will become outdated. Which components are responsible for updating the cache
right after it finishes updating itself
under asynchronous replication, when does a primary node report success to the client?
what the workload is read-heavy
under what circumstances is primary-secondary replication appropriate?
root-level name servers
what DNS name servers hold mappings for the top-level domains such as .com, .edu, .us, etc.
receiving logs, storing those logs locally, pushing the logs to a pub-sub system
what actions is a log accumulator responsible for
the sliding window log algorithm
what algorithm proposes a solution to the window edge problem that occurs in the fixed window counter algorithm
redundant cache servers
what allows for the high availability of distributed caches?
sharding of the cached data
what allows for the scalability of distributed caches?
Google Cloud firestorm and MongoDB
what are examples of document databases?
dynamic algorithms are far better because they maintain the state of serving hosts and therefore are worth the extra effort and complexity
what are more effective, dynamic or static load balancing algorithms?
local or default servers
what are other names for DNS resolvers?
least recently used, most recently used, least frequently used, most frequently used
what are some cache eviction strategies
round-robin scheduling, weighted round-robin, least connections, least response time, IP hash, URL hash
what are some examples of popular load balancing algorithms?
conflict avoidance where all writes for a given record go through the same leader, last-write-wins, custom logic
what are the three methods of conflict resolution in a multi-leader database?
account metadata, container metadata, blob metadata
what are the three types of metadata that the metadata storage in a blob store holds?
operability, lucidity, modifiability
what are the three underlying aspects of maintainability
a primary-secondary model and a cluster of independent hosts model
what are the two approaches for performing message replication in a cluster of different hosts in a messaging queue
service-side erros and client-side errors
what are the two broad categories of monitoring focus for distributed systems?
synchronous replication within a storage cluster and asynchronous replication across data centers and regions
what are the two levels of replication that occur in blob storage?
document partitioning and term partitioning
what are the two methods for partitioning in a distributed indexing/search system?
dedicated cache servers and co-located cache
what are the two methods for sharding in cache clusters
separation of search and indexing nodes and not recomputing indexes when making replicas
what are the two things that need to be fixed in order to make a distributed search system scalable
a relational database and a graph database
what are the two types of databses necessary in a task scheduling system
key-range based sharding, hash based sharding, and consistent hashing
what are the types of horizontal sharding?
very efficient for a large number of aggregation and data analytics queries by drastically reducing the disk I/O requirements and the amount of data required to load from the disk
what are the use cases for columnar databases?
the data model maps directly to objects in code, support for ubiquitous JSON documents, the high flexibility of the schema where not all documents have to have the same fields
what are three differentiators (advantages) that document databases have over relational databases?
round-robin selection, random selection, metrics-based selection
what are three methods for assigning write requests to shards in a sharded counter?
data structure storage (memcached stores all data as strings), database for persistent storage, and a message broker that can translate millions of messages per second
what are three of the key features of redis
use appropriate authentication and resource authorization, consider code sandboxing using dockers or virtual machines, use performance isolation between tasks by monitoring resource utilization
what are three ways to deal with untrusted tasks in a task scheduler
best effort ordering and strict ordering
what are two approaches to message orderig in messaging queues
low latency and high availability
what are two non functional requirements that are needed for both youtube, netflix and also any other streaming service?
a locking mechanism (suboptimal due to performance issues) and the set then get approach
what are two potential solutions to the race condition problem of rate limiters
when we have to maintain changes in the replicated data over time
what causes the main problem in data replication?
blob indexing
what component in blob store solves the problem of blob management and querying (finding a blob in a sea of blobs)?
segments help identify the start and end of a message using an offset address
what do segments do within partitions in the context of a pub-sub system
clients can still read from secondary nodes if the primary node has failed
what does it mean that primary-secondary replication is read resilient?
a cache client
what entity performs the hash calculations for caching?
proxy servers usually serve content from RAM, CDN proxy servers are placed near the users to provide faster access to content, request routing ensures users are directed to the nearest proxy server, long-tailed content is stored in nonvolatile storage instead of directing to the origin server
what factors of a CDN help it achieve high performance?
the system needs to detect such cases where a counter unexpectedly starts getting very high write traffic. We'll dynamically increase the number of shards of the affected counter to mitigate the situation
what happens when a user with just a few followers has a post go viral on twitter (in the context of counters)
IDs that would have been generated in a dead period are wasted
what is a con of using the twitter snowflake method to encode causality in unique identifiers
randomly assigning nodes in the ring may cause non-uniform distribution
what is a disadvantage of consistent hashing?
session oriented applications like web applications
what is a good application of a key-value database?
LRU cache because the most recently uploaded data is the most likely to get views
what is a good cache eviction methodology for social application
bigtable because of its high throughput and scalability for storing key-value data
what is a good databse for storing thumbnails and why
bounded waiting time for users
what is a non functional requirement that is unique to distributed task schedulers?
quorums
what is a popular approach for solving the concurrent write problem in leaderless replication database systems?
a single point of failure
what is a problem with using a central database's auto-increment feature to generate unique IDs for a sequencer?
when message size exceeds the original packet size of 512 bytes
what is a time where a DNS may use the TCP communication protocol over the UDP (user datagram protocol)?
vertical sharding is used to increase the speed of data retrievals from a table consisting of columns with very wide text or a binary large object
what is a time where vertical sharding may be useful?
Google's trutime API which return confidence intervals of time rather than points
what is a unique identifier for sequencers that guarantees all of the required properties including causality
if we use a push model to push metrics from the services to the data collector service
what is a way a service side monitoring service can be overwhelmed and become a bottleneck
mongoDB ensures atomicity in concurrent write operations and avoids collisions by returning duplcate-key errors fro record-duplication issues
what is an advantage of mongoDB that makes it a good use case for URL shortening?
cassandra
what is an example of a columnar database
mapreduce
what is an example of a distributed data processing system that an indexer will use for index construction in a distributed search system?
mongoDB
what is an example of a document database?
read only memory
what is an example of nonvolatile memory?
a system that processes financial ATM transactions
what is an example where a sampler service is not viable in distributed logging
durability
what is an important non functional requirement of messaging queues
a failover server is activated
what is done in the case of a failure of a range handler in a sequencer
volatile memory
what is more costly per unit size- volatile memory or nonvolatile memory
creating containers to group blobs
what is one of the key functional requirements of blob storage?
replication
what is one of the most commonly used methods of providing fault tolerance for a system?
rules for how to operate the rate limiter
what is stored in the database/cache associated with a database?
reduced capex and opex for extra hardware
what is the advantage of a co-located cache
using time stamps based on synchronized clocks
what is the best approach for ordering incoming messages in a messaging queue
a range handler
what is the best way of generating unique IDs for a sequencer
when the data is write once read many
what is the circumstance where blob storage is useful
client then load balancer then a layer of services with cache clients then a layer of cache servers that feed into a persisitent storage layer
what is the diagram of a distributed cache in simple terms
maintainability is measured by mean time to repair while reliability is measured by both mean time to repair as well as mean time between failure
what is the difference between maintainability and reliability
the client lacks the overall information to choose the most suitable server for its request which may result in requesting an already over-loaded server
what is the disadvantage of using client multiplexing to route a client to the nearest proxy server in DNS?
any non-deterministic function such as now() might result in distinct writes on the follower and leader
what is the disadvantage of using statement-based replication
if the primary node fails, the writes that weren't copied to the secondary nodes will be lost
what is the downside of asynchronous replication?
we need to build a system such that many nodes could collectively work as if we had a single huge server
what is the downside of horizontal scaling
high latency
what is the downside of synchronous replication
when content changes infrequently, the polling approach consumes unnecessary bandwidth
what is the downside of using period polling in a DNS to ensure content consistency?
We can only grow to the limitations of our server and the dollar cost of vertical scaling is usually high
what is the downside of vertical scaling
if two intervals overalap, we are unsure in which order the events occurred
what is the downside to using Google's truetime API as a sequencer?
each microservice sends its metrics to the monitoring system, resulting in a heavy traffic load on the infrastructure which can cause a bottleneck for business operations
what is the drawback to using a push based approach in a monitoring system
a load balancer
what is the first point of contact within a data center after the firewall?
DAU / RPS of a server
what is the formula for calculating the number of servers required
we lose a significant range any time a server dies
what is the major con of a range handler?
in a queue, only one consumer consumes a message as opposed to multiple consumers in a pub-sub system
what is the major difference between pub-sub messaging and queue based messaging?
a consistent burst of traffic at the window edges could cause a potential decrease in performance
what is the major disadvantage of the fixed window counter algorithm for rate limiting
conflict
what is the major drawback of multi-leader replication?
minimizing latency
what is the most important non functional requirement of a CDN?
setting up a configuration service that sits between the cache client and the cache servers that continuously monitors the health of the cache servers
what is the most robust (yet costly) method for keeping cache clients up to date on the presence and health of each cache server?
in a distributed system where there can be millions of events ocurring per second, we need a mechanism to distinguish these events from each other
what is the motivation for using sequencers?
continuation token
what is the name of the indicator used in pagination under a blob store system?
as nodes join or leave the system, a minimal number of keys need to move
what is the primary benefit of consistent hashing
unstructured catalog data like JSON files or other complex hierarchical data. It is also a good option for content management applications such as blogs and video platforms
what is the primary use case for document databases?
index recomputation is a waste of resources
what is the problem that arises in search systems when each replica of a primary node stores the index
kilobyte, megabyte, gigabyte, terabyte, petabyte
what is the progression of bit sizes?
increase the availability of the queue
what is the purpose of replicating queues on multiple servers?
memcahed has a nearly deterministic O(1) speed serving millions of keys per second
what is the query speed of memcached
verify the consumer, retention time management, message receiving options management, allow multiple reads
what is the role of a consumer manager in a pub-sub system
to fetch content and create documents
what is the role of a crawler in a search system?
resolvers initaite the querying sequence and forward requests to the other DNS name servers. They can also cater to user's DNS queries through caching techniques
what is the role of a dns resolver in the dns infrastructure hierarchy?
to fairly divide all clients' requests among the pool of available servers
what is the role of a load balancer
responds to search queries by running the query on the index created by the indexer
what is the role of a searcher in a search system
broker/topics registry and managing replication by assigning the new lead broker in the case of failure
what is the role of the cluster manager in a pub-sub system
a temporary data storage that can serve data faster by keeping data entries in memory
what is the simple definition of a cache
we compute the index on the primary node only and then communicate the inverted index (binary blob/file) to the replicas
what is the solution ot index recomputation when replicating nodes in a distributed search system?
storing metadata indicating a time to live
what is the solution to having stale data stored in a distributed cache
if the number is too small, we face high contention for writes. If the number is too large we encounter higher overhead on the read operation
what is the tradeoff in determining the number of shards for a sharded counter?
web server
what is typically the first point of contact after a load balancer?
mean time to repair
what is used to measure maintainability
all of the writes must go through the leader node
what limits the performance of primary-secondary replication?
multiword queries necessitate sending long mapping lists between groups of nodes for merging
what makes term partitioning difficult in distributed search indexers
scalability, availability, performance
what non functional characteristics do load balancers improve
the write reqeuest queue increases, maximum shard utilization decreases, the user will not get quick responses
what problems might arise if shards are selected in an order (sequentially) rather than randomly?
iterative
what type of DNS resolution reduces query load on the DNS infrastructure- iterative or recursive?
the metadata for a blob's chunks is cached on the client side when its read for the firs time. The client can then go directly to the data nodes without communicating to the master node.
what type of caching can occur on the client side in a blob store system?
asynchronous communication
what type of communication do messaging queues enable?
universal datagram protocol (UDP)
what type of communication protocol do many clients of DNS use?
eventual consistency
what type of consistency level do domain name systems guarantee?
HTTP responses as well as other protocols
what type of content do applications servers serve
a flat data organization pattern with no hierarchies or sub directories
what type of data organization pattern does blob storage follow?
propogation delay by bring the data closer to its users
what type of delay does a CDN primarily try to alleviate?
a stateless load balancer
what type of load balancer does not keep track of any session information?
messages that failed to be consumed and have reached the maximum attempts limit
what type of messages do dead-letter queues contain?
leader-follower protocol
what type of replication does MongoDB use?
leaderless
what type of replication does the cassandra DB use?
vertical sharding
what type of sharding (vertical or horizontal) is more amendable to manual partitioning?
Different ISPs have different numbers of users resulting in uneven load distribution. It does not consider end-server crashes
when a DNS infrastructure responds with a reordered list of IP addresses, it is essentially performing global server load balancing in a round robin format. What are the two disadvantages to this approach?
static content delivery where the origin server decides which content to deliver to users of the CDN
when is the push CDN content strategy appropriate?
when we have incoming requests with extensive service time or requests with widely differing length of service time
when is the weighted round robin algorithm inappropriate for load balancing?
a sampler service
when it is not viable to log every single event in a distributed logger, what can we use instead?
if we are operating an application where we need to be able to make changes even if we are offline such as a calendar application
when might a multi-leader replication strategy be useful?
when a firewall prevents the monitoring system from accessing the servers directly
when might a push metrics model be appropriate in a monitoring system?
if the data is unstructured, if there is a need to serialize and deserialize data, if the size of the data to be stored is large
when should a non-relational database be chosen?
if the data to be stored is structured, if acid properties are required, if the size of the data is relatively small and can fit on one node
when should a relational database be chosed
non relational
when strong consistency is not required, what might be a good type of database to use?
highly available but weak consistency
where do noSQL databases fall on the spectrum of consistency and availability?
in RAM to support low latency for the search
where do we store the index in distributed search and why?
web server
which server layer deals with HTTP requests and responses only?
application server
which server layer deals with several communication protocols one of which is HTTP?
there is no indication of the severity of the issues, no structure, and they are hard to track
why can't we just use print statements instead of distributed logging systems?
there is a cost to queue storage and some tasks need to be repeatedly scheduled at regular intervals and so need to be stored for long periods
why do we use a database in task scheduling instead of just storing everything in the queue
because it makes a great tradeoff between speed, storage size, and cost
why is RAM a good choice for serving cached data as opposed to CPU register or Magnetic Disk
all API requests pass through the rate limiter. Therefore it needs minimum latency so as not to affect the user experience
why is low latency critical for rate limiters
it provides a much shorter delay
why might a DNS system favor the unreliable UDP communication protocol over something like TCP that uses a handshaking dialogue?
statement-based replication, write-ahead log shipping, logical (row-based) log replication
name three different types of primary-secondary replication
kilobyte
2 ^ 10 bits
gigabyte
2 ^ 30 bits
petabyte
2 ^ 50 bits
megabyte
2^20 bits
terabyte
2^40 bits
the percentage of time some system is accessible to clients is is operated under normal conditions
Define availability
dependent and independent tasks
broadly, tasks can be of two types. What are they?
a fault tolerance technique that saves a system's state in stable storage when the system state is consistent
checkpointing
what are the disadvantages of the token bucket algorithm for rate limiting
choosing an optimal value for the parameters is a difficult task. A lock might require taking a token from the bucket that can increase the request's processing delayt if contention on the lock increases
they can perform global traffic management between different time zones
cloud based load balancers may not necessarily replace a local on-premise load balancing facility. In that case, what is their primary role?
redis runs as a single process using one core whereas memcached can efficiently use multicore systems with multithreading technology
compare multithreading on memcached and redis
redis provides persistance while memcached does not although it can be achieved with third party tools
compare persistence of memcached and redis
redis automates replication whereas in memcached it is subject to third party tools
compare replication in redis vs memcached
memcached is simple but leaves most of the effort for managing clusters to the developer. Redis automates most of the scalability and data division issues
compare the simplicity of memcached and redis
what are the steps that occur when a user writes a blob to blob store
if the request succesfully passes the rate limiter, the load balancer forwards the client's request to one of the front end servers. Front end server then requests the master node. The master node assigns a unique ID, splits the data into chunks. The chunks are replicated for redundancy. The master node stores metadata in metadata storage. After writing the blob, a fully qualified path if returned to the client.
top-level domain name servers
in a DNS, what type of name server do root level servers forward queries to?
blob storage
in a client side monitoring service, what kind of database will we use to store the time series data?
the URI namespace delegation of all objects cached in the CDN
in a content delivery network, what does the origin server provide to the request routing system
filterer, error aggregator, alert aggregator
in a distributed logger, what are the three services that will work on the pub-sub data
least response time
in a performance sensitive service, which load balancing algorithm might be preferred?
a queue
in a pub-sub system, what is used to maintain each topic
relational
in a pub-sub system, what type of database (relational or non-relational) is used to maintain which consumers have subscribed to which topics
by maintaining a set of rules in a rules and actions database
in a service side distributed monitoring system, how can we ensure that our system will send out an alert when a metric breaches a critical value
execute these tasks in off-peak resource times
in a social app like facebook, some tasks do not need to be executed urgently such as recommending friends. What is a good way to execute these tasks in a resource efficient mannager?
execution caps
in a task scheduler, some task scripts may never halt due to a bug in the script. What is the solution?
monitoring the health of servers directly and handling TCP congestion control protocols
in addition to distributing traffic to servers, what is the role of tier three load balancers which sit at layer 7?
we mark blobs as deleted and they will be garbage collected later
in blob storage, deleting from many nodes takes time and holding a client until that is done is not a viable option. What is the solution?
the data node
in blob storage, where does the garbage collection process run: at the master node, the data node, or the monitoring system?
partitions
in blob store, there are a large number of data nodes on which blobs can be stored. It would take a while to search each one for a particular blob. What method is used to overcome this?
metadata storage
in blob store, where are the partition mappings stored?
embedding two propers in the application which will be an agent and a collector
in client side error monitoring, what is an improvement on the simple prober approach?
these components extract and filter terms from the partitions assigned to it by the cluster manager. These machines output inverted indexes in parallel which serve as input to reducers.
in mapreduce, what is the role of mappers?
the reducer combines mappings for various terms to generate a summarized index
in mapreduce, what is the role of reducers
the manager initiates the process by assigning a set of partitions to mappers. Once the mappers are done, the cluster manager assigns the output of mappers to reducers
in mapreduce, what is the role of the cluster manager?
using a time window approach where we have to sort messages received within a specific time frame and then put them in the relevant queue
in messaging queues, we frequently need to sort messages based on timestamps. How can we minimize latency based on this extra processing time?
we can tag a unique process identifier with the time stamp
in messaging queues, we will use time stamps based on synchronized clocks to order incoming messages. How can we ensure that we order messages when two concurrent sessions ask for a timestamp at the exact same time?
asynchronous writes
in the case of distributed caches, do we prefer synchronous or asynchronous writes to replica servers to copy data?
any random host within the cluster
in the cluster of independent hosts model, which component is responsible for replicating messages in the other nodes?
placing CDN proxy servers in ISP network
in the context of CDN deployment, what does off-premises mean?
a smaller data center could be placed near major IXPs
in the context of CDN deployment, what does on-premises deployment mean
time to live
in the periodic polling method of CDN content consistency, a lot of bandwidth is unnescessarily consumed by polling for content that does not change frequently. What is a strategy that addresses this issue
sign bit, 41 bits for time stamp in milliseconds, 10 bits for a worker number which identifies the server, and 12 bits for the sequence number which is incremented by the server each time
in the twitter snowflake sequencer method, what are the different types of bits used for encoding
nonpersisten
is a cache a persistent or non persistent storage area?
client, rate limiter, load balancer, front-end servers, data nodes, master node, metadata storage, monitoring service, administrator
list the key components of blob store design
edge side includes (ESI) markup language
name a popular dynamic data compression technique for caching data in a CDN
routing system, scrubber servers, proxy servers, distribution system, origin servers, management system
name the components of a content delivery network (CDN)
monitor anomalies in the use of CPU/memory/disk/network bandwidth by a process, monitor overall server health, monitor hardware component faults on a server such as memory failures, monitor server's ability to reach out-of-server critical services such as a network file system
what are some examples of things we want our monitoring system to do for us in distributed monitoring?
failure in DNS name resolution, any failure in routing along the path from the client to the service provider, any failures with third-party infrastructure such as middleboxes and CDNs
what are some factors that can cause failure in clients being unable to reach the server?
temperature differences, equipment age, manufacturing defects, virtualized clocks
what are some of the causes of physical clock drift
proxy servers usually serve content from RAM, CDN proxy servers are usually placed near the users, the request routing system ensures that users are directed to the nearest proxy server, proxy servers have long-tail content sotred in nonvolatile storage systems
what are some of the design decisions in CDNs that minimize latency
preventing resource starvation, managing policies and quotas, controlling data flow, and avoiding excess costs
what are some scenarios where a rate limiter would be useful?
DNS redirection, anycast, client multiplexing, HTTP redirection
what are some techniques that can be used to route users to the nearest proxy server in a CDN?
the advantage is that keys are uniformly distributed across nodes. The disadvantage is that we cannot perform range queries with this technique
what are the advantages and disadvantages of hash based sharding
flexibility because the database can be modfied while queries are happening, reduced redundancy, concurrency, backup and disaster recovery
what are the advantages of relational databases
it can cause a burst of traffic as long as there are enough tokens. it is space efficient as there are limited states
what are the advantages of the token bucket algorithm for rate limiters
an inverted index facilitates full-text searches and reduces the time of counting the occurrence of a word in each document because we have mappings against each term
what are the advantages of using an inverted index in search?
there is flexibility in terms of hardware choices for each functionality and it is possible to scale application servers and cache servers separately
what are the advantages of using dedicated cache servers for sharding?
range-query-based scheme is easy to implement and range queries can be performed using the partition keys
what are the advantages of using key-range based data sharding?
if a node fails or does routine maintenance, the workload is uniformly distributed over other nodes. It is up to each node to decide how many virtual nodes it is responsible for which allows nodes with more computational capacity to take more of the load
what are the benefits of using virtual nodes in consistent hashing?
a cache serves data from ram while a key-value store stores writes data to non volatile storage. Key-value stores should survive failures while cashes must be repopulated from scratch
what are the differences between a cache and a key value store
there is storage overhead and additional processing time up front for adding a new document
what are the disadvantages of using an inverted index in distributed search?
impedance mismatch
what are the downsides of relational databases
the content provider cannot do anything if the public CDN is down, if there are no proxy servers location in the region where some traffic comes from they are out of luck, and it is possible that some domains or IP addresses of CDN providers are blocked or restricted in some places
what are the downsides of using a public CDN
incomplete coverage and lack of user imitation
what are the downsides of using probers to mimick user behavior in client side monitoring?
concurrent management to separate users, we cannot grant different access rights, scaling, content search is difficult
what are the limitations of file storage?
a log accumulator, storage, a log indexer, and visualization
what are the major components of a distributed logging system?
scalable, available, and fault tolerant
what are the non functional requirements of a key-value store
they minimize user perceived latency, pre-generate expensive queries from the database, store user session data temporarily, serve data from temporary storage even if the data store is down temporarily, reduce network costs by serving data from local resources
what are the primary benefits of a distributed cache?
uniqueness, scalability, availability, and a 64 bit ID
what are the requirements for a unique identifier system
requirements, estimation, storage schema, high level design, api design, detailed design, evaluation, distinctive component
what are the steps in the reshaded approach
configurable service where different applications can choose between the tradeoffs of consistency and availability, the ability to alwasy write to the key value store, and hardware heterogeneity
what are the three functional requirements of a key value store
a data collector service, a time series database, a querying service, a rules and actions database, a blob store, and a service discoverer like kubernetes, an alert manager, and a dashboard
what are the three high level components of a client side distributed monitoring service?
account layer, container layer, blob layer
what are the three levels of abstraction in blob store?
a crawler, an indexer, and a searcher
what are the three main components of a search system?
periodic polling, time to live, and leases
what are the three methods for maintaining content consistency in a CDN?