Distributed Systems

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

(total elapsed time - downtime) / Number of Failures

How is MTBF calculated?

Maintenance Time / Number of repairs

How is MTTR calculated

scrubbing server

In a DNS system, after the routing system sends the client the IP address of the nearest proxy server, what component handles the request next

the routing system

In a DNS system, when a client first wants to send a request, what is the first component it is sent to?

for any connection, all incoming packets are forwarded to the same tier 3 load balancer

In a typical data center load balancing system, there are three tiers of load balancers. What is the role of the second tier which sits at layer 4 of OSI

less hardware infrastructure/layers are required to do load balancing, network latency will be reduced because of no intermediate hop, with a dedicated load balancing layer there can be traffic issues

In designing twitter, client side load balancing is utilized. What are the advantages of this approach?

uniqueness and scalability

In the context of a sequencer, what is the problem with using multiple databases that all start at different values and increment by the number of databases? What requirements are violated?

graph databases can be used in social applications

What are popular applications of graph databases?

Amazon DynamoDB, Redis, and memcached DB

What are some popular key-value databases

they are quite expensive, the configuration requires additional human resources, availability can be an issue as additional hardware is required in the case of failure

What are the downsides of hardware based load balancers

lack of standardization and consistency

What are the drawbacks of noSQL databases?

DNS resolver, root-level name servers, top-level domain name servers, authoritative nam servers

What are the four types of servers in the DNS hierarchy?

user requests a domain name to browser, browser sends DNS query to ISP, ISP forwards DNS query to DNS infrastructure, DNS infrastructure sends list of IP addresses to ISP, ISP forwards list to browser, browswer responds with HTTP request to ISP, ISP forwards HTTP request to web server

What are the steps in the high level flow of a dns

caching, server replication, and protocol

What are the three main things that make DNS a reliable system

MTBF and MTTR

What are the two ways of measuring system reliability

DNS uses caching at different layers and DNS name servers are in a hierarchical form

What are two aspects of domain name systems that allow them to scale well and cater to the requests of the entire internet?

load balancers are typically deployed in pairs. To maintain availability, enterprises will also use clusters of load balancers that use heartbeat communication. If the entire cluster fails, manual rerouting must occur

What happens if a load balancer fails? Are they not a single point of failure?

128-bit numbers makes the primary-key indexing slower, we cannot guarantee uniqueness, and they are not monotonically increasing

What is the problem with using a UUID for a sequencer?

multiple queues based on task categories

What is the solution to the fact that in a task scheduler, some tasks need to be processed with more urgency than others further ahead in a FIFO queue

UDP has no handshaking dialogues and therefore exposes the user to the unreliability of the underlying network

What makes UDP a more unreliable communication protocol as compared with TCP?

pull

What type of CDN strategy (push or pull) is appropriate when dynamic content is being requested?

applications that require a large volume of semi-structured or unstructured data, low latency, and flexible data models

What type of applications require noSQL databases?

HTTP responses

What type of content do web servers serve?

in the browser, the operating system, and the local name server. Some frequently visited services are also cached in the ISP DNS resolvers themselves

Where are DNS requests typically cached?

application server

Which server layer deals with longer running processes that are more resource intensive?

small additional costs are required to implement shadow load balancers on commodity hardware

Why isn't availability an issue when it comes to software based load balancers

99%

Youtube: What uptime % would be considered good as we consider availability?

the client could be malicious, if the uploaded video is a duplicate it can be filtered out, encoders will only be available on a private IP address within Youtube's network

Youtube: when a user uploads a video, it is passed through a server before hitting the encoder. Why is this the case?

simple design that does not require dealing with impedance mismatch, easy horizontal scaling, highly available, cost because you do not need to pay for a RDMS and many noSQL databases are open source

advantages of NoSQL databases

a document-term index

an inverted index is an improvement ona document level index in a distributed search system. What does an inverted index make use of?

health checking of servers, TLS termination, predictive analytics on traffic patterns, reduced human intervention in the case of failure handling, increased security

aside from ensuring that services will be scalable, available, and highly performant, what are key benefits and services of load balancers?

99.9%

at what level do cloud providers like google, microsoft, and amazon set their availabilities in their SLAs?

web layer, application layer, and database layer

at what three levels will a cache generally be present in a system?

layer 4 (network/trasnport) or layer 7 (application)

at what two layers of the OSI can load balancers be placed?

a term is a frequently occuring word that can be thrown away

define "term" in the context of a document-term matrix in distributed search

DNS redirection is a method that can be used to route users to the nearest proxy servers in a content delivery network. In step one, a client is mapped to the appropriate nework location. In step two, that location distributes the load over proxy servers.

define DNS redirection

the network time protocol is a networking protocol for clock synchronization between computer systems over packet-switched, variable latency data networks

define NTP

the broker server is the core component of the pub-sub system. It handles read and write requests. A broker will have multiple topics where each topic can have multiple partitions associated with it.

define a broker in the context of a pub-sub system

a cache that embeds cache and service functionality within the same host

define a co-located cache

a report collector independent of the primary service. It is made independent to avoid the situations where client agents want to report an error to a failed service

define a collector in the context of client side monitoring

columnar databases store data in columns instead of rows which enable quick and efficient access to all entires in a database column

define a columnar database

a group of geographically distributed proxy servers where a proxy server is an intermediate server between a client and the origin server

define a content delivery network

a mathematical matrix that represents the frequency of terms in a list of documents

define a document-term matrix in the context of distributed search

the internet's naming service that maps human-friendly domain names to machine-readable IP addresses

define a domain name system

key value stores are distributed hash tables where a key binds to a specific value and does not assume anything about the structure of the value

define a key value store as a distributed systems building block

an agent that collects logs from each node and dumps them into storage. This way we do not need to visit each node if we want to know about a particular event

define a log accumulator

an intermediate component between the interacting entities known as producers and consumers

define a messaging queue

rate limiters put a limit on the number of requests a service will fulfill and throttles requests that cross the limit

define a rate limiter

each counter has a specific number of shards as needed. These shards are run on different computational units in parallel.

define a sharded counter

a persistent sequence of messages stored in the local storage of a broker

define a topic in the context of publisher-subscriber systems

the data is first written to the cache and asynchronously written to the database. Inconsistency is inevitable when the client reads stale data from the database

define a write back cache and its considerations

writes on the cache as well as the database. This increases write latency but ensures strong consistency between the database and the cache

define a write through cache and its considerations

a prober embedded in the client application that sends the appropriate service reports about any failures

define an agent in the context of client side monitoring

encoders and transcoders compress videos and transform them into different formats and qualities to support varying numbers of devices according to their screen resolution and bandwidth

define an encoder

apache kafka is a framework for distributed real-time event processing and storage

define apache kafka

a database transaction is considered an atomic unit and if one statement fails, the whole transaction is aborted and rolled back

define atomicity in regards to databases

messages are placed in the queue in the order they are received which may not necessarily be the order they were sent

define best effort ordering

a storage solution for unstructured data where every type of data is stored as a binary large object

define blob storage

consistent hashing assigns each server or item in a distributed hash table a place on an abstact circle called a ring

define consistent hashing

the system should guarantee that completed transactions will survive permanently in the databse even in system failure events

define durability in regards to databases

edge side includes is a popular dynamic content caching strategy in CDNs. Since many dynamic content web pages only change in small portions, ESI caches the portions that do not change and only notifies of the portions that do.

define edge side includes markup language

the number of requests can cross the predefined limit if the system has excess resources available

define elastic throttling

the distribution of traffic load across multiple geographic regions

define global server load balancing

there is a hard limit on the number of API reqeusts. So whenever a request crosses the limit, it is discarded

define hard throttling

scaling by increasing the number of machines in the network

define horizontal scaling

horizontal sharding is used to divide a table into multiple tables row wise

define horizontal sharding

impedence mismatch occurs in relational databases when there is a difference between the relational model and the in-memory data structures

define impedence mismatch

a characteristic of operations that allows them to be applied several times without influencing the outcome

define indempotence

the organization and manipulation of data that's done to facilitate fast and accurate information retrieval

define indexing in the context of a search system.

in the case of multiple concurrent transactions, it should give the same result as if they were run in isolation

define isolation in regards to databases

load balancing achieved within a data center

define local load balancing

all secondary nodes replicate the actual data changes.

define logical log replication

the probability that the service will restore its functions within a specified time of fault occurrence

define maintainability

in power of two load balancing, two nodes are selected at random and the one with less load is chosen. This simple technique is exponentially better than random selection

define power of two client side load balancing

an asynchronous service-to-service communication method that is popular in serverless and microservices architecture

define pub-sub messaging

the probability that the service will perform its functions for a specified time

define reliability

refers to keeping multiple copies of data at various nodes which are preferably geographically distributed to achieve availability, scalability, and performance

define replication in regards to databases

each request is forwarded to a server in the pool in a repeating sequential manner

define round robin load balancer scheduling

the ability of a system to handle an increasing amount of workload without compromising performance

define scalability

the number of requests can exceed the predefined limit by a certain percentage

define soft throttling

the primary node saves all statements that it executes and sends them to the secondary nodes to perform

define statement-based replication which is a type of primary-secondary replication

in strict ordering messages are placed in a messaging queue in the order they were produced usually based on an attached timestamp

define strict ordering in the context of messaging queues

comprised of clearly defined data types with patterns that make them easily searchable. Typically quantitative data

define structured data

the primary node waits for acknowledgements from the seconday nodes about updating the data. After receiving acknowledgement from all secondary nodes, the primary node reports success to the client

define synchronous replication

some applications provide a different level of service to users based on their IP addresses. In that case, hashing the IP address is performed to assign users' requests to servers

define the IP hash method of load balancing

DNS is a routing methodology in which all edge servers location in multiple locations share the same sinlge IP address. THe border gateway protocol is then used to route clients based on the internet's natural network flow

define the anycast method for routing clients to the nearest proxy server in DNS

they adhere to a particular schema before storing data. Instances are stored in rows, each of which has a unique key identifying that tuple which can be linked to other tables

define the key factors of a relational database

content gets sent automatically to the CDN proxy servers from the origin server in the push model

define the push CDN content caching strategy

servers are unaware of each other, there is no synchronization, data sharing, and communication between the servers

define the shared-nothing architectured utilzied by memcache

scaling by providing additional capabilities such as CPUs or RAM to an existing device

define vertical scaling

memory hardware that fetches and stores data at high speed but is lost once the system is turned off

define volatile computer memory

if some servers have a higher capability of serving a clients' requests, then it is given a higher weight. The load balancer forwards requests according to the weight of each node

define weighted round-robin load balancing

data is written only to the database and written to the cache when there is a cache miss later. Not a favorable strategy for reading recently updated data

define write around cache and its considerations

the primary node saves the query before executing it in a log file known as a write ahead log file. It then uses these logs to write data onto the secondary nodes

define write-ahead log shipping

cassandra is PA/EL

describe cassandra DB in terms of the PACELC theorem

mongoDB is PA/EC

describe mongoDB in terms of the PACELC theorem

redis does not have strong consistency due to its use of asynchronous replication

describe the consistency level of redis and why

a client sends a task to a rate limiter which accepts. The task is sent to the task submitter which gets a UUID from a generator. The task is then sent ot a databse. The task is then batched and prioritized before being placed in a distributed queue. The task is then sent to the cloud resources for processing

describe the flow of a task scheduler

the local server itself requests the root the the top level domain then the authoritatitve server

describe the iterative process of DNS resolution

a burst of requests can fill the bucket and if not processed in the specified time, recent requests can take a hit. determining the optimal bucket size and outflow rate is a challenge

disadvantages of the leaking bucket algorithm for rate limiting

no. Usually they use the services of a CDN provider like Akamai or Cloudflare

do most companies build their own CDNs?

most content providers use both to get the benefits of each approach

do most content providers use a push or pull CDN caching approach?

static content

do web servers typically serve static or dynamic content?

read throughput

does MongoDB have a higher read throughput or a higher write throughput?

a system's ability to execute persistently even if one or more of its components fail

fault tolerance

availability

for key-value stores do we prefer consistency or availability?

we introduce secondary monitoring systems which pull from the application. These then push to a datacenter monitoring system which pushes to a global monitoring system

how can we utilize a hybrid push/pull approach to improve on scalability in a monitoring service

CDNs cache content that serves as a backup whenever the origin server fails. If a proxy server stops working, other operational proxy servers step in and continue to drive traffic

how do CDNs guarantee availability

we partition the blobs based on the complete path of the blob which is the combination of the account ID, container ID, and blob ID

how do we ensure that data with the same user or container is stored in the same partition in blob store systems?

we can periodically read all the shards of a counter and cache the results

how do we improve on the low read throughput and high read latency that is inherent to sharded counters?

as a string which means all non string data must be serialized

how does memcache store data?

13 (managed by 12 different organizations)

how many DNS logical root name servers are there spread strategically around the world?

1000

how many QPS can mySQL handle?

8000

how many requests per second can a single server handle

8000

how many users per day should we estimate that a server can handle

netflix might be able to store more than 90% of its content in the CDN while this is not feasible for Youtube due to the volume of content

how might netflix and youtube have different strategies for caching their content in a content delivery network?

a few MBs

how much data can we usually store in the value component of a key-value store

the length of the network path and the capacity (bandidth) limits along the network path

network distance between a user and a proxy server in a CDN is a function of what two variables?

pagination

often a user will want a list of blobs matching some condition. However, when these lists become very large this can affect performance. What is a solution to this problem?

certain content requires the execution of scripts that can be executed at proxy servers instead of the origin server

other than edge side includes, what is a popular optimization technique for dynamic content caching in CDNs?

CPU register, CPU cache, RAM, SSD, Magnetic Disk

rank the five types of memory from fastest to slowest

No because it cannot possible meet the data storage and query requirements and failure of the one mega server would result in downtime for everyone

should key value servers be located on a single node and why?

pipelining

since redis uses a client-server model, a request cannot be handled until the server responds to the prior request. What is the method that Redis uses to get around this?

true

true or false, DNS can be considered a global server load balancer?

true

true or false, blobs are broken up into a constant amount of chunks?

false

true or false, in nonvolatile memory the processor has direct access to the data

true

true or false, weighted round robin is a static load balancing algorithm?

within the premise of a users network

typically, where are DNS resolvers located?

throttle rule retriever

under a rate limiter system, assume that the service owner adds some extra rules to the database. Consequently data in the cache will become outdated. Which components are responsible for updating the cache

right after it finishes updating itself

under asynchronous replication, when does a primary node report success to the client?

what the workload is read-heavy

under what circumstances is primary-secondary replication appropriate?

root-level name servers

what DNS name servers hold mappings for the top-level domains such as .com, .edu, .us, etc.

receiving logs, storing those logs locally, pushing the logs to a pub-sub system

what actions is a log accumulator responsible for

the sliding window log algorithm

what algorithm proposes a solution to the window edge problem that occurs in the fixed window counter algorithm

redundant cache servers

what allows for the high availability of distributed caches?

sharding of the cached data

what allows for the scalability of distributed caches?

Google Cloud firestorm and MongoDB

what are examples of document databases?

dynamic algorithms are far better because they maintain the state of serving hosts and therefore are worth the extra effort and complexity

what are more effective, dynamic or static load balancing algorithms?

local or default servers

what are other names for DNS resolvers?

least recently used, most recently used, least frequently used, most frequently used

what are some cache eviction strategies

round-robin scheduling, weighted round-robin, least connections, least response time, IP hash, URL hash

what are some examples of popular load balancing algorithms?

conflict avoidance where all writes for a given record go through the same leader, last-write-wins, custom logic

what are the three methods of conflict resolution in a multi-leader database?

account metadata, container metadata, blob metadata

what are the three types of metadata that the metadata storage in a blob store holds?

operability, lucidity, modifiability

what are the three underlying aspects of maintainability

a primary-secondary model and a cluster of independent hosts model

what are the two approaches for performing message replication in a cluster of different hosts in a messaging queue

service-side erros and client-side errors

what are the two broad categories of monitoring focus for distributed systems?

synchronous replication within a storage cluster and asynchronous replication across data centers and regions

what are the two levels of replication that occur in blob storage?

document partitioning and term partitioning

what are the two methods for partitioning in a distributed indexing/search system?

dedicated cache servers and co-located cache

what are the two methods for sharding in cache clusters

separation of search and indexing nodes and not recomputing indexes when making replicas

what are the two things that need to be fixed in order to make a distributed search system scalable

a relational database and a graph database

what are the two types of databses necessary in a task scheduling system

key-range based sharding, hash based sharding, and consistent hashing

what are the types of horizontal sharding?

very efficient for a large number of aggregation and data analytics queries by drastically reducing the disk I/O requirements and the amount of data required to load from the disk

what are the use cases for columnar databases?

the data model maps directly to objects in code, support for ubiquitous JSON documents, the high flexibility of the schema where not all documents have to have the same fields

what are three differentiators (advantages) that document databases have over relational databases?

round-robin selection, random selection, metrics-based selection

what are three methods for assigning write requests to shards in a sharded counter?

data structure storage (memcached stores all data as strings), database for persistent storage, and a message broker that can translate millions of messages per second

what are three of the key features of redis

use appropriate authentication and resource authorization, consider code sandboxing using dockers or virtual machines, use performance isolation between tasks by monitoring resource utilization

what are three ways to deal with untrusted tasks in a task scheduler

best effort ordering and strict ordering

what are two approaches to message orderig in messaging queues

low latency and high availability

what are two non functional requirements that are needed for both youtube, netflix and also any other streaming service?

a locking mechanism (suboptimal due to performance issues) and the set then get approach

what are two potential solutions to the race condition problem of rate limiters

when we have to maintain changes in the replicated data over time

what causes the main problem in data replication?

blob indexing

what component in blob store solves the problem of blob management and querying (finding a blob in a sea of blobs)?

segments help identify the start and end of a message using an offset address

what do segments do within partitions in the context of a pub-sub system

clients can still read from secondary nodes if the primary node has failed

what does it mean that primary-secondary replication is read resilient?

a cache client

what entity performs the hash calculations for caching?

proxy servers usually serve content from RAM, CDN proxy servers are placed near the users to provide faster access to content, request routing ensures users are directed to the nearest proxy server, long-tailed content is stored in nonvolatile storage instead of directing to the origin server

what factors of a CDN help it achieve high performance?

the system needs to detect such cases where a counter unexpectedly starts getting very high write traffic. We'll dynamically increase the number of shards of the affected counter to mitigate the situation

what happens when a user with just a few followers has a post go viral on twitter (in the context of counters)

IDs that would have been generated in a dead period are wasted

what is a con of using the twitter snowflake method to encode causality in unique identifiers

randomly assigning nodes in the ring may cause non-uniform distribution

what is a disadvantage of consistent hashing?

session oriented applications like web applications

what is a good application of a key-value database?

LRU cache because the most recently uploaded data is the most likely to get views

what is a good cache eviction methodology for social application

bigtable because of its high throughput and scalability for storing key-value data

what is a good databse for storing thumbnails and why

bounded waiting time for users

what is a non functional requirement that is unique to distributed task schedulers?

quorums

what is a popular approach for solving the concurrent write problem in leaderless replication database systems?

a single point of failure

what is a problem with using a central database's auto-increment feature to generate unique IDs for a sequencer?

when message size exceeds the original packet size of 512 bytes

what is a time where a DNS may use the TCP communication protocol over the UDP (user datagram protocol)?

vertical sharding is used to increase the speed of data retrievals from a table consisting of columns with very wide text or a binary large object

what is a time where vertical sharding may be useful?

Google's trutime API which return confidence intervals of time rather than points

what is a unique identifier for sequencers that guarantees all of the required properties including causality

if we use a push model to push metrics from the services to the data collector service

what is a way a service side monitoring service can be overwhelmed and become a bottleneck

mongoDB ensures atomicity in concurrent write operations and avoids collisions by returning duplcate-key errors fro record-duplication issues

what is an advantage of mongoDB that makes it a good use case for URL shortening?

cassandra

what is an example of a columnar database

mapreduce

what is an example of a distributed data processing system that an indexer will use for index construction in a distributed search system?

mongoDB

what is an example of a document database?

read only memory

what is an example of nonvolatile memory?

a system that processes financial ATM transactions

what is an example where a sampler service is not viable in distributed logging

durability

what is an important non functional requirement of messaging queues

a failover server is activated

what is done in the case of a failure of a range handler in a sequencer

volatile memory

what is more costly per unit size- volatile memory or nonvolatile memory

creating containers to group blobs

what is one of the key functional requirements of blob storage?

replication

what is one of the most commonly used methods of providing fault tolerance for a system?

rules for how to operate the rate limiter

what is stored in the database/cache associated with a database?

reduced capex and opex for extra hardware

what is the advantage of a co-located cache

using time stamps based on synchronized clocks

what is the best approach for ordering incoming messages in a messaging queue

a range handler

what is the best way of generating unique IDs for a sequencer

when the data is write once read many

what is the circumstance where blob storage is useful

client then load balancer then a layer of services with cache clients then a layer of cache servers that feed into a persisitent storage layer

what is the diagram of a distributed cache in simple terms

maintainability is measured by mean time to repair while reliability is measured by both mean time to repair as well as mean time between failure

what is the difference between maintainability and reliability

the client lacks the overall information to choose the most suitable server for its request which may result in requesting an already over-loaded server

what is the disadvantage of using client multiplexing to route a client to the nearest proxy server in DNS?

any non-deterministic function such as now() might result in distinct writes on the follower and leader

what is the disadvantage of using statement-based replication

if the primary node fails, the writes that weren't copied to the secondary nodes will be lost

what is the downside of asynchronous replication?

we need to build a system such that many nodes could collectively work as if we had a single huge server

what is the downside of horizontal scaling

high latency

what is the downside of synchronous replication

when content changes infrequently, the polling approach consumes unnecessary bandwidth

what is the downside of using period polling in a DNS to ensure content consistency?

We can only grow to the limitations of our server and the dollar cost of vertical scaling is usually high

what is the downside of vertical scaling

if two intervals overalap, we are unsure in which order the events occurred

what is the downside to using Google's truetime API as a sequencer?

each microservice sends its metrics to the monitoring system, resulting in a heavy traffic load on the infrastructure which can cause a bottleneck for business operations

what is the drawback to using a push based approach in a monitoring system

a load balancer

what is the first point of contact within a data center after the firewall?

DAU / RPS of a server

what is the formula for calculating the number of servers required

we lose a significant range any time a server dies

what is the major con of a range handler?

in a queue, only one consumer consumes a message as opposed to multiple consumers in a pub-sub system

what is the major difference between pub-sub messaging and queue based messaging?

a consistent burst of traffic at the window edges could cause a potential decrease in performance

what is the major disadvantage of the fixed window counter algorithm for rate limiting

conflict

what is the major drawback of multi-leader replication?

minimizing latency

what is the most important non functional requirement of a CDN?

setting up a configuration service that sits between the cache client and the cache servers that continuously monitors the health of the cache servers

what is the most robust (yet costly) method for keeping cache clients up to date on the presence and health of each cache server?

in a distributed system where there can be millions of events ocurring per second, we need a mechanism to distinguish these events from each other

what is the motivation for using sequencers?

continuation token

what is the name of the indicator used in pagination under a blob store system?

as nodes join or leave the system, a minimal number of keys need to move

what is the primary benefit of consistent hashing

unstructured catalog data like JSON files or other complex hierarchical data. It is also a good option for content management applications such as blogs and video platforms

what is the primary use case for document databases?

index recomputation is a waste of resources

what is the problem that arises in search systems when each replica of a primary node stores the index

kilobyte, megabyte, gigabyte, terabyte, petabyte

what is the progression of bit sizes?

increase the availability of the queue

what is the purpose of replicating queues on multiple servers?

memcahed has a nearly deterministic O(1) speed serving millions of keys per second

what is the query speed of memcached

verify the consumer, retention time management, message receiving options management, allow multiple reads

what is the role of a consumer manager in a pub-sub system

to fetch content and create documents

what is the role of a crawler in a search system?

resolvers initaite the querying sequence and forward requests to the other DNS name servers. They can also cater to user's DNS queries through caching techniques

what is the role of a dns resolver in the dns infrastructure hierarchy?

to fairly divide all clients' requests among the pool of available servers

what is the role of a load balancer

responds to search queries by running the query on the index created by the indexer

what is the role of a searcher in a search system

broker/topics registry and managing replication by assigning the new lead broker in the case of failure

what is the role of the cluster manager in a pub-sub system

a temporary data storage that can serve data faster by keeping data entries in memory

what is the simple definition of a cache

we compute the index on the primary node only and then communicate the inverted index (binary blob/file) to the replicas

what is the solution ot index recomputation when replicating nodes in a distributed search system?

storing metadata indicating a time to live

what is the solution to having stale data stored in a distributed cache

if the number is too small, we face high contention for writes. If the number is too large we encounter higher overhead on the read operation

what is the tradeoff in determining the number of shards for a sharded counter?

web server

what is typically the first point of contact after a load balancer?

mean time to repair

what is used to measure maintainability

all of the writes must go through the leader node

what limits the performance of primary-secondary replication?

multiword queries necessitate sending long mapping lists between groups of nodes for merging

what makes term partitioning difficult in distributed search indexers

scalability, availability, performance

what non functional characteristics do load balancers improve

the write reqeuest queue increases, maximum shard utilization decreases, the user will not get quick responses

what problems might arise if shards are selected in an order (sequentially) rather than randomly?

iterative

what type of DNS resolution reduces query load on the DNS infrastructure- iterative or recursive?

the metadata for a blob's chunks is cached on the client side when its read for the firs time. The client can then go directly to the data nodes without communicating to the master node.

what type of caching can occur on the client side in a blob store system?

asynchronous communication

what type of communication do messaging queues enable?

universal datagram protocol (UDP)

what type of communication protocol do many clients of DNS use?

eventual consistency

what type of consistency level do domain name systems guarantee?

HTTP responses as well as other protocols

what type of content do applications servers serve

a flat data organization pattern with no hierarchies or sub directories

what type of data organization pattern does blob storage follow?

propogation delay by bring the data closer to its users

what type of delay does a CDN primarily try to alleviate?

a stateless load balancer

what type of load balancer does not keep track of any session information?

messages that failed to be consumed and have reached the maximum attempts limit

what type of messages do dead-letter queues contain?

leader-follower protocol

what type of replication does MongoDB use?

leaderless

what type of replication does the cassandra DB use?

vertical sharding

what type of sharding (vertical or horizontal) is more amendable to manual partitioning?

Different ISPs have different numbers of users resulting in uneven load distribution. It does not consider end-server crashes

when a DNS infrastructure responds with a reordered list of IP addresses, it is essentially performing global server load balancing in a round robin format. What are the two disadvantages to this approach?

static content delivery where the origin server decides which content to deliver to users of the CDN

when is the push CDN content strategy appropriate?

when we have incoming requests with extensive service time or requests with widely differing length of service time

when is the weighted round robin algorithm inappropriate for load balancing?

a sampler service

when it is not viable to log every single event in a distributed logger, what can we use instead?

if we are operating an application where we need to be able to make changes even if we are offline such as a calendar application

when might a multi-leader replication strategy be useful?

when a firewall prevents the monitoring system from accessing the servers directly

when might a push metrics model be appropriate in a monitoring system?

if the data is unstructured, if there is a need to serialize and deserialize data, if the size of the data to be stored is large

when should a non-relational database be chosen?

if the data to be stored is structured, if acid properties are required, if the size of the data is relatively small and can fit on one node

when should a relational database be chosed

non relational

when strong consistency is not required, what might be a good type of database to use?

highly available but weak consistency

where do noSQL databases fall on the spectrum of consistency and availability?

in RAM to support low latency for the search

where do we store the index in distributed search and why?

web server

which server layer deals with HTTP requests and responses only?

application server

which server layer deals with several communication protocols one of which is HTTP?

there is no indication of the severity of the issues, no structure, and they are hard to track

why can't we just use print statements instead of distributed logging systems?

there is a cost to queue storage and some tasks need to be repeatedly scheduled at regular intervals and so need to be stored for long periods

why do we use a database in task scheduling instead of just storing everything in the queue

because it makes a great tradeoff between speed, storage size, and cost

why is RAM a good choice for serving cached data as opposed to CPU register or Magnetic Disk

all API requests pass through the rate limiter. Therefore it needs minimum latency so as not to affect the user experience

why is low latency critical for rate limiters

it provides a much shorter delay

why might a DNS system favor the unreliable UDP communication protocol over something like TCP that uses a handshaking dialogue?

statement-based replication, write-ahead log shipping, logical (row-based) log replication

name three different types of primary-secondary replication

kilobyte

2 ^ 10 bits

gigabyte

2 ^ 30 bits

petabyte

2 ^ 50 bits

megabyte

2^20 bits

terabyte

2^40 bits

the percentage of time some system is accessible to clients is is operated under normal conditions

Define availability

dependent and independent tasks

broadly, tasks can be of two types. What are they?

a fault tolerance technique that saves a system's state in stable storage when the system state is consistent

checkpointing

what are the disadvantages of the token bucket algorithm for rate limiting

choosing an optimal value for the parameters is a difficult task. A lock might require taking a token from the bucket that can increase the request's processing delayt if contention on the lock increases

they can perform global traffic management between different time zones

cloud based load balancers may not necessarily replace a local on-premise load balancing facility. In that case, what is their primary role?

redis runs as a single process using one core whereas memcached can efficiently use multicore systems with multithreading technology

compare multithreading on memcached and redis

redis provides persistance while memcached does not although it can be achieved with third party tools

compare persistence of memcached and redis

redis automates replication whereas in memcached it is subject to third party tools

compare replication in redis vs memcached

memcached is simple but leaves most of the effort for managing clusters to the developer. Redis automates most of the scalability and data division issues

compare the simplicity of memcached and redis

what are the steps that occur when a user writes a blob to blob store

if the request succesfully passes the rate limiter, the load balancer forwards the client's request to one of the front end servers. Front end server then requests the master node. The master node assigns a unique ID, splits the data into chunks. The chunks are replicated for redundancy. The master node stores metadata in metadata storage. After writing the blob, a fully qualified path if returned to the client.

top-level domain name servers

in a DNS, what type of name server do root level servers forward queries to?

blob storage

in a client side monitoring service, what kind of database will we use to store the time series data?

the URI namespace delegation of all objects cached in the CDN

in a content delivery network, what does the origin server provide to the request routing system

filterer, error aggregator, alert aggregator

in a distributed logger, what are the three services that will work on the pub-sub data

least response time

in a performance sensitive service, which load balancing algorithm might be preferred?

a queue

in a pub-sub system, what is used to maintain each topic

relational

in a pub-sub system, what type of database (relational or non-relational) is used to maintain which consumers have subscribed to which topics

by maintaining a set of rules in a rules and actions database

in a service side distributed monitoring system, how can we ensure that our system will send out an alert when a metric breaches a critical value

execute these tasks in off-peak resource times

in a social app like facebook, some tasks do not need to be executed urgently such as recommending friends. What is a good way to execute these tasks in a resource efficient mannager?

execution caps

in a task scheduler, some task scripts may never halt due to a bug in the script. What is the solution?

monitoring the health of servers directly and handling TCP congestion control protocols

in addition to distributing traffic to servers, what is the role of tier three load balancers which sit at layer 7?

we mark blobs as deleted and they will be garbage collected later

in blob storage, deleting from many nodes takes time and holding a client until that is done is not a viable option. What is the solution?

the data node

in blob storage, where does the garbage collection process run: at the master node, the data node, or the monitoring system?

partitions

in blob store, there are a large number of data nodes on which blobs can be stored. It would take a while to search each one for a particular blob. What method is used to overcome this?

metadata storage

in blob store, where are the partition mappings stored?

embedding two propers in the application which will be an agent and a collector

in client side error monitoring, what is an improvement on the simple prober approach?

these components extract and filter terms from the partitions assigned to it by the cluster manager. These machines output inverted indexes in parallel which serve as input to reducers.

in mapreduce, what is the role of mappers?

the reducer combines mappings for various terms to generate a summarized index

in mapreduce, what is the role of reducers

the manager initiates the process by assigning a set of partitions to mappers. Once the mappers are done, the cluster manager assigns the output of mappers to reducers

in mapreduce, what is the role of the cluster manager?

using a time window approach where we have to sort messages received within a specific time frame and then put them in the relevant queue

in messaging queues, we frequently need to sort messages based on timestamps. How can we minimize latency based on this extra processing time?

we can tag a unique process identifier with the time stamp

in messaging queues, we will use time stamps based on synchronized clocks to order incoming messages. How can we ensure that we order messages when two concurrent sessions ask for a timestamp at the exact same time?

asynchronous writes

in the case of distributed caches, do we prefer synchronous or asynchronous writes to replica servers to copy data?

any random host within the cluster

in the cluster of independent hosts model, which component is responsible for replicating messages in the other nodes?

placing CDN proxy servers in ISP network

in the context of CDN deployment, what does off-premises mean?

a smaller data center could be placed near major IXPs

in the context of CDN deployment, what does on-premises deployment mean

time to live

in the periodic polling method of CDN content consistency, a lot of bandwidth is unnescessarily consumed by polling for content that does not change frequently. What is a strategy that addresses this issue

sign bit, 41 bits for time stamp in milliseconds, 10 bits for a worker number which identifies the server, and 12 bits for the sequence number which is incremented by the server each time

in the twitter snowflake sequencer method, what are the different types of bits used for encoding

nonpersisten

is a cache a persistent or non persistent storage area?

client, rate limiter, load balancer, front-end servers, data nodes, master node, metadata storage, monitoring service, administrator

list the key components of blob store design

edge side includes (ESI) markup language

name a popular dynamic data compression technique for caching data in a CDN

routing system, scrubber servers, proxy servers, distribution system, origin servers, management system

name the components of a content delivery network (CDN)

monitor anomalies in the use of CPU/memory/disk/network bandwidth by a process, monitor overall server health, monitor hardware component faults on a server such as memory failures, monitor server's ability to reach out-of-server critical services such as a network file system

what are some examples of things we want our monitoring system to do for us in distributed monitoring?

failure in DNS name resolution, any failure in routing along the path from the client to the service provider, any failures with third-party infrastructure such as middleboxes and CDNs

what are some factors that can cause failure in clients being unable to reach the server?

temperature differences, equipment age, manufacturing defects, virtualized clocks

what are some of the causes of physical clock drift

proxy servers usually serve content from RAM, CDN proxy servers are usually placed near the users, the request routing system ensures that users are directed to the nearest proxy server, proxy servers have long-tail content sotred in nonvolatile storage systems

what are some of the design decisions in CDNs that minimize latency

preventing resource starvation, managing policies and quotas, controlling data flow, and avoiding excess costs

what are some scenarios where a rate limiter would be useful?

DNS redirection, anycast, client multiplexing, HTTP redirection

what are some techniques that can be used to route users to the nearest proxy server in a CDN?

the advantage is that keys are uniformly distributed across nodes. The disadvantage is that we cannot perform range queries with this technique

what are the advantages and disadvantages of hash based sharding

flexibility because the database can be modfied while queries are happening, reduced redundancy, concurrency, backup and disaster recovery

what are the advantages of relational databases

it can cause a burst of traffic as long as there are enough tokens. it is space efficient as there are limited states

what are the advantages of the token bucket algorithm for rate limiters

an inverted index facilitates full-text searches and reduces the time of counting the occurrence of a word in each document because we have mappings against each term

what are the advantages of using an inverted index in search?

there is flexibility in terms of hardware choices for each functionality and it is possible to scale application servers and cache servers separately

what are the advantages of using dedicated cache servers for sharding?

range-query-based scheme is easy to implement and range queries can be performed using the partition keys

what are the advantages of using key-range based data sharding?

if a node fails or does routine maintenance, the workload is uniformly distributed over other nodes. It is up to each node to decide how many virtual nodes it is responsible for which allows nodes with more computational capacity to take more of the load

what are the benefits of using virtual nodes in consistent hashing?

a cache serves data from ram while a key-value store stores writes data to non volatile storage. Key-value stores should survive failures while cashes must be repopulated from scratch

what are the differences between a cache and a key value store

there is storage overhead and additional processing time up front for adding a new document

what are the disadvantages of using an inverted index in distributed search?

impedance mismatch

what are the downsides of relational databases

the content provider cannot do anything if the public CDN is down, if there are no proxy servers location in the region where some traffic comes from they are out of luck, and it is possible that some domains or IP addresses of CDN providers are blocked or restricted in some places

what are the downsides of using a public CDN

incomplete coverage and lack of user imitation

what are the downsides of using probers to mimick user behavior in client side monitoring?

concurrent management to separate users, we cannot grant different access rights, scaling, content search is difficult

what are the limitations of file storage?

a log accumulator, storage, a log indexer, and visualization

what are the major components of a distributed logging system?

scalable, available, and fault tolerant

what are the non functional requirements of a key-value store

they minimize user perceived latency, pre-generate expensive queries from the database, store user session data temporarily, serve data from temporary storage even if the data store is down temporarily, reduce network costs by serving data from local resources

what are the primary benefits of a distributed cache?

uniqueness, scalability, availability, and a 64 bit ID

what are the requirements for a unique identifier system

requirements, estimation, storage schema, high level design, api design, detailed design, evaluation, distinctive component

what are the steps in the reshaded approach

configurable service where different applications can choose between the tradeoffs of consistency and availability, the ability to alwasy write to the key value store, and hardware heterogeneity

what are the three functional requirements of a key value store

a data collector service, a time series database, a querying service, a rules and actions database, a blob store, and a service discoverer like kubernetes, an alert manager, and a dashboard

what are the three high level components of a client side distributed monitoring service?

account layer, container layer, blob layer

what are the three levels of abstraction in blob store?

a crawler, an indexer, and a searcher

what are the three main components of a search system?

periodic polling, time to live, and leases

what are the three methods for maintaining content consistency in a CDN?


Ensembles d'études connexes

Chapter 12: Binary trees and hash tables

View Set

Microbiology ch 22 Skin Infections

View Set

The Endomembrane System and Membrane Trafficking

View Set