Chapter 8: The Trouble with Distributed
In the __________ system model, an algorithm is not allowed to make any timing assumptions -- in fact, it does not even have a clock (so it cannot use timeouts). Some algorithms can be designed for this model, but it is very restrictive.
Asynchronous
It is unwise for a service to assume that its clients will always be well __________, because the clients are often run by people whose priorities are very different from the priorities of the people running the service. Thus, it is a good idea for any service to protect itself from accidentally abusive clients.
Behaved
Large datacenter networks are often based on IP and Ethernet, arranged in Clos topologies to provide high _____________ bandwidth
Bisection
In synchronous networks, there is no queuing, the maximum end-to-end latency of the network is fixed. We call this a _____________.
Bounded delay
Datacenter networks and the internet use packet switching because they are optimized for _____________. A circuit is good for an audio or video call, which needs to transfer a fairly constant number of bits per second for the duration of the call.
Bursty traffic
Distributed systems problems become much harder if there is a risk that nodes may "lie" (send arbitrary faulty or corrupted responses) -- for example, if a node may claim to have received a particular message when in fact it didn't. Such behavior is known as a __________.
Byzantine Fault
A system is __________ if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with network.
Byzantine Fault-Tolerant
In you datacenter, all the nodes are controlled by your organization (so they can hopefully be trusted). In most server-side systems, the cost of deploying __________ solutions make them impractical.
Byzantine Fault-Tolerant
The __________ images a situation where n generals who need agree on a battle plan. As they have set up camp on different sites, they can only communicate by messenger, and the messengers sometimes get delayed or lost (like packets in a network). Their endeavor is hampered by the fact that there are some traitors in their midst. Most of the generals are loyal, and thus send truthful messages, but the traitors may try to deceive and confuse the others by sending fake or untrue messages.
Byzantine Generals Problem
The problem of reaching consensus in an untrusting environment is known as the __________.
Byzantine Generals Problem
Using _____________ for bursty data transfers wastes network capacity and makes transfers unnecessarily slow. By contrast, TCP dynamically adapts the rate of data transfer to the available network capacity
Circuits
There is a spectrum of philosophies on how to build large-scale computing systems: at the other extreme is _____________, which is not very well defined but is often associated with multi-tenant datacenters, commodity computers connected with an IP network (often Ethernet), elastic/on-demand resource allocation, and metered billing.
Cloud computing
Because of the difficulty of synchronizing physical clocks over an asynchronous packet network, it doesn't make sense to think of a physical clock (a time-of-day clock in particular) as a point in time; it is more like a range of times, within a __________ interval. (e.g., [earliest possible timestamp, latest possible timestamp])
Confidence
Prematurely declaring a node _____________ is problematic: if the node is actually alive and in the middle of performing some action (for example, sending an email), and another node takes over, the action may end up being performed twice.
Dead
When a node is declared _____________, its responsibilities need to be transferred to other nodes, which places additional load on other nodes and the network. If the system is already struggling with high load, it could happen that the node actually wasn't dead but only slow to respond due to overload; transferring its load to other nodes can cause a cascading failure.
Dead
The quartz clock in a computer is not very accurate: it __________ (runs faster or slower than it should). If your NTP daemon is misconfigured, or a firewall is blocking NTP traffic, the clock error due can quickly become large.
Drifts
Applications depend on clocks in various way to answer important questions. The following questions use clocks to measure __________ (e.g., the time interval between a request being sent and a response being received): 1. Has this request timed out yet? 2. What's the 99th percentile response time of this service? 3. How many queries per second did this service handle on average in the last five minutes? 4. How long did the user spend on our site?
Durations
The internet share network bandwidth _____________: this approach has the downside of queueing, but the advantage is that it maximizes utilization of the wire. The wire has a fixed cost, so if you utilize it better, each byte you send over the wire is cheaper.
Dynamically
Every time the lock server grants a lock or lease, it also returns a __________ token, which is a number that increases every time a lock is granted. We can then require that every time a client sends a write request to the storage service, it must include its current token.
Fencing
Using __________ tokens requires the resource (service) itself to take an active role in checking tokens by rejecting any writes with an older token than one that has already been processed.
Fencing
When using a lock or lease to protect access to some resource, such a a file storage, we need to ensure that a node that is under a false belief of being "the chosen one" cannot disrupt the rest of the system. A fairly simple technique that achieves this goal is called __________.
Fencing
__________ tokens detect and block a node that is inadvertently acting in error (e.g., because it hasn't yet found out that its lease has expired). However if the node deliberately wanted to subvert the system's guarantees, it could easily do so by sending messages with a fake token.
Fencing
TCP performs _____________ (also known as congestion avoidance or backpressure), in which a node limits its own rate of sending in order to avoid overloading a network link or the receiving node. This means additional queuing at the sender before the data even enters the network.
Flow control
Some software runs in environments where a failure to respond within a specified time can cause serious damage: computers that control aircraft, rockets, robots, cars, and other physical objects must respond quickly and predictably to their sensor inputs. In these systems, there is a specified deadline by which the software must respond; if it doesn't meet the deadline, that may cause a failure of the entire system. These are so-called __________ system.
Hard real-time
Adding redundant networking gear doesn't reduce faults as much as you might hope, since it doesn't guard against _____________ (e.g., misconfigured switches), which is a major cause of outages.
Human error
__________ (LWW) cannot distinguish between writes that occurred sequentially in quick succession and writes that were truly concurrent (neither writer was aware of the other). Additional causality tracking mechanisms, such as version vectors, are needed in order to prevent violations of causality.
Last-write-wins
__________ (LWW) is a conflict resolution strategy widely used in both multi-leader replication and leaderless databases (e.g., Casandra and Riak). The are fundamental problems with it, as it requires synchronized clocks between all nodes to correctly work.
Last-write-wins
In a geographically distributed development (keeping data geographically close to your users to reduce access _____________), communication most likely goes over the internet, which is slow and unreliable compared to local networks.
Latency
Internet-related applications need to be able to serve users with low _____________ at any time. Making the service unavailable -- for example, stopping the cluster for repair -- is not acceptable.
Latency
Some _____________-sensitive applications use UDP rather than TCP.
Latency
So-called __________ clocks, which are based on incrementing counters rather than oscillating quartz crystals, are a safer alternative for ordering events. These clocks do not measure the time of day or the number of seconds elapsed, only the relative ordering of events (whether one event happened before or after another).
Logical
A _____________ timeout means a long wait until a node is declared dead (and during this time, users may have to wait or see error messages).
Long
Although we can assume that nodes are generally honest, it can be worth adding mechanisms to software that guards against weak forms of "__________" -- for example, invalid messages due to hardware issues, software bugs, and misconfiguration. (E.g, corrupted network packets, carefully sanitizing any inputs from users)
Lying
Most server implementations cannot guarantee that they can handle requests within some _____________ time.
Maximum
A __________ clock is suitable measuring a duration (time interval), such as a timeout or a service's response time, because they are guaranteed to always move forward.
Monotonic
In a distributed system, using a __________ clock for measuring time (e.g., timeouts) is usually fine because it doesn't assume any synchronization between different nodes' clocks and is not sensitive to slight inaccuracies of measurement.
Monotonic
It makes no sense to compare __________ clock values from two different computers, because they don't mean the same thing.
Monotonic
You can check the value of the __________ clock at one point in time, do something, and then check the clock again at a later time. The difference between the two values tells you how much time elapsed between the two checks.
Monotonic
__________ clocks don't need synchronization.
Monotonic
__________ synchronization can only be as good as the network delay, so there is a limit to its accuracy when you're on a congested network with variable packet delays.
NTP
__________'s synchronization accuracy is itself limited by the network round-trip time, in addition to other sources of error such as quartz drift.
NTP
To better detect _____________ in public clouds and multi-tenant datacenters, you can only choose timeouts experimentally: you need to determine an appropriate trade-off between failure detection delay (taking too long to detect a failure) and risk of premature timeouts (incorrectly shutting off a node that is actually still alive).
Network Faults
To detect with _____________, a system can continually measure response times and their variability (jitter) and automatically adjust timeouts according to the observed response time distribution.
Network Faults
It is possible to synchronize clocks to some degree: the most commonly used mechanism is the __________ (NTP), which allows the computer clock to be adjusted according to the time reported by a group of servers. The servers in turn get their time from a more accurate time source, such as a GPS receiver.
Network Time Protocol
The variability of packet delays on asynchronous computer networks is most often due to queueing: If several different nodes simultaneously try to send packets to he same destination, the network switch must queue them up and feed them into the destination network link one by one. On a busy network link, a packet may have to wait a while until it can get a slot (this is called _____________). If there is so much much incoming data that the switch queue fills up, the packet is dropped, so it needs to be resent -- even though the network is functioning fine.
Network congestion
When one part of the network is cut off from the rest due to a network fault, that is sometimes called a network partition or netsplit. The more general term is _____________.
Network fault
Even if _____________ are rare in your environment, the fact they can occur means that your software needs to be able to handle them.
Network faults
The crash-stop faults system model is a system model that describes assumptions for __________.
Node failures
In public clouds and multi-tenant datacenters resources are shared among many customers. As you have no control over a insight into other customers' usage of the shared resources, network delays can be highly variable if someone near you (a _____________) is using a lot of resources.
Noisy neighbor
The difficulty is that partial failures are _____________: if you try to do anything involving multiple nodes and the network, it may sometimes work and sometimes unpredictably fail.
Non-deterministic
The time it takes for a message to travel across a network is _____________: you may not even know whether something succeeded or not.
Non-deterministic
If you use software that requires synchronized time-of-day clocks, it is essential that you also carefully monitor the clock __________ between all the machines. Any node whose clock drifts too far from others should be declared dead and removed from the cluster. Such monitoring ensures that you notice the broken clocks before they can cause too much damage.
Offsets
In some specific circumstances you might get some feedback to explicitly tell you that something is not working: if you can reach the machine on which the node should be running, but no process is listening on the destination port (e.g., because the process crashed), the _____________ will helpfully close or refuse TCP connections by sending a RST or FIN packet in reply. However, if the node crashed while it was handling your request, you have no way of knowing how much data was actually processed by the remote node.
Operating system
Ethernet and IP (used widely in datacenter networks) are _____________-switched protocols, which suffer from queuing and thus unbounded delays in the network.
Packet
In a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine. This is known as a _____________.
Partial Failure
The __________ system model means that a system behaves like a synchronous system most of the time, but it sometimes exceeds the bounds for network delay, process pauses, and clock drift. This is a realistic model of many systems: most of the time, networks and processes are quite well behaved -- otherwise we would never be able to get anything done -- but we have to reckon with the fact that any timing assumptions may be shattered occasionally. When this happens, network delay, pauses, and clock error may become arbitrarily large.
Partially synchronous
A node in a distributed system must assume that its execution can be __________ for a significant length of time at any point, even in the middle of a function. During this time, the rest of the world keeps moving and may even declare the node dead because it's not responding. Eventually, the node may continue running, without even noticing that it was asleep until it checks its clock sometime later.
Paused
Say you have a database with a single leader per partition. How does a node know that it is still leader (that it hasn't been declared dead by the others), and that it may safely accept writes? One option is for the leader to obtain a lease from other nodes. Only one node can hold the lease at any one time. In order to remain leader, the node must periodically renew the lease before it expires. For sure, you shouldn't use time-of-day clocks to figure out when the lease expires. And, even if we change the protocol to only us the local monotonic clock, we can't assume that very little time passes between the point we check the time and the time when the request is processed. This is because the thread processing the request could be __________ for a time longer than the lease expiration interval. This could happen for various reasons: 1. Garbage collection 2. Virtual machine could be suspended 3. Operating system context-switches to another thread, or the hypervisor switches to a different virtual machine. 4. If the application performs synchronous disk access, a thread may be paused waiting for a slow disk I/O operation to complete. 5. If the operating system is configured to allow swapping to disk (paging), a simple memory access may result in a page fault that requires a page from disk to be loaded into memory. The thread is paused while this slow I/O operation takes place. 6. A Unix process can be paused by sending it the SIGSTOP signal
Paused
Time-of-day and monotonic clocks measure actual elapsed time (as opposed to logical clocks, that measure the relative ordering of events). These clocks are also known as __________ clocks.
Physical
Applications depend on clocks in various ways to answer important questions. The following questions use clocks to describe __________ (events that occur on a particular date, at a particular time): 1. When was this article published? 2. At what date and time should the reminder email be sent? 3. When does this cache entry expire? 4. What is the timestamp on this error message in the log file?
Points in time
The variability of packet delays on asynchronous computer networks is most often due to queueing: when a packet reaches the destination machine, if all CPU cores are currently busy, the incoming request from the network is _____________ by the operating system until the application is ready to handle it.
Queued
_____________ delays have an especially wide range when a system (using asynchronous packet networks) is close to its maximum capacity.
Queueing
Most commonly, the __________ is an absolute majority of more than half the nodes. A majority allows the system to continue working if individual nodes have failed.
Quorom
A distributed system cannot exclusively rely on a single node, because a node may fail at any time, potentially leaving the system stuck and unable to recover. Instead, many distributed algorithms rely on a __________, that is, voting among the nodes: decisions require some minimum number of votes from several nodes in order to reduce the dependence on any one particular node.
Quorum
If a __________ of nodes declares another node dead, then it must be considered dead, even if that node still very much feels alive. The individual node must abide by the decision and step down.
Quorum
Providing __________ guarantees in a system requires support from all levels of the software stack. For most server-side data processing systems, such guarantees are simply not economical or appropriate. Consequently, these systems must suffer the pauses and clock instability that come from operating in a non-real-time environment.
Real-time
In some specific circumstances you might get some feedback to explicitly tell you that something is not working: if a node process crashed (or was killed by an administrator) but the node's operating system is still running, a _____________ can notify other nodes about the crash so that another node can take over quickly without having to wait for a timeout expire.
Script
Distributed systems are different from programs running on a single computer: there is no __________ memory, only message passing via an unreliable network with variable delays, and the systems may suffer from partial failures, unreliable clocks, and processing pauses.
Shared
When writing multi-threaded code on a single machine, we have fairly good tools making it thread-safe: mutexes, semaphores, atomic counters, lock-free data structures, blocking queues, and so on. Unfortunately, these tools don't directly translate to distributed systems, because a distributed system has no __________ -- only messages sent over an unreliable network.
Shared memory
In _____________ systems, a bunch of machines connected by a network. The network is the only way those machines can communicate -- we assume that each machine has its own memory and disk, and one machine cannot access another machine's memory or disk.
Shared-nothing
_____________ has become the dominant approach for building internet services: it's comparatively cheap because it requires no special hardware, it can make use of commoditized cloud computing services, and it can achieve high reliability through redundancy across multiple geographically distributed datacenters.
Shared-nothing
A _____________ timeout detects faults faster, but carries a higher risk of incorrectly declaring a node dead when in fact it has only suffered a temporary slowdown (e.g., due to a load spike on the node or the network).
Short
The bigger a system gets, the more likely it is that one of its components is broken. In a system with thousands of nodes, it is reasonable to assume that _____________ is always broken.
Something
A _____________ is more like a single-node computer than a disturbed systems: it deals with partial failure by letting it escalate into total failure -- if any part of the system fails, just let everything crash (like a kernel panic of a single machine).
Supercomputer
The __________ system model assumes bounded network delay, bounded process pauses, and bounded clock error. This does not imply exactly synchronized clocks or zero network delay; it just means you know that network delay, pauses, and clock drift will never exceed some fixed upper bound. This model is not a realistic model of most practical systems, because unbounded delays and pauses do occur.
Synchronous
A __________ is an abstraction that describes what things an algorithm may assume, to identify the kinds of faults that we expect to happen in a system.
System model
In a distributed system, we can state the assumptions we are making about the behavior (the __________) and design the actual system in such a way that it meets those assumptions.
System model
A __________ clock returns the current date and time according to some calendar (also known as wall-clock time).
Time-of-day
__________ clocks are unsuitable for measuring elapsed time, because they may jump back in time.
Time-of-day
__________ clocks are usually synchronized with NTP.
Time-of-day
__________ clocks have various oddities. In particular, if the local clock is too far ahead of the NTP server, it may be forcibly reset and appear to jump back to a previous point in time.
Time-of-day
__________ clocks need to be set according to an NTP server or other external time source in order to be useful.
Time-of-day
A _____________ is the only sure way of detecting a fault.
Timeout
A _____________ is the only sure way of detecting a fault: after some time you give up waiting and assume that the response is not going to arrive.
Timeout
If you send a request and don't get a response, it's not possible to distinguish wether the request was lost, the remote node is down, or the response was lost. The usual way of handling this nondeterminism is a _____________.
Timeout
TCP considers a packet to be lost if it is not acknowledged within some _____________ (which is calculated from observed round-trip times), and lost packets are automatically transmitted. The application does see the resulting delay.
Timeout
There's no "correct" value for _____________ -- they need to be determined experimentally.
Timeouts
The asynchronous model is a system model that describes __________ assumptions.
Timing
The partially synchronous model is a system model that describes __________ assumptions.
Timing
The synchronous model is a system model that describes __________ assumptions.
Timing
The choice between UDP and TCP is a trade-off between reliability and variability of delays: as _____________ does not perform flow control and does not retransmit lost packets, it avoids some of the reasons for variable network delays (although it is still susceptible to switch queues and scheduling delays).
UDP
_____________ (rather than TCP) is a good choice in situations where delayed data is worthless.
UDP
Asynchronous networks have _____________: they try to deliver packets as quickly as possible, but there is no upper limit on the time it may take for a packet to arrive.
Unbound delays
If we want to make distributed systems work, we must accept the possibility of partial failure and build fault-tolerance mechanisms into the software. In other words, we need to build a reliable system from _____________ components.
Unreliable
Multi-tenancy with dynamic resource partitioning (sharing resources between users) provides better resource _____________, so it is cheaper, but it has the downside of the variable delays.
Utilization
The internet and most internal networks in datacenters (often Ethernet) are _____________ networks. In this kind of network, one node can send message (a packet) to another node, but the network gives no guarantees as to when it will arrive, or whether it will arrive at all.
asynchronous packet
One particular situation in which it is tempting, but dangerous, to rely on __________ is in the ordering of events across multiple nodes.
clocks
There is a spectrum of philosophies on how to build large-scale computing systems: at one end of the scale is the field of _____________. Super-computers with thousands of CPUs are typically used for computationally intensive scientific computing tasks, such as weather forecasting or molecular dynamics.
high-performance computing
One fundamental problem with __________ (LWW) is that database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite values previously written by a node with a fast clock until the clock skew between the nodes has elapsed. This scenario can cause arbitrary amounts of data to be silently dropped without any error being reported to the application (i.e., write B should overwrite write A, but the node handling B has a slower clock than the node handling A, so the A overwrites B when the writes are replicated).
last-write-wins
Frequently, a system requires there to be only __________ of some thing: - Only one node is allowed to be the leader for a database partition, to avoid split brain. - Only one transaction or client is allowed to hold the lock for a particular resource or object, to prevent concurrently writing to it and corrupting it. - Only one user is allowed to register a particular username, because a username must uniquely identify a user. Implementing this in a distributed system requires care: even if a node believes that it is "the chosen one", that doesn't necessarily mean a quorum of nodes agree.
one