Distributed Systems Exam 3
What is the golden rule of recoverability?
"Never modify the only copy" transaction makes changes to local copy of resource until a commit or a rollback
Describe the AFS architecture
(client caches files on local system) - client requests to access a specific file - cache is checked first, request made to server if necessary - server sends copy back to client for modifications - changes are ONLY applied to client until the file is closed - client copy is copied back to the server with new updates - client also retains copy in cache for future updates Amdahl's Law: MAKE THE COMMON CASE FAST implements session semantics
How is concurrency control implemented in NFS?
(clients require locks on files) - client informs the server of intent to lock - if available, the server grants the lock to the client with lease - if client dies while holding the lock, the lease expires - client may renew lease before the old lease expires
Name the messaging modes we discussed in class
- PubSub: 1 --> many - Point to point: 1 --> 1 Both of these are considered MOM (message oriented middleware)
What is an RDD?
- Resilient Distributed Dataset - fundamental data structure in Spark - immutable --> we do not modify, we create new ones
Explain Network Time Protocol
- allows users to synch with UTC (coordinated universal time) over the internet - atomic clocks are at the top of the graph - time servers (like CMU clock) synch with atomic clocks - our devices synch with time servers - think of this as a hierarchy
How does a client read a file sequentially in GFS?
- client computes the chunk index based on the byte offset of where it wants to start reading the file - client calls master server with file name and chunk index - master server returns chunk ID and location of replicas - client calls chunk server directly and receives the chunk (no caching in this sequence)
Explain the 'reduce' stage of MapReduce
- combines intermediate values into 1+ final values for each key - can have multiple reduce stages in a pipeline
Name some issues with microservice architectures
- data replication becomes more difficult - system becomes more complex --> each new service requires connections to other services - debugging can become more difficult - performance hit when we switch from network communication instead of memory-based communication
What are two types of decoupling that make messaging systems more resilient?
- decoupled in space: sender does not know identity of receivers, and vice versa - decoupled in time: system can store messages until they can be delivered successfully
Explain how GFS splits and organizes files
- each file is mapped to set of chunks (each 128Mb) - each cluster has one master and hundreds of chunk servers - each chunk is replicated on multiple chunk servers - master has metadata and knows locations of chunk replicas - chunk servers know what replicas they have
What are benefits of a microservice when compared to monoliths?
- easier to replicate parts and scale - adding new features does not impact existing system - more equipped to handle failures since we can replicate more effectively
Apache Spark
- faster and more flexible version of MapReduce - uses RDDs
What parts does every S3 object have?
- key - data - user metadata (tags, etc.) - system metadata (time of creation, etc.) - storage class --> pay more for faster access (DUSKS)
How are changes allowed and propagated in GFS?
- master server grants permission to a process (lease) - if available, master provides locations of chunks - program accesses these chunkservers directly - modifying chunkserver (always primary chunk holder) propagates any changes to servers with backup copies - no changes saved until all chunkservers acknowledge (atomic)
Explain the 'shuffle and sort' phase of MapReduce
- responsible for redistributing the map output across the reducers - ensure that all values associated with a particular key end up on the same reducer
What are some requirements of GFS?
- run reliably with daily component failures - Google issue: not massive # of files, but file sizes are huge - write once, append / read many - long reads/appends dominate access --> no caching needed - throughput more important than latency
Explain the 'map' stage of MapReduce
- takes records from source as key/value pairs - mapper produces 1+ intermediate values with an output key - can have multiple map stages in a pipeline
Name some issues with monolithic architectures
- to scale, we have to replicate the entire architecture - adding new features/updates impacts the entire service - tied to one coupled installation
What is Java Messaging Service (JMS)?
- widely used abstraction API for interacting with different MOM systems (JMS IS NOT MOM) - client-facing service
What is a system in the context of service architectures?
A collection of operating microservices
Explain the concept of a consistency model. Provide some examples.
A contract between processes and a data store If the processes agree to obey certain rules, the store promises to work correctly. Examples: Strict, sequential, eventual
What parts of the CAP Theorem does S3 uphold
AP S3 is available and tolerates partitions, but it is only consistent eventually
What is NTP accurate enough for? What is it not accurate enough for?
Accurate enough to keep logs on personal machine Not accurate enough to keep logs across distributed systems with multiple machines
How do we add instrumentation to microservice architectures?
Add timers to microservices to log performance speeds Our primary concerns are latency and throughput
Atomicity
All or nothing, no 'intermediate' steps We either commit the whole transaction or abort the entire process
In a mobile app, why are long-running tasks run in a background thread? a) It makes that task faster b) It keeps the GUI responsive while the task is still running c) It makes the GUI faster d) It's just a programming convenience and isn't really needed
B
List some common use cases of S3
Backup storage and archival storage Replicating objects across regions for performance and fault tolerance Data for static websites Source and destination of data for applications running on EC2
In a mobile app, which of the following is not a way for GUI components to be defined? a) Programmatically -- usually in the main method b) In a setup file using xml c) By an asynchronous thread d) Interactively, using a GUI editor
C
Session semantics consistency model
Changes are initially visible only to the process that modifies the file Changes become visible when the file is closed
How does a generic distributed file system work?
Consists of client and server computers Client module: interface used by apps --> makes calls to server Server: contains flat file service and directory service, both of these provide an RPC interface for clients to use
Which flat file operation does not pass a UUID
Create()
What storage systems have weak consistency?
DNS
Define continuous delivery and continuous deployment
Delivery: code changes are automatically tested, software is always in a 'deployable' state Deployment: automatically deploy changes (one step further than delivery)
Consistency
Different meaning than in DB's Data is in a consistent state when a transaction starts and when it ends Ex. Total $$$ in two accounts is the same after money is moved between them
What is a transaction coordinator?
Framework (class, methods, etc) that we build to carry out transactions and maintain isolation Example method Transaction t tid = openTransaction(); a.withdraw(tid, 100); b.deposit(tid, 100); c.withdraw(tid, 200); d.deposit(tid, 200); closeTransaction(tid)
What were the goals of the Andrew File System
Goals 1) SCALABILITY 2) Reduce client-server interactions using client caches
What were the goals of Sun NFS (Network File System)
Goals 1) appear like a UNIX file system 2) Implement a POSIX API (standards for UNIX-like systems) 3) Files available from any machine
What is two-phase locking?
Growing phase: acquire all locks that are needed for transaction Shrinking phase: release all locks Once any lock is released, no new locks may be acquired
How are Google File System and Hadoop related?
Hadoop is an open source implementation of GFS
Explain how serial transactions can produce different answers
If both transactions operate on the same data, order of operations can become relevant Must accept both answers as correct
What is the central idea behind Hadoop and GFS
Instead of moving data to the code, we move code to data With VMs, one machine can act as many With Hadoop/GFS, it is opposite --> many act as one
How is concurrency control implemented in AFS
It isn't... No support for large shared databases or updating files that have multiple replicas
What type of evaluation do RDD's use?
LAZY EVALUATION - the execution does not start until an action is triggered (i.e. reduceByKey, etc.)
What timekeeping approach have we adopted for dist. systems?
Lamport Clocks (Logical Clocks)
What is the most popular way to achieve transaction isolation?
Locking
List directory service operations
Lookup(directory, name) --> fileID AddName(directory, name, fileID) UnName(dirName) GetNames(dir, pattern) --> nameSeq
What is the primary aim of a server that supports transactions?
Maximize concurrency Data consistency is more important than transaction speed
What are the main components needed to use JMS?
Messaging clients: produce and consume messages Message destinations: queues/topics to send and receive messages JMS-Compatible MOM
What does "smart endpoints and dumb pipes" mean?
Microservices (endpoints) are smart because they contain all the business logic Communication (pipes) are dumb because message contents are kept simple; the concern is transferring the data, not trying to understand it
What should we never do when correcting clocks?
NEVER SET THE CLOCK BACKWARDS - we will end up w/ duplicate timestamps for events
Describe the NFS architecture
NFS uses RPC over TCP or UDP NFS is a virtual file system --> the NFS server is receiving requests from the NFS client and converting them to RPC calls to access the UNIX file system Directories are distributed Remote mount: a file system in one system can also exist in the hierarchy of the file system on another system
What is the CAP Theorem argument?
Nodes A and B are partitioned - We write x to B and then attempt to read x from A A is either unavailable or not consistent ... we must choose We know that partitions happen (P), so we must choose AP or CP
What is serial transaction execution?
One transaction runs to completion before the other transaction begins. Then the second transaction runs to completion
Two-phase commit
Phase 1: voting --> coordinator sends message to all participating nodes to verify whether each node can commit the transaction, nodes send back 'yes' or 'no' Phase 2: commit --> coordinator sends 'commit' or 'abort' message to nodes based on results of voting phase
In JMS, are message production and consumption synch or asynch?
Production: always synchronous Consumption: asynch --> register as listener on queue or topic synch --> read and block until message is available (or timeout)
Explain how a cache is utilized with strict consistency
Programs cannot observe any differences between cached copies and stored data after an update Every process works with the same cache
Who decides the number of mappers and reducers?
The program
Sequential consistency
The result of any execution is the same as if the (read and write) operations of all processes were executed in some sequential order, and the operations of each individual process appear in this sequence in the order specified by its program
What is a serial equivalent interleaving?
The result of the interleaved transactions is the same as if the transactions were serial. Some interleavings are not serially equivalent
What is the primary purpose of the directory service?
Translate text names to UFIDs The directory service is a client of the flat file service
Isolation
Two transactions do not interfere with each other
How are non-distributed file systems typically organized? Why?
Typically layered - each layer depends only on layer below it SEPARATION OF CONCERNS!
What file system(s) guarantee(s) strict. consistency?
UNIX NFS and AFS only approximate this S3 makes no real attempt to approximate this
Define Venus and Vice
Venus: client-side software used to implement AFS, responsible for caching files on client system and making calls to AFS if required Vice: another name for AFS servers, responsible for managing file system and access controls
Eventual consistency
Weak consistency Eventually, all of the copies of a data store will return the same value
What are the four requirements of deadlock?
1) Mutual exclusion: resources need mutual exclusion. They are not thread safe 2) Reservation: resources may be reserved while a process is waiting for more 3) You cannot force an object to give up a resources 4) Circular wait is possible
What are three solutions for deadlocking?
1) Prevention -- disallow one of the four requirements 2) avoidance -- study what is required before beginning 3) detection -- use timeouts or wait-for graphs
What are four properties of a microservice architecture?
1) Small (as measured in complexity) 2) Focused: low coupling, separation of concerns 3) Autonomous: separate process or service 4) Well-defined API: easy for other services to interact with it
How are flat file service operations performed in general DFS?
1) The client module makes a call on one of the operations 2) The directory service receives the call 3) Directory service returns unique file ID 4) client sends request to flat file service using file ID 5) flat file service returns data or status
Name and describe two common distributed file system models
1) remote access model: the client sends requests to access a file on the server, but the file itself never leaves the server 2) upload/download model: client sends request to access a file, and the file is downloaded to the client for editing --> it is uploaded back to the server upon completion
What are the main tasks that Hadoop/GFS handle?
1) storing files 2) running applications on top of files
Can a module be both a queue listener and a queue writer? a) No, they're the same b) Yes, because every queue needs both c) No, the JMS protocol only allows one or the other d) Yes, that's how the data is transferred from one queue to another e) No, that's only possible with topics
d
What is the difference between JMS and Message Oriented Middleware (MOM)? a) None, they're the same b) MOM provides an interface to JMS c) JMS provides Point-to-Point queues and topics, but MOMs only provide topics d) JMS only interacts with Servlets e) JMS provides an interface to MOM
e
What is the highest level of computation in MapReduce
job
What are interleaved transactions?
The individual actions of the transactions are mixed, but the program order remains a b c d w x y z Interleaving: a b w x y c d z
What is an inconsistency window?
The period between a data update and the moment that all replicas have the updated value
Strict one copy
A read after a write always gets the value that was just written
Names some valid JMS message types
Text Object Map Bytes Stream (TOMBS)
What happens when the server is stateless in a distributed file system?
Each request by the client must have all information needed to perform the job May have to authenticate and authorize for every request
What kind of consistency does S3 maintain?
Eventual consistency If you PUT a new object, subsequent reads will return the object If you overwrite with a PUT, the change will be reflected eventually If you DELETE, it will eventually be removed
Strict consistency
Every read on data item x returns a value corresponding to the result of the most recent write. NOT POSSIBLE in a distributed system due to message latency
Durable
The commit causes a permanent change to stable storage We can recover from crashes, probably using some sort of log-based recovery algorithm
What kind of storage is S3? How is data stored and accessed?
Remote object storage Data is stored as objects, which have no defined format Data is accessed using REST - PUT, GET, DELETE
What consistency model to transactions typically utilize?
Sequential consistency
Why can't a distributed system have a global clock?
Skew: two clocks will have two different times Drift: clocks vary in speed
What is a challenge of end-to-end instrumentation w/ microservices?
System clocks are not exactly synched
What is the highest level of computation in spark
application