Chapter 12: The Future of Data Systems
By structuring applications around dataflow and checking constraints __________, we can avoid most coordination and create systems that maintain integrity but still perform well, even in geographically distributed scenarios and in the presence of faults.
Asynchronously
Secondary indexes often cross partition boundaries. A partitioned system with secondary indexes either need to send writes to multiple partitions (if the index is term-partitioned) or send reads to all partitions (if the index is document-partitioned). Such cross-partition communication is most reliable and scalable if the index is maintained ___________.
Asynchronously
Compared to transactions, event-based systems can provide better __________. In the event sourcing approach, user input to the system is represented as a single immutable event, and any resulting state updates are derived from that event. The derivation can be made deterministic and repeatable, so that running the same log of events through the same version of the derivation code will result in the same state updates.
Auditability
Typically conflated with consistency, __________ means ensuring that users observe the system in an up-to-date state. For example, if a user reads from a stale copy of the data, they may observe it in an inconsistent state. However, that inconsistency is temporary, and will eventually be resolved by waiting and trying again.
Timeliness
With systems that are small enough, constructing a totally ordered event log is entirely feasible. However, as systems are scaled toward bigger and more complex workloads, limitations begin to emerge. In most cases, constructing a totally ordered log requires all events to pass through a __________ node that decides on the ordering. If the throughput of events is greater than a single machine can handle, you need to partition it across multiple machines. The order of events in two different partitions is then ambiguous.
(Single) leader
The traditional approach to synchronizing writes requires distributed transactions across heterogeneous storage systems. When data crosses the boundary between different technologies, an __________ event log with idempotent writes is a much more robust and practical approach.
Asynchronous
Unlike with transactions, when unbundling an operation across multiple stage of a stream processor, consumers of the stream log are __________ by design, so a sender does not wait until its message has been processed by consumers. However, it is possible for a client to wait for a message to appear on a output stream.
Asynchronous
__________ transactions usually provide both timeliness (e.g., linearizability) and integrity (e.g., atomic commit) guarantees.
ACID
Composing stream operators into dataflow systems (i.e., for unbundling databases) has a lot of similar characteristics to the standard microservice approach (with the same benefit of organization scalability). However, the underlying communication mechanism is very different: one-directional, __________ message streams rather than synchronous request/response interactions. The dataflow approach replaces a synchronous network request to another service with a query to a local database (a derived materialized view generated by a stream of changes from the other service), which may be on the same machine, even in the same process. Not only is the dataflow approach faster, but it is also more robust to the failure of another service. The fastest and most reliable network request is no network request at all!
Asynchronous
With both hardware and software not always living up to the ideal that we would like them to be, it seems that data corruptions is inevitable sooner or later. Thus, we should at least have a way of finding out it data has been corrupted so that we can fix it and try to track down the source of the error. Checking the integrity of data is known as __________. Moreover, mature systems tend to consider the possibility of unlikely things going wrong, and manage that risk. For example, large-scale storage systems such as HHDFS and Amazon S3 do not fully trust disk: they run background processes that continually read back files, compare them to other replicas, and move files from one disk to another, in order to mitigate the risk of silent corruption.
Auditing
When maintaining derived data, ___________ processing allows large amounts of accumulated historical data to be reprocessed in order to derive views onto an existing dataset.
Batch
___________ processing has a quite strong functional flavor: it encourages deterministic, pure functions whose output depends only on the input and which have no side effects other than the explicit outputs, treating inputs as immutable and outputs as append-only.
Batch
The goal of unbundling databases is not to compete with individual databases on performance for particular workloads; the goal is to allow you to combine several different databases in order to achieve good performance for a much wider range of workloads than is possible with a single piece of software. It's about __________, not depth. The advantages of unbundling and composition only come into the picture when there is no single piece of software that satisfies all your requirements.
Breadth
Consistency in the sense of ACID is based on the idea that the database starts off in a consistent state, and a transaction transforms it from one consistent state to another consistent state. However, this notion only makes sense if you assume that the transaction is free from __________.
Bugs
Instead of treating reads as transient, it is possible to represent read requests as stream of events, and send both the read events and the write events (for a particular database partition) through a stream processor; the processor responds to read events by emitting the result of the read to an output stream. Recording a log of read events potentially has the benefits with regard to tracking __________ dependencies and data provenance across a system: it would allow to reconstruct what the user saw before they made a particular decision. For example, in an online shop, it is likely that the predicted shipping date and the inventory status shown to a customer affect whether they choose to buy an item. To analyze this connection, you need to record the result of the user's query of the shipping and inventory status. Writing read vents to durable storage thus enables better tracking of such dependencies, but it incurs additional storage and I/O cost.
Causal
For reasoning about dataflows in you application, You might arrange for data to first be written to a system of record database, capturing the changes made to that database and then applying the changes to the search index in the same order. If __________ (CDC) is the only way of updating the index, you can be confident that the index is entirely derived from the system of record, and therefore consistent with it (barring bugs in the software). Writing to the database is the only way of supplying new input into this system.
Change data capture
In a distributed setting, enforcing a uniqueness constraint requires __________: if there are several concurrent requests with the same value, the system somehow needs to decide which one of the conflicting operations is accepted, and reject the others as violations of the constraint.
Consensus
The most common way of achieving __________ when enforcing a uniqueness constraint is to make a single node the leader, and put it in charge of making all the decisions.
Consensus
The term __________ typically conflates two different requirements that are worth considering separately: timeliness and integrity.
Consistency
Dataflow systems can provide the data management services for many applications without requiring coordination, while still giving strong integrity guarantees. Such __________-avoiding system can achieve better performance and fault tolerance than systems that need to perform synchronous coordination.
Coordination
We want to build applications that are reliable and __________ (i.e., programs whose semantics are well defined and understood, even in the face of various faults).
Correct
Ensuring that an operation is executed atomically, while satisfying constraints, becomes more interesting when several partitions are involved. However, partitioned log-based messaging can reach an equivalent correctness as a distributed transaction across the partitions, without an atomic commit: Example: a transfer of money from account A to account B 1. The request to transfer money from account A to account B is given a unique request ID by the client, and appended to a log partition based on the request ID. 2. A stream processor reads the log of requests. For each request message, it emits two messages to output streams: a debit transaction to the payer account A (partitioned by A), and a credit instruction to the payee account (partitioned by B). The original request ID is included in those emitted messages. 3. Further processors consume the streams of credit and debit instructions, deduplicate by request ID, and apply the changes to the account balances. Steps 1 and 2 are necessary because if the client directly sent the credit and debit instructions, it would require an atomic commit across those two partitions to ensure that either both or neither happen. To avoid the need for a distributed transaction, we first durably log the request as a single message, and then derive the credit and debit instructions from that first message. By breaking down the multi-partition transaction into two differently partitioned stages and using the end-to-end request ID, we have achieved the same __________ property (i.e., every request is applied exactly once to both the payer and payee accounts), even in the presence of faults, and without using an atomic commit protocol.
Correctness
If you need strong assurance of __________, then serializability and atomic commit are established approaches, but they come at a cost: they typically only work in a single datacenter (ruling out geographically distributed architectures), and they limit the scale and fault-tolerance properties you can achieve.
Correctness
The approach of unbundling databases by composing specialized storage and processing systems with application code is also becoming known as the "__________" approach (which has a lot of overlap with dataflow languages, functional reactive programming, and logic programming languages): when a record in a database changes, we want any index for the record to be automatically updated, and any cached views or aggregations that depend on the record to be automatically refreshed, with the need to be fault-tolerant, scalable, and store data durably.
Database inside-out
The trend for web applications has been to keep stateless application logic separate from state management (__________): not putting application logic in the database and not putting persistent state in the application. With these stateless services, any user request can be routed to any application server, and the server forgets everything about the request once it has sent the response. This style of development is convenient, as servers can be added or removed at will.
Databases
An interesting property of the event-based dataflow systems is that they __________ timeliness and integrity. When processing event streams asynchronously, unless you explicitly build consumers that wait for a message to arrive before returning. But integrity is in fact central to streaming systems.
Decouple
In a dataflow system (i.e., using unbundled databases), the __________ is the place where the write path and read path meet.
Derived dataset
Some client applications are stateful (i.e., some "single-page" javascript applications and mobile applications). Using a dataflow system, state changes could flow through an end-to-end write path: from the interaction on one device that triggers a state change, via event logs and through several derived data systems and stream processors, all the way to the user interface of a person observing the state on another device. These state changes could be propagated with fairly low delay -- say under one second end to end. In order to extend the write path all the way to the end user (i.e., treating the state on the stateful client as a __________), we need to move away from the request/response interaction (which requires polling for updates) and toward the publish/subscribe dataflow. The advantage of this change is more responsive user interfaces and better offline support.
Derived dataset
The outputs of batch and stream processes are ___________ such as search indexes, materialized views, recommendations to show to users, aggregate metrics, and so on.
Derived datasets
The role of caches, indexes, materialized views and other __________ is simple: they shift the boundary between the read path and the write path. They allow use to do more work on the write path, by recomputing results, in order to save effort on the read path.
Derived datasets
Checking the integrity of data systems is best done in an __________ fashion: the more systems we can include in an integrity check, the fewer opportunities there are for corruption to go unnoticed at some stage of the process. If we can check that an entire derived data pipeline is correct end to end, then any disks, networks, services, and algorithms along the path are implicitly included in the check.
End-to-end
Having continuous __________ integrity checks gives you increased confidence about the correctness of your systems, which in turn allows you to move faster. Like automated testing, auditing increases the changes that bugs will be found quickly, and thus reduces the risk that a change to the system or a new storage technology will cause damage. If you are not afraid of making changes, you can much better evolve an application to meet changing requirements.
End-to-end
The scenario of suppressing duplicate transactions (i.e., duplicate supression) is just one example of a more general principle called the __________ argument, which was articulated by Saltzer, Reed, and Clark in 1984: The function in question can completely and correctly be implemented only with knowledge and help of the application standing at the endpoints of the communication system (e.g., the client). Therefore, providing that questioned function as a feature of the communication system itself is not possible. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancements.)
End-to-end
Violations of timeliness are "__________ consistency".
Eventual
You can solve data integration problem by using batch processing and event streams to let data changes flow between different systems. In this approach, certain systems are designated as systems of record, and other data is derived from them through transformations. In this way we can maintain indexes, materialized views, machine learning models, statistical summaries, and more. By making these derivations and transformation asynchronous and loosely coupled, a problem in one area is prevented from spreading to unrelated parts of the system, increasing the robustness and fault-tolerance of the system as a whole. Expressing dataflows as transformations from one dataset to another also helps __________ applications: if you want to change one of the processing steps, for example to change the structure of an index or cache, you can just rerun the new transformation code on the whole input dataset in order to rederive the output. Similarly if something goes wrong, you can fix the code and reprocess the data in order to recover.
Evolve
Unifying batch and stream processing in one system requires the following features: * the ability to replay historical events through the same processing engine that handles the stream of recent events. * ___________ semantics for stream processors -- that is ensuring that the output is the same as if no fault had occurred, event if faults did in fact occur. * Tools for windowing by event time, not by processing time, since processing time is meaningless when reprocessing historical events.
Exactly-once
__________ or effectively-once semantics is a mechanism for preserving integrity. If an event is lost, or if an event takes effect twice, the integrity of data system could be violated. Thus, fault-tolerant message delivery and duplicate suppression (e.g., idempotent operations) are important for maintaining the integrity of a data system in the face of faults.
Exactly-once
The important thing to keep in mind about unbundled databases is that maintaining derived data is not the same as asynchronous job execution, for which messaging systems are traditionally designed: __________ is key for derived data: losing just a single message causes the derived dataset to go permanently out of sync with its data source. Both message delivery and derived state updates must be reliable. For example, many actor systems by default maintain actor state and messages in memory, so they are lost if the machine running the actor crashes.
Fault tolerance
If we start from the premise that there is no single data model or storage format that is suitable for all access patterns, I speculate that there are two avenues by which which different storage and processing tools can nevertheless be composed into a cohesive system: one is the __________ databases -- i.e., unifying reads. It is possible to provide a unified query interface to a wide variety of underlying storage engines and processing methods -- an approach also known as polystore. (For example, PostgreSQL's foreign data wrapper feature fits this pattern). Applications that need a specialized data model or query interface can still access the underlying storage engine directly, while users who want to combine data from disparate places can do so easily through the federated interface. This query interface follows the relational tradition of a single integrated system with a high-level query language and elegant semantics, but a complicated implementation.
Federated
Derived views allow ___________ evolution. If you want to restructure a dataset, you do not need to perform the migration as a sudden switch. Instead, you can maintain the old schema and the new schema side by side as two independently derived views onto the same underlying data. You can then start shifting a small number of users to the new view in order to test its performance and find any bugs, while most users continue to be routed to the old view. Gradually, you can increase the proportion of users accessing the new view, and eventually you can drop the old view. The beauty of such a migration is that every stage of the process is easily reversible if something goes wrong: you always have a working system to go back to.
Gradual
Distributed transactions are used within some stream processors to achieve exactly-once semantics, and this can work quite well. However, when a transaction would need to involve systems written by different groups of people, the lack of a standardized transactional protocol makes integration much harder. An ordered log of events with __________ consumers is a much simpler abstraction, and thus much more feasible to implement across heterogenous systems.
Idempotent
One of the most effective approaches for exactly-once semantics in event-based systems is to make the operation __________; that is, to ensure that it has the same effect, no matter whether it is executed once or multiple times. However, taking an operation that does not naturally have this property and making it have it requires some effort and care: you need maintain some additional metadata (such as the set of operation IDs that have updated a value), and ensure fencing (tokens) when failing over from one node to another. Namely, you need to ensure that you need to perform duplicate suppression.
Idempotent
Serializability and atomic commit does not guarantee data integrity. If an application has a bug that causes it to write incorrect data, or delete data from a database, serializable transactions aren't going to save you. Application bugs can occur, and people make mistakes. __________ and append-only data make it easier to recover from such mistakes, since you remove the ability of faulty code to destroy good data. Although these properties are useful, it is not a cure-all by itself -- you still need duplicate suppression (i.e., idempotence) and consideration of the end-to-end argument.
Immutable
Coordination and constraints reduce the number of apologies you have to make due to __________, but potentially also reduce the performance and availability of your system, and thus potentially increase the number of apologies you have to make due to outages.
Inconsistencies
For reasoning about dataflow in your application, Allowing the application to directly write to both the search index and the system of records database introduces the problem in which two clients concurrently send conflicting writes, and the two storage systems process them in a different order. In this case, neither the database nor the search index is "in charge" of determining the order of writes, and so they may make contradictory decisions and become permanently __________ with each other.
Inconsistent
Whenever a batch, stream, or ETL process transports data from one place and form to another place and form, it is acting like the database subsystem that keeps indexes or materialized views up to date. Viewed like this, batch and stream processors are like elaborate implementations of triggers, stored procedures, and materialized view maintenance routines. The derived data systems they maintain are like different ___________ types.
Index
Although strict uniqueness constraints require timeliness and synchronous coordination, many applications are actually fine with loose constraints that may be temporarily violated and fixed up later, as long as __________ is preserved throughout.
Integrity
Databases offer features that preserve __________, such as foreign key or uniqueness constraints.
Integrity
Dataflow systems can maintain __________ guarantees on derived data without atomic commit, linearizability, or synchronous cross-partition coordination.
Integrity
Dataflow systems could operate with loose coordination. Such a system could operate distributed across multiple datacenters in a multi-leader configuration, asynchronously replicating between regions. Any one datacenter can continue operating independently from the others, because no synchronous cross-region coordination is required. Such a system would have weak timeliness guarantees -- it could not be linearizable without introducing coordination -- but it can still have strong __________ guarantees. Synchronous coordination can still be introduced in places where it is needed (for example, to enforce strict constraints before an operation from which recovery is not possible), but there is no need for everything to pay the cost of coordination if only a small part of an application needs it.
Integrity
In most applications, __________ is much more important than timeliness. Violations of the latter can be annoying and confusing, but violations of the former can be catastrophic. For example, on your credit card statement, it is not surprising if a transaction that you that you made within the last 24 hours does not yet appear. However, it would be very bad if the statement balance was not equal to the sum of the transactions plus the previous statement balance.
Integrity
Often conflated with consistency, __________ means absence of corruption; i.e., no data loss, and no contradictory or false data. In particular, if some derived dataset is maintained as a view onto some underlying data, the derivation must be correct. Unlike timeliness, if this property is violated, the inconsistency is permanent: waiting and trying again is not going to fix database corruption in most cases. Instead, explicit checking and repair is needed.
Integrity
Reliable stream processing systems can preserve __________ (using a combination of mechanism listed below) without requiring distributed transactions and an atomic commit protocol, which means they can potentially achieve comparable correctness with much better performance and operational robustness. * Representing the content of the write operation as a single message, which can easily be written atomically -- an approach that fits very well with event sourcing. * Deriving all other state updates from that single message using deterministic derivation functions. * Passing a client-generated request ID through all these levels of processing, enabling end-to-end duplicate suppression and idempotence. * Making messages immutable and allowing derived data to be reprocessed from time to time, which makes it easier to recover from bugs.
Integrity
Strong __________ guarantees can be implemented scalably with asynchronous event processing, by using end-to-end operation identifiers to make operations idempotent and by checking constraints asynchronously. Clients can either wait until the check has passed, or go ahead without waiting but risk having to apologize about a constrain violation. This approach is much more scalable and robust than the traditional approach of using distributed transactions, and fits with how many business processes work in practice.
Integrity
How does the approach of keeping different data systems consistent with each other using derived data systems fare in comparison to using distributed transactions? At an abstract level, they achieve a similar goal by different means. Distributed transactions decide on an ordering of writes by using locks for mutual exclusion, while CDC an event sourcing use a log for ordering. Distributed transactions use atomic commit to ensure that changes take effect exactly once, while log-based systems are often based on deterministic retry and idemppotence. The biggest different is that transaction systems usually provide __________, which implies useful guarantees such as reading your own writes. On the other hand, derived data systems are often updated asynchronously, and so they do not by default offer the same timing guarantees.
Linearizability
A convenient property of transactions is that they are __________: that is, a writer waits until a transaction is committed, and thereafter its writes are immediately visible to all readers.
Linearizable
Capturing causal dependencies between events when total ordering isn't available (e.g., the ordering of an unfollow event from the friends service and post event from a posts service in a microservice) is tricky, a starting point includes: ___________ timestamps (e.g., "Sequence Number Ordering") can provide total ordering without coordination, so they may help in cases where total order broadcast is not feasible. However, they still require recipients to handle events that are delivered out of order, and they require additional metadata to be passed around.
Logical
Compared to distributed transactions, the big advantage of a log-of-events based integration (for writes and reads) between different systems using idempotent consumers is __________ between the various components, which manifest itself in two ways: 1. At a system level, asynchronous event streams make the systems as a whole more robust to outages or performance degradation of individual components. If a consumer runs slow or fails, the event log can buffer messages, allowing the producer and any other consumers to continue running unaffected. By contrast, the synchronous interaction of distributed transactions tend to escalate local faults into large-scale failures. 2. At a human level, unbundling data systems allows different software components and services to be developed, improved, and maintained independently from each other by different teams. Specialization allows each team to focus on doing one thing well, with well-defined interfaces to other team's systems. Event logs provide an interface that is powerful enough to capture fairly strong consistency properties (due to durability and ordering of events), but also general enough to be applicable to almost any kind of data.
Loose coupling
The advantage of service-oriented architecture over a single monolithic application is primarily organizational scalability through __________: different teams can work on different services, which reduces coordination effort between teams (as long as the services can be deployed and updated independently).
Loose coupling
Cryptographic auditing and integrity checking, which may become relevant for data systems in general, often relies on __________, which are trees of hashes that can be used to efficiently prove that a record appears in some dataset (and a few other things). Outside of the hype of cryptocurrencies, certificate transparency is a security technology that relies on these trees to check the validity of TLS/SSL certificates.
Merkle trees
The important thing to keep in mind about unbundled databases is that maintaining derived data is not the same as asynchronous job execution, for which messaging systems are traditionally designed: When maintaining derived data, the order of state changes is often important (if several views are derived from an event log, they need to process the events in the same order sot that they remain consistent with each other). Many __________ do not have this property when redelivering unacknowledged messages.
Message brokers
With systems that are small enough, constructing a totally ordered event log is entirely feasible. However, as systems are scaled toward bigger and more complex workloads, limitations begin to emerge. When applications are deployed as ___________, a common design choice is to deploy each service and its durable state as an independent unit, with no durable state shared between services. When two events originate in different services, there is no defined order for those events.
Microservices
With systems that are small enough, constructing a totally ordered event log is entirely feasible. However, as systems are scaled toward bigger and more complex workloads, limitations begin to emerge. Some applications maintain client-side state that is updated immediately on user input (without waiting for confirmation from a server), and even continue to work offline. With such applications, clients and servers are very likely to see events in different orders. This is an example of ___________ replication.
Multi-leader
With systems that are small enough, constructing a totally ordered event log is entirely feasible. However, as systems are scaled toward bigger and more complex workloads, limitations begin to emerge. If the servers are spread across multiple geographically distributed datacenters, for example in order to tolerate an entire datacenter going offline, you typically have a separate leader in each datacenter, because ___________ make synchronous cross-datacenter coordination inefficient. This implies an undefined ordering of events that originate in two different datacenters.
Network delays
In a distributed setting, enforcing uniqueness constraints rules out asynchronous multi-master __________, because it could happen that different masters concurrently accept conflicting writes, and thus the values are no longer unique. If you want to be able to immediately reject any writes that would violate the constraint, synchronous coordination is unavoidable.
Replication
Using log-based messaging, a stream processor consumes all the messages in a log partition sequentially on a single thread. Thus, for enforcing a uniqueness constraint, if the log is partitioned based on the value that needs to be unique, a stream processor can unambiguously and deterministically decide which one of the several conflicting operations came first. For example, in the case of several users trying to claim the same username: 1. Every request for a username is encoded as a message, and appended to a partition determined by the hash of the username. 2. A stream processor sequentially reads the requests in the log, using a local database to keep track of which usernames are taken. For every request for a username that is available, it records the name as taken and emits a success message to an output stream. For every request for a username that is available, it records the name as taken and emits a success message to an output stream. For every request for a username that is already taken, it emits a rejection message to an output stream. 3. The client that requested the username watches the output stream and waits for a success or rejection message corresponding to its request. This algorithm scales easily to a large request throughput by increasing the number of partitions, as each partition can be processed independently. Further, this approach can work for many other kinds of constraints. Its fundamental principle is that any writes that may conflict are routed to the same __________ and processed sequentially.
Partition
In a distributed setting, Uniqueness checking can be scaled out by __________ based on the value that needs to be unique. For example, if you need to ensure uniqueness by request ID, you can ensure all requests with the same request ID are routed to the same partition. If you need usernames to be unique, you can partition by hash of the username.
Partitioning
Violations of integrity are "__________ consistency".
Perpetual
Databases have inherited a passive approach to mutable data found in most programming languages: if you want to find out whether the content of the database has changed, often your only option is to __________ (i.e., to repeat your request periodically). Subscribing to changes is only just beginning to emerge as a feature.
Poll
Building for scale that you don't need is wasted effort and may lock you into an inflexible design. In effect, it is a form of __________.
Premature optimization
Being explicit about dataflow makes the __________ of data much clearer, which makes integrity checking much more feasible. For the event log (i.e., event sourcing and CDC), we can use hashes to check that the event storage has been corrupted. For any derived state, we can rerun the batch and stream processors that derived it from the event log in order to check whether we get the same result, or even run a redundant derivation in parallel. A deterministic and well-defined dataflow also makes it easier to debug and trace the execution of a system in order to determine why id did something. If something unexpected occurred, it is valuable to have the diagnostic capability to reproduce the exact circumstances that led to the unexpected event -- a kind of time-travel debugging capability.
Provenance
For queries that touch a single partition, the effort of sending queries through a stream and collecting a stream of responses (i.e., recording __________ events) is perhaps overkill. However, this idea opens the possibility of distributed execution of complex queries that need to combine data from several partitions, taking advantage of the infrastructure for message routing, partitioning, and joining that is already provided by stream processors. For example, this approach has been used to compute the number of people who have seen a URL on Twitter -- i.e., the union of the followers set of everyone who has tweeted that URL. AS the set of Twitter users is partitioned, this computation requires combining results from many partitions.
Read
In a dataflow system (i.e., using unbundled databases), the __________ path to the derived dataset is the portion of the journey that only happens when someone asks for it. (i.e., Lazy evaluation)
Read
In a dataflow system (i.e., using unbundled databases), you create derived datasets (such as search indexes, materialized views, and predictive models) because you want to query them at some later time. This is the __________ path of the derived dataset: when serving a user request you read from the derived dataset, perhaps perform some more processing on the results, and construct the response to the user.
Read
The CAP theorem uses consistency in the sense of linearizability, which is a strong way of achieving __________. Weaker forms of this property exist, like read-after-write consistency, which can be useful.
Timeliness
Reprocessing existing data (using batch processing) provides a good mechanism for maintaining a system, evolving it to support new features and changed requirements. Without reprocessing, ___________ evolution is limited to simple changes like adding a new optional field to a record, or adding a new type of record. This is the case both in a schema-on-write and in a schema-on-read context. On the other hand, with reprocessing it is possible to restructure a dataset into a completely different model in order to better server new requirements.
Schema
Capturing causal dependencies between events when total ordering isn't available (e.g., the ordering of an unfollow event from the friends service and post event from a posts service in a microservice) is tricky, a starting point includes: Conflict resolution algorithms help with processing events that are delivered in an unexpected order. They are useful for maintaining _________, but they do not help if actions have external side effects (such as sending a notification to a user).
State
For reasoning about dataflows in your application, If it is possible for you to funnel all user input through a single system that decides on ordering for all writes, it becomes much easier to derive other representations of the data by processing the writes in the same order. This is an application of the __________ (SMR) approach. Whether you use change data capture or an event sourcing log is less important than simply the principle of deciding on a total order.
State machine replication
__________ systems such as databases are designed to remember things forever (more or less), so if something goes wrong, the effects also potentially last forever -- which means they require more careful thought.
Stateful
When maintaining derived data, ___________ processing allows changes in the input to be reflected in derived views with low delay.
Stream
An example of the end-to-end argument is duplicate __________ . In this case, we can't rely on the TCP, database transactions, and stream processors to entirely rule out duplicates sent due to faults (i.e., TCP only dedupes in a context of a connection -- not across them; reattempting database transactions spans two separate TCP connections so the database will process them twice, and retrying distributed transactions can cause similar issues). Solving the problem requires an end-to-end solution: e.g., a transaction identifier that is passed all the way from the end-user client to the database. For example, this identifier can be generated included as a hidden form field in the client application, or calculated as a hash of all the relevant form fields.
Suppression
When reasoning about the behavior of a system, we can make assumptions that certain things might go wrong, but other things won't. We call these assumptions our __________. Traditionally, these models take a binary approach toward faults: we assume that some things can happen, and other things can never happen. In reality, it is more a question of probabilities: some things are more likely, other things less likely.
System model
In the emerging architecture of derived data systems, instead of implementing triggers, stored, procedures, and materialized view maintenance routines as features of a single integrated database product, they are provided by various different pieces of software, running on different machines, administered by different __________.
Teams
Ensuring that an operation is executed atomically, while satisfying constraints, becomes more interesting when several partitions are involved. In the traditional approach to databases, executing such a transaction would require an atomic commit across all three partitions, which essentially forces it into a total order with respect to all other transactions on any of those partitions. Since there is now cross-partition coordination, different partitions can no longer be processed independently, so __________ is likely to suffer.
Throughput
Most consensus algorithms are designed for situations in which the throughput of a single node is sufficient to process the entire stream of events, and these algorithms do not provide a mechanism for multiple nodes to share the work of ordering the events (i.e., to form a ___________ ordering). It is still an open research problem to design consensus algorithms that can scale beyond the throughput of a single node and that work well in a geographically distributed setting.
Total (ordering)
In formal terms, deciding on a total order of events is known as ___________, which is equivalent to consensus.
Total order broadcast
Log-based messaging ensures that all consumers see messages in the same order -- a guarantee that is formally known as __________ and is equivalent to consensus.
Total order broadcast
Database __________ take a wide range of possible issues (concurrent writes, constraint violations, crashes, network interruptions, disk failures) and collapse them down to two possible outcomes: commit or abort.
Transactions
In principle, derived data systems should be maintained synchronously, just like a relational database updates secondary indexes synchronously within the same transaction as writes to the table being indexed. However, asynchrony is what makes systems based on event logs robust: it allows a fault in one part of the system to be contained locally, whereas distributed ___________ abort if any one participant fails, so they tend to amplify failures by spreading them to the rest of the system.
Transactions
Required for unbundling databases, stable message ordering and fault-tolerant message processing are quite stringent demands, but they are much less expensive and more operationally robust than distributed __________.
Transactions
If we start from the premise that there is no single data model or storage format that is suitable for all access patterns, I speculate that there are two avenues by which which different storage and processing tools can nevertheless be composed into a cohesive system: one is the __________ databases -- i.e., unifying writes. While federation addresses read-only querying across several different systems, it does not have a good answer to synchronizing writes across those systems. We said that within a single database, creating a consistent index is a built-in feature. When we compose several storage systems, we similarly need to ensure that all data changes end up in all the right places, even in the face of faults. Making it easier to reliably plug together storage systems (e.g., through change data capture and event logs) is like unbundling a datastore's index-maintenance features in away that can synchronize writes across disparate technologies. The unbundled approach follows the Unix tradition of small tools that do one thing well, that communicate through a uniform low-level API, and that can be composed using a higher-level language.
Unbundled
Instead of treating a database as a passive variable that is manipulated by the application, we can think much more about the interplay and collaboration between state, state changes, and code that processes them. Application code responds to state changes in one place by triggering state changes in another place. __________ the database means taking this idea and applying it to the creation of derived datasets outside of the primary database: caches, full-text search indexes, machine learning, or analytics systems. We can use stream processing and messaging systems for this purpose.
Unbundling
Capturing causal dependencies between events when total ordering isn't available (e.g., the ordering of an unfollow event from the friends service and post event from a posts service in a microservice) is tricky, a starting point includes: If you can log an event to record the state of the system that the user saw before making a decision, and give that event a ___________, then any later events can reference that event identifier in order to record the causal dependency.
Unique identifier
If a transaction mutates several objects in a database, it is difficult to tell after the fact what that transaction means. Even if you capture the transaction logs, the insertions, updates, and deletions in various tables do not necessarily give a clear picture of __________ those mutation were performed. The invocation of the application logic that decided on those mutations is transient and cannot be reproduced
Why
Dataflow systems (i.e., using unbundled databases) give you a process for creating derived datasets (such as search indexes, materialized views, and predictive models) and keeping them up to date. This process can be called the __________ path of the derived dataset: whenever some piece of information is written to the system, it may go through multiple stages of batch and stream processing, and eventually every derived dataset is updated to incorporate the data that was written.
Write
In a dataflow system (i.e., using unbundled databases), the __________ path of a derived dataset is the portion of the journey that is precomputed -- i.e., that is done eagerly as soon as the data comes in, regardless of whether anyone has asked to see it. (i.e., eager evaluation)
Write
__________ semantics means arranging the computation such that the final effect is the same as if no faults had occurred, even if the operation actually was retried due to some fault.
exactly-once