Chapter 10: Batch Processing
A very simplified view of a __________ index such as Lucene is a file (the term dictionary) in which you can efficiently look up a particular keyword and find the list of all the document IDs containing that keyword (the postings list).
(full-text) search (index)
The input files to Unix commands are normally treated as __________. This means you can run the commands as often as you want, trying various command-line options, without damaging the input files.
Immutable
__________ (or clickstream data) are a log events describing the things that logged-in users did on a website
Activity events
It is very common for MapReduce jobs to be chained together into __________, such that the output of one job becomes the input to the next job.
Workflows
A __________ of a job is the amount of memory to which the job needs random access (RAM).
Working set
If you need to perform a full-test search over a fixed set of documents, then a __________ process is very effective way of building the indexes: the mappers partition the set of documents as needed, each reducer builds the index for its partition, and the index files are written to the distributed filesystem.
Batch
__________ processes are less sensitive to faults than online systems (services), because they do not immediately affect users (e.g., the client is not expecting a response) if they fail and they can always be run again.
Batch
A __________ system (offline systems) takes a large amount of input data, runs a job to process it, and produces some output data. Jobs often take a while (from a few minutes to several days), so there normally isn't a user waiting for the job to finish. Instead, batch jobs are often scheduled to run periodically (for example, once a day).
Batch processing
The simplest way of performing a map-side join with MapReduce applies in the case where a large dataset is joined with a small dataset. In particular, the small dataset needs to be small enough that it can be loaded entirely into memory in each of the mappers (e.g., load a small user database into memory to use as a hash-table and merge it with the larger dataset of activity events). This simple but effective algorithm is called a __________ join, to reflect the fact that each mapper for a partition of the large input (e.g., activity events) reads the entirety of the (same) small input (e.g., user database) and puts it in a hash table.
Broadcast hash
Each input file of a MapReduce job is typically hundreds of megabytes in size. The MapReduce scheduler tries to run each map task in one of the machines that stores a replica of the input file, provided that machine has enough spare RAM and CPU resources to run the map task. This principle is known as "putting the __________ near the data": it saves copying the input file over the network , reducing network load and increasing locality.
Computation
Every MapReduce job is independent from every other job. This setup is reasonable if the output from the first job is a dataset that you want to publish widely within your organization. In that case, you need to be able to refer to it by name and reuse it as input to several different jobs (including jobs developed by other teams). Publishing data to a well-known location in the distributed filesystem allows loose __________ so that jobs don't need to know who is producing their input or consuming their output..
Coupling
In order to fix some problems with MapReduce (particularly issues concerning intermediate state), several new execution engines for distributed batch computations were developed, the most well known of which are Spark, Tex, and Flink. These engines have one thing in common: they handle an entire workflow as one job, rather than breaking it up into independent subjobs. Since they explicitly model the flow of data through several processing stages, these systems are known as __________ engines.
Dataflow
When using a __________, materialized datasets on HDFS are still usually the inputs and final outputs of a job. Like with MapReduce, the inputs are immutable and the output s completely replaced. The improvement over MapReduce is that you save yourself writing all the intermediate state to the filesystem as well.
Dataflow engine
You can use __________ to implement the same computations as MapReduce workflows, and they usually execute significantly faster due to the optimizations described here.
Dataflow engines
Technically speaking, derived data is redundant, in the sense that it duplicates existing information. However, it is often essential for getting good performance on read queries. It is commonly __________.
Denormalized
On a high level, systems that store and process data can be grouped into two broad categories: A __________ is the result of taking some existing data from another system and transforming or processing it in some way. If you lose derived data, you can recreate it from the original source. A classic example is a cache: data can be served from the cache if present, but if the cache doesn't contain what you need, you can fall back to the underlying database.
Derived data system
Dataflow engines avoid writing intermediate state to HDFS, so they take a different approach to tolerating faults: if a machine fails and the intermediate state on that machine is lost, it is recomputed from other data that is available (a prior intermediary stage if possible, or otherwise the original input data, which is normally on HDFS). To enable this recomputation, the framework must keep track of how a given piece of data was computed -- which input partitions it used, and which operators were applied to it. When recomputing data, it is important to know whether the computation is __________. This question matters if some of the lost data has already been sent to downstream operators. The solution in the case of nondeterministic operators is normally to kill the downstream operators as well, and run them again on the new data.
Deterministic
In many datasets, it is common for one record to have an association with another record: a __________ in a document model.
Document reference
In many datasets, it is common for one record to have an association with another record: a __________ in a graph model.
Edge
In order to tolerate machine and disk failures with HDFS, file blocks are replicated on multiple machines. Replication may mean full replication or an __________ scheme such as Reed-Solomon codes, which allows lost data to be recovered with lower storage overhead than full replication. The technique is similar to RAID, which provides redundancy across several disks attached to the same machine; the difference is that in a distributed filesytem, file access and replication are done over a conventional datacenter network without special hardware.
Erasure coding
Hadoop has often be used for implementing __________ (ETL) processes: data from transaction processing systems is dumped into the distributed filesystem in some raw form, and then MapReduce jobs are written to clean up that data, transform it into a relational form, and import it into an MPP data warehouse for analytic purposes.
Extract, Transform, and Load
An advantage of fully materializing intermediate state to a distributed filesystem is that it is durable, which makes __________ fairly easy in MapReduce: if a task fails, it can just be restarted on another machine and read the same input again from the filesystem.
Fault tolerance
In Unix, the interface for most CLI tools is a __________, which is just an ordered sequence of bytes. Because this is such a simple interface, many different things can be represented using the same interface: an actual file on the filesystem, a communication channel to another process (Unix socket, stdin, stdout), a device driver (say /dev/audio or /dev/lp0), a socket representing a TCP connection, and so on.
File (descriptor)
In many datasets it is common for one record to have an association with another record: a __________ in a relational model.
Foreign key
__________ has scaled well. Such large scale has become viable because the cost of data storage and access on this distributed filesystem, using commodity hardware and open source software, is much lower than that of the equivalent capacity on a dedicated storage appliance.
HDFS
The __________ ecosystem includes both random-access OLTP databases such as HBase and MPP-style analytic databases such as Impala. Neither HBase nor Impala use MapReduce, but both use HDFS for storage. They are very different approaches to accessing and processing data, but they can nevertheless coexist and be integrated in the same system.
Hadoop
With __________, there is no need to import the data into several different specialized systems for different kinds of processing: the system is flexible enough to support a diverse set of workloads within the same cluster. Not having to move data around makes it a lot easier to derive value from the data, and a lot easier to experiment with new processing models.
Hadoop
__________ opened up the possibility of indiscriminately dumping data into HDFS, and only later figuring out how to process it further. By contrast, MPP databases typically require careful up-front modeling of the data and query patterns before importing the data into the database's proprietary storage format.
Hadoop
While Unix tools use stdin and stdout as input and output, MapReduce jobs read and write files on a distributed filesystems. In Hadoop's implementation of MapReduce, that filesystem is called __________ (HDFS), an open source reimplementation of the Google File System (GFS).
Hadoop Distributed File System
In a MapReduce job, to ensure that all key-value pairs with the same key end up at the same reducer, the framework uses a __________ of the key to determine which reduce task should receive a particular key-value pair.
Hash
MapReduce gave engineers the ability to easily run their own code over large datasets. If you have HDFS and MapReduce, you can build a SQL query execution engine on top of it, and indeed that is what the __________ project did. However, you can also write many other forms of batch processes that do not lend themselves to being expressed as a SQL query.
Hive
In MapReduce, when grouping records by a __________ key and aggregating them, you can perform the grouping in two stages. The first MapReduce state sends records to a random reducer, so that each reducer performs the grouping on a subset of records for the key and outputs a more compact aggregated value per key. The second MapReduce job then combines the values from all of the first-stage reducers into a single value per key.
Hot (key)
A __________ (or linchpin object) appears when there is a very large amount of data related to a single key.
Hot key
Batch jobs are easy to maintain because they minimize irreversibility: if you introduce a bug into the code and the output is wrong or corrupted, you can simply roll back to a previous version of the code and rerun the job, and the output will be correct again. This idea of being able to recover from buggy code has been called __________ fault tolerance.
Human
When using batch processing to build machine learning systems, the output of these batch jobs is often some kind of database. These databases need to be queried from the web application that handles user requests, which is usually separate from the Hadoop infrastructure. A good solution for getting the output from the batch progress back into the database where the web application can query it is to build a brand-new database inside the batch job and write it as files to the job's output directory in the distributed file system. Those data files are then __________ once written, and can be loaded in bulk to servers that handle read-only queries.
Immutable
Dataflow engines offer several advantages compared to the MapReduce model: It is usually sufficient for __________ between operators to be kept in memory or written to local disk, which requires less I/O than writing it to HDFS (where it must be replicated to several machines and written to disk on each replica). MapReduce already uses this optimization for mapper output, but dataflow engines generalize the idea.
Intermediate state
Every MapReduce job is independent from every other job. This reasonable if output is used widely within your organization. However, in many cases, you know that the output of one job is only ever used as input to one other job, which is maintained by the same team. In this case, the files on the distributed filesystem are simply __________: a means of passing data from one job to the next. This is common in complex workflows (e.g., for machine learning).
Intermediate state
In many datasets it is common for one record to have an association with another record. In MapReduce, like in databases, a __________ is necessary whenever you have some code that needs to access records on both sides of an association (both the record that holds the reference and the record being referenced).
Join
The output of a reduce-side join is partitioned and sorted by the join key, whereas the output of a map-side join is partitioned and sorted in the same way as the __________ (since one map task is started for each file block of the join's large input, regardless of whether a partitioned or broadcast join is used).
Large input
A stream job operates on events shortly after they happen, whereas a batch job operates on a fixed set of input data. This difference allows stream processing system to have lower __________ than the equivalent batch systems.
Latency
Dataflow engines offer several advantages compared to the MapReduce model: There are no unnecessary __________ tasks, since the work done by a mapper can often be incorporated into the preceding reduce operator (because a mapper does not change the partitioning of a dataset).
Map
The number of __________ tasks for a MapReduce job is determined by the number of input file blocks.
Map
When performing a join in MapReduce, if you can make certain assumptions about your input data, it is possible to make joins faster by using a so-called __________-side join. This approach uses a cut-down MapReduce job in which there are no reducers and no sorting. Instead, each mapper simply reads one input file block from the distributed filesystem and writes on output file to the filesystem.
Map
Compared to MPP's focus on parallel execution of analytic SQL queries on a cluster of machines, __________ and a distributed filesystem provides something like general-purpose operating system that can run arbitrary programs.
MapReduce
For a __________ job, after the key-value pairs are mapped, they are sorted by key. This sorting is done is stages, since the input dataset is likely too large to be sorted: First, after the mapper callback is applied, each map task partitions its output by reducer, based on the hash of the key. Each of these partitions is written to a sorted file (e.g., [mapper #m, reducer #r]) on the mapper's local disk.
MapReduce
The pattern for data processing in __________ is: 1. Read a set of input files, and break it up into records. 2. Call the mapper function to extract a key and value from each input record. 3. Sort all of the key-value pairs by key. 4. Call the reducer function to iterate over the sorted key-value pairs. If there are multiple occurrences of the same key, the sorting has made them adjacent in the list, so it is easy to combine those values without having to keep a lot of state in memory.
MapReduce
To create a __________ job, you need to implement two callback functions, the mapper and reducer.
MapReduce
__________ can tolerate the failure of a map or reduce task without it affecting the job as a whole by retrying work at the granularity of an individual task. It is also very eager to write data to disk, partly for fault tolerance, and partly on the assumption that the dataset will be too big to fit in memory anyway. The system is designed to tolearate frequent unexpected task terminations in this way because the freedom to arbitrarily terminate processes enables better resource utilization in a computing cluster (where resources are shared between high-priority and low-priority tasks).
MapReduce
__________ is a bit like Unix tools, but distributed across potentially thousands of machines. As with most Unix tools, a job takes one or more inputs and produces one or more outputs, and running a job normally does not modify the input and does not have any side effects other than producing the output.
MapReduce
__________ is a programming framework with which you can write code to process large datasets in a distributed filesystem like HDFS.
MapReduce
__________ jobs are like a sequence of commands where each command's output is written to a temporary file, and the next command reads from the temporary file.
MapReduce
__________'s approach of fully materializing intermediate state has downsides compared to Unix pipes (which stream): - A job can only start when all tasks in the preceding jobs (that generate its inputs) have completed, whereas processes connected by a Unix pipe are started at the same time, with output being consumed as soon as it is produced. Having to wait until all of the preceding job's tasks have completed slows down the execution of the workflow as a whole. - Storing intermediate state in a distributed filesystem means those files are replicated across several nodes, which is often overkill for such temporary data.
MapReduce
In a MapReduce job, the role of the __________ is to prepare the data by putting it into a form that is suitable for sorting. (This callback doesn't handle sorting directly, tho.)
Mapper
To create a MapReduce job, you need to implement two callback functions: The __________ is called once for every input record, and its job is to extract the key and value from the input record. For each input, it may generate any number of key-value pairs (including none). It does not keep any state from one input record to the next, so each record is handled independently.
Mapper
__________ (MPP) databases focus on parallel execution of analytic SQL queries on a cluster of machines.
Massively parallel processing
The process of writing out intermediate state to files is called __________. It means to eagerly compute the result of some operation and write it out, rather than computing it on demand when requested.
Materialization
With dataflow engines, recovering from faults by recomputing data is not always the right answer: if the intermediate data is much smaller than the source data, or if the computation is very CPU-intensive, it is probably cheaper to __________ the intermediate data to files than to recompute it.
Materialize
Using the sort Unix tool, chunks of data can be sorted in memory and written out of disk as segment files, and then multiple sorted segments can be merged into a larger sorted file. __________ has sequential access patterns that perform well on disks.
Mergesort
HDFS consists of a daemon process running on each machine, exposing a network service that allows other nodes to access files stored on that machine (assuming that every general-purpose machine in a datacenter has some disk attached to it). A central server called the __________ keeps track of which file blocks are stored on which machine. Thus, HDFS conceptually creates one big filesystem that can use the space on the disks of all machines running the daemon.
NameNode
Dataflow engines offer several advantages compared to the MapReduce model: __________ can start executing as soon as their input is ready; there is no need to wait for the entire preceding stage to finish before the next one starts.
Operators
Like MapReduce, dataflow engines work by repeatedly calling a user-defined function to process one record at a time on a single thread. They parallelize work by partitioning inputs, and they copy the output of one function over the network to become the input to another function. Unlink in MapReduce, these function need not take the strict roles of alternating map and reduce, but instead can be assembled in a more flexible ways. We call these functions __________
Operators
Using the MapReduce programming model separates the physical network communication aspects of computation (getting the data to the right machine/partition) from the application logic (processing the data once you have it). This separation contrasts with the typical use of databases, where a request to fetch data from the database often occurs somewhere deep inside a piece of application code. Since MapReduces handles all network communication, it also shields the application from having to worry about __________, such as the crash of another node
Partial failures
For joins in MapReduce, if the inputs to the map-side join are partitioned in the same way, then the hash join approach can be applied to each partition independently (e.g., user database and activity events are both partitioned based on the last decimal digit of the user ID). If the partitioning is done correctly, you can be sure that all the records you might want to join are located in the same numbered partition, and so it is sufficient for each mapper to only read one partition from each of the input datasets. This approach only works if both the join's inputs have the same number of partitions, with records assigned to partitions based on the same key and the same hash function. This approach is known as __________ joins.
Partitioned hash
MapReduce can parallelize a computation across many machines. This parallelization is based on __________: the input to a job is typically a directory in HDFS, and each file or file block within the input directory is considered to be a separate partition that can be processed by a separate map task.
Partitioning
Many dataflow engines are build around the idea of Unix __________ execution: that is, incrementally passing the output of an operator to other operators, and not waiting for the input to be complete before starting to process it.
Pipe(lined)
In Unix, __________ let you attach the stdout of one process to the stdin of another process (with a small in-memory buffer, and without writing the entire intermediate data stream to disk).
Pipes
Unix __________ do not fully materialize the intermediate sate, but instead stream the output to the input incrementally, using only a small in-memory buffer.
Pipes
The number of __________ tasks for a MapReduce job is configured by the job author (it can different from the number of map tasks.
Reduce
With MapReduce, a __________-side join has the advantage that you do not need to make any assumptions about the input data: whatever its properties and structure, the mappers can prepare the data to be ready for joining. However, the downside is that all the sorting, copying to reducers, and merging of reducer inputs can be quite expensive. Depending on the available memory buffers, data may be written to disk several times as it passes through the stages of MapReduce.
Reduce
With MapReduce, in a _________-side join, the mappers take the role of preparing the input data: extracting the key and value from each record, assigning the key-value pairs to a reducer partition, and sorting by key.
Reduce
In a MapReduce job, the role of the __________ is to process the data that has been sorted.
Reducer
To create a MapReduce job, you need to implement two callback functions: The MapReduce framework takes the key-value pairs produced by the mappers, collects all the values belonging to the same key, and calls the __________ with an iterator over that collection of values. This callback can produce output records (such as the number of occurrences of the same URL).
Reducer
__________ is usually the primary measure of performance of a service system, and availability is often very important (if the client can't reach the service, the user will probably get an error message).
Response time
Dataflow engines offer several advantages compared to the MapReduce model: Because all joins and data dependencies in a workflow are explicitly declared, the __________ has an overview of what data is required where, so it can make locality optimizations. For example, it can try to place the task that consumes some data on the same machine as the task that produces it, so that the data can be
Scheduler
Hadoop's model of collecting data in its raw form and worrying about schema design later, allows the data collection to be speeded up (a concept sometimes known as "data lake" or "enterprise data hub"). This shifts the burden of interpreting data; instead of forcing the producer of a dataset to bring it into a standardized format, the interpretation of the data becomes the consumer's problem (the __________ approach).
Schema-on-read
A __________ (online system) waits for a request or instruction from a client to arrive. When one is received, the service tries to handle it as quickly as possible and sends a response back.
Service
A common use of grouping (e.g. GROUP BY in SQL) is collating all the activity events for a particular user session, in order to find out the sequence of actions that the user took -- a process called __________.
Sessionization
A __________ storage (e.g., used by Network Attached Storage) is implemented by a central storage appliance, often using custom hardware and special network infrastructure such as Fibre Channel.
Shared-disk
A __________ storage approach requires no special hardware, only computers connected by a conventional datacenter network.
Shared-nothing
Hadoop Distributed File System (HDFS) is based on the __________ principle, in contrast to the shared-disk approach of Network Attached Storage (NAS) and Storage Area Network (SAN) architectures.
Shared-nothing
For a MapReduce job, whenever a mapper finishes reading its input file and writing its sorted output files, the reducers download the files of the sorted-key value pairs for their partition from each mapper partition. This process of partitioning by reducer, sorting, and copying data partitions from mappers to reducers is known as the __________.
Shuffle
The handling of output from MapReduce jobs follows the Unix philosophy. By treating inputs as immutable and avoiding __________ (such as writing to external databases from within job), batch jobs not only achieve good performance but also become much easier to maintain.
Side effects
The biggest limitation of Unix tools is that they run on a __________ -- and that's where tools like Hadoop come in.
Single machine
Collecting all activity related to a celebrity (e.g., replies to something they posted) in a single reducer can lead to significant __________ (also known as hot spots) -- that is, one reducer that must process more records than the others. Since a MapReduce job is only complete when all of its mappers and reducers have completed, any subsequent jobs must wait for the slowest reducer to complete before they can start.
Skew
The __________ utility in GNU Coreutils (Linux) automatically handles larger-than-memory datasets by spilling to disk, and automatically parallelizes sorting across multiple CPU cores. This means that a simple chain of Unix commands easily scales on large datasets, without running out of memory. The bottleneck is likely to be the rate at which the input file can be read from disk.
Sort
For a MapReduce job, the reduce task takes the files for their partition from the mappers and merges them together, preserving the __________.
Sort order
In a __________ join with MapReduce, the mapper output is sorted by key, and the reducers then merge together the sorted list of records from both sides of the join (e.g., given two mappers, one mapper reads the database, and the other reads activity event logs, then the reducer merges each user information with the user's activity).
Sort-merge
Dataflow engines offer several advantages compared to the MapReduce model: Expensive work such as __________ need only be performed in places where it actually required, rather than always happening by default between every map and reduce stage.
Sorting
A __________ system is somewhere between online and offline/batch processing (so it is sometimes called near-real-time or nearline processing). Like a batch processing system, this system consumes inputs and produces outputs (rather than responding to requests). However, a stream job operates on events shortly after they happen, whereas a batch job operates on a fixed set of input data.
Stream processing
indiscriminate data dumping (such as in Hadoop) shifts the burden of interpreting the data from the producer to the consumer. This can be an advantage if the producer and consumers are different teams with different priorities. There may not be one ideal data model, but rather different views onto the data that are suitable for different purposes. Simply dumping data in its raw form allows for several such transformations. This approach has been dubbed the __________; "raw data is better".
Sushi principle
On a high level, systems that store and process data can be grouped into two broad categories: A __________, also known as source of truth, holds the authoritative version of your data. When new data comes in, it is first written here. If there is any discrepancy between another system and this system, then the value in this system is (by definition) the correct one.
System of record
In order to achieve good __________ in a batch process, the computation must be (as much as possible) local to one machine. Making random-access requests over the network for every record you want to process is too slow. Moreover, querying a remote database would mean that the batch job becomes nondeterministic, because the data in the remote database might change. Thus a better approach is to copy the data locally (e.g., take a copy of the user database extracted from a database backup using an ETL process) and to put in the same distributed filesystem as the MapReduce job.
Throughput
The primary performance measure of a batch job (in a batch processing system) is usually __________ (the time it takes to crunch through an input dataset of a certain size).
Throughput
In __________, separating the input/output wiring from the program logic makes it easier to compose small tools into bigger systems.
Unix
In a __________ system, many data analyses can be done in a few minutes using some combination of awk, sed, grep, sort, uniq, and xargs, and they perform surprisingly well.
Unix
__________ commands pass the output of one process as input to another process directly, using only a small in-memory buffer.
Unix
The __________ -- automation, rapid prototyping, incremental iteration, being friendly to experimentation, and breaking down large projects into manageable chunks -- sounds remarkably like the Agile and DevOps movements of today.
Unix philosophy
The __________ was described in 1978 as follows: 1. Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features". 2. Expect the output of every program to become the input to another, as yet unknown, program. Don't clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don't insist on interactive input. 3. Design and build software, even operating systems, to be tried early, ideally within weeks. Don't hesitate to throw away the clumsy parts and rebuild them. 4. Use tools in preference to unskilled help to lighten a programming task, even if you have a detour to build the tools and expect to throw some of them out after you've finished using them.
Unix philosophy
A common use for batch processing is to build machine learning systems such as classifiers (e.g., spam filters, anomaly detection, image recognition) and recommendation systems (e.g., people you may know, products you may be interested in, or related searches). The output of these batch jobs is often some kind of __________: for example, one that can be queried by user ID to obtain suggested friends for that user, or can be queried by product ID to get a list of related products.
database
When the intermediate state of a machine is lost, dataflow engines recompute data from other data that is available (a prior intermediary stage, etc.). If the computation of any lost data is not deterministic, downstream operators need to killed as well, then run them again on the new data. In order to avoid such cascading __________, it is better to make operators deterministic. However, it is easy for nondeterministic behavior to accidentally creep in: - iterating over elements of a hash table which are unordered - using random numbers - using the system clock or external data sources (e.g., external database). Such causes of nondeterminism need to be removed in order to reliably recover from faults, for example by generating psuedorandom numbers using a fixed seed.
faults