Week 7: MapReduce

Ace your homework & exams now with Quizwiz!

JobTracker

Runs on the same machine as the NameNode in HDFS 1. Resource management and scheduling: assigns map and reduce tasks to TaskTrackers. 2. Monitoring: checks that the data is not lost, reassigns task if lost.

Worker Failure

The Master pings workers periodically. If a worker connection times out the progress of the concerning partition is reset and the partition is assigned to a new worker. Other workers are notified to read the data from the new machine. As jobs from offline workers get re-executed by available ones, large-scale worker failures causes no data-loss but simply increases the time until the result of the MapReduce job becomes available.

MapReduce physical layer

JobTracker is the NameNode/HMaster and TaskTracker is the DataNode/Regionserver. Also colocated, so shortcircuiting "brings the query to the data"

MapReduce optimization - Combine

Map and reduce are linear in n key values, but shuffle is quadratic. So you need to reduce the amount of data to be shuffled. Combine: do this when you flush and when you compact. Combine means apply the reduce function to the mapped k-v pairs, before shuffling. Requirements: - Intermediate type is the same as output type. - Reduce function is associative and commutative. Associative: a + (b + c) = (a+b) + c Commutative: (a+b) = (b+a) Generally, the combine function is the same as the reduce function. If you're counting words in a doc, map created a 'word': 1 k-v pair for each word, and combine and reduce compute a count grouped by key.

Default number of map and reduce tasks

Map: one for each block on the DFS. Can make the blocks a bit smaller so as to have smaller map tasks for optimization, but not too small. Reduce: default is 1.

What is MapReduce from a compsci high level perspective?

MapReduce is a framework to run highly parallelisable computations on many machines made from commodity hardware, while simultaneously keeping the complexity of creating parallel jobs simple.

Master Failure

Master can write checkpoints to an off-machine storage medium which a new master can use to resume work if the old one dies. Otherwise a MapReduce job can also be aborted and restarted.

MapReduce Job

One instance of a mapping, shuffling and reducing, i.e. the entire thing.

Commutative

(a+b) = (b+a)

Benefits of the combine function (assuming lots of repeated keys and values)

1. Less memory requirements on reducers 2. Less data transferred over the network (i.e. decrease overall communication volume) Does NOT reduce mapper computation time or mapper memory requirements (since it happens on flush and compaction).

Mapping phase

1. Split input data into blocks (128 MB), only one replica per block 2. For each input block, JobTracker creates a map task. Generally assigned to TaskTracker where data is stored. Many more map tasks than mappers, since likely many input blocks on one machine. 3. Execute map tasks in parallel, reducers are idle during this time. 4. Intermediate values are kept sorted in memory, and flushed / compacted using LSMtrees for efficiency. Not stored in HDFS during this since it is intermediate, don't want to replicate since will be deleted. 5. In case of mapping task failure, the task will be assigned to another mapper to recreate map output (data can be transferred over network)

Stages of MapReduce

1. Splitting 2. Mapping 3. Stack intermediate key values 4. Sort intermediate k-v by key 5. Partition by key (or groups of keys) 6. Reduce

Reduce phase

After receiving all relevant, intermediate k-v pairs over the network, apply the reduce function and output the final key-values to disk. Output is stored on HDFS and replicated. Need a different file for each reducer since HDFS only supports sequential writes, not updates.

Shuffle phase

All map tasks are finished, all intermediate k-v pairs are stored on mapper machines in tree structure in memory and SStables on disk. MUST WAIT TIL MAP IS FINISHED. 1. Each TaskTracker runs an HTTP server. 2. Reducers connect to all of the mapper machines and request the values for their assigned keys. The mappers package together the keys for each reducer into different partitions, and send them off.

Task Granularity

By making the number map and reduce partitions (R and M) greater than the number of machines, multiple partitions can dynamically assigned to the same machine to allow for load balancing (i.e. weaker machines get less partitions and faster machines get more). The upper bound of the number of partitions is limited by the masters ability to perform scheduling decisions.

MapReduce in space and time

Distributed through time w/in a task, through space across tasks.

HashPartitioner

HashPartitioner: - takes a single argument which defines number of partitions (so you predefine the number of reducers) - values are assigned to partitions using hash of keys. - if distribution of keys is not uniform you can end up in situations when part of your cluster is idle

Function signature

In computer science, a type signature or type annotation defines the inputs and outputs for a function, subroutine or method.

Hashing

In hashing, large keys are converted into small keys by using hash functions. The k-v pairs are then stored in a data structure called hash table. The idea of hashing is to distribute entries (key/value pairs) uniformly across an array, the array being the range of possible values of the index of the hash table. Hashing is implemented in two steps: - An element is converted into an integer by using a hash function. - The element is stored in the hash table where it can be quickly retrieved using hashed key. hash = hashfunc(key) index = hash % array_size

Map/Reduce task

Individual map or reduce task on a partition.

Input/output formats

Input MUST be a k-v pair, and there are tools to convert different data sources to this format. E.g. use the primary key of a table as the key and the rest of the row values as the value. Text: use line number

Sequence file

Key value binary format supported by Hadoop. It's a flat file structure which consists of serialized key-value pairs. Key length | Key | Value length | Value

Storage of input/output

Lowest level: storage as file system (HDFS) with input files taking any form (but then converted into sequence file of k-v pairs) Higher level: table in HBase

MR step 2: Map

Making a function call for every k-v, sequentially within a split, in parallel across splits.

Storing intermediate task data

Map tasks write their output to the local disk, not HDFS, because it is an intermediate output. Replication (in HDFS) is not required, as it is discarded. Failed tasks are re-run automatically anyways.

MR step 5: Partition by key

Repartition the keys, but this time make sure that all instances of a key make it into the same partition. Can be multiple keys in a partition.

Load balancing in MapReduce

Requires prior knowledge of the distribution of the keys. Isn't automatically handled by Mappers and is not solved with high probability by hashing the keys inside the mappers.

TaskTracker

Runs on every DataNode, and spawns 0 or more mappers and 0 or more reducers depending on task allocation. In case of map task, generally a TaskTracker is responsible for mapping input data stored locally w/ the DataNode.

Inputs and outputs must be

Since the output of mapper and reducer must be serializable, they have to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.

MR step 1: Splitting

Size: ideally equal size chunks, size of HDFS block

Backup tasks (i.e. Hedging)

Slowdown of a MapReduce job is frequently caused by a single machine taking significantly longer than the rest. As the job is only completed when all partitions have been processed, a MapReduce job close to completion will run the same remaining partitions on multiple machines and take the result of the first machine that completes. This prevents a single machine from slowing down the job.

Skipping Bad Records

Some records may cause crashes of the map/reduce function. If a crash happens the worker sends a signal to the master that indicates the record causing the crash. If multiple signals with the same record a received, the master indicates to skip this record.

Partitioner

Splits the input into partitions for parallelized mapping: HashPartitioner -- uses the Java hash TotalOrderPartitioner -- sorts the keys

MR step 3-4: Stack intermediate key values and sort them

Stack all sets of intermediate k-v pairs and sort them by key

Problem that MapReduce solves

TBs of data, thousands of nodes. You want to read TBs of data that are distributed across many machines -- MapReduce takes these shards and processes them in parallel.

MapReduce and Distributed Data

Takes advantage of distributed storage to query in parallel. Input data is sharded, so is output data.

Intermediate split MapReduce

The intermediate data produced by the map job is then partitioned into R splits using a hash function.

Types in MapReduce

Three levels, each allowed to be different: 1. Input type 2. Intermediate type 3. Output type Generally, input != intermediate = output.

Shards

Two levels of sharding: 1. Physical level of HDFS 2. User level: user sees the chunks output by MapReduce as separate files (part 001, 002, etc.). Each output file is created by a single machine, in parallel.

Master in MapReduce

When a MapReduce job is started the program holding the instructions for the map and reduce functions is copied to all machines that are part of the job. One of the copies of the program is special and is called the "master". The Master keeps track of the progress of individual machines. It also propagates the location of data of completed jobs to the workers.

Locality

When assigning jobs the master takes into account the location of data and nodes to reduce overall network usage.

MR step 6: Reduce

With group of k-v pairs with same key as input, call function to generate output. In output, keys are unique. Generally output only one k-v per intermediate key, but can be multiple outputs with different output keys. Again, sequentially within a partition, in parallel across partitions. Generally, intermediate key is output key and only output one value per intermediate key.

MapReduce Fault Tolerance

Worker failure: the master (JobTracker) pings workers (TaskTracker) regularly. If no response is received from a worker in some amount of time, the worker is marked failed and the map task is reset back to idle. Master failure: master writes periodic checkpoints of the master data structures. In case of failure, just reboot from last checkpoint.

Associativity

a + (b + c) = (a+b) + c

MapReduce can run on...

local FS, HDFS (most common), Azure, S3, etc.


Related study sets

Anatomy and Physiology 1: Chapter 15 Part 3 (Hearing and Balance)

View Set

Fetal Assessment for Risk Factors (Exam 3)

View Set

Fair Credit Reporting Act (FCRA)

View Set