Big Data Anlytics Ch2 Map Reduce

Ace your homework & exams now with Quizwiz!

Why is a good split size important?

* If the split is too small, the overhead of managing the splits and map task creation begins to dominate the total job execution time. * Failed processes or other jobs running concurrently make load balancing desirable, and the quality of the load balancing increases as the splits become more fine grained

Map Reduce Phases

* Map phase then reduce phase * Each phase has key-value pairs as input and output * The output from the map function is processed by the MapReduce framework before being sent to the reduce function. This processing sorts and groups the key-value pairs by key.

What are the Big Ideas of MapReduce?

1. "Embarrassing" Parallelism 2. Scalability and avoiding bottlenecks 3. Fault Tolerance, failures are common 4. Bring the Computation to the Data 5. Stragglers

Problem with Splitting up Data to Run in Parallel:

1. Dividing the work into equal-size pieces isn't always easy or obvious. Ex, the file size for different years varies widely 2. Combining the results of independent processing is challenging 3. Still limited by the capacity of a single machine. . When we start using multiple machines, a whole host of other factors come into play, mainly falling into the categories of coordination and reliability.

HDFS Data blocks

Files are split into 128 MB blocks and then stored into the Hadoop file system. All blocks of the file are the same size except the last block, which can be either the same size or smaller.

What is a good split size?

For most jobs, a good split size tends to be the size of an HDFS block, which is 128 MB by default. This is good becuase it is the largest size of input that can be guaranteed to be stored on a single node

splits, input splits

Hadoop divides the input to a MapReduce job into fixed-size pieces

Explain how tasks are split up in Hadoop

Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks. The tasks are scheduled using YARN and run on nodes in the cluster. If a task fails, it will be automatically rescheduled to run on a different node.

Hadoop provides its own set of basic types that are optimized for network serialization in Java including :

LongWritable, which corresponds to a Java Long, Text (like Java String), and IntWritable (like Java Integer)

When the hadoop command is invoked with a classname as the first argument, it launches _________________________

a Java virtual machine (JVM) to run the class. The hadoop command adds the Hadoop libraries (and their dependencies) to the classpath and picks up the Hadoop configuration, too.

Combiner Functions

a function to be run on the map output, this functions output forms the input to the reduce function, it tries to minimize the data transferred between map and reduce, since jobs are limited by bandwidth on the cluster.

Pseudo-Distributed

a method of running Hadoop whereby all Hadoop daemons run on the same machine. essentially, a cluster consisting of a single machine

When we run a job on a Hadoop cluster, we will package the code into a JAR file (_____________________________________). Rather than explicitly specifying the name of the JAR file, ____________________________________, which Hadoop will use to locate the relevant JAR file by looking for the JAR file containing this class.

a. which Hadoop will distribute around the cluster b. we can pass a class in the Job's setJarByClass() method

To scale out, we need to store the data in a ____________

distributed filesystem (typically HDFS). This allows Hadoop to move the MapReduce computation to each machine hosting a part of the data, using Hadoop's resource management system, called YARN.

When there are multiple reducers, the map tasks partition their output, each creating one partition for ________________.

each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition.

The number of reduce tasks is not governed by the size of the input, but instead is specified __________.

independently.

Reduce tasks don't have the advantage of data ______________.

locality; the input to a single reduce task is normally the output from all mappers.

Network Serialization

the ability to transmit objects back and forth between two different programs over a Socket using an input stream and output stream

Serialization

the process of translating data structures or object state into a format that can be stored or transmitted and reconstructed later

Hadoop Streaming

uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program


Related study sets

NDFS 1020 Chapter 9: Fat-soluble Vitamins

View Set

Direct Variation, Direct Variation

View Set

Chapter 8: Lesson 1 "What are the spheres of Earth?"

View Set

Week 3 Check Your Understanding Assignment

View Set

Med-Surg HESI EAQ - Heart Disease

View Set

Chapter 11: Physical and Chemical Agents for Microbial Control

View Set

AWS Cloud Practitioner Exam Preparation

View Set