Hadoop Chapter 2

¡Supera tus tareas y exámenes ahora con Quizwiz!

Hadoop creates one map task for each

split

input files (or data) are divided into fixed-size pieces called

splits

after writing map-reduce it is normal to try it out on a small dataset to flush out any immediate problems with the code T or F?

TRue

input types of reduce function must match the output types of the map function t or f?

TRue

it is possible to have zero reduce tasks T or F?

TRue

more splits means the time it takes to process each split is small compared to the time to process the whole input T or F?

TRue

It's generally easier to and more efficient t0 process a smaller number of relatively large files T or F?

True

MapReduce scales to the size of your data and the size of your hardware T or F?

True

Reducer needs the complete data from all mappers. T or F?

True

if a task fails it will be automatically rescheduled on a different node T or F?

True

if splits are too small the overhead managing time begins to takeover the job t or f?

True

knowing the job Id and task IDs can be very important when debugging map-reduce jobs T or F?

True

quality of load balancing increases as splits become more fine-grained T or F?

True

the out put from the map function is processed by the MapReduce framework before being sent to the reduce function T or F?

True

number of reduce tasks is specified:

independently

Hadoop divides the input to a MapReduce job into fixed size pieces called:

input splits or splits

map tasks write their output to the local disk not to HDFS because

it is intermediate output: its processed by reduce tasks to produce the final output, and once the job is complete map output can be thrown away

The job runner must provide the following information to the Job:

(1) Data input path (2)Data output path (3) mapper and reducer classes (4)data types of output if different from that of input.

Hadoop does its best to run map task on a node where the input data resides so it doesn't take up bandwidth

Data localityoptimization

3 types of data locality:

1. Data-local 2. Rack-local 3. off-rack

map reduce works by breaking the process into 2 phases

1. Map phase 2. Reduce phase

Haddop runs the job by dividing it into 2 types of tasks:

1. Map tasks 2. reduce tasks

an input path can be:

1. a single file 2. file directory 3. or a file pattern

job runner parameters

1. can be called multiple times to load source files from multiple locations 2. The output path must NOT exist when run the job, else Hadoop will refuse to run it. 3. When job is finished in success, 2 files will be created in the output folder: _SUCCESS and part_r_00000. The latter file contains outputs of Reducer

problems parallelizing the processing

1. dividing work into equal size pieces isn't always easy or obvious 2. combining results from independent processes may require more processing 3. still limited by the processing capacity of a single machine

Hadoop tries to run Map tasks on the machine closing to the data source

Data locality

In a multiple Reducer case, each reducer needs the complete data from all mapper T or F?

True

MapReduce programs can be implemented in various languages such as

Java, Ruby, and Python

forms the specification of the job and gives you control over how the job is run

Job Object

a programming model for data processing that is inherently parallel, really comes into its own for large data sets

MapReduce

a unit of work that the client wants to be performed, consists of the input data, Mapreduce program and configuration information

MapReduce Job

To create a Java MapReduce application, we need to define at least 3 programs (classes):

Mapper, reducer, and the job runner

Hadoop creates one map task for each BLANK which runs the user defined map function for each record in the blank

Split

Hadoop schedules Map and Reduce tasks using

YARN

task are scheduled using BLANK and run on nodes in the cluster

YARN

A good split size tends to be about the size of:

an HDFS block, 128MB

If a task fails, it will be:

automatically rescheduled to another node

a classic tool for processing line oriented data, extracts 2 fields from the data

awk

reduce tasks don't have the advantage of BLANK because the input to a single reduce task is normally from the output from all mappers

data locality

optimal split size is blocksize bc;

its the largest size of input that can fit on one node

class creates a MapReduce Job instance and runs it with the mapper and reducer classes that a user defined.

job runner

part of MapReduce, merely extracts the data and emits them as output (data preparation phase)

map function

programmer specifies the BLANK function and the BLANK function in map reduce

map, reduce

needs to specify the data types for 4 parameters , first phase of MapReduce function, nput key, input value, output key, and output value

mapper

Because the data output of map are essentially the input to Reduce, the input types of the reduce function must BLANK the output types of the map function.

match

All tasks run on BLANK in the cluster

nodes

when their are multiple reducers the map tasks BLANK their output, creating on BLANK for each reduce task

partition

output of BLANK is normally stored in HDFS for reliability

reduce

sorts and groups the key value pairs by key

reduce function

also specifies the data types for input key, input value, output key, and output value.

reducer

Output of Reducer is saved to HDFS and usually :

replicated


Conjuntos de estudio relacionados

GA Personal Lines Guarantee Exam Attempt #1

View Set

Unit 1 exam EAQ practice questions

View Set

Grupo 17 Diptongos oi,oy, ui, uy

View Set