Hadoop Chapter 2
Hadoop creates one map task for each
split
input files (or data) are divided into fixed-size pieces called
splits
after writing map-reduce it is normal to try it out on a small dataset to flush out any immediate problems with the code T or F?
TRue
input types of reduce function must match the output types of the map function t or f?
TRue
it is possible to have zero reduce tasks T or F?
TRue
more splits means the time it takes to process each split is small compared to the time to process the whole input T or F?
TRue
It's generally easier to and more efficient t0 process a smaller number of relatively large files T or F?
True
MapReduce scales to the size of your data and the size of your hardware T or F?
True
Reducer needs the complete data from all mappers. T or F?
True
if a task fails it will be automatically rescheduled on a different node T or F?
True
if splits are too small the overhead managing time begins to takeover the job t or f?
True
knowing the job Id and task IDs can be very important when debugging map-reduce jobs T or F?
True
quality of load balancing increases as splits become more fine-grained T or F?
True
the out put from the map function is processed by the MapReduce framework before being sent to the reduce function T or F?
True
number of reduce tasks is specified:
independently
Hadoop divides the input to a MapReduce job into fixed size pieces called:
input splits or splits
map tasks write their output to the local disk not to HDFS because
it is intermediate output: its processed by reduce tasks to produce the final output, and once the job is complete map output can be thrown away
The job runner must provide the following information to the Job:
(1) Data input path (2)Data output path (3) mapper and reducer classes (4)data types of output if different from that of input.
Hadoop does its best to run map task on a node where the input data resides so it doesn't take up bandwidth
Data localityoptimization
3 types of data locality:
1. Data-local 2. Rack-local 3. off-rack
map reduce works by breaking the process into 2 phases
1. Map phase 2. Reduce phase
Haddop runs the job by dividing it into 2 types of tasks:
1. Map tasks 2. reduce tasks
an input path can be:
1. a single file 2. file directory 3. or a file pattern
job runner parameters
1. can be called multiple times to load source files from multiple locations 2. The output path must NOT exist when run the job, else Hadoop will refuse to run it. 3. When job is finished in success, 2 files will be created in the output folder: _SUCCESS and part_r_00000. The latter file contains outputs of Reducer
problems parallelizing the processing
1. dividing work into equal size pieces isn't always easy or obvious 2. combining results from independent processes may require more processing 3. still limited by the processing capacity of a single machine
Hadoop tries to run Map tasks on the machine closing to the data source
Data locality
In a multiple Reducer case, each reducer needs the complete data from all mapper T or F?
True
MapReduce programs can be implemented in various languages such as
Java, Ruby, and Python
forms the specification of the job and gives you control over how the job is run
Job Object
a programming model for data processing that is inherently parallel, really comes into its own for large data sets
MapReduce
a unit of work that the client wants to be performed, consists of the input data, Mapreduce program and configuration information
MapReduce Job
To create a Java MapReduce application, we need to define at least 3 programs (classes):
Mapper, reducer, and the job runner
Hadoop creates one map task for each BLANK which runs the user defined map function for each record in the blank
Split
Hadoop schedules Map and Reduce tasks using
YARN
task are scheduled using BLANK and run on nodes in the cluster
YARN
A good split size tends to be about the size of:
an HDFS block, 128MB
If a task fails, it will be:
automatically rescheduled to another node
a classic tool for processing line oriented data, extracts 2 fields from the data
awk
reduce tasks don't have the advantage of BLANK because the input to a single reduce task is normally from the output from all mappers
data locality
optimal split size is blocksize bc;
its the largest size of input that can fit on one node
class creates a MapReduce Job instance and runs it with the mapper and reducer classes that a user defined.
job runner
part of MapReduce, merely extracts the data and emits them as output (data preparation phase)
map function
programmer specifies the BLANK function and the BLANK function in map reduce
map, reduce
needs to specify the data types for 4 parameters , first phase of MapReduce function, nput key, input value, output key, and output value
mapper
Because the data output of map are essentially the input to Reduce, the input types of the reduce function must BLANK the output types of the map function.
match
All tasks run on BLANK in the cluster
nodes
when their are multiple reducers the map tasks BLANK their output, creating on BLANK for each reduce task
partition
output of BLANK is normally stored in HDFS for reliability
reduce
sorts and groups the key value pairs by key
reduce function
also specifies the data types for input key, input value, output key, and output value.
reducer
Output of Reducer is saved to HDFS and usually :
replicated