Hadoop Midterm Practice Exam
Which daemon distributes individual tasks to machines?
Job tracker
How are keys and values presented and passed to the reducers during a standard sort and shuffle phase of MapReduce?
Keys are presented to reducer in dotted order; values for a given key are not sorted
What is map-side join?
Map side join is done in the map phase and done in memory
Which salmon is responsible for instantiating and monitoring individual map and reduce task?
Task tracker
What is the default input format?
TextInputFormat with byte offset as a key and entire line as a value
What applies to distributed cache?
Transfer happens behind the scenes before any task is executed Distributed cache is read only Files int he distributed cache are automatically deleted from slave nodes when the job finishes
Hadoop will start transferring the data as soon as Mapper finishes it tasks and it will not wait until last Map Task Finished (T/F)?
True
The intermediate data is held on the data node local disk (T/F)?
True
What is the size of a block in HDFS
64 mb or 128 mb
How can you disable the reduce step?
A developer can always set the number of the reducer to zero. That will completely disable the reduce step
What is writable?
A java interface that needs to be implemented for MapReduce processing
Which is the correct for pseudo distributed mode of the Hadoop?
A single machine cluster All daemons run on the same machine
In a MapReduce job which process millions of input records and generated the same amount of key-value pairs (in millions). The data is not uniformly distributed. Hence MapReduce job is going to create a significant amount of intermediate data that it needs to transfer between mappers and reducers which is potential bottleneck. A custom implementation of which interface is most likely to reduce the amount of intermediate data transferred across the network?
Combiner
What's is data localization?
Hadoop will state the Map Task on the node when data block is kept via HDFS
If a Mapper runs slow relative to others, then:
No reducer can start until last Mapper finished If mapper is running slowly, the another instance of Mapper will be started by Hadoop on another machine Hadoop will kill the slow mapper if it keeps running while the new one finished The result of the first mapper finished will be used
What are the features of the Hadoop framework?
Nodes talk to each other as little as possible Computation happens where the data is stored Data is replicated multiple times on the system
What are the common problems with map side join?
Out of memory exceptions on slave nodes
Suppose that your jobs input is a (huge) set of word tokens and their number of occurrences (word count) and that you want to sort them by number of occurrences. Then which of the following class will help you to get globally sorted file?
Partitioner
What is the function of combiner?
Runs locally on a single mappers output Using combiner can reduce the network traffic Generally, combiner and reducer code is the same
Which daemon is responsible for the housekeeping of the name node?
Secondary name node
What is the role of the namenode?
Splits big files into smaller blacks and sends them to different data nodes Manages HDFS system and supplies addresses of the data on the different datanodes
