Hadoop exam review

Ace your homework & exams now with Quizwiz!

What happens when a slave node fails?

- Jobtracker and namenode detects failure. - on the failed node all tasks are re-scheduled - Namenode replicates the user data to another node

How JobTracker schedules a job?

- client application submits job to the job tracker. - jobtracker talks to namenode to determine the location of the data - jobtracker locates tasktracker node with available slots at or near the data - Jobtracker submits the work to the chosen tasktracker nodes - tasktracker nodes are monitored. if they do not submit heartbeat signals often enough they are deemed to have failed and the work is scheduled on a different tasktracker - a tasktracker will notify jobtracker when a task fails. the jobtracker decides what to do then. it may resubmit the job elsewhere it may mark that specific record as something to avoid, and it may even blacklist the tasktracker as unreliable - when the work is complete the jobtracker updates its status. - client application call poll the jobtracker for the information

what is distributed cache in MapReduce framework?

Distributed cache is when you want to share some files across all nodes in Hadoop cluster. the files could be an executable jar files or simple property file

what are the scheduling policies available in YARN?

FIFO scheduler: puts application requests in queue and runs them in the order of submission. Capacity scheduler: has a separate dedicated queue for smaller jobs and starts them as soon as they are submitted Fair Scheduler: each job uses its fair share of resources

name job control options specified by mapreduce.

Job.submit() job.waitforCompletion(boolean)

What is a JobTracker?

JobTracker is the daemon service for submitting a tracking mapreduce jobs in Hadoop. the job track is a single point of failure for the Hadoop MapReduce service. if it goes down, all running jobs are halted.

What are the main components of MapReduce job?

Main driver class: provides job configuration parameter mapper class: must extend Mapper class and performs execution of map() method reducer class: must extend Reducer class and performs execution of reducer() method

What is a task instance in Hadoop? Where does it run?

Task instances are the actual MapReduce tasks which are run on each slave node. The TaskTracker start a separate JVM process to do the actual work (called Task Instance), which to ensure that process failure does not take down the tasks tracker. There can be multiple processes of task instance running on a slave node.

What is a task tracker in Hadoop?

Task tracker monitors these task instances, capturing the output and exit codes. when the task instance finish the task tracker notifies the job tracker

what are the most common input formats defined in Hadoop

Textinputformat keyvalueinputformat sequencefileinputformat

How the client communicates with HDFS?

The client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file. Clients applications can talk directly to a DataNode, once the NameNode has provided the location of the data.

How is HDFS different from traditional file systems?

The difference from other distributed file systems are - HDFS if highly fault-tolerant and is designed to be deployed on low-cost hardware. - HDFS provides high throughput access to application data and is suitable for applications that have large data set. - HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data set. These applications write data once but they read it one or more times. HDFS supports write-once-read-many semantics on files

What is a sequence file in Hadoop?

To store binary key/value pairs. unlike regular compressed file, sequence file supports splitting even when data inside the file is compressed

what is Sqoop in Hadoop?

To transfer the data between Relational database management and Hadoop

what is the difference between HDFS block and inputsplit?

an HDFS block splits data into physical divisions while inputsplit in mapreduce splits input files logically

what is a Mapreduce combiner?

combiners are used to increase the efficiency of a mapreduce program. they are used to aggregate intermediate data map output locally on individual mapper output. they can help you reduce the amount of data that needs to be transferred across to the reducer. you can use reducer code as a combiner if the operation performed is commutative and associative.

For a Hadoop job, how will you write a custom partitioner?

create a new class that extends Partitioner class override method getPartition() in the wrapper that runs the Mapreduce, add the custom partitioner to the job by using method set partitioner class

what is inputformat in Hadoop?

defines the input specifications for a job if performs the following functions - validate the input of the job split the input files into logical instances provide implantation of RecordReader to extract input records.

what is textinputformat?

files are broken into lines, wherein key is position in the file and value refers to the line of text

what is identitymapper and identityreduce in mapreduce?

identitymapper: mapping inputs directly to outputs. identityreducer: performs no reduction, writing all input values directly to the output

what is speculative execution?

is a way of coping with individuals machine performance. in large clusters where some machines are perform better than others

What is Hadoop MapReduce?

is the core of Hadoop which is an important processes of Hadoop map and reduce. Map: converts a set of data into key/value pairs Reduce: takes the output from the map and combines into smaller set of key/value pairs

Explain JobConf in mapreduce

it is a primary interface to define a map-reduce job in the Hadoop for job execution JobConf specifies mapper, combiner, partitioner, reducer, input format, output format, implementations and other advanced job components

What is ResourceManager in YARN?

it is the master process it responds to clients requests manage resources create containers and a scheduler determines when and where a container is created the Resourcemanager has two main components scheduler and applicationmanager Scheduler - is responsible for allocating resources ApplicationManager - is responsible for accepting job-submissions, negotiating the first container for executing the aplication and provide services for restarting container on failure

what is a record reader in a mapreduce?

recordreader is used to read key/value pairs from the inputsplit by converting the byte-oriented view and presenting record-oriented view to mapper

What is DataNode in Hadoop?

A DataNode stores data in the Hadoop file system HDFS. There is only one DataNode process run on any Hadoop slave node. DataNode runs on its own JAVA Virtual Machine process. On startup a DataNode connects to the NamenNode. DataNode instances can talk to each other, this is mostly during replicating data

What are the core Hadoop components?

Data storage: HDFS Data processing: MapReduce / Spark

How the HDFS blocks are replicated?

HDFS stores each file as a sequence of blocks. The NameNode makes all decisions regarding replication of blocks.

How is HDFS different from traditional file system block size?

HDFS utilize the local file system to store each HDFS block as a separate file. HDFS block size can not be compared with the traditional file system block size.

What is HDFS?

HDFS, Hadoop Distributed File System, is responsible for storing large volumes of data on the cluster.

How many Daemon Processes run on a Hadoop System v1?

Hadoop is comprised of five separate daemons. Each of these daemons run in its own JAVA Virtual Machine. Flowing 3 Daemons run on Master node - NameNode : this daemon stores and maintains the metadata for HDFS - Secondary NameNode: Performs housekeeping functions for the NameNode - Job Tracker: Manages MapReduce Jobs, distributing individual tasks to the machines running the tasks tracker Following 2 daemons run on each slave nodes - DataNode: stores actual HDFS data blocks - Task Tracker : responsible for instantiating and monitoring individual Map and reduce tasks

What is Heartbeat in HDFS?

Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker. if then name node or job tracker does not receive the responds to signal, then it is considered there is some issues with the data node or task tracker

what is HDFS Block Size?

In HDFS data is split into blocks and and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate each block three times. replicates are stored on different nodes

When Namenode is down what happens to job tracker

Namenode is the single point of failure in HDFS so when NameNode is down your cluster will set off

how Namenode handles Datanode failures?

Namenode periodically receives a heartbeat and blockreport from each of the datanodes in the cluster. When Namenode notices that is has not received a heartbeat message from a datanode after a certain amount of time the data node is marked as dead since. since blocks will be under replicated, the system begins replicating the blocks that were stored on the dead datenode. The namenode orchestrates the replication of data blocks from one datanode to another. the data transfer happens directly between datanodes and never passes through the namenode.

Is YARN a replacement of Hadoop MapReduce?

No YARN is not a replacement of Hadoop it is a more powerful and efficient technology that supports MapReduce it is also referred to as MapReduce V2

Can reducer communicate with another reducer?

No, Mapreduce programming model does not allow reducers to communicate with each other. reducers run in isolation

Is it possible to change the number of mappers to be created?

No, the number of mapper is determined by the number of input splits

What is a portioner and its usage?

Practitioners controls the partitioning of the intermediate mapreduce output keys using a hash function. the process of partitioning determines which reducer a key value pair is sent. hash portioner is the default class available in hadoop which implement the int getPartirion(k key, V value, int numReducerTasks) this functions returns the partition number using the nureduceTasks

What are the key components of YARN?

ResourceManager: is the YARN master process it mediates resources on a Hadoop cluster. it responds to client requests to create containers, and a scheduler determines when and where a container can be created NodeManager: is the slave process that runs on every node in a cluster. its job create, monitor and kill containers. it services requests from the ResourceManager and applicationmaster to create containers, and it reports on the status of the containers to the ResourceManager. ApplicationMaster: responsible for negotiating resource requirements for the resourcemanager and working with nodemanager to execute and monitor tasks. Also, the application manager is responsible for the specific fault-tolerance behavior. it receives status messages from the resource manager when its containers fail, and it can decide to take action based on these events by asking the resourcemanager to create a new container, or ignore these events container: A container is an application-specific process that's created by a nodemanager on behalf of an applicationmaster with a constrained set of resources (memory, cpu, etc.)

How many JVMs run on a slave node?

Single instances of a Task Tracker is ran on each slave node. Task Tracker is ran as a separate JVM process. Single instances of DataNode daemon is ran on each slave node One or multiple instances of Task Instance is ran on each slave node.

What is NameNode in Hadoop?

The NameNode in Hadoop stores all the file location information (metadata) in HDFS. it does not store the data of these files itself there is only one NameNode process run on any Hadoop cluster. NameNode runs on its own JAVA Virtual Machine process. The NameNode is a Single Point of Failure for the HDFS cluster. when the NameNode does down, the file system goes offline Client application talk to the NameNode whenever they wish to locate a file. The NameNode responds to successful requests by returning a list of relevant DataNodes servers where the data lives.

what main configuration parameters are specified in mapreduce?

the input location of the job in HDFS the output location of the job in HDFS the inputs and outputs format the class containing map and reduce functions the jar file for mapper, reduce and driver classes

Where is mapper output intermediate key-value data stored?

the mapper outputs intermediate data on the local file system not HDFS of each individual mapper nodes. this is typically a temporary directory which can be setup in config by the Hadoop administrator

what is shuffling and sorting in MapReduce?

the process of transferring data from mapper to reducer data is sorted and shuffled between the map and reduce process so that the reducer receives data shuffled and sorted to reduce the data

when is reducers are started in a mapreduce job?

they do not start until all map jobs have been completed

can number of reducers be set to zero?

yes setting the number of reducers to zero is valid configuration in Hadoop. when you set reducers to zero, no reducer will be executed and the output of each mapper will be stored to a sperate file on HDFS


Related study sets

CHP. 11 Classical and Keynesian Macro Analysis

View Set

3.06 Chapter 6 Connect SmartBook

View Set

Contemprorary Criminal Law Lippman Ch 11-12-13-15

View Set

T/F Chapter 7 Quiz, Chapter 8 Quiz, Chapter 9 Quiz, Chapter 10 Quiz, Chapter 11 Quiz, Chapter 12 Quiz

View Set

Sociology Research Methods Midterm 2

View Set

High Risk OB: Preterm Labor and Birth

View Set