Data Analytics
Why learn about YARN?
Anatomy of YARN application run YARN Schedulers Components of YARN: Resource Manager (one per Cluster) Node Manager (one per data node)
Map reduce in YARN
Client submits the application /job to YARN to RM RM finds a NM and asks launch a container (application Master) AM takes responsibility to execute and monitor the job AM functionality depends on app framework (map reduce functions differently than a spark or framework
Flume, Sqoop
Data Ingesting Services
PIG. HIVE
Data Processing Services using Query (SQL-Like)
MapReduce
Data processing using programming
Responsibility of MR engine
Executing mr programs, takes workload from MapReduce for more efficient execution
Goal of Analytics
Gain insights and act on complex issues
HDFS
Hadoop Distributed File System
Spark
In-memory Data Processing
MapReduce: Mapping
Input data set into a collection of key-value pairs
MapReduce: Reducing
Input data set into all pairs with the same key
MapReduce Flow
Input file -> Input Split (multiple) -> RecordReader (multiple) -> Mapper -> Shuffling and Sorting -> Reducer -> RecordWriter ->OutputFile
Oozie
Job Scheduling
The _____________ executes the Mapper/Reducer task as a child process in a separate jvm.
Job Tracker
YARN Architecture
Job tracker 1.0 responsibility is now split Resource Manager manages the resource allocation in the cluster Application master manages the resource needs of individual applications Node Manager is a generalized task tracker A container executes an application specific process
MapReduce 1.0
JobTracker is a Master daemon, responsible to assign and track task execution progress Task trackers are slave daemons, they run on systems where data nodes reside Responsible to spawn a child jvm to execute Map, Reduce and intermediate tasks
Mahout, Spark MLlib
Machine Learning
Zookeeper
Managing Cluster
Name given to processing done in Hadoop
MapReduce
The genral-purpose computing model and run time system for distributed area
MapReduce
Hadoop 2.x process
MapReduce (Data processing) + Other Frameworks (Data processing[MPI]) -> YARN (Resource management) -> HDFS (Distributed Redundant Storage)
Hadoop 1.x process
MapReduce (Resource management, data processing) -> HDFS (Distributed Redundant Storage)
Resource Manager
Master service usually deployed in high availability service Node manager is responsible for launching and managing a container Container is linux control group which is linux control feature that allows us to allocate cpu, memory, disk i/o bandwidth to a user process
YARN Components: Container
Name the given to a package of resources including RAM, CPU, Network, HDD, etc
HBase
No SQL Database
Ambari
Provision, Monitor and Maintain cluster
_____________ function/node is responsible for consolidating the results produced by each of the Map() functions/tasks.
Reduce
YARN purpose
Resource manager for Hadoop Clusters, cluster manager for Hadoop 2.x, framework to provide computational resources for execution engines
Apache Drill
SQL on Hadoop
Why map reduce?
Salability bottleneck caused by having a single JobTracker. Think one instructor in a class of students with questions According to Yahoo, practical limits of a design are reached with 5,000 nodes and 40,000 tasks running concurrently The computational resources on each slave node are divided by a cluster administrator into a fixed number of map and reduce slots Hadoop was designed to run MapReduce jobs only
Solr & Lucene
Searching and Indexing
Benefit of MapReduce
Shared-nothing data processing platform- all mappers can work independently, no critical region or data is shared among mappers and reducers
Hadoop 1.x purpose
System for creating and executing MapReduce application, responsible for managing cluster resources (CPU, Memory, disk I/O and network bandwidth)
A _____________ Tracker acts as the Slave and is responsible for executing a Task assigned to it by the Job Tracker.
Task
Data Node can talk to...
Task Tracker
Who is YARN for?
Teams who are creating new computation engines
Input File Formats
TextInputFormat KeyValueTextInputFormat SequenceFileInputFormat SequenceFileAsTextInputFormat
What is Hadoop named after?
The toy elephant of Cutting's son
YARN Components: Resource Manager
To manage the use of resources across the cluster
YARN Components: Node Manager
To oversee the containers running on the cluster nodes
YARN Components: Client
To submit Map-Reduce jobs
Descriptive Analytics
What happened?
Predictive Analytics
What is likely to happen?
Prescriptive Analytics
What should I do about it?
YARN Components: Application Master
Which negotiates with the Resource Manager for resources and runs the application-specific process (Map or Reduce tasks) in those clusters
YARN
Yet Another Resource Negotiator
Which statement is correct? a) MapReduce tries to place the data and the compute as close as possible b) Combine in MapReduce is performed using the Mapper() function c) Reduce Task in MapReduce is performed using the Map() function d) All of the mentioned
a
All of the following accurately describe Hadoop, EXCEPT: a) Open source b)C based language c) Java-based d) Distributed computing approach
b
Which statement is incorrect? a) A MapReduce job usually splits the input data-set into independent chunks which are processed by he map tasks in a completely parallel manner b) The MapReduce framework operates exclusively on pairs c) Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods d) None of the mentioned
d
compareTo()
for the Comparable interface
The number of maps is usually on the number _____________ split.
inputs
readField()
work with DataInput class to serialize the class contents
write()
work with DataOutput class to serialize the class contents