Data Analytics

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Why learn about YARN?

Anatomy of YARN application run YARN Schedulers Components of YARN: Resource Manager (one per Cluster) Node Manager (one per data node)

Map reduce in YARN

Client submits the application /job to YARN to RM RM finds a NM and asks launch a container (application Master) AM takes responsibility to execute and monitor the job AM functionality depends on app framework (map reduce functions differently than a spark or framework

Flume, Sqoop

Data Ingesting Services

PIG. HIVE

Data Processing Services using Query (SQL-Like)

MapReduce

Data processing using programming

Responsibility of MR engine

Executing mr programs, takes workload from MapReduce for more efficient execution

Goal of Analytics

Gain insights and act on complex issues

HDFS

Hadoop Distributed File System

Spark

In-memory Data Processing

MapReduce: Mapping

Input data set into a collection of key-value pairs

MapReduce: Reducing

Input data set into all pairs with the same key

MapReduce Flow

Input file -> Input Split (multiple) -> RecordReader (multiple) -> Mapper -> Shuffling and Sorting -> Reducer -> RecordWriter ->OutputFile

Oozie

Job Scheduling

The _____________ executes the Mapper/Reducer task as a child process in a separate jvm.

Job Tracker

YARN Architecture

Job tracker 1.0 responsibility is now split Resource Manager manages the resource allocation in the cluster Application master manages the resource needs of individual applications Node Manager is a generalized task tracker A container executes an application specific process

MapReduce 1.0

JobTracker is a Master daemon, responsible to assign and track task execution progress Task trackers are slave daemons, they run on systems where data nodes reside Responsible to spawn a child jvm to execute Map, Reduce and intermediate tasks

Mahout, Spark MLlib

Machine Learning

Zookeeper

Managing Cluster

Name given to processing done in Hadoop

MapReduce

The genral-purpose computing model and run time system for distributed area

MapReduce

Hadoop 2.x process

MapReduce (Data processing) + Other Frameworks (Data processing[MPI]) -> YARN (Resource management) -> HDFS (Distributed Redundant Storage)

Hadoop 1.x process

MapReduce (Resource management, data processing) -> HDFS (Distributed Redundant Storage)

Resource Manager

Master service usually deployed in high availability service Node manager is responsible for launching and managing a container Container is linux control group which is linux control feature that allows us to allocate cpu, memory, disk i/o bandwidth to a user process

YARN Components: Container

Name the given to a package of resources including RAM, CPU, Network, HDD, etc

HBase

No SQL Database

Ambari

Provision, Monitor and Maintain cluster

_____________ function/node is responsible for consolidating the results produced by each of the Map() functions/tasks.

Reduce

YARN purpose

Resource manager for Hadoop Clusters, cluster manager for Hadoop 2.x, framework to provide computational resources for execution engines

Apache Drill

SQL on Hadoop

Why map reduce?

Salability bottleneck caused by having a single JobTracker. Think one instructor in a class of students with questions According to Yahoo, practical limits of a design are reached with 5,000 nodes and 40,000 tasks running concurrently The computational resources on each slave node are divided by a cluster administrator into a fixed number of map and reduce slots Hadoop was designed to run MapReduce jobs only

Solr & Lucene

Searching and Indexing

Benefit of MapReduce

Shared-nothing data processing platform- all mappers can work independently, no critical region or data is shared among mappers and reducers

Hadoop 1.x purpose

System for creating and executing MapReduce application, responsible for managing cluster resources (CPU, Memory, disk I/O and network bandwidth)

A _____________ Tracker acts as the Slave and is responsible for executing a Task assigned to it by the Job Tracker.

Task

Data Node can talk to...

Task Tracker

Who is YARN for?

Teams who are creating new computation engines

Input File Formats

TextInputFormat KeyValueTextInputFormat SequenceFileInputFormat SequenceFileAsTextInputFormat

What is Hadoop named after?

The toy elephant of Cutting's son

YARN Components: Resource Manager

To manage the use of resources across the cluster

YARN Components: Node Manager

To oversee the containers running on the cluster nodes

YARN Components: Client

To submit Map-Reduce jobs

Descriptive Analytics

What happened?

Predictive Analytics

What is likely to happen?

Prescriptive Analytics

What should I do about it?

YARN Components: Application Master

Which negotiates with the Resource Manager for resources and runs the application-specific process (Map or Reduce tasks) in those clusters

YARN

Yet Another Resource Negotiator

Which statement is correct? a) MapReduce tries to place the data and the compute as close as possible b) Combine in MapReduce is performed using the Mapper() function c) Reduce Task in MapReduce is performed using the Map() function d) All of the mentioned

a

All of the following accurately describe Hadoop, EXCEPT: a) Open source b)C based language c) Java-based d) Distributed computing approach

b

Which statement is incorrect? a) A MapReduce job usually splits the input data-set into independent chunks which are processed by he map tasks in a completely parallel manner b) The MapReduce framework operates exclusively on pairs c) Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods d) None of the mentioned

d

compareTo()

for the Comparable interface

The number of maps is usually on the number _____________ split.

inputs

readField()

work with DataInput class to serialize the class contents

write()

work with DataOutput class to serialize the class contents


Kaugnay na mga set ng pag-aaral

SPI Review Penny & Fox chapter 2

View Set

the effects of world war II on the home front, Post-World War II Society, Anticommunism and the Roots of the Cold War, cold war events, The Rise of the Middle Class

View Set

Introduction to Psychology Study Guide

View Set

A&P2 Chapter 15 Autonomic Nervous System

View Set

Textbook Pages 143 & 144 # 1-16 (Word Problems)

View Set

MNGT 301 || Chapter 16 - Control Systems and Quality Management: Techniques for Enhancing Organizational Effectiveness

View Set