3.1 Hadoop Architecture and HDFS

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Essential Points

- HDFS is the storage layer of hadoop - Chunks data into blocks and distributes them across the cluster when data is stored. - Slave nodes run DataNode Daemons, managed by a single NameNode on a master node. - Access HDFS using Hue, the hdfs command or via the HDFS API - YARN manages resources in a hadoop cluster and schedules jobs - YARN works with HDFS to run tasks where the data is stored - Slave nodes run NodeManager daemons, managed by a ResourceManager on a master node. - Monitor jobs using Hue, the YARN web UI, or the yarn command

HDFS

HDFS is a filesystem written in Java - Based on google's GFS Sit on top of a Native Filesystem - Such as ext3, ext4, or xfs Provides redundant storage for massive amounts of data - Using readily available industry-standard computers HDFS performs best with a modest number of large files - Millions rather than billions of files - Each file typically 100MB or more Files in HDFS are write once - No random writes to files are allowed HDFS is optimized for large, streaming reads on files. - Rather than random reads

HDFS Namenode availability

The name node daemon must be running at all times - If the namenode stops, the cluster becomes inaccessible HDFS is typically set up for high availability - Two namenodes: Active and Standby Small clusters may use classic Mode - One nameNode - One "helper" node called the secondary namenode - Bookkeeping not backup

Cluster Components

Three main components of a cluster Work together to provide distributed data processing We start with the storage component -> HDFS Storage-->Resource Management --> processing

What is YARN?

YARN = Yet Another Resource Negotiator YARN is the Hadoop processing layer that contains - A Resource Manager - A job Scheduler YARN allows multiple data processing engines to run on a single Hadoop cluster - Batch programs (Spark, MapReduce) - Interactive SQL (Impala) - Advanced Analytics (Spark, Impala) - Streaming (Spark Streaming)

Hadoop Cluster Terminology

A cluster is a group of computers working together - Provides data storage, data processing, and resource management A node is an individual computer in the cluster - Master nodes manage distribution of work and data to worker nodes. A daemon is a program running o a node - Each hadoop daemon performs a specific function in the cluster

Running an application in YARN

Containers - Created by the RM upon request - Allocate a certain amount of resources (Memory, CPU) on a slave node. - Applications run in one or more containers Application Master (AM) - One per Application - Framework/application specific - Runs in a container - Requests more containers to run application tasks

How Files are stored

Data files are split into 128 MB blocks which are distributed at load time Each block is replicated on multiple data nodes (default 3X) Name Nodes store metadata

YARN Daemons

Resource Manager (RM) - Runs on Master Node - Global Resource Scheduler - Arbitrates system resources between competing applications - Has a Pluggable Scheduler to support different algorithms (capacity, fair, scheduler) Node Manager - Runs on Slave Nodes - Communicates with RM


Kaugnay na mga set ng pag-aaral

KIN 3502 FINAL -- ch 11, kin hhhh, KIN 3502 - Sport Skills & Motor Abilities (Chap. 11), KIN 3502 - Final - Chap. 12 Quiz (Psychological Measures), KIN 3502 - Exam 2 - Chap. 9 Quiz, KIN 3502 final -- ch 12, Kin 3502 Final- Psychological Measurements

View Set

World History: SOCIAL CHANGES: POSITIVE CONTRIBUTIONS

View Set

BL-Linux Chapter 12 - Shell Scripting

View Set

MUS 110 CH 15 Listening Study Guide: LG 4 Farmer: Fair Phyllis

View Set

Interactive and Multichannel Marketing

View Set

chapter 9 brain & behavior study questions

View Set