3.1 Hadoop Architecture and HDFS
Essential Points
- HDFS is the storage layer of hadoop - Chunks data into blocks and distributes them across the cluster when data is stored. - Slave nodes run DataNode Daemons, managed by a single NameNode on a master node. - Access HDFS using Hue, the hdfs command or via the HDFS API - YARN manages resources in a hadoop cluster and schedules jobs - YARN works with HDFS to run tasks where the data is stored - Slave nodes run NodeManager daemons, managed by a ResourceManager on a master node. - Monitor jobs using Hue, the YARN web UI, or the yarn command
HDFS
HDFS is a filesystem written in Java - Based on google's GFS Sit on top of a Native Filesystem - Such as ext3, ext4, or xfs Provides redundant storage for massive amounts of data - Using readily available industry-standard computers HDFS performs best with a modest number of large files - Millions rather than billions of files - Each file typically 100MB or more Files in HDFS are write once - No random writes to files are allowed HDFS is optimized for large, streaming reads on files. - Rather than random reads
HDFS Namenode availability
The name node daemon must be running at all times - If the namenode stops, the cluster becomes inaccessible HDFS is typically set up for high availability - Two namenodes: Active and Standby Small clusters may use classic Mode - One nameNode - One "helper" node called the secondary namenode - Bookkeeping not backup
Cluster Components
Three main components of a cluster Work together to provide distributed data processing We start with the storage component -> HDFS Storage-->Resource Management --> processing
What is YARN?
YARN = Yet Another Resource Negotiator YARN is the Hadoop processing layer that contains - A Resource Manager - A job Scheduler YARN allows multiple data processing engines to run on a single Hadoop cluster - Batch programs (Spark, MapReduce) - Interactive SQL (Impala) - Advanced Analytics (Spark, Impala) - Streaming (Spark Streaming)
Hadoop Cluster Terminology
A cluster is a group of computers working together - Provides data storage, data processing, and resource management A node is an individual computer in the cluster - Master nodes manage distribution of work and data to worker nodes. A daemon is a program running o a node - Each hadoop daemon performs a specific function in the cluster
Running an application in YARN
Containers - Created by the RM upon request - Allocate a certain amount of resources (Memory, CPU) on a slave node. - Applications run in one or more containers Application Master (AM) - One per Application - Framework/application specific - Runs in a container - Requests more containers to run application tasks
How Files are stored
Data files are split into 128 MB blocks which are distributed at load time Each block is replicated on multiple data nodes (default 3X) Name Nodes store metadata
YARN Daemons
Resource Manager (RM) - Runs on Master Node - Global Resource Scheduler - Arbitrates system resources between competing applications - Has a Pluggable Scheduler to support different algorithms (capacity, fair, scheduler) Node Manager - Runs on Slave Nodes - Communicates with RM
