Week 7: Hadoop, MapReduce, Hive, HDFS, YARN, and Spark
Client Nodes
-Has Hadoop installed with all the required cluster configuration settings and is responsible for loading all the data into the Hadoop cluster. -Submits MapReduce jobs describing on how data needs to be processed and then the output is retrieved by the client node once the job processing is completed. -Are in charge of loading the data into the cluster. -First submit MapReduce jobs describing how data needs to be processed and then fetch the results once the processing is finished.
Single Node Cluster
-Has only a single machine. -All the daemons i.e., DataNode, NameNode, TaskTracker and JobTracker run on the same machine/host. -Everything runs on a single JVM instance. -The default replication factor is 1.
HDFS Characteristics
-High fault-tolerance. -May consist of thousands of server machines. -Has high throughput. -Is designed to store and scan millions of rows of data and to count or add some subsets of the data. -It has been designed to support large datasets in batch-style jobs. -Is economical -Is designed in such a way that it can be built on commodity hardware and heterogeneous platforms, which is low-priced and easily available.
Balancing
-If a data node is dead/crashed, the blocks present on it are lost! -The blocks will be considered as "under-replicated" compared to the other blocks. -Master node (NameNode) will give a signal to the other data nodes containing the replicas (of the blocks in question) to replicate so as to have an overall distribution of the blocks and it is all balanced. -NO TWO REPLICAS OF A BLOCK ARE PRESENT ON THE SAME DATANODE!
Zookeeper Cordination
-In a Hadoop cluster, coordinating and synchronizing nodes can be a challenging task. -Therefore, Zookeeper is the perfect tool for the problem.
Spark
-Is an alternative framework to Hadoop built on Scala but supports varied applications written in Java, Python, etc. -Compared to MapReduce, it provides in-memory processing which accounts for faster processing. -In addition to batch processing offered by Hadoop, it can also handle real-time processing.
Kafka
-It is a distributed streaming platform designed to store and process streams of records. It is written in Scala. It builds real-time streaming data pipelines that reliably get data between applications, and also builds real-time applications that transform data into streams. -It uses a messaging system for transferring data from one application to another. As seen below, we have the sender, the message queue, and the receiver involved in data transfer.
Hadoop YARN
-It is a file system that is built on top of HDFS. -It is responsible for managing cluster resources to make sure you don't overload one machine. -It performs job scheduling to make sure that the jobs are scheduled in the right place.
HeartBeat
-It is the signal that is sent by the data node continuously to the NameNode. -If the NameNode does not receive the heartbeat from a node, then it will consider the data node as "dead".
Metadata
-Loads the blocks that reside on a specific DataNode into its memory at startup. -The metadata size is limited to the RAM available on the NameNode. -As mentioned earlier, each file is split into one or more blocks stored and replicated in DataNodes. -This is Data Block split
Secondary NameNode Server
-Maintains the edit log and namespace image information in sync with the NameNode server. -At times, the namespace images from the NameNode server are not updated; therefore, you cannot totally rely on the Secondary NameNode server for the recovery process.
DataNodes
-Manage names and locations of file blocks. -By default, each file block is 128 Megabytes.
Components Of A Hadoop Cluster
-Master Node -Slave/Worker Nodes -Client Nodes
HDFS Components
-NameNode -Secondary NameNode -File System -Metadata -DataNode
Dedicated Master Nodes
-NameNode -Secondary NameNode -YARN
Checkpoint Node Or Backup Node
-Provides checkpointing services for the NameNode. This involves reading the NameNode's edit log for changes to files in HDFS (new, deleted, and appended files) since the last checkpoint, and applying them to the NameNode's master file that maps files to data blocks. -In addition, the Backup Node keeps a copy of the file system namespace in memory and keeps it in sync with the state of the NameNode. For high availability deployments, do not use a checkpoint node or backup node — use a Standby NameNode instead. In addition to being an active standby for the NameNode, the Standby NameNode maintains the checkpointing services and keeps an up-to-date copy of the file system namespace in memory.
HDFS Data Storage
-Provides distributed storage. -Can be implemented on commodity hardware. -Provides data security. -Highly fault-tolerant: If one machine goes down, the data from that machine goes to the next machine.
Master Nodes
-Responsible for storing data in HDFS and executing parallel computation on the stored data using MapReduce. -Has three nodes: NameNode, Secondary NameNode and JobTracker.
NameNode Server
-Server is the core component of an HDFS cluster. -There can be only one NameNode server in an entire cluster. -Namenode maintains and executes the file system namespace operation such as opening, closing, and renaming of files and directories, which are present in HDFS. -The namespace image and the edit log store information of the data and the metadata. -NameNode also determines the linking of blocks to DataNodes. -Furthermore, the NameNode is a single point of failure.
Benefits Of Data Block Approach
-Simplified Replication -Fault-Tolerance -Reliability
Big Data Challenges
-Single Central Storage -Serial Processing: One Input, One Processor, One Output -Lack of ability to process unstructured data.
HDFS Limitations
-Small file is a problem: For large number of small files, there will be lots of seeks (read/write) and lots of movement from one node to another to retrieve each small file. This process is very inefficient due to file i/o. -Low latency data access: Applications that require low-latency access to data, i.e., within milliseconds will not work well with HDFS.
Hadoop Benefits
-Speed: Hadoop's concurrent processing, MapReduce model, and HDFS lets users run complex queries in just a few seconds. -Diversity: Hadoop's HDFS can store different data formats, like structured, semi-structured, and unstructured. -Cost-Effective: Hadoop is an open-source data framework. -Resilient: Data stored in a node is replicated in other cluster nodes, ensuring fault tolerance. -Scalable: Since Hadoop functions in a distributed environment, you can easily add more servers.
Sqoop
-Sqoop is used to transfer data between Hadoop and external datastores such as relational databases and enterprise data warehouses. -It imports data from external datastores into HDFS, Hive, and HBase.
Map Phase
-The data is assigned a key and a value of 1. -These key-value pairs are then shuffled and sorted together based on their keys.
Resource Manager (YARN)
It is the master daemon of YARN and is responsible for resource assignment and management among all the applications. Whenever it receives a processing request, it forwards it to the corresponding node manager and allocates resources for the completion of the request accordingly. It has two major components: Scheduler and Application Manager.
Scheduler
It performs scheduling based on the allocated application and available resources. It is a pure scheduler, means it does not perform other tasks such as monitoring or tracking and does not guarantee a restart if a task fails. The YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to partition the cluster resources.
Client
It submits map-reduce jobs.
Node Manager
It takes care of individual node on Hadoop cluster and manages application and workflow and that particular node. Its primary job is to keep-up with the Node Manager. It monitors resource usage, performs log management and also kills a container based on directions from the resource manager. It is also responsible for creating the container process and start it on the request of Application master.
Secondary NameNode
Keeps a backup of the NameNode data.
NameNode
Keeps a track of all the information on files (i.e., the metadata on files) such as the access time of the file, which user is accessing a file on current time, and which file is saved in which Hadoop cluster.
File System NameNode Operation
Maintains two persistent files: 1.) One a transaction log called an Edit Log. 2.) A namespace image called a FsImage.
NameNode And Standby NameNode
Manages HDFS storage, to ensure high availability. Each run on its own, dedicated master node.
HDFS MapReduce
Manages the nodes for processing.
JobTracker
Monitors the parallel processing of data using MapReduce while the NameNode handles the data storage function with HDFS. -Replaced with YARN in Hadoop 2. For Hadoop 1 servers, handles cluster resource management and scheduling. With YARN, the JobTracker is obsolete and isn't used. A number of Hadoop deployments still haven't migrated to Hadoop 2 and YARN.
Resource Manager
Oversees the scheduling of application tasks and management of the Hadoop cluster's resources. This service is the heart of YARN.
Data Node
Reads, writes, processes, and replicates the data. They also send signals, known as heartbeats, to the name node. These heartbeats show the status of the data node.
JournalNode
Receives edit log modifications indicating changes to files in HDFS from the NameNode. At least three JournalNode services (and it's always an odd number) must be running in a cluster, and they're lightweight enough that they can be collocated with other services on the master nodes.
Big Data
Refers to the massive amount of data that cannot be stored, processed, and analyzed using traditional ways.
Cluster Utilization
Since YARN supports Dynamic utilization of cluster in Hadoop, which enables optimized Cluster Utilization.
Fault Tolerance
The ability for a system to respond to unexpected failures or system crashes as the backup system immediately and automatically takes over with no loss of service
Value
The ability to turn data into useful insights for your business.
Reduce Phase
The aggregation takes place, and the final output is obtained.
Variety
The different types of data: structured, semi-structured, unstructured.
Spark Core
The main execution engine for Spark and other APIs built on top of it.
Hadoop MapReduce
The processing unit of Hadoop. In the MapReduce approach, the processing is done at the slave nodes, and the final result is sent to the master node.
MapReduce
The processing unit.
YARN
The resource management unit.
Scalability
The scheduler in Resource manager of YARN architecture allows Hadoop to extend and manage thousands of nodes and clusters.
Velocity
The speed at which data is generated, collected, and analyzed.
Parallel Processing With Distributed Storage
The storage unit is distributed among each of the processors. Resulting in the storing and accessing of data efficiently and with no network overhead. This setup is how data engineers and analysts manage big data effectively.
HDFS
The storage unit.
Master And Slave Nodes
These nodes form the HDFS cluster. The name node is called the master, and the data nodes are called the slaves.
Veracity
Trustworthiness in terms of quality and accuracy
Hadoop Distributed File System (HDFS)
Uses name nodes and data nodes to store extensive data.
Apache Pig
Was developed by Yahoo researchers, targeted mainly towards non-programmers. -It was designed with the ability to analyze and process large datasets without using complex Java codes. -It provides a high-level data processing language that can perform numerous operations without getting bogged down with too many technical concepts. -Pig was developed for analyzing large datasets and overcomes the difficulty to write map and reduce functions.
Compatability
YARN supports the existing map-reduce applications without disruptions thus making it compatible with Hadoop 1.0 as well.
Who uses Hadoop?
•British Airways •Uber •The Bank of Scotland •Netflix •The National Security Agency (NSA) •The UK's Royal Mail system •Expedia •Twitter
Features Of HDFS
•Provides distributed storage. •Can be implemented on commodity hardware. •Provides data security. •Highly fault-tolerant •If one machine goes down, the data from that machine goes to the next machine.
Three Master Nodes
-Active NameNode -Stanby NameNode -Resource Manager
Ambari
-Apache Ambari is an open-source tool responsible for keeping track of running applications and their statuses. -Ambari manages, monitors, and provisions Hadoop clusters.
Big Data Solutions
-Distributed Storage -Parallel Processing: Multiple Inputs, Multiple Processors, Multiple Outputs -Ability to process every type of data.
Block Replication Architecture
-Each file is split into a sequence of blocks. -All blocks except the last one in the file are of the same size. -Blocks are replicated for fault tolerance.
HDFS File System
-Exposes a file system namespace and allows user data to be stored in files. -Has a hierarchical file system with directories and files.
HBase
-HBase is a Column-based NoSQL database. -It runs on top of HDFS and can handle any type of data. -It allows for real-time processing and random read/write operations to be performed in the data.
Three Major Components Of Hadoop
-HDFS -MapReduce -YARN
MapReduce Data Processing
-Hadoop data processing is built on MapReduce, which processes large volumes of data in a parallelly distributed manner. -As we see, we have our big data that needs to be processed, with the intent of eventually arriving at an output. -So, in the beginning, input data is divided up to form the input splits. -The first phase is the Map phase, where data in each split is passed to produce output values. -In the shuffle and sort phase, the mapping phase's output is taken and grouped into blocks of similar data. -Finally, the output values from the shuffling phase are aggregated. It then returns a single output value.
Container
It is a collection of physical resources such as RAM, CPU cores and disk on a single node. The containers are invoked by Container Launch Context (CLC) which is a record that contains information such as environment variables, security tokens, dependencies etc.
Dataset
It is first split into chunks, and then processed parallely.
Application Manager
It is responsible for accepting the application and negotiating the first container from the resource manager. It also restarts the Application Manager container if a task fails.
Hadoop Data
It is stored in a distributed manner in HDFS. There are two components the name node and data node. While there is only one name node there can be multiple data nodes.
Apache Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Has two major components: -Hive Command Line -JDBC/ ODBC driver
Java Database Connectivity (JDBC)
Application is connected through JDBC Driver.
Open Database Connectivity (ODBC)
Application is connected through ODBC Driver.
Hadoop
Big Data Framework
Zookeeper
Coordinates distributed components and provided mechanisms to keep them in sync. Zookeeper is used to detect the failure of the NameNode and elect a new NameNode. It's also used with HBase to manage the states of the HMaster and the RegionServers.
Redundancy
Critical in avoiding single points of failure, so you see two switches and three master nodes.
Streaming API
Enables Spark to handle real-time data. It can easily integrate with a variety of data sources like Flume, Kafka, and Twitter.
Replication
Is performed three times by default, by the data node. It is done this way, so if a commodity machine fails, you can replace it with a new machine that has the same data.
Name Node
Is responsible for the workings of the data nodes. It stores the metadata.
Multi-Tenancy
It allows multiple engine access thus giving organizations a benefit of multi-tenancy.
Hadoop Challenges
-There's a steep learning curve. If you want to run a query in Hadoop's file system, you need to write MapReduce functions with Java, a process that is non-intuitive. Also, the ecosystem is made up of lots of components. -Not every dataset can be handled the same. Hadoop doesn't give you a "one size fits all" advantage. Different components run things differently, and you need to sort them out with experience. •MapReduce is limited. Yes, it's a great programming model, but MapReduce uses a file-intensive approach that isn't ideal for real-time interactive iterative tasks or data analytics. •Security is an issue. There is a lot of data out there, and much of it is sensitive. Hadoop still needs to incorporate the proper authentication, data encryption, provisioning, and frequent auditing practices.
Slave/Worker Nodes
-This component in a Hadoop cluster is responsible for storing the data and performing computations. -Every slave/worker node runs both a TaskTracker and a DataNode service to communicate with the Master node in the cluster. -The DataNode service is secondary to the NameNode and the TaskTracker service is secondary to the JobTracker.
Mahout
-Used to create scalable and distributed machine learning algorithms such as clustering, linear regression, classification, and so on. -It has a library that contains built-in algorithms for collaborative filtering, classification, and clustering.
Hive
-Uses SQL (Structured Query Language) to facilitate the reading, writing, and management of large datasets residing in distributed storage. -It was developed with a vision of incorporating the concepts of tables and columns with SQL since users were comfortable with writing queries in SQL. -It is a distributed data warehouse system developed by Facebook. -It allows for easy reading, writing, and managing files on HDFS. -It has its own querying language for the purpose known as Hive Querying Language (HQL) which is very similar to SQL. -This makes it very easy for programmers to write MapReduce functions using simple HQL queries.
Hadoop Data Processing
-Variety -Value -Higher Velocity -Veracity -Higher Volume
Multi Node Cluster
-Will have more than one machine. -All the essential daemons are up and run-on different machines/hosts. -Has a master slave architecture where in one machine acts as a master that runs the NameNode daemon while the other machines act as slave or worker nodes to run other Hadoop daemons. -Can be configured based on the number of data nodes available.
YARN Cluster Resources Management
-YARN handles the cluster of nodes and acts as Hadoop's resource management unit. -YARN allocates RAM, memory, and other resources to different applications. -Two components: 1.) Resource Manager (Master) - This is the master daemon. It manages the assignment of resources such as CPU, memory, and network bandwidth. 2.) Node Manager (Slave) - This is the slave daemon, and it reports the resource usage to the Resource Manager.
Application Workflow In Hadoop Yarn
1. Client submits an application. 2. The Resource Manager allocates a container to start the Application Manager. 3. The Application Manager registers itself with the Resource Manager. 4. The Application Manager negotiates containers from the Resource Manager. 5. The Application Manager notifies the Node Manager to launch containers. 6. Application code is executed in the container. 7. Client contacts Resource Manager/Application Manager to monitor application's status. 8. Once the processing is complete, the Application Manager un-registers with the Resource Manager.
Apache Pig Consists Of
1. Pig Latin - This is the language for scripting. It is the Scripting Language that is similar to SQL. 2. Pig Latin Compiler - This converts Pig Latin code into executable code. This is the execution engine on which Pig Latin runs.
Stages Of Big Data Processing
1.) Ingestion: Flume, Kafka, Scoop 2.) Storage: HDFS, HBase 3.) Processing: MapReduce, Spark 4.) Analysis: Pig, Hive, Spark
Hadoop Cluster
A collection of independent components connected through a dedicated network to work as a single centralized data processing resource.
GraphX
A graph computation engine that enables users to interactively build, transform, and reason about graph-structured data at scale and comes with a library of common algorithms.
Volume
A massive amount of data generated every second.
MLlib
A scalable machine learning library that will enable you to perform data science tasks while leveraging the properties of Spark at the same time.
Oozie
A workflow scheduler system used to manage Hadoop jobs. -It consists of two parts: 1.Workflow engine - This consists of Directed Acyclic Graphs (DAGs), which specify a sequence of actions to be executed. 2.Coordinator engine - The engine is made up of workflow jobs triggered by time and data availability.
Yet Another Resource Negotiator (YARN)
Acts as an operating system for Hadoop in managing cluster resources.
Spark SQL API
Allows for querying structured data stored in DataFrames or Hive tables.
Application Master
An application is a single job submitted to a framework. The application manager is responsible for negotiating resources with the resource manager, tracking the status and monitoring progress of a single application. The application master requests the container from the node manager by sending a Container Launch Context (CLC) which includes everything an application needs to run. Once the application is started, it sends the health report to the resource manager from time-to-time.
Flume
Another data collection and ingestion tool, a distributed service for collecting, aggregating, and moving large amounts of log data. -It ingests online streaming data from social media, logs files, web server into HDFS. -As you can see in the image on the right, data is taken from various sources, depending on your organization's needs. -It then goes through the source, channel, and sink. -The sink feature ensures that everything is in sync with the requirements. -Finally, the data is dumped into HDFS.
