Hadoop, Linux, and Big Data - Week 3 Revature
How might we scale a HDFS cluster past a few thousand machines?
o HDFS Federations, with multiple NameNodes, can be used if you need 10000s of machines.
When does the combine phase run, and where does each combine task run?
o The Combiner is a partial reduction before shuffle and sort o Output of combiner will be sent over network to actual reduce task as input.
What is/was Unix? Why is Ubuntu a Unix-like operating system?
o Unix was an OS. Large community of hackers + academics modified the source of Unix to create and share distributions. But they decided to close the source o The community worked on the GNU (Gnu's Not Unix) Project, which was an attempt to recreate Unix, open source + copyleft, from the ground up. GNU/Linux is the Linux we know today, Ubuntu is a distribution based off of GNU/Linux.
How are DataNodes fault tolerant?
For DataNodes, their fault tolerance is handled by the NameNode. DNs send heartbeats to the NN, so when a DN goes down, it stops sending those heartbeats, and the NN knows to make new replicas of all the data stored on the downed DN.
What is the job of the NameNode? What about the DataNode?
NameNode: master daemon. The NN keeps the image of the distributed filesystem. It doesn't store any of the actual data in the files/directories. It does contain the metadata for files and directories. DataNode: worker daemon. There are many of these in your cluster, typically 1 per machine, except the NN. The DN stores the actual data stored in the filesystem and communicates its status with the NN. runs on "commodity hardware", which just means regular servers, nothing specialized.
What does the NodeManager do?
Node managers manage bundles of resources called containers running on their machine and report the status back to the ResourceManager.
What is an ApplicationMaster? How many of them are there per job?
1 per job (managed by the applications manager) run in containers on the cluster, and are responsible for communicating with the scheduler to achieve their jobs. This allows the ApplicationsManager to be ultimately responsible for job completion, while offloading most of the work to ApplicationMasters running on worker nodes.
What do the following commands do? 1) hdfs dfs -get /user/adam/myfile ~ 2) hdfs dfs -put ~/coolfile /user/adam/
1) gets a file from hdfs and puts it into our local system 2) puts a file from our local environment into hdfs
How many blocks will a 200MB file be stored in in HDFS, if we assume default HDFS block size for Hadoop v2+?
2 (one 128MB the other 72MB)
What is the default number of replications for each block? How are these replications typically distributed across the cluster?
3. Replication information and other metadata is stored on the NameNode, and the NameNode makes all decisions about where data/replicas will be stored on the cluster. Each file/block within a file is replicated across the cluster.
What is a daemon?
A daemon is just a long-running process. HDFS and YARN both involve multiple different daemons running on different machines. Typically, applications running on a cluster will have one or more master daemons responsible for coordinating work and many worker daemons responsible for actually doing the work.
What is a VM?
A virtual machine is like the OS running on your hardware that we're using right now, but running on virtual hardware instead. When I use a virtual machine, I create some virtual resources on top of my physical resources and install an OS on those virtual resources.
Which responsibilities does the ApplicationsManager have?
Accepts job submissions, and creates the ApplicationMaster for each submitted job. Also responsible for the fault tolerance of ApplicationMasters
What is AWS? (short)
Amazon web services. World's most comprehensive cloud platform. you can access technology services, such as computing power, storage, and databases, on an as-needed basis from a cloud provider
What is Hadoop?
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation.
What is a Container in YARN?
Bundles of resources, tasks are what run inside them The RM makes tasks run in containers across the cluster and the scheduler allocates containers across the cluster, based on request ApplicationMasters run in containers on the cluster
What's the CDH?
Cloudera Distribution of Hadoop. One of the ways Hadoop is used in the wild. Offers help managing Hadoop clusters. Pay them to set up a cluster for you and then you can manage it yourself or they can do it for you.
What was the "Hadoop Explosion"?
Different tools were needed for different tasks so it inspired a lot of other technologies that were built alongside and on top of Hadoop like spark, hive, kafka
What's the difference between an absolute and a relative path?
Every path we specify is either relative or absolute. Absolute paths start with / and fully specify the location. Relative paths don't start with / and specify the location based on our current directory
What is data locality and why is it important?
In Hadoop, Data locality is the process of moving the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.
What are some differences between hard disk space and RAM?
RAM is short-term memory, it is where programs, outputs, and the inputs to the commands running on the processor are stored. Hard disk is long-term memory. The only part of the computer that preserves anything when you shut down. Stores files.
What is rack awareness?
Rack awareness is the knowledge of network structure ie location of different dataNode across the Hadoop cluster. While reading/writing data in HDFS, NameNode chooses the Data node which is in the same rack or if not available, at least in a nearby rack
What is ssh?
SSH, also known as Secure Shell or Secure Socket Shell, is a network protocol that gives users, particularly system administrators, a secure way to access a computer over an unsecured network.
What are the 3 Vs of big data (Gardener's 3 V's)?
Volume : Big data processing involves large amounts of data at least >1TB Velocity : Big/Fast data involves processing data that is produced rapidly and may need to be processed in near-real-time. Variety : Big data involves processing data in multiple formats from multiple sources.
What are the different hadoop modes?
Standalone Mode: Standalone mode is usually the fastest Hadoop modes as it uses the local file system for all the input and output. Here is the summarized view of the standalone mode- • Used for debugging purpose • HDFS is not being used • Uses local file system for input and output• No need to change any configuration files • Default Hadoop Modes Pseudo-distributed Mode: The pseudo-distribute mode is also known as a single-node cluster where both NameNode and DataNode will reside on the same machine. • Single Node Hadoop deployment running on Hadoop is considered as pseudo distributed mode • All the master & slave daemons will be running on the same node • Mainly used for testing purpose• Replication Factor will be ONE for blocks • Changes in configuration files will be required for all the three files- mapred-site.xml, core-site.xml, hdfs-site.xml Fully-Distributed Mode (Multi-Node Cluster): This is the production mode of Hadoop where multiple nodes will be running. Here data will be distributed across several nodes and processing will be done on each node. • Production phase of Hadoop • Separate nodes for master and slave daemons • Data are used and distributed across multiple nodes
What are some examples of structured data? Unstructured data?
Structured data is highly-organized and formatted in a way so it's easily searchable in relational databases (dates, phone numbers, ssn, addresses, etc). Unstructured data has no pre-defined format or organization, making it much more difficult to collect, process, and analyze (text files, reports, images, video files, etc).
Be able to explain the significance of Mapper[LongWritable, Text, Text, IntWritable] and Reducer[Text, IntWritable, Text, IntWritable]
The Mapper class is a generic type, with four formal type parameters that specify the input key, input value, output key, and output value types of the map function. o Rather than use built-in Java types, Hadoop provides its own set of basic types that are optimized for network serialization. These are found in the org.apache.hadoop.io package. o Here we use LongWritable, which corresponds to a Java Long, Text (like Java String),and IntWritable (like Java Integer).
How does the chmod command change file permissions?
The chmod command can be used to explicitly assign privileges to owner, group, and user. This can be accomplished either using binary number format, i.e. 777 for all privileges to all groups, or through letter format, i.e. o + rwx, g + rwx, u + rwx
How do permissions work in Unix?
There are 3 user types on Linux system: User, Group, and Other. Linux divides the file permissions into read, write, and execute denoted by r, w, and x. The permissions on a file can be changed by 'chmod' command
How many NameNodes exist on a cluster?
There is one of these per cluster unless your cluster is multiple thousands of machines.
How does a Standby NameNode make the NameNode fault tolerant?
This is a daemon that runs on another machine and follows the same steps as the NameNode while they are occuring in real time. it just receives the information in the EditLog and keeps its own FSImage. Then, if the real NameNode fails, the Standby NameNode steps in and becomes the new NameNode. This is called failover. This is the best option, but requires more resources.
What purpose does a Secondary NameNode serve?
This periodically (every hour) keeps backups of the NN metadata. It isn't capable of stepping in or functioning as a replacement NameNode, it just provides functionality to preserve FS information in a secondary location in the case of total failure of the NN. Avoid catastrophic data loss.
How do we interact with the distributed filesystem?
Through fs shell commands Jar: runs a jar file. Users can bundle their MapReduce code in a JAR file and execute it using this command
What are users, what are groups?
groups can contain multiple users. All users belonging to a group will have the same Linux group permissions access to the file. A user or account of a system is uniquely identified by a numerical number called the UID (unique identification number). A root or super user can access all the files, while the normal user has limited access to files.
Know basic file manipulation and navigation commands in Unix:
ls -al: displays contents of a directory in longform, with details. Lets us see permissions cd: change directory pwd: print working directory (prints out current file path you're in) mkdir: make directory to make a new folder touch: make a new file with touch filename nano: basic command line text editor man: see the manual for list of all commands less: prints contents of file to the command line cat: reads a file mv: moves file from source to destination cp: copy's specified file rm: removes file history: shows history of commands. history | grep [old command] : this will show you prior usage of some command. very handy when you forget.
In a typical Hadoop cluster, what's the relationship between HDFS data nodes and YARN node managers?
one per machine, the worker daemon. Node managers manage bundles of resources called containers running on their machine and report the status back to the RM. We submit jobs to the Resource Manager. Tasks are the individual pieces Jobs are broken up into. Tasks are what run inside of containers. Data Nodes are responsible for these map and reduce tasks.
What does the ResourceManager do?
one resource manager per cluster, the RM is the master daemon. Responsible for providing computing resources for jobs (ie RAM, cores, disk).
What is a package manager? what package manager do we have on Ubuntu?
package managers download applications or parts of applications for you, installing and managing dependencies automatically. On Ubuntu we'll use APT (Advanced Package Tool).
Know the input and output of the shuffle + sort phase
takes output from mapper and orders all associated keys before passing it to the reducer to make it easier to parse data
Which responsibilities does the Scheduler have?
this is responsible for allocating resources (containers) across the cluster based on requests.
