CSC 4740
Open Source
Documentation is online
hdfs dfs -get shakespeare/poems ~/shakepoems.txt less ~/shakepoems.txt
Download the poems file into the local filesystem
Scalable
Easily add a machine to the cluster
Fault-Tolerant Scalable Open_Source Distributed
Features of Hadoop
HMaster
HBase structure assigning regions and creating & deleting tables
RegionServers
HBase structure serving data for reads and writes
hdfs dfs -ls /
HDFS Print the contents of the root directory
hdfs dfs -mkdir weblog
HDFS command to Create a directory 'weblog' in HDFS
gunzip -c access_log.gz | hdfs dfs -put - weblog/access_log
HDFS command to Unzip access_log.gz and upload the file in one step to weblog
tar zxvf shakespeare.tar.gz
HDFS command to Unzip shakespeare.tar.gz
hdfs dfs -ls shakespeare
List the contents of the /user/training/shakespeare directory
hdfs dfs -cat shakespeare/* | grep 'tiger'
Print all lines containing a word 'tiger' in all files stored in HDFS shakespeare directory without storing any file on your local file system.
hdfs dfs -ls /user/training hdfs dfs -ls
Print the contents of your HDFS home directory
hdfs dfs -cat shakespeare/histories | tail -n 50
Print the last 50 lines of the histories file
hdfs dfs -cat wordcounts/part-r-00000 | less
View the contents of the output in wordcounts/part-r-00000
HDFS
What is the storage of Hadoop
YARN
What manages computing resources and schedules submitted jobs
Hadoop MapReduce
What processes data on the cluster for Hadoop
job
a 'full program'
Fault Tolerant
capable of continuing operation even if a component fails
MapReduce
is a component for distributing a job across multiple nodes
HBase
is a distributed column-oriented data store built on top of HDFS
cluster
is a group of computers working together • Provides data storage, data processing, and resource management
task attempt
is a particular instance of an attempt to execute a task.
daemon
is a program running on a node, performs a specific function in the cluster
Node
is an individual computer in the cluster
task
is the execution of a single Mapper or Reducer over a slice of data.
Master node
manage distribution of work and data to worker nodes
YARN
the Hadoop processing layer that contains • A resource manager • A job scheduler
Distributed
Can work on multiple machines at the same time
Apache Hadoop
A software framework for storing, processing, and analyzing "big data"
hdfs dfs -cat shakespeare/* | grep 'tiger' | sort hdfs -put - tigers.txt
After finding all lines containing a word 'tiger' in all files stored in HDFS shakespeare directory, store them in lexicographically sorted order in a HDFS file called tigers.txt without storing any file on your local file system.
HBase is a distributed _____ _____ data store built on top of ______.
Column-oriented HDFS
hadoop version
Command line print the installed Hadoop version
hadoop
Command line to print a help message
job.SetNumReduceTasks(0);
Command to create Map-only job, set the number of Reducers to 0 in your Driver code
Storage Processing Resource Management
Core Hadoop Components
hdfs dfs -cat shakespeare/* | wc -l
Count the number of lines of all files stored in HDFS shakespeare directory without storing any file on your local file system.
hdfs dfs -mkdir testlog gunzip -c access_log.gz | head -n 5000 | hdfs dfs -put - testlog/test_access_log
Create a smaller version of the log file named testlog (e.g., first 5,000 lines) and store the smaller version in HDFS
hdfs dfs -ls /user
HDFS print the contents of the /user directory
hdfs dfs
HDFS to print a help message
ZooKeeper
Hbase structure for maintaining cluster status
128MB
How big is each block
speculative execution
If a Mapper appears to be running significantly more slowly than the others, a new instance of the Mapper will be started on another machine, operating on the same data
hdfs dfs -put shakespeare shakespeare
Insert shakespeare directory into HDFS
NameNode
One machine that get selected to store the metadata
hdfs dfs -rm shakespeare/glossary
Remove the glossary file from shakespeare
