Data Science - Hadoop Ecosystem & Security
look at more details of node
$ hdfs fsck /user/vagrant/salaries.csv this tells you more info, including how many blocks it contains
remove machines and all data
$ vagrant destroy
suspend machines
(when you don't need it anymore, or unsuspend when you want to use again) $ vagrant suspend
replication of nodes
- HDFS provides a reliable way to store large data in a distributed environment as data blocks - blocks are also replicated to provide fault tolerance - the default replication factor is 3 - the first replica is on the local rack - the next two replicas will be stored on a different (remote) rack but on a different data node within that rack
YARN (Yet Another Resource Negotiator)
- a hadoop ecosystem that provides resource management - also called the operating system of hadoop - responsible for managing and monitoring it's a layer that separates the resource management layer and the processing components layer
vagrantfile
- a project has a vagrantfile with the configuration - it marks the root directory of your project - it describes the kind of machine and resources needed for the project to run and what software to install to change network settings, memory or CPU, open the Vagrantfile in notepad (or equiv) and make changes
HDFS files and blocks
- hadoop app is responsible for distributing the data blocks across multiple nodes - files are split into chunks and chunks are stored in blocks - blocks are managed by NameNode and stored in data node - blocks are traditionally are either of 64 MB or 128 MB - default block size is 128 MB
why traditional database is not a solution?
- it is NOT horizontally scalable bc we can't add resources or more computational nodes. - a db is designed for structured data. this db is not a good choice when we have a variety of data
MapReduce: Reducer phase
- the shuffled and sorted data is taken as input for the reducer phase - the input data will be combined, and the same pairs of actual vital values will be added to the HDFS system. writer records data from reducer to HDFS - after the reducer logic is applied, it will give output as part-r-0001, etc. - in the mapred-site.xml configuration file, we need to set some properties that allow the number of reducers to be set for the specific task
MapReduce: Intermediate Phase
- the shuffling and sorting of data occurs in this phase - the intermediate data is being produced after some computations - hadoop uses round-robin algorithm to write intermediate data to local disk
Which of the following genres does Hadoop produce?
Distributed-File System
HDFS provides a command line interface called __________ used to interact with HDFS.
FS Shell
Data flow in MapReduce
First: mapreduce job is a unit of the work that the clients want to be performed second: mapreduce job consists of input data, mapreduce program and configuration information then: hadoop runs the jobs dividing the task into phases: mapping and reducing input data stored on HDFS -> input split -> recordReader -> Mapper -> Combiner -> Partitioner -> shuffling and sorting -> reducer -> output data stored on HDFS
What application / service stores the data on the hadoop cluster?
HDFS
Hadoop components/Modules
HDFS (Hadoop distributed file system) - primary storage for Hadoop. provides scalable, fault tolerance, cost-efficient storage for Big Data. runs on commodity hardware MapReduce - an executing engine YARN (yet another resource negotiator) - hadoop ecosystem module that provides resource management
NameNode
Keeps information on data location; coordination for HDFS location is on the NameNode but not information (looks over all DataNodes on all the machines)
tools for Hadoop
Mahout (machine learning) Ambari, Ganglia, Nagios Sqoop (data exchange), cascading (like Pig [scriptin] or Hive[sql]) Oozie (project management; workflow) Flume (streaming input into hadoop) [log control] Protobuf, Avro, Thrift Fuse - DFS (provides linux access to HDFS) HUE (Hadoop User Experience)
___________________ is a general-purpose computing model and runtime system for distributed data analytics.
MapReduce
Hadoop is a framework that works with a variety of related tools. Which of the following best suits it?
MapReduce, Hive, HBase
Is the Namenode running on a master node or on worker node?
Master Node
Companies and Big Data
NASA : gathers nearly 1.73 GB of data every hour Facebook : gathers 500 terabytes a day eBay : 40 PB clusters for search, consumer recommendation and merchandising
master machines
NameNode & JobTracker
Each daemon runs on a single node as a single Java process in ___________ mode.
Pseudo Distributed mode
Hadoop achieves reliability by replicating the data across multiple hosts and hence does not require ________ storage on hosts.
RAID
HDFS allows a client to read a file that is already opened for writing.
True
things to install
VirtualBox Vagrant
where was Hadoop invented?
Yahoo! Hadoop was invented at Yahoo! and inspired by Google's GFS (Google File System) and Google's MapReduce papers
what is Big Data/ 4 V's of Big Data
a collection of data sets so large and complex that your legacy IT systems cannot handle them. (Terabytes, petabytes, exabytes of data). Data is considered 'Big Data' if it satisfies the v's Volume - size/scale of data Variety - of data, data is often unstructured or semi-structured. The different forms of data Velocity - speed of processing data Veracity - (extra added by IBM) uncertainty of the quality of data; analysis of streaming data
what is a computer cluster?
a set of connected computers that work together that could be viewed as a single system - easy to add more power (simply add a new computer)
Which type of data can Hadoop deal?
all structured, semi-structured, unstructured
Which of the following are Hadoop Daemons?
all (DataNode, NameNode, Secondary Namenode)
Which type of data is a good fit for Hadoop?
behavioral
Vagrant
can create and configure virtual development environments - lightweight - reproducible - portable - wrapper around VirtualBox, KVM, AWS, Docker, VMWare, Hyper-V create identical dev environments for operations and developers disposable environments
HCatalog
can reference data in either HBase or HDFS; makes accessible to others like Pig, Hive, and MapReduce
to launch a 3 node cluster
cmd -> folder where vagrant is downloaded -> type $ vagrant up node1 node2 node3 my case: c:\HashiCorp\Vagrant> * my experience though, go to the project folder you create, type: $ vagrant init then the above
How to upload file to HDFS
cmd -> project folder -> $ vagrant ssh node1 [logs into node1] -> hadoop fs -put /vagrant/data/salaries/csv salaries.csv
HBase
columnar data store provides simple interface to distributed data; can be accessed by Pig, Hive, or MapReduce to store into HDFS. reliable and durable used for FB messenger stores some metadata in zookeeper (scalable solution for working with servers)
what is graph data structure?
connected nodes with relations between them, like friends on Facebook
Hadoop runs on ____ platform
cross
Which type of example data would be best suited for Hadoop?
customer purchasing data for sales analytics Feedback Non-critical data about a customer's interactions with a business is an excellent fit for Hadoop. HDFS supports storing unstructured data in a manner that is scalable and accessible.
Which of the following has the largest Hadoop cluster?
All daemons execute in separate nodes in __________ mode.
fully-distributed mode
what is geodata?
geologic data, like GPS coordinates
All Files in HDFS can be merged using which of the command?
get merge
big users of Hadoop cluster
google, yahoo, IBM, amazon Cloudera, MaPR Tech, DATASTAX
what model/algorithm is not used in data science?
graph detection
Hadoop 1.x VS Hadoop 2.x
hadoop 1.x: MapReduce (cluster resource management & data processing), HDFS (redundant, reliable storage) hadoop 2.x: MapReduce (data processing) and others, YARN (cluster resource management), HDFS2 (redundant, highly-available & reliable storage)
Which of the following gets into Safemode?
hadoop dfsadmin - safemode get ?
How to know the current status of Safemode?
hadoop dfsadmin -safemode get
Which of the command is used to come out of Safemode?
hadoop dfsadmin -safemode leave
To define the heap size, we use _______ config files.
hadoop-env.sh
MapReduce: mapper phase
here, the input data will be split into two values - <key, value> by the record reader - the key can be written and compared in the processing phase. value can only be written during the processing stage. - the further processing of data is done inside task tracker - combiner is called the mini reducer. the reducer code is placed as a combiner in the mapper. when the mapper output is a large amount of data, we will place the reduced code in the mapper as a combiner for better performance to solve this bandwidth problem - hash partition is the default partition used in this process
Pig
high level language that translates down into MapReduce . allows you to write high level description how data is process and give it to Pig for it to run the app for you. increases productivity high-level platform for creating programs that run on Hadoop
_____________ jobs are optimized for scalability but not latency.
hive
Which of the following property configuration is done in mapred-site.xml?
host and port where mapreduce jobs run
Hadoop setup process
https://www.guru99.com/how-to-install-hadoop.html
Which of the following statement is false about Hadoop?
it is best for live streaming of data
Map-Reduce jobs are submitted on ____________
job tracker?
Which of the following command is used to check all the active running daemons?
jps
JobTracker
keeps track of the jobs being run (looks over all the TaskTrackers on all machines)
check size/block amount of file
ls -ahl /vagrant/data/salaries.csv *outputs something like: 1 vagrant vagrant 16M Mar 4 17:09 /vagrant/data/salaries/csv where 16M is 16 megabytes, too small to separate into multiple blocks
Hadoop works in ____________
master-worker fashion ?
Hadoop I/O Hadoop comes with a set of ________ for data I/O.
none of the mentioned
The total number of partitioner is equal to ____________.
number of reducer
Ambari-server
on node 1 it installs and configures all the nodes
The output of the mapper is sent to __________.
partitioner
Hadoop uses hadoop-metrics.properties file for ________.
performance reporting purpose
motivation for MapReduce 2
scalability bottleneck caused by having a single JobTracker. According to Yahoo!, the practical limits such as design are reached iwth a cluster of 5k nodes and 40k tasks running concurrently - hadoop was designed to run MR jobs only
DataNode
sends blockrepot & heartbeats to the NameNode. These hold copies/backups of the information being stored
The output of the mapper is first written on the local disk for sorting and _________ process.
shuffling
what is velocity?
speed of data processing
Hive
sql - like language; also like pig where it takes a language like sql to break down into MapReduce
Big Data challenges
storage - we need to know how to store the huge data efficiently computational efficiency - efficiency in storing the data that is suitable for computation data loss - due to disruption, hardware failure. We need to have a proper recovery strategy in place cost - the solution that we propose should be cost-effective
hadoop cluster
storing and analyzing huge amounts of data both structured and unstructured low-cost one machine in cluster is NameNode; another is JobTracker; these are master machines highly scalable highly resistant to hardware failures data has backups to not become lost
unzip file in cmd
sudo apt install unzip unzip master.zip
The default InputFormat is __________ that treats each value of input a new value and the associated key is byte offset.
text-input format
The default size of distributed cache is ___________.
10 GB
what version of java is supported by Hadoop?
8
________ is the slave/worker node and holds the user data in the form of Data Blocks.
DataNode
Slave Machines
DataNode &TaskTracker
what is the management frontend in Hortonworks called?
Ambari
which Hadoop distribution is fully open source?
Apache Hadoop and Hortonworks
What license type is the Hadoop core using?
Apache license Projects under the apache umbrella are released under the apache license
components of YARN
Client: submits MapReduce jobs Resource Manager: manages the resources across the cluster Node Manager: responsible for launch and monitor of computer containers in a cluster Map Reduce Application master: a part of the resourcemanager, arranges for the container in which an application's ApplicationMaster works when the YarnScheduler schedules the application
HBase uses the _______ File System to store its data.
Hadoop
___________ files deal with small file problems.
Hadoop Archive
select the correct ansewr
Hadoop scales horizontally and will have a lower cost per GB when storing a lot of data
Which statement is true about Hadoop?
It will commonly be used as an alternative file system
components of HDFS
NameNode: - centerpiece of HDFS - manages the file system's namespace, metadata, and file blocks - this process must be up all the time - a single point of failure(SOP) DataNode: - the data blocks of the file are stored inset of DataNode. Client app gets the list of data nodes where the data blocks of a particular file are stored from the NameNode - after this DataNode is responsible for serving read/write request from the file system's client - the data node stores, delete blocks and replicate those blocks upon instructions from the NameNode secondary NameNode: - each cluster has more than one app; Secondary NameNodes and primary NameNodes. - ex: primary NN's are like the main flight engine that continues to work in all situations, however, in case of primary node failing, secondary NN comes to picture
During ____________ process, duplicate task is created to improve the overall execution time.
Speculative execution
Hadoop
the framework to process and analyze Big Data; a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming model - supports the huge volume of data - stores data efficiently and reliably - data loss is unavoidable. the proposed solution gives good recovery strategies - solution should be horizontally scalable as the data grows - should be cost-effective - minimize the learning code. It should be easy for the programmer and non-programmers - scalable (same program runs on 1, 1000, or 4000 machines); scales linearly - simple APIs - petabytes of data - high availability - scalability - fault tolerant - economic set of Open Source Projects key attributes: - redundant and reliable (no data loss) - extremely powerful - batch processing centric - easy to program distributed apps - runs on commodity hardware created by Doug Cutting and Mike Cafarella
MapReduce
the processing part of Hadoop; its server on a typical machine is called a TaskTracker; core building blocks to processing in Hadoop framework moves resource management (like infrastructure to monitor nodes, allocate resources and schedule jobs) into YARN a programming framework that allows us to perform distributed and parallel processing on a large data set in the distributed environment -users specify the computation in terms of a map and a reduce function -MapReduce model performs parallel computation across large-scale clusters of a machine
architecture of MapReduce
the workflow of MR takes place in different phases, and the output will be stored in HDFS including replications - the job tracker will take care of teh MR jobs in the hadoop cluster at different nodes - job tracker plays a vital role in the planning of jobs and keeps track of the whole map and reduces jobs - the task tracker is responsible for current mapping and reduction of tasks happens in 3 phases: mapper phase intermediate phase reduce phase
you have to process 5000 TB of data with Hadoop. The configuration of the data node available is: - 8 GB RAM - 25 TB HDD - 150 MB/s R/W speed you have a hadoop cluster with replication factor as 2 and block size as 64 MB How many DataNodes are required to store 5000 TB of data?
total data=(Replication factor*amount of data to be processed) Datanodes = (total data)/(size of DataNode)= (2*5000)/25 = 400 datanodes
get file from web in cmd
wget http://github.com/wardviaene/hado op-ops-course/archive/master.zip
HDFS (Hadoop Distributed File System)
when data is uploaded here, it is divided over small blocks and distributed over the cluster; the data part of Hadoop; its server on a typical machine is called a DataNode *to upload file to hdfs hadoop fs -put - puts metadata in NameNode, sends lists of IP adresses of DataNodes -each DataNode holds copies of data (3x by default) - hadoop interacts with this with shell commands - data transfer rate is very high - infinitely scalable - fits well with replication to provide fault tolerance and availability
adding storage
when you need another duplicate or more storage, create another machine; a machine holds a TaskTracker and DataNode; having multiple machines with Hadoop creates a cluster
Is the NodeManager running on a master node or on worker node?
worker node
To control HDFS replication factor, which configuration file is used?
yarn-site.xml
Which of the file contains the configuration setting for NodeManager and ResourceManager?
yarn-site.xml
