Data Science - Hadoop Ecosystem & Security

¡Supera tus tareas y exámenes ahora con Quizwiz!

look at more details of node

$ hdfs fsck /user/vagrant/salaries.csv this tells you more info, including how many blocks it contains

remove machines and all data

$ vagrant destroy

suspend machines

(when you don't need it anymore, or unsuspend when you want to use again) $ vagrant suspend

replication of nodes

- HDFS provides a reliable way to store large data in a distributed environment as data blocks - blocks are also replicated to provide fault tolerance - the default replication factor is 3 - the first replica is on the local rack - the next two replicas will be stored on a different (remote) rack but on a different data node within that rack

YARN (Yet Another Resource Negotiator)

- a hadoop ecosystem that provides resource management - also called the operating system of hadoop - responsible for managing and monitoring it's a layer that separates the resource management layer and the processing components layer

vagrantfile

- a project has a vagrantfile with the configuration - it marks the root directory of your project - it describes the kind of machine and resources needed for the project to run and what software to install to change network settings, memory or CPU, open the Vagrantfile in notepad (or equiv) and make changes

HDFS files and blocks

- hadoop app is responsible for distributing the data blocks across multiple nodes - files are split into chunks and chunks are stored in blocks - blocks are managed by NameNode and stored in data node - blocks are traditionally are either of 64 MB or 128 MB - default block size is 128 MB

why traditional database is not a solution?

- it is NOT horizontally scalable bc we can't add resources or more computational nodes. - a db is designed for structured data. this db is not a good choice when we have a variety of data

MapReduce: Reducer phase

- the shuffled and sorted data is taken as input for the reducer phase - the input data will be combined, and the same pairs of actual vital values will be added to the HDFS system. writer records data from reducer to HDFS - after the reducer logic is applied, it will give output as part-r-0001, etc. - in the mapred-site.xml configuration file, we need to set some properties that allow the number of reducers to be set for the specific task

MapReduce: Intermediate Phase

- the shuffling and sorting of data occurs in this phase - the intermediate data is being produced after some computations - hadoop uses round-robin algorithm to write intermediate data to local disk

Which of the following genres does Hadoop produce?

Distributed-File System

HDFS provides a command line interface called __________ used to interact with HDFS.

FS Shell

Data flow in MapReduce

First: mapreduce job is a unit of the work that the clients want to be performed second: mapreduce job consists of input data, mapreduce program and configuration information then: hadoop runs the jobs dividing the task into phases: mapping and reducing input data stored on HDFS -> input split -> recordReader -> Mapper -> Combiner -> Partitioner -> shuffling and sorting -> reducer -> output data stored on HDFS

What application / service stores the data on the hadoop cluster?

HDFS

Hadoop components/Modules

HDFS (Hadoop distributed file system) - primary storage for Hadoop. provides scalable, fault tolerance, cost-efficient storage for Big Data. runs on commodity hardware MapReduce - an executing engine YARN (yet another resource negotiator) - hadoop ecosystem module that provides resource management

NameNode

Keeps information on data location; coordination for HDFS location is on the NameNode but not information (looks over all DataNodes on all the machines)

tools for Hadoop

Mahout (machine learning) Ambari, Ganglia, Nagios Sqoop (data exchange), cascading (like Pig [scriptin] or Hive[sql]) Oozie (project management; workflow) Flume (streaming input into hadoop) [log control] Protobuf, Avro, Thrift Fuse - DFS (provides linux access to HDFS) HUE (Hadoop User Experience)

___________________ is a general-purpose computing model and runtime system for distributed data analytics.

MapReduce

Hadoop is a framework that works with a variety of related tools. Which of the following best suits it?

MapReduce, Hive, HBase

Is the Namenode running on a master node or on worker node?

Master Node

Companies and Big Data

NASA : gathers nearly 1.73 GB of data every hour Facebook : gathers 500 terabytes a day eBay : 40 PB clusters for search, consumer recommendation and merchandising

master machines

NameNode & JobTracker

Each daemon runs on a single node as a single Java process in ___________ mode.

Pseudo Distributed mode

Hadoop achieves reliability by replicating the data across multiple hosts and hence does not require ________ storage on hosts.

RAID

HDFS allows a client to read a file that is already opened for writing.

True

things to install

VirtualBox Vagrant

where was Hadoop invented?

Yahoo! Hadoop was invented at Yahoo! and inspired by Google's GFS (Google File System) and Google's MapReduce papers

what is Big Data/ 4 V's of Big Data

a collection of data sets so large and complex that your legacy IT systems cannot handle them. (Terabytes, petabytes, exabytes of data). Data is considered 'Big Data' if it satisfies the v's Volume - size/scale of data Variety - of data, data is often unstructured or semi-structured. The different forms of data Velocity - speed of processing data Veracity - (extra added by IBM) uncertainty of the quality of data; analysis of streaming data

what is a computer cluster?

a set of connected computers that work together that could be viewed as a single system - easy to add more power (simply add a new computer)

Which type of data can Hadoop deal?

all structured, semi-structured, unstructured

Which of the following are Hadoop Daemons?

all (DataNode, NameNode, Secondary Namenode)

Which type of data is a good fit for Hadoop?

behavioral

Vagrant

can create and configure virtual development environments - lightweight - reproducible - portable - wrapper around VirtualBox, KVM, AWS, Docker, VMWare, Hyper-V create identical dev environments for operations and developers disposable environments

HCatalog

can reference data in either HBase or HDFS; makes accessible to others like Pig, Hive, and MapReduce

to launch a 3 node cluster

cmd -> folder where vagrant is downloaded -> type $ vagrant up node1 node2 node3 my case: c:\HashiCorp\Vagrant> * my experience though, go to the project folder you create, type: $ vagrant init then the above

How to upload file to HDFS

cmd -> project folder -> $ vagrant ssh node1 [logs into node1] -> hadoop fs -put /vagrant/data/salaries/csv salaries.csv

HBase

columnar data store provides simple interface to distributed data; can be accessed by Pig, Hive, or MapReduce to store into HDFS. reliable and durable used for FB messenger stores some metadata in zookeeper (scalable solution for working with servers)

what is graph data structure?

connected nodes with relations between them, like friends on Facebook

Hadoop runs on ____ platform

cross

Which type of example data would be best suited for Hadoop?

customer purchasing data for sales analytics Feedback Non-critical data about a customer's interactions with a business is an excellent fit for Hadoop. HDFS supports storing unstructured data in a manner that is scalable and accessible.

Which of the following has the largest Hadoop cluster?

facebook

All daemons execute in separate nodes in __________ mode.

fully-distributed mode

what is geodata?

geologic data, like GPS coordinates

All Files in HDFS can be merged using which of the command?

get merge

big users of Hadoop cluster

google, yahoo, IBM, amazon Cloudera, MaPR Tech, DATASTAX

what model/algorithm is not used in data science?

graph detection

Hadoop 1.x VS Hadoop 2.x

hadoop 1.x: MapReduce (cluster resource management & data processing), HDFS (redundant, reliable storage) hadoop 2.x: MapReduce (data processing) and others, YARN (cluster resource management), HDFS2 (redundant, highly-available & reliable storage)

Which of the following gets into Safemode?

hadoop dfsadmin - safemode get ?

How to know the current status of Safemode?

hadoop dfsadmin -safemode get

Which of the command is used to come out of Safemode?

hadoop dfsadmin -safemode leave

To define the heap size, we use _______ config files.

hadoop-env.sh

MapReduce: mapper phase

here, the input data will be split into two values - <key, value> by the record reader - the key can be written and compared in the processing phase. value can only be written during the processing stage. - the further processing of data is done inside task tracker - combiner is called the mini reducer. the reducer code is placed as a combiner in the mapper. when the mapper output is a large amount of data, we will place the reduced code in the mapper as a combiner for better performance to solve this bandwidth problem - hash partition is the default partition used in this process

Pig

high level language that translates down into MapReduce . allows you to write high level description how data is process and give it to Pig for it to run the app for you. increases productivity high-level platform for creating programs that run on Hadoop

_____________ jobs are optimized for scalability but not latency.

hive

Which of the following property configuration is done in mapred-site.xml?

host and port where mapreduce jobs run

Hadoop setup process

https://www.guru99.com/how-to-install-hadoop.html

Which of the following statement is false about Hadoop?

it is best for live streaming of data

Map-Reduce jobs are submitted on ____________

job tracker?

Which of the following command is used to check all the active running daemons?

jps

JobTracker

keeps track of the jobs being run (looks over all the TaskTrackers on all machines)

check size/block amount of file

ls -ahl /vagrant/data/salaries.csv *outputs something like: 1 vagrant vagrant 16M Mar 4 17:09 /vagrant/data/salaries/csv where 16M is 16 megabytes, too small to separate into multiple blocks

Hadoop works in ____________

master-worker fashion ?

Hadoop I/O Hadoop comes with a set of ________ for data I/O.

none of the mentioned

The total number of partitioner is equal to ____________.

number of reducer

Ambari-server

on node 1 it installs and configures all the nodes

The output of the mapper is sent to __________.

partitioner

Hadoop uses hadoop-metrics.properties file for ________.

performance reporting purpose

motivation for MapReduce 2

scalability bottleneck caused by having a single JobTracker. According to Yahoo!, the practical limits such as design are reached iwth a cluster of 5k nodes and 40k tasks running concurrently - hadoop was designed to run MR jobs only

DataNode

sends blockrepot & heartbeats to the NameNode. These hold copies/backups of the information being stored

The output of the mapper is first written on the local disk for sorting and _________ process.

shuffling

what is velocity?

speed of data processing

Hive

sql - like language; also like pig where it takes a language like sql to break down into MapReduce

Big Data challenges

storage - we need to know how to store the huge data efficiently computational efficiency - efficiency in storing the data that is suitable for computation data loss - due to disruption, hardware failure. We need to have a proper recovery strategy in place cost - the solution that we propose should be cost-effective

hadoop cluster

storing and analyzing huge amounts of data both structured and unstructured low-cost one machine in cluster is NameNode; another is JobTracker; these are master machines highly scalable highly resistant to hardware failures data has backups to not become lost

unzip file in cmd

sudo apt install unzip unzip master.zip

The default InputFormat is __________ that treats each value of input a new value and the associated key is byte offset.

text-input format

The default size of distributed cache is ___________.

10 GB

what version of java is supported by Hadoop?

8

________ is the slave/worker node and holds the user data in the form of Data Blocks.

DataNode

Slave Machines

DataNode &TaskTracker

what is the management frontend in Hortonworks called?

Ambari

which Hadoop distribution is fully open source?

Apache Hadoop and Hortonworks

What license type is the Hadoop core using?

Apache license Projects under the apache umbrella are released under the apache license

components of YARN

Client: submits MapReduce jobs Resource Manager: manages the resources across the cluster Node Manager: responsible for launch and monitor of computer containers in a cluster Map Reduce Application master: a part of the resourcemanager, arranges for the container in which an application's ApplicationMaster works when the YarnScheduler schedules the application

HBase uses the _______ File System to store its data.

Hadoop

___________ files deal with small file problems.

Hadoop Archive

select the correct ansewr

Hadoop scales horizontally and will have a lower cost per GB when storing a lot of data

Which statement is true about Hadoop?

It will commonly be used as an alternative file system

components of HDFS

NameNode: - centerpiece of HDFS - manages the file system's namespace, metadata, and file blocks - this process must be up all the time - a single point of failure(SOP) DataNode: - the data blocks of the file are stored inset of DataNode. Client app gets the list of data nodes where the data blocks of a particular file are stored from the NameNode - after this DataNode is responsible for serving read/write request from the file system's client - the data node stores, delete blocks and replicate those blocks upon instructions from the NameNode secondary NameNode: - each cluster has more than one app; Secondary NameNodes and primary NameNodes. - ex: primary NN's are like the main flight engine that continues to work in all situations, however, in case of primary node failing, secondary NN comes to picture

During ____________ process, duplicate task is created to improve the overall execution time.

Speculative execution

Hadoop

the framework to process and analyze Big Data; a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming model - supports the huge volume of data - stores data efficiently and reliably - data loss is unavoidable. the proposed solution gives good recovery strategies - solution should be horizontally scalable as the data grows - should be cost-effective - minimize the learning code. It should be easy for the programmer and non-programmers - scalable (same program runs on 1, 1000, or 4000 machines); scales linearly - simple APIs - petabytes of data - high availability - scalability - fault tolerant - economic set of Open Source Projects key attributes: - redundant and reliable (no data loss) - extremely powerful - batch processing centric - easy to program distributed apps - runs on commodity hardware created by Doug Cutting and Mike Cafarella

MapReduce

the processing part of Hadoop; its server on a typical machine is called a TaskTracker; core building blocks to processing in Hadoop framework moves resource management (like infrastructure to monitor nodes, allocate resources and schedule jobs) into YARN a programming framework that allows us to perform distributed and parallel processing on a large data set in the distributed environment -users specify the computation in terms of a map and a reduce function -MapReduce model performs parallel computation across large-scale clusters of a machine

architecture of MapReduce

the workflow of MR takes place in different phases, and the output will be stored in HDFS including replications - the job tracker will take care of teh MR jobs in the hadoop cluster at different nodes - job tracker plays a vital role in the planning of jobs and keeps track of the whole map and reduces jobs - the task tracker is responsible for current mapping and reduction of tasks happens in 3 phases: mapper phase intermediate phase reduce phase

you have to process 5000 TB of data with Hadoop. The configuration of the data node available is: - 8 GB RAM - 25 TB HDD - 150 MB/s R/W speed you have a hadoop cluster with replication factor as 2 and block size as 64 MB How many DataNodes are required to store 5000 TB of data?

total data=(Replication factor*amount of data to be processed) Datanodes = (total data)/(size of DataNode)= (2*5000)/25 = 400 datanodes

get file from web in cmd

wget http://github.com/wardviaene/hado op-ops-course/archive/master.zip

HDFS (Hadoop Distributed File System)

when data is uploaded here, it is divided over small blocks and distributed over the cluster; the data part of Hadoop; its server on a typical machine is called a DataNode *to upload file to hdfs hadoop fs -put - puts metadata in NameNode, sends lists of IP adresses of DataNodes -each DataNode holds copies of data (3x by default) - hadoop interacts with this with shell commands - data transfer rate is very high - infinitely scalable - fits well with replication to provide fault tolerance and availability

adding storage

when you need another duplicate or more storage, create another machine; a machine holds a TaskTracker and DataNode; having multiple machines with Hadoop creates a cluster

Is the NodeManager running on a master node or on worker node?

worker node

To control HDFS replication factor, which configuration file is used?

yarn-site.xml

Which of the file contains the configuration setting for NodeManager and ResourceManager?

yarn-site.xml


Conjuntos de estudio relacionados

Nonrecognition Transactions Chapter 13

View Set

Part 3: Text Structure in an Informational Text 80%

View Set

Management 360 - Final Study Guide (Quizzes)

View Set

ACCT Chapter 9 Plant and Intangible Assets

View Set

2nd half: Sect 2: WA Business Skills & Procedures

View Set

Interpersonal Communications Exam 3

View Set