Big Data 2 Exam 3

Ace your homework & exams now with Quizwiz!

Procedural language

Instructions are in line with modules, functions, etc. Can manipulate data.

Bulk import, direct input and data export are key features of this tool in the Hadoop ecosystem?

Sqoop

HIVE metadata is stored externally in the metastore

TRUE

Synchronization

Keeps multiple processes executing concurrently in a cluster to run smoothly

NameNode

Keeps track of where data is physically stored

Hadoop MapReduce

MapReduce is used to indicate the process, but is made of two distinct pieces: --the algorithm (the "true" MapReduce), and --the implementation of the algorithm and its environmengt Hadoop MapReduce specifically refers to the process developed by the Apache Hadoop project

Scheduling

MapReduce jobs get broken down into individual tasks for the map and the reduce portions of the application

The _____________ node regulates file access in Hadoop

Name

ApplicationMaster

Notifies the NodeManager if more resources are necessary to support the running application

Partitioner and a sort

Perform the gathering and shuffling of intermediate results

Algorithm

A series of steps that need to occur in service to an overall goal

Sqoop

Bulk import, direct input and data export are key features of this tool in the Hadoop ecosystem

reduce

Can't begin until all the mapping is done and isn't finished until all instances are complete

Sending reports to the NameNode about availability and performing checksum validations are tasks performed by the __________________

Data Node

A cluster manager is a versatile, resilient, clustered approach to big data file management

FALSE

A decision programming structure causes a program to execute the same code repetitively until time to stop

FALSE

A decision programming structure causes a program to execute the same code repetitively until told to stop

FALSE

A decision programming structure causes a. program to execute the same code repetitively until told to stop

FALSE

Checksum validation is a type of encryption

FALSE

HDFS and MapReduce perform their work on nodes in a cluster hosted on racks of virtual machines

FALSE

HIVE was developed to have a very rapid query response time

FALSE

KEEP IT WARM is a MapReduce guideline that means that the system should have backup data nodes

FALSE

Keep it warm is a MapReduce guideline that means that the system should have backup data nodes

FALSE

Process control is a key Zookeeper capability

FALSE

The boss node of Hadoop is known as the "traffic cop"

FALSE

A _______________ is a set of instructions that do a specific task, often needing information passed to it in variables

Function

Syntax problems are the likely problem if the program is not running

TRUE

YARN provides global resources management resources n the Hadoop environment

TRUE

Output format

Takes the key-value pair and organizes the output for writing to HDFS

Record writer

Takes the output format data and writes it to HDFS in the form necessary for the requirements of the application program

reduce function

Takes the output of a map function and "reduces" the list in whatever fashion the programmer desires

Hadoop Foundation and Ecosystem

The HDFS and MapReduce provide the foundation for Hadoop Hadoop has several sub services that run specialized services for the ecosystem

checksum validation

an error detection method where the sender assigns a numeric value to a string depending on the number of bits it contains, and the receiver calculates the number to ensure it is correct.

Programs

must be syntactically correct to run; must be logically correct to produce reasonable output

Data node

sends reports to the NameNode about availability and performing checksum validations

Object oriented languages

use a different type of focus (objects) to achieve the same goal. Can manipulate input data.

Sending reports to the NameNode about availability and performing checksum validations are tasks performed by the

Data node

MapReduce

Designed as a programming model combined with the implementation of that model--in essence, a reference implementation

Common Applications

Developed as procedural programs or object oriented programs. Have formal step-by-step instructions that facilitate the needs of the application Java, C++, COBOL, VB

Functional programs

Don't manipulate the data; They interpret the data by analyzing the data for trends and patterns and then assembling important elements into lists. Each operation is independent so the order of processing is not as important R, LISP, Prolog

Keeping the data and the code together is one of the best optimizations for MapReduce performance

TRUE

One important difference between Hbase tables and RDBMS tables is versioning

TRUE

Process control is a key Zookeeper capability

TRUE

Programs must be logically correct to produce reasonable output

TRUE

Syntax errors are the l likely problem if the program is not running

TRUE

The output of reduce is also a key and a value

TRUE

YARN provides global resource management in the Hadoop environment

TRUE

Zookeeper provides resilient, fault-tolerant distributed applications in the Hadoop environment

TRUE

HIVE uses 3 mechanisms for data organization

Tables Partitions Buckets

Checksum validations

Used to guarantee the contents of files in HDFS

NameNode

Uses a "rack id" to keep track of the data nodes in the cluster

Rebalancer

When you add new nodes, HDFS will not rebalance automatically. However, HDFS provides a _____ tool that can be invoked manually.

Fault/error handling

all programs *should* have error handling built in so that the system can properly react when a failure or error occurs (such as assigning a new node to complete a failed node's processes

Transaction log

includes a list of every operation recorded that supports data integrity

Sqoop (SQL to Hadoop)

is the ETL process in the Hadoop system. It is able to work on non-Hadoop data sets to enable them to be manipulated in the Hadoop environment --it is executed at the command line (meaning coding is required) --Sqoop is highly functional, including the ability to examine a data source and determine the appropriate mode of transfer for it --Interacts with Hive and Hbase

Hbase

A columnar (non relational) database that can hold billions of rows layered across Hadoop clusters --It provides real time access to data --Highly configurable --Tracks changes by versioning the data where the version is a timestamp attribute --Organized somewhat like a taxonomy to make searching more efficient (e.g., employee, type of employee, specific employee)

Pipeline

A connection between multiple data nodes that exists to support the movement of data across the servers

Function

A function is a set of instructions that do a specific task, often needing information passed to it in variables so it can perform the task. Functions usually have output back to the program, also usually passed back through variables.

map function

A function that generates an output list from an input list by applying a function to each element in the input list.

MapReduce

A program based on two functions - the map function and the reduce function Allows huge sets of data to be worked with at the same time over a number of nodes

HIVE

A relational data warehouse layer that allows SQL-savvy users to interact directly with structured data (HiveQL) while retaining the ability to implement analysis with MapReduce --not fast, but extensive and more scalable than a traditional data warehouse --allows for data to be partitioned (direct access to a subset of data via a directory or as buckets (files stored in the partition directory)

Function

A set of instructions that do a specific task, often needing information passed to it in variables

NameNode

Acts as a "traffic cop"

MapReduce-Reduce Function

After the map for each node has been created, the reduce function is used --reducer (word, values) [this is from the master list compiled from all maps] --for each value in values, sum=sum + value emit (word, sum) [a new list of all unique words and the number of times they appear in the data set]

Distributed file systems

Be sure there is a redundant master node that is ready if the main master node fails Distributed files should be as large as possible with a minimum number of nodes required Bandwith concerns are more about sustained throughput than supporting quick mapping and reducing functions. Coding to optimize streaming data during reads and writes enhances the overall system. Security is a necessary evil. Too much causes performance degradation; not enough may leave data vulnerable. Authorization is the primary security means for MapReduce as it is more likely to suffer a local issue than an outside attack.

Output collector

Collects the output from the independent mappers and passes it to the reducers

Programming

Comprised of structures that accomplish different tasks and a syntax that allows the programmer to communicate with the computer

Pipeline

Connection between multiple data nodes that exists to support the movement of data across the servers

The step between the mapping function and the reduce function in MapReduce is called splitting

FALSE

YARN uses a unique key to ensure that all of the processing is related to solving the same problem

FALSE

checksum validation is a type of encryption

FALSE

HDFS

Hadoop Distributed File System

Name Node

Is a master server that manages files and regulates access: --Opening, closing, renaming files and directories The name node should be replicated in case of failure

HDFS (Hadoop Distributed File System)

Is an approach to data/file Management, NOT a storage facility The HDFS facilitates the process of managing data for easy access (write once read many) allowing for greater coherency and increased throughput Portable across platforms The HDFS is a collection of clusters

Data Node

Is an element that manages its local block: --Read and write request management --Block creation, deletion, and replication when directed by the Name Node One file may be distributed across many blocks, so constant communication of data nodes to the name node is critical

Origins of MapReduce

It was evident that the resources necessary to support users was not going to keep up as the number of users expanded The idea of distributed computing to enable bigger applications and data sources across a network of cheaper computers (called a cluster) was a solution. However, it was a necessary but insufficient solution. The work distribution had to happen in parallel. in order to support: --processing that needed to expand or contract as necessary (scalable) --processing that was reliable (redundancy) --ease of development of services without regard to physically located resources MapReduce was developed to be a generic programming model capable of: --parallel execution --fault tolerance --load balancing --data manipulation MapReduce was named for its two functions that were already part of common programming functions-mapping, and reducing

Encapsulation

Large sets of code can be organized to allow for reuse for certain tasks (encapsulation). Very often these include a type of code cluster called a function.

Hadoop

MapReduce addresses many of the challenges of working with big data. However it needs an environment in which to work. Hadoop provides the distributed file system framework in which MapReduce works. Hadoop works on data of any structure Hadoop self-manages such that constant revision of the environment occurs to maximize efficiencies and minimize problems caused by errors or failures

NodeManager

Monitors the applications usage of CPU, disk, network and memory and reports back to the ResourceManager

MapReduce distribution of work

Must be performed in parallel for the following three reasons: --the processing must be able to expand and contract automatically --the processing must be able to proceed regardless of failures in the network or the individual systems --developers leveraging this approach must be able to create services that are easy to leverage by other developers. Therefore, this approach must be independent of where the data and computations have executed

A ___________ is a connection between multiple data nodes that exists to support the movement of data across the servers

Pipeline

MapReduce Optimization

Programming code can be used to implement some optimization, particularly with regard to reliability and performance Other means to increase reliability and performance include: --Appropriate infrastructure, particularly by physically organizing servers to allow the best speed and reliability possible at the level --Appropriate infrastructure, particularly with regard to distributed file systems that increase resources over those available on an individual machine (virtualization, in this case a master-slave style). Slave nodes store data and master nodes "call" the nodes as requests are received. --Ensuring that synchronization programming immediately copies mapping results to the reducing nodes so that the reducing processing can begin immediately.

Reporter functiob

Provides information gathered from map tasks so you know when and if the map tasks are complete

Record reader

Record writer in reverse

MapReduce Foundational Behaviors

SCHEDULING - self manages the number of tasks and the number of nodes so that all mapping occurs prior to reducing SYNCHRONIZING - self manages the tasks by holding task results in limbo until all have completed; once tasks are completed maps are placed in a "shuffle and sort" area CODE/DATA COLOCATION - because there is enhanced efficiency when the data and the code reside in the same node, a copy of the code is sent to each node FAULT/ERROR HANDLING - all programs *should* have error handling built in so that the system can properly react when a failure or error occurs (such as assigning a new node to complete a failed node's processes)

Structures include:

SEQUENCE - a set of instructions that are performed one instruction at a time in the order stated SELECTION (or decision) - a set of instructions that are performed according to the outcome of a question LOOP - a set of instructions that are performed iteratively until something tells it to stop • All programs are a combination of the above high-level structures.

Foundational behaviors of MapReduce

Scheduling Synchronization Code/data colocation Fault/error handling

Pig programs can be run in three different ways

Script Grunt Embedded

HDFS clusters

Sometimes referred to as being "rack-aware"

Apachee Hadoop Project

The process begins with a user request to run MapReduce and continues until results are written to the HDFS HDFS and MapReduce depend on clustered nodes over multiple servers The results are formatted for output and written to the HDFS The mapping function works on each input pair. As pairs are found they go to the Output Collector; as the OC fills the data goes to the Partitioner. The partitioner determines the "bucket" into which each pair goes (e.g., bear, deer). This is an interactive process along with sorting that interacts with all nodes. The reduce function gathers pairs until all processing is done and then performs the reduce function. After reduction, the results are formatted and written to the HDFS

A list of every operation recorded that supports data integrity is located in a ___________________

Transaction log

HDFS capabilities to support data integrity

Transaction logs; Checksum validations; Additional tasks including: --detailed metadata about the files, in which blocks they reside, how they have changed, who has access, how many there are, what nodes exists in the clusters, and where critical information such as the transaction logs reside; --Functions by the data nodes including storage and retrieval of data blocks, storage of metadata, checksum validations, activity reports to the name node, provides metadata and data on demand to authorized user applications, and restructures data as appropriate to maximize efficiencies; --Supporting environment for MapReduce

YARN

YARN (Yet Another Resource Negotiator) is a core service that provides resource and per-application management: --Resource management includes a scheduler that dynamically allocates resources according to pre-set needs of the application --Application management includes a notifier that activates when additional resources are required by the application.

Zookeeper

allows the distributed environment to work smoothly with few faults: --synchronizes processes such that they occur in the proper order by starting and stopping nodes as appropriate --ensures proper configuration of resources and maintains configuration consistency --assigns a node to be a leader that then interacts with the application --supports effective messaging among nodes

Pig and Pig Latin

are an environment that supports development-like activities by non-developers Pig is the support environment (script based) for Pig Latin that allows loading and processing of input data Pig is also capable of producing map and reduce processes so that the user is not required to know how to do so Because Pig is a simple environment and the language is easily learned (similar to SQL) it is relatively easy for less technical end users to examine data, test a small set, and approve jobs before involving huge sets of data

LISP

artificial intelligence language

MapReduce-Mapping Function

basis for an artificial intelligence development language (LISP) Unlike other programming languages that may manipulate and change the structure of data, functional languages DO NOT FUNCTIONAL LANGUAGES create new STRUCTURES that become the output of the program. Advantage of this is that the original data remains untouched, allowing multiple accesses of the data without concern for consistency issues. Reading and writing data is not necessary in the traditional sense so where the data is DOES NOT concern the programming. MapReduce uses word counting You must use a key pair for the mapping Here is a simple word count map function: mapper(filename, file-contents) For each word in file-contents emit (word, 1) This will then look through every file designated and compile a list of 'word, 1'

Code/data colocation

because there is enhanced efficiency when the data and the code reside in the same node, a copy of the code is sent to each node

Metadata

data about data

Name Node

regulates file access in Hadoop

Name node

regulates file access in Hadoop

What does a Block server do?

—stores and retrieves the data blocks in local file system of server —stores the metadata of a block in the local file system —performs periodic validations of file checksums —sends regular reports to the NameNode —provides metadata —forwards data to other data nodes


Related study sets

Chapter 8 - Intellectual Property Rights

View Set

Biology: Chapter 1: The Scientific Study of Life --- What is life?

View Set

Ch.12.4: Helpdesk: Transmission Media and Network Adaptors_sc

View Set

Med-Surg - Cardiac -Prep U -on Meth

View Set

Automatic Transmission / Transaxle

View Set

Nursing Process (PREPU Questions) CHP 14 - ASSESSMENT

View Set

GASTROİNTESTİNAL SİSTEM HASTALIKLARININ TANISINDA ANAMNEZ VE FİZİK MUAYENE

View Set