Big Data 2 Exam 3
Procedural language
Instructions are in line with modules, functions, etc. Can manipulate data.
Bulk import, direct input and data export are key features of this tool in the Hadoop ecosystem?
Sqoop
HIVE metadata is stored externally in the metastore
TRUE
Synchronization
Keeps multiple processes executing concurrently in a cluster to run smoothly
NameNode
Keeps track of where data is physically stored
Hadoop MapReduce
MapReduce is used to indicate the process, but is made of two distinct pieces: --the algorithm (the "true" MapReduce), and --the implementation of the algorithm and its environmengt Hadoop MapReduce specifically refers to the process developed by the Apache Hadoop project
Scheduling
MapReduce jobs get broken down into individual tasks for the map and the reduce portions of the application
The _____________ node regulates file access in Hadoop
Name
ApplicationMaster
Notifies the NodeManager if more resources are necessary to support the running application
Partitioner and a sort
Perform the gathering and shuffling of intermediate results
Algorithm
A series of steps that need to occur in service to an overall goal
Sqoop
Bulk import, direct input and data export are key features of this tool in the Hadoop ecosystem
reduce
Can't begin until all the mapping is done and isn't finished until all instances are complete
Sending reports to the NameNode about availability and performing checksum validations are tasks performed by the __________________
Data Node
A cluster manager is a versatile, resilient, clustered approach to big data file management
FALSE
A decision programming structure causes a program to execute the same code repetitively until time to stop
FALSE
A decision programming structure causes a program to execute the same code repetitively until told to stop
FALSE
A decision programming structure causes a. program to execute the same code repetitively until told to stop
FALSE
Checksum validation is a type of encryption
FALSE
HDFS and MapReduce perform their work on nodes in a cluster hosted on racks of virtual machines
FALSE
HIVE was developed to have a very rapid query response time
FALSE
KEEP IT WARM is a MapReduce guideline that means that the system should have backup data nodes
FALSE
Keep it warm is a MapReduce guideline that means that the system should have backup data nodes
FALSE
Process control is a key Zookeeper capability
FALSE
The boss node of Hadoop is known as the "traffic cop"
FALSE
A _______________ is a set of instructions that do a specific task, often needing information passed to it in variables
Function
Syntax problems are the likely problem if the program is not running
TRUE
YARN provides global resources management resources n the Hadoop environment
TRUE
Output format
Takes the key-value pair and organizes the output for writing to HDFS
Record writer
Takes the output format data and writes it to HDFS in the form necessary for the requirements of the application program
reduce function
Takes the output of a map function and "reduces" the list in whatever fashion the programmer desires
Hadoop Foundation and Ecosystem
The HDFS and MapReduce provide the foundation for Hadoop Hadoop has several sub services that run specialized services for the ecosystem
checksum validation
an error detection method where the sender assigns a numeric value to a string depending on the number of bits it contains, and the receiver calculates the number to ensure it is correct.
Programs
must be syntactically correct to run; must be logically correct to produce reasonable output
Data node
sends reports to the NameNode about availability and performing checksum validations
Object oriented languages
use a different type of focus (objects) to achieve the same goal. Can manipulate input data.
Sending reports to the NameNode about availability and performing checksum validations are tasks performed by the
Data node
MapReduce
Designed as a programming model combined with the implementation of that model--in essence, a reference implementation
Common Applications
Developed as procedural programs or object oriented programs. Have formal step-by-step instructions that facilitate the needs of the application Java, C++, COBOL, VB
Functional programs
Don't manipulate the data; They interpret the data by analyzing the data for trends and patterns and then assembling important elements into lists. Each operation is independent so the order of processing is not as important R, LISP, Prolog
Keeping the data and the code together is one of the best optimizations for MapReduce performance
TRUE
One important difference between Hbase tables and RDBMS tables is versioning
TRUE
Process control is a key Zookeeper capability
TRUE
Programs must be logically correct to produce reasonable output
TRUE
Syntax errors are the l likely problem if the program is not running
TRUE
The output of reduce is also a key and a value
TRUE
YARN provides global resource management in the Hadoop environment
TRUE
Zookeeper provides resilient, fault-tolerant distributed applications in the Hadoop environment
TRUE
HIVE uses 3 mechanisms for data organization
Tables Partitions Buckets
Checksum validations
Used to guarantee the contents of files in HDFS
NameNode
Uses a "rack id" to keep track of the data nodes in the cluster
Rebalancer
When you add new nodes, HDFS will not rebalance automatically. However, HDFS provides a _____ tool that can be invoked manually.
Fault/error handling
all programs *should* have error handling built in so that the system can properly react when a failure or error occurs (such as assigning a new node to complete a failed node's processes
Transaction log
includes a list of every operation recorded that supports data integrity
Sqoop (SQL to Hadoop)
is the ETL process in the Hadoop system. It is able to work on non-Hadoop data sets to enable them to be manipulated in the Hadoop environment --it is executed at the command line (meaning coding is required) --Sqoop is highly functional, including the ability to examine a data source and determine the appropriate mode of transfer for it --Interacts with Hive and Hbase
Hbase
A columnar (non relational) database that can hold billions of rows layered across Hadoop clusters --It provides real time access to data --Highly configurable --Tracks changes by versioning the data where the version is a timestamp attribute --Organized somewhat like a taxonomy to make searching more efficient (e.g., employee, type of employee, specific employee)
Pipeline
A connection between multiple data nodes that exists to support the movement of data across the servers
Function
A function is a set of instructions that do a specific task, often needing information passed to it in variables so it can perform the task. Functions usually have output back to the program, also usually passed back through variables.
map function
A function that generates an output list from an input list by applying a function to each element in the input list.
MapReduce
A program based on two functions - the map function and the reduce function Allows huge sets of data to be worked with at the same time over a number of nodes
HIVE
A relational data warehouse layer that allows SQL-savvy users to interact directly with structured data (HiveQL) while retaining the ability to implement analysis with MapReduce --not fast, but extensive and more scalable than a traditional data warehouse --allows for data to be partitioned (direct access to a subset of data via a directory or as buckets (files stored in the partition directory)
Function
A set of instructions that do a specific task, often needing information passed to it in variables
NameNode
Acts as a "traffic cop"
MapReduce-Reduce Function
After the map for each node has been created, the reduce function is used --reducer (word, values) [this is from the master list compiled from all maps] --for each value in values, sum=sum + value emit (word, sum) [a new list of all unique words and the number of times they appear in the data set]
Distributed file systems
Be sure there is a redundant master node that is ready if the main master node fails Distributed files should be as large as possible with a minimum number of nodes required Bandwith concerns are more about sustained throughput than supporting quick mapping and reducing functions. Coding to optimize streaming data during reads and writes enhances the overall system. Security is a necessary evil. Too much causes performance degradation; not enough may leave data vulnerable. Authorization is the primary security means for MapReduce as it is more likely to suffer a local issue than an outside attack.
Output collector
Collects the output from the independent mappers and passes it to the reducers
Programming
Comprised of structures that accomplish different tasks and a syntax that allows the programmer to communicate with the computer
Pipeline
Connection between multiple data nodes that exists to support the movement of data across the servers
The step between the mapping function and the reduce function in MapReduce is called splitting
FALSE
YARN uses a unique key to ensure that all of the processing is related to solving the same problem
FALSE
checksum validation is a type of encryption
FALSE
HDFS
Hadoop Distributed File System
Name Node
Is a master server that manages files and regulates access: --Opening, closing, renaming files and directories The name node should be replicated in case of failure
HDFS (Hadoop Distributed File System)
Is an approach to data/file Management, NOT a storage facility The HDFS facilitates the process of managing data for easy access (write once read many) allowing for greater coherency and increased throughput Portable across platforms The HDFS is a collection of clusters
Data Node
Is an element that manages its local block: --Read and write request management --Block creation, deletion, and replication when directed by the Name Node One file may be distributed across many blocks, so constant communication of data nodes to the name node is critical
Origins of MapReduce
It was evident that the resources necessary to support users was not going to keep up as the number of users expanded The idea of distributed computing to enable bigger applications and data sources across a network of cheaper computers (called a cluster) was a solution. However, it was a necessary but insufficient solution. The work distribution had to happen in parallel. in order to support: --processing that needed to expand or contract as necessary (scalable) --processing that was reliable (redundancy) --ease of development of services without regard to physically located resources MapReduce was developed to be a generic programming model capable of: --parallel execution --fault tolerance --load balancing --data manipulation MapReduce was named for its two functions that were already part of common programming functions-mapping, and reducing
Encapsulation
Large sets of code can be organized to allow for reuse for certain tasks (encapsulation). Very often these include a type of code cluster called a function.
Hadoop
MapReduce addresses many of the challenges of working with big data. However it needs an environment in which to work. Hadoop provides the distributed file system framework in which MapReduce works. Hadoop works on data of any structure Hadoop self-manages such that constant revision of the environment occurs to maximize efficiencies and minimize problems caused by errors or failures
NodeManager
Monitors the applications usage of CPU, disk, network and memory and reports back to the ResourceManager
MapReduce distribution of work
Must be performed in parallel for the following three reasons: --the processing must be able to expand and contract automatically --the processing must be able to proceed regardless of failures in the network or the individual systems --developers leveraging this approach must be able to create services that are easy to leverage by other developers. Therefore, this approach must be independent of where the data and computations have executed
A ___________ is a connection between multiple data nodes that exists to support the movement of data across the servers
Pipeline
MapReduce Optimization
Programming code can be used to implement some optimization, particularly with regard to reliability and performance Other means to increase reliability and performance include: --Appropriate infrastructure, particularly by physically organizing servers to allow the best speed and reliability possible at the level --Appropriate infrastructure, particularly with regard to distributed file systems that increase resources over those available on an individual machine (virtualization, in this case a master-slave style). Slave nodes store data and master nodes "call" the nodes as requests are received. --Ensuring that synchronization programming immediately copies mapping results to the reducing nodes so that the reducing processing can begin immediately.
Reporter functiob
Provides information gathered from map tasks so you know when and if the map tasks are complete
Record reader
Record writer in reverse
MapReduce Foundational Behaviors
SCHEDULING - self manages the number of tasks and the number of nodes so that all mapping occurs prior to reducing SYNCHRONIZING - self manages the tasks by holding task results in limbo until all have completed; once tasks are completed maps are placed in a "shuffle and sort" area CODE/DATA COLOCATION - because there is enhanced efficiency when the data and the code reside in the same node, a copy of the code is sent to each node FAULT/ERROR HANDLING - all programs *should* have error handling built in so that the system can properly react when a failure or error occurs (such as assigning a new node to complete a failed node's processes)
Structures include:
SEQUENCE - a set of instructions that are performed one instruction at a time in the order stated SELECTION (or decision) - a set of instructions that are performed according to the outcome of a question LOOP - a set of instructions that are performed iteratively until something tells it to stop • All programs are a combination of the above high-level structures.
Foundational behaviors of MapReduce
Scheduling Synchronization Code/data colocation Fault/error handling
Pig programs can be run in three different ways
Script Grunt Embedded
HDFS clusters
Sometimes referred to as being "rack-aware"
Apachee Hadoop Project
The process begins with a user request to run MapReduce and continues until results are written to the HDFS HDFS and MapReduce depend on clustered nodes over multiple servers The results are formatted for output and written to the HDFS The mapping function works on each input pair. As pairs are found they go to the Output Collector; as the OC fills the data goes to the Partitioner. The partitioner determines the "bucket" into which each pair goes (e.g., bear, deer). This is an interactive process along with sorting that interacts with all nodes. The reduce function gathers pairs until all processing is done and then performs the reduce function. After reduction, the results are formatted and written to the HDFS
A list of every operation recorded that supports data integrity is located in a ___________________
Transaction log
HDFS capabilities to support data integrity
Transaction logs; Checksum validations; Additional tasks including: --detailed metadata about the files, in which blocks they reside, how they have changed, who has access, how many there are, what nodes exists in the clusters, and where critical information such as the transaction logs reside; --Functions by the data nodes including storage and retrieval of data blocks, storage of metadata, checksum validations, activity reports to the name node, provides metadata and data on demand to authorized user applications, and restructures data as appropriate to maximize efficiencies; --Supporting environment for MapReduce
YARN
YARN (Yet Another Resource Negotiator) is a core service that provides resource and per-application management: --Resource management includes a scheduler that dynamically allocates resources according to pre-set needs of the application --Application management includes a notifier that activates when additional resources are required by the application.
Zookeeper
allows the distributed environment to work smoothly with few faults: --synchronizes processes such that they occur in the proper order by starting and stopping nodes as appropriate --ensures proper configuration of resources and maintains configuration consistency --assigns a node to be a leader that then interacts with the application --supports effective messaging among nodes
Pig and Pig Latin
are an environment that supports development-like activities by non-developers Pig is the support environment (script based) for Pig Latin that allows loading and processing of input data Pig is also capable of producing map and reduce processes so that the user is not required to know how to do so Because Pig is a simple environment and the language is easily learned (similar to SQL) it is relatively easy for less technical end users to examine data, test a small set, and approve jobs before involving huge sets of data
LISP
artificial intelligence language
MapReduce-Mapping Function
basis for an artificial intelligence development language (LISP) Unlike other programming languages that may manipulate and change the structure of data, functional languages DO NOT FUNCTIONAL LANGUAGES create new STRUCTURES that become the output of the program. Advantage of this is that the original data remains untouched, allowing multiple accesses of the data without concern for consistency issues. Reading and writing data is not necessary in the traditional sense so where the data is DOES NOT concern the programming. MapReduce uses word counting You must use a key pair for the mapping Here is a simple word count map function: mapper(filename, file-contents) For each word in file-contents emit (word, 1) This will then look through every file designated and compile a list of 'word, 1'
Code/data colocation
because there is enhanced efficiency when the data and the code reside in the same node, a copy of the code is sent to each node
Metadata
data about data
Name Node
regulates file access in Hadoop
Name node
regulates file access in Hadoop
What does a Block server do?
—stores and retrieves the data blocks in local file system of server —stores the metadata of a block in the local file system —performs periodic validations of file checksums —sends regular reports to the NameNode —provides metadata —forwards data to other data nodes