BIGDATAII LAST TEST

Ace your homework & exams now with Quizwiz!

Directs traffic. Typically have a backup node for every server rack in the cluster. When you submit jobs, it decides which nodes are allocated to you. Tracks data, creates log file, logs traffic that goes through it. Opens, closes, and renames files. Keeps track of where data is physically stored.

Name node

-Software developers design applications based on. -A series of steps that need to occur in service to an overall goal. -Real power of MapReduce is the capability to

-algorithms -algorithm -divide and conquer

-All of these are stored locally, primarily for performance reasons. Are replicated across several data nodes, so the failure of one server may not necessarily corrupt a file. - The degree of replication, the number of data nodes, and the HDFS namespace are established when this is implemented. All parameters can be adjusted during the operation of this

-data blocks -cluster

An error detection method where the sender assigns a numeric value to a string depending on the number of bits it contains, and the receiver calculates the number to ensure it is correct. Binary string has to match the string sent from the sender (match the sum of the binary bits).

Checksum Validations

-All scripts are run on a single machine. Hadoop MapReduce and HDFS are not required.

-LOCAL MODE

Lots of small files should be avoided. Distributed file systems supporting MapReduce engines work best when they are populated with a modest number of large files.

THE BIGGER THE BETTER

-Applies a function to each element (defined as a key-value pair) of a list and produces a new list. You can use this with impunity because it will not harm your precious stored data. Dealing with where the data is stored. Executes without making any changes to the original list. Is communicative - in other words, the order that a function is executed doesn't matter.

- map function

-is an open source software foundation. - Yet Another Resource Negotiator. Provides resource and application management. Scheduling and resource management. Lets the name node decide where everything goes

-Apache -YARN

-HDFS works by breaking large files into smaller pieces called this. Are stored on data nodes. - Responsibility of this node to know what blocks on which data nodes make up the complete file. Also acts as a "traffic cop", managing all access to the files, including reads, writes, creates, deletes, and replication of data blocks on the data nodes. can and should be replicated to guard against a single point of failure. Are smart

-BLOCKS -NAMENODE

validations are used to guarantee the contents of files in HDFS. When a client requests a file, it can verify the contents by examining its ______. If the ________ matches, the file operation can continue. If not, an error is reported. __________ files are hidden to help avoid tampering.

-Checksum validations

-Complete collection of all the files in the cluster is sometimes referred to as the -operate in a "loosely coupled" fashion. Allows the cluster elements to behave dynamically, adding or subtracting servers as the demand increases or decreases.

-FILE SYSTEM NAME SPACE -NAME AND DATA NODES

-is the most comprehensive collection of tools and technologies available today to target big data challenges. -is a master service and control NodeManager in each of the nodes of a Hadoop cluster. - included in the resource manager,sole task is to allocate system resources to specific running applications (tasks), but it does not monitor or track the application's status.

-HADOOP ECOSYSTEM -RESOURCE MANAGER -SCHEDULER

-is a distributed, non-relational (columnar) database that utilizes HDFS as its persistence store. • Capable of hosting very large tables (billions of columns/rows) because it is layered on Hadoop clusters of commodity hardware. • Provides random, real-time read/write access to big data. • Highly configurable, providing a great deal of flexibility to address huge amounts of data efficiency. • All data is stored into tables with rows and columns similar to relational database management systems. -tracks changes in the cell and makes it possible to retrieve any version of the contents should it become necessary.

-HBASE -VERSIONING

- addresses volume, velocity, and variety by breaking files into a related collection of smaller blocks. These blocks are distributed among the data nodes in the ________ cluster and are managed by the NameNode. is resilient, so these blocks are replicated throughout the cluster in case of a server failure. -data about data

-HDFS -MEDATA

service that is designed to address these possibilities. Goal is to balance the data nodes based on how full each set of local disks might be. runs while the cluster is active and can be throttled to avoid congestion of network traffic. is effective, but it does not have a great deal of built-in intelligence.

-HDFS REBALANCER

-A reliable, high bandwidth, low-cost, data storage cluster that facilitates the management of related files across machines. Is a versatile, resilient, clustered approach to managing files in a big data environment. Not the final destination for files, but rather is a data service that offers a unique set of capabilities needed when data volumes and velocity are high. Data is written once and then read many times thereafter. Provides the highest levels of performance when the entire cluster is in the same physical rack in the data center. -A high-performance parallel/distributed data-processing implementation of the MapReduce algorithm.

-Hadoop Distributed File System (HDFS) -MapReduce Engine

-What are the types of steps (ch8) -Hadoop is built on top of -Do you have to use MapReduce to take advantage of Hadoop Stack

-MAP, REDUCE, SORT -Map Reduce -NO

-has been reinvigorated as a core technology for processing lists of data elements (keys and value) -in functional languages do not modify the structure of the data; they create new data structures as their output. -data itself is unmodified

-Map -Operator -Original

-was created for Google's big data needs, uses any type of data, has 3 types of nodes, all built on commodity hardware,built to be fault tolerant

-Map Reduce

- can perform its work on different machines in a network and get the same result as if all the work was done on a single machine. • It can also draw from multiple data sources, internal or external. • Keeps track of its work by creating a unique key to ensure that all the processing is related to solving the same problem . This key is also used to pull all the output together at the end of all the distributed tasks

-MapReduce

-monitors the application's usage of CPU, disk, network, and memory and reports back to the ResourceManager -notifies the NodeManager and the NodeManager and the NodeManager negotiates with the ResourceManager (Scheduler) for the additional capacity on behalf of the application.

-NODE MANAGER -APPLICATION MANAGER

Within the HDFS cluster, data blocks are replicated across multiple data nodes and access is managed by this node. uses a "rackID" to keep track of the data nodes in the cluster.

-Name node

- was designed to make Hadoop more approachable and usable by non-developers. Interactive, or script-based, execution environment supporting Pig Latin, a language used to express data flows. creates a set of map and reduce jobs. -supports the loading and processing of input data with a series of operators that transform the input data with a series of operators that transform the input data and produce the desired output. provides an abstract way to get answers from big data by focusing on the data and not the structure of a custom software program. supports a wide variety of types, expressions, functions, diagnostic operators, macros, and file system commands.

-PIG -PIG LATIN

-A connection between multiple data nodes that exists to support the movement of data across the servers. -placement on the data nodes is critical to data replication and support for data pipelining.

-PIPELINE -BLOCK

-are mapped to subdirectories in the underlying file system and represent the distribution of data throughout the table. -are stored as files in the partition directory in the underlying file system. are based on the hash of a column in the table

-Partitions -Buckets

Has been a feature of functional programming languages for many years. Takes the output of a map function and "reduces" the list in whatever fashion the programmer desires. First step this function requires is to place a value in something called an accumulator, which holds an initial value. Function then processes each element of the list and performs the operation you need across the list. At the end of the list, this function returns a value based on what operation you wanted to perform on the output list.

-Reduce Function

-Pig programs can be run in three different ways -Simply a file containing Pig Latin commands, identified by the .pig suffix -A command interpreter, You can type Pig Latin on the grunt command line and Grunt will execute the command on your behalf. -Pig programs can be executed as part of a Java program

-SCRIPT, GRUNT, EMBEDDED -SCRIPT -GRUNT -embedded

-A tool that offers the capability to extract data from non-Hadoop data stores, transform the data into a form usable by Hadoop, and then load the data into HDFS (ETL, "Extract, Transform, Load"). is a command-line interpreter; you type commands into the interpreter and they are executed one at a time. works by looking at the database you want to import and selecting an appropriate import function for the source data. It then reads the metadata for the table (or database) and creates a class definition of your input requirements

-SQOOP

FOUNDATIONAL BEHAVIORS OF MAP REDUCE DEFINITIONS(START) -Jobs get broken down into individual tasks for the map and the reduce portions of the application. Mapping must be concluded before reducing can take place. Those tasks are prioritized according to the number of nodes in the cluster.

-Scheduling

-ETL process. Extract from database, transform data, load into new database in new language/form. Interacts with Hive and Hbase. -Independent Apache project. Typically faster and more expensive than Hadoop for Big Data. Resilient distributed dataset is a key feature. Requires a distributed file system and a distributed resource manager. Uses HDFS and YARN. Written in Java. Frequently used with Jupyter notebooks.

-Sqoop -Spark

-mechanisms copy the mapping results to the reducing nodes immediately after they have complemented so that the processing can begin right away. All values from the same key are sent to the same reducer, again ensuring higher performance and better efficiency. -are written directly to the file system, so it must be designed and tuned for best results.

-Synchronization -reduction outputs

-are a very common practice in file system and database design. They keep track of every operation and are effective in auditing or rebuilding of the file system should something untoward occur. -uses transaction logs and checksum validation to ensure integrity across the cluster. supports a number of capabilities designed to provide data integrity.

-Transaction logs -HDFS

File System -MapReduce implementation is supported by a -Major difference between local and distributed file systems is -need to be spread across multiple machines or nodes in a network. -MapReduce implementations rely on a _________ of distribution, where the master node stores all the metadata, access rights, mapping and location of files and blocks, and so on. - are nodes where the actual data is stored. -All requests go here and then are handled by the appropriate slave node.

-distributed file system - capacity -File systems -master-slave style -The slaves -The master

(END) Most MapReduce engines have very robust error handling and fault tolerance. Engine must recognize that something is wrong and make the necessary correction. Engine is designed so that it recognizes when a job is incomplete and will automatically assign the task to a different node.

Fault/Error Handling

-server is responsible for: o Storing and retrieving the data blocks in the local file system of the server. o Storing the metadata of a block in the local file system based on the metadata template in the NameNode. o Performing periodic validations of file checksums. o Sending regular reports to the NameNode about what blocks are available for file operations. o Providing metadata and data to clients on demand. o Forwarding data to other data nodes based on a "pipeline" model.

BLOCK SERVER

Most effective processing occurs when the mapping function (the code) is collocated on the same machine with the data it needs to process. Process scheduler is very clever and can place the code and its related data on the same node prior to execution (or vice versa).

Code/Data Colocation

Contain HDFS. HDFS is redundant, so if you have data on the Hadoop cluster, you'll have at least 2x redundancy (stored on 2 nodes) or 3x (stored on 3 nodes). Manages the local block (creation, deletion, replication) as directed by the name node.

Data Node

-are not very smart, but are resilient. Constantly ask the NameNode whether there is anything for them to do. Also tells the NameNode what data nodes are out there and how busy they are. Also communicate among themselves so that they can cooperate during normal file system operations. Blocks for one file are likely to be stored on multiple of these nodes. also provide "heartbeat" messages to detect and ensure connectivity between the NameNode and the data nodes. use local disks in the commodity server for persistence. Known as block servers

Data nodes

Was developed because it represented the most pragmatic way to allow companies to manage huge volumes of data easily. Allowed big problems to be broken down into smaller elements so that analysis could be done quickly and cost-effectively. Is now an open source project managed by the Apache Software Foundation. Is a fundamental building block in our desire to capture and process big data. Designed to parallelize data processing across computing nodes to speed computations and hide latency.

HADOOP

Columnar database that can hold billions of rows. NoSQL database. Powerful database. If you're using Hadoop as a database, you're really using this. If you change files, you cannot guarantee that they are changed elsewhere in the system.

HBASE

is stored in the NameNode, and while the cluster is operating, all the metadata is loaded into the physical memory of the NameNode server.

HDFS METADATA

-is a batch-oriented. Data-warehousing layer built on the core elements of Hadoop (HDFS and MapReduce). • Provides users who know SWL with a simple SQL-lite implementation called HiveQL, without sacrificing access via mappers and reducers. With this you can get SQL-like access to structured data and sophisticated big data analysis with MapReduce. • not designed for quick responses to queries. Is best used for data mining and deeper analytics that do not require real-time behaviors. It is very scalable, extensible, and resilient, something that the average data warehouse is not

HIVE

Relational data warehouse layer. Uses SQL-like queries for those who already know SQL. Translates SQL into MapReduce queries.

HIVE

Designed to process huge amounts of structured and unstructured data and is implemented on racks of commodity servers as a Hadoop cluster. Able to detect changes, including failures, and adjust to those changes and continue to operate without interruption.

Hadoop

Manages the distributed files. You have different nodes that can store data and the data is split up on those nodes. If you're using data on one node, there is always a copy on another node as a backup. Facilitates the process of reading and writing data. To move data into ______, you just drag and drop.

Hadoop Distributed File System (HDFS):

decides how the file is going to be broken into smaller pieces for processing using a function called InputSplit. It then assigns a RecordReader to transform the raw data for processing by the map.

INPUT FORMAT

Adding layers of security on the distributed file system will degrade its performance. File permissions are there to guard against unintended consequences, not malicious behavior. The best approach is to ensure that only authorized users have access to the data center environment and to keep the distributed file system protected from the outside.

KEEP IT SECURE

-Master node could get overworked because everything begins there. If the master node fails, the entire file system is inaccessible until the master is restored. Create a "warm standby" master node that can jump into service if a problem occurs with the online master.

KEEP IT WARM

Uses Pig Latin (not actually Pig Latin). Simple programming language. Allows you to develop code that you can test and use quickly. Not the best thing for big jobs. Functions are already debugged. If you have the proper syntax, it will work. Language is easy (similar to SQL).

PIG

-When multiple processes execute concurrently in a cluster, you need a way to keep things running smoothly. Automatic execution framework. Keeps track of what has run and when. Intermediate data is copied over the network as it is produced using a mechanism called "shuffle and sort". This gathers and prepares all the mapped data for reduction.

Synchronization

Highly sustained network bandwidth is more important than quick execution times of the mappers or reducers. Optimal approach is for the code to stream lots of data when it is reading and again when it is time to write to the file system.

THE LONG VIEW

Kept in the name node. Help support data integrity and the management of the HDFS cluster and distributed database.

Transaction Logs


Related study sets

World Cultures India Imperialism-Modi

View Set

TRADITIONAL COSTUMES IN MALAYSIA

View Set

Test 2 - Cancer Development, Care of Patients with Cancer, & Diabetes Mellitus

View Set

ATI Ch 3 Expected physiological changes during pregnancy (+QUESTIONS)

View Set

Chapter 8 Abnormal psychology corrections

View Set

Digestive System: Chemical Digestion

View Set

Strategic Thinking & Implementation Exam 2

View Set