Hadoop, MapReduce, Pig, Hive

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

How Pig Works?

On the clients side.

What is Data-Level Parallelism?

Performing the same operation, same time on independent data.

What are the disadvantages of MapReduce?

1 Data processing is Slow in MapReduce. 2 Batch processing only.

What are the advantages of YARN?

1 Flexibility - Enables other data processing models beyond MapReduce. Other applications can be run along with Map Reduce programs in Hadoop2. 2 Efficiency - allow many applications to run on the same cluster.

What are the disadvantages of Vertical Scaling?

1 Hardware failures could cost the loss of data. 2 Disk seek time. 3 Processing Time. 4 Single point of failure.

What are the advantages of Spark?

1 In-Memory Processing using DAG execute engine: With in-memory processing, we can increase the processing speed. 2 Fault Tolerance: Spark RDDs are designed to handle the failure of any worker node in the cluster. 3 Support Multiple Languages. 4 Speed Processing: 100x faster in memory, 10x faster in desk.

What are the disadvantages of Hive?

1 It does not support updates and deletes, only overwriting or apprehending data. 2 Subqueries are not supported. 3 very high latency. 4 Cannot work with unstructured data.

What are the disadvantages of Spark?

1 Its a near Real-Time processing of stream data (micro batch Processing). 2 No file management system. Depends on hadoop. 3 Keeping the data in memory is very expensive.

What are the disadvantages of Hadoop?

1 Lack of Security: It is missing encryption at storage and network levels 2 Cannot handle Small Data environments. 3 Risky functioning: cybercriminals can easily exploit the frameworks that are built on Java.

What are the disadvantages of Pig?

1 No Support. 2 Still in development. 3 Delay in execution: if we do not dump or store the final result the commands are not executed. 4 Does not enforce a schema. 5 Pig does not support Partition.

What are the disadvantages of Storm?

1 No support. 2 Adoption of change is slow.

What are the advantages of Hadoop?

1 Open Source: its code can be modified according to business requirements. 2 Distributed Processing: data is processed in parallel on a cluster of nodes. 3 Fault Tolerance: If any node goes down, data on that node can be recovered from other nodes replicas. It's done automatically by the framework. 4 Scalability: new hardware can be easily added to the nodes horizontally without downtime. 5 Economic: not very expensive as it runs on a cluster of commodity hardware. 6 Easy to use: the framework takes care of distributed computing. 7 Data Locality: move the computation to data instead of data to computation. When a client submits the MapReduce algorithm, this algorithm is moved to the cluster of data rather than bringing data to the location where the algorithm is.

What are the advantages of Pig?

1 Self-optimizing (focus on semantics rather than efficiency). 2 Works with structured and unstructured data. 3 Easy to learn and program. 4 Less development time. 5 Easy to write UDF (User Defined Functions).

What are the advantages of MapReduce?

1 Simplicity - MapReduce jobs are easy to run. Applications can be written in any language such as java and python. 2 Scalability - can process petabytes of data. 3 Speed - Because of parallel processing problems that take days to solve, it is solved in hours and minutes by MapReduce. 4 Fault Tolerance - handled by re-execution. • Detect failed workers. • Re-execute complete and in-progress map tasks • Re-execute in progress reduce tasks • Task completion committed through master

What are the advantages of Hive?

1 Table structure similar to Relational Databases. 2 Queries data using a SQL-like language. 3 Format conversion within the Hive. 4 Hive does support Partition.

What are the advantages of Storm?

1 open source and Fast. 2 Scalabel (run parallel tasks) 3 low latency.

What is Amazon S3?

AWS S3 stores the data and it allows the client to securely run queries of the data without moving it to the separate analytics platform.

What is AWS EMR (Elastic Mapreduce)?

Amazon EMR (Amazon Elastic MapReduce) provides a managed Hadoop framework using the elastic infrastructure of Amazon EC2 and Amazon S3.

What is Amazon Web Services?

Amazon Web Services (AWS) provides a cloud platform, the first to introduce pay as you go cloud computing model that scales to provide users with computing, storage as needed.

What is Hadoop?

An open-source software framework, written in Java, for distributed storage and processing of large data sets. Three main components: HDFS, YARN, and MapReduce.

What are Hadoop Daemons?

Daemons are the processes that run in the background. There are mainly 4 daemons which run for Hadoop. Namenode - It runs on the master node for HDFS. Datanode - It runs on slave nodes for HDFS. ResourceManager - It runs on the master node for Yarn. Keeps track of live and dead nodes. NodeManager - It runs on slave node for Yarn.

What is Amazon EC2?

Elastic Cloud Compute • Virtual machines (instances) • Different configurations of RAM, Processing, Storage, Architecture • Identity and Security management • Linux, Windows, specific applications • Billed on an hourly basis • Can spin up on demand

What is Apache Spark?

Fast, In-memory Cluster computing for Big Data. Combines Batch, Stream, Interactive computations (via shell). Language Integration, Preferably Scala, Java, Python, R, SQL (via API) • Fully compatible with Hadoop, Also works standalone (locally). runs on Mesos or YARN. Useful for: stream processing using spark streaming - a wrapper over batch processing. it operates on data at rest.

Cloud Service Models:

Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS)

What is Apache Pig?

It is a high-level language platform for analyzing datasets that are stored in HDFS. Pig uses PigLatin language. It is very similar to SQL. It loads the data, Writes complex MapReduce transformation, Pig translates these into MapReduce jobs so they can be executed on the Hadoop, and dumps the data in the required format. Pig requires Java runtime environment. Used by researchers and programmers. Pig was first developed by Yahoo.

How MapReduce Works?

MapReduce works by breaking the processing into phases: Map and Reduce. Map function takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Reduce function takes the output from the Map as an input and combines those data tuples based on the key and accordingly modifies the value of the key.

What is Apache Spark RDD?

RDD stands for "Resilient Distributed Dataset". It is the data structure of Apache Spark. Resilient, i.e. fault-tolerant: able to recompute missing or damaged partitions due to node failures. Distributed, since Data resides on multiple nodes. Dataset The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.

How Spark works:

The Driver rules it all • Connects to a cluster manager (Standalone/Mesos/Yarn) to allocate resources • Acquires executors on clusters • Sends app code to the executors • Sends tasks for the executors to run

What is YARN?

Yet Another Resource Negotiator is the resource management layer of Hadoop.

What is HDFS?

a distributed file system which provides storage in Hadoop. Two main components: 1 NameNode (Master Node): stores the metadata. Regulates client's access to files. The NameNode stores the metadata in the memory for fast retrieval. Hence we should configure it on a high-end machine. 2 DataNode (Slave Node): manages data storage of the system.

What is Hadoop Cluster?

a group of computers connected together via LAN. Hadoop clusters have a number of commodity hardware connected together. They communicate with a high-end machine which acts as a master. These master and slaves implement distributed computing over distributed data storage.

What is Storm?

computing for streaming data. runs on Mesos and YARN. operates on data in motion. The data can be processed exactly one, at least one, or more than one.

How Hadoop Works?

in master-slave fashion. There is a master node and there are n numbers of slave nodes. Master manages the slaves while slaves are the actual worker nodes. The Master should deploy on good configuration hardware, not just commodity hardware. As it is the centrepiece of the Hadoop cluster.

What is Apache Hive?

it's a data warehouse built on top of Apache Hadoop that help with the analysis of large datasets stored in Hadoop. Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL automatically translates SQL queries into MapReduce jobs which will execute on Hadoop. used by data analysts. Support ETL operations, on the server side. Hive was first developed by Facebook.

What is MapReduce?

the data processing layer of Hadoop. It writes the application that processes large structured and unstructured data stored in HDFS. It processes large data in parallel. It's a Data-parallel programming model.

Pig Data Types:

• A Tuple is an Ordered set of Fields: Represented by parentheses. Example ('bob',55,12.2 ) • Maps are key value pairs: Represented as [key#value]. Example ['name'#'bob', 'age'#'55'] • A Bag is an unordered collection of Tuples: • A Relation is a Bag of Bags Represented by { } {('bob',55,12.2 ), ('sally',52,11.3 )} • Tuples in the same bag can have different numbers of fields

Cloud Computing Characteristics:

• Broad network access • On-Demand Self-Service • Rapid Elasticity (satistical multiplexing) • Resource Pooling • Measured service (and costs)

Why Cloud Computing?

• Customers can focus on core operations • Infrastructure can be consumed as needed • Scalability no longer a limiting factor


Kaugnay na mga set ng pag-aaral

Ch. 24 Asepsis and Infection Control

View Set

Chapter 41: Disorders of Endocrine Control of Growth and Metabolism

View Set

Variable, Indexed, and Market-Value Adjusted Annuities

View Set

Vocab from classical roots boob b lesson 4

View Set

Soci 100 - Homework - Chapter 6. Social Control and Deviance

View Set

Foundation of Professional Nursing Practice

View Set

Lesson 18: Personality Disorders Multiple Choice

View Set

Unit 4: Medical Language; Urology, Male Reproductive System, & Gynecology and Obstetrics

View Set