Big Data Infrastructure

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Write a filtering program that uses filter or where.

# Defining a filter function word = inputRDD.filter(lambda s: "error" in s) def containsError(s): return "error" in s word = inputRDD.filter(containsError) word.take(10)

Given a program, like word count, be able to diagram/explain how the data is processed through each program step

Client → Split → Map (Map-Node) → Shuffle → Reduce (Reduce-Node) → Output

What is the trend for storage of data? What are some sources of increased data?

From the floppy disk, CDs, flash drive to cloud, people need to store more and more data, and they also want to be able to access those data in a relatively easy manner. Once data becomes easy to store, people create and make more data. Some of the sources of increased data are social media, Internet of Things (IOT) and search queries from search engines.

General question form is:

Given data and sequence of PySpark commands, write out the result Given data, a sequence of PySpark commands, and the result, add one or more (a few) commands so that the execution of the program matches the result

What is Hadoop?

Hadoop is an ecosystem consisting of many softwares, but the most fundamental idea of Hadoop is the use of a file system, such as HDFS, and a computing systems such as MapReduce or Spark.

Which is better for data mining?

Hadoop is preferable for data mining over RDMS because it is design to process large datasets with high performance speed.

What are the important features that Hadoop offers (compared to common computing)?

High fault tolerance Fast processing of large datasets

Know the student presentations What technologies were presented

Kafka Streaming Streaming is a process in which big data is quickly processed in order to extract real-time insights from it. ​ The data on which processing is done is the data in motion. ​ Big data streaming is ideally a speed-focused approach wherein a continuous stream of data is processed.

What does linear speedup refer to, with regards to adding computer nodes to processing infrastructure? What specific issues prevent linear speedup for parallel processing systems? (slides 2.2)

Linear speedup is the same as horizontal scaling, where one adds CPU to a cluster to speed up processing. This is the opposite of vertical scaling, which simply means to get larger and more powerful computers/CPU. The challenges of achieving linear speedup are: Start up cost (starting an operation on many processors) Contention for resources between processors Data transfer Slowest processors becomes the bottleneck Data skew & load balancing Shared-nothing (Most scalable architecture) (Minimizes resource sharing across processors) Can use commodity hardware Hard to program and manage

What are examples of transformation operation?

Map, Filter, GroupBy, Join (Lazy operations to build RDD's from other RDD's)

Terms: node, rack, block, split, task

Node = Device, such as computer, used in cluster for computation. Can be further divided into master-node, name-node, and data-node. Rack = Frame used to hold several nodes in bundles Block = File divided into one or many blocks, with a maximum size of 128Mb. Split = To split a file into separate blocks, also known as partition Task= Smaller units of work, derived from the job, such as map or reduce

Can PySpark process files that are on the Linux file system (but not on HDFS)? Can it process Amazon S3 files?

PysSpark does not necessarily have to use HDFS. However, the advantages are great as HDFS provides a file system that is fault tolerant and processes large datasets very fast. The disadvantage with using HDFS for PySpark would be that it is not efficient when dealing with many, small files as it has a long overhead time. It also takes long to set up.

What languages are supported by Spark?

Python, Java, Scala, and R

What is the return data type of lines = sc.textFile ?

RDD - Resilient Distributed Dataset

What is the return data type of sc.parallelize([1, 2, 3, 4, 5]) ?

RDD - Resilient Distributed Dataset

What is a paired RDD? Given an example of it's usage.

RDD's that contain a key-value pair. Pair RDDs have their own operations. Example: You can reduce,group, or sort your data by key.

Compare Hadoop vs RDMS Which is better for transaction process (like credit card purchases)?

RDMS is better for transaction process since it's focus lies on atomicity and consistency, while Hadoop focuses on process speed of large datasets.

Why use Spark instead of the traditional Map Reduce coding in Hadoop?

Spark runs in-memory on the cluster, Hadoop persists data back to the disk after performing a map or reduce action Spark is 100x faster

Compare streaming technologies

Storm:​ Apache Storm is a free and open source distributed real time computation system. ​ Storm's main use cases include:​ real time analytics and continuous computation​ online machine learning​ Flink:​ Flink's is an open-source framework for distributed stream processing system.​ It is capable of high throughput and low latency and has better speeds compared to Storm.​ Flink's use cases are as follows:​ Optimization of e-commerce search results in real-time.​ Network / sensor monitoring and error detection ​ Difference with Kafka:​ Kafka combines both distributed and tradition messaging systems​. It is used for more complex transformations.

Given replication factor 3, where are the three blocks stored?

The blocks are stored in three separate nodes, preferable two on the same rack, and the third on a different, but adjacent, rack.

By what steps does a client write blocks from the HDFS?

The client communicates with the Distribute Filesystem which communicates with the NameNode, which tells where the specific file-blocks are located. The client then communicates with the FileSystem Data Inputstream that tells the first DataNode what to read. The DataNode then communicates with the next DataNode what to read.

How is the closest block determined (consider a metric)

The closest block is determined in the metrics of nodes, racks, and data centers. FOr instance, a block can be "one node away" or "one rack away".

What do these nodes do: Name node, data (worker) node?

The name-node keeps tracks of all the elements of the file in the cluster, while the data-node is the one doing the actual work, such as computations and writes.

What is the relationship between task and block?

The number of blocks processed equals the number of tasks

What are the 4 V's of Big Data?

Volume - lots of data (lots!) Variety - structure & unstructured Velocity - faster than transactional data (like CC's); think sensor data... so fast that some RDMS has trouble writing it all Value - make sense of the data so that action can be taken (in time)

How did the tutorial demonstrate the integrated analysis of unstructured and structured data?

We used Flume to get web clickstream data set and integrated the data using Sqoop. We then ran queries with it using Hive.

What are key features that the framework provides for parallel programming? Consider replication, fault tolerance, etc.

Below are the key features HDFS provides: · Data Replication: HDFS replicates(copies) portions of a file (in the form of Blocks) over multiple computers. This would be retrieved in case of any hardware failures. · Fault Tolerance: In case of any fault/hardware-failure, HDFS automatically recovers the lost data and re-assigns it to a different node/rack. · HDFS allows the MapReduce to run on commodity hardware and Open end softwares, controlling data flow and optimization by using Master-Worker architecture. · HDFS controls the entire MapReduce process flow by distributing tasks among nodes and running those tasks in parallel to achieve better performance.

Define Big Data

Big Data is data that exceeds the processing capacity of conventional database systems. As consumers and business users the size and scale of data is not what we care about. What Big Data is really all about is the ability to capture and analyze data and gain actionable insights from that data at a much lower cost than was historically possible. With the ability to gauge customer needs and satisfaction through analytics comes the power to give customers what they want. No longer do we need complex software that takes months or years to set up and use. Nearly all the analytics power we need is available through simple software downloads or in the cloud.

What is an RDD?

Resilient Distributed Datasets: Immutable collection of objects spread across a cluster Built through parallel transformations (Map, Filter, etc) Automatically rebuild on failure Controllable Persistence (caching in RAM)

Write a simple SQL query for a notebook, given a dataset. (Use the %sql magic command.) Ans - %sql

SELECT "COLUMN NAME" FROM "TABLE NAME"

Explain shuffle and sort What occurs on the Map node and on the Reduce node?

Shuffle and sort means to group the key into sorted groups. On the map-node, elements are being paired into key/value-pairs and then sent to the reduce-node. On the reduce-node, these key/value-pairs are merged and reduced based on these groups.

What Hadoop component were used in the Cloudera tutorial?

Sqoop, which is used for data aggregation from multiple databases. Hive, which is a data warehouse software project used for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Impala, which is a distributed SQL query engine for Apache Hadoop. Flume, which is the standard tool for streaming log and event data into Hadoop. Hue, which is a web interface for analyzing data with Apache Hadoop.

By what steps does a client read blocks from the HDFS?

The client communicates with the Distribute Filesystem which communicates with the NameNode, which tells where the specific file-blocks are located. The client then communicates with the FileSystem Data Inputstream that reads from the specific DataNodes where the blocks are located.

Compare and contrast the role of Data Analyst and Data Engineer

While a data analyst is more focused on analyzing the data, the data engineer's role revolves more around the architecture of the database system and the file systems. A data analyst needs to be well rounded and knowledgeable in several areas such as programming, statistical modelling, math, business operations, and such. A data engineer, on the other hand, is required to have a more in-depth knowledge of the technical aspect of data processing, storing, and management.

What are the magic commands for Databricks (e.g., %sql, etc)?

While a notebook has a default language, in Databricks you can mix languages by using the language magic command.For example, given a notebook you can execute code in any of our other supported languages by running any of the below by specifying the below string at the beginning of a cell. %python - This allows you to execute python code in a notebook (even if that notebook is not python). %sql - This allows you to execute sql code in a notebook (even if that notebook is not sql). %r - This allows you to execute r code in a notebook (even if that notebook is not r). %scala - This allows you to execute scala code in a notebook (even if that notebook is not scala). %sh - This allows you to execute shell code in your notebook. %fs - This allows you to use Databricks Utilities filesystem commands.

Which reads data faster: (a) one big fast disk drive, or (b) many smaller disk drives

Without the exact dimensions of the big disk drive or the number of smaller disk drives, one has to assume that the answer is "b".

Write a program to filter, and include only, lines containing "ERROR" and then count those lines.

# Defining a filter function word = inputRDD.filter(lambda s: "error" in s) def containsError(s): return "error" in s word = inputRDD.filter(containsError) word.count()

Write the word-count program using map and reduceByKey

# read the text file into an RDD, where file lines become RDD rows rdd = sc.textFile("/databricks-datasets/learning-spark/README.md") # simple map has an element of all words in a line print rdd.map(lambda x: x.split(" ")).take(10) # Flat map will joint all lines, so that RDD element is simply a word # map applies to each line, which is split into spaces words = rdd.flatMap(lambda x: x.split(" ")) # Convert each element, to (word, 1), where 1 can be used to count print "flatMap" print words.map(lambda x: (x, 1)).take(10) # then use reduceByKey to add up each word. x and y are the values (from elements with equal keys) result = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y) print "reduceByKey" print result.top(10) print "Descending order of their counts

What three steps are common to many Hadoop applications (slides 2.2)?

- Batch Processing - Parallel Execution - Spread data over a cluster of servers and take the computation to the data

What's a node container?

A container, also known as an executor, is where each partition processing takes place

What is the relationship between split and block?

A file is split into block. Sometimes split and block is used interchangable.

What are map tasks?

A map task mean to take a value (such as a word or a line) and map it into a value (in our word count example we map each word to a one.

What are reduce tasks?

A reduce task takes the output from a map as an input and combines those data tuples into a smaller set of tuples.

What's the difference between transformation and action? What does the term, lazy, mean in this context?

A transformation is a function that produces an RDD from an existing RDD, an action is performed on the actual dataset. Actions do not form new RDDs. A graph of actions (DAG) is built via transformations. When you call an action on a transformation everything is computed all at once.

If Hadoop cannot be used, what are some other alternatives that may be used for parallel processing of data?

Apache Spark: Faster speed and good programming interface; Spark framework runs in-memory on a cluster. But it needs lots of memory at once. Cluster Map Reduce: Cluster Map Reduce provides a Hadoop-like framework for MapReduce jobs run in a distributed environment. By simplifying movement of data and minimizing dependencies that can slow data pull, they were able to create something faster. HPCC: High Performance Computer Cluster uses enterprise control language. Hydra: Hydra supports streaming and batch operations using a tree-based data structure so it can store and process data across clusters that may have thousands of nodes.

What is block replication? Why is it done?

Block replication refers to distributing replications of blocks among separate nodes, in order to maintain a high fault tolerance.

What do these components do: Resource manager, Application manager?

Client requests resource manager execute a job Resource manager- 1. Obtains file block location from name node (meta-data) 2. Computes input splits a. Entire file for small files, otherwise (file/block-size) splits 3. Requests Data Node Manager start job Application manager (in data node) 1. Starts a map task for each split, and some reduce tasks a. Map task placed close to its data (according to meta-data) 2. Monitors execution, restating tasks as necessary

What are examples of actions?

Count, Collect, Save (Return a result or write it to storage)

What is a common business use of data mining? (I.e., What is cross-selling?)

Cross-selling is a common business use of data mining. Based on searches and product purchasing behavior, a business organization can figure out what products to recommend next. Using the sales history, current and social events, customer buying behaviors, and social media, an organization can recommend to its customers what items or services are frequently bought together. More importantly, the recommendations can be either customer specific/general or be based on what is trending.

Numbers vary, but about how many rows (or GB) of data are too few to be worthwhile to build an entire Hadoop application?

Data should be larger than 100MB in order for a Hadoop application to be a viable option

How does commodity hardware, Open Source and The Cloud enabling Big Data?

Data storage infrastructure such as commodity hardware, Open Source and The Cloud is primarily responsible for storing and to some extent processing the immense amounts of data that companies are capturing. With people creating more and more data, companies used to store the data in their own commodity hardware such as data servers. Then, Open Source such as Linux, which are operating system on low-cost hardware, was invented to help companies store data with lower cost. And, instead of buying hardware and software and installing it in their own data centers and then maintaining that infrastructure, companies now store data in The Cloud, such as Amazon Web Services (AWS), to get the capabilities they want on demand over the Internet. All of these data storage infrastructures allow companies to store massive amount of data easily which in terms enables them to process and analyze data more efficiently.

In a few sentences, describe a Big Data application

Dirty data is a record that contains mistakes, errors or incomplete values. The challenge of dirty data is to find ways to clean the data. Data cleaning requires going through the data meticulously, noting where incorrect or absent values could be hurting data accuracy. Detecting, correcting or removing incomplete or corrupted data is especially challenging when the data sets are massive.

Given a business problem, provide a recommendation as to when Hadoop would be appropriate, as well as when it would not be appropriate.

Hadoop would be appropriate for a bank to try to detect credit card fraud or analysing capital markets and try to predict future movements in the market. Hadoop would however not be a good choice for handlingar financial transaction for this bank.

Assume your manager request that you justify the use of Hadoop for some application at work. Present multiple reasons that justify the use of Hadoop for data mining.

If the application uses a very large data sets, then for more advanced data mining methods like web or text mining, and predictive analytics, Hadoop will be appropriate to use. Hadoop is a stack of frameworks that support data analysis over distributed data Hadoop supports unstructured or semi-structured text 1. Scalable If the data set size can vary then, Hadoop is a highly scalable storage platform, because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational database systems (RDBMS) that can't scale to process large amounts of data, Hadoop enables businesses to run applications on thousands of nodes involving thousands of terabytes of data. 2 Cost effective Hadoop also offers a cost-effective storage solution for businesses' exploding data sets. The problem with traditional relational database management systems is that it is extremely cost prohibitive to scale to such a degree in order to process such massive volumes of data. Hadoop uses commodity hardware; thus, costs are down. 3 Flexible Hadoop enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data. This means businesses can use Hadoop to derive valuable business insights from data sources such as social media, email conversations or clickstream data. 4. Fast If you're dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours. 5. Resilient to failure A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes in the cluster, which means that in the event of failure, there is another copy available for use.


Kaugnay na mga set ng pag-aaral

"What am I?" Guess the answer to the tricky questions!

View Set

Cloud computing (Characteristics of Cloud Services from a Business Perspective)

View Set

Personal Finance Investing Review

View Set

Digestive System Questions part b

View Set

World Scholar's Cup: Special Area

View Set

Essentials to Nursing Exam- Nursing Skills Related to Vital Signs

View Set

Chapter 10: Energy Balance and Weight Control

View Set