Data Mining (1)

Ace your homework & exams now with Quizwiz!

Why Use Spark Operations?

Spark is implemented by Java, runs in jvm, so it is faster than Python Spark can analyze and optimize the computing process

Data Mining Tasks

- Descriptive methods: = Find Human-interpretable patterns that describe the data (e.g.Clustering) - Predictive methods = Use some variables to predict unknown or future values of other variables = Example: Recommender systems

Map Reduce (Inverted Index)

- (ID, content) ⇒ (content, List[IDs]) Map: For each word i input value, emit(word, tweet_ID) as intermediate (key, value) pair Reduce: Reduce function emits key and list of tweet_IDs associated with that key

The challenges of Big Data?

- Cannot store in one place - Failures for access may be unexpected - Unpredictable Diversity - Messy, noisy, and errors are inevitable - Dynamic - ....

Components of Distributed File System

- Chunk Server = File is split into Contiguous Chunks (typically 16-64 MB) = Each chunk is replicated (usually 2x or 3x) = Try to keep replicas in different racks, in case that the switch on a rack can fail and entire rack becomes inaccessible - Master Node (Name node) = Stores metadata about where files are stored = Might be replicated, otherwise it might become a single point of failure - Client library for file access = Talks to master to find chunk servers that store the chunks = Connects directly to chunk servers to access data without going through the master node.,

Partition Function

- Control how keys get partitioned = Reduce needs to ensure that records with the same intermediate key end up at the same worker - System uses a default partition function: = hash(key) mod R - Sometimes useful to override the hash function: - E.g. hash(hostname(URL)) mod R ensures URLs from the same host to end up in the same output file

Partitions

- Data are split into multiple partitions by hashing function - Ensure the partitions are balanced - Common size of a partition is 64MB, common number of partition is 2 or 3 times of #workers - Less partition sometimes may lead to better performance because of the cost of "partition" - Repartition(func, num)

Distributed File System

- Data is kept in "chunks" spread across machines (chunk servers) - Each chunk is replicated on different machines - E.g. 4 chunk servers, file 1 is split into 6 chunks and replicated twice - Each chunk is replicated twice, and the replicas of a chunk are never on the same chunk server. - Chunk servers also serve as compute servers - Bring computation to data!

General Characteristic of Good Problem for Map-Reduce

- Data set is truly "BIG" = Terabytes, not tens of gigabytes = Hadoop/MapReduce designed for terabyte/petabyte scale computation = Most real-world problems process less than 100GB of input - Don't need fast response time = When submitting jobs, Hadoop latency can be 1 minute = Not well-suited for problems that require faster response time = A good pre-computation engine - Good for applications that work in batch mode - Runs over entire data set = Takes time to initiate, run; = Shuffle step can be time-consuming - Does not provide good support for random access to datasets. - Best suited for data that can be expressed as "KEY-VALUE PAIRS" without losing context, dependencies. = Graph data is hard to process using Map-Reduce = Graph algorithms need information about the entire graph for each iteration. - Other problems/data NOT suited for MapReduce: = Tasks that need results of intermediate steps to compute results of current step = Some machine learning algorithms = SUMMARY: Good candidates for MapReduce: - Jobs that process huge quantities of data and either summarize or transform the content. - Collected data has elements that can easily be captured with an identifier (key) and corresponding value.

Map Reduce (Get unique integers)

- Design MapReduce algorithms lot take a very large file of integers and produce a set of unique integers as output - The large file of integers cannot fit in the memory of node Map: Emit(integer, 1) once only for each unique integer for each chunk e.g. maintain an array or a map to track whether have seen/emitted that integer before Shuffle: Shuffle step will group together all values for the same integer: (integer, (1,1,1,1,1...)) Same integer might appear in multiple chunks from Map tasks and each integer key will only go to one reduce task Reduce: Each Reduce task eliminates duplicates (also ignore list of 1's) for each integer key and emit (integer). Combine the output from multiple reduce tasks (not required to be in any order)

Map Reduce (Compute Average, with combiner)

- Design MapReduce algorithms to take a very large file of integers and produce the average of all the integers as output - The large file of integers cannot fit in the memory of node Map: Map task produces (key, (number of integers, sum of integers)) for each chunk as output - Key could be set to the same value for all mappers, e.g. 1 - Value is the (number, sum) tulle Reduce: A single reduce task sums all the sums of integers and sums number of integers, calculate average; emit(average, 1) - Since each Map task summarizes a large chunk of data by a single (key, (number, sum)) key value pair, this should be able to use a single reducer even with thousands of map tasks.

Map Reduce (Count the number of unique integers)

- Design MapReduce algorithms to take a very large file of integers and produce the count of the number of distinct integers as output - The large file of integers cannot fit in the memory of node TWO STAGES NEEDED!: - First phase: Map: Just emit (integer, 1) for unique integers Reduce: Eliminates duplicates (across chunks) -Second phase: Map: each map task gets some unique integers from the previous reduce phase as the input and output the key-value pair like (some key, count) (key could be 1) Reduce: A single reduces, sums all counts from map tasks and produces overall count.

Map Reduce (Find largest integer)

- Design MapReduce algorithms to take a very large file of integers and produce the largest integer as output - The large file of integers cannot fit in the memory of node Map: Map task produces (1, largest-integer) of the largest value in the local chunk given as (key, value) paris Group by Key (1, (10, 61,46) Reduce Single reduce task is needed to pick the largest integer. (1, (10, 61,46) ⇒ (1, 61) ⇒ 61

Map Reduce (Integers divisible by 7)

- Design a Map-Reduce algorithm that takes a very large file of integers and produces as output all unique integers from the original file that are evenly divisible by 7 - The large file of integers cannot fit in the memory of node Map: for v in valuelist: if (v % 7) == 0: emit(v, 1) Reduce: // Eliminate duplicates emit(key, 1) ⇒ Why check whether divisible by 7 in the Map task rather than the Reduce task? ⇒ Reduce communication: send less data over network.

Resilient Distributed Dataset (RDD)

- Distributed = Data are split into multiple partitions, distributed across nodes to be processed in parallel - Resilient = Spark keeps track of transformations and enable efficient recovery - Built-in data structure = You can't access the value directly in pyspark

Redundant Storage Infrastructure

- Distributed File System = Store data multiple times across a cluster = Provide global file namespace = E.g. Google GFS, Hadoop HDFS - Typical usage pattern = Huge files (100s of GB to TB) = Data is rarely updated in place = Reads and appends are common

What is data mining?

- Given lots of data - Discover patterns and models that are: = Valid: hold on new data with some certainty = Userful: should be possible to act on the item = Unexpected: non-obvious to the system = Understandable: Humans should be able to interpret the pattern

Bonferroni's principle

- If you look in moore places for interesting patterns that you amount of data will support, you are bound to find crap. - A risk with "data mining" is that an analyst can "discover" patterns that are meaningless.

Map Reduce (Count Friends)

- In a social network (Facebook, Instagram,...) how many friends does each person have? Map Person name as key and put count 1 for each friendship as value (Jim, 1) Reduce Count the total number of friends for each person

Data Flow

- Input and final output are stored on a distributed file system (DFS) = Scheduler tries to chedule map tasks "close" to physical storage location of input data - Intermediate results are stored "on local file system of Map workers" =e.g. output of the map step - Output is often input to another Map-Reduce task

Map Reduce Formal

- Input: a set of key-value pairs - Programmer need to specify two methods: ⇒ Map(k, v) ➝ <k', v'>* - Takes a key-value pair and outputs a set of key-value pairs - There is one Map call for every (k, v) pair ⇒ Reduce(k',<v'>*) ➝ <k', v''>* - All values v' with same key k' are reduced together - There is one Reduce function call pre unique key k'

Map - Reduce

- Map = Divide the file into many "records" = Extract something (e.g. word) from each record (as key) = Output one or multiple things for each record - Group by key = Sort and Shuffle - Reduce = Aggregate, summarize, filter or transform = Output the result

Dealing with Failures

- Map worker failure: = Map tasks completed or in-progress at worker are reset to idle = Idle tasks eventually rescheduled on other worker(s) - Reduce worker failure = Only in-progress tasks are reset to idle = Idle Reduce tasks restarted on other worker(s) - Master failure = Map-reduce task is aborted and client notified

Coordination: Master

- Master node takes care of coordination: = Task status: idle, in-progress, completed = Idle tasks get scheduled as workers become available = When a map task competes, it sends the master the location and sizes of its ᵣ intermediate files, one for each reducer (R = number of reducers) = Master pushes this info to reducers Master pings workers periodically to detect failures

Map-Reduce addresses the challenges

- Node Failure ⇒ Store data redundantly on multiple nodes - Network bottleneck ⇒ Move computation close to data to minimize data movement - Distributed programming ⇒ Map function and Reduce functions

Cluster Computing Challenges

- Node failures ⇒ Store data persistently and keep it available when nodes fail ⇒ Deal with node failures during a long running computation - Network bottleneck ⇒ A framework that does not move data around so much while it's doing computation - Distributed/parallel programming is hard ⇒ A simple model that hides most of the complexity

RDD Operations

- Transformations = They are lazy, the result is not immediately computed - Action = They are eager, the result is immediately computed

Common "Actions" on RDD

- getNumPartitions() - foreachPartitions(func) - collect() - take(n) - count(), sum(), max(), min(), mean() - reduce(func) - aggregate( zeroVal, seqOp, combOp) - CountByKey()

Common "Transformations" on RDD

- map(func) - mapValues(func) - filter(func) - flatMap(func) - reduceByKey(func, [numTasks]) - groupByKey([numTasks]) - distinct([numTasks]) - mapPartitions(func) ⇒ Return another RDD

reduceByKey() vs. reduce()

- reduceByKey() returns an RDD ⇒ Reduce values per key - reduce() returns a non-RDD value ⇒ Reduce all values!

Steps of executing a Spark program

1. The driver program runs the Spark application, which creates a SparkContext 2. The SparkContext connects to a cluster manager to allocate resources 3. Spark acquires executors on nodes in the cluster, which urn computations fo you application 4. Driver program sends your application code to executors to execute.

Caching and Persistence

By default, RDDs are recompiled each time you run an "action" on them. This can be expensive (time-consuming) if yo need to use a dataset more than once. Spark allows you to control what is cached in memory. Use persist() or cache()

Map Reduce (Distributed Sort)

Goal: Sort a very large list of (first name, last name) paris by last name followed by first name Map: Emit( lastName, firstName) Group by keys: Group together entries with same last name Divide into non-overlapping alphabetical ranges (sorting) Keys are sorted in alphabetical order Reduce: Processes on key at a time For each (lastName, list(firstName)), emit (lastName, firstName) in alphabetical order (sorting) Merge output from all Reduce tasks (e.g. write)

Map Reduce (Word length histogram, with combiner)

How many "big" (10+ letters), "medium" (5-9 letters), "small" (2-4 letters) and "tiny" (1 letter) words are used? Split the document into multiple chunks and process each chunk on different nodes Map: Sort up words in designated key value and then use combiner to sum up the count. ("big", 17) ... Reduce: Sum up the count for each key.

How many Map and Reduce jobs?

M maps task R reduce tasks - Rule of thumb: = Make M "much larger" than the number of nodes in the cluster. = One DFS chunk per map task is common = Improves dynamic load balancing and speeds up recovery from worker failures - Usually R is smaller than M = Output is spread across R files = Google example: Often use 200.000 map tasks, 5000 reduce tasks on 2000 machines.

Map-Reduce Summary

Map task: - Input is in "key-value" format, e.g. key = file location, value = text - Map code is written by the user - Processes chunks and produces sequence of key-value pairs, Notice: not the "key" in usual sense, do not have to be unique. Group by Key/Shuffle - Collects key-value paris from each Map task - Values associated with each key are formed into a list of values - All key-value pairs with same key go to same Reduce task Reduce task - Reduce code is written by user - Produces output key-value pairs

Map Reduce (Relational Join)

Map task: Key = "the key used fo the join" Value is a tuple with all fields from table (including the table name) Group by key: Groups together all values (tulles) associated with each key Reduce: emit joined values (without table names)

Map-Reduce: Environment

MapReduce environment takes care of: - Partitioning the input data - Scheduling the program's execution across a set of nodes - Perforiming the group by key step - Handling machine failures - Managing required inter-machine communication

Combiners

Often a Map task will produce many paris of the form (k, v1), (k,v2) ... for the "SAME" key k - Instead of producing many pairs of the same key, we can sum the n occurrences of the key and emit (w, n) in Map task before shipping to reducers. - Now each node only sends a single value for each word - Can save network time by pre-aggregating values in the mapper - Combiners run after the Mappers and before the Reducers - Combiner receivers all data emitted by the Mapper as input - The output of the combiner is then sent to the reducers - Usage of the combiners is optimal. - When to use it? = If combiner is suitable for the job, the instances of the combiner are run on every node that has run map tasks = The combiner is a "mini-reduce" process (usually the same as the reduce function) ⇒ If a reduce function is both "COMMUTATIVE" and "ASSOCIATIVE", then it can be used as a Combiner e.g. sum (add up all the input values) ⇒ If the reducer cannot be used directly as a Combiner because of commutativity or associativity. - You may stillb e able to write a third class to use as a combiner (average)

Map Reduce (Matrix Multiplication)

One phase: Map: for each element (i,j) of A, emit ((i,k), ('A' , i, j, A[i,j]) for k in 1...N for each element (j, k) of B, emit ((i, k), ('B', j, k, B[j, k]) for i in 1 ...L Reduce: key = (i,k) for a position in C value = SUM (A[i,j] * B[j,k[) Two phase: 1.: Multiply the appropriate values in 1st MapReduce Phase 2.: Add up in 2nd MapReduce phase 1.Map For each matrix element A[i,j] : emit(j, ('A', i, A[i,j])) For each matrix element B[j,k]: emit(j, ('B', k, B[j,k])) 1. Reduce For each key j, produce all possible products For each value of (i,k) which comes from A and B, i.e. ('A', i A[i,j]) and ('B', k, B[j, k]): emit ((i, k), (A[i, j] *B[j, k]) 2.Map The input would be the (key, value) from the 1st Reduce task Let the pair of ((i, k), (A[i,j] * B[j,k]) pass through 2.Reduce For each (i,k) add up the values, emit ((i, k), SUM(values))

What is a switch?

Switch - connect nodes e.g., 1GB/sec band witch between any pair of nodes in a rack Backbone Switch - connect racks e.g. 2-10 GB/sec bandwidth between racks

Map Reduce (word count)

map: for each word w in value: emit(w, 1) reduce: result = 0 for each count v in values: result += v emit(key, result)


Related study sets

Chapter 3: Anatomy and Physiology of the Reproductive System

View Set

History of Rock n Roll, History of Rock & Roll Midterm, History of Rock and Roll-Exam 3, Exam 3 - History of Rock n Roll, History of Rock and Roll Exam 3, history of rock and roll exam 3, History of rock and roll exam 3, History of Rock and Roll: Exa...

View Set

Introduction to Consolidated Financial Statements

View Set