TDDE31 - MapReduce, Spark, Big Data for ML

Ace your homework & exams now with Quizwiz!

Parallel Processing of Big-Data

Storage on a single hard disk (sequential access) would prevent parallel processing Use lots of servers with lots of hard disks, standard server nodes in a cluster. Distribute the data across nodes to allow for parallel access

Lineage, RDD transformations

When defining rdd's through transformations like map, reducebyjey etc, we are not actually creating rdds. The rdds are created when running an action, for optimization purposes. (exempel, read sen filter, om man vet vad som kommer hända innan man skapar rdds så kan man endast skapa en i slutet mkt mer effektivt)

Hadoop recommends how many mappers per cluster node?

10-100 mappers per node, M,R >> P. "A reducer cannot start while a mapper is still in progress"

Limitations MapReduce

For multiple sequential MapReduce steps: Its a good idea to keep calculations in memory somehow to the next mapper and reduce I/Os. This is the idea of Spark.

Varför kan det vara ineffektivt att köra fler mapreduce på raken?

För att efter varje reduce så sparas datan i global storage, sen måste nästa mapper hämta därifrån. Med Spark kan man dodga detta. (Output of mapper is stored on local storage, output of reducer is stored on global storage).

Skillnad spark och vanlig mapreduce

I mapreduce så måste datan hämtas från global storage efter varje iteration. I spark så kan du välja att spara viss data lokalt mellan iterationer vilket är mer effektivt. rdd.persist(). Tror att efter en action så tappar man minnet med

Construction of new RDDs

If a RDD is lazy, it means that it is never materialized in secondary storage.

Task granularity

M (nr of map tasks) and R (reduce tasks) needs to be bigger than number of nodes, but not too much bigger, then overhead gets big (master has to schedule all of them). Large number of M and R (smaller tasks) makes for more efficient work though (minus overhead).

Save I/O if multiple sequential MapReduce steps: Splitting the MapReduce Construct into Simpler Operations - 2 Main Categories

Map is only local, Reduce needs to work globally, calculate full sum for example and wants to write to global file system.

Spark Idea: Data Flow Computing in Memory

Only when we encounter an action we are in charge of materalize the result. Now we can process the data that should be output in a different order. Instead of sweeping over the data again and again we follow the order of computation done by smaller blocks of data in cache/main memory. Much less I/O and dont read all data we dont need several times.

Data sharing is achieved via RDDs. How can they be defined?

RDD can be defined from other RDD or from reading from disk.

Mapreduce, reduce phase

Reducer; Run a user-defined reduce function once per key grouping l Can aggregate, filter, and combine data l Output: 0 or more key/value pairs sent to output formatter. Output Formatter; Translates the final (key,value) pair from the reduce function and writes it to stdout to a file in HDFS.

Spark: Shared variables

Shared = not partitioned and distributed, accessible for all workers

Spark streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.

Phases in mapreduce

map phase: reads input, processes elementwise and producees temporary sequences of data. Work on key value pairs that comes in from hdfs. shuffle phase: Reorder temporary data into temporary forms according ot predefined criterias. reduce phase: Accumulation over data items that fulfill some property.


Related study sets

Participation and Academic Honesty Verification

View Set

Chapter 24 Asepsis Practice Questions

View Set

ATI Comprehensive physical assessment of an adult post test

View Set

GLY - 1030 Test 2, FAT STUDY GUIDE

View Set

Chapter 16 - Cardiovascular Emergencies

View Set

A&P Review (Epithelial Tissue, Connective Tissue, Nervous Tissue, Muscle Tissue)

View Set