TDDE31 - MapReduce, Spark, Big Data for ML
Parallel Processing of Big-Data
Storage on a single hard disk (sequential access) would prevent parallel processing Use lots of servers with lots of hard disks, standard server nodes in a cluster. Distribute the data across nodes to allow for parallel access
Lineage, RDD transformations
When defining rdd's through transformations like map, reducebyjey etc, we are not actually creating rdds. The rdds are created when running an action, for optimization purposes. (exempel, read sen filter, om man vet vad som kommer hända innan man skapar rdds så kan man endast skapa en i slutet mkt mer effektivt)
Hadoop recommends how many mappers per cluster node?
10-100 mappers per node, M,R >> P. "A reducer cannot start while a mapper is still in progress"
Limitations MapReduce
For multiple sequential MapReduce steps: Its a good idea to keep calculations in memory somehow to the next mapper and reduce I/Os. This is the idea of Spark.
Varför kan det vara ineffektivt att köra fler mapreduce på raken?
För att efter varje reduce så sparas datan i global storage, sen måste nästa mapper hämta därifrån. Med Spark kan man dodga detta. (Output of mapper is stored on local storage, output of reducer is stored on global storage).
Skillnad spark och vanlig mapreduce
I mapreduce så måste datan hämtas från global storage efter varje iteration. I spark så kan du välja att spara viss data lokalt mellan iterationer vilket är mer effektivt. rdd.persist(). Tror att efter en action så tappar man minnet med
Construction of new RDDs
If a RDD is lazy, it means that it is never materialized in secondary storage.
Task granularity
M (nr of map tasks) and R (reduce tasks) needs to be bigger than number of nodes, but not too much bigger, then overhead gets big (master has to schedule all of them). Large number of M and R (smaller tasks) makes for more efficient work though (minus overhead).
Save I/O if multiple sequential MapReduce steps: Splitting the MapReduce Construct into Simpler Operations - 2 Main Categories
Map is only local, Reduce needs to work globally, calculate full sum for example and wants to write to global file system.
Spark Idea: Data Flow Computing in Memory
Only when we encounter an action we are in charge of materalize the result. Now we can process the data that should be output in a different order. Instead of sweeping over the data again and again we follow the order of computation done by smaller blocks of data in cache/main memory. Much less I/O and dont read all data we dont need several times.
Data sharing is achieved via RDDs. How can they be defined?
RDD can be defined from other RDD or from reading from disk.
Mapreduce, reduce phase
Reducer; Run a user-defined reduce function once per key grouping l Can aggregate, filter, and combine data l Output: 0 or more key/value pairs sent to output formatter. Output Formatter; Translates the final (key,value) pair from the reduce function and writes it to stdout to a file in HDFS.
Spark: Shared variables
Shared = not partitioned and distributed, accessible for all workers
Spark streaming
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.
Phases in mapreduce
map phase: reads input, processes elementwise and producees temporary sequences of data. Work on key value pairs that comes in from hdfs. shuffle phase: Reorder temporary data into temporary forms according ot predefined criterias. reduce phase: Accumulation over data items that fulfill some property.