MapReduce

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Flow 10. RecordWriter?

1. writes Reducers: output key-value pairs to the output files.

Flow 6. Combiner?

1. 'Mini-reducer'. 2. local aggregation on the mappers' output 3. helps to minimize the data transfer between mapper & reducer 4. C. output is passed to the partitioner for further work.

Flow 3. InputSplits?

1. Created by InputFormat, logically represent the data which will be processed by an individual Mapper 2. One map task is created for each split 3. Number of map tasks equal to the number of InputSplits. 4. Split divided into records and each record will be processed by the mapper

Flow 1. Input Files?

1. Data for a MapReduce task is stored in input files, input files typically lives in HDFS. 2. The format of these files is arbitrary, while line-based log files and binary format can be used.

Flow 2. InputFormat?

1. InputFormat defines how input files are split and read. 2. It selects the files used for input. 3. InputFormat creates InputSplit.

Flow 7. Partitioner?

1. Partitioner used if Num Reducers > 1 (for one reducer partitioner not used). 2. P. takes combiner's output and partitions them 3. takes place on the basis of the key and then sorted. 4. By hash function, key (or a subset of the key) is used to derive the partition. 5. According to the key value in MapReduce: 5.1 each combiner output is partitioned 5.2 record having the same key value goes into the same partition 5.3 each partition is sent to a reducer 6. Partitioning allows even distribution of the map output over the reducer.

Flow 4. RecordReader?

1. communicates with the InputSplit in Hadoop MapReduce 2. converts the data into key-value pairs suitable for reading by the mapper. 3. default: TextInputFormat 4. RR communicates with the InputSplit until the file reading is not completed. 5. It assigns byte offset (unique number) to each line present in the file. 6. key-value pairs [position, line] are sent to the mapper for further processing.

What is MapReduce?

1. data processing layer of Hadoop. 2 process big structured/unstructured data stored in HDFS 3. parallel by dividing the job (submitted job) into a set of independent tasks (sub-job) 4. By this parallel processing speed and reliability of cluster is improved. 5. need to put the custom code (business logic) in the way map reduce works and rest things will be taken care by the engine.

Flow 11. OutputFormat?

1. how output key-value pairs are written in output files by RecordWriter determined by the OutputFormat. 2. OutputFormat instances provided by the Hadoop are used to write files in HDFS or on the local disk. 3. final output of reducer is written on HDFS by OutputFormat

Flow 9. Reducer?

1. input: set intermediate key-value pairs produced by the mappers 2. runs reducer function on each pairs to generate the output. 3. output of the reducer is final output, stored in HDFS.

Flow 5. Mapper?

1. processes each input record (from RR); generates new key-value pair 2. this key-value pair by Mapper different from the input pair. 3. output known as intermediate output written to the local disk. 4. outputs not stored on HDFS - temporary data as writing on HDFS will create unnecessary copies 5. M. output passed to the combiner for further process.

Flow 8. Shuffling and Sorting?

1. the output is Shuffled to the reduce node (normal slave node called as reducer node). 2. shuffling physical movement of data done over the network. 3. all the mappers finished and output is shuffled on the reducer nodes, 3.1 then intermediate output is merged and sorted, 3.2 then provided as input to reduce phase.

How Hadoop MapReduce Works?

Two phases: Map phase and Reduce phase. 1. Map: we specify all the complex logic/business rules/costly code. 2. Reduce: specify light-weight processing like aggregation/summation.


Ensembles d'études connexes

Common benign conditions of the skin

View Set

ch 29- Nonmalignant hematologic disorder

View Set

Karch - Ch49: Drugs Used to Treat Anemias

View Set

Central Nervous System: True or False

View Set

A driver must yield the right of way

View Set