Data Science - MapReduce
Changing the compression ratio of the data is an example of optimizing _____.
the onload of the data
understanding MapReduce
- Map works on (K2, V1): input info of keys, values (K2, V2): intermediate values (one list per node) - shuffle/sort -reduce (k2, list(V2)) (k3, V3) -MR uses a functional paradigm
Cloudera has developed a framework for MapReduce called ______________.
MRUnit
Which statement about coding a JAR file for MapReduce 2.0 is true?
Map and Reduce classes are usually static classes. Feedback Hadoop itself needs to be able to instantiate the Mapper or Reducer with the reference class configured in the Job.
Which MapReduce phase is based on code that you have to write?
Mapping
When tuning for optimal spill ratio, what should you aim for?
The number of spilled records is equal to the number of map output records.
Architecturally, where is the YARN layer located?
between MapReduce and HDFS
MapReduce is a model that processes?
big data sets
map()
takes input and converts it into another set of data, where each element is converted to key-value pairs. processes and counts the input. Sends results to reduce()
Unit Testing
the process of examining a small unit or piece of software to verify that it meets the business need provided.
JobContext interface
the super interface for all the classes in MapReduce that define different jobs. It gives a read-only view of the job
Which scenario is distributed caching often used for?
translating data using a lookup
A JobContext interface's main class is the Job class.
true
Optimizing the onload of data by breaking the data into smaller chunks will affect which part of the MapReduce() process?
Splitting Feedback Splitting is one of the first parts of the process, and data can be onloaded faster if the data to split is smaller
What statements in a MapReduce framework describes JUnit?
Mapper ?
In your MapReduce 2.0 code, you have a static class called Map() that implements the mapping. From which parent class should this class inherit?
Mapper()
shuffle and sort phase
takes the mapped data then reduces it
What is the default size of an HDFS block?
64 MB
MapContext
<keyin, valuein, keyout, valueout> defines the context for the mapper
reduce()
takes the output as an input from the mapper and combines these key-value pairs into a smaller set of tuples
When setting up a MapReduce job on GCP, where would you specify the output destination for the results?
the Arguments field
Why MapReduce is required in First place?
the bigData that is stored in HDFS is not stored in a traditional fashion (i think this one?) OR because of load on the servers
maps
the individual tasks that transform the input records into intermediate records. it's not compulsory for the intermediate records to be of the same type as input. A given input pair can be mapped to zero.
Which of the following is used to deactivate the reduction step?
set job.setNumreduceTasks(0)
What is it called when MapReduce() kills a job based on parameters you have set?
speculative execution Feedback Speculative execution is an optimization strategy where a computer system executes certain tasks that may not be required.
methods to write MR jobs
standard - usually with Java streaming (like piping in language of choice; python/c#) pipes abstraction libraries : Hive, Pig (includes data cleansing), etc. (higher-level language)
Coding patterns
standard - usually written in Java Hadoop Streaming - Java base = other language for mapper/reducer logic Hadoop Pipes - C++
Running multiple jobs through several mappers and reducers at once is known as _____.
subdividing tasks Feedback MapReduce is scalable. Generally speaking, this is done by subdividing the activities into tiny chunks, far more than the number of worker nodes.
custom data types
supported - write RawComparator called a custom comparator that often includes a custom input split. You can also include a custom partitioner - user the Text type only when you are using text data
In which year did Google publish a paper titled MapReduce?
2004
Which of the following are MapReduce components?
mapper
Who introduced MapReduce?
_______________ is the processing unit of Hadoop.
MapReduce
shuffling
process of exchanging the intermediate outputs from the map tasks to where they are required by the reducers
Partitioning behaves like a hash function.
true
This list value goes through a shuffle phase, and the values are given to the reducer.<key2, list(value20)>.
true
MRUnit
used when a job has been running for hours and finally it returns an unexpected output with an error in the job. This checks the code before moved to production
When caching is used with Hadoop, where are cache files located?
distributed across nodes running the job Feedback Hadoop Distributed Cache is a way to copy small files or archives to worker nodes in time. Hadoop does this in such a manner that these worker nodes will use them while working on a task.
When coding in Java using MapReduce 2.0, _____ are a common method of keeping track of job results or tracking key-value pairs.
Counters Feedback You might want to determine the proportion of successful or rejected records, you might want to look at the quality of the data to see if further preprocessing is required, or something else has to be done with the data, so that you can get a sense of what percentage is being rejected.
A partition divides data into segments.
True
Input types
where <key, value> TextInputFormat : each line in a text file is a record. <LongWritable (offset of line), Text (content of line)> KeyValueTextInputFormat ; each line is a record. First separator divides line (\t) <Text (before separator), Text (after separator)> SequenceFileInputFormat<K,V> : sequence for reading files NLineInputFormat : Like TextInputFormat; each split has exactly N lines <LongWritable, Text>
What must be considered before running a new JAR file in Google Cloud Platform?
whether the Hadoop version matches Feedback If the image we download isn't in the correct format for our virtualization software, we will not be able to run it.
What should you plan for when running Hadoop jobs?
Jobs should be monitored to confirm successful completion.
Which of the following is also called Mini-reducer?
Combiner
ways to run MR jobs
configure JobConf options From the dev environment (IDE) from GUI (Hue / HDInsight console) from command line: hadoop jar filelname.jar input output
Which Hadoop file system shell command input from the command line will run a MapReduce() job from a JAR file?
hadoop jar jobname.jar /home/input /home/output Feedback This hadoop file system shell command input from the command line will run a MapReduce() job from a specified JAR file.
mapReduce libraries
import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*;
languages
java HiveQL (HQL) Pig Latin Python C# JavaScript R HiveQL & Pig are better for non-programmers
Which of the following happens when the number of reducers is set to zero?
none
output types
TextOutputFormat : write each record as a line of text. keys and values are written as strings and separated by \t SequenceFileOutputFormat<K, V> : write the key/value pairs in sequence file format. works with sequenceFileInputFormat NullOutputFormat<K,V> : outputs nothing
Which statement is false regarding combiners?
They eliminate the need for an actual Reducer.
RecordReader handles recording limits in MapReduce text files or sequence files.
True
Mapper Class
defines the Map job. maps input key-value pairs to intermediate key-value pairs. method: map(keyin, valuein, org.apache.hadoop.mapreduce.Mapper.Context context)
Which Hadoop file system shell command input from the Linux terminal can be used to read the contents of a file?
hadoop fs cat /home/file.txt Feedback The `cat` command is used to read files. The `hadoop fs` portion tells the cluster you are sending the command through the Hadoop file system.
uses of MapReduce
indexing and searching classification - used to make classifiers recommendation - it can be used for recommendation engine as used by e-commerce companies like Amazon and Flipkart Analytics
What is the main form of MapReduce() output, regardless of format?
key and value pairs Feedback In MapReduce, a certain key-value pair is processed by the map function which outputs a certain number of key-value pairs, while the Reduce function processes values grouped by the same key and outputs another set of key-value pairs.
partitioner
- behaves like a condition in processing the inputs - the number of these is an equal number of reducer - divides the input according to the number of reducers - happens after the mapper phase and before reducer phase - partition the <key, value> pairs of mapper output - partition using a user-defined function - behaves like a hash function it accepts the key-value pairs from the map task as input. The partition divides the data into segments. According to the condition given for partitions, the input key-value paired data can be divided into parts
coding steps (java)
- create a class - create a static (global) Map class - create a static reduce class - create a main function = create a job = job calls the map and reduce classes public class MapReduce{ public static void Main(String[] args){ //create jobRunnerInstance //call MapInstance on JobInstance //call ReduceInstance on JobInstance } public void Map(){ //write Mapper } public void Reduce(){ //write reducer } }
MapReduce 1.0
- distributed, scalable, cheap - storage = hdfs - triple replicated = commodity hardware -processing = parallel via Map (local) and Reduce (aggregated) org.apache.hadoop.mapred almost 10 years old and only allows batch processing (not interactive)
LocalJobRunner
- more helpful in debugging the job than to test the job - runs map-reduce jobs in a single JVM and hence can be easily debugged using IDE. It helps us to run the job against the local file system. to enable job execution using LJR, set: conf.set("mapred.job.tracker", "local") in case we want to use local filesystem for input/output then set: conf.set("fs.default.name", "local")
ReduceContext
<keyin, valuein, keyout, valueout> specifies the context to be passed to the Reducer
Writable data types
BooleanWritable (1 byte size) - wrapper for standard boolean variable; sort policy = false before, true after ByteWritable (1 byte) - wrapper for single byte; sort policy = ascending DoubleWritable (8 bytes) - wrapper for a double; sort policy = ascending FloatWritable (4 bytes) - wrapper for a float; sort policy = ascending IntWritable (4 bytes) - wrapper for an int; sort policy = ascending
Which of the following are the best testing and debugging practices for MapReduce jobs?
Builds a small Hadoop cluster for the sole purpose of debugging and testing MapReduce code.
The nodes in MapReduce are collectively called a ___________.
Cluster
Which of the following is used to provide multiple inputs to Hadoop?
FileInputFormat
libraries
HBase Hive pig sqoop oozie mahout others
Hadoop Sub-projects
Hive, PIG, HBase, and Zookeeper
Key Components (Java classes)
Input/output (Data) - writable/write comparable Mapper Reducer Partitioner Reporter Output Collector
Which of the following is a feature of LocalJobRunner?
It can run in a debugger and can step through the code.
Which of the following statements describe JUnit in a MapReduce framework?
It is a JAR based.
Why does the functional programming aspect of Hadoop cause the requirement for task trackers on each node to have to run in separate JVMs?
JVMs do not share states. Feedback Unlike traditional relational database management systems, JVMs do not share state.
Daemons and services
JVMs or services : isolated processes - job tracker : one (controller and scheduler) - task tracker : one per cluster (monitors tasks) Job configurations - specify input/output locations for job instances - job clients submit jobs for execution
Job Class
JobContext interface's main class. allows user to configure, submit, execute and query the job. the set methods work until the job is submitted and after that throws IllegalStateException
Which command creates the directory /user/hadoop/mydir in the Hadoop file system?
hadoop fs -mkdir /user/hadoop/mydir
_______ is a Java library for unit testing.
Junit
In recovery mode, why is the name node is started?
Recover data when there is only one metadata storage location
Which function of MapReduce 1.0 outputs a single combined list of key-value pairs?
Reduce() Feedback The role of The Reducer is to process data that comes from the mapper. After processing, it generates a new set of output, which will be stored in the HDFS.
When implementing the MapReduce functionality, which important aspect of the Map function should you consider?
The Map function is implemented as a static class.
Which improvement of MapReduce 2.0 over MapReduce 1.0 separates resource management layer from the processing layer, allowing the use of different processing frameworks such as graph, interactive, and stream processing engines?
YARN resource manager Feedback YARN permits different data processing engines like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS(Hadoop Distributed File System).
________ builds upon JUnit and supplies built-in functionality lacking in Junit.
`MRUnit ? both (mrunit and dbunit)
Which of the following describes JUnit?
a java library that is being designed for unit testing.
JUnit
a java library that is being designed for unit testing. Not a part of standard Java class libraries, but included in several popular IDEs. provides automated testing and validations.
combiner
a local reducer that aggregates each node. Common/standard as less data is passed
What is the correct data flow sequence in the following: A. InputFormat B. Mapper C. Combiner D. Reducer E. Partitioner F. OutputFormat
abcedf
Apache Nutch
an algorithm to rank the web pages called as Page Rank (Larry Page introduced this concept). It implies that the highest ranked page is the highest. The story begins in the year 1997, Doug Cutting, wrote the first version of Lucene. It is a text search library used to search web pages faster. in 2001, Mike Cafarella joined him in indexing the web pages, resulting in a sub-project name as Apache Nutch
MapReduce 2.0
builds off v1.0 org.apache.hadoop.mapreduce splits the existing JobTracker's Roles - resource management - job life-cycle management many benefits (ex. scalability) - distributed job life-cycle management - support for multiple MapReduce API's in a cluster - batch or real-time/interactive processing - supports many frameworks, via YARN - mapreduce programming not required - fits more business scenarios adds enterprise features (security, high availability) execution flexibility/control -mapper/reduce configure method takes params -tools/genericOptionsParser takes HCI options - reports can use app-specific info distributed cache improvements -can be used to distribute read-only data for jobs
What is the term for an optimization that aggregates duplicate values and reduces them to singular values locally in the mapper?
combiner Feedback A Combiner, also known as a semi-reducer, is an optional class that works by taking inputs from the Map class and then passing output key-value pairs to the Reducer class. The main purpose of the Combiner is to summarize the output records of the map with the same key.
Your MapReduce 2.0 logic needs a distributed cache of pattern files. In which code function will you get these pattern files from the cache?
configure()
Reducer Class
defines the reduce job in mapreduce. using the JobContext.getConfiguration() method, reducer implementations can access the configuration for the job. a reducer has 3 phases: - shuffling - sorting - reducing methods: reduce(keyin, Iterable<valuein> values, org.apache.hadoop.mapreduce.Reducer.Context context) all of the values with the same key are presented to a single reducer together
The main objective of combiners is to reduce the output value of the mapper.
false
Which of the following data will the client read in Hadoop from the HDFS file system?
gets only the block locations from the namenode
hadoop shell commands
hadoop fs -cat file:///file2 hadoop fs -mkdir /user/hadooop/dir1 /user/hadoop/dir2 hadoop fs -copyFromLocal <fromDir> <toDir> etc.
MapReduce
it's the processing unit of Hadoop, using which the data in Hadoop can be processed. map() reduce() it's an API, or set of libraries - job : a unit of MapReduce work/instance - Map task - runs on each node - Reduce task : runs on some nodes - source data : HDFS or other location (like cloud) concept was brought by Google and adopted by Apache Hadoop. it's a particular network that is needed with the capability to process the data that stays as a block of data into respective DataNode. So, the processing goes here and processes the data to only give the result
Which of the following command is used to end a failed job on MapReduce?
kill
Which of the following statement describes InputSplit?
logical representation of data
linux shell commands
ls : list folder contents cat : reads (displays) a file mkdir : makes a directory cd : change to a directory sudo command : runs command as admin chmod file : show/change permissions of file tab key : autocompletes for you
combiners
main objective is to reduce the output value of the mapper. -used between the mapper and reducer -if there's a lot of map output, then it's a good idea to design a combiner -input: <key, list(values)> - output: <key, values> ex: input of mapper phase: <My name is AJ> <I like to code for fun> output of mapper phase: <My,1> <name,1> <is,1> <AJ,1> <i,1> <like,1> <to,1> <code,1> <for,1> <fun,1> ex2: input of combiner: <My,1> <name,1> <is,1> <AJ,1> <i,1> <like,1> <to,1> <code,1> <for,1> <fun,1> output of combiner: <My,1,1> <name,1> <is,1> <AJ,1,1> <i,1> <like,1,1> <to,1> <code,1,1> <for,1> <fun,1> *not sure why some have extra 1's
Optimizations
optimize before the job runs (file sizes; compression, encryption) optimize onload of the data optimize the map phase of the job optimize the shuffle phase of the job optimize the reduce phase of the job after the job completes
tip
verify your cluster configuration parameters - do you have unused resources? - do you have overstressed resources? (slowing it down) - are you using any type of cluster-monitoring tools? (worth it) ***the simplest optimization is to add more nodes***
When will you elect to use the streaming method for developing a MapReduce job?
when you want to create the mapping or reducing logic in a language different from Java