Hortonworks Java
What is the data type of the return value of the getPartition method?
An int between 0 and the number of reducers.
What is the parent class of a Hive UDF?
Answer: A Hive UDF extends the org.apache.hadoop.hive.ql.exec.UDF class.
What determines the number of Reducers of a MapReduce job?
You choose the number of Reducers.
Consider the following HBase code: put.add("a".getBytes(), "b".getBytes(), "c".getBytes()); What does "a" represent? What does "b" represent? What does "c" represent?
a) Column family; b) Column qualifier; c) Cell value
What method in the Job class is used to implement a linear chain of MapReduce jobs?
waitForCompletion
True or False: A Combiner is only used during the Map phase.
False. A Combiner is invoked on the reduce side if the Reducer needs to spill <key,value> pairs to disk.
True or False: A custom Partitioner must ensure that equal keys get sent to the same partition.
False. A custom partitioner can do whatever it wants as far as distributing key/value pairs to Reducers.
True or False: HBase is a standalone database alternative to HDFS.
False. HBase runs on top of Hadoop and relies on HDFS for its data storage.
True or False: To configure a reducer test, you need a specific input <key,value> pair and a specific output <key,value> pair.
False. The input to the reducer needs to match <key,iterable(value)>, so you need to use a specific key along with a collection (which could be of size 1) of values. The output is only required if you are using the runTest method.
True or False: The shuffle/sort phase sorts the keys and values as they are passed to the Reducer.
False. The keys are sorted, but the values are not.
Suppose the <key,value> pairs being passed to a Combiner are of type <Text,Double>. What will the <key, value> pairs output by the Combiner look like?
<Text, Iterable <Double>>
List three common Oozie workflow actions:
<pig>, <hive>, and <map-reduce>
What are the two arguments passed into the ControlledJob constructor?
A Job instance and a List of Job instances that the new Job is dependent on.
What are the two ways to define a custom counter?
A custom counter can be any string, or you can write an enum of custom counters.
What is the name of the management and monitoring tool that comes with the Hortonworks Data Platform?
Ambari
What four things must a MapReduce job do?
1) Extend Configured. This implements Configurable and allows you to create the Configuration object. 2) Instantiate a Job object within the run method. The Job class contain's more than 20 "set" methods for configuring the MapReduce job. 3) Implement Tool. Allows for handling of command line arguments. 4) Define a main method that uses the ToolRunner.run method to run the MapReduce job with the command-line arguments.
What are the three phases of reduce?
1) Shuffle - also called the fetch phase. Mapper output retrieved with Netty. Records with the same key are combined and sent to the reducer. 2) Sort - Simultaneous with Shuffle. Records are sorted by key as they are fetched. 3) Reduce - The reduce method is invoked for each key, with the records combined into an iterable collection.
What happens if a container fails to complete its task in a YARN application?
ApplicationMaster requests a new container from the ResourceManager and tries the task again.
The main purpose of the Reduce task is to _______
Apply business logic by analyzing data and answering the question or solving a problem.
How many arguments are passed into the evaluate method of a Hive UDF?
As many as you want, and any valid data type that you want.
What is the benefit of using a Bloom filter when joining datasets?
Bloom filter avoids sending records to reducers that are not a result of the join, saving network traffic and CPU processing.
Given an HBase table name, put the following in order of retrieval to access a cell's value: a) Column qualifier b) Rowkey c) Column family d) Timestamp
Column Family -> Rowkey -> Column Qualifier -> Timestamp
What are the two techniques for performing Map Aggregation?
Combiner, in-map aggregation.
What is partitioning?
Designating of <key, value> pairs to the requesting Reducer.
In a custom RecordReader, what is the main goal of the nextKeyValue method?
Determine if there is another record then read its record key and value.
During the shuffle/sort phase, keys are sorted by their
During the shuffle/sort phase, keys are sorted by their natural order and keys that are equal are grouped together.
What file is required to be a part of an Oozie workflow?
Each Oozie workflow must contain a workflow.xml configuration file.
What happens to the output of the Reducer(s)?
Each Reducer writes a file to HDFS.
T/F: Map, Shuffle, and Reduce all share the same JVM on a node.
F. Each runs in a separate JVM, unless specially configured.
If you have a MapReduce job with a lot of intermediate disk spills, what should you do?
If possible, increase the mapreduce.task.io.sort.mb value.
Where does HCatalog store its schema information?
In the Hive metastore.
For an MRUnit test, what would you typically do in an @Before method?
Instantiate the test driver objects and initialize them with mapper and reducer instances.
How does the TotalOrderPartitioner achieve a total ordering of the output?
It samples the data, determines ideal partition points, saves those points in a partition file that is shared amongst all mappers.
What determines the grouping of keys?
Keys are grouped by its Grouping Comparator.
What determines the sort order of keys?
Keys are sorted based on the compareTo method of its data type.
How does a RawComparator increase the performance of a MapReduce job?
Keys can be compared without the overhead of being deserialized.
What happens if all the <key,value> pairs output by a Mapper do not fit into the memory of the Mapper?
When the map output buffer reaches a threshold, the <key,value> pairs are spilled to disk, meaning they are written to a temporary file on the local filesystem.
What are the three main phases of a MapReduce job?
Map phase, shuffle/sort phase, and reduce phase
Which join technique is more efficient: map-side or reduce-side?
Map-side is preferred and almost always more efficient.
What are the three types of tests you can run using MRUnit?
Mapper test, Reducer test, and MapReduce test
What are the three main steps to follow to implement a secondary sort?
Move part of the value into the key (typically done by writing a custom key class); write a custom partitioner that ensures records are sent to the appropriate reducer; and write a custom grouping comparator.
What Output Format would be a good choice if you needed the result of a MapReduce job to appear in five different files?
MultipleOutputs<k,v>
What determines the number of Mappers of a MapReduce job?
Number of input splits.
How many times does a Combiner get invoked?
One or more times, depending if MapOutputBuffer has to be spilled to disk intermediately.
What are the two main capabilities of Oozie?
Oozie Workflow, for defining Hadoop job workflows; and the Oozie Coordinator, for scheduling recurring workflows.
What is the benefit of using sequence files in a Hadoop job?
Performance. Sequence files are an efficient way to serialize and deserialize objects in Hadoop.
What is the benefit of in-map aggregation?
Performance. You avoid the unnecessary disk IO and serialization/deserialization of keys and values.
What are the three main components of YARN?
ResourceManager, NodeManager, and ApplicationMaster
List two Pig commands that cause a logical plan to execute:
STORE, DUMP, and ILLUSTRATE all cause a logical plan to execute.
What determines the Jar file that contains the Mapper and Reducer class?
SetJarByClass() method lets the Job determine the JAR file that contains the Mapper and Reducer.
Which is the better option in MapReduce for string concatenation: the + operator or using the StringBuilder class?
StringBuilder
Which is the better option in MapReduce for splitting strings: StringUtils.split or String.split?
StringUtils.split
A Combiner class extends
The Reducer class
Where are value records aggregated into a collection?
The Reducer, as they are being read in.
In what process are the keys presented along with all values that belong to them?
The Reducer.
What happens to the underlying data of a Hive-managed table when the table is dropped?
The data and folders are deleted from HDFS.
What are the data types of the <key,value> pairs created from an HCatInputFormat?
The key is WritableComparable (not a useful value), and the value is an HCatRecord that contains the fields of a given record.
Suppose the Mappers of a MapReduce job output <key,value> pairs that are of type <integer,string>. What will the pairs look like that are processed by the corresponding Reducers?
The pairs coming into the Reducer will look like <integer, (string,string,string,...)>
The data type of the key passed to a TableMapper is ImmutableBytesWritable. What does this key contain?
The rowkey
What is the difference between running a test using the runTest method vs. using the run method?
The runTest method runs the test and either fails or succeeds. The run method always succeeds and returns the output of the values that were input to the mapper or reducer.
What is the data type of the value passed to a TableMapper?
The value is a Result object, which contains all the data of the row.
What two types of events can be used to trigger an Oozie coordinator job?
Time-based, data-based
True or False: To perform a bucket join, the two datasets must be sorted by the join key.
True
True or False: To configure a mapper test, you need a specific input <key,value> pair and a specific output <key,value> pair.
True only if the runTest method is being used, otherwise you are not required to specify the output.
True or False: Every cell in HBase has its own timestamp.
True.
True or False: The Hive metastore requires an underlying SQL database.
True. Hive uses an in-memory database called Derby by default, but you can configure Hive to use any SQL database.
How do you configure a Combiner for a MapReduce job?
Use the setCombinerClass method of the Job class.
Determine which InputFormat to use given the following use cases (assuming that there is likely not a single correct answer): a) Unstructured text data: b) Log files: c) CSV file with structured customer information: d) The input for the MapReduce job is the immediate output from another MapReduce job e) Tab-delimited data representing recorded temperatures measured at weather stations:
a) TextInputFormat; b) Possibly KeyValueTextInputFormat if a proper delimiter could be used to generate a useful key; c) A custom InputFormat would work best; d) SequenceFileInputFormat; e) KeyValueTextInputFormat, or a custom InputFormat
Name two compression algorithms that support splitting:
bzip2 and LZO
What are the two methods in the InputFormat interface?
getInputSplits and createRecordReader
What are the three arguments passed in to the getPartition method?
key, value, number of reducers.
What property configures when a spill occurs?
mapreduce.map.sort.spill.percent property.
What property configures the mappers output memory buffer?
mapreduce.task.io.mb property.
A Hive table consists of a schema stored in the Hive ______ and data stored in ______.
metastore, HDFS.