FUll midterm Review

Ace your homework & exams now with Quizwiz!

Name Node

1. one for each cluster 2. manages file system namespace and metadata 3. should have lots of memory 4. represents a single point of failure

What is the JobTracker of map reduce responsible for?

1. one for each hadoop cluster 2. receives jobs from clients and distributes mapReduce jobs to Task Trackers 3. monitors both Map and Reduce jobs on task trackers

What is the Hadoop NameNode?

1. there is a name node for each hadoop cluster 2. Manages file system namespace and metadata 3. should have lots of memory 4. represents a single point of failure

3. ___________ part of the MapReduce is responsible for processing one or more chunks of data and producing the output results. a) Maptask b) Mapper c) Task execution d) All of the mentioned

Answer: a Explanation: Map Task in MapReduce is performed using the Map() function.

Answer: a Explanation: Hadoop together with a relational data warehouse, they can form very effective data warehouse infrastructure.

Answer: a Explanation: Other means of tagging the values also can be used.

1._________ operator is used to review the schema of a relation. a) DUMP b) DESCRIBE c) STORE d) EXPLAIN

Answer: b Explanation: DESCRIBE returns the schema of a relation.

What is hadoop mapreduce?

A YARN-based system for parallel processing of large data sets.

7. Which of the following file contains user defined functions (UDFs) ? a) script2-local.pig b) pig.jar c) tutorial.jar d) excite.log.bz2

Answer: c Explanation: tutorial.jar contains java classes also.

2. Point out the correct statement : a) The framework groups Reducer inputs by keys b) The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged c) Since JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values d) All of the mentioned

Answer: d Explanation: If equivalence rules for keys while grouping the intermediates are different from those for grouping keys before reduction, then one may specify a Comparator.

9. You can specify parameter names and parameter values in one of the ways: a) As part of a command line. b) In parameter file, as part of a command line c) With the declare statement, as part of Pig script d) All of the mentioned

Answer: d Explanation: Parameter substitution may be used inside of macros.

10. Which of the following parameter is to collect keys and combined values ? a) key b) values c) reporter d) output

Answer: d Explanation: reporter parameter is for facility to report progress.

Which of the following service(s) is(are) required to execute the following command? hadoop fs -mkdir ISDS7511 a. MapReduce Incorrect b. HDFS c. Yarn d. Zookeeper

B. HDFS

What happens when a DataNode doesn't send heartbeats to the NameNode A. Nothing happens, HDFS is a distributed file system and can handle this B. blocks immediately get replicated to maintain the replication factor of the blocks C. The namenode will wait a preconfigured amount of time before the replication process kicks off

C

Define Spark Dataframe

DataFrames are distributed table-like collections with well-defined rows and columns. • In English: DataFrame are basically the same thing as Tables.

What are we doing during the process of setting up environment variables for "PYSPARK_DRIVER_PYTHON" and "PYSPARK_DRIVER_PYTHON_OPTS".

During these steps, we connect pyspark with Jupyter Notebook for our convenience (Because the Jupyter Notebook interface looks way nicer than the black and white CLI!!!!!)

Map Function

Takes a series of key/value pairs, processes each and generates zero or more output key/pair values E.g. firstname/pankush, age/26..... Functions can be run on all data nodes in parallel. It is required that there is no dependcies between the data Output is stored on a local disk

Reduce Function

The Reduce(Fold) function then proceeds to combine elements of the data structure in some systematic way Output of Reduce is stored in HDFS Reduce functions can run in parallel

What is hadoop common?

The common utilities that support the other Hadoop modules.

What is a node?

a none enterprise, commodity hardware that stores data each computer is called a node

9. __________ is a generalization of the facility provided by the MapReduce framework to collect data output by the Mapper or the Reducer a) Partitioner b) OutputCollector c) Reporter d) All of the mentioned

b Explanation: Hadoop MapReduce comes bundled with a library of generally useful mappers, reducers, and partitioners.

Spark Low level APIs

include two parts: • Resilient Distributed Dataset (RDD) • Distributed Shared Variables Low Level APIs are the physical forms and methods Spark uses to store and manipulate the data You have more control over the data Not convenient AT ALL DO NOT START LEARNING SPARK WITH LOW LEVEL APIs Disregard any part of any textbook/Course about LOW LEVEL

If you want to use spark on your own machine:

need to have Java, and need to have environment variables.

3. The daemons associated with the MapReduce phase are ________ and task-trackers. a) job-tracker b) map-tracker c) reduce-tracker d) all of the mentioned

nswer: a Explanation: Map-Reduce jobs are submitted on job-tracker.

6. __________ is a framework for collecting and storing script-level statistics for Pig Latin. a) Pig Stats b) PStatistics c) Pig Statistics d) None of the mentioned

nswer: c Explanation: The new Pig statistics and the existing Hadoop statistics can also be accessed via the Hadoop job history file.

If the path to this bin folder is not stored in the PATH variable, what will happen if I type in pyspark in Command line interface?

the system cannot find the path specified will display in the command line interface

Hadoop was named after what?

toy elephant of Cuttings son

How to use Spark: Spark's Language APIs

• Scala: Spark is primarily written in Scala. This is the native language of Spark. • Java • Python • R

Spark is:

• Unified computing engine for parallel processing • A set of libraries for parallel data processing • A tool to manage and coordinate tasks on data across a cluster of computers

For the following command, we could tell the current user is () [cloudera@quickstart Downloads]$ mkdir ISDS7511 Select one: a. Cloudera b. Quickstart c. Root d. hdfs

a. Cloudera

For the following command, we could tell the current working directory's name is ISDS7511. [cloudera@quickstart Downloads]$ mkdir ISDS7511 True False

False

What are the two major components of the MapReduce layer? A. TaskManager B. JobTracker C. NameNode D. DataNode

A. B. ``

What are the 3 components of Spark structured APIS

A. Datasets B. DataFrames C. SQL

6. Which of the following function is used to read data in PIG ? a) WRITE b) READ c) LOAD d) None of the mentioned

Answer: c Explanation: PigStorage is the default load function.

4. Use the __________ command to run a Pig script that can interact with the Grunt shell (interactive mode). a) fetch b) declare c) run d) all of the mentioned

Answer: c Explanation: With the run command, every store triggers execution.

How to we solve the error message "the system cannot find the path specified" displayed in the command line interface

• Add the path that contains these binaries to the PATH variable

What is Hadoop? 1. 2. 3. 4.

1. Framework rather than a single solution 2. Scalable and distributed framework 3. Data pipeline of massive amounts of data 4. Both structured and unstructured data

Which service is not necessary for the following command? hadoop jar anagram.jar AnagramDriver /user/cloudera/DSIS7511_Peter_hdfs /user/cloudera/output Select one: a. HDFS b. Mapreduce c. Zookeeper

c. zookeeper

8. ________ is the slave/worker node and holds the user data in the form of Data Blocks. a) DataNode b) NameNode c) Data block d) Replication

Answer: a Explanation: A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.

What are the two basic layers comprising the Hadoop Architecture? A. ZooKeeper and MapReduce B. HDFS and Hive C. MapReduce and HDFS D.Impala and HDFS

c

Which service is not necessary for the following command? hadoop jar anagram.jar AnagramDriver /user/cloudera/DSIS7511_Peter_hdfs /user/cloudera/output Select one: a. HDFS b. Mapreduce c. Zookeeper

c

Who created Hadoop?

Doug Cutting Mike Cafarella in 2005 for the Yahoo search engine

What is Hadoop not? 1. 2.

1. An alternative for SQL 2. Always fast and efficient

Sparks Tool Set:

1. Low level APIs A. Distributed Variables B.RDDs 2. Structured APIs A. Datasets B. DataFrames C. SQL 3. Structured Streaming, 4. Machine Learning Libraries, 5. Ecosystem + Packages (Didn't go into details regarding this)

When does Hadoop come of need?

1. You have more data than you can maintain 2. Backup and duplication tasks are difficult 3. efficient data analysis is needed: a. data size is huge b. complexity of data c. depth of analysis

Why do we need mapReduce?

1. bc the big data that is stored on hadoop hdfs is not stored in a traditional fashion - data is divided into chunks of data thtat is stored in data nodes 2. no location where the complete data is so we dont have any app that can process data directly 3. we needed a framework capable of processing the data as blocks of data in the data nodes so the processing can go to the data node process the data and only bring back the result

What are the four components of hadoop ?

1. hadoop common 2. hadoop distributed file system 3. hadoop yarn 4. hadoop mapreduce

What is the Hadoop DataNode?

1. multiple for each hadoop cluster 2. stores/manages blocks of data 3. commodity hardware would suffice 4. client finds out from namenode what datanodes has relevant blocks and queries them directly 5. datanodes also reports to namenode occasionally about what blocks it has

What is the map reduce task tracker?

1. multiple in each hadoop cluster 2. executes map and reduce operations 3. reads data blocks from data nodes

What are the 4 types of nodes?

1. name node 2. data node 3. secondary name node 4. checkpoint node

What are HDFS blocks?

1. uses fixed size and hence easy to calculate how many will fit on a disk 2. a file can be larger than any single disk in the cluster 3. no space is wasted. If a chunk of file is smaller than the block size, only needed space is used

What is hadoop distributed file system?

A distributed file system that provides high-throughput access to application data. It's a Hadoop file system that runs on top of resident OS file system. Designed to handle very large files Performs less seeks on data due to larger blocks of data Stores data in default blocks of 64MB or larger (UNIX block is 4KB).

What is hadoop yarn?

A framework for job scheduling and cluster resource management.

What is a hadoop cluster?

A hadoop cluster is therefore a combination of racks, each with multiple nodes.

For the following command, we could tell the current user is () [cloudera@quickstart Downloads]$ su root Select one: a. Cloudera b. Quickstart c. Root d. hdfs

A.

When a MapReduce task is started, the first task that is started in a container is... A. The application master B. the map task C. the reduce task D. the applications manager

A.

What would happen to the map task if a 1 GB file would only contain one line and your Map tasks reads line by line? A. the first map task would read and process all blocks B. all map tasks would process all blocks C. no map task would process any file, the output of the map task would be nothing D. every map task processes its own block by default

A. The first map task would read the first line, and because this line spans all the blocks, this first map task is going to process the first line, the other map tasks will do nothing

2. Point out the correct statement : a) Data locality means movement of algorithm to the data instead of data to algorithm b) When the processing is done on the data algorithm is moved across the Action Nodes rather than data to the algorithm c) Moving Computation is expensive than Moving Data d) None of the mentioned

Answer: a Explanation: Data flow framework possesses the feature of data locality.

4. The JobTracker pushes work out to available _______ nodes in the cluster, striving to keep the work as close to the data as possible a) DataNodes b) TaskTracker c) ActionNodes d) All of the mentioned

Answer: b Explanation: A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status whether the node is dead or alive.

1. ________ is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. a) Hive b) MapReduce c) Pig d) Lucene

Answer: b Explanation: MapReduce is the heart of hadoop.

6. InputFormat class calls the ________ function and computes splits for each file and then sends them to the jobtracker. a) puts b) gets c) getSplits d) all of the mentioned

Answer: c Explanation: InputFormat uses their storage locations to schedule map tasks to process them on the tasktrackers.

5. Point out the wrong statement : a) The map function in Hadoop MapReduce have the following general form:map:(K1, V1) → list(K2, V2) b) The reduce function in Hadoop MapReduce have the following general form: reduce: (K2, list(V2)) → list(K3, V3) c) MapReduce has a complex model of data processing: inputs and outputs for the map and reduce functions are key-value pairs d) None of the mentioned

Answer: c Explanation: MapReduce is relatively simple model to implement in Hadoop.

HDFS - Replication

Block with data are replicated on multiple nodes and hence allows for node failure without data loss

In MapReduce, keys and values are sorted when sent to the reducer True False

False Only keys are sorted, not values

Question 1: True or False: The client sends the data to the NameNode and the NameNode divides the data into blocks and sends it to the datanodes

False The client sends its data directly to the datanodes. Only metadata is sent to the NameNode

Define Hadoops HDFS

It's a Hadoop file system that runs on top of resident OS file system. Designed to handle very large files Performs less seeks on data due to larger blocks of data Stores data in default blocks of 64MB or larger (UNIX block is 4KB).

Combiner/Partition Function

Lessens the data transfers between MAP and REDUCE tasks Takes outputs of multiple MAP functions and combines them into a single input to a REDUCE function Is given a key value and the number of reducers-Returns an index of the desired reducer -For the Reduce function to work, all key / value pairs for a particular key value must be sent to the same Reducer. -To determine to which Reducer a particular key / value pair is to be sent, the Partition function is used. -The Partition function is given a key / value pair, and the number of Reducers. -It uses some process, typically a hash or a modulo function, that converts the key value to a reducer index value. Whatever function is used to generate the reducer index, you want it to evenly distribute the data across all reducer DataNodes. -Also between the map and reduce stages, the data will be sorted and forwarded to DataNodes so that the entire set of key / value pairs for a particular key is sent to the same Reducer.

What is Map Reduce?

Map reduce is the processing component of Apache Hadoop processes data parallely in distributed environment

Data Node

Multiple for each hadoop cluster. Stores/Manages blocks of data. Commodity hardware would suffice. Client finds out from NameNode what DataNodes has relevant blocks and queries them directly. DataNodes also reports to NameNode occasionally about what blocks it has.

Spark Dataframes have:

Schemas: Define the column names and types of a DataFrame • Columns: Represent a simple type or a complex type • Rows: a record of data

71. What is Jupyter Notebook?

a. Jupyter Notebook is just an interactive interface to code. Since we installed it with Anaconda, a Python distribution, it is linked to Python by default.

What are the binaries in spark?

• Binaries refer to executable or runnable programs. • That's why you have seen a lot of "bin" folders • These bin folder containing a lot of executable files (binaries) • You could run these command by typing them in the terminal, Because the path to the bin folder containing the commands has been written to a system variable : Path • Every time you issue a command in the terminal, system will start to look for the binary in the locations stored in variable "Path" Here we have a bin folder, containing a command: pyspark

Hadoop is : 1. 2. 3. 4.

1. Framework rather than a single solution 2. Scalable and distributed framework 3. Data pipeline of massive amounts of data 4. Both structured and unstructured data

For the following command, we could tell the current working directory's name is ISDS7511. [cloudera@quickstart Downloads]$ mkdir ISDS7511 Select one: True False

f

What is considered to be part of the Apache Basic Hadoop Modules? A. HDFS B. Yarn C. MapReduce D. Impala

A & C

For the following command, we could tell the current user is () [cloudera@quickstart Downloads]$ mkdir ISDS7511 Select one: a. Cloudera b. Quickstart c. Root d. hdfs

a

The following command will do: [cloudera@quickstart Downloads]$ su root Select one: a. Change the user from cloudera to root b. change the working directory from current to "/root" c. set up the java enviroment

a

Hadoop is not: 1. 2.

an alternative for SQL always fast and efficient

What does HDFS stand for? A. Hadoop Data File System B. Hadoop Distributed File System C. Hadoop Data File Scalability D. Hadoop Datanode File Security

b

Which of the following service(s) is(are) required to execute the following command? hadoop fs -mkdir ISDS7511 Select one: a. MapReduce b. HDFS c. Yarn d. Zookeeper

b

9. HDFS provides a command line interface called __________ used to interact with HDFS. a) "HDFS Shell" b) "FS Shell" c) "DFS Shell" d) None of the mentioned

Answer: b Explanation: The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS).

3. Which of the following command is used to show values to keys used in Pig ? a) set b) declare c) display d) All of the mentioned

Answer: a Explanation: All Pig and Hadoop properties can be set, either in the Pig script or via the Grunt command line.

7. Which is the most popular NoSQL database for scalable big data store with Hadoop ? a) Hbase b) MongoDB c) Cassandra d) None of the mentioned

Answer: a Explanation: HBase is the Hadoop database: a distributed, scalable Big Data store that lets you host very large tables — billions of rows multiplied by millions of columns — on clusters built with commodity hardware.

6. HDFS and NoSQL file systems focus almost exclusively on adding nodes to : a) Scale out b) Scale up c) Both Scale out and up d) None of the mentioned

Answer: a Explanation: HDFS and NoSQL file systems focus almost exclusively on adding nodes to increase performance (scale-out) but even they require node configuration with elements of scale up.

6. Which of the following scenario may not be a good fit for HDFS ? a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file b) HDFS is suitable for storing data related to applications requiring low latency data access c) HDFS is suitable for storing data related to applications requiring low latency data access d) None of the mentioned

Answer: a Explanation: HDFS can be used for storing archive data since it is cheaper as HDFS allows storing the data on low cost commodity hardware while ensuring a high degree of fault-tolerance.

2. Point out the correct statement : a) Hadoop is ideal for the analytical, post-operational, data-warehouse-ish type of workload b) HDFS runs on a small cluster of commodity-class nodes c) NEWSQL is frequently the collection point for big data d) None of the mentioned

Answer: a Explanation: Hadoop together with a relational data warehouse, they can form very effective data warehouse infrastructure.

Point out the correct statement : a) Hive is not a relational database, but a query engine that supports the parts of SQL specific to querying data b) Hive is a relational database with SQL support c) Pig is a relational database with SQL support d) All of the mentioned

Answer: a Explanation: Hive is a SQL-based data warehouse system for Hadoop that facilitates data summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems.

4. ___________ operator is used to view the step-by-step execution of a series of statements. a) ILLUSTRATE b) DESCRIBE c) STORE d) EXPLAIN

Answer: a Explanation: ILLUSTRATE allows you to test your programs on small datasets and get faster turnaround times.

3. Input to the _______ is the sorted output of the mappers. a) Reducer b) Mapper c) Shuffle d) All of the mentioned

Answer: a Explanation: In Shuffle phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.

8. The output of the reduce task is typically written to the FileSystem via : a) OutputCollector b) InputCollector c) OutputCollect d) All of the mentioned

Answer: a Explanation: In reduce phase the reduce(Object, Iterator, OutputCollector, Reporter) method is called for each pair in the grouped inputs.

7. Reducer is input the grouped output of a : a) Mapper b) Reducer c) Writable d) Readable

Answer: a Explanation: In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.

1. In order to read any file in HDFS, instance of __________ is required. a) filesystem b) datastream c) outstream d) inputstream

Answer: a Explanation: InputDataStream is used to read data from file.

7. You can run Pig in interactive mode using the ______ shell. a) Grunt b) FS c) HDFS d) None of the mentioned

Answer: a Explanation: Invoke the Grunt shell using the "pig" command (as shown below) and then enter your Pig Latin statements and Pig commands interactively at the command line.

10. PigUnit runs in Pig's _______ mode by default. a) local b) tez c) mapreduce d) none of the mentioned

Answer: a Explanation: Local mode does not require a real cluster but a new local one is created each time.

_________ can best be described as a programming model used to develop Hadoop-based applications that can process massive amounts of data. a) MapReduce b) Mahout c) Oozie d) All of the mentioned

Answer: a Explanation: MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm.

________ is general-purpose computing model and runtime system for distributed data analytics. a) Mapreduce b) Drill c) Oozie d) None of the mentioned

Answer: a Explanation: Mapreduce provides a flexible and scalable foundation for analytics, from traditional reporting to leading-edge machine learning algorithms.

8. __________ maps input key/value pairs to a set of intermediate key/value pairs. a) Mapper b) Reducer c) Both Mapper and Reducer d) None of the mentioned

Answer: a Explanation: Maps are the individual tasks that transform input records into intermediate records.

3. HDFS works in a __________ fashion. a) master-worker b) master-slave c) worker/slave d) all of the mentioned

Answer: a Explanation: NameNode servers as the master and each DataNode servers as a worker/slave

1. ________ systems are scale-out file-based (HDD) systems moving to more uses of memory in the nodes. a) NoSQL b) NewSQL c) SQL d) All of the mentioned

Answer: a Explanation: NoSQL systems make the most sense whenever the application is based on data with varying data types and the data can be stored in key-value notation.

8. Which of the following is correct syntax for parameter substitution using cmd ? a) pig {-param param_name = param_value | -param_file file_name} [-debug | -dryrun] script b) {%declare | %default} param_name param_value c) {%declare | %default} param_name param_value cmd d) All of the mentioned

Answer: a Explanation: Parameter Substitution is used to substitute values for parameters at run time.

e Pig Latin scripting language is not only a higher-level data flow language but also has operators similar to : a) SQL b) JSON c) XML d) All of the mentioned

Answer: a Explanation: Pig Latin, in essence, is designed to fill the gap between the declarative style of SQL and the low-level procedural style of MapReduce.

4. _________ function is responsible for consolidating the results produced by each of the Map() functions/tasks. a) Reduce b) Map c) Reducer d) All of the mentioned

Answer: a Explanation: Reduce function collates the work and resolves the results.

5. Point out the wrong statement : a) Reducer has 2 primary phases b) Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures c) It is legal to set the number of reduce-tasks to zero if no reduction is desired d) The framework groups Reducer inputs by keys (since different mappers may have output the same key) in sort stage

Answer: a Explanation: Reducer has 3 primary phases: shuffle, sort and reduce.

8. Which of the following is the default mode ? a) Mapreduce b) Tez c) Local d) All of the mentioned

Answer: a Explanation: Specify local mode using the -x flag (pig -x local).

9. Which of the following will compile the Pigunit ? a) $pig_trunk ant pigunit-jar b) $pig_tr ant pigunit-jar c) $pig_ ant pigunit-jar d) None of the mentioned

Answer: a Explanation: The compile will create the pigunit.jar file.

7. Which of the following phases occur simultaneously ? a) Shuffle and Sort b) Reduce and Sort c) Shuffle and Map d) All of the mentioned

Answer: a Explanation: The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.

2. Point out the correct statement : a) DataNode is the slave/worker node and holds the user data in the form of Data Blocks b) Each incoming file is broken into 32 MB by default c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of fault tolerance d) None of the mentioned

Answer: a Explanation: There can be any number of DataNodes in a Hadoop Cluster.

2. Point out the correct statement : a) MapReduce tries to place the data and the compute as close as possible b) Map Task in MapReduce is performed using the Mapper() function c) Reduce Task in MapReduce is performed using the Map() function d) All of the mentioned

Answer: a Explanation: This feature of MapReduce is "Data Locality".

8. Which of the following is the default mode ? a) Mapreduce b) Tez c) Local d) All of the mentioned

Answer: a Explanation: To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation.

Hadoop is a framework that works with a variety of related tools. Common cohorts include: a) MapReduce, Hive and HBase b) MapReduce, MySQL and Google Apps c) MapReduce, Hummer and Iguana d) MapReduce, Heron and Trumpet

Answer: a Explanation: To use Hive with HBase you'll typically want to launch two clusters, one to run HBase and the other to run Hive.

The number of maps is usually driven by the total size of : a) inputs b) outputs c) tasks d) None of the mentioned

Answer: a Explanation: Total size of inputs means total number of blocks of the input files.

6. Which of the following command can be used for debugging ? a) exec b) execute c) error d) throw

Answer: a Explanation: With the exec command, store statements will not trigger execution; rather, the entire script is parsed before execution starts.

1. Pig operates in mainly how many nodes ? a) Two b) Three c) Four d) Five

Answer: a Explanation: You can run Pig (execute Pig Latin statements and Pig commands) using various mode: Interactive and Batch Mode.

2. Point out the correct statement : a) You can run Pig in either mode using the "pig" command b) You can run Pig in batch mode using the Grunt shell c) You can run Pig in interactive mode using the FS shell d) None of the mentioned

Answer: a Explanation: You can run Pig in either mode using the "pig" command (the bin/pig Perl script) or the "java" command (java -cp pig.jar ...).

4. _____________ is used to read data from bytes buffers . a) write() b) read() c) readwrite() d) all of the mentioned

Answer: a Explanation: readfully method can also be used instead of read method.

1. A ________ serves as the master and there is only one NameNode per cluster. a) Data Node b) NameNode c) Data block d) Replication

Answer: b Explanation: All the metadata related to HDFS including the information about data nodes, files stored on HDFS, and Replication, etc. are stored and maintained on the NameNode.

All of the following accurately describe Hadoop, EXCEPT: a) Open source b) Real-time c) Java-based d) Distributed computing approach

Answer: b Explanation: Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware.

5. Point out the wrong statement: a) You can run Pig scripts from the command line and from the Grunt shell b) DECLARE defines a Pig macro c) Use Pig scripts to place Pig Latin statements and Pig commands in a single file d) None of the mentioned

Answer: b Explanation: DEFINE defines a Pig macro.

10. HDFS is implemented in _____________ programming language. a) C++ b) Java c) Scala d) None of the mentioned

Answer: b Explanation: HDFS is implemented in Java and any computer which can run Java can host a NameNode/DataNode on it.

7. ________ is a utility which allows users to create and run jobs with any executables as the mapper and/or the reducer. a) Hadoop Strdata b) Hadoop Streaming c) Hadoop Stream d) None of the mentioned View Answer

Answer: b Explanation: Hadoop streaming is one of the most important utilities in the Apache Hadoop distribution.

Hive also support custom extensions written in : a) C# b) Java c) C d) C++

Answer: b Explanation: Hive also support custom extensions written in Java, including user-defined functions (UDFs) and serializer-deserializers for reading and optionally writing custom formats.

1. Which of the following is shortcut for DUMP operator ? a) \de alias b) \d alias c) \q d) None of the mentioned

Answer: b Explanation: If alias is ignored last defined alias will be used.

9. Applications can use the _________ provided to report progress or just indicate that they are alive. a) Collector b) Reporter c) Dashboard d) None of the mentioned

Answer: b Explanation: In scenarios where the application takes a significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task.

10. _________ is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. a) Map Parameters b) JobConf c) MemoryConf d) None of the mentioned

Answer: b Explanation: JobConf represents a MapReduce job configuration.

7. The ________ class mimics the behavior of the Main class but gives users a statistics object back. a) PigRun b) PigRunner c) RunnerPig d) None of the mentioned

Answer: b Explanation: Optionally, you can call the API with an implementation of progress listener which will be invoked by Pig runtime during the execution.

3. You can run Pig in batch mode using __________ . a) Pig shell command b) Pig scripts c) Pig options d) All of the mentioned

Answer: b Explanation: Pig script contains Pig Latin statements.

6. Interface ____________ reduces a set of intermediate values which share a key to a smaller set of values. a) Mapper b) Reducer c) Writable d) Readable

Answer: b Explanation: Reducer implementations can access the JobConf for the job.

5. Point out the wrong statement : a) To run Pig in local mode, you need access to a single machine b) The DISPLAY operator will display the results to your terminal screen c) To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation d) All of the mentioned

Answer: b Explanation: The DUMP operator will display the results to your terminal screen.

5. Point out the wrong statement : a) The framework calls reduce method for each pair in the grouped inputs b) The output of the Reducer is re-sorted c) reduce method reduces values for a given key d) None of the mentioned

Answer: b Explanation: The output of the Reducer is not re-sorted.

2. Point out the correct statement : a) During the testing phase of your implementation, you can use LOAD to display results to your terminal screen b) You can view outer relations as well as relations defined in a nested FOREACH statement c) Hadoop properties are interpreted by Pig d) None of the mentioned

Answer: b Explanation: Viewing outer relations is possible using DESCRIBE operator.

8. ___________ is a simple xUnit framework that enables you to easily test your Pig scripts. a) PigUnit b) PigXUnit c) PigUnitX d) All of the mentioned

Answer: b Explanation: With PigUnit you can perform unit testing, regression testing, and rapid prototyping. No cluster set up is required if you run Pig in local mode.

4. __________ are highly resilient and eliminate the single-point-of-failure risk with traditional Hadoop deployments a) EMR b) Isilon solutions c) AWS d) None of the mentioned

Answer: b Explanation: enterprise data protection and security options including file system auditing and data-at-rest encryption to address compliance requirements is also provided by Isilon solution.

10. _______ refers to incremental costs with no major impact on solution design, performance and complexity. a) Scale-out b) Scale-down c) Scale-up d) None of the mentioned

Answer: c Explanation: Adding more CPU/RAM/Disk capacity to Hadoop DataNode that is already part of a cluster does not require additional network switches.

______ is a platform for constructing data flows for extract, transform, and load (ETL) processing and analysis of large datasets. a) Pig Latin b) Oozie c) Pig d) Hive

Answer: c Explanation: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.

2. Point out the correct statement: a) Invoke the Grunt shell using the "enter" command b) Pig does not support jar files c) Both the run and exec commands are useful for debugging because you can modify a Pig script in an editor d) All of the mentioned

Answer: c Explanation: Both commands promote Pig script modularity as they allow you to reuse existing components.

9. HBase provides ___________ like capabilities on top of Hadoop and HDFS. a) TopTable b) BigTop c) Bigtable d) None of the mentioned

Answer: c Explanation: Google Bigtable leverages the distributed data storage provided by the Google File System.

5. Point out the wrong statement : a) EMC Isilon Scale-out Storage Solutions for Hadoop combine a powerful yet simple and highly efficient storage platform b) Isilon's native HDFS integration means you can avoid the need to invest in a separate Hadoop infrastructure c) NoSQL systems do provide high latency access and accommodate less concurrent users d) None of the mentioned

Answer: c Explanation: NoSQL systems do provide low latency access and accommodate many concurrent users.

Mapper and Reducer implementations can use the ________ to report progress or just indicate that they are alive. a) Partitioner b) OutputCollector c) Reporter d) All of the mentioned

Answer: c Explanation: Reporter is a facility for MapReduce applications to report progress, set application-level status messages and update Counters.

4. ________ NameNode is used when the Primary NameNode goes down. a) Rack b) Data c) Secondary d) None of the mentioned

Answer: c Explanation: Secondary namenode is used for all time availability and reliability.

5. Point out the wrong statement : a) ILLUSTRATE operator is used to review how data is transformed through a sequence of Pig Latin statements b) ILLUSTRATE is based on an example generator c) Several new private classes make it harder for external tools such as Oozie to integrate with Pig statistics d) None of the mentioned

Answer: c Explanation: Several new public classes make it easier for external tools such as Oozie to integrate with Pig statistics.

1. A ________ node acts as the Slave and is responsible for executing a Task assigned to it by the JobTracker. a) MapReduce b) Mapper c) TaskTracker d) JobTracker

Answer: c Explanation: TaskTracker receives the information necessary for execution of a Task from JobTracker, Executes the Task, and Sends the Results back to JobTracker.

8. The ___________ can also be used to distribute both jars and native libraries for use in the map and/or reduce tasks. a) DataCache b) DistributedData c) DistributedCache d) All of the mentioned

Answer: c Explanation: The child-jvm always has its current working directory added to the java.library.path and LD_LIBRARY_PATH.

Point out the wrong statement : a) Hardtop's processing capabilities are huge and its real advantage lies in the ability to process terabytes & petabytes of data b) Hadoop uses a programming model called "MapReduce", all the programs should confirms to this model in order to work on Hadoop platform c) The programming model, MapReduce, used by Hadoop is difficult to write and test d) All of the mentioned

Answer: c Explanation: The programming model, MapReduce, used by Hadoop is simple to write and test.

4. Pig Latin statements are generally organized in one of the following ways : a) A LOAD statement to read data from the file system b) A series of "transformation" statements to process the data c) A DUMP statement to view results or a STORE statement to save the results d) All of the mentioned

Answer: d Explanation: A DUMP or STORE statement is required to generate output.

7. The need for data replication can arise in various scenarios like : a) Replication Factor is changed b) DataNode goes down c) Data Blocks get corrupted d) All of the mentioned

Answer: d Explanation: Data is replicated across different DataNodes to ensure a high degree of fault-tolerance.

3. Which of the following operator is used to view the map reduce execution plans ? a) DUMP b) DESCRIBE c) STORE d) EXPLAIN

Answer: d Explanation: EXPLAIN displays execution plans.

______ jobs are optimized for scalability but not latency. a) Mapreduce b) Drill c) Oozie d) Hive

Answer: d Explanation: Hive Queries are translated to MapReduce jobs to exploit the scalability of MapReduce.

5. Point out the wrong statement : a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file level b) Block Report from each DataNode contains a list of all the blocks that are stored on that DataNode c) User data is stored on the local file system of DataNodes d) DataNode is aware of the files to which the blocks stored on it belong to

Answer: d Explanation: NameNode is aware of the files to which the blocks stored on it belong to.

10.$ pig -x tez_local ... will enable ________ mode in Pig. a) Mapreduce b) Tez c) Local d) None of the mentioned

Answer: d Explanation: Tez Local Mode is similar to local mode, except internally Pig will invoke tez runtime engine.

5. Point out the wrong statement : a) A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner b) The MapReduce framework operates exclusively on pairs c) Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods d) None of the mentioned

Answer: d Explanation: The MapReduce framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

What are the two majority types of nodes in HDFS? A. MetaNode B. NameNode C. RackNode D. BlockNode E. DataNode

B & E

What is Yarn used as an alternative to in Hadoop 2.0 and higher versions of Hadoop? A. Pig B. Hive C. ZooKeeper D.MapReduce E. HDFS

D

In Amazon Web service(AWS), you could pay for different types of instance to run the Data Node and Name Node for the hadoop distributed file system. Also, Amazon web service allows its customer to apply two types of instances: 1. reserved instance, which once you start, it will not end until you terminated it. 2. Spot instance, which is way cheaper, because it is using idle resources in AWS. But it might be reclaimed by AWS when the computing resources become unavailable. If we want to decrease our cost of using AWS for our Hadoop deployment, we should use the reserved instance for our data node and spot instance for name node because we always need name node to maintain the file thus this will help us keep with the budget. Select one: True False

False

Hadoop is written in what language? Why?

Java for distribted storage processing

What are Hadoop advantages over a traditional platform? A. Scalability B. Reliability C. Flexibility D. Cost

a b c d

10. _________ are scanned in the order they are specified on the command line. a) Command line parameters b) Parameter files c) Declare and default preprocessors d) Both parameter files and command line parameters

swer: d Explanation: Parameters and command parameters are scanned in FIFO manner.


Related study sets

Accounting Final WSU fall 2016 chaps 1-2

View Set

Classroom Management: The Fab 15

View Set

Chapter 1 Financial Planning Process

View Set