Topics n BigData EXAM
The client reading the data from HDFS filesystem in Hadoop -gets the data from the namenode -gets the block location from the datanode -gets both the data and block location from the namenode -gets only the block locations from the namenode
gets only the block locations from the namenode
When you setup a Hadoop cluster, which of the command is used to verify whether all the Hadoop daemons are running on the machine -top -ps -jps -fsck
jps
In which language hadoop is written in -c++ -Python -c# -Java
Java
Which is the slave daemon of Yarn. -NodeManager -Container -ApplicationMaster -ResourceManager
NodeManager
Users can control which keys (and hence records) go to which Reducer by implementing a custom? -All of the mentioned -Reporter -Partitioner -OutputSplit
Partitioner
Apache Spark has API's in · Java · Scala · Python · All of the above
All of the above
In which of the following languages you can code in Hadoop? -Python -R -Java -C++
All of them
Which of the following is not a scheduling option available in YARN -FIFO scheduler -Fair scheduler -Balanced scheduler -Capacity scheduler
Balanced scheduler
Namenode stores filesystem metadata which is further divided in ____ -Editlog -work directory -None of the above -Fsimage
Editlog Fsimage
Which of the following key features of HDFS ensure against data loss? -Fault tolerant -Scalable -Replication -Portable
Replication
________ is the slave/worker node and holds the user data in the form of Data Blocks. -NameNode -Data block -Replication -DataNode
DataNode
MapReduce is a programming model used in Hadoop for processing Big Data. It's also a processing technique for what? -Distributed computing -System with multiple components -Java -Python
Distributed computing
Point out the wrong statement regarding driver class in MapReduce implementation -The driver class is responsible for setting up MapReduce job to run in Hadoop -We specify job name, data type of input/output and names of the mapper and reducer classes in the driver class -Driver class is optional in MapReduce -We also need to set input and output directories for the MapReduce job
Driver class is optional in MapReduce
Select the statement that identifies all the data types associated with Big Data. · Semi-structured data is not associated with Big Data. · Unstructured data is not associated with Big Data. · Structured, semi-structured, and unstructured data are all associated with Big Data. · Only unstructured data is associated with Big Data.
Structured, semi-structured, and unstructured data are all associated with Big Data.
The datanode and namenode are respectively ___ -Master and worker nodes -None -Both are worker nodes -Worker and Master nodes
Worker and Master nodes
The hdfs command put is used to -Copy files from local file system to HDFS. -Copy files from from HDFS to local filesystem. -Copy files or directories from HDFS to local filesystem. -Copy files or directories from local file system to HDFS.
-Copy files from local file system to HDFS. -Copy files or directories from local file system to HDFS.
When writing data to HDFS what is true if the replication factor is three? -Data is written to DataNodes on three separate racks (if Rack Aware). -Data is written to blocks on three different DataNodes. -None of the above -Data is written to DataNodes on two separate racks (if Rack Aware).
-Data is written to blocks on three different DataNodes. -Data is written to DataNodes on two separate racks (if Rack Aware).
Namenode keeps metadata in -HDFS -Memory -Both -None of the above
-HDFS
What are the components of a Hadoop 1 architecture (before 2014)? -HDFS and MapReduce -HDFS, MapReduce, and YARN -DataNode and NameNode -Jobtracker and Tasktracker
-HDFS and MapReduce
In a Hadoop cluster, what is true for a HDFS block that is no longer available due to disk corruption or machine failure? -It can be replicated form its alternative locations to other live machines. -The namenode allows new client request to keep trying to read it. -It is lost for ever -The Mapreduce job process runs ignoring the block and the data stored in it.
-It can be replicated form its alternative locations to other live machines.
Which of the following is component of Hadoop? -YARN -HDFS -MapReduce -Spark
-YARN -HDFS -MapReduce
HDFS works in a __________ fashion. -worker-master fashion -master-slave fashion -master-worker fashion -slave-master fashion
-master-slave fashion -master-worker fashion
The default number of reducers for a MapReduce job is __________ -3 -1 -2 -None of the above
1
Which Cluster Manager do Spark Support? · Standalone Cluster Manager · MESOS · YARN · All of the above
All of the above
Which of the following capabilities are quantifiable advantages of distributed processing? · You can add and remove execution nodes as and when required, significantly reducing infrastructure costs. · Since problem instructions are executed on separate execution nodes, memory and processing requirements are low even while processing large volumes of data. · Parallel processing can process Big Data in a fraction of the time compared to linear processing. · Parallel processing fixes and executes errors locally without impacting other nodes.
All of them
which of the following statements about Hadoop are true? -Collection of computers working together at the same time to perform tasks -Hadoop allows for running applications on clusters -Processes massive amounts of data in distributed files systems that are linked together. -Set of open-source programs and procedures which can be used as the framework for Big Data operation
All of them
What is TRUE about transformation? · Transformations are the functions that are applied on an RDD · Filter and Map applies to each element of RDD that creates a new RDD · Transformations are not executed until an action is called · All of these
All of these
Which of the following physically stores the data? -Master Node -All of the above -Data Node -Name Node
Data Node
What is following is TRUE about Apache Spark? · Hadoop is way faster than Apache Spark · It provides high-level API's only in Java · Apache Spark is a fast and general purpose open source cluster computing system · All of these
Apache Spark is a fast and general purpose open source cluster computing system
Which of the following is framework-specific entity that negotiates resources from the ResourceManager? -ApplicationMaster -NodeManager -ResourceManager -All of the above
ApplicationMaster
The minimum amount of data that HDFS can read or write is called a _____________. -NameNode -Block -Datanode -None of the above
Block
Which type of data processing Spark offers? · Batched-based processing of data stream · Interactive processing · None · Both
Both
Point out the wrong statement regarding combiner -Combiner can speed up the job execution -Combiner will be called as long as it is specified in the job configuration -the existing reducer can be used as combiner -Combiner should not affect the final result
Combiner will be called as long as it is specified in the job configuration
In which mode all daemons execute in separate nodes -None of the above -Pseudo-distributed mode -Fully distributed mode -Local (Standalone) mode
Fully distributed mode
What are the main components of Hadoop framework? -Kafka -HDFS -YARN -MapReduce
HDFS YARN MapReduce
What are the components of a Hadoop architecture? -DataNode and NameNode -HDFS, MapReduce, and YARN -Jobtracker and Tasktracker -HDFS and MapReduce
HDFS, MapReduce, and YARN HDFS and MapReduce
________ is a utility which allows users to create and run jobs with any executables as the mapper and/or the reducer. -None of the mentioned -Hadoop Stream -Hadoop Strdata -Hadoop Streaming
Hadoop Streaming
Which one of the following is FALSE about Hadoop? -It is a distributed framework -MapReduce is the processing engine in Hadoop -Hadoop can work with commodity hardware -Hadoop was created by Google
Hadoop was created by Google
_________ is the default Partitioner for partitioning key space. -Partitioner -HashPartitioner -HashPar -None of the mentioned
HashPartitioner
The CapacityScheduler supports _____________ queues to allow for more predictable sharing of cluster resources. -Hierarchical -Networked -None of the above -Partition
Hierarchical
Point out the correct statement. -The right number of reduces seems to be 0.95 or 1.75 -Increasing the number of reduces increases the framework overhead -With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish -All of the mentioned
Increasing the number of reduces increases the framework overhead
The Hadoop MapReduce framework spawns one map task for each __________ generated by the InputFormat for the job. -All of the mentioned -OutputSplit -InputSplit -InputSplitStream
InputSplit
The number of maps is usually driven by the total size of ____________ None of the above Inputs Output Task
Inputs
Which of the following is not the feature of Spark? · Fault-tolerance · Supports in-memory computation · It is not cost efficient · Compatible with other file storage system
It is not cost efficient
Users can bundle their MapReduce code in a _________ file and execute it using jar command. -py -Jar -Java -xml
Jar
Although the Hadoop framework is implemented in Java, MapReduce applications need not be written in ____________ Python C++ Java None of the above
Java
What is Sparks data loading mechanism? · Eager Loading · Both of these · Lazy loading · None of these
Lazy loading
Which of the following is a data processing engine for Hadoop Framework? -MapReduce -Spark -HDFS -YARN
MapReduce
Is YARN a replacement of Hadoop MapReduce? (Y/N)
No
What is FALSE about RDD? · RDD is immutable · RDD provides two kinds of operations: transformations & actions · Spark revolves around the concept of RDD · None of the above
None of the these
Which of the following capabilities are quantifiable advantages of parallel processing? · You can add and remove execution nodes as and when required, significantly reducing infrastructure costs. · Since problem instructions are executed on separate execution nodes, memory and processing requirements are low even while processing large volumes of data. · Parallel processing can process Big Data in a fraction of the time compared to linear processing. · Parallel processing fixes and executes errors locally without impacting other nodes.
Parallel processing can process Big Data in a fraction of the time compared to linear processing
Which of the following are not design goals of HDFS? -Fault detection and recovery -Prevent deletion of data -Provide high network bandwidth for data movement -Handle huge dataset
Prevent deletion of data Provide high network bandwidth for data movement
In MapReduce, the number of reducers can be changed by____________ -The number of map tasks -input size -Programmer set the number of Reducers -the number of nodes in a cluster
Programmer set the number of Reducers
Hive is a ____ -Query Language -Database -Data Flow Language -Programming Language
Query Language
The basic abstraction of Spark Streaming is · DataFrame · RDD · Shared variable · None of the above
RDD
all of the following accurately describe Hadoop, EXCEPT: -Open Source -Java Based -Real Time -Distributed Computing approach
Real Time
All of the following accurately describe Hadoop, EXCEPT -Open-source -Real-time -Java-based -Distributed computing approach
Real-time
Which of the following key features of HDFS ensure against data loss? -Scalable -Portable -Fault tolerant -Replication
Replication
What does RDD stands for? · Redundant Distributed Database · Resilient Distributed Database · Resilient Distributed Dataset · None
Resilient Distributed Dataset
What is YARN? -None of the above -Storage layer -Batch processing engine -Resource Management Layer
Resource Management Layer
Which among the following is ultimate authority that arbitrates resources among all the applications in the system. -Container -ApplicationMaster -ResourceManager -NodeManage
ResourceManager
Spark is developed in which language · Java · Scala · Python · R
Scala
Which of the following phases occur simultaneously -Shuffle and Map -Both A and B -Shuffle and Sort -Reduce and Sort
Shuffle and Sort
Which statement best describes small data? · Small Data is available in limited quantities that humans can easily interpret with little or no digital processing. · Small data consists of batches of big data requiring large amounts of compute power. · Small Data is available in quantities that humans can easily interpret after digital processing. · Small data has little or no structure or is semi-structured. Examples of semi-structured data include social media posts that could be images accompanied by hashtags, while unstructured data could include medical records from millions of patients.
Small Data is available in limited quantities that humans can easily interpret with little or no digital processing.
What is the driver program of Spark? · SparkContext · Cluster Manager · Worker Node · All
SparkContext
Point out the wrong statement. -It is legal to set the number of reduce-tasks to zero if no reduction is desired. -The Mapreduce framework does not sort the map-outputs before sending them to the reduce tasks -None of the above -The outputs of the map-tasks go directly to the local File System
The Mapreduce framework does not sort the map-outputs before sending them to the reduce tasks
The total number of partitioner is equal to -The number of reducer -The number of combiner -All of the above -The number of mapper
The number of reducer
Hadoop can be deployed on commodity servers, which provides low-cost processing as well as storage of unstructured, huge volume of data. -True -False
True
What is writable in Hadoop? -None of these answers are correct -Writable is a java interface that needs to be implemented for HDFS writes -Writable is a java interface that needs to be implemented for streaming data to remote servers -Writable is a java interface that needs to be implemented for MapReduce processing
Writable is a java interface that needs to be implemented for MapReduce processing
Which of the following is the architectural center of Hadoop that allows multiple data processing engines. -HDFS -YARN -Hive -Incubator
YARN
Which of the following manages the resources among all the applications running in a Hadoop cluster? -NameNode -DataNode -YARN -MapReduce
YARN
Apache Hadoop YARN stands for : -Yet Another Resource Network -None of the above -Yet Another Resource Negotiator -Yet Another Reserve Negotiator
Yet Another Resource Negotiator
In HDFS the files cannot be___ -executed -none of the above -deleted -read
executed
Which of these statements describe big data? Check all that apply. · Data generated in huge volumes and can be structured, semi-structured, or unstructured. · Big Data arrives continuously at enormous speed from multiple sources. · Big Data is relatively consistent and is stored as JSON or XML forms. · Big Data is mostly located in storage within Enterprises and Data Centers.
· Data generated in huge volumes and can be structured, semi-structured, or unstructured. · Big Data arrives continuously at enormous speed from multiple sources. · Big Data is mostly located in storage within Enterprises and Data Centers.