Big Data Exam 1 (Q1 -Q4)
What class performs the summary operations against intermediate data on the mapper node?
Combiner
In a Hadoop "stack", what node periodically replicates and stores data from the Name node should it fail?
Secondary node
Data loading in HDFS
Sqoop
______________ refers to both how fast data is being produced and how fast the data must be processed (i.e, captured, stored and analyzed) to meet the demand or need. RFID tags, automated sensors, GPS devices, and smart meters are driving an increasing need to deal with torrents of data in near-real time.
Velocity
Cluster Management in HDFS
YArn
HDFS is
a distributed file system made of commodity hardware
Slave node in hadoop is:
a node where data is stored and processed
Using data to understand customers/clients and business operations to sustain and foster growth and profitability is:
an increasingly challenging task for today's enterprises
Which Big Data approach promotes efficiency, lower cost, and better performance by processing jobs in a shared, centrally managed pool of IT resources?
grid computing
Correct format for submitting MapReduce jobs
hadoop -jar MRjar .jar MRDriver inputdir outputdir
Command to list files in hdfs
hdfs dfs -ls
Command to display a list of blocks that make up each file:
hdfs fsck / -files -blocks
Config file for HDFS is stored in
hdfs-site.xml
The number of map tasks equals the number of input file splits, and may even be increased by the user by reducing the input split size. This leads to
improved resource utilization
What are the details about data stored in the name node?
location and size of blocks, size of files, permissions and ownership
Clients make communication in HDFS through the
name node only
Functions of reducer in MapReduce
reduce, shuffle and sort
The analytics layer of the Big Data stack is experiencing what type of development currently?
significant development
In HDFS blocks are stored
spread over multiple data nodes
Traditional data warehouses have not been able to keep up with
the variety and complexity of data
Data flows can be highly inconsistent, with periodic peaks, making data loads hard to manage. What is this feature of Big Data called?
variability
A newly popular unit of dat in the Big Data era is the petabyte (PB), which is:
10^15 bytes
Default block size for HDFS replication is
128 MB
In MapReduce how many main functions are involved?
2
Default config for mapper and reducer:
2 mappers and 1 reducer
Default replication factor in HDFS is
3
Which of the following options are used to pass parameters to a MapReduce program during runtime?
A + B
Choose the non-relational database that is a part of the Hadoop ecosystem A. Hive B. Hbase C. NoSQL D. ZooKeeper
B. Hbase
Which statement is not true? A. Hadoop consists of multiple products B. HDFS is a file system, not a RDBMS C. HDFS and network file systems are related D. Hive resembles SQL but is not standard SQL
C
Which tool simplifies Java-based MapReduce processing?
Pig Latin
T/F: Block size and replication factor cannot be configured once we set up the HDFS
False
The requirement is to upload the data to HDFS as soon as it reaches the company. Which tool is meant to satisfy the requirements in hadoop ecosystem?
Flume
The final output of the reduced tasks are stored in:
HDFS (due to the smaller size of it)
How does Hadoop work?
It breaks up Big Data into multiple parts so each part can be processed and analyzed at the same time on multiple computers.
What inputformat is best suited to run a second map reduce job that takes the output key-value pairs from the first as its input?
KeyValueTextInputFormat
Which takes key-value pairs as input and produces it as output too?
Mappers and reducers
___________ node in a Hadoop cluster provides client information on where in the cluster particular data is stored.
Name node
True statement
The number of input splits is equal to the number of map tasks
The term "Big Data" is relative as it depends on the size of the using organization.
True