Data Science - Big Data Assessment
which of these can be the purpose of counters?
all: the number of tasks that were launched and successfully ran the correct number of bytes that were read and written number of mapper and reducer launched
which of the following will give min value of votes in the dataframe "df"? df.max() df.agg(min("votes")).show() df.agg().min("votes").show() none
df.agg(min("votes")).show()
can Spark RDD be shared between the spark context?
false
is sql querying supported in HBase?
false
A Dataframe can be created from an existing RDD. You would create the Dataframe from the existing RDD by inferring schema using case classes in which one of the given classes? - if your dataset has more than 22 fields - if all your users are going to need dataset parsed in same way - if you have two sets of users who will need the text dataset parsed differently - none of these
if you have two sets of users who will need the text dataset parsed differently Spark SQL and DataFrames - Spark 2.3.1 Documentation (spark.apache.org/docs/2.3.1/sql-programming-guide.html) ctrl+f 'parsed'
below are teh steps given to compute the number of lines which contain the word "ADMIN". val inRDD-sc.textFile("/user/user01/data/weatherdat.csv") val tempRD=inRDD.map(line=>line.split("|")) val tempRDD=inRDD.filter(line=>line.contains("Admin")) val tempCount=tempRDD.count() choose the step where the inputRDD will be computed
inputRDD is computed and data loaded when count() is applied **bc "The transformations are only computed when an action requires a result to be returned to the driver program." https://spark.apache.org/docs/latest/rdd-programming-guide.html
which are the options available in spark to save the data to disk only, where dataFrame object is referred as 'out'?
out.persist(StorageLevel.DISK_ONLY) ??
which of the following tool defines a data flow language? - hive - HBase - MapReduce - Pig
pig
what type of database is HBase? schema-rigid schema-flexi schema-less not a db
schema-less
which is the default serde in Hive? - DynamicSerDe - ThriftSerDe - metadataTypedColumnsetSerDe - LazySimpleSerDe
thriftSerDe
RDDs can also be unpersisted to remove RDD from permanent storage like memory and/or disk
true
spark supports loading data from HBase
true
zero reducers are allowed in map reduce
true
Why is Serde used in hive? - used for serialization and deserialization - compress the file - read and write files into hdfs - lazyloading of files into cache
used for serialization and deserialization * to read and write??
An election RDD is provided with tuple (candidate, count) which of the below Scala snippet is used to get the candidate with the lowest number of votes? val low = electionRDD.sortByKey().first val low = electionRDD.sortByKey(false).first (not this one, goes in descending order here) val low = electionRDD.map(x=>(x._2, x._1)).sortByKey().first val low = ocuntryRDD.map(x->x._2,x._1)).sortByKey(false).first (not this one)
val low = electionRDD.sortByKey().first
partitioner is used in a mapreduce job - when there are multiple reducers - when keys have to be grouped and sent to particular mapper - when keys have to be grouped and send to particular reducer - none
when there are multiple reducers
which of the following language is used by apache CouchDB for storing the data? javascript json csv xml
xml
is relational join on two large tables on common key possible in mapreduce job?
yes
choose the correct statement about RDD - none -rdd is a programming paradigm - rdd is a distributed data structure - rdd is a database
-rdd is a programming paradigm
in what year was apache spark made an open-source technology?
2010
you have the following key-value pairs as output from your map test (the,1) (fox, 1) (faster,1) (than,1) (the,1) (cat,1) how many keys will be passed to the reducer? -3 -4 -5 -6
5
which is the default database for storing metadata of hive tables? - MySQL - derby - oracle - none of these
Derby
Apache CouchDB is written in ____ programming language java javascript erlang python
Erlang
By default, Spark uses which algorithm to remove old and unused RDD to release more memory?
Least Recently Used (LRU)
which is the default storage level in spark? MEMORY_ONLY MEMORY_ONLY_SER MEMORY_AND_DISK MEMORY_AND_DISK_-SER
MEMORY_ONLY
The programming paradigm used in Spark is __________. - none of the options - generalized - map only - mapreduce
MapReduce
what is the input format in mapreduce job to read multiple lines at a time - TextInputFormat - MultiLineInputFormat - NLineTextInputFormat - NLineInputFormat
NLineInputFormat?? textinputformat??
which of these are responsible to output key-value pairs from the Reducer phase to output files? reducer Writer RecordReader RecordWriter none of these
RecordWriter
____ protocol is used for API to access the documents SMTP HTTP IP SSH
SMTP
which tells Spark how and where to access a cluster? Spark Streaming Spark Context Spark Core Spark Session
Spark Context
Yarn architecture, which replaces the classic MR1 architecture, removes which one? - single point of failure in the Namenode - resource pressure on the JobTracker (this one??) - HDFS latency (not this) - ability to run frameworks other than MapReduce, such as MPI (this one??)
ability to run frameworks other than mapreduce such as MPI AND resource pressure on the JobTracker
Oozie workflow jobs are Directed _____ graphs of actions? - acyclical -cyclical -elliptical - all the above
acyclical
rdd is ___ - recomputable - fault-tolerant -all - immutable
all
how would you set the number of executors in any spark based application, assume we need to set 12 executors? --num-executors 12 can't change, it is always fixed conf.set(12) none of the above
chegg is saying none? (i don't trust them) internet saying first one.... although that's for the command line..
which of the following will return all the elements of the dataset as an array at the driver program? - first() - head() - collect() - none of the above
collect()
In mapreduce program the interface a combiner class will implement when the input is IntWritable keys, Text values, and emits IntWritable keys, IntWritable values. Which interface should your class implement - combiner - mapper - reducer - reducer <intWritable, Text, IntWritable, IntWritable>
combiner
which of the following application types can Spark run in addition to batch-processing jobs? all graph processing machine learning stream processin
graph processing
how can you set the reduce phase n MapReduce program as zero?
job.setNumreduceTasks(0)
in map-side join we will store the small file in the distributed cache and read it in the mapper class before the records are processed. Where do you implement the logic to read the file stored in distributed cache and store it into an associate array? combine map init setup
map
what will happen if the reducers are set to zero in the job configuration? - map only job takes place - reduce only job takes place - combiners output will be stored as final output in the hdfs - none of these
map only job takes place
which out of these is not optional in a MapReduce program? - mapper - combiner (optional) - reducer (optional) - none of these
mapper
which of these phases comes after the partitioner phase? -mapper phase - reducer phase - output - none
reducer
which of the following features of apache couchDB is an operation to avail extra disc space for the database by removing unused data? views replication ACID properties compaction
replication
In Spark Context, which all context are available by default (choose multiple)? - sparkcontext - sqlcontext - hivecontext - Streaming context
sparkContext, sqlContext, hiveContext
in scala which of the following would be used to specify a User Defined Function (UDF) that can be used in a SQL statement on Apache Spark DataFrames? registerUDF("func name", func def) sqlContext.udf(function definition) udf((arguments)=>{function definition}) sqlContext.udf.register("func name", func def")
sqlContext.udf.register("func name", func def") docs.databricks.com/spark/latest/spark-sql/udf-scala.html
the data types of the mapper for a product analytics job where the input key is month of data type Intwritable and input values represent items of data typeText, is configured at
the InputFormat used by the job determines the mapper's input key and value types
which of the following is true of running a Spark application on Hadoop YARN? - in Hadoop YARN mode, the RDDs and variables are always in the same memory space - there are two deploy modes that can be used to launch spark application on YARN- client mode and cluster mode - irrespective of the mode, the driver is launched in the client process that submitted the job - running in hadoop YARN has the advantage of having multiple users running the spark interactive shell
there are two deploy modes that can be used to launch spark application on YARN- client mode and cluster mode
when building a standalone application, you need to create the SparkContext. To do this in Scala, you would include which of the following within the main method? val conf=new SparkConf().setAppName("AuctionsApp") val sc= new SparkContext(conf) val sc = SparkContext().setAppName("AuctionsApp"_ val conf= SparkConf(sc) val sc=new SparkContext() val conf= new SparkConf().setAppName("AuctionsApp")
val conf=new SparkConf().setAppName("AuctionsApp") val sc= new SparkContext(conf)
which of the file contains the configuration setting for HDFS daemons? yarn-hdfs.xml hdfs-site.xml mapred-hdfs-site.xml all
hdfs-site.xml
what is RDD Lineage - it is a process that reconstructs lost data partitions - it gives information about the integrity of the input source - it gives information about the schema of the data - it is used for data replication
it is a process that reconstructs lost data partitions (edureke user) chegg says it gives information about the integrity of the input source
why is partition used in hive 1. hive distributes execution load horizontally 2. hadoop specific compression 3. join two tables (not this one) 4. bucketing purpose (not this)
- it's where hive divides a table into parts based on a particular column - stores data in slices, so query response time becomes faster - helps in elimination of data - a partition is created when data is inserted into table - increase db manageability - partitions are horizontal slices of data hive distributes execution load horizontally
which of the following is true about Spark application running on a Hadoop YARN? - yarn provide the facility of having multiple users use the spark interactive shell simultaneously yarn provide the facility of having multiple users use the spark interactive shell simultaneously - in hadoop yarn mode, the RDDs and variables are always in the same memory space - yarn support launching of spark app in client mode and cluster mode deployment - yarn launches the driver in client process where the job is submitted The client mode and cluster mode are the two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is organized by YARN on the cluster, and the client can go away after commencing the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
- yarn support launching of spark app in client mode and cluster mode deployment ??
to launch a spark application in any one of the four modes (local, standalone, Mesos or YARN), use ___ all ./bin/spark-submit ./bin/SparkContext ./bin/submit-app
./bin/spark-submit
The HDFS block size configuration is 128MB in your hadoop cluster. The HDFS directory contains 50 small files each of 200 MB in size. How many map tasks will be created when the inputformat for your job is TextInputFormat? 100 128 50 200
100
which of the following is not a type of counter in MapReduce? 1. job counters (is one) 2. user defined counters (is one) 3. filesystem counters (is one) 4. execute counter
4. execute counter
in a spark program, there are two datasets on weather. First one the weather records has the number of temperature recording of every major place in the country. AverageTemperatureRecords dataset has the details corresponding to average temperature of each place. Also note the exception that all places do not have average temperature weather record in the AverageTemperatureRecords dataset. What would we use to get all the current temperatures and average temperatures of all the places? D1.join(D2) D2.join(D1) D1.leftOuterJoin(D2) D2.leftOuterJoin(D1)
D1.join(D2) ?? inner join (join two sets on key cols, and where keys don't match the rows get dropped from both sets): D1.join(D2, D1("jfieie") === D2("jfei"), "inner") full outer (returns all rows from both sets, where expression doesn't match it returns null on record column): D1.join(D2, D1("id") === D2("id), "outer"')
which of the following is a column-oriented database that runs on top of hdfs - hbase - hive - sqoop - flume
HBase
where does the output of a reducer get stored - hdfs - local system - both - none
HDFS
an instance of the spark sql execution engine that integrates with the data stored in Hive is _____ - sparkHiveConnector - HiveLoader - HiveContext - HiveSparkConnector
HiveContext
the data types of the Mapper for a product analytics job where the input key is month of data type IntWritable and input values represent items of data type Text, is configured at
The inputFormat used by the job determines the mapper's input key and value types
spark can integrate with which of the following data storage systems? - hive - google cloud - all - cassandra
all
the type(s) of operation(s) that can be performed on RDDs is/are ___ all map action transformation action and map
all
what are the benefit(s) of using appropriate file formats in spark? - faster accessing during read and write - all - schema oriented - more compression support
all
what file systems does Apache Spark support? HDFS local system amazon S3 all
all
what kind of data can be handled by spark? semi-structured unstructured structured all
all
which of the following file formats is/are supported by spark? - json - sequence file -csv -all -parquet
all
what is replication factor? - replication factor controls how many times the Name node replicates the meta data - replication factor creates multiple copies of the same file to be served to the clients - replication factor controls how many times each individual block can be replicated - all
all explanation: .The number of nodes where data (rows and partitions) is replicated is equal to the Replication Factor (RF). Data is duplicated over many nodes (RF=N).There is only one copy of a row in a cluster with an RF of one, and there is no way to retrieve the data if the node is hacked or fails. A cluster with RF=2 has two copies of the same row. In most systems, an RF of at least three is utilized.Data is duplicated automatically at all times. Data stored on any of the replicated nodes can be read or written to.
Which of the following is true of caching the RDD?
all - rdd.persist(MEMORY_ONLY) is the same as rdd.cache() - use rdd.cache() to cache the RDD - cache behavior depends on available memory. If there is not enough memory, action will reload from the file instead of the cache
spark can store its data in ___ - all - mongoDB - cassandra - hdfs
hdfs
the following are characteristics shared by hadoop and spark, except ____ - both use open source api's to link between different tools - both are data processing platforms -both have their own file system -both are cluster computing environments
both have their own file system
choose the correct statement: - execution starts with the call of transformation - execution starts with the call of action - all the transformations and actions are lazily evaluated - none
execution starts with the call of action
spark is 100x faster than MapReduce due to development in scala
false
we can edit the data of rdd like conversion to uppercase
false
the client reading the data from HDFS fileSystem in Hadoop does which of the following? gets the data from the namenode gets both the data and block location from the namenode gets the block location from the datanode gets only the block locations from the namenode
gets only the block locations from the namenode
on what basis does partitioner groups the output and sent to the next stage? - on the basis of Key - on the basis of Value - both of these - none of these
key some chegg answer is saying both...
how is versioning information added to data in hbase? - VersionNo - Key Value - KeyNo - VersionValue
keyValue
in reducers the input received after the sort and shuffle phase of the mapreduce will be
keys are presented to reducer in sorted order; values for given key are not sorted
in map reduce job in which point the reducer class reduce method will be invoked? - as soon as at least one mapper has finished processing its input split - as soon as a mapper has emitted at least one record - not until all mappers have finished processing all records - it depends on the inputFormat used for the job
not until all mappers have finished processing all records
find the appropriate Scala statement to load the Transaction file into an RDD. Assume that SparkContext is available as "scCntxt" and SQLContext as "sqlCntxt"? val inData=sqlContext.loadText("/path to file/txn.txt") val inData=sc.loadFile("/path to file/txn.txt") val inData=sc.textFile("/path to file/txn.txt") val inData=sc.loadText("/path to file/txn.txt")
val inData=sc.textFile("/path to file/txn.txt")
Which of the following Scala statements would be most appropriate to load the data (sfpd.txt) into an RDD? Assume that SparkContext is available as the variable sc and SQLContext as the variable sqlContext.
val sfpd = sc.textFile("/path to file/sfpd.txt")