Data Science - Big Data Assessment

Ace your homework & exams now with Quizwiz!

which of these can be the purpose of counters?

all: the number of tasks that were launched and successfully ran the correct number of bytes that were read and written number of mapper and reducer launched

which of the following will give min value of votes in the dataframe "df"? df.max() df.agg(min("votes")).show() df.agg().min("votes").show() none

df.agg(min("votes")).show()

can Spark RDD be shared between the spark context?

false

is sql querying supported in HBase?

false

A Dataframe can be created from an existing RDD. You would create the Dataframe from the existing RDD by inferring schema using case classes in which one of the given classes? - if your dataset has more than 22 fields - if all your users are going to need dataset parsed in same way - if you have two sets of users who will need the text dataset parsed differently - none of these

if you have two sets of users who will need the text dataset parsed differently Spark SQL and DataFrames - Spark 2.3.1 Documentation (spark.apache.org/docs/2.3.1/sql-programming-guide.html) ctrl+f 'parsed'

below are teh steps given to compute the number of lines which contain the word "ADMIN". val inRDD-sc.textFile("/user/user01/data/weatherdat.csv") val tempRD=inRDD.map(line=>line.split("|")) val tempRDD=inRDD.filter(line=>line.contains("Admin")) val tempCount=tempRDD.count() choose the step where the inputRDD will be computed

inputRDD is computed and data loaded when count() is applied **bc "The transformations are only computed when an action requires a result to be returned to the driver program." https://spark.apache.org/docs/latest/rdd-programming-guide.html

which are the options available in spark to save the data to disk only, where dataFrame object is referred as 'out'?

out.persist(StorageLevel.DISK_ONLY) ??

which of the following tool defines a data flow language? - hive - HBase - MapReduce - Pig

pig

what type of database is HBase? schema-rigid schema-flexi schema-less not a db

schema-less

which is the default serde in Hive? - DynamicSerDe - ThriftSerDe - metadataTypedColumnsetSerDe - LazySimpleSerDe

thriftSerDe

RDDs can also be unpersisted to remove RDD from permanent storage like memory and/or disk

true

spark supports loading data from HBase

true

zero reducers are allowed in map reduce

true

Why is Serde used in hive? - used for serialization and deserialization - compress the file - read and write files into hdfs - lazyloading of files into cache

used for serialization and deserialization * to read and write??

An election RDD is provided with tuple (candidate, count) which of the below Scala snippet is used to get the candidate with the lowest number of votes? val low = electionRDD.sortByKey().first val low = electionRDD.sortByKey(false).first (not this one, goes in descending order here) val low = electionRDD.map(x=>(x._2, x._1)).sortByKey().first val low = ocuntryRDD.map(x->x._2,x._1)).sortByKey(false).first (not this one)

val low = electionRDD.sortByKey().first

partitioner is used in a mapreduce job - when there are multiple reducers - when keys have to be grouped and sent to particular mapper - when keys have to be grouped and send to particular reducer - none

when there are multiple reducers

which of the following language is used by apache CouchDB for storing the data? javascript json csv xml

xml

is relational join on two large tables on common key possible in mapreduce job?

yes

choose the correct statement about RDD - none -rdd is a programming paradigm - rdd is a distributed data structure - rdd is a database

-rdd is a programming paradigm

in what year was apache spark made an open-source technology?

2010

you have the following key-value pairs as output from your map test (the,1) (fox, 1) (faster,1) (than,1) (the,1) (cat,1) how many keys will be passed to the reducer? -3 -4 -5 -6

which is the default database for storing metadata of hive tables? - MySQL - derby - oracle - none of these

Derby

Apache CouchDB is written in ____ programming language java javascript erlang python

Erlang

By default, Spark uses which algorithm to remove old and unused RDD to release more memory?

Least Recently Used (LRU)

which is the default storage level in spark? MEMORY_ONLY MEMORY_ONLY_SER MEMORY_AND_DISK MEMORY_AND_DISK_-SER

MEMORY_ONLY

The programming paradigm used in Spark is __________. - none of the options - generalized - map only - mapreduce

MapReduce

what is the input format in mapreduce job to read multiple lines at a time - TextInputFormat - MultiLineInputFormat - NLineTextInputFormat - NLineInputFormat

NLineInputFormat?? textinputformat??

which of these are responsible to output key-value pairs from the Reducer phase to output files? reducer Writer RecordReader RecordWriter none of these

RecordWriter

____ protocol is used for API to access the documents SMTP HTTP IP SSH

SMTP

which tells Spark how and where to access a cluster? Spark Streaming Spark Context Spark Core Spark Session

Spark Context

Yarn architecture, which replaces the classic MR1 architecture, removes which one? - single point of failure in the Namenode - resource pressure on the JobTracker (this one??) - HDFS latency (not this) - ability to run frameworks other than MapReduce, such as MPI (this one??)

ability to run frameworks other than mapreduce such as MPI AND resource pressure on the JobTracker

Oozie workflow jobs are Directed _____ graphs of actions? - acyclical -cyclical -elliptical - all the above

acyclical

rdd is ___ - recomputable - fault-tolerant -all - immutable

all

how would you set the number of executors in any spark based application, assume we need to set 12 executors? --num-executors 12 can't change, it is always fixed conf.set(12) none of the above

chegg is saying none? (i don't trust them) internet saying first one.... although that's for the command line..

which of the following will return all the elements of the dataset as an array at the driver program? - first() - head() - collect() - none of the above

collect()

In mapreduce program the interface a combiner class will implement when the input is IntWritable keys, Text values, and emits IntWritable keys, IntWritable values. Which interface should your class implement - combiner - mapper - reducer - reducer <intWritable, Text, IntWritable, IntWritable>

combiner

which of the following application types can Spark run in addition to batch-processing jobs? all graph processing machine learning stream processin

graph processing

how can you set the reduce phase n MapReduce program as zero?

job.setNumreduceTasks(0)

in map-side join we will store the small file in the distributed cache and read it in the mapper class before the records are processed. Where do you implement the logic to read the file stored in distributed cache and store it into an associate array? combine map init setup

map

what will happen if the reducers are set to zero in the job configuration? - map only job takes place - reduce only job takes place - combiners output will be stored as final output in the hdfs - none of these

map only job takes place

which out of these is not optional in a MapReduce program? - mapper - combiner (optional) - reducer (optional) - none of these

mapper

which of these phases comes after the partitioner phase? -mapper phase - reducer phase - output - none

reducer

which of the following features of apache couchDB is an operation to avail extra disc space for the database by removing unused data? views replication ACID properties compaction

replication

In Spark Context, which all context are available by default (choose multiple)? - sparkcontext - sqlcontext - hivecontext - Streaming context

sparkContext, sqlContext, hiveContext

in scala which of the following would be used to specify a User Defined Function (UDF) that can be used in a SQL statement on Apache Spark DataFrames? registerUDF("func name", func def) sqlContext.udf(function definition) udf((arguments)=>{function definition}) sqlContext.udf.register("func name", func def")

sqlContext.udf.register("func name", func def") docs.databricks.com/spark/latest/spark-sql/udf-scala.html

the data types of the mapper for a product analytics job where the input key is month of data type Intwritable and input values represent items of data typeText, is configured at

the InputFormat used by the job determines the mapper's input key and value types

which of the following is true of running a Spark application on Hadoop YARN? - in Hadoop YARN mode, the RDDs and variables are always in the same memory space - there are two deploy modes that can be used to launch spark application on YARN- client mode and cluster mode - irrespective of the mode, the driver is launched in the client process that submitted the job - running in hadoop YARN has the advantage of having multiple users running the spark interactive shell

there are two deploy modes that can be used to launch spark application on YARN- client mode and cluster mode

when building a standalone application, you need to create the SparkContext. To do this in Scala, you would include which of the following within the main method? val conf=new SparkConf().setAppName("AuctionsApp") val sc= new SparkContext(conf) val sc = SparkContext().setAppName("AuctionsApp"_ val conf= SparkConf(sc) val sc=new SparkContext() val conf= new SparkConf().setAppName("AuctionsApp")

val conf=new SparkConf().setAppName("AuctionsApp") val sc= new SparkContext(conf)

which of the file contains the configuration setting for HDFS daemons? yarn-hdfs.xml hdfs-site.xml mapred-hdfs-site.xml all

hdfs-site.xml

what is RDD Lineage - it is a process that reconstructs lost data partitions - it gives information about the integrity of the input source - it gives information about the schema of the data - it is used for data replication

it is a process that reconstructs lost data partitions (edureke user) chegg says it gives information about the integrity of the input source

why is partition used in hive 1. hive distributes execution load horizontally 2. hadoop specific compression 3. join two tables (not this one) 4. bucketing purpose (not this)

- it's where hive divides a table into parts based on a particular column - stores data in slices, so query response time becomes faster - helps in elimination of data - a partition is created when data is inserted into table - increase db manageability - partitions are horizontal slices of data hive distributes execution load horizontally

which of the following is true about Spark application running on a Hadoop YARN? - yarn provide the facility of having multiple users use the spark interactive shell simultaneously yarn provide the facility of having multiple users use the spark interactive shell simultaneously - in hadoop yarn mode, the RDDs and variables are always in the same memory space - yarn support launching of spark app in client mode and cluster mode deployment - yarn launches the driver in client process where the job is submitted The client mode and cluster mode are the two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is organized by YARN on the cluster, and the client can go away after commencing the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

- yarn support launching of spark app in client mode and cluster mode deployment ??

to launch a spark application in any one of the four modes (local, standalone, Mesos or YARN), use ___ all ./bin/spark-submit ./bin/SparkContext ./bin/submit-app

./bin/spark-submit

The HDFS block size configuration is 128MB in your hadoop cluster. The HDFS directory contains 50 small files each of 200 MB in size. How many map tasks will be created when the inputformat for your job is TextInputFormat? 100 128 50 200

100

which of the following is not a type of counter in MapReduce? 1. job counters (is one) 2. user defined counters (is one) 3. filesystem counters (is one) 4. execute counter

4. execute counter

in a spark program, there are two datasets on weather. First one the weather records has the number of temperature recording of every major place in the country. AverageTemperatureRecords dataset has the details corresponding to average temperature of each place. Also note the exception that all places do not have average temperature weather record in the AverageTemperatureRecords dataset. What would we use to get all the current temperatures and average temperatures of all the places? D1.join(D2) D2.join(D1) D1.leftOuterJoin(D2) D2.leftOuterJoin(D1)

D1.join(D2) ?? inner join (join two sets on key cols, and where keys don't match the rows get dropped from both sets): D1.join(D2, D1("jfieie") === D2("jfei"), "inner") full outer (returns all rows from both sets, where expression doesn't match it returns null on record column): D1.join(D2, D1("id") === D2("id), "outer"')

which of the following is a column-oriented database that runs on top of hdfs - hbase - hive - sqoop - flume

HBase

where does the output of a reducer get stored - hdfs - local system - both - none

HDFS

an instance of the spark sql execution engine that integrates with the data stored in Hive is _____ - sparkHiveConnector - HiveLoader - HiveContext - HiveSparkConnector

HiveContext

the data types of the Mapper for a product analytics job where the input key is month of data type IntWritable and input values represent items of data type Text, is configured at

The inputFormat used by the job determines the mapper's input key and value types

spark can integrate with which of the following data storage systems? - hive - google cloud - all - cassandra

all

the type(s) of operation(s) that can be performed on RDDs is/are ___ all map action transformation action and map

all

what are the benefit(s) of using appropriate file formats in spark? - faster accessing during read and write - all - schema oriented - more compression support

all

what file systems does Apache Spark support? HDFS local system amazon S3 all

all

what kind of data can be handled by spark? semi-structured unstructured structured all

all

which of the following file formats is/are supported by spark? - json - sequence file -csv -all -parquet

all

what is replication factor? - replication factor controls how many times the Name node replicates the meta data - replication factor creates multiple copies of the same file to be served to the clients - replication factor controls how many times each individual block can be replicated - all

all explanation: .The number of nodes where data (rows and partitions) is replicated is equal to the Replication Factor (RF). Data is duplicated over many nodes (RF=N).There is only one copy of a row in a cluster with an RF of one, and there is no way to retrieve the data if the node is hacked or fails. A cluster with RF=2 has two copies of the same row. In most systems, an RF of at least three is utilized.Data is duplicated automatically at all times. Data stored on any of the replicated nodes can be read or written to.

Which of the following is true of caching the RDD?

all - rdd.persist(MEMORY_ONLY) is the same as rdd.cache() - use rdd.cache() to cache the RDD - cache behavior depends on available memory. If there is not enough memory, action will reload from the file instead of the cache

spark can store its data in ___ - all - mongoDB - cassandra - hdfs

hdfs

the following are characteristics shared by hadoop and spark, except ____ - both use open source api's to link between different tools - both are data processing platforms -both have their own file system -both are cluster computing environments

both have their own file system

choose the correct statement: - execution starts with the call of transformation - execution starts with the call of action - all the transformations and actions are lazily evaluated - none

execution starts with the call of action

spark is 100x faster than MapReduce due to development in scala

false

we can edit the data of rdd like conversion to uppercase

false

the client reading the data from HDFS fileSystem in Hadoop does which of the following? gets the data from the namenode gets both the data and block location from the namenode gets the block location from the datanode gets only the block locations from the namenode

gets only the block locations from the namenode

on what basis does partitioner groups the output and sent to the next stage? - on the basis of Key - on the basis of Value - both of these - none of these

key some chegg answer is saying both...

how is versioning information added to data in hbase? - VersionNo - Key Value - KeyNo - VersionValue

keyValue

in reducers the input received after the sort and shuffle phase of the mapreduce will be

keys are presented to reducer in sorted order; values for given key are not sorted

in map reduce job in which point the reducer class reduce method will be invoked? - as soon as at least one mapper has finished processing its input split - as soon as a mapper has emitted at least one record - not until all mappers have finished processing all records - it depends on the inputFormat used for the job

not until all mappers have finished processing all records

find the appropriate Scala statement to load the Transaction file into an RDD. Assume that SparkContext is available as "scCntxt" and SQLContext as "sqlCntxt"? val inData=sqlContext.loadText("/path to file/txn.txt") val inData=sc.loadFile("/path to file/txn.txt") val inData=sc.textFile("/path to file/txn.txt") val inData=sc.loadText("/path to file/txn.txt")

val inData=sc.textFile("/path to file/txn.txt")

Which of the following Scala statements would be most appropriate to load the data (sfpd.txt) into an RDD? Assume that SparkContext is available as the variable sc and SQLContext as the variable sqlContext.

val sfpd = sc.textFile("/path to file/sfpd.txt")

See all study sets

Data Science - Big Data Assessment

Related study sets

DBS401

Chapter 4 Vocab

Marketing Cloud Email Certification Salesforce (144) - Cody's Answers (unverified)

Animal Farm Individual Chapters

Respiratory

Advanced Molecular Cell Biology CH 3

Chapter 2 MAR

Study Hard!!

Microeconomics Exam 4

criminology final

Chapter 5 - Means of Egress Quiz

Fundamentals Exam 4 Ch 28 & 50; class notes and practice questions

Foundations Exam 3 HONAN

Mastering Biology Chapter 7

6 Functions Of Proteins

ATI; Musculoskeletal

Process Costing and Weighted Average Method

CH 5 Flash Cards

Unit 1 - Ch. 12, 14

CS 161 Final Review