COSC 3365 - Homework 9
Spark cannot infer the DataFrame schema from the following type of files if they are formed properly:
.xsls
Which of the followings is not true for DataFrames?
Are very similar to document-based Databases
Column operations in Spark include:
Arithmetic operations
Catalyst Optimizer is an extensible framework for SparkSQL which optimizes query results in a Dataset or Dataframe.
False
Column operations in Spark include arithmetic operations, but do not have data testing operations and sorting functions.
False
CrowdStrike company uses Apache Spark to store the data in the cloud.
False
Datasets are a collection of objects, whose type is defined at the time of execution.
False
GraphX is a framework used in Spark to make visualizations of the data.
False
In DataFrames, each row can contain a collection of values of specific types such as integers, floats, strings but not a collection of arrays or lists.
False
Spark Streaming is a Spark API that can be used to create applications that stream data into the cloud.
False
Spark is written in Java and can be used by coding in Scala or Python.
False
What does "left_outer" join in the command "customersDF.join(zipCodeDF, customersDF("City")===zipCodeDF("City"), "left_outer").show()" do?
Lists all customers and adds the zip code to those customers who have a non-null value for the City
Which of the following is not a main component in Big Data Analysis?
MATLAB
By using DataFrame actions you cannot:
Select columns dynamically according to a condition
What is the result after submitting the following command: dataframe.select("col1", "col3").where("col1>0").take(3).show() ?
Show col1 and col3 of first 3 rows where col1 is positive
Which of the following commands cannot be used to get the first row of the DataFrame
Show(1)
Which of the followings is not a Spark API?
Spark Core
Which of the following is the main entry point of Spark for working with structured data?
SparkSQL
DataFrames actions generate a new output and transformations that transform an existing DataFrame.
True
In Python there is an additional way to select columns of a DataFrame like object
True
In Python, DataFrames may be called similarly as calling an object like this: DF.field
True
In Python, you cannot use Datasets, as it is a dynamically typed language.
True
In Spark, you can generate a DataFrame from a custom list and then rename all the columns to define your own schema
True
Scala is a functional Programming language that runs in Java Virtual Machine
True
Spark Applications can perform large scale data processing such as extract, transform and load (ETL)
True
Spark has been gaining ground on MapReduce because of its faster processing, lower latency, and data streaming abilities.
True
SparkMLib is a machine learning library that allows making Spark Big Data learning applications
True
The Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, and interacting with storage systems.
True
What is the correct command to read all json files in a given directory?
spark.read.json("directory/*.json")
Which of the following is not a DataFrame transformation?
collect