Databricks Certified Associate Developer for Apache Spark 3.0 - Python
What is a shuffle?
A shuffle is the process by which data is compared across partitions.
Which of the following is the default storage level for persist() for a non-streaming dataframe/dataset? A) MEMORY_AND_DISK B) MEMORY_AND_DISK_SER C) DISK_ONLY D) MEMORY_ONLY_SER E) MEMORY_ONLY
A) MEMORY_AND_DISK
Which of the following operations can be used to sort the rows of a DataFrame? A) sort() and orderBy() B) orderby() C) sort() and orderby() D orderBy() E) sort()
A) sort() and orderBy()
Which of the following code blocks will not always return the exact number of distinct values in column division? A)storesDF.agg(approx_count_distinct(col("division")).alias("divisionDistinct")) B)storesDF.agg(approx_count_distinct(col("division"), 0).alias("divisionDistinct")) C)storesDF.agg(countDistinct(col("division")).alias("divisionDistinct")) D.storesDF.select("division").dropDuplicates().count() E)storesDF.select("division").distinct().count()
A) storesDF.agg(approx_count_distinct(col("division")).alias("divisionDistinct"))
What is an out-of-memory error in Spark?
An out-of-memory error occurs when either the driver or an executor does not have enough memory to collect or process the data allocated to it.
Which of the following operations can be used to return the top n rows from a DataFrame? A) DataFrame.n() B) DataFrame.take(n) C) DataFrame.head D) DataFrame.show(n) E) DataFrame.collect(n)
B) DataFrame.take(n)
Which of the following operations can be used to split an array column into an individual DataFrame row for each element in the array? A) extract() B) split() C) explode() D) arrays_zip() E) unpack()
C) explode()
Which of the following DF operations is always classified as a narrow transformation? A) df.sort() B) df.distinct() C) df.repartition() D) df.select() E) df.join()
D df.select()
Which of the following operations is most likely to skew in size of your data's partitions? A) df.collect() B) df.cache() C) df.repartition(n) D) df.coalesce(n) E) df.persist()
D) df.coalesce(n)
Which of the following code blocks returns the number of rows in DF 'storesDF' A. storesDF.withColumn("numberOfRows", count()) B. storesDF.withColumn(count().alias("numberOfRows")) C. storesDF.countDistinct() D. storesDF.count() E. storesDF.agg(count())
D. storesDF.count()
How do you print the schema of a DataFrame?
DataFrame.printSchema()
Which of the following operations fails to return a DataFrame where every row is unique? A) DataFrame.distinct() B) DataFrame.drop_duplicates(subset = None) C) DataFrame.drop_duplicates() D) DataFrame.dropDuplicates() E) DataFrame.drop_duplicates(subset = "all")
E) DataFrame.drop.duplicates(subset="all")
which of the following operations will trigger evaluation? A) df.filter() B) df.distinct() C) df.intersect() D) df.join() E) df.count()
E) df.count()
If you have a DF with more partitions than you have (single core) executors what happens?
Performance will be suboptimal because not all data can be processed at the same time. Shuffle commands will create a large number of connections. Increased overhead associated with managing resources for data processing for each task. Increased risk of out-of-memory errors depending on the size of executors.
What data structures are Spark DataFrames built on top of?
RDDs (resilient distributed datasets)
What are slots?
Slots are resources for parallelization within a Spark application.
Spark has a few different execution/deployment modes: cluster, client, and local. What is a Spark execution/deployment mode?
Spark's execution/development mode determines where the driver and executors are physically stored when a Spark application is run.
What is a group of tasks that can be executed in parallel to compute the same set of operations on potentially multiple machines?
Stage
What is a combination of a block of data and a set of transformers that runs on a single executor?
Task
The codeblock below contains an error. It is intended to return a DF containing only the rows from df "storesDF" where the value in col "sqft" is <= 25000. storesDF.filter(sqft <= 25000)
The column name sqft needs to be quoted and wrapped in the col() function like storesDF.filter(col("sqft") <= 25000).
The code block contains an error. The code block is intended to return a 15% sample of rows from DF 'storesDF' without replacement. Identify the error. storesDF.sample(True, fraction = .015)
The first argument True sets the sampling to be with replacement.
The code block shown below contains an error. The code block is intended to return a new DataFrame where column division from DataFrame storesDF has been renamed to column state and column managerName from DataFrame storesDF has been renamed to column managerFullName. Identify the error. Code block: (storesDF.withColumnRenamed("state", "division") .withColumnRenamed("managerFullName", "managerName"))
The first argument to operation withColumnRenamed() should be the old column name and the second argument should be the new column name.
What is a Spark Driver?
The spark driver is the node in which the Spark application's main method runs to coordinate the Spark application. It contains the SparkContext object. Responsible for scheduling the execution of data by various worker nodes in cluster mode.
What is the difference between transformation and actions?
Transformations are business logic operations that do not induce executions while actions are execution triggers focused on returning results.
What are worker nodes in cluster-mode Spark
Worker nodes are machines that host the executors responsible for the execution of tasks.
Which operation is used to convert a DF column from one type to another?
col().cast()
pass
pass
What is the code block which returns a DF containing summary statistics only for column 'sqft' in DF 'storesDF'.
storesDF.describe("sqft")
What is the code block which returns the sum of values in colum 'sqft' in DF 'storesDF' grouped by distinct values in col 'division'
storesDF.groupBy("division".agg(sum(col("sqft)))
What is the code that returns a DataFrame where rows in DataFrame "storesDF" containing missing values in every column have been dropped.
storesDF.na.drop("all")
What is the code block needed to return a dataframe containing only column 'storeId' and column 'division' from a dataframe called 'storesDF'?
storesDF.select("storeId", "division")
What is the code that returns a new DF from a DF 'storesDF' where column 'numberOfManagers' is the constant integer 1?
storesDF.withColumn("numberOfManagers", lit(1))
What is the code that returns a new DF with a new column 'sqft100' that is 1/100th of a column 'sqft' in a DF 'storesDF'. Note: col 'sqft1000' is not in the original DF.
storesDF.withColumn("sqft100", col("sqft") / 100)
What code returns a new DataFrame where column "storeCategory" is an all-lowercase version of column "storeCategory" in DataFrame "storesDF".
storesDF.withColumn("storeCategory", lower(col("storeCategory")))
Fill in the blanks on the block below to return a new DF with the mean of column 'sqft' from DF 'storesDF' in col 'sqftMean'. storesDF.__1__(__2__(__3__).alias("sqftMean")
1 - agg 2 - mean 3 - col("sqft")
In what order should the below lines of code be run in order to create and register a SQL UDF named "ASSESS_PERFORMANCE" using the Python function assessPerformance'' and apply it to column 'customerSatistfaction' in table 'stores'? Lines of code: 1. spark.udf.register("ASSESS_PERFORMANCE", assessPerformance) 2. spark.sql("SELECT customerSatisfaction, assessPerformance(customerSatisfaction) AS result FROM stores") 3. spark.udf.register(assessPerformance, "ASSESS_PERFORMANCE") 4. spark.sql("SELECT customerSatisfaction, ASSESS_PERFORMANCE(customerSatisfaction) AS result FROM stores")
1 -> 4
The code block below should extract the value for column 'sqft' from the first row of DF 'storesDF'. Fill in the blanks. __1__.__2__.___3__
1) storesDF 2) first() 3) sqft
What is a broadcast variable?
A broadcast variable is entirely cached on each worker node so it doesn't need to be shipped or shuffled between nodes within each stage.