Data Bricks

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

The code block shown below should return a DataFrame with column only aSquared dropped from DataFrame df. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: df.__1__(__2__) ​ 1. remove 2. aSquared ​ 1. drop 2. "aSquared" ​ 1. remove 2. "aSquared" ​ 1. drop 2. aSquared Explanation

1. drop 2. "aSquared" Explanation Correct useage of drop is the following: df.drop("col_name")

The code block shown below should return a new DataFrame with a new column named "casted" who's value is the long equivalent of column "a" which is a integer column also this dataframe should contain all the previously existing columns from DataFrame df. Choose the response that correclty fills in the numbered blanks within the code block to complete this task. Code block: df._1_(_2_) ​ 1. withColumn 2. casted 3. col(a).cast(long) ​ 1. withColumn 2. "casted" 3. col("a").cast("long") (Correct) ​ 1. withColumnRenamed 2. casted 3. col("a").cast("long") ​ 1. withColumn 2. "casted" 3. cast(a) ​ 1. withColumn 2. "casted" 3. cast(col("a") ​ 1. withColumnRenamed 2. "casted" 3. col("a").cast("long")

1. withColumn 2. "casted" 3. col("a").cast("long") (Correct) Explanation Read the questions and responses carefully ! You will have many questions like this one, try to visualize it and write it down if it helps. There is always quotes in the column name and you need to you .cast to cast a column

tableA is a DataFrame consisting of 20 fields and 40 billion rows of data with a surrogate key field. tableB is a DataFrame functioning as a lookup table for the surrogate key consisting of 2 fields and 5,000 rows. If the in-memory size of tableB is 22MB, what occurs when the following code is executed:? df = tableA.join(tableB, "primary_key") ​ An exception will be thrown due to tableB being greater than the 10MB default threshold for a broadcast join. ​ A non-broadcast join will be executed with a shuffle phase since the broadcast table is greater than the 10MB default threshold and the broadcast hint was not specified. ​ The contents of tableB will be partitioned so that each of the keys that need to be joined on in tableA partitions on each executor will match. ​ The contents of tableB will be replicated and sent to each executor to eliminate the need for a shuffle stage during the join.

A non-broadcast join will be executed with a shuffle phase since the broadcast table is greater than the 10MB default threshold and the broadcast hint was not specified. Explanation By default spark.sql.autoBroadcastJoinThreshold= 10MB and any value above this thershold will not force a broadcast join.

Which of the following are correct for slots ? ​ Each executor has a number of slots. ​ Each slot can be assigned a task. ​ Spark parallelizes via slots. ​ It is interchangeable with tasks. ​ All of the answers are correct.

Each executor has a number of slots. ​ Each slot can be assigned a task. ​ Spark parallelizes via slots. Explanation Slots are not the same thing as executors. Executors could have multiple slots in them, and tasks are executed on slots. Review well this concept for the exam. https://spark.apache.org/docs/latest/cluster-overview.html

What command we can use to get the number of partition of a dataframe named df ? ​ df.rdd.getPartitionSize() ​ df.getNumPartitions() ​ df.rdd.getNumPartitions() ​ df.getPartitionSize()

Explanation Correct answer here is df.rdd.getNumPartitions()

The following statement will create a managed table dataframe.write.option('path', "/my_paths/").saveAsTable("managed_my_table") ​ TRUE ​ FALSE

False Explanation Spark manages the metadata, while you control the data location. As soon as you add 'path' option in data frame writer it will be treated as global external/unmanaged table. When you drop table only metadata gets dropped. A global unmanaged/external table is available across all clusters.

Narrow Transformation

In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().

Wide Transformation

In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey() and reducebyKey().

If we want to store RDD as deserialized Java objects in the JVM and if the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed also replicate each partition on two cluster nodes, which storage level we need to choose ? MEMORY_AND_DISK_2_SER MEMORY_ONLY_2 MEMORY_AND_DISK_2 MEMORY_AND_DISK

MEMORY_AND_DISK_2 StorageLevel.MEMORY_AND_DISK_2 is Same as MEMORY_AND_DISK storage level but replicate each partition to two cluster nodes.

What won't cause a full shuffle knowing that dataframe 'df' has 8 partitions ? All of them will cause a full shuffle df.coalesce(4) df.repartition(12)

df.coalesce(4) Coalse function avoids a full shuffle if it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

If we want to create a constant integer 1 as a new column 'new_column' in a dataframe df, which code block we should select ? ​ df.withColumn(new_column, lit(1)) ​ df.withColumnRenamed('new_column', lit(1)) ​ df.withColumn("new_column", lit("1")) ​ df.withColumn("new_column", 1) (Incorrect) ​ df.withColumn("new_column", lit(1)) (Correct) Explanation The second argument for DataFrame.withColumn should be a Column so you have to use a literal to add constant value 1:

df.withColumn("new_column", lit(1)) The second argument for DataFrame.withColumn should be a Column so you have to use a literal to add constant value 1:

Let's suppose that we have a dataframe with a column 'today' which has a format 'YYYY-MM-DD'. You want to add a new column to this dataframe 'week_ago' and you want it's value to be one week prior to column 'today'. Select the correct code block. ​ df.withColumn("week_ago", date_sub(col("today"), 7)) ​ df.withColumn(week_ago, date_sub(col("today"), 7)) ​ df.withColumn("week_ago", col("today")- 7)) ​ df.withColumn( date_sub(col("today"), 7), "week_ago") ​ df.withColumn("week_ago", week_sub(col("today"), 7))

df.withColumn("week_ago", date_sub(col("today"), 7)) Explanation Date_sub and date_add are some functions that exist in the following packages org.apache.spark.sql.functions.*

Which of the following code blocks changes the parquet file content given that there is already a file exist with the name that we want to write ? ​ df.write.mode("overwrite").option("compression", "snappy").save("path") ​ df.save.format("parquet").mode("overwrite").option("compression", "snappy").path("path") ​ df.write.format("parquet").option("compression", "snappy").path("path")

df.write.mode("overwrite").option("compression", "snappy").save("path") Explanation Parquet is the default file format. If you don't include the format() method, the DataFrame will still be saved as a Parquet file. And if the file name already exist in the path given and if you don't include option mode("overwrite") you will get an error.

Which of the following DataFrame operation is classified as a narrow transformation ? filter() orderBy() distinct() coalse() repartition()

filter()

When joining two dataframes, if there is a need to evaluate the keys in both of the DataFrames or tables and include all rows from the left DataFrame as well as any rows in the right DataFrame that have a match in the left DataFrame also If there is no equivalent row in the right DataFrame, we want to instert null: which join type we should select ? df1.join(person, joinExpression, joinType) ​ joinType = "leftAnti" ​ joinType = "left_outer" ​ joinType = "leftOuter" (Incorrect) ​ joinType = "left_semi"

joinType = "left_outer" Explanation Correct answer is joinType = "left_outer". For example df1.join(person, joinExpression, "left_outer").show()

Which of the following 3 DataFrame operations are NOT classified as an action? Choose 3 answers: ​ limit() ​ printSchema() ​ cache() ​ show() ​ foreach() ​ first()

limit() ​ printSchema() ​ cache()

There is a global temp view named 'my_global_view'. If I want to query this view within spark, which command I should choose ? ​ spark.read.view("global_temp.my_global_view") ​ spark.read.table("my_global_view") ​ spark.read.table("global_temp.my_global_view") ​ spark.read.view("my_global_view")

spark.read.table("global_temp.my_global_view") Explanation Global temp views are accessed via prefix 'global_temp'

Spark dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed partitions. Which property need to be enabled to achieve this ? spark.sql.skewJoin.enabled spark.sql.adaptive.skewJoin.enabled spark.sql.adaptive.skewJoin.enabled spark.sql.adaptive.optimize.skewJoin

spark.sql.adaptive.skewJoin.enabled

You have a need to transform a column named 'date' to a timestamp format. Assume that the column 'date' is timestamp compatible. You have written the code block down below, but it contains an error. Identify and fix it. df.select(to_timestamp(col("date")).show() ​ to_timestamp requires always a format ! So you need to add one df.select(to_timestamp(col("date"), 'yyyy-dd-MM')) ​ to_timestamp() is not a valid operation. Proper function is toTimestamp() df.select(toTimestamp(col("date"))) ​ to_timestamp() is not a valid operation. Proper function is toTimestamp() and also we need to add a format. df.select(toTimestamp(col("date"), 'yyyy-dd-MM'))) (Incorrect) ​ We need to add a format and it should be the first parameter passed to this function. df.select(to_timestamp('yyyy-dd-MM', col("date"))) Explanation

to_timestamp requires always a format ! So you need to add one df.select(to_timestamp(col("date"), 'yyyy-dd-MM')) to_timestamp, always requires a format to be specified.


Set pelajaran terkait

Identifying and Reporting Child Abuse and Neglect

View Set

Prep U Chapter 66, Chapter 65, and Chapter 67

View Set

BUS-L201 Chapter 51 Employment Law

View Set

Chapter 1 med-surg NCLEX review questions

View Set

Anatomy and Physiology II Ch. 22 Digestive

View Set

Judicial Educator Module 8: Alcohol 101: Choosing a Direction

View Set

GBU 6552 Final (Chapters 31,33,34,36,39)

View Set