Databricks - Data Engineer Associate Practice Exam 5
The default threshold of VACUUM is 7 days, internal audit team asked to certain tables to maintain at least 365 days as part of compliance requirement, which of the below setting is needed to implement. a. ALTER TABLE table_name set TBLPROPERTIES (delta.deletedFileRetentionDuration= 'interval 365 days') b. MODIFY TABLE table_name set TBLPROPERTY (delta.maxRetentionDays = 'interval 365 days') c. ALTER TABLE table_name set EXENDED TBLPROPERTIES (delta.deletedFileRetentionDuration= 'interval 365 days') d. ALTER TABLE table_name set EXENDED TBLPROPERTIES (delta.vaccum.duration= 'interval 365 days')
a. ALTER TABLE table_name set TBLPROPERTIES (delta.deletedFileRetentionDuration= 'interval 365 days') xplanation ALTER TABLE table_name SET TBLPROPERTIES ( property_key [ = ] property_val [, ...] ) TBLPROPERTIES allow you to set key-value pairs Table properties and table options (Databricks SQL) | Databricks on AWS
Which of the following locations hosts the driver and worker nodes of a Databricks-managed cluster? a. Data plane b. Control plane c. Databricks Filesystem d. JDBC data source e. Databricks web application
a. Data plane xplanation The answer is Data Plane, which is where compute(all-purpose, Job Cluster, DLT) are stored this is generally a customer cloud account, there is one exception SQL Warehouses, currently there are 3 types of SQL Warehouse compute available(classic, pro, serverless), in classic and pro compute is located in customer cloud account but serverless computed is located in Databricks cloud account.
Which of the following scenarios is the best fit for AUTO LOADER? a. Efficiently process new data incrementally from cloud object storage b. Efficiently move data incrementally from one delta table to another delta table c. Incrementally process new data from streaming data sources like Kafka into delta lake d. Incrementally process new data from relational databases like MySQL e. Efficiently copy data from one data lake location to another data lake location
a. Efficiently process new data incrementally from cloud object storage xplanation The answer is, Efficiently process new data incrementally from cloud object storage, AUTO LOADER only supports ingesting files stored in a cloud object storage. Auto Loader cannot process streaming data sources like Kafka or Delta streams, use Structured streaming for these data sources. Auto Loader and Cloud Storage Integration Auto Loader supports a couple of ways to ingest data incrementally Directory listing - List Directory and maintain the state in RocksDB, supports incremental file listing File notification - Uses a trigger+queue to store the file notification which can be later used to retrieve the file, unlike Directory listing File notification can scale up to millions of files per day. [OPTIONAL] Auto Loader vs COPY INTO? Auto Loader Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional setup. Auto Loader provides a new Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory. When to use Auto Loader instead of the COPY INTO? You want to load data from a file location that contains files in the order of millions or higher. Auto Loader can discover files more efficiently than the COPY INTO SQL command and can split file processing into multiple batches. You do not plan to load subsets of previously uploaded files. With Auto Loader, it can be more difficult to reprocess subsets of files. However, you can use the COPY INTO SQL command to reload subsets of files while an Auto Loader stream is simultaneously running.
You are currently working on storing data you received from different customer surveys, this data is highly unstructured and changes over time, why Lakehouse is a better choice compared to a Data warehouse? a. Lakehouse supports schema enforcement and evolution, traditional data warehouses lack schema evolution. b. Lakehouse supports SQL c. Lakehouse supports ACID d. Lakehouse enforces data integrity e. Lakehouse supports primary and foreign keys like a data warehouse
a. Lakehouse supports schema enforcement and evolution, traditional data warehouses lack schema evolution.
A dataset has been defined using Delta Live Tables and includes an expectations clause: CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') What is the expected behavior when a batch of data containing data that violates these constraints is processed? a. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log. b. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log. c. Records that violate the expectation cause the job to fail. d. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset. e. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.
a. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log. xplanation The answer is, Records that violate the expectation are added to the target dataset and recorded as invalid in the event log. Delta live tables support three types of expectations to fix bad data in DLT pipelines Review below example code to examine these expectations,
What is the main difference between the silver layer and gold layer in medallion architecture? a. Silver optimized to perform ETL, Gold is optimized query performance b. Gold is optimized go perform ETL, Silver is optimized for query performance c. Silver is copy of Bronze, Gold is a copy of Silver d. Silver is stored in Delta Lake, Gold is stored in memory e. Silver may contain aggregated data, gold may preserve the granularity of original data
a. Silver optimized to perform ETL, Gold is optimized query performance planation Medallion Architecture - Databricks Gold Layer: 1. Powers Ml applications, reporting, dashboards, ad hoc analytics 2. Refined views of data, typically with aggregations 3. Reduces strain on production systems 4. Optimizes query performance for business-critical data Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose. Sorry I had to add the watermark some people in Udemy are copying my content.
The sales team has asked the Data engineering team to develop a dashboard that shows sales performance for all stores, but the sales team would like to use the dashboard but would like to select individual store location, which of the following approaches Data Engineering team can use to build this functionality into the dashboard. a. Use query Parameters which then allow user to choose any location b. Currently dashboards do not support parameters c. Use Databricks REST API to create a dashboard for each location d. Use SQL UDF function to filter the data based on the location e. Use Dynamic views to filter the data based on the location
a. Use query Parameters which then allow user to choose any location xplanation The answer is Databricks supports many types of parameters in the dashboard, a drop-down list can be created based on a query that has a unique list of store locations. Here is a simple query that takes a parameter for SELECT * FROM sales WHERE field IN ( {{ Multi Select Parameter }} ) Or SELECT * FROM sales WHERE field = {{ Single Select Parameter }} Query parameter types Text Number Dropdown List Query Based Dropdown List Date and Time
you are currently working on creating a spark stream process to read and write in for a one-time micro batch, and also rewrite the existing target table, fill in the blanks to complete the below command successfully. spark.table("source_table") .writeStream .option("____", "dbfs:/location/silver") .outputMode("____") .trigger(Once=____) .table("target_table") a. checkpointlocation, complete, True b. targetlocation, overwrite, True c. checkpointlocation, True, overwrite d. checkpointlocation, True, complete e. checkpointlocation, overwrite, True
a. checkpointlocation, complete, True
What is the output of below function when executed with input parameters 1, 3 : def check_input(x,y): if x < y: x= x+1 if x>y: x= x+1 if x <y: x = x+1 return x a. 1 b. 2 c. 3 d. 4 e. 5
b. 2 xplanation The answer is 2
Which of the following array functions takes input column return unique list of values in an array? a. COLLECT_LIST b. COLLECT_SET c. COLLECT_UNION d. ARRAY_INTERSECT e. ARRAY_UNION
b. COLLECT_SET
Which of the below commands can be used to drop a DELTA table? a. DROP DELTA table_name b. DROP TABLE table_name c. DROP TABLE table_name FORMAT DELTA d. DROP table_name
b. DROP TABLE table_name
What is the main difference between the silver layer and the gold layer in medallion architecture? a. Silver may contain aggregated data b. Gold may contain aggregated data c. Data quality checks are applied in gold d. Silver is a copy of bronze data e. God is a copy of silver data
b. Gold may contain aggregated data xplanation Medallion Architecture - Databricks Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose. Sorry I had to add the watermark some people in Udemy are copying my content.
A dataset has been defined using Delta Live Tables and includes an expectations clause: CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION DROP ROW What is the expected behavior when a batch of data containing data that violates these constraints is processed? a. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log. b. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log. c. Records that violate the expectation cause the job to fail. d. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset. e. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.
b. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log. xplanation The answer is Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log. Delta live tables support three types of expectations to fix bad data in DLT pipelines Review below example code to examine these expectations,
You are asked to debug a databricks job that is taking too long to run on Sunday's, what are the steps you are going to take to identify the step that is taking longer to run? a. A notebook activity of job run is only visible when using all-purpose cluster. b. Under Workflow UI and jobs select job you want to monitor and select the run, notebook activity can be viewed. c. Enable debug mode in the Jobs to see the output activity of a job, output should be available to view. d. Once a job is launched, you cannot access the job's notebook activity. e. Use the compute's spark UI to monitor the job activity.
b. Under Workflow UI and jobs select job you want to monitor and select the run, notebook activity can be viewed. xplanation The answer is, Under Workflow UI and jobs select job you want to monitor and select the run, notebook activity can be viewed. You have the ability to view current active runs or completed runs, once you click the run you can see the Click on the run to view the notebook output
Your team has hundreds of jobs running but it is difficult to track cost of each job run, you are asked to provide a recommendation on how to monitor and track cost across various workloads a. Create jobs in different workspaces, so we can track the cost easily b. Use Tags, during job creation so cost can be easily tracked c. Use job logs to monitor and track the costs d. Use workspace admin reporting e. Use a single cluster for all the jobs, so cost can be easily tracked
b. Use Tags, during job creation so cost can be easily tracked xplanation The answer is Use Tags, during job creation so cost can be easily tracked Review below link for more details https://docs.databricks.com/administration-guide/account-settings/usage-detail-tags-aws.html Here is a view how tags get propagated from pools to clusters and clusters without pools,
You have written a notebook to generate a summary data set for reporting, Notebook was scheduled using the job cluster, but you realized it takes an average of 8 minutes to start the cluster, what feature can be used to start the cluster in a timely fashion? a. Setup an additional job to run ahead of the actual job so the cluster is running second job starts b. Use the Databricks cluster pools feature to reduce the startup time c. Use Databricks Premium edition instead of Databricks standard edition d. Pin the cluster in the cluster UI page so it is always available to the jobs e. Disable auto termination so the cluster is always running
b. Use the Databricks cluster pools feature to reduce the startup time xplanation Cluster pools allow us to reserve VM's ahead of time, when a new job cluster is created VM are grabbed from the pool. Note: when the VM's are waiting to be used by the cluster only cost incurred is Azure. Databricks run time cost is only billed once VM is allocated to a cluster. Here is a demo of how to setup and follow some best practices, https://www.youtube.com/watch?v=FVtITxOabxg&ab_channel=DatabricksAcademy
You were asked to write python code to stop all running streams, which of the following command can be used to get a list of all active streams currently running so we can stop them, fill in the blank. for s in _______________: s.stop() a. Spark.getActiveStreams() b. spark.streams.active c. activeStreams() d. getActiveStreams() e. spark.streams.getActive
b. spark.streams.active
Which of the following python statements can be used to replace the schema name and table name in the query? a. table_name = "sales" schema_name = "bronze" query = f"select * from schema_name.table_name" b. table_name = "sales" query = "select * from {schema_name}.{table_name}" c. table_name = "sales" query = f"select * from {schema_name}.{table_name}" d. table_name = "sales" query = f"select * from + schema_name +"."+table_name"
c. table_name = "sales" query = f"select * from {schema_name}.{table_name}" xplanation The answer is table_name = "sales" query = f"select * from {schema_name}.{table_name}" It is always best to use f strings to replace python variables, rather than using string concatenation.
Which of the following commands can be used to query a delta table? a. %python spark.sql("select * from table_name") b. %sql Select * from table_name c. Both A & B d. %python execute.sql("select * from table") e. %python delta.sql("select * from table")
c. Both A & B xplanation The answer is both options A and B Options C and D are incorrect because there is no command in Spark called execute.sql or delta.sql
Delete records from the transactions Delta table where transactionDate is greater than current timestamp? a. DELETE FROM transactions FORMAT DELTA where transactionDate > current_timestmap() b. DELETE FROM transactions if transctionDate > current_timestamp() c. DELETE FROM transactions where transactionDate > current_timestamp() d. DELETE FROM transactions where transactionDate > current_timestamp() KEEP_HISTORY e. DELET FROM transactions where transactionDate GE current_timestamp()
c. DELETE FROM transactions where transactionDate > current_timestamp()
You are currently looking at a table that contains data from an e-commerce platform, each row contains a list of items(Item number) that were present in the cart, when the customer makes a change to the cart the entire information is saved as a separate list and appended to an existing list for the duration of the customer session, to identify all the items customer bought you have to make a unique list of items, you were asked to create a unique item's list that was added to the cart by the user, fill in the blanks of below query by choosing the appropriate higher-order function? Note: See below sample data and expected output. Schema: cartId INT, items Array<INT> Sample Data: cartId | items 1 | [[1,100,200,300], [1, 250,300]] 2 | [[10,150,200,300], [1, 210,300], [350]] Expected output cartId | items 1 | [1,100,200,300, 250]] 2 | [10,150,200,300,210,350] Fill in the blanks: SELECT cartId, ______(________(items)) FROM carts a. ARRAY_UNION, ARRAY_DISCINT b. ARRAY_DISTINCT, ARRAY_UNION c. ARRAY_DISTINCT, FLATTEN d. FLATTEN, ARRAY_DISTINCT e. ARRAY_DISTINCT, ARRAY_FLATTEN
c. ARRAY_DISTINCT, FLATTEN xplanation FLATTEN -> Transforms an array of arrays into a single array. ARRAY_DISTINCT -> The function returns an array of the same type as the input argument where all duplicate values have been removed.
What is the main difference between the bronze layer and silver layer in a medallion architecture? a. Duplicates are removed in bronze, schema is applied in silver b. Silver may contain aggregated data c. Bronze is raw copy of ingested data, silver contains data with production schema and optimized for ELT/ETL throughput d. Bad data is filtered in Bronze, silver is a copy of bronze data
c. Bronze is raw copy of ingested data, silver contains data with production schema and optimized for ELT/ETL throughput planation Medallion Architecture - Databricks Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose. Sorry I had to add the watermark some people in Udemy are copying my content.
Which of the statement is correct about the cluster pools? a. Cluster pools allow you to perform load balancing b. Cluster pools allow you to create a cluster c. Cluster pools allow you to save time when starting a new cluster d. Cluster pools are used to share resources among multiple teams e. Cluster pools allow you to have all the nodes in the cluster from single physical server rack
c. Cluster pools allow you to save time when starting a new cluster
How does a Delta Lake differ from a traditional data lake? a. Delta lake is Data warehouse service on top of data lake that can provide reliability, security, and performance b. Delta lake is a caching layer on top of data lake that can provide reliability, security, and performance c. Delta lake is an open storage format like parquet with additional capabilities that can provide reliability, security, and performance d. Delta lake is an open storage format designed to replace flat files with additional capabilities that can provide reliability, security, and performance e. Delta lake is proprietary software designed by Databricks that can provide reliability, security, and performance
c. Delta lake is an open storage format like parquet with additional capabilities that can provide reliability, security, and performance xplanation What is Delta? Delta lake is · Open source · Builds up on standard data format · Optimized for cloud object storage · Built for scalable metadata handling Delta lake is not · Proprietary technology · Storage format · Storage medium · Database service or data warehouse
Kevin is the owner of both the sales table and regional_sales_vw view which uses the sales table as the underlying source for the data, and Kevin is looking to grant select privilege on the view regional_sales_vw to one of newly joined team members Steven. Which of the following is a true statement? a. Kevin can not grant access to Steven since he does not have security admin privilege b. Kevin although is the owner but does not have ALL PRIVILEGES permission c. Kevin can grant access to the view, because he is the owner of the view and the underlying table d. Kevin can not grant access to Steven since he does have workspace admin privilege e. Steve will also require SELECT access on the underlying table
c. Kevin can grant access to the view, because he is the owner of the view and the underlying table xplanation The answer is, Kevin can grant access to the view, because he is the owner of the view and the underlying table, Ownership determines whether or not you can grant privileges on derived objects to other users, a user who creates a schema, table, view, or function becomes its owner. The owner is granted all privileges and can grant privileges to other users
Identify one of the below statements that can query a delta table in PySpark Dataframe API. a. Spark.read.mode("delta").table("table_name") b. Spark.read.table.delta("table_name") c. Spark.read.table("table_name") d. Spark.read.format("delta").LoadTableAs("table_name") e. Spark.read.format("delta").TableAs("table_name")
c. Spark.read.table("table_name")
Which of the following SQL statements can be used to update a transactions table, to set a flag on the table from Y to N. a. MODIFY transactions SET active_flag = 'N' WHERE active_flag = 'Y' b. MERGE transactions SET active_flag = 'N' WHERE active_flag = 'Y' c. UPDATE transactions SET active_flag = 'N' WHERE active_flag = 'Y' d. REPLACE transactions SET active_flag = 'N' WHERE active_flag = 'Y'
c. UPDATE transactions SET active_flag = 'N' WHERE active_flag = 'Y' xplanation The answer is UPDATE transactions SET active_flag = 'N' WHERE active_flag = 'Y' Delta Lake supports UPDATE statements on the delta table, all of the changes as part of the update are ACID compliant.
You are working on a dashboard that takes a long time to load in the browser, due to the fact that each visualization contains a lot of data to populate, which of the following approaches can be taken to address this issue? a. Increase size of the SQL endpoint cluster b. Increase the scale of maximum range of SQL endpoint cluster c. Use Databricks SQL Query filter to limit the amount of data in each visualization d. Remove data from Delta Lake e. Use Delta cache to store the intermediate results
c. Use Databricks SQL Query filter to limit the amount of data in each visualization xplanation Note*: The question may sound misleading but these are types of questions the exam tries to ask. A query filter lets you interactively reduce the amount of data shown in a visualization, similar to query parameter but with a few key differences. A query filter limits data after it has been loaded into your browser. This makes filters ideal for smaller datasets and environments where query executions are time-consuming, rate-limited, or costly. This query filter is different from than filter that needs to be applied at the data level, this filter is at the visualization level so you can toggle how much data you want to see. SELECT action AS `action::filter`, COUNT(0) AS "actions count" FROM events GROUP BY action When queries have filters you can also apply filters at the dashboard level. Select the Use Dashboard Level Filters checkbox to apply the filter to all queries. Dashboard filters Query filters | Databricks on AWS
Which of the following technique can be used to implement fine-grained access control to rows and columns of the Delta table based on the user's access? a. Use Unity catalog to grant access to rows and columns b. Row and column access control lists c. Use dynamic view functions d. Data access control lists e. Dynamic Access control lists with Unity Catalog
c. Use dynamic view functions xplanation The answer is, Use dynamic view functions. Here is an example that limits access to rows based on the user being part managers group, in the below view if a user is not a part of the manager's group you can only see rows where the total amount is <= 1000000 Dynamic view function to filter rows CREATE VIEW sales_redacted AS SELECT user_id, country, product, total FROM sales_raw WHERE CASE WHEN is_member('managers') THEN TRUE ELSE total <= 1000000 END; Dynamic view function to hide a column data based on user's access, CREATE VIEW sales_redacted AS SELECT user_id, CASE WHEN is_member('auditors') THEN email ELSE 'REDACTED' END AS email, country, product, total FROM sales_raw Please review below for more details https://docs.microsoft.com/en-us/azure/databricks/security/access-control/table-acls/object-privileges#dynamic-view-functions
At the end of the inventory process a file gets uploaded to the cloud object storage, you are asked to build a process to ingest data which of the following method can be used to ingest the data incrementally, schema of the file is expected to change overtime ingestion process should be able to handle these changes automatically. Below is the auto loader to command to load the data, fill in the blanks for successful execution of below code. spark.readStream .format("cloudfiles") .option("_______","csv) .option("_______", 'dbfs:/location/checkpoint/') .load(data_source) .writeStream .option("_______",' dbfs:/location/checkpoint/') .option("_______", "true") .table(table_name)) a. format, checkpointlocation, schemalocation, overwrite b. cloudfiles.format, checkpointlocation, cloudfiles.schemalocation, overwrite c. cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, mergeSchema d. cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, append e. cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, overwrite
c. cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, mergeSchema
Below table temp_data has one column called raw contains JSON data that records temperature for every four hours in the day for the city of Chicago, you are asked to calculate the maximum temperature that was ever recorded for 12:00 PM hour across all the days. Parse the JSON data and use the necessary array function to calculate the max temp. Table: temp_date Column: raw Datatype: string { "chicago": [ {"date:": "01-01-2021", "temp":[25,28,34,56,39,25] }, {"date:": "01-02-2021", "temp":[25,28,49,54,38,25] }, {"date:": "01-03-2021", "temp":[25,28,49,58,38,25] }] } Expected output: 58 output: 58 a. select max(raw.chicago.temp[3]) from temp_data b. select array_max(raw.chicago[*].temp[3]) from temp_data c. select array_max(from_json(raw['chicago'].temp[3],'array<int>')) from temp_data d. select array_max(from_json(raw:chicago[*].temp[3],'array<int>')) from temp_data e. select max(from_json(raw:chicago[3].temp[3],'array<int>')) from temp_data
d. select array_max(from_json(raw:chicago[*].temp[3],'array<int>')) from temp_data xplanation Note: This is a difficult question, more likely you may see easier questions similar to this but the more you are prepared for the exam easier it is to pass the exam. Use this below link to look for more examples, this will definitely help you, https://docs.databricks.com/optimizations/semi-structured.html Here is the solution, step by step Use this below link to look for more examples, this will definitely help you, https://docs.databricks.com/optimizations/semi-structured.html If you want to try this solution use below DDL,
Below sample input data contains two columns, one cartId also known as session id, and the second column is called items, every time a customer makes a change to the cart this is stored as an array in the table, the Marketing team asked you to create a unique list of item's that were ever added to the cart by each customer, fill in blanks by choosing the appropriate array function so the query produces below expected result as shown below. Schema: cartId INT, items Array<INT> Sample Data | cartId | items | 1 | [1,100,200,300] | 1 | [1,250, 300] SELECT cartId, ___ (___(items)) as items FROM carts GROUP BY cartId Expected result: | cartId | items | 1 | [1,100,200,300,250] a. FLATTEN, COLLECT_UNION b. ARRAY_UNION, FLATTEN c. ARRAY_UNION, ARRAY_DISTINT d. ARRAY_UNION, COLLECT_SET e. ARRAY_DISTINCT, ARRAY_UNION
d. ARRAY_UNION, COLLECT_SET xplanation COLLECT SET is a kind of aggregate function that combines a column value from all rows into a unique list ARRAY_UNION combines and removes any duplicates,
Your colleague was walking you through how a job was setup, but you noticed a warning message that said, "Jobs running on all-purpose cluster are considered all purpose compute", the colleague was not sure why he was getting the warning message, how do you best explain this warning message? a. All-purpose clusters cannot be used for Job clusters, due to performance issues. b. All-purpose clusters take longer to start the cluster vs a job cluster c. All-purpose clusters are less expensive than the job clusters d. All-purpose clusters are more expensive than the job clusters e. All-purpose cluster provide interactive messages that can not be viewed in a job
d. All-purpose clusters are more expensive than the job clusters xplanation Warning message: Pricing for All-purpose clusters are more expensive than the job clusters AWS pricing(Aug 15th 2022) Jobs compute photon Standard ed. $0.10/DBU Premium ed $0.15/DBU all purpose compute Photon Standard ed. $0.40/DBU Premium ed $0.55/DBU
You are working on IOT data where each device has 5 reading in an array collected in Celsius, you were asked to covert each individual reading from Celsius to Fahrenheit, fill in the blank with an appropriate function that can be used in this scenario. Schema: deviceId INT, deviceTemp ARRAY<double> Input Data: deviceId | deviceTempC 1 | [25.00,26.00,25.00,26.00,27.00] Expected Data deviceId | deviceTempF 1 | [77.00,78.80,77.00,78.80,80.60] SELECT deviceId, __(deviceTempC,i-> (i * 9/5) + 32) as deviceTempF FROM sensors a. APPLY b. MULTIPLY c. ARRAYEXPR d. TRANSFORM e. FORALL
d. TRANSFORM planation TRANSFORM -> Transforms elements in an array in expr using the function func. transform(expr, func)
A SQL Dashboard was built for the supply chain team to monitor the inventory and product orders, but all of the timestamps displayed on the dashboards are showing in UTC format, so they requested to change the time zone to the location of New York. How would you approach resolving this issue? a. Move the workspace from Central US zone to East US Zone b. Change the timestamp on the delta tables to America/New_York format c. Change the spark configuration of SQL endpoint to format the timestamp to America/New_York d. Under SQL Admin Console, set the SQL configuration parameter time zone to America/New_York e. Add SET Timezone = America/New_York on every of the SQL queries in the dashboard.
d. Under SQL Admin Console, set the SQL configuration parameter time zone to America/New_York xplanation The answer is, Under SQL Admin Console, set the SQL configuration parameter time zone to America/New_York Here are steps you can take this to configure, so the entire dashboard is changed without changing individual queries Configure SQL parameters To configure all warehouses with SQL parameters: Click Settings at the bottom of the sidebar and select SQL Admin Console. Click the SQL Warehouse Settings tab. In the SQL Configuration Parameters textbox, specify one key-value pair per line. Separate the name of the parameter from its value using a space. For example, to enable ANSI_MODE: Similarly, we can add a line in the SQL Configuration parameters timezone America/New_York SQL configuration parameters | Databricks on AWS
You are asked to setup an AUTO LOADER to process the incoming data, this data arrives in JSON format and get dropped into cloud object storage and you are required to process the data as soon as it arrives in cloud storage, which of the following statements is correct a. AUTO LOADER is native to DELTA lake it cannot support external cloud object storage b. AUTO LOADER has to be triggered from an external process when the file arrives in the cloud storage c. AUTO LOADER needs to be converted to a Structured stream process d. AUTO LOADER can only process continuous data when stored in DELTA lake e. AUTO LOADER can support file notification method so it can process data as it arrives
e. AUTO LOADER can support file notification method so it can process data as it arrives xplanation Auto Loader supports two modes when ingesting new files from cloud object storage Directory listing: Auto Loader identifies new files by listing the input directory, and uses a directory polling approach. File notification: Auto Loader can automatically set up a notification service and queue service that subscribe to file events from the input directory. File notification is more efficient and can be used to process the data in real-time as data arrives in cloud object storage. Choosing between file notification and directory listing modes | Databricks on AWS
Unity catalog helps you manage the below resources in Databricks at account level. a. Tables b. ML Models c. Dashboards d. Meta Stores and Catalogs e. All of the above
e. All of the above xplanation The answer is all of the above. Unity Catalog is a unified governance solution for all data and AI assets including files, tables, machine learning models, and dashboards in your lakehouse on any cloud.
How does Lakehouse replace the dependency on using Data lakes and Data warehouses in a Data and Analytics solution? a. Open, direct access to data stored in standard data formats. b. Supports ACID transactions. c. Supports BI and Machine learning workloads d. Support for end-to-end streaming and batch workloads e. All the above
e. All the above xplanation Lakehouse combines the benefits of a data warehouse and data lakes, Lakehouse = Data Lake + DataWarehouse Here are some of the major benefits of a lakehouse
Which of the following statement is true about Databricks repos? a. You can approve the pull request if you are the owner of Databricks repos b. A workspace can only have one instance of git integration c. Databricks Repos and Notebook versioning are the same features d. You cannot create a new branch in Databricks repos e. Databricks repos allow you to comment and commit code changes and push them to a remote branch
e. Databricks repos allow you to comment and commit code changes and push them to a remote branch xplanation See below diagram to understand the role Databricks Repos and Git provider plays when building a CI/CD workdlow. All the steps highlighted in yellow can be done Databricks Repo, all the steps highlighted in Gray are done in a git provider like Github or Azure Devops
John Smith is a newly joined team member in the Marketing team who currently has access read access to sales tables but does not have access to delete rows from the table, which of the following commands help you accomplish this? a. GRANT USAGE ON TABLE table_name TO [email protected] b. GRANT DELETE ON TABLE table_name TO [email protected] c. GRANT DELETE TO TABLE table_name ON [email protected] d. GRANT MODIFY TO TABLE table_name ON [email protected] e. GRANT MODIFY ON TABLE table_name TO [email protected]
e. GRANT MODIFY ON TABLE table_name TO [email protected] xplanation The answer is GRANT MODIFY ON TABLE table_name TO [email protected] , please note INSERT, UPDATE, and DELETE are combined into one role called MODIFY. Below are the list of privileges that can be granted to a user or a group, SELECT: gives read access to an object. CREATE: gives the ability to create an object (for example, a table in a schema). MODIFY: gives the ability to add, delete, and modify data to or from an object. USAGE: does not give any abilities, but is an additional requirement to perform any action on a schema object. READ_METADATA: gives the ability to view an object and its metadata. CREATE_NAMED_FUNCTION: gives the ability to create a named UDF in an existing catalog or schema. MODIFY_CLASSPATH: gives the ability to add files to the Spark classpath. ALL PRIVILEGES: gives all privileges (is translated into all the above privileges
Once a cluster is deleted, which of the below additional actions need to performed by the administrator? a. Remove virtual machines but storage and networking are automatically dropped b. Drop storage disks but Virtual machines and networking are automatically dropped c. Remove networking but Virtual machines and storage disks are automatically dropped d. Remove logs e. No action needs to be performed. All resources are automatically removed.
e. No action needs to be performed. All resources are automatically removed.
How VACCUM and OPTIMIZE commands can be used to manage the DELTA lake? a. VACCUM command can be used to compact small parquet files, and the OPTIMZE command can be used to delete parquet files that are marked for deletion/unused. b. VACCUM command can be used to delete empty/blank parquet files in a delta table. OPTIMIZE command can be used to update stale statistics on a delta table. c. VACCUM command can be used to compress the parquet files to reduce the size of the table, OPTIMIZE command can be used to cache frequently delta tables for better performance. d. VACCUM command can be used to delete empty/blank parquet files in a delta table, OPTIMIZE command can be used to cache frequently delta tables for better performance. e. OPTIMIZE command can be used to compact small parquet files, and the VACCUM command can be used to delete parquet files that are marked for deletion/unused.
e. OPTIMIZE command can be used to compact small parquet files, and the VACCUM command can be used to delete parquet files that are marked for deletion/unused. xplanation VACCUM: You can remove files no longer referenced by a Delta table and are older than the retention threshold by running the vacuum command on the table. vacuum is not triggered automatically. The default retention threshold for the files is 7 days. To change this behavior, see Configure data retention for time travel. OPTIMIZE: Using OPTIMIZE you can compact data files on Delta Lake, this can improve the speed of read queries on the table. Too many small files can significantly degrade the performance of the query.
You were asked to identify number of times a temperature sensor exceed threshold temperature (100.00) by each device, each row contains 5 readings collected every 5 minutes, fill in the blank with the appropriate functions. Schema: deviceId INT, deviceTemp ARRAY<double>, dateTimeCollected TIMESTAMP sample data: deviceId | deviceTemp | dateTimeCollected 1 | [99.00,99.00,99.00,100.10,100.9] | 10-10-2021 10:10:00 1 | [99.00,99.00,100.00,100.15,102] | 10-10-2021 10:15:00 1 | [99.00,99.00,100.00,100.20,101] | 10-10-2021 10:20:00 SELECT deviceId, __ (__ (__(deviceTemp], i -> i > 100.00))) FROM devices GROUP BY deviceId a. SUM, COUNT, SIZE b. SUM, SIZE, SLICE c. SUM, SIZE, ARRAY_CONTAINS d. SUM, SIZE, ARRAY_FILTER e. SUM, SIZE, FILTER
e. SUM, SIZE, FILTER xplanation FILER function can be used to filter an array based on an expression SIZE function can be used to get size of an array SUM is used to calculate to total by device
One of the queries in the Databricks SQL Dashboard takes a long time to refresh, which of the below steps can be taken to identify the root cause of this issue? a. Restart the SQL endpoint b. Select the SQL endpoint cluster, spark UI, SQL tab to see the execution plan and time spent in each step c. Run optimize and Z ordering d. Change the Spot Instance Policy from "Cost optimized" to "Reliability Optimized." e. Use Query History, to view queries and select query, and check query profile to time spent in each step
e. Use Query History, to view queries and select query, and check query profile to time spent in each step xplanation The answer is, Use Query History, to view queries and select query, and check the query profile to see time spent in each step. Here is the view of the query profile, for more info use the link https://docs.microsoft.com/en-us/azure/databricks/sql/admin/query-profile As you can see here Databricks SQL query profile is much different to Spark UI and provides much more clear information on how time is being spent on different queries and time it spent on each step.
You are looking to process the data based on two variables, one to check if the department is supply chain or check if process flag is set to True a. if department = "supply chain" | process: b. if department == "supply chain" or process = TRUE: c. if department == "supply chain" | process == TRUE: d. if department == "supply chain" | if process == TRUE: e. if department == "supply chain" or process:
e. if department == "supply chain" or process: