Databricks - Data Engineer Associate Practice Exam 2

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

You are working on a table called orders which contains data for 2021 and you have the second table called orders_archive which contains data for 2020, you need to combine the data from two tables and there could be a possibility of the same rows between both the tables, you are looking to combine the results from both the tables and eliminate the duplicate rows, which of the following SQL statements helps you accomplish this? a. SELECT * FROM orders UNION SELECT * FROM orders_archive b. SELECT * FROM orders INTERSECT SELECT * FROM orders_archive c. SELECT * FROM orders UNION ALL SELECT * FROM orders_archive d. SELECT * FROM orders_archive MINUS SELECT * FROM orders e. SELECT distinct * FROM orders JOIN orders_archive on order.id = orders_archive.id

a. SELECT * FROM orders UNION SELECT * FROM orders_archive xplanation Answer is SELECT * FROM orders UNION SELECT * FROM orders_archive UNION and UNION ALL are set operators, UNION combines the output from both queries but also eliminates the duplicates. UNION ALL combines the output from both queries.

A notebook accepts an input parameter that is assigned to a python variable called <department> and this is an optional parameter to the notebook, you are looking to control the flow of the code using this parameter. you have to check department variable is present then execute the code and if no department value is passed then skip the code execution. How do you achieve this using python? a. if department is not None: #Execute code else: pass b. if (department is not None) #Execute code else pass c. if department is not None: #Execute code end: pass d. if department is not None: #Execute code then: pass e. if department is None: #Execute code else: pass

a. if department is not None: #Execute code else: pass

Which of the following is a true statement about the global temporary view? a. A global temporary view is available only on the cluster it was created, when the cluster restarts global temporary view is automatically dropped. b. A global temporary view is available on all clusters for a given workspace c. A global temporary view persists even if the cluster is restarted d. A global temporary view is stored in a user database e. A global temporary view is automatically dropped after 7 days

a. A global temporary view is available only on the cluster it was created, when the cluster restarts global temporary view is automatically dropped. xplanation The answer is, A global temporary view is available only on the cluster it was created. Two types of temporary views can be created Session scoped and Global A session scoped temporary view is only available with a spark session, so another notebook in the same cluster can not access it. if a notebook is detached and re attached the temporary view is lost. A global temporary view is available to all the notebooks in the cluster, if a cluster restarts global temporary view is lost.

Which of the following is a correct statement on how the data is organized in the storage when when managing a DELTA table? a. All of the data is broken down into one or many parquet files, log files are broken down into one or many JSON files, and each transaction creates a new data file(s) and log file. b. All of the data and log are stored in a single parquet file c. All of the data is broken down into one or many parquet files, but the log file is stored as a single json file, and every transaction creates a new data file(s) and log file gets appended. d. All of the data is broken down into one or many parquet files, log file is removed once the transaction is committed. e. All of the data is stored into one parquet file, log files are broken down into one or many json files.

a. All of the data is broken down into one or many parquet files, log files are broken down into one or many JSON files, and each transaction creates a new data file(s) and log file. Explanation Answer is All of the data is broken down into one or many parquet files, log files are broken down into one or many json files, and each transaction creates a new data file(s) and log file. here is sample layout of how DELTA table might look,

You noticed that colleague is manually copying the notebook with _bkp to store the previous versions, which of the following feature would you recommend instead. a. Databricks notebooks support change tracking and versioning b. Databricks notebooks should be copied to a local machine and setup source control locally to version the notebooks c. Databricks notebooks can be exported into dbc archive files and stored in data lake d. Databricks notebook can be exported as HTML and imported at a later time

a. Databricks notebooks support change tracking and versioning Explanation Answer is Databricks notebooks support automatic change tracking and versioning. When you are editing the notebook on the right side check version history to view all the changes, every change you are making is captured and saved.

How do you create a delta live tables pipeline and deploy using DLT UI? a. Within the Workspace UI, click on Workflows, select Delta Live tables and create a pipeline and select the notebook with DLT code. b. Under Cluster UI, select SPARK UI and select Structured Streaming and click create pipeline and select the notebook with DLT code. c. There is no UI, you can only setup DELTA LIVE TABLES using Python and SQL API and select the notebook with DLT code. d. Use VS Code and download DBX plugin, once the plugin is loaded you can build DLT pipelines and select the notebook with DLT code. e. Within the Workspace UI, click on SQL Endpoint, select Delta Live tables and create pipeline and select the notebook with DLT code.

a. Within the Workspace UI, click on Workflows, select Delta Live tables and create a pipeline and select the notebook with DLT code. Explanation The answer is, Within the Workspace UI, click on Workflows, select Delta Live tables and create a pipeline and select the notebook with DLT code. https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-quickstart.html

What could be the expected output of query SELECT COUNT (DISTINCT *) FROM user on this table | userId | username | email | 1 | john.smith | john.smith@com | 2 | NULL | david.clear@com | 3 | kevin.smith | kevin.smith@com a. 3 b. 2 c. 1 d. 0

b. 2 xplanation The answer is 2, Count(DISTINCT *) removes rows with any column with a NULL value

You are currently working to ingest millions of files that get uploaded to the cloud object storage for consumption, and you are asked to build a process to ingest this data, the schema of the file is expected to change over time, and the ingestion process should be able to handle these changes automatically. Which of the following method can be used to ingest the data incrementally? a. AUTO APPEND b. AUTO LOADER c. COPY INTO d. Structured Streaming e. Checkpoint

b. AUTO LOADER Explanation The answer is AUTO LOADER, Use Auto Loader instead of the COPY INTO SQL command when: You want to load data from a file location that contains files in the order of millions or higher. Auto Loader can discover files more efficiently than the COPY INTO SQL command and can split file processing into multiple batches. COPY INTO only directory listing but AUTO LOADER supports File notification method where the Auto Loader continues to ingest files as they arrive in cloud object storage leveraging cloud provider(Queues and triggers) and Spark's structured streaming. Your data schema evolves frequently. Auto Loader provides better support for schema inference and evolution. See Configuring schema inference and evolution in Auto Loader.

You are currently working on a production job failure with a job set up in job clusters due to a data issue, what cluster do you need to start to investigate and analyze the data? a. A Job cluster can be used to analyze the problem b. All-purpose cluster/ interactive cluster is the recommended way to run commands and view the data. c. Existing job cluster can be used to investigate the issue d. Databricks SQL Endpoint can be used to investigate the issue

b. All-purpose cluster/ interactive cluster is the recommended way to run commands and view the data. explanation Answer is All-purpose cluster/ interactive cluster is the recommended way to run commands and view the data. A job cluster can not provide a way for a user to interact with a notebook once the job is submitted, but an Interactive cluster allows to you display data, view visualizations write or edit quries, which makes it a perfect fit to investigate and analyze the data.

Which of the following locations in the Databricks product architecture hosts the notebooks and jobs? a. Data plane b. Control plane c. Databricks Filesystem d. JDBC data source e. Databricks web application

b. Control plane xplanation The answer is Control Pane, Databricks operates most of its services out of a control plane and a data plane, please note serverless features like SQL Endpoint and DLT compute use shared compute in Control pane. Control Plane: Stored in Databricks Cloud Account The control plane includes the backend services that Databricks manages in its own Azure account. Notebook commands and many other workspace configurations are stored in the control plane and encrypted at rest. Data Plane: Stored in Customer Cloud Account The data plane is managed by your Azure account and is where your data resides. This is also where data is processed. You can use Azure Databricks connectors so that your clusters can connect to external data sources outside of your Azure account to ingest data or for storage.

You are still noticing slowness in query after performing optimize which helped you to resolve the small files problem, the column(transactionId) you are using to filter the data has high cardinality and auto incrementing number. Which delta optimization can you enable to filter data effectively based on this column? a. Create BLOOM FLTER index on the transactionId b. Perform Optimize with Zorder on transactionId c. transactionId has high cardinality, you cannot enable any optimization. d. Increase the cluster size and enable delta optimization e. Increase the driver size and enable delta optimization

b. Perform Optimize with Zorder on transactionId xplanation The answer is, perform Optimize with Z-order by transactionid Here is a simple explanation of how Z-order works, once the data is naturally ordered, when a flle is scanned it only brings the data it needs into spark's memory Based on the column min and max it knows which data files needs to be scanned.

The operations team is interested in monitoring the recently launched product, team wants to set up an email alert when the number of units sold increases by more than 10,000 units. They want to monitor this every 5 mins. Fill in the below blanks to finish the steps we need to take · Create ___ query that calculates total units sold · Setup ____ with query on trigger condition Units Sold > 10,000 · Setup ____ to run every 5 mins · Add destination ______ a. Python, Job, SQL Cluster, email address b. SQL, Alert, Refresh, email address c. SQL, Job, SQL Cluster, email address d. SQL, Job, Refresh, email address e. Python, Job, Refresh, email address

b. SQL, Alert, Refresh, email address xplanation The answer is SQL, Alert, Refresh, email address Here the steps from Databricks documentation, Create an alert Follow these steps to create an alert on a single column of a query. Do one of the following: Click Create in the sidebar and select Alert. Click Alerts in the sidebar and click the + New Alert button. Search for a target query. To alert on multiple columns, you need to modify your query. See Alert on multiple columns. In the Trigger when field, configure the alert. The Value column drop-down controls which field of your query result is evaluated. The Condition drop-down controls the logical operation to be applied. The Threshold text input is compared against the Value column using the Condition you specify. Note If a target query returns multiple records, Databricks SQL alerts act on the first one. As you change the Value column setting, the current value of that field in the top row is shown beneath it. In the When triggered, send notification field, select how many notifications are sent when your alert is triggered: Just once: Send a notification when the alert status changes from OK to TRIGGERED. Each time alert is evaluated: Send a notification whenever the alert status is TRIGGERED regardless of its status at the previous evaluation. At most every: Send a notification whenever the alert status is TRIGGERED at a specific interval. This choice lets you avoid notification spam for alerts that trigger often. Regardless of which notification setting you choose, you receive a notification whenever the status goes from OK to TRIGGERED or from TRIGGERED to OK. The schedule settings affect how many notifications you will receive if the status remains TRIGGERED from one execution to the next. For details, see Notification frequency. In the Template drop-down, choose a template: Use default template: Alert notification is a message with links to the Alert configuration screen and the Query screen. Use custom template: Alert notification includes more specific information about the alert. A box displays, consisting of input fields for subject and body. Any static content is valid, and you can incorporate built-in template variables: ALERT_STATUS: The

Which of the following tool provides Data Access control, Access Audit, Data Lineage, and Data discovery? a. DELTA LIVE Pipelines b. Unity Catalog c. Data Governance d. DELTA lake e. Lakehouse

b. Unity Catalog

Data engineering team is required to share the data with Data science team and both the teams are using different workspaces in the same organization which of the following techniques can be used to simplify sharing data across? *Please note the question is asking how data is shared within an organization across multiple workspaces. a. Data Sharing b. Unity Catalog c. DELTA lake d. Use a single storage location e. DELTA LIVE Pipelines

b. Unity Catalog xplanation The answer is the Unity catalog. Unity Catalog works at the Account level, it has the ability to create a meta store and attach that meta store to many workspaces see the below diagram to understand how Unity Catalog Works, as you can see a metastore can now be shared with both workspaces using Unity Catalog, prior to Unity Catalog the options was to use single cloud object storage manually mount in the second databricks workspace, and you can see here Unity Catalog really simplifies that. sorry for the inconvenience watermark was added because other people on Udemy are copying my questions and images. Review product features https://databricks.com/product/unity-catalog

Which of the following SQL statements can replace python variables in Databricks SQL code, when the notebook is set in SQL mode? %python table_name = "sales" schema_name = "bronze" %sql SELECT * FROM ____________________ a. SELECT * FROM f{schema_name.table_name} b. SELECT * FROM {schem_name.table_name} c. SELECT * FROM ${schema_name}.${table_name} d. SELECT * FROM schema_name.table_name

c. SELECT * FROM ${schema_name}.${table_name} xplanation The answer is, SELECT * FROM ${schema_name}.${table_name} %python table_name = "sales" schema_name = "bronze" %sql SELECT * FROM ${schema_name}.${table_name} ${python variable} -> Python variables in Databricks SQL code

Direct query on external files limited options, create external tables for CSV files with header and pipe delimited CSV files, fill in the blanks to complete the create table statement CREATE TABLE sales (id int, unitsSold int, price FLOAT, items STRING) ________ ________ LOCATION "dbfs:/mnt/sales/*.csv" a. FORMAT CSV OPTIONS ( "true","|") b. USING CSV TYPE ( "true","|") c. USING CSV OPTIONS ( header ="true", delimiter = "|") d. FORMAT CSV FORMAT TYPE ( header ="true", delimiter = "|") e. FORMAT CSV TYPE ( header ="true", delimiter = "|")

c. USING CSV OPTIONS ( header ="true", delimiter = "|") xplanation Answer is USING CSV OPTIONS ( header ="true", delimiter = "|") Here is the syntax to create an external table with additional options CREATE TABLE table_name (col_name1 col_typ1,..) USING data_source OPTIONS (key='value', key2=vla2) LOCATION = "/location"

At the end of the inventory process, a file gets uploaded to the cloud object storage, you are asked to build a process to ingest data which of the following method can be used to ingest the data incrementally, schema of the file is expected to change overtime ingestion process should be able to handle these changes automatically. Below is the auto loader to command to load the data, fill in the blanks for successful execution of below code. spark.readStream .format("cloudfiles") .option("_______","csv) .option("_______", 'dbfs:/location/checkpoint/') .load(data_source) .writeStream .option("_______",' dbfs:/location/checkpoint/') .option("_______", "true") .table(table_name)) a. format, checkpointlocation, schemalocation, overwrite b. cloudfiles.format, checkpointlocation, cloudfiles.schemalocation, overwrite c. cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, mergeSchema d. cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, overwrite e. cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, append

c. cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, mergeSchema xplanation The answer is cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, mergeSchema. Here is the end to end syntax of streaming ELT, below link contains complete options Auto Loader options | Databricks on AWS spark.readStream .format("cloudfiles") # Returns a stream data source, reads data as it arrives based on the trigger. .option("cloudfiles.format","csv") # Format of the incoming files .option("cloudfiles.schemalocation", "dbfs:/location/checkpoint/") The location to store the inferred schema and subsequent changes .load(data_source) .writeStream .option("checkpointlocation","dbfs:/location/checkpoint/") # The location of the stream's checkpoint .option("mergeSchema", "true") # Infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema. .table(table_name)) # target table

Which of the following python statement can be used to replace the schema name and table name in the query statement? a. table_name = "sales" schema_name = "bronze" query = f"select * from schema_name.table_name" b. table_name = "sales" schema_name = "bronze" query = "select * from {schema_name}.{table_name}" c. table_name = "sales" schema_name = "bronze" query = f"select * from { schema_name}.{table_name}" d. table_name = "sales" schema_name = "bronze" query = f"select * from + schema_name +"."+table_name"

c. table_name = "sales" schema_name = "bronze" query = f"select * from { schema_name}.{table_name}" xplanation Answer is table_name = "sales" query = f"select * from {schema_name}.{table_name}" f strings can be used to format a string. f" This is string {python variable}" https://realpython.com/python-f-strings/

What is the purpose of a silver layer in Multi hop architecture? a. Replaces a traditional data lake b. Efficient storage and querying of full and unprocessed history of data c. A schema is enforced, with data quality checks. d. Refined views with aggregated data e. Optimized query performance for business-critical data

c. A schema is enforced, with data quality checks. Explanation The answer is, A schema is enforced, with data quality checks. Medallion Architecture - Databricks Silver Layer: Reduces data storage complexity, latency, and redundency Optimizes ETL throughput and analytic query performance Preserves grain of original data (without aggregation) Eliminates duplicate records production schema enforced Data quality checks, quarantine corrupt data Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose. Sorry I had to add the watermark some people in Udemy are copying my content.

Data engineering team has a job currently setup to run a task load data into a reporting table every day at 8: 00 AM takes about 20 mins, Operations teams are planning to use that data to run a second job, so they access latest complete set of data. What is the best to way to orchestrate this job setup? a. Add Operation reporting task in the same job and set the Data Engineering task to depend on Operations reporting task b. Setup a second job to run at 8:20 AM in the same workspace c. Add Operation reporting task in the same job and set the operations reporting task to depend on Data Engineering task d. Use Auto Loader to run every 20 mins to read the initial table and set the trigger to once and create a second job e. Setup a Delta live to table based on the first table, set the job to run in continuous mode

c. Add Operation reporting task in the same job and set the operations reporting task to depend on Data Engineering task xplanation The answer is Add Operation reporting task in the same job and set the operations reporting task to depend on Data Engineering task. Job View

As a Data Engineer, you were asked to create a delta table to store below transaction data? | transactionId | transactionDate | UnitsSold | 1 | 01-01-2021 09:10:24 AM | 100 | 2 | 01-01-2021 10:30:15 AM | 100 a. CREATE DELTA TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int) b. CREATE TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int) FORMAT DELTA c. CREATE TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int) d. CREATE TABLE USING DELTA transactions ( transactionId int, transactionDate timestamp, unitsSold int) e. CREATE TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int) LOCATION DELTA

c. CREATE TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int) Explanation Answer is CREATE TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int) When creating a table in Databricks by default the table is stored in DELTA format.

Which of the following techniques structured streaming uses to ensure recovery of failures during stream processing? a. Checkpointing and Watermarking b. Write ahead logging and watermarking c. Checkpointing and write-ahead logging d. Delta time travel e. The stream will failover to available nodes in the cluster f. Checkpointing and Idempotent sinks

c. Checkpointing and write-ahead logging xplanation The answer is Checkpointing and write-ahead logging. Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval.

Newly joined data analyst requested read-only access to tables, assuming you are owner/admin which section of Databricks platform is going to facilitate granting select access to the user ____. a. Admin console b. User settings c. Data explorer d. Azure Databricks control pane IAM e. Azure RBAC

c. Data explorer Explanation Answer is Data Explorer https://docs.databricks.com/sql/user/data/index.html Data explorer lets you easily explore and manage permissions on databases and tables. Users can view schema details, preview sample data, and see table details and properties. Administrators can view and change owners, and admins and data object owners can grant and revoke permissions. To open data explorer, click Data in the sidebar.

You are designing an analytical to store structured data from your e-commerce platform and unstructured data from website traffic and app store, how would you approach where you store this data? a. Use traditional data warehouse for structured data and use data lakehouse for unstructured data. b. Data lakehouse can only store unstructured data but cannot enforce a schema c. Data lakehouse can store structured and unstructured data and can enforce schema d. Traditional data warehouses are good for storing structured data and enforcing schema

c. Data lakehouse can store structured and unstructured data and can enforce schema explanation The answer is, Data lakehouse can store structured and unstructured data and can enforce schema What Is a Lakehouse? - The Databricks Blog

How does a Delta Lake differ from a traditional data lake? a. Delta lake is Data warehouse service on top of data lake that can provide reliability, security, and performance b. Delta lake is a caching layer on top of data lake that can provide reliability, security, and performance c. Delta lake is an open storage format like parquet with additional capabilities that can provide reliability, security, and performance d. Delta lake is an open storage format designed to replace flat files with additional capabilities that can provide reliability, security, and performance e. Delta lake is proprietary software designed by Databricks that can provide reliability, security, and performance

c. Delta lake is an open storage format like parquet with additional capabilities that can provide reliability, security, and performance Explanation Answer is, Delta lake is an open storage format like parquet with additional capabilities that can provide reliability, security, and performance Delta lake is · Open source · Builds up on standard data format · Optimized for cloud object storage · Built for scalable metadata handling Delta lake is not · Proprietary technology · Storage format · Storage medium · Database service or data warehouse

A newly joined team member John Smith in the Marketing team who currently does not have any access to the data requires read access to customers table, which of the following statements can be used to grant access? a. GRANT SELECT, USAGE TO [email protected] ON TABLE customers b. GRANT READ, USAGE TO [email protected] ON TABLE customers c. GRANT SELECT, USAGE ON TABLE customers TO [email protected] d. GRANT READ, USAGE ON TABLE customers TO [email protected] e. GRANT READ, USAGE ON customers TO [email protected]

c. GRANT SELECT, USAGE ON TABLE customers TO [email protected] xplanation The answer is GRANT SELECT, USAGE ON TABLE customers TO [email protected] Data object privileges - Azure Databricks | Microsoft Docs

Here is the example query --- Get order summary create or replace table orders_summary as select product_id, sum(order_count) order_count from ( select product_id,order_count from orders_instore union all select product_id,order_count from orders_online ) group by product_id -- get supply summary create or repalce tabe supply_summary as select product_id, sum(supply_count) supply_count from supply group by product_id -- get on hand based on orders summary and supply summary with stock_cte as ( select nvl(s.product_id,o.product_id) as product_id, nvl(supply_count,0) - nvl(order_count,0) as on_hand from supply_summary s full outer join orders_summary o on s.product_id = o.product_id ) select * from stock_cte where on_hand = 0 The data analyst team had put together queries that identify items that are out of stock based on orders and replenishment but when they run all together for final output the team noticed it takes a really long time, you were asked to look at the reason why queries are running slow and identify steps to improve the performance and when you looked at it you noticed all the code queries are running sequentially and using a SQL endpoint cluster. Which of the following

c. Increase the cluster size of the SQL endpoint. xplanation The answer is to increase the cluster size of the SQL Endpoint, here queries are running sequentially and since the single query can not span more than one cluster adding more clusters won't improve the query but rather increasing the cluster size will improve performance so it can use additional compute in a warehouse. In the exam please note that additional context will not be given instead you have to look for cue words or need to understand if the queries are running sequentially or concurrently. if the queries are running sequentially then scale up(more nodes) if the queries are running concurrently (more users) then scale out(more clusters). Below is the snippet from Azure, as you can see by increasing the cluster size you are able to add more worker nodes. SQL endpoint scales horizontally(scale-out) and vertically (scale-up), you have to understand when to use what. Scale-up-> Increase the size of the cluster from x-small to small, to medium, X Large.... If you are trying to improve the performance of a single query having additional memory, additional nodes and cpu in the cluster will improve the performance. Scale-out -> Add more clusters, change max number of clusters If you are trying to improve the throughput, being able to run as many queries as possible then having an additional cluster(s) will improve the performance. SQL endpoint

The data engineering team noticed that one of the job normally finishes in 15 mins but gets stuck randomly when reading remote databases due to a network packet drop, which of the following steps can be used to improve the stability of the job? a. Use Databrick REST API to monitor long running jobs and issue a kill command b. Use Jobs runs, active runs UI section to monitor and kill long running job c. Modify the task, to include a timeout to kill the job if it runs more than 15 mins. d. Use Spark job time out setting in the Spark UI e. Use Cluster timeout setting in the Job cluster UI

c. Modify the task, to include a timeout to kill the job if it runs more than 15 mins. xplanation The answer is, Modify the task, to include time out to kill the job if it runs more than 15 mins. https://docs.microsoft.com/en-us/azure/databricks/data-engineering/jobs/jobs#timeout

A dataset has been defined using Delta Live Tables and includes an expectations clause: CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION FAIL What is the expected behavior when a batch of data containing data that violates these constraints is processed? a. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log. b. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log. c. Records that violate the expectation cause the job to fail d. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset. e. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.

c. Records that violate the expectation cause the job to fail xplanation The answer is Records that violate the expectation cause the job to fail. Delta live tables support three types of expectations to fix bad data in DLT pipelines Review below example code to examine these expectations, Invalid records: Use the expect operator when you want to keep records that violate the expectation. Records that violate the expectation are added to the target dataset along with valid records: SQL CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') Drop invalid records: Use the expect or drop operator to prevent the processing of invalid records. Records that violate the expectation are dropped from the target dataset: SQL CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION DROP ROW Fail on invalid records: When invalid records are unacceptable, use the expect or fail operator to halt execution immediately when a record fails validation. If the operation is a table update, the system atomically rolls back the transaction: SQL CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION FAIL UPDATE

What is the best way to query external csv files located on DBFS Storage to inspect the data using SQL? a. SELECT * FROM 'dbfs:/location/csv_files/' FORMAT = 'CSV' b. SELECT CSV. * from 'dbfs:/location/csv_files/' c. SELECT * FROM CSV. 'dbfs:/location/csv_files/' d. You can not query external files directly, us COPY INTO to load the data into a table first e. SELECT * FROM 'dbfs:/location/csv_files/' USING CSV

c. SELECT * FROM CSV. 'dbfs:/location/csv_files/' xplanation Answer is, SELECT * FROM CSV. 'dbfs:/location/csv_files/' you can query external files stored on the storage using below syntax SELECT * FROM format.`/Location` format - CSV, JSON, PARQUET, TEXT

Which of the following programming languages can be used to build a Databricks SQL dashboard? a. Python b. Scala c. SQL d. R e. All of the above

c. SQL

The marketing team is launching a new campaign to monitor the performance of the new campaign for the first two weeks, they would like to set up a dashboard with a refresh schedule to run every 5 minutes, which of the below steps can be taken to reduce of the cost of this refresh over time? a. Reduce the size of the SQL Cluster size b. Reduce the max size of auto scaling from 10 to 5 c. Setup the dashboard refresh schedule to end in two weeks d. Change the spot instance policy from reliability optimized to cost optimized e. Always use X-small cluster

c. Setup the dashboard refresh schedule to end in two weeks

What is the underlying technology that makes the Auto Loader work? a. Loader b. Delta Live Tables c. Structured Streaming d. DataFrames e. Live DataFames

c. Structured Streaming Explanation The answer is Structured Streaming Auto Loader is built on top of Structured Streaming, Auto Loader provides a Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory

You are currently asked to work on building a data pipeline, you have noticed that you are currently working on a very large scale ETL many data dependencies, which of the following tools can be used to address this problem? a. AUTO LOADER b. JOBS and TASKS c. SQL Endpoints d. DELTA LIVE TABLES e. STRUCTURED STREAMING with MULTI HOP

d. DELTA LIVE TABLES xplanation The answer is, DELTA LIVE TABLES DLT simplifies data dependencies by building DAG-based joins between live tables. Here is a view of how the dag looks with data dependencies without additional meta data, create or replace live view customers select * from customers; create or replace live view sales_orders_raw select * from sales_orders; create or replace live view sales_orders_cleaned as select sales.* from live.sales_orders_raw s join live.customers c on c.customer_id = s.customer_id where c.city = 'LA'; create or replace live table sales_orders_in_la select * from sales_orders_cleaned; Above code creates below dag Documentation on DELTA LIVE TABLES, https://databricks.com/product/delta-live-tables https://databricks.com/blog/2022/04/05/announcing-generally-availability-of-databricks-delta-live-tables-dlt.html DELTA LIVE TABLES, addresses below challenges when building ETL processes Complexities of large scale ETL Hard to build and maintain dependencies Difficult to switch between batch and stream Data quality and governance Difficult to monitor and enforce data quality Impossible to trace data lineage Difficult pipeline operations Poor observability at granular data level Error handling and recovery is laborious

When you drop an external DELTA table using the SQL Command DROP TABLE table_name , how does it impact metadata(delta log, history), and data stored in the storage? a. Drops table from metastore, metadata(delta log, history)and data in storage b. Drops table from metastore, data but keeps metadata(delta log, history) in storage c. Drops table from metastore, metadata(delta log, history)but keeps the data in storage d. Drops table from metastore, but keeps metadata(delta log, history)and data in storage e. Drops table from metastore and data in storage but keeps metadata(delta log, history)

d. Drops table from metastore, but keeps metadata(delta log, history)and data in storage xplanation The answer is Drops table from metastore, but keeps metadata and data in storage. When an external table is dropped, only the table definition is dropped from metastore everything including data and metadata(Delta transaction log, time travel history) remains in the storage. Delta log is considered as part of metadata because if you drop a column in a delta table(managed or external) the column is not physically removed from the parquet files rather it is recorded in the delta log. The delta log becomes a key metadata layer for a Delta table to work. Please see the below image to compare the external delta table and managed delta table and how they differ in how they are created and what happens if you drop the table.

Grant full privileges to new marketing user Kevin Smith to table sales a. GRANT FULL PRIVILEGES TO [email protected] ON TABLE sales b. GRANT ALL PRIVILEGES TO [email protected] ON TABLE sales c. GRANT FULL PRIVILEGES ON TABLE sales TO [email protected] d. GRANT ALL PRIVILEGES ON TABLE sales TO [email protected] e. GRANT ANY PRIVILEGE ON TABLE sales TO [email protected]

d. GRANT ALL PRIVILEGES ON TABLE sales TO [email protected] xplanation The answer is GRANT ALL PRIVILEGE ON TABLE sales TO [email protected] GRANT <privilege> ON <securable_type> <securable_name> TO <principal> Here are the available privileges and ALL Privileges gives full access to an object. Privileges SELECT: gives read access to an object. CREATE: gives ability to create an object (for example, a table in a schema). MODIFY: gives ability to add, delete, and modify data to or from an object. USAGE: does not give any abilities, but is an additional requirement to perform any action on a schema object. READ_METADATA: gives ability to view an object and its metadata. CREATE_NAMED_FUNCTION: gives ability to create a named UDF in an existing catalog or schema. MODIFY_CLASSPATH: gives ability to add files to the Spark class path. ALL PRIVILEGES: gives all privileges (is translated into all the above privileges).

You are noticing job cluster is taking 6 to 8 mins to start which is delaying your job to finish on time, what steps you can take to reduce the amount of time cluster startup time? a. Setup a second job ahead of first job to start the cluster, so the cluster is ready with resources when the job starts b. Use All purpose cluster instead to reduce cluster start up time c. Reduce the size of the cluster, smaller the cluster size shorter it takes to start the cluster d. Use cluster pools to reduce the startup time of the jobs e. Use SQL endpoints to reduce the startup time

d. Use cluster pools to reduce the startup time of the jobs Explanation The answer is, Use cluster pools to reduce the startup time of the jobs. Cluster pools allow us to reserve VM's ahead of time, when a new job cluster is created VM are grabbed from the pool. Note: when the VM's are waiting to be used by the cluster only cost incurred is Azure. Databricks run time cost is only billed once VM is allocated to a cluster. Here is a demo of how to setup and follow some best practices, https://www.youtube.com/watch?v=FVtITxOabxg&ab_channel=DatabricksAcademy

Which of the following results in the creation of an external table? a. CREATE TABLE transactions (id int, desc string) USING DELTA LOCATION EXTERNAL b. CREATE TABLE transactions (id int, desc string) c. CREATE EXTERNAL TABLE transactions (id int, desc string) d. CREATE TABLE transactions (id int, desc string) TYPE EXTERNAL e. CREATE TABLE transactions (id int, desc string) LOCATION '/mnt/delta/transactions'

e. CREATE TABLE transactions (id int, desc string) LOCATION '/mnt/delta/transactions' xplanation Answer is CREATE TABLE transactions (id int, desc string) USING DELTA LOCATION '/mnt/delta/transactions' Anytime a table is created using Location it is considered an external table, below is the current syntax. Syntax CREATE TABLE table_name ( column column_data_type...) USING format LOCATION "dbfs:/"

spark.readStream.format("delta").table("sales").createOrReplaceTempView("streaming_view") Which of the following operations are not supported on a streaming dataset view? a. SELECT sum(unitssold) FROM streaming_view b. SELECT max(unitssold) FROM streaming_view c. SELECT id, sum(unitssold) FROM streaming_view GROUP BY id ORDER BY id d. SELECT id, count(*) FROM streaming_view GROUP BY id e. SELECT * FROM streadming_view ORDER BY id

e. SELECT * FROM streadming_view ORDER BY id Explanation The answer isSELECT * FROM streadming_view order by id Please Note: Sorting with Group by will work without any issues see below explanation for each option of the options, Certain operations are not allowed on streaming data, please see highlighted in bold. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations Multiple streaming aggregations (i.e. a chain of aggregations on a streaming DF) are not yet supported on streaming Datasets. Limit and take the first N rows are not supported on streaming Datasets. Distinct operations on streaming Datasets are not supported. Deduplication operation is not supported after aggregation on a streaming Datasets. Sorting operations are supported on streaming Datasets only after an aggregation and in Complete Output Mode. Note: Sorting without aggregation function is not supported. Here is the sample code to prove this, Setup test stream Sum aggregation function has no issues on stream Max aggregation function has no issues on stream Group by with Order by has no issues on stream Group by has no issues on stream Order by without group by fails.

What is the best way to describe a data lakehouse compared to a data warehouse? a. A data lakehouse provides a relational system of data management b. A data lakehouse captures snapshots of data for version control purposes. c. A data lakehouse couples storage and compute for complete control. d. A data lakehouse utilizes proprietary storage formats for data. e. A data lakehouse enables both batch and streaming analytics.

e. A data lakehouse enables both batch and streaming analytics. Explanation Answer is A data lakehouse enables both batch and streaming analytics. A lakehouse has the following key features: Transaction support: In an enterprise lakehouse many data pipelines will often be reading and writing data concurrently. Support for ACID transactions ensures consistency as multiple parties concurrently read or write data, typically using SQL. Schema enforcement and governance: The Lakehouse should have a way to support schema enforcement and evolution, supporting DW schema architectures such as star/snowflake-schemas. The system should be able to reason about data integrity, and it should have robust governance and auditing mechanisms. BI support: Lakehouses enable using BI tools directly on the source data. This reduces staleness and improves recency, reduces latency, and lowers the cost of having to operationalize two copies of the data in both a data lake and a warehouse. Storage is decoupled from compute: In practice this means storage and compute use separate clusters, thus these systems are able to scale to many more concurrent users and larger data sizes. Some modern data warehouses also have this property. Openness: The storage formats they use are open and standardized, such as Parquet, and they provide an API so a variety of tools and engines, including machine learning and Python/R libraries, can efficiently access the data directly. Support for diverse data types ranging from unstructured to structured data: The lakehouse can be used to store, refine, analyze, and access data types needed for many new data applications, including images, video, audio, semi-structured data, and text. Support for diverse workloads: including data science, machine learning, and SQL and analytics. Multiple tools might be needed to support all these workloads but they all rely on the same data repository. End-to-end streaming: Real-time reports are the norm in many enterprises. Support for streaming eliminates the need for separate systems dedicated to serving real-time data applications.

Which of the following describes how Databricks Repos can help facilitate CI/CD workflows on the Databricks Lakehouse Platform? a. Databricks Repos can facilitate the pull request, review, and approval process before merging branches b. Databricks Repos can merge changes from a secondary Git branch into a main Git branch c. Databricks Repos can be used to design, develop, and trigger Git automation pipelines d. Databricks Repos can store the single-source-of-truth Git repository (Incorrect) e. Databricks Repos can commit or push code changes to trigger a CI/CD process

e. Databricks Repos can commit or push code changes to trigger a CI/CD process explanation Answer is Databricks Repos can commit or push code changes to trigger a CI/CD process See below diagram to understand the role Databricks Repos and Git provider plays when building a CI/CD workdlow. All the steps highlighted in yellow can be done Databricks Repo, all the steps highlighted in Gray are done in a git provider like Github or Azure Devops.

If you create a database sample_db with the statement CREATE DATABASE sample_db what will be the default location of the database in DBFS? a. Default location, DBFS:/user/ b. Default location, /user/db/ c. Default Storage account d. Statement fails "Unable to create database without location" e. Default Location, dbfs:/user/hive/warehouse

e. Default Location, dbfs:/user/hive/warehouse xplanation The Answer is dbfs:/user/hive/warehouse this is the default location where spark stores user databases, the default can be changed using spark.sql.warehouse.dir a parameter. You can also provide a custom location using the LOCATION keyword. Here is how this works, DBFS = data bricks file system Default location FYI, This can be changed used using cluster spark config or session config. Modify spark.sql.warehouse.dir location to change the default location

What is the purpose of a gold layer in Multi-hop architecture? a. Optimizes ETL throughput and analytic query performance b. Eliminate duplicate records c. Preserves grain of original data, without any aggregations d. Data quality checks and schema enforcement e. Powers ML applications, reporting, dashboards and adhoc reports.

e. Powers ML applications, reporting, dashboards and adhoc reports. Explanation The answer is Powers ML applications, reporting, dashboards and adhoc reports. Review the below link for more info, Medallion Architecture - Databricks Gold Layer: Powers Ml applications, reporting, dashboards, ad hoc analytics Refined views of data, typically with aggregations Reduces strain on production systems Optimizes query performance for business-critical data Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose. Sorry I had to add the watermark some people in Udemy are copying my content.

What is the purpose of the bronze layer in a Multi-hop architecture? a. Can be used to eliminate duplicate records b. Used as a data source for Machine learning applications. c. Perform data quality checks, corrupt data quarantined d. Contains aggregated data that is to be consumed into Silver e. Provides efficient storage and querying of full unprocessed history of data

e. Provides efficient storage and querying of full unprocessed history of data Explanation The answer is Provides efficient storage and querying of full unprocessed history of data Medallion Architecture - Databricks Bronze Layer: Raw copy of ingested data Replaces traditional data lake Provides efficient storage and querying of full, unprocessed history of data No schema is applied at this layer Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose. Sorry I had to add the watermark some people in Udemy are copying my content.

You are trying to create an object by joining two tables that and it is accessible to data scientist's team, so it does not get dropped if the cluster restarts or if the notebook is detached. What type of object are you trying to create? a. Temporary view b. Global Temporary view c. Global Temporary view with cache option d. External view e. View

e. View xplanation Answer is View, A view can be used to join multiple tables but also persist into meta stores so others can accesses it


Kaugnay na mga set ng pag-aaral

International Management Module 2

View Set

ECON 3900 Practice Questions Chapter 11 "Aggregate Demand: Part II"

View Set

Literary Terms antithesis - cacophony

View Set

Chapter 8: communicating digital content

View Set

[1S-MIDTERMS] General Physics 1 - Module 1

View Set

First Astronomy Exam Smart Work Answers

View Set

Ch. 71 Care of Pts with Gynecologic Disorders

View Set