DP-203 Microsoft Data Engineer

Ace your homework & exams now with Quizwiz!

Question Set 2 Q33 You are developing a solution that will stream to Azure Stream Analytics. The solution will have both streaming data and reference data. Which input type should you use for the reference data? A. Azure Cosmos DB B. Azure Blob storage C. Azure IoT Hub D. Azure Event Hubs

Correct Answer: B Stream Analytics supports Azure Blob storage and Azure SQL Database as the storage layer for Reference Data. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-use-reference-data

Question Set 1 Q24 Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain description data that has an average length of 1.1 MB. You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics. You need to prepare the files to ensure that the data copies quickly. Solution: You modify the files to ensure that each row is more than 1 MB. Does this meet the goal? A. Yes B. No

Correct Answer: B Instead convert the files to compressed delimited text files. Reference: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data

Question Set 2 Q27 You have an Azure Storage account and a data warehouse in Azure Synapse Analytics in the UK South region. You need to copy blob data from the storage account to the data warehouse by using Azure Data Factory. The solution must meet the following requirements: ✑ Ensure that the data remains in the UK South region at all times. ✑ Minimize administrative effort. Which type of integration runtime should you use? A. Azure integration runtime B. Azure-SSIS integration runtime C. Self-hosted integration runtime

Correct Answer: A Incorrect Answers: C: Self-hosted integration runtime is to be used On-premises. Reference: https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime

Topic 2 - Question Set 2 Q15 You are designing a statistical analysis solution that will use custom proprietary Python functions on near real-time data from Azure Event Hubs. You need to recommend which Azure service to use to perform the statistical analysis. The solution must minimize latency. What should you recommend? A. Azure Synapse Analytics B. Azure Databricks C. Azure Stream Analytics D. Azure SQL Database

Correct Answer: C Reference: https://docs.microsoft.com/en-us/azure/event-hubs/process-data-azure-stream-analytics

Question Set 1 Q37 You are designing a partition strategy for a fact table in an Azure Synapse Analytics dedicated SQL pool. The table has the following specifications: ✑ Contain sales data for 20,000 products. Use hash distribution on a column named ProductID. ✑ Contain 2.4 billion records for the years 2019 and 2020. Which number of partition ranges provides optimal compression and performance for the clustered columnstore index? A. 40 B. 240 C. 400 D. 2,400

Correct Answer: A Each partition should have around 1 millions records. Dedicated SQL pools already have 60 partitions. We have the formula: Records/(Partitions*60)= 1 million Partitions= Records/(1 million * 60) Partitions= 2.4 x 1,000,000,000/(1,000,000 * 60) = 40 Note: Having too many partitions can reduce the effectiveness of clustered columnstore indexes if each partition has fewer than 1 million rows. Dedicated SQL pools automatically partition your data into 60 databases. So, if you create a table with 100 partitions, the result will be 6000 partitions. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool

Question Set 2 Q25 HOTSPOT - You are designing a real-time dashboard solution that will visualize streaming data from remote sensors that connect to the internet. The streaming data must be aggregated to show the average value of each 10-second interval. The data will be discarded after being displayed in the dashboard. The solution will use Azure Stream Analytics and must meet the following requirements: ✑ Minimize latency from an Azure Event hub to the dashboard. ✑ Minimize the required storage. ✑ Minimize development effort. What should you include in the solution? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point Hot Area: Questions: - Azure Stream Analytics input type - Azure Stream Analytics output type - Aggregation query location Answers: - Azure Event Hub - Azure SQL Database - Azure Stream Analytics - Microsoft Power BI

Correct Answer: Azure Event Hub, Microsoft Power BI, Azure Stream Analytics Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-power-bi-dashboard

Question Set 1 Q28 You plan to implement an Azure Data Lake Storage Gen2 container that will contain CSV files. The size of the files will vary based on the number of events that occur per hour. File sizes range from 4 KB to 5 GB. You need to ensure that the files stored in the container are optimized for batch processing. What should you do? A. Convert the files to JSON B. Convert the files to Avro C. Compress the files D. Merge the files

Correct Answer: B XX Alternate answer: D Avro supports batch and is very relevant for streaming. Note: Avro is framework developed within Apache's Hadoop project. It is a row-based storage format which is widely used as a serialization process. AVRO stores its schema in JSON format making it easy to read and interpret by any program. The data itself is stored in binary format by doing it compact and efficient. Reference: https://www.adaltas.com/en/2020/07/23/benchmark-study-of-different-file-format/

Question Set 1 Q4 You have files and folders in Azure Data Lake Storage Gen2 for an Azure Synapse workspace as shown in the following exhibit. /topfolder/ - File1.csv - /folder1/ - File2.csv - /folder2/ - File3.csv - File4.csv You create an external table named ExtTable that has LOCATION='/topfolder/'. When you query ExtTable by using an Azure Synapse Analytics serverless SQL pool, which files are returned? A. File2.csv and File3.csv only B. File1.csv and File4.csv only C. File1.csv, File2.csv, File3.csv, and File4.csv D. File1.csv only

Correct Answer: C To run a T-SQL query over a set of files within a folder or set of folders while treating them as a single entity or rowset, provide a path to a folder or a pattern(using wildcards) over a set of files or folders. Reference:https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-data-storage#query-multiple-files-or-folders

Question Set 2 Q35 HOTSPOT - You are designing a monitoring solution for a fleet of 500 vehicles. Each vehicle has a GPS tracking device that sends data to an Azure event hub once per minute. You have a CSV file in an Azure Data Lake Storage Gen2 container. The file maintains the expected geographical area in which each vehicle should be. You need to ensure that when a GPS position is outside the expected area, a message is added to another event hub for processing within 30 seconds. The solution must minimize cost. What should you include in the solution? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Answer Area Service - An Azure Synapse Analytics Apache Spark Pool - An Azure Synapse Analytics serverless SQL Pool - Azure Data Factory - Azure Stream Analytics Window - Hopping - No window - Session - Tumbling Analysis Type - Event pattern matching - Lagged record comparison - Point within polygon - Polygon overlap

Correct Answer: Service: Azure Stream Analytics Window: Hopping Analysis type: Point within polygon Box 1: Azure Stream Analytics - Box 2: Hopping - Hopping window functions hop forward in time by a fixed period. It may be easy to think of them as Tumbling windows that can overlap and be emitted more often than the window size. Events can belong to more than one Hopping window result set. To make a Hopping window the same as a Tumbling window, specify the hop size to be the same as the window size. Box 3: Point within polygon - Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Question Set 2 Q26 DRAG DROP - You have an Azure Stream Analytics job that is a Stream Analytics project solution in Microsoft Visual Studio. The job accepts data generated by IoT devices in the JSON format. You need to modify the job to accept data generated by the IoT devices in the Protobuf format. Which three actions should you perform from Visual Studio on sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order. Select and Place: Actions - Change the Event Serialization Format to Protobuf in the input.json file of the job and reference the DLL. - Add an Azure Stream Analytics Custom Deserializer Project (.NET) project to the solution. - Add .NET deserializer code for Protobuf to the custom deserializer project. - Add .NET deserializer code for Protobuf to the Stream Analytics project. - Add an Azure Stream Analytics Application project to the solution.

Correct Answer: 1 Add an Azure Stream Analytics Custom Deserializer Project (.NET) project to the solution. 2 Add .NET deserializer code for Protobuf to the custom deserializer project. 3 Add an Azure Stream Analytics Application project to the solution. YY Alternate Answer: 1 Add an Azure Stream Analytics Customer Deserializer Project(.net) project to the Solution 2 Add .net deseriliaizer Code to ProtoBuf to customer deserializer project 3 Change the event Serialization format to protobuf in the input.json File of the job and reference the DLL. Step 1: Add an Azure Stream Analytics Custom Deserializer Project (.NET) project to the solution.Create a custom deserializer - 1. Open Visual Studio and select File > New > Project. Search for Stream Analytics and select Azure Stream Analytics Custom Deserializer Project (.NET). Give the project a name, like Protobuf Deserializer. 2. In Solution Explorer, right-click your Protobuf Deserializer project and select Manage NuGet Packages from the menu. Then install theMicrosoft.Azure.StreamAnalytics and Google.Protobuf NuGet packages. 3. Add the MessageBodyProto class and the MessageBodyDeserializer class to your project. 4. Build the Protobuf Deserializer project. Step 2: Add .NET deserializer code for Protobuf to the custom deserializer projectAzure Stream Analytics has built-in support for three data formats: JSON, CSV, and Avro. With custom .NET deserializers, you can read data from other formats such as Protocol Buffer, Bond and other user defined formats for both cloud and edge jobs. Step 3: Add an Azure Stream Analytics Application project to the solutionAdd an Azure Stream Analytics project 1. In Solution Explorer, right-click the Protobuf Deserializer solution and select Add > New Project. Under Azure Stream Analytics > Stream Analytics, chooseAzure Stream Analytics Application. Name it ProtobufCloudDeserializer and select OK. 2. Right-click References under the ProtobufCloudDeserializer Azure Stream Analytics project. Under Projects, add Protobuf Deserializer. It should be automatically populated for you. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/custom-deserializer

Question Set 2 Q28 HOTSPOT - You have an Azure SQL database named Database1 and two Azure event hubs named HubA and HubB. The data consumed from each source is shown in the following table. You need to implement Azure Stream Analytics to calculate the average fare per mile by driver. How should you configure the Stream Analytics input for each source? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Answers HubA HubB Database1 All same choice: Stream Reference

Correct Answer: HubA: Stream HubB: Stream Database1: Reference HubA: Stream - HubB: Stream - Database1: Reference - Reference data (also known as a lookup table) is a finite data set that is static or slowly changing in nature, used to perform a lookup or to augment your data streams. For example, in an IoT scenario, you could store metadata about sensors (which don't change often) in reference data and join it with real time IoT data streams. Azure Stream Analytics loads reference data in memory to achieve low latency stream processing Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-use-reference-data

Topic 2 - Question Set 2 Q23 DRAG DROP - You have an Azure Data Lake Storage Gen2 account that contains a JSON file for customers. The file contains two attributes named FirstName and LastName. You need to copy the data from the JSON file to an Azure Synapse Analytics table by using Azure Databricks. A new column must be created that concatenates the FirstName and LastName values. You create the following components: ✑ A destination table in Azure Synapse ✑ An Azure Blob storage container ✑ A service principal Which five actions should you perform in sequence next in is Databricks notebook? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order. Select and Place: Actions: Mount the Data Lake Storage onto DBFS Write the results to a table in Azure Synapse Perform transformations on the file Specify a temporary folder to stage the data Write the results to Data Lake Storage Read the file into a data frame Drop the data frame Perform transformations on the data frame Answer Area:

Correct Answer: Mount the data onto DBFS Read the file into a data frame Perform transformation on the frame Specify a temporary folder to stage the data Write the results to a table in Azure Synapse Step 1: Read the file into a data frame. You can load the json files as a data frame in Azure Databricks. Step 2: Perform transformations on the data frame. Step 3: Specify a temporary folder to stage the data Specify a temporary folder to use while moving data between Azure Databricks and Azure Synapse. Step 4: Write the results to a table in Azure Synapse. You upload the transformed data frame into Azure Synapse. You use the Azure Synapse connector for Azure Databricks to directly upload a dataframe as a table in a Azure Synapse. Step 5: Drop the data frame -Clean up resources. You can terminate the cluster. From the Azure Databricks workspace, select Clusters on the left. For the cluster to terminate, under Actions, point to the ellipsis (...) and select the Terminate icon. Reference: https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse

Question Set 1 Q14 HOTSPOT - You have an Azure Synapse Analytics dedicated SQL pool that contains the users shown in the following table. Name Role User1 Server admin User2 db_datareader User1 executes a query on the database, and the query returns the results shown in the following exhibit. select c.name, ... from sys.masked_columns as c inner join sys.tables ... inner join sys.types ... where is_masked = 1; Results 1 BirthDate 2 Gender 3 EmailAddress 4 YearlyIncome User1 is the only user who has access to the unmasked data. Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic. NOTE: Each correct selection is worth one point. When user2 queries the YearlyIncome column When user1 queries the BirthDate column

Correct Answer: 0, values stored in the database Box 1: 0 - The YearlyIncome column is of the money data type.The Default masking function: Full masking according to the data types of the designated fields ✑ Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real). Box 2: the values stored in the database Users with administrator privileges are always excluded from masking, and see the original data without any mask. Reference: https://docs.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview

Question Set 1 Q36 HOTSPOT - You plan to develop a dataset named Purchases by using Azure Databricks. Purchases will contain the following columns: ✑ ProductID ✑ ItemPrice ✑ LineTotal ✑ Quantity ✑ StoreID ✑ Minute ✑ Month ✑ Hour Year - ✑ Day You need to store the data to support hourly incremental load pipelines that will vary for each Store ID. The solution must minimize storage costs. How should you complete the code? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: df.write __1__ __2__.mode("append")__3__ __1__: .bucketBy, .partitionBy, .range, .sortBy __2__: ("*"), ("StoreID","Hour"), ("StoreID", "Y","M","D","H") __3__: ".csv", ".json", ".parquet", ".saveAsTable" > ("/Purchases")

Correct Answer: >>below<< df.write.partitionBy("StoreID","Year", "Month", "Day", "Hour") .mode("append").parquet("/Purchases") Box 1: partitionBy - We should overwrite at the partition level.Example:df.write.partitionBy("y","m","d").mode(SaveMode.Append).parquet("/data/hive/warehouse/db_name.db/" + tableName) Box 2: ("StoreID", "Year", "Month", "Day", "Hour") Box 3: parquet("/Purchases") Reference: https://intellipaat.com/community/11744/how-to-partition-and-write-dataframe-in-spark-without-deleting-partitions-with-no-new-data

Question Set 2 Q54 HOTSPOT - You are building an Azure Data Factory solution to process data received from Azure Event Hubs, and then ingested into an Azure Data Lake Storage Gen2 container. The data will be ingested every five minutes from devices into JSON files. The files have the following naming pattern. /{deviceType}/in/{YYYY}/{MM}/{DD}/{HH}/{deviceID}_{YYYY}{MM}{DD}HH}{mm}.json You need to prepare the data for batch data processing so that there is one dataset per hour per deviceType. The solution must minimize read times. How should you configure the sink for the copy activity? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Parameter - @pipeline(),TriggerTime - @pipeline(),TriggerType - @trigger().outputs.windowStartTime - @trigger().startTime Naming pattern - /{deviceID}/out/{YYYY}/{MM}/{DD}/{HH}.json - /{YYYY}/{MM}/{DD}/{deviceType}.json - /{YYYY}/{MM}/{DD}/{HH}.json - /{YYYY}/{MM}/{DD}/{HH}_{deviceType}.json Copy behavior: Add dynamic content, Flatten hierarchy, Merge files

Correct Answer: @trigger().startTime , /{YYYY}/{MM}/{DD}/{HH}_{deviceType}.json, Flatten hierarchy ?? Alternate Answer: @trigger().outputs.windowStartTime, /{YYYY}/{MM}/{DD}/{HH}_{deviceType}.json, Merge files Box 1: @trigger().startTime - startTime: A date-time value. For basic schedules, the value of the startTime property applies to the first occurrence. For complex schedules, the trigger starts no sooner than the specified startTime value. Box 2: /{YYYY}/{MM}/{DD}/{HH}_{deviceType}.json One dataset per hour per deviceType. Box 3: Flatten hierarchy - FlattenHierarchy: All files from the source folder are in the first level of the target folder. The target files have autogenerated names. Reference: https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers https://docs.microsoft.com/en-us/azure/data-factory/connector-file-system

Question Set 2 Q51 Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You have an Azure Data Lake Storage account that contains a staging zone. You need to design a daily process to ingest incremental data from the staging zone, transform the data by executing an R script, and then insert the transformed data into a data warehouse in Azure Synapse Analytics. Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that copies the data to a staging table in the data warehouse, and then uses a stored procedure to execute the R script. Does this meet the goal? A. Yes B. No

Correct Answer: A YY Alternate Answer: B - Cannot execute the R script using stored procedure If you need to transform data in a way that is not supported by Data Factory, you can create a custom activity with your own data processing logic and use the activity in the pipeline. Note: You can use data transformation activities in Azure Data Factory and Synapse pipelines to transform and process your raw data into predictions and insights at scale. Reference: https://docs.microsoft.com/en-us/azure/data-factory/transform-data

Topic 2 - Question Set 2 Q8 You have an Azure Data Factory instance that contains two pipelines named Pipeline1 and Pipeline2. Pipeline1 has the activities shown in the following exhibit. Stored Procedure (1)/Fail->Set variable 1 Pipeline2 has the activities shown in the following exhibit. Execute pipeline:Execute Pipeline1/Succeed-> set variable1 You execute Pipeline2, and Stored procedure1 in Pipeline1 fails. What is the status of the pipeline runs? A. Pipeline1 and Pipeline2 succeeded. B. Pipeline1 and Pipeline2 failed. C. Pipeline1 succeeded and Pipeline2 failed. D. Pipeline1 failed and Pipeline2 succeeded.

Correct Answer: A Activities are linked together via dependencies. A dependency has a condition of one of the following: Succeeded, Failed, Skipped, or Completed. Consider Pipeline1: If we have a pipeline with two activities where Activity2 has a failure dependency on Activity1, the pipeline will not fail just because Activity1 failed. If Activity1 fails and Activity2 succeeds, the pipeline will succeed. This scenario is treated as a try-catch block by Data Factory. The failure dependency means this pipeline reports success. Note: If we have a pipeline containing Activity1 and Activity2, and Activity2 has a success dependency on Activity1, it will only execute if Activity1 is successful. In this scenario, if Activity1 fails, the pipeline will fail. Reference: https://datasavvy.me/category/azure-data-factory/

Question Set 1 Q22 Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain description data that has an average length of 1.1 MB. You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics. You need to prepare the files to ensure that the data copies quickly. Solution: You convert the files to compressed delimited text files. Does this meet the goal? A. Yes B. No

Correct Answer: A All file formats have different performance characteristics. For the fastest load, use compressed delimited text files. Reference: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data

Question Set 1 Q2 You have an Azure Synapse workspace named MyWorkspace that contains an Apache Spark database named mytestdb. You run the following command in an Azure Synapse Analytics Spark pool in MyWorkspace. CREATE TABLE mytestdb.dbo.myParquetTable (EmployeeID int, EmployeeName string, EmployeeStartDate date) USING Parquet - You then use Spark to insert a row into mytestdb.dbo.myParquetTable. The row contains the following data. EmpName EmpId EmpStartDate Alice 24 2020-01-25 One minute later, you execute the following query from a serverless SQL pool in MyWorkspace. SELECT EmployeeID - FROM mytestdb.dbo.myParquetTable WHERE name = 'Alice'; What will be returned by the query? A. 24 B. an error C. a null value

Correct Answer: A Once a database has been created by a Spark job, you can create tables in it with Spark that use Parquet as the storage format. Table names will be converted to lower case and need to be queried using the lower case name. These tables will immediately become available for querying by any of the Azure Synapse workspace Spark pools. They can also be used from any of the Spark jobs subject to permissions. Note: For external tables, since they are synchronized to serverless SQL pool asynchronously, there will be a delay until they appear. Reference:https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/table

Question Set 2 Q39 Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are designing an Azure Stream Analytics solution that will analyze Twitter data. You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once. Solution: You use a tumbling window, and you set the window size to 10 seconds. Does this meet the goal? A. Yes B. No

Correct Answer: A Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. The following diagram illustrates a stream with a series of events and how they are mapped into 10-second tumbling windows. Reference: https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

Topic 2 - Question Set 2 Q2 Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads: ✑ A workload for data engineers who will use Python and SQL. ✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL. ✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R. The enterprise architecture team at your company identifies the following standards for Databricks environments: ✑ The data engineers must share a cluster. ✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster. ✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists. You need to create the Databricks clusters for the workloads. Solution: You create a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a High Concurrency cluster for the jobs. Does this meet the goal? A. Yes B. No

Correct Answer: A We need a High Concurrency cluster for the data engineers and the jobs. Note: Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL. A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies. Reference: https://docs.azuredatabricks.net/clusters/configure.html

Topic 2 - Question Set 2 Q21 You are creating an Azure Data Factory data flow that will ingest data from a CSV file, cast columns to specified types of data, and insert the data into a table in an Azure Synapse Analytic dedicated SQL pool. The CSV file contains three columns named username, comment, and date. The data flow already contains the following: ✑ A source transformation. ✑ A Derived Column transformation to set the appropriate types of data. ✑ A sink transformation to land the data in the pool. You need to ensure that the data flow meets the following requirements: ✑ All valid rows must be written to the destination table. ✑ Truncation errors in the comment column must be avoided proactively. ✑ Any rows containing comment values that will cause truncation errors upon insert must be written to a file in blob storage. Which two actions should you perform? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. To the data flow, add a sink transformation to write the rows to a file in blob storage. B. To the data flow, add a Conditional Split transformation to separate the rows that will cause truncation errors. C. To the data flow, add a filter transformation to filter out rows that will cause truncation errors. D. Add a select transformation to select only the rows that will cause truncation errors.

Correct Answer: AB B: Example: 1. This conditional split transformation defines the maximum length of "title" to be five. Any row that is less than or equal to five will go into the GoodRows stream. Any row that is larger than five will go into the BadRows stream. 2. This conditional split transformation defines the maximum length of "title" to be five. Any row that is less than or equal to five will go into the GoodRows stream. Any row that is larger than five will go into the BadRows stream. A: 3. Now we need to log the rows that failed. Add a sink transformation to the BadRows stream for logging. Here, we'll "auto-map" all of the fields so that we have logging of the complete transaction record. This is a text-delimited CSV file output to a single file in Blob Storage. We'll call the log file "badrows.csv". 4. The completed data flow is shown below. We are now able to split off error rows to avoid the SQL truncation errors and put those entries into a log file. Meanwhile, successful rows can continue to write to our target database. Reference: https://docs.microsoft.com/en-us/azure/data-factory/how-to-data-flow-error-rows

Topic 2 - Question Set 2 Q9 HOTSPOT - A company plans to use Platform-as-a-Service (PaaS) to create the new data pipeline process. The process must meet the following requirements: Ingest: ✑ Access multiple data sources. ✑ Provide the ability to orchestrate workflow. ✑ Provide the capability to run SQL Server Integration Services packages. Store: ✑ Optimize storage for big data workloads. ✑ Provide encryption of data at rest. ✑ Operate with no size limits. Prepare and Train: ✑ Provide a fully-managed and interactive workspace for exploration and visualization. ✑ Provide the ability to program in R, SQL, Python, Scala, and Java. ✑ Provide seamless user authentication with Azure Active Directory. Model & Serve: ✑ Implement native columnar storage. ✑ Support for the SQL language ✑ Provide support for structured streaming. You need to build the data integration pipeline. Which technologies should you use? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Ingest: Logic Apps, ADF, Azure Automation Store: ADLS, Azure Blob Storage, Azure Files Prepare/Train: HDInsight Apache Spark Cluster, Azure Databricks, HDInsight Apache Storm Cluster Model/Serve: HDInsight Apache Kafka Cluster, Azure Synapse Analytics, ADLS

Correct Answer: ADF, ADLS, ADB, Azure Synapse Analytics Ingest: Azure Data Factory - Azure Data Factory pipelines can execute SSIS packages. In Azure, the following services and tools will meet the core requirements for pipeline orchestration, control flow, and data movement: Azure Data Factory, Oozie on HDInsight, and SQL Server Integration Services (SSIS). Store: Data Lake Storage - Data Lake Storage Gen1 provides unlimited storage. Note: Data at rest includes information that resides in persistent storage on physical media, in any digital format. Microsoft Azure offers a variety of data storage solutions to meet different needs, including file, disk, blob, and table storage. Microsoft also provides encryption to protect Azure SQL Database, Azure CosmosDB, and Azure Data Lake. Prepare and Train: Azure Databricks Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration. With Azure Databricks, you can set up your Apache Spark environment in minutes, autoscale and collaborate on shared projects in an interactive workspace. Azure Databricks supports Python, Scala, R, Java and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch and scikit-learn. Model and Serve: Azure Synapse Analytics Azure Synapse Analytics/ SQL Data Warehouse stores data into relational tables with columnar storage. Azure SQL Data Warehouse connector now offers efficient and scalable structured streaming write support for SQL Data Warehouse. Access SQL Data Warehouse from Azure Databricks using the SQL Data Warehouse connector. Note: As of November 2019, Azure SQL Data Warehouse is now Azure Synapse Analytics. Reference: https://docs.microsoft.com/bs-latn-ba/azure/architecture/data-guide/technology-choices/pipeline-orchestration-data-movement https://docs.microsoft.com/en-us/azure/azure-databricks/what-is-azure-databricks

Question Set 1 Q17 You have an Azure Data Lake Storage Gen2 container that contains 100 TB of data. You need to ensure that the data in the container is available for read workloads in a secondary region if an outage occurs in the primary region. The solution must minimize costs. Which type of data redundancy should you use? A. geo-redundant storage (GRS) B. read-access geo-redundant storage (RA-GRS) C. zone-redundant storage (ZRS) D. locally-redundant storage (LRS)

Correct Answer: B Correct Answer: B Geo-redundant storage (with GRS or GZRS) replicates your data to another physical location in the secondary region to protect against regional outages. However, that data is available to be read only if the customer or Microsoft initiates a failover from the primary to secondary region. When you enable read access to the secondary region, your data is available to be read at all times, including in a situation where the primary region becomes unavailable. Incorrect Answers: A: While Geo-redundant storage (GRS) is cheaper than Read-Access Geo-Redundant Storage (RA-GRS), GRS does NOT initiate automatic failover. C, D: Locally redundant storage (LRS) and Zone-redundant storage (ZRS) provides redundancy within a single region. Reference:https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy

Topic 2 - Question Set 2 Q19 Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are designing an Azure Stream Analytics solution that will analyze Twitter data. You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once. Solution: You use a hopping window that uses a hop size of 5 seconds and a window size 10 seconds. Does this meet the goal? A. Yes B. No

Correct Answer: B Instead use a tumbling window. Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. Reference: https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

Question Set 2 Q53 You are designing an Azure Databricks cluster that runs user-defined local processes. You need to recommend a cluster configuration that meets the following requirements: ✑ Minimize query latency. ✑ Maximize the number of users that can run queries on the cluster at the same time. ✑ Reduce overall costs without compromising other requirements. Which cluster type should you recommend? A. Standard with Auto Termination B. High Concurrency with Autoscaling C. High Concurrency with Auto Termination D. Standard with Autoscaling

Correct Answer: B A High Concurrency cluster is a managed cloud resource. The key benefits of High Concurrency clusters are that they provide fine-grained sharing for maximum resource utilization and minimum query latencies. Databricks chooses the appropriate number of workers required to run your job. This is referred to as autoscaling. Autoscaling makes it easier to achieve high cluster utilization, because you don't need to provision the cluster to match a workload. Incorrect Answers: C: The cluster configuration includes an auto terminate setting whose default value depends on cluster mode: Standard and Single Node clusters terminate automatically after 120 minutes by default. High Concurrency clusters do not terminate automatically by default. Reference: https://docs.microsoft.com/en-us/azure/databricks/clusters/configure

Topic 2 - Question Set 2 Q6 You plan to perform batch processing in Azure Databricks once daily. Which type of Databricks cluster should you use? A. High Concurrency B. automated C. interactive

Correct Answer: B Azure Databricks has two types of clusters: interactive and automated. You use interactive clusters to analyze data collaboratively with interactive notebooks. You use automated clusters to run fast and robust automated jobs. Example: Scheduled batch workloads (data engineers running ETL jobs) This scenario involves running batch job JARs and notebooks on a regular cadence through the Databricks platform. The suggested best practice is to launch a new cluster for each run of critical jobs. This helps avoid any issues (failures, missing SLA, and so on) due to an existing workload (noisy neighbor) on a shared cluster. Reference: https://docs.databricks.com/administration-guide/cloud-configurations/aws/cmbp.html#scenario-3-scheduled-batch-workloads-data-engineers-running-etl-jobs

Question Set 2 Q38 You have an Azure Databricks workspace named workspace1 in the Standard pricing tier. You need to configure workspace1 to support autoscaling all-purpose clusters. The solution must meet the following requirements: ✑ Automatically scale down workers when the cluster is underutilized for three minutes. ✑ Minimize the time it takes to scale to the maximum number of workers. ✑ Minimize costs. What should you do first? A. Enable container services for workspace1. B. Upgrade workspace1 to the Premium pricing tier. C. Set Cluster Mode to High Concurrency. D. Create a cluster policy in workspace1.

Correct Answer: B For clusters running Databricks Runtime 6.4 and above, optimized autoscaling is used by all-purpose clusters in the Premium plan Optimized autoscaling: Scales up from min to max in 2 steps. Can scale down even if the cluster is not idle by looking at shuffle file state. Scales down based on a percentage of current nodes. On job clusters, scales down if the cluster is underutilized over the last 40 seconds. On all-purpose clusters, scales down if the cluster is underutilized over the last 150 seconds. The spark.databricks.aggressiveWindowDownS Spark configuration property specifies in seconds how often a cluster makes down-scaling decisions. Increasing the value causes a cluster to scale down more slowly. The maximum value is 600. Note: Standard autoscaling -Starts with adding 8 nodes. Thereafter, scales up exponentially, but can take many steps to reach the max. You can customize the first step by setting the spark.databricks.autoscaling.standardFirstStepUp Spark configuration property. Scales down only when the cluster is completely idle and it has been underutilized for the last 10 minutes. Scales down exponentially, starting with 1 node. Reference: https://docs.databricks.com/clusters/configure.html

Topic 2 - Question Set 2 Q13 You have an Azure Data Factory that contains 10 pipelines. You need to label each pipeline with its main purpose of either ingest, transform, or load. The labels must be available for grouping and filtering when using the monitoring experience in Data Factory. What should you add to each pipeline? A. a resource tag B. a correlation ID C. a run group ID D. an annotation

Correct Answer: D Annotations are additional, informative tags that you can add to specific factory resources: pipelines, datasets, linked services, and triggers. By adding annotations, you can easily filter and search for specific factory resources. Reference: https://www.cathrinewilhelmsen.net/annotations-user-properties-azure-data-factory/

Question Set 1 Q39 You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from suppliers for a retail store. FactPurchase will contain the following columns. Name Data Type Nullable PurchaseKey bigint No DateKey int No SupplierKey int No StockItemKey int No ... FactPurchase will have 1 million rows of data added daily and will contain three years of data. Transact-SQL queries similar to the following query will be executed daily. SELECT SupplierKey, StockItemKey, COUNT(*) FROM FactPurchase WHERE DateKey >= 20210101 AND DateKey <= 20210131 GROUP By SupplierKey, StockItemKey Which table distribution will minimize query times? A. replicated B. hash-distributed on PurchaseKey C. round-robin D. hash-distributed on DateKey

Correct Answer: B Hash-distributed tables improve query performance on large fact tables, and are the focus of this article. Round-robin tables are useful for improving loading speed. Incorrect: Not D: Do not use a date column. All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

Question Set 1 Q23 Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain description data that has an average length of 1.1 MB. You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics. You need to prepare the files to ensure that the data copies quickly. Solution: You copy the files to a table that has a columnstore index. Does this meet the goal? A. Yes B. No

Correct Answer: B Instead convert the files to compressed delimited text files. Reference: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data

Question Set 2 Q40 Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are designing an Azure Stream Analytics solution that will analyze Twitter data. You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once. Solution: You use a session window that uses a timeout size of 10 seconds. Does this meet the goal? A. Yes B. No

Correct Answer: B Instead use a tumbling window. Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. Reference: https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

Topic 2 - Question Set 2 Q18 Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You are designing an Azure Stream Analytics solution that will analyze Twitter data. You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once. Solution: You use a hopping window that uses a hop size of 10 seconds and a window size of 10 seconds. Does this meet the goal? A. Yes B. No

Correct Answer: B Instead use a tumbling window. Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. Reference: https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

Question Set 1 Q25 You build a data warehouse in an Azure Synapse Analytics dedicated SQL pool. Analysts write a complex SELECT query that contains multiple JOIN and CASE statements to transform data for use in inventory reports. The inventory reports will use the data and additional WHERE parameters depending on the report. The reports will be produced once daily. You need to implement a solution to make the dataset available for the reports. The solution must minimize query times. What should you implement? A. an ordered clustered columnstore index B. a materialized view C. result set caching D. a replicated table

Correct Answer: B Materialized views for dedicated SQL pools in Azure Synapse provide a low maintenance method for complex analytical queries to get fast performance without any query change. Incorrect Answers: C: One daily execution does not make use of result cache caching. Note: When result set caching is enabled, dedicated SQL pool automatically caches query results in the user database for repetitive use. This allows subsequent query executions to get results directly from the persisted cache so recomputation is not needed. Result set caching improves query performance and reduces compute resource usage. In addition, queries using cached results set do not use any concurrency slots and thus do not count against existing concurrency limits. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/performance-tuning-materialized-views https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/performance-tuning-result-set-caching

Question Set 1 Q32 You plan to ingest streaming social media data by using Azure Stream Analytics. The data will be stored in files in Azure Data Lake Storage, and then consumed by using Azure Databricks and PolyBase in Azure Synapse Analytics. You need to recommend a Stream Analytics data output format to ensure that the queries from Databricks and PolyBase against the files encounter the fewest possible errors. The solution must ensure that the files can be queried quickly and that the data type information is retained. What should you recommend? A. JSON B. Parquet C. CSV D. Avro

Correct Answer: B Need Parquet to support both Databricks and PolyBase. Alternate Answer: The Avro format is great for data and message preservation. Avro schema with its support for evolution is essential for making the data robust for streaming architectures like Kafka, and with the metadata that schema provides, you can reason on the data. Having a schema provides robustness in providing meta-data about the data stored in Avro records which are self- documenting the data. Reference: http://cloudurable.com/blog/avro/index.html https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql

Question Set 2 Q52 Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads: ✑ A workload for data engineers who will use Python and SQL. ✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL. ✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R. The enterprise architecture team at your company identifies the following standards for Databricks environments: ✑ The data engineers must share a cluster. ✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster. ✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists. You need to create the Databricks clusters for the workloads. Solution: You create a High Concurrency cluster for each data scientist, a High Concurrency cluster for the data engineers, and a Standard cluster for the jobs. Does this meet the goal? A. Yes B. No

Correct Answer: B Need a High Concurrency cluster for the jobs. Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL. A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies. Reference: https://docs.azuredatabricks.net/clusters/configure.html

Question Set 2 Q36 You are designing an Azure Databricks table. The table will ingest an average of 20 million streaming events per day. You need to persist the events in the table for use in incremental load pipeline jobs in Azure Databricks. The solution must minimize storage costs and incremental load times. What should you include in the solution? A. Partition by DateTime fields. B. Sink to Azure Queue storage. C. Include a watermark column. D. Use a JSON format for physical data storage.

Correct Answer: B The Databricks ABS-AQS connector uses Azure Queue Storage (AQS) to provide an optimized file source that lets you find new files written to an Azure Blob storage (ABS) container without repeatedly listing all of the files. This provides two major advantages :✑ Lower latency: no need to list nested directory structures on ABS, which is slow and resource intensive. ✑ Lower costs: no more costly LIST API requests made to ABS. Reference: https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/aqs

Question Set 2 Q29 You have an Azure Stream Analytics job that receives clickstream data from an Azure event hub. You need to define a query in the Stream Analytics job. The query must meet the following requirements: ✑ Count the number of clicks within each 10-second window based on the country of a visitor. ✑ Ensure that each click is NOT counted more than once. How should you define the Query? A. SELECT Country, Avg(*) AS Average FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, SlidingWindow(second, 10) B. SELECT Country, Count(*) AS Count FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, TumblingWindow(second, 10) C. SELECT Country, Avg(*) AS Average FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, HoppingWindow(second, 10, 2) D. SELECT Country, Count(*) AS Count FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, SessionWindow(second, 5, 10)

Correct Answer: B Tumbling window functions are used to segment a data stream into distinct time segments and perform a function against them, such as the example below. The key differentiators of a Tumbling window are that they repeat, do not overlap, and an event cannot belong to more than one tumbling window. Incorrect Answers: A: Sliding windows, unlike Tumbling or Hopping windows, output events only for points in time when the content of the window actually changes. In other words, when an event enters or exits the window. Every window has at least one event, like in the case of Hopping windows, events can belong to more than one sliding window. C: Hopping window functions hop forward in time by a fixed period. It may be easy to think of them as Tumbling windows that can overlap, so events can belong to more than one Hopping window result set. To make a Hopping window the same as a Tumbling window, specify the hop size to be the same as the window size. D: Session windows group events that arrive at similar times, filtering out periods of time where there is no data. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Question Set 2 Q46 Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads: ✑ A workload for data engineers who will use Python and SQL. ✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL. ✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R. The enterprise architecture team at your company identifies the following standards for Databricks environments: ✑ The data engineers must share a cluster. ✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster. ✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists. You need to create the Databricks clusters for the workloads. Solution: You create a Standard cluster for each data scientist, a Standard cluster for the data engineers, and a High Concurrency cluster for the jobs. Does this meet the goal? A. Yes B. No

Correct Answer: B We need a High Concurrency cluster for the data engineers and the jobs. Note: Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL. A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies. Reference: https://docs.azuredatabricks.net/clusters/configure.html

Topic 2 - Question Set 2 Q1 Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution. After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen. You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads: ✑ A workload for data engineers who will use Python and SQL. ✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL. ✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R. The enterprise architecture team at your company identifies the following standards for Databricks environments: ✑ The data engineers must share a cluster. ✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster. ✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists. You need to create the Databricks clusters for the workloads. Solution: You create a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a Standard cluster for the jobs. Does this meet the goal? A. Yes B. No

Correct Answer: B We would need a High Concurrency cluster for the jobs. Note: Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL. A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies. Reference: https://docs.azuredatabricks.net/clusters/configure.html

Question Set 1 Q33 You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a partitioned fact table named dbo.Sales and a staging table named stg.Sales that has the matching table and partition definitions. You need to overwrite the content of the first partition in dbo.Sales with the content of the same partition in stg.Sales. The solution must minimize load times. What should you do? A. Insert the data from stg.Sales into dbo.Sales. B. Switch the first partition from dbo.Sales to stg.Sales. C. Switch the first partition from stg.Sales to dbo.Sales. D. Update dbo.Sales from stg.Sales.

Correct Answer: B YY Alternate Answer: C A way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management. For example, rather than execute a DELETE statement to delete all rows in a table where the order_date was in October of 2001, you could partition your data monthly. Then you can switch out the partition with data for an empty partition from another table Note: Syntax: SWITCH [ PARTITION source_partition_number_expression ] TO [ schema_name. ] target_table [ PARTITION target_partition_number_expression ] Switches a block of data in one of the following ways: ✑ Reassigns all data of a table as a partition to an already-existing partitioned table. ✑ Switches a partition from one partitioned table to another. ✑ Reassigns all data in one partition of a partitioned table to an existing non-partitioned table. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool

Question Set 2 Q34 You are designing an Azure Stream Analytics job to process incoming events from sensors in retail environments. You need to process the events to produce a running average of shopper counts during the previous 15 minutes, calculated at five-minute intervals. Which type of window should you use? A. snapshot B. tumbling C. hopping D. sliding

Correct Answer: B YY Alternative Answer: C (HOPPING) Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. The following diagram illustrates a stream with a series of events and how they are mapped into 10-second tumbling windows. Reference: https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

Question Set 1 Q34 You are designing a slowly changing dimension (SCD) for supplier data in an Azure Synapse Analytics dedicated SQL pool. You plan to keep a record of changes to the available fields. The supplier data contains the following columns. Name Description SupplierSystemId Unique supplier ID SupplierName Name of supplier company SupplierAddress SupplierDescription SupplierCategory Which three additional columns should you add to the data to create a Type 2 SCD? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. surrogate primary key B. effective start date C. business key D. last modified date E. effective end date F. foreign key

Correct Answer: BCE effectiveStartDate, effectiveEndDate, businessKey ? Alternative Answer: ABE C: The Slowly Changing Dimension transformation requires at least one business key column. BE: Historical attribute changes create new records instead of updating existing ones. The only change that is permitted in an existing record is an update to a column that indicates whether the record is current or expired. This kind of change is equivalent to a Type 2 change. The Slowly Changing Dimension transformation directs these rows to two outputs: Historical Attribute Inserts Output and New Output. Reference: https://docs.microsoft.com/en-us/sql/integration-services/data-flow/transformations/slowly-changing-dimension-transformation

Topic 2 - Question Set 2 Q17 You need to implement a Type 3 slowly changing dimension (SCD) for product category data in an Azure Synapse Analytics dedicated SQL pool. You have a table that was created by using the following Transact-SQL statement. Which two columns should you add to the table? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. create table dbo.dimProduct ( productKey, productSourceId, productName, color, sellStartDate, sellEndDate, rowInsertedDateTime, rowUpdatedDateTime, etlAuditId) A. [EffectiveStartDate] [datetime] NOT NULL, B. [CurrentProductCategory] [nvarchar] (100) NOT NULL, C. [EffectiveEndDate] [datetime] NULL, D. [ProductCategory] [nvarchar] (100) NOT NULL, E. [OriginalProductCategory] [nvarchar] (100) NOT NULL,

Correct Answer: BE A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a column for the current value of a member plus either the original or previous value of the member. So Type 3 uses additional columns to track one key instance of history, rather than storing additional rows to track each change like in a Type 2 SCD. This type of tracking may be used for one or two columns in a dimension table. It is not common to use it for many members of the same table. It is often used in combination with Type 1 or Type 2 members. Reference: https://k21academy.com/microsoft-azure/azure-data-engineer-dp203-q-a-day-2-live-session-review/ https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types

Question Set 2 Q32 You have two Azure Data Factory instances named ADFdev and ADFprod. ADFdev connects to an Azure DevOps Git repository. You publish changes from the main branch of the Git repository to ADFdev. You need to deploy the artifacts from ADFdev to ADFprod. What should you do first? A. From ADFdev, modify the Git configuration. B. From ADFdev, create a linked service. C. From Azure DevOps, create a release pipeline. D. From Azure DevOps, update the main branch.

Correct Answer: C In Azure Data Factory, continuous integration and delivery (CI/CD) means moving Data Factory pipelines from one environment (development, test, production) to another. Note: The following is a guide for setting up an Azure Pipelines release that automates the deployment of a data factory to multiple environments. 1. In Azure DevOps, open the project that's configured with your data factory. 2. On the left side of the page, select Pipelines, and then select Releases. 3. Select New pipeline, or, if you have existing pipelines, select New and then New release pipeline. 4. In the Stage name box, enter the name of your environment. 5. Select Add artifact, and then select the git repository configured with your development data factory. Select the publish branch of the repository for the Default branch. By default, this publish branch is adf_publish. 6. Select the Empty job template. Reference: https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment

Question Set 1 Q15 You have an enterprise data warehouse in Azure Synapse Analytics. Using PolyBase, you create an external table named [Ext].[Items] to query Parquet files stored in Azure Data Lake Storage Gen2 without importing the data to the data warehouse. The external table has three columns. You discover that the Parquet files have a fourth column named ItemID. Which command should you run to add the ItemID column to the external table? a. alter external table [ext].[items] add [itemid] int; b. drop external file format parquetfile1; ... c. drop external table [ext1].[items] create external table file format parquetfile1 ... d. alter table [ext].[items] add [itemid] int;

Correct Answer: C Incorrect Answers:A, D: Only these Data Definition Language (DDL) statements are allowed on external tables ✑ CREATE TABLE and DROP TABLE ✑ CREATE STATISTICS and DROP STATISTICS ✑ CREATE VIEW and DROP VIEW Reference: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql

Question Set 1 Q1 You have a table in an Azure Synapse Analytics dedicated SQL pool. The table was created by using the following Transact-SQL statement. You need to alter the table to meet the following requirements: ✑ Ensure that users can identify the current manager of employees. ✑ Support creating an employee reporting hierarchy for your entire company. ✑ Provide fast lookup of the managers' attributes such as name and job title. Which column should you add to the table? create table [dbo].[DimEmployee] ( employeeKey int identity(1,1) not null, employeeId int not null, firstName varchar (100) not null, lastName varchar (100) not null, jobTitle varchar (100) not null, lastHireDate date null, streetAddress varchar (500) not null, city varchar (200) not null, stateProvince varchar (50) not null, portalCode varchar (10) not null) A. [ManagerEmployeeID] [smallint] NULL B. [ManagerEmployeeKey] [smallint] NULL C. [ManagerEmployeeKey] [int] NULL D. [ManagerName] [varchar](200) NULL

Correct Answer: C We need an extra column to identify the Manager. Use the data type as the EmployeeKey column, an int column. Reference: https://docs.microsoft.com/en-us/analysis-services/tabular-models/hierarchies-ssas-tabular

Question Set 1 Q18 You plan to implement an Azure Data Lake Gen 2 storage account. You need to ensure that the data lake will remain available if a data center fails in the primary Azure region. The solution must minimize costs. Which type of replication should you use for the storage account? A. geo-redundant storage (GRS) B. geo-zone-redundant storage (GZRS) C. locally-redundant storage (LRS) D. zone-redundant storage (ZRS)

Correct Answer: C YY Alternate Answer: D? Zone redundancy Locally redundant storage (LRS) copies your data synchronously three times within a single physical location in the primary region. LRS is the least expensive replication option Reference: https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy

Question Set 1 Q43 You are designing a dimension table for a data warehouse. The table will track the value of the dimension attributes over time and preserve the history of the data by adding new rows as the data changes. Which type of slowly changing dimension (SCD) should you use? A. Type 0 B. Type 1 C. Type 2 D. Type 3

Correct Answer: C A Type 2 SCD supports versioning of dimension members. Often the source system doesn't store versions, so the data warehouse load process detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and EndDate) and possibly a flag column (for example, IsCurrent) to easily filter by current dimension members. Incorrect Answers: B: A Type 1 SCD always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten. D: A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a column for the current value of a member plus either the original or previous value of the member. So Type 3 uses additional columns to track one key instance of history, rather than storing additional rows to track each change like in a Type 2 SCD. Reference: https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types

Question Set 1 Q40 You are implementing a batch dataset in the Parquet format. Data files will be produced be using Azure Data Factory and stored in Azure Data Lake Storage Gen2. The files will be consumed by an Azure Synapse Analytics serverless SQL pool. You need to minimize storage costs for the solution. What should you do? A. Use Snappy compression for files. B. Use OPENROWSET to query the Parquet files. C. Create an external table that contains a subset of columns from the Parquet files. D. Store all data as string in the Parquet files.

Correct Answer: C An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. External tables are used to read data from files or write data to files in Azure Storage. With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or serverless SQL pool. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables

Topic 2 - Question Set 2 Q5 You need to trigger an Azure Data Factory pipeline when a file arrives in an Azure Data Lake Storage Gen2 container. Which resource provider should you enable? A. Microsoft.Sql B. Microsoft.Automation C. Microsoft.EventGrid D. Microsoft.EventHub

Correct Answer: C Event-driven architecture (EDA) is a common data integration pattern that involves production, detection, consumption, and reaction to events. Data integration scenarios often require Data Factory customers to trigger pipelines based on events happening in storage account, such as the arrival or deletion of a file in Azure Blob Storage account. Data Factory natively integrates with Azure Event Grid, which lets you trigger pipelines on such events. Reference: https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers

Question Set 2 Q47 You have the following Azure Data Factory pipelines: ✑ Ingest Data from System1 ✑ Ingest Data from System2 ✑ Populate Dimensions ✑ Populate Facts Ingest Data from System1 and Ingest Data from System2 have no dependencies. Populate Dimensions must execute after Ingest Data from System1 and Ingest Data from System2. Populate Facts must execute after Populate Dimensions pipeline. All the pipelines must execute every eight hours. What should you do to schedule the pipelines for execution? A. Add an event trigger to all four pipelines. B. Add a schedule trigger to all four pipelines. C. Create a patient pipeline that contains the four pipelines and use a schedule trigger. D. Create a patient pipeline that contains the four pipelines and use an event trigger.

Correct Answer: C Schedule trigger: A trigger that invokes a pipeline on a wall-clock schedule. Reference: https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers

Question Set 1 Q5 HOTSPOT - You are planning the deployment of Azure Data Lake Storage Gen2. You have the following two reports that will access the data lake: ✑ Report1: Reads three columns from a file that contains 50 columns. ✑ Report2: Queries a single record based on a timestamp. You need to recommend in which format to store the data in the data lake to support the reports. The solution must minimize read times. What should you recommend for each report? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: - Report1 -- Avro -- CSV -- Parquet -- TSV - Report2 -- Avro -- CSV -- Parquet -- TSV

Correct Answer: CSV Avro Report1: CSV - CSV: The destination writes records as delimited data. Report2: AVRO - AVRO supports timestamps. Not Parquet, TSV: Not options for Azure Data Lake Storage Gen2. Reference:https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Destinations/ADLS-G2-D.html

Topic 2 - Question Set 2 Q10 You have the following table named Employees. firstName lastName hireDate employeeType Jane Doe 2019-08-23 new Ben Smith 2917-12-15 standard You need to calculate the employee_type value based on the hire_date value. How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content. NOTE: Each correct selection is worth one point. Select and Place: Values - Case - Else - Over - Partition By -Row_Number Answer Area select *, _1_ when hireDate >= '2019-01-01' then 'New' _2_ 'Standard' end as employeeType from employees

Correct Answer: Case, Else Box 1: CASE - CASE evaluates a list of conditions and returns one of multiple possible result expressions. CASE can be used in any statement or clause that allows a valid expression. For example, you can use CASE in statements such as SELECT, UPDATE, DELETE and SET, and in clauses such as select_list, IN, WHERE, ORDER BY, and HAVING. Syntax: Simple CASE expression: CASE input_expression - WHEN when_expression THEN result_expression [ ...n ] [ ELSE else_result_expression ] END - Box 2: ELSE - Reference: https://docs.microsoft.com/en-us/sql/t-sql/language-elements/case-transact-sql

Question Set 1 Q10 HOTSPOT - You have an Azure Data Lake Storage Gen2 container. Data is ingested into the container, and then transformed by a data integration application. The data is NOT modified after that. Users can read files in the container but cannot modify the files. You need to design a data archiving solution that meets the following requirements: ✑ New data is accessed frequently and must be available as quickly as possible. ✑ Data that is older than five years is accessed infrequently but must be available within one second when requested. ✑ Data that is older than seven years is NOT accessed. After seven years, the data must be persisted at the lowest cost possible. ✑ Costs must be minimized while maintaining the required availability. How should you manage the data? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point Hot Area: Five Year Old Data Seven Year Old Data Options: Delete the Blob Move to Archive Storage Move to Cool Storage Move to Hot Storage

Correct Answer: Cool Storage, Archive Storage Box 1: Move to cool storage Box 2: Move to archive storage Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of hours.

Question Set 1 Q46 You are performing exploratory analysis of the bus fare data in an Azure Data Lake Storage Gen2 account by using an Azure Synapse Analytics serverless SQL pool. You execute the Transact-SQL query shown in the following exhibit. select paymentType, sum(fareAmount) as fareTotal from openrowset ( bulk 'cs/busfare/tripdata2020*.csv', data_source = 'busData', format = 'csv', parser_version = '2.0', firstrow = 2 ) with (paymentType int 10, fareAmount float 11) as nyc group by paymentType order by paymentType; What do the query results include? A. Only CSV files in the tripdata_2020 subfolder. B. All files that have file names that beginning with "tripdata_2020". C. All CSV files that have file names that contain "tripdata_2020". D. Only CSV that have file names that beginning with "tripdata_2020".

Correct Answer: D

Question Set 2 Q31 You need to schedule an Azure Data Factory pipeline to execute when a new file arrives in an Azure Data Lake Storage Gen2 container. Which type of trigger should you use? A. on-demand B. tumbling window C. schedule D. event

Correct Answer: D Event-driven architecture (EDA) is a common data integration pattern that involves production, detection, consumption, and reaction to events. Data integration scenarios often require Data Factory customers to trigger pipelines based on events happening in storage account, such as the arrival or deletion of a file in AzureBlob Storage account. Reference: https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger

Question Set 2 Q41 You use Azure Stream Analytics to receive data from Azure Event Hubs and to output the data to an Azure Blob Storage account. You need to output the count of records received from the last five minutes every minute. Which windowing function should you use? A. Session B. Tumbling C. Sliding D. Hopping

Correct Answer: D Hopping window functions hop forward in time by a fixed period. It may be easy to think of them as Tumbling windows that can overlap and be emitted more often than the window size. Events can belong to more than one Hopping window result set. To make a Hopping window the same as a Tumbling window, specify the hop size to be the same as the window size. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Question Set 1 Q12 You need to design an Azure Synapse Analytics dedicated SQL pool that meets the following requirements: ✑ Can return an employee record from a given point in time. ✑ Maintains the latest employee information. ✑ Minimizes query complexity. How should you model the employee data? A. as a temporal table B. as a SQL graph table C. as a degenerate dimension table D. as a Type 2 slowly changing dimension (SCD) table

Correct Answer: D A Type 2 SCD supports versioning of dimension members. Often the source system doesn't store versions, so the data warehouse load process detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and EndDate) and possibly a flag column (for example,IsCurrent) to easily filter by current dimension members. Reference:https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types

Question Set 1 Q30 You are designing a financial transactions table in an Azure Synapse Analytics dedicated SQL pool. The table will have a clustered columnstore index and will include the following columns: ✑ TransactionType: 40 million rows per transaction type ✑ CustomerSegment: 4 million per customer segment ✑ TransactionMonth: 65 million rows per month ✑ AccountType: 500 million per account type . You have the following query requirements: ✑ Analysts will most commonly analyze transactions for a given month. ✑ Transactions analysis will typically summarize transactions by transaction type, customer segment, and/or account type You need to recommend a partition strategy for the table to minimize query times. On which column should you recommend partitioning the table? A. CustomerSegment B. AccountType C. TransactionType D. TransactionMonth

Correct Answer: D For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributed databases. Example: Any partitioning added to a table is in addition to the distributions created behind the scenes. Using this example, if the sales fact table contained 36 monthly partitions, and given that a dedicated SQL pool has 60 distributions, then the sales fact table should contain 60 million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the recommended minimum number of rows per partition, consider using fewer partitions in order to increase the number of rows per partition.

Question Set 1 Q26 You have an Azure Synapse Analytics workspace named WS1 that contains an Apache Spark pool named Pool1. You plan to create a database named DB1 in Pool1. You need to ensure that when tables are created in DB1, the tables are available automatically as external tables to the built-in serverless SQL pool. Which format should you use for the tables in DB1? A. CSV B. ORC C. JSON D. Parquet

Correct Answer: D Serverless SQL pool can automatically synchronize metadata from Apache Spark. A serverless SQL pool database will be created for each database existing in serverless Apache Spark pools. For each Spark external table based on Parquet or CSV and located in Azure Storage, an external table is created in a serverless SQL pool database. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-storage-files-spark-tables

Question Set 1 Q27 You are planning a solution to aggregate streaming data that originates in Apache Kafka and is output to Azure Data Lake Storage Gen2. The developers who will implement the stream processing solution use Java. Which service should you recommend using to process the streaming data? A. Azure Event Hubs B. Azure Data Factory C. Azure Stream Analytics D. Azure Databricks

Correct Answer: D The following tables summarize the key differences in capabilities for stream processing technologies in Azure.

Question Set 1 Q6 You are designing the folder structure for an Azure Data Lake Storage Gen2 container. Users will query data by using a variety of services including Azure Databricks and Azure Synapse Analytics serverless SQL pools. The data will be secured by subject area. Most queries will include data from the current year or current month. Which folder structure should you recommend to support fast queries and simplified folder security? A. /{SubjectArea}/{DataSource}/{DD}/{MM}/{YYYY}/{FileData}_{YYYY}_{MM}_{DD}.csv B. /{DD}/{MM}/{YYYY}/{SubjectArea}/{DataSource}/{FileData}_{YYYY}_{MM}_{DD}.csv C. /{YYYY}/{MM}/{DD}/{SubjectArea}/{DataSource}/{FileData}_{YYYY}_{MM}_{DD}.csv D. /{SubjectArea}/{DataSource}/{YYYY}/{MM}/{DD}/{FileData}_{YYYY}_{MM}_{DD}.csv

Correct Answer: D There's an important reason to put the date at the end of the directory structure. If you want to lock down certain regions or subject matters to users/groups, then you can easily do so with the POSIX permissions. Otherwise, if there was a need to restrict a certain security group to viewing just the UK data or certain planes, with the date structure in front a separate permission would be required for numerous directories under every hour directory. Additionally, having the date structure in front would exponentially increase the number of directories as time went on. Note: In IoT workloads, there can be a great deal of data being landed in the data store that spans across numerous products, devices, organizations, and customers. Itג€™s important to pre-plan the directory layout for organization, security, and efficient processing of the data for down-stream consumers. A general template to consider might be the following layout: {Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/

Question Set 2 Q49 You are monitoring an Azure Stream Analytics job by using metrics in Azure. You discover that during the last 12 hours, the average watermark delay is consistently greater than the configured late arrival tolerance. What is a possible cause of this behavior? A. Events whose application timestamp is earlier than their arrival time by more than five minutes arrive as inputs. B. There are errors in the input data. C. The late arrival policy causes events to be dropped. D. The job lacks the resources to process the volume of incoming data.

Correct Answer: D Watermark Delay indicates the delay of the streaming data processing job. There are a number of resource constraints that can cause the streaming pipeline to slow down. The watermark delay metric can rise due to: 1. Not enough processing resources in Stream Analytics to handle the volume of input events. To scale up resources, see Understand and adjust StreamingUnits. 2. Not enough throughput within the input event brokers, so they are throttled. For possible solutions, see Automatically scale up Azure Event Hubs throughput units. 3. Output sinks are not provisioned with enough capacity, so they are throttled. The possible solutions vary widely based on the flavor of output service being used. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-time-handling

Topic 2 - Question Set 2 Q4 A company has a real-time data analysis solution that is hosted on Microsoft Azure. The solution uses Azure Event Hub to ingest data and an Azure StreamAnalytics cloud job to analyze the data. The cloud job is configured to use 120 Streaming Units (SU). You need to optimize performance for the Azure Stream Analytics job. Which two actions should you perform? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. Implement event ordering. B. Implement Azure Stream Analytics user-defined functions (UDF). C. Implement query parallelization by partitioning the data output. D. Scale the SU count for the job up. E. Scale the SU count for the job down. F. Implement query parallelization by partitioning the data input.

Correct Answer: DF ?? Alternative Answer: CF D: Scale out the query by allowing the system to process each input partition separately. F: A Stream Analytics job definition includes inputs, a query, and output. Inputs are where the job reads the data stream from. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization

Question Set 1 Q35 HOTSPOT - You have a Microsoft SQL Server database that uses a third normal form schema. You plan to migrate the data in the database to a star schema in an Azure Synapse Analytics dedicated SQL pool. You need to design the dimension tables. The solution must optimize read operations. What should you include in the solution? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Transform data for dim tables by - Maintaining to a third normal form - Normalizing to a fourth normal form - Denormalizing to a second normal form For primary key cols in the dimension tables use - New IDENTITY columns - A new computed column - The business key column from the source sys

Correct Answer: Denormalize to a second normal form, New identity columns Box 1: Denormalize to a second normal form - Denormalization is the process of transforming higher normal forms to lower normal forms via storing the join of higher normal form relations as a base relation. Denormalization increases the performance in data retrieval at cost of bringing update anomalies to a database. Box 2: New identity columns - The collapsing relations strategy can be used in this step to collapse classification entities into component entities to obtain flat dimension tables with single-part keys that connect directly to the fact table. The single-part key is a surrogate key generated to ensure it remains unique over time. Example: >> picture << Note: A surrogate key on a table is a column with a unique identifier for each row. The key is not generated from the table data. Data modelers like to create surrogate keys on their tables when they design data warehouse models. You can use the IDENTITY property to achieve this goal simply and effectively without affecting load performance. Reference: https://www.mssqltips.com/sqlservertip/5614/explore-the-role-of-normal-forms-in-dimensional-modeling/ https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-identity

Question Set 1 Q21 From a website analytics system, you receive data extracts about user interactions such as downloads, link clicks, form submissions, and video plays. The data contains the following columns. You need to design a star schema to support analytical queries of the data. The star schema will contain four tables including a date dimension. To which table should you add each column? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Columns: - EventCategory - ChannellGrouping - TotalEvents Tables: - DimDate - DimChannel - DimEvent - FactEvents

Correct Answer: DimEvent, DimChannel, FactEvents Box 1: DimEvent - Box 2: DimChannel - Box 3: FactEvents - Fact tables store observations or events, and can be sales orders, stock balances, exchange rates, temperatures, etc Reference: https://docs.microsoft.com/en-us/power-bi/guidance/star-schema

Question Set 1 Q11 DRAG DROP - You need to create a partitioned table in an Azure Synapse Analytics dedicated SQL pool. How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content. NOTE: Each correct selection is worth one point. Select and Place: Clustered Index Collate Distribution Partition Partition Function Partition Scheme Put here: Create table table 1 ... ( ________ = hash(ID), ________ = (ID range left ... )

Correct Answer: Distribution, Partition Box 1: DISTRIBUTION - Table distribution options include DISTRIBUTION = HASH ( distribution_column_name ), assigns each row to one distribution by hashing the value stored in distribution_column_name. Box 2: PARTITION - Table partition options. Syntax:PARTITION ( partition_column_name RANGE [ LEFT | RIGHT ] FOR VALUES ( [ boundary_value [,...n] ] )) Reference:https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-azure-sql-data-warehouse?

Question Set 1 Q20 You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from suppliers for a retail store. FactPurchase will contain the following columns. Name PurchaseKey bigint DateKey int SupplierKey int isOrderFinalized Bit ... FactPurchase will have 1 million rows of data added daily and will contain three years of data. Transact-SQL queries similar to the following query will be executed daily. SELECT -SupplierKey, StockItemKey, IsOrderFinalized, COUNT(*) FROM FactPurchase - WHERE DateKey >= 20210101 - AND DateKey <= 20210131 - GROUP By SupplierKey, StockItemKey, IsOrderFinalized Which table distribution will minimize query times? A. replicated B. hash-distributed on PurchaseKey C. round-robin D. hash-distributed on IsOrderFinalized

Correct Answer: Hash-distributed on PurchaseKey Correct Answer: B Hash-distributed tables improve query performance on large fact tables. To balance the parallel processing, select a distribution column that: ✑ Has many unique values. The column can have duplicate values. All rows with the same value are assigned to the same distribution. Since there are 60 distributions, some distributions can have > 1 unique values while others may end with zero values. ✑ Does not have NULLs, or has only a few NULLs. ✑ Is not a date column. Incorrect Answers: C: Round-robin tables are useful for improving loading speed. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

Question Set 2 Q45 HOTSPOT - You have an Azure Storage account that generates 200,000 new files daily. The file names have a format of {YYYY}/{MM}/{DD}/{HH}/{CustomerID}.csv. You need to design an Azure Data Factory solution that will load new data from the storage account to an Azure Data Lake once hourly. The solution must minimize load times and costs. How should you configure the solution? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Load methodology: - Full Load - Incremental Load - Load individual files as they arrive Trigger: - Fixed schedule - New file - Tumbling window

Correct Answer: Incremental Load, Tumbling window Box 1: Incremental load - Box 2: Tumbling window - Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. The following diagram illustrates a stream with a series of events and how they are mapped into 10-second tumbling windows. Reference: https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

Question Set 2 Q30 HOTSPOT - You are building an Azure Analytics query that will receive input data from Azure IoT Hub and write the results to Azure Blob storage. You need to calculate the difference in the number of readings per sensor per hour. How should you complete the query? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: select sensorId, growth = reading - __1__ (reading) over (partition by sensorId __2__ (hour, 1) from input __1__: lag, last, lead __2__: limit duration, offset, when

Correct Answer: lag, limit duration Box 1: LAG - The LAG analytic operator allows one to look up a 'previous' event in an event stream, within certain constraints. It is very useful for computing the rate of growth of a variable, detecting when a variable crosses a threshold, or when a condition starts or stops being true. Box 2: LIMIT DURATION - Example: Compute the rate of growth, per sensor: SELECT sensorId, growth = reading -LAG(reading) OVER (PARTITION BY sensorId LIMIT DURATION(hour, 1)) FROM input Reference: https://docs.microsoft.com/en-us/stream-analytics-query/lag-azure-stream-analytics

Topic 2 - Question Set 2 Q7 You are processing streaming data from vehicles that pass through a toll booth. You need to use Azure Stream Analytics to return the license plate, vehicle make, and hour the last vehicle passed during each 10-minute window. How should you complete the query? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: with LastInWindow as ( select __1__ (time) as LastEventTime from input timestamp by time group by __2__ (minute, 10) ) select input.LicensePlate, input.Make, input.Time from input timestamp by time inner join lastInWindow on __3__ (minute, input, LastInWindow) between 0 and 10 and Input.Time = LastInWindow.LastEventTime __1__: Count, Max, Min, TopOne __2__: Hopping, Session, Sliding, Tumbling __2__: DateAdd, DateDiff, DateName, DatePart

Correct Answer: Max, Tumbling, DateDiff Box 1: MAX - The first step on the query finds the maximum time stamp in 10-minute windows, that is the time stamp of the last event for that window. The second step joins the results of the first query with the original stream to find the event that match the last time stamps in each window. Query: WITH LastInWindow AS ( SELECT MAX(Time) AS LastEventTime FROM Input TIMESTAMP BY Time GROUP BY TumblingWindow(minute, 10) ) SELECT Input.License_plate, Input.Make, Input.Time - FROM Input TIMESTAMP BY Time - INNER JOIN LastInWindow - ON DATEDIFF(minute, Input, LastInWindow) BETWEEN 0 AND 10 AND Input.Time = LastInWindow.LastEventTime Box 2: TumblingWindow - Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. Box 3: DATEDIFF - DATEDIFF is a date-specific function that compares and returns the time difference between two DateTime fields, for more information, refer to date functions. Reference: https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

Question Set 1 Q7 HOTSPOT - You need to output files from Azure Data Factory. Which file format should you use for each type of output? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Columnar Format: Avro Parquet GZip TXT JSON with a Timestamp: Avro Parquet GZip TXT

Correct Answer: Parquet, Avro Box 1: Parquet - Parquet stores data in columns, while Avro stores data in a row-based format. By their very nature, column-oriented data stores are optimized for read-heavy analytical workloads, while row-based databases are best for write-heavy transactional workloads. Box 2: Avro - An Avro schema is created using JSON format. AVRO supports timestamps. Note: Azure Data Factory supports the following file formats (not GZip or TXT).

Question Set 1 Q16 HOTSPOT - You have two Azure Storage accounts named Storage1 and Storage2. Each account holds one container and has the hierarchical namespace enabled. The system has files that contain data stored in the Apache Parquet format. You need to copy folders and files from Storage1 to Storage2 by using a Data Factory copy activity. The solution must meet the following requirements: ✑ No transformations must be performed. ✑ The original folder structure must be retained. ✑ Minimize time required to perform the copy activity. How should you configure the copy activity? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Source dataset type: - Binary - Parquet - Delimited text Copy activity copy behavior: - FlattenHierarchy - MergeFiles - PreserveHierarchy

Correct Answer: Parquet, PreserveHierarchy XX Alternate answer: Binary, PreserveHierarchy Box 1: Parquet - For Parquet datasets, the type property of the copy activity source must be set to ParquetSource. Box 2: PreserveHierarchy - PreserveHierarchy (default): Preserves the file hierarchy in the target folder. The relative path of the source file to the source folder is identical to the relative path of the target file to the target folder. Incorrect Answers: ✑ FlattenHierarchy: All files from the source folder are in the first level of the target folder. The target files have autogenerated names. ✑ MergeFiles: Merges all files from the source folder to one file. If the file name is specified, the merged file name is the specified name. Otherwise, it's an autogenerated file name. Reference: https://docs.microsoft.com/en-us/azure/data-factory/format-parquet https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-data-lake-storage

Topic 2 - Question Set 2 Q12 DRAG DROP - You have an Apache Spark DataFrame named temperatures. A sample of the data is shown in the following table. Date Temp ... ... 18-01-2021 3 19-01-2021 4 20-01-2021 2 21-01-2021 2 ... ... You need to produce the following table by using a Spark SQL query. Year Jan Feb Mar Apr May 2019 2.3 4.1 5.2 7.6 9.2 2020 2.4 4.2 4.9 7.8 9.1 2021 2.6 5.3 3.4 7.9 9.5 How should you complete the query? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content. NOTE: Each correct selection is worth one point. Select and Place: Values: Cast, Collate, Convert, Flatten, Pivot, UnPivot select * from (select >> fields << from temperatures where date between here and now) __1__ ( avg (__2__ (Temp as Decimal(4,1))) For Month in ( 1 Jan, 2 Feb, etc.) Order by Year ASC

Correct Answer: Pivot, Cast Box 1: PIVOT - PIVOT rotates a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output. And PIVOT runs aggregations where they're required on any remaining column values that are wanted in the final output. Incorrect Answers: UNPIVOT carries out the opposite operation to PIVOT by rotating columns of a table-valued expression into column values. Box 2: CAST - If you want to convert an integer value to a DECIMAL data type in SQL Server use the CAST() function. Example:S ELECT -CAST(12 AS DECIMAL(7,2) ) AS decimal_value; Here is the result:decimal_value12.00 Reference: https://learnsql.com/cookbook/how-to-convert-an-integer-to-a-decimal-in-sql-server/ https://docs.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot

Question Set 1 Q29 HOTSPOT - You store files in an Azure Data Lake Storage Gen2 container. The container has the storage policy shown in the following exhibit. {"rules": [ {"enabled":true, "name":"contosorule", ... ... "prefixMatch":["container1/contoso"]}... Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic. NOTE: Each correct selection is worth one point. Hot Area: The files are ___ after 30 days: deleted from the container moved to archive storage moved to cool storage moved to hot storage The storage policy applies to __: container1/contoso.csv container1/docs/contoso.json container1/mycontoso/contoso.csv

Correct Answer: moved to cool storage, container1/contoso.csv Box 1: moved to cool storage - The ManagementPolicyBaseBlob.TierToCool property gets or sets the function to tier blobs to cool storage. Support blobs currently at Hot tier. Box 2: container1/contoso.csv - As defined by prefixMatch.prefixMatch: An array of strings for prefixes to be matched. Each rule can define up to 10 case-senstive prefixes. A prefix string must start with a container name. Reference: https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.management.storage.fluent.models.managementpolicybaseblob.tiertocool

Question Set 2 Q44 HOTSPOT - You have an Azure Data Factory instance named ADF1 and two Azure Synapse Analytics workspaces named WS1 and WS2.ADF1 contains the following pipelines: ✑ P1: Uses a copy activity to copy data from a nonpartitioned table in a dedicated SQL pool of WS1 to an Azure Data Lake Storage Gen2 account ✑ P2: Uses a copy activity to copy data from text-delimited files in an Azure Data Lake Storage Gen2 account to a nonpartitioned table in a dedicated SQL pool ofWS2 - You need to configure P1 and P2 to maximize parallelism and performance. Which dataset settings should you configure for the copy activity if each pipeline? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: P1: Set the Copy method to Bulk insert Set the Copy method to Polybase Set the Isolation level to Repeatable read Set the Partition option to Dynamic range P2: Set the Copy method to Bulk insert Set the Copy method to Polybase Set the Isolation level to Repeatable read Set the Partition option to Dynamic range

Correct Answer: Polybase, Bulk insert ?? Alternate Answer: Dynamic Range or Polybase, Polybase Box 1: Set the Copy method to PolyBase While SQL pool supports many loading methods including non-Polybase options such as BCP and SQL BulkCopy API, the fastest and most scalable way to load data is through PolyBase. PolyBase is a technology that accesses external data stored in Azure Blob storage or Azure Data Lake Store via the T-SQL language. Box 2: Set the Copy method to Bulk insert Polybase not possible for text files. Have to use Bulk insert. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/load-data-overview Alternate Reference: https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-versioned-feature-summary

Question Set 1 Q8 HOTSPOT - You use Azure Data Factory to prepare data to be queried by Azure Synapse Analytics serverless SQL pools. Files are initially ingested into an Azure Data Lake Storage Gen2 account as 10 small JSON files. Each file contains the same data attributes and data from a subsidiary of your company. You need to move the files to a different folder and transform the data to meet the following requirements: ✑ Provide the fastest possible query times. ✑ Automatically infer the schema from the underlying files. How should you configure the Data Factory copy activity? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Copy behavior: Flatten Hierarchy Merge Files Preserve Hierarchy Sink File Type: CSV JSON Parquet TXT

Correct Answer: Preserve Hierarchy, Parquet ALTERNATE: Merge, Parquet Box 1: Preserver hierarchy - Compared to the flat namespace on Blob storage, the hierarchical namespace greatly improves the performance of directory management operations, which improves overall job performance. Box 2: Parquet - Azure Data Factory parquet format is supported for Azure Data Lake Storage Gen2.Parquet supports the schema property. Reference:https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction https://docs.microsoft.com/en-us/azure/data-factory/format-parquet

Question Set 1 Q31 HOTSPOT - You have an Azure Data Lake Storage Gen2 account named account1 that stores logs as shown in the following table. Type Designated retention period Application 360 days Infrastructure 60 days You do not expect that the logs will be accessed during the retention periods. You need to recommend a solution for account1 that meets the following requirements: ✑ Automatically deletes the logs at the end of each retention period ✑ Minimizes storage costs What should you include in the recommendation? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: To minimize storage costs: - Store logs in the Archive access tier - Store logs in the Cool access tier - Store infra logs in Cool and app logs in Archive To delete logs automatically: - Azure Data Factory pipelines - Azure Blob storage lifecycle management rules - Immutable Azure Blob storage time-based retention policies

Correct Answer: Store infra logs in Cool and app logs in Archive, Azure Blob storage lifecycle management rules Box 1: Store the infrastructure logs in the Cool access tier and the application logs in the Archive access tier For infrastructure logs: Cool tier - An online tier optimized for storing data that is infrequently accessed or modified. Data in the cool tier should be stored for a minimum of 30 days. The cool tier has lower storage costs and higher access costs compared to the hot tier. For application logs: Archive tier - An offline tier optimized for storing data that is rarely accessed, and that has flexible latency requirements, on the order of hours. Data in the archive tier should be stored for a minimum of 180 days. Box 2: Azure Blob storage lifecycle management rules Blob storage lifecycle management offers a rule-based policy that you can use to transition your data to the desired access tier when your specified conditions are met. You can also use lifecycle management to expire data at the end of its life. Reference: https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview

Topic 2 - Question Set 2 Q3 HOTSPOT - You plan to create a real-time monitoring app that alerts users when a device travels more than 200 meters away from a designated location. You need to design an Azure Stream Analytics job to process the data for the planned app. The solution must minimize the amount of code developed and the number of technologies used. What should you include in the Stream Analytics job? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Input type: - Stream - Reference Function: - Aggregate - Geospatial - Windowing

Correct Answer: Stream, Geospatial Input type: Stream - You can process real-time IoT data streams with Azure Stream Analytics. Function: Geospatial - With built-in geospatial functions, you can use Azure Stream Analytics to build applications for scenarios such as fleet management, ride sharing, connected cars, and asset tracking. Note: In a real-world scenario, you could have hundreds of these sensors generating events as a stream. Ideally, a gateway device would run code to push these events to Azure Event Hubs or Azure IoT Hubs. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-get-started-with-azure-stream-analytics-to-process-data-from-iot-devices https://docs.microsoft.com/en-us/azure/stream-analytics/geospatial-scenarios

Question Set 2 Q43 HOTSPOT - You are designing an Azure Stream Analytics solution that receives instant messaging data from an Azure Event Hub. You need to ensure that the output from the Stream Analytics job counts the number of messages per time zone every 15 seconds. How should you complete the Stream Analytics query? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Select TimeZone, count(*) as MessageCount From MessageStream __1__ CreatedAt Group By TimeZone, __2__ (second, 15) __1__: Last, Over, System.TimeStamp(), Timestamp By __2__: Hopping, Session, Sliding, Tumbling (Window)

Correct Answer: Timestamp By, TumblingWindow YY Alternate Answer: System.Timestamp(), TumblingWindow Box 1: timestamp by - Box 2: TUMBLINGWINDOW - Tumbling window functions are used to segment a data stream into distinct time segments and perform a function against them, such as the example below. The key differentiators of a Tumbling window are that they repeat, do not overlap, and an event cannot belong to more than one tumbling window. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Question Set 2 Q50 HOTSPOT - You are building an Azure Stream Analytics job to retrieve game data. You need to ensure that the job returns the highest scoring record for each five-minute time interval of each game. How should you complete the Stream Analytics query? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: select __1__ as HighestScore from input timestamp by CreatedAt group by __2__ __1__ Collect(Score) Collect Top(1) over (order by score desc) Game, max(Score) TopOne() over (partition by Game order by Score desc) __2__ Game Hopping(minute,5) Tumbling(minute,5) Windows(TumblingWindow(minute,5),Hopping(minute,5))

Correct Answer: TopOne() over(partition by Game order by Score desc), Hopping(minute,5) YY Alternate Answer: TopOne, Tumbling Box 1: TopOne OVER(PARTITION BY Game ORDER BY Score Desc) TopOne returns the top-rank record, where rank defines the ranking position of the event in the window according to the specified ordering. Ordering/ranking is based on event columns and can be specified in ORDER BY clause. Box 2: Hopping(minute,5) Hopping window functions hop forward in time by a fixed period. It may be easy to think of them as Tumbling windows that can overlap and be emitted more often than the window size. Events can belong to more than one Hopping window result set. To make a Hopping window the same as a Tumbling window, specify the hop size to be the same as the window size. Reference: https://docs.microsoft.com/en-us/stream-analytics-query/topone-azure-stream-analytics https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Question Set 2 Q24 HOTSPOT - You build an Azure Data Factory pipeline to move data from an Azure Data Lake Storage Gen2 container to a database in an Azure Synapse Analytics dedicated SQL pool. Data in the container is stored in the following folder structure. /in/{YYYY}/{MM}/{DD}/{HH}/{mm} The earliest folder is /in/2021/01/01/00/00. The latest folder is /in/2021/01/15/01/45. You need to configure a pipeline trigger to meet the following requirements: ✑ Existing data must be loaded. ✑ Data must be loaded every 30 minutes. ✑ Late-arriving data of up to two minutes must be included in the load for the time at which the data should have arrived. How should you configure the pipeline trigger? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Answer Area: - Type: Event, On-demand, Schedule, Tumbling Window Additional Properties: - Prefix: /in/, Event: Blob created - Recurrence: 30 minutes, Start time: 2021-01-01T00:00 - Recurrence: 30 minutes, Start time: 2021-01-01T00:00, Delay: 2 minutes - Recurrence: 32 minutes, Start time: 2021-01-15T01:45

Correct Answer: Tumbling window, Recurrence: 30 minutes, Start time: 2021-01-01T00:00, Delay: 2 minutes Box 1: Tumbling window - To be able to use the Delay parameter we select Tumbling window. Box 2:Recurrence: 30 minutes, not 32 minutes Delay: 2 minutes. The amount of time to delay the start of data processing for the window. The pipeline run is started after the expected execution time plus the amount of delay. The delay defines how long the trigger waits past the due time before triggering a new run. The delay doesn't alter the window startTime. Reference: https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-tumbling-window-trigger

Question Set 1 Q38 HOTSPOT - You are creating dimensions for a data warehouse in an Azure Synapse Analytics dedicated SQL pool. You create a table by using the Transact-SQL statement shown in the following exhibit. create table [dbo].[DimProduct]( productKey int identity(1,1) not null, productSourceId int not null, productNumber nvarchar(25) not null, ... sellStartDate date, sellEndDate date, etlAuditId int; Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic. NOTE: Each correct selection is worth one point. Hot Area: DimProduct is a __ slowly changing dimension (SCD) - Type 0 - Type 1 - Type 2 The ProductKey column is a __ - a surrogate key - a business key - an audit column

Correct Answer: Type 2, business key YY Alternate Answer: Type 2, surrogate Key Box 1: Type 2 - A Type 2 SCD supports versioning of dimension members. Often the source system doesn't store versions, so the data warehouse load process detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and EndDate) and possibly a flag column (for example,IsCurrent) to easily filter by current dimension members. Incorrect Answers: A Type 1 SCD always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten. Box 2: a business key - A business key or natural key is an index which identifies uniqueness of a row based on columns that exist naturally in a table according to business rules. For example business keys are customer code in a customer table, composite of sales order header number and sales order item line number within a sales order details table. Reference: https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types

Topic 2 - Question Set 2 Q14 HOTSPOT - The following code segment is used to create an Azure Databricks cluster. { num_workers: null, autoscale: {min_workers: 2, max_workers: }, cluster_name: MyCluster, spark_version: latest-stable-scala-2.11 spark_conf: { spark.databricks.cluster.profile: serverless spark.databricks.repl.allowedLanguages: sql,pythom,r}, node_type_id: Standard_DS13_v2, ssh_public_keys: [], custom_tags: { ResourceClass: Serverless }, spark_env_vars: { Pyspark_Python: />>path<<}, autotermination_minutes: 90 enable_elastic_disk: true init_scripts: [] } For each of the following statements, select Yes if the statement is true. Otherwise, select No. NOTE: Each correct selection is worth one point. Hot Area: yes or no statement The Databricks cluster supports multiple concurrent users. The Databricks cluster minimizes costs when running scheduled jobs that execute notebooks. The Databricks cluster supports the creation of a Delta Lake table.

Correct Answer: Yes, No, Yes Box 1: Yes - A cluster mode of ג€˜High Concurrencyג€™ is selected, unlike all the others which are ג€˜Standardג€™. This results in a worker type of Standard_DS13_v2. Box 2: No - When you run a job on a new cluster, the job is treated as a data engineering (job) workload subject to the job workload pricing. When you run a job on an existing cluster, the job is treated as a data analytics (all-purpose) workload subject to all-purpose workload pricing. Box 3: Yes - Delta Lake on Databricks allows you to configure Delta Lake based on your workload patterns. Reference: https://adatis.co.uk/databricks-cluster-sizing/ https://docs.microsoft.com/en-us/azure/databricks/jobshttps://docs.databricks.com/administration-guide/capacity-planning/cmbp.html https://docs.databricks.com/delta/index.html

Question Set 2 Q42 HOTSPOT - You configure version control for an Azure Data Factory instance as shown in the following exhibit. >>Git configuration<< with >>respository<< Type: Azure DevOps Git Azure DevOps Account: CONTOSO Project Name: Data Repository Name: dwh_batchetl Collaboration Branch: main Publish Branch: adf_publish Root folder: / Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic. NOTE: Each correct selection is worth one point. Hot Area: ARM templates for the pipeline assets are stored in: / adf_publish main Parameterization template A Data Factory ARM template named contososales can be found in: / /contososales /dwh_batchetl/adf_publish/contososales /main

Correct Answer: adf_publish, / dwh_batchetl/adf_publish/contososales Box 1: adf_publish - The Publish branch is the branch in your repository where publishing related ARM templates are stored and updated. By default, it's adf_publish. Box 2: / dwh_batchetl/adf_publish/contososales Note: RepositoryName (here dwh_batchetl): Your Azure Repos code repository name. Azure Repos projects contain Git repositories to manage your source code as your project grows. You can create a new repository or use an existing repository that's already in your project. Reference: https://docs.microsoft.com/en-us/azure/data-factory/source-control

Question Set 2 Q48 DRAG DROP - You are responsible for providing access to an Azure Data Lake Storage Gen2 account. Your user account has contributor access to the storage account, and you have the application ID and access key. You plan to use PolyBase to load data into an enterprise data warehouse in Azure Synapse Analytics. You need to configure PolyBase to connect the data warehouse to storage account. Which three components should you create in sequence? To answer, move the appropriate components from the list of components to the answer area and arrange them in the correct order. Select and Place: Components: a database scoped credential an asymmetric key an external data source a database encryption key an external file format

Correct Answer: an asymmetric key, a database scoped credential, an eternal data source YY Alternate Answer: a database scoped credential, an external data source, an external file format Step 1: an asymmetric key - A master key should be created only once in a database. The Database Master Key is a symmetric key used to protect the private keys of certificates and asymmetric keys in the database. Step 2: a database scoped credential Create a Database Scoped Credential. A Database Scoped Credential is a record that contains the authentication information required to connect an external resource. The master key needs to be created first before creating the database scoped credential. Step 3: an external data source - Create an External Data Source. External data sources are used to establish connectivity for data loading using Polybase. Reference: https://www.sqlservercentral.com/articles/access-external-data-from-azure-synapse-analytics-using-polybase

Topic 2 - Question Set 2 Q20 HOTSPOT - You are building an Azure Stream Analytics job to identify how much time a user spends interacting with a feature on a webpage. The job receives events based on user actions on the webpage. Each row of data represents an event. Each event has a type of either 'start' or 'end'. You need to calculate the duration between start and end events. How should you complete the query? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: select [user], feature, __1__ second, __2__ [time] over (partition by [user], feature limit duration(hour, 1) when event = 'start', time) as duration from input timestamp by time where event = 'end' __1__: dateAdd(, dateDiff(, datePart( __2__: isfirst, last, topone

Correct Answer: dateDiff, Last Box 1: DATEDIFF - DATEDIFF function returns the count (as a signed integer value) of the specified datepart boundaries crossed between the specified startdate and enddate. Syntax: DATEDIFF ( datepart , startdate, enddate ) Box 2: LAST - The LAST function can be used to retrieve the last event within a specific condition. In this example, the condition is an event of type Start, partitioning the search by PARTITION BY user and feature. This way, every user and feature is treated independently when searching for the Start event. LIMIT DURATION limits the search back in time to 1 hour between the End and Start events. Example: SELECT - [user], feature, DATEDIFF(second,LAST(Time) OVER (PARTITION BY [user], feature LIMIT DURATION(hour,1) WHEN Event = 'start'),Time) as duration - FROM input TIMESTAMP BY Time - WHERE Event = 'end' Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-stream-analytics-query-patterns

Question Set 2 Q37 HOTSPOT - You have a self-hosted integration runtime in Azure Data Factory. The current status of the integration runtime has the following configurations: ✑ Status: Running ✑ Type: Self-Hosted ✑ Version: 4.4.7292.1 ✑ Running / Registered Node(s): 1/1 ✑ High Availability Enabled: False ✑ Linked Count: 0 ✑ Queue Length: 0 ✑ Average Queue Duration. 0.00s The integration runtime has the following node details: ✑ Name: X-M ✑ Status: Running ✑ Version: 4.4.7292.1 ✑ Available Memory: 7697MB ✑ CPU Utilization: 6% ✑ Network (In/Out): 1.21KBps/0.83KBps ✑ Concurrent Jobs (Running/Limit): 2/14 ✑ Role: Dispatcher/Worker ✑ Credential Status: In Sync Use the drop-down menus to select the answer choice that completes each statement based on the information presented. NOTE: Each correct selection is worth one point. Hot Area: If the X-M node becomes unavailable, all executed pipelines will: - fail until the node comes back online - switch to another integration runtime - exceed the CPU limit The number of concurrent jobs and the CPU usage indicate that the Concurrent Jobs (Running/Limit) value should be: - raised - lowered - left as is

Correct Answer: fail until the node comes back online, lowered Box 1: fail until the node comes back online We see: High Availability Enabled: False Note: Higher availability of the self-hosted integration runtime so that it's no longer the single point of failure in your big data solution or cloud data integration with Data Factory. Box 2: lowered - We see: Concurrent Jobs (Running/Limit): 2/14CPU Utilization: 6% Note: When the processor and available RAM aren't well utilized, but the execution of concurrent jobs reaches a node's limits, scale up by increasing the number of concurrent jobs that a node can run Reference: https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime

Topic 2 - Question Set 2 Q11 DRAG DROP - You have an Azure Synapse Analytics workspace named WS1.You have an Azure Data Lake Storage Gen2 container that contains JSON-formatted files in the following format. >> json file format << You need to use the serverless SQL pool in WS1 to read the files. How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content. NOTE: Each correct selection is worth one point. Select and Place: Values: - opendatasource - openjson - openquery - openrowset Answer area: select * from __1__ >> statements << as q cross apply __2__ (contextcustomdimensions) with >> more stuff <<

Correct Answer: openrowset, openjson Box 1: openrowset - The easiest way to see to the content of your CSV file is to provide file URL to OPENROWSET function, specify csv FORMAT. Example: SELECT *FROM OPENROWSET( BULK 'csv/population/population.csv', DATA_SOURCE = 'SqlOnDemandDemo', FORMAT = 'CSV', PARSER_VERSION = '2.0', FIELDTERMINATOR =',', ROWTERMINATOR = '\n' Box 2: openjson - You can access your JSON files from the Azure File Storage share by using the mapped drive, as shown in the following example: SELECT book.* FROM - OPENROWSET(BULK N't:\books\books.json', SINGLE_CLOB) AS json CROSS APPLY OPENJSON(BulkColumn) WITH( id nvarchar(100), name nvarchar(100), price float, pages_i int, author nvarchar(100)) AS book Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-single-csv-file https://docs.microsoft.com/en-us/sql/relational-databases/json/import-json-documents-into-sql-server

Question Set 1 Q45 HOTSPOT - You are building an Azure Synapse Analytics dedicated SQL pool that will contain a fact table for transactions from the first half of the year 2020. You need to ensure that the table meets the following requirements: ✑ Minimizes the processing time to delete data that is older than 10 years ✑ Minimizes the I/O for queries that use year-to-date values How should you complete the Transact-SQL statement? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Create table dbo.factTransaction (transactionTypeId int not null, transactionDateId int not null, customerId int not null, recipientId int not null, amount money not null) with __1__ __2__ range right... __1__ - clustered columnstore index - distribution - partition - truncate_target __2__ - transactionDateId - transactionDateId, transactionTypeId - hash (transactionTypeId) - round robin

Correct Answer: partition, transactionDateId Box 1: PARTITION - RANGE RIGHT FOR VALUES is used with PARTITION. Box 2: [TransactionDateID] Partition on the date column. Example: Creating a RANGE RIGHT partition function on a datetime column The following partition function partitions a table or index into 12 partitions, one for each month of a year's worth of values in a datetime column. CREATE PARTITION FUNCTION [myDateRangePF1] (datetime)AS RANGE RIGHT FOR VALUES ('20030201', '20030301', '20030401','20030501', '20030601', '20030701', '20030801','20030901', '20031001', '20031101', '20031201'); Reference: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql

Topic 2 - Question Set 2 Q16 You have an enterprise data warehouse in Azure Synapse Analytics that contains a table named FactOnlineSales. The table contains data from the start of 2009 to the end of 2012.You need to improve the performance of queries against FactOnlineSales by using table partitions. The solution must meet the following requirements: ✑ Create four partitions based on the order date. ✑ Ensure that each partition contains all the orders places during a given calendar year. How should you complete the T-SQL command? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: create table dbo.factOnlineSales (>>fields<<) with (clustered columnstore index) partition (orderDateKey range __1__ for values (__2__) __1__: left, right __2__: 20090101, 20121231 20100101,20110101,20120101 20090101,20100101,20110101,20120101

Correct Answer: right; 20100101,20110101,20120101 https://docs.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql?view=sql-server-ver15#c-creating-a-range-right-partition-function-on-a-datetime-column Range Left or Right, both are creating similar partition but there is difference in comparison For example: in this scenario, when you use LEFT and 20100101,20110101,20120101 Partition will be, datecol<=20100101, datecol>20100101 and datecol<=20110101, datecol>20110101 and datecol<=20120101, datecol>20120101 But if you use range RIGHT and 20100101,20110101,20120101 Partition will be, datecol<20100101, datecol>=20100101 and datecol<20110101, datecol>=20110101 and datecol<20120101, datecol>=20120101 In this example, Range RIGHT will be suitable for calendar comparison Jan 1st to Dec 31st Reference: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql?view=sql-server-ver15

Topic 2 - Question Set 2 Q22 DRAG DROP - You need to create an Azure Data Factory pipeline to process data for the following three departments at your company: Ecommerce, retail, and wholesale. The solution must ensure that data can also be processed for the entire company. How should you complete the Data Factory data flow script? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content. NOTE: Each correct selection is worth one point. Select and Place: Values all, ecommerce, retail, wholesale dept=='ecommerce', dept=='retail', dept=='wholesale' dept=='ecommerce', dept=='wholesale', dept=='retail' disjoint: false disjoint: true ecommerce, retail, wholesale, all Answer CleanData split(__1__ __2__) ~> SplitByDept@ (__3__)

Correct Answer: split( dept=='ecommerce', dept=='retail', dept=='wholesale' disjoint: false) ~> SplitByDept@(ecommerce, retail, wholesale, all) The conditional split transformation routes data rows to different streams based on matching conditions. The conditional split transformation is similar to a CASE decision structure in a programming language. The transformation evaluates expressions, and based on the results, directs the data row to the specified stream. Box 1: dept=='ecommerce', dept=='retail', dept=='wholesale' First we put the condition. The order must match the stream labeling we define in Box 3. Syntax:<incomingStream> split(<conditionalExpression1> <conditionalExpression2> ... disjoint: {true | false}) ~> <splitTx>@(stream1, stream2, ..., <defaultStream>) Box 2: discount : false - disjoint is false because the data goes to the first matching condition. All remaining rows matching the third condition go to output stream all. Box 3: ecommerce, retail, wholesale, all Label the streams - Reference: https://docs.microsoft.com/en-us/azure/data-factory/data-flow-conditional-split

Question Set 1 Q13 You have an enterprise-wide Azure Data Lake Storage Gen2 account. The data lake is accessible only through an Azure virtual network named VNET1. You are building a SQL pool in Azure Synapse that will use data from the data lake. Your company has a sales team. All the members of the sales team are in an Azure Active Directory group named Sales. POSIX controls are used to assign the Sales group access to the files in the data lake. You plan to load data to the SQL pool every hour. You need to ensure that the SQL pool can load the sales data from the data lake. Which three actions should you perform? Each correct answer presents part of the solution. NOTE: Each area selection is worth one point. A. Add the managed identity to the Sales group. B. Use the managed identity as the credentials for the data load process. C. Create a shared access signature (SAS). D. Add your Azure Active Directory (Azure AD) account to the Sales group. E. Use the shared access signature (SAS) as the credentials for the data load process. F. Create a managed identity.

Create Managed identity Add Managed identity to the Sales Group Use Managed identity as the credentials for the data load process YY Alternate: FAB Correct Answer: ADF The managed identity grants permissions to the dedicated SQL pools in the workspace. Note: Managed identity for Azure resources is a feature of Azure Active Directory. The feature provides Azure services with an automatically managed identity in Azure AD - Reference:https://docs.microsoft.com/en-us/azure/synapse-analytics/security/synapse-workspace-managed-identity

Question Set 1 Q44 DRAG DROP - You have data stored in thousands of CSV files in Azure Data Lake Storage Gen2. Each file has a header row followed by a properly formatted carriage return (/ r) and line feed (/n).You are implementing a pattern that batch loads the files daily into an enterprise data warehouse in Azure Synapse Analytics by using PolyBase. You need to skip the header row when you import the files into the data warehouse. Before building the loading pattern, you need to prepare the required database objects in Azure Synapse Analytics. Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order. NOTE: Each correct selection is worth one point Select and Place: Actions <--> Answer here - Create a database scoped credential that uses Azure Active Directory Application and a Service Principal Key - Create an external data source that uses the abfs location - Use Create External Table As Select (CETAS) and configure the reject options to specify reject values or percentages - Create an external file format and set the First_Row option

Create an external data source Create an external file format Use CETAS Alternative Answer: Create database scoped credential Create external data source Use CETAS Step 1: Create an external data source that uses the abfs location Create External Data Source to reference Azure Data Lake Store Gen 1 or 2 Step 2: Create an external file format and set the First_Row option. Create External File Format. Step 3: Use CREATE EXTERNAL TABLE AS SELECT (CETAS) and configure the reject options to specify reject values or percentages To use PolyBase, you must create external tables to reference your external data. Use reject options. Note: REJECT options don't apply at the time this CREATE EXTERNAL TABLE AS SELECT statement is run. Instead, they're specified here so that the database can use them at a later time when it imports data from the external table. Later, when the CREATE TABLE AS SELECT statement selects data from the external table, the database will use the reject options to determine the number or percentage of rows that can fail to import before it stops the import. Reference: https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-t-sql-objects https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-as-select-transact-sql

Question Set 1 Q41 DRAG DROP - You need to build a solution to ensure that users can query specific files in an Azure Data Lake Storage Gen2 account from an Azure Synapse Analytics serverless SQL pool. Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order. NOTE: More than one order of answer choices is correct. You will receive credit for any of the correct orders you select. Select and Place: Actions <--> Put Answers here Create an external file format object Create an external data source Create a query that uses Create Table as Select Create a table Create an external table

Create an external data source Create an external file format object Create an external table Step 1: Create an external data source You can create external tables in Synapse SQL pools via the following steps: 1. CREATE EXTERNAL DATA SOURCE to reference an external Azure storage and specify the credential that should be used to access the storage. 2. CREATE EXTERNAL FILE FORMAT to describe format of CSV or Parquet files. 3. CREATE EXTERNAL TABLE on top of the files placed on the data source with the same file format. Step 2: Create an external file format object Creating an external file format is a prerequisite for creating an external table. Step 3: Create an external table Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-table1

Question Set 1 Q9 HOTSPOT - You have a data model that you plan to implement in a data warehouse in Azure Synapse Analytics as shown in the following exhibit. DimEmployee <- FctDailyBookings -> DimCustomer -> DimTime All the dimension tables will be less than 2 GB after compression, and the fact table will be approximately 6 TB. The dimension tables will be relatively static with very few data inserts and updates. Which type of table should you use for each table? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: DimCustomer \ Hash Distributed Dim Employee \ Round Robin FctDailyBookings / Replicated DimTime /

Dims are Replicated Fact is Hash Distributed Box 1: Replicated - Replicated tables are ideal for small star-schema dimension tables, because the fact table is often distributed on a column that is not compatible with the connected dimension tables. If this case applies to your schema, consider changing small dimension tables currently implemented as round-robin to replicated. Box 2: Replicated Box 3: Replicated Box 4: Hash-distributed - For Fact tables use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same distribution column. Reference:https://azure.microsoft.com/en-us/updates/reduce-data-movement-and-make-your-queries-more-efficient-with-the-general-availability-of-replicated-tables/ https://azure.microsoft.com/en-us/blog/replicated-tables-now-generally-available-in-azure-sql-data-warehouse/

Question Set 1 Q19 HOTSPOT - You have a SQL pool in Azure Synapse. You plan to load data from Azure Blob storage to a staging table. Approximately 1 million rows of data will be loaded daily. The table will be truncated before each daily load. You need to create the staging table. The solution must minimize how long it takes to load the data to the staging table. How should you configure the table? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one point. Hot Area: Distribution - Hash - Replicated - Round-robin Indexing - Clustered - Clustered columnstore - Heap Partitioning - Date - None

Hash, Clustered columnstore, Date YY Alternate Answers: Round-Robin, Heap, None Box 1: Hash - Hash-distributed tables improve query performance on large fact tables. They can have very large numbers of rows and still achieve high performance. Incorrect Answers: Round-robin tables are useful for improving loading speed. Box 2: Clustered columnstore - When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition. For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Box 3: Date - Table partitions enable you to divide your data into smaller groups of data. In most cases, table partitions are created on a date column. Partition switching can be used to quickly remove or replace a section of a table. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-partition https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

Question Set 1 Q42 You are designing a data mart for the human resources (HR) department at your company. The data mart will contain employee information and employee transactions. From a source system, you have a flat extract that has the following fields: ✑ EmployeeID FirstName - ✑ LastName ✑ Recipient ✑ GrossAmount ✑ TransactionID ✑ GovernmentID ✑ NetAmountPaid ✑ TransactionDate You need to design a star schema data model in an Azure Synapse Analytics dedicated SQL pool for the data mart. Which two tables should you create? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. a dimension table for Transaction B. a dimension table for EmployeeTransaction C. a dimension table for Employee D. a fact table for Employee E. a fact table for Transaction

dimension table for Employee fact table for Transaction Correct Answer: CE C: Dimension tables contain attribute data that might change but usually changes infrequently. For example, a customer's name and address are stored in a dimension table and updated only when the customer's profile changes. To minimize the size of a large fact table, the customer's name and address don't need to be in every row of a fact table. Instead, the fact table and the dimension table can share a customer ID. A query can join the two tables to associate a customer's profile and transactions. E: Fact tables contain quantitative data that are commonly generated in a transactional system, and then loaded into the dedicated SQL pool. For example, a retail business generates sales transactions every day, and then loads the data into a dedicated SQL pool fact table for analysis. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview

Question Set 1 Q3 DRAG DROP - You have a table named SalesFact in an enterprise data warehouse in Azure Synapse Analytics. SalesFact contains sales data from the past 36 months and has the following characteristics: ✑ Is partitioned by month ✑ Contains one billion rows ✑ Has clustered columnstore indexes At the beginning of each month, you need to remove data from SalesFact that is older than 36 months as quickly as possible. Which three actions should you perform in sequence in a stored procedure? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order. Select and Place: ✑ Switch the partition containing the stale data from SalesFact to SalesFact_Work. ✑ Truncate the partition containing the stale data. ✑ Drop the SalesFact_Work table. ✑ Create and empty table names SalesFact_Work that has the same schema as SalesFact. ✑ Execute a DELETE statement where the value in the Date column is more than 36 months ago. ✑ Copy the data to a new table by using CREATE TABLE AS SELECT (CTAS).

✑ Create an empty table names SalesFact_Work that has the same schema as SalesFact. ✑ Switch the partition containing the stale data from SalesFact to SalesFact_Work. ✑ Drop the SalesFact_Work table.


Related study sets

La Celestina Intro - Spanish 4 Honors

View Set

Talent Development Ch. 1 Questions

View Set

Auditing the Revenue Cycle Day 10

View Set

F2: Determine the need for Behavior-Analytic Services

View Set

Management Test 2 Answers answered incorrectly

View Set

Chapter 13- Cardiovascular System

View Set

(not mine) CFP Retirement Planning Final

View Set

Nuance Clintegrity Physician Query

View Set