DP-203

¡Supera tus tareas y exámenes ahora con Quizwiz!

You need to create an Azure Data Factory pipeline to process data for the following three departments at your company: Ecommerce, retail, and wholesale. The solution must ensure that data can also be processed for the entire company. How should you complete the Data Factory data flow script? CleanData split ( _____________(1)____________________ _____________(2)____________________ ) ~> SplitByDept@ ( _____________(3)____________________ )

1 = dept == 'ecommerce', dept == 'retail', dept == 'wholesale' 2 = disjoint = false 3 = ecommerce, retail, wholesale, all

What is the difference between a surrogate key and a business key?

A surrogate key is a system generated (could be GUID, sequence, etc.) value with no business meaning that is used to uniquely identify a record in a table A natural key is a column or set of columns that already exist in the table (e.g. they are attributes of the entity within the data model) and uniquely identify a record in the table

What Ingest PaaS meets the following requirements: ✑ Access multiple data sources. ✑ Provide the ability to orchestrate workflow. ✑ Provide the capability to run SQL Server Integration Services packages.

Azure Data Factory

What Storage PaaS meets the following requirements: ✑ Optimize storage for big data workloads. ✑ Provide encryption of data at rest. ✑ Operate with no size limits.

Azure Data Lake Storage

You have a JSON file you need to open. Finish the SQL statement to do so select * from _________(1)___________ ( bulk 'latest/ecdc_cases.jsonl', data_source = 'covid', format = 'csv', fieldterminator ='0x0b', fieldquote = '0x0b' ) with (doc nvarchar(max)) as rows cross apply _______(2)_______ (doc) with ( date_rep datetime2, cases int, fatal int '$.deaths', country varchar(100) '$.countries_and_territories') where country = 'Serbia' order by country, date_rep desc;

(1) openrowset (2) openjson The easiest way to see to the content of your JSON file is to provide the file URL to the OPENROWSET function and specify csv FORMAT. This returns lines as a single string in a separate row of the result set. {"date_rep":"2020-07-24","day":24,"month":7,"year":2020,"cases":3,"deaths":0,"geo_id":"AF"} {"date_rep":"2020-07-25","day":25,"month":7,"year":2020,"cases":7,"deaths":0,"geo_id":"AF"} You can use functions JSON_VALUE and OPENJSON to parse the values in JSON documents and return them as relational values like so: date_rep cases geo_id 2020-07-24 3 AF 2020-07-25 7 AF

You have an Azure Data Lake Storage Gen2 account that contains a JSON file for customers. The file contains two attributes named FirstName and LastName. You need to copy the data from the JSON file to an Azure Synapse Analytics table by using Azure Databricks. A new column must be created that concatenates the FirstName and LastName values. You create the following components: ✑ A destination table in Azure Synapse ✑ An Azure Blob storage container ✑ A service principal Which five actions should you perform in sequence next in the Databricks notebook? Actions: Mount the data lake storage onto DBFS Write the results to a table in Azure Synapse Perform transformations on the file Specify a temporary folder to stage the data Write the results to data lake storage Read the file into a data frame Drop the data frame Perform transformations on the data frame

1) mount onto DBFS 2) read into data frame 3) transform data frame 4) specify temporary folder 5) write to table in SQL data warehouse

You have an Azure Active Directory (Azure AD) tenant that contains a security group named Group1. You have an Azure Synapse Analytics dedicated SQL pool named dw1 that contains a schema named schema1. You need to grant Group1 read-only permissions to all the tables and views in schema1. The solution must use the principle of least privilege. What actions should you take?

1. create user from external provider for Group1 2. create Role1 with select on schema1 3. add user to the Role1

You are designing a streaming data solution that will ingest variable volumes of data. You need to ensure that you can change the partition count after creation. Which service should you use to ingest the data? A. Azure Event Hubs Dedicated B. Azure Stream Analytics C. Azure Data Factory D. Azure Synapse Analytics

A. Azure Event Hubs Dedicated You can't change the partition count for an event hub after its creation except for the event hub in a dedicated cluster.

You have an Azure Storage account and a data warehouse in Azure Synapse Analytics in the UK South region. You need to copy blob data from the storage account to the data warehouse by using Azure Data Factory. The solution must meet the following requirements: ✑ Ensure that the data remains in the UK South region at all times. ✑ Minimize administrative effort. Which type of integration runtime should you use? A. Azure integration runtime B. Azure-SSIS integration runtime C. Self-hosted integration runtime

A. Azure integration runtime Explanation: If you have strict data compliance requirements and need ensure that data do not leave a certain geography, you can explicitly create an Azure IR in a certain region and point the Linked Service to this IR using ConnectVia property.

You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain description data that has an average length of 1.1 MB. You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics. You need to prepare the files to ensure that the data copies quickly. Which of these solutions meets the goal? A. Convert the files to compressed delimited text files B. Modify the files to ensure that each row is more than 1 MB C. Copy the files to a table that has a columnstore index

A. Convert the files to compressed delimited text files

You are creating an Azure Data Factory data flow that will ingest data from a CSV file, cast columns to specified types of data, and insert the data into a table in anAzure Synapse Analytic dedicated SQL pool. The CSV file contains three columns named username, comment, and date. The data flow already contains the following: ✑ A source transformation. ✑ A Derived Column transformation to set the appropriate types of data. ✑ A sink transformation to land the data in the pool. You need to ensure that the data flow meets the following requirements: ✑ All valid rows must be written to the destination table. ✑ Truncation errors in the comment column must be avoided proactively. ✑ Any rows containing comment values that will cause truncation errors upon insert must be written to a file in blob storage. Which two actions should you perform? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. To the data flow, add a sink transformation to write the rows to a file in blob storage. B. To the data flow, add a Conditional Split transformation to separate the rows that will cause truncation errors. C. To the data flow, add a filter transformation to filter out rows that will cause truncation errors. D. Add a select transformation to select only the rows that will cause truncation errors.

A. To the data flow, add a sink transformation to write the rows to a file in blob storage. B. To the data flow, add a Conditional Split transformation to separate the rows that will cause truncation errors.

You are designing an Azure Stream Analytics solution that will analyze Twitter data. You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once. Solution: You use a hopping window that uses a hop size of 10 seconds and a window size of 10 seconds. Does this meet the goal? A. Yes B. No

A. Yes Hopping windows that have the hop size the same as the window size are essentially tumbling windows

You are designing an Azure Stream Analytics solution that will analyze Twitter data. You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once. A. You use a tumbling window, and you set the window size to 10 seconds. B. You use a session window that uses a timeout size of 10 seconds. C. You use a hopping window that uses a hop size of 5 seconds and a window size 10 seconds

A. You use a tumbling window and you set the window size to 10 seconds. Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals.

You plan to create an Azure Synapse Analytics dedicated SQL pool. You need to minimize the time it takes to identify queries that return confidential information as defined by the company's data privacy regulations and the users who executed the queues. Which two components should you include in the solution? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point. A. sensitivity-classification labels applied to columns that contain confidential information B. resource tags for databases that contain confidential information C. audit logs sent to a Log Analytics workspace D. dynamic data masking for columns that contain confidential information

A. sensitivity-classification labels applied to columns that contain confidential information C. audit logs sent to a Log Analytics workspace

What is the best JSON-like format to store data in the data lake that queries a single record based on a timestamp?

AVRO - row based format & has a logical type timestamp

You are designing a real-time dashboard solution that will visualize streaming data from remote sensors that connect to the internet. The streaming data must be aggregated to show the average value of each 10-second interval. The data will be discarded after being displayed in the dashboard. The solution will use Azure Stream Analytics and must meet the following requirements: ✑ Minimize latency from an Azure Event hub to the dashboard. ✑ Minimize the required storage. ✑ Minimize development effort. What should you include in the solution? Azure Stream Analytics input type: ________________________ Azure Stream Analytics output type: ________________________ Aggregation query location: ________________________

Azure Azure Stream Analytics input type: Azure Event Hub Azure Stream Analytics output type: Microsoft Power BI Aggregation query location: Azure Stream Analytics

What Preparing & Training PaaS meets the following requirements: ✑ Provide a fully-managed and interactive workspace for exploration and visualization. ✑ Provide the ability to program in R, SQL, Python, Scala, and Java. Provide seamless user authentication with Azure Active Directory.

Azure Databricks

While working on one of your company's projects, your teammate wants to check the options for input to an Azure Stream Analytics task that needs high throughput and low latencies. He is confused about the input that he should use in this case. He approaches you and asks you to help him. Which Azure product would you suggest to him and why?

Azure Event hubs it is a highly scalable event ingestion service that an take and process over million events within a second. You can transform and store the data that is sent to the event hubs with the help of storage/batching adaptors or real-time analytics provider.

You need to decide on the technology choice that your team should use for batch processing in Azure. The requirements demand the technology to meet the following capabilities: - Autoscaling - In-memory caching of data - Query from external relational stores - Support for firewall What technology would you choose and why

Azure HDInsight with Spark It is a managed Hadoop service that can be used for deploying and managing the Hadoop clusters in azure. It can be used with Spark, MapReduce, Hive, or Hive LLAP for batch processing and meets all the capabilities mentioned in the scenario. Note: Databricks supports firewall when integrated with VNET, but can't support the given capabilities alone

What Model & Serve PaaS meets the following requirements: ✑ Implement native columnar storage. ✑ Support for the SQL language ✑ Provide support for structured streaming.

Azure Synapse Analytics

You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads: ✑ A workload for data engineers who will use Python and SQL. ✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL. ✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R. The enterprise architecture team at your company identifies the following standards for Databricks environments: ✑ The data engineers must share a cluster. ✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster. ✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists. You need to create the Databricks clusters for the workloads. A. You create a High Concurrency cluster for each data scientist, a High Concurrency cluster for the data engineers, and a Standard cluster for the jobs. B. You create a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a Standard cluster for the jobs. C. You create a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a High Concurrency cluster for the jobs.

B Data Scientists: Standard cluster - only standard clusters support scala & terminate after 120 Data Engineers: High concurrency cluster - better resource sharing Jobs: Standard cluster - only standard clusters support scala ; high concurrency does not support packaged notebooks

You plan to perform batch processing in Azure Databricks once daily. Which type of Databricks cluster should you use? A. High Concurrency B. automated C. interactive

B. Automated Azure Databricks has two types of clusters: interactive and automated. You use interactive clusters to analyze data collaboratively with interactive notebooks. You use automated clusters to run fast and robust automated jobs.

You are developing a solution that will stream to Azure Stream Analytics. The solution will have both streaming data and reference data. Which input type should you use for the reference data? A. Azure Cosmos DB B. Azure Blob storage C. Azure IoT Hub D. Azure Event Hubs

B. Azure Blob storage Stream Analytics supports Azure Blob storage and Azure SQL Database as the storage layer for Reference Data.

You are designing a statistical analysis solution that will use custom proprietary Python functions on near real-time data from Azure Event Hubs. You need to recommend which Azure service to use to perform the statistical analysis. The solution must minimize latency. What should you recommend? A. Azure Synapse Analytics B. Azure Databricks C. Azure Stream Analytics D. Azure SQL Database

B. Azure Databricks Azure Databricks supports near real-time data from Azure Event Hubs. And includes support for R, SQL, Python, Scala, and Java

You have an Azure Synapse Analytics dedicated SQL pool. You need to ensure that data in the pool is encrypted at rest. The solution must NOT require modifying applications that query the data. What should you do? A. Enable encryption at rest for the Azure Data Lake Storage Gen2 account. B. Enable Transparent Data Encryption (TDE) for the pool. C. Use a customer-managed key to enable double encryption for the Azure Synapse workspace. D. Create an Azure key vault in the Azure subscription grant access to the pool. Hide Solution

B. Enable Transparent Data Encryption (TDE) for the pool. Transparent Data Encryption (TDE) helps protect against the threat of malicious activity by encrypting and decrypting your data at rest. When you encrypt your database, associated backups and transaction log files are encrypted without requiring any changes to your applications. TDE encrypts the storage of an entire database by using a symmetric key called the database encryption key.

You have an Azure Stream Analytics job that receives clickstream data from an Azure event hub. You need to define a query in the Stream Analytics job. The query must meet the following requirements: ✑ Count the number of clicks within each 10-second window based on the country of a visitor. ✑ Ensure that each click is NOT counted more than once. How should you define the Query? A. SELECT Country, Avg(*) AS Average FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, SlidingWindow(second, 10) B. SELECT Country, Count(*) AS Count FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, TumblingWindow(second, 10) C. SELECT Country, Avg(*) AS Average FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, HoppingWindow(second, 10, 2) D. SELECT Country, Count(*) AS Count FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, SessionWindow(second, 5, 10)

B. SELECT Country, Count(*) AS Count FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, TumblingWindow(second, 10) Tumbling window functions are used to segment a data stream into distinct time segments and perform a function against them, such as the example below. The key differentiators of a Tumbling window are that they repeat, do not overlap, and an event cannot belong to more than one tumbling window. Incorrect Answers: A: Sliding windows, unlike Tumbling or Hopping windows, output events only for points in time when the content of the window actually changes. In other words, when an event enters or exits the window. Every window has at least one event, like in the case of Hopping windows, events can belong to more than one sliding window. C: Hopping window functions hop forward in time by a fixed period. It may be easy to think of them as Tumbling windows that can overlap, so events can belong to more than one Hopping window result set. To make a Hopping window the same as a Tumbling window, specify the hop size to be the same as the window size. D: Session windows group events that arrive at similar times, filtering out periods of time where there is no data.

You have an Azure Databricks workspace named workspace1 in the Standard pricing tier. You need to configure workspace1 to support autoscaling all-purpose clusters. The solution must meet the following requirements: ✑ Automatically scale down workers when the cluster is underutilized for three minutes. ✑ Minimize the time it takes to scale to the maximum number of workers. ✑ Minimize costs. What should you do first? A. Enable container services for workspace1. B. Upgrade workspace1 to the Premium pricing tier. C. Set Cluster Mode to High Concurrency. D. Create a cluster policy in workspace1.

B. Upgrade workspace1 to the Premium pricing tier.

You have a partitioned table in an Azure Synapse Analytics dedicated SQL pool. You need to design queries to maximize the benefits of partition elimination. What should you include in the Transact-SQL queries? A. JOIN B. WHERE C. DISTINCT D. GROUP BY Hide Solution

B. WHERE When you add the "WHERE" clause to your T-SQL query it allows the query optimizer to access only the relevant partitions to satisfy the filter criteria of the query - which is what partition elimination is all about.

You need to implement a Type 3 slowly changing dimension (SCD) for product category data in an Azure Synapse Analytics dedicated SQL pool. You have a table that was created with a Transact-SQL statement. What two columns would you expect to see? A. [EffectiveStartDate] [datetime] NOT NULL, B. [CurrentProductCategory] [nvarchar] (100) NOT NULL, C. [EffectiveEndDate] [datetime] NULL, D. [ProductCategory] [nvarchar] (100) NOT NULL, E. [OriginalProductCategory] [nvarchar] (100) NOT NULL,

B. [CurrentProductCategory] [nvarchar] (100) NOT E. [OriginalProductCategory] [nvarchar] (100) NOT NULL,

You have an Azure Synapse Analytics dedicated SQL pool that contains a large fact table. The table contains 50 columns and 5 billion rows and is a heap. Most queries against the table aggregate values from approximately 100 million rows and return only two columns. You discover that the queries against the fact table are very slow. Which type of index should you add to provide the fastest query times? A. nonclustered columnstore B. clustered columnstore C. nonclustered D. clustered

B. clustered columnstore

You implement an enterprise data warehouse in Azure Synapse Analytics. You have a large fact table that is 10 terabytes (TB) in size. Incoming queries use the primary key SaleKey column to retrieve data. You need to distribute the large fact table across multiple nodes to optimize performance of the table. Which technology should you use? A. hash distributed table with clustered index B. hash distributed table with clustered Columnstore index C. round robin distributed table with clustered index D. round robin distributed table with clustered Columnstore index E. heap table with distribution replicate

B. hash distributed table with clustered Columnstore index Hash-distributed tables improve query performance on large fact tables. Columnstore indexes can achieve up to 100x better performance on analytics and data warehousing workloads and up to 10x better data compression than traditional rowstore indexes.

You have a SQL pool in Azure Synapse. You discover that some queries fail or take a long time to complete. You need to monitor for transactions that have rolled back. Which dynamic management view should you query? A. sys.dm_pdw_request_steps B. sys.dm_pdw_nodes_tran_database_transactions C. sys.dm_pdw_waits D. sys.dm_pdw_exec_sessions

B. sys.dm_pdw_nodes_tran_database_transactions

When should you use clustered columnstore indexes?

By default, dedicated SQL pool creates a clustered columnstore index when no index options are specified on a table. Clustered columnstore tables offer both the highest level of data compression and the best overall query performance. Clustered columnstore tables will generally outperform clustered index or heap tables and are usually the best choice for large tables

You are designing an enterprise data warehouse in Azure Synapse Analytics that will contain a table named Customers. Customers will contain credit card information. You need to recommend a solution to provide salespeople with the ability to view all the entries in Customers. The solution must prevent all the salespeople from viewing or inferring the credit card information. What should you include in the recommendation? A. data masking B. Always Encrypted C. column-level security D. row-level security

C. Column-level security The key word is 'infer' Dynamic Data Masking should not be used as an isolated measure to fully secure sensitive data from users running ad-hoc queries on the database. It is appropriate for preventing accidental sensitive data exposure, but will not protect against malicious intent to infer the underlying data.

You need to trigger an Azure Data Factory pipeline when a file arrives in an Azure Data Lake Storage Gen2 container. Which resource provider should you enable? A. Microsoft.Sql B. Microsoft.Automation C. Microsoft.EventGrid D. Microsoft.EventHub

C. Microsoft.EventGrid

You have an Azure Data Factory version 2 (V2) resource named Df1. Df1 contains a linked service. You have an Azure Key vault named vault1 that contains an encryption key named key1. You need to encrypt Df1 by using key1. What should you do first? A. Add a private endpoint connection to vaul1. B. Enable Azure role-based access control on vault1. C. Remove the linked service from Df1. D. Create a self-hosted integration runtime.

C. Remove the linked service from Df1. As stated in the documentation: "Ensure the Data Factory is empty. The data factory can't contain any resources such as linked services, pipelines, and data flows. For now, deploying customer-managed key to a non-empty factory will result in an error."

You create an Azure Databricks cluster and specify an additional library to install. When you attempt to load the library to a notebook, the library in not found. You need to identify the cause of the issue. What should you review? A. notebook logs B. cluster event logs C. global init scripts logs D. workspace logs

C. global init scripts logs Cluster-scoped Init Scripts: Init scripts are shell scripts that run during the startup of each cluster node before the Spark driver or worker JVM starts. Databricks customers use init scripts for various purposes such as installing custom libraries, launching background processes, or applying enterprise security policies.

What function do you use in SQL to convert an integer value to a DECIMAL data type

CAST()

you are working on a columnstore table. Although the columnstore indexes and tables are saved with the columnstore compression always, you are interested in further decreasing the columnstore data size. For this purpose, you decide to configure an add-on compression known as archival compression. What method would you use to compress the data by using the archival compression?

COLUMNSTORE_ARCHIVE

What Data Definition Language (DDL) statements are allowed on external tables?

CREATE TABLE and DROP TABLE CREATE STATISTICS and DROP STATISTICS CREATE VIEW and DROP VIEW

Complete the following Transact-SQL statement to create a partitioned table in Azure Synapse Analytics dedicated SQL pool CREATE TABLE table1 ( ID INTEGER, Col1 VARCHAR (10), Col2 VARCHAR (10), ) WITH ( ________________________ = HASH(ID), ________________________ = (ID RANGE LEFT FOR VALUES (1, 1000)) );

CREATE TABLE table1 ( ID INTEGER, Col1 VARCHAR (10), Col2 VARCHAR (10), ) WITH ( DISTRIBUTION = HASH(ID), PARTITION = (ID RANGE LEFT FOR VALUES (1, 1000)) );

How should you choose the distribution column for a table?

Choose a distribution column with data that a) distributes evenly b) has many unique values c) does not have NULLs or few NULLs d) is not a data column

What file format has the fastest load?

Compressed, delimited text files

When you create a temporal table in Azure SQL Database, it automatically creates a history table in the same database for capturing the historical records. Which of the following statements are true about the temporal table and history table? A. A temporal table must have 1 primary key. B. To create a temporal table, System versioning needs to be set to On. C. To create a temporal table, System Versioning needs to be set to Off. D. It is mandatory to mention the name of the history table when you create the temporal table. E. If you don't specify the name for the history table, the default naming convention is used for the history table. F. You can specify the table constraints for the history table.

Correct: A. A temporal table must have 1 primary key. B. To create a temporal table, System versioning needs to be set to On. E. If you don't specify the name for the history table, the default naming convention is used for the history table. Incorrect: C. To create a temporal table, System Versioning needs to be set to Off. D. It is mandatory to mention the name of the history table when you create the temporal table. F. You can specify the table constraints for the history table.

After checking the monior tab in the Azure Synapse Studio environment, you realize that you can improve the performance of the run. Now you decide to use bucketed tables to improve the performance. Which of the following are the recommended practices to consider while using bucketed tables? A. Avoid the use of SortMerge join whenever possible B. Prefer the use of SortMerge join as much as you can C. Never consider the most elective joins D. Start with the most selective joins E. Move joins that increase the number of rows after aggregations whenever possible F. The order of various types of joins matters when it comes to the resource consumption

Correct: A. Avoid the use of SortMerge join whenever possible D. Start with the most selective joins E. Move joins that increase the number of rows after aggregations whenever possible F. The order of various types of joins matters when it comes to the resource consumption Incorrect: B. Prefer the use of SortMerge join as much as you can C. Never consider the most elective joins

SQLite differs from commercial relational database systems in terms of features as it does not support a number of features supported by commercial relational database systems. Which of the following statement(s) is/are true about SQLite? A. SQLite also assigns a type to the columns like most relational database systems B. SQLite supports all - LEFT OUTERJOIN, RIGHT OUTERJOIN, and FULL OUTERJOIN C. SQLite supports no type of OUTERJOIN D. SQLite supports only LEFT OUTERJOIN, not RIGHT OUTERJOIN or FULL OUTERJOIN E. You can't create views in SQLite

Correct: D. SQLite supports only LEFT OUTERJOIN, not RIGHT OUTERJOIN or FULL OUTERJOIN Incorrect: A. SQLite also assigns a type to the columns like most relational database systems B. SQLite supports all - LEFT OUTERJOIN, RIGHT OUTERJOIN, and FULL OUTERJOIN C. SQLite supports no type of OUTERJOIN E. You can't create views in SQLite

You need to create data semantic models in SQL Server Analysis Services. There are some recommended best practices for data modeling that one should follow. Which of the following practices are considered as the best practices you would mind while creating data semantic models? A. Never create a dimension model snowflake as well as a star while ingesting data from various sources. B. Create a dimension model snowflake and/or a star, even if you need to ingest data from various sources C. Only include the integer surrogate keys or value encoding in the model and exclude all the natural keys from dimension tables. D. Only include the natural keys in the model and exclude the integer surrogate keys or value encoding. E. Decrease the cardinality to reduce the uniqueness of the values and allow much better compression. F. Increase the cardinality to reduce the uniqueness of the values and allow much better compression.

Correct: B. Create a dimension model snowflake and/or a star, even if you need to ingest data from various sources C. Only include the integer surrogate keys or value encoding in the model and exclude all the natural keys from dimension tables. E. Decrease the cardinality to reduce the uniqueness of the values and allow much better compression. Incorrect: A. Never create a dimension model snowflake as well as a star while ingesting data from various sources. D. Only include the natural keys in the model and exclude the integer surrogate keys or value encoding. F. Increase the cardinality to reduce the uniqueness of the values and allow much better compression.

You have an enterprise data warehouse in Azure Synapse Analytics that contains a table named FactOnlineSales. The table contains data from the start of 2009 to the end of 2012. You need to improve the performance of queries against FactOnlineSales by using table partitions. The solution must meet the following requirements: ✑ Create four partitions based on the order date. ✑ Ensure that each partition contains all the orders places during a given calendar year. How should you complete the T-SQL command? Create Table [dbo].FactOnlineSales ([OderDateKey] [datetime] NOT NULL, [StoreKey] [int] NOT NULL, [ProductKey] [int] NOT NULL) WITH (CLUSTERED COLUMNSTORE INDEX) PARTITION ([OrderDateKey] RANGE ____________________ FOR VALUES: (__________________________________)

Create Table [dbo].FactOnlineSales ([OderDateKey] [datetime] NOT NULL, [StoreKey] [int] NOT NULL, [ProductKey] [int] NOT NULL) WITH (CLUSTERED COLUMNSTORE INDEX) PARTITION ([OrderDateKey] RANGE RIGHT FOR VALUES: (20100101,20110101,20120101)

You have an Azure data factory. You need to examine the pipeline failures from the last 60 days. What should you use? A. the Activity log blade for the Data Factory resource B. the Monitor & Manage app in Data Factory C. the Resource health blade for the Data Factory resource D. Azure Monitor

D. Azure Monitor Data Factory stores pipeline-run data for only 45 days. Use Azure Monitor if you want to keep that data for a longer time.

You have an Azure Synapse Analytics dedicated SQL pool named Pool1 and a database named DB1. DB1 contains a fact table named Table1. You need to identify the extent of the data skew in Table1. What should you do in Synapse Studio? A. Connect to the built-in pool and run DBCC PDW_SHOWSPACEUSED. B. Connect to the built-in pool and run DBCC CHECKALLOC. C. Connect to Pool1 and query sys.dm_pdw_node_status. D. Connect to Pool1 and query sys.dm_pdw_nodes_db_partition_stats. Hide Solution

D. Connect to Pool1 and query sys.dm_pdw_nodes_db_partition_stats. Hide Solution While DBCC PDW_SHOWSPACEUSED is the best answer for determining data skew, it does not work on built-in (serverless) pools

You work in Azure Synapse Analytics dedicated SQL pool that has a table titled Pilots. Now you want to restrict the user access in such a way that users in 'IndianAnalyst' role can only see the pilots from India. Which of the following would you add to the solution? A. Table partitions B. Encryption C. Column-Level security D. Row-level security E. Data masking

D. Row-level security

You have an Azure Data Factory that contains 10 pipelines. You need to label each pipeline with its main purpose of either ingest, transform, or load. The labels must be available for grouping and filtering when using the monitoring experience in Data Factory. What should you add to each pipeline? A. a resource tag B. a correlation ID C. a run group ID D. an annotation

D. an annotation Annotations are additional, informative tags that you can add to specific factory resources: pipelines, datasets, linked services, and triggers. By adding annotations, you can easily filter and search for specific factory resources.

You need to design an Azure Synapse Analytics dedicated SQL pool that meets the following requirements: ✑ Can return an employee record from a given point in time. ✑ Maintains the latest employee information. ✑ Minimizes query complexity. How should you model the employee data? A. as a temporal table B. as a SQL graph table C. as a degenerate dimension table D. as a Type 2 slowly changing dimension (SCD) table

D. as a Type 2 slowly changing dimension (SCD) table

You are designing an Azure Synapse Analytics dedicated SQL pool. You need to ensure that you can audit access to Personally Identifiable Information (PII). What should you include in the solution? A. column-level security B. dynamic data masking C. row-level security (RLS) D. sensitivity classifications

D. sensitivity classifications Data Discovery & Classification is built into Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics. It provides basic capabilities for discovering, classifying, labeling, and reporting the sensitive data in your databases. Your most sensitive data might include business, financial, healthcare, or personal information. Discovering and classifying this data can play a pivotal role in your organization's information-protection approach. It can serve as infrastructure for: ✑ Helping to meet standards for data privacy and requirements for regulatory compliance. ✑ Various security scenarios, such as monitoring (auditing) access to sensitive data. ✑ Controlling access to and hardening the security of databases that contain highly sensitive data.

You plan to implement an Azure Data Lake Gen 2 storage account. You need to ensure that the data lake will remain available if a data center fails in the primary Azure region. The solution must minimize costs. Which type of replication should you use for the storage account? A. geo-redundant storage (GRS) B. geo-zone-redundant storage (GZRS) C. locally-redundant storage (LRS) D. zone-redundant storage (ZRS)

D. zone-redundant storage (ZRS)

What is the difference between fact tables and dimension tables?

Dimension tables describe business entities—the things you model. Entities can include products, people, places, and concepts including time itself (e.g., a date dimension table) Fact tables store observations or events, and can be sales orders, stock balances, exchange rates, temperatures, etc. A fact table contains dimension key columns that relate to dimension tables, and numeric measure columns.

You have a SQL pool in Azure Synapse. You plan to load data from Azure Blob storage to a staging table. Approximately 1 million rows of data will be loaded daily. The table will be truncated before each daily load. You need to create the staging table. The solution must minimize how long it takes to load the data to the staging table. How should you configure the table? Distribution & indexing

Distribution - round-robin Indexing - Heap

You have an on-premises data warehouse that includes the following fact tables. Both tables have the following columns: DateKey, ProductKey, RegionKey. There are 120 unique product keys and 65 unique region keys. Sales Table: - 600 GB in size - DateKey is used extensively in WHERE clause in queries - ProductKey is used extensively in join operations - RegionKey is used for grouping Invoice table: - 6 GB in size - DateKey and ProductKey are used extensively in the WHERE clause in queries - RegionKey is used for grouping Queries that use the data warehouse take a long time to complete. You plan to migrate the solution to use Azure Synapse Analytics. You need to ensure that the Azure-based solution optimizes query performance and minimizes processing skew. What should you recommend? Distribution type for Sales & Invoice tables? Distribution column for Sales & Invoice tables?

Distribution type: Hash-distributed for both Sales Distribution column: ProductKey because it is used in joins Invoice distribution column: RegionKey because it is not used in the WHERE clause

When should you use clustered and non clustered indexes

For queries where a single or very few row lookup is required to perform with extreme speed, consider a clustered index or nonclustered secondary index. The disadvantage to using a clustered index is that only queries that benefit are the ones that use a highly selective filter on the clustered index column.

Describe GRS

Geo-redundant storage (GRS) copies your data synchronously three times within a single physical location in the primary region using LRS. It then copies your data asynchronously to a single physical location in a secondary region that is hundreds of miles away from the primary region. GRS offers durability for Azure Storage data objects of at least 99.99999999999999% (16 9's) over a given year.

What type of table should you use for a 6TB fact table in a star schema?

Hash distributed with cluster column store index

You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from suppliers for a retail store. FactPurchase will have 1 million rows of data added daily and will contain three years of data. Transact-SQL queries similar to the following query will be executed daily. SELECT - SupplierKey, StockItemKey, IsOrderFinalized, COUNT(*) FROM FactPurchase - WHERE DateKey >= 20210101 - AND DateKey <= 20210131 - GROUP By SupplierKey, StockItemKey, IsOrderFinalized Which table distribution will minimize query times? A. replicated B. hash-distributed on PurchaseKey C. round-robin D. hash-distributed on IsOrderFinalized

Hash-distributed tables improve query performance on large fact tables. To balance the parallel processing, select a distribution column that: ✑ Has many unique values. The column can have duplicate values. All rows with the same value are assigned to the same distribution. Since there are 60 distributions, some distributions can have > 1 unique values while others may end with zero values. ✑ Does not have NULLs, or has only a few NULLs. ✑ Is not a date column.

A server less cluster is another word for a ____________________________ cluster

High Concurrency

Describe a hopping window

Hopping window functions hop forward in time by a fixed period. It may be easy to think of them as Tumbling windows that can overlap and be emitted more often than the window size. Events can belong to more than one Hopping window result set. E.g., Every 5 seconds, give me the count of Tweets over the last 10 seconds

What are the differences between the three types of SCD's?

In a Type 1 SCD the new data overwrites the existing data. Thus the existing data is lost as it is not stored anywhere else. A Type 2 SCD retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record. Each record contains the effective time and expiration time to identify the time period between which the record was active. A Type 3 SCD stores two versions of values for certain selected level attributes. Each record stores the previous value and the current value of the selected attribute. When the value of any of the selected attributes changes, the current value is stored as the old value and the new value becomes the current value.

You have an Azure event hub named retailhub that has 16 partitions. Transactions are posted to retailhub. Each transaction includes the transaction ID, the individual line items, and the payment details. The transaction ID is used as the partition key. You are designing an Azure Stream Analytics job to identify potentially fraudulent transactions at a retail store. The job will use retailhub as the input. The job will output the transaction ID, the individual line items, the payment details, a fraud score, and a fraud indicator. You plan to send the output to an Azure event hub named fraudhub. You need to ensure that the fraud detection solution is highly scalable and processes transactions as quickly as possible. How should you structure the output of the Stream Analytics job? Number of partitions = ? Partition key = ?

It should be configured the same way as the input, so 16 partition keys & Transaction ID as the partition key

Log Analytics Workspaces store the data collected by Azure Monitor Logs. Log query is the query that is used to retrieve the data from a Log Analytics workspace. What language are these log queries written in?

Kusto Query Language (KQL)

Place these in the correct order from least expensive to most expensive: GRS LRS ZRS

LRS ZRS GRS

You need to collect application metrics, streaming query events, and application log messages for an Azure Databrick cluster. Which type of library and workspace should you implement?

Library: Azure Databricks Monitoring Library Workspace: Azure Log Analytics

Describe LRS

Locally redundant storage (LRS) replicates your data three times within a single data center in the primary region. LRS provides at least 99.999999999% (11 nines) durability of objects over a given year. LRS is the lowest-cost redundancy option and offers the least durability compared to other options. LRS protects your data against server rack and drive failures. However, if a disaster such as fire or flooding occurs within the data center, all replicas of a storage account using LRS may be lost or unrecoverable. To mitigate this risk, Microsoft recommends using zone-redundant storage (ZRS), geo-redundant storage (GRS), or geo-zone-redundant storage (GZRS).

Does an ADF schedule trigger allow for a delay?

No

You have been assigned the Storage Blob Data Contributor role at a container level. Which of the following two statements are true? a. You have been granted write, read, and delete access to all blobs in that container. b. You can view a blob within Azure portal.

Only A.

Which databricks cluster type supports Scala? Why does one not?

Only standard clusters support scala - these clusters are recommended for single users. High concurrency clusters provide high performance and security by running user code in separate processes — which is not possible in scala.

What file format supports the schema property?

Parquet

What is the best format to store data in the data lake that has large number of columns (50)

Parquet - column-oriented binary file format

What is the best way to regularly remove large amounts of partitioned data (billions of rows) to a new table?

Partition switching - It is extremely fast because it is a metadata-only operation that updates the location of the data, no data is physically moved

What type of table should you use for 2GB relatively static dimension tables in a star schema?

Replicated - Replicated tables are ideal for small star-schema dimension tables, because the fact table is often distributed on a column that is not compatible with the connected dimension tables

You are building an Azure Analytics query that will receive input data from Azure IoT Hub and write the results to Azure Blob storage. You need to calculate the difference in readings per sensor per hour. How should you complete the query? SELECT sensorId, growth = reading - ___________ (reading) OVER (PARTITION BY sensorId _______________ (hour, 1))

SELECT sensorId, growth = reading - LAG (reading) OVER (PARTITION BY sensorId LIMIT DURATION (hour, 1)) The LAG analytic operator allows one to look up a "previous" event in an event stream, within certain constraints. It is very useful for computing the rate of growth of a variable, detecting when a variable crosses a threshold, or when a condition starts or stops being true. In Stream Analytics, the scope of LAG (that is, how far back in history from the current event it needs to look) is always limited to a finite time interval, using the LIMIT DURATION clause. Syntax: LAG(<scalar_expression >, [<offset >], [<default>]) OVER ([PARTITION BY <partition key>] LIMIT DURATION(<unit>, <length>) [WHEN boolean_expression])

You are designing a monitoring solution for a fleet of 500 vehicles. Each vehicle has a GPS tracking device that sends data to an Azure event hub once per minute. You have a CSV file in an Azure Data Lake Storage Gen2 container. The file maintains the expected geographical area in which each vehicle should be. You need to ensure that when a GPS position is outside the expected area, a message is added to another event hub for processing within 30 seconds. The solution must minimize cost. What should you include in the solution? Service: _______________________ Window: _______________________ Analysis type: _______________________

Service: Azure Stream Analytics Window: No window Analysis type: Point within polygon

Describe a session window

Session window functions group events that arrive at similar times, filtering out periods of time where there is no data. It has three main parameters: timeout, maximum duration, and partitioning key (optional). A session window begins when the first event occurs. If another event occurs within the specified timeout from the last ingested event, then the window extends to include the new event. Otherwise if no events occur within the timeout, then the window is closed at the timeout. If events keep occurring within the specified timeout, the session window will keep extending until maximum duration is reached. The maximum duration checking intervals are set to be the same size as the specified max duration. For example, if the max duration is 10, then the checks on if the window exceed maximum duration will happen at t = 0, 10, 20, 30, etc. E.g., Tell me the count of tweets that occur within 5 minutes to each other

Describe sliding windows

Sliding windows, unlike Tumbling or Hopping windows, output events only for points in time when the content of the window actually changes. In other words, when an event enters or exits the window. So, every window has at least one event. Similar to Hopping windows, events can belong to more than one sliding window. E.g., Give me the count of Tweets for all topics which are Tweeted more than 10 times in the last 10 seconds

Describe a snapshot window

Snapshot windows groups events that have the same timestamp. Unlike other windowing types, which require a specific window function (such as SessionWindow(), you can apply a snapshot window by adding System.Timestamp() to the GROUP BY clause. E.g., Give me the count of tweets with the same topic type that occur at exactly the same time

Which type of databricks cluster terminates automatically after 120 minutes?

Standard and Single Node clusters terminate automatically after 120 minutes by default. High Concurrency clusters do not terminate automatically by default.

You have an Azure Synapse Analytics SQL pool named Pool1 on a logical Microsoft SQL server named Server1. You need to implement Transparent Data Encryption (TDE) on Pool1 by using a custom key named key1. Which five actions should you perform in sequence? Enable TDE on Port1 Assign a managed identity to Server1 Configure key1 as the TDE protector for Server1 Add key1 to the Azure key vault Create an azure key vault and grant the managed identity permissions to the key vault

Step 1: Assign a managed identity to Server1 You will need an existing Managed Instance as a prerequisite. Step 2: Create an Azure key vault and grant the managed identity permissions to the vault Create Resource and setup Azure Key Vault. Step 3: Add key1 to the Azure key vault The recommended way is to import an existing key from a .pfx file or get an existing key from the vault. Alternatively, generate a new key directly in Azure Key Vault. Step 4: Configure key1 as the TDE protector for Server1. Provide TDE Protector key Step 5: Enable TDE on Pool1

You need to ensure that pipeline-run data is retained for 120 days. The solution must ensure that you can query the data by using the Kusto query language. What actions do you need to take?

Step 1: Create a Log Analytics workspace that has Data Retention set to 120 days. Step 2: From Azure Portal, add a diagnostic setting. Step 3: Select the PipelineRuns Category Step 4: Send the data to a Log Analytics workspace.

You work in TCT company and you are having the responsibility to manage the jobs in Azure. You decide to add a new job. While specifying the job constraints, you set maxWallClockTime property to value 30 minutes. What does that mean?

The job can be in the active or running state for a max of 30 minutes.

What are some tips to choose the best distribution column?

To minimize data movement, select a distribution column that: - Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses. When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns. When a table is not used in joins, consider distributing the table on a column that is frequently in the GROUP BY clause. - Is not used in WHERE clauses. This could narrow the query to not run on all the distributions. - Is not a date column. WHERE clauses often filter by date. When this happens, all the processing could run on only a few distributions.

You have an Azure subscription that contains a logical Microsoft SQL server named Server1. Server1 hosts an Azure Synapse Analytics SQL dedicated pool named Pool1. You need to recommend a Transparent Data Encryption (TDE) solution for Server1. The solution must meet the following requirements: ✑ Track the usage of encryption keys. ✑ Maintain the access of client apps to Pool1 in the event of an Azure datacenter outage that affects the availability of the encryption keys. What should you include in the recommendation?

To track the usage of encryption keys, use TDE with platform-managed keys To maintain the access of client apps to Pool1 in the event of an Azure datacenter outage: create and configure Azure key vaults in two azure regions

True / False for HC Clusters: It supports multiple concurrent users It minimizes costs when running scheduled jobs that execute notebooks It supports the creation of a delta lake table

True False True

Describe a tumbling window

Tumbling window functions are used to segment a data stream into distinct time segments and perform a function against them, such as the example below. The key differentiators of a Tumbling window are that they repeat, do not overlap, and an event cannot belong to more than one tumbling window. E.g., Tell me the count of tweets per time zone every 10 seconds

What is the default masking function for numeric data types?

Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).

When should you use heap tables?

When you are temporarily landing data in dedicated SQL pool, you may find that using a heap table makes the overall process faster. This is because loads to heaps are faster than to index tables and in some cases the subsequent read can be done from cache. If you are loading data only to stage it before running more transformations, loading the table to heap table is much faster than loading the data to a clustered columnstore table.

Does an ADF tumbling window trigger allow for a delay?

Yes

Describe ZRS

Zone-redundant storage (ZRS) replicates your Azure Storage data synchronously across three Azure availability zones in the primary region. Each availability zone is a separate physical location with independent power, cooling, and networking. ZRS offers durability for Azure Storage data objects of at least 99.9999999999% (12 9's) over a given year.

You need to set your default Azure region. You know that the region can be set using the following command: z config set defaults.location = <REGION> Here you need to replace <REGION> with the name of the region that is available for your subscription and you want to set. Which command would you run in Cloud Shell to check the regions that are available from your Azure subscription?

az account list-locations

You have created a data frame. But before issuing SQL queries, you decide to save your dataframe as a temporary view. What method will help you in creating the temporary view?

createOrReplaceTempView

You develop a dataset named DBTBL1 by using Azure Databricks. DBTBL1 contains the following columns: ✑ SensorTypeID ✑ GeographyRegionID ✑ Year ✑ Month ✑ Day ✑ Hour ✑ Minute ✑ Temperature ✑ WindSpeed ✑ Other You need to store the data to support daily incremental load pipelines that vary for each GeographyRegionID. The solution must minimize storage costs. How should you complete the code? df.write._____________(_____________).mode("append")._____________

df.write.partitionBy("Year", "Month", "Day", "GeographicalRegionID").mode("append").parquet("/DBTBL1")

You have created an external table in Azure Data Explorer. Now, a database user needs to run a KQL (Kusto Query Language) query on this external table. What function should he use to refer to this table?

external_table()

You are working on ADLS Gen1. Suddenly you realize you need to know the schema of the external data. What plug-in would you use to know the external data schema?

infer_storage_schema

While configuring normalize data you decide to apply the Zscore mathematical function from the Transformation method dropdown list to apply on the chosen columns. What is the Zscore function?

z = x - mean(x)/stdev(x)


Conjuntos de estudio relacionados

Slope-Intercept and Standard Form

View Set

The Bits and Bytes of Computer Networking Week 1

View Set

Evolve (Psych) - Chapters 4, 34, 35

View Set

MA 16500 Midterm III Study Guide

View Set

COBA CORE Accounting 2050 & 2060

View Set

A.P. World Asian Transitions (Ch.23) (Test 5)

View Set

Mkt 300: Ch. 6 Consumer Behavior

View Set