DP-203- engineer to be

Ace your homework & exams now with Quizwiz!

Hopping Window

"A type of window in which consecutive windows ""hop"" forward in time by a fixed period. The window is defined by two time spans: the period P and the window length L. For every P time unit a new window of size L is created." They can overlap, e.g.: Every 5 secs give the count of tweets over the last 10 sec

Snapshot window

"A window that is defined according to the start and end times of the event in the stream, instead of a fixed grid along the timeline." E.g.: Give me the count of tweets with the same topic type that occur at exactly the same time.

When you want to switch to SparkSQL in a notebook, what is the first command to type? %%csharp %%spark %%sparksql %%pyspark %%sql

%%sql %%pyspark is Python, %%spark is Scala, %%csharp is .NET

Difference Locally redundant storage (LRS), and Zone-redundant storage (ZRS)

(ZRS) copies your data synchronously across three Azure availability zones in the primary region and (LRS) copies your data synchronously three times within a single physical location in the primary region

Microsoft Azure Storage is a managed service that provides durable, secure, and scalable storage in the cloud. Azure Files enables you to set up highly available network file shares that can be accessed using the standard Server Message Block (SMB) protocol. This means that multiple VMs can share the same files with both read and write access. You can read the files using the REST interface or the storage client libraries. You can also associate a unique URL to any file to allow fine-grained access to a private file for a set period of time. Which are common scenarios where File shares can be used? (Select all that apply) - Log files such as diagnostics, metrics, and crash dumps. - Shared data between on-premises applications and Azure VMs to allow migration of apps to the cloud instantly. - Storing shared configuration files for VMs, tools, or utilities so that everyone is using unique versions. - Shared data between on-premises applications and Azure VMs to allow migration of apps to the cloud over a period of time. - Storing shared configuration files for VMs, tools, or utilities so that everyone is using the same version.

- Log files such as diagnostics, metrics, and crash dumps. - Shared data between on-premises applications and Azure VMs to allow migration of apps to the cloud over a period of time. - Storing shared configuration files for VMs, tools, or utilities so that everyone is using the same version.

Which Index Type offers the highest compression in Synapse Analytics? Round-Robin Replicated Heap Columnstore Rowstore

Columnstore --> Clustered columnstore index has the hoghest lvl of compression!

Integration Runtime

Compute infrastructure used by Azure Data Factory and Azure Synapse pipelines. On IR, we can: Data Flow, Data Movement, Activity Dispatch, SSIS Package Execution

SSIS is primarily a control flow engine that manages the execution of workflows. Workflows are held in packages, which can be executed [?]. (Select all that apply) Randomly On a schedule On demand Only once

On a schedule On demand Only once

Scenario: You are working on a project and have begun creating an Azure Data Lake Storage Gen2 account. Configuration of this account must allow for processing analytical data workloads for best performance. Which option should you configure when creating the storage account? On the Basic Tab, set the Performance option to ON. On the Advanced tab, set the Hierarchical Namespace to Enabled. On the Basic tab, set the Performance option to Standard. On the Networking tab, set the Hierarchical Namespace to ON.

On the Advanced tab, set the Hierarchical Namespace to Enabled.

Scenario: A customer of Ultron Electronics is attempting to use a $300 store credit for the full amount of a new purchase. They are trying to double-spend their credit by creating two transactions at the exact same time using the entire store credit. The customer is making two transactions using two different devices. The database behind the scenes is an ACID-compliant transactional database. What would be the result? - None of the listed options. - One order would be processed and use the in-store credit, and the other order would not be processed. - Both orders would be processed and use the in-store credit. - One order would be processed and use the in-store credit, and the other order would update the remaining inventory for the items in the basket, but would not complete the order.

One order would be processed and use the in-store credit, and the other order would not be processed. This is because of ACID: A: Atomicity --> Transactions are all or nothing C: Consistency --> Only valid data is saved I: Isolation --> Transactions do not affect each other D: Durability --> Written data will not be lost

What is cosmos.oltp?

OLTP stands for Online Transaction Processing and is used to handle real-time transactional data.

When to use parquet?

OLTP, OLAP

WHat command do I use if I want to load JSON from ADLS Gen2 into SparkPool?

OPENROWSET T-SQL command

Azure role-based access control (RBAC) is the authorization system you use to manage access to Azure resources. To grant access, you may assign roles to which of the following top level classifications? (Select four) Assets Workflows Attributes Service principals Users Orchestrations Managed identities Groups Devices

Service principals Users Managed identities Groups

Databricks File System (DBFS)

The Databricks File System (DBFS) enables you to mount cloud storage and use it to work with and persist file-based data.

Azure Cosmos DB analytical store is a fully isolated column store for enabling large-scale analytics against operational data in your Azure Cosmos DB, without any impact to your transactional workloads. True or False: You can only enable analytical store at the time of creating a new container.

True

Notebooks

When working with Spark, Notebooks provide an interactive environment in which you can combine text and graphics in Markdown format with cells containing code that you run interactively in the notebook session.

Scenario: A teammate is working on solution for transferring data between a dedicated SQL Pool and a serverless Apache Spark Pool using the Azure Synapse Apache Spark Pool to Synapse SQL connector. When could SQL Auth be used for this connection? - Always, anytime you want to transfer data between the SQL and Spark Pool. - None of the listed options. - When you need a token-based authentication to a dedicated SQL outside of the Synapse Analytics workspace. - Never, it is not necessary to use SQL Auth when transferring data between a SQL or Spark Pool.

When you need a token-based authentication to a dedicated SQL outside of the Synapse Analytics workspace.

Which are valid authentication methods in Azure Synapse Analytics? (Select five) SAML MFA Azure Active Directory SSL SQL Authentication SAS OAuth Managed identity Key

- MFA (Multi Factor Authentication: specific for environments that have that requirement) - Azure Active Directory (only one login required, access to other resources is managed) - SQL Authentication (for user accounts that are not part of an AAD, external users that need to access the data) - SAS (shared access signature: Like Keys, but for external users that need access, especially for untrusted clients. Service level: access to specific resources; account-level: acces to additional resources or abilities, e.g. create files) - Managed identity (feature of AAD that provides Azure Services) - Key (primary and secondary key for each storage account, includes access logging and secrets are stored in individual vaults, so hard to leak)

Once you have checked the monitor tab within the Azure Synapse Studio environment, and feel that you could improve the performance of the run, you have several things to consider: • Choose the data abstraction • Use the optimal data format • Use the cache option • Check the memory efficiency • Use Bucketing • Optimize Joins and Shuffles if appropriate • Optimize Job Execution If you did decide to use bucketed tables, which of the following are recommended practices? (Select all that apply) - Move joins that increase the number of rows after aggregations. - Avoid the use of SortMerge join when possible. - It's advised to start with the broadest selective joins. - The order of the different type of joins does matter when it comes to resource consumption.

- Move joins that increase the number of rows after aggregations. - Avoid the use of SortMerge join when possible. - The order of the different type of joins does matter when it comes to resource consumption.

Types of Slowly Changing Dimensions (SCD)

- Type 0: Dimension is never updated - Type 1: does not track change, overwrites the old data with new data. - Type 2: tracks changes by creating a new ROW each time a change occurs, rather than updating the existing record. This allows for historical data to be preserved, and is the most widely used type of SCD. - Type 3: tracks changes by creating new COLUMNS for each version of the data, and updating the appropriate column with the new data. This allows for historical data to be preserved, but can lead to an increase in table size. - Type 4: This type of SCD is similar to Type 2, but it also adds a column to indicate the start and end date of the record, so you can see how the dimension data changed over time.

How do you create a DataFrame object? (Select all that apply) - Use the createDataFrame() function - Introduce a variable name and equate it to something like myDataFrameDF = - Execute createOrReplaceObject() - Use the DF.create() syntax

- Use the createDataFrame() function - Introduce a variable name and equate it to something like myDataFrameDF =

Two of the core features of Delta Lake are performing UPSERTS and Time Travel operations. What does the Time Travel operation do? (Select all that apply) - Writing complex temporal queries. - Because Delta Lake is version controlled, you have the option to query past versions of the data using a single file storage system. - Re-creating analyses, reports, or outputs (for example, the output of a machine learning model). This could be useful for debugging or auditing, especially in regulated industries. - Providing snapshot isolation for a set of queries for fast changing tables.

- Writing complex temporal queries. - Because Delta Lake is version controlled, you have the option to query past versions of the data using a single file storage system. - Re-creating analyses, reports, or outputs (for example, the output of a machine learning model). This could be useful for debugging or auditing, especially in regulated industries. - Providing snapshot isolation for a set of queries for fast changing tables.

True or False: Azure SQL Managed Instance supports Azure Synapse link for SQL.

.False: The target database (used as an analytical data store) must be a dedicated SQL pool in an Azure Synapse Analytics workspace. --> What is ASL used for? ASL automatically replicates changes made to tables in the operational database to corresponding tables in an analytical database. Like that, we dont need complex ETL processes anymore, but rather use ELT. Why is ELT better than ETL?: ELT is faster by comparison; data is loaded directly into a destination system, and transformed in-parallel

Which of the following allow programmatic interaction with Azure Data Factory? .NET Python Java ARM Templates JavaScript REST APIs C++ PowerShell

.NET Python ARM Templates REST APIs PowerShell

How do you cache data into the memory of the local executor for instant access? .cacheLocalExe() .save().inMemory() .inMemory().save() .cache()

.cache() or .persist()

3 methods to update table in SQL pool with information from other table

1) Insert data from table 2 into table 1: New rows, but old ones not overwritten 2) Switch the first partition from table 1 to table 2: The metadata of the table is modified to reference the new partition, this causes the new partition to be read and the old partition to be dropped from the table. The data is not deleted but rather moved to a different partition. This is a fast and efficient process that minimizes the load time and resources required. 3) Update table 1 from table 2: This would overwrite the existing data in table 1, but would not be the best solution as it would take longer and require more resources to update the data.

4 methods to query datasets

1) Ordered clustered columnstore index: Organize data into segments to improve query performance (large data warehousing scenarios) 2) Materialized view: low maintenance, only if query always the same 3) Result set caching: store query results in memory (when same query is executed several times a day) 4) Replicated table: Create copy of table on a different server/location

What do I use to query dimension tables?

1) Replicated: If the table <= 2GB, bc each node has the whole table and I avoid data movement 2) Hash-distributed: If I have a large table that exceeds the memory capacity of one node, then query performance is better, bc its scaling out on all the nodes (better query performance than round robin) 3) Heap: When I have to quickly insert or update large amounts of data (query performance doesn't rlly matter) 3) Round-robin: large dimension table evenly distributed on the nodes without taking into consideration the columns that I want to distribute on --> one node per row

3 Types of tables to create in a database

1) Standard table: stores data within the database 2) External table: access data that is stored outside of the database, such as in a CSV file or a table in another database. The data is not stored within the database, but can be queried as if it were. 3) View: virtual table that is based on the result of an SQL SELECT statement. It does not store data itself, but rather provides a way to access data from one or more underlying tables in a specific way. Read-only!

In Azure Cosmos DB analytical store, there are two different schema representation modes, depending on the account: 1) SQL (Core) API accounts 2) Azure Cosmos DB API for MongoDB accounts

1) Well-defined schema representation: It is a representation of a schema in which the structure, constraints, and relationships between data elements are clearly and explicitly defined. This type of schema representation allows for consistent and predictable data management and manipulation. 2) Full-fidelity schema representation: all aspects of the schema are captured and preserved in detail, including any historical changes or evolution. This type of schema representation provides a complete and accurate record of the schema's development over time, but may not be as useful for practical data management and manipulation as a well-defined schema representation

By using multimaster replication in Azure Cosmos DB, you can often achieve a response time of less than 1 second from anywhere in the world. Azure Cosmos DB is guaranteed to achieve a response time of less than [?] for reads and writes. 1 ms 100 ms 500 ms 1000 ms 200 ms 10 ms

10 ms

When enabling autoscale in an Apache Spark Pool, it checks if an up- or downscale is necessary every ... seconds

30

Delta Lake and Data Lake

A data lake is basically just a place to store all your less-structured data ( compared to relational database ).The Delta Lake, if I understand correctly, is a practice where in your Data Lake you focus on writing the changes made, rather than updating the data itself. Like Change Capture. So here the only reason why Delta is atomic, is that you are just appending a line stating what is changed, rather than having to update a flat file. Delta Lake is actually built on ADLS Gen2

Tumbling Window

A hopping window whose hop size is equal to the window size.

Azure Storage provides a REST API to work with the containers and data stored in each account. See the below command: HTTP GET https://myaccount.blob.core.windows.net/?comp=list What would this command return? All of the listed options. A list of all the queues in a container A list of all the tables in a container A list of all the files in a container A list all the blobs in a container A list of all containers None of the listed options.

A list of all containers

Azure Synapse SQL pools support placing complex data processing logic into Stored procedures True or False: Multiple users and client programs can perform operations on underlying database objects through a procedure, even if the users and programs do not have direct permissions on those underlying objects.

A stored procedure is a pre-compiled collection of SQL statements that are stored under a name and processed as a unit. True

The following are the facets of Azure Databricks security: • Data Protection • IAM/Auth • Network • Compliance Which of the following comprise Data Protection within Azure Databricks security? (Select five) ACLs Azure Private Link TLS Azure VNet service endpoints Vault Secrets Managed Keys AAD VNet Injection VNet Peering

ACLs TLS Vault Secrets Managed Keys AAD

What core component of ADF is this: "It is a specific action performed on the data in a pipeline like the transformation or ingestion of the data. Each pipeline can have one or more."

Activity

Across all organizations and industries, common use cases for Azure Synapse Analytics are which of the following? (Select all that apply) IoT device deployment Advanced analytics Data exploration and discovery AI learning troubleshooting Real time analytics Integrated analytics Modern data warehousing Data integration

Advanced analytics Data exploration and discovery Real time analytics Integrated analytics Modern data warehousing Data integration

While Agile, CI/CD, and DevOps are different, they support one another. Agile focuses on the ..., CI/CD on ..., and DevOps on ...

Agile focuses on the development process, CI/CD on practices, and DevOps on culture.

Which of the following are you able to load into Azure Synapse Analytics? (Select all that apply) Data batches Non-Azure clouds Semi-structured data On-premises Structured data Relational datastores Non-relational datastores Data streams

All of them

Which of the following services allow customers to store semi-structured datasets in Azure. Azure SQL for VM Azure Cosmos DB Azure SQL Datawarehouse Azure File Storage Azure SQL Database Azure Table Storage Azure Content Delivery Network (CDN) Azure Blob Storage

Azure Cosmos DB Azure File Storage Azure Table Storage Azure Blob Storage

You can monitor Azure Stream Analytics jobs by using which of the following? (Select all that apply) An activity log for each running job Predictive dashboards that show expected service and application health status Real-time dashboards that show service and application health trends Diagnostic logs Alerts on issues in applications or services

An activity log for each running job Real-time dashboards that show service and application health trends Diagnostic logs Alerts on issues in applications or services

What is a lambda architecture and what does it try to solve? - An architecture that splits incoming data into two paths - a batch path and a streaming path. This architecture helps address the need to provide real-time processing in addition to slower batch computations. - None of the listed options. - An architecture that defines a data processing pipeline whereby microservices act as compute resources for efficient large-scale data processing. - An architecture that employs the latest Scala runtimes in one or more Databricks clusters to provide the most efficient data processing platform available today.

An architecture that splits incoming data into two paths - a batch path and a streaming path. This architecture helps address the need to provide real-time processing in addition to slower batch computations. Reason: Like that, when making queries I can have some real-time and some "older" results and can combine them

Azure Databricks is an amalgamation of multiple technologies that enable you to work with data at scale. What are the 6 technologies included in Azure Databricks?

Apache Spark clusters Delta Lake clusters Hive metastore clusters Notebook clusters SQL Warehouses Databricks File System clusters

Difference Apache Spark, HDInsight, Azure Databricks, Synapse Spark

Apache Spark: open source big data analytics HDInsight: Microsoft implementation of open source Spark within Azure Azure Databricks: Azure managed Spark as a service solution Synapse Spark: Embedded Spark capability within Azure Synapse Analytics Basically, with Apache Spark, I have no SLA; with HDInsight I have one and Azure Databricks provides a management platform and Synapse I use when I don't have an existing Spark implementation but want to use it still

An Azure Stream Analytics job supports which of the following input types? (Select three) Azure Blob Storage Azure Queue Storage Azure Table Storage Azure IoT Hub Azure Event Hub

Azure Blob Storage Azure IoT Hub Azure Event Hub

Inputs options for a stream analytics job (depending on latency = Wartezeit and throughput (Durchlaufleistung) Azure Blob storage Azure Data Lake Storage Gen2 Azure IoT Hub Azure Event Hubs

Azure Blob storage: store and process large amounts of data in a cost-effective manner Azure Data Lake Storage Gen2: Large volume of data with low latency Azure IoT Hub: specific for IOT data Azure Event Hubs: low latency, high throughput, high-velocity (Hochgeschwindigkeit), real-time event data

You need an NoSQL database of a supported API model, at planet scale, and with low latency performance.

Azure Cosmos DB

Scenario: You are determining the type of Azure service needed to fit the following specifications and requirements: Data classification: Semi-structured because of the need to extend or modify the schema for new products Operations: • Customers require a high number of read operations, with the ability to query many fields within the database. • The business requires a high number of write operations to track its constantly changing inventory. Latency & throughput: High throughput and low latency. Transactional support: Because all of the data is both historical and yet changing, transactional support is required. Which would be the best Azure service to select? Azure Cosmos DB Azure Blob Storage Azure Queue Storage Azure SQL Database Azure Route Table

Azure Cosmos DB --> supports semi-structured data but we can still use SQL for queries

What is Azure Cosmos DB analytical store?

Azure Cosmos DB analytical store is a fully isolated column store for enabling large-scale analytics against operational data in your Azure Cosmos DB, without any impact to your transactional workloads.

Which technology is typically used as a staging area in a modern data warehousing architecture? Azure Synapse Spark Pools Azure Synapse SQL Lakes Azure Data Lake Azure Synapse Spark Lakes Azure Data Pools Azure Synapse SQL Pools

Azure Data Lake

Which of the following are deployed along with Azure Synapse Analytics? (Select all that apply) Azure Machine Learning Azure Queue Storage Azure Data Lake Storage Gen2 Azure Synapse Workspace Azure Kubernetes Service

Azure Data Lake Storage Gen2 Azure Synapse Workspace Azure Kubernetes Service

Databases: name which are SQL/NoSQL, scale and latency: Azure Database for PostgreSQL Azure DB Server Azure Database for MySQL Azure DB for PostgreSQL Single Server Azure Database for MariaDB Azure DB for MySQL Single Server Azure Cosmos DB

Azure Database for PostgreSQL: SQL, Planet Scale, Low Latency Azure DB Server: SQL, Scale depends on configuration, Low Latency Azure Database for MySQL: SQL, Planet Scale, Low Latency Azure DB for PostgreSQL Single Server: SQL, Single Server, Low Latency Azure Database for MariaDB: SQL, Planet Scale, Low Latency Azure DB for MySQL Single Server: SQL, Single Server, Low Latency Azure Cosmos DB: NoSQL, planet scale (means being able to handle a massive amount of data), low latency So this means that PostgreSQL, MariaDB and MySQL are similiar, then also PostgreSQL Single Server and MySQL Single Server are similiar and Azure DB Server as well as Azure Cosmos DB are somewhat unique (Cosmos is the only NoSQL one!)

[?] provides one-click setup, streamlined workflows, an interactive workspace for Spark-based applications plus it adds capabilities to Apache Spark, including fully managed Spark clusters and an interactive workspace. Azure Data Catalogue Azure Storage Explorer Azure Cosmos DB Azure Data Factory Azure Data Lake Storage Azure Databricks Azure SQL Data warehouse

Azure Databricks

When creating a new cluster in the Azure Databricks workspace, what happens behind the scenes? - None of the listed options. - Azure Databricks provisions a dedicated VM that processes all jobs, based on your VM type and size selection. - Azure Databricks creates a cluster of driver and worker nodes, based on your VM type and size selections. - When an Azure Databricks workspace is deployed, you are allocated a pool of VMs. Creating a cluster draws from this pool.

Azure Databricks creates a cluster of driver and worker nodes, based on your VM type and size selections.

What is Azure Databricks from a high level?

Azure Databricks service launches and manages Apache Spark clusters within your Azure subscription.

Within creating a notebook, you need to specify the pool that needs to be attached to the notebook that is, a SQL or Spark pool. In order to bring data to a notebook, you have several options. It is possible to load data from which of the following? Azure File Storage Azure Data Factory Azure Data Lake Store Gen 2 Azure Cosmos DB SQL Pool Azure Blob Storage

Azure File Storage Azure Data Lake Store Gen 2 Azure Cosmos DB SQL Pool Azure Blob Storage

The IT team has an Azure subscription which contains an Azure Storage account and they plan to create an Azure container instance named OShaughnessy001 that will use a Docker image named Source001. Source001 contains a Microsoft SQL Server instance that requires persistent storage. Right now the team is configuring a storage service for OShaughnessy001 and there is debate around which of the following should be used. As the expert consultant, the team looks to you for direction. Which should you advise them to use? Azure Queue storage Azure Files Azure Table storage Azure Blob storage

Azure Files --> main reason in this scenario is containerization: persistent volume for containers with Azure Files

What is HDI

Azure HDInsight is a managed, full-spectrum, open-source analytics service in the cloud for enterprises. With HDInsight, you can use open-source frameworks such as, Apache Spark, Apache Hive, LLAP, Apache Kafka, Hadoop and more, in your Azure environment.

Configuration and synchronization of data between an on-premises Microsoft SQL Server database to Azure SQL Database --> Requirements: • Execute an initial data synchronization to Azure SQL Database (minimize downtime) • Execute bi-directional data synchronization after initial synchronization Which synchronization method should you advise the team to use? Data Migration Assistant Azure SQL Data Sync Transactional replication Backup and restore SQL Server Agent job

Azure SQL Data Sync --> SQL Data Sync is a service built on Azure SQL Database that lets you synchronize the data you select bi-directionally across multiple databases, both on-premises and in the cloud. --> First, I define a hub database (must be an Azure SQL database) and the rest are member databases. The sync happens between the hub and individual members

Azure offers several types of storage for data, the one chosen should depend on the needs of the users. Each data store has a different price structure. When you want to store data but don't need to query it, which would be the most cost efficient choice? Azure Stream Analytics Azure Data Lake Storage Azure Data Factory Azure Data Catalogue Azure Databricks Azure Storage

Azure Storage

Sliding Window

Azure Stream Analytics outputs events only for those points in time when the content of the window actually changes, in other words when an event entered or exits the window. E.g: Alert me whenever a topic is mentioned more than 3 times in under 10 secs

[?] makes it possible to replicate data from SQL Server 2022 or Azure SQL Database to a dedicated pool in Azure Synapse Analytics with low latency. This replication enables you to analyze operational data in near-real-time without incurring a large resource utilization overhead on your transactional data store. Azure Application Insights Azure Synapse Link for SQL Azure Data Lake Storage Gen2 Azure Cosmos DB

Azure Synapse Link for SQL

[?] offers both serverless and dedicated resource models to work with both descriptive and diagnostic analytical scenarios. This is a distributed query system that enables you to implement data warehousing and data virtualization scenarios using standard T-SQL.

Azure Synapse SQL

Azure Event Hubs keeps received messages from your sender application, even when the hub is unavailable. Messages received after the hub becomes unavailable are successfully transmitted to the application as soon as the hub becomes available. Which of the following provides message counts and other metrics that you can use as a health check for your Event Hubs? Azure Analysis Services Azure portal Azure Advisor Azure Monitor

Azure portal

Apache Spark for Azure Synapse

Big data engineering and machine learning solutions

Scenario: You are working in the sales department of a company and part of your role is to manage the storage of customer profile and sales data. A common request is to generate a list of "the top 100 customers including name, account number and sales figures for a given time period" or "who are the customers within a given geographic region?" Is Azure Blob storage a good choice for this data? Yes No

Blobs are not appropriate for structured data that needs to be queried frequently. They have higher latency than memory and local disk and don't have the indexing features that make databases efficient at running queries.

Which are Azure Storage supported languages and frameworks? (Select all that apply) C# Python Java .NET Node.js Go

C# Python Java .NET Node.js Go

While Agile, CI/CD, and DevOps are different, they support one another Which is best described by: "Focuses on software-defined life cycles highlighting tools that emphasize automation."

CI/CD

I want to read data from Azure Data Lake Storage Gen2. It's a small amount of columns I need to read. Which file format is the best for that?

CSV

Which feature of Spark determines how your code is executed? Tungsten Record Format Catalyst Optimizer Java Garbage Collection Cluster Configuration

Catalyst Optimizer

Init Scripts provide a way to configure cluster's nodes. It is recommended to favour Cluster Scoped Init Scripts over Global and Named scripts. Which of the following is best described by: "You specify the script in cluster's configuration by either writing it directly in the cluster configuration UI or storing it on DBFS and specifying the path in Cluster Create API. Any location under DBFS /databricks folder except /databricks/init can be used for this purpose." Cluster Scoped Cluster Named Interactive Global

Cluster Scoped

What do I use to query large fact tables?

Clustered Columnstore Index

How can you manage the lifecycle of data and define how long it will be retained for in an analytical store? Configure the deletion duration for records in the transactional store. Configure the purge duration in a container. Configure the cache to set the time to retain the data in memory. Configure the default Time to Live (TTL) property for records stored.

Configure the default Time to Live (TTL) property for records stored.

You can use either the REST API or the Azure client library to programmatically access a storage account. What is the primary advantage of using the client library? Availability Cost Localization Convenience

Convenience

Azure Synapse Pipelines

Create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight Hadoop, or Azure Databricks.

Query languages used in Synapse SQL can have different supported features depending on consumption model. Which of the following are compatible with the Serverless consumption model? DDL statements (CREATE, ALTER, DROP) Data export Control of flow Data Load Cross-database queries Built-in table value functions Labels UPDATE statement Built-in functions (text) Aggregates Operators DELETE statement MERGE statement SELECT statement INSERT statement Transactions Built-in functions (analysis)

DDL statements (CREATE, ALTER, DROP) Data export Control of flow Cross-database queries Built-in table value functions Built-in functions (text) Aggregates Operators SELECT statement Transactions Built-in functions (analysis)

Query languages used in Synapse SQL can have different supported features depending on consumption model. Which of the following are compatible with the Dedicated consumption model? DDL statements (CREATE, ALTER, DROP) Data export Control of flow Data Load Cross-database queries Built-in table value functions Labels UPDATE statement Built-in functions (text) Aggregates Operators DELETE statement MERGE statement SELECT statement INSERT statement Transactions Built-in functions (analysis)

DDL statements (CREATE, ALTER, DROP) Data export Control of flow UPDATE statement DELETE statement MERGE statement Built-in table value functions SELECT statement Built-in functions (text) Data Load Aggregates Operators INSERT statement Transactions Labels

What is the Databricks Delta command to display metadata? SHOW SCHEMA tablename MSCK DETAIL tablename METADATA SHOW tablename DESCRIBE DETAIL tableName

DESCRIBE DETAIL tableName

Which distribution option should Eddie's IT team use for a product dimension table that will contain 1,000 records in Synapse Analytics? DISTRIBUTION = ROUND_ROBIN DISTRIBUTION = HEAP([ProductId]) DISTRIBUTION = HASH([ProductId]) DISTRIBUTION = REPLICATE

DISTRIBUTION = REPLICATE --> Works best if table is <2GB or if the table is static and does not change much --> dimension tables

Who does this? Provisions and sets up data platform technologies that are on-premises and in the cloud. They manage and secure the flow of structured and unstructured data from multiple sources. The data platforms they use can include relational databases, nonrelational databases, data streams, and file stores. They ensure that data services securely and seamlessly integrate with other data platform technologies or application services such as Azure Cognitive Services, Azure Search, or even bots.

Data Engineer

You can natively perform data transformations with Azure Synapse pipelines code free using the Mapping Data Flow task. Mapping Data Flows provide a fully visual experience with no coding required. Your data flows will run on your own execution cluster for scaled-out data processing. Data flow activities can be operationalized via which of the following? (Select four) Data Factory scheduling Control capabilities Monitor hub Data hub Manage hub Integrate hub Monitoring capabilities Flow capabilities

Data Factory scheduling Control capabilities Monitoring capabilities Flow capabilities

By default, how long are the Azure Data Factory diagnostic logs retained for?

Data Factory stores pipeline-run data for only 45 days. Use Azure Monitor if you want to keep that data for a longer time.

Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory. It provides data integration capabilities across different network environments. Which of the following are valid data integration capabilities? (Select four) Test Lab execution Data Flow Control Flow Analytic dispatch SSIS package execution Activity dispatch Data transformation activities Data movement Data storage

Data Flow SSIS package execution Activity dispatch Data movement

Which tool is used to perform an assessment of migrating SSIS packages to Azure SQL Database services? Data Migration Assistant Lab Services SQL Server Management Studio SQL Server Upgrade Advisor Data Migration Service ARM templates Data Migration Assessment

Data Migration Assistant

The benefits of using Delta Lake in Azure Databricks include which of the following? (Select four) Data versioning and time travel. Support for ACID transactions. The underlying data for Delta Lake tables is stored in JSON format Support for batch and streaming data You can select, insert, update, and delete rows of data in the same way you would in a relational database system.

Data versioning and time travel. Support for ACID transactions. Support for batch and streaming data You can select, insert, update, and delete rows of data in the same way you would in a relational database system.

[?] enables ad hoc data preparation scenarios, where organizations are wanting to unlock insights from their own data stores without going through the formal processes of setting up a data warehouse. The ETL process Synapse SQL Synapse pipelines Data virtualization

Data virtualization

Which of the following is best described by "While each cluster node has its own local file system, the nodes in a cluster have access to a shared, distributed file system in which they can access and operate on data files. This enables you to mount cloud storage and use it to work with and persist file-based data"?

Databricks file system

What core component of ADF is this: "It is basically collected data users required which are used as input for the ETL process. They can be in JSON, CSV, ORC, or text format."

Datasets

What are the consumption models in Azure Synapse SQL?

Dedicated and serverless. Difference: The Dedicated SQL Pool can be put on pause to save costs, and the Serverless Pool can take over to create views or external tables for querying, and afterwards it can be imported into Power BI.

Linux foundation [?] is an open-source storage layer for Spark that enables relational database capabilities for batch and streaming data. By using [?], you can implement a data lakehouse architecture in Spark to support SQL_based data manipulation semantics with support for transactions and schema enforcement. The result is an analytical data store that offers many of the advantages of a relational database system with the flexibility of data file storage.

Delta Lake

The [?] is a vast improvement upon the traditional Lambda architecture. At each stage, we enrich our data through a unified pipeline that allows us to combine batch and streaming workflows through a shared filestore with ACID-compliant transactions. Data Lake architecture No-SQL architecture Data Sea architecture Delta Lake architecture Anaconda architecture Serverless architecture

Delta Lake architecture

What is Delta Lake?

Delta Lake is a transactional storage layer designed specifically to work with Apache Spark and Databricks File System (DBFS). At the core of Delta Lake is an optimized Spark table. It stores your data as Apache Parquet files in DBFS and maintains a transaction log that efficiently tracks changes to the table. --> Data Lake, however, inexpensively stores a vast amount of raw data but is not ideal. So we use Delta Lake

As a Data Engineer, you can transfer and move data in several ways. The most common tool is Azure Data Factory which provides robust resources and nearly 100 enterprise connectors. Azure Data Factory also allows you to transform data by using a wide variety of languages. Azure has opened the way for technologies that can handle unstructured data at an unlimited scale. This change has shifted the paradigm for loading and transforming data from [?]. ELT → ETL ETL → RTO RPO → RTO ETL → MTD MTD → RPO ETL → ELT

ETL → ELT --> The benefit of ELT is that you can store data in its original format, so that the time required to load the data is reduced.

Azure Data Factory provides a variety of methods for ingesting data, and also provides a range of methods to perform transformations. These methods are: • Mapping Data Flows • Compute Resources • SSIS Packages Mapping Data Flows provides a number of different transformations types that enable you to modify data without using any code. They are broken down into the following categories: • Schema modifier transformations • Row modifier transformations • Multiple inputs/outputs transformations Which of the following are valid transformations available in the Mapping Data Flow? (Select all that apply) Exists Round Lookup Union Trim Alter row Between Conditional split Flatten Derived column Aggregate Filter Avg Merge Join

Exists (Check whether your data exists in another source or stream.) Lookup Union Alter row Conditional split (Route rows of data to different streams based on matching conditions.) Flatten Derived column (Generate new columns or modify existing fields) Aggregate Filter Join --> not available: Round Trim Between Avg Merge

During the process of creating a notebook, you need to specify the pool that needs to be attached to the notebook that is, a SQL or Spark pool. True or False: Notebook cells are individual blocks of code or text that runs as a group. If you want to skip cells within the group, a simple skip notation in the cell is all that is required.

False

True or False: In a serverless SQL pool, if statistics are missing, the query optimizer creates statistics on entire tables in the query predicate or join condition to improve cardinality estimates for the query plan. False True

False

True or False: In Azure Data Factory, in order to debug pipelines or activities, it is necessary to publish your workflows. Pipelines or activities which are being tested may be confined to containers to isolate them from the production environment.

False --> there is no need to publish changes in the pipeline or activities before you want to debug. This is helpful in a scenario where you want to test the changes and see if it works as expected before you actually save and publish them.

True or False: When you create an Azure Databricks workspace, a Databricks appliance is deployed as an Azure resource in your subscription. When you create a cluster in the workspace, you specify the region, but Azure Databricks manages all other aspects of the cluster. True False

False --> we also have to specify the types and sizes of the virtual machines (VMs) and some other configuration options

True or False: Materialized views are prewritten queries with joins and filters whose definition is saved and the results persisted to both serverless and dedicated SQL pools.

False, only dedicated

Required: • Migration of the database to Azure SQL Database • Synchronize users from Active Directory to Azure Active Directory (Azure AD) • Configure Azure SQL Database to use an Azure AD user as administrator Which of the following should be configured? For each Azure SQL Database server, set the Access Control to administrator. For each Azure SQL Database server, set the Active Directory to administrator. For each Azure SQL Database, set the Access Control to administrator. For each Azure SQL Database, set the Active Directory administrator role.

For each Azure SQL Database, set the Active Directory administrator role.

Column level security

For example, if you want to ensure that a specific user 'Leo' can only access certain columns of a table because he's in a specific department.

Is Geo-redundant storage (GRS) or Read-Access Geo-Redundant Storage (RA-GRS) cheaper?

Geo-redundant storage (GRS), bc it does NOT initiate automatic failover

Scenario: You have created a storage account name using a standardized naming convention within your department. Your teammate is concerned with this practice because the name of a storage account must be [?]. Globally unique None of the listed options Unique within the containing resource group Unique within your Azure subscription

Globally unique

"An implementation by Microsoft of Open Source Spark, managed on the Azure Platform. You can use this for a spark environment when you are aware of the benefits of Apache Spark in its OSS form, but you want an SLA. Usually this is of interest to Open Source Professionals needing an SLA as well as Data Platform experts experienced with Microsoft."

HDI

Which kind of table to choose for fact table? Replicated Round Robin Hash-Distributed

Hash-Distributed

Hive metastore

Hive is an open-source technology used to define a relational abstraction layer of tables over file-based data. A Hive metastore is created for each Spark cluster when it's created.

Azure provides many ways to store your data and there are several tools that create a storage account. Which aspects guide a user's decision on the tool used to create a storage account? (Select two) The datatype being stored in the account Tool cost If the user wants a GUI Location restrictions of the data centre If the user needs automation

If the user wants a GUI If the user needs automation --> Available tools: • Azure Portal (GUI) • Azure CLI (Command-line interface) (Automation) • Azure PowerShell (Automation) • Management client libraries (Automation) --> Usually, bc the creation of a storage account is a one time event, I use the portal. If I do need automation I would use CLI or PowerShell, bc scripting is faster but if the application already exists I can use the MCL

Row level security

If you want to restrict for example, customer's data access that is only relevant to the company, you can implement RLS. For example, only salesmen in New York city can see sales of this city is an example of RLS because New York is a value (row) in the City Dimension.

When is it unnecessary to use import statements for transferring data between a dedicated SQL and Spark pool? - Import statements are not needed since they are pre-loaded with the Azure Synapse Studio integrated notebook experience. - None of the listed options. - Use token-based authentication. - It is always necessary to use import statements for transferring data between a dedicated SQL and Spark pool. - Use the PySpark connector.

Import statements are not needed since they are pre-loaded with the Azure Synapse Studio integrated notebook experience.

What type of process are the driver and the executors? Java processes Python processes C++ processes JavaScript

Java processes The driver and the executor are JVMs (Java Virtual Machine). JVMs job is to execute other programs. It allows Java to run and optimizes program memory.

You are working on a project where you create a DataFrame which is designed to read data from Azure Blob Storage. Next, you plan to create as additional DataFrame by filtering the initial DataFrame. Which feature of Spark causes these transformation to be analyzed? Lazy Execution Tungsten Record Format Cluster configuration Java Garbage Collection

Lazy Execution

What core component of ADF is this: "It has information on the different data sources and Data Factory uses this information to connect to data originating sources. It is mainly used to locate the data stores in the machines and also represent the compute services for the activity to be executed, such as running spark jobs on spark clusters or running hive queries using hive services from the cloud."

Linked service

What are the four storage options?

Locally redundant storage (LRS) Zone-redundant storage (ZRS) Geo-redundant storage (GRS) Read-Access Geo-Redundant Storage (RA-GRS)

Which storage option is the least expensive one?

Locally redundant storage (LRS), bc it copies your data synchronously three times within a single physical location

What are the differences between the four storage options?

Locally redundant storage (LRS)—synchronously replicates data to three disks within a data center in the primary region. Offers a moderate level of availability at a lower cost. Zone-redundant storage (ZRS)—synchronously replicates data among three Azure availability zones in the primary region. Provides a higher level of resilience at higher cost. Geo-Redundant storage (GRS)—stores another three copies of data in a paired Azure region Read-Access Geo-Redundant (RA-GRS)—same as GRS, but allows data to be read from both Azure regions

Scenario: The organization you work at has data which is specific to a country or region due to regulatory control requirements. When considering Azure Storage Accounts, which option meets the data diversity requirement? - Enable virtual networks for the proprietary data and not for the public data. This will require separate storage accounts for the proprietary and public data. - Locate the organization's data in a data centre with the strictest data regulations to ensure that regulatory requirement thresholds have been met. In this way, only one storage account will be required for managing all data, which will reduce data storage costs. - Locate the organization's data it in a data centre in the required country or region with one storage account for each location. - None of the listed options.

Locate the organization's data in a data centre in the required country or region with one storage account for each location. --> I need one storage account for every group of settings (e.g. location, replication strategy, owner etc.). That means, if I have data that is specific to a country or region, I need one storage account for each location. If I have some data that is proprietary and some for public consumption, then I can enable virtual networks for the proprietary data and not for the public data. This will also require separate storage accounts.

Scenario: The company you work at stores several website asset types in Azure Storage. These types include images and videos. Which of the following is the best way to secure browser apps to lock GET requests? Use Private Link on the company's websites. Lock GET requests down to specific domains using Vault. Lock GET requests down to specific domains using CORS. Use Private Endpoints between the VMs and the company websites.

Lock GET requests down to specific domains using CORS. --> CORS uses HTTP headers so that a web application at one domain can access resources from a server at a different domain. By using CORS, web apps ensure that they load only authorized content from authorized sources.

How do you perform UPSERT in a Delta dataset?

MERGE INTO my-table USING data-to-upsert --> UPSERT is an operation that inserts rows into a table if they don't exist, otherwise they are updated. With MERGE, I can insert and update rows at once

In the Azure Portal you can use the Pause command within the dedicated SQL pool and within Azure Synapse Studio, in the [?] hub which allows you to enable it, and set the number of minutes idle. Explore and analyze Integrate Develop Data Manage Ingest Monitor

Manage

The team plans to use Azure Data Factory to prepare data to be queried by Azure Synapse Analytics serverless SQL pools. Files will be initially ingested into an Azure Data Lake Storage Gen2 account as 10 small JSON files. Each file will contain the same data attributes and data from a subsidiary of Palmer. The team needs to move the files to a different folder and transform the data. Required: • Provide the fastest possible query times. • Automatically infer the schema from the underlying files. Which of the following should you advise them to use? Append Files Flatten hierarchy Preserve hierarchy Merge Files

Merge Files

MTD

Microsoft Defender for Endpoint on Android and iOS is Microsoft's mobile threat defense solution (MTD)

"Where can I find the Copy Data activity ?" Which of the below is the correct location? Azure Function Data Explorer Move & Transform Databricks Batch Service

Move & Transform

The IT team plans to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads: • A workload for data engineers who will use Python and SQL. • A workload for jobs that will run notebooks that use Python, Scala, and SQL. • A workload that data scientists will use to perform ad hoc analysis in Scala and R. Required: Create the Databricks clusters for the workloads. Solution: The team decides to create a High Concurrency cluster for each data scientist, a High Concurrency cluster for the data engineers, and a Standard cluster for the jobs. Does this meet the requirement?

No --> High-concurrency clusters do not support Scala. Standard clusters are recommended for a single user and can run workloads developed in any language: Python, R, Scala, and SQL. High Concurrency clusters work only for SQL, Python, and R.

Which Azure Service is Azure Synapse Pipelines based on? Azure Synapse Spark pools Azure Synapse Studio None of the listed options Azure Data Warehouse Azure Data Explorer Azure Stream Analytics Azure Synapse Link

None of the listed options --> ADF

A shuffle occurs when we need to move data from one node to another in order to complete a stage. Depending on the type of transformation, you are doing you may cause a shuffle to occur. This happens when all the executors require seeing all of the data in order to accurately perform the action. If the Job requires a wide transformation, you can expect the job to execute slower because all of the partitions need to be shuffled around in order to complete the job. There are two control knobs of a shuffle that can used to optimize. Which are these options? (Select two) Minimum data set size Number containers that can be accessed by a specific partition Cap on compute resource allocation Region allocation Number of partitions being shuffled Number of partitions that you can compute in parallel Maximum data set size

Number of partitions being shuffled Number of partitions that you can compute in parallel

Four core components of Azure Data Factory

Pipeline Activity Dataset Linked Service

The team plans to use Azure Data Factory to prepare data to be queried by Azure Synapse Analytics serverless SQL pools. Files will be initially ingested into an Azure Data Lake Storage Gen2 account as 10 small JSON files. Each file will contain the same data attributes and data from a subsidiary of Palmer. The team needs to move the files to a different folder and transform the data. Required:• Provide the fastest possible query times.• Automatically infer the schema from the underlying files. As the Azure expert, the team looks to you for advice on how they should configure the Data Factory copy activity with respect to the sink file type. Which of the following should you advise them to use? Parquet TXT CSV JSON

Parquet

What file format is serverless SQL pool best compatible with?

Parquet

SQL and Spark can directly explore and analyze which types of files stored in the data lake? (Select all that apply) PDF Parquet CSV JSON XLS XLSX TSV TXT

Parquet CSV JSON TSV TXT

Formula to calculate partitions

Partitions= Records/(1 million * 60)

Azure Synapse Link

Perform operational analytics with near real-time hybrid transactional and analytical processing

What core component of ADF is this: "It is created to perform a specific task by composing the different activities in the task in a single workflow. Activities can be data ingestion (Copy data to Azure) -> data processing (Perform Hive Query). We can schedule the task and manage all the activities in a single process also it is used to run the multiple operation parallel."

Pipeline

[?] is a data flow object that can be added to the canvas designer as an activity in an Azure Data Factory pipeline to perform code free data preparation. It enables individuals who are not conversant with the traditional data preparation technologies such as Spark or SQL Server, and languages such as Python and T-SQL to prepare data at cloud scale iteratively. Data Expression Orchestrator Mapping Data Flow Data Flow Expression Builder Power Query Data Stream Expression Builder Data Expression Script Builder

Power Query = Wrangling Data Flow --> Can be added as an activity to the pipeline

Which pricing tier do I need if I want to use SQL persona view in the Azure Databricks portal?

Premium

What 3 coding languages can be used in Delta Lake?

PySpark, Scala, and .NET

To access data in your company storage account, your client makes requests over HTTP or HTTPS. Every request to a secure resource must be authorized. Which service ensures that the client has the permissions required to access the data. Vault Private Link RBAC Azure AD

RBAC: authorize a user to use resources in Azure Azure AD: identity store with which users can authenticate onto Azure and can access resources that are part of Azure subscription Azure Private Link: Works for users belonging to different AAD tenants --> Generally, the main two things are RBAC and Azure AD

Azure Storage provides a REST API to work with the containers and data stored in each account. To work with data in a storage account, your app will need which pieces of data? (Select two) Private access key REST API endpoint Instance key Subscription key Access key Public access key

REST API endpoint Access key

RPO

Recovery Point Objective is your goal for the maximum amount of data the organization can tolerate losing

RTO

Recovery Time Objective is the goal your organization sets for the maximum length of time it should take to restore normal operations following an outage or data loss.

Which kind of table to choose for small star schema dimension tables? Replicated Round Robin Hash-Distributed

Replicated

What are the three types of tables?

Replicated, Round Robin, Hash-Distributed

When to use AVRO?

Row based format, has logical type timestamp

Which of the following tools are used to create and deploy SQL Server Integration Packages on an Azure-SSIS integration runtime, or for on-premises SQL Server? SQL Server Upgrade Advisor SQL Server Data Tools dtexec Data Migration Assessment SQL Server Management Studio Data Migration Assistant Data Migration Service

SQL Server Data Tools (SSIS packages) SQL Server Management Studio (monitor, run, stop etc packages)

SQL Warehouses

SQL Warehouses are relational compute resources with endpoints that enable client applications to connect to an Azure Databricks workspace and use SQL to work with data in tables.

In Azure Synapse Studio, the Data hub is where you access which of the following? (Select three) Power BI Notebooks SQL serverless databases Data flows Pipeline canvas Activities Provisioned SQL pool databases Master Pipeline External data sources SQL scripts

SQL serverless databases Provisioned SQL pool databases External data sources

Which integration runtime is required for Azure Data Factory to ingest data from the on-premises server? None of the listed options. Self-Hosted Integration Runtime On-demand HDInsight cluster Azure-SSIS Integration Runtime Azure Integration Runtime

Self-Hosted Integration Runtime

AIM plans to implement an Azure Cosmos DB database which will be replicated to four global regions where only the one closest to London will be writable. During events, the Cosmos DB will write 250,000 JSON each day and the consistency level must meet the following. Requirements: · The system must guarantee monotonic reads and writes within a session. · The system must provide the fastest throughput available. · Latency must be the lowest available. As the expert, the team looks to you for direction. Which of the following consistency levels should you advise them to utilize? Session Consistent Prefix Bounded Staleness Strong Eventual

Session --> The system must guarantee monotonic reads and writes within a session. Azure Cosmos DB offers five well-defined levels. From strongest to weakest, the levels are: • Strong • Bounded staleness • Session • Consistent prefix • Eventual Consistency and latency is going down when going down the list, Availability and throughput are going up

What are Azure Synapse Studio notebooks based on?

Spark

Apache Spark clusters

Spark is a distributed data processing solution that makes use of clusters to scale processing across multiple compute nodes. Each Spark cluster has a driver node to coordinate processing jobs, and one or more worker nodes on which the processing occurs.

What are Spark Pool Clusters?

Spark pools clusters are groups of computers that are treated as a single computer and handle the execution of commands issued from notebooks. --> To improve scale and performance, the data processing is parallelized across several computers. It consists of a Spark Driver and Worker nodes. The Driver node sends work to the Worker nodes and instructs them to pull data from a specified data source. Moreover, you can configure the number of nodes that are required to perform the task.

Which of the following are valid options for a cluster mode in Azure Databricks?

Standard, Single Node, High Concurrency

[?] logs every operation Azure Storage account activity in real time, and you can search the logs for specific requests. Filter based on the authentication mechanism, the success of the operation, or the resource that was accessed.

Storage Analytics

To configure an application to receive messages from an Event Hub, which of the following information must be provided so that the application can create connection credentials? (Select all that apply) Storage account name Storage account container name Primary shared access key Event Hub name Storage account connection string Shared access policy name Event Hub namespace name

Storage account name Storage account container name Primary shared access key Event Hub name Storage account connection string Shared access policy name Event Hub namespace name

There are a number of required values to create your Azure Databricks workspace. Which are they? (Select five) Subscription Pricing Tier Workspace Name Node Type Autopilot Options Cluster Mode Resource Group Location Databricks RuntimeVersion

Subscription Pricing Tier Workspace Name Resource Group Location

To provide a better authoring experience, Azure Data Factory allows you to configure version control software for easier change tracking and collaboration. Which of the below does Azure Data Factory integrate with? (Select all that apply) SourceForge Team Foundation Server AWS CodeCommit Google Cloud Source Repositories GitLab BitBucket Git repositories Source Safe Launchpad

Team Foundation Server (=Azure DevOps) Git repositories

When doing a write stream command, what does the outputMode("append") option do? The append outputMode allows records to update to the output log. The append mode allows records to be updated and changed in place. The append outputMode allows records to be added to the output sink. The append mode replaces existing records and updates aggregates.

The append outputMode allows records to be added to the output sink. --> informs the write stream to add only new records to the output sink.

TCO

The term total cost of ownership (TCO) describes the final cost of owning a given technology. In on-premises systems, TCO includes the following costs: • Hardware • Software licensing • Labour (installation, upgrades, maintenance) • Datacentre overhead (power, telecommunications, building, heating and cooling)

Query languages used in Synapse SQL can have different supported features depending on consumption model. What are the differences between the features of the serverless and dedicated model?

They have the same features, but dedicated model can do INSERT, UPDATE, DELETE, MERGE, Labels, Data load (which serverless cant do), while serverless can do cross-database queries, which dedicated cant do

Wrangling Data Flow is a data flow object that can be added to the canvas designer as an activity in an Azure Data Factory pipeline to perform code free data preparation. There are two ways to create a wrangling data flow in Azure Data Factory. • Click the plus icon and select Data Flow in the factory resources pane. • In the activities pane of the pipeline canvas, open the Move and Transform accordion and drag the Data flow activity onto the canvas. In both methods, in the side pane that opens, select Create new data flow and choose Wrangling data flow. Click OK. Once you have selected a source, then clicked on create, what is the result? - This opens the Data Flow Wrangler UI. - This both opens the Data Flow Wrangler UI and creates a new instance of Data Factory which can be manipulated in either a CLI or GUI environment. - This both opens the Online Mashup Editor and creates a new instance of Data Factory which can be manipulated in either a CLI or GUI environment. - This creates a new instance of Data Factory which can be manipulated in either a CLI or GUI environment. - None of the listed options. - This opens the Online Mashup Editor.

This opens the Online Mashup Editor.

If you are performing analytics on the data, set up the storage account as an Azure Data Lake Storage Gen2 account by setting the Hierarchical Namespace option to which of the following? Disabled OFF ON Auto-scale Ticked/Checked Enabled

Ticked/Checked ADLS Gen2 is like a Blob storage, but optimized by enabling hierarchical namespace. Hierarchical namespace organizes blobs into directories If you want to store data without performing analysis on the data, set the Hierarchical Namespace option to Disabled to set up the storage account as an Azure Blob storage account. If you are performing analytics on the data, set up the storage account as an Azure Data Lake Storage Gen2 account by setting the Hierarchical Namespace option to Ticked/Checked.

Tony is asking the development team to enable the Delta Lake feature which will allow him to retrieve data from previous versions of a table. Which of the following should you recommend to the team to employ? Time Travel Catalogue Tables Hindsight Spark Structured Streaming

Time Travel

Synapse Studio comes with an integrated notebook experience. The notebooks in Synapse studio, are a web interface that enables you to create, edit, or transform data in the files. It is based on a live code experience, including visualizations and narrative text. True or False: You can access data in the primary storage account directly. There's no need to provide the secret keys.

True

True or False: Concurrency and the allocation of resources across connected users are also a factor that can limit the load performance into Azure Synapse Analytics SQL pools. To optimize the load execution operations, recommendations are to reduce or minimize the number of simultaneous load jobs that are running or assigning higher resource classes that reduce the number of active running tasks.

True

Within the context of Azure Databricks, sharing data from one worker to another can be a costly operation. Spark has optimized this operation by using a format called [?] which prevents the need for expensive serialization and de-serialization of objects in order to get data from one JVM to another. Lineage Pipelining Stage boundary Stages Tungsten Shuffles

Tungsten --> The bottleneck is mostly CPU and Tungsten eliminates this problem to an extent. Spark uses two engines to optimize and run the queries - Catalyst and Tungsten, in that order. Catalyst generates a query plan and Tungsten uses that to generate code. Tungsten is using cache-aware computations and Tungsten Row Format (binary data representation)

Peter plans to have the IT team use a Slowly Changing Dimension (SCD) to update the dimension members to keep history of dimension member changes by adding a new row to the table for each change? Which SCD type would be the best fit? Type 3 SCD Type 6 SCD Type 2 SCD Type 1 SCD Type 5 SCD Type 4 SCD

Type 2 SCD

A(n) [?] schema may be defined at query time. Azure Cosmos DB data type Structured data type Hybrid data type Unstructured data type

Unstructured data type --> This means that data can be loaded onto a data platform in its native format.

How can parameters be passed into an Azure Databricks notebook from Azure Data Factory? - Deploy the notebook as a web service in Databricks, defining parameter names and types. - Render the notebook to an API endpoint in Databricks, defining parameter names and types. - Use the new API endpoint option on a notebook in Databricks and provide the parameter name. - Use notebook widgets to define parameters that can be passed into the notebook.

Use notebook widgets to define parameters that can be passed into the notebook.

Required:• Ensure that the SQL pool can load the sales data from the data lake. The team has assembled some actions being considered to meet the requirement which are shown below. As the Azure expert, Oswald looks to you for advice of the correct actions to take. Which of the following actions should you recommend they perform? (Select three) Use the managed identity as the credentials for the data load process. Add your Azure Active Directory (Azure AD) account to the Sales group. Create a managed identity. Add the managed identity to the Sales group. Create a shared access signature (SAS). Use the snared access signature (SAS) as the credentials for the data load process.

Use the managed identity as the credentials for the data load process. Create a managed identity. Add the managed identity to the Sales group.

Optimal count of streaming units (meaning computing resources that are allocated to execute a Stream Analytics job)

Usually I need 6 SUs per partition. The higher the number of SUs, the more CPU and memory resources are allocated for your job. So e.g. if there are 10 partitions, so 6x10 = 60 SUs is good.

When planning and implementing your Azure Databricks deployments, you have a number of considerations about networking and network security implementation details including which of the following? (Select four) VNet Peering Managed Keys TLS Azure Private Link VNet Injection ACLs AAD Azure VNet service endpoints Vault Secrets

VNet Peering (Peer with another VN) Azure Private Link (most secure way to access Azure data services from Azure Databricks) VNet Injection (specific network customizations, I can use my own VNET) Azure VNet service endpoints

Which workload management feature influences the order in which a request gets access to resources? Workload importance Workload classification Workload priority Workload isolation

Workload importance --> influences the order in which a request gets access to resources. Higher importance = first access

Do all the 3 IR types support private and public network?

Yes

Which of the following is a benefit of to parameterizing a linked service in Azure Data Factory? - You don't have to create a single linked service for each database that is on the same SQL Server. - You don't have to create a single linked service for each database that uses a set of SQL Servers. A single parameterized linked service can be used for multiple SQL Servers providing they are all of the same type of SQL Server (Azure, MySQL, MariaDB, PostgreSQL, Oracle, Amazon...). - You don't have to create a single linked service for each database that uses a set of SQL Servers. A single parameterized linked service can be used for multiple SQL Servers regardless if they are all of the same type of SQL Server or not (Azure, MySQL, MariaDB, PostgreSQL, Oracle, Amazon...). They must all be relational database types. - None of the listed options.

You don't have to create a single linked service for each database that is on the same SQL Server.

Azure Databricks: • The maximum number of jobs that a workspace can create in an hour is [A] • At any time, you cannot have more than [B] jobs simultaneously running in a workspace • There can be a maximum of [C] notebooks or execution contexts attached to a cluster • There can be a maximum of [D] Azure Databricks API calls/hour [A] 750, [B] 250, [C] 300, [D] 1250 [A] 1000, [B] 150, [C] 150, [D] 1500 [A] 250, [B] 50, [C] 200, [D] 500 [A] 500, [B] 100, [C] 250, [D] 1000

[A] 1000, [B] 150, [C] 150, [D] 1500

What data lies in dimension tables?

attribute data that might change but usually changes infrequently. For example, a customer's name and address are stored in a dimension table and updated only when the customer's profile changes. To minimize the size of a large fact table, the customer's name and address don't need to be in every row of a fact table. Instead, the fact table and the dimension table can share a customer ID.

Azure Blob storage lifecycle management rules

automatically transition and delete blobs based on their age or the date when they were last modified. These rules can be used to move data from hot storage to cool storage, or to archive data to an offline storage tier

Synapse Spark can be used to read and transform objects into a flat structure through data frames. Synapse SQL serverless can be used to query such objects directly and return those results as a regular table. With Synapse Spark, you can transform nested structures into columns and array elements into multiple rows. The steps show the techniques involved to deal with complex data types have been shuffled. a. Flatten nested schema Use the function to flatten the nested schema of the data frame (df) into a new data frame. b. Define a function for flattening We define a function to flatten the nested schema. c. Flatten child nested Schema Use the function you create to flatten the nested schema of the data frame into a new data frame. d. Explode Arrays Transform the array in the data frame into a new dataframe where you also define the column that you want to select. Which is the correct technique sequence to deal with complex data types? b → a → c → d a → c → b → d b → a → d → c c → b → d → a

b → a → d → c

What is cosmos.olap?

cosmos.olap is the method that connects to the analytical store in Azure Cosmos DB. OLAP stands for Online Analytical Processing and is used to handle multi-dimensional data.

Scenario: You are working at an online retailer and have been tasked with finding average of sales transactions by storefront. Which of the following aggregates would you use? df.select(col("storefront")).avg("completedTransactions") df.groupBy(col("storefront")).avg(col("completedTransactions")) df.groupBy(col("storefront")).avg("completedTransactions") df.select(col("storefront")).avg("completedTransactions").groupBy(col("storefront"))

df.groupBy(col("storefront")).avg("completedTransactions") I dont need to indicate again that it is a column!

Which command orders by a column in descending order in Azure Databricks?

df.orderBy(col("requests").desc()) --> we can also use the sport function, that by default sorts by ascending order

Which is the correct syntax for overwriting data in Azure Synapse Analytics from a Databricks notebook? df.write.format("com.databricks.spark.sqldw").update().option("...").option("...").save() df.write.format("com.databricks.spark.sqldw").overwrite().option("...").option("...").save() df.write.format("com.databricks.spark.sqldw").mode("overwrite").option("...").option("...").save() df.write.mode("overwrite").option("...").option("...").save()

df.write.format("com.databricks.spark.sqldw").mode("overwrite").option("...").option("...").save() --> I need to specify the format, intended write mode and options

What is a replicated table?

uneven distribution, just replicates the original table

What is a Hash Distributed table?

even distribution, based on theme of data/timestamps. Hash-distributed tables improve query performance on large fact tables. They can have very large numbers of rows and still achieve high performance. To determine on which value to distribute on, don't choose a date column and also a column with many unique values

What is a Round Robin table?

even distribution, but by "percentage pieces". Round-robin tables are useful for improving loading speed.

Immutable Azure Blob storage retention policies

feature of Azure Blob storage that allow you to protect your data from deletion or modification. When you enable an immutable retention policy on a container, no one, including the storage account owner, can delete or modify blobs within that container for the specified period of time.

From a high level, the Azure Databricks service launches and manages Apache Spark clusters within your Azure subscription. Apache Spark clusters are groups of computers that are treated as a single computer and handle the execution of commands issued from notebooks. In Databricks, the notebook interface ... [?] - specifies the types and sizes of the virtual machines. - is the driver program. - provides the fastest virtualized network infrastructure in the cloud. - pulls data from a specified data source.

is the driver program.

In ADF, I can define the pipeline in JSON format. There are different properties in the definition. Which of the JSON properties are required? name description parameters activities

name activities

Delta Lake

open source storage layer that runs on top of Data Lakes, used for transaction logging, data type constraints etc.

What is SSIS?

platform for building complex Extract Transform and Load (ETL) solutions. SSIS is typically used to build pipelines

What data lies in fact tables?

quantitative data that are commonly generated in a transactional system, and then loaded into the dedicated SQL pool. For example, a retail business generates sales transactions every day, and then loads the data into a dedicated SQL pool fact table for analysis.

What command should be issued to view the list of active streams? Invoke spark.view.active Invoke spark.streams.active Invoke spark.view.activeStreams Invoke spark.streams.show

spark.streams.active

With regards to the Workspace to VNet ratio, Microsoft recommends ... [?]

that you should only deploy one Workspace in any VNET --> it is possible to have more than one, but for security reasons only one is recommended. To still share common networking resources, we can implement VNET Peering without the workspaces having to be in the same VNET

When working with Azure Data Factory, a dataset is a named view of data that simply points or references the data you want to use in your activities as inputs and outputs. A dataset in Data Factory can be defined in a JSON format. Which of the JSON properties are required? (Select all that apply) typeProperties name type structure (schema)

typeProperties name type

Three IR types

• Azure • Self-hosted • Azure-SSIS

Azure Advisor provides recommendations in the what areas? • Cost • Security • Encryption deficiencies • Reliability • Operational excellence • Performance

• Cost • Security • Reliability • Operational excellence • Performance

In Azure Synapse Studio, the Develop hub is where you access which of the following? (4)

• SQL scripts • Notebooks • Data flows • Power BI

What are the three pricing tiers of Azure Databricks and what are the differences?

• Standard - Core Apache Spark capabilities with Azure AD integration. • Premium - Role-based access controls and other enterprise-level features. • Trial - A 14-day free trial of a premium-level workspace

Azure Databricks includes an integrated notebook interface for working with Spark. Which are valid features of notebooks when using Spark? (6)

• Syntax highlighting and error support. • Code auto-completion. • Interactive data visualizations. • The ability to export results. • Export results and notebooks in .html or .ipynb format. • Develop code using Python, SQL, Scala, and R.

As great as data lakes are at inexpensively storing our raw data, they also bring with them which performance challenges?

• Too many small or very big files - more time opening & closing files rather than reading contents (worse with streaming). • Partitioning also known as "poor man's indexing"- breaks down if you picked the wrong fields or when data has many dimensions, high cardinality columns. • No caching - cloud storage throughput is low (cloud object storage is 20-50MB/s/core vs 300MB/s/core for local SSDs).


Related study sets

Vista Spanish Leccion 10 Vocabulario

View Set

Juvenile Delinquency CH.1-2 Test

View Set

Tissue integrity - Intro to Health Care Concepts - ADN

View Set

Counseling and Group Processes Midterm

View Set

Chapter 7: Products, Services, and Brands

View Set

Chapter 12: Inventory systems and supply ordering

View Set

VTVLC (FLVS) Spanish 1 Segment 1 - 3.08

View Set