MODULE(S) KNOWLEDGE CHECKS
By default, how many partitions will a new Event Hub have?
4 Event Hubs default to 4 partitions. Partitions are the buckets within an Event Hub. Each publication will go into only one partition. Each consumer group may read from one or more than one partition
You want to configure a private endpoint. You open up Azure Synapse Studio, go to the manage hub, and see that the private endpoints is greyed out. Why is the option not available?
A managed virtual network has not been created. In order to create a private endpoint, you first must create a managed virtual network.
Which T-SQL Statement loads data directly from Azure Storage?
COPY. The T-SQL COPY Statement reads data from Azure Blob Storage or the Azure Data Lake and inserts it into a table within the SQL Pool.
What sort of pipeline is required in Azure DevOps for creating artifacts used in releases?
A Build pipeline
What is an Azure Key Vault-backed secret scope?
A Databricks secret scope that is backed by Azure Key Vault instead of Databricks A secret scope is provided by Azure Databricks and can be backed by either Databricks or Azure Key Vault.
What is the difference between a star schema and a snowflake schema?
All dimensions in a star schema join directly to the fact table (denormalized) while some dimension tables in a snowflake schema are normalized. A star schema is highly denormalized so that the fact table joins directly to dimension; a snowflake schema normalizes some dimensions into multiple tables such as DimProduct, DimProductSubcategory, and DimProductCategory.
What is SCIM?
An open standard that enables organizations to import both groups and users from Azure Active Directory into Azure Databricks Azure Active Directory roles have no relationship with groups created inside of Azure Databricks. SCIM enables synchronizing users and groups, and synchronization is automatic after initial import.
Within Azure Synapse Link for Azure Cosmos DB, which Column-oriented store optimized for queries?
Analytical store. An analytical store is a data store optimized for analytical queries.
What steps are required to authorize Azure DevOps to connect to and deploy notebooks to a staging or production Azure Databricks workspace?
Create a new Access Token within the user settings in the production Azure Databricks workspace, then use the token as the Databricks bearer token in the Databricks Notebooks Deployment step of the Release pipeline The Access Token allows you to grant access to resources within an Azure Databricks workspace without passing in user credentials.
Which DataFrame method do you use to create a temporary view?
createOrReplaceTempView()
Which method for renaming a DataFrame's column is incorrect?
df.alias("timestamp", "dateCaptured")
Which command specifies a column value in a DataFrame's filter? Specifically, filter by a productType column where the value is equal to book?
df.filter(col("productType") == "book")
When using the Column Class, which command filters based on the end of a column value? For example, a column named verb and filtered by words ending with "ing".
df.filter(col("verb").endswith("ing")) The Column Class supports both the endswith() method and the like() method (example - col("verb").like("%ing")).
You need to find the average of sales transactions by storefront. Which of the following aggregates would you use?
df.groupBy(col("storefront")).avg("completedTransactions") The syntax shown groups the data by the storefront Column, then calculates the average value of completed sales transactions.
Which is the correct syntax for overwriting data in Azure Synapse Analytics from a Databricks notebook?
df.write.format("com.databricks.spark.sqldw").mode("overwrite").option("...").option("...").save()
When you want to switch to SparkSQL in a notebook, what is the first command to type?
%%sql. When you want to switch to SparkSQL in a notebook, type the %%sql command.
How do you list files in DBFS within a notebook?
%fs ls /my-file-path You added the file system magic to the cell before executing the ls command.
How do you disable Azure Synapse Link for Azure Cosmos DB?
Delete the Azure Cosmos DB account. Deleting the Azure Cosmos DB account with disable and remove Azure Synapse Link.
Which Azure Synapse Studio hub would you go to create Notebooks?
Develop. The Develop hub is where you manage SQL scripts, Synapse notebooks, data flows, and Power BI reports.
If mounting an Azure Data Lake Storage (ADLS) account to a workspace, what cluster feature must be used to have ACLS within ADLS applied to the user executing commands in a notebook?
Enable ADLS Passthrough on a cluster When enabled, authentication automatically takes place in Azure Data Lake Storage (ADLS) from Azure Databricks clusters using the same Azure Active Directory (Azure AD) identity that one uses to log into Azure Databricks. Any ACLs applied at the folder or file level in ADLS are enforced based on the user's identity.
Below is a list of key benefits of using Azure Stream Analytics to process streaming data. Which of the following statements is incorrect?
Integration with Azure Blob storage Integration with Azure Blob storage is not one of the primary benefits of using Azure Stream Analytics to process streaming data. The integration with Blob storage can be used to process static data.
How do you create a DataFrame object?
Introduce a variable name and equate it to something like myDataFrameDF =
What command should be issued to view the list of active streams?
Invoke spark.streams.active
What type of process are the driver and the executors?
Java processes The driver and the executors are Java processes.
To parallelize work, the unit of distribution is a Spark Cluster. Every Cluster has a Driver and one or more executors. Work submitted to the Cluster is split into what type of object?
Jobs Each parallelized action is referred to as a Job. The results of each Job is returned to the Driver. Depending on the work required, multiple Jobs will be required. Each Job is broken down into Stages.
What function provides a rowset view over a JSON document?
OPENJSON. The OPENJSON function provides a rowset view over a JSON document.
Mike is creating an Azure Data Lake Storage Gen2 account. He must configure this account to be able to process analytical data workloads for best performance. Which option should he configure when creating the storage account?
On the Advanced tab, set the Hierarchical Namespace to Enabled. If you want to enable the best performance for analytical workloads in Data Lake Storage Gen2, then on the Advanced tab of the Storage Account creation set the Hierarchical Namespace to Enabled.
How many drivers does a Cluster have?
Only one A Cluster has one and only one driver.
What is an element of an Apache Spark Pool in Azure Synapse Analytics?
Spark Instance. The definition of an Apache Spark pool is that, when instantiated, it is used to create an Apache Spark instance that processes data.
Which statement about the Azure Databricks Data Plane is true?
The Data Plane is hosted within the client subscription and is where all data is processed and stored All data is processed by clusters hosted within the client Azure subscription and data is stored within Azure Blob storage and any connected Azure services within this portion of the platform architecture.
Spark is a distributed computing environment. Therefore, work is parallelized across executors. At which two levels does this parallelization occur?
The Executor and the Slot The first level of parallelization is the Executor - a Java virtual machine running on a node, typically, one instance per node. Each Executor has a number of Slots to which parallelized Tasks can be assigned to it by the Driver.
What is an example of a branching activity used in control flows?
The If-condition An example of a branching activity is The If-condition activity which is similar to an if-statement provided in programming languages.
To consume processed event streaming data in near-real-time to produce dashboards containing rich visualizations, which of the following services should you use?
Power BI Power BI provides a platform for visualizing and analyzing aggregated data in near-real-time. Azure Stream Analytics can target Power BI as an output destination. Processed data is passed into Power BI to facilitate near-real-time dashboard updates.
Which type of analytics answers the question "What is likely to happen in the future based on previous trends and patterns?"
Predictive. Predictive analytics "What is likely to happen in the future based on previous trends and patterns?".
Which feature commits the changes of Azure Data Factory work in a custom branch created with the main branch in a Git repository?
Pull request. After a developer is satisfied with their changes, they create a pull request from their feature branch to the master or collaboration branch to get their changes reviewed by peers.
Which language can be used to define Spark job definitions?
PySpark Pyspark can be used to define spark job definitions.
Which of the following statements is a benefit of materialized views?
Reducing the execution time for complex queries with JOINs and aggregate functions. Materialized views help to improve complex query performance. The more complex the query, the higher the potential for execution-time saving.
Which is the default distribution used for a table in Synapse Analytics?
Round-Robin. Round-Robin is the default distribution created for a table and delivers fast performance when used for loading data but may negatively impact larger queries.
When configuring network access to your Azure Storage Account, what is the default network rule?
To allow all connections from all networks The default network rule is to allow all connections from all networks.
What's the purpose of linked services in Azure Data Factory?
To represent a data store or a compute resource that can host execution of an activity Linked services define the connection information needed for Data Factory to connect to external resources.
Applications that publish messages to Azure Event Hub very frequently will get the best performance using Advanced Message Queuing Protocol (AMQP) because it establishes a persistent socket.
True Publishers can use either HTTPS or AMQP. AMQP opens a socket and can send multiple messages over that socket.
Which SCD type would you use to update the dimension members without keeping track of history?
Type 1 SCD. When a value changes, Type 1 SCD will update the existing record without keeping history.
Which SCD type would you use to keep history of changes in dimension members by adding a new row to the table for each change?
Type 2 SCD. When a value changes, Type 2 SCD will add a new row for the entity with a start date, end date, and unique key which will join back to any transactions in the fact table within the effective data range.
How do you perform UPSERT in a Delta dataset?
Use MERGE INTO my-table USING data-to-upsert
What is one of the possible ways to optimize an Apache Spark Job?
Use bucketing. Bucketed tables are optimized because it is a metadata operation about how the data is bucketed and sorted.
How can parameters be passed into an Azure Databricks notebook from Azure Data Factory?
Use notebook widgets to define parameters that can be passed into the notebook You can configure parameters by using widgets on the Databricks notebook. You then pass in parameters with those names via a Databricks notebook activity in Data Factory.
When is it unnecessary to use import statements for transferring data between a dedicated SQL and Apache Spark pool?
Use the integrated notebook experience from Azure Synapse Studio. Import statements are not needed since they are pre-loaded in case you use the Azure Synapse Studio integrated notebook experience.
What SparkSQL method reads data from the analytical store?
cosmos.olap. Cosmos.olap is the method that connects to the analytical store in Azure Cosmos DB.
Which command orders by a column in descending order?
df.orderBy(col("requests").desc()) Use the desc() method on the Column Class to reverse the order.
How do you infer the data types and column names when you read a JSON file?
spark.read.option("inferSchema", "true").json(jsonFile) This approach is the correct way to infer the file's schema.
In Spark Structured Streaming, what method should be used to read streaming data into a DataFrame?
spark.readStream Use the spark.readStream method to start reading data from a streaming query into a DataFrame.
Which Dynamic Management View enables the view the active connections against a dedicated SQL pool?
sys.dm_pdw_exec_requests. sys.dm_pdw_exec_requests enables you to view the active connections against a dedicated SQL pool
In which modes does Azure Databricks provide data encryption?
At-rest and in-transit Data stored in Azure Storage is encrypted using server-side encryption that is seamlessly accessed by Azure Databricks. All data transmitted between the Data Plane and the Control Plane is always encrypted in-flight via TLS.
Which of the following technologies typically provide an ingestion point for data streaming in an event processing solution that uses static data as a source?
Azure Blob storage Azure Blob storage provides an ingestion point for data streaming in an event processing solution that uses static data as a source.
Which of the following services is an invalid input for an Azure Stream Analytics job?
Azure Cosmos DB Azure Cosmos DB is a valid output for your Stream Analytics jobs but cannot be used as an input.
Which Azure Service is Azure Synapse Pipelines based on?
Azure Data Factory. Azure Synapse Pipelines is based in the Azure Data Factory service.
Which technology is typically used as a staging area in a modern data warehousing architecture?
Azure Data Lake. Azure Data Lake Store Gen 2 is the technology that will be used to stage data before loading it into the various components of Azure Synapse Analytics.
When creating a new cluster in the Azure Databricks workspace, what happens behind the scenes?
Azure Databricks creates a cluster of driver and worker nodes, based on your VM type and size selections. At the time of cluster creation, you specify the types and sizes of the virtual machines (VMs) to use for both the Driver and Worker nodes, but Azure Databricks manages all other aspects of the cluster.
Which Azure service detects anomalies in account activities and notifies you of potential harmful attempts to access your account?
Azure Defender for Storage Microsoft Defender for Storage detects anomalies in account activity. It then notifies you of potentially harmful attempts to access your account.
What is a supported connector for built-in parameterization?
Azure Synapse Analytics Azure Synapse Analytics is a supported connector for built-in parameterization for Linked Services in Azure Data Factory.
In order to create an Apache Spark pool in Azure Synapse Analytics, what needs to be created first?
Azure Synapse Analytics Workspace. In order to create an Apache Spark pool in Azure Synapse Analytics, you would have to create a Synapse Analytics Workspace.
Which Azure Synapse Analytics component enables you to perform Hybrid Transactional and Analytical Processing?
Azure Synapse Link. Azure Synapse Link is the component that enables Hybrid Transactional and Analytical Processing.
Where can you develop TSQL scripts and notebooks in Azure Synapse Analytics?
Azure Synapse Studio. Azure Synapse Studio is where you can develop TSQL scripts and notebooks.
What component of Azure Synapse analytics allows the different engines to share the databases and tables between Spark pools and SQL on-demand engine?
Azure Synapse shared metadata. The shared metadata gives the workspace SQL engines access to databases and tables created with Spark.
Which feature of Spark determines how your code is executed?
Catalyst Optimizer Spark SQL uses Catalyst's general tree transformation framework in four phases - Analysis, Logical Optimization, Physical Planning, and Code Generation.
Which Index Type offers the highest compression?
Columnstore. This is the default index type created for a table. It works on segments of rows that get compressed and optimized by column.
What does Azure Data Lake Storage (ADLS) Passthrough enable?
Commands running on a configured cluster can read and write data in ADLS without configuring service principal credentials
How does splitting source files help maintain good performance when loading into Synapse Analytics?
Compute node to storage segment alignment. SQL Pools have 60 storage segments. Compute can also scale to 60 nodes and so optimizing for alignment of these 2 resources can dramatically decrease load times.
Which transformation in the Mapping Data Flow is used to routes data rows to different streams based on matching conditions?
Conditional Split. A Conditional Split transformation routes data rows to different streams based on matching conditions. The conditional split transformation is similar to a CASE decision structure in a programming language.
How can you manage the lifecycle of data and define how long it will be retained for in an analytical store?
Configure the default Time to Live (TTL) property for records stored. Configuring the default Time to Live (TTL) property for records stored in an analytical store can manage the lifecycle of data and define how long it will be retained for.
Once Azure Synapse Link is configured on Cosmos DB, what is the first step to perform to use Azure Synapse Analytics serverless SQL pools to query the Azure Cosmos DB data?
Create a database Before being able to issue any queries using Azure Synapse Analytics serverless SQL pools, you first must create a database.
What are the two prerequisites for connecting Azure Databricks with Azure Synapse Analytics that apply to the Azure Synapse Analytics instance?
Create a database master key and configure the firewall to enable Azure services to connect Create a database master key and configure the firewall to enable Azure services to connect
Which notebook format is used in Databricks?
DBC The supported Databricks notebook format is the DBC file type.
What is the Databricks Delta command to display metadata?
DESCRIBE DETAIL tableName
What distribution option would be best for a sales fact table that will contain billions of records?
DISTRIBUTION = HASH([SalesOrderNumber]). Hash distribution provides good read performance for a large table by distributing records across compute nodes based on the hash key.
What distribution option would you use for a product dimension table that will contain 1,000 records in Synapse Analytics?
DISTRIBUTION = REPLICATE. Replicate will result in a copy of the table on each compute node, which performs well with joins to the distributed fact table.
Which tool is used to perform an assessment of migrating SSIS packages to Azure SQL Database services?
Data Migration Assistant. The Data Migration Assistant is used to perform an assessment of migrating SSIS packages to Azure SQL Database services.
What can cause a slower performance on join or shuffle jobs?
Data skew. Due to asymmetry in your job data.
What size does OPTIMIZE compact small files to?
Around 1 GB The Spark optimization team determined this value to be a good compromise between speed and performance.
In Azure Synapse Studio, where would you view the contents of the primary data lake store?
In the linked tab of the Data hub. The linked tab of the data hub is where you can view the contents of the primary data lake store.
How do you cache data into the memory of the local executor for instant access?
.cache() The cache() method is an alias for persist(). Calling this moves data into the memory of the local executor.
What is required to specify the location of a checkpoint directory when defining a Delta Lake streaming query?
.writeStream.format("delta").option("checkpointLocation", checkpointPath) ...
What is the maximum size for a single publication (individual or batch) that is allowed by Azure Event Hub?
1 MB The maximum size for a single publication (individual or batch) that is allowed by Azure Event Hub is 1 MB.
What is a DataFrame?
A creation of a data structure. A DataFrame creates a data structure and it's one of the core data structures in Apache Spark.
Which of the following statements describes a wide transformations?
A wide transformation requires sharing data across workers. It does so by shuffling data. Wide transformation shares data across workers by shuffling data between executors.
Which of the definitions below best describes a Tumbling window?
A windowing function that segment a data stream into a contiguous series of fixed-size, non-overlapping time segments and operate against them. Events cannot belong to more than one tumbling window.
What Transact-SQL function is used to perform a HyperLogLog function?
APPROX_COUNT_DISTINCT The APPROX_COUNT_DISTINCT function is used to perform a HyperLogLog function.
Which feature in alerts can be used to determine how an alert is fired?
Add criteria. The add criteria feature enables you to determine how an alert is fired
What is a lambda architecture and what does it try to solve?
An architecture that splits incoming data into two paths - a batch path and a streaming path. This architecture helps address the need to provide real-time processing in addition to slower batch computations. The lambda architecture is a big data processing architecture that combines both batch- and real-time processing methods.
What are Azure Synapse Studio notebooks based on?
Apache Spark. Azure Synapse Studio notebook is purely Apache Spark based.
What does the CD in CI/CD mean?
Both are correct Continuous Delivery automates your release process up to the point where human intervention is needed, by clicking a button. Continuous Deployment takes a step further by removing the human intervention and relying on automated tests to automatically determine whether the build should be deployed into production.
How do column statistics improve query performance?
By keeping track of how much data exists between ranges in columns. It tracks cardinality and range density to determine which data access paths return the fewest rows for speed.
What optimization does the following command perform: OPTIMIZE Students ZORDER BY Grade?
Ensures that all data backing, for example, Grade=8 is colocated, then rewrites the sorted data into new Parquet files
What is a step in flattening a nested schema?
Explode Arrays. Explode Arrays is a third step in flattening nested schemas. It is necessary to transform the array in the DataFrame into a new DataFrame where the column that you want to select is defined.
Which version control software does Azure Data Factory integrate with?
Git repositories. Azure Data Factory allows you to configure a Git repository with either Azure Repos or GitHub, and is a version control system that allows for easier change tracking and collaboration.
What is the name of the application architecture that enables near real-time querying to provide insights?
HTAP. HTAP stands for Hybrid Transactional and Analytical Processing that enable you to gain insights from operational systems without impacting the performance of the operational system.
What is the Python syntax for defining a DataFrame in Spark from an existing Parquet file in DBFS?
IPGeocodeDF = spark.read.parquet("dbfs:/mnt/training/ip-geocode.parquet")
What Transact-SQL function verifies if a piece of text is valid JSON?
ISJSON ISJSON is a Transact-SQL function that verifies if a piece of text is valid JSON.
What happens to Databricks activities (notebook, JAR, Python) in Azure Data Factory if the target cluster in Azure Databricks isn't running when the cluster is called by Data Factory?
If the target cluster is stopped, Databricks will start the cluster before attempting to execute This situation will result in a longer execution time because the cluster must start, but the activity will still execute as expected.
Where do you enable Azure Synapse Link for Azure Cosmos DB?
In Azure Cosmos DB. When you enable Azure Synapse Link for Azure Cosmos DB it must be done in Azure Cosmos DB.
If you create a DataFrame that will read some data from Azure Blob Storage, and then you create another DataFrame by filtering the initial DataFrame. What feature of Spark causes these transformation to be analyzed?
Lazy Execution Transformations applied to DataFrames are lazy, meaning they will not trigger any jobs. If you pass the DataFrame to a display function, a job will be triggered because display is an action.
Which ALTER DATABASE statement parameter allows a dedicated SQL pool to scale?
MODIFY MODIFY is used to scale a dedicated SQL pool.
Which hub is where you can grant access to Synapse workspace and resources?
Manage hub. You can grant access to Synapse workspace in the Manage hub.
You require an Azure Synapse Analytics Workspace to access an Azure Data Lake Store using the benefits of the security provided by Azure Active Directory. What is the best authentication method to use?
Managed identities. Managed identities provides Azure services with an automatically managed identity in Azure Active Directory. You can use the Managed Identity capability to authenticate to any service that support Azure Active Directory authentication.
Which component enables you to perform code free transformations in Azure Synapse Analytics?
Mapping data flow. You can natively perform data transformations with Azure Data Factory code free using the Mapping Data Flow task.
In which version of SQL Server was SSIS Projects introduced?
SQL Server 2012. SSIS Projects was introduced in SQL Server 2012 and is the unit of deployment for SSIS solutions.
Which tool is used to create and deploy SQL Server Integration Packages on an Azure-SSIS integration runtime, or for on-premises SQL Server?
SQL Server Data Tools. SQL Server Data Tools is typically used to create and deploy SQL Server Integration Services (SSIS) packages.
In what language can the Azure Synapse Apache Spark to Synapse SQL connector be used?
Scala. The connector uses Scala to integrate Apache Spark pools with dedicated SQL pools in Azure Synapse Analytics.
How can all notebooks in Synapse studio be saved?
Select the Publish all button on the workspace command bar. To save all notebooks in your workspace, select the Publish all button on the workspace command bar.
You have a requirement to occasionally prepare data for ad hoc data exploration and analysis. Which resource model in Azure Synapse Analytics is the most effective to use to meet this requirement?
Serverless. The serverless resource model is the ideal resource model in this scenario as it makes use of the resources when required.
You are working on a project with a 3rd party vendor to build a website for a customer. The image assets that will be used on the website are stored in an Azure Storage account that is held in your subscription. You want to give read access to this data for a limited period of time. What security option would be the best option to use?
Shared Access Signatures A shared access signature is a string that contains a security token that can be attached to a URI. Use a shared access signature to delegate access to storage objects and specify constraints, such as the permissions and the time range of access.
Which transformation is used to load data into a data store or compute resource?
Sink. A Sink transformation allows you to choose a dataset definition for the destination output data. You can have as many sink transformations as your data flow requires.
In which phase of big data processing is Azure Data Lake Storage located?
Store Store is the phase in which Azure Data Lake Storage resides for processing big data solution.
When doing a write stream command, what does the outputMode("append") option do?
The append outputMode allows records to be added to the output sink The outputMode "append" option informs the write stream to add only new records to the output sink. The "complete" option is to rewrite the full output - applicable to aggregations operations. Finally, the "update" option is for updating changed records in place.
What would be the best approach to investigate if the data at hand is unevenly allocated across all distributions?
Using DBCC PDW_SHOWSPACEUSED to see the number of table rows that are stored in each of the 60 distributions. DBCC PDW_SHOWSPACEUSED returns the number of table rows that are stored in each of the 60 distributions.
To achieve improved query performance, which one would be the best data type for storing data that contains less than 128 characters?
VARCHAR(128) Limiting the size of the data type and not using size variability will provide the best performance.
What happens if the command option("checkpointLocation", pointer-to-checkpoint directory) is not specified?
When the streaming job stops, all state around the streaming job is lost, and upon restart, the job must start from scratch Setting the checkpointLocation is required for many sinks used in Structured Streaming. For those sinks where this setting is optional, keep in mind that when you do not set this value, you risk losing your place in the stream.
Which Workload Management capability manages minimum and maximum resource allocations during peak periods?
Workload Isolation. Workload Isolation assigns maximum and minimum usage values for varying resources under load. These adjustments can be done live without having to take the SQL Pool offline.
Which workload management feature influences the order in which a request gets access to resources?
Workload importance. Workload importance influences the order in which a request gets access to resources. On a busy system, a request with higher importance has first access to resources.
