DP-203 Studying

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What is required to specify the location of a checkpoint directory when defining a Delta Lake streaming query?

.writeStream.format("delta").option("checkpointLocation", checkpointPath)

What value should be set in the fieldterminator and fieldquote variables to read JSON files?

0x0b is the value should be set in the fieldterminator and fieldquote variables to read JSON files.

Which CREDENTIAL identity type can be time limited?

A Shared access signature can be set to a time limit.

Which metadata object do you create that contains a query that reads data from a data lake?

A VIEW is a metadata object that contains a query that reads data from a data lake.

What is Azure Data Lake Storage Gen2

A data lake is a repository of data that is stored in its natural format, usually as blobs or files. Azure Data Lake Storage is a comprehensive, scalable, and cost-effective data lake solution for big data analytics built into Azure. Has a hierarchy Provides redundancy through locally redundant storage (LRS) or geo-redundant storage (GRS)

What are the two types of access control lists?

Access ACLs Controls access to an object. Files and directories both have access ACLs. Default ACLs Are templates of ACLs associated with a directory that determine the access ACLs for any child items that are created under that directory. Files do not have default ACLs.

What authentication types are supported by Serverless SQL pools?

Anonymous access To access publicly available files placed on Azure storage accounts that allow anonymous access. Shared access signature (SAS) Provides delegated access to resources in storage account. With a SAS, you can grant clients access to resources in storage account, without sharing account keys. A SAS gives you granular control over the type of access you grant to clients who have the SAS: validity interval, granted permissions, acceptable IP address range, acceptable protocol (https/http). Managed Identity. Is a feature of Azure Active Directory (Azure AD) that provides Azure services for serverless SQL pool. Also, it deploys an automatically managed identity in Azure AD. This identity can be used to authorize the request for data access in Azure Storage. Before accessing the data, the Azure Storage administrator must grant permissions to Managed Identity for accessing the data. Granting permissions to Managed Identity is done the same way as granting permission to any other Azure AD user. User Identity Also known as "pass-through", is an authorization type where the identity of the Azure AD user that logged into serverless SQL pool is used to authorize access to the data. Before accessing the data, Azure Storage administrator must grant permissions to Azure AD user for accessing the data. This authorization type uses the Azure AD user that logged into serverless SQL pool, therefore it's not supported for SQL user types.

What is Azure Databricks?

Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud services platform. Azure Databricks offers three environments for developing data intensive applications: Databricks SQL, Databricks Data Science & Engineering, and Databricks Machine Learning.

What is Lambda Architecture?

Batch processing architecture. The lambda architecture is a big data processing architecture that addresses this problem by combining both batch- and real-time processing methods. It features an append-only immutable data source that serves as system of record. Timestamped events are appended to existing events (nothing is overwritten). Data is implicitly ordered by time of arrival. Notice how there are really two pipelines here, one batch and one streaming, hence the name lambda architecture.

What are the 4 components of a typical event processing pipeline built on top of Stream Analytics?

Event producer: Any application, system, or sensor that continuously produces event data of interest. Examples include sensors tracking the flow of water through a utility pipe and an application such as Twitter that generates tweets against a single hashtag. Event ingestion system: Receives the data from an event producer and passes it to an analytics engine. Azure Event Hubs, Azure IoT Hub, or Azure Blob storage can serve as the ingestion system. Stream analytics engine: The compute platform that processes, aggregates, and transforms incoming data streams. Azure Stream Analytics provides the Stream Analytics query language (SAQL), a subset of Transact-SQL tailored to perform computations over streaming data. The engine supports windowing functions that are fundamental to stream processing and are implemented by using the SAQL. Event consumer: A destination of the output from the stream analytics engine. The output can be stored in a data storage platform, such as Azure Data Lake Storage Gen2, Azure Cosmos DB, Azure SQL Database, or Azure Blob storage. Or, you can consume the output in near-real-time using Power BI dashboards.

What is the difference between Azure Data Lake storage and Azure Blob Storage?

In Azure Blob storage, you can store large amounts of unstructured ("object") data, in a single hierarchy, also known as a flat namespace. You can access this data by using HTTP or HTTPs. Azure Data Lake Storage Gen2 builds on blob storage and optimizes I/O of high-volume data by using hierarchical namespaces that you turned on in the previous exercise. Hierarchical namespaces organize blob data into directories and stores metadata about each directory and the files within it. This structure allows operations, such as directory renames and deletes, to be performed in a single atomic operation. Flat namespaces, by contrast, require several operations proportionate to the number of objects in the structure. Hierarchical namespaces keep the data organized, which yields better storage and retrieval performance for an analytical use case and lowers the cost of analysis. If you are NOT doing analysis on the data you can disable hierarchical (keep flat) Use

What are the 4 phases of processing big data solutions, regardless of the architecture?

Ingestion - The ingestion phase identifies the technology and processes that are used to acquire the source data. This data can come from files, logs, and other types of unstructured data that must be put into the Data Lake Store. The technology that is used will vary depending on the frequency that the data is transferred. For example, for batch movement of data, Azure Data Factory may be the most appropriate technology to use. For real-time ingestion of data, Apache Kafka for HDInsight or Stream Analytics may be an appropriate technology to use. Store - The store phase identifies where the ingested data should be placed. In this case, we're using Azure Data Lake Storage Gen2. Prep and train - The prep and train phase identifies the technologies that are used to perform data preparation and model training and scoring for data science solutions. The common technologies that are used in this phase are Azure Databricks, Azure HDInsight or Azure Machine Learning Services. Model and serve - Finally, the model and serve phase involves the technologies that will present the data to users. These can include visualization tools such as Power BI, or other data stores such as Azure Synapse Analytics, Azure Cosmos DB, Azure SQL Database, or Azure Analysis Services. Often, a combination of these technologies will be used depending on the business requirements.

Which authentication method would be the likeliest choice to use for an individual who needs to access your serverless SQL pool who works for an external organization?

SQL Authentication uses an authentication method of a username and password stored within the serverless SQL pool.

Which role enables a user to create external table as select (CETAS) against an Azure Data Lake Gen2 data store?

Storage Blob Data Contributor because it provides read/write access. Read/Write access is needed if user should have access to create external table as select

What is Delta Lake Architecture?

The Delta Lake Architecture is a vast improvement upon the traditional Lambda architecture. At each stage, we enrich our data through a unified pipeline that allows us to combine batch and streaming workflows through a shared filestore with ACID-compliant transactions. Bronze tables contain raw data ingested from various sources (JSON files, RDBMS data, IoT data, etc.). Silver tables will provide a more refined view of our data. We can join fields from various bronze tables to enrich streaming records, or update account statuses based on recent activity. Gold tables provide business level aggregates often used for reporting and dashboarding. This would include aggregations such as daily active website users, weekly sales per store, or gross revenue per quarter by department. The end outputs are actionable insights, dashboards, and reports of business metrics.

What function is used to read the data in files stored in a data lake?

The OPENROWSET is used to read the data in files stored in a data lake.

What character in file path can be used to select all the file/folders that match rest of the path?

The asterisk character in file path can be used to select all the file or folders that match rest of the path.

Which metadata object provides the connection information to the files in a data lake store?

a DATA SOURCE provides the connection information to the files in a data lake store.

What command lets you create/save an external data source instead of accessing it by it's full path?

create external data source create external data source covid with ( location = 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases' );

What function lets you read an external data source like CSV from SQL?

openrowset select top 10 * from openrowset( bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv', format = 'csv', parser_version = '2.0', firstrow = 2 ) as rows

What command should be issued to view the list of active spark streams?

spark.streams.active


Ensembles d'études connexes

Anatomy and Physiology Final ( part 3)

View Set