DP900 - Explore data storage and processing in Azure

Ace your homework & exams now with Quizwiz!

What is a pipeline in Azure Data Factory

A pipeline is a logical grouping of activities that performs a unit of work. Together, the activities in a pipeline perform a task. For example, a pipeline might contain a series of activities that ingests raw data from Azure Blob storage, and then runs a Hive query on an HDInsight cluster to partition the data and store the results in a Cosmos DB database.

What are Synapse pipelines in Synapse Analytics

A pipeline is a logical grouping of activities that together perform a task. For example, a pipeline could contain a set of activities that ingest and clean log data, and then kick off a mapping data flow to analyze the log data. The pipeline allows you to manage the activities as a set instead of each one individually. You deploy and schedule the pipeline instead of the activities independently.

Can data stored in Azure Synapse Analytics can be used to build and train models with Azure Machine Learning.

Any data stored in Azure Synapse Analytics can be used to build and train models with Azure Machine Learning.

What is Apache Hive

Apache Hive provides interactive SQL-like facilities for querying, aggregating, and summarizing data. The data can come from many different sources. Queries are converted into tasks, and parallelized. Each task can run on a separate node in the HDInsight cluster, and the results are combined before being returned to the user.

What is Apache Kafka

Apache Kafka is a clustered streaming service that can ingest data in real time. It's a highly scalable solution that offers publish and subscribe features.

What is Apache Storm

Apache Storm is a scalable, fault tolerant platform for running real-time data processing applications. Storm can process high volumes of streaming data using comparatively modest computational requirements. Storm is designed for reliability, so that events shouldn't be lost. Storm solutions can also provide guaranteed processing of data, with the ability to replay data that wasn't successfully processed the first time. Storm can interoperate with a variety of event sources, including Azure Event Hubs, Azure IoT Hub, Apache Kafka, and RabbitMQ (a message queuing service). Storm can also write to data stores such as HDFS, Hive, HBase, Redis, and SQL databases. You write a Storm application using the APIs provided by Apache.

What is Azure Data Factory

Azure Data Factory is a service that can ingest large amounts of raw, unorganized data from relational and non-relational systems, and convert this data into meaningful information. Data Factory provides a scalable and programmable ingestion engine that you can use to implement complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.

What is Azure Data Lake Analytics

Azure Data Lake Analytics is an on-demand analytics job service that you can use to process big data. It provides a framework and set of tools that you use to analyze data held in Microsoft Azure Data Lake Store, and other repositories. You write jobs that contain queries to transform data and extract insights.

What is Azure Data Lake Store is compatible with

Azure Data Lake Store is compatible with the Hadoop Distributed File System (HDFS). You can run Hadoop jobs using Azure HDInsight (see below) that can read and write data in Data Lake Store efficiently.

What is An Access Control in Azure Data Lake Store

Azure Data Lake Store provides granular security over data, using Access Control Lists. An Access Control List specifies which accounts can access which files and folders in the store. If you are more familiar with Linux, you can use POSIX-style permissions to grant read, write, and search access based on file ownership and group membership of users.

What is Azure Data Lake and what 3 main elements does it comprise of

Azure Data Lake is a collection of analytics and storage services that you can combine to implement a big data solution. It comprises three main elements: 1. Data Lake Store 2. Data Lake Analytics 3. HDInsight

What is Azure Databricks

Azure Databricks is an analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

What is Azure HDInsight

Azure HDInsight is a managed analytics service in the cloud. It's based on Apache Hadoop, a collection of open-source tools and utilities that enable you to run processing tasks over large amounts of data. HDInsight uses a clustered model, similar to that of Synapse Analytics. HDInsight stores data using Azure Data Lake storage. You can use HDInsight to analyze data using frameworks such as Hadoop Map/Reduce, Apache Spark, Apache Hive, Apache Kafka, Apache Storm, R, and more.

What is Azure Synapse Analytics

Azure Synapse Analytics is a generalized analytics service. You can use it to read data from many sources, process this data, generate various analyses and models, and save the results.

What is Azure Synapse Analytics

Azure Synapse Analytics is an integrated analytics service that allows organizations to gain insights quickly from all their data at any hyperscale, from both data warehouses and big data analytics systems​.

Azure Synapse Analytics is designed to run queries on what size of datasets.

Azure Synapse Analytics is designed to run queries over massive datasets. You can manually scale the SQL pool up to 60 nodes.

what clustered architecture can Azure Synapse Analytics use

Azure Synapse Analytics uses a clustered architecture. Each cluster has a control node that is used as the entry point to the system. When you run Transact-SQL statements or start Spark jobs from a notebook, the request is sent to the control node.

What is Azure Synapse Link for Azure Cosmos DB

Azure Synapse Link for Azure Cosmos DB is a cloud-native hybrid transactional and analytical processing (HTAP) capability that enables you to run near real-time analytics over operational data stored in Azure Cosmos DB.

What are the components of Azure Synapse Analytics?

Azure Synapse is composed of the following elements: 1.Synapse SQL pool: This is a collection of servers running Transact-SQL. Transact-SQL is the dialect of SQL used by Azure SQL Database, and Microsoft SQL Server. You write your data processing logic using Transact-SQL. 2. Synapse Spark pool: This is a cluster of servers running Apache Spark to process data. You write your data processing logic using one of the four supported languages: Python, Scala, SQL, and C# (via .NET for Apache Spark). Spark pools support Azure Machine Learning through integration with the SparkML and AzureML packages. 3.Synapse Pipelines: A Synapse pipeline is a logical grouping of activities that together perform a task. The activities in a pipeline define actions to perform on your data. For example, you might use a copy activity to transform data from a source dataset to a destination dataset. You could include activities that transform the data as it is transferred, or you might combine data from multiple sources together. 4.Synapse Link: This component allows you to connect to Cosmos DB. You can use it to perform near real-time analytics over the operational data stored in a Cosmos DB database. 5.Synapse Studio: This is a web user interface that enables data engineers to access all the Synapse Analytics tools. You can use Synapse Studio to create SQL and Spark pools, define and run pipelines, and configure links to external data sources.

Can Business analysts, data engineers, and data scientists use Synapse Spark pools or Synapse SQL pools

Business analysts, data engineers, and data scientists can now use Synapse Spark pools or Synapse SQL pools to run near real-time business intelligence, analytics, and machine learning pipelines. You can achieve this without impacting the performance of your transactional workloads on Azure Cosmos DB.

By default what is created in each Azure Synapse Analytics workspace

By default, an on-demand SQL pool is created in each Azure Synapse Analytics workspace. You can then provision additional pools, either on-demand or provisioned.

What is the Data Lake Store

Data Lake Store provides a file system that can store near limitless quantities of data. It uses a hierarchical organization (like the Windows and Linux file systems), but you can hold massive amounts of raw data (blobs) and structured data. It is optimized for analytics workloads.

Can databricks process streaming data

Databricks can also process streaming data

What data can databricks process

Databricks can process data held in many different types of storage, including Azure Blob storage, Azure Data Lake Store, Hadoop storage, flat files, databases, and data warehouses.

What architecture does databricks use

Databricks uses an extensible architecture based on drivers.

Hadoop Map/Reduce has largely been replaced by What

Hadoop Map/Reduce has largely been replaced by Spark, which offers a more advanced set of operations and a simpler interface.

Describe Hadoop Map/Reduce

Hadoop Map/Reduce uses a simple framework to split a task over a large dataset into a series of smaller tasks over subsets of the data that can be run in parallel, and the results then combined. You write your Map/Reduce code in a language such as Java, and then submit this code as a job to the Hadoop cluster.

How long does resuming a SQL pool in Synapse Analytics take

However, you can't run any queries until the pool is resumed. Resuming a pool can take several minutes.

Describe how U-SQL code is processed in Azure Data Lake Analytics

It's important to understand that the U-SQL code only provides a description of the work to be performed. Azure Data Lake Analytics determines how best to actually carry out this work. Data Lake Analytics takes the U-SQL description of a job, parses it to make sure it is syntactically correct, and then compiles it into an internal representation. Data Lake Analytics then breaks down this internal representation into stages of execution. Each stage performs a task, such as extracting the data from a specified source, dividing the data into partitions, processing the data in each partition, aggregating the results in a partition, and then combining the results from across all partitions. Partitioning is used to improve parallelization, and the processing for different partitions is performed concurrently on different processing nodes. The data for each partition is determined by the U-SQL compiler, according to the way in which the job retrieves and processes the data.

Describe Spark Jobs

Like Map/Reduce jobs, Spark jobs are parallelized into a series of subtasks tasks that run on the cluster. You can write Spark jobs as part of an application, or you can use interactive notebooks. These notebooks are the same as those that you can run from Azure Databricks. Spark includes libraries that you can use to read and write data in a wide variety of data stores (not just HDFS). For example, you can connect to relational databases such as Azure SQL Database, and other services such as Azure Cosmos DB.

What services can read and write to the Data Lake Store directly

Services such as Azure Data Factory, Azure Databricks, Azure HDInsight, Azure Data Lake Analytics, and Azure Stream Analytics can read and write to the Data Lake Store directly.

What are Spark pools in Synapse Analytics

Spark pools enable you to process data held in many formats, such as csv, json, xml, parquet, orc, and avro. Spark can be extended to support many more formats with external data sources. Spark pools provide the basic building blocks for performing in-memory cluster computing. A Spark job can load and cache data into memory and query it repeatedly. In-memory computing is much faster than disk-based applications. Spark pools in Azure Synapse are compatible with Azure Storage and Azure Data Lake Storage, so you can use Spark pools to process your data stored in Azure. Spark pools can have autoscaling enabled, so that pools scale by adding or removing nodes as needed. Also, Spark pools can be shut down with no loss of data since all the data is stored in Azure Storage or Data Lake Storage.

How does Synapse Analytics use PolyBase

Synapse Analytics uses a technology called PolyBase to make external data look like SQL tables. You can run queries against these tables directly, or you can transfer the data into a series of SQL tables managed by Synapse Analytics for querying later. Synapse uses Azure Storage to manage your data while it's being processed.

You have a large amount of data held in files in Azure Data Lake storage. You want to retrieve the data in these files and use it to populate tables held in Azure Synapse Analytics. Which processing option is most appropriate?

Synapse SQL pool

Which of the components of Azure Synapse Analytics allows you to train AI models using AzureML

Synapse Spark

What are Spark clusters in Synapse Analytics

Synapse Spark runs clusters based on Apache Spark rather than Azure SQL Database. You write your analytics jobs as notebooks, using code written in Python, Scala, C#, or Spark SQL (this is a different dialect from Transact-SQL). You can combine code written in multiple languages in the same notebook.

What is Synapse Studio for Synapse Analytics

Synapse Studio is a web interface that enables you to create pools and pipelines interactively. With Synapse Studio you can develop, test, and debug Spark notebooks and Transact-SQL jobs. You can monitor the performance of operations that are currently running, and you can manage the serverless or provisioned resources. All of these capabilities are accessed via the web-native Synapse Studio that allows for model management, monitoring, coding, and security.

What is the Cosmos DB Analytical Store

Synapse link uses a feature of Cosmos DB named Cosmos DB Analytical Store. Cosmos DB Analytical Store contains a copy of the data in a Cosmos DB container, but organized as a column store. Column stores group data by column rather than by row. Column stores are a more optimal format for running analytical workloads that need to aggregate data down a column rather than across a row, such as generating sum totals, averages, maximum or minimum values for a column. Cosmos DB automatically keeps the data in its containers synchronized with the copies in the column store.

What Data Integration engine do pipelines use

Synapse pipelines use the same Data Integration engine used by Azure Data Factory. This gives you the power in Synapse Studio to create pipelines that can connect to over 90 sources from flat files, databases, or online services. You can create codeless data flows that let you do complex mappings and transformations on data as it flows into your analytic solutions. The example below shows a pipeline with three activities. The pipeline ingests data, and then uses a Spark notebook to generate a machine learning model. The Azure function at the end of the pipeline tests the machine learning model to validate it.

What do the activities in a pipeline define

The activities in a pipeline define actions to perform on your data. For example, you may use a copy activity to copy data from Azure Blob Storage into Azure Synapse using a SQL pool. Then, use a data flow activity or a notebook activity using a Spark pool to process and generate a machine learning model.

What is the Data Movement Service in Azure Synapse Analytics

The control and compute nodes use the Data Movement Service (DMS) to move data across the nodes as necessary to run queries in parallel and return accurate results.

what is the control node in Azure Synapse Analytics

The control node runs a parallel processing engine that splits the operation into a set of tasks that can be run concurrently. Each task performs part of the workload over a subset of the source data. Each task is sent to a compute node to actually do the processing. The control node gathers the results from the compute nodes and combines them into an overall result.

In Azure Databricks how do you change the language a cell uses

The first line in the cell is %language. For example, %scala

The databricks processing engine is provided by what

The processing engine is provided by Apache Spark. Spark is a parallel-processing engine that supports large-scale analytics. You write application code that consumes data from one or more sources, and merge, reformat, filter, and remodel this data, and then store the results. Spark distributes the work across a cluster of computers. Each computer can process its data in parallel with the other computers. The strategy helps to reduce the time required to perform the work. Spark is designed to handle massive quantities of data.

what is Transact-SQL

This is the same dialect of SQL used by Azure SQL Database, with some extensions for reading data from external sources, such as databases, files, and Azure Data Lake storage. You can use these extensions to load data quickly, generate aggregations and other analytics, create tables and views, and store information using these tables and views. You can use the results for later reporting and processing.

what is Spark

This is the same open-source technology used to power Azure Databricks. You write your analytical code using notebooks in a programming language such as C#, Scala, Python, or SQL. The Spark libraries provided with Azure Synapse Analytics enable you to read data from external sources, and also write out data in a variety of different formats if you need to save your results for further analysis.

what two technologies can process data for Azure Synapse Analytics

Transact-SQL Spark

Using Azure Data Factory, you can create and schedule what

Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from the disparate data stores used by the gaming company. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight, Azure Databricks, and Azure SQL Database. You can then publish the transformed data to Azure Synapse Analytics for business intelligence applications to consume.

What are SQL pools in Azure Synapse Analytics

When you use Synapse SQL, your analytics workload runs using a SQL pool. In a SQL pool, the Control and Compute nodes in the cluster run a version of Azure SQL Database that supports distributed queries. You define your logic using Transact-SQL statements. You send your Transact-SQL statements to the control node, which splits up the work into queries that operate over a subset of the data, and then sends these smaller queries to the compute nodes. The data is split into chunks called distributions. A distribution is the basic unit of storage and processing for parallel queries that run on distributed data. Each of the smaller queries runs on one of the data distributions.

Can you pause a SQL pool Synapse Analytics

You can also pause a SQL pool if you don't require it for a while. Pausing releases the resources associated with the pool. You aren't charged for these resources until you manually resume the pool.

What languages can you write spark code in

You can write the Spark application code using several languages, such as Python, R, Scala, Java, and SQL. Spark has a number of libraries for these languages, providing complex analytical routines that have been optimized for the clustered environment. These libraries include modules for machine learning, statistical analysis, linear and non-linear modeling, predictive analytics, and graphics.

How do You define a job in Azure Data Lake Analytics

You define a job using a language called U-SQL. This is a hybrid language that takes features from both SQL and C#, and provides declarative and procedural capabilities that you can use to process data.

What is a databricks notebook

You write Databricks applications using a Notebook. A notebook contains a series of steps (cells), each of which contains a block of code. For example, one cell might contain the code that connects to a data source, the next cell reads the data from that source and converts it into a model in-memory, the next cell plots a graph, and a final cell saves the data from the in-memory model to a repository. The first line in the cell is %language. For example, %scala.

What 2 reasons do you Use SQL pools in Synapse Analytics

1. Complex reporting. You can use the full power of Transact-SQL to run complex SQL statements that summarize and aggregate data. 2. Data ingestion. PolyBase enables you to retrieve data from many external sources and convert it into a tabular format. You can reformat this data and save it as tables and materialized views in Azure Synapse.

Spark pools in Synapse Analytics are especially suitable for what two scenarios

1. Data Engineering/Data Preparation. Apache Spark includes many language features to support preparation and processing of large volumes of data so that it can be made more valuable and then consumed by other services within Synapse Analytics. This is enabled through the Spark libraries that support processing and connectivity. 2. Machine Learning. Apache Spark comes with MLlib, a machine learning library built on top of Spark that you can use from a Spark pool in Synapse Analytics. Spark pools in Synapse Analytics also include Anaconda, a Python distribution with a variety of packages for data science including machine learning. When combined with built-in support for notebooks, you have an environment for creating machine learning applications.

Synapse link has a wide range of uses, including what

1. Supply chain analytics and forecasting. You can query operational data directly and use it to build machine learning models. You can use the results generated by these models back into Cosmos DB for near-real-time scoring. You can use these assessments to successively refine the models and generate more accurate forecasts. 2.Operational reporting. You can use Synapse Analytics to query operational data using Transact-SQL running in a SQL pool. You can publish the results to dashboards using the support provided to familiar tools such as Microsoft Power BI. 3.Batch data integration and orchestration. With supply chains getting more complex, supply chain data platforms need to integrate with a variety of data sources and formats. The Azure Synapse data integration engine allows data engineers to create rich data pipelines without requiring a separate orchestration engine. 4.Real-time personalization. You can build engaging ecommerce solutions that allow retailers to generate personalized recommendations and special offers for customers in real time. 5.IoT maintenance. Industrial IoT innovations have drastically reduced downtimes of machinery and increased overall efficiency across all fields of industry. One such innovation is predictive maintenance analytics for machinery at the edge of the cloud. The historical operational data from IoT device sensors could be used to train predictive models such as anomaly detectors. These anomaly detectors are then deployed back to the edge for real-time monitoring. Looping back allows for continuous retraining of the predictive models.

The example U-SQL block below reads data from a file named StockPrices.csv, which is held in a folder named StockMarket in Data Lake Storage. This is a text file that contains stock market information (tickers, and prices, and possibly other data), held in comma-separated format. The EXTRACT statement reads the file line by line and pulls out the data in the Ticker, and Price fields (it skips the first line, where a CSV file typically holds field name information rather than data). The SELECT statement calculates that maximum price for each ticker. The OUTPUT statement stores the results to another file in Data Lake Storage.

@priceData = EXTRACT Ticker string, Price int FROM "/StockMarket/StockPrices.csv" USING Extractors.Csv(skipFirstNRows: 1); @maxPrices = SELECT Ticker, MAX(Price) AS MaxPrice FROM @priceData GROUP BY Ticker; OUTPUT @maxPrices TO "/output/MaxPrices.csv" USING Outputters.Csv(outputHeader: true);

A U-SQL job can output results to what

A U-SQL job can output results to a single CSV file, partition the results across multiple files, or can write to other destinations. For example, Data Lake Analytics enables you to create custom outputters if you want to save data in a particular format (such as XML or HTML). You can also write data to the Data Lake Catalog. The catalog provides a SQL-like interface to Data Lake Storage, enabling you to create tables, and views, and run INSERT, UPDATE, and DELETE statements against these tables and views.

What is a driver

A driver is a piece of code that connects to a specific data source and enables you to read and write that source. A driver is typically provided as part of a library that you can load into the Databricks environment. Drivers are available for many Azure services, including Azure SQL Database, Azure Cosmos DB, Azure Blob storage, and Azure Data Lake storage, as well as many services and databases produced by third-parties, such as MySQL and PostgreSQL.


Related study sets

TEX FINALLY THE FINAL OP. for practice

View Set

Les Questions de Compréhension (L'étranger)

View Set

Systems Administrator Security Questions

View Set

NUR 325 Exam 2: Diabetes Questions

View Set

Old Testament genre narrative/Law

View Set

AP World History Units 5 + 6 (1750-1900: Modern Era) - Final Exam Sem. 2

View Set