Azure Data Engineering

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

to specify the location of a checkpoint directory when defining a Delta Lake streaming query

.writeStream.format("delta").option("checkpointLocation", checkpointPath)

Azure Tables

A NoSQL store for schema-less storage of structured data

Azure SQL Managed Instance

A PaaS deployment option providing near 100% compatibility with the latest version of SQL server. This allows customers to lift and shift on-premises apps to the cloud with minimal changes.

Azure SQL Database

A PaaS managed relational database with similar SQL server capabilities

Azure Data Factory

A cloud integration service that orchestrates that movement of data between various data stores. Using this you can create and schedule data-driven workflows (pipelines) that can ingest data from disparate data stores. Through this, raw data can be organized into meaningful data stores and data lakes for better business decisions

Apache Spark notebook

A collection of cells. These cells are run to execute code, to render formatted text, or to display graphical visualizations.

Azure Data Catalog

A fully-managed cloud service. Can be used to help document information about an organization's data sources. With this, user can discover, understand, and consume data sources. Includes a crowdsourcing model of metadata and annotations.

Azure Cosmo DB

A globally distributed, low latency, multi-model database. Can leverage these API models: SQL API, Mongo DB API, Cassandra DB API, Gremlin DB API, Table API

Azure Blobs

A massively scalable object store for text and binary data

Azure Queues

A messaging store for reliable messaging between application components.

Apache Spark

A unified processing engine that can analyze big data using SQL, machine learning, graph processing, or real-time stream analysis

Activities in a pipeline

Actions that you perform on your data. It can take zero or more input datasets and produce one or more output datasets.

Supported Connectors for built-in parameterization

Amazon Redshift, Azure Cosmos DB (SQL API), Azure Database for MySQL, Azure SQL Database, Azure Synapse Analytics, MySQL, Oracle, SQL Server, Genetic HTTP, Generic REST

Custom State Passing

An activity that created output or the state of the activity that needs to be consumed by a subsequent activity in the pipeline.

Event Consumer

An application which consumes the data and takes specific action based on the insights. Example include alert generation, dashboards, or sending data to another event processing engine

Event Processor

An engine to consume even data stream and deriving insights from them. Depending on the problem space, event processor ether process one incoming event at a time (such as heart rate monitor) or multiple events at a time (such as highway toll lane sensor)

Azure SQL Data Warehouse (Azure Synapse Analytics)

An enterprise-class MPP cloud-based enterprise data warehouse designed to process massive amounts of data.

Azure Data Factory Control Flow

An orchestration of pipeline activities that includes chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline on demand or from a trigger. Can also include looping containers, that can pass information for each iteration of the looping container.

Azure Databricks Key Features

Apache Spark-based analytics platform. Enterprise security. Integration with other Cloud Services.

Azure SQL Data Warehouse - when to use

As data loads increase organizations may have to find a way to reduce the processing time for business intelligence workloads including reporting. This is the PaaS solution that is optimized to process high volumes of data.

What a cluster driver does

Assign units of work to Slots for parallel execution; decide how to partition the data so that it can be distributed for parallel processing, assign a Partition of data to each task

Azure Storage Accounts

Azure Blobs, Azure Files, Azure Queues, and Azure Tables

Chaining activities

Azure Data Factory control flow that chains activities in a sequence or a pipeline. It is possible to use the dependsOn property in an activity definition to chain it with an upstream activity.

Where to parameterize linked service in Azure Data Factory

Azure Data Factory user interface, the Azure portal, or a programming interface of your preference

Secrets

Azure Key Vault object that stores storage account key information

SQL Database options in Azure

Azure SQL Database, Azure SQL Managed Instance, SQL Server

Stages of Delta Lake Architecture

Bronze, Silver, Gold

Common control flow activities

Chaining activities, Branching activities, Parameters, Custom state passing, Custom state passing, Trigger-based flows, Delta flows

supported Databricks notebook format

DBC file type

Workload Management

Data Loading feature that limits the number of resources a group of requests can consume in Azure Synapse Analytics

Azure Data Factory Activity Categories

Data movement, Data transformation, Control activities.

What Databricks offers that is not Open-source Spark

Databricks Workspace, workflows, Runtime, I/O (DBIO), Serverless, Enterprise Security (DBES)

Azure Analysis Services

Enterprise-grade tabular data models in the cloud

Areas of Data Flow

Event Production -> Event Queuing and Stream Ingestion -> Stream Analytic -> Storage, Presentation, and Action

Azure Stream Analytics

Event processing engine that enables the consumption and analysis of high volumes of streaming data in real time

Azure Stream Analytics

Event-processing engine that allows you to examine high volumes of data streaming from devices. Incoming data can be from devices, sensors, web, social media, apps and more.

Data Lake Storage Gen2 Benefits

Hadoop compatible access; Security; Performance; Data redundancy

Azure Data Lake Store (Gen 1 and Gen 2)

Hadoop-compatible data repository that can store unlimited NoSQL data. Designed for customers who require the ability to store massive amount of data with high compute capabilities for big data analytics.

Data Streams

In the context of analytics, these are event data generated by sensors or other sources that can be analyzed by another technology

Azure Data Lake Store Gen 2 additional feature

Includes a new BLOB based hierarchical file system. Gen 2 acts as a storage layer for a wide range of compute platforms including Azure Databricks, Hadoop, or Azure HDInsight.

IaaS

Infrastructure as a Service, includes Server and Storage, Networking, Firewall, Security, and Datacenter physical plant or building

4 stages for processing big data solutions

Ingestion, Store, Prep and train, Model and serve

IR

Integration Runtime

command to view the list of active streams

Invoke spark.streams.active

Azure Databricks

Is a "Spark as a Service". Is a fully-managed version of the open source Apache Spark analytics and data processing engine. Brings an enterprise-grade and secure cloud-based Big Data and Machine Learning platform as a premium offering on Azure

Type of process for Spark driver and executors

Java Processes

Azure Files

Managed file shares for cloud or on-premises deployments. Extend on-premise to the cloud.

Azure SQL Data Warehouse Key Features

Massive Parallel Processing (MPP). Ability to handle petabytes of data quickly. Compute and storage separate where the compute nodes can be scaled independently. Has the capability of pausing and resuming the compute layer. You only pay for the compute that you use.

Number of drivers a Spark Cluster has

One

PaaS

Platform as a Service, includes IaaS and Development and Tools, Data Transformation and Movement, and Databases and Analytics

Serving Layer

Power BI and Azure Analysis Services

Azure HDInsight

Provides technologies for ingesting, processing, and analyzing big data to support batch processing, data warehousing, IoT, and Data Science. Low-cost cloud solution containing several technologies including Apache Hadoop, Apache Spark, Apache Kafka, Apache HBase, Interactive Query, and Apache Storm - all in one approach.

SQL Server on IaaS

Provides the option of a fully functioning version of SQL Server in the cloud.

Power BI

Query Batch & real-time views, merge results

Apache Spark notebook uses

Read and process huge files and data sets; Query, explore, and visualize data sets; Join disparate data sets found in data lakes; Train and evaluate machine learning models; Process live streams of data; Perform analysis on large graph data sets and social networks

What does an Azure-SSIS IR support

Running packages that are or to be deployed into SSIS catalog (SSISDB) hosted by your Azure SQL Database server or Managed Instance in Project Deployment Model; Running packages that are or to be deployed into file system, Azure Files, or SQL Server database (MSDB) hosted by your Azure SQL Managed Instance in Package Deployment Model

SSIS

SQL Server Integration Services

Types of Cloud services

SaaS, PaaS, and IaaS

Languages supported by Apache Spark

Scala (primary language), Python (PySpark), R, Java, SQL

Event Producer

Sensors or processes that generate data continuously such as a heart rate monitor or a highway toll lane sensor

SaaS

Software as a Service, includes IaaS, PaaS, and Hosted Applications/Apps

Dependency Conditions of an Activity

Succeeded, Failed, Skipped, Completed

Azure Cosmo DB Key Features

Supports up to 99.999% uptime. Automatically failover if there is a regional disaster. Multi-master replication. Low latency: guaranteed to achieve less than 10ms response time for reads and writes. Consistency levels addressing global-scale needs including: Strong, bounded staleness, session, consistent prefix, and eventual.

The 2 levels where work occurs in parallel in Spark

The Executor and the Slot

IF Condition Activity

The control activity that can be used to branch an activity on the condition that evaluates True or False

Distribution = Replicate

The distribution option you would use for a product dimension table that will contain 1,000 static records in Synapse Analytics

Debug

The feature that enables you to interact with the Mapping Data Flow transformations that you create

Azure Databricks - When to use

The intention of this platform is to provide the ease of deploying a collaborative Machine Learning environment based on Spark that can be used between data scientists, data engineers, and business analysts.

Azure Data Lake

The technology that is typically used as a staging area in a modern data warehousing architecture

Branching activities

This Azure Data Factory control flow activity evaluates a set of activities, and when the condition evaluates to true, a set of activities are executed. When it evaluates to false, then an alternative set of activities is executed.

Clustered Index

This index will help performance when you have a large fact table and is in heap. Your queries aggregate values from ~100M rows and only return 2 rows

Azure HDInsight - when to use

This is a strong option when you need a high throughput data store for NoSQL data.

Cluster Slot

To which the Driver can assign parallelized Tasks to an Executor. The number of which is determined by the number of cores and CPUs of each node

Azure Data Lake Store Key Features

Unlimited scalability, Hadoop compatibility, Access Control Lists (ACLs) / POSIX-compliance, Optimized ABFS driver designed for big data analytics, Zone redundant storage / Geo-redundant storage

Azure Stream Analytics - when to use

Use it if your organization needs to respond to data event in real-time or you need respond to large batches of data in a continuous time bound stream.

Azure Cosmo DB - When to use

Use this when a NoSQL database of the supported API model is required, global-scale, and low latency performance is needed.

Azure Data Catalog - when to use

Use this when you require a multiuser approach to document your data stores and provide a solution that can help business users better understand their data.

OpenRowSet

What allows you to query a file

Type 2 SCD

What to use when your SQL Pool needs to be able to return an employee record from a given point of time, maintains latest employee information, & min query complexity

Heap index

When you are temporarily landing data in Dedicated SQL Pool, you should use this to make the overall process faster

Cluster Executor

a Java virtual machine running on a node, typically, one instance per node

lambda architecture

a big data processing architecture that addresses the problem of large data set latency by combining both batch- and real-time processing methods

Databricks Delta

a data-management system that's fast, reliable, and able to handle large volumes of data in different raw formats; provides the best of data lake, data warehousing, and streaming data-ingestion systems

Bronze tables

a vast improvement upon the traditional Lambda architecture. At each stage, we enrich our data through a unified pipeline that allows us to combine batch and streaming workflows through a shared filestore with ACID-compliant transactions

Delta Lake architecture

a vast improvement upon the traditional Lambda architecture. At each stage, we enrich our data through a unified pipeline that allows us to combine batch and streaming workflows through a shared filestore with ACID-compliant transactions

Data Lake Storage Gen2

combines a file system with a storage platform to help you quickly identify insights into your data; builds on Azure Blob storage capabilities to optimize it specifically for analytics workloads

A use-case for parameterizing a linked service in which you can pass through dynamic values during run time

connecting to several different databases that are on the same SQL server

Job

each parallelized action; result of which is returned to the Driver; broken down into Stages

Apache Spark clusters

groups of computers that are treated as a single computer and handle the execution of commands issued from notebooks

Ingestion (stage of processing big data)

identifies the technology and processes that are used to acquire the source data. This data can come from files, logs, and other types of unstructured data that must be put into the Data Lake Store. The technology that is used will vary depending on the frequency that the data is transferred (Azure Data Factory for batch, Apache Kafka or Stream Analytics for real-time)

data lake

is a repository of data that is stored in its natural format, usually as blobs or files

Cluster Driver

is the JVM in which our application runs. There's only one in each cluster.

Apache Spark clusters architecture

master-worker type architecture, allowing processing of data to be parallelized across many computers to improve scale and performance. They consist of a Spark Driver (master) and worker nodes. The driver node sends work to the worker nodes and instructs them to pull data from a specified data source.

Silver tables

provide a more refined view of our data. We can join fields from various bronze tables to enrich streaming records, or update account statuses based on recent activity

Gold tables

provide business level aggregates often used for reporting and dashboarding. This would include aggregations such as daily active website users, weekly sales per store, or gross revenue per quarter by department.

Data Lake Storage

provides a repository where you can upload and store huge amounts of unstructured data with an eye toward high-performance big data analytics; comprehensive, scalable, and cost-effective data lake solution for big data analytics built into Azure

A pipeline in Azure Data Factory

represents a logical grouping of activities where the activities together perform a certain task.

Hierarchical Namespace = Disabled

to set up the storage account as an Azure Blob storage account, sotring data without performing analysis on the data

Hierarchical Namespace = Enabled

to set up the storage account as an Azure Data Lake Storage Gen2 account and perform analytics on the data

A use-case for setting global parameters

when you have multiple pipelines where the parameters names and values are identical


संबंधित स्टडी सेट्स

Research Methods in Psychology: Chapter 5

View Set

FNP (Pediatrics Part 2 - Hip pain - Legg-Calve-Perthes Disease)

View Set

"5 plus 5" rules of med administration

View Set