Azure Data Engineering
to specify the location of a checkpoint directory when defining a Delta Lake streaming query
.writeStream.format("delta").option("checkpointLocation", checkpointPath)
Azure Tables
A NoSQL store for schema-less storage of structured data
Azure SQL Managed Instance
A PaaS deployment option providing near 100% compatibility with the latest version of SQL server. This allows customers to lift and shift on-premises apps to the cloud with minimal changes.
Azure SQL Database
A PaaS managed relational database with similar SQL server capabilities
Azure Data Factory
A cloud integration service that orchestrates that movement of data between various data stores. Using this you can create and schedule data-driven workflows (pipelines) that can ingest data from disparate data stores. Through this, raw data can be organized into meaningful data stores and data lakes for better business decisions
Apache Spark notebook
A collection of cells. These cells are run to execute code, to render formatted text, or to display graphical visualizations.
Azure Data Catalog
A fully-managed cloud service. Can be used to help document information about an organization's data sources. With this, user can discover, understand, and consume data sources. Includes a crowdsourcing model of metadata and annotations.
Azure Cosmo DB
A globally distributed, low latency, multi-model database. Can leverage these API models: SQL API, Mongo DB API, Cassandra DB API, Gremlin DB API, Table API
Azure Blobs
A massively scalable object store for text and binary data
Azure Queues
A messaging store for reliable messaging between application components.
Apache Spark
A unified processing engine that can analyze big data using SQL, machine learning, graph processing, or real-time stream analysis
Activities in a pipeline
Actions that you perform on your data. It can take zero or more input datasets and produce one or more output datasets.
Supported Connectors for built-in parameterization
Amazon Redshift, Azure Cosmos DB (SQL API), Azure Database for MySQL, Azure SQL Database, Azure Synapse Analytics, MySQL, Oracle, SQL Server, Genetic HTTP, Generic REST
Custom State Passing
An activity that created output or the state of the activity that needs to be consumed by a subsequent activity in the pipeline.
Event Consumer
An application which consumes the data and takes specific action based on the insights. Example include alert generation, dashboards, or sending data to another event processing engine
Event Processor
An engine to consume even data stream and deriving insights from them. Depending on the problem space, event processor ether process one incoming event at a time (such as heart rate monitor) or multiple events at a time (such as highway toll lane sensor)
Azure SQL Data Warehouse (Azure Synapse Analytics)
An enterprise-class MPP cloud-based enterprise data warehouse designed to process massive amounts of data.
Azure Data Factory Control Flow
An orchestration of pipeline activities that includes chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline on demand or from a trigger. Can also include looping containers, that can pass information for each iteration of the looping container.
Azure Databricks Key Features
Apache Spark-based analytics platform. Enterprise security. Integration with other Cloud Services.
Azure SQL Data Warehouse - when to use
As data loads increase organizations may have to find a way to reduce the processing time for business intelligence workloads including reporting. This is the PaaS solution that is optimized to process high volumes of data.
What a cluster driver does
Assign units of work to Slots for parallel execution; decide how to partition the data so that it can be distributed for parallel processing, assign a Partition of data to each task
Azure Storage Accounts
Azure Blobs, Azure Files, Azure Queues, and Azure Tables
Chaining activities
Azure Data Factory control flow that chains activities in a sequence or a pipeline. It is possible to use the dependsOn property in an activity definition to chain it with an upstream activity.
Where to parameterize linked service in Azure Data Factory
Azure Data Factory user interface, the Azure portal, or a programming interface of your preference
Secrets
Azure Key Vault object that stores storage account key information
SQL Database options in Azure
Azure SQL Database, Azure SQL Managed Instance, SQL Server
Stages of Delta Lake Architecture
Bronze, Silver, Gold
Common control flow activities
Chaining activities, Branching activities, Parameters, Custom state passing, Custom state passing, Trigger-based flows, Delta flows
supported Databricks notebook format
DBC file type
Workload Management
Data Loading feature that limits the number of resources a group of requests can consume in Azure Synapse Analytics
Azure Data Factory Activity Categories
Data movement, Data transformation, Control activities.
What Databricks offers that is not Open-source Spark
Databricks Workspace, workflows, Runtime, I/O (DBIO), Serverless, Enterprise Security (DBES)
Azure Analysis Services
Enterprise-grade tabular data models in the cloud
Areas of Data Flow
Event Production -> Event Queuing and Stream Ingestion -> Stream Analytic -> Storage, Presentation, and Action
Azure Stream Analytics
Event processing engine that enables the consumption and analysis of high volumes of streaming data in real time
Azure Stream Analytics
Event-processing engine that allows you to examine high volumes of data streaming from devices. Incoming data can be from devices, sensors, web, social media, apps and more.
Data Lake Storage Gen2 Benefits
Hadoop compatible access; Security; Performance; Data redundancy
Azure Data Lake Store (Gen 1 and Gen 2)
Hadoop-compatible data repository that can store unlimited NoSQL data. Designed for customers who require the ability to store massive amount of data with high compute capabilities for big data analytics.
Data Streams
In the context of analytics, these are event data generated by sensors or other sources that can be analyzed by another technology
Azure Data Lake Store Gen 2 additional feature
Includes a new BLOB based hierarchical file system. Gen 2 acts as a storage layer for a wide range of compute platforms including Azure Databricks, Hadoop, or Azure HDInsight.
IaaS
Infrastructure as a Service, includes Server and Storage, Networking, Firewall, Security, and Datacenter physical plant or building
4 stages for processing big data solutions
Ingestion, Store, Prep and train, Model and serve
IR
Integration Runtime
command to view the list of active streams
Invoke spark.streams.active
Azure Databricks
Is a "Spark as a Service". Is a fully-managed version of the open source Apache Spark analytics and data processing engine. Brings an enterprise-grade and secure cloud-based Big Data and Machine Learning platform as a premium offering on Azure
Type of process for Spark driver and executors
Java Processes
Azure Files
Managed file shares for cloud or on-premises deployments. Extend on-premise to the cloud.
Azure SQL Data Warehouse Key Features
Massive Parallel Processing (MPP). Ability to handle petabytes of data quickly. Compute and storage separate where the compute nodes can be scaled independently. Has the capability of pausing and resuming the compute layer. You only pay for the compute that you use.
Number of drivers a Spark Cluster has
One
PaaS
Platform as a Service, includes IaaS and Development and Tools, Data Transformation and Movement, and Databases and Analytics
Serving Layer
Power BI and Azure Analysis Services
Azure HDInsight
Provides technologies for ingesting, processing, and analyzing big data to support batch processing, data warehousing, IoT, and Data Science. Low-cost cloud solution containing several technologies including Apache Hadoop, Apache Spark, Apache Kafka, Apache HBase, Interactive Query, and Apache Storm - all in one approach.
SQL Server on IaaS
Provides the option of a fully functioning version of SQL Server in the cloud.
Power BI
Query Batch & real-time views, merge results
Apache Spark notebook uses
Read and process huge files and data sets; Query, explore, and visualize data sets; Join disparate data sets found in data lakes; Train and evaluate machine learning models; Process live streams of data; Perform analysis on large graph data sets and social networks
What does an Azure-SSIS IR support
Running packages that are or to be deployed into SSIS catalog (SSISDB) hosted by your Azure SQL Database server or Managed Instance in Project Deployment Model; Running packages that are or to be deployed into file system, Azure Files, or SQL Server database (MSDB) hosted by your Azure SQL Managed Instance in Package Deployment Model
SSIS
SQL Server Integration Services
Types of Cloud services
SaaS, PaaS, and IaaS
Languages supported by Apache Spark
Scala (primary language), Python (PySpark), R, Java, SQL
Event Producer
Sensors or processes that generate data continuously such as a heart rate monitor or a highway toll lane sensor
SaaS
Software as a Service, includes IaaS, PaaS, and Hosted Applications/Apps
Dependency Conditions of an Activity
Succeeded, Failed, Skipped, Completed
Azure Cosmo DB Key Features
Supports up to 99.999% uptime. Automatically failover if there is a regional disaster. Multi-master replication. Low latency: guaranteed to achieve less than 10ms response time for reads and writes. Consistency levels addressing global-scale needs including: Strong, bounded staleness, session, consistent prefix, and eventual.
The 2 levels where work occurs in parallel in Spark
The Executor and the Slot
IF Condition Activity
The control activity that can be used to branch an activity on the condition that evaluates True or False
Distribution = Replicate
The distribution option you would use for a product dimension table that will contain 1,000 static records in Synapse Analytics
Debug
The feature that enables you to interact with the Mapping Data Flow transformations that you create
Azure Databricks - When to use
The intention of this platform is to provide the ease of deploying a collaborative Machine Learning environment based on Spark that can be used between data scientists, data engineers, and business analysts.
Azure Data Lake
The technology that is typically used as a staging area in a modern data warehousing architecture
Branching activities
This Azure Data Factory control flow activity evaluates a set of activities, and when the condition evaluates to true, a set of activities are executed. When it evaluates to false, then an alternative set of activities is executed.
Clustered Index
This index will help performance when you have a large fact table and is in heap. Your queries aggregate values from ~100M rows and only return 2 rows
Azure HDInsight - when to use
This is a strong option when you need a high throughput data store for NoSQL data.
Cluster Slot
To which the Driver can assign parallelized Tasks to an Executor. The number of which is determined by the number of cores and CPUs of each node
Azure Data Lake Store Key Features
Unlimited scalability, Hadoop compatibility, Access Control Lists (ACLs) / POSIX-compliance, Optimized ABFS driver designed for big data analytics, Zone redundant storage / Geo-redundant storage
Azure Stream Analytics - when to use
Use it if your organization needs to respond to data event in real-time or you need respond to large batches of data in a continuous time bound stream.
Azure Cosmo DB - When to use
Use this when a NoSQL database of the supported API model is required, global-scale, and low latency performance is needed.
Azure Data Catalog - when to use
Use this when you require a multiuser approach to document your data stores and provide a solution that can help business users better understand their data.
OpenRowSet
What allows you to query a file
Type 2 SCD
What to use when your SQL Pool needs to be able to return an employee record from a given point of time, maintains latest employee information, & min query complexity
Heap index
When you are temporarily landing data in Dedicated SQL Pool, you should use this to make the overall process faster
Cluster Executor
a Java virtual machine running on a node, typically, one instance per node
lambda architecture
a big data processing architecture that addresses the problem of large data set latency by combining both batch- and real-time processing methods
Databricks Delta
a data-management system that's fast, reliable, and able to handle large volumes of data in different raw formats; provides the best of data lake, data warehousing, and streaming data-ingestion systems
Bronze tables
a vast improvement upon the traditional Lambda architecture. At each stage, we enrich our data through a unified pipeline that allows us to combine batch and streaming workflows through a shared filestore with ACID-compliant transactions
Delta Lake architecture
a vast improvement upon the traditional Lambda architecture. At each stage, we enrich our data through a unified pipeline that allows us to combine batch and streaming workflows through a shared filestore with ACID-compliant transactions
Data Lake Storage Gen2
combines a file system with a storage platform to help you quickly identify insights into your data; builds on Azure Blob storage capabilities to optimize it specifically for analytics workloads
A use-case for parameterizing a linked service in which you can pass through dynamic values during run time
connecting to several different databases that are on the same SQL server
Job
each parallelized action; result of which is returned to the Driver; broken down into Stages
Apache Spark clusters
groups of computers that are treated as a single computer and handle the execution of commands issued from notebooks
Ingestion (stage of processing big data)
identifies the technology and processes that are used to acquire the source data. This data can come from files, logs, and other types of unstructured data that must be put into the Data Lake Store. The technology that is used will vary depending on the frequency that the data is transferred (Azure Data Factory for batch, Apache Kafka or Stream Analytics for real-time)
data lake
is a repository of data that is stored in its natural format, usually as blobs or files
Cluster Driver
is the JVM in which our application runs. There's only one in each cluster.
Apache Spark clusters architecture
master-worker type architecture, allowing processing of data to be parallelized across many computers to improve scale and performance. They consist of a Spark Driver (master) and worker nodes. The driver node sends work to the worker nodes and instructs them to pull data from a specified data source.
Silver tables
provide a more refined view of our data. We can join fields from various bronze tables to enrich streaming records, or update account statuses based on recent activity
Gold tables
provide business level aggregates often used for reporting and dashboarding. This would include aggregations such as daily active website users, weekly sales per store, or gross revenue per quarter by department.
Data Lake Storage
provides a repository where you can upload and store huge amounts of unstructured data with an eye toward high-performance big data analytics; comprehensive, scalable, and cost-effective data lake solution for big data analytics built into Azure
A pipeline in Azure Data Factory
represents a logical grouping of activities where the activities together perform a certain task.
Hierarchical Namespace = Disabled
to set up the storage account as an Azure Blob storage account, sotring data without performing analysis on the data
Hierarchical Namespace = Enabled
to set up the storage account as an Azure Data Lake Storage Gen2 account and perform analytics on the data
A use-case for setting global parameters
when you have multiple pipelines where the parameters names and values are identical