Microsoft Azure Exams DP-200 and DP-201
Data Engineering Practices
- Provision - Process - Secure - Monitor - Disaster Recovery
4 Pillars of a Great Data Architecture
- Security - Performance & Scalability - Availability & Recoverability - Efficiency & Operations
Cosmos DB Consistency Levels
- Strong - Bounded Staleness - Session (default) - Consistent Prefix - Eventual
Lambda architecture
A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation. The rise of ____________________ is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce.
Azure SQL Database
A general-purpose relational database provided as a managed service. With it, you can create a highly available and high-performance data storage layer for the applications and solutions in Azure. It can be the right choice for a variety of modern cloud applications because it enables you to process both relational data and non-relational structures, such as graphs, JSON, spatial, and XML. It's based on the latest stable version of the Microsoft SQL Server database engine. You can use advanced query processing features, such as high-performance in-memory technologies and intelligent query processing. The newest capabilities of SQL Server are released first to ____________________, and then to SQL Server itself. You get the newest SQL Server capabilities with no overhead for patching or upgrading, tested across millions of databases.
Request Unit
A measure of throughput for Cosmos DB. The number of reads or writes per second.
Serving layer
In a lambda architecture, output from the batch and speed layers are stored in the ____________________, which responds to ad-hoc queries by returning precomputed views or building views from the processed data.
Azure Stream Analytics
A serverless, scalable complex event processing engine by Microsoft that enables users to develop and run real-time analytics on multiple streams of data from sources such as devices, sensors, web sites, social media, and other applications. Users can set up alerts to detect anomalies, predict trends, trigger necessary workflows when certain conditions are observed, and make data available to other downstream applications and services for presentation, archiving, or further analysis.
Databricks
An Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark. Is intended to be fast, easy, and collaborative. Is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using Kafka, Event Hub, or IoT Hub. This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. Is able to read data from multiple data sources such as Azure Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse.
Azure Synapse Analytics
Analytics service that brings together enterprise data warehousing and Big Data analytics. Lets you query data on your terms, using either serverless on-demand or provisioned resources—at scale. Brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate BI and machine learning needs.
Cosmos DB
Azure's no SQL database. Has APIs for MongoDB, Cassandra, Azure Table, and Gremlin.
Azure Data Factory
Cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale. Allows you to create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores.
Control Node
In MPP
Batch layer
In a Lambda architecture, the layer that precomputes results using a distributed processing system that can handle very large quantities of data. Aims at perfect accuracy by being able to process all available data when generating views. This means it can fix any errors by recomputing based on the complete data set, then updating existing views. Output is typically stored in a read-only database, with updates completely replacing existing precomputed views.
Time to Live
Feature of Cosmos DB (to be completed)
Speed layer
In a lambda architecture, the layer that processes data streams in real time and without the requirements of fix-ups or completeness. This layer sacrifices throughput as it aims to minimize latency by providing real-time views into the most recent data. Essentially, it is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate or complete as the ones eventually produced by the batch layer, but they are available almost immediately after data is received, and can be replaced when the batch layer's views for the same data become available.
Data Factory Components
Linked Service Activities Pipeline Triggers Dataset Parameters Integration Runtime Control Flow
Kusto Query Language (KQL)
Query language used to query Azure's service. A ____________________ query is stated in plain text, using a data-flow model designed to make the syntax easy to read, author, and automate. The query uses schema entities that are organized in a hierarchy similar to SQL's: databases, tables, and columns. The query consists of a sequence of query statements, delimited by a semicolon (;), with at least one statement being a tabular expression statement which is a statement that produces data arranged in a table-like mesh of columns and rows. The query's tabular expression statements produce the results of the query.
Design for Recoverability
Recovery Point Objective Recovery Time Objective
MPP Table geometries
Round Robin Hash Distributed Replicated
Blob Storage
To be completed. Uses flat namespace.
Data Lake Storage Gen2
To be completed. Uses hierarchical namespace.
Compute Node
in MPP