AZURE DP-203
Temporal Database
A database that contains time-varying historical data with the possible inclusion of current and future data and has the ability to manipulate this data
orchestration
ADF can use a similar approach, whilst it has native functionality to ingest and transform data, sometimes it will instruct another service to perform the actual work required on its behalf, such as a Databricks to execute a transformation query. So, in this case, it would be Databricks that performs the work, not ADF. ADF merely orchestrates the execution of the query, and then provides the pipelines to move the data onto the next step or destination.
Data Sharding for Scaling single server limitations
Computing resources Geography of data and users Network bandwidth Storage space
SSIS package execution
Natively execute SQL Server Integration Services (SSIS) packages in a managed Azure compute environment.
Data Lake Storage
Scalable Durable Secure
Big Data Components
Source Storage Process Processed data storage Reporting
Index Table Pattern
Create indexes over the fields in data stores Ability to emulate secondary indexes
Partition SQL Database
Elastic pools support horizontal scale Shards can hold more than one dataset Shard map shardlets should have the same schema Avoid mixing highly active
Portioning Purpose
Improve Availability Improve performance Improve Scalability Improve Security
Design ingestion patterns
Integrate hub Develop hub
Simple Repartition
Only writing to a single Hive partition
RDD
Resilient Distributed Datasets
Manage hub
Select SQL Pools Drag performance level
Performance Tiers
Standard Premium
Graph database
A graph database offers an alternative way to track relationships; its structure resembles sociograms with their interlinked nodes
Partitioning Data Lakes Table Storage
A key value store that is designed to support partitioning Partitions are managed internally by <term> Each entity must include a partition key and row key
When to use Azure Synapse Analytics
Across all organizations and industries, the common use cases for Azure Synapse Analytics are identified by the need for:
Publish
After the raw data has been refined into a business-ready consumable form from the transform and enrich phase, you can load the data into Azure Data Warehouse, Azure SQL Database, Azure Cosmos DB, or whichever analytics engine your business users can point to from their business intelligence tools
Azure Databricks
An advanced analytics managed Apache Spark-as-a-Service solution Provides an end-to-end data engineering and data science solution and management platform Data Engineers and Data Scientists working on big data projects every day It provides the ability to create and manage an end-to-end big data/data science project using one platform
Data Sink
An external entity that consumes information generated by a system
Apache Spark for Azure Synapse
Apache Spark is an open-source, memory-optimized system for managing big data workloads, which is used when you want a Spark engine for big data processing or data science, and you don't mind that there is no service level agreement provided to keep the services running. Usually, it is of interest of open-source professionals and the reason for Apache Spark is to overcome the limitations of what was known as SMP systems for big data workloads.
Change data processes
As a data engineer you'll extract raw data from a structured or unstructured data pool and migrate it to a staging data repository. Because the data source might have a different structure than the target destination, you'll transform the data from the source schema to the destination schema. This process is called transformation. You'll then load the transformed data into the data warehouse. Together, these steps form a process called extract, transform, and load (ETL). An alternative approach is extract, load, and transform (ELT). In ELT, the data is immediately extracted and loaded into a large data repository such as Azure Cosmos DB or Azure Data Lake Storage. This change in process reduces the resource contention on source systems. Data engineers can begin transforming the data as soon as the load is complete.
Web
As a data engineer, use the Azure Cosmos DB multimaster replication model to create a data architecture that supports web and mobile applications. Thanks to Microsoft performance commitments, these applications can achieve a response time of less than 10 ms anywhere in the world. By reducing the processing time of their websites, global organizations can increase customer satisfaction.
Hierarchical Namespace
Atomic directory analytics processing ACLS for directory and files Bottleneck management Performance optimmizations
Data Encryption
Automatic at rest Client-side
Access Control
Azure Active Directory Role based access controls Shared access signature shared key Anonymous
Azure Data Lake Storage Gen2
Azure Blob storage Hierarchical namespace File system driver and REST API ABFS Azure Blob File System
ADF
Azure Data Factory
Monitor
Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal, to monitor the scheduled activities and pipelines for success and failure rates.
Disable Geo-replication capabilities
Azure Data Lake Gen2
Data Migration
Azure manual services AzCopy .NET Library Blobfuse Azure Data Factory Azure Data Box
Blob Types
Block Append page
Latency
Calculate requirements End to end Server
Data Lake Characteristics
Centralize,schema-less data Rapid ingestion Map and control data Data catalog Self-service
Circuit Breaker Pattern
Closed Open Half-Open
Cloud Support
Cloud systems are easy to support because the environments are standardized. When Microsoft updates a product, the update applies to all consumers of the product.
Cloud Multilingual support
Cloud systems often store data as a JSON file that includes the language code identifier (LCID). The LCID identifies the language that the data uses. Apps that process the data can use translation services such as the Bing Translator API to convert the data into an expected language when the data is consumed or as part of a process to prepare the data.
Transform and enrich
Compute services such as Databricks and Machine Learning can be used to prepare or produce transformed data on a maintainable and controlled schedule to feed production environments with cleansed and transformed data. In some instances, you may even augment the source data with additional data to aid analysis, or consolidate it through a normalization process to be used in a Machine Learning experiment as an example.
Data movement
Copy data across data stores in public network and data stores in private network (on-premises or virtual private network). It provides support for built-in connectors, format conversion, column mapping, and performant and scalable data transfer.
Open Synapse Studio
Create a spark notebook in Azure Synapse Analytics
Star Schema Design Considerations
Current and historical data Emphasis on fast reads Redundant data storage Data that isn't real-time
Linked Service
Data Factory supports a wide variety of data sources that you can connect to through the creation of an object known as a Linked Service, which enables you to ingest the data from a data source in readiness to prepare the data for transformation and/or analysis. In addition, Linked Services can fire up compute services on demand.
Data Lake Zoning Data
Data Separation Governance Service Level Agreements Security
Planning Data Lake Governance
Data catalog Data quality Compliance Self-service Growth
Object Replication
Data distribution and tiering Applied through policy Asynchronous Blob versioning Constraints and limitations
Data Lake Challenges
Data governance Data Swamp (uncontrolled) Security Emerging tech/skill sets
Data Lake Maturity
Data ingestion and storage Standbox/experimentation Complements warehouse Drives data operations (final)
Data Lake Security
Data security Access Control Network security Application security
Data Lake Architecture
Decoupled Agile Resilient Auditable
Extract
Define the data source: Identify source details such as the resource group, subscription, and identity information such as a key or secret. Define the data: Identify the data to be extracted. Define data by using a database query, a set of files, or an Azure Blob storage name for blob storage.
Transform
Define the destination: During a load, many Azure destinations can accept data formatted as a JavaScript Object Notation (JSON), file, or blob. You might need to write code to interact with application APIs. Start the job: Test the ETL job in a development or test environment. Then migrate the job to a production environment to load the production system. Monitor the job: ETL operations can involve many complex processes. Set up a proactive and reactive monitoring system to provide information when things go wrong. Set up logging according to the technology that will use it.
Data Lake Use Cases
Descriptive Analysis Diagnostic analysis Predictive analysis Perspective analysis Data mining
Descriptive analytics
Descriptive analytics answers the question "What is happening in my business?" The data to answer this question is typically answered through the creation of a data warehouse. Azure Synapse Analytics leverages the dedicated SQL pool capability that enables you to create a persisted data warehouse to perform this type of analysis. You can also make use of the serverless SQL pool to prepare data from files stored in a data lake to create a data warehouse interactively
Diagnostic analytics
Diagnostic analytics deals with answering the question "Why is it happening?". This may involve exploring information that already exists in a data warehouse, but typically involves a wider search of your data estate to find more data to support this type of analysis.
Activity dispatch
Dispatch and monitor transformation activities running on a variety of compute services such as Azure Databricks, Azure HDInsight, Azure Machine Learning, Azure SQL Database, SQL Server, and more.
On-premis Licensing
Each OS that's installed on a server might have its own licensing cost. OS and software licenses are typically sold per server or per CAL (Client Access License). As companies grow, licensing arrangements become more restrictive.
Data Lake Data Catalog
Find existing data Discover new data sources Collect metadata Categorize and assign tags
Materialized View Pattern
Generates prepopulated views in advance. Not authorized source of data (Temporary) Bridging different data stores Create views that are difficult to query
Redundancy Options
Geo-redundant Geo zone redundant
Benefits Azure Blob Storage
High availability Encrypted Scalable Managed service Client access libraries
Partition Shard
Horizontal Patitioning Choose the correct sharding key
Access Tiers
Hot - Frequently accessed Cool - Not frequently accessed Archive - lowest cost, highest access cost.
On-premis Support
Hundreds of vendors sell physical server hardware. This variety means server administrators might need to know how to use many different platforms. Because of the diverse skills required to administer, maintain, and support on-premises systems, organizations sometimes have a hard time finding server administrators to hire.
VNet Injection
If you're looking to do specific network customizations, you could deploy Azure Databricks data plane resources in your own VNet. In this scenario, instead of using the managed VNet, which restricts you from making changes, you "bring your own" VNet where you have full control. Azure Databricks will still create the managed VNet, but it will not use it.
On-premis Multilingual support
In on-premises SQL Server systems, multilingual support is difficult and expensive. One issue with multiple languages is the sorting order of text data. Different languages can sort text data differently. To address this issue, the SQL Server database administrator must install and configure the data's collation settings. But these settings can work only if the SQL database developers considered multilingual functionality when they were designing the system. Systems like this are complex to manage and maintain.
Star Schema Benefits
Increases simplicity and understandability Diminishes complexity when building reports Creates a single source of consolidated data
Azure Data Lake Gen2 Processing Big Data
Ingest Store Prepare data Presentation
Apache Spark
Is an open-source, memory-optimized system for managing big data workloads When you want to benefits of Apache Spark for big data processing and/or data science work without the Service Level Agreements (SLA's) of a provider Open-source Professionals To overcome the limitations of symmetric multiprocessing (SMP) systems imposed on big data workloads
Set-up Azure Data Factory
It is easy to set up Azure Data Factory from within the Azure portal, you only require the following information: Name: The name of the Azure Data Factory instance Subscription: The subscription in which the ADF instance is created Resource group: The resource group where the ADF instance will reside Version: select V2 for the latest features Location: The datacenter location in which the instance is stored
Relational Schema Design Considerations
Minimal data redundancy Optimized for fast read and write operations Real-time and current data
Partition Design Patterns
Minimize cross-partition joins Replicate static reference data Periodically rebalance shards
On-premis Computing environment
On-premises environments require physical equipment to execute applications and services. This equipment includes physical servers, network infrastructure, and storage. The equipment must have power, cooling, and periodic maintenance by qualified personnel. A server needs at least one operating system (OS) installed. It might need more than one OS if the organization uses virtualization technology.
On-premis Maintenance
On-premises systems require maintenance for the hardware, firmware, drivers, BIOS, operating system, software, and antivirus software. Organizations try to reduce the cost of this maintenance where it makes sense.
Parquet
Open source storage for hadoop
IoT solutions
Over the last couple of years, hundreds of thousands of devices have been produced to generate sensor data. These are known as IoT devices. Using technologies like Azure IoT Hub, you can design a data solution architecture that captures information from IoT devices so that the information can be analyzed.
Data Lake Folder Structure
Parent - raw data Source - source systems Entity Date
Partitioning
Partition ranges Block sizes Appropriate naming
Planning a Data Lake
Platform quotas Lake placement Lake distribution Management Silos
Real-time / near real-time
Processing of a typically infinite stream of input data (stream), whose time until results ready is short—measured in milliseconds or seconds in the longest of cases.
Analytical Workloads
Provide data for reporting, decision making and planning. Generally centered around the Select command.
Batch
Queries or programs that take tens of minutes, hours, or days to complete. Activities could include initial data wrangling, complete ETL pipeline, or preparation for downstream analytics.
Partition Azure Service Bus
Queues and topics are scoped in the Service Bus namespace Each namespace imposes quotas on available resources Messages that are sent as part of a transaction must specify a partition key
Planning Data Lake Structure
Raw zone Cleansed zone Curated zone Exploratory zone
Partition Table
Select a partition row key by how data is accessed Supports transactional operations in the same partitition Use vertical partitioning for dividing fields into groups
What is Apache Spark?
Spark is a unified processing engine that can analyze big data using SQL, machine learning, graph processing, or real-time stream analysis
Anatomy of Blob Storage
Storage account Container Files and blobs
Connect and collect
The first step in building an orchestration system is to define and connect all the required sources of data together, such as databases, file shares, and FTP web services. The next step is to ingest the data as needed to a centralized location for subsequent processing.
developer's and learner's perspective
The number of Partitions my data is divided into. The number of Slots I have for parallel execution. How many Jobs am I triggering? And lastly the Stages those jobs are divided into.
Data exploration and discovery
The serverless SQL pool functionality provided by Azure Synapse Analytics enables Data Analysts, Data Engineers and Data Scientist alike to explore the data within your data estate. This capability supports data discovery, diagnostic analytics, and exploratory data analysis.
Document database
They are set up similar to JSONs so they excel at heterogeneous data formats and are easy to implement
Databricks runtime best practices
Tune shuffle for optimal performance Partition your data
Row-oriented Databases
Typically requires multiple indexes Slow bulk operations Easy to add rows hard to add columns. Fast concurrent CRUD operations
Azure Data Lake Gen2 Best Practice
Use security groups Integrated firewall Hadoop Distributed copy Scheduled data copy Directory layout
Partitioning Data Lakes Blob Storage
Use when you need to upload or download large volumes of data quickly Makes it possible to hold large binary objects Containers group related <term> with the same security
VNet Peering
Virtual network (VNet) peering allows the virtual network in which your Azure Databricks resource is running to peer with another Azure virtual network. Traffic between virtual machines in the peered virtual networks is routed through the Microsoft backbone infrastructure, much like traffic is routed between virtual machines in the same virtual network, through private IP addresses only.
Parquet
When it comes to storing refined versions of the data for possible querying, the recommended data format
Cloud Lift and shift
When moving to the cloud, many customers migrate from physical or virtualized on-premises servers to Azure Virtual Machines. This strategy is known as lift and shift. Server administrators lift and shift an application from a physical environment to Azure Virtual Machines without rearchitecting the application.
Databricks appliance
When you create an Azure Databricks service, a "Databricks appliance" is deployed as an Azure resource in your subscription.
Geo-zone redundant Storage
Zone redundant in primary region Local redundant in secondary region sixteen 9s durability
Isolation
ensures that one transaction is not impacted by another transaction.
Consistency
ensures that the data is consistent both before and after the transaction.
Apache Spark pools in Azure Synapse Analytics
has Apache Spark capabilities embedded. For organizations that don't have existing Apache Spark implementations yet, Apache Spark pools in Azure Synapse Analytics provide the functionality to spin up an Apache Spark cluster to meet data engineering needs without the overhead of the other Apache Spark platforms. Data engineers, data scientist, data platform experts, and data analyst can come together within Azure Synapse Analytics where the Apache Spark cluster is running to quickly collaborate on various analytical solutions.
Apache Spark notebook
is a collection of cells. These cells are run to execute code, to render formatted text, or to display graphical visualizations.
Key value stores
stores key value pairs, fast lookup key -> value massive scalability good for simple associative data and big data bad for complex highly relational data ex. redis
Coalesce
to come together Only writing to a single Hive partition Spark <word> first
Cloud Computing environment
"Cloud computing environments provide the physical and logical infrastructure to host services, virtual servers, intelligent applications, and containers for their subscribers. Different from on-premises physical servers, cloud environments require no capital investment. Instead, an organization provisions service in the cloud and pays only for what it uses. Moving servers and services to the cloud also reduces operational costs. Within minutes, an organization can provision anything from virtual servers to clusters of containerized apps by using Azure services. Azure automatically creates and handles all of the physical and logical infrastructure in the background. In this way, Azure reduces the complexity and cost of creating the services."
On-premis Availability
"High-availability systems must be available most of the time. Service-level agreements (SLAs) specify your organization's availability expectations.System uptime can be expressed as three nines, four nines, or five nines. These expressions indicate system uptimes of 99.9 percent, 99.99 percent, or 99.999 percent. To calculate system uptime in terms of hours, multiply these percentages by the number of hours in a year (8,760)."
On-premis Scalability
"When administrators can no longer scale up a server, they can instead scale out their operations. To scale an on-premises server horizontally, server administrators add another server node to a cluster. Clustering uses either a hardware load balancer or a software load balancer to distribute incoming network requests to a node of the cluster. A limitation of server clustering is that the hardware for each server in the cluster must be identical. So when the server cluster reaches maximum capacity, a server administrator must replace or upgrade each node in the cluster."
On-premis Total cost of ownership
"describes the final cost of owning a given technology. In on-premises systems, TCO includes the following costs: Hardware Software licensing Labor Data Center Overhead
Parquet vs CSV
Parquet reduces stored and scanned data CSV is simple and widely used Parquet utilizes efficient columnar storage CSV is stored as row-based data
data-driven workflow
(called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure Synapse Analytics.
Zone Redundant Storage
3 replicas over 3 availability zones Twelve 9s durabilty Automated failover Geo restrictions No archive tier
Locally Redundant Storage
3 replicas within the data center Eleven 9s durability Synchronous writes
Cloud Cloud Total cost of ownership
A subscription can be based on usage that's measured in compute units, hours, or transactions. The cost includes hardware, software, disk storage, and labor. Because of economies of scale, an on-premises system can rarely compete with the cloud in terms of the measurement of the service usage.
cluster best practices
Arrive at the correct cluster size by iterative performance testing Workload requires caching (like machine learning) ETL and analytic workloads
Azure Databrick
Azure Databricks is a managed Apache Spark-as-a-Service propriety solution that provides an end-to-end data engineering/data science platform. Azure Databricks is of interest for many data engineers and data scientists working on big data projects today. It provides the platform in which you can create and manage the big data/data science projects all on one platform. These services are not mutually exclusive. It is common to find customers who use a combination of these technologies working together.
Azure Data Lake Gen2 Security
Azure Defender Azure Storage encryption Private endpoints
Predictive analytics
Azure Synapse Analytics also enables you to answer the question "What is likely to happen in the future based on previous trends and patterns?" by using its integrated Apache Spark engine. Azure Synapse Spark pools can be used with other services such as Azure Machine Learning Services, or Azure Databricks.
Real time analytics
Azure Synapse Analytics can capture, store and analyze data in real-time or near-real time with features such as Azure Synapse Link, or through the integration of services such as Azure Stream Analytics and Azure Data Explorer.
Data integration
Azure Synapse Pipelines enables you to ingest, prepare, model and serve the data to be used by downstream systems. This can be used by components of Azure Synapse Analytics exclusively.
Ingest and Prep
Azure Synapse SQL Serverless Azure Synapse Spark Azure Synapse Pipelines Azure Data Factory Azure Databricks
Cloud Availability
Azure duplicates customer content for redundancy and high availability. Many services and platforms use SLAs to ensure that customers know the capabilities of the platform they're using.
Partitioning
By time is a frequently used strategy.
Security best practice
Consider isolating each workspace in its own VNet Do not store any production data in Default Databricks Filesystem (DBFS) Folders Always hide secrets in a key vault Access control - Azure Data Lake Storage (ADLS) passthrough Configure audit logs and resource utilization metrics to monitor activity Querying VM metrics in Log Analytics once you have started the collection using the above document
Partition Azure Search
Create an instance per geographic region Create a local and global service Can be divided into1-4,6 or 12 partitions Each partition can store up to 300 GB.
Data Factory Contributor role
Create, edit, and delete data factories and child resources including datasets, linked services, pipelines, triggers, and integration runtimes. Deploy Resource Manager templates. Resource Manager deployment is the deployment method used by Data Factory in the Azure portal. Manage App Insights alerts for a data factory. At the resource group level or above, lets users deploy Resource Manager template. Create support tickets.
Apache Spark for Azure Synapse
Embedded Apache Spark capability within Azure Synapse Analytics residing on the same platform that contains data warehouses and data integration capabilities, as well as integrating with other Azure services Enables organizations without existing Apache Spark implementations to fire up an Apache Spark cluster to meet data engineering needs without the overhead of the other Apache Spark platforms listed Data Engineers, Data Scientists, Data Platform experts and Data Analysts It provides the ability to scale efficiently with Apache Spark clusters within a one stop shop analytical platform to meet your needs.
Partitioning Data Lakes Queue Storage
Enable Asynchronous messaging between processes Maximum individual message size 64km Can contain any number of queues or messages Can handle up to 2000 messages per second.
Advanced analytics
Enables organizations to perform predictive analytics using both the native features of Azure Synapse Analytics, and integrating with other technologies such as Azure Databricks.
Develop hub
Expand SQL scripts
tools and integration best practices
Favor cluster scoped init scripts over global and named scripts Use cluster log delivery feature to manage logs
Partition Functional
Improve isolation and data access performance Separate read-only and read write data.
Azure Synapse Data Analytics
Improve performance by applying filters Improve performance by partitioning while loading Implements dropping of the oldest partition
Cloud Maintenance
In the cloud, Microsoft manages many operations to create a stable computing environment. This service is part of the Azure product benefit. Microsoft manages key infrastructure services such as physical hardware, computer networking, firewalls and network security, datacenter fault tolerance, compliance, and physical security of the buildings. Microsoft also invests heavily to battle cybersecurity threats, and it updates operating systems and firmware for the customer. These services allow data engineers to focus more on data engineering and eliminating system complexity.
Healthcare
In the healthcare industry, use Azure Databricks to accelerate big-data analytics and AI solutions. Apply these technologies to genome studies or pharmacy sales forecasting at a petabyte scale. Using Databricks features, you can set up your Spark environment in minutes and autoscale quickly and easily.
Sharding Strategies
Lookup Range - similar data in the same shard Hash - spread data evenly across shards.
Integrate hub
Manage integration pipelines within the Integrate hub. If you are familiar with Azure Data Factory (ADF), then you will feel at home in this hub. The pipeline creation experience is the same as in ADF, which gives you another powerful integration built into Azure Synapse Analytics, removing the need to use Azure Data Factory for data movement and transformation pipelines.
Azure HDInsight
Microsoft implementation of opensource Apache Spark managed within the realms of Azure When you want to benefits of OSS Spark with the SLA of a provider Open-source Professionals wanting SLA's and Microsoft Data Platform experts To take advantage of the OSS Big Data Analytics platform with SLA's in place to ensure business continuity
Cluster
One an only one driver
OLTP
Online Transaction Processing
Interactive query
Querying batch data at "human" interactive speeds, which with the current generation of technologies means results are ready in time frames measured in seconds to minutes.
Stages
Read Select Filter GroupBy Select Filter Write
Distribution
Regions Availability zones Manual failover Last sync time
Column-oriented Databases
Relies on a single index per column Fast build operations Easy to add columns hard to rows Slow concurrent CRUD operations
Star Schema Drawbacks
Requires expertise to design the model and configure processes Requires ongoing maintenance.
Partition Cosmos
Resources are subject to quota limitations Database queries are scoped at the collection level
Planning Data Lake Access Management
Role-based Access Control Access control lists Security groups
Lifecycle Management
Rules/policies Lifecycle management Automated tier migration
job responsibilities
SQL Server professionals generally work only with relational database systems. Data engineers also work with unstructured data and a wide variety of new data types, such as streaming data.
Cloud Scalability
Scalability in on-premises systems is complicated and time-consuming. But scalability in the cloud can be as simple as a mouse click. Typically, scalability in the cloud is measured in compute units. Compute units might be defined differently for each Azure product.
Sharding Pattern Benefits
Scale out systems by inserting shards on additional storage nodes Balancing workload across shards can cause contention Each storage node can use off the shelf hardware Shards can be physically relocated closer to the users accessing the data
Geo-redundant Application Design
Secondary region read-only Asynchronous replication Last Sync Time Storage Client Library Failover
Data wharehouse architecture components
Select integrate hub Select manage hub Select develop hub Select data hub
Data hub
Select the Linked tab (1), expand the Azure Data Lake Storage Gen2 group
Partition Vertical
Set security controls for each partition Reduces concurrent access Separate slow-moving and dynamic data Reduces I/O and performance cost
Azure Data Lake Gen2 Access Control
Shared key Shared access signature (SAS) Role-based Access Control (RBAC) Access control lists
Apache Spark pools in Azure Synapse Analytics
Speed and Efficiency: There is a quick start-up time for nodes and automatic shut-down when instances are not used within 5 minutes after the last job, unless there is a live notebook connection. Ease of creation: Creating as Apache Spark pool can be done through the Azure portal, PowerShell, or .NET SDK for Azure Synapse Analytics. Ease of use: Within the Azure Synapse Analytics workspace, you can connect directly to the Apache Spark pool and interact with the integrated notebook experience, or use custom notebooks derived from Nteract. Notebook integration helps you develop interactive data processing and visualization pipelines. REST API IDE Pre-loaded Anaconda libraries Scalability
Snowflake schema
Suitable for data warehouses Normalized dimension tables Less space, more complex Slower, more complex SQL queries No redundant data
Star schema
Suitable for datamarts Dimension tables not normalized More space, less complex Faster, less complex, SQL queries Include redundant data
SCIM integration
System for Cross-domain Identity Management, an open standard that allows you to automate user provisioning.
Modern data warehousing
This involves the ability to integrate all data, including big data, to reason over data for analytics and reporting purposes from a descriptive analytics perspective, independent of its location or structure.
Prescriptive analytics
This type of analytics looks at executing actions based on real-time or near real-time analysis of data, using predictive analytics. Azure Synapse Analytics provides this capability through both Apache Spark, Azure Synapse Link, and by integrating streaming technologies such as Azure Stream Analytics.
Geo-redundant Storage
Three LRS copies in primary region
Data ingestion and preparation
To ingest data, customers can do so code-free with over 100 data integration connectors with Azure Data Factory. Data Factory empowers customers to do code-free ETL/ELT, including preparation and transformation. And while a lot of our customers are currently heavily invested in the SQL Server Integration Services packages (SSIS), they created, they can leverage these without having to rewrite those packages in Azure Data Factory.
Secrets
Using the Secrets APIs, Secrets can be securely stored including in an Azure Key Vault or Databricks backend. Authorized users can consume the secrets to access services. Azure Databricks has two types of secret scopes: Key Vault-backed and Databricks-backed. These secret scopes allow you to store secrets, such as database connection strings, securely. If someone tries to output a secret to a notebook, it is replaced by [REDACTED]. This helps prevent someone from viewing the secret or accidentally leaking it when displaying or sharing the notebook.
Integrated analytics
With the variety of analytics that can be performed on the data at your disposal, putting together the services in a cohesive solution can be a complex operation. Azure Synapse Analytics removes this complexity by integrating the analytics landscape into one service. That way you can spend more time working with the data to bring business benefit, than spending much of your time provisioning and maintaining multiple systems to achieve the same outcomes.
Account Fail-over
Write to primary, read-only replications in secondary Modify Azure DNS Re-enable Geo-redundancy
Azure HDInsight (HDI)
is an implementation by Microsoft of open-source Apache Spark, managed on the Azure Platform. You can use HDI for an Apache Spark environment when you are aware of the benefits of Apache Spark in its OSS form, but you want a Service Level Agreement (SLA). This implementation is usually of interest to open-source professionals needing an SLA and data platform experts experienced with Microsoft products and services.
Azure Synapse Analytics
is an integrated analytics platform, which combines data warehousing, big data analytics, data integration, and visualization into a single environment.
Azure Private Link
is currently the most secure way to access Azure data services from Azure Databricks. Private Link enables you to access Azure PaaS Services (for example, Azure Storage, Azure Cosmos DB, and SQL Database) and Azure hosted customer/partner services over a Private Endpoint in your virtual network.
Access control (IAM)
is the blade that you use to assign roles to grant access to Azure resources. It's also known as identity and access management and appears in several locations in the Azure portal.
Atomicity
means a transaction must execute exactly once and must be atomic; either all of the work is done, or none of it is. Operations within a transaction usually share a common intent and are interdependent.
Durability
means that the changes made due to the transaction are permanently saved in the system. Committed data is saved by the system so that even in the event of a failure and system restart, the data is available in its correct state.
What is a cluster?
networked computers, that work together to process your data. The first step is to create a cluster.
OLAP
online analytical processing