AZURE DP-203

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Temporal Database

A database that contains time-varying historical data with the possible inclusion of current and future data and has the ability to manipulate this data

orchestration

ADF can use a similar approach, whilst it has native functionality to ingest and transform data, sometimes it will instruct another service to perform the actual work required on its behalf, such as a Databricks to execute a transformation query. So, in this case, it would be Databricks that performs the work, not ADF. ADF merely orchestrates the execution of the query, and then provides the pipelines to move the data onto the next step or destination.

Data Sharding for Scaling single server limitations

Computing resources Geography of data and users Network bandwidth Storage space

SSIS package execution

Natively execute SQL Server Integration Services (SSIS) packages in a managed Azure compute environment.

Data Lake Storage

Scalable Durable Secure

Big Data Components

Source Storage Process Processed data storage Reporting

Index Table Pattern

Create indexes over the fields in data stores Ability to emulate secondary indexes

Partition SQL Database

Elastic pools support horizontal scale Shards can hold more than one dataset Shard map shardlets should have the same schema Avoid mixing highly active

Portioning Purpose

Improve Availability Improve performance Improve Scalability Improve Security

Design ingestion patterns

Integrate hub Develop hub

Simple Repartition

Only writing to a single Hive partition

RDD

Resilient Distributed Datasets

Manage hub

Select SQL Pools Drag performance level

Performance Tiers

Standard Premium

Graph database

A graph database offers an alternative way to track relationships; its structure resembles sociograms with their interlinked nodes

Partitioning Data Lakes Table Storage

A key value store that is designed to support partitioning Partitions are managed internally by <term> Each entity must include a partition key and row key

When to use Azure Synapse Analytics

Across all organizations and industries, the common use cases for Azure Synapse Analytics are identified by the need for:

Publish

After the raw data has been refined into a business-ready consumable form from the transform and enrich phase, you can load the data into Azure Data Warehouse, Azure SQL Database, Azure Cosmos DB, or whichever analytics engine your business users can point to from their business intelligence tools

Azure Databricks

An advanced analytics managed Apache Spark-as-a-Service solution Provides an end-to-end data engineering and data science solution and management platform Data Engineers and Data Scientists working on big data projects every day It provides the ability to create and manage an end-to-end big data/data science project using one platform

Data Sink

An external entity that consumes information generated by a system

Apache Spark for Azure Synapse

Apache Spark is an open-source, memory-optimized system for managing big data workloads, which is used when you want a Spark engine for big data processing or data science, and you don't mind that there is no service level agreement provided to keep the services running. Usually, it is of interest of open-source professionals and the reason for Apache Spark is to overcome the limitations of what was known as SMP systems for big data workloads.

Change data processes

As a data engineer you'll extract raw data from a structured or unstructured data pool and migrate it to a staging data repository. Because the data source might have a different structure than the target destination, you'll transform the data from the source schema to the destination schema. This process is called transformation. You'll then load the transformed data into the data warehouse. Together, these steps form a process called extract, transform, and load (ETL). An alternative approach is extract, load, and transform (ELT). In ELT, the data is immediately extracted and loaded into a large data repository such as Azure Cosmos DB or Azure Data Lake Storage. This change in process reduces the resource contention on source systems. Data engineers can begin transforming the data as soon as the load is complete.

Web

As a data engineer, use the Azure Cosmos DB multimaster replication model to create a data architecture that supports web and mobile applications. Thanks to Microsoft performance commitments, these applications can achieve a response time of less than 10 ms anywhere in the world. By reducing the processing time of their websites, global organizations can increase customer satisfaction.

Hierarchical Namespace

Atomic directory analytics processing ACLS for directory and files Bottleneck management Performance optimmizations

Data Encryption

Automatic at rest Client-side

Access Control

Azure Active Directory Role based access controls Shared access signature shared key Anonymous

Azure Data Lake Storage Gen2

Azure Blob storage Hierarchical namespace File system driver and REST API ABFS Azure Blob File System

ADF

Azure Data Factory

Monitor

Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal, to monitor the scheduled activities and pipelines for success and failure rates.

Disable Geo-replication capabilities

Azure Data Lake Gen2

Data Migration

Azure manual services AzCopy .NET Library Blobfuse Azure Data Factory Azure Data Box

Blob Types

Block Append page

Latency

Calculate requirements End to end Server

Data Lake Characteristics

Centralize,schema-less data Rapid ingestion Map and control data Data catalog Self-service

Circuit Breaker Pattern

Closed Open Half-Open

Cloud Support

Cloud systems are easy to support because the environments are standardized. When Microsoft updates a product, the update applies to all consumers of the product.

Cloud Multilingual support

Cloud systems often store data as a JSON file that includes the language code identifier (LCID). The LCID identifies the language that the data uses. Apps that process the data can use translation services such as the Bing Translator API to convert the data into an expected language when the data is consumed or as part of a process to prepare the data.

Transform and enrich

Compute services such as Databricks and Machine Learning can be used to prepare or produce transformed data on a maintainable and controlled schedule to feed production environments with cleansed and transformed data. In some instances, you may even augment the source data with additional data to aid analysis, or consolidate it through a normalization process to be used in a Machine Learning experiment as an example.

Data movement

Copy data across data stores in public network and data stores in private network (on-premises or virtual private network). It provides support for built-in connectors, format conversion, column mapping, and performant and scalable data transfer.

Open Synapse Studio

Create a spark notebook in Azure Synapse Analytics

Star Schema Design Considerations

Current and historical data Emphasis on fast reads Redundant data storage Data that isn't real-time

Linked Service

Data Factory supports a wide variety of data sources that you can connect to through the creation of an object known as a Linked Service, which enables you to ingest the data from a data source in readiness to prepare the data for transformation and/or analysis. In addition, Linked Services can fire up compute services on demand.

Data Lake Zoning Data

Data Separation Governance Service Level Agreements Security

Planning Data Lake Governance

Data catalog Data quality Compliance Self-service Growth

Object Replication

Data distribution and tiering Applied through policy Asynchronous Blob versioning Constraints and limitations

Data Lake Challenges

Data governance Data Swamp (uncontrolled) Security Emerging tech/skill sets

Data Lake Maturity

Data ingestion and storage Standbox/experimentation Complements warehouse Drives data operations (final)

Data Lake Security

Data security Access Control Network security Application security

Data Lake Architecture

Decoupled Agile Resilient Auditable

Extract

Define the data source: Identify source details such as the resource group, subscription, and identity information such as a key or secret. Define the data: Identify the data to be extracted. Define data by using a database query, a set of files, or an Azure Blob storage name for blob storage.

Transform

Define the destination: During a load, many Azure destinations can accept data formatted as a JavaScript Object Notation (JSON), file, or blob. You might need to write code to interact with application APIs. Start the job: Test the ETL job in a development or test environment. Then migrate the job to a production environment to load the production system. Monitor the job: ETL operations can involve many complex processes. Set up a proactive and reactive monitoring system to provide information when things go wrong. Set up logging according to the technology that will use it.

Data Lake Use Cases

Descriptive Analysis Diagnostic analysis Predictive analysis Perspective analysis Data mining

Descriptive analytics

Descriptive analytics answers the question "What is happening in my business?" The data to answer this question is typically answered through the creation of a data warehouse. Azure Synapse Analytics leverages the dedicated SQL pool capability that enables you to create a persisted data warehouse to perform this type of analysis. You can also make use of the serverless SQL pool to prepare data from files stored in a data lake to create a data warehouse interactively

Diagnostic analytics

Diagnostic analytics deals with answering the question "Why is it happening?". This may involve exploring information that already exists in a data warehouse, but typically involves a wider search of your data estate to find more data to support this type of analysis.

Activity dispatch

Dispatch and monitor transformation activities running on a variety of compute services such as Azure Databricks, Azure HDInsight, Azure Machine Learning, Azure SQL Database, SQL Server, and more.

On-premis Licensing

Each OS that's installed on a server might have its own licensing cost. OS and software licenses are typically sold per server or per CAL (Client Access License). As companies grow, licensing arrangements become more restrictive.

Data Lake Data Catalog

Find existing data Discover new data sources Collect metadata Categorize and assign tags

Materialized View Pattern

Generates prepopulated views in advance. Not authorized source of data (Temporary) Bridging different data stores Create views that are difficult to query

Redundancy Options

Geo-redundant Geo zone redundant

Benefits Azure Blob Storage

High availability Encrypted Scalable Managed service Client access libraries

Partition Shard

Horizontal Patitioning Choose the correct sharding key

Access Tiers

Hot - Frequently accessed Cool - Not frequently accessed Archive - lowest cost, highest access cost.

On-premis Support

Hundreds of vendors sell physical server hardware. This variety means server administrators might need to know how to use many different platforms. Because of the diverse skills required to administer, maintain, and support on-premises systems, organizations sometimes have a hard time finding server administrators to hire.

VNet Injection

If you're looking to do specific network customizations, you could deploy Azure Databricks data plane resources in your own VNet. In this scenario, instead of using the managed VNet, which restricts you from making changes, you "bring your own" VNet where you have full control. Azure Databricks will still create the managed VNet, but it will not use it.

On-premis Multilingual support

In on-premises SQL Server systems, multilingual support is difficult and expensive. One issue with multiple languages is the sorting order of text data. Different languages can sort text data differently. To address this issue, the SQL Server database administrator must install and configure the data's collation settings. But these settings can work only if the SQL database developers considered multilingual functionality when they were designing the system. Systems like this are complex to manage and maintain.

Star Schema Benefits

Increases simplicity and understandability Diminishes complexity when building reports Creates a single source of consolidated data

Azure Data Lake Gen2 Processing Big Data

Ingest Store Prepare data Presentation

Apache Spark

Is an open-source, memory-optimized system for managing big data workloads When you want to benefits of Apache Spark for big data processing and/or data science work without the Service Level Agreements (SLA's) of a provider Open-source Professionals To overcome the limitations of symmetric multiprocessing (SMP) systems imposed on big data workloads

Set-up Azure Data Factory

It is easy to set up Azure Data Factory from within the Azure portal, you only require the following information: Name: The name of the Azure Data Factory instance Subscription: The subscription in which the ADF instance is created Resource group: The resource group where the ADF instance will reside Version: select V2 for the latest features Location: The datacenter location in which the instance is stored

Relational Schema Design Considerations

Minimal data redundancy Optimized for fast read and write operations Real-time and current data

Partition Design Patterns

Minimize cross-partition joins Replicate static reference data Periodically rebalance shards

On-premis Computing environment

On-premises environments require physical equipment to execute applications and services. This equipment includes physical servers, network infrastructure, and storage. The equipment must have power, cooling, and periodic maintenance by qualified personnel. A server needs at least one operating system (OS) installed. It might need more than one OS if the organization uses virtualization technology.

On-premis Maintenance

On-premises systems require maintenance for the hardware, firmware, drivers, BIOS, operating system, software, and antivirus software. Organizations try to reduce the cost of this maintenance where it makes sense.

Parquet

Open source storage for hadoop

IoT solutions

Over the last couple of years, hundreds of thousands of devices have been produced to generate sensor data. These are known as IoT devices. Using technologies like Azure IoT Hub, you can design a data solution architecture that captures information from IoT devices so that the information can be analyzed.

Data Lake Folder Structure

Parent - raw data Source - source systems Entity Date

Partitioning

Partition ranges Block sizes Appropriate naming

Planning a Data Lake

Platform quotas Lake placement Lake distribution Management Silos

Real-time / near real-time

Processing of a typically infinite stream of input data (stream), whose time until results ready is short—measured in milliseconds or seconds in the longest of cases.

Analytical Workloads

Provide data for reporting, decision making and planning. Generally centered around the Select command.

Batch

Queries or programs that take tens of minutes, hours, or days to complete. Activities could include initial data wrangling, complete ETL pipeline, or preparation for downstream analytics.

Partition Azure Service Bus

Queues and topics are scoped in the Service Bus namespace Each namespace imposes quotas on available resources Messages that are sent as part of a transaction must specify a partition key

Planning Data Lake Structure

Raw zone Cleansed zone Curated zone Exploratory zone

Partition Table

Select a partition row key by how data is accessed Supports transactional operations in the same partitition Use vertical partitioning for dividing fields into groups

What is Apache Spark?

Spark is a unified processing engine that can analyze big data using SQL, machine learning, graph processing, or real-time stream analysis

Anatomy of Blob Storage

Storage account Container Files and blobs

Connect and collect

The first step in building an orchestration system is to define and connect all the required sources of data together, such as databases, file shares, and FTP web services. The next step is to ingest the data as needed to a centralized location for subsequent processing.

developer's and learner's perspective

The number of Partitions my data is divided into. The number of Slots I have for parallel execution. How many Jobs am I triggering? And lastly the Stages those jobs are divided into.

Data exploration and discovery

The serverless SQL pool functionality provided by Azure Synapse Analytics enables Data Analysts, Data Engineers and Data Scientist alike to explore the data within your data estate. This capability supports data discovery, diagnostic analytics, and exploratory data analysis.

Document database

They are set up similar to JSONs so they excel at heterogeneous data formats and are easy to implement

Databricks runtime best practices

Tune shuffle for optimal performance Partition your data

Row-oriented Databases

Typically requires multiple indexes Slow bulk operations Easy to add rows hard to add columns. Fast concurrent CRUD operations

Azure Data Lake Gen2 Best Practice

Use security groups Integrated firewall Hadoop Distributed copy Scheduled data copy Directory layout

Partitioning Data Lakes Blob Storage

Use when you need to upload or download large volumes of data quickly Makes it possible to hold large binary objects Containers group related <term> with the same security

VNet Peering

Virtual network (VNet) peering allows the virtual network in which your Azure Databricks resource is running to peer with another Azure virtual network. Traffic between virtual machines in the peered virtual networks is routed through the Microsoft backbone infrastructure, much like traffic is routed between virtual machines in the same virtual network, through private IP addresses only.

Parquet

When it comes to storing refined versions of the data for possible querying, the recommended data format

Cloud Lift and shift

When moving to the cloud, many customers migrate from physical or virtualized on-premises servers to Azure Virtual Machines. This strategy is known as lift and shift. Server administrators lift and shift an application from a physical environment to Azure Virtual Machines without rearchitecting the application.

Databricks appliance

When you create an Azure Databricks service, a "Databricks appliance" is deployed as an Azure resource in your subscription.

Geo-zone redundant Storage

Zone redundant in primary region Local redundant in secondary region sixteen 9s durability

Isolation

ensures that one transaction is not impacted by another transaction.

Consistency

ensures that the data is consistent both before and after the transaction.

Apache Spark pools in Azure Synapse Analytics

has Apache Spark capabilities embedded. For organizations that don't have existing Apache Spark implementations yet, Apache Spark pools in Azure Synapse Analytics provide the functionality to spin up an Apache Spark cluster to meet data engineering needs without the overhead of the other Apache Spark platforms. Data engineers, data scientist, data platform experts, and data analyst can come together within Azure Synapse Analytics where the Apache Spark cluster is running to quickly collaborate on various analytical solutions.

Apache Spark notebook

is a collection of cells. These cells are run to execute code, to render formatted text, or to display graphical visualizations.

Key value stores

stores key value pairs, fast lookup key -> value massive scalability good for simple associative data and big data bad for complex highly relational data ex. redis

Coalesce

to come together Only writing to a single Hive partition Spark <word> first

Cloud Computing environment

"Cloud computing environments provide the physical and logical infrastructure to host services, virtual servers, intelligent applications, and containers for their subscribers. Different from on-premises physical servers, cloud environments require no capital investment. Instead, an organization provisions service in the cloud and pays only for what it uses. Moving servers and services to the cloud also reduces operational costs. Within minutes, an organization can provision anything from virtual servers to clusters of containerized apps by using Azure services. Azure automatically creates and handles all of the physical and logical infrastructure in the background. In this way, Azure reduces the complexity and cost of creating the services."

On-premis Availability

"High-availability systems must be available most of the time. Service-level agreements (SLAs) specify your organization's availability expectations.System uptime can be expressed as three nines, four nines, or five nines. These expressions indicate system uptimes of 99.9 percent, 99.99 percent, or 99.999 percent. To calculate system uptime in terms of hours, multiply these percentages by the number of hours in a year (8,760)."

On-premis Scalability

"When administrators can no longer scale up a server, they can instead scale out their operations. To scale an on-premises server horizontally, server administrators add another server node to a cluster. Clustering uses either a hardware load balancer or a software load balancer to distribute incoming network requests to a node of the cluster. A limitation of server clustering is that the hardware for each server in the cluster must be identical. So when the server cluster reaches maximum capacity, a server administrator must replace or upgrade each node in the cluster."

On-premis Total cost of ownership

"describes the final cost of owning a given technology. In on-premises systems, TCO includes the following costs: Hardware Software licensing Labor Data Center Overhead

Parquet vs CSV

Parquet reduces stored and scanned data CSV is simple and widely used Parquet utilizes efficient columnar storage CSV is stored as row-based data

data-driven workflow

(called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure Synapse Analytics.

Zone Redundant Storage

3 replicas over 3 availability zones Twelve 9s durabilty Automated failover Geo restrictions No archive tier

Locally Redundant Storage

3 replicas within the data center Eleven 9s durability Synchronous writes

Cloud Cloud Total cost of ownership

A subscription can be based on usage that's measured in compute units, hours, or transactions. The cost includes hardware, software, disk storage, and labor. Because of economies of scale, an on-premises system can rarely compete with the cloud in terms of the measurement of the service usage.

cluster best practices

Arrive at the correct cluster size by iterative performance testing Workload requires caching (like machine learning) ETL and analytic workloads

Azure Databrick

Azure Databricks is a managed Apache Spark-as-a-Service propriety solution that provides an end-to-end data engineering/data science platform. Azure Databricks is of interest for many data engineers and data scientists working on big data projects today. It provides the platform in which you can create and manage the big data/data science projects all on one platform. These services are not mutually exclusive. It is common to find customers who use a combination of these technologies working together.

Azure Data Lake Gen2 Security

Azure Defender Azure Storage encryption Private endpoints

Predictive analytics

Azure Synapse Analytics also enables you to answer the question "What is likely to happen in the future based on previous trends and patterns?" by using its integrated Apache Spark engine. Azure Synapse Spark pools can be used with other services such as Azure Machine Learning Services, or Azure Databricks.

Real time analytics

Azure Synapse Analytics can capture, store and analyze data in real-time or near-real time with features such as Azure Synapse Link, or through the integration of services such as Azure Stream Analytics and Azure Data Explorer.

Data integration

Azure Synapse Pipelines enables you to ingest, prepare, model and serve the data to be used by downstream systems. This can be used by components of Azure Synapse Analytics exclusively.

Ingest and Prep

Azure Synapse SQL Serverless Azure Synapse Spark Azure Synapse Pipelines Azure Data Factory Azure Databricks

Cloud Availability

Azure duplicates customer content for redundancy and high availability. Many services and platforms use SLAs to ensure that customers know the capabilities of the platform they're using.

Partitioning

By time is a frequently used strategy.

Security best practice

Consider isolating each workspace in its own VNet Do not store any production data in Default Databricks Filesystem (DBFS) Folders Always hide secrets in a key vault Access control - Azure Data Lake Storage (ADLS) passthrough Configure audit logs and resource utilization metrics to monitor activity Querying VM metrics in Log Analytics once you have started the collection using the above document

Partition Azure Search

Create an instance per geographic region Create a local and global service Can be divided into1-4,6 or 12 partitions Each partition can store up to 300 GB.

Data Factory Contributor role

Create, edit, and delete data factories and child resources including datasets, linked services, pipelines, triggers, and integration runtimes. Deploy Resource Manager templates. Resource Manager deployment is the deployment method used by Data Factory in the Azure portal. Manage App Insights alerts for a data factory. At the resource group level or above, lets users deploy Resource Manager template. Create support tickets.

Apache Spark for Azure Synapse

Embedded Apache Spark capability within Azure Synapse Analytics residing on the same platform that contains data warehouses and data integration capabilities, as well as integrating with other Azure services Enables organizations without existing Apache Spark implementations to fire up an Apache Spark cluster to meet data engineering needs without the overhead of the other Apache Spark platforms listed Data Engineers, Data Scientists, Data Platform experts and Data Analysts It provides the ability to scale efficiently with Apache Spark clusters within a one stop shop analytical platform to meet your needs.

Partitioning Data Lakes Queue Storage

Enable Asynchronous messaging between processes Maximum individual message size 64km Can contain any number of queues or messages Can handle up to 2000 messages per second.

Advanced analytics

Enables organizations to perform predictive analytics using both the native features of Azure Synapse Analytics, and integrating with other technologies such as Azure Databricks.

Develop hub

Expand SQL scripts

tools and integration best practices

Favor cluster scoped init scripts over global and named scripts Use cluster log delivery feature to manage logs

Partition Functional

Improve isolation and data access performance Separate read-only and read write data.

Azure Synapse Data Analytics

Improve performance by applying filters Improve performance by partitioning while loading Implements dropping of the oldest partition

Cloud Maintenance

In the cloud, Microsoft manages many operations to create a stable computing environment. This service is part of the Azure product benefit. Microsoft manages key infrastructure services such as physical hardware, computer networking, firewalls and network security, datacenter fault tolerance, compliance, and physical security of the buildings. Microsoft also invests heavily to battle cybersecurity threats, and it updates operating systems and firmware for the customer. These services allow data engineers to focus more on data engineering and eliminating system complexity.

Healthcare

In the healthcare industry, use Azure Databricks to accelerate big-data analytics and AI solutions. Apply these technologies to genome studies or pharmacy sales forecasting at a petabyte scale. Using Databricks features, you can set up your Spark environment in minutes and autoscale quickly and easily.

Sharding Strategies

Lookup Range - similar data in the same shard Hash - spread data evenly across shards.

Integrate hub

Manage integration pipelines within the Integrate hub. If you are familiar with Azure Data Factory (ADF), then you will feel at home in this hub. The pipeline creation experience is the same as in ADF, which gives you another powerful integration built into Azure Synapse Analytics, removing the need to use Azure Data Factory for data movement and transformation pipelines.

Azure HDInsight

Microsoft implementation of opensource Apache Spark managed within the realms of Azure When you want to benefits of OSS Spark with the SLA of a provider Open-source Professionals wanting SLA's and Microsoft Data Platform experts To take advantage of the OSS Big Data Analytics platform with SLA's in place to ensure business continuity

Cluster

One an only one driver

OLTP

Online Transaction Processing

Interactive query

Querying batch data at "human" interactive speeds, which with the current generation of technologies means results are ready in time frames measured in seconds to minutes.

Stages

Read Select Filter GroupBy Select Filter Write

Distribution

Regions Availability zones Manual failover Last sync time

Column-oriented Databases

Relies on a single index per column Fast build operations Easy to add columns hard to rows Slow concurrent CRUD operations

Star Schema Drawbacks

Requires expertise to design the model and configure processes Requires ongoing maintenance.

Partition Cosmos

Resources are subject to quota limitations Database queries are scoped at the collection level

Planning Data Lake Access Management

Role-based Access Control Access control lists Security groups

Lifecycle Management

Rules/policies Lifecycle management Automated tier migration

job responsibilities

SQL Server professionals generally work only with relational database systems. Data engineers also work with unstructured data and a wide variety of new data types, such as streaming data.

Cloud Scalability

Scalability in on-premises systems is complicated and time-consuming. But scalability in the cloud can be as simple as a mouse click. Typically, scalability in the cloud is measured in compute units. Compute units might be defined differently for each Azure product.

Sharding Pattern Benefits

Scale out systems by inserting shards on additional storage nodes Balancing workload across shards can cause contention Each storage node can use off the shelf hardware Shards can be physically relocated closer to the users accessing the data

Geo-redundant Application Design

Secondary region read-only Asynchronous replication Last Sync Time Storage Client Library Failover

Data wharehouse architecture components

Select integrate hub Select manage hub Select develop hub Select data hub

Data hub

Select the Linked tab (1), expand the Azure Data Lake Storage Gen2 group

Partition Vertical

Set security controls for each partition Reduces concurrent access Separate slow-moving and dynamic data Reduces I/O and performance cost

Azure Data Lake Gen2 Access Control

Shared key Shared access signature (SAS) Role-based Access Control (RBAC) Access control lists

Apache Spark pools in Azure Synapse Analytics

Speed and Efficiency: There is a quick start-up time for nodes and automatic shut-down when instances are not used within 5 minutes after the last job, unless there is a live notebook connection. Ease of creation: Creating as Apache Spark pool can be done through the Azure portal, PowerShell, or .NET SDK for Azure Synapse Analytics. Ease of use: Within the Azure Synapse Analytics workspace, you can connect directly to the Apache Spark pool and interact with the integrated notebook experience, or use custom notebooks derived from Nteract. Notebook integration helps you develop interactive data processing and visualization pipelines. REST API IDE Pre-loaded Anaconda libraries Scalability

Snowflake schema

Suitable for data warehouses Normalized dimension tables Less space, more complex Slower, more complex SQL queries No redundant data

Star schema

Suitable for datamarts Dimension tables not normalized More space, less complex Faster, less complex, SQL queries Include redundant data

SCIM integration

System for Cross-domain Identity Management, an open standard that allows you to automate user provisioning.

Modern data warehousing

This involves the ability to integrate all data, including big data, to reason over data for analytics and reporting purposes from a descriptive analytics perspective, independent of its location or structure.

Prescriptive analytics

This type of analytics looks at executing actions based on real-time or near real-time analysis of data, using predictive analytics. Azure Synapse Analytics provides this capability through both Apache Spark, Azure Synapse Link, and by integrating streaming technologies such as Azure Stream Analytics.

Geo-redundant Storage

Three LRS copies in primary region

Data ingestion and preparation

To ingest data, customers can do so code-free with over 100 data integration connectors with Azure Data Factory. Data Factory empowers customers to do code-free ETL/ELT, including preparation and transformation. And while a lot of our customers are currently heavily invested in the SQL Server Integration Services packages (SSIS), they created, they can leverage these without having to rewrite those packages in Azure Data Factory.

Secrets

Using the Secrets APIs, Secrets can be securely stored including in an Azure Key Vault or Databricks backend. Authorized users can consume the secrets to access services. Azure Databricks has two types of secret scopes: Key Vault-backed and Databricks-backed. These secret scopes allow you to store secrets, such as database connection strings, securely. If someone tries to output a secret to a notebook, it is replaced by [REDACTED]. This helps prevent someone from viewing the secret or accidentally leaking it when displaying or sharing the notebook.

Integrated analytics

With the variety of analytics that can be performed on the data at your disposal, putting together the services in a cohesive solution can be a complex operation. Azure Synapse Analytics removes this complexity by integrating the analytics landscape into one service. That way you can spend more time working with the data to bring business benefit, than spending much of your time provisioning and maintaining multiple systems to achieve the same outcomes.

Account Fail-over

Write to primary, read-only replications in secondary Modify Azure DNS Re-enable Geo-redundancy

Azure HDInsight (HDI)

is an implementation by Microsoft of open-source Apache Spark, managed on the Azure Platform. You can use HDI for an Apache Spark environment when you are aware of the benefits of Apache Spark in its OSS form, but you want a Service Level Agreement (SLA). This implementation is usually of interest to open-source professionals needing an SLA and data platform experts experienced with Microsoft products and services.

Azure Synapse Analytics

is an integrated analytics platform, which combines data warehousing, big data analytics, data integration, and visualization into a single environment.

Azure Private Link

is currently the most secure way to access Azure data services from Azure Databricks. Private Link enables you to access Azure PaaS Services (for example, Azure Storage, Azure Cosmos DB, and SQL Database) and Azure hosted customer/partner services over a Private Endpoint in your virtual network.

Access control (IAM)

is the blade that you use to assign roles to grant access to Azure resources. It's also known as identity and access management and appears in several locations in the Azure portal.

Atomicity

means a transaction must execute exactly once and must be atomic; either all of the work is done, or none of it is. Operations within a transaction usually share a common intent and are interdependent.

Durability

means that the changes made due to the transaction are permanently saved in the system. Committed data is saved by the system so that even in the event of a failure and system restart, the data is available in its correct state.

What is a cluster?

networked computers, that work together to process your data. The first step is to create a cluster.

OLAP

online analytical processing


Ensembles d'études connexes

WellCare Re-Cert Exam 2023 Study Set

View Set

Music Appreciation - Unit 1 POP music

View Set

Series 7: Retirement Plans (Retirement Plans)

View Set