Azure Data Fundamentals
Describe stream processing
Process that allows you to almost instantaneously from one source (i.e. device) to another (i.e. log file).
Describe batch data
Running process to completion one after another, at once. Batch processes are not interactive.
key-value store
Similar to a relational table, except that each row can have any number of columns
Analytical workloads
Typically read-only systems that store vast volumes of historical data or business metrics, such as sales performance and inventory levels
SaaS
Typically specific software packages that are installed and run on virtual hardware in the cloud
lifecycle management policy
automatically move a blob from Hot to Cool, and then to the Archive tier, as it ages and is used less frequently
Bar and column charts
enable you to see how a set of variables changes across different categories
ELT
Extract, Load, and Transform. The process differs from ETL in that the data is stored before being transformed
Azure Database for MySQL
is a PaaS implementation of MySQL in the Azure cloud, based on the MySQL Community Edition
entity
is described as a thing about which information needs to be known or held.
Block blobs
is handled as a set of blocks. Each block can vary in size, up to 100 MB
Page blobs
is organized as a collection of fixed size 512-byte pages
Archive tier
provides the lowest storage cost, but with increased latency.
Describe a bar chart
representation of categorical data using rectangles
Describe a pie chart
representation of data using a circle divided into proportional slices
Avro
row-based format. It was created by Apache. Each record contains a header that describes the structure of the data in the record.
Power BI visualization
a visual representation of data, like a chart, a color-coded map, or other interesting things you can create to represent your data visually
scaling
act of increasing (or decreasing) the resources used by a service
data warehouse
also stores large quantities of data, but the data in a warehouse has been processed to convert it into a format for efficient analysis.
Atomicity
guarantees that each transaction is treated as a single unit, which either succeeds completely, or fails completely
Synapse Spark pool
is a cluster of servers running Apache Spark to process data. You write your data processing logic using one of the four supported languages: Python, Scala, SQL, and C# (via .NET for Apache Spark). Spark pools support Azure Machine Learning through integration with the SparkML and AzureML packages.
Synapse SQL pool
is a collection of servers running Transact-SQL. Transact-SQL is the dialect of SQL used by Azure SQL Database, and Microsoft SQL Server. You write your data processing logic using Transact-SQL.
SQL Database server
is a logical construct that acts as a central administrative point for multiple single or pooled databases, logins, firewall rules, auditing rules, threat detection policies, and failover groups
Synapse Pipelines
is a logical grouping of activities that together perform a task.
NoSQL
is a rather loose term that simply means non-relational.
Azure Table Storage
is a scalable key-value store held in the cloud. You create a table using an Azure storage account.
Range index
is based on an ordered tree-like structure
transactional system (OLTP)
is often what most people consider the primary function of business computing, it records transactions
dot plot chart
is similar to a bubble chart and scatter chart, but can plot categorical data along the X-Axis
Provisioning
is the act of running series of tasks that a service provider, such as Azure SQL Database, performs to create and configure a service
Hot tier
is the default. You use this tier for blobs that are accessed frequently
Azure Cosmos DB
schema-agnostic database that allows you to iterate on your application without having to deal with schema or index management. Automatically indexes every property for all items in your container without having to define any schema or configure secondary indexes
scatter chart
shows the relationship between two numerical values
Analysis
You typically use batch processing for performing complex analytics. Stream processing is used for simple response functions, aggregates, or calculations such as rolling averages.
Azure Database for PostgreSQL
a PaaS implementation of PostgreSQL in the Azure Cloud. This service provides the same availability, performance, scaling, security, and administrative benefits as the MySQL service.
Azure HDInsight
a big data processing service, that provides the platform for technologies such as Spark in an Azure environment.
Change Feed
a blob provides an ordered, read-only, record of the updates made to a blob.
Azure Data Factory
a cloud-based data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale
Power BI datasets
a collection of data that Power BI uses to create its visualizations.
Power BI report
a collection of visualizations that appear together on one or more pages
column family database
a column family database can appear very similar to a relational database, real power lies in its denormalized approach to structuring sparse data
Cassandra API
a column family database management system
distributed database
a database in which data is stored across different physical locations
Azure Cosmos DB
a multi-model NoSQL database management system
Snapshots
a read-only version of a blob at a particular point in time.
data lake
a repository for large quantities of raw data.
Azure Virtual Network
a representation of your own network in the cloud.
Azure Blob storage
a service that enables you to store massive amounts of unstructured data, or blobs, in the cloud
Power BI tile
a single visualization on a report or a dashboard
document key
a unique identifier for the document
view
a virtual table based on the result set of a query
Parquet
another columnar data format. It was created by Cloudera and Twitter. Contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data.
Describe cognitive analytics
applying human like intelligence to tasks
Treemaps
are charts of colored rectangles, with size representing the relative value of each item
Cognitive analytics
attempts to draw inferences from existing data and patterns, derive conclusions based on existing knowledge bases, and then add these findings back into the knowledge base for future inferences--a self-learning feedback loop
unstructured data
audio and video files, and binary data files might not have a specific structure
stream processing
each new piece of data is processed when it arrives
SQL Database managed instance
effectively runs a fully controllable instance of SQL Server in the cloud. You can install multiple databases on the same instance. You have complete control over this instance, much as you would for an on-premises server
Line charts
emphasize the overall shape of an entire series of values, usually over time
Spatial indices
enable efficient queries on geospatial objects such as - points, lines, polygons, and multipolygon. used on correctly formatted GeoJSON objects. Points, LineStrings, Polygons, and MultiPolygons are currently supported
Azure File Storage
enables you to create files shares in the cloud, and access these file shares from anywhere with an internet connection
Azure Database Migration Service (DMS)
enables you to restore a backup of your on-premises databases directly to databases running in Azure Data Services
in-place updates
enabling an application to modify the values of specific fields in a document without rewriting the entire document
Consistency
ensures that a transaction can only take the data in the database from one valid state to another
Isolation
ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially
Data processing solutions
fall into one of two broad categories: analytical systems, and transaction processing systems
Describe diagnostic analytics
form of analytics that answers the question, "Why did it happen?"
Filled map
geographical data, you can use a filled map to display how a value differs in proportion across a geography or region
Owner privilege
gives full access to the data including managing the security like adding new users and removing access to existing users.
Read/write access
gives users the ability to view and modify existing data.
Durability
guarantees that once a transaction has been committed, it will remain committed even if there's a system failure such as a power outage or crash
Cool tier.
has lower performance and incurs reduced storage charges compared to the Hot tier
Prescriptive analytics
helps answer questions about what actions should be taken to achieve a goal or target
Descriptive analytics
helps answer questions about what has happened, based on historical data
Predictive analytics
helps answer questions about what will happen in the future
Diagnostic analytics
helps answer questions about why things happened
index
helps you search for data in a table
Gremlin API
implements a graph database interface to Cosmos DB.
Composite indexes
increase the efficiency when you are performing operations on multiple fields
Describe descriptive analytics
interpretation of historical data to better understand changes that have occurred
Append blobs
is a block blob optimized to support append operations
Synapse Studio
is a web user interface that enables data engineers to access all the Synapse Analytics tools.
Azure Databricks
is an Apache Spark environment running on Azure to provide big data processing, streaming, and machine learning
Azure Synapse Analytics
is an analytics engine. It's designed to process large amounts of data very quickly
Azure Database for MariaDB
is an implementation of the MariaDB database management system adapted to run in Azure. It's based on the MariaDB Community Edition.
MongoDB API
is another well-known document database, with its own programmatic interface
Data Querying
looking in data for trends, or attempting to determine the cause of problems in your systems
Describe the concepts of data processing
manipulation of data by a computer
Read-only access
means the users can read data but can't modify any existing data or create new data.
modern data warehouse
might contain a mixture of relational and non-relational data, including files, social media streams, and Internet of Things (IoT) sensor data
Power BI dashboard
must fit on a single page, often called a canvas (the canvas is the blank backdrop in Power BI Desktop or the service, where you put visualizations).
Optimized Row Columnar format (ORC)
organizes data into columns rather than rows
clustered index
physically reorganizes a table by the index key. This arrangement can improve the performance of queries still further, because the RDBMS system doesn't have to follow references from the index to find the corresponding data in the underlying table
Business Intelligence
refers to technologies, applications, and practices for the collection, integration, analysis, and presentation of business information. The purpose of business intelligence is to support better decision making
document database
represents the opposite end of the NoSQL spectrum from a key-value store. Each document has a unique ID, but the fields in the documents are transparent to the database management system.
Single Database
resource type creates a database in Azure SQL Database with its own set of resources and is managed via a server
Azure database administrator
responsible for the design, implementation, maintenance, and operational aspects of on-premises and cloud-based database solutions built on Azure data services and SQL Server
Elastic Pool
similar to Single Database, except that by default multiple databases can share the same resources, such as memory, data storage space, and processing power.
Azure Database for PostgreSQL single-server
single-server deployment option for PostgreSQL provides similar benefits as Azure Database for MySQL. You choose from three pricing tiers: Basic, General Purpose, and Memory Optimized. Each tier supports different numbers of CPUs, memory, and storage sizes—you select one based on the load you expect to support
ETL
stands for Extract, Transform, and Load. The raw data is retrieved and transformed before being saved
Data Visualization
the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to spot and understand trends, outliers, and patterns in data
Soft delete
This feature enables you to recover a blob that has been removed or overwritten, by accident or otherwise.
Table API
This interface enables you to use the Azure Table Storage API to store and retrieve documents.
SQL API
This interface provides a SQL-like query language over documents, enable to identify and retrieve documents using SELECT statements
Azure Database for PostgreSQL Hyperscale
Hyperscale (Citus) is a deployment option that scales queries across multiple server nodes to support large database loads. Your database is split across nodes. Data is split into chunks based on the value of a partition key or sharding key. Consider using this deployment option for the largest database PostgreSQL deployments in the Azure Cloud.
Disadvantages of batch
The time delay between ingesting the data and getting the results. All of a batch job's input data must be ready before a batch can be processed. Even minor data errors, such as typographical errors in dates, can prevent a batch job from running.
Azure Data Services
These services are a series of DBMSs managed by Microsoft in the cloud. Each data service takes care of the configuration, day-to-day management, software updates, and security of the databases that it hosts. All you do is create your databases under the control of the data service.
batch processing
Arriving data elements are collected into a group. The whole group is then processed at a future time
Azure Analysis Services
Enables you to build tabular models to support online analytical processing (OLAP) queries
Versioning
can maintain and restore earlier versions of a blob.
matrix
visual is a tabular structure that summarizes data
Data Size
Batch processing is suitable for handling large datasets efficiently. Stream processing is intended for individual records or micro batches consisting of few records.
PaaS
Allows you to specify the resources that you require (based on how large you think your databases will be, the number of users, and the performance you require), and Azure automatically creates the necessary virtual machines, networks, and other devices for you
Data Scope
Batch processing can process all the data in the dataset. Stream processing typically only has access to the most recent data received, or within a rolling time window
Describe differences between batch and stream data
Batch data requires a dataset to be collected over time, stream data depends on an analytic tool that consumes data.
data analyst
Enables businesses to maximize the value of their data assets, responsible for designing and building scalable models, cleaning and transforming data, and enabling advanced analytics capabilities through reports and visualizations.
IaaS
Enables you to create a virtual infrastructure in the cloud that mirrors the way an on-premises data center might work. You can create a set of virtual machines, connect them together using a virtual network, and add a range of virtual devices
Describe data visualization
Field that deals with the graphical representation of data.
Data
Is a collection of facts such as numbers, descriptions, and observations used in decision making.
Semi-structured data
Is information that doesn't reside in a relational database but still has some structure to it. Examples include documents held in JavaScript Object Notation (JSON) format.
Advantages of batch
Large volumes of data can be processed at a convenient time. It can be scheduled to run at a time when computers or systems might otherwise be idle, such as overnight, or during off-peak hours.
normalized
Splitting tables out into separate groups of columns
Performance
The latency for batch processing is typically a few hours. Stream processing typically occurs immediately, with latency in the order of seconds or milliseconds.
data engineer
collaborates with stakeholders to design and implement data-related assets that include data ingestion pipelines, cleansing and transformation activities, and data stores for analytical workloads
Azure Data Lake Storage
combines the hierarchical directory structure and file system semantics of a traditional file system with security and scalability provided by Azure
relational database
comprises a set of tables. A table can have zero (if the table is empty) or more rows. Each table has a fixed set of columns. You can define relationships between tables using primary and foreign keys, and you can access the data in tables using SQL
Data analytics
concerned with examining, transforming, and arranging data so that you can study it and extract useful information
container
content is projected as a JSON document, then converted into a tree representation.
Azure Data Factory
described as a data integration service
analytical system (OLAP)
designed to support business users who need to query data and gain a big picture view of the information held in a database
key influencer chart
displays the major contributors to a selected result or value
Data Ingestion
the process of capturing the raw data
Reporting
the process of organizing data into informational summaries to monitor how different areas of an organization are performing
Describe predictive analytics
the use of historical data, statistical models and machine learning techniques to identify the likelihood of future outcomes.
Data Transformation/Data Processing
to do some cleaning operations and remove any questionable or invalid data, or perform some aggregations such as calculating profit, margin, and other Key Performance Metrics (KPIs)
Describe ETL
type of data integration refers to extract, transform and load to blend data from multiple sources
Describe ELT
type of data integration, takes advantage of the target system to do the data transformation
graph database
used to store and query information about complex relationships, contains nodes (information about objects), and edges (information about the relationships between objects).
Describe prescriptive analytics
using machine learning techniques to help businesses decide on a course of action to taken
non-relational system
you store the information for entities in collections or containers rather than relational tables. Two entities in the same collection can have a different set of fields rather than a regular set of columns found in a relational table