Azure data Fundamentals: 1. Explore core data concepts
The raw data might not be in a format that is suitable for querying. The data might contain anomalies that should be filtered out, or it may require transforming in some way. The process of reformatting, cleaning, standardizing, removal of bad information, creating new features, etc. is called
Data Transformation/Data Processing
what is data?
Data is a collection of facts such as numbers, descriptions, and observations used in decision making
______________ is simply the conversion of raw data to meaningful information through a process
Data processing
_____________ guarantees that once a transaction has been committed, it will remain committed even if there's a system failure such as a power outage or crash.
Durability
azure data processing tools
Functions; Cognitive Services; Databricks; Other tools
What is an analytical system?
In contrast to systems designed to support OLTP, an analytical system is designed to support business users who need to query data and gain a big picture view of the information held in a database. Analytical systems are concerned with capturing raw data, and using it to generate insights
__________________ ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially.
Isolation
In those key-value arrays, a _______ serves as a unique identifier to retrieve a specific value. Those values can be anything from a number or a string to a complex object, like a JSON file.
Key
is the time taken for the data to be received and processed.
Latency
_____________databases enable you to store data in a format that more closely matches the original structure
Non-relational
what applications require a graph database?
OLTP apps with highly correlated data; apps that require easy updates to single or many objects; apps with a flexible data modelling; apps with data that has requirements that evolve; apps with data that has hierarchical data structures
After your service is provisioned, the service needs to be configured so that users can be given access to the data. You can typically define several levels of access:
Read-only; Read/Write; Owner privilege (full access)
_________________ provides a graphical interface, enabling you to query data, perform general database administration tasks, and generate scripts for automating database maintenance and support operations.
SQL Server Management Studio
_________________ data is information that doesn't reside in a relational database but still has some structure to it
Semi-structured
ORC is
The Optimized Row Columnar (ORC)
Performance: Batch vs. Streaming
The latency for batch processing is typically a few hours. Stream processing typically occurs immediately, with latency in the order of seconds or milliseconds.
___________ the data can often be useful as a tool for examining data. You can generate charts such as bar charts, line charts, plot results on geographical maps, pie charts, or illustrate how data changes over time
Visualizing
Blob stands for
Binary Large Object
The process of capturing the raw data
Data ingestion
what is unstructured data?
Does not naturally contain fields
How is data in a relational table organized?
Rows and columns
______________ is typically tabular data that is represented by rows and columns in a database
Structured data
in a normalized database schema, no data
duplication exists
diagnostic analytics is used to
figure out why something is happening. Take the findings of descriptive analytics. 1. identify anomalies 2. collect data related to anomalies 3. use analytics to explain the anomalies
descriptive data analytics examples
generating reports to provide and view sales data
A ___________ contains nodes (information about objects), and edges (information about the relationships between objects).
graph
You can use a _____________database to store and query information about complex relationships
graph
_____________ stores entities centric around relationships. Enables applications to perform queries by traversing a network of nodes and edges
graph database
cosmos db supports graph databases through the _______ api
gremlin
Transactional systems are often
high-volume, sometimes handling many millions of transactions in a single day.
OLTP stands for
online transaction processing
views are created to
simplify the query
parquet is developed for
storing nested data types efficiently
Analysis: Batch vs. Streaming
You typically use batch processing for performing complex analytics. Stream processing is used for simple response functions, aggregates, or calculations such as rolling averages.
what is Power BI
a collection of software, services, aps, and connectors
a view is
a virtual table based on the results set of a query
Data processing solutions often fall into one of two broad categories:
analytical systems, and transaction processing systems.
In ____________ processing, newly arriving data elements are collected into a group. The whole group is then processed at a future time as a ____________
batch ; batch
Buffering and processing the data in groups is called _____________
batch processing.
non-relational collections are often defined
by labeling each field with the name it represents
data processing
cleans the data and converts it into a more meaningful format for exploration
each _______________ in a table is defined by a datatype
column
_____________ database can appear very similar to a relational database
column family
apache cassandra is a
column family database
Each row in a table has the same set of
columns
An example of an effective use of batch processing would be a
connection to a mainframe system. Vast amounts of data need to be transferred into a data analysis system and the data is not real-time.
data ingestion can be imported by
continuous stream or in batches
____________ is stored in a table
data
steps in data analytics
data ingestion (raw data) --> data processing (clean/transform data) --> data exploration (queries, visualizations)
Most analytical data processing systems need to perform similar tasks:
data ingestion, data transformation, data querying, and data visualization
Differences between batch and streaming data include:
data scope; data size; performance; analysis
what is semi-structured data?
data structure is defined within the actual data by fields.
In data analytics, different data sources are typically brought together in a
data warehouse so you can answer questions with the data
A column family is a
database object that contains columns of related data. It is a tuple (pair) that consists of a key-value pair, where the key is mapped to a value that is a set of columns. In analogy with relational databases, a column family is as a "table", each key-value pair being a "row". Each column is a tuple (triplet) consisting of a column name, a value, and a timestamp
you can often categorize your data analytics as
descriptive; diagnostic; predictive; prescriptive; cognitive
A ____________database is a database in which data is stored across different physical locations. It may be held in multiple computers located in the same physical location (for example, a datacenter), or may be dispersed over a network of interconnected computers
distributed
azure cosmos db is an example of a
document-based database
One disadvantage of a document database is
duplication due to information being stored by document. For example, if two customer documents have the same address, the address would need to be stored in both documents. Duplication increases storage required and also makes maintenance more complex.
information about the relationship between objects in a graph database
edge
unstructured data is often used to
extract data form and categorize or identify "structures"
ELT
extract, load, transform
ETL stands for
extract, transform, load
data analytics help you
identify strengths and weaknesses in your organization
cognitive analytics
inspired by how the human brain works. self-learning feedback loop
the aim of the data processing
is to convert the data into meaningful structures
json
javascript object notation
in a normalized database schema, data is retrieved by
joining tables together in a query
in a relational database, data is retrieved by
joining tables together in a query
example of semi-structured data
json
types of non-relational data
json, avro, orc, parquet
A ___________ database stores Associative arrays
key-value
4 examples of NoSQL categories
key-value stores, document based, column family databases, graph databases
Other types of semi-structured data
key-value stores; graph databases
examples of Power BI visualizations
line graph, bar chart; matrix; tree map; scatter plot;
What is NoSQL?
loose term, to describe non-relational databases
While normalization enables fast throughput for transactions, it can make querying
more complex
non-relational collections can have
multiple entities in the same collection or container with different fields; ;
Streaming data processing is beneficial in most scenarios where
new, dynamic data is generated on a continual basis.
information about objects in a graph database
nodes
non-relational collections can have a different
non-tabular schema
A document database is a type of
nonrelational database that is designed to store and query data as JSON-like documents
Typically, the end result of the _____________ process is that your data is split into a large number of narrow, well-defined tables (a narrow table is a table with few columns), with references from one table to another, as shown in the image below.
normalization
typically when we design a relational database, we need to _____________ the data
normalize
Splitting tables out into separate groups of columns to support fast processing is called
normalized.
In __________________, information systems typically facilitate and manage transaction-oriented applications
online transaction processing
Parquet is an
open source file format available to any project in the Hadoop ecosystem
predictive analytics used to
predict what will happen in the future
data ingestion
processing of obtaining and importing data for immediate or later use
The act of setting up the database server is called
provisioning.
an index optimizes
queries for faster data retrieval;
After data is ingested and transformed, you can ________ the data to analyze it
query
the focus of a key-value store is to
read/write data quickly
Streaming handles data in __________
real time
data is normalized to
reduce
Databases that hold tables in the form of structured data are called
relational databases
Structured data is typically stored in a ____________database such as _________ or _____________
relational; SQL Server; Azure SQL Database
avro is a
row-based format
Avro is a
row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project
all _____________ in a table have the same # of columns
rows
benefit of relational database
simple structure; easier to understand; can remove duplication with normalization
In _______________, each new piece of data is processed when it arrives. For example, data ingestion.
stream processing
Processing data as it arrives is called ______________
streaming
characteristics of a relational database
structure of tables and columns, rigid structure, tables are related to each other with primary and foreign keys
A key-value database stores data as a single collection without
structure or relation
3 classifications of data
structured, semi-structured, or unstructured.
a _______________ is where data is stored in a relational database
table
____________ consists of rows and columns
tables
When compared to non-distributed database systems, any data update to a distributed database will
take time to apply across multiple locations.
frequent uses of unstructured data in ML
text analytics, sentiment analysis, vision api
an index reduces
the amount of data pages that need to be read to retrieve the data in a SQL statement;
data processing generally occurs after
the data has been ingested
to query a key-value store you need to know
the keys you want to look up the values for
in a column family database
there are groups of columns that are logically grouped together
Stream processing is ideal for
time-critical operations that require an instant real-time response.
A ______________ is a sequence of operations that are atomic. This means that either all operations in the sequence must be completed successfully, or if something goes wrong, all operations run so far in the sequence must be undone
transaction
A primary use of relational databases is to handle
transaction processing.
An example of ineffective batch-processing would be to
transfer small amounts of real-time data, such as a financial stock-ticker.
Depending on the __________ such as structured, semi-structured, or unstructured, data will be stored _______________.
type of data; differently
In a non-relational database, each key should have a
unique id
Not all data is structured or even semi-structured. For example, audio and video files, and binary data files might not have a specific structure. This is called
unstructured data
You can categorize data in many different ways, depending not only on how it's structured, but also on how the data is __________
used
prescriptive analytics
used to answer questions about what actions need to be taken to achieve certain targets. rely on machine learning to figure out patterns in past experience
in a normalized database schema primary keys and foreign keys are
used to defined relationships
examples of unstructured data
video, audio, media streams, documents
Relational and non-relational databases are suited to different
workloads.
Which one of the following tasks is a role of a database administrator? ( 1. Backing up and restoring databases 2. Creating dashboards and reports 3. Identifying data quality issues )
1. Backing up and restoring databases
difference between etl and elt
1. ETL loads data first into the staging server and then into the target system whereas ELT loads data directly into the target system. 2. ETL model is used for on-premises, relational and structured data while ELT is used for scalable cloud structured and unstructured data sources. 3. ETL is mainly used for a small amount of data whereas ELT is used for large amounts of data.
Advantages of batch processing include:
1. Large volumes of data can be processed at a convenient time.; 2. It can be scheduled to run at a time when computers or systems might otherwise be idle, such as overnight, or during off-peak hours.
Which one of the following roles is not a data job role? ( 1. Systems Administrator 2. Data Analyst 3. Database Administrator )
1. Systems Administrator
Disadvantages of batch processing include:
1. The time delay between ingesting the data and getting the results. 2. All of a batch job's input data must be ready before a batch can be processed. This means data must be carefully checked. Problems with data, errors, and program crashes that occur during batch jobs bring the whole process to a halt.
Which of the following is an example of unstructured data? (1. An Employee table with columns Employee ID, Employee Name, and Employee Designation 2. Audio and Video files 3. A table within SQL Server database )
2. Audio and Video files
Which of the following tools is a visualization and reporting tool? ( 1. SQL Server Management Studio 2. Power BI 3. SQL )
2. Power BI
Examples of streaming data include:
A financial institution tracks changes in the stock market in real time, computes value-at-risk, and automatically rebalances portfolios based on stock price movements.
What is a transactional system?
A transactional system is often what most people consider the primary function of business computing. A transactional system records transactions
A transactional database must adhere to the ______________ properties to ensure that the database remains consistent while processing transactions.
ACID (Atomicity, Consistency, Isolation, Durability)
_______________ are typically read-only systems that store vast volumes of historical data or business metrics, such as sales performance and inventory levels. _______________ are used for data analysis and decision making.
Analytical workloads
______________ are generated by aggregating the facts presented by the raw data into summaries, trends, and other kinds of "Business information."
Analytics
______________ guarantees that each transaction is treated as a single unit, which either succeeds completely, or fails completely
Atomicity
You can provision other services as well in Azure. For example, if you want to store unstructured data such as video or audio files, you can use
Azure Blob storage
If you want to store semi-structured data such as documents, you can use a service such as
Azure Cosmos DB.
_______________ currently provides connections to on-premises SQL Server databases, Azure SQL Database, PostgreSQL, Azure SQL Data Warehouse, and SQL Server Big Data Clusters, amongst others.
Azure Data Studio
_______________ provides a graphical user interface for managing many different database systems.
Azure Data Studio
____________ Database is a service that runs in the cloud. You can use it to create and access relational tables
Azure SQL
____________ is managed and run by Azure, you just specify that you want a database server to be created.
Azure SQL Database service
________________ provides database services in Azure. It's similar to SQL Server, except that it runs in the cloud. You can manage _____________ using Azure portal.
Azure SQL database; Azure SQL database
Data scope: Batch vs. Streaming
Batch processing can process all the data in the dataset. Stream processing typically only has access to the most recent data received, or within a rolling time window (the last 30 seconds, for example).
Data size: Batch vs. Streaming
Batch processing is suitable for handling large datasets efficiently. Stream processing is intended for individual records or micro batches consisting of few records.
__________ ensures that a transaction can only take the data in the database from one valid state to another. A ___________ database should never lose or create data in a manner that can't be accounted for.
Consistency; consistent