Azure data Fundamentals: 1. Explore core data concepts

Ace your homework & exams now with Quizwiz!

The raw data might not be in a format that is suitable for querying. The data might contain anomalies that should be filtered out, or it may require transforming in some way. The process of reformatting, cleaning, standardizing, removal of bad information, creating new features, etc. is called

Data Transformation/Data Processing

what is data?

Data is a collection of facts such as numbers, descriptions, and observations used in decision making

______________ is simply the conversion of raw data to meaningful information through a process

Data processing

_____________ guarantees that once a transaction has been committed, it will remain committed even if there's a system failure such as a power outage or crash.

Durability

azure data processing tools

Functions; Cognitive Services; Databricks; Other tools

What is an analytical system?

In contrast to systems designed to support OLTP, an analytical system is designed to support business users who need to query data and gain a big picture view of the information held in a database. Analytical systems are concerned with capturing raw data, and using it to generate insights

__________________ ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially.

Isolation

In those key-value arrays, a _______ serves as a unique identifier to retrieve a specific value. Those values can be anything from a number or a string to a complex object, like a JSON file.

Key

is the time taken for the data to be received and processed.

Latency

_____________databases enable you to store data in a format that more closely matches the original structure

Non-relational

what applications require a graph database?

OLTP apps with highly correlated data; apps that require easy updates to single or many objects; apps with a flexible data modelling; apps with data that has requirements that evolve; apps with data that has hierarchical data structures

After your service is provisioned, the service needs to be configured so that users can be given access to the data. You can typically define several levels of access:

Read-only; Read/Write; Owner privilege (full access)

_________________ provides a graphical interface, enabling you to query data, perform general database administration tasks, and generate scripts for automating database maintenance and support operations.

SQL Server Management Studio

_________________ data is information that doesn't reside in a relational database but still has some structure to it

Semi-structured

ORC is

The Optimized Row Columnar (ORC)

Performance: Batch vs. Streaming

The latency for batch processing is typically a few hours. Stream processing typically occurs immediately, with latency in the order of seconds or milliseconds.

___________ the data can often be useful as a tool for examining data. You can generate charts such as bar charts, line charts, plot results on geographical maps, pie charts, or illustrate how data changes over time

Visualizing

Blob stands for

Binary Large Object

The process of capturing the raw data

Data ingestion

what is unstructured data?

Does not naturally contain fields

How is data in a relational table organized?

Rows and columns

______________ is typically tabular data that is represented by rows and columns in a database

Structured data

in a normalized database schema, no data

duplication exists

diagnostic analytics is used to

figure out why something is happening. Take the findings of descriptive analytics. 1. identify anomalies 2. collect data related to anomalies 3. use analytics to explain the anomalies

descriptive data analytics examples

generating reports to provide and view sales data

A ___________ contains nodes (information about objects), and edges (information about the relationships between objects).

graph

You can use a _____________database to store and query information about complex relationships

graph

_____________ stores entities centric around relationships. Enables applications to perform queries by traversing a network of nodes and edges

graph database

cosmos db supports graph databases through the _______ api

gremlin

Transactional systems are often

high-volume, sometimes handling many millions of transactions in a single day.

OLTP stands for

online transaction processing

views are created to

simplify the query

parquet is developed for

storing nested data types efficiently

Analysis: Batch vs. Streaming

You typically use batch processing for performing complex analytics. Stream processing is used for simple response functions, aggregates, or calculations such as rolling averages.

what is Power BI

a collection of software, services, aps, and connectors

a view is

a virtual table based on the results set of a query

Data processing solutions often fall into one of two broad categories:

analytical systems, and transaction processing systems.

In ____________ processing, newly arriving data elements are collected into a group. The whole group is then processed at a future time as a ____________

batch ; batch

Buffering and processing the data in groups is called _____________

batch processing.

non-relational collections are often defined

by labeling each field with the name it represents

data processing

cleans the data and converts it into a more meaningful format for exploration

each _______________ in a table is defined by a datatype

column

_____________ database can appear very similar to a relational database

column family

apache cassandra is a

column family database

Each row in a table has the same set of

columns

An example of an effective use of batch processing would be a

connection to a mainframe system. Vast amounts of data need to be transferred into a data analysis system and the data is not real-time.

data ingestion can be imported by

continuous stream or in batches

____________ is stored in a table

data

steps in data analytics

data ingestion (raw data) --> data processing (clean/transform data) --> data exploration (queries, visualizations)

Most analytical data processing systems need to perform similar tasks:

data ingestion, data transformation, data querying, and data visualization

Differences between batch and streaming data include:

data scope; data size; performance; analysis

what is semi-structured data?

data structure is defined within the actual data by fields.

In data analytics, different data sources are typically brought together in a

data warehouse so you can answer questions with the data

A column family is a

database object that contains columns of related data. It is a tuple (pair) that consists of a key-value pair, where the key is mapped to a value that is a set of columns. In analogy with relational databases, a column family is as a "table", each key-value pair being a "row". Each column is a tuple (triplet) consisting of a column name, a value, and a timestamp

you can often categorize your data analytics as

descriptive; diagnostic; predictive; prescriptive; cognitive

A ____________database is a database in which data is stored across different physical locations. It may be held in multiple computers located in the same physical location (for example, a datacenter), or may be dispersed over a network of interconnected computers

distributed

azure cosmos db is an example of a

document-based database

One disadvantage of a document database is

duplication due to information being stored by document. For example, if two customer documents have the same address, the address would need to be stored in both documents. Duplication increases storage required and also makes maintenance more complex.

information about the relationship between objects in a graph database

edge

unstructured data is often used to

extract data form and categorize or identify "structures"

ELT

extract, load, transform

ETL stands for

extract, transform, load

data analytics help you

identify strengths and weaknesses in your organization

cognitive analytics

inspired by how the human brain works. self-learning feedback loop

the aim of the data processing

is to convert the data into meaningful structures

json

javascript object notation

in a normalized database schema, data is retrieved by

joining tables together in a query

in a relational database, data is retrieved by

joining tables together in a query

example of semi-structured data

json

types of non-relational data

json, avro, orc, parquet

A ___________ database stores Associative arrays

key-value

4 examples of NoSQL categories

key-value stores, document based, column family databases, graph databases

Other types of semi-structured data

key-value stores; graph databases

examples of Power BI visualizations

line graph, bar chart; matrix; tree map; scatter plot;

What is NoSQL?

loose term, to describe non-relational databases

While normalization enables fast throughput for transactions, it can make querying

more complex

non-relational collections can have

multiple entities in the same collection or container with different fields; ;

Streaming data processing is beneficial in most scenarios where

new, dynamic data is generated on a continual basis.

information about objects in a graph database

nodes

non-relational collections can have a different

non-tabular schema

A document database is a type of

nonrelational database that is designed to store and query data as JSON-like documents

Typically, the end result of the _____________ process is that your data is split into a large number of narrow, well-defined tables (a narrow table is a table with few columns), with references from one table to another, as shown in the image below.

normalization

typically when we design a relational database, we need to _____________ the data

normalize

Splitting tables out into separate groups of columns to support fast processing is called

normalized.

In __________________, information systems typically facilitate and manage transaction-oriented applications

online transaction processing

Parquet is an

open source file format available to any project in the Hadoop ecosystem

predictive analytics used to

predict what will happen in the future

data ingestion

processing of obtaining and importing data for immediate or later use

The act of setting up the database server is called

provisioning.

an index optimizes

queries for faster data retrieval;

After data is ingested and transformed, you can ________ the data to analyze it

query

the focus of a key-value store is to

read/write data quickly

Streaming handles data in __________

real time

data is normalized to

reduce

Databases that hold tables in the form of structured data are called

relational databases

Structured data is typically stored in a ____________database such as _________ or _____________

relational; SQL Server; Azure SQL Database

avro is a

row-based format

Avro is a

row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project

all _____________ in a table have the same # of columns

rows

benefit of relational database

simple structure; easier to understand; can remove duplication with normalization

In _______________, each new piece of data is processed when it arrives. For example, data ingestion.

stream processing

Processing data as it arrives is called ______________

streaming

characteristics of a relational database

structure of tables and columns, rigid structure, tables are related to each other with primary and foreign keys

A key-value database stores data as a single collection without

structure or relation

3 classifications of data

structured, semi-structured, or unstructured.

a _______________ is where data is stored in a relational database

table

____________ consists of rows and columns

tables

When compared to non-distributed database systems, any data update to a distributed database will

take time to apply across multiple locations.

frequent uses of unstructured data in ML

text analytics, sentiment analysis, vision api

an index reduces

the amount of data pages that need to be read to retrieve the data in a SQL statement;

data processing generally occurs after

the data has been ingested

to query a key-value store you need to know

the keys you want to look up the values for

in a column family database

there are groups of columns that are logically grouped together

Stream processing is ideal for

time-critical operations that require an instant real-time response.

A ______________ is a sequence of operations that are atomic. This means that either all operations in the sequence must be completed successfully, or if something goes wrong, all operations run so far in the sequence must be undone

transaction

A primary use of relational databases is to handle

transaction processing.

An example of ineffective batch-processing would be to

transfer small amounts of real-time data, such as a financial stock-ticker.

Depending on the __________ such as structured, semi-structured, or unstructured, data will be stored _______________.

type of data; differently

In a non-relational database, each key should have a

unique id

Not all data is structured or even semi-structured. For example, audio and video files, and binary data files might not have a specific structure. This is called

unstructured data

You can categorize data in many different ways, depending not only on how it's structured, but also on how the data is __________

used

prescriptive analytics

used to answer questions about what actions need to be taken to achieve certain targets. rely on machine learning to figure out patterns in past experience

in a normalized database schema primary keys and foreign keys are

used to defined relationships

examples of unstructured data

video, audio, media streams, documents

Relational and non-relational databases are suited to different

workloads.

Which one of the following tasks is a role of a database administrator? ( 1. Backing up and restoring databases 2. Creating dashboards and reports 3. Identifying data quality issues )

1. Backing up and restoring databases

difference between etl and elt

1. ETL loads data first into the staging server and then into the target system whereas ELT loads data directly into the target system. 2. ETL model is used for on-premises, relational and structured data while ELT is used for scalable cloud structured and unstructured data sources. 3. ETL is mainly used for a small amount of data whereas ELT is used for large amounts of data.

Advantages of batch processing include:

1. Large volumes of data can be processed at a convenient time.; 2. It can be scheduled to run at a time when computers or systems might otherwise be idle, such as overnight, or during off-peak hours.

Which one of the following roles is not a data job role? ( 1. Systems Administrator 2. Data Analyst 3. Database Administrator )

1. Systems Administrator

Disadvantages of batch processing include:

1. The time delay between ingesting the data and getting the results. 2. All of a batch job's input data must be ready before a batch can be processed. This means data must be carefully checked. Problems with data, errors, and program crashes that occur during batch jobs bring the whole process to a halt.

Which of the following is an example of unstructured data? (1. An Employee table with columns Employee ID, Employee Name, and Employee Designation 2. Audio and Video files 3. A table within SQL Server database )

2. Audio and Video files

Which of the following tools is a visualization and reporting tool? ( 1. SQL Server Management Studio 2. Power BI 3. SQL )

2. Power BI

Examples of streaming data include:

A financial institution tracks changes in the stock market in real time, computes value-at-risk, and automatically rebalances portfolios based on stock price movements.

What is a transactional system?

A transactional system is often what most people consider the primary function of business computing. A transactional system records transactions

A transactional database must adhere to the ______________ properties to ensure that the database remains consistent while processing transactions.

ACID (Atomicity, Consistency, Isolation, Durability)

_______________ are typically read-only systems that store vast volumes of historical data or business metrics, such as sales performance and inventory levels. _______________ are used for data analysis and decision making.

Analytical workloads

______________ are generated by aggregating the facts presented by the raw data into summaries, trends, and other kinds of "Business information."

Analytics

______________ guarantees that each transaction is treated as a single unit, which either succeeds completely, or fails completely

Atomicity

You can provision other services as well in Azure. For example, if you want to store unstructured data such as video or audio files, you can use

Azure Blob storage

If you want to store semi-structured data such as documents, you can use a service such as

Azure Cosmos DB.

_______________ currently provides connections to on-premises SQL Server databases, Azure SQL Database, PostgreSQL, Azure SQL Data Warehouse, and SQL Server Big Data Clusters, amongst others.

Azure Data Studio

_______________ provides a graphical user interface for managing many different database systems.

Azure Data Studio

____________ Database is a service that runs in the cloud. You can use it to create and access relational tables

Azure SQL

____________ is managed and run by Azure, you just specify that you want a database server to be created.

Azure SQL Database service

________________ provides database services in Azure. It's similar to SQL Server, except that it runs in the cloud. You can manage _____________ using Azure portal.

Azure SQL database; Azure SQL database

Data scope: Batch vs. Streaming

Batch processing can process all the data in the dataset. Stream processing typically only has access to the most recent data received, or within a rolling time window (the last 30 seconds, for example).

Data size: Batch vs. Streaming

Batch processing is suitable for handling large datasets efficiently. Stream processing is intended for individual records or micro batches consisting of few records.

__________ ensures that a transaction can only take the data in the database from one valid state to another. A ___________ database should never lose or create data in a manner that can't be accounted for.

Consistency; consistent


Related study sets

BUS 101 Chapter 1 Study Questions

View Set

Principles of Management (Multiple Choice)

View Set

Declaration of Independence and Articles of Confederation - 12/12/14

View Set