DP-900: Azure Data Fundamentals
How do you create Azure Table Storage table?
Create resource>>StorageAccount>> Set settings and deploy>> select table in overview of that storage account>> + table
How can I get more control?
Install db on virtual machine in azure
What are the advantages of using Azure Table Storage?
It is simpler to scale. Same time to insert data in empty table or table with billions of entries. Can hold up to 500 TB of data. Row insertion is fast Data retrieval is fast if you specify partition and row keys Table can hold semi structured data.
Drawback to normalization?
It isn't as intuitive, because of all the joins to get data back together queries may be more expensive.
Some formats of semi structured data
JSON, Avro (row based format), ORC Optimized Row Columnar Format (columns rather than rows), Parquet another columnar format
Eventual Consistency
Least Consistent Changes will sync to all locations eventually. Could appear out of order.
What is a clustered index?
Like an index afore mentioned except the table is sorted by that index, no copy needed, table is sorted by the clustered index.
What are cognitive analytics?
ML recommendations based on a model. AI
What is Azure Cosmos DB?
Multi-model No SQL database management system. Manages data as partitioned set of documents. Which is a collection of fields identified by a key. It is highly scalable. Cosmos DB automatically allocates space in a container for your partitions and each partition can grow up to 10GB in size. Indexes created and maintained automatically
How is data in a relational database organized?
Rows and Columns
What are key characteristics of storing non relational data?
Store the data without structure. No schema imposed on the data. You store and consume it in its natural format.
Describe Streaming Data
Streaming data is continuous no start or stop. Examples would be IOT, logs, etc. Other examples: sensor data, event or IOT hub, blob storage for logs, apache kafka, netflix, youtube, course video.
How many ways can you classify data?
Structured (tabular, csv represents rows and columns) Semi-Structured (JSON, key-value, graph db) Unstructured (audio or video files)
What is provisioning?
The act of setting up the database is provisioning.
What is change feed option in Blob Storage Service?
The change feed for a blob provides an ordered, read-only, record of the updates made to the blob
`Can you combine analysis services with Synapse Analytics?
Yes, example use Synapse to clean and shave raw data into well processed data. Then use Analysis services to do detailed interrogation of data. Synapse is a more powerful processor, will process the data faster and spit out a smaller, meaner and cleaner data set that Analysis Services can use.
When to use block blob vs page blob
block blob - infrequent changes page blob - when require random read/write access
In Azure Databricks, how do you change the language a cell uses?
first line %language example: %python
What is Azure table storage key comprised of?
partition identifier + row key Items in same partition are stored in row key order
What are snapshots
read only version of a blob at a particular point in time
What is azure data lake storage?
repo for large quantity of raw data. Think of it as a staging point for your ingested data before it's massaged and converted into format suitable for analytics
You can define several levels of access in Azure for your data, is this true?
yes, read-only access, read-write access, owner privilege.
How does azure data services cost you?
(Base price covers infrastructure and licensing) + (Administration costs)
What is ELT process?
(ELT) a form of data processing that stands for extract, load then transform. Example of how this would be done Extract data and load it to data lake then perform transformations with databricks and move to synapse data warehouse Data is available before transformations
Azure Relational Database Options
(IAAS) Infrastructure as a service - (PAAS) Platform as a service (SAAS) Software as a service
What is IaaS and when should you use it?
(Iaas) Infrastructure as a service. IaaS is when you create a virtual infrastructure in the cloud that mirrors the way an on premises data center may work. You are responsible for day-to-day adminstrative operations such as configuration, backups patching and restoring data. Best for migrations where you want to lift your on premise system into the cloud without any changes. May have to make small configuration changes but that is it
What is SQL Managed Instance
(Isn't used much) Close to 100% compatability to on premise. Fully managed service (Azure manages) 4-80 vCores 32GB to 8TB
What are two kinds of data processing solutions?
(OLTP) Online Transaction Processing (OLAP) Online Analytics Processing
What is a security principle?
An object that represents a user, group, service, or managed identity that is requesting access to Azure resources.
How many access tiers does blob storage provide
3 Hot teir - default, used for blobs that are accessed frequently, high performance Cool Tier - lower performance, reduced storage charges compared to hot tier, used for data accessed infrequently. Can migrate cool tier back to hot tier. Archive Tier - lowest cost, increased latency. intended for historical data that mustn't be lost. Can switch to cool or hot tier, but the blob must be rehydrated first.
What is the availability of Azure Data Services?
99.99 (4 9s)
Eventual Consistency
A model for database consistency in which updates to the database will propagate through the system so that all data copies will be consistent eventually. Minimizes latency but will lead to temporary inconsistent data between locations.
What is Azure Data Factory?
An orchestration tool (like ssis) Data integration service retrieve data and convert it to format that you process.
What is Azure Synapse Analytics?
An analytics engine, designed to process large amounts of data very quickly. Ingest data from external sources such as flat files, Azure Data Lake, or other DB managment systems. Transform and aggregate this data into a format suitable for analytics processing, perform queries, reports, graphs, and charts.
What is a strong consistency option?
All writes are only visible to clients after changes are confirmed as written successfully to all replicas. Unavailable if you need to distribute your data across multiple global regions
A ___ is a collection of preset, ready-made visuals and reports that are shared with an entire organization
An App
What is Azure Databricks?
Apache spark environment running on azure to provide big data processing, streaming and machine learning
A collection of ready-made visuals prearranged in dashboards and reports is called what in PowerBi
App
What is the advantage of using multi-region replication with Cosmos DB
Availability is increased
What is a command-line utility optimized for transferring large files (and blobs) between your local computer and Azure File storage?
AzCopy
Use __________ for smaller volumes of data (a few terabytes) and multiple sources that can be correlated.
Azure Analysis Services
What are the roles in the world of data?
Azure DBA Role Data Engineer Data Analyst
What are several tools that you can use to provision services
Azure Portal Azure Command-line interface (cli) Azure Powershell (ARM) Azure Resource Manager Templates
What are the options for protecting Azure resources such as storage account, Azure cosmos DB, etc.
Azure Private Endpoint Firewalls and virtual networks Configure Authentication Configure Access Control Configure advanced security
What are currently available relational databases on azure?
Azure SQL Database Azure Database MySQL Azure Database PostgreSQL Azure Database MariaDB
Use ____________ for very high volumes of data (multi-terabyte to petabyte size) and very complex queries and aggregations.
Azure Synapse Analytics
_________ leverages a massively parallel processing (mpp) architecture.
Azure Synapse Analytics
What is Polybase in the above design?
Azure Synapse Analytics uses tech named polybase. Polybase enables you to retrieve data from relational and non-relational sources, such as delimited text files, Azure blob storage, data lake storage. You can save the data read in as sql tables within the synapse analytics service. Converts read data from all sources into sql tables in synapse.
What is the Azure service that implements the NoSQL key-value model?
Azure Table Storage
What is Azure Table Storage?
Azure Table Storage is a scalabel key-value store held in the cloud. You create a table using an Azure storage account. In Azure table storage items are referred to as rows and fields are known as columns.
What is Azure RBAC?
Azure role-based access control.
What is Azure Table Storage?
Azure table storage is a scalable key-value store held in the cloud. You can use it to store semi-structured data.
What is Batch Data?
Batch data includes CSV, TSV, Json, XML, Parquet, Blob files, another database, cache for offline viewing
What is batch data?
Batch data is any load of data that has a beginning and end, is not continuous.
Modern Data Warehouse
Big Data Collected at escalating volumes Higher velocities Greater variety of formats May be historical or streamed from source Have to use this data to provide insight
What is the cheapest way to store data in Azure?
Blob Storage
What are different types of blobs that Azure Blob Service supports?
Block Blobs - blocks of up to 100 mb grouped for a maximum size of over 4.7TB Page Blobs - collection of fixed size 512-byte pages. A page blob is optimized to support random read and write operations. Can hold up to 8 TB of data. Azure uses these to implement virtual disk storage for virtual machines Append Blobs - block blob optimized to support append operations. can only add blocks to end of append blob. Can't update or delete existing blocks. Each block can vary in size up to 4 MB, max size of append b lob is just over 195 GB
What should you do if you want to maintain and restore earlier versions of a blob.
Versioning
What is the purpose of data ingestion?
Capture data as quickly as possible
What is Azure SQL DB
Close to 100% compatibility to on premise Many options for provisioned and serverless DB Pay for performance or pay for hardware 2-80 vCores 5GB to 4TB Starting at $5 per month Uses SQL Server Engine Underneath
What are disadvantages of using Azure Table Storage?
Consistency needs to be given consideration as transactional updates across multiple entities aren't guaranteed. No referential integrity difficult to filter, queries based on non-key fields result in full table scans.
Azure Synapse Analytics includes a control node and a pool of compute nodes. Explain the architecture
Control node is brain of architecture. Its the front end that interacts with all applications. MPP engine runs on control node to optimize and coordinate parallel queries. When you submit a request the control node transforms it into smaller requests that run against distinct subsets of the data in parallel. Compute nodes provide the computational power. The data to be processed is distributed evenly across the nodes. Users and applications send processing requests to the control node, node sends queries to compute nodes. results are sent to control node and combined to result.
Give me example of Non-Relational DB options?`
Cosmos DB, Table Storage (up to 5PB), Blob (up to 5 PB, cheapest), File Storage.
SQL Database supports two purchasing models, what are they?
DTUs and vCore
Two important stages of data analytics
Data Ingestion and Data Processing
What are the tasks that are involved in an Analytical System?
Data Ingestion: Process of ingesting data, could be batch or stream data. Also could be structured, semi-structured or Unstructured data. Data Transformation/Processing: Process of organizing and cleaning data, may also add calculations. This results in a usable data set for analytics and insight. Data Querying: Query data for insight. Data Visualization: Allows you to understand data more intuitively, tool to better understand data and share data.
What is data?
Data is a collection of facts such as numbers, descriptions and observations used in decision making.
Difference between data lake and data warehouse
Data lake is raw unstructured data, data warehouse is structured and organized
What are Prescriptive Analytics?
Describe what to do to fix problem. We need to drop prices 10% to encourage customers to make bigger purchases since less people are outside.
What are Predictive Analytics?
Describe what will happen. Next time we have a pandemic we can expect revenue to drop 10% based on history.
What are Diagnostic Analytics?
Describes why it happened. For example revenue is down 10% year over year due to corona virus which resulted in less purchases industry wide.
What are the 5 types of Analytics?
Descriptive (What Happened), Diagnostic (Why it Happened), Predictive (What will Happen), Prescriptive (What should I do), Cognitive (Machine learning predictions based on model)
What are Descriptive Analytics?
Descriptive Analytics describe what happened. For example revenue is down 10% year over year.
Which consistency option of CosmosDB provides the lowest latency and least consistency?
Eventual Consistency
ETL
Extract, transform, load. Perform extract and transform logic before loading data to be available. Data is not available before transformations are performed
What is the overarching principle for network security in Azure SQL Datbase?
Grant minimum permissions to successfully complete task.
Why use SQL Server in a VM?
Guaranteed to be compatible to on premise sql server. No data limitations (run above 4 TB) Pay for server and licensing not per DB (could be pos/neg)
Some use cases for non relational data?
IOT and telematics Retail and marketing - microsoft uses cosmos DB for its own ecommerce platforms (windows store and xbox live) Gaming - requires low latency Web and Mobile Application - well suited for modeling social interactions and more
Scenarios where CosmosDB is Suitable?
IOT and telematics Retail and marketing Gaming Mobile applications
Session Option
If application makes number of changes, they'll be visible to that application and in order. Other apps may see old data, although any changes that are seen will be in order as they would be in consistent prefix
Why creating indexes make inserts or updates or deletes slow?
Index requires more memory which incurs additional processing charges for these statements.
What is Azure Blob Storage?
Store massive amounts of unstructured data in the cloud
How Azure handles Security?
Must have public network access and add IP to get through firewall. If you don't have public access no one can access. Only servers on azure network could access. TLS version 1.2 min Must check allow other azure services to access server if you want diff apps to access Can list ip range in firewall, or list ips individually Also can give access through the firewall through virtual network. Database permissions managed at database level not firewall and security level
What are use cases of Azure SQL Database?
Need modern stable sql server features Applications require high availability Systems w/variable load and need to scale quickly
If you set up your sql Database networking with "public endpoint" without any other actions, which type of user can connect to the database?
None, you must configure firewall to allow anyone in. Must update the firewall access list.
Limitations to Azure Data Services
Not all features of a database management system are available because azure manages many of them for you. You have no control over platform on which services run
What are NoSQL databases?
Not only SQL data stores. Generally fall into 4 categories - Key Value: very fast - Graph - Columnar - Document
What are ACID principles?
OLTP must adhere to ACID priciples Atomicity - each transaction is single unit fails completely or succeeds completely. Consistency - A consistent database should never lose or create data from original transaction. It is consistent through out. Isolation - Ensures that concurrent execution of transactions leaves the database in the same state it was obtained. Durability - Ensures once committed transaction remains committed even if there is a system failure or crash.
PowerBI consists of three main elements. What are those elements?
Power BI Desktop Power BI Service Power BI Mobile
Which power BI tool helps build paginated reports?
Power BI Report Builder
Which category (IaaS,PaaS,SaaS) does azure data services fall into?
PaaS
What is PaaS and when should you use it?
PaaS stands for Platform-as-a-service. Don't have to create the infrastructure, azure does this for you. You specify resources that you require and azure automatically creates the necessary virtual machines, networks and other devices for you. PaaS is more scalable allows you to scale without purchasing your own hardware.
Which component of an Azure Data Factory can be triggered to run data ingestion tasks?
Pipelines
What are point queries and range queries in Azure Table Storage?
Point Query - retrieves a single row Range Query - returns a range of rows
Which port doze Azure SQL Database use?
Port 1433, must be opened by your networks firewall if you want to connect Azure SQL Database server.
Common flow of activity in PowerBI
PowerBI Desktop (create report)>>(Share to) PowerBI Service and PowerBi Mobile
What is the common flow of powerBI
PowerBI Desktop create report Publish to PowerBI Service create visuals (reports,dasboards,etc.) Share dashboards with others View and interact through mobile apps
What is normalization?
Process of separating data into separate tables that can be connected. Normalization reduces the number of rows and storage as you have less duplicated information. It does require more joins and connection to get the data back together.
Use cases of Azure Table Storage?
Product Catalog Logging
What is change feed on a blob?
Provides ordered, read only, record of the updates made to a blob. You can use it to monitor blob changes.
Session
Read your own writes If application makes a number of changes they all be visible to the application in that order. Other apps may see old data though any changes that appear will appear in order. Sometimes known as read your own writes.
What are the benefits of normalization
Reduces data duplication Reduces common typos and error on inputs
What is Business Intelligence?
Refers to the technologies, applications, and practices for the collection, integration, analysis, and presentation of business information. The purpose of business intelligence is to support better decision making.
When you configure CosmosDB for replication what is the default behavior?
Region which created account supports write ops, replicas are all read-only.
Describe characteristics of relational data
Relational data is structured, has a schema, is rigid. Databases composed of tables with rows and columns. Data integrity based on keys, datatypes, relations. Components include: - tables - views - Primary: unique row - Foreign Keys: child parent relationship Schema - layout of the database including table names, column names and their data types Databases enforce integrity
What is replication and how many options we have when provisioning a storage account?
Replication - Azure storage account replicated 3x in region you specify as primary location for account Locally Redundant Storage (LRS) - Copies data synchronously three times within single physical location in region. LRS is least expensive, but isn't recommended for applications that require high availability Geo Redundant Storage (GRS) - copies data synchronously 3x in single physical location (primary). Then copies your data asynchronously to a single physical location in the secondary region, protects you against regional outages. Read-Access geo-redundant storage (RA-GRS) - Similar to GRS but you only have read access in secondary location. Used to recover from failure and reduce latency for customers that need read access to data in that area.
Reporting vs Visualization
Reporting - informational summaries Visualization - graphs and charts
What are the different APIs that Cosmos supports
SQL API Table API MongoDB API - document Cassandra API - columnar Gremlin API - graph
Synapse spark vs Synapse SQL Pool
SQL Pool for sql tables Spark for datascience AI stuff
Azure Synapse Analytics supports two computational models. What are those?
SQL Pools Spark Pools
What are Microsoft Azure's 4 Relational Databases?
SQL Server in VM SQL Managed Instance Azure SQL Database Azure SQL Database for MySQL, PostgreSQL or Maria DB
What are the 3 elements of a RBAC role?
Security principal, role definition, scope Security principal - an object that represents a user, group service principal, or managed identity that is requesting access to Azure resources. Role definition - owner: has full access - contributor: near full, can't delegate access to others - reader: can view existing Azure resources - user access administrator - lets you manage user access to Azure resources Scope - lists set of resources access applies to
Use cases for Azure Blob Storage
Serve images Store files Stream video or audio Store data for backup, restore, disaster, or archive Store data for analysis by an on-premises or Azure-hosted service
What are the features of Azure File Storage?
Share up to 100 TB of data in a single storage accfount max size of file is 1TB supports 2000 concurrent connections per shared file upload through azure portal, use tools such as AZCopy utility. Azure file sync to sync locally
What is the difference between analysis services and synapse analytics?
Significant overlap, but it's suited for processing on a smaller scale.
What is Azure HD Insight?
Similar to Synapse but the nodes are running the spark processing engine rather than Azure SQL Database
What are the options available when selecting Azure SQL Database?
Single Database Elastic Pool - multiple databases can share the same resources such as memory, data storage space. Resources are referred to as a pool. Helpful if you have databases whose resources vary over time. Managed Instance - runs fully controllable instance of sql server in the cloud.
Consistent Prefix
Slow right order Changes will appear in order though there may be a delay before they become visible.
Which feature should you enable if you want to recover a blob that has been removed or overwritten by accident?
Soft Delete
Non-Relational databases allow you to store data in a format that more closely matches the original structure. What is the disadvantage to this?
Some of the data is duplicated. This increases storage requirement and makes maintenance more complex.
In a Spark pool, nodes are replaced with ______?
Spark clusters
How many tiers are there for Azure File Storage?
Standard & Premium
What is bounded staleness option?
There is a lag between writing and then reading the updated data. You specify this staleness either as a period of time or number of previous versions the data will be inconsistent for.
Consistent Prefix Option
This option ensures that changes will appear in order though there may be a delay before they are visible
When to use Azure Synapse Analytics?
To perform complex queries and aggregations
Azure Table Storage is intended to support very large volumes of data up to several hundred TBs in size? T/F?
True
Azure data factory is essentially an extension of Azure Blob storage, organized as a near-infinite file system. True?
True
Cosmos DB guarantees less then 10 ms latencies for reads and rights at the 99th percentile around the world. T/F
True
Non relational systems such as azure cosmos db (non relational) support indexing when the sstructure of the indexed data can vary from record to record? T/F?
True
T/F. A container provides a convenient way of grouping blobs, you can organize blobs in hierarchy of folders inside a container.
True
T/F. Before you can use AZCopy you need to create a shared access signature?
True
You control who can read and write blobs inside a container at the container level? T/F?
True
You need to create a storage account before creating an Azure Table Storage? T/F?
True
What is shared Access Signature?
Use it to grant limited rights to resources in an azure storage account for a specified time period. Enables applications to access resources such as blobs and files, without requiring that they're authenticated first.
When might you use Polybase?
Use it when you are ingesting data with Azure Synapse
How to migrate from hot tier to archive after 6 months?
Use lifecycle management policies. Automatically move from hot>>cool>>archive as something ages
What does VLAN and ACL stand for?
VLAN - Virtual Local Area Network ACL - Access Control List
Building blocks of Power Bi
Visualization, datasets, reports, dashboards, tiles
What are the basic building blocks of powerBI?
Visualizations - visual representation of data, sometimes called visuals DataSets - collection of data that PBI uses to create visualization Reports - collection of visualization spanning 1+ pages Dashboards - single page collection a visualizations Tiles - single visualization
When use managed instance?
When all you want is lift and shift.
What is an index?
When you create an index in the database, you specify the column to be indexed. An index contains a copy of this column data in a sorted order with pointers to the corresponding rows in the table, when the user runs a query that specifies this column in the where clause the database system can use the index to fetch the data more quickly then scanning through the entire table row by row.
Strong
Writes are only visible to clients after changes are confirmed as written successfully to all replicas. Unavailable globally
Is a csv file an example of unstructured data?
Yes
What are potential disadvantages to SQL Server in a VM?
You have to do all updates, pick install your sql server version. You manage everything.
Bounded Staleness
lag between writing and reading the updated data. specify staleness as period of time, or number of previous versions the data will be inconsistent for
Consistency Spectrum
low latency, least consistent --> high latency, most consistent Eventual --> Strong