DP-900
What tool would you use for real time visualization of streamed data?
Stream analytics
What open source big data processing technologies does HDInsight support?
- Apache Spark - Apache Hadoop - Apache HBase - Apache Kafka - Apache Storm
Non-Relational Databases
- Key-value databases - Document databases, key value with value being json document - Column family, which stores tabular data comprisong of rows and columns but you can divide colums into groups known as column family - Graph databases, which stores entities as nodes withl inks to define relationships between them
Apache Avro
Apache Avro is a row-based optimized file format where each record contains a header that describes the structure of data in the record. Header is JSON, data is binary. Good for compressing data and minimizing storage and network bandwidth
Benefits of SQL Managed Instance
Automated backups, software update management, other maintenance done by Azure, near 100% compatibility with on-prem SQL Server, reduces management
What tool ingests big data and deposits it into a Data Store?
Azure Data Factory
What data services are PaaS?
Azure SQL Database Cosmos DB Azure DB for Open Source (SQL, MariaDB, Postgres, MySQL)
What provides an interface to interact with Azure Synapse Analytics and allows the creation of interactive notebooks in which synapse code and markdown content can be combined?
Azure Synapse Studio
Data engineers use __to host data lakes - blob storage with a hierarchical namespace that enables files to be organized in folders in a distributed file system
Azure storage
Azure data factory
Big Data Orchestration tool that consists of pipelines that are triggered either manually, scheduled, etc., and collects data from a source and transforms it through various tasks and then sinks it into another data source
Azure datalake sits on one kind of storage?
Blob storage
Data sharding
Breaking up semi structured data into partitions that can be processed by different computes for better performance. EX: Break up data into months of the year
Append blobs size limits and restrictions
Can only add blocks to end of append blob, not update or delete existing blobs Each block up to 4mb Max append blob is 195gb
OLTP (online transaction processing)
Capturing high volume of small transactions constantly coming in and being stored as they are happening. Need fast access, normalize them. SQL Server, Azure DB (Postgres, MySQL, MariaDB)
Cassandra is an example of a ___ model type database
Column
Apache Parquet is an example of what kind of data storage type?
Columnar, where data a is stored as columns. (get all ages by looking at a single column). Parquet is a columnar data file format created by cloudera and twitter, contains row groups. Data for each column is stored together in same row group. Each row group contains one or more chunks of data. File includes metadata that describes set of rows found in each chunk and app can use metadata to quickly locate the correct chunk for a given set of rows then retrieves data in the specified columns for rows. Very efficient compression and encoding shemes.
Data Definition Language (DDL)
Create modify remove tables and other objects in a database such as tables, stored procedures, and views. CREATE ALTER DROP RENAME
Data Manipulation Language (DML)
DBMS language that changes database content, including data element creations, updates, insertions, and deletions
What tool do you use to check compatibility with on-prem SQL server with a migration to managed instance?
Data Migration Assistant (DMA). This tool analyzes your databases on SQL Server and reports any issues that could block migration to a managed instance.
batch processing
Data collected and stored somewhere and processed on an interval, meant for large volumes of data, high latency
Steam processing
Data is processed as it arrives in real time. Need to scale correctly as no resources means data is lost
Types of analysis
Descriptive - Tell me what happened Diagnostic - Why it happened Predictive - What is going to happen based on history Prescriptive - What should I do if I want this outcome? Cognitive - Conclusions based on knowledge All feed into powerbi
SQL and MongoDB are an example of a ___ model type database
Document
Block blobs size and limit
Each block can vary in size up to 100mb Block blob consists of 50k blocks, max size 4.7TB
What does ETL and ELT mean?
Extract load transform Extract transform Load Methods for processing big data
Page blob size and limits
Fixed size 512 byte pages Page blob can hold up to 8tb of data
Azure Synapse Analytics
Fully managed data warehouse with integral security at every level of scale at no extra cost. End to end solution for data analytics, high performance SQL server with data lake flexibility and open source apache spark. Data pipleiens for ingestion and transformation Native support for log and telemetry analytics with azure synapse data explorer pools.
What data structure is good for knowing relationships between data nodes?
Graph, which breaks data into nodes and creates relationship between nodes that can be queried
What tool would you use to migrate existing hadoop based solution to the cloud?
HDInsight
What must you enable to allow a gen2 storage account to be optimized as a datalake for HDinsight, Azure Databricks, Synapse Analytics, and more?
Hiearchical namespace must be enabled for the storage account, irreversible, and create a blob container
Data lakehouse
Hybrid approach that combines features of data lakes and data warehouses. Raw data stores as files in a datalake and relational storage layer abstracts underlying files and exposes them as tables meant for querying.
Data Control Language (DCL)
Manage access to objects in a database by granting, denying, or revoking permissions to specific users or groups GRANT DENY REVOKE GRANT SELECT, INSERT, UPDATE ON Product TO user1;
Azure Data Warehouse
Relational database built on SQL Server where data is stored in a schema optimized for data analytics rather than transactional workloads
What SQL solution is best for "Lift and Shift"?
SQL Server on a VM, IaaS, or SQL MI if you don't want a VM overhead
What does PaaS data options offer?
Scalability High availability natively No OS Access Constant updates and patches
What two modes can SQL Database run in?
Single Database - single database charged per hour and can scale by adding more space memory and processing power Elastic Pool - Purchase a pool of resources (ram space processing power) that you can run multiple database instances on which will use resources in pool and release when no longer needed
Azure Data Lake
a means of storing and analyzing massive amounts of unstructured data. Runs on general purpose v2 storage, block blobs, meant for high performance data access.
TSQL
Transact SQL, create table drop table
Azure data bricks does what in the ELT/ETL process?
Transforms and cleans the data and then deposits it into a structured data sink such as Azure Synapse Analytics
How many logical partitions can be in a Cosmos DB container and how big can they be?
Unlimited, 20gb
partition key
Used to identify and separate data into logical partitions in semi structured data
OLAP (online analytical processing)
Very large amount of volume, stored in data warehouse for analytics such as Synapse. Manipulation of information to create business intelligence in support of strategic decision making