Databricks Competition Study

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What are the 4 main elements of data governance?

1) Availability of Data (where is it, who has access) 2) Usability of data (what format, what language is used in processing the data) 3) Integrity of data 4) Security of data

What is the Machine Learning Process?

1) Data Preparation 2) Exploratory Data Analysis 3) Feature Engineering 4) Model Training 5) Model Validation 6) Deployment 7) Monitoring

what are the existing problems of Data cleanrooms?

1) Data movement and replication - used to be in vendor lock-in and added storage cost 2) Restricted to SQL - can't run arbitrary workloads and restricted to simple SQL statements 3)Hard to Scale- can't expand collaboration (stuck with one vendor)

What are the motivations behind edge computing/fog computing?

1) Low Latency 2) Poor Connectivity 3) Reduced Server Load 4) Privacy Examples: Facial Recognition, Sensors in jet engines, Sensors in cells

Why Databricks over EMR?

1) Lower TCO (Total Cost of Ownership) 2) Reliable Analytics 3) Collaborative 4) Scalable Data Pipelines

How does Delta Lake differ Iceberg?

1) Overall performance - loading and querying data is 3.5x faster 2) Load Performance- load from Parquet to intended formats (delta is faster) 3) Query performance - Delta is 4.5x faster

What are the 4 components of MLFlow?

1) Tracking 2) Projects 3) Models 4) Model Registry

What 2 ways can you ingest Data into a lakehouse?

1) network of data ingestion through partners 2) integrating data into delta lake with Auto Loader

what are the 2 main risk in ML systems?

1) technical risk - inherent to the system itself 2) risk of noncompliance with external systems

How to know if it's a production workload?

1) workload is operationalized 2)new data is processed on a regular basis (batch/streaming) 3)key parts have been automated through workflows/jobs 4)drives business value

What were the 3 components of Hadoop?

1)HDFS- storage unit (Hadoop Distributed File System) stored on blocks and then stored on several data nodes in the cluster 2)MapReduce - data processing 3) YARN - resource negotiator (managed the resources)

Each node is how many Megabytes?

128 megabytes

What are the key features of Delta Lake

ACID Transactions, Time Travel, Schema Enforcement, Audit History, Iceberg to Delta Converter, Optimize ZOrder, Clones Etc

Sagemaker

AWS product that is a fully managed ML platform that enables data scientists and developers to create, train and deploy ML models

What was the first major solution to distributed storage?

Apache Hadoop - only built for local storage not for processing

What Databricks Products address Hadoop migration needs for data engineering?

Auto Loader, DLT, Databricks Workflows

What is the difference between batch processing and streaming processing?

Batch is data at rest. (allows to look into more depth) Stream - data is constantly arriving Streaming requires a different infrastructure, uses different methods and has different goals - looking for anomalies or quick trends

What is Edge Computing?

Central storage and processing and then have connected devices at the edges of the network. The devices can store and process the data

In MEDDPICC, what does the C's stand for?

Champion: the person who has power, influence, and credibility within the customers' organization Competition: what other vendors the company is evaluating

Why have data marketplaces seen limited use?

Closed platforms (one per vendor) Limited to just datasets

Databricks vs Sagemaker?

Comparing Databricks to Sagemaker is an unfair comparison because when comparing a complete end-to-end platform Sagemakeer requires 20+ other services Sagemaker- overhead and complexity, deteriorated team collaboration, data duplications higher TCO and tedious governance

Azure Synapse Weakness

Complex and has many different options for computing, storage, and required integrations have 4 different compute options and 3 storage options when it comes to streaming synapse relies on other services like Azure ML and Azure Stream Analytics Vendor Lock-In - can't be multi-cloud

What is Model Training in the ML lifecycle?

DS explore algorithms to find the best performing model

What are the major risk of storing data on the cloud?

Data Dissipation - Security Cost Containment Remote Possibility - servers could fail

How are data engineers used in the ML lifecycle?

Data Engineers prepare production data and make it available for consumption (raw data)

How does Databricks differentiate from Snowflake?

Data Science/Machine Learning Data Ingestion ETL Streaming Capabilities Data Sharing With Snowflake, you will have high costs, limited capabilities, limited unstructured data support, inefficient engineering and vendor lock in

What Databricks Products address Hadoop migration needs for Data Warehousing?

Databricks SQL

Streaming Capabilities- how does Databricks differ from Snowflake

Databricks reads in any streaming services while Snowflake only supports Kafka and was not designed for high velocity data but rather structured data at rest

In MEDDPICC, what does the D's stand for?

Decision Criteria: are the various criteria in which a decision to process your criteria will be judged Decision Process: series of steps that form a process of which the buyer will follow to make a decision

What is Delta Lake?

Delta Lake - bringing data warehousing functionality to data lakes It's a storage framework that takes data in Parquet format (everyone in Big Data knows this data format) and adds an abstraction layer to it (transaction log, metadata etc) which resolves the key issues that have been a pain for all practitioners. Data stays in their own cloud object store

What is Delta Live Tables?

Delta Live Tables is the first ETL framework that allows you to build reliable data pipelines (accelerate ETL development) - automatically manages your infrastructure at scale so data analysts and engineers can spend less time on tooling and focus on getting value from data. DLT- fully supports Python and SQL and works with both batch and streaming DLT - manages task oircehstration, cluster management, monitoring, data quality and error handling supports CDC (change data capture)

MLFlow Models

Deploy machine learning models in diverse serving environments

What companies have migrated from AWS EMR to Databricks?

Disney+ - 15.6 million infrastructure savings with reduction in compute costs for data processing, modeling and analytics Plume Wild Life Ancestry

Price difference between Databricks and EMR?

EMR is relatively fast and extremely cheap however it comes with high operational cost, operational overhead and risks They are giving away EMR for free to sell EC2 hours

EMR

ETL/stream processing many open sourced projects (Hadoop, Presto, Jupyter) are packaged and rolled into a single click which integrates well with AWS ecosystem cloud big data platform for running large scale distributed data processing jobs, interactive SQL queries EMR- EMR runtime which provides performance enhancements

In MEDDPICC, what does the E stand for?

Economic Buyer: the person with the overall authority in the buying decision

What Databricks features helps out with observability and governance?

Expectations- help prevent bad data from flowing into tables, track data quality over time and provide tools to troubleshoot bad data with granular pipeline observability so you get a high fidelity lineage diagram of your pipeline, track dependencies and aggregate data quality metrics across all of your pipeline.

What is Project Lightspeed?

Faster and simpler stream processing 1) predictable low latency 2) enhanced functionality 3) Operations and Troubleshooting 4) Connectors & Ecosystem

What Databricks Products address Hadoop migration needs for Data Science and ML?

Feature Store, MLFlow, ML Runtime, MLOps and AutoML

Databricks Workflows

Fully managed orchestration service Orchestration is the configuration of multiple tasks into one complete end-to-end process the goal is to streamline and optimize the execution of frequent, repeatable processes

At Summit, what was Snowflake's biggest announcements?

Iceberg Tables, Snowflake for Python, and Snowpipe Streaming -also support for transactional workloads with Unistore and Hybrid Tables

In MEDDPICC, what does the I stand for?

Implicating the Pain: identified the pain and identified the solution that your pain solves

What are Data cleanrooms?

It's a secure environment to run computations on joint data Run any computation on Python, SQL, R, or java No data replications Scalability

How is HDFS Fault Tolerant?

Makes copies of the data and stores across multiple systems -replication of 3 nodes

In MEDDPICC, what does the M stand for?

Metrics: are the quantifiable measures of value that your solution can provide

What are the different types of SQL Workloads?

Online Transaction Processing (OLTP) Workloads, Decision Support System or data warehouse (DSS or DW) Workloads, Online analytical processing (OLAP) Workloads

MLFlow Projects

Package data science code in a format to reproduce runs on any platform

In MEDDPICC, what does the P stand for?

Paper Process: how will you go from decision to signed contract

What are the 4 ways companies make decisions?

Revenue Growth Cost Reduction & Savings Strategic Initiatives Driving Efficiency

In MEDDPICC, what does the R stand for?

Risks: specific risks that will need to be monitored in order to get the deal done

What is Structured Data?

Rows and columns Variables are predefined Ex: Spreadsheets, rational databases

How do we position with Snowflake

Snowflake handles DW capabilities while Databricks handles the DS/ML capabilities

For ETL- how does Databricks differ from Snowflake

Snowflake has no ETL tools rely heavily on 3rd party vendors which increases cost and complexity Databricks - Delta Live Tables

For Data Science- how does Databricks differ from Snowflake?

Snowflake is built for SQL workloads. They either use Snowpark or rely on 3rd party tools for DS and ML-like Dataiku or DataRobot Databricks can allow your data scientist to use any libraries/languages in the same platform as their DE, and manage the end-to-end ML pipeline with MLflow

For Data Sharing- how does Databricks differ from Snowflake?

Snowflake's data format is proprietary, users can only share data with other Snowflake accounts (vendor lock in) -Snowflake would take the data from your cloud storage, conduct transformations and then push the data back so you are paying an egress tax to and from and as that data is going to be stored -pay for compute to send data Delta Sharing - an open standard for data sharing, no replication of datasets

Velocity

Speed with which the data arrives -data that comes in rapidly and changes frequently

What Databricks Products address Hadoop migration needs for Streaming workloads?

Structured Streaming, DLT, AutoLoader

What is artificial intelligence?

Techniques that allow computers to do things typically done by humans

How are text, audio, and video data measured?

Text is measured in Kilobytes Audio is measured in megabytes Video is measured in gigabytes

What does the Lakehouse consist of?

The lakehouse brings the scalability and cost effectiveness of data lakes with the reliability and performance of data warehouses

Why Databricks over Synapse?

Unified collaborative environment and a single compute engine Significantly better cost performance Multi cloud and open Best for 2 primary uses cases correlated to Lakehouse architecture Data sharing live tables without copying -Synapse - copy their data

What is hybrid cloud?

Using both public cloud providers and secure private cloud

What is Multi-cloud?

Using several cloud storage and computing providers simultaneously

What is Unstructured data?

Variables and fields are not labeled or identified Ex: text, photos, videos, audio

What is Semi-Structured data?

Variables that are marked with tags HTML, EXML, JSON (webpages and social media data)

When is anomaly detection important?

When identifying process failure, potential value (find outliers) and fraud detection

How do we position with AWS Sagemaker?

Win with Delta - allows the need for a data-centric ML platform and ease of integration through MLflow Sagemaker support Serverless Model Endpoints

For Data Ingestion- how does Databricks differ from Snowflake

With Snowflake ingestion go through various Snowflake stages, often needing to utilize limited SQL pipelines or Snowpipe which is inefficient and expensive 2) Snowflake tax - egress tax moving data to and from and as that data is being stored Databricks - ingestion is simple with AutoLoader where data is automatically transformed into Delta Tables

YARN consists of 4 components?

Yet another resource negotiator 1)Resource Manager 2)Node Manager 3)Application Master 4)Containers YARN - manages the resources

Databricks SQL

a

What is Databricks Workflows?

a fully managed orchestration service for all your data analytics and AI -allows users to build ETL pipelines that are automatically managed, including ingestion and lineage using Delta Live Tables -"repair and rerun" failed jobs -provides a deep monitoring capabilities and centralized observability across all your workflows

What do Workflows allow you to do?

allow users to build ETL pipelines that are automatically managed Orchestrating and managing production workflows are. bottleneck for many organizations and require external tools like Apache Airflow, Azure DataFactory, AWS Step Function, or GCP workflows

MLFlow Tracking

allows you to record and query experiments: code, data, configure results

What is Exploratory data analysis in the ML lifecycle?

analysis is conducted by data scientist to assess statistical properties of the data available and determine if they address the business question

What is AutoLoader?

automatically processes files landing in cloud storage

MLFlow Model Registry

collaborative hub for all ML models - use webhooks to automate and integrate your machine learning pipeline with existing CI/CD tools and workflows

What is Machine Learning?

collection of algorithms that can find patterns in data to predict outcomes -improve over time -can go from simple with linear regressions to more complex solutions with neural networks

What is Apache Spark?

data processing language engine for executing data engineering, data science and ML on a single node machine or clusters

What is Feature Engineering in the ML lifecycle?

data scientist clean data and apply business logic and specialized transformations to engineer features for model training

Variety

format of the data (Structured, Unstructured or semi-structured data)

What is a data lake?

hold data that is structured, semi-structured and unstructured -gets rid of data silos so all organizations can access the data

Volume

how much data you have - more data than fits in a computer's RAM (memory) -more data than fits on a hard drive

What is MEDDPICC?

is a qualification methodology

What is MLOps?

is a set of processes and automation for managing models, data, and code to improve performance stability and long-term efficiency in ML systems MLOps = ModelOps + DataOps + DevOps

What is MLFlow?

is an open source platform to manage the ML lifecycle including experimentation, reproducibility and a central model registry

When you talk about ETL, you also hear about Production Workloads. What are production workloads?

jobs that are scheduled, automated and critical to our customers running their everyday business

What does EMR lack?

lacks simplicity in management and configuration set up, has high dependency to multiple other AWS tools and performance wise it is not on par with Databricks

What is Unity Catalog?

offers a centralized governance solution for all data and AI assets with built in search and discovery newest announcements around data lineage- enhances performance and scalability and gives business a complete view of the entire data lifecycle

What is Databricks Marketplace?

open marketplace for data solutions, built on Delta Sharing Consist of Notebooks Data files, Data Tables Solution Accelerators ML Models Dashboards

What is Iceberg Tables?

open table format- pairs Snowflake's powerful performance engine and platform capabilities with open formats and storage managed by customers -competes with Delta Lake

What is Photon

query engine on Databricks which provides up to 12x better performance compared to other cloud data warehouses

Databricks SQL

serverless data warehouse that lets you run all of your SQL and BI applications at scale

What is MapReduce?

splits the data into parts and processes them separately on different data nodes and then the results are aggregated to give a final output

Hadoop

stored and processed vast amounts of data efficiently using cluster of commodity hardware (multiple storage units and processors framework that manages big data storage in a distributed way and processes it

What is Snowpipe Streaming?

supports streaming data (only kafka support) Databricks supports Kinesis, Event Hubs Kafka and many more

What is Snowpark with Python?

targets Application development workloads both with native Streamlit integrations and for ISV partners -can read files and tables and apply basic transformations Snowpark is a proprietary API that woks only on Snowflake compute

What is DevOps?

the concept of developer operations - deploy software applications

What is a Data Warehouse

unified place to keep an organization's data sets - handles structured data

What are the 3 V's of big data?

volume, velocity, variety


Ensembles d'études connexes

Practice Quiz 1 - Chapters 16, 17, 18

View Set

AGRI 61 - Extension approaches (MOD 4.5)

View Set

Safety and Infection Control (6)

View Set

Earth Science, Discovery Education, The Atmosphere

View Set

Anatomy Chapter 16 Practice Questions

View Set

ITCS-4102 Programming Languages Quiz 7

View Set