Databricks Competition Study
What are the 4 main elements of data governance?
1) Availability of Data (where is it, who has access) 2) Usability of data (what format, what language is used in processing the data) 3) Integrity of data 4) Security of data
What is the Machine Learning Process?
1) Data Preparation 2) Exploratory Data Analysis 3) Feature Engineering 4) Model Training 5) Model Validation 6) Deployment 7) Monitoring
what are the existing problems of Data cleanrooms?
1) Data movement and replication - used to be in vendor lock-in and added storage cost 2) Restricted to SQL - can't run arbitrary workloads and restricted to simple SQL statements 3)Hard to Scale- can't expand collaboration (stuck with one vendor)
What are the motivations behind edge computing/fog computing?
1) Low Latency 2) Poor Connectivity 3) Reduced Server Load 4) Privacy Examples: Facial Recognition, Sensors in jet engines, Sensors in cells
Why Databricks over EMR?
1) Lower TCO (Total Cost of Ownership) 2) Reliable Analytics 3) Collaborative 4) Scalable Data Pipelines
How does Delta Lake differ Iceberg?
1) Overall performance - loading and querying data is 3.5x faster 2) Load Performance- load from Parquet to intended formats (delta is faster) 3) Query performance - Delta is 4.5x faster
What are the 4 components of MLFlow?
1) Tracking 2) Projects 3) Models 4) Model Registry
What 2 ways can you ingest Data into a lakehouse?
1) network of data ingestion through partners 2) integrating data into delta lake with Auto Loader
what are the 2 main risk in ML systems?
1) technical risk - inherent to the system itself 2) risk of noncompliance with external systems
How to know if it's a production workload?
1) workload is operationalized 2)new data is processed on a regular basis (batch/streaming) 3)key parts have been automated through workflows/jobs 4)drives business value
What were the 3 components of Hadoop?
1)HDFS- storage unit (Hadoop Distributed File System) stored on blocks and then stored on several data nodes in the cluster 2)MapReduce - data processing 3) YARN - resource negotiator (managed the resources)
Each node is how many Megabytes?
128 megabytes
What are the key features of Delta Lake
ACID Transactions, Time Travel, Schema Enforcement, Audit History, Iceberg to Delta Converter, Optimize ZOrder, Clones Etc
Sagemaker
AWS product that is a fully managed ML platform that enables data scientists and developers to create, train and deploy ML models
What was the first major solution to distributed storage?
Apache Hadoop - only built for local storage not for processing
What Databricks Products address Hadoop migration needs for data engineering?
Auto Loader, DLT, Databricks Workflows
What is the difference between batch processing and streaming processing?
Batch is data at rest. (allows to look into more depth) Stream - data is constantly arriving Streaming requires a different infrastructure, uses different methods and has different goals - looking for anomalies or quick trends
What is Edge Computing?
Central storage and processing and then have connected devices at the edges of the network. The devices can store and process the data
In MEDDPICC, what does the C's stand for?
Champion: the person who has power, influence, and credibility within the customers' organization Competition: what other vendors the company is evaluating
Why have data marketplaces seen limited use?
Closed platforms (one per vendor) Limited to just datasets
Databricks vs Sagemaker?
Comparing Databricks to Sagemaker is an unfair comparison because when comparing a complete end-to-end platform Sagemakeer requires 20+ other services Sagemaker- overhead and complexity, deteriorated team collaboration, data duplications higher TCO and tedious governance
Azure Synapse Weakness
Complex and has many different options for computing, storage, and required integrations have 4 different compute options and 3 storage options when it comes to streaming synapse relies on other services like Azure ML and Azure Stream Analytics Vendor Lock-In - can't be multi-cloud
What is Model Training in the ML lifecycle?
DS explore algorithms to find the best performing model
What are the major risk of storing data on the cloud?
Data Dissipation - Security Cost Containment Remote Possibility - servers could fail
How are data engineers used in the ML lifecycle?
Data Engineers prepare production data and make it available for consumption (raw data)
How does Databricks differentiate from Snowflake?
Data Science/Machine Learning Data Ingestion ETL Streaming Capabilities Data Sharing With Snowflake, you will have high costs, limited capabilities, limited unstructured data support, inefficient engineering and vendor lock in
What Databricks Products address Hadoop migration needs for Data Warehousing?
Databricks SQL
Streaming Capabilities- how does Databricks differ from Snowflake
Databricks reads in any streaming services while Snowflake only supports Kafka and was not designed for high velocity data but rather structured data at rest
In MEDDPICC, what does the D's stand for?
Decision Criteria: are the various criteria in which a decision to process your criteria will be judged Decision Process: series of steps that form a process of which the buyer will follow to make a decision
What is Delta Lake?
Delta Lake - bringing data warehousing functionality to data lakes It's a storage framework that takes data in Parquet format (everyone in Big Data knows this data format) and adds an abstraction layer to it (transaction log, metadata etc) which resolves the key issues that have been a pain for all practitioners. Data stays in their own cloud object store
What is Delta Live Tables?
Delta Live Tables is the first ETL framework that allows you to build reliable data pipelines (accelerate ETL development) - automatically manages your infrastructure at scale so data analysts and engineers can spend less time on tooling and focus on getting value from data. DLT- fully supports Python and SQL and works with both batch and streaming DLT - manages task oircehstration, cluster management, monitoring, data quality and error handling supports CDC (change data capture)
MLFlow Models
Deploy machine learning models in diverse serving environments
What companies have migrated from AWS EMR to Databricks?
Disney+ - 15.6 million infrastructure savings with reduction in compute costs for data processing, modeling and analytics Plume Wild Life Ancestry
Price difference between Databricks and EMR?
EMR is relatively fast and extremely cheap however it comes with high operational cost, operational overhead and risks They are giving away EMR for free to sell EC2 hours
EMR
ETL/stream processing many open sourced projects (Hadoop, Presto, Jupyter) are packaged and rolled into a single click which integrates well with AWS ecosystem cloud big data platform for running large scale distributed data processing jobs, interactive SQL queries EMR- EMR runtime which provides performance enhancements
In MEDDPICC, what does the E stand for?
Economic Buyer: the person with the overall authority in the buying decision
What Databricks features helps out with observability and governance?
Expectations- help prevent bad data from flowing into tables, track data quality over time and provide tools to troubleshoot bad data with granular pipeline observability so you get a high fidelity lineage diagram of your pipeline, track dependencies and aggregate data quality metrics across all of your pipeline.
What is Project Lightspeed?
Faster and simpler stream processing 1) predictable low latency 2) enhanced functionality 3) Operations and Troubleshooting 4) Connectors & Ecosystem
What Databricks Products address Hadoop migration needs for Data Science and ML?
Feature Store, MLFlow, ML Runtime, MLOps and AutoML
Databricks Workflows
Fully managed orchestration service Orchestration is the configuration of multiple tasks into one complete end-to-end process the goal is to streamline and optimize the execution of frequent, repeatable processes
At Summit, what was Snowflake's biggest announcements?
Iceberg Tables, Snowflake for Python, and Snowpipe Streaming -also support for transactional workloads with Unistore and Hybrid Tables
In MEDDPICC, what does the I stand for?
Implicating the Pain: identified the pain and identified the solution that your pain solves
What are Data cleanrooms?
It's a secure environment to run computations on joint data Run any computation on Python, SQL, R, or java No data replications Scalability
How is HDFS Fault Tolerant?
Makes copies of the data and stores across multiple systems -replication of 3 nodes
In MEDDPICC, what does the M stand for?
Metrics: are the quantifiable measures of value that your solution can provide
What are the different types of SQL Workloads?
Online Transaction Processing (OLTP) Workloads, Decision Support System or data warehouse (DSS or DW) Workloads, Online analytical processing (OLAP) Workloads
MLFlow Projects
Package data science code in a format to reproduce runs on any platform
In MEDDPICC, what does the P stand for?
Paper Process: how will you go from decision to signed contract
What are the 4 ways companies make decisions?
Revenue Growth Cost Reduction & Savings Strategic Initiatives Driving Efficiency
In MEDDPICC, what does the R stand for?
Risks: specific risks that will need to be monitored in order to get the deal done
What is Structured Data?
Rows and columns Variables are predefined Ex: Spreadsheets, rational databases
How do we position with Snowflake
Snowflake handles DW capabilities while Databricks handles the DS/ML capabilities
For ETL- how does Databricks differ from Snowflake
Snowflake has no ETL tools rely heavily on 3rd party vendors which increases cost and complexity Databricks - Delta Live Tables
For Data Science- how does Databricks differ from Snowflake?
Snowflake is built for SQL workloads. They either use Snowpark or rely on 3rd party tools for DS and ML-like Dataiku or DataRobot Databricks can allow your data scientist to use any libraries/languages in the same platform as their DE, and manage the end-to-end ML pipeline with MLflow
For Data Sharing- how does Databricks differ from Snowflake?
Snowflake's data format is proprietary, users can only share data with other Snowflake accounts (vendor lock in) -Snowflake would take the data from your cloud storage, conduct transformations and then push the data back so you are paying an egress tax to and from and as that data is going to be stored -pay for compute to send data Delta Sharing - an open standard for data sharing, no replication of datasets
Velocity
Speed with which the data arrives -data that comes in rapidly and changes frequently
What Databricks Products address Hadoop migration needs for Streaming workloads?
Structured Streaming, DLT, AutoLoader
What is artificial intelligence?
Techniques that allow computers to do things typically done by humans
How are text, audio, and video data measured?
Text is measured in Kilobytes Audio is measured in megabytes Video is measured in gigabytes
What does the Lakehouse consist of?
The lakehouse brings the scalability and cost effectiveness of data lakes with the reliability and performance of data warehouses
Why Databricks over Synapse?
Unified collaborative environment and a single compute engine Significantly better cost performance Multi cloud and open Best for 2 primary uses cases correlated to Lakehouse architecture Data sharing live tables without copying -Synapse - copy their data
What is hybrid cloud?
Using both public cloud providers and secure private cloud
What is Multi-cloud?
Using several cloud storage and computing providers simultaneously
What is Unstructured data?
Variables and fields are not labeled or identified Ex: text, photos, videos, audio
What is Semi-Structured data?
Variables that are marked with tags HTML, EXML, JSON (webpages and social media data)
When is anomaly detection important?
When identifying process failure, potential value (find outliers) and fraud detection
How do we position with AWS Sagemaker?
Win with Delta - allows the need for a data-centric ML platform and ease of integration through MLflow Sagemaker support Serverless Model Endpoints
For Data Ingestion- how does Databricks differ from Snowflake
With Snowflake ingestion go through various Snowflake stages, often needing to utilize limited SQL pipelines or Snowpipe which is inefficient and expensive 2) Snowflake tax - egress tax moving data to and from and as that data is being stored Databricks - ingestion is simple with AutoLoader where data is automatically transformed into Delta Tables
YARN consists of 4 components?
Yet another resource negotiator 1)Resource Manager 2)Node Manager 3)Application Master 4)Containers YARN - manages the resources
Databricks SQL
a
What is Databricks Workflows?
a fully managed orchestration service for all your data analytics and AI -allows users to build ETL pipelines that are automatically managed, including ingestion and lineage using Delta Live Tables -"repair and rerun" failed jobs -provides a deep monitoring capabilities and centralized observability across all your workflows
What do Workflows allow you to do?
allow users to build ETL pipelines that are automatically managed Orchestrating and managing production workflows are. bottleneck for many organizations and require external tools like Apache Airflow, Azure DataFactory, AWS Step Function, or GCP workflows
MLFlow Tracking
allows you to record and query experiments: code, data, configure results
What is Exploratory data analysis in the ML lifecycle?
analysis is conducted by data scientist to assess statistical properties of the data available and determine if they address the business question
What is AutoLoader?
automatically processes files landing in cloud storage
MLFlow Model Registry
collaborative hub for all ML models - use webhooks to automate and integrate your machine learning pipeline with existing CI/CD tools and workflows
What is Machine Learning?
collection of algorithms that can find patterns in data to predict outcomes -improve over time -can go from simple with linear regressions to more complex solutions with neural networks
What is Apache Spark?
data processing language engine for executing data engineering, data science and ML on a single node machine or clusters
What is Feature Engineering in the ML lifecycle?
data scientist clean data and apply business logic and specialized transformations to engineer features for model training
Variety
format of the data (Structured, Unstructured or semi-structured data)
What is a data lake?
hold data that is structured, semi-structured and unstructured -gets rid of data silos so all organizations can access the data
Volume
how much data you have - more data than fits in a computer's RAM (memory) -more data than fits on a hard drive
What is MEDDPICC?
is a qualification methodology
What is MLOps?
is a set of processes and automation for managing models, data, and code to improve performance stability and long-term efficiency in ML systems MLOps = ModelOps + DataOps + DevOps
What is MLFlow?
is an open source platform to manage the ML lifecycle including experimentation, reproducibility and a central model registry
When you talk about ETL, you also hear about Production Workloads. What are production workloads?
jobs that are scheduled, automated and critical to our customers running their everyday business
What does EMR lack?
lacks simplicity in management and configuration set up, has high dependency to multiple other AWS tools and performance wise it is not on par with Databricks
What is Unity Catalog?
offers a centralized governance solution for all data and AI assets with built in search and discovery newest announcements around data lineage- enhances performance and scalability and gives business a complete view of the entire data lifecycle
What is Databricks Marketplace?
open marketplace for data solutions, built on Delta Sharing Consist of Notebooks Data files, Data Tables Solution Accelerators ML Models Dashboards
What is Iceberg Tables?
open table format- pairs Snowflake's powerful performance engine and platform capabilities with open formats and storage managed by customers -competes with Delta Lake
What is Photon
query engine on Databricks which provides up to 12x better performance compared to other cloud data warehouses
Databricks SQL
serverless data warehouse that lets you run all of your SQL and BI applications at scale
What is MapReduce?
splits the data into parts and processes them separately on different data nodes and then the results are aggregated to give a final output
Hadoop
stored and processed vast amounts of data efficiently using cluster of commodity hardware (multiple storage units and processors framework that manages big data storage in a distributed way and processes it
What is Snowpipe Streaming?
supports streaming data (only kafka support) Databricks supports Kinesis, Event Hubs Kafka and many more
What is Snowpark with Python?
targets Application development workloads both with native Streamlit integrations and for ISV partners -can read files and tables and apply basic transformations Snowpark is a proprietary API that woks only on Snowflake compute
What is DevOps?
the concept of developer operations - deploy software applications
What is a Data Warehouse
unified place to keep an organization's data sets - handles structured data
What are the 3 V's of big data?
volume, velocity, variety