Databricks Fundamentals
What are tables, views, and volumes (databricks)
These are at the lowest level in the data object hierarchy
What does a Workspace organize?
objects (notebooks, libraries, dashboards, and experiments) into folders and provides access to data objects and computational resources.
What does Delta Live Table (DLT) automate
operational complexities such as: - infrastructure management - task orchestration - error handling and recovery - performance optimization.
What is a catalog
the first layer of the object hierarchy, used to organize your data assets. It contains schemas (databases)
What is the control plane
the management plane where Databricks runs the workspace application and manages notebooks, configuration and clusters.
What is the Data Plane
the plane that handles your data processing
What is a schema (general)
the specification of the records in the database e.g. the schema can say a record will consist of: PersonID (unique index number), FamilyName (40 chars), FirstName (40 chars), DateOfBirth (Date)
What is a metastore
the top-level container for metadata.
What data engineering capabilities does the Databricks Lakehouse Platform offer to simplify the work of data engineers?
- Automatic deployment and data operations - SQL and Python development compatibility - End-to-end data pipeline visibility
Which of the following compute resources is available in the Databricks Lakehouse Platform?
- Classic clusters - Serverless Databricks SQL warehouses
The Databricks Lakehouse Platform architecture consists of what two planes?
- Control plane - Data plane
What contributes directly to high levels of data quality within the Databricks Lakehouse Platform?
- Data expectations enforcement - Table schema evolution
What are the benefits of the Databricks Lakehouse Platform being designed to support all data and artificial intelligence (AI) workloads?
- Data teams can all utilize secure data from a single source to deliver reliable, consistent results across workloads at scale. - Data workloads can be automatically scaled when needed. - Data analysts, data engineers, and data scientists can easily collaborate within a single platform. - Analysts can easily integrate their favorite business intelligence (BI) tools for further analysis.
NEED TO DIG DEEPER TO UNDERSTAND Describe how a specific capability of the Databricks Lakehouse Platform supports a data streaming pattern Which of the following correctly describes how a specific capability of the Databricks Lakehouse Platform supports a data streaming pattern? Select three responses. ☐ Structured Streaming enables stream-based machine learning inference. ☐ Databricks Workflows automatically passes data from task to task in regular microbatches. ☐ MLflow ingests its automatic experiment tracking data into a stream for continuous monitoring.
- Delta Live Tables processes ETL pipelines on streaming data with advanced monitoring mechanisms. - Auto Loader continuously and incrementally ingests streaming data.
If a data architect is evaluating data warehousing solutions for their organization to use, what are some benefits of using the Databricks Lakehouse Platform for warehousing that you'd reference?
- Engineering capabilities supporting warehouse source data - Best available price/performance - A rich ecosystem of business intelligence (BI) integrations - Local development software to integrate with other capabilities
What are some common problems within a data lake architecture that can be easily solved by using the Databricks Lakehouse Platform?
- Lack of ACID transaction support - Too many small files - Ineffective partitioning
What resources exists within the Databricks control plane?
- Notebooks - Cluster configurations
What performance features are supported by Databricks SQL Serverless?
- Photon Engine - Predictive IO - Intelligent Workload Management
What SQL warehouse types does Databricks SQL support?
- Serverless - Pro - Classic
What are key features and advantages of using Photon.
- Support for SQL and equivalent DataFrame operations with Delta and Parquet tables. - Accelerated queries that process data faster and include aggregations and joins. - Faster performance when data is accessed repeatedly from the disk cache. - Robust scan performance on tables with many columns and many small files. - Faster Delta and Parquet writing using UPDATE, DELETE, MERGE INTO, INSERT, and CREATE TABLE AS SELECT, including wide tables that contain thousands of columns. - Replaces sort-merge joins with hash-joins.
What does Databricks make available as part of Databricks Machine Learning to support machine learning workloads?
- Support for distributed model training on big data - Built-in real-time model serving - Built-in automated machine learning development - Optimized and preconfigured machine learning frameworks
What challenges would a data organization likely face when migrating from a data warehouse to a data lake
- There are increased data reliability issues in a data lake. - There are increased security and privacy concerns in a data lake.
In which of the following ways do serverless compute resources differ from classic compute resources within the Databricks Lakehouse Platform?
- They exist within the Databricks cloud account - They are always running and reserved for a single, specific customer when needed
What architecture benefits are provided directly by the Databricks Lakehouse Platform?
- Unified security and governance approach for all data assets - Built on open source and open standards - Available on and across multiple clouds
What are the benefits of Databricks Workflows (big picture)
- Unified with the Databricks Data Intelligence Platform - Reliability at scale - Deep monitoring and observability - Batch and streaming - Efficient compute - Seamless user experience
What is Delta Sharing?
A multicloud, open-source solution to securely and efficiently share live data from the lakehouse to any external system.
What is Intelligent workload management (IWM)?
A set of features that enhances Databricks SQL Serverless's ability to process large numbers of queries quickly and cost-effectively.
What is Predictive IO
A suite of features for speeding up selective scan operations in SQL queries. It can provide a wide range of speedups.
What is Unity Catalog
A unified governance solution for all data and AI assets including files, tables, and machine learning models in your lakehouse on any cloud. It offers improved Lakehouse data object governance and organization capabilities for data segregation.
What are Workspaces?
A workspace is an environment for accessing all of your Databricks assets.
What is a schema (databricks)
Also known as databases, they are the second layer of the object hierarchy and contain tables and views.
What are models (databricks)
Although they are not, strictly speaking, data assets, they can also be managed in Unity Catalog and reside at the lowest level in the object hierarchy.
What technology is Photon built on top of?
Apache Spark
Why does Databricks make special features available to support machine learning workloads?
Because data organizations need specialized environments designed specifically for machine learning workloads.
What is a consequence of using Unity Catalog to manage, organize and segregate data objects?
Complete data object referencing requires three levels
What is metadata
Data about data
Which of the Databricks Lakehouse Platform services or capabilities provides a data warehousing experience to its users?
Databricks SQL
What would you tell a customer who wants to continue using other preferred vendors for use cases like data ingestion, data transformation, business intelligence, and machine learning?
Databricks can be integrated directly with a large number of Databricks partners.
Why did Databricks develop Delta Sharing
Each of the traditional sharing tools and solutions comes with its own set of limitations
When Databricks SQL users leverge serverless Databricks SQL warehouses rather than classic Databricks SQL warehouses what is one advantage they experience?
Expedited environment startup
What does Databricks Workflows let you do?
It lets you easily define, manage and monitor multitask workflows for ETL, analytics and machine learning pipelines. With a wide range of supported task types, deep observability capabilities and high reliability, your data teams are empowered to better automate and orchestrate any pipeline and become more productive.
What tasks does Databricks Workflows support?
It supports tasks for data ingestion, data engineering, machine learning, and business intelligence (BI)
What does Intelligent workload management (IWM) use to to allocate resources
It uses AI-powered prediction and dynamic management techniques, it works to ensure that workloads have the right amount of resources quickly
Explain Databricks to a 15 year old
It's a way of executing 5 or so languages on spark distributed computing, the code can be anything from ETL to Datascience and Machine Learning, depends what you write. It also acts as a platform for management of all of the above, sharing, collaboration, cluster (virtual computer) management. (Users share that it is pretty intuitive, they can just write SQL and interact with their datalake, can be expensive if not managed right)
Explain Databricks to a five year old
Makes little bits of big computers use data in lots of ways and in lots of languages.
Why did Databricks develop Databricks Workflows?
Many organizations use a variety of open-source and proprietary tools for data orchestration, but these tools often have their own limitations.
List these relational entities in order from largest (most coarse) to smallest (most granular) within their hierarchy: - Schema (Database) - Metastore - Catalog - Table
Metastore → Catalog → Schema (Database) → Table
What does volumes provide governance for?
Non-tabular data.
What is Delta Lake
One of the foundational technologies provided by the Databricks Lakehouse Platform is an open-source, file-based storage format that provides a number of benefits. These benefits include ACID transaction guarantees, scalable data and metadata handling, audit history and time travel, table schema enforcement and schema evolution, support for deletes/updates/merges, and unified streaming and batch data processing.
Describes the motivation for the creation of the data lakehouse
Organizations needed a single, flexible, high-performance system to support data, analytics, and machine learning workloads.
What technology has Databricks introduced to further speed up and scale all query-based workloads?
Photon
What is Databricks Photon
Photon is a high-performance Databricks-native vectorized query engine that runs your SQL workloads and DataFrame API calls faster to reduce your total cost per workload.
Maintaining and improving data quality is a major goal of modern data engineering. Why?
Poor data quality filters down throughout the business impacting decision making, AI use-cases, ML models, and creates additional work for data engineers.
What are 2 security features made available in the Databricks Lakehouse Platform by Unity Catalog
Single-source-of-truth identity management Workspace-specific data metastores
What is the benefit to a business if they use Photon?
While it is more expensive, it offers a more performant experience. Overall, the TCO is worth it for the business as cluster maintenance, optimization exercises took time and required expensive and specialized talent, while this just works
What is "Delta Live Tables" or "DLT"
a declarative framework for ETL and ML pipelines so data engineers can focus on helping their organizations get value from their data.
What is Databricks Workflows
a managed orchestration service, fully integrated with the Databricks Data Intelligence Platform.
What does each metastore expose?
a three-level namespace (catalog.schema.table) that organizes your data.
How has data sharing traditionally been performed?
by proprietary vendor solutions, SSH File Transfer Protocol (SFTP), or cloud-specific solutions.
What is a database
consisting of one or more Tables or records, actual data following a Schema of sort, plus other things (stored queries, etc.).
What is a benefit to using Delta Live table (DLT) for engineers?
engineers can also treat their data as code and apply software engineering best practices like testing, monitoring and documentation to deploy reliable pipelines at scale.
How does the Databricks Lakehouse Platform makes data governance simpler?
Via Unity Catalog, which provides a single governance solution across workload types and clouds.
Why did Databricks develop Delta Live Tables?
In the past, a lot of data engineering resources needed to be contributed to the development of tooling and other mechanisms for creating and managing data workloads.
Where do non-serverless compute resources exist?
Inside the customers AWS/Azure/GCP environment
Why did Databricks introduce Photon?
It can be challenging for a data lakehouse to provide both performance and scalability for all of its query-based workloads to the standards of a data warehouse and a data lake.
True or False, Databricks Workflows supports workloads across multiple cloud service providers and tools?
True; Databricks Workflows supports workloads across multiple cloud service providers and tools
