Modernizing Data lakes and Data Warehouses with GCP
SQL STRUCT
A container of other data fields which can be of different data types
SQL ARRAY
A container of other data fields which must be of the same type.
Data Engineer
A data engineer builds pipelines to enable data-driven decisions.
What is the difference between a data lake and a data warehouse?
A data lake holds raw data. A data warehouse stores data in a way that makes it efficient to query.
Federated Data Source (External Data Source)
A data source that you can query directly even though the data is not stored in BigQuery.
Data Warehouse
A data warehouse stores transformed data in a usable condition for business insights.
User Defined Functions (UDF)
A function provided by the user of a program or environment, in a context where the usual assumption is that functions are built into the program or environment.
Cloud Composer
A managed Apache Airflow service that helps you create, schedule, monitor and manage workflows.
Throughput
A measure of how many units of information a system can process in a given amount of time.
Failover
A method of protecting computer systems from failure, in which standby equipment automatically takes over when the main system fails.
Ad-hoc Queries
A request for information created due to unplanned information needs that is typically not saved for later use
Data Lake
A scalable and secure data platform that allows enterprises to ingest, store, process, and analyze any type or volume of information. Usually stores data in raw format. The point of it is to make data *ACCESSIBLE* for analytics!
Table Partitioning
A way to divide a large table into smaller, more manageable parts without having to create separate tables for each part.
What are challenges that data engineers face?
Access to data. Data accuracy and quality. Availability of computational resources. Query performance.
Overhead
Any combination of excess or indirect computation time, memory, bandwidth, or other resources that are required to perform a specific task
Google Cloud's Operations Suite (formerly Stackdriver)
Designed to monitor, troubleshoot, and improve cloud infrastructure, software, and application performance. Efficiently build and run workloads, keeping applications performant and available.
IAM (Identity & Access Management)
Enables you to manage access to cloud services and resources securely. Google and AWS both have this.
What does ETL stand for?
Extract, Transform, Load
What are the key considerations when building a data lake?
Handling of multiple data types. Scalability. High-throughput ingestion. Fine-grained access control to objects. Connectivity to other tools.
How does denormalization increase performance?
It avoids intensive joins in SQL.
What should you always assume about any raw data from source systems?
It needs to be cleaned, transformed and stored in a data warehouse
What are considerations when choosing a data warehouse?
Pipeline type. Scalability. Data organization. Performance design. Maintenance.
Data Mart
Subset of a data warehouse oriented to a specific business line
Normalization
The process of applying rules to a database design to ensure that information is divided into the appropriate tables.
What is a potential issue if you stored all your data in one giant table?
The table row size could be too large for traditional reporting databases.
Before you can operate over the elements of a STRUCT or ARRAY, what must you do?
They must be unpacked. (Use the UNNEST() function.)
Slots
Units of computation that comprise a certain number of CPU and RAM.
When should you use EL?
When the data is already clean and correct.
When should you use ETL?
When the data loading has to happen continuously or you want to integrate continuous integration or continuous delivery systems.
When should you use ELT?
When you're not sure what kinds of transformation are needed and/or the transformation can be expressed in SQL.
What affects the path your data takes to get to the cloud?
Where your data is now. How big your data is. Where it has to go. How much transformation is needed.
Why do machine learning teams need data engineers to help with?
capturing new features in a stable pipeline in order to have data available at production time with more insight in the models
Clustering
organizing items into related groups during recall from long-term memory
Partitioned Tables
special table that is divided into segments, called partitions, that make it easier to manage and query your data.