Modernizing Data lakes and Data Warehouses with GCP

¡Supera tus tareas y exámenes ahora con Quizwiz!

SQL STRUCT

A container of other data fields which can be of different data types

SQL ARRAY

A container of other data fields which must be of the same type.

Data Engineer

A data engineer builds pipelines to enable data-driven decisions.

What is the difference between a data lake and a data warehouse?

A data lake holds raw data. A data warehouse stores data in a way that makes it efficient to query.

Federated Data Source (External Data Source)

A data source that you can query directly even though the data is not stored in BigQuery.

Data Warehouse

A data warehouse stores transformed data in a usable condition for business insights.

User Defined Functions (UDF)

A function provided by the user of a program or environment, in a context where the usual assumption is that functions are built into the program or environment.

Cloud Composer

A managed Apache Airflow service that helps you create, schedule, monitor and manage workflows.

Throughput

A measure of how many units of information a system can process in a given amount of time.

Failover

A method of protecting computer systems from failure, in which standby equipment automatically takes over when the main system fails.

Ad-hoc Queries

A request for information created due to unplanned information needs that is typically not saved for later use

Data Lake

A scalable and secure data platform that allows enterprises to ingest, store, process, and analyze any type or volume of information. Usually stores data in raw format. The point of it is to make data *ACCESSIBLE* for analytics!

Table Partitioning

A way to divide a large table into smaller, more manageable parts without having to create separate tables for each part.

What are challenges that data engineers face?

Access to data. Data accuracy and quality. Availability of computational resources. Query performance.

Overhead

Any combination of excess or indirect computation time, memory, bandwidth, or other resources that are required to perform a specific task

Google Cloud's Operations Suite (formerly Stackdriver)

Designed to monitor, troubleshoot, and improve cloud infrastructure, software, and application performance. Efficiently build and run workloads, keeping applications performant and available.

IAM (Identity & Access Management)

Enables you to manage access to cloud services and resources securely. Google and AWS both have this.

What does ETL stand for?

Extract, Transform, Load

What are the key considerations when building a data lake?

Handling of multiple data types. Scalability. High-throughput ingestion. Fine-grained access control to objects. Connectivity to other tools.

How does denormalization increase performance?

It avoids intensive joins in SQL.

What should you always assume about any raw data from source systems?

It needs to be cleaned, transformed and stored in a data warehouse

What are considerations when choosing a data warehouse?

Pipeline type. Scalability. Data organization. Performance design. Maintenance.

Data Mart

Subset of a data warehouse oriented to a specific business line

Normalization

The process of applying rules to a database design to ensure that information is divided into the appropriate tables.

What is a potential issue if you stored all your data in one giant table?

The table row size could be too large for traditional reporting databases.

Before you can operate over the elements of a STRUCT or ARRAY, what must you do?

They must be unpacked. (Use the UNNEST() function.)

Slots

Units of computation that comprise a certain number of CPU and RAM.

When should you use EL?

When the data is already clean and correct.

When should you use ETL?

When the data loading has to happen continuously or you want to integrate continuous integration or continuous delivery systems.

When should you use ELT?

When you're not sure what kinds of transformation are needed and/or the transformation can be expressed in SQL.

What affects the path your data takes to get to the cloud?

Where your data is now. How big your data is. Where it has to go. How much transformation is needed.

Why do machine learning teams need data engineers to help with?

capturing new features in a stable pipeline in order to have data available at production time with more insight in the models

Clustering

organizing items into related groups during recall from long-term memory

Partitioned Tables

special table that is divided into segments, called partitions, that make it easier to manage and query your data.


Conjuntos de estudio relacionados

20.3 The acid dissociation constant Ka

View Set

Synchronous Sequential Circuits Intro

View Set

Chapter 6 Managerial Planning and Goal Setting

View Set

Communication 101 Midterm Chapters 7-8

View Set

Chapter 3 - Numerical Descriptive Measures

View Set

NUR 311L Skills Lab - Course Point: Mod. 3

View Set