GCP - Data Engineer Certification

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

ML: Feature Engineering

*Feature engineering* means transforming raw data into a feature vector. Expect to spend significant time doing feature engineering. Many machine learning models must represent the features as real-numbered vectors since the feature values must be multiplied by the model weights.

VPC

*V*irtual *P*rivate *C*loud A global private isolated virtual network partition that provides managed networking functionality

DataStore: Full Indexing

- "Built-in" Indices on each property (~field) of each entity kind (~table row) - "Composite" Indices on multiple property values - If you are certain a property will never be queried, can explicitly exclude it from indexing - Each query is evaluated using its "perfect index"

Streaming: Micro batches: Session Window

- *Changing* window size based on session data - No overlapping time - Number of entities differ within a window - Session gap determines window size

DataFlow (Apache Beam): Driver and Runner

- *Driver* defines computation DAG (pipeline) - *Runner* executes DAG on a backend - Beam supports multiple backends -- Apache Spark -- Apache Flink -- Google Cloud Dataflow -- Beam Model

VM: Live Migration Stages

- *Pre-migration brownout*: VM executing on source when most of the state is sent from source to target - *Blackout*: A brief moment when the VM is not running anywhere. - *Post-migration brownout*: VM is on the target, the source is present and might offer support (forwards packets from the source to target VMs till networking is updated)

GCE: High-CPU Machine Types

- 0.9 GB memory per vCPU - naming: n1-highcpu-<2,4,8,16,32,64,96 vCPUs>. - Fixed at 16 persistent disks, 64TB total size.

VM: Billing Model

- All machines types are charged for a minimum of 1 minute - After 1 minute instances are charged in 1 second increments

Pub/Sub: Publishers

- Any application that can make HTTPS requests to googleapis.com -- App Engine app -- App running on Compute Engine instance -- App running on third party network -- Any mobile or desktop app -- Even a browser

GKE: Container Cluster: Autoscaling

- Automatic resizing of clusters with *Cluster Autoscaler* - Periodically checks whether there are any pods waiting, resizes cluster if needed - Also monitors usage of nodes and deletes nodes if all its pods can be scheduled elsewhere

Network Load Balancing

- Based on incoming IP protocol data, such as address, port, and protocol type - *Pass-through, regional* load balancer - does not proxy connections from clients - Use it to load balance UDP traffic, and TCP and SSL traffic - Load balances traffic on ports that are not supported by the SSL proxy and TCP proxy load balancers

StackDriver: Service Tiers and Retention

- Basic - no StackDriver account - free and 5 GB cap - Retention period of log data depends on service tier

HDFS

- Built on commodity hardware - Highly fault tolerant, hardware failure is the norm - Suited to batch processing - data access has high throughput rather than low latency - Supports very large data sets - Manage file storage across multiple disks - A cluster of machines. Each machine is a node in the cluster. - Each disk on a different machine in a cluster. - One node is the name node and others are data nodes.

VM: Sustained Discounts for Custom Machines

- Calculates sustained use discounts by combining memory and CPU usage - Tries to combine resources to qualify for the biggest sustained usage discounts possible

VPC: IP Addresses

- Can be assigned to resources e.g. VMs - Each VM has an internal IP address - One or more secondary IP addresses - Can also have an external IP address

Cloud Storage: Domain-Named Buckets

- Cloud Storage considers bucket names that contain dots to be domain names - Must be syntactically valid DNS names -- E.g bucket...example.com is not valid. - End with a currently-recognized top-level domain, such as .com - Pass domain ownership verification.

VPC: Instance Routing Tables

- Every route in a VPC might map to 0 or more instances - Routes apply to an instance if the tag of the route and instance match - If no tag, then route applies to all instances in a network - All routes together form a routes collection

BigTable

- Fast scanning of sequential key values - use BigTable - Columnar database, good for sparse data - Sensitive to hot spotting - need to design key structure carefully - Similar to HBase

Unmanaged Instance Groups

- Groups of dissimilar instances that you can add and remove from the group - Do not offer autoscaling, rolling updates or instance templates - Not recommended, used only when you need to apply *load balancing to pre-existing* configurations

BigTable: Hotspotting and Schema Design

- Like Cloud Spanner, data stored in sorted lex order of keys - Data is distributed based on key values - So, performance will be really poor if -- Reads/writes are concentrated in some ranges -- For instance if key values are sequential - Use hashing of key values, or non-sequential keys

Data Exfiltration: Bastion Hosts

- Limit source IPs that can communicate with the bastion - Configure firewall rules to allow SSH traffic to private instances from only the bastion host. Similar to jump hosts?

Load Balancing: Target Pools

- Network load balancing forwards traffic to target pools - A *group of instances* which receive incoming traffic from forwarding rules - Can only be used with forwarding rules for TCP and UDP traffic - Can have *backup pools* which will receive requests if the first pool is unhealthy - *failoverRatio* is the ratio of healthy instances to failed instances in a pool - If primary target pool's ratio is *below the failoverRatio* traffic is sent to the backup pool

OAuth: Caution

- OAuth client ID secrets are viewable by all project owners and editors, but not readers - If you revoke access to some user, remember to reset these secrets to prevent data exfiltration

Hadoop Ecosystem: Hive

- Provides an SQL interface to Hadoop - The bridge to Hadoop for folks who don't have exposure to OOP in Java

HDFS: Replication

- Replicate blocks based on the replication factor - Store replicas in different locations - The replica locations are also stored in the name node.

StackDriver: Error Reporting

- StackDriver error reporting works on -- AppEngine Standard Environment - log entries with a stack trace and severity of ERROR or higher automatically show up -- AppEngine Flexible Environment - anything written to stderr automatically shows up -- Compute Engine - instrument - throw error in exception catch block -- Amazon EC2: Enable StackDriver logging

Deployment Manager: Types

- Which represents a single API resource or set of resources and more important for resource creation. - Base type - Creates single primitive resource and type provider used to create additional base types. - Composite base types contains one or more templates - preconfigured to work together.

IAP Limitations

- Will not protect against activity inside VM, e.g. someone SSH-ing into a VM or AppEngine flexible environment - Need to configure firewall and load balancer to disallow traffic not from serving infrastructure - Need to turn on HTTP signed headers

ML: discrete feature

A feature with a finite set of possible values. For example, a feature whose values may only be animal, vegetable, or mineral is a discrete (or categorical) feature.

ML: prediction bias

A value indicating how far apart the average of predictions is from the average of labels in the data set. Not to be confused with the bias term in machine learning models or with bias in ethics and fairness.

VM: Instances and storage options

By default, each Compute Engine instance has a small *root persistent disk* that contains the operating system. When applications running on your instance require more storage space, you can add additional storage options to your instance.

ML: bucketing

Converting a (usually continuous) feature into multiple binary features called buckets or bins, typically based on value range. For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete bins. Given temperature data sensitive to a tenth of a degree, all temperatures between 0.0 and 15.0 degrees could be put into one bin, 15.1 to 30.0 degrees could be a second bin, and 30.1 to 50.0 degrees could be a third bin.

VM: Instances and projects

Each instance belongs to a Google Cloud Platform Console project, and a project can have one or more instances. When you create an instance in a project, you specify the zone, operating system, and machine type of that instance. When you delete an instance, it is removed from the project.

DEVSHELL_PROJECT_ID

Environment variable that holds the current project id.

Cloud Spanner: Primary keys

Every table must have a primary key, and that primary key can be composed of zero or more columns of that table. If you declare a table to be a child of another table, the primary key column(s) of the parent table must be the prefix of the primary key of the child table. This means if a parent table's primary key is composed of N columns, the primary key of each of its child tables must also be composed of those same N columns, in the same order and starting with the same column.

How can a user switch between two different projects from cloud shell using gcloud command

Get which projects you have: - gcloud projects list Get which project you are into: - gcloud config get-value project - gcloud config list Set project you want to get into: - gcloud config set project project-id - In other commands, give --project "Project_ID" as flag. - Set the CLOUDSDK_CORE_PROJECT environment variable >> export CLOUDSDK_CORE_PROJECT="my-project-123456"

Cloud DNS

Google Cloud DNS is a high-performance, resilient, global Domain Name System (DNS) service that publishes your domain names to the global DNS in a cost-effective way. - Hierarchical distributed database that lets you store IP addresses and other data and look them up by name - Publish zones and records in the DNS - No burden of managing your own DNS server

VPC: Subnets are Regional

Instances from different regions cannot be on the same subnet Subnets can have resources from multiple zones Or from a single zone

Cloud Data Transfer Use Cases - Decommission Tape Libraries and Infrastructure

Many organizations accumulate vast libraries of magnetic tape as they copy data for backup, archival or disaster recovery purposes. You can easily transfer data from tape to Google Cloud Storage. Once in Google Cloud you can generate new insights with advanced analytics, discover it more easily for regulatory and legal purposes and apply machine learning.

HDFS: Choosing Replica Locations

Maximize redundancy: - Store replicas "far away" i.e. on different nodes Minimize write bandwidth: - This requires that replicas be stored close to each other

BigQuery Commands: SQL Query

Most popular girl's name in the past few years: bq query "SELECT name, count FROM babynames.all_names WHERE gender = 'F' ORDER BY count DESC LIMIT 5"

Shared VPC: Host Project

Project that hosts sharable VPC networking resources within a Cloud Organization.

Spark Core

Spark Core is just a computing engine. It needs two additional components. - A *Storage System* that stores the data to be processed -- Local file system -- HDFS - A *Cluster Manager* to help Spark run tasks across a cluster of machines -- Built-in Cluster Manager -- YARN Both of these are plug and play components.

TensorFlow: Getting a tf.Tensor object's rank

To determine the rank of a tf.Tensor object, call the tf.rank method. Example: r = tf.rank(my_image)

Dataproc: High Availability (Beta)

When creating a Google Cloud Dataproc cluster, you can put the cluster into Hadoop High Availability (HA) mode by specifying the number of master instances in the cluster. The number of masters can only be specified at cluster creation time. Currently, Cloud Dataproc supports two master configurations: - 1 master (default, non HA) - 3 masters (Hadoop HA) *Instance Names* The default master is named cluster-name-m; HA masters are named cluster-name-m-0, cluster-name-m-1, cluster-name-m-2. *Apache ZooKeeper* In an HA Cloud Dataproc cluster, all masters participate in a ZooKeeper cluster, which enables automatic failover for other Hadoop services.

Cloud Storage: Moving and Renaming Buckets

When you create a bucket, you permanently define its name, its geographic location, and the project it is part of. However, you can effectively move or rename your bucket: - If there is no data in your old bucket, simply delete the bucket and create another bucket with a new name, in a new location, or in a new project. - If you have data in your old bucket, create a new bucket with the desired name, location, and/or project, copy data from the old bucket to the new bucket, and delete the old bucket and its contents.

VM: Shared Core

- Ideal for applications that do not require a lot of resources - Small, non-resource intensive applications

Global Forwarding Rules

- Route traffic by IP address, port and protocol to a load balancing proxy - Can only be used with *global* load balancing HTTP(S), SSL Proxy and TCP Proxy - *Regional* forwarding rules can be used with regional load balancing and individual instances

YARN: Node Manager

- Run on all other nodes - Manages tasks on the individual node

HDFS: Reading a File

- Use metadata in the name node to look up block locations - Read the blocks from respective locations

Pig vs. Hive

*Pig*: - Used to extract, transform and load data *into a data warehouse* - Used by developers to bring together useful data in one place - Uses Pig Latin, a procedural, data flow language *Hive*: - Used to query data *from a data warehouse* to generate reports - Used by analysts to retrieve business information from data - Uses HiveQL, a structured query language

Load Balancing: Backend Service Components

- *Health Check*: Pools instances to determine which one can receive requests - *Backends*: Instance group of VMs which can be automatically scaled - *Session Affinity*: Attempts to send requests from the same client to the same VM - *Timeout*: Time the backend service will wait for a backend to respond

Dataproc vs Dataflow Workloads

- *Stream processing (ETL)*: Dataflow - *Batch processing (ETL)*: Both - *Iterative processing and notebooks*: Dataproc - *Machine learning with Spark ML*: Dataproc - *Preprocessing for machine learning*: Dataflow (with Cloud ML Engine)

VM: How do you retrieve Meta Data

gcloud compute instances create example-instance --metadata foo=bar gcloud compute instances add-metadata INSTANCE --metadata lettuce=green gcloud compute instances remove-metadata INSTANCE --keys curl "http://metadata.google.internal/computeMetadata/v1/instance/disks /0/type /0/type "

VM: How to move instance between Zones

gcloud compute instances move jjain1-vm -zone us-east1-b -destination-zone us-east1-c Instances can be moved along with their resources only within a region: from one zone to another.

Cloud Storage: List Bucket

gsutils ls gsutils ls -L -b gs://[BUCKET_NAME]/ (it will give you geographic *Location* and default *Storage* class of the bucket) To get the size of the bucket: gsutil ds -s gs://[BUCKET_NAME]/

Pig vs. SQL

*Pig*: - A *data flow* language, transforms data to store in a warehouse. - Specifies *exactly how* data is to be modified at every step. - Purpose of processing is to *store in a queryable format*. - Used to *clean data* with inconsistent or incomplete schema. *SQL*: - A *query* language, is used for retrieving results - *Abstracts* away how queries are executed - Purpose of data extraction is *analysis* - *Extract insights*, generate reports, drive decisions

Hadoop Ecosystem: HBase

- A database management system on top of Hadoop - Integrates with your application just like a traditional database

YARN

- Co-ordinates tasks running on the cluster - Assigns new nodes in case of failure Two major components: - Resource Manager - Node Manager.

VPC: Dynamic Routing Mode

- Determines which subnets are visible to Cloud Routers - *Global dynamic routing*: Cloud router advertises all subnets in the VPC network to the on-premise router - *Regional dynamic routing*: Advertises and propagates only those routes in its local region

VPC: Firewall: Rule Assignment

- Every rule is assigned to every instance in a network - Rule assignment can be restricted using tags or service accounts -- Allow traffic from instances with source tag "backend" -- Deny traffic to instances running as service account "[email protected]"

Default Hadoop Replication Strategy

- First location chosen at random - Second location has to be on a different rack (if possible) - Third replica is on the same rack as the second but on different nodes - Reduces inter-rack traffic and improves write performance - Read operations are sent to the rack closest to the client

BigQuery: Querying and Viewing

- Interactive queries - Batch queries - Views - Partitioned tables

GCP Internal Load Balancing

- Not proxied - differs from traditional model - lightweight load-balancing built on top of Andromeda network virtualization stack - provides software-defined load balancing that directly delivers the traffic from the client instance to a backend instance

GKE: Container Registry

- Private registry for Docker images - Can access Container Registry through secure HTTPS endpoints, which lets you push, pull, and manage images from any system, whether it's a Compute Engine instance or your own hardware - Can use the Docker credential helper command-line tool to configure Docker to authenticate directly with Container Registry - Can use third-party cluster management, continuous integration, or other solutions outside of Google Cloud Platform

GCE: Image Types

- Public images for Linux and Windows Server that Google provides - Private images that you create or import to Compute Engine - Community supported images of other OS.

Cloud SQL, Cloud Spanner

- Relational databases - super-structured data, constraints etc - ACID properties - use for transaction processing (OLTP) - Too slow and too many checks for analytics/BI/warehousing (OLAP) - Recall that OLTP needs strict write consistency, OLAP does not - Cloud Spanner is Google proprietary, more advanced than Cloud SQL - Cloud Spanner offers "horizontal scaling" - i.e. bigger data, more instances, replication etc.

ACID at the row level

- Updates to a single row are atomic. - All columns in a row are updated or none are updated.

GCP Virtual Private Cloud

A VPC network, sometimes just called a "network," is a virtual version of a physical network, like a data center network. It provides connectivity for your Compute Engine virtual machine (VM) instances, Kubernetes Engine clusters, App Engine Flex instances, and other resources in your project. Projects can contain multiple VPC networks. New projects start with a default network that has one subnet in each region (an auto mode network).

ML: classification model

A type of machine learning model for distinguishing among two or more discrete classes. For example, a natural language processing classification model could determine whether an input sentence was in French, Spanish, or Italian.

ML: bias (math)

An intercept or offset from an origin. Bias (also known as the bias term) is referred to as b or w0 in machine learning models. For example, bias is the b in the following formula: y' = b + w1x1 + w2x2 + ... + wnxn. Not to be confused with bias in ethics and fairness or prediction bias.

MapReduce: Reduce

An operation to combine the results of the map step

Cloud Bigtable: storage model

Cloud Bigtable stores data in massively scalable tables, each of which is a sorted key/value map. The table is composed of *rows*, each of which typically describes a single entity, and *columns*, which contain individual values for each row. Each row is indexed by a single *row key*, and columns that are related to one another are typically grouped together into a *column family*. Each column is identified by a combination of the column family and a *column qualifier*, which is a unique name within the column family. Each row/column intersection can contain multiple *cells* at different timestamps, providing a record of how the stored data has been altered over time. Cloud Bigtable tables are sparse; if a cell does not contain any data, it does not take up any space.

GCE: Storage Options

Each instance comes with a small root persistent disk containing the OS Add additional storage options: - Standard Persistent disks - SSD - Local SSDs - Cloud Storage

DataStore: Server-Side Encryption

Google Cloud Datastore automatically encrypts all data before it is written to disk. There is no setup or configuration required and no need to modify the way you access the service. The data is automatically and transparently decrypted when read by an authorized user.

VPC: Using Routes

- Many-to-one NATs -- Multiple hosts mapped to one public IP - Transparent proxies -- Direct all external traffic to one machine

Dataproc: Connectors

- BigQuery - BigTable - Cloud Storage

VPC: Routes

A route is a mapping of an IP range to a destination. Routes tell the VPC network where to send packets destined for a particular IP address.

CDN

*C*ontent *D*elivery *N*etwork

VPC: Interconnecting Networks

3 options - Virtual Private Networks (VPNs) using Cloud Router - Dedicated Interconnect - Direct and Carrier Peering

Referring tables in BigQuery

<project id>:<dataset>.<table> Examples: publicdata*:*samples*.*shakespeare bigquery-public-data*:*usa_names*.*usa_1910_current

ML: synthetic feature

A *feature* not present among the input features, but created from one or more of them. Kinds of synthetic features include: - *Bucketing* a continuous feature into range bins. - Multiplying (or dividing) one feature value by other feature value(s) or by itself. - Creating a *feature cross*. Features created by *normalizing* or *scaling* alone are not considered synthetic features.

BigTable: Instances

A Cloud Bigtable instance is mostly just a container for your clusters and nodes, which do all of the real work. Important properties: - The instance type (production or development) - The storage type (SSD or HDD) - The application profiles, for instances that use replication.

StackDriver Accounts

A Stackdriver account holds monitoring and other configuration information for a group of GCP projects and AWS accounts that are monitored together.

Shared VPC: Shared VPC network

A VPC network owned by the host project and shared with one or more service projects in the Cloud Organization.

ML: Validation set

A subset of the data set—disjunct from the training set—that you use to adjust hyperparameters.

VM: Image (definition)

An image in Compute Engine is a cloud resource that provides a reference to an immutable disk.

MapReduce: Map

An operation performed in parallel, on small portions of the dataset

Identity and Security

Authentication - Who are you? Standard flow - critical to get it right. End-User Accounts Service Accounts API Keys - Not critical to get it right. Authorization - What can you do? Identity and Access Management (Cloud IAM)

BigQuery Data Transfer Service: Supported data sources

BigQuery Data Transfer Service supports loading data from the following data sources: - Google AdWords - DoubleClick Campaign Manager - DoubleClick for Publishers - Google Play (beta) - YouTube - Channel Reports - YouTube - Content Owner Reports

Hosting with Google Container Engine

Can use prebuilt container images for hosting specific websites like wordpress, etc. - DevOps - need largely mitigated - Can use Jenkins for CI/CD - StackDriver for logging and monitoring

Properties of HBase

Columnar store Denormalized storage Only CRUD operations ACID at the row level

ML: binning/bucketing

Converting a (usually continuous) feature into multiple binary features called buckets or bins, typically based on value range. For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete bins. Given temperature data sensitive to a tenth of a degree, all temperatures between 0.0 and 15.0 degrees could be put into one bin, 15.1 to 30.0 degrees could be a second bin, and 30.1 to 50.0 degrees could be a third bin.

ML: Overfitting

Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.

Stream-first Architecture

Data can come from multiple sources. - Files - Databases - Streams Major components: - Message transport - Stream processing *The stream is the source of truth*

Instance Template

Defines the machine type, image, zone and other properties of an instance. A way to save the instance configuration to use it later to create new instances or groups of instances - *Global* resource not bound to a zone or a region - Can *reference zonal resources* such as a persistent disk -- In such cases can be used only within the zone

Columnar Store

Different columns values of each RowId are stored as cotinuous rows. Good for: - *Sparse tables*: No wastage of space when storing sparse data - *Dynamic attributes*: Update attributes dynamically without changing storage structure Empty cells are ok in traditional databases, but not in big data whose size is in the range of peta bytes.

Mix-and-Match Use Cases

Hybrid Use Cases: - Use *App Engine* for the front end serving layer, while running *Redis* in *Compute Engine*. - Use *Container Engine* for a rendering micro-service that uses *Compute Engine* VMs running Windows to do the actual frame rendering. - Use *App Engine* for your web front end, *Cloud SQL* as your database, and *Container Engine* for your big data processing.

ML: inference

In machine learning, often refers to the process of making predictions by applying the trained model to unlabeled examples. In statistics, inference refers to the process of fitting the parameters of a distribution conditioned on some observed data.

GKE: Container Cluster: Master Endpoint

Managed master also runs the Kubernetes API server, which - services REST requests - schedules pod creation and deletion on worker nodes - synchronizes pod information (such as open ports and location)

HDFS: Name Node

Manages the overall file system Stores - The directory structure - Metadata of the files

OLAP

Online Analytical Processing

HDFS: Data nodes

Physically stores the data in the files

ML: Precision

Precision = TP/(TP + FP)

ML: Recall

Recall = TP/(TP + FN)

ML: Generalization

Refers to your model's ability to make correct predictions on new, previously unseen data as opposed to the data used to train the model.

Spark APIs

Scala Python Java

BigQuery Shell Commands

Show table info: bq show publicdata:samples.shakespeare Show first 10 rows: bq head -n 10 publicdata:samples.shakespeare

Containers vs VMs

Some points of comparison between Containers and Virtual Machine. *Containers*: - Virtualise the Operating System - More portable - Quick to boot - Size - tens of MBs *Virtual Machines*: - Virtualise hardware - Less portable - Slow to boot - Size - tens of GBs

TensorFlow: Components

TensorFlow consists of the following two components: - a graph protocol buffer - a runtime that executes the (distributed) graph These two components are analogous to Python code and the Python interpreter. Just as the Python interpreter is implemented on multiple hardware platforms to run Python code, TensorFlow can run the graph on multiple hardware platforms, including CPU, GPU, and TPU.

Shared VPC: Organization

The Cloud Organization is the top level in the Cloud Resource Hierarchy and the top-level owner of all the projects and resources created under it. A given host project and its service projects must be under the same Cloud Organization. A given host project and its service projects must be under the *same Cloud Organization*.

Cloud DNS: Resource Record Changes

The changes are first made to the authoritative servers and is then picked up by the DNS resolvers when their cache expires

ML: Training

The process of determining the ideal parameters comprising a model.

ML: Test set

The subset of the data set that you use to test your model after the model has gone through initial vetting by the validation set.

ML: Training set

The subset of the data set used to train a model.

BigQuery Commands: Export table to cloud storage

bq extract babynames.all_names gs://<bucket name>/export/all_names.csv

BigTable and HBase

- BigTable is basically GCP's managed HBase -- This is a much stronger link than between say Hive and BigQuery! - Usual advantages of GCP - -- scalability -- low ops/admin burden -- cluster resizing without downtime -- many more column families before performance drops (~100 OK)

Load Balancing: Session Affinity

- *Client IP*: Hashes the IP address to send requests from the same IP to the same VM -- Requests from different users might look like it is from the same IP -- Users which move networks might lose affinity - *Cookie*: Issues a cookie named GCLB in the first request. -- Subsequent requests from clients with the cookie are sent to the same instance

BigTable: Warming the Cache

- BigTable will improve performance over time - Will observe read and write patterns and redistribute data so that shards are evenly hit - Will try to store roughly same amount of data in different nodes - This is why testing over hours is important to get true sense of performance

DevOps

- Compute Engine Management with Puppet, Chef, Salt and Ansible - Automated Image Builds with Jenkins, Packer, and Kubernetes. - Distributed Load Testing with Kuburnetes. - Continuous Delivery with Travis CI. - Managing Deployments with Spinnaker.

VM: Rightsizing Recommendations

- Compute Engine provides machine recommendations to help optimize resource utilization - Automatically generated based on system metrics gathered by Stackdriver monitoring - Uses last 8 days of data for recommendations

VM: Inferred Instances

- Compute engine gives you the maximum available discount by clubbing instance usage together - Different instances running the same predefined machine type are combined to create inferred instances

BigQuery: Data Model

- Dataset = set of tables and views - Table must belong to dataset - Dataset must belong to a project - Tables contain records with rows and columns (fields) - Nested and repeated fields are supported.

GKE: Storage options

- Storage options as with Compute Engine - However, remember that container disks are *ephemeral*. - Need to use gcePersistentDisk abstraction for persistent disk

Data Loss Prevention API

- Understand and manage sensitive data in Cloud Storage or Cloud Datastore - Easily classify and redact sensitive data -- Classify textual and image-based information -- Redact sensitive data from text files, and classify

ML: Loss

A measure of how far a model's *predictions* are from its *label*. Or, to phrase it more pessimistically, a measure of how bad the model is. To determine this value, a model must define a loss function. For example, linear regression models typically use *mean squared error* for a loss function, while logistic regression models use *Log Loss*.

ML: calibration layer

A post-prediction adjustment, typically to account for prediction bias. The adjusted predictions and probabilities should match the distribution of an observed set of labels.

Data Exfiltration

An authorized person extracts data from the secured systems where it belongs, and either shares it with unauthorized third parties or moves it to insecure systems. Data exfiltration can occur due to the actions of malicious or compromised actors, or accidentally.

Shared VPC: Service project

Project that has permission to use the shared VPC networking resources from the host project.

Traditional RDBMS vs. DataStore: Similarities

Traditional RDBMS: - Atomic transactions - Indices for fast lookup - Some queries use indices - not all - Query time depend on both size of data set and size of result set DataStore: - Atomic transactions - Indices for fast lookup - All queries use indices! - Query time independent of data set, depends on result set alone

VM: Apply tags

gcloud compute instances add-tags jjain3 --tags "http-server"

ML: activation function

A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value (typically nonlinear) to the next layer.

ML: stochastic gradient descent (SGD)

A gradient descent algorithm in which the batch size is one. In other words, SGD relies on a single example chosen uniformly at random from a data set to calculate an estimate of the gradient at each step.

Instance Groups

A group of machines which can be created and managed together to avoid individually controlling each instance in the project - Managed - Unmanaged

DAG

*D*irected *A*cyclic *G*raph Examples: Flink, Apache Beam, TensorFlow

OLAP: Window Functions

- A suite of functions which are syntactic sugar for complex queries - Make complex operations *simple* without needing many *intermediate* calculations - Reduces the need for intermediate tables to store temporary data

Cloud Spanner: Interleaving

Cloud Spanner stores rows in sorted order by primary key values, with child rows inserted between parent rows that share the same primary key prefix. This insertion of child rows between parent rows along the primary key dimension is called *interleaving*, and child tables are also called *interleaved tables*. This enables fast access like HBase.

Cloud DNS: Record types

*A* - Address record, maps hostnames to IPv4 addresses *SOA* - Start of authority - specifies authoritative information on a managed zone *MX* - Mail exchange used to route requests to mail servers *NS* - Name Server record, delegates a DNS zone to an authoritative server

YARN: Resource Manager

- Runs on a single master node - Schedules tasks across nodes

MIG: Health Checks and Autohealing

- A MIG applies health checks to monitor the instances in the group - If a service has failed on an instance, that instance is recreated (*autohealing*) - Similar to health checks used in load balancing but the objective is different -- LB health checks are used to determine where to send traffic -- MIG health checks are used to recreate instances - Typically configure health checks for both LB and MIGs - The new instance is recreated based on the template that was used to originally create it (might be different from the default instance template) - Disk data might be lost unless explicitly snapshotted

Hadoop Ecosystem: Pig

- A data manipulation language - Transforms unstructured data into a structured format - Query this structured data using interfaces like Hive

Application Credentials

- A service account is a Google account that is associated with your GCP project, as opposed to a specific user. - Create from -- GCP Console -- Programmatically - Service account is associated with credentials via environment variable GOOGLE_APPLICATION_CREDENTIALS - At any point, one set of credentials is 'active', called *Application Default Credentials*.

VPC Network Peering

- Allows private RFC1918 connectivity across two VPC networks - Networks can be in the same or in different projects - Primary and secondary ranges should not overlap with any peered ranges. - Build SaaS ecosystems in GCP, services can be made available privately across different VPC networks - Useful for organizations: -- With several network administrative domains -- Which want to peer with other organizations on the GCP

Pub/Sub: Use-cases

- Balancing workloads in network clusters - Asynchronous order processing - Distributing event notifications - Refreshing distributed caches - Logging to multiple systems simultaneously - Data streaming - Reliability improvement

Dataproc: Scaling Clusters

- Can scale up/down even when jobs are running - Operations for scaling are: -- Add workers to run jobs faster. -- Remove workers to save on cost. -- Add HDFS storage - Because clusters can be scaled more than once, you might want to increase/decrease the cluster size at one time, and then decrease/increase the size later. *Graceful Decommissioning*: When you downscale a cluster, work in progress may terminate before completion. If you are using *Cloud Dataproc v 1.2 or later*, you can use Graceful Decommissioning, which incorporates graceful YARN decommissioning to finish work in progress on a worker before it is removed from the Cloud Dataproc cluster.

DataFlow: Transforms: Types

- Core Transforms - Composite Transforms

Use Case: Document database, NoSQL

- CouchDB, MongoDB (key-value/indexed database) - GCP: DataStore

Cloud Key Management: Object hierarchy: CryptoKey

- Cryptographic key used for special purpose. - CryptoKey is used to protect some corpus of data. - Can encrypt and decrypt by users with the permissions of CryptoKey.

DataStore

- Document data - eg XML or HTML - has a characteristic pattern - Key-value structure, i.e. structured data - Typically not used either for OLTP or OLAP - Fast lookup on keys is the most common use-case - Speciality of DataStore is that query execution time depends on size of returned result (not size of data set) - So, a returning 10 rows will take the same length of time whether dataset is 10 rows, or 10 billion rows - Ideal for "needle-in-a-haystack" type applications, i.e. lookups of nonsequential keys - Indices are always fast to read, slow to write - So, don't use for write-intensive data

Choosing the right IAM roles

- In smaller organisation - *owner*, *editor* and *viewer* provide sufficient granularity for key management. - In large organisation - separation of duties required. The roles they recommend are: a) For the business owners whose application requires encryption. b) For the user managing cloud c) For the user or service using keys for encryption and decryption operations.

VPC: VPCs are Global

- Instances can be from different zones same region - Instances can be from different regions also. - All machines communicate using internal IP addresses

Cloud Key Management: Object hierarchy: KeyRing

- KeyRing is a grouping of CryptoKeys for organisational purpose. - Combination of CryptoKey and KeyRing - No need act individually.

StackDriver: Using Logs

- Monitor virtually anything - VM instances, AWS EC2 instances, database instances ... - Exporting to sinks: Cloud Storage, BigQuery datasets, Pub/Sub topics - Create metrics to view in StackDriver monitoring

VM: High CPU Machines

- More memory per vCPU as compared with regular machines

Service Accounts

- Most flexible and widely supported authentication method - Different GCP APIs support different credential types, but all GCP APIs support service accounts - For most applications that run on a server and need to communicate with GCP APIs, use service accounts

Bucket Storage Classes

- Multi-regional - frequent access from anywhere in the world - Regional - frequent access from specific region - Nearline - accessed once a month at max - Coldline - accessed once a year at max

GKE: Load Balancing

- Network load balancing works out-of-box with Container Engine - For HTTP load balancing, need to integrate with Compute Engine load balancing

CloudSQL: Cloud Proxy

- Provides secure access to your Cloud SQL Second Generation instances without having to whitelist IP addresses or configure SSL. - *Secure connections:* The proxy automatically encrypts traffic to and from the database; SSL certificates are used to verify client and server identities. - *Easier connection management:* The proxy handles authentication with Google Cloud SQL, removing the need to provide static IP addresses.

BigQuery: Alternatives to Loading

- Public datasets - Shared datasets - Stackdriver log files (needs export - but direct)

Service Accounts: Why use them?

- Service accounts are associated with a project, not a user - So, any project user gets access to all required resources at one go - Btw, can also assign roles to service accounts - Only use end-user accounts if you'd like to differentiate even between different end-users on the same project

How do you prevent VM from accidental deletion

- Set deletionProtection property (Only a user that has been granted a role with compute.instances.create permission can reset the flag to allow the resource to be deleted.) - gcloud compute instances describe example-instance | grep "deletionProtection" - gcloud compute instances update [INSTANCE_NAME] [--deletionprotection | --no-deletionprotection]

PySpark

- This is just like a Python shell - Use Python functions, dicts, lists etc - You can import and use any installed Python modules - Launches by default in a local non-distributed mode

Dataproc: BigQuery Connector

- You can use a BigQuery connector to enable programmatic read/write access to BigQuery. - This is an ideal way to process data that is stored in BigQuery. No command-line access is exposed. - The BigQuery connector is a Java library that enables Hadoop to process data from BigQuery. *Pricing considerations:* When using the connector, you will also be *charged* for any associated BigQuery usage fees. Additionally, the BigQuery connector downloads data into a Cloud Storage bucket before running a Hadoop job. After the Hadoop job successfully completes, the data is deleted from Cloud Storage. You are *charged* for storage according to Cloud Storage pricing.

Container

A container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it: code, runtime, system tools, system libraries, settings

ML: ROC (receiver operating characteristic) Curve

A curve of true positive rate vs. false positive rate at different classification thresholds.

Apache Pig

A high level scripting language to work with data with *unknown* or *inconsistent* schema. - Part of the Hadoop eco-system - Works well with unstructured, incomplete data - Can work directly on files in HDFS Used to get data *into* a data warehouse

TensorFlow: Estimator

An instance of the tf.Estimator class, which encapsulates logic that builds a TensorFlow graph and runs a TensorFlow session. You may create your own custom Estimators (as described here) or instantiate premade Estimators created by others.

TensorFlow: Tensors

Arrays of arbitrary dimensionality. - A scalar is a *0-d array* (a 0th-order tensor). For example, "Howdy" or 5 - A vector is a *1-d array* (a 1st-order tensor). For example, [2, 3, 5, 7, 11] or [5] - A matrix is a *2-d array* (a 2nd-order tensor). For example, [[3.1, 8.2, 5.9][4.3, -2.7, 6.5]]

Cloud Storage: Auto-Scaling

Cloud Storage is a multi-tenant service, meaning that users share the same set of underlying resources. In order to make the best use of these shared resources, buckets have an initial IO capacity of around 1000 write requests per second and 5000 read requests per second, which average to 2.5PB written and 13PB read in a month for 1MB objects. As the request rate for a given bucket grows, Cloud Storage automatically increases the IO capacity for that bucket by distributing the request load across multiple servers.

BigTable: Load Balancing

Each Cloud Bigtable zone is managed by a master process, which balances workload and data volume within clusters. The master splits busier/larger tablets in half and merges less-accessed/smaller tablets together, redistributing them between nodes as needed. Cloud Bigtable manages all of the splitting, merging, and rebalancing automatically, saving users the effort of manually administering their tablets. To get the best write performance from Cloud Bigtable, it's important to distribute writes as evenly as possible across nodes. One way to achieve this goal is by using row keys that do not follow a predictable order. At the same time, it's useful to group related rows so they are adjacent to one another, which makes it much more efficient to read several rows at the same time.

Google Data Studio: Filters

Filters work by including or excluding records in your data that meet a set of conditions that you specify. *Include* filters retrieve only the records that match the conditions, while *exclude* filters retrieve only the records that DON'T match the conditions. Note that filters do not transform your data in any way. They simply reduce the amount of data displayed in the report. Filters conditions consist of one or more *clauses*. Multiple clauses can be joined with "OR" logic (true if any conditions are met), "AND" logic (true if all conditions are met), or both.

Datastore

Google Cloud Datastore is a NoSQL document database built for automatic scaling, high performance, and ease of application development. - Document data - eg XML or HTML - has a characteristic pattern - Key-value structure, i.e. structured data - Typically not used either for OLTP or OLAP - Fast lookup on keys is the most common use-case - SQL-like queries and REST API - For mobile and web development - Stack Driver monitoring is integrated. - Support MapReduce framework on top of Datastore, for processing large amounts of data in parallel and distributed fashion. - Speciality of DataStore is that query execution time depends on size of returned result (not size of data set) - Ideal for "needle-in-a-haystack" type applications, i.e. lookups of nonsequential keys

Cloud SQL: Supported open source SQLs

MySQL - fast and the usual. PostgreSQL - complex queries.

Pig: Extract, Transform, Load

Pull unstructured, inconsistent data from source, clean it and place it in another database where it can be analyzed

ML: Squared loss

The linear regression models we'll examine here use a loss function called *squared loss* (also known as *L2 loss*). The squared loss for a single example is as follows: = the square of the difference between the label and the prediction = (observation - prediction(x))2 = (y - y')2

TensorFlow: Special Tensor Types

The main ones are: - tf.Variable - tf.constant - tf.placeholder - tf.SparseTensor With the exception of *tf.Variable*, the value of a tensor is immutable, which means that in the context of a single execution tensors only have a single value. However, evaluating the same tensor twice can return different values; for example that tensor can be the result of reading data from disk, or generating a random number.

Setting up Cloud KMS in separate project

The user and owner can access and manage the project at the time of run. a) Create the key project without an owner-recommended. b) Grant an owner role for your key project-Not recommended.

Google App Engine: Standard environment

Using the App Engine standard environment means that your application instances run in a sandbox, using the runtime environment of a supported language. Optimal for applications with the following characteristics: - Source code is written in specific versions of the supported programming languages: >> Python 2.7 >> Java 8, Java 7 >> Node.js 8 (beta) >> PHP 5.5 >> Go 1.6, 1.8, and 1.9 - Intended to *run for free or at very low cost*, where you pay only for what you need and when you need it. - Experiences *sudden and extreme spikes of traffic* which require immediate scaling.

ML: outliers

Values distant from most other values. In machine learning, any of the following are outliers: - Weights with high absolute values. - Predicted values relatively far away from the actual values. - Input data whose values are more than roughly 3 standard deviations from the mean. Outliers often cause problems in model training.

Cloud Storage: Copy Bucket

gsutil cp gs://[bucket_name]/[object_name] [object_destination]

Cloud Storage: Create Bucket

gsutil mb -p [PROJECT_NAME] -c [STORAGE_CLASS] -l [BUCKET_LOCATION] gs://[BUCKET_NAME]/ Notes: - you cannot nest buckets - you need to specify a globally-unique name, which must be less than 1024 bytes in length. - you can only change the bucket name and location by deleting and re-creating the bucket bucket. - you store objects in the bucket, which are immutable (cannot change throughout its storage lifetime). - objects have two components: object data and object metadata.

Editing VM instance

- Can't change the zone once created. - To change the CPU type, you need to stop the VM, make the change and start it again. - Instead of giving access to all the APIs, you can set access at a more granular level. - VMs can be labelled to help in viewing billing, etc for a group of resources.

GKE: Container Cluster: Node Instances

- Managed from the master - Run the services necessary to support *Docker* containers - Each node runs the *docker runtime* and hosts a *Kubelet* agent, which manages the Docker containers scheduled on the host

Autoscaling: Autoscaling Policy

- Average CPU utilization - Stackdriver monitoring metrics - HTTP(S) load balancing server capacity (utilization or RPS) - Pub/Sub queueing workload (alpha)

BigQuery Overview

- BigQuery is Google's serverless, highly scalable, low cost enterprise data warehouse designed to make all your data analysts productive. - Because there is no infrastructure to manage, you can focus on analyzing data to find meaningful insights using familiar SQL and you don't need a database administrator. - BigQuery enables you to analyze all your data by creating a logical data warehouse over managed, columnar storage as well as data from object storage, and spreadsheets. - BigQuery makes it easy to securely share insights within your organization and beyond as datasets, queries, spreadsheets and reports. - BigQuery allows organizations to capture and analyze data in real-time using its powerful streaming ingestion capability so that your insights are always current. - BigQuery is free for up to 1TB of data analyzed each month and 10GB of data stored.

Cloud Storage: Bucket Labels

- Bucket labels are key:value metadata pairs that allow you to group your buckets. - You can apply multiple labels to each bucket, with a maximum of 64 labels per bucket. View bucket labels: gsutil ls -L -b gs://[BUCKET_NAME]/ Remove a bucket label: gsutil label ch -d [KEY_1] gs://[BUCKET_NAME]/

Dataproc: Restartable Jobs

- By default, Dataproc jobs do NOT restart on failure - Can optionally change this - useful for long-running and streaming jobs (eg Spark Streaming) - Specify the maximum number of retries per hour (the upper limit is 10 retries per hour) - Mitigates out-of-memory, unscheduled reboots

VPC: Dynamic Routes

- Can be implemented using Cloud Router on the GCP - Uses BGP to exchange route information between networks - Networks automatically and rapidly discover changes - Changes implemented without disrupting traffic

Autoscaling Policy: Stackdriver monitoring metrics

- Can configure the autoscaler to use standard or custom metrics - Not all standard metrics are valid utilization metrics that the autoscaler can use -- the metric must contain data for a VM instance -- the metric must define how busy the resource is, the metric value increases or decreases proportional to the number of instances in the group

ML: Labels

A *label* is the thing we're predicting—the y variable in simple linear regression. The label could be the future price of wheat, the kind of animal shown in a picture, the meaning of an audio clip, or just about anything.

Pig Latin

A *procedural, data flow* language to extract, transform and load data. Procedural: - Series of *well-defined steps* to perform operations - No *if statements* or *for loops* Data Flow: - Focused on *transformations* applied to the data - Written with a *series* of data operations in mind - Data from one or more sources can be read, processed and stored in *parallel*. - Cleans data, precomputes common aggregates before storing in a data warehouse

ML: feature cross

A *synthetic feature* formed by crossing (taking a *Cartesian product* of) individual binary features obtained from categorical data or from continuous features via bucketing. Feature crosses help represent nonlinear relationships.

Resilient Distributed Datasets

Partitions - Data is divided into partitions - Distributed to multiple machines, called nodes - Nodes process data in parallel Read-only - RDDs are immutable - Only Two Types of Operations -- Transformation: The user may define a chain of transformations on the dataset -- Action: Request a result using an action Lineage - When created, an RDD just holds metadata -- A transformation -- It's parent RDD - Every RDD knows where it came from - Lineage can be traced back all the way to the source

DataFlow: Transforms: Composite Transforms

The model of transforms in the Dataflow SDKs is modular, in that you can build a transform that is implemented in terms of other transforms. You can think of a composite transform as a complex step in your pipeline that contains several nested steps.

Hive

- Hive runs on top of the Hadoop distributed computing framework - Hive stores its data in HDFS - Hive runs all processes in the form of MapReduce jobs under the hood - Don't need to write MapReduce code to work with Hive?

Hive Data Ownership

- Hive stores files in *HDFS* - Hive files can be read and written by many technologies - *Hadoop*, *Pig*, *Spark* - Hive database schema *cannot be enforced* on these files

BigTable: Avoiding Hotspotting

- Field Promotion: Use in reverse URL order like Java package names -- This way keys have similar prefixes, differing endings - Salting -- Hash the key value

Use Case: Storing media, Blob Storage

- File system - maybe HDFS. - GCP: Cloud Storage

Use Case: Fast random access with mobile SDKs

- Firebase Realtime DB

Streaming: Micro batches: Sliding Window

- Fixed window size - Overlapping time - sliding interval - Number of entities differ within a window

Streaming: Micro batches: Tumbling Window

- Fixed window size. - Non-overlapping time - Number of entities differ within a window - The window tumbles over the data, in a nonoverlapping manner

Cloud Storage: Bucket Storage Classes: Multi-regional Storage

- Frequently accessed ("hot" objects), such as serving website content, interactive workloads, or mobile and gaming applications. - Highest availability of the storage classes - Geo-redundant - Cloud Storage stores your data redundantly in at least two regions separated by at least 100 miles within the multi-regional location of the bucket.

DataStore: Perfect Index

- Given a query, which is the index that most optimally returns query results? - Depends on following (in order) -- equality filter -- inequality filter (only 1 allowed) -- sort conditions if any specified

Apache Beam: Where

- Global windows - Fixed windows - Sliding windows - Session windows - Custom windows - Custom merging windows - Timestamp control

Hosting with Google Cloud Engine

- Google cloud launcher. - Choose machine types, disk sizes before deployment. - Get accurate cost estimates before deployment. - Can customise configuration, rename instances, etc - After deployment, have full control of the VM instances. *Storage options:* Cloud Storage Buckets, Standard persistent disks, SSD (solid state disks), Local SSD *Storage Technologies:* Cloud SQL (MySQL, PostgreSQL, NoSQL, GCP NoSQL tools - BigTable, Datastore *Load Balancing:* *- Network load balancing:* forwarding rules based on address, port, protocol *- HTTP load balancing:* look into content, examine cookies, certain clients to one server. *StackDriver* for logging and monitoring

VM: High Memory Machines

- More memory per vCPU as compared with regular machines - Useful for tasks which require more memory as compared to processing - 6.5 GB of RAM per core

VPC: Firewall: Ingress Connections

- Source CIDR ranges, Protocols, Ports - Sources with specific tags or service accounts -- Allow: Permit matching ingress connections -- Deny: Block the matching ingress connections

Stream Processing

Data is received as a stream - Log messages - Tweets - Climate sensor data Process the data one entity at a time - Filter error messages - Find references to the latest movies - Track weather patterns Store, display, act on filtered messages - Trigger an alert - Show trending graphs - Warn of sudden squalls

Load Balancing Hierarchy

External: -- Global ---- HTTP/HTTPS ---- SSL Proxy ---- TCP Proxy -- Regional ---- Network Internal -- Regional

ACID Properties

*A*tomicity *C*onsistency *I*solation *D*urability

IAP: Authentication & Authorisation

*Authentication:* - Requests come from 2 sources: -- App Engine -- Cloud Load Balancing (HTTPS) - Cloud IAP checks the user's browser credentials - If none exist, the user is redirected to an OAuth 2.0 Google Account sign-in - Those credentials sent to IAM for authorisation *Authorisation:* - As before using IAM

CRUD operations

*C*reate, *R*ead, *U*pdate & *D*elete Traditional databases and SQL support: - Joins: Combining information across tables using keys - Group By: Grouping and aggregating data for the groups - Order By: Sorting rows by a certain column HBase does not support SQL (NoSQL). Only a limited set of operations are allowed in HBase (CRUD). - No operations involving multiple tables - No indexes on tables - No constraints

IAP and IAM

- IAP is an additional step, not a bypassing of IAM - So, users and groups still need correct Cloud Identity Access Management (Cloud IAM) role

Kubernetes Container Cluster

- Kubernetes Master - Kubernetes Node agent/client - Kubelet - Pods - consists of a set of containers.

GCE: Other Machine Types

- Memory-optimized machine types - Shared-core machine types (f1-macro, g1-small) - Custom machine types - Provides GPUs that you can add to your VM instances (NVIDIA Tesla V100, P100 and K80 GPUs)

MapReduce

- Processing huge amounts of data requires running processes on many machines. - MapReduce is a programming paradigm - Takes advantage of the inherent parallelism in data processing - A task of large scale is processed in two stages -- map -- reduce - Programmer defines these 2 functions. Hadoop does the rest - behind the scenes

Streaming: Micro batches: Types of Windows

- Tumbling Window - Sliding Window - Session Window

GCE: Preemptible Instances: Ways for handling graceful shutdown

- Using metadata >> startup-script-url or startup-script >> shutdown-script or shutdown-script-url - API: instances.delete request or instances.stop

VM: Availability Policy

--maintenance-policy migrated | terminated Default, it is migrated. By default, instances are automatically set to restart unless you provide --no-restart-on-failure flag.

BigQuery: Table Schema

A BigQuery table contains individual records organized in rows. Each record is composed of columns (also called fields). Every table is defined by a *schema* that describes the column names, data types, and other information. You can specify the schema of a table when it is created, or you can create a table without a schema and declare the schema in the query job or load job that first populates it with data.

DataFlow: pCollection: Limitations

A PCollection has several key aspects in which it differs from a regular collection class: - A PCollection is immutable. Once created, you cannot add, remove, or change individual elements. - A PCollection does not support random access to individual elements. - A PCollection belongs to the pipeline in which it is created. You cannot share a PCollection between Pipeline objects. - A PCollection may be physically backed by data in existing storage, or it may represent data that has not yet been computed. - You can use a PCollection in computations that generate new pipeline data (as a new PCollection); however, you cannot change the elements of an existing PCollection once it has been created.

ML: class-imbalanced data set

A binary classification problem in which the labels for the two classes have significantly different frequencies. For example, a disease data set in which 0.0001 of examples have positive labels and 0.9999 have negative labels is a class-imbalanced problem, but a football game predictor in which 0.51 of examples label one team winning and 0.49 label the other team winning is not a class-imbalanced problem.

BigTable: Clusters

A cluster represents the actual Cloud Bigtable service. Each cluster belongs to a single Cloud Bigtable instance, and an instance can have up to 2 clusters. When your application sends requests to a Cloud Bigtable instance, those requests are actually handled by one of the clusters in the instance.

ML: weight

A coefficient for a feature in a linear model, or an edge in a deep network. The goal of training a linear model is to determine the ideal weight for each feature. If a weight is 0, then its corresponding feature does not contribute to the model.

ML: scaling

A commonly used practice in feature engineering to tame a feature's range of values to match the range of other features in the data set. For example, suppose that you want all floating-point features in the data set to have a range of 0 to 1. Given a particular feature's range of 0 to 500, you could scale that feature by dividing each value by 500.

ML: Features

A feature is an input variable—the x variable in simple linear regression. A simple machine learning project might use a single feature, while a more sophisticated machine learning project could use millions of features. In the spam detector example, the features could include the following: - words in the email text - sender's address - time of day the email was sent - email contains the phrase "one weird trick."

ML: Standard Heuristic for Model Tuning

A few rules of thumb that may help guide you: - Training error should steadily decrease, steeply at first, and should eventually plateau as training converges. - If the training has not converged, try running it for longer. - If the training error decreases too slowly, increasing the learning rate may help it decrease faster. >>> But sometimes the exact opposite may happen if the learning rate is too high. - If the training error varies wildly, try decreasing the learning rate. >>> Lower learning rate plus larger number of steps or larger batch size is often a good combination. - Very small batch sizes can also cause instability. First try larger values like 100 or 1000, and decrease until you see degradation. Again, never go strictly by these rules of thumb, because the effects are data dependent. Always experiment and verify.

ML: continuous feature

A floating-point feature with an infinite range of possible values.

ML: sigmoid function

A function that maps logistic or multinomial regression output (log odds) to probabilities, returning a value between 0 and 1. The sigmoid function has the following formula: y = 1/(1 + e power(-sigma)) where sigma in *logistic regression* problems is simply: sigma = b + w1x1 + w2x2 + ... + wnxn In other words, the sigmoid function converts sigma into a probability between 0 and 1. In some neural networks, the sigmoid function acts as the activation function.

ML: Models

A model defines the relationship between features and label. For example, a spam detection model might associate certain features strongly with "spam". Let's highlight two phases of a model's life: - *Training* means creating or *learning* the model. That is, you show the model labeled examples and enable the model to gradually learn the relationships between features and label. - *Inference* means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions (y').

VM: Instances and networks

A project can have up to *five* VPC networks, and each Compute Engine instance belongs to one VPC network. Instances in the same network communicate with each other through a local area network protocol. An instance uses the Internet to communicate with any machine, virtual or physical, outside of its own network.

Shared VPC: Standalone project

A project that does not share networking resources with any other project.

Cloud Storage: Key Terms: Resources

A resource is an entity within Google Cloud Platform. Each project, bucket, and object in Google Cloud Platform is a resource, as are things such as Compute Engine instances.

ML: regularization rate (Lambda)

A scalar value, represented as lambda, specifying the relative importance of the regularization function. The following simplified loss equation shows the regularization rate's influence: minimize( loss function + lambda(regularization function)) Raising the regularization rate reduces overfitting but may make the model less accurate.

ML: one-hot encoding

A sparse vector in which: - One element is set to 1. - All other elements are set to 0. One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a given botany data set chronicles 15,000 different species, each denoted with a unique string identifier. As part of feature engineering, you'll probably encode those string identifiers as one-hot vectors in which the vector has a size of 15,000.

ML: hidden layer

A synthetic layer in a neural network between the input layer (that is, the features) and the output layer (the prediction). A neural network contains one or more hidden layers.

ML: gradient descent

A technique to minimize *loss* by computing the gradients of loss with respect to the model's parameters, conditioned on training data. Informally, gradient descent iteratively adjusts parameters, gradually finding the best combination of *weights* and bias to minimize loss.

Hadoop Ecosystem: Oozie

A tool to schedule workflows on all the Hadoop ecosystem technologies

DataFlow (Apache Beam): Transforms

A transform is a data processing operation, or a step, in your pipeline. A transform takes one or more PCollections as input, performs a processing function that you provide on the elements of that PCollection, and produces an output PCollection. Your transforms don't need to be in a strict linear sequence within your pipeline. You can use conditionals, loops, and other common programming structures to create a branching pipeline or a pipeline with repeated structures. You can think of your pipeline as a directed graph of steps, rather than a linear sequence.

ML: L1 regularization

A type of regularization that penalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features, L1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0, which removes those features from the model.

ML: L2 regularization

A type of regularization that penalizes weights in proportion to the sum of the squares of the weights. L2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0. L2 regularization always improves generalization in linear models.

Cloud Launcher

A way to launch common software packages and stacks on Google Compute Engine with just a few clicks. - Click on Cloud Launcher - Choose WordPress - Choose - zone, region, machine type, boot disk size, network and admin name. - Click Deploy - Open the web page and see if you are able to access it's front end (assuming you have installed Word press or something like that)

BigTable: Architecture

All client requests go through a front-end server before they are sent to a Cloud Bigtable node. The nodes are organized into a Cloud Bigtable cluster, which belongs to a Cloud Bigtable instance, a container for the cluster. Each node in the cluster handles a subset of the requests to the cluster. By adding nodes to a cluster, you can increase the number of simultaneous requests that the cluster can handle, as well as the maximum throughput for the entire cluster. If you enable replication by adding a second cluster, you can also send different types of traffic to different clusters, and you can fail over to one cluster if the other cluster becomes unavailable. A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. Tablets are stored on Colossus, Google's file system, in SSTable format. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. In addition to the SSTable files, all writes are stored in Colossus's shared log as soon as they are acknowledged by Cloud Bigtable, providing increased durability. Importantly, data is never stored in Cloud Bigtable nodes themselves; each node has pointers to a set of tablets that are stored on Colossus.

Cloud Storage: Key Terms: Projects

All data in Cloud Storage belongs inside a project. A project consists of a set of users, a set of APIs, and billing, authentication, and monitoring settings for those APIs. You can have one project or multiple projects.

ML: Rectified Linear Unit (ReLU)

An activation function with the following rules: - If input is negative or zero, output is 0. - If input is positive, output is equal to input.

BigTable: Application Profiles

An application profile, or app profile, stores settings that tell your Cloud Bigtable instance how to handle incoming requests from an application. App profiles affect how your applications communicate with an instance that uses replication. As a result, app profiles are especially useful for instances that have 2 clusters. An app profile defines the *routing policy* that Cloud Bigtable uses. It also controls whether *single-row transactions* are allowed.

Spark

An engine for data processing and analysis. - General Purpose -- Exploring -- Cleaning and Preparing -- Applying Machine Learning -- Building Data Applications - Interactive -- REPL: Read-Evaluate-Print-Loop -- Interactive environments, fast feedback - Distributed Computing -- Process data across a cluster of machines -- Integrate with Hadoop -- Read data from HDFS

ML: AUC (Area under the ROC Curve)

An evaluation metric that considers all possible *classification thresholds*. The Area Under the ROC curve is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.

BigQuery: Loading Data

BigQuery manages the technical aspects of storing your structured data, including compression, encryption, replication, performance tuning, and scaling. BigQuery stores data in the Capacitor columnar data format, and offers the standard database concepts of tables, partitions, columns, and rows. - Batch loads -- CSV -- JSON (newline delimited) -- Avro -- GCP Datastore backups - Streaming loads -- High volume event tracking logs -- Realtime dashboards Other Google Sources - Cloud storage - Analytics 360 - Datastore - Dataflow - Cloud storage logs

Block Storage

Characteristics of block storage: - This is the lowest level of storage without any abstraction and structure to data. - Meant for use from VMs but independent of VM. Retains data through VM remove or reboots. - Location tied to VM location. Remember the options available on Compute Engine VMs: - Standard persistent disks - Regional SSDs - Local SSDs

CryptoKey Version state

CryptoKeyVersion has a state: a) *Enabled*(ENABLED) - Used for encryption and decryption of cryptokey requests. b) *Disabled*(DISABLED) - May not be used, placed back to enabled state. c) *Scheduled for destruction*(DESTROY-SCHEDULED) - For destruction and destroyed soon. d) *Destroyed*(DESTROYED) - Key material no longer stored in cloud KMS.

Container vs VM: Layers

Different layers which can be represented for Containers and Virtual Machines. *Containers:* - App + Bins/Libs - per container. - Docker Runtime (virtualize the OS) - Host OS - Infrastructure *Virtual Machines:* - App + Bins/Libs + Guest OS - per VM. - Hipervisor (virtualize the hardware) - Infrastructure

Working with Cloud Storage

Different ways to interact with the could storage: - XML and JSON APIs - Command line (gsutil) - GCP Console (web) - Client SDK

BigTable: Nodes

Each cluster in a production instance has 3 or more nodes, which are compute resources that Cloud Bigtable uses to manage your data. Cloud Bigtable splits all of the data from your tables into smaller tablets, which are stored on disk, separate from the nodes. Each node is responsible for keeping track of specific tablets on disk; handling incoming reads and writes for its tablets; and performing maintenance tasks on its tablets, such as periodic compactions. A cluster must have enough nodes to support its current workload and the amount of data it stores. Otherwise, the cluster might not be able to handle incoming requests, and latency could go up.

ML: Categorical Data

Features having a discrete set of possible values. For example, consider a categorical feature named house style, which has a discrete set of three possible values: Tudor, ranch, colonial. By representing house style as categorical data, the model can learn the separate impacts of Tudor, ranch, and colonial on house price. Sometimes, values in the discrete set are mutually exclusive, and only one value can be applied to a given example. For example, a car maker categorical feature would probably permit only a single value (Toyota) per example. Other times, more than one value may be applicable. A single car could be painted more than one different color, so a car color categorical feature would likely permit a single example to have multiple values (for example, red and white). Categorical features are sometimes called discrete features.

Firebase Hosting + Google Cloud Storage

Firebase Hosting is production-grade web content hosting for developers. With Hosting, you can quickly and easily deploy web apps and static content to a global content delivery network (CDN) with a single command. Key capabilities: - *Served over a secure connection*: The modern web is secure. Zero-configuration SSL is built into Firebase Hosting, so content is always delivered securely. - *Fast content delivery*: Each file that you upload is cached on SSDs at CDN edges around the world. No matter where your users are, the content is delivered fast. - *Rapid deployment*: Using the Firebase CLI, you can get your app up and running in seconds. Command line tools make it easy to add deployment targets into your build process. - *One-click rollbacks*: Quick deployments are great, but being able to undo mistakes is even better. Firebase Hosting provides full versioning and release management with one-click rollbacks.

External versus internal load balancing

GCP's load balancers can be divided into external and internal load balancers. External load balancers distribute traffic coming from the internet to your GCP network. Internal load balancers distribute traffic within your GCP network.

Google Cloud Functions

Google Cloud Functions is a serverless execution environment for building and connecting cloud services. With Cloud Functions you write simple, single-purpose functions that are attached to events emitted from your cloud infrastructure and services. Your Cloud Function is triggered when an event being watched is fired. Your code executes in a fully managed environment. There is no need to provision any infrastructure or worry about managing any servers. Cloud Functions are written in Javascript and execute in a Node.js v6.14.0 environment on Google Cloud Platform. You can take your Cloud Function and run it in any standard Node.js runtime which makes both portability and local testing a breeze.

TensorFlow: Session

Graphs must run within a TensorFlow session, which holds the state for the graph(s) it runs: with tf.Session() as sess: initialization = tf.global_variables_initializer() print y.eval() When working with tf.Variables, you must explicitly initialize them by calling *tf.global_variables_initializer* at the start of your session, as shown above. A session can distribute graph execution across multiple machines: *Distributed TensorFlow*.

BigTable: Single-row transactions

In Cloud Bigtable, reads and writes are always atomic at the row level. Cloud Bigtable does not provide atomicity above the row level; for example, Cloud Bigtable does not support transactions that atomically update more than one row. However, Cloud Bigtable also supports some write operations that would require a transaction in other databases: - *Read-modify-write operations*, including increments and appends. A read-modify-write operation reads an existing value; increments or appends to the existing value; and writes the updated value to the table. - *Check-and-mutate operations*, also known as conditional mutations or conditional writes. In a check-and-mutate operation, Cloud Bigtable checks a row to see if it meets a specified condition. If the condition is met, Cloud Bigtable writes new values to the row.

BigQuery: Query Plan Explanation

In the web UI, click on "Explanation". Helps in debugging complex queries. Embedded within query jobs, BigQuery includes diagnostic query plan and timing information. This is similar to the information provided by statements such as EXPLAIN in other database and analytical systems. This information can be retrieved from the API responses of methods such as jobs.get. For long running queries, BigQuery will periodically update these statistics. These updates happen independently of the rate at which the job status is polled, but typically will not happen more frequently than every 30 seconds. Additionally, query jobs that do not leverage execution resources, such as dry run requests or results that can be served from cached results will not include the additional diagnostic information, though other statistics may be present.

Cloud Storage: Domain Verification

Number of ways to demonstrate ownership of a site or domain, including: -- Adding a special Meta tag to the site's homepage. -- Uploading a special HTML file to the site. -- Verifying ownership directly from Search Console. -- Adding a DNS TXT or CNAME record to the domain's DNS configuration.

Cloud Storage: Bucket Versioning

Object Versioning on and off can be done using using the gsutiltool tool, the JSON API, and XML API. Not with Console. - gsutil versioning set on gs://[BUCKET_NAME] - gsutil versioning get gs://[BUCKET_NAME] - gsutil ls -a gs://[BUCKET_NAME]

Cloud Storage: Key Terms: Objects

Objects are the individual pieces of data that you store in Cloud Storage. There is no limit on the number of objects that you can create in a bucket. Objects have two components: *object data* and *object metadata*. - Object data is typically a file that you want to store in Cloud Storage. - Object metadata is a collection of name-value pairs that describe various object qualities. *Object names* - An object's name is treated as a piece of object metadata in Cloud Storage. - Object names can contain any combination of Unicode characters (UTF-8 encoded) and must be less than 1024 bytes in length. - Can use '/' to give an impression of directory type structure.

Cloud Storage: Object Metadata

Objects stored in Cloud Storage have metadata associated with them. Metadata identifies properties of the object, as well as specifies how the object should be handled when it's accessed. Metadata exists as *key:value* pairs. The mutability of metadata varies: some metadata you can edit at any time, some metadata you can only set at the time the object is created, and some metadata you can only view. For example, you can edit the value of the Cache-Control metadata at any time, but you can only assign the storageClass metadata when the object is created or rewritten, and you cannot directly edit the value for the generation metadata, though the generation value changes when the object is overwritten. There are two categories of metadata that users can change for objects: - *Fixed-key metadata:* Metadata whose keys are set, but for which you can specify a value. - *Custom metadata:* Metadata that you add by specifying both a key and a value associated with the key. Setting metadata: gsutil setmeta -h "[METADATA_KEY]:[METADATA_VALUE]" gs://[BUCKET_NAME]/[OBJECT_NAME]

Cloud Data Transfer Use Cases - Machine Learning

Once transferred to Google Cloud Storage or BigQuery, your data is accessible via Google Cloud Dataflow processing service for machine learning projects. Google Cloud Machine Learning Engine is a managed service that enables you to easily build machine learning models, that work on any type of data, of any size. Create your model with the powerful TensorFlow framework that powers many Google products, from Google Photos to Google Cloud Speech. Build models of any size with our managed scalable infrastructure. Your trained model is immediately available for use with our global prediction platform that can support thousands of users and TBs of data.

SQL vs. HBase Shell Commands

SQL: 1. select * from census 2. select name from census 3. select * from census limit 1 4. select * from census where rowkey = 1 HBase Shell: 1. scan 'census' 2. scan 'census', {COLUMNS => ['personal:name']} 3. scan 'census', {LIMIT => 1} 4. get 'census', 1

Storage Transfer Service: Options

Storage Transfer Service has options that make data transfers and synchronization between data sources and data sinks easier. For example, you can: - Schedule one-time transfer operations or recurring transfer operations. - Delete existing objects in the destination bucket if they don't have a corresponding object in the source. - Delete source objects after transferring them. - Schedule periodic synchronization from data source to data sink with advanced filters based on file creation dates, file-name filters, and the times of day you prefer to import data. In order to have full access to Storage Transfer Service, you must be the *EDITOR* or *OWNER* of the project that creates the transfer job. If you are a VIEWER of the project, you can view and list transfer jobs and transfer operations associated with the data sink.

Storage Transfer Service Overview

Storage Transfer Service transfers data from an online data source to a data sink. Your data source can be an Amazon Simple Storage Service (Amazon S3) bucket, an HTTP/HTTPS location, or a Cloud Storage bucket. Your data sink (the destination) is always a Cloud Storage bucket. You can use Storage Transfer Service to: - Back up data to a Cloud Storage bucket from other storage providers. - Move data from a Multi-Regional Storage bucket to a Nearline Storage bucket to lower your storage costs.

ML: TensorFlow

Tensorflow is a computational framework for building machine learning models. TensorFlow provides a variety of different toolkits that allow you to construct models at your preferred level of abstraction. You can use lower-level APIs to build models by defining a series of mathematical operations. Alternatively, you can use higher-level APIs (like tf.estimator) to specify predefined architectures, such as linear regressors or neural networks. TensorFlow gets its name from tensors, which are arrays of arbitrary dimensionality. Using TensorFlow, you can manipulate tensors with a very high number of dimensions. TensorFlow operations create, destroy, and manipulate tensors. Most of the lines of code in a typical TensorFlow program are operations.

TensorFlow: Rank

The *rank* of a tf.Tensor object is its number of dimensions. Synonyms for rank include *order* or *degree* or *n-dimension*. Note that rank in TensorFlow is not the same as matrix rank in mathematics. 0 - Scalar (magnitude only) 1 - Vector (magnitude and direction) 2 - Matrix (table of numbers) 3 - 3-Tensor (cube of numbers) n - n-Tensor (you get the idea)

TensorFlow: Shape

The *shape* of a tensor is the number of elements in each dimension. TensorFlow automatically infers shapes during graph construction. These inferred shapes might have known or unknown rank. If the rank is known, the sizes of each dimension might be known or unknown.

Dataflow Programming Model

The Dataflow programming model is designed to simplify the mechanics of large-scale data processing. When you program with a Dataflow SDK, you are essentially creating a data processing job to be executed by one of the Cloud Dataflow runner services. This model lets you concentrate on the logical composition of your data processing job, rather than the physical orchestration of parallel processing. The Dataflow model provides a number of useful abstractions that insulate you from low-level details of distributed processing, such as coordinating individual workers, sharding data sets, and other such tasks. These low-level details are fully managed for you by Cloud Dataflow's runner services.

Cloud Spanner: Efficient Bulk Loading

The common theme for optimal bulk loading performance is to *minimize the number of machines that are involved in each write*, because aggregate write throughput is maximized when fewer machines are involved. Cloud Spanner uses load-based splitting to evenly distribute your data load across nodes: after a few minutes of high load, Cloud Spanner introduces split boundaries between rows of non-interleaved tables and assigns each split to a different server. - *Partition your data by primary key*: A good rule of thumb for your number of partitions is 10 times the number of nodes in your Cloud Spanner instance. So if you have N nodes, with a total of 10xN partitions, you can assign rows to partitions by: >>> Sorting your data by primary key. >>> Dividing it into 10xN separate sections. >>> Creating a set of worker tasks that upload the data. - *Commit between 1 MiB to 5 MiB mutations at a time* - *Upload data before creating secondary indexes* - *Periodic bulk uploads to an existing database* Inefficient Practices: - Don't write rows one at a time - Don't package N random rows into a commit with N mutations - Don't sequentially add all rows in primary key order

ML: accuracy

The fraction of predictions that a classification model got right. In multi-class classification, accuracy is defined as follows: accuracy = (correct predictions)/(total number of examples) In binary classification, accuracy has the following definition: accuracy = (True Positives + True Negatives)/(total number of examples)

ML: feature set

The group of features your machine learning model trains on. For example, postal code, property size, and property condition might comprise a simple feature set for a model that predicts housing prices.

ML: regularization

The penalty on a model's complexity. Regularization helps prevent overfitting. Different kinds of regularization include: - L1 regularization - L2 regularization - dropout regularization - early stopping (this is not a formal regularization method, but can effectively limit overfitting)

TensorFlow: tf.data API

The tf.data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training. The pipeline for a text model might involve extracting symbols from raw text data, converting them to embedding identifiers with a lookup table, and batching together sequences of different lengths. The tf.data API makes it easy to deal with large amounts of data, different data formats, and complicated transformations. The tf.data API introduces two new abstractions to TensorFlow: - tf.data.Dataset - tf.data.Iterator

Google Data Studio: How the cache works

There are 2 parts to Data Studio cache: the query cache, and the prefetch cache. *Query cache*: The query cache remembers the queries (requests for data) issued by the components in a report. When a person viewing the report requests the exact same query (i.e., the same dimensions, metrics, filter conditions, and date range) as a previously received query, then the data is served from the cache. If the response can't be served from the query cache, Data Studio next looks to the prefetch cache. *Prefetch cache*: The prefetch cache (A.K.A. the "Smart cache") predicts the data that a component could request by analyzing the dimensions, metrics, filters, and date range properties and controls on the report. Data Studio then stores (prefetches) as much of the data as possible that could be used to answer the predicted queries. When a query can't be answered by the query cache, Data Studio tries to answer it using this prefetched data. If the query can't be answered by the prefetch cache, the data will come from the underlying data set. *Cache refresh and expiration*: Both the query cache and prefetch cache automatically expire periodically (approximately every 12 hours). If you can edit the report, you can refresh both caches at any time by viewing the report and clicking Refresh data.

Cloud Datalab: Pricing

There is no charge for using Google Cloud Datalab. However, you do pay for any Google Cloud Platform resources you use with Cloud Datalab, for example: - *Compute resources*: You incur costs from the time of creation to the time of deletion of the Cloud Datalab VM instance. The default Cloud Datalab VM machine type is n1-standard-1, but you can choose a different machine type. You are also charged for a 20GB Standard Persistent Disk, which is used as a Boot Disk, and a 200GB Standard Persistent Disk, where user notebooks are stored. >>> The 20GB boot disk is deleted when the VM instance is deleted, but the 200GB disk remains after the deletion of the VM until you delete it. - *Storage resources*: Notebooks are saved to Persistent Disk and backed up to Google Cloud Storage - *Data Analysis Services*: You incur Google BigQuery costs when issuing SQL queries within Cloud Datalab notebooks. Also, when you use Google Cloud Machine Learning, you may incur Cloud Machine Learning Engine and/or Google Cloud Dataflow charges.

Cloud Storage: Key Terms: Namespace

There is only *one* Cloud Storage namespace, which means every bucket must have a unique name across the entire Cloud Storage namespace. Object names must be unique only within a given bucket.

Global versus regional load balancing

Use global load balancing when your users and instances are globally distributed, your users need access to the same applications and content, and you want to provide access using a single anycast IP address. Global load balancing can also provide IPv6 termination. Use regional load balancing when your users and instances are concentrated in one region and you only require IPv4 termination. Global load balancing requires that you use the Premium Tier of Network Service Tiers. For regional load balancing, you can use Standard Tier.

GCE: Storage: Cloud Storage Buckets

Use when latency and throughput are not a priority && when you must share data easily between multiple instances or zones. - Flexible, scalable, durable - infinite size possible - Performance depends on storage class.

Google App Engine: Flexible environment

Using the App Engine flexible environment means that your application instances run within Docker containers on Google Compute Engine virtual machines (VMs). Optimal for applications with the following characteristics: - Source code that is written in a version of any of the supported programming languages: Python, Java, Node.js, Go, Ruby, PHP, or .NET - Runs in a Docker container that includes a custom runtime or source code written in *other programming languages*. - *Depends on other software, including operating system packages* such as imagemagick, ffmpeg, libgit2, or others through apt-get. - Uses or depends on frameworks that include *native code*. - Accesses the resources or services of your Cloud Platform project that reside in the *Compute Engine network*.

Dataproc: Initialisation Actions

When creating a Cloud Dataproc cluster, you can specify initialization actions in executables or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run. - Can specify scripts to be run from GitHub or Cloud Storage - Can do so via GCP console, gcloud CLI or programmatically - Run as root (i.e. no sudo required) - Use absolute paths - Use shebang line to indicate script interpreter

Google App Engine: Flexible environment vs Compute Engine

While the flexible environment runs services in instances on Compute Engine VMs, the flexible environment differs from Compute Engine in the following ways: - The VM instances used in the flexible environment are restarted on a weekly basis. During restarts, Google's management services apply any necessary operating system and security updates. - You always have root access to Compute Engine VM instances. By default, SSH access to the VM instances in the flexible environment is disabled. If you choose, you can enable root access to your app's VM instances. - The geographical region of the VM instances used in the flexible environment is determined by the location that you specify for the *App Engine application* of your GCP project. Google's management services ensures that the VM instances are co-located for optimal performance.

DataStore: Locations

You can store your Cloud Datastore data in either a multi-region location or a regional location. Data in a *multi-region* location operates in a multi-zone and multi-region replicated configuration. Select a multi-region location if you want to maximize the availability and durability of your database. Data in a *regional* location operates in a multi-zone replicated configuration. Select a regional location if your application is more sensitive to write latency or if you want co-location with other Google Cloud Platform resources that your application may use.

BigQuery Commands: Load data from storage

bq load --source_format=CSV babynames.babynames_2011 gs://<bucket-name>//babynames/yob2011.txt name:string,gender:string,count:integer Can use regular expressions in the file names. Like yob20*.txt.

HBase: 4-dimensional Data Model

# - Row Key - Column Family - Column - Timestamp Example: Row Key = Employee ID Column Family = Work, Personal Column = Dept Grade Title (Work), Name SSN (Personal)

ML: Regression vs. classification

A *regression* model predicts continuous values. For example, regression models make predictions that answer questions like the following: - What is the value of a house in California? - What is the probability that a user will click on this ad? A *classification* model predicts discrete values. For example, classification models make predictions that answer questions like the following: - Is a given email message spam or not spam? - Is this an image of a dog, a cat, or a hamster?

TensorFlow: tf.Tensor

A *tf.Tensor* object represents a partially defined computation that will eventually produce a value. TensorFlow programs work by first building a graph of tf.Tensor objects, detailing how each tensor is computed based on the other available tensors and then by running parts of this graph to achieve the desired results. A tf.Tensor has the following properties: - data type (float32, int32, or string, for example) - shape

Pig on Other Technologies

*Apache Tez*: - *Tez* is an extensible framework which *improves on MapReduce* by making its operations *faster*. *Apache Spark*: - *Spark* is another *distributed computing technology* which is scalable, flexible and fast.

Google Compute Choices

*App Engine*: - A flexible, zero ops (serverless!) platform for building highly available apps - Focus on writing code, no need to concern about server, cluster, OS, or other infrastructure. - Support for several languages... or bring your own app runtime. - Use cases: >> Web sites; Mobile app and gaming backends >> RESTful APIs >> Internet of things (IoT) apps. *Container Engine*: - Logical infra powered by Kubernetes, the open source container orchestration system. - Increase velocity and improve operability by separating the app from the OS. - Don't have dependencies on a specific operating system & run the application anywhere. - Use cases >> Containerized workloads >> Cloud-native distributed systems >> Hybrid applications. *Compute Engine*: - Virtual machines running in Google's global data center network. - Gives complete control over infra and direct access to high-performance hardware (GPUs and local SSDs). - Need to make OS-level changes, and necessary drivers for optimal performance. - Direct access to GPUs that you can use to accelerate specific workloads. - Use cases >> Any workload requiring a specific OS or OS configuration >> Currently deployed, on-premises software that you want to run in the cloud. >> Anything which can't be containerised easily; or need existing VM images

VPC: Types of subnets

*Auto Mode*: Automatically sets up a single subnet in each region - can manually create more subnets. This is *default*. *Custom Mode*: No subnets are set up by default, we have to manually configure all subnets You can switch a network from auto mode to custom mode. This conversion is *one-way*; custom mode networks cannot be changed to auto networks.

Batch vs. Stream Processing

*Batch*: - Bounded, finite datasets - Slow pipeline from data ingestion to analysis - Periodic updates as jobs complete - Order of data received unimportant - Single global state of the world at any point in time *Stream*: - Unbounded, infinite datasets - Processing immediate, as data is received - Continuous updates as jobs run constantly - Order important, out of order arrival tracked - No global state, only history of events received

BigQuery: Performance: Optimizing Query Computation

*Best practice:* - Avoid using JavaScript user-defined functions. Calling a JavaScript UDF requires the instantiation of a Java subprocess. Spinning up this process and running the UDF directly impacts query performance. If possible, use a native (SQL) UDF instead. - If the SQL aggregation function you're using has an equivalent approximation function, the approximation function will yield faster query performance. For example, instead of using COUNT(DISTINCT), use APPROX_COUNT_DISTINCT(). - Use ORDER BY only in the outermost query or within window clauses (analytic functions). Push complex operations to the end of the query. - For queries that join data from multiple tables, optimize your join patterns. Start with the largest table. - When you query partitioned tables, use the _PARTITIONTIME pseudo column. Filtering the data using _PARTITIONTIME allows you to specify a date or range of dates.

BigQuery: Performance: Avoiding SQL Anti-Patterns

*Best practice:* - Typically, self-joins are used to compute row-dependent relationships. The result of using a self-join is that it potentially doubles the number of output rows. This increase in output data can cause poor performance. Instead of using a self-join, use a *window (analytic) function* to reduce the number of additional bytes that are generated by the query. - Partition skew, sometimes called data skew, is when data is partitioned into very unequally sized partitions. This creates an imbalance in the amount of data sent between slots. You can't share partitions between slots, so if one partition is especially large, it can slow down, or even crash the slot that processes the oversized partition. - Cross joins are queries where each row from the first table is joined to every row in the second table (there are non-unique keys on both sides). The worst case output is the number of rows in the left table multiplied by the number of rows in the right table. In extreme cases, the query might not finish. - Using point-specific DML statements is an attempt to treat BigQuery like an Online Transaction Processing (OLTP) system. BigQuery focuses on Online Analytical Processing (OLAP) by using table scans and not point lookups. If you need OLTP-like behavior (single-row updates or inserts), consider a database designed to support OLTP use cases such as Google Cloud SQL.

Compute Instance Creation Example

*CLI:* $ gcloud compute instances create example-instance-1 example-instance-2 example-instance-3 --zone us-central1-a To create an instance with the latest Red Hat Enterprise Linux 7 image available, run: $ gcloud compute instances create example-instance --image-family rhel-7 --image-project rhel-cloud --zone us-central1-a *API Method:* instances.insert

BigTable: What it's good for

- *Time-series data*, such as CPU and memory usage over time for multiple servers. - *Marketing data*, such as purchase histories and customer preferences. - *Financial data*, such as transaction histories, stock prices, and currency exchange rates. - *Internet of Things data*, such as usage reports from energy meters and home appliances. - *Graph data*, such as information about how users are connected to one another.

Cloud Dataproc vs Cloud Dataflow

*Cloud Dataproc:* Cloud Dataproc is good for environments dependent on specific components of the Apache big data ecosystem: - Tools/packages - Pipelines - Skill sets of existing resources *Cloud Dataflow:* Cloud Dataflow is typically the preferred option for greenfield environments: - Less operational overhead - Unified approach to development of batch or streaming pipelines - Uses Apache Beam - Supports pipeline portability across Cloud Dataflow, Apache Spark, and Apache Flink as runtimes

Cloud Dataflow vs. Cloud Dataproc: Which should you use?

*Cloud Dataproc:* If you have dependencies on specific tools/packages in the Apache Hadoop/Spark ecosystem or if you favour a hands-on/DevOps approach to operations. *Cloud Dataflow:* If you don't have any dependencies on Hadoop/Spark acosystem or favour hands-off/serverless approach.

VM: Image Lifecycle

*DEPRECATED*: Images that are no longer the latest, but can still be launched by users. Users will see a warning at launch that they are no longer using the most recent image. *OBSELETE*: Images that should not be launched by users or automation. An attempt to create an instance from these images will fail. You can use this image state to archive images so their data is still available when mounted as a non-boot disk. *DELETED*: Images that have already been deleted or are marked for deletion in the future. These cannot be launched, and you should delete them as soon as possible.

Google Data Studio: Dimensions and Metrics

*Dimensions describe. Metrics measure.* *Dimensions* are data categories. Dimension values are names, descriptions or other characteristics of a category. *Metrics* measure the things contained in dimensions. In Data Studio, metrics values are always aggregated: given metric X, the value can be a sum, a count, a ratio of X, etc. Metrics are always numbers. Dimensions can be any other kind of data, including unaggregated numbers, dates, text, and boolean (true/false) values.

VPC: IP Addresses: Ephemeral vs Static

*Ephemeral*: - Available only till the VM is stopped, restarted or terminated - No distinction between regional and global IP addresses *Static*: - Permanently assigned to a project and available till explicitly detached - Regional or global resources -- Regional: Allows resource of the region to use the address -- Global: Used only for global forwarding rules in global load balancing - Unassigned static IPs incur a cost

HBase: Filters

*Filters* allow you to control what data is returned from a scan operation. Built-in Filters: - Conditions on row keys - Conditions on columns - Multiple conditions on columns - Timestamp range

Hive Metastore

*Hive Metastore*: The bridge between data stored in files and the tables exposed to users - Stores *metadata* for all the tables in Hive - *Maps* the files and directories in Hive to tables - Holds *table definitions* and the *schema* for each table - Has information on *converting* files to table representations

HiveQL

*Hive Query Language*: A SQL-like interface to the underlying data - Modeled on the Structured Query Language (SQL) - Familiar to analysts and engineers - Simple query constructs -- select -- group by -- join - Hive exposes files in HDFS in the form of tables to the user - Write SQL-like query in HiveQL and submit it to Hive - Hive will translate the query to MapReduce tasks and run them on Hadoop - MapReduce will process files on HDFS and return results to Hive

Hive vs. RDBMS

*Hive:* Large datasets - Gigabytes or petabytes - Calculating trends Parallel computations - Distributed system with multiple machines - Semi-structured data files partitioned across machines - Disk space cheap, can add space by adding machines High latency - Records not indexed, cannot be accessed quickly - Fetching a row will run a MapReduce that might take minutes Read operations - Not the owner of data - Schema-on-read Not ACID compliant by default - Data can be dumped into Hive tables from any source HiveQL *RDBMS:* Small datasets - Megabytes or gigabytes - Accessing and updating individual records Serial computations - Single computer with backup - Structured data in tables on one machine - Disk space expensive on a single machine Low latency - Records indexed, can be accessed and updated fast - Queries can be answered in milliseconds or microseconds Read/write operations - Sole gatekeeper for data - Schema-on-write ACID compliant - Only data which satisfies constraints are stored in the database SQL

Hive vs RDBMS (cont)

*Hive:* - Schema on read, no constraints enforced - Minimal index support - Row level updates, deletes as a special case - Many more built-in functions - Only equi-joins allowed - Restricted subqueries *RDBMS:* - Schema on write keys, not null, unique all enforced - Indexes allowed - Row level operations allowed in general - Basic built-in functions - No restriction on joins - Whole range of subqueries

Identity and Access Management (IAM)

*Identities:* - End-user (Google) account - Service account - Google group - G-Suite domain - Cloud Identity domain - allUsers, allAuthenticatedUsers *Roles:* - lots of granular roles - per resource *Resources:* - Projects - Compute Engine instances - Cloud Storage buckets *Policy:* - Associate identities with roles

Cloud Storage: Bucket Storage Classes: Regional Storage

- Appropriate for storing data that is used by Compute Engine instances. - Better performance for data-intensive computations, as opposed to storing your data in a multi-regional location

Cloud Spanner: Instances

*Instance configuration*: An instance configuration defines the geographic placement and replication of the databases in that instance. When you create an instance, you must configure it as either regional or multi-region. You make this choice by selecting an instance configuration, which determines where your data is stored for that instance. *Node count*: Your choice of node count determines the amount of serving and storage resources that are available to the databases in that instance. Each node provides up to *2 TiB* of storage. The peak read and write throughput values that nodes can provide depend on the instance configuration, as well as on schema design and data-set characteristics.

VPC: IP Addresses: Internal vs External

*Internal* - Ephemeral, changes every 24 hours or on VM restarts - Allocated from the range of IP addresses available to a subnet to which the resource belongs - VMs know their internal IP - Hostname is mapped to internal IP "instance-1.c.test-project123.internal" - VPC networks automatically resolve internal IP addresses to host names *External* - Can be ephemeral or static - Ephemeral: Allocated from a pool of external IP addresses. - Static: Reserved - charged when not assigned to VM - VMs unaware of external IP - Hosts with external IPs allow connections from outside the VPC - Need to publish public DNS records to point to the instance with the external IP - Can use Cloud DNS

ML: Mean square error (MSE)

*Mean square error (MSE)* is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples: MSE = 1/N SIGMA(x,y belongs to D) (y - prediction(x))2 where: - *(x,y)* is an example in which >>> *x* is the set of features (for example, chirps/minute, age, gender) that the model uses to make predictions. >>> *y* is the example's label (for example, temperature). - *prediction(x)* is a function of the weights and bias in combination with the set of features *x*. - *D* is a data set containing many labeled examples, which are *(x,y)* pairs. - *N* is the number of examples in *D*. Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.

Google Data Studio: Sharing Permissions

*Owner access*: *Is Owner* access means you have complete control over the file. *View access*: *Can view* access lets users see the report as a whole, or see the schema of the data source. View access to a report lets people interact with any filters or date range controls available. It does not let users change the data source or report in any way. *Edit access*: *Can edit* access lets users modify the report or data source. For reports, users can add, change or remove charts and controls. They can add and remove data sources, change the report styling, and set up new filters or modify existing ones. Edit access to a data source lets users modify its schema. They can add or change calculated fields, disable and enable fields, and change data types and field aggregations (when permitted by the data source).

Cloud Storage: Best Practices

*Ramp up request rate gradually* To ensure that Cloud Storage auto-scaling always provides the best performance, you should ramp up your request rate gradually for any bucket that hasn't had a high request rate in several weeks or that has a new range of object keys. If your request rate is less than 1000 write requests per second or 5000 read requests per second, then no ramp-up is needed. If your request rate is expected to go over these thresholds, you should start with a request rate below or near the thresholds and then double the request rate no faster than every 20 minutes. *Use a naming convention that distributes load evenly across key ranges* Auto-scaling of an index range can be slowed when using sequential names, such as object keys based on a sequence of numbers or timestamp. This occurs because requests are constantly shifting to a new index range, making redistributing the load harder and less effective. In order to maintain a high request rate, avoid using sequential names. Using completely random object names will give you the best load distribution. If you want to use sequential numbers or timestamps as part of your object names, introduce randomness to the object names by adding a hash value before the sequence number or timestamp. *Reorder bulk operations to distribute load evenly across key ranges* Even if you are not able to choose the object names, you can control the order in which the objects are uploaded or deleted to achieve the same effect as using random names. If you have many folders and many files under each folder to upload, a good strategy is to upload from multiple folders in parallel and randomly choose which folders and files are uploaded. Doing so allows the system to distribute the load more evenly across entire key ranges, which allows you to achieve a high request rate after the initial ramp-up.

Cloud Spanner: Types of replicas

*Read-write*: Read-write replicas support both reads and writes. - Maintain a full copy of your data. - Serve reads. - Can vote whether to commit a write. - Participate in leadership election. - Are eligible to become a leader. - Are the only type used in single-region instances. *Read-only*: Read-only replicas only support reads (not writes). - Are only used in multi-region instances. - Maintain a full copy of your data, which is replicated from read-write replicas. - Serve reads. - Do not participate in voting to commit writes. - Can usually serve stale reads without needing a round-trip to the default leader region. Strong reads may require a round-trip to the leader replica. - Are not eligible to become a leader. *Witness*: Witness replicas don't support reads but do participate in voting to commit writes. These replicas: - Are only used in multi-region instances. - Do not maintain a full copy of data. - Do not serve reads. - Vote whether to commit writes. - Participate in leader election but are not eligible to become leader.

Cloud Spanner: Hotspotting

- As in HBase - need to choose Primary key carefully - Do not use monotonically increasing values, else writes will be on same locations - hot spotting - Use hash of key value if you naturally monotonically ordered keys - Under the hood, Cloud Scanner divides data among servers across key ranges

Crypto Key Rotation

*Rotation in cloud KMS:* - At the time of generating crypto key version of crypto key marking that is primary key. - Each Crypto key version rotated to primary key, at the point to encrypt the data. *Frequency of key rotation:* - Regular rotation and Irregular rotation are the two rotations of Encryption keys. - Regular rotation take the time for data encrypted with single key. - Irregular rotation disable the to the restrict access of data. *Automatic rotation:* - CryptoKey rotation schedule d using cloud command or via Google Cloud Platform Console. - Rotation schedule is scheduled by rotation period and next rotation time. *Manual Rotation:* - Used for irregular key rotation - Manually rotated using cloud command line or via Cloud Platform Console.

VPC: Service Accounts vs. Tags

*Service Accounts:* - Represents the identity that the instance runs with. - An instance can have just one service account - Restricted by IAM permissions, permissions to start an instance with a service account has to be explicitly given - Changing a service account requires stopping and restarting an instance *Tags:* - Logically group resources for billing or applying firewalls - An instance can have any number of tags - Tags can be changed by any user who can edit an instance - Changing tags is metadata update and is a much lighter operation Prefer service accounts to tags for group instances so that firewall rules can be applied.

Google AppEngine Environments

*Standard*: Pre-configured with: Java 7, Python 2.7, Go, PHP *Flexible*: More choices: Java 8, Python 3.x, .NET - Serverless! - Instance classes determine price, billing - Laundry list of services - pay for what you use

VM: Startup Scripts vs. Baking

*Startup Scripts:* - Longer for the instance to be ready - Startup scripts might fail and has to be idempotent - Rollback has to be handled for applications and image separately - The script will need to install dependencies during application deployment - Each deployment might reference different versions if the latest version of the software has changed *Baking:* - Much faster to go from boot to application readiness - Much more reliable for application deployments - Version management is easier, easier to rollback to previous versions - Fewer external dependencies during application bootstrap - Scaling up creates instances with identical software versions

GCE: Preemptible Instances: Termination Steps

*Step 1*: Compute Engine sends a Soft Off signal *Step 2*: Shutdown script should clean up and give up control within 30 seconds. *Step 3*: If not, Compute Engine sends a Mechanical Off signal. *Step 4*: Compute Engine transitions to Terminated state

Cloud Storage: Consistency

*Strongly consistent operations:* - Read-after-write - Read-after-metadata-update - Read-after-delete - Bucket listing - Object listing - Granting access to resources *Eventually consistent operations:* - Revoking access from resources *Cache control and consistency* Cached objects that are publicly readable might not exhibit strong consistency. If you allow an object to be cached, and the object is in the cache when it is updated or deleted, the cached object is not updated or deleted until its cache lifetime expires. The cache lifetime of an object is defined by the *Cache-Control metadata* associated with the object.

Traditional RDBMS vs. HBase

*Traditional RDBMS*: - Data arranged in rows and columns - Supports SQL - Complex queries such as grouping, aggregates, joins etc - Normalized storage to minimize redundancy and optimize space - ACID compliant *HBase*: - Data arranged in a column-wise manner - NoSQL database - Only basic operations such as create, read, update and delete - Denormalized storage to minimize disk seeks - ACID compliant at the row level

Denormalized storage

*Traditional* databases use normalized forms of database design to *minimize redundancy*. In normalized form, data is made more granular by splitting it across multiple tables. This will optimize the memory. In distributed systems, network bandwidth is more costly than storage. Storage is cheap. Need to optimize number of disk seeks. Read a single record to get all details about an employee in one read operation.

DataFlow (Apache Beam): Typical Beam Driver Program

- *Create a Pipeline* object - Create an initial PCollection for pipeline data -- Source API to read data from an external source -- Create transform to build a PCollection from in-memory data. - *Define the Pipeline* transforms to change, filter, group, analyse PCollections -- transforms do not change input collection - *Output the final*, transformed PCollection(s), typically using the Sink API to write data to an external source. - *Run the pipeline* using the designated Pipeline Runner.

ML: Empirical Risk Minimization

*Training* a model simply means learning (determining) good values for all the weights and the bias from labeled examples. In supervised learning, a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss; this process is called *empirical risk minimization*. Loss is the penalty for a bad prediction. That is, *loss* is a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.

BigQuery & MapReduce Selection Criteria

*Use BigQuery*: - Finding particular records with specified conditions. For example, to find request logs with specified account ID. - Quick aggregation of statistics with dynamically-changing conditions. For example, getting a summary of request traffic volume from the previous night for a web application and draw a graph from it. - Trial-and-error data analysis. For example, identifying the cause of trouble and aggregating values by various conditions, including by hour, day and etc... *Use MapReduce*: - Executing a complex data mining on Big Data which requires multiple iterations and paths of data processing with programmed algorithms. - Executing large join operations across huge datasets. - Exporting large amount of data after processing.

Google Data Studio: Introduction

*Visualize your data* Data Studio turns your data into informative, easy to read, easy to share, and fully customizable dashboards and reports. Use the drag and drop report editor to create charts, apply filters, color themes, etc. *Connect to your data* Reports in Data Studio get their information from one or more data sources. Using the Data Sources tool, you can easily connect to wide variety of data, without programming. *Share and collaborate* Data Studio reports and data sources are stored as files on Google Drive. Just as in Drive, it's easy to share your files with individuals, teams, or the world. To tell your data stories as broadly as possible, embed your reports in other pages, such as Google Sites, blog posts, marketing articles, and annual reports.

ML: mini-batch stochastic gradient descent (SGD)

*mini-batch*: A small, randomly selected subset of the entire batch of examples run together in a single iteration of training or inference. The batch size of a mini-batch is usually between 10 and 1,000. It is much more efficient to calculate the loss on a mini-batch than on the full training data. *Mini-batch stochastic gradient descent*: A *gradient descent* algorithm that uses *mini-batches*. In other words, mini-batch SGD estimates the gradient based on a small subset of the training data. *Vanilla SGD* uses a mini-batch of size 1.

ML: parameter & hyperparameter

*parameter*: A variable of a model that the ML system trains on its own. For example, weights are parameters whose values the ML system gradually learns through successive training iterations. *hyperparameter*: The "knobs" that you tweak during successive runs of training a model. For example, learning rate is a hyperparameter.

VM: Cloud Platform Free Tier

- *1 f1-micro VM instance* per month (US regions, excluding Northern Virginia). - *30 GB of Standard persistent disk* storage per month. - *5 GB of snapshot storage* per month. - *1 GB* egress from North America to other destinations per month (excluding Australia and China).

Datastore Features

- *Atomic transactions:* can execute a set of operations where either all succeed, or none occur. - *High availability of reads and writes.* - *Massive scalability with high performance:* uses a distributed architecture to automatically manage scaling. Uses a mix of indexes and query constraints so your queries scale with the *size of your result set*, not the size of your data set. - *Flexible storage and querying of data:* maps naturally to object-oriented and scripting languages, and is exposed to applications through multiple clients. It also provides a SQL-like query language. - *Balance of strong and eventual consistency:* ensures that entity lookups by key and ancestor queries always receive strongly consistent data. All other queries are eventually consistent. - *Encryption at rest:* automatically encrypts all data before it is written to disk and automatically decrypts the data when read by an authorized user. - *Fully managed with no planned downtime.*

BigQuery Best Practices: Controlling Costs

- *Avoid SELECT **. Query only the columns that you need. - Sample data using preview options. Don't run queries to explore or preview table data. - Price your queries before running them. Before running queries, preview them to estimate costs. - Limit query costs by restricting the number of bytes billed. Best practice: Use the maximum bytes billed setting to limit query costs. - LIMIT doesn't affect cost. Best practice: Do not use a LIMIT clause as a method of cost control. - Partition data by date. If possible, partition your BigQuery tables by date. Partitioning your tables allows you to query relevant subsets of data which improves performance and reduces costs. - Materialize query results in stages. If you create a large, multi-stage query, each time you run it, BigQuery reads all the data that is required by the query. You are billed for all the data that is read each time the query is run. - Keeping large result sets in BigQuery storage has a cost. If you don't need permanent access to the results, use the default table expiration to automatically delete the data for you. - There is no charge for loading data into BigQuery. There is a charge, however, for streaming data into BigQuery. Unless your data must be immediately available, load your data rather than streaming it.

Bounded vs Unbounded Datasets

- *Bounded* datasets are processed in batches - *Unbounded* datasets are processed as streams

MIG: Configuring Health Checks

- *Check Interval*: The time to wait between attempts to check instance health - *Timeout*: The length of time to wait for a response before declaring check attempt failed - *Health Threshold*: How many consecutive "healthy" responses indicate that the VM is healthy - *Unhealthy Threshold*: How many consecutive "failed" responses indicate VM is unhealthy

TensorFlow: Layers

- *Estimator (tf.estimator)*: High-level, OOP API. - *tf.layers/tf.losses/tf.metrics*: Libraries for common model components. - *TensorFlow*: Lower-level APIs

Apache Beam: When

- Configurable triggering - Event-time triggers - Processing-time triggers - Count triggers - [Meta]data driven triggers - Composite triggers - Allowed latencies - Timers

VPC: Features

- *Global*: Resources from across zones, regions. VPCs are global. Subnets are regional. - *Multi-tenancy*: VPCs can be shared across GCP projects - *Private and secure*: IAM, firewall rules - *Scalable*: Add new VMs, containers to the network, without any workload shutdown or downtime. - A single project has a quota of 5 networks - A single network has a limit of 7000 instances - Within a network the resources communicate with each other often and are *trusted* - Resources in other networks are treated just like *any other external resource* (even if they are in the same project)

Internal Load Balancing: Health Checks

- *HTTP, HTTPS health checks:* These provide the highest fidelity, they verify that the web server is up and serving traffic, not just that the instance is healthy. - *SSL (TLS) health checks:* Configure the SSL health checks if your traffic is not HTTPS but is encrypted via SSL(TLS) - *TCP health checks:* For all TCP traffic that is not HTTP(S) or SSL(TLS), you can configure a TCP health check

Cloud Storage: Access Control Options

- *Identity and Access Management (IAM) permissions:* Grant access to buckets as well as bulk access to a bucket's objects. IAM permissions give you broad control over your projects and buckets, but not fine-grained control over individual objects. - *Access Control Lists (ACLs):* Grant read or write access to users for individual buckets or objects. In most cases, you should use IAM permissions instead of ACLs. Use ACLs only when you need fine-grained control over individual objects. - *Signed URLs (query string authentication):* Give time-limited read or write access to an object through a URL you generate. Anyone with whom you share the URL can access the object for the duration of time you specify, regardless of whether or not they have a Google account. - *Signed Policy Documents:* Specify what can be uploaded to a bucket. Policy documents allow greater control over size, content type, and other upload characteristics than signed URLs, and can be used by website owners to allow visitors to upload files to Cloud Storage. - *Firebase Security Rules:* Provide granular, attribute-based access control to mobile and web apps using the Firebase SDKs for Cloud Storage. For example, you can specify who can upload or download objects, how large an object can be, or when an object can be downloaded.

Load Balancing: Backends

- *Instance Group*: Can be a managed or unmanaged instance group - *Balancing Mode*: Determines when the backend is at full usage -- CPU utilization, Requests per second - *Capacity Setting*: A % of the balancing mode which determines the capacity of the backend

Cloud Dataprep

- *Instant data exploration*: Visually explore and interact with data in seconds. Instantly understand data distribution and patterns. You don't need to write code. You can prepare data with a few clicks. - *Intelligent data cleansing*: Cloud Dataprep automatically identifies data anomalies and helps you to take corrective action fast. Get data transformation suggestions based on your usage pattern. Standardize, structure, and join datasets easily with a guided approach. - *Serverless*: Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. This helps you to keep your focus on the data preparation and analysis. - *Seriously powerful*: Cloud Dataprep is built on top of the powerful Cloud Dataflow service. Cloud Dataprep is auto-scalable and can easily handle processing massive data sets. - *Supports common data sources of any size*: Process diverse datasets — structured and unstructured. Transform data stored in CSV, JSON, or relational table formats. Prepare datasets of any size, megabytes to terabytes, with equal ease. - *Integrated with Google Cloud Platform*: Easily process data stored in Cloud Storage, BigQuery, or from your desktop. Export clean data directly into BigQuery for further analysis. Seamlessly manage user access and data security with Cloud Identity and Access Management.

Autoscaling

- *Managed instance groups* automatically add or remove instances based on increases and decreases in load - Helps your applications *gracefully handle increases in traffic* - *Reduces cost* when load is lower - Define autoscaling policy, the autoscaler takes care of the rest For *GKE* groups autoscaling is different, called *Cluster Autoscaling* - Autoscaling Policy - Target Utilization Level

Internal Load Balancing

- *Private load balancing IP address* that only your VPC instances can access - VPC traffic stays *internal* - less latency, more security - No public IP address needed - Useful to balance requests from your *frontend instances to your backend instances*

Pub/Sub: Basics

- *Publisher* apps create and send messages on a *Topic* - *Subscriber* apps subscribe to a topic to receive messages - Subscription is a queue (message stream) to a subscriber - Message = data + attributes sent by publisher to a topic - Message Attributes = key-value pairs sent by publisher with message - *Publisher* apps create and send messages on a *Topic* -- *Messages* persisted in a message store until delivered/acknowledged -- One queue per subscription - *Subscriber* apps subscribe to a topic to receive messages -- *Push* - WebHook endpoint -- *Pull* - HTTPS request to endpoint - Once acknowledged by subscriber, message deleted from queue

Cloud Storage: Data Encryption Options

- *Server-side encryption:* encryption that occurs after Cloud Storage receives your data, but before the data is written to disk and stored. >>> *Google-managed encryption keys:* Cloud Storage uses its server-side encryption keys to encrypt your data. This is the default for Cloud Storage encryption. >>> *Customer-supplied encryption keys:* You can create and manage your own encryption keys for server-side encryption, which replace the Google-managed encryption keys. >>> *Customer-managed encryption keys:* You can generate and manage your encryption keys using Cloud Key Management Service. These replace the Google-managed encryption keys. - *Client-side encryption:* encryption that occurs before data is sent to Cloud Storage. Such data arrives at Cloud Storage already encrypted but also undergoes server-side encryption.

Storage Options for Compute

- *Standard persistent disks*: Efficient and reliable block storage. - *Regional persistent disks*: Efficient and reliable block storage with synchronous replication across two zones in a region. - *SSD persistent disks*: Fast and reliable block storage. - *Regional SSD persistent disks*: Fast and reliable block storage with synchronous replication across two zones in a region. - *Local SSD*: High performance local block storage. - *Cloud Storage Buckets*: Affordable object storage.

Cloud Storage: Lifecycle Management

- Assign a lifecycle management configuration to a bucket, applies to a bucket, applies to current and future objects in the bucket. - Each lifecycle management contains a set of rules. When defining a rule, you can specify any set of conditions for any action. - Each rule should contain only one action.

VPC: Firewall Rules for the "default" network

- *default-allow-internal*: Allows ingress network connections of any protocol and port between VM instances on the network - *default-allow-ssh*: Allows ingress TCP connections from any source to any instance on the network over port 22 - *default-allow-icmp*: Allows ingress ICMP traffic from any source to any instance on the network. - *default-allow-rdp*: Allows ingress remote desktop protocol traffic to TCP port 3389.

gsutil or Transfer Service?

- *gsutil* can be used to get data into cloud storage buckets - Prefer the transfer service when transferring from AWS, etc - If copying files over from on-premise, use gsutil

VPC: What is a route made of?

- *name*: User-friendly name - *network*: The name of the network to which this route applies - *destRange*: The destination IP range that this route applies to - *instanceTags*: Instance tags that this route applies to, applies to all instances if empty - *priority*: Used to break ties in case of multiple matches and one of: - *nextHopInstance*: Fully qualified URL. Instance must already exist - *nextHopIp*: The IP address - *nextHopNetwork*: URL of network - *nextHopGateway*: URL of gateway - *nextHopVpnTunnel*: URL of VPN tunnel

GCE: Standard Machine Types

- 3.75 GB memory per vCPU - naming: n1-standard-<1,2,4,8,16,32,64,96 vCPUs> - Fixed at 16 persistent disks, 64TB total size.

GCE: High-memory Machine Types

- 6.5 GB memory per vCPU - naming: n1-highmem-<2,4,8,16,32,64,96 vCPUs>. Total RAM = 6.5 x vCPU count. - Fixed at 16 persistent disks, 64TB total size.

TensorFlow: Graph

- A *TensorFlow graph* (also known as a *computational graph* or a *dataflow graph*) is a graph data structure. - A graph's nodes are operations (in TensorFlow, every operation is associated with a graph). - Many TensorFlow programs consist of a single graph, but TensorFlow programs may optionally create multiple graphs. - A graph's *nodes are operations*; a graph's *edges are tensors*. - Tensors flow through the graph, manipulated at each node by an operation. The output tensor of one operation often becomes the input tensor to a subsequent operation. - TensorFlow implements a lazy execution model, meaning that nodes are only computed when needed, based on the needs of associated nodes.

VPC: Dynamic Routing for VPN tunnels

- A Cloud Router belongs to a particular network and a particular region - Subnets segmenting the network IP space - Advertises subnet changes using the BGP - Also learns about subnet changes in the on premise network through BGP - The IP address of the Cloud Router and the gateway router should both be link local IP addresses (valid only for communication within the network link)

Cloud Spanner: Data Model

- A Cloud Spanner database can contain one or more tables. - Tables look like relational database tables in that they are structured with rows, columns, and values, and they contain primary keys. - Data in Cloud Spanner is strongly typed: you must define a schema for each database and that schema must specify the data types of each column of each table. Allowable data types include scalar and array types. - You can also define one or more secondary indexes on a table (parent - child relationship).

DataFlow (Apache Beam): pCollection

- A PCollection represents a set of data in your pipeline. - The Dataflow PCollection classes are specialized container classes that can represent data sets of virtually unlimited size. - A PCollection can hold a data set of a *fixed size* (such as data from a text file or a BigQuery table), or an *unbounded data* set from a continuously updating data source (such as a subscription from Google Cloud Pub/Sub). PCollections are the inputs and outputs for each step in your pipeline.

CloudSQL: High Availability Configuration

- A Second Generation instance is in an high availability configuration when it has a failover replica - The failover replica must be in a different zone than the original instance, also called the master. - All changes made to the data on the master, including to user tables, are replicated to the failover replica using semisynchronous replication.

VPC: Static Routing for VPN tunnels

- A VPN tunnel connecting a gateway at either end (Google Cloud & Peer network) - A new subnet added to the on premise network - New routes need to be added to the cloud VPC to reach the new subnet - VPN tunnel will need to be torn down and re-established to include the new subnet - Static routes are slow to converge as updates are manual

VPC: Firewall: Implied Rules

- A default "allow egress" rule. -- Allows all egress connections. Rule has a priority of 65535. - A default "deny ingress" rule. -- Deny all ingress connection. Rule has a priority of 65535.

Hadoop Ecosystem: Spark

- A distributed computing engine used along with Hadoop - Interactive shell to quickly process datasets - Has a bunch of built in libraries for machine learning, stream processing, graph processing etc.

VM: Baking

- A more *efficient* way to provision infrastructure - Create a custom image with your *configuration incorporated into the public image*

CloudSQL: Point-in-time-recovery (PITR)

- A point-in-time recovery always creates a new instance; you cannot perform a point-in-time recovery to an existing instance. - The target instance should have the same database version as the instance from which the backup was taken. - You cannot restore an instance using a backup taken in different GCP project. - If you are restoring to an instance that is in a high availability configuration (it has a fail-over replica) or to an instance with read replicas, you must delete all replicas and recreate them after the restore operation completes.

Types of Logs

- Audit logs: permanent GCP logs (no retention period) - Admin activity logs: for actions that modify config or metadata - Data access logs: API calls that create modify or read user-provided data - Admin activity logs are always on; data access logs need to be enabled (can be big) - BigQuery data access logs are always on by default

YARN: Scheduling Policies

- FIFO Scheduler - Capacity Scheduler - Fair Scheduler

Pub/Sub: Message Life

- A publisher application creates a topic in the Google Cloud Pub/Sub service and sends messages to the topic. A message contains a payload and optional attributes that describe the payload content. - Messages are persisted in a message store until they are delivered and acknowledged by subscribers. - The Pub/Sub service forwards messages from a topic to all of its subscriptions, individually. - Each subscription receives messages either by Pub/Sub pushing them to the subscriber's chosen endpoint, or by the subscriber pulling them from the service. - The subscriber receives pending messages from its subscription and acknowledges each one to the Pub/Sub service. - When a message is acknowledged by the subscriber, it is removed from the subscription's queue of messages.

Shared VPC: Host and Service projects

- A service project can only be associated with a single host - A project cannot be a host as well as a service project at the same time - Instances in a project can only be assigned external IPs from the same project - Existing projects can use shared VPC networks - Instances on a shared VPC need to be created explicitly for the VPC

DataStore: Designing for scale

- A single entity group in Cloud Datastore should not be updated too rapidly. - Avoid high read or write rates to Cloud Datastore keys that are lexicographically close. - Gradually ramp up traffic to new Cloud Datastore kinds or portions of the keyspace. - Avoid deleting large numbers of Cloud Datastore entities across a small range of keys. - Use sharding or replication for hot Cloud Datastore keys. >>> You can use *replication* if you need to read a portion of the key range at a higher rate than permitted. >>> You can use *sharding* if you need to write to a portion of the key range at a higher rate than permitted.

VPC: Alias IP Ranges

- A single service on a VM requires just one IP address - Multiple services on the same VM may need different IP addresses - Subnets have a primary and secondary CIDR range - Using IP aliasing can set up multiple IP addresses drawn from the primary or secondary CIDR ranges - Multiple containers or services on a VM can have their own IP - VPCs automatically set up routes for the IPs - Containers don't need to do their own routing, simplifies traffic management - Can separate infrastructure from containers (infra will draw from the primary range, containers from the secondary range)

Streams Using Micro-batches

- A stream of integers grouped into batches. - If the batches are small enough... it approximates real-time stream processing. - Exactly once semantics, replay micro-batches. Each item is processed exactly once. - Latency-throughput trade-off based on batch sizes. Depending of the application, achieve the sweet spot. - Examples: *Spark Streaming*, *Storm Trident*.

VM: Premium Images

- Additional per second charges, same charges across the world - Red Hat Enterprise Linux, Microsoft Windows - Changes based on the machine type used - SQL Server images are charged per minute

YARN: Application Master Process

- All processes on a node are run within a container. - This is the logical unit for resources the process needs -memory, CPU etc. - A container executes a specific application - 1 NodeManager can have multiple containers. - The ResourceManager starts off the Application Master within the Container - Performs the computation required for the task - If additional resources are required, the Application Master makes the request

HBase: Column Family

- All rows have the same set of column families - Each column family is stored in a separate data file - Set up at schema definition time - Can have different columns for each row

VM: RAM Disk

- Allocate high performance memory to use as a disk - A RAM disk has very low latency and high performance - Used when your application expects a file system structure and can't store data in memory - No storage redundancy or flexibility - Shares memory with your applications - Contents stays only as long as the VM is up

Load Balancing: Firewall Rules

- Allow traffic from 130.211.0.0/22 and 35.191.0.0/16 to reach your instances - IP ranges that the *load balancer* and the *health checker* use to connect to backends - Allow traffic on the *port* that the global forwarding rule has been configured to use

Load Balancing: Backend Buckets

- Allow you to use Cloud Storage buckets with HTTP(S) load balancing - Traffic is directed to the bucket instead of a backend - Useful in load balancing requests to *static content*

DataStore: Special query types

- Ancestor queries: An ancestor query limits its results to the specified entity and its descendants. - Kindless queries: A query with no kind and no ancestor retrieves all of the entities of an application from Cloud Datastore. Such kindless queries cannot include filters or sort orders on property values. They can, however, filter on entity keys and use ancestor filters. - Projection queries: Projection queries allow you to query Cloud Datastore for just those specific properties of an entity that you actually need, at lower latency and cost than retrieving the entire entity. A *keys-only* query (which is a type of projection query) returns just the keys of the result entities instead of the entities themselves.

Hive Metastore: Requirements

- Any database with a JDBC driver can be used as a metastore. - Development environments use the built-in Derby database: *Embedded metastore* - *Same Java process* as Hive itself - *One Hive session* to connect to the database *Production environments*: - *Local metastore*: Allows multiple sessions to connect to Hive - *Remote metastore*: Separate processes for Hive and the metastore

OAuth 2.0

- Application needs to access resources on behalf of a specific user - Application presents consent screen to user; user consents - Application requests credentials from some authorisation server - Application then uses these credentials to access resources Creation: - GCP Console => API Manager => Credentials => Create - Select "OAuth client ID" - Will create OAuth client secret

Cloud SQL: Automatic Storage Increase

- Available storage is checked every 30 seconds. If available falls below a threshold size (calculated using a currently provisioned size), additional storage capacity is automatically added to your instance. - Storage size can be increased, but it can't be decreased.

BigQuery: Schema Auto-Detection

- Available while -- Loading data -- Querying external data - BigQuery selects a random file in the data source and scans up to 100 rows of data to use as a representative sample - Then examines each field and attempts to assign a data type to that field based on the values in the sample

ML: Representation: Qualities of Good Features

- Avoid rarely used discrete feature values. - Prefer clear and obvious meanings. Ex. house age in years as opposed to number of seconds since start. - Don't mix "magic" values with actual data. Ex. instead of using -1 for invalid value, using another flag to indicate if a variable is valid or not. - Account for upstream instability. The definition of feature shouldn't change over time. Maybe depending on some other model, can raise this situation. Should be ready to handle it.

BigQuery: Batch Queries

- BigQuery will schedule these to run whenever possible (idle resources) - Don't count towards limit on concurrent usage - If not started within 24 hours, BigQuery makes them interactive

BigTable: Use for Time Series

- BigTable is a natural fit for Timestamp data (range queries) - Say IOT sensor network emitting data at intervals -- Use *Device ID # Time* as row key if common query = "All data for a device over period of time" -- Use *Time # Device ID* as row key if common query = "All data for a period for all devices"

VM: Image Contents

- Boot loader - Operating system - File system structure - Software - Customizations

Stream-first Arch: Message Transport

- Buffer for event data - Performant and persistent - Decoupling multiple sources from processing - Popular products: *Kafka*, *MapR* streams

Dataproc: Cluster Machine Types

- Built using Compute Engine VM instances - Cluster - Need at least 1 master and 2 workers - Preemptible instances - OK *if* used with care

BigQuery: Data Formats

- CSV - JSON (newline delimited) - Avro (open source data format that bundles serialized data with the data's schema in the same file) - Cloud Datastore backups (BigQuery converts data from each entity in Cloud Datastore backup files to BigQuery's data types)

DataStore: Transaction Support

- Can optionally use transactions - not required - Not as strong as Cloud Spanner (which is ACID++), but stronger than BigQuery or BigTable

Cloud Spanner: Staleness

- Can set *timestamp bounds* - Strong - "read latest data" - Bounded Staleness - "read version no later than ..." - Exact Staleness - "read at exactly ..." -- (could be in past or future) - Cloud Scanner has a version-gc that reclaims versions older than 1 hour old

Load Balancing: Backend Service

- Centralized service for managing backends - Backends contain instance groups which handle user requests - Knows which instances it can use, how much traffic they can handle - Monitors the health of backends and does not send traffic to unhealthy instances

BigTable: Overview

- Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, enabling you to store terabytes or even petabytes of data. - A single value in each row is indexed; this value is known as the *row key*. - Cloud Bigtable is ideal for storing very large amounts of single-keyed data with very low latency. - It supports high read and write throughput at low latency, and it is an ideal data source for MapReduce operations. Cloud Bigtable is exposed to applications through multiple client libraries, including a supported extension to the Apache HBase library for Java. As a result, it integrates with the existing Apache ecosystem of open-source Big Data software.

DataFlow

- Cloud Dataflow is a unified programming model and a managed service for developing and executing a wide variety of data processing patterns. - Cloud Dataflow includes SDKs for defining data processing workflows, and a Cloud Platform managed service to run those workflows on Google Cloud Platform resources such as Compute Engine, BigQuery, and more. - Used to transform data - Loosely semantically equivalent to Apache Spark - Based on Apache Beam. Dataflow (1.x) was not based on Beam

Cloud Datalab

- Cloud Datalab is packaged as a container and run in a VM (Virtual Machine) instance. - Cloud Datalab uses notebooks instead of the text files containing code. Notebooks bring together code, documentation written as markdown, and the results of code execution—whether as text, image or, HTML/JavaScript. - Cloud Datalab notebooks can be stored in Google Cloud Source Repository, a git repository. This git repository is cloned onto persistent disk attached to the VM. This clone forms your workspace. To share your work with other users, push your changes from this local workspace to the repository. - When the executed code accesses Google Cloud services such as BigQuery or Google Machine Learning Engine, it uses the service account available in the VM. Hence, the service account must be authorized to access the data or request the service. - The VM used for running Cloud Datalab is a shared resource accessible to all the members of the associated cloud project. Therefore, using an individual's personal cloud credentials to access data is strongly discouraged.

Cloud Key Management: Object hierarchy: Project

- Cloud KMS resources belong to a project. - Resources have permission when the account with primitive IAM roles on any project with cloud KMS resources.

CloudSQL: On-demand and Automatic Backups

- Cloud SQL retains up to 7 automated backups for each instance. They are incremental. - Automatic back-ups are automatically deleted when master is deleted. - On-demand backups are not automatically deleted. - Backup data is stored in two regions for redundancy.

Use Case: Storage for Compute, Block Storage along with mobile SDKs

- Cloud Storage for Firebase

Hbase: Column

- Columns are units within a column family - New columns can be added on the fly - ColumnFamily:ColumnName = Work:Department

GKE: Advantages

- Componentization - microservices - Portability - Rapid deployment - Orchestration - Kubernetes clusters - Image registration - pull images from container registry - Flexibility - mix-and-match with other cloud providers, on-premise

VPC: Interconnect: VPN

- Connects your on premise network to the Google Cloud VPC or two VPCs. - Offers 99.9% service availability - Traffic is encrypted by one VPN gateway and then decrypted by another VPN gateway - Supports both static and dynamic routes for traffic between on-premise and cloud - Only IPSec gateway to gateway scenarios are supported, does not work with client software on a laptop - Must have a static external IP address - Needs to know what destination IPs are allowed and create routes to forward packets to those IPs - Can have multiple tunnels to a single VPN gateway, site-to-site VPN VPN will have *higher latency* and *lower throughput* as compared with dedicated interconnect and peering options.

Cloud SQL: Concepts

- Coud SQL is a fully-managed MySQL and PostgreSQL database service. - Fully Managed Relational Databases. - SLA -99.95% availability. - Cloud SQL - for up to 10TB of storage capacity, 40,000 IOPS, and 416GB of RAM per instance. Anything beyond - Use Spanner. - Cloud SQL is SSAE 16, ISO 27001, PCI DSS v3.0, and HIPAA compliant. - Import and export databases using mysqldump, or import and export CSV files. - Data replication between multiple zones with automatic fail-over. - Automated and on-demand backups, and point-in-time recovery.

Hosting a static site

- Create a cloud storage bucket that uses a domain name. Ex. Bucket reports.example.com for hosting http://reports.example.com. - Domain name ownership verification is required by adding TXT records, HTML tags in header, etc. - Define: >>> MainPageSuffix = "index.html" or "main.html", etc >>> NotFoundPage = "404.html" - Copy over content directly to bucket *OR* Store in GitHub and use WebHook to run update script *OR* Use CI/CD tools like Jenkins & use cloud storage plugin for post build steps. Best option if the content is static and also if users can upload content like files, photos, videos, etc.

VPC: Static Routes

- Create and maintain a routing table - A topology change in the network requires routes to be manually updated - Cannot re-route traffic automatically if a link fails - Suitable for small networks with stable topologies - Routers do not advertise routes

GCE: Image Creation

- Creator has full root privileges, SSH capability >> Can share with other users Steps: - In the Google Cloud Platform Console, go to the Create an image page. - Specify the source from which you want to create an image. This can be a persistent disk, another image, or a disk.raw file in Google Cloud Storage. - Specify the properties for your image. For example, you can specify an image family name for your image to organize this image as part of an image family. - If you are creating an image from a disk attached to a running image, check "Force creation from running instance" to confirm that you want to create the image while the instance is running. - Click Create to create the image.

CryptoKey and CryptoKeyVersion states

- CryptoKeyVersion Contains the states. - If Primary CryptoKeyVersion is enabled, Then only CryptoKey to encrypt the data. - At the time of decryption, no need for the primary version.

OLAP: Partitioning

- Data may be naturally split into logical units. Eg. customers in US based on the state. - Each of these units will be stored in a different directory - State specific queries will run only on data in one directory - Splits may *not* of the same size

Relational Data

- Data organized in a structured schemas, with primary key, etc. - Data is split into different tables and linked (normalized). - Can't handle missing values well (as opposed to columnar database).

Pub/Sub: Architecture

- Data plane, which handles moving messages between publishers and subscribers - Control plane, which handles the assignment of publishers and subscribers to servers on the data plane - The servers in the data plane are called forwarders, and the servers in the control plane are called routers.

BigQuery: Interactive Queries

- Default mode (executed as soon as possible) - Count towards limits on -- Daily usage -- Concurrent usage

VPC: Route: Creating a Network

- Default route for internet traffic. - One route for every subnet that is created.

Access control for Deployment Manager

- Deployment Manager uses the credentials of Google API service account for create Google Cloud Platform resources. - The Google APIs service account is automatically granted editor permissions on the project. - The service account exists indefinitely with the project and is only deleted when the project is deleted.

Deployment Manager: Deployment

- Deployment is a collection of resources,deployed and managed together.

VPC: Firewall: Egress Connections

- Destination CIDR ranges, Protocols, Ports - Destinations with specific tags or service accounts -- Allow: Permit matching egress connections -- Deny: Block the matching egress connections

Cloud SQL: Backups and Binary logging

- Determine whether automated backups are performed and binary logging is enabled or not. - Required for the creation of replicas and clones, and for point-in-time recovery.

VPC: Interconnect: Direct Peering

- Direct connection between on-premise network and Google at Google's edge network locations - *BGP* routes exchanged for dynamic routing - Direct peering can be used to reach all of Google's services include the full suite of GCP products - Special billing rate for GCP egress traffic, other traffic billed at standard GCP rates

VPC: 2 Default Routes

- Direct packets to destinations to specific destinations which carry it to the outside world (uses external IP addresses) - Allow instances on a VPC to send packets directly to each other (uses internal IP addresses) The existence of a route does not mean that a packet will get to the destination. *Firewall* rules have to be configured to allow the packet through.

VPC: Interconnect: Dedicated Interconnect

- Direct physical connection and RFC 1918 communication between on-premise network and cloud VPC - Can transfer large amounts of data between networks - More cost effective than using high bandwidth internet connections or using VPN tunnels - Capacity of a single connection is 10Gbps - A maximum of 8 connections supported Cross connect between the Google network and the on premise router in a common colocation facility.

Apache Beam: How

- Discarding - Accumulating - Accumulating & Retracting

VM: Sustained Use Discounts

- Discounts for running a VM instance for a significant portion of the billing month - Say you run an instance for 25% of the month, you get a discount for every incremental minute - Applied automatically, no action to avail of these

StackDriver: Trace

- Distributed tracing system that collects latency data from Google App Engine, Google HTTP(S) load balancers, and applications instrumented with the Stackdriver Trace SDKs - Think TensorBoard for Google Cloud Apps

VPC: Interconnect: Dedicated Interconnect Benefits

- Does not traverse the public internet. Fewer hops between points so fewer points of failure - Can use internal IP addresses over a dedicated connection - Scale connection based on needs up to 80Gbps - Cost of egress traffic from VPC to on-premise network reduced

BigTable: Row Keys to Avoid

- Domain names - Sequential numeric values - Timestamps alone - Timestamps as prefix of row-key - Mutable or repeatedly updated values

Data Exfiltration: Don'ts for VMs

- Don't allow outgoing connections to unknown addresses - Don't make IP addresses public - Don't allow remote connection software e.g. RDP - Don't give SSH access unless absolutely necessary

DataStore: When to avoid

- Don't use if you need *very strong transaction support* (OLTP) - OK for basic ACID support though - Don't use for non-hierarchical or unstructured data - BigTable is better - Don't use if analytics/business intelligence/data warehousing - use BigQuery instead - Don't use for immutable blobs like movies each > 10 MB - use Cloud Storage instead - Don't use if application has lots of writes and updates on key columns

BigTable: Avoid BigTable When

- Don't use if you need transaction support (OLTP) - use Cloud SQL or Cloud Spanner - Don't use for data less than 1 TB (can't parallelize) - Don't use if analytics/business intelligence/data warehousing - use BigQuery instead - Don't use for documents or highly structured hierarchies - use DataStore instead - Don't use for immutable blobs like movies each > 10 MB - use Cloud Storage instead

GCE: Storage: Persistent Disks

- Durable network storage devices that instances can access like physical disks in a desktop or a server. - Compute Engine manages physical disks and data distribution to ensure redundancy and optimize performance. - Encrypted (custom encryption possible) - Built-in redundancy - Restricted to the zone where instance is located *Two types*: Standard and SSD - *Standard Persistent*: These are regular hard disks. They are cheap. Ok to use for for sequential access. - *SSD Persistent*: Fast and expensive. Good for for random access.

VPC: Cloud Router

- Dynamically exchange routes between Google VPCs and on premise networks - Fully distributed and managed Google cloud service - Peers with on premise gateway or router to exchange route information - Uses the BGP or Border Gateway Protocol - To enable dynamic routing, create a Cloud Router. Then, configure a BGP session between the Cloud Router and your on-premises gateway or router. - The new subnets are seamlessly advertised between networks. Instances in the new subnets can start sending and receiving traffic immediately.

GCE: Projects and Instances

- Each instance belongs to a project - Projects can have any number of instances - Projects can have upto 5 VPC (Virtual Private Networks) - Each instance belongs in one VPC >> instances within VPC communicate on LAN >> instances across VPC communicate on internet

BigTable: Schema Design

- Each table has just one index - the row key. Choose it well - Rows are sorted lexicographically by row key - All operations are atomic at row level - Related entities in adjacent rows

BigTable: Designing Schema: General Concepts

- Each table has only one index, the row key. There are no secondary indices. - Rows are sorted lexicographically by row key, from the lowest to the highest byte string. Row keys are sorted in big-endian, or network, byte order, the binary equivalent of alphabetical order. - All operations are atomic at the row level. Avoid schema designs that require atomicity across rows. - Ideally, both reads and writes should be distributed evenly across the row space of the table. - In general, keep all information for an entity in a single row. An entity that doesn't need atomic updates and reads can be be split across multiple rows. Splitting across multiple rows is recommended if the entity data is large (hundreds of MB). - Related entities should be stored in adjacent rows, which makes reads more efficient. - Cloud Bigtable tables are sparse. Empty columns don't take up any space. As a result, it often makes sense to create a very large number of columns, even if most columns are empty in most rows.

VPC: Interconnect: Carrier Peering

- Enterprise grade network services connecting your infrastructure to Google using a service provider - Can get high availability and lower latency using one or more links - No Google SLA, the SLA depends on the carrier - Special billing rate for GCP egress traffic, other traffic billed at standard GCP rates

Cloud DNS: Managed Zone

- Entity that manages DNS records for a given suffix (example.com) - Maintained by Cloud DNS

VPC: The "default" Network

- Every GCP project has an auto-mode network set up by default - It comes with a number of routes and firewall rules preconfigured - Gets us up and running without thinking about networks

CloudSQL: External Read Replicas

- External read replicas are external MySQL instances that are replicating from Cloud SQL master. - Example: A MySQL instance running on Compute Engine is considered an external instance. - Replicating to a MySQL instance hosted by another cloud platform or on -premise is not possible.

DataStore: Best Practices: Entities

- Group highly related data in entity groups. Entity groups enable ancestor queries, which return strongly consistent results. Ancestor queries also rapidly scan an entity group with minimal I/O because the entities in an entity group are stored at physically close places on Cloud Datastore servers. - Avoid writing to an entity group more than once per second. Writing at a sustained rate above that limit makes eventually consistent reads more eventual, leads to time outs for strongly consistent reads, and results in slower overall performance of your application. A batch or transactional write to an entity group counts as only a single write against this limit. - Do not include the same entity (by key) multiple times in the same commit. Including the same entity multiple times in the same commit could impact Cloud Datastore latency.

GKE: Container Cluster

- Group of Compute Engine instances running Kubernetes. - It consists of -- one or more node instances, and -- a managed Kubernetes master endpoint.

Use Case: Fast scanning, NoSQL

- HBase (columnar database) - GCP: BigTable

Major Blocks of Hadoop

- HDFS - MapReduce - YARN - Yet another resourse negotiator.

Network Load Balancer: Firewall Rules

- HTTP health check probes are sent from the IP ranges 209.85.152.0/22, 209.85.204.0/22, and 35.191.0.0/16. - The load balancer uses the same ranges to connect to the instances - Firewall rules should be configured to allow traffic from these IP ranges

Load Balancing: Health Checks

- HTTP(S), SSL and TCP health checks - HTTP(S): Verifies that the instance is healthy and the web server is serving traffic - TCP, SSL: Used when the service expects TCP or SSL connection i.e. not HTTP(S) - GCP creates redundant copies of the health checker automatically so health checks might happen more frequently that you expect

Stream Processing with Flink

- Handles *out of order* or late arriving data. - *Exactly once processing* for stateful computations. - *Flexible windowing* based on time, sessions, count, etc. - *Lightweight* fault tolerance and checkpointing. - *Distributed*, runs in large scale clusters.

Cloud Endpoints

- Helps create, share, maintain, and secure your APIs - Uses the distributed Extensible Service Proxy to provide low latency and high performance - Provides authentication, logging, monitoring - Host your API anywhere Docker is supported so long as it has Internet access to GCP - Ideally, use with -- App Engine (flexible or some types of standard) -- Container Engine instance -- Compute Engine instance *Note* - proxy and API containers must be on same instance to avoid network access

Stream-first Arch: Stream Processing

- High throughput, low latency - Fault tolerance with low overhead - Manage out of order events - Easy to use, maintainable - Replay streams - Examples: *Streaming Spark*, *Storm*, *Flink*

Hadoop Ecosystem

- Hive - HBase - Pig - Kafka - Spark - Oozie

Use Case: Analytics/Data Warehouse (OLAP)

- Hive (SQL-like, but MapReduce on HDFS) - GCP: BigQuery

Use Case: SQL Interface atop file data

- Hive (SQL-like, but MapReduce on HDFS) - GCP: BigQuery

StackDriver: Types of Monitored Projects

- Hosting Projects: holds the monitoring configuration for the Stackdriver account — the dashboards, alert policies, uptime checks, and so on. - To monitor a single GCP project, create new StackDriver account within that 1 project - To monitor multiple GCP projects, create new StackDriver account in an otherwise empty hosting project -- Don't use hosting project for any other purpose - AWS Connector Projects: When you add an AWS account to a Stackdriver account, Stackdriver Monitoring creates the AWS connector project for you, typically giving it a name beginning AWS Link. - The Monitoring and Logging agents on your EC2 instances send their metrics and logs to this connector project. - If you use StackDriver logging from AWS, those logs will be in the AWS connector project (not in the host project of the Stackdriver account) - Don't put any GCP resources in an AWS connector project. This will not be monitored! - Monitored Projects: Regular (non-AWS) projects within GCP that are being monitored.

Identity-Aware Proxy (IAP)

- Identity-Aware Proxy (IAP) is an HTTPS-based, i.e. web based, way to combine all the identity management. - IAP acts as an additional safeguard on a particular resource - Turning on IAP for a resource causes creation of an OAuth 2.0 Client ID & secret (per resource). Don't delete any of these! IAP will stop working. - central Authorization layer for applications accessed by HTTPS - Application-level access control model instead of relying on network-level firewalls - With Cloud IAP, you can set up group-based application access: - a resource could be accessible for employees and inacccessible for contractors, or only accessible to a specific department.

DataStore: Best Practices: Indexes

- If a property will never be needed for a query, exclude the property from indexes. Unnecessarily indexing a property could result in increased latency to achieve consistency, and increased storage costs of index entries. - Avoid having too many composite indexes. Excessive use of composite indexes could result in increased latency to achieve consistency, and increased storage costs of index entries. If you need to execute ad hoc queries on large datasets without previously defined indexes, use Google BigQuery. - Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to hotspots that impact Cloud Datastore latency for applications with high read and write rates.

VM: Custom Machines

- If none of the predefined machine types fit your workloads, use a custom machine type - Save the cost of running on a machine which is more powerful than what you need - Billed according to the number of vCPUs and the amount of memory used

Access control for users

- If the users have access permission to our project then they can create configurations and deployments. - IAM (Identity and Access Management) - Support predefined and primitive roles. - Primitive roles-map directly to the legacy project owner, editor and viewer roles.

DataStore: Best Practices: Queries

- If you need to access only the key from query results, use a *keys-only query*. A keys-only query returns results at lower latency and cost than retrieving entire entities. - If you need to access only specific properties from an entity, use a *projection query*. A projection query returns results at lower latency and cost than retrieving entire entities. - Likewise, if you need to access only the properties that are included in the query filter (for example, those listed in an order by clause), use a *projection query*. - Do not use offsets. Instead use *cursors*. Using an offset only avoids returning the skipped entities to your application, but these entities are still retrieved internally. The skipped entities affect the latency of the query, and your application is billed for the read operations required to retrieve them. - If you need strong consistency for your queries, use an ancestor query. To use ancestor queries, you first need to structure your data for strong consistency.

CloudSQL: Instances

- Instances need to be created explicitly -- Not serverless and needs database instance. -- Specify region while creating instance - First vs. second generation instances -- Second generation instances allow proxy support - no need to white list IP addresses or configure SSL -- Higher availability configuration -- Maintenance won't take down the server

Deployment Manager: Configuration

- It describes all the resources you want for a single deployment and this file written in YAML syntax. - This lists each of the resources you want to create and it's respective resource properties. - A configuration must contain a resource. Resource must contain three components :- a) Name-user-defined string for identification. b) Type-Type of resource being deployed c) Properties-Parameters of the resource type

Deployment Manager: Manifest

- It is a read only object contains original configuration. - At the time of updation Deployment manager generates manifest. - Manifest is useful for solving troubleshooting issue.

OLAP: Join Optimizations

- Join operations are MapReduce jobs under the hood. - Optimize joins by *reducing* the amount of data held *in memory*. - Or by structuring joins as a *map-only* operation

VM: Live Migration

- Keeps your VM instances running even during a hardware or software update - Live migrates your instance to another host in the same zone without rebooting VMs -- infrastructure maintenance and upgrades -- network and power grid maintenance -- Failed hardware -- Host and BIOS updates -- Security changes etc - VM gets a notification that it needs to be evicted - A new VM is selected for migration, the empty "target" - A connection is authenticated between the two - Instances with GPUs cannot be live migrated, They get a 60 minute notice before termination - Instances with local SSDs attached can be live migrated - Preemptible instances cannot be live migrated, they are always terminated

DataStore: Best Practices: Keys

- Key names are auto-generated if not provided at entity creation. They are allocated so as to be evenly distributed in the key space. - For a key that uses a custom name, always use UTF-8 characters except a forward slash (/). - For a key that uses a numeric ID: >>> Do not use a negative number for the ID. >>> Do not use the value 0(zero) for the ID. If you do, you will get an automatically allocated ID. >>> If you wish to manually assign your own numeric IDs to the entities you create, have your application obtain a block of IDs with the allocateIds() method. This will prevent Cloud Datastore from assigning one of your manual numeric IDs to another entity.

BigQuery

- Latency bit higher than BigTable, DataStore - prefer those for low latency - No ACID properties - can't use for transaction processing (OLTP) - Great for analytics/business intelligence/data warehouse (OLAP) - Recall that OLTP needs strict write consistency, OLAP does not - Superficially similar in use-case to Hive - SQL-like abstraction for non-relational data - Underlying implementation actually quite different from Hive.

Load Balancing

- Load balancing and autoscaling for groups of instances - Scale your application to support heavy traffic - Detect and remove unhealthy VMs, healthy VMs automatically re-added - Route traffic to the closest VM - Fully managed service, redundant and highly available

VPC Network Peering Benefits

- Lower latency as compared with public IP networking - Better security since services need not expose an external IP address - Using internal IPs for traffic avoids egress bandwidth pricing on the GCP

Dataproc

- Managed Hadoop + Spark - Includes: Hadoop, Spark, Hive and Pig - "No-ops": create cluster, use it, turn it off using Cloud Dataproc automation. -- Use Google Cloud Storage, not HDFS - else billing will hurt - Ideal for moving existing code to GCP

Managed Instance Groups and Load Balancing

- Managed instance groups are a pool of similar machines which can be scaled automatically - Load balancing can be external or internal, global or regional - Basic components of HTTP(S) load balancing - target proxy, URL map, backend service and backends - Use cases and architecture diagrams for all the load balancing types HTTP(S), SSL proxy, TCP proxy, network and internal load balancing

High Availability

- Managed service. *no additional configuration* needed to ensure high availability - Can configure *multiple instance groups in different zones* to guard against failures in a single zone - With multiple instance groups all instances are treated as if they are in a *single pool* and the load balancer distributes traffic amongst them using the load balancing algorithm

Pub/Sub

- Messaging "middleware" - Many-to-many asynchronous messaging - Decouple sender and receiver

GCE: Preemptible Instances

- Much much cheaper than regular Compute Engine instances - Can be *terminated* at any time if GCE needs the resources and definitely after running for 24 hours. - Suitable for *batch or fault-tolerant* applications. - Probability of termination varies by day/zone etc. - Cannot live migrate (stay alive during updates) or auto-restart on maintenance. - Not billed for instances preempted in the first 10 minutes.

Schema-on-read

- Number of columns, column types, constraints specified at table creation - Hive *tries to impose* this schema when data is read - It may not succeed, may *pad data with nulls*

Autoscaling Policy: HTTP(S) Load Balancing Server Capacity

- Only works with -- CPU utilization -- maximum requests per second/instance - These are the only settings that can be controlled by adding and removing instances Autoscaling does not work with maximum requests per group. This setting is *independent* of the number of instances in a group.

Resource Hierarchy

- Organization >> project >> resource - Can set an IAM access control policy at any level in the resource hierarchy - Resources inherit the policies of the parent resource *Organization:* - Not required, but helps separate projects from individual users - Link with G-suite super admin - Root of hierarchy, rules cascade down *Folders:* - Logical groupings of projects

Data Exfiltration: Types

- Outbound email - Downloads to insecure devices - Uploads to external services - Insecure cloud behaviour - Rogue admins, pending employee terminations

Apache Beam: What

- ParDo (Parallel Do) - GroupByKey - Flatten - Combine - Composite Transforms. - Side Inputs - Source API - Metrics - Stateful Processing

Cloud Spanner: Parent - Child

- Parent-child relationships between tables - These cause physical location for fast access - If you query Students and Grades together, make Grades child of Students -- Data locality will be enforced between 2 independent tables! - Every table must have primary keys - To declare table is child of another, prefix parent's primary key onto primary key of child. (This storage model resembles HBase)

OLAP: Queries on Big Data

- Partitioning and Bucketing of Tables - Join Optimizations - Window Functions

Deployment Manager: Templates

- Parts of the configuration and abstracted into individual building blocks. This file is written in python or jinja2. - They are much more flexible than individual configuration files and intended to support easy portability across deployments. - The interpretation of each template eventually must result in the same YAML syntax.

VPC Network Peering Properties

- Peered networks are *administratively separate* - routes, firewalls, VPNs and traffic management applied independently - One VPC can peer with multiple networks with a limit of 25 - Only directly peered networks can communicate

Load Balancing: TCP Proxy Load Balancing

- Perform load balancing based on transport layer (TCP) - Allows you to use a single IP address for all users around the world. - Automatically routes traffic to the instances that are closest to the user. - Advantage of transport layer load balancing: -- more intelligent routing possible than with network layer load balancing -- better security - TCP vulnerabilities can be patched at the load balancer

GCE: Storage: Local SSD

- Physically attached to the server that hosts your virtual machine instance - Local SSDs have higher throughput and lower latency - The data that you store on a local SSD persists only until you stop or delete the instance - Small - each local SSD is 375 GB in size (can go up to 8SSDs, i.e. 3TB per instance). - Very high IOPS and low latency - Unlike persistent disks, you must manage the striping on local SSDs yourself - Encrypted, custom encryption not possible

Load Balancing Algorithm

- Picks an instance based on a hash of: -- the source IP and port -- destination IP and port -- protocol - This means that incoming TCP connections are spread across instances and each new connection may go to a different instance. - Regardless of the session affinity setting, all packets for a connection are directed to the chosen instance until the connection is closed and have no impact on load balancing decisions for new incoming connections - This can result in imbalance between backends if long-lived TCP connections are in use.

Pig on Hadoop

- Pig runs *on top* of the Hadoop distributed computing framework. - *Reads* files from HDFS, *stores intermediate* records in HDFS and *writes* its final output to HDFS. - *Decomposes* operations into MapReduce jobs which run in parallel. - Provides *non-trivial, built-in* implementations of standard data operations, which are very *efficient*. - Pig optimizes operations *before* MapReduce jobs are run, to speed operations up

DataFlow (Apache Beam): Pipeline

- Pipeline: single, potentially repeatable job, from start to finish, in Dataflow - Encapsulates series of computations that accepts some input data from external sources, transforms data to provide some useful intelligence, and produce output - A pipeline consists of two parts: data (PCollections) and transforms applied to that data (Transforms). - Defined by driver program -- The actual pipeline computations run on a backend, abstracted in the driver by a runner.

BigTable: Reasons for Poor Performance

- Poor schema design (eg sequential keys) - Inappropriate workload -- too small (<300 GB) -- used in short bursts (needs hours to tune performance internally) - Cluster too small - Cluster just fired up or scaled up - HDD used instead of SSD - Development v Production instance

Load Balancing: Few concepts of L7/HTTP or HTTPS LB

- Port 80 or port 8080 or port 443. - Support URL-based or Content Based routing. Create URL Maps and direct traffic to different instances based on the incoming URL. >> you can send requests for http://www.example.com/audio to one backend service, which contains instances configured to deliver audio files, and requests for http://www.example.com/video to another backend service, which contains instances configured to deliver video files. >> Route requests for static content to a Cloud Storage bucket. - Supports session affinity - sends all request from the same client to same virtual machine instance as long as the instance stays healthy and has capacity. >> (two type: client IP affinity, cookie affinity) - The health of each backend instance is verified using an HTTP health check

Zonal vs. Regional MIG

- Prefer regional instance groups to zonal so application load can be spread across multiple zones - This protects against failures within a single zone - Choose zonal if you want lower latency and avoid cross-zone communication

Pub/Sub: Subscribers

- Push subscribers - Any app that can make HTTPS request to googleapis.com - Pull subscribers - must be WebHook endpoint that can accept POST request over HTTPS

Use Case: Transaction Processing (OLTP)

- RDBMS - GCP: Cloud SQL, Cloud Spanner

Spark: Resilient Distributed Datasets

- RDDs are the main programming abstraction in Spark - RDDs are in-memory collections of objects - With RDDs, you can interact and play with billions of rows of data

Target Proxy

- Referenced by one or more global forwarding rules - Route the incoming requests to a URL map to determine where they should be sent - Specific to a protocol (HTTP, HTTPS, SSL and TCP) - Should have a SSL certificate if it terminates HTTPS connections (limit of 10 SSL certificates) - Can connect to backend services via HTTP or HTTPS

Cloud Spanner: Data Types

- Remember that tables are strongly-typed (schemas must have types) - Non-normalized types such as ARRAY and STRUCT available too - STRUCTs are not OK in tables, but can be returned by queries (eg if query returns ARRAY of ARRAYs) - ARRAYs are OK in tables, but ARRAYs of ARRAYs are not

Load Balancing: SSL Proxy Load Balancing

- Remember the OSI network layer stack: physical, data link, network, transport, session, presentation, application? - The usual combination is TCP/IP: network = IP, transport = TCP, application = HTTP - For secure traffic: add session layer = SSL (secure socket layer), and application layer = HTTPS - Use only for non-HTTP(S) SSL traffic - For HTTP(S), just use HTTP(S) load balancing - SSL connections are terminated at the global layer then proxied to the closest available instance group

VM: How do you add persistent disks after you have created the instance?

- Remember, you have had created Boot disk while creating instance. - Disks are zonal resources, so they reside in a particular zone for their entire lifetime. - A persistent disk can be a standard (HDD) or solid-state (SSD) drive. You can also attach an ephemeral local SSD for high-performance I/O. Each local SSD is 375 GB in size, but you can attach up to eight devices for 3 TB of total SSD storage space per instance. Commands: - *gcloud* compute disks create my-disk-1 my-disk-2 --zone us-east1-a --size 100GB - *gcloud* compute instances attach-disk INSTANCE_NAME --disk=DISK

Cloud Key Management: Object hierarchy: CryptoKey Version

- Represents the key materials, which have many versions and starting from 1. - Which have states like enabled, disabled and scheduled. - Primary version will use for the encryption of data.

VPC: Overview

- Resources in GCP projects are split across VPCs (Virtual Private Clouds) - Routes and forwarding rules must be configured to allow traffic within a VPC and with the outside world - Traffic flows only after firewall rules are configured specifying what traffic is allowed or not - VPN, peering, shared VPCs are some of the ways to connect VPCs or a VPC with an on premise network

BigTable: Types of Row Keys

- Reverse domain names - String identifiers - Timestamps as suffix in key

BigTable: Size Limits

- Row keys: 4KB per key - Column Families: ~100 per table - Column Values: ~ 10 MB each - Total Row Size: ~100 MB

DataStore: Multitenancy

- Separate data partitions for each client organization - Can use the same schema for all clients, but vary the values - Specified via a namespace (inside which kinds and entities can exist)

Endpoints: API Keys

- Simple encrypted string - Can be used when calling certain APIs that *don't need to access private user data*. - Useful in clients such as browser and mobile applications that don't have a backend server - The API key is used to track API requests associated with your project for quota and billing. Creation: - GCP Console => API Manager => Credentials => Create - Select "API Key" Beware: - Can be used by anyone - Man-in-the-Middle - Do not identify user or application making request

HDFS: Block Size

- Size is 128MB. - Block size is a trade off -- Reduces parallelism -- Increases overhead - This size helps minimize the time taken to seek to the block on the disk

OLAP: Bucketing

- Size of each split should be the same - Based on hash of a column value - address, name, timestamp anything - Each bucket is a separate file - Makes sampling and joining data more efficient

Cloud Storage: Bucket Storage Classes: Nearline Storage

- Slightly lower availability. 99.0% availability SLA. - 30-day minimum storage duration - Data retrieval costs. - Very low cost per GB stored. Higher per operation costs. Use case: - Data you plan to read or modify on average once a month or less - Data backup, disaster recovery, and archival storage.

DataFlow (Apache Beam): I/O Sources and Sinks

- Source & Sink: different data storage formats, such as files in Google Cloud Storage, BigQuery tables - Custom sources and sinks possible too Source: - Twitter feed - log messages Sink: - BigQuery - BigTable

Spark: Lazy Evaluation

- Spark keeps a record of the series of transformations requested by the user. - It groups the transformations in an efficient way when an Action is requested.

BigQuery: Partitioned Tables

- Special table where data is partitioned for you - No need to create partitions manually or programmatically - Manual partitions - performance degrades - Limit of 1000 tables per query does *not* apply - Date-partitioned tables offered by BigQuery - Need to declare table as partitioned at creation time - No need to specify schema (can do while loading data) - BigQuery automatically creates date partitions

Logging with StackDriver

- Stackdriver Logging includes storage for logs, a user interface (the Logs Viewer), and an API to manage logs programmatically - Stackdriver Logging lets you -- read and write log entries -- search and filter your logs -- export your logs -- create logs-based metrics.

StackDriver: Metrics

- Stackdriver Monitoring has metrics for -- the CPU utilization of your VM instances -- the number of tables in your SQL databases -- hundreds more - Can create custom metrics for StackDriver monitoring to track - Three types: -- gauge metrics -- delta metrics -- cumulative metrics - Metric data will be available in StackDriver monitoring for 6 weeks

GCE: Machine Types

- Standard - High-memory - High-CPU - Shared-core (small, non-resource intensive) - Can attach GPU dies to most machine types

VM: Create a Image and launch an instance using it

- Stop the instance - Go to Images link on left navigator - Click Create Image - Choose source as Disk and disk source as Instance name of Stopped Instance. - Give a nice name and create image. - Create a new instance with this image. Choose image from custom images. - Launch it with HTTP / HTTPs checked - so you can verify the instance

Cloud KMS: Choosing a secret management solution

- Storing secrets in code, encrypted with a key from Cloud KMS and Storing secrets in storage bucket in Google cloud storage are some example for approaches. - Rotating secrets, Cache secrets locally and using a separate solution or problem are some of the changing secrets. - Encryption options are Use application layer encryption using a key in Cloud KMS and aUse the default encryption built into the Cloud Storage bucket. - Managing access to secrets are Access controls on the bucket in which the secret is stored and Access controls on the key which encrypts the bucket in which the secret is stored. - Key rotation and Secret rotation are some example for secret management. - Permission management without a service account requires several users: An organizational-level administrator, A second user that has the a storage,A third user with the cloudkms.admin role and A fourth user that has both the storage.objectAdmin and cloudkms.cryptoKeyEncrypterDecrypter roles.

Cloud Dataproc/Dataflow: Recommended Workflows

- Stream processing (ETL): Dataflow - Batch processing (ETL): Both Dataproc & Dataflow - Iterative processing and notebooks: Dataproc - Machine learning with Spark ML: Dataproc - Preprocessing for machine learning: Dataflow (with Cloud ML Engine)

GKE: Container Cluster: Node Pool

- Subset of machines within a cluster that all have the same configuration. - Useful for customizing instance profiles in your cluster - You can also: >> run multiple Kubernetes node versions on each node pool in your cluster >> update each node pool independently >> target different node pools for specific deployments.

Cloud Spanner: Transactions

- Supports serialisability - Cloud Spanner transaction support is super-strong, even stronger than traditional ACID -- Transactions commit in an order that is reflected in their commit timestamps -- These commit timestamps are "real time" so you can compare them to your watch - Two transaction modes -- Locking read-write (slow) -- Read-only (fast) - If making a one-off read, use something known as a "Single Read Call" -- Fastest, no transaction checks needed!

Autoscaling Policy: Average CPU Utilization

- Target utilization level of 0.75 maintains average CPU utilization at 75% *across all instances* - If utilization exceed the target, *more CPUs will be added* - If utilization reaches 100% during times of heavy usage the autoscaler might increase the number of CPUs by -- 50% -- 4 instances - whichever is *larger*

TensorFlow: constants or variables

- Tensors can be stored in the graph as constants or variables. - Constants hold tensors whose values can't change, while variables hold tensors whose values can change. - A constant is an operation that always returns the same tensor value. - A variable is an operation that will return whichever tensor has been assigned to it. To define a constant, use the tf.constant operator and pass in its value. For example: x = tf.constant(5.2) Similarly, you can create a variable like this: y = tf.Variable([5]) Or you can create the variable first and then subsequently assign a value like this (note that you always have to specify a default value): y = tf.Variable([0]) y = y.assign([5])

CloudSQL: Cloud Proxy: Operation

- The Cloud SQL Proxy works by having a local client, called the proxy, running in the local environment - Your application communicates with the proxy with the standard database protocol used by your database. - The proxy uses a secure tunnel to communicate with its companion process running on the server. When you start the proxy, need to tell it: - What Cloud SQL instances it should establish connections to - Where it will listen for data coming from your application to be sent to Cloud SQL - Where it will find the credentials it will use to authenticate your application to Cloud SQL You can install the proxy anywhere in your local environment. The location of the proxy binaries does not impact where it listens for data from your application.

Autoscaler with Multiple Policies

- The autoscaler will scale based on the policy which provides the *largest number of VMs* in the group. - This ensures that you always have enough machines to handle your workload. - Can handle a maximum of *5 policies* at a time.

Internal Load Balancing: Load Balancing Algorithm

- The backend instance for a client is selected using a hashing algorithm that takes instance health into consideration. - Using a 5-tuple hash, five parameters for hashing: -- client source IP -- client port -- destination IP (the load balancing IP) -- destination port -- protocol (either TCP or UDP) - Introduce session affinity by hashing on only some of the 5 parameters -- Hash based on 3-tuple (Client IP, Dest IP, Protocol) -- Hash based on 2-tuple (Client IP, Dest IP)

Cloud KMS: Envelope encryption

- The key used to encrypt data itself is called a data encryption key (DEK). - The DEK is encrypted (or wrapped) by a key encryption key (KEK).

Autoscaling: Target Utilization Level

- The level at which you want to maintain your VMs - Interpreted differently based on the autoscaling policy that you've chosen

Transfer Service: Importing Data

- The transfer service helps get data *into* Cloud Storage from: -- AWS, i.e. an S3 bucket -- HTTP/HTTPS location -- Local files -- Another Cloud Storage Bucket Bells & Whistles - One-time vs recurring transfers - Delete from destination if they don't exist in source - Delete from source after copying over - Periodic synchronization of source and destination based on file filters

DataStore: Consistency

- Two consistency levels possible for query results -- Strongly consistent: return up-to-date result, however long it takes -- Eventually consistent: faster, but might return stale

OLAP: Splitting

- Two ways to narrow down the data to be examined or processed. -- Partitioning -- Bucketing - Splits data into smaller, manageable parts - Enables performance optimizations

HBase: Row Key

- Uniquely identifies a row - Can be primitives, structures, arrays - Represented internally as a byte array - Sorted in ascending order

BigQuery: Slots

- Unit of Computational capacity needed to run queries - BigQuery calculates on basis of query size, complexity - Usually default slots sufficient - Might need to be expanded for very large, complex queries - Slots are subject to quota policies - Can use *StackDriver Monitoring* to track slot usage

Cloud Storage: Bucket Storage Classes: Coldline Storage

- Unlike other "cold" storage services, same throughput and latency (i.e. not slower to access) - 90-day minimum storage duration, costs for data access, and higher per-operation costs - Infrequently accessed data, such as data stored for legal or regulatory reasons

DataStore: Implications of Full Indexing

- Updates are really slow - No joins possible - Can't filter results based on subquery results - Can't include more than one inequality filter (one is OK)

Use Case: Storage for Compute, Block Storage

- Use Persistent (hard disks), SSD. - Same in GCP also.

BigTable: SSD or HDD Disks

- Use SSD unless skimping on cost - SSD can be 20x faster on individual row reads - More predictable throughput too (no disk seek variance) - Don't even think about HDD unless storing > 10 TB and all batch queries - The more random access, the stronger the case for SSD

Data Exfiltration: Dos for VMs

- Use VPCs and firewalls between them - Use a bastion host as a chokepoint for access - Use Private Google Access - Use Shared VPC, aka Cross-Project Networking

DataStore: Best Practices: API calls

- Use batch operations for your reads, writes, and deletes instead of single operations. Batch operations are more efficient because they perform multiple operations with the same overhead as a single operation. - If a transaction fails, ensure you try to rollback the transaction. The rollback minimizes retry latency for a different request contending for the same resource(s) in a transaction. Note that a rollback itself might fail, so the rollback should be a best-effort attempt only. - Use asynchronous calls where available instead of synchronous calls.

DataStore: When to use

- Use for crazy scaling of read performance - to virtually any size - Use for hierarchical documents with key/value data. - Product catalogs that provide real-time inventory and product details for a retailer. - User profiles that deliver a customized experience based on the user's past activities and preferences. - Transactions based on *ACID* properties, for example, transferring funds from one bank account to another.

BigTable: Use BigTable When

- Use for very fast scanning and high throughput - Use for non-structured key/value data - Where each data item < 10 MB and total data > 1 TB - Use where writes infrequent/unimportant (no ACID) but fast scans crucial - Use for Time Series data

Cloud Spanner: SQL Best Practices

- Use query parameters to speed up frequently executed queries - Understand how Cloud Spanner executes queries - Use secondary indexes to speed up common queries - Write efficient queries for range key lookup - Write efficient queries for joins - Avoid large reads inside read-write transactions - Use ORDER BY to ensure the ordering of your SQL results

SparkContext

- When the shell is launched it initializes a SparkContext. - The SparkContext represents a connection to the Spark Cluster - Used to load data into memory from a specified source - The data gets loaded into an RDD

End-user Authentication

- Use service accounts wherever possible - In certain specific cases however, end-user authentication its unavoidable - You need to access resources on behalf of an end user of your application -- For example, your application needs to access Google BigQuery datasets that belong to users of your application. - You need to authenticate as yourself (not as your application) -- For example, because the Cloud Resource Manager API can create and manage projects owned by a specific user, you would need to authenticate as a user to create projects on their behalf.

VPC: External IP Addresses

- Use to communicate across VPCs - Traffic using external IP addresses can cause additional billing charges - Can be ephemeral or static - VMs are not aware of their external IP address

VPC: Internal IP Addresses

- Use within a VPC - Cannot be used across VPCs unless we have special configuration (like shared VPCs or VPNs) - Can be ephemeral or static, typically ephemeral - VMs know their internal IP address (VM name and IP is available to the network DNS)

HBase: Timestamp

- Used as the version number for the values stored in a column - The value for any version can be accessed

Primary CryptoKey Version

- Used for the time of encryption. - At any given point of time,One version of the cryptography cab be primary. - If primary CryptoKeyVersion is disabledCryptoKey cannot encrypt the data.

Shared VPC

- Used to be called XPN (Cross-Project Networking) - So far one project, multiple networks - Shared VPC allow cross project networking i.e. multiple projects, one network. - Creates a VPC network of RFC1918 IP spaces that associated projects can use. - Firewall rules and policies apply to all projects on the network

VM: Images

- Used to create boot disks for VM instances - *Public images*: -- provided and maintained by Google, open source communities, third party vendors -- all projects have access and can use them - *Custom images*: -- Available only to your project -- Create a custom image from boot disks and other images - Most of the *public images* can be used for *no cost* - Some *premium* images may have an additional cost - Custom images that you import to compute engine add no cost to your instance - They incur an *image storage charge* when stored in your project (tar and gzipped file) - Images are configured as a part of the *instance template* of a managed instance group

VM: Startup Scripts

- Used to customize the instance created using a public image - The script runs commands that deploys the application as it boots - Script should be *idempotent* to avoid inconsistent or partially configured state

URL Map

- Used to direct traffic to different instances based on the incoming URL -- http://www.example.com/audio -> backend service1 -- http://www.example.com/vide -> backend service2

Hadoop Blocks: Co-ordination

- User defines map and reduce tasks using the MapReduce API - A job is triggered on the cluster - YARN figures out where and how to run the job, and stores the result in HDFS

Scenario: "Sign in to Quora using Google"

- User navigates to quora.com - Quora needs to access resources on behalf of user - Quora presents Google sign-in screen to user; user signs in - Quora requests Google to authenticate user - Quora has authenticated user, now releases resource - Resource owner: Quora guarding access to your account - Resource server: Quora granting access to your account - Client: Quora talking to Google - Authorisation server: Google

Scenario: "Access API via GCP Project"

- User wants to access some API - Project needs to access that API on behalf of user - Project requests GCP API Manager to authenticate user by passing client secret; API manager responds - Project has authenticated user, now gives API access - Resource owner: Project guarding access to your account - Resource server: Project granting access to your account - Client: Project talking to API manager - Authorisation server: API manager

Load Balancing: Load Distribution

- Uses CPU utilization of the backend or requests per second as the *balancing mode* - Maximum values can be specified for both - Short bursts of traffic above the limit can occur - Incoming requests are first sent to the *region closest to the user*, if that region has capacity - Traffic distributed amongst zone instances based on capacity - Round robin distribution across instances in a zone - Round robin can be overridden by session affinity

Managed Instance Group

- Uses an *instance template* to create a group of *identical* instances - Changes to the instance group changes all instances in the group - Can automatically scale the number of instances in the group - Work with load balancing to distribute traffic across instances - If an instance stops, crashes or is deleted the group automatically recreates the instance with the same template - Can identify and recreate unhealthy instances in a group (autohealing) - Two types: 1) Zonal, 2) Regional.

VPC: Flow Logs

- Using Flow Logs, you can monitor network traffic to and from your VMs for TCP and UDP protocols. - Enable or disable VPC Flow Logs per network subnet. - It will capture source and destination IP addresses, source and destination ports and protocol number, time stamp, number of packets, throughput, etc.. - Where do you use? >>> Network monitoring, diagnostics >>> Network forensics (e.g.: which IPs talked with whom and when) >>> Real-time security analysis - This can provide real-time monitoring, correlation of events, analysis, and security alerts. >>> Cost / expense optimization. - You can view flow logs in Stackdriver Logging, and you can export logs to any destination that Stackdriver Logging export supports (3 destination): >>> Cloud Storage Buckets >>> Cloud Pub/Sub(publish & subscribe for real time messaging) and >>> BigQuery (fully managed enterprise data warehouse).

StackDriver: Metric Latency

- VM CPU utilisation - once a minute, available with 3-4 minutes lag - If writing data programmatically to metric time series -- first write takes a few minute to show up -- subsequent writes visible within seconds

GCP Virtual Machines: Overview

- VMs offer many useful features such as live migration which allows them to remain up even during maintenance events - Rightsizing recommendations allow you to use the right sized machines for your workloads - Google offers sustained use and committed use discounts which help reduce your cloud bill - Images help you instantiate new VMs with the OS and applications of your choice baked in

BigQuery: Estimating Costs

- When you enter a query in the web UI, the query validator verifies the query syntax and provides an estimate of the number of bytes read. You can use this estimate to calculate query cost in the pricing calculator. - When you run a query in the CLI, you can use the --dry_run flag to estimate the number of bytes read. You can use this estimate to calculate query cost in the pricing calculator.

What is Runtime Configurator?

- Which lets you define and store and store as hierarchy of key value pairs in the google cloud. - These key value pairs are used for Dynamically configure services, Communicate service states, Send notification of changes to data and Share information between multiple tiers of services. - Runtime configurator also offers watcher and waiter service. *Concepts:* - Config resource-Which contains a hierarchical list of variables. - Variables are the key value pairs belongs to RuntimeConfig resource. - Watchers can use the watch() method to watch a variable and return when the variable changes, and finally waiters which have a cardinality condition.

Deployment Manager: Resource

- Which represents a single API resource and provided by Google-managed base type. - API resource provided by a Type Provider. - To specify a resource- provide a Type for that resource.

Cloud Key Management: Object hierarchy: Location

- Which represents geographical data centre location of where requests to Cloud KMS regarding the given resources. - If locations are close to you,which is more fast and reliable. - Global - If we using this resource, KMS resources are available from multiple data centers.

Google Compute Engine (IaaS)

- You need complete control over your infrastructure and direct access. - You want to tune the hardware and squeeze out last drop of performance. - Control load balancing, scaling yourself. - Configuration, administration and management - all on you. - No need to buy the machines, install OS, etc - Have custom made applications that can't be containerized.

VM: Shared Core Bursting

- f1-micro machine types offer *bursting* capabilities that allow instances to use additional physical CPU for short periods of time. - Bursting happens *automatically* when needed. - The instance will automatically take advantage of available CPU in bursts. - Bursts are not permanent, only possible *periodically*.

VM: How to reset or restart an instance

- gcloud compute instances reset example-instance - gcloud compute instances stop example-instance - gcloud compute instances start example-instance A stopped instance does not incur charges, but all of the resources that are attached to the instance will still be charged.

CloudSQL: Operate SQL DB

- gcloud sql connect jj123 --user root - give password - CREATE DATABASE jj123db; - USE jj123db; - CREATE TABLE students (studentName VARCHAR(255), idStudent INT NOT NULL AUTO_INCREMENT, PRIMARY KEY(idStudent)); - SELECT * FROM students

Cloud Spanner: Interleaved Table Example

-- Schema hierarchy: -- + Singers -- + Albums (interleaved table, child table of Singers) -- + Songs (interleaved table, child table of Albums) CREATE TABLE Singers ( SingerId INT64 NOT NULL, FirstName STRING(1024), LastName STRING(1024), SingerInfo BYTES(MAX), ) PRIMARY KEY (SingerId); CREATE TABLE Albums ( SingerId INT64 NOT NULL, AlbumId INT64 NOT NULL, AlbumTitle STRING(MAX), ) PRIMARY KEY (SingerId, AlbumId), INTERLEAVE IN PARENT Singers ON DELETE CASCADE; CREATE TABLE Songs ( SingerId INT64 NOT NULL, AlbumId INT64 NOT NULL, TrackId INT64 NOT NULL, SongName STRING(MAX), ) PRIMARY KEY (SingerId, AlbumId, TrackId), INTERLEAVE IN PARENT Albums ON DELETE CASCADE;

HTTP/HTTPS Load Balancing

A global, external load balancing service offered on the GCP. Distributes HTTP(S) traffic among groups of instances based on: -- proximity to the user -- requested URL -- or both. - Traffic from the internet is sent to a global forwarding rule - this rule determines which proxy the traffic should be directed to - The global forwarding rule directs incoming requests to a target HTTP proxy - The target HTTP proxy checks each request against a URL map to determine the appropriate backend service for the request - The backend service directs each request to an appropriate backend based on serving capacity, zone, and instance health of its attached backends - The health of each backend instance is verified using either an HTTP health check or an HTTPS health check - if HTTPS, request is encrypted - Actual request distribution can happen based on CPU utilization, requests per instance - Can configure the managed instance groups making up the backend to scale as the traffic scales (based on the parameters of utilization or requests per second) - HTTPS load balancing requires the target proxy to have a signed certificate to terminate the SSL connection - Must create firewall rules to allow requests from load balancer and health checker to get through to the instances - *Session affinity:* All requests from same client to same server based on either -- client IP -- cookie

Machine Learning Algorithm

A machine learning algorithm is an algorithm that is able to learn from the data.

ML: logistic regression

A model that generates a probability for each possible discrete label value in classification problems by applying a sigmoid function to a linear prediction. Although logistic regression is often used in binary classification problems, it can also be used in multi-class classification problems (where it becomes called multi-class logistic regression or multinomial regression).

ML: neural network

A model that, taking inspiration from the brain, is composed of layers (at least one of which is hidden) consisting of simple connected units or neurons followed by nonlinearities.

ML: neuron

A node in a neural network, typically taking in multiple input values and generating one output value. The neuron calculates the output value by applying an activation function (nonlinear transformation) to a weighted sum of input values.

ML: classification threshold

A scalar-value criterion that is applied to a model's predicted score in order to separate the positive class from the negative class. Used when mapping logistic regression results to binary classification. For example, consider a logistic regression model that determines the probability of a given email message being spam. If the classification threshold is 0.9, then logistic regression values above 0.9 are classified as spam and those below 0.9 are classified as not spam.

Cloud Spanner: Secondary Indices

A secondary index is helpful for quickly looking up data when searching by one or more non-key columns. - Like in HBase, key-based storage ensures fast sequential scan of keys >>> Remember that tables must have primary keys - Unlike in HBase, can also add secondary indices >>> Might cause same data to be stored twice - Fine-grained control on use of indices >>> Force query to use a specific index (index directives) >>> Force column to be copied into a secondary index (STORING clause) Example: CREATE INDEX AlbumsByAlbumTitle ON Albums(AlbumTitle); SELECT AlbumId, AlbumTitle, MarketingBudget FROM Albums@{FORCE_INDEX=AlbumsByAlbumTitle} WHERE SingerId = 1 AND AlbumTitle >= 'Aardvark' AND AlbumTitle < 'Goo'

TensorFlow: tf.data.Dataset

A tf.data.Dataset represents a sequence of elements, in which each element contains one or more Tensor objects. For example, in an image pipeline, an element might be a single training example, with a pair of tensors representing the image data and a label. There are two distinct ways to create a dataset: - Creating a *source* (e.g. Dataset.from_tensor_slices()) constructs a dataset from one or more tf.Tensor objects. - Applying a *transformation* (e.g. Dataset.batch()) constructs a dataset from one or more tf.data.Dataset objects.

TensorFlow: tf.data.Iterator

A tf.data.Iterator provides the main way to extract elements from a dataset. The operation returned by Iterator.get_next() yields the next element of a Dataset when executed, and typically acts as the interface between input pipeline code and your model. The simplest iterator is a "one-shot iterator", which is associated with a particular Dataset and iterates through it once. For more sophisticated uses, the Iterator.initializer operation enables you to reinitialize and parameterize an iterator with different datasets, so that you can, for example, iterate over training and validation data multiple times in the same program.

ML: binary classification

A type of classification task that outputs one of two mutually exclusive classes. For example, a machine learning model that evaluates email messages and outputs either "spam" or "not spam" is a binary classifier.

BigQuery: Views

A view is a virtual table defined by a SQL query. When you create a view, you query it in the same way you query a table. When a user queries the view, the query results contain data only from the tables and fields specified in the query that defines the view. - Can't assign access control - based on user running view - Can create *authorised view*: share query results with groups without giving read access to underlying data - Can give *row-level permissions* to different users within same view - Can't export data from a view - Can't use JSON API to retrieve data - Can't mix standard and legacy SQL, e.g., standard SQL query can't access legacy-SQL view. - No user-defined functions allowed - No wildcard table references allowed - Limit of 1000 authorized views per data set. Queries are billed according to the total amount of data in all table fields referenced directly or indirectly by the top-level query.

Dataproc: Using Preemptible VMs

All preemptible instances added to a cluster use the machine type of the cluster's non-preemptible worker nodes. The addition or removal of preemptible workers from a cluster does not affect the number of non-preemptible workers in the cluster. Cloud Dataproc adds preemptible instances as secondary workers in a *managed instance group*, which contains only preemptible workers. The managed group automatically re-adds workers lost due to reclamation as capacity permits. *Rules* for using preemptible workers with a Cloud Dataproc cluster: - *Processing only* — Since preemptibles can be reclaimed at any time, preemptible workers do not store data. Preemptibles added to a Cloud Dataproc cluster only function as processing nodes. - *No preemptible-only clusters* — To ensure clusters do not lose all workers, Cloud Dataproc cannot create preemptible-only clusters. Cloud Dataproc will automatically add two non-preemptible workers to the cluster. - *Persistent disk size* — As a default, all preemptible workers are created with the smaller of 100GB or the primary worker boot disk size.

ML: Examples

An *example* is a particular instance of data, *x*. We break examples into two categories: - labeled examples - unlabeled examples A *labeled example* includes both feature(s) and the label. Use labeled examples to *train* the model. In our spam detector example, the labeled examples would be individual emails that users have explicitly marked as "spam" or "not spam." An *unlabeled example* contains features but not the label. Once we've trained our model with labeled examples, we use that model to predict the label on unlabeled examples. In the spam detector, unlabeled examples are new emails that humans haven't yet labeled.

BigTable: Routing policy

An app profile specifies the routing policy that Cloud Bigtable should use for each request: - *Single-cluster routing* routes all requests to 1 cluster in your instance. - *Multi-cluster routing* tells Cloud Bigtable that it can route each request to any available cluster. If one cluster becomes unavailable, and an app profile uses multi-cluster routing, any traffic that uses that app profile automatically fails over to the other cluster. In contrast, if an app profile uses single-cluster routing, you must manually fail over.

DataStore: Index

An index is defined on a list of properties of a given entity kind, with a corresponding order (ascending or descending) for each property. For use with ancestor queries, the index may also optionally include an entity's ancestors. Two types of indexes: - *Built-in:* By default, Cloud Datastore automatically predefines an index for each predefines an index for each property of each entity kind. These single property indexes are suitable for simple types of queries. - *Composite Index:* Composite indexes index multiple property values per indexed entity. Composite indexes support complex queries and are defined in an index configuration file (index.yaml)

Cloud Storage: Key Terms: Data opacity

An object's data component is completely opaque to Cloud Storage. It is just a chunk of data to Cloud Storage.

Apache Flink

Apache Flink is an open source, distributed system built using the stream-first architecture. *The stream is the source of truth* *Streaming execution model*: Processing is continuous, one event at a time. *Everything is a stream*: Batch processing with bounded datasets are a special case of the unbounded dataset. *Same engine for all*: Streaming and batch APIs both use the same underlying architecture

Cloud Storage: Load redistribution time

As a bucket approaches its IO capacity limit, Cloud Storage typically takes on the order of minutes to detect and accordingly redistribute the load across more servers. Consequently, if the request rate on your bucket increases faster than Cloud Storage can perform this redistribution, you may run into temporary limits, specifically higher latency and error rates. Ramping up the request rate gradually for your buckets avoids such latency and errors.

Cloud Spanner: Benefits of replication

Benefits of Cloud Spanner replication include: - *Data availability*: Having more copies of your data makes the data more available to clients that want to read it. Also, Cloud Spanner can still serve writes even if some of the replicas are unavailable, because only a majority of voting replicas are required in order to commit a write. - *Geographic locality*: Having the ability to place data across different regions and continents with Cloud Spanner means data can be geographically closer, to the users and services that need it. - *Single database experience*: Because of the synchronous replication and global strong consistency, at any scale Cloud Spanner behaves the same, delivering a single database experience. - *Easier application development*: Cloud Spanner's ACID transactions with global strong consistency means developers don't have to add extra logic in the applications to deal with eventual consistency, making application development and subsequent maintenance faster and easier.

BigQuery: Table Types

BigQuery supports the following table types: - Native tables: tables backed by native BigQuery storage. - External tables: tables backed by storage external to BigQuery. BigTable, Cloud Storage, Google Drive - Views: Virtual tables defined by a SQL query.

Cloud Storage: Key Terms: Buckets

Buckets are the basic containers that hold your data. Everything that you store in Cloud Storage must be contained in a bucket. You can use buckets to organize your data and control access to your data, but unlike directories and folders, you cannot nest buckets. When you create a bucket, you specify a globally-unique name, a geographic location where the bucket and its contents are stored, and a default storage class. The default storage class you choose applies to objects added to the bucket that don't have a storage class specified explicitly. *Bucket names* Bucket names have more restrictions than object names and must be globally unique, because every bucket resides in a single Cloud Storage namespace. *Bucket labels* Bucket labels are key:value metadata pairs that allow you to group your buckets along with other Google Cloud Platform resources such as virtual machine instances and persistent disks.

ML: gradient clipping

Capping gradient values before applying them. Gradient clipping helps ensure numerical stability and prevents exploding gradients.

Cloud Bigtable and other storage options

Cloud Bigtable is not a relational database; it does not support SQL queries or joins, nor does it support multi-row transactions. Also, it is not a good solution for storing less than 1 TB of data. - If you need full SQL support for an online transaction processing (OLTP) system, consider Cloud Spanner or Cloud SQL. - If you need interactive querying in an online analytical processing (OLAP) system, consider BigQuery. - If you need to store immutable blobs larger than 10 MB, such as large images or movies, consider Cloud Storage. - If you need to store highly structured objects in a document database, with support for ACID transactions and SQL-like queries, consider Cloud Datastore.

Cloud Datalab: Usage Scenarios

Cloud Datalab is an interactive data analysis and machine learning environment designed for Google Cloud Platform. You can use it to explore, analyze, transform, and visualize your data interactively and to build machine learning models from your data. A few ideas to get you started: - Write a few SQL queries to explore the data in BigQuery. Put the results in a Dataframe and visualize them as a histogram or a line chart. - Read data from a CSV file in Google Cloud Storage and put it in a Dataframe to compute statistical measures such as mean, standard deviation, and quantiles using Python. - Try a TensorFlow or scikit-learn model to predict results or classify data.

Cloud KMS: Overview of secret management

Cloud KMS (Key Management Service) allows you to keep encryption keys in one central cloud service, for direct use by other cloud resources and applications. With Cloud KMS you are the ultimate custodian of your data, you can manage encryption in the cloud the same way you do on-premises, and you have a provable and monitorable root of trust over your data. - Common ways to storing secrets are Code of binaries, Deployment Management etc... - Authorisation, Verification of usage,Encryption at rest,RotationAnd Isolation are the security concerns. - Consistency and Version Management describes the functionality concerns of secret management.

Cloud ML Engine Overview

Cloud ML Engine mainly does two things: - Enables you to train machine learning models at scale by running TensorFlow training applications in the cloud. - Hosts those trained models for you in the cloud so that you can use them to get predictions about new data. Cloud ML Engine manages the computing resources that your training job needs to run, so you can focus more on your model than on hardware configuration or resource management.

Cloud Spanner: Replication

Cloud Spanner automatically gets replication at the byte level from the underlying distributed filesystem that it's built on. Cloud Spanner writes database mutations to files in this filesystem, and the filesystem takes care of replicating and recovering the files when a machine or disk fails. Cloud Spanner also replicates data to provide the additional benefits of data availability and geographic locality. - Cloud Spanner creates multiple copies, or "replicas," of these rows, then stores these replicas in different geographic areas. - Cloud Spanner uses a synchronous, *Paxos*-based replication scheme, in which voting replicas take a vote on every write request before the write is committed. - This property of globally synchronous replication gives you the ability to read the most up-to-date data from any Cloud Spanner read-write or read-only replica.

Cloud Spanner

Cloud Spanner is a fully managed, mission-critical, relational database service that offers transactional consistency at global scale, schemas, SQL (ANSI 2011 with extensions), and automatic, synchronous replication for high availability. Cloud Spanner offers: - *Strong consistency*, including strongly consistent secondary indexes. - *SQL support*, with ALTER statements for schema changes. - *Managed instances with high availability* through transparent, synchronous, built-in data replication. Cloud Spanner offers regional and multi-region instance configurations. Use when you need high availability, strong consistency, transactional reads and writes (especially writes!). *Don't use if* - Data is not relational, or not even structured - Want an open source RDBMS - Strong consistency and availability is overkill

Cloud Spanner: Splits

Cloud Spanner let's you define hierarchies of parent-child relationships between tables up to seven layers deep, which means you can co-locate rows of seven logically independent tables. As your database grows, Cloud Spanner divides your data into chunks called "splits", where individual splits can move independently from each other and get assigned to different servers, which can be in different physical locations. A split is defined as a range of rows in a top-level table, where the rows are ordered by primary key. The start and end keys of this range are called "split boundaries". Cloud Spanner automatically adds and removes split boundaries, which changes the number of splits in the database. Cloud Spanner splits data based on load: it adds split boundaries automatically when it detects high read or write load spread among many keys in a split. The parent-child table relationships that you define, along with the primary key values that you set for rows of related tables, give you control over how data is split under the hood.

Cloud Storage: Object key indexing

Cloud Storage supports consistent object listing, which enables users to run data processing workflows easily against Cloud Storage. In order to provide consistent object listing, Cloud Storage maintains an index of object keys for each bucket. This index is stored in lexicographical order and is updated whenever objects are written to or deleted from a bucket. Adding and deleting objects whose keys all exist in a small range of the index naturally increases the chances of contention. Cloud Storage detects such contention and automatically redistributes the load on the affected index range across multiple servers. Similar to scaling a bucket's IO capacity, when accessing a new range of the index, such as when writing objects under a new prefix, you should ramp up the request rate gradually. Not doing so may result in temporarily higher latency and error rates.

DataFlow: Transforms: Core Transforms

Core transforms form the basic building blocks of pipeline processing. Each core transform provides a generic processing framework for applying business logic that you provide to the elements of a PCollection. When you use a core transform, *you provide the processing logic as a function object*. The function you provide gets applied to the elements of the input PCollection(s). Instances of the function may be executed in parallel across multiple Google Compute Engine instances, given a large enough data set, and pending optimizations performed by the pipeline runner service. The worker code function produces the output elements, if any, that are added to the output PCollection(s). Requirements for User-Provided Function Objects: - Your function object must be *serializable*. - Your function object must be *thread-compatible*, and be aware that the Dataflow SDKs are not thread-safe. - We recommend making your function object *idempotent*.

Google App Engine

Fully managed serverless application platform - Just write the code - leave the rest to platform. - Open & familiar languages and tools - Pay only for what you use - Examples: Heroku, Engine Yard - Lots of code, languages You can run your applications in App Engine using the *flexible environment* or *standard environment*. You can also choose to simultaneously use both environments for your application and allow your services to take advantage of each environment's individual benefits.

Google Cloud Launcher

Google Cloud Launcher offers ready-to-go development stacks, solutions, and services to accelerate development. - Deploy production-grade solutions in a few clicks - Single bill for all your GCP and 3rd party services - Manage solutions using Deployment Manager - Notifications when a security update is available - Direct access to partner support

Load Balancer

Google Cloud Platform Load Balancing gives you the ability to distribute load-balanced compute resources in single or multiple regions, to meet your high availability requirements, to put your resources behind a single anycast IP and to scale your resources up or down with intelligent Autoscaling. Cloud Load Balancing is fully integrated with Cloud CDN for optimal content delivery. Cloud load balancers can be divided up as follows: - Global versus regional load balancing - External versus internal load balancing - Traffic type

Google Transfer Appliance

Google Transfer Appliance is a high capacity storage server that enables you to transfer up to one petabyte of data on a single appliance and securely ship it to a Google upload facility, where the data is uploaded to Google Cloud Storage. You can serially lease multiple Transfer Appliances if your data size exceeds one petabyte. Transfer Appliance offers two models: - The rackable 100 terabyte (TB), which stores from 100 TB up to potentially 200 TB of data, depending on the deduplication and compression ratio of your data. - The standalone 480 TB, which stores from 480 TB up to potentially 1 petabyte (PB).

DataFlow (Apache Beam): Side Inputs

In addition to the main input PCollection, you can provide additional inputs to a ParDo transform in the form of *side inputs*. A side input is an additional input that your DoFn can access each time it processes an element in the input PCollection. When you specify a side input, you create a view of some other data that can be read from within the ParDo transform's DoFn while processing each element. Side inputs are useful if your ParDo needs to inject additional data when processing each element in the input PCollection, but the additional data needs to be determined at runtime (and not hard-coded). Such values might be determined by the input data, or depend on a different branch of your pipeline. There is a fundamental difference between side inputs and main inputs. Main inputs are sharded across multiple worker instances in your Dataflow job, so each element is read only once for the entire job. With side inputs, each worker may read the same elements multiple times.

ML: Convergence

Informally, often refers to a state reached during training in which training loss and validation loss change very little or not at all with each iteration after a certain number of iterations. In other words, a model reaches convergence when additional training on the current data will not improve the model. In deep learning, loss values sometimes stay constant or nearly so for many iterations before finally descending, temporarily producing a false sense of convergence.

Apache Flink Stack

Layer 3a: - CEP Event Processing - Table Relational Layer 3a: - Table Relational - Gelly Graph - Flink ML Machine Learning Layer 2: - DataStream API Stream Processing - DataSet API Batch Processing Layer 1: - Runtime Layer 0: - Local Single JVM - Cluster Standalone, YARN - Cloud GCE, EC2

VPC: Subnets

Logical partitioning of the network - Defined by a IP address prefix range - Specified in CIDR notation - IP ranges cannot overlap between subnets - Subnets in the GCP can contain resources only from a single region CIDR notation - 10.123.9.0/24 - Contains all IP addresses in the range 10.123.9.0 to 10.123.9.255 - the /24 represents the number of bits which is the network prefix - Each subnet has a contiguous private RFC1918 IP space

OLTP

Online Transactional Processing

VPC: Firewall Rules

Protects your virtual machine (VM) instances from unapproved connections, both inbound (ingress) and outbound (egress). You can create firewall rules to allow or deny specific connections based on a combination of IP addresses, ports, and protocol. - *Action*: allow or deny - *Direction*: ingress or egress - Source IPs (ingress), Destination IPs (egress) - Protocol and port - Specific instance names - Priorities and tiebreakers GCP firewall rules are *stateful*. If a connection is allowed, all traffic in the flow is also allowed, in *both directions*. Few ports are permanently blocked (Outgoing traffic to port 25 (SMTP), GRE (Generic Routing Encapsulation) traffic, etc.) A rule with a deny action overrides another with an allow if the two rules have same priority.

Dataproc: Single Node Clusters

Single node clusters are Cloud Dataproc clusters with only one node. This single node acts as the master and worker for your Cloud Dataproc cluster. There are a number of situations where single node Cloud Dataproc clusters can be *useful*, including: - Trying out new versions of Spark and Hadoop or other open source components - Building proof-of-concept (PoC) demonstrations - Lightweight data science - Small-scale non-critical data processing - Education related to the Spark and Hadoop ecosystem *Limitations:* - Single node clusters are not recommended for large-scale parallel data processing. - n1-standard-1 machine types have limited resources and are not recommended for YARN applications. - Single node clusters are not available with high-availability since there is only one node in the cluster. - Single node clusters cannot use preemptible VMs.

Dataproc: Cluster Web Interfaces

Some of the core open source components included with Google Cloud Dataproc clusters, such as Apache Hadoop and Apache Spark, provide Web interfaces. These interfaces can be used to manage and monitor cluster resources and facilities, such as the YARN resource manager, the Hadoop Distributed File System (HDFS), MapReduce, and Spark. Other components or applications that you install on your cluster may also provide Web interfaces (for example, Install and run a Jupyter notebook on a Cloud Dataproc cluster). *YARN ResourceManager*: http://master-host-name:8088 *HDFS NameNode*: http://master-host-name:9870 (In earlier Cloud Dataproc releases (pre-1.2), the HDFS Namenode Web UI port was 50070)

Hadoop Ecosystem: Kafka

Stream processing for unbounded datasets

BigQuery Data Transfer Service

The BigQuery Data Transfer Service automates data movement from Software as a Service (SaaS) applications such as Google AdWords and DoubleClick on a scheduled, managed basis. Your analytics team can lay the foundation for a data warehouse without writing a single line of code. You can access the BigQuery Data Transfer Service using the: - BigQuery web UI - BigQuery command-line tool - BigQuery Data Transfer Service API After you configure a data transfer, the BigQuery Data Transfer Service automatically loads data into BigQuery on a regular basis. You can also initiate data backfills to recover from any outages or gaps. Currently, you cannot use the BigQuery Data Transfer Service to transfer data out of BigQuery.

Dataproc: Cloud Storage connector

The Cloud Storage connector is an open source Java library that lets you run Apache Hadoop or Apache Spark jobs directly on data in Cloud Storage, and offers a number of benefits over choosing the Hadoop Distributed File System (HDFS). Benefits of the Cloud Storage connector: - *Direct data access* - Store your data in Cloud Storage and access it directly, with no need to transfer it into HDFS first. - *HDFS compatibility* - You can easily access your data in Cloud Storage using the gs:// prefix instead of hdfs://. - *Interoperability* - Storing data in Cloud Storage enables seamless interoperability between Spark, Hadoop, and Google services. - *Data accessibility* - When you shut down a Hadoop cluster, you still have access to your data in Cloud Storage, unlike HDFS. - *High data availability* - Data stored in Cloud Storage is highly available and globally replicated without a loss of performance. - *No storage management overhead* - Unlike HDFS, Cloud Storage requires no routine maintenance such as checking the file system, upgrading or rolling back to a previous version of the file system, etc. - *Quick startup* - In HDFS, a MapReduce job can't start until the NameNode is out of safe mode— a process that can take from a few seconds to many minutes depending on the size and state of your data. With Cloud Storage, you can start your job as soon as the task nodes start, leading to significant cost savings over time.

Cloud Data Transfer Use Cases - Data Center Migration

The data you create and store on-premises takes relentless focus and significant resources to manage it cost-effectively, securely, and reliably. As organizations face exponential growth of their data many are turning to the cloud to scale with them in their efforts. For your structured and unstructured data sets, whether they are small and frequently accessed or huge and rarely referenced, Google offers solutions to migrate that data quickly to Google Cloud Storage , BigQuery or Dataproc.

Cloud ML: Prepare your trainer and data for the cloud

The key to getting started with Cloud ML Engine is your training application, written in TensorFlow. You can develop your trainer as you would any other TensorFlow application, but you need to follow a few guidelines about your approach to work well with cloud training. You must make your trainer into a Python package and stage it on Google Cloud Storage where your training job can access it. As with your application package, your data must be stored where Cloud ML Engine can access it. The easiest solution is to store your data in Google Cloud Storage in a bucket associated with the same project that you use for Cloud ML Engine tasks.

Cloud Spanner: Choosing a primary key

The primary key uniquely identifies each row in a table. If you want to update or delete existing rows in a table, then the table must have a primary key composed of one or more columns. Often your application already has a field that's a natural fit for use as the primary key. There are techniques that can spread the load across multiple servers and avoid hotspots: - *Hash the key* and store it in a column. Use the hash column as the primary key. - *Swap the order* of the columns in the primary key. - *Use a Universally Unique Identifier (UUID)*. Version 4 UUID is recommended, because it uses random values in the high-order bits. Don't use a UUID algorithm that stores the timestamp in the high order bits. - *Bit-reverse* sequential values.

ML: Basic Assumptions in Supervised Learning

Three basic assumptions in all of the supervised learning: - We draw examples *independently and identically (i.i.d.)* at random from the distribution - The distribution is *stationary*: It doesn't change over time - We always pull from the *same distribution*: Including training, validation, and test sets

Cloud Data Transfer Use Cases - Content Storage and Delivery

To serve users around the world with the highest availability, Google offers multi-regional setups designed for video streaming and frequently accessed content like web sites and images. For analytics and batch processing, regional setups are available to meet the unique requirements of those workloads. For content-rich use cases like these you can choose a data transfer option that will have minimal impact on your network while moving large amounts of data.

GKE: Container Builder

Tool that executes your container image builds on Google Cloud Platform's infrastructure - Working: >> import source code from a variety of repositories or cloud storage spaces >> execute a build to your specifications >> produce artifacts such as Docker containers or Java archives.

Traditional RDBMS vs. DataStore: Differences

Traditional RDBMS: - Structured relational data - Rows stored in Tables - Rows consist of fields - Primary Keys for unique ID - Rows of table have same properties (Schema is strongly enforced) - Types of all values in a column are the same - Lots of joins - Filtering on subqueries - Multiple inequality conditions DataStore: - Structured hierarchical data (XML, HTML) - Entities of different Kinds (think HTML tags) - Entities consist of Properties - Keys for unique ID - Entities of a kind can have different properties (think optional tags in HTML) - Types of different properties with same name in an entity can be different. - No joins - No filtering on subqueries - Only one inequality filter OK per query

Is Transfer Appliance suitable for me?

Transfer Appliance is a good fit for your data transfer needs if: - You are an existing Google Cloud Platform (GCP) customer. - Your data resides in the United States. - Your data size is greater than or equal to 20TB. - You don't require HIPAA compliance.

BigQuery: Performance Factors

When evaluating query performance in BigQuery, the amount of work required depends on a number of factors: - *Input data and data sources (I/O):* How many bytes does your query read? - *Communication between nodes (shuffling):* How many bytes does your query pass to the next stage? How many bytes does your query pass to each slot? - *Computation:* How much CPU work does your query require? - *Outputs (materialization):* How many bytes does your query write? - *Query anti-patterns:* Are your queries following SQL best practices?

BigQuery: Performance: Optimizing Communication Between Slots

When evaluating your communication throughput, consider the amount of shuffling that is required by your query. How many bytes are passed between stages? How many bytes are passed to each slot? For example, a *GROUP BY* clause passes like values to the same slot for processing. The amount of data that is shuffled directly impacts communication throughput and as a result, query performance. *Best practice:* - Trim the data as early in the query as possible, before the query performs a JOIN. If you reduce data early in the processing cycle, shuffling and other complex operations only execute on the data that you need. - WITH clauses are used primarily for readability because they are not materialized. For example, placing all your queries in WITH clauses and then running UNION ALL is a misuse of the WITH clause. If a query appears in more than one WITH clause, it executes in each clause. - Partitioned Tables perform better than date-named tables. When you create tables sharded by date, BigQuery must maintain a copy of the schema and metadata for each date-named table. Also, when date-named tables are used, BigQuery might be required to verify permissions for each queried table. This practice also adds to query overhead and impacts query performance. - Table sharding refers to dividing large datasets into separate tables and adding a suffix to each table name. Avoid creating too many table shards. If you are sharding tables by date, use time-partitioned tables instead.

BigQuery: Performance: Input Data and Data Sources

When evaluating your input data, consider the required I/O. How many bytes does your query read? Are you properly limiting the amount of input data? Is your data in native BigQuery storage or an external data source? The amount of data read by a query and the source of the data impact query performance and cost. You can examine how the input data is read by a query by using the query plan explanation. *Best Practices:* - Control projection - Query only the columns that you need. Avoid SELECT * - When querying a time-partitioned table, use the _PARTITIONTIME pseudo column to filter the partitions. - BigQuery performs best when your data is denormalized. Rather than preserving a relational schema such as a star or snowflake schema, denormalize your data and take advantage of nested and repeated fields. - Querying tables in BigQuery managed storage is typically much faster than querying external tables in Google Cloud Storage, Google Drive, or Google Cloud Bigtable. Use an external data source for these use cases: >>> Performing extract, transform, and load (ETL) operations when loading data >>> Frequently changing data >>> Periodic loads such as recurring ingestion of data from Cloud Bigtable - Use wildcards to query multiple tables by using concise SQL statements. Wildcard tables are a union of tables that match the wildcard expression. Wildcard tables are useful if your dataset contains: >>> Multiple, similarly named tables with compatible schemas >>> Sharded tables

BigQuery: Performance: Managing Query Outputs

When evaluating your output data, consider the number of bytes written by your query. How many bytes are written for your result set? Are you properly limiting the amount of data written? Are you repeatedly writing the same data? The amount of data written by a query impacts query performance (I/O). If you are writing results to a permanent (destination) table, the amount of data written also has a cost. *Best practice:* - Avoid repeatedly joining the same tables and using the same subqueries. - Carefully consider materializing large result sets to a destination table. Writing large result sets has performance and cost impacts. - If you are sorting a very large number of values, use a LIMIT clause.

ML: NaN trap

When one number in your model becomes a NaN during training, which causes many or all other numbers in your model to eventually become a NaN. NaN is an abbreviation for "Not a Number."

DataFlow (Apache Beam): Programming Model: Major Concepts

When you think about data processing with Dataflow, you can think in terms of four major concepts: - *Pipeline*: single, potentially repeatable job, from start to finish, in Dataflow - *PCollection*: specialized container classes that can represent data sets of virtually unlimited size - *Tranforms*: takes one or more PCollections as input, performs a processing function that you provide on the elements of that PCollection, and produces an output PCollection. - *I/O Sources & Sinks*: different data storage formats, such as files in Google Cloud Storage, BigQuery tables

Application Default Credentials

When your code uses a client library, the strategy checks for your credentials in the following order: - First, ADC checks to see if the environment variable GOOGLE_APPLICATION_CREDENTIALS is set. If the variable is set, ADC uses the service account file that the variable points to. - If the environment variable isn't set, ADC uses the default service account that Compute Engine, Container Engine, App Engine, and Cloud Functions provide, for applications that run on those services. - If ADC can't use either of the above credentials, an error occurs.

Cloud Data Transfer Use Cases - Backup and Archival

With increased frequency of cloud outages you need to ensure your data is always available. Using our data transfer services you can easily backup data from another cloud storage provider to Google Cloud Storage. You can ensure your data is retained cost-effectively by taking advantage of ultra low-cost, highly-durable and highly available archival storage offered through Google's Nearline and Coldline storage classes. Object lifecycle management enables this automatically, transitioning data from one storage class to the next depending on your business's cost and availability needs at the time.

Google Data Studio: What you can filter?

You can apply filters to the following components: - *Charts*. For example, you can display a pie chart of new versus returning users in your biggest markets with a filter including Country IN "United States,Canada,Mexico,Japan" - *Filter controls*. For example, you can let your viewers select from a list of best selling products on Quantity Sold Greater than (>) 100. - *Groups*. For example, you can group 2 sets of charts and filter on Device Category to show website traffic in one set, and on the other to show mobile traffic. - *Pages*. Page-level filters apply to every chart on that page. For example, you can dedicate page 1 of your Google Analytics report to mobile app traffic, and page 2 to desktop traffic by filtering on the Device Category dimension. - *Reports*. Every chart in the report is subject to the filter. For example, you can create a report that focuses on your best customers by setting the report-level filter property to Lifetime Value Greater than or equal to 10,000.

ML: batch & step

batch: The set of examples used in one *iteration* (that is, one *gradient* update) of *model training*. batch size: The number of examples in a *batch*. For example, the batch size of *SGD* is 1, while the batch size of a *mini-batch* is usually between 10 and 1000. Batch size is usually fixed during training and inference; however, TensorFlow does permit dynamic batch sizes. step: A forward and backward evaluation of one batch. step size (learning rate): A scalar used to train a model via gradient descent. During each iteration, the gradient descent algorithm multiplies the learning rate by the gradient. The resulting product is called the *gradient step*.

Cloud Storage: List Objects in Bucket

gsutils ls -r gs://[BUCKET_NAME]/**


Kaugnay na mga set ng pag-aaral

Principles of Management_Chapter 7

View Set

Public Speaking - UNIT 1 - MILESTONE 1

View Set

NUR 416 cognition/neuro questions

View Set

NBDHE Community Oral Health Planning and Practice - 20

View Set