Combined Courses

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Steps to stream into Bigtable using Dataflow

1. Get/create table a. Get authenticated session b. Create table 2. Convert object to write into Mutation(s) inside a ParDo 3. Write mutations to bigtable

Create a bigtable cluster using gcloud (or web UI)

"gcould beta bigtable instance create INSTANCE"

The need for fast decisions leads to streaming

1. Massive data, from varied sources, that keeps growing over time 2. Need to derive insights immediately in the form of dashboards 3. Need to make timely decisions

Cloud Storage

persistent storage, durable, replicated. It can be made globally available if you need to. Usually used as a first step of the data life cycle

Apache Spark

popular, flexible powerful way to process large datasets; it's able to mix different kinds of applications and to adjust how it uses the available resources.

Precision

positive predictive value = TP / (TP + FP)

Bigtable is

probably a better place for high throughput sensor data.

Training ML

process of optimizing the weights; includes gradient descent + evaluation

Dataflow

provides a fully-managed, autoscaling execution environment for Beam pipelines

In order to do that we looked at an architecture that consisted of ingesting the

pub/sub, processing the data in stream using Cloud Dataflow and streaming it into BigQuery for durable storage and interactive analysis.

Compute engine

you take your workload and you just run it as is on the cloud

to associate the messages with the timestamps

you have to create that association in the p collection. In this case, rather than use process context.output, you can use process context.output with timestamp.

Datalab applications that ran previously

1. Do not show up in Dataproc > Jobs page in Console. 2. Only applications submitted from Console are tracked in Console.

Ingesting data into a pipeline

1. Read data from file system, GCS, BigQuery, Pub/Sub a. Text formats return String b. Big query returns a TableRow

Custom ML models

1. Tensorflow 2. Machine Learning Engine

Creating a cluster on Dataproc

1. a deployment manager template, which is an infrastructure automation service in Google Cloud 2. CLI commands 3. Google Cloud console

neural network model

A hidden layer, a layer of neurons, is a combination of neurons, all of which share the same set of inputs.

Evaluation

Is the model good enough? Has to be done on full dataset

perKey

Key-value pairs such as in tuples

learning rate

The size of the changes we make to the weights

Cloud Dataproc

cloud based implementation of Hadoop.

Bigquery details:

easy, inexpensive 1. Latency in order of seconds 2. 100k rows/second streaming

if you need globally consistent data or more than one cloud sql instance

use cloud spanner

Train_and_evaluate Manages:

1. Distribute the graph 2. Share variables 3. Evaluate occasionally 4. Handle machine failures 5. Create checkpoint files 6. Recover from failures 7. Saves summaries for Tensor Board

This skew can be caused by

1. A discrepancy between how you handle data in the training and serving pipelines. 2. A change in the data between when you train and when you serve. 3.A feedback loop between your model and your algorithm.

Designing for Bigtable

1. A table have only one index(the row key) 2. Grouped related columns into column families 3. Two types of designs (wide or narrow tables) 4. Rows are sorted lexicographically by row key, from lowest to highest bytes 5. Queries that use the row key, a row prefix, or a row range are the most efficient 6. Store related entities in adjacent rows 7. Distribute your writes and reads across rows 8. Design row keys to avoid hot spotting

Machine Learning

1. A way to use standard algorithms to derive predictive insights from data and make repeated decisions. 2. Adjust the weights in such a way that the output of the ML model of this mathematical function (ML)

A dataset contains tables and views

1. Access Control Lists for Reader/Writer/Owner 2. Applied to all tables/views in dataset

Compute nodes on GCP are

1. Allocated on demand, and you pay for the time that they are up. 2. Fungible resource 3. Software packages need to be downloaded 4. Provide unlimited options because of customization!

So, you can write JavaScript UDFs, you can write SQL UDFs

1. Amount of data UDF outputs per input row should be <5 MB 2. Each user can run 6 concurrent JavaScript UDF queries per project 3. Native code JavaScript functions aren't supported 4. JavaScript handles only the most significant 32 bits 5. A query job can have a maximum of 50 JavaScript UDF resources a. Each inline code blob is limited to maximum size of 32 KB b. Each external code resource limited to maximum size of 1 MB

Interactive, Iterative Development with Cloud Datalab:

1. Analyze data in BigQuery Compute Engine or Cloud Storage 2. Use existing python packages

Monitor BigQuery with stackdriver

1. Available for all Big Query customers 2. Fully interactive GUI. Customers can create custom dashboards displaying up to 13 BigQuery metrics, including: a. Slots utilization b. Queries in flight c. Uploaded bytes d. Stored bytes

Bigquery uploading data:

1. Batch - Web console and Command line using the BQ command. batch data on Cloud storage or streaming data via Cloud Dataflow 2. Stream - stream data in with Cloud Dataflow If, for example, you're receiving sensor data in real time, log data in real time, you can process them with Cloud Dataflow and string them into BigQuery. Even as the data are streaming in, you can run queries on that data. 3. Federated data source - raw form as CSV files or JSON files or AVRO files (including Google Sheets - query it with BigQuery)

Streaming data into Bigquery

1. BigQuery provides streaming ingestion at a rate of 100,000 rows/table/second a. Provided by the rest APIs tabledata().insertAll() method b. Works for partitioned and standard tables 2. Streaming data can be queried as arrives a. Data available within seconds 3. For data consistency, enter insertID for each inserted row a. De-duplication is based on a best-effort basis, and can be affected by network errors b. Can be done manually

Pub/Sub simplifies event distribution:

1. By replacing synchronous point-to-point connections with a single availability asynchronous bus 2. Asynchronous -> Publisher never waits a. A subscriber can get the message now or any time (within 7 days) 3. Can avoid overprovisioning for spikes with Pub/Sub

A job is a potentially long-running action

1. Can be cancelled

3 aspects to Big data

1. Can use the same tools for batch as for streaming 2. Another aspect of big data is variety; audio, video, images, etc, unstructured text, blog posts. 3. But the third aspect to big data, is near real-time data processing, data that's coming in so fast that you need to process it to keep up with the data

Tips for improving performance

1. Change schema to minimize data skew 2. Takes a while after scaling up nodes for performance improvement to be seen 3. Test with > 300 GB and for minutes-to-hours 4. Disk Speed: SSD faster than HDD 5. Performance increases linearly with number of nodes 6. Make sure clients and Bigtable are in same zone

Before you begin, though, be sure to gather and prepare, that's clean, split, engineer features, preprocess features

1. Do all this on your training data. 2. put your data in an online source that Cloud ML can access, example, cloud storage.

This module covered ways to expand the value of Dataproc by leveraging other services available on Google Cloud Platform

1. Cloud storage is useful in many ways, not just holding data, but also holding application code and initialization scripts. It can also be used as an intermediary between BigQuery and Dataproc. 2. For more direct communications, a BigQuery connector is available in Dataproc, and you can develop an application that reads directly from BigQuery. 3. You also learned about how to output shards to handle parallel processing and to accumulate those shards back into a table when importing data into BigQuery leveraging JSON formatting and tools. 4. You learned about cluster customization with installation scripts, how to use metadata to determine whether the script is running on a master node or a worker node 5. you learned how properties for core software can be set for some configuration files from the command line or via the REST API.

In reality, ML is

1. Collect data 2. Organize data 3. Create model 4. Use machines to flesh out the model from data 5. Deploy fleshed out model

Bigquery storage is

1. Columnar 2. Each column in separate, compressed encrypted file that is replicated 3+ times 3. No indexes, keys, or partitions required 4. Meant for immutable, massive datasets

A table is a collection of columns

1. Columnar storage 2. Views are virtual tables defined by SQL query 3. Tables can be external (e.g., on Cloud Storage)

Load and expert into Bigquery

1. Command line interface called BQ (comes with gcloud SDK) 2. Web user interface 3. Use an API, a python API or a data flow API

Cloud Storage How do you get your data onto cloud storage?

1. Command line tool called gsutil (simplest; comes with gcloud SDK) 2. So whichever machine you're going to be uploading the data from, install G Cloud, get gsutil, and then say gsutil copy, that's a cp, gsutil cp sales*.csv. gs://acme-sales/data/

Tr.transform

1. Computes min,max,vocab,etc, and store in metadata.json 2. In serving function, use the metadata to scale the raw inputs before providing to model

To pass in a PCollection

1. Convert the PCollection to a View (asList, asMap) cz = .. czmap = cz.apply("ToView",View.asMap()) 2. Call the ParDo with side input(s) "ParDo.withsideInputs().of()" 2. Within ParDo, get the side input from the context

Tensorflow Steps

1. Create task.py to parse command-line parameters and send along to train_and_evaluate 2. The model.py contains the ML model in TensorFlow(Estimator API) 3. Package up TensorFlow model as python package (needs to contain an __init__.py in every folder) 4. Verify that the model works as a Python package 5. Then use the gcloud command to submit the training job, either locally or to cloud

Bucketizing in Cloud ML

1. Creating bucketized features using Tensorflow 2. Number of buckets is a hyper parameter 3. tf.feature_column.bucketized_column(lat,latbuckets) 4. pipeline for bucketized and crossed features

Apply Transform to PCollection

1. Data in a pipeline are represented by PCollection a. Supports parallel processing b. Not an in-memory collection; can be unbounded c. Apply transform to pcollection; returns Pcollection

Beyond linear regression with estimators

1. Deep neural network (DNNRegressor) 2. Classification (LinearClassifier or DNNClassifier)

Did we use triggers?

1. Default trigger setting used, which is trigger first when the watermark passes the end of the window, and then trigger again every time there is late arriving data

Steps to define an Estimator API model

1. Define input feature columns 2. Create a model, passing in the feature columns 3. Write input_fn (returns features, labels); features is a dict 4. Train the model using model.train 5. Use trained model to predict

Manages:

1. Distribute the graph 2. Share variables 3. Evaluate occasionally 4. Handle machine failures 5. Create checkpoint files 6. Recover from failures 7. Saves summaries for Tensor Board

Pub/sub is a low latency, guaranteed delivery service

1. Does not guarantee order of messages 2. At-least-once delivery means that repeated delivery is possible

Relational databases

1. Does not support very very very high-throughput 2. Scale pretty well to a few 100 gigabytes for queries 3. Aggregations on structured data 4. Transactional updates on relatively small datasets

Bigtables These are row keys to avoid:

1. Domains 2. Sequential (numeric) IDS 3. Static, repeatedly updated identifiers

Cloud ML Engine Scalable Training:

1. During training, Engine will help you distribute your pre-processing, your model training, even your hyperparameter tuning and finally deploy your trained model to the cloud.

Data studio lets you build dashboards and reports

1. Easy to read, share, and fully customizable 2. Handles authentication, access rights, and structuring of data

Stream processing:

1. Element-wise stream processing is easy 2. Aggregating is hard 3. Composite on unbounded data is super difficult

Pull:

1. Endpoint can be a server or a device capable of making API call 2. Delays between publication and delivery 3. Ideal for large number of dynamically created subscribers

Push:

1. Endpoint can only be HTTPS server accepting Webhook 2. Immediate delivery; no latency 3. Ideal for subscribers needing closer to real time performance

Hyperparameters tuning

1. Ensure that your model writes out evaluation metrics periodically. 2. Ensure that the outputs of different trials don't clobber each other. 3. Create a YAML configuration file. 4. Submit training job, configuration file included.

What do you pass in train_and_evaluate?

1. Estimator 2. Train spec 3. Eval spec

Handle late data: watermarks, triggers, accumulation

1. Event time a. Bound to the source of the event e.g. event of ID X is given a timestamp relative to the source scope b. Always < processing time 2. Processing time a. Relative to the engine processing the event e.g. event of ID X and even tie of Y is being processed now b. Always > event time

ParDo is useful for a variety of common data processing operations including

1. Filtering a data set. 2. Formatting or type-converting each element in a data set. 3. Extracting parts of each element in a data set. 4. Performing computations on each element in a data set.

Programming Tensorflows requires 2 steps:

1. First step, create the graph. 2. Second step, run the graph. 3. Does lazy evaluation: you need to run the graph to get results

Need to process variable amounts of data that will grow overtime:

1. Fixed or slowly scaled clusters are a waste 2. Windowing lets us answer the question of "where in event time" we are computing the aggregates a. Windowing divides data into event-time based finite chunks b. Required when doing aggregations over unbounded date c. Fixed, sliding, sessions 3. Beam unified model is very powerful and handles different processing paradigms

What is BigQuery?

1. Fully managed data warehouse 2. Fast, petabyte-scale with the convenience of SQL 3. Encrypted, durable, and highly available 4. Virtually unlimited resources only pay for what you use 5. Provides streaming ingest to unbounded data sets

BigQuery Characteristics:

1. Fully managed data warehouse that leds you do ad-hoc SQL queries on massive volumes of data 2. Ingesting data into Bigquery 3. Files on disk or cloud storage 4. Stream Data 5. Federated data source

Cloud SQL characteristics:

1. Fully managed database service 2. Flexible pricing 3. Familiar 4. Managed backups, 5. Automatic replication 6. Fast connection from GComputeE and GAppE 7. Connection from anywhere 8. Google Security

Cloud SQL and Cloud Dataproc offer familiar tools. What is the value-add provided by Google Cloud Platform?

1. Fully-managed versions of the software offer no-ops 2. Running it on Google infrastructure offers reliability and cost savings 3. Fast Random Access 4. CRUD operations are easily implemented on Datastore

Built-in functions are faster than JavaScript UDFs

1. Functions - what work are we doing on the data? 2. Guideline - some operators are faster than others

Create subscriptions, pull messages

1. Gcloud pubsub subscriptions create -topic sandiego mysub1 2. Gcloud pubusub subscriptions pull -auto-ack mysub1

Tensorflow Overfitting:

1. Get more training data - a model trained on more data will naturally generalize better 2. When that is no longer possible, the next best solution is to use techniques like regularization 3. Reduce the size of the model i.e. the number of learnable parameters in the model (which is determined by the number of layers and the number of units per layer) 4. Dropout, applied to a layer, consists of randomly "dropping out" (i.e. set to zero) a number of output features of the layer during training; The "dropout rate" is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5

Data proc

1. Google-managed: Hadoop, Pig, Hive, Spark 2. Image versioning 3. Familiar 4. Resize in seconds 5. Automated cluster management 6. Integrates with google cloud 7. Flexible VMs 8. Google Security

Low cardinality Group Bys are faster

1. Grouping - how much data are we grouping per-key for aggregation? 2. Guideline - low cardinality/groups -> fast, high-cardinality -> slower 3. However, high key cardinality (more groups) leads to more shuffling

Asynchronous processing advantages:

1. High availability 2. Balance load across multiple workers balance high-throughput 3. Reduce coupling 4. reduce latency

TensorFlow toolkit hierarchy

1. High-level "out-of-box" API does distributed training 2. Components useful when building custom NN models 3. Python API gives you full control 4. C++ API is quite low-level 5. TF runs on different hardware (CPU/GPU/TPU/Android)

What is work for a query?

1. I/O - how many bytes did you read? 2. Shuffle - how many bytes did you pass to the next stage? a. Grouping - how many bytes do you pass to each group 3. Materialization - how many bytes did you write? 4. CPU work - User-defined functions (UDFs), functions

Streaming(data processing for unbounded datasets)

1. Infinite data set 2. Is never complete, especially when considering time 3. Stored in multiple temporary, yet durable stores

Challenge #1: Variable volumes require ability of ingest to scale and be fault-tolerant

1. Ingesting variable volumes: massive amounts of streaming events, handle spiky/bursty data, high availability and durability 2. the way to get a durable, highly available messaging system is to use Pub/Sub.

The original data is organized visually, but if you had to write an algorithm to process the data, how might you approach?

1. It could be by rows, by columns, by rows then fields, and the different approaches would perform differently based on the query. 2. Your method might not be parallelizable. The original data can be interpreted and stored in many ways in a database.

Pub/Sub: how it works: Topics and subscriptions

1. Its an asynchronous communicate pattern 2. Multiple topics in Pub/Sub 3. One or more publishers sends a message to a topic 4. At least once a delivery guarantee a. A subscriber ACKs each message for every subscription b. A message is resent if subscriber takes more than "ackDeadline" to respond c. A subscriber can extend the deadlinge per message d. 5. Exactly once, ordered processing a. Pub/sub delivers at least once b. Dataflow: deduplicate, order and window c. Separation of concerns -> scale

Do the biggest joins first

1. Joins - in what order are you merging data? 2. Guideline - Biggest, smallest, decreasing size thereafter 3. Avoid self-join if you can, since it squares the number of rows processed

Use projects to

1. Limit access to datasets and jobs 2. Manage billing

Three categories of BigQuery pricing free

1. Loading data into BigQuery, 2. exporting data from BigQuery, 3. any queries on metadata. 4. How many rows are there in this table, 5. how many columns are there in this table, 6. what are the names of the columns, those queries like that, they are always free. 7. Any cached query is free. Right. So, if you run a query and you run the exact same query in that project, it's free. 8. Right. Anything that has already been cached and you are getting it back is free, the caches are per user for privacy reasons. So if you have two users in a project, they don't share the cache. 9. Okay. So, it's a per user cache but any query whose results are returned from the cache is free. 10. Any query that has an error is also free.

Cloud MLE supports hyperparameter tuning

1. Make the parameter a command line argument 2. Make sure outputs don't clobber each other 3. Supply hyperparameters to training job

Pub/Sub - publish-subscribe

1. Message bus that's a great way to deal with the challenge of ingesting variable volumes of data 2. Has the ability to ingest data at high speeds 3. Works well with streaming data 4. Durable and fault tolerant 5. No ops design 6. Serverless

So it's important to look at what BigQuery, how to get, Data into BigQuery because that's something you'll be doing quite a bit

1. Now you can load data into BigQuery using a command line interface called BQ, it comes with the G Cloud SDK. 2. You can use the web user interface 3. You can use an API, a Python API, a data flow API. 4. Pretty much all of the tools in Google Cloud will be able to talk to BigQuery and will be able to write their data into BigQuery.

Tensorflow Underfitting:

1. Occurs when there is still room for improvement on the test day 2. Results if the model is not powerful enough, is over-regularized, or has simply not been trained long enough 3. Network has not learned the relevant patterns in the training data

Data studio connects to various GCP data sources

1. Offers a BigQuery connector 2. Read from table or run a custom query 3. Build charts graph or lay it on a map

Don't project unnecessary columns

1. On how many columns are you operating? 2. Excess columns incur wasted I/O and materialization

Filter early and often using WHERE clauses

1. On how many rows (or partitions) are you operating? 2. Excess rows incur "waste" similar to excess columns

General rules for feature engineering

1. Overly specific attributes should be discarded 2. Categorical values could be one-hot encoded 3. Preprocess data to create a vocabulary of keys 4. Don't mix magic numbers with data

ParDo allows for parallel processing:

1. ParDo acts on one item at a time (like a map in mapreduce) a. Multiple instances of class on many machines b. Should not contain any state 2. Useful for: a. Filtering b. Converting one Java type to another c. Extracting parts of an input (e.g. fields of tableRow) d. Calculating values from different parts of input

Batch and Streaming set up

1. Pub/Sub is a global message bus. 2. Dataflow is capable of doing batch and streaming, the core doesn't change. It gives you the better deal with late data and unordered data 3. big query gives you the power of doing analytics both on historical data and on streaming data.

DAG steps

1. Read in data, transform it, write out 2. Can branch, merge, use if-then statements, etc.

what makes a good feature? Represent raw data in a form conduce for ML

1. Related to what is being predicted? 2. Value should be known for prediction 3. Numeric with meaningful magnitude 4. Enough examples 5. Good features bring human insight to problem

Datalab frees from being constrained by hardware limitations:

1. Run it on any Compute Engine instance that you want 2. Change the machine specs after it's been provisioned a. You can go into the web console, find a running VM, stop it and restart it with a new machine configuration 3. As a developer, to work in Datalab, you simply connect the VM that's running the notebook server 4. The notebook themselves can be persisted in Git or in a cloud repository so you can delete the VM if you don't need it anymore. 5. To start up Cloud Datalab, you go into Cloud Shell and you type in datalab create. If you have gcloud installed on your local computer, you can run this datalab create command from your own computer instead of from Cloud Shell.

Cloud Dataflow on the cloud

1. Scalable, fault-tolerant multi-step processing of data 2. offers you ways to spend less on ops and administration. Incorporate real-time data into apps and architectures, apply machine learning broadly easily and as an end goal, create citizen data scientists so that everybody in your organization can work with data.

Streaming processing makes it possible to derive real-time insights from growing data

1. Scale to variables volumes 2. Act on real-time data using continuous queries 3. Derive insights from data in flight

Asynchronous processing Potential use cases:

1. Send an SMS 2. Train ML model 3. Process data from multiple sources 4. Weekly reports

To use custom timestamps, perhaps based on message producer's clock:

1. Set an attribute in pubsub with the timestamp when publishing batch.publish(event_data,mytime="2017-04-12T23:20:50.52Z") 2. Tell dataflow which PubSub attribute is the timestampLabel p.apply(PubsubIO.readStrings().fromTopic(t).withTimestampAttribute("mytime") ).apply(...)

Understand BigQuery plans

1. Significant difference between avg and max time a. Probably data skew - use approx._top_count to check b. Filter early to workaround 2. Most time spent reading from intermediate stages a. Consider filtering earlier in the query 3. Most time spent on CPU tasks a. Consider approximate functions, inspect UDF usage, filter earlier

Training is very sensitive to batch-size and learning-rate

1. Size of model 2. Number of hash buckets 3. Embedding size

Order on the outermost query

1. Sorting - how many values do you need to sort? a. Filter first reduces the number of values you need to sort b. Ordering first forces you to sort the world

Can enforce only-once handling in dataflow even if your publisher might retry publishes

1. Specify a unique label when publishing to Pub/Sub 2. When reading, tell Dataflow which PubSub attribute is the idLabel

Learning rate

1. Start with a model with random weights 2. Calculate error on labeled dataset (every batch) 3. Change the weights so that the error goes down (every batch) 4. Repeat step 2-3 until the model is good enough.

BigQuery Not-Free

1. Storage a. Amount of data in table b. Ingest rate of streaming data c. Automatic discount for old data 3. Processing a. On demand OR flat-rate plans b. On demand based on amount of data processed c. 1 TB/month free d. Have to opt in to run high compute queries

Pre-trained ML models

1. Vision API 2. Speech API 3. Jobs API 4. Translation API 5. Natural Language API

Dataflow Templates enable a new development and execution workflow

1. The templates help separate the development activities and the developers from the execution activities and the users. The user environment no longer has dependencies back to the development environment. 2. The new approach facilitates the scheduling of batch jobs and opens up more ways for users to submit jobs and more opportunities for automation. 3. Runtime parameters work through the value provider interface, so that your users can set these values and the template is submitted. 4. Value Provider can be used in IO transformations and do FN your functions and there are static and nested versions of Value Provider for more complex cases. 5. You specify the location of the template in cloud storage and output location in cloud storage, the name, value parameters that map to the Value Provider Interface. 6. Example templates for basic tasks are provided including word count, Cloud Pub Sub to BigQuery, Cloud Storage text to Cloud Pub Sub, Cloud Pub Sub to Cloud Storage texts and so forth.

Can I not read directly from BigQuery?

1. There is a BigQuery reader in TensorFlow. 2. When we do our training, we're gonna be reading multiple times. We're gonna be reading from different parameter servers. We're gonna be reading in chunks. We can read from BigQuery, but it's gonna be cheaper to read from CSV.

Table partitioning

1. Time-partioned tables are a cost effective way to manage data 2. Easier to write queries spanning time periods 3. When you create tables with time-based partitions, BigQuery automatically loads data in correct partition a. Declare the table as partitioned at creatino time using this flag: --time_partitioning-type b. To create partitioned table with expiration time for data, using this flag: --time_partitioning_expiration

Create topic and publish message

1. To create at topic, you use Gcloud 2. Messages are opaque in pub/sub (no parsing) 3. Actually sending a web call, a rest API call.

Unbounded datasets are quite common

1. Traffic sensors along highways 2. Usage information of Cloud component by every user with a GCP project 3. Credit card transactions 4. User moves in multi-user online gaming

Reading from Bigtable

1. Typically programmatic using HBase API 2. HBase command line client 3. Bigquery

Cluster performance

1. Under typical workloads cloud Bigtable delivers highly predictable performance. When everything is running smoothly, you can expect the following performance for each node in your Cloud Bigtable cluster, depending on which type of storage your cluster uses

There are three steps to training your model at Cloud ML Engine.

1. Use TensorFlow to write your code. 2. Package up your trainer as a Python module. 3. Configure and start your ML Engine job.

For realistic, real-world ML models, we need to:

1. Use a fault-tolerant distributed training framework 2. Choose a model based on validation datasets 3. Monitor training, especially if it will take days 4. Resume training if necessary

Best practices Dataflow + BigQuery to enable fast data-driven decisions

1. Use dataflow to do the processing/transforms 2. Create multiple tables for easy analysis 3. Take advantage of BigQuery for streaming analysis for dashboards and long term storage to reduce storage cost 4. Create views for common query support

Python: Map vs flatmap

1. Use map for 1:1 relationship between input & output 2. Flatmap for non 1:1 relationships, usually with generator 3. Java: use apply(parDo) for both cases

Wildcard tables - Standard SQL

1. Use wildcards to query multiple tables using concise SQL statements 2. Wildcar tables are union of tables matching the wildcard expression 3. Useful if your dataset contains: a. Multiple, similarly named tables with compatible schemas b. Sharded tables 4. When you query, each row contains a special column with the wildcard match Example: FROM `bigquery-public-data.noaa_gsod.gsod*` Matches all tables in noaa_gsod that begin with string 'gsod' The backtick(``) is required Richer prefixes perform better than shorted prefixes For example: .gsod200* versus .*

Techniques to deal with the three Vs

1. Volume a. Terabytes, petabytes b. Mapreduce autoscaling analysis 2. Velocity a. Realtime or near-realtime b. streaming 3. Variety a. Social networks, blog posts, logs, sensors b. Unstructured data and machine learning

Triggers control when results are emitted

1. What are you computing? What = transformations 2. Where in event time? Where = windowing 3. When in processing time? When = watermarks + triggers 4. How do refinements relate? How = accumulation

How dataflow handles streaming data while balancing tradeoffs

1. Windowing model a. Which supports unaligned eventtime windows, and a simple API for their creation and use 2. Triggering model a. That binds the output times of results to runtime characteristics of the pipeline with a powerful and flexible declarative API for describing desired triggering semantics 3. Incremental processing model a. That integrates retractions and updates into the windowing and triggering models

Stream processing in Dataflow accounts for this

1. Works with out-of-order messages when computing aggregates 2. Automatically removes duplicated based on internal Pub/Sub id

Cloud Datalab:

1. Write code in Python 2. Run cell (shift enter) 3. Examine output 4. Write commentary in markdown 5. Share and collaborate

Can write data out to same formats

1. Write data to file system, GCS, BigQuery, Pub/Sub 2. Can prevent sharding of output (do only if it is small) 3. May have to transform PCollection <Integer>, etc. to PCollectionString> before writing out

Modifying bigtable clusters

1. You can add or remove nodes or change a cluster's name without any downtime 2. You cannot modify the Cluster ID, zone, or storage type after a cluster is created

Understand query performance

1. You can optimize our queries and your data, but still need to monitor performance 2. Tow primary approaches: a. Per-query explains plans i. What did my query do? b. Project-level monitoring through Google stack driver i. What is going on with all my resources in this project?

Dataproc customization options

1. You can start up a single node cluster where all the Hadoop services are installed on a single VM. If you're developing code, you might want to use this for cost control or to give each developer their own environment. 2. Standard mode has a single master node. In Hadoop, the master node is the ingress point for job submission. Normally, having a single master node is sufficient. 3. If you have a very long-running job then you might want to use the high availability option. That provides three master nodes, so the loss of a single VM will not result in losing the job.

Three possible places to do feature engineering:

1. You could do it on the fly as you read in the data in the input function itself, or by creating feature columns. 2. Alternately, you could do it as a separate step before you do the training. And then your input function reads the preprocessed data. 3. The third option, is to do the pre-processing in Dataflow and create a preprocessed data features.

Three things that you need to do to build an effective machine learning model:

1. You need to scale it out to large data; we just looked at that with Cloud ML 2. You need to do what's called feature engineering 3. Hyper parameter tuning

Dataprep provides

1. a high-leverage method to quickly create Dataflow pipelines without coding. 2. This is especially useful for data quality tasks and for Master Data task, combining data from multiple sources where programming may not be required. 3. The pipeline can be output as a Dataflow Template for continued use in Dataflow. 4. For example, you could set up a data quality job to clean up source data provided by a native system that's destined for data analysis. Then this template can be used by the administrative staff periodically to submit clean data for the analysis tasks. 5. processing logs in Cloud Dataflow and some Apache Beam resources.

Dataproc

1. a managed Hadoop service on Google Cloud Platform 2. It's fast, convenient, and offers several unique flexible features. 3. Bdutil eliminates the complexity of deploying a cluster. And if you use it in the cloud, connectivity to cloud based services ceases to be an issue. So it eliminates a lot of the IT overhead. 4. Operational and performance tuning overhead remains - Responsible for your own custom code used in your jobs.

pig

1. a scripting language, and you would write your MapReduce programs in that scripting language @ higher level. 2. It is almost like an ETL language extraction transformation loading of data language.

BigQuery in respect to Dataproc

1. a serverless, highly scalable, low-cost enterprise data warehouse with a fast interactive interface designed for use by data analysts. 2. utility overlaps significantly with Hadoop, and in some cases can be used instead of Hadoop. 3. an add-on to Dataproc to extend the abilities of Dataproc.

Apache beam

1. a unified model for batch and stream processing 2. supporting multiple runtimes

Dataproc connecting

1. also hosts the HDFS name node at port 9870, which gives insight in the HDFS. 2. You can SSH directly to the cluster nodes. you can use a SOCKS proxy to connect your browser through a SSH tunnel. One reason is to use SSH is to directly access software installed on the cluster such as Hive, Pig, and Pyspark. This can be a great way of interacting directly with the cluster and learning about the open source software that's installed by default on Dataproc.

Dataprep overview

1. an interactive graphical system for preparing structured or unstructured data for use in analytics such as BigQuery, visualization like, Data Studio and to train machine learning models 2. input integration: provided for Cloud Storage BigQuery and files. 3. offers a graphical user interface for interactively designing a pipeline. 4. offers a rich set of tools for working with data 5. the format of a string field can have transformations applied to change to uppercase, to proper case, that's initial uppercase letters, to trim leading and trailing whitespace, and to remove whitespace altogether 6. These are the kinds of transformations commonly needed for improving the quality of data produced by a native system in preparation for big data processing

Apache beam :

1. an open source API that let's you define a data pipeline. So, this API shown here is Apache beam and you basically creating a pipeline reading some text and the text that we're reading is from cloud storage. 2. Python, Java 3. Executable on Cloud Dataflow, Flink, Spark etc

Hive

1. an open source software project that implements a data warehouse and offers an SQL-like query language 2. HiveQL is not identical to standard SQL. 3. used for structured data, similar to SQL

Apache Bigtop

1. an open-source project that automates packaging deployment and integration of other projects in the Hadoop ecosystem. 2. gathers the core Hadoop components and makes sure that the configuration works. It uses Jenkins for continuous integration testing. 3. makes sure that the default Dataproc clusters perform well. It's common when installing Hadoop software manually to accidentally include software and services that actually aren't used in the configuration. 4. makes sure you're not wasting resources by eliminating elements that are not really needed.

Can associate a timestamp with inputs

1. automatic timestamp when reading from PubSub a. Timestamp is the time that message was published to topic 2. For batch inputs, explicitly assign timestamp when emitting at some step in your pipeline a. outputWithTimestamp()

As mentioned before, HDFS is available and you can use it if you want to reduce the changes for adoption from your existing system. Using Cloud Storage instead of HDFS

1. avoid some of the complexity of configuring your cluster. 2. dynamically scale to meet requirements. You wouldn't have to try to predict storage consumption in advance. Because the cluster would only need disk space for working storage and software, sizing the cluster nodes becomes much easier. 3. makes the cluster stateless, so you can shut it down when you don't need it.

Why would you want to run Hadoop on a cloud platform to begin with

1. cheap storage 2. Running with gigabytes or terabytes of data running a Hadoop on your own cluster is efficient.

Earlier in the course, the concept of machine learning was introduced and three categories of work were identified

1. consists of problems that require human insight to solve. Those are not good candidates for machine learning solutions, at least not yet. 2. category are problems that essentially reduce to counting. These are easy problems to solve with big data processing. 3. problems that at first appear to have no easy solution. Mainly because they involve unstructured data. However, with some ingenuity and using machine learning services, these problems can be transformed into counting problems and they become very powerful applications for big data processing.

Hadoop alternatives come with operational and maintenance overhead. You can overcome these limitations with Cloud Dataproc, which was designed to deal with them

1. create a cluster specifically for one job 2. use cloud storage instead of HDFS 3. shutdown the cluster when it's not actually processing data 4. use custom machines to closely match the CPU and memory requirements of the job 5. on non-critical jobs requiring huge clusters, use preemptible VMs to hasten results and cut costs at the same time.

Big Query

1. data warehouse 2. petabyte scale data warehouse on Google Cloud 3. denormalized

Randomly shuffle the filenames in the filename_queue

1. different file sizes 2. different complexity 3. don't want the same machine to be stuck every time

Bigtable

1. don't need transactional support 2. capcity of Petabytes 3. high throughput scenarios 4.4. With Bigtable you basically deal with flattened data, it's not for hierarchical data, it's flattened and you search only based on the key. And because you can search only based on the key, the key itself and the way you design it becomes extremely important. '5. NoOps 6. automatically balanced 7. it's automatically replicated, 8. its compacted, 9. it's essential NoOps. 10. You don't have to manage any of that infrastructure, you can deal with extremely high throughput data.

Date and time functions

1. enable date and time manipulation for timestamps, date strings, and timestamp data types 2. bigquery uses epoch time

record defaults get used two ways.

1. figure out what a default value is 2., to determine what the type of the column is

Pub/Sub :

1. has to be primed so the second message will only show. 2. simplifies systems by removing the need for every component to speak to every component 3. connects applications and services through a messaging infrastructure

Providing other inputs to a ParDo

1. in memory objects can be provided as usual

So if you're thinking about a good stream processing solution, there are three key challenges that it needs to address.

1. it needs to be able to scale. 2. you want to essentially use Continuous Queries 3. we want to be able to do SQL-like queries that operate over time windows, over time windows over that data

Spanner

1. it uses familiar relational semantics, so traditional database analysts will adapt to it easily. 2. Data is sharded within the zone, providing high throughput. 3. And it provides high availability by design, so there's no manual intervention required to deal with a zone failure.

Reason to Use Tensorflow:

1. machine-learning researcher interested in extending the open source SDK 2. creating new machine learning models for research, et cetera.

Store related entities in adjacent rows

1. make query parameter the row key 2. Add reverse timestap to the rowkey 3. Distribute the writing load between tablets while allowing common queries to return consecutive rows

Dataflow:

1. manages the provisioning of these machines. 2. The auto-scaling if necessary of your pipeline such that transform one just happens at scale completely distributed and then everything comes streaming back into transform two. 3. completely NoOps data pipeline. 4. it can have the intermediate processing be identical even when you move from a batch to a streaming scenario. 5. swap out the input 6. Run to PubSub, BigQuery, Cloud Storage 7. So, dataflow is where we see a lot of data pipelines migrating because you really want to be able to process historical data and real-time data in an identical way. That's the only way you'll be able to build a machine learning pipeline for example, that is strained on historical data that operates on real-time arriving data. 8. ingest, transform, and load, filtering, grouping, windowing etc; consider using it instead of Spark

Cloud Pub/Sub:

1. message oriented architectures 2. offers reliable real-time messaging that's accessible through HTTP 3. reliable delivery, decoupled workers 4. asynchronous processing

Bigquery details

1. near real-time analysis of very large datasets 2. it's no-ops so you're only paying for what you use 3. it gives you durability 4. it gives you replication which gives you a very cheap storage about the same cost as cloud storage and it gives you immutable audit logs so you know who's access the data when, and the very important thing is that because there's only one BigQuery that's global, it allows you to mash up different datasets

Reason to use machine learning APIs:

1. pre-built models and incorporating them into your applications 2. not training a machine learning model when you'd use the ML APIs 3. The machine learning APIs are built off Google's data, so if you've ever used the Android app where you can point the application at a foreign language sign and get it translated, well that app users translation, it uses optical character recognition. OCR, Optical Character Recognition is part of the Vision API and translation is part of the Translate API.

Tensorflow: Directed graph

1. preparing a graph for execution on multiple hardware devices 2. process the graph to add quantization, debug nodes, create summaries. 3. can be compiled for example, to fuse ops to improve the performance. For example, you may have two consecutive add notes and you might want to fuse them into a single one.

BigQuery Overview:

1. serverless data warehouse that operates at massive scale. 2. To use BigQuery, you don't have to store your data in a cluster. 3. To query the data, API call or invoke BigQuery from just a web browser. 4. You can analyze terabytes to petabytes of data 5. Requires uncheck Cache results

Monitor training Tensorflow:

1. set your verbosity to be INFO. 2. By default, TensorFlow's error logging level is at warn, so it doesn't show you a bunch of stuff. 3. So if you want TensorFlow to show you the loss as it trains, change the error level to INFO or change it to debug. 4. The levels are debug, info, warn, error and fatal. 5. Graphical way to monitor training through TensorBoard 6. You can use TensorBoard and you point it at the model output directory, whether it's a local directory or on cloud storage, TensorBoard can read from both.

Stream processing poses several challenges:

1. size a. Traffic data will only grow with more sensors and higher frequency 2. scalability and fault-tolerance a. Handle growing traffic data volumes, distributed sensors, and still be fault tolerant 3. programming model a. Compare traffic over past hour against that of last Friday at same time: is this stream or batch? 4. Unboundedness a. What happens if data from a sensor arrives late?

Cloud dataflow key ideas:

1. the execution framework for Apache beam pipelines. 2. Allows for decoupling producers and consumers of data in large organizations and complex systems

Denormalizing(nested and repeated fields)

1. the strategy of accepting repeated fields in the data to gain processing performance. 2. Data must be normalized before it can be denormalized. 3. Denormalization is another increase in the orderliness of the data. 4. takes more storage - repeated fields 5. because it no longer is relational, queries can be processed more efficiently and in parallel using columnar processing. 6. Nested can be understood as a form of repeated field

Normalizing the data

1. turning it into a relational system. 2. This stores the data efficiently and makes query processing a clear and direct task. Normalizing increases the orderliness of the data.

Pig

1. used for semi-structured data, similar to SQL + scripting 2. provides SQL primitives similar to Hive, but in a more flexible scripting language format. 3. deal with semi-structured data, such as data having partial schemas, or for which the schema is not yet known. 4. sometimes used for Extract Transform Load (ETL) 5. generates Java MapReduce jobs. 6. not designed to deal with unstructured data.

Smart way to explore the space in Cloud ML

1. uses a Bayesian optimization approach that can be applied to autotune parameters (like learning rate, number of hidden nodes, etc.) of your machine learning model.

Repeat the data and send it along in chunks

1. we now have our filename_queue by taking these file names, and randomly shuffling them, and adding them num_epoch 2. We need to set up our readers that are going to do the decoding. The reader in our case is a TextlineReader because these are CSV file 3. Batch of records from the filename_queue 4. take that record, which at this point is just a line, it's a scalae 5. we make it a tensor with viewing expand dims, a tensor with the same shape. And that basically becomes our value. It's now just a string. 6. And we take that and we ask TensorFlow to do a decode of the CSV. So decode this as a comma separated value string.

build effective machine learning models.

1. you need to collect all the data that you can so collecting the data so that you can do analytics on the data is extremely important. 2. So, but once you have the data then you want to basically bring human insight into the data using good features, and then we also looked at how you can take advantage of modern improvements in neural network architectures to get the best possible accuracy once you've decided that this is how you're going to build your ML. 3. We said that when you're doing machine learning the accuracy improvement that you're going to get is you're going to get it through hard work, you're going to get it through feature engineering, through hyperparameter tuning, and through lots of data 4.. What Cloud ML Engine gives you is an environment in which you can do all of these things.

Reason to use Cloud ML Engine:

1. you're in an industry as a data scientist and you want to build a machine learning model on your data set and the machine learning model that you're building is something that's pretty well understood 2. a no-op, so that you're not in the business of managing infrastructure to be able to do machine learning at scale over real-world datasets.

Difference between BigQuery and Bigtable Latency:

a. BigQuery is in the order of seconds b. BigTable is in the order of milliseconds.

Cloud Storage reduce latency

Choose the closest zone/region; Distribute your apps and data across zones to reduce service disruptions and regions for global availability

BIgQuery

2. It supports nested and repeated fields. 3. separates out storage and compute 4. structured data or tabular data 5. near-real time analysis 6. completely no-ops 7. durable - replicated in multiple places and pretty inexpensive in terms of storing 8. immutable audit logs 9. Mashing up different datasets to derive insights

A message in Pub/Sub persist for

7 days

Configure Alerts

: Define thresholds on job or resource group-level metrics and alert when these metrics reach specified values. Stackdriver alerting can notify on a variety of conditions such as long streaming system lag or failed jobs.

The elements are divided into datasets, recipes and output

A dataset roughly translates into a Dataflow pipeline read, a recipe usually translates into multiple pipeline transformations, and an output translates into a pipeline action

Cloud pub/sub overview:

A global multitenant, managed, real-time messaging services. 1. Discoverability 2. Availability 3. Durability 4. Scalability 5. Low latency

Feature Cross

A synthetic feature formed by crossing (multiplying or taking a Cartesian product of) individual features; help represent nonlinear relationships.

A data warehouse

can be a source of structured data examples for your ML model.

Why does Tensorflow need a DEFAULT value?

But TensorFlow is an anti-neural networks are adding and subtracting machines

Feature creation in tensorflow also possible

can be quite powerful since it is so flexible and will need to add call to all input functions (train,eval,serving)

Advantages of putting the preprocessing directly in TensorFlow

But tell your prediction graph that you want the same transformations carried out in TensorFlow during serving. To do that, you will use a library called TensorFlow transfer. Discretising and feature crossing are examples of preprocessing that is done in TensorFlow itself. These operations are part of your model graph and so they are carried out in an identical fashion in both training and in serving.

Vision API match question

Automatically reject inappropriate image content

No tail skew

Average and max are identical

Files accepted in the Bigquery web ui

Avro, Parquet, Json (newline delimited), and CSV

Machine scale tiers and pricing

Basic - single worker instance standard_1 = 1 master, 4 workers, 3 parameters servers Premium_1 - 1 master, 19 workers, 11 parameter servers Basic_GPU - 1 worker with GPU Custom priced by hour

Difference between BigQuery and Bigtable Structure:

BigQuery (SQL) Bigtable (NoSQL)

Data flow export to

BigQuery, Cloud Storage text file, cloud storage Avro file

But if you need very high throughput, very low latency, then you need

Bigtable

you have an existing Hadoop application that reads and writes data to an HBase database

Bigtable is the path for separating storage and compute. You migrate that data to Bigtable and update the references. It uses the same API as HBase, so the code change will be minimal. Bigtable is also a strong candidate for real-time solutions.

all are faster than Javascript UDFs 3. Example - Exact COUNT(DISTINCT) is very costly, but APPROX_COUNT_DISTINCT is very fast Note

Check to see if there are reasonable approximate functions for your query

Translation API match question

Build application to monitor Spanish twitter feed

Training and evaluation input functions

CSV_COLUMNS = def read_dataset(filename, mode, batch_size=512): ...

And we allow Cloud ML to be able to write into our bucket because

Cloud ML runs as a service account or a robot account.

Data flow import from

Cloud Pub/Sub subscription, cloud storage text file

When you load data from Cloud Storage into BigQuery, your data can be in any of the following formats:

Comma-separated values (CSV) JSON (newline-delimited) Avro Parquet ORC (Beta) Cloud Datastore exports Cloud Firestore exports

Chart Dataflow metrics in Stackdriver Dashboards:

Create Dashboards and chart time series of Dataflow metrics.

Metrics for dataflow in lab

Data watermark age: The age (time since event timestamp) of the most recent item of data that has been fully processed by the pipeline. System lag: The current maximum duration that an item of data has been awaiting processing, in seconds

Feature Engineering

Def add_more_features(feats): # will be covered in next chapter; for now, just a no-op Return feats

Serving input function

Def serving_input_fn(): ... Return tf.estimator.export.ServingInputReceiver(features, feature_pholders)

Train and evaluate loop

Def train_and_evaluate(args): ... tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

Tensorflow match question

Create, test new machine learning methods

Dataflow current vCPU count:

Current # of virtual CPUs used by job and updated on value change.

Two types of features in feature comparison model

Dense and Sparse

Datalab match question

Develop Big Data algorithms interactively in python

Overfit

Does not generalize

Cloud SQL

Does not handle high throughput needs. If you have sensors that are distributed all across the world, and you're basically getting back millions of messages a minute, that's not something that this database could handle

Prefer combine

GroupByKey Collection.apply(Count.perKey()) Is faster than: Collection.apply(groupbykey.create()).apply(parDo.of(new DoFn(){ Void processElement(ProcessContext c){ c.output(KV.of(c.element().getKey(),c.element().getvalue().size())));

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

Estimator comes with a method that handles distributed training and evaluation

Cloud ML Engine Scalable Predictable:

For predictions, the ML model is accessible via a rest API and it includes all the pre-processing and feature creation that you did so your client code can simply supply the raw input variables and get back a prediction

Datalab price:

Free - just pay for Google Cloud resources

Google Compute Storage

Good option for storing data that may be required to be read at some time later and imported into a cluster for analysis

the BigQuery Data Transfer Service supports loading data from the following data sources

Google AdWords DoubleClick Campaign Manager DoubleClick for Publishers Google Play (beta) YouTube - Channel Reports YouTube - Content Owner Reports

Bigtable match question

High-throughput writes of wide-column data

Feature columns

INPUT_COLUMNS = [ tf.feature_column.numeric_column('pickuplon')

Using GCS as staging

If you need to basically for example have your data into Big Query, a good option is to first get it into GCS and then use GCS as a staging to import it into Big Query, or into Dataproc, or into any other cluster.

Monitor User-Defined Metrics:

In addition to Dataflow metrics, Dataflow exposes user-defined metrics (SDK Aggregators) as Stackdriver custom counters in the Monitoring UI, available for charting and alerting. Any Aggregator defined in a Dataflow pipeline will be reported to Stackdriver as a custom metric. Dataflow will define a new custom metric on behalf of the user and report incremental updates to Stackdriver approximately every 30 seconds.

Tensorflow's capacity:

In deep learning, the number of learnable parameters in a model

Dataflow Job status:

Job status (Failed, Successful), reported as an enum every 30 secs and on update.

Dataflow Elapsed time:

Job elapsed time (measured in seconds), reported every 30 secs.

Which of these is a way of discretizing a continuous variable?

Layers.bucketized_column()

Which of these is a way of encoding categorical data?

Layers.sparse_columns_with_keys()

hyperparameter

Learning rate is an example of what is known

Problem with vertical scaling

Marginally diminishing returns. For each increased unit of computing power, the cost goes up and the value provided goes down.

Dataflow System lag:

Max lag across the entire pipeline, reported in seconds.

Cloud Pub/Sub push subscriptions

Or could be a push in which basically the client system says call this endpoint whenever there's a new message for me, and that new endpoint would get called by this architecture whenever there's a new message

Cloud ML Engine match question

No-ops, custom machine learning applications at scale

Dataflow Estimated byte count :

Number of bytes processed per PCollection.

Problem it is in memory, so usually only used with a subset of data. first method. One method that was discussed at the beginning was to use Cloud Storage as an intermediary because both BigQuery and Dataproc can communicate with Cloud Storage. Cloud Storage is fast, but there are two operations involved

One method that was discussed at the beginning was to use Cloud Storage as an intermediary because both BigQuery and Dataproc can communicate with Cloud Storage. Cloud Storage is fast, but there are two operations involved 1. writing out from BigQuery to Cloud Storage. 2. reading from Cloud Storage into Dataproc. This method is appropriate for periodic or infrequent transfers that in other circumstances a more direct communication method would be useful.

Cloud Storage Transfer Services

Online data - transfer once or multiple times

TextIO

Output methods do not support output with the timestamp.

Deep models components:

Output units, hidden layers, dense embeddings, sparse features

Transform

ParDo GroupByKey CoGroupByKey Combine Flatten Partition

Machine learning

Pattern recognition from examples

Gradient Descent

Process of reducing error

Cloud Storage IAM

Project level, Bucket level, Object level

Datastore match question

Searching for objects by attribute value

Working with Estimator API

Set up machine learning model 1. regression or classification? 2. What is the label? 3. What are the features Carry out ML steps 1. Train the model 2. Evaluate the model 3. Predict with the model

important for distributed training

Shuffling

One thing to note with Google Compute Engine is that disk performance scales with the size of the VM

So the second key to using dataproc to overcome Hadoop's limitations is to use Cloud storage instead of HDFS. It reduces the complexity of disk provisioning and enables you to shut down your cluster when it's not processing a job. There are some more handy options : 1. You can load multiple initialization scripts to customize the software on dataproc workers and on the master. 2. Dataproc comes preconfigured with a Cloud Storage connector. So the cluster already knows how to communicate with buckets located in the project. You can stage initialization scripts there. 3. The network section allows you to do things like associate a tag name with the nodes. So that later on, you can create a very narrow firewall rule to allow access to the cluster services. For example, to access some of the Hadoop Web interfaces. 4. Dataproc uses Google Compute Engine virtual machines. So it inherits the ability to select virtual machines with different qualities. You can match your qualities to your processing requirements to gain greater control over the speed and cost of your data processing solution. 5. Why isn't Cloud Shell used in this lab? : Unlike Compute Engine, Cloud Shell has no SLA. Therefore, the availability of the Cloud Shell VM cannot be guaranteed during the lab.

Challenge #2: Latency is to be expected

So, the Beam/Dataflow model provides for exactly once processing of events. 1. What results are calculated, 2. Where in event time should we calculate those results 3. When in processing time should you save out those results 4. How do you change, already computed results? How do you refine it?

Horizontal Scaling (scaling out):

That's the distributed parallel processing solution. You acquire or borrow many smaller computers and use then together.

Can add new features in dataflow:

The advantage of doing this, of putting the preprocessing directly in TensorFlow, is that these operations are part of your model graph and so they are carried out in an identical fashion in both training and in serving. Feature creation in tensorflow also possible -> can be quite powerful since it is so flexible and will need to add call to all input functions (train,eval,serving) Can add new features in dataflow: 1. Ideal for features that involve time-windowed aggregations(streaming) 2. You will have to compute these features in real-time pipeline for predictinos (i.e. will have to use dataflow for predictions also)

Cloud ML Engine - best practice

The fewer you do in parallel, the better it is in terms of accuracy but the longer it's going to take, so that's the tradeoff that you're doing. And then at the end of it, CloudML is going to come back to you and say this is the best.

Cloud Spanner

The first horizontally scalable, globally consistent database. It's proprietary, not open source. Consider what it means to have a relational database that's consistent but also distributed and global. Think about what might be involved in coordinating transactions on components of relational database located around the world. It seems like a very difficult problem to solve.

Tensorflow

The more capacity the network has, the quicker it will be able to model the training data (resulting in a low training loss), but the more susceptible it is to overfitting (resulting in a large difference between the training and validation loss).

PubSub timestamp

The timestamp will be the time at which the element was published to pubsub. If you want to use a custom timestamp, it must be published as a pubsub attribute, and you tell data flow about it using the timestamp label setter.

weights

parameters we optimize

Speech-to-Text (speech api) match question

Transcribe customer support calls

machine learning APIs as a way to make sense of unstructured data

We said that whether you have images or audio or video or free-form text you can use the machine learning APIs to essentially extract entities, labels, people, events, etc, so you can make sense of it and then take that unstructured data, get entities, get sentiment, get labels and then do if necessary, machine-learning on that data.

Why isn't Cloud Shell used in this lab?

Unlike Compute Engine, Cloud Shell has no SLA. Therefore, the availability of the Cloud Shell VM cannot be guaranteed during the lab.

why do we say the Cloud ML Engine is repeatable, scalable, et cetera?

Well, repeatable, using sing a bear TensorFlow yourself, you would have to keep track of all kinds of things such as the order of pre-processing operations, what the scaling parameters are, excetra.

Cloud AutoML

a new technology that helps to automate the creation of machine learning modules.

The batch size

a number of points over which we try out the changes in weights.

Dataproc's third reason to overcome Hadoop's limitations by using Managed Instant groups

Use custom machine types to closely manage the resources the job requires. So, the primary group has a two node minimum, and you can define a secondary group with as few as zero preemptible instances to start. You can manually scale the cluster later, or you can setup autoscaling.

Yaml

a parameter named train_batch_size, a parameter named nbuckets, a parameter named hidden_units, and specifying the exploration space

App Engine

a serverless way to run web applications and autoscales, reliable

Spark knows the truth

data partitioning, data replication, data recovery, pipelining of processing, all are automated by Spark so you don't have to worry about them.

Bigquery match question

Warehousing structured data

Recall

a true positive rate

Dataproc is

a way by which you can run a lot of the Hadoop ecosystem tools, Pig, Hive, Spark, etc.

A neuron

a way to combine all of the inputs, and make a single decision on those inputs

Cloud dataflow

a way to execute apache beam data pipelines on Google Cloud platform

The simplest neuron does _______________.

a weighted sum of its inputs.

Dataflow connector for cloud bigtable

Years of engineering to 1. Teach bigtable to configure itself 2. Isolate performance form noisy neighbors 3. React automatically to new patterns, splitting, and balancing 4. Looks at access patterns and improves itself

accessing dataproc

You request a cluster from the service using either the console web interface, the Gcloud command, or programmatically via the API.

Tensorboard

a collection of visualization tools, that are especially designed to help you visualize TensorFlow models and the training of TensorFlow models.

Example is

a combination of label and input. An input and its corresponding label together form an example.

An example in machine learning terms is

a combination of the input, the input for which we want an output, and a label, which is a true output, the thing that we know, this is what it needs to be.

pipeline

a directed graph of steps

TensorFlow, at its heart, is

a high performance library for numerical computation. So it's a library that lets you work with numbers in an efficient way. It's open-source and follows a graph processing idea, in a way that's similar to Apache Beam and data flow.

Pub/Sub features

a. Fast: order of 100s of milliseconds b. Fan in, fan out parallel consumption c. Push and pull delivery flows i. a subscriber can keep checking whether a new message is available, which is pull ii. it can register for notifications when there is a new message which is called push. d. Client libraries i. Idiomatic, hand-built in Java, Python, C#, Ruby, PHP, Node.js ii. Auto-generated in 10 gRPC languages

Pub/sub is global service

a. Messages stored in region closest to publisher (in multiple availability zones) b. A subscription collates a topic from different regions c. Subscribers can be anywhere in world; no change of code

1. A table can have only one index (the row key)

a. Rows are stored in ascending order of the row key b. None of the other columns can be indexed

3. Two types of designs

a. Wide tables when every column value exists for every row b. Narrow tables for sparse data

Pub/Sub

an auto-scaling message queue

Dialogue Flow

an end-to-end development suite for building conversational interfaces for dialogue systems. It uses machine learning to recognize the intent and context of what a user saying

tensor

an n-dimensional array

The Natural Language API

analyzes free form text and identifies parts of speech and quality such as sentiment analysis, entity analysis, entity sentiment analysis, syntactic analysis, and content classification

Content classification

analyzes text content and returns a content category for the content. Content classification is performed by using the classifyText method

accuracy is the one that you will use if the dataset is

balanced

BigQuery and Sheets:

being able to join a table in sheets with a table in BigQuery

Bigtable details:

big, fast, noSQL, autoscaling 1. Low latency/high-throughput 2. 100,000 QPS @ 6 ms latency for a 10 node cluster 3. Paying for the number nodes of Bigtable that you are running

Container Engine

containerize it and put it into a Docker Container and we will basically orchestrate those containers and manage them for you

Eval spec

controls the evaluation

Label

correct output for some input. This is what you train the model with. The label is a correct output for an input.

To read sharded CSV files

create a tf.data.TextLineDataset(filenames).map(decode_csv) giving it a function to decode the CSV into features, labels

Bigquery export

data studio, GCS

Repeat the data and send it along in chunks

dataset = dataset.repeat(num_epochs).batch(batch_size)

DNNs good for

dense, highly correlated features

Bigquery

destination table write preference: Write if empty, append to table, overwrite table

Training-serving skew

difference between performance during training and performance during serving.

CLoud ML Engine the tuning

do hyper-parameter tuning and will remember these hyperparameters.

a ParDo transform considers

each element in the input PCollection, performs some processing function (your user code) on that element, and emits zero, one, or multiple elements to an output PCollection.

Syntactic analysis

extracts linguistic information, breaking up the given text into a series of sentences and tokens (generally, word boundaries), providing further analysis on those tokens. Syntactic Analysis is performed with the analyzeSyntax method

Cloud Spanner Natural use cases include:

financial applications and inventory applications traditionally served by relational database technology. Here's some example, mission critical use cases. Powering customer authentication and provisioning for multinational businesses. Building consistent systems for transactions and inventory management and the financial services in retail industries. Supporting high volume systems that require low latency and high throughput in the advertising and media industries.

Cross-entropy

for classification problems, the most commonly used error measure - because it is differentiable

Weights

free parameters in machine learning model ; the weights are the things that you get to change so that your model captures your data

Both BigQuery and Bigtable are what kind of services?

fully managed, no-ops services.

Command for submitting on cloud

gcloud ml-engine jobs submit training $JOBNAME

Command for running local on cloud

gcloud ml-engine local train

key skew can lead to increased tail latency Note

get a count of your groups when trying to understand performance

c. Distribute the writing load between tablets while allowing common queries to return consecutive rows

i. Can you have both distributed writes and block reads? ii. E.g. highway-milemaker - reverse times (I35-347-123456789)

b. Add reverse timestamp to the rowkey

i. Will you often need to retrieve the latest few records ii. E.g. highway - milemaker - reverse timestamp (I35-347-123456789)

Make query parameter the row key

i. what is the most common query you need to support ii. e.g. highway - milemarker (I35-347) iii. Entities are considered related if users are likely to pull both records in a single query. This makes reads more efficient iv. Results would come from the same tablet

Sentiment analysis

inspects the given text and identifies the prevailing emotional opinion within the text, especially to determine a writer's attitude as positive, negative, or neutral. Sentiment analysis is performed through the analyzeSentiment method

Entity analysis

inspects the given text for known entities (Proper nouns such as public figures, landmarks, and so on. Common nouns such as restaurant, stadium, and so on.) and returns information about those entities

Entity sentiment analysis

inspects the given text for known entities (proper nouns and common nouns), returns information about those entities, and identifies the prevailing emotional opinion of the entity within the text, especially to determine a writer's attitude toward the entity as positive, negative, or neutral. Entity analysis is performed with the analyzeEntitySentiment method

Dataflow

is a runner. Each step is elastically scaled

Refactoring

is a software engineering term that essentially means that you're taking your program and you're changing the design of your program without adding any extra features. The reason you are changing the design of your program is so that you can do extra things with it.

The timestamp of the message in Pub Sub

is going to be the timestamp at which you call the published method. The time at which you publish into pubsub, is the timestamp of the message

A graph definition

is separate from the training group because this is a lazy evaluation model. It minimizes the Python C++ context switches and enables the computation to be very efficient. Note that c, after you call tf.add, is not the actual values. You have to evaluate c in the context of a TensorFlow session to get a numpy array of values, which we are calling "numpy_c".

So one of the cool things is that the TextlineReader in TensorFlow not only reads from local files,

it also reads from Google Cloud Storage

Dataflow is a core part of this architecture

it does ingest, it does transformation and it does load, it can do filtering, it can do grouping, it can do windowing and windowing of course is very important if you're doing unbounded data, if you're doing stream data.

Bigtable and clusters

it uses clusters but those clusters only contain pointers to the data, they don't contain the data itself. So, the clusters consist of nodes, these nodes are contain the metadata, the data itself remains on Colossus, it remains on Google Cloud Storage.

Cloud Storage Usage

persistent storage and as staging ground for import to other google cloud products.

we looked at how to do resilient stream processing on GCP

it was important to be able to ingest variable volumes because you could have spikes in your data, it's important to be able to deal with latency because latency is a fact of life, and we want to be able to derive real-time insights from the data even as the data are streaming in.

Example

label + input

Cloud Dataflow connector for Cloud Bigtable.

makes it possible to use Cloud Bigtable in a Cloud Dataflow pipeline.

model

mathematical function that takes an input and creates an output that approximates the label for that input.

Cloud Storage default storage class

multi-regional, regional, nearline, coldline

Pig

often used for cleaning up data and for turning semi-structured data into structured data and originally developed to submit MapReduce jobs.

Epoch

one pass through entire dataset; It consists of going through multiple batches. In our example, if we said we had 100,000 samples was our training dataset, and each batch was 100, then an epoch consists of 1,000 batches or 1,000 steps. So that's another word that's often used for one step is one tweak of the weights.

Neural network

only as good as the input that it is provided with.

Side inputs

other smaller data that you need to get and join with in Big Query

if for example you're simulating historical data, so you're publishing at sometime in 2017 but the data actually comes from 2008, we might want to set metadata which is the timestamp of the message, because

pubsub is not going to parse this message and figure out what the actual timestamp is inside the message. The other up thing that you want to be aware of is that, you don't have to call publish one at a time for every message.

The Vision API

recognizes objects and other qualities in images

The Speech API

recognizes spoken words in audio files

PTransform:

represents a data processing operation, or a step, in your pipeline

A PCollection

represents a distributed data set that your Beam pipeline operates on

Direct runner

running locally

Cloud ML Engine Scalable:

scale your service with as many machines as needed to reach the required number of queries per second, and this is important, you need high-quality execution at both training and prediction time, so while computation of a TensorFlow model ones is relatively cheap, the point of an ML model is to do prediction for lots and lots of repeated requests.

Pub/Sub is

serverless as we'll see later, you don't need to start any clusters, just publish messages to a topic

Cloud ML Engine simplifies

simplifies the bookkeeping, ensures that the trained model is what you run at prediction time. This will help you handle training serving skew. It can be quite easy otherwise for the trading pipeline to do something that the prediction pipeline doesn't do

MapReduce approach splits Big Data

so that each compute node processes data local to it

Linear for

sparse, independent features

A chatbot

special purpose program that's designed to conduct a convincingly intelligent conversation

sharding

split the data as it's being copied in from mass storage,and distribute it to the nodes, a process called sharding.

Instead of using HDFS for storage

store your data from dataproc to gcloudstorage

globally in cases

sum over all of floats

The difference between a view and a table

that a new table is materialized, it is no longer live. A view is not materialized and therefore it is live

When a message is read from PubSub by a subscriber

that data includes that timestamp. The timestamp is the basis of all windowing primitives including watermarks, triggers, and lag monitoring of delayed messages

Cloud Spanner is suited for applications

that require relational database support, strong consistency, transactions, and horizontal scalability.

we need this extra input function

that will map between the JSON received from the REST API and the features as expected from the model. So this extra input function is called the serving input function.

Batch size

the amount of data we compute error on

precision and recall, are the measures of scale that you will use to describe your dataset, the performance of your ML model if

the dataset is unbalanced

Dataflow runner

the graph gets launched on the cloud, and all of the compute is now happening on the cloud.

The scale-tier essentially controls

the kind of resources that you want this program to take

Vertical scaling (scaling up):

the main frame solution, your builder require a larger computer.

When a publisher publishes a message,

the message is associated with a timestamp

legacy SQL

the original Google SQL that currently is a default in BigQuery

Input

the thing that you will know and that you can provide at the time of even prediction. These are the things, for example, if they're images; the image itself is an input.

Training

this process of adjusting the weights of a model in such a way that it can make predictions, given an input.

Prediction

this process of taking an input in and applying the mathematical model to it, so as to get an output that hopefully is the correct output for that input

Beam supports

time-based shuffle (Windowing) (Published datatime, not received datatime; event time vs processing time)

Google Data Studio needs access

to Bigquery

The watermark:

tracks how far behind the system is 1. Where in event time to compute? 2. When in processing time to emit?

The Translation API

translates among 80 languages

weights

tunable parameters

If you are running Java code in Dataflow

use Maven, because maven will also take care of downloading dependencies for you and managing them - By default this runs locally, where the default runner is a local runner

Bigquery and Bigtable

user generated queries, ad-hoc queries, queries that you have that you do once in a long while.

A project contains

users and datasets

An important solution to the diminishing returns of vertical scaling

was horizontal scaling, also called distributed processing. Instead of provisioning a bigger machine, you use a cluster of smaller machines called nodes. One early software for coordinating the nodes was called MapReduce. To make distributed processing work in a traditional storage environment, you have to split the data as it's being copied in from mass storage,and distribute it to the nodes, a process called sharding.

Handling late data

watermarks, triggers, accumulation

Cloud Pub/Sub pull subscriptions

whenever the system the client is ready to process a new message, it goes ahead and asks are there any new messages?

The ML model microservice

will auto scale for you all the way down to zero if there's no traffic, to how many other machines that you need if you have lots of traffic.

Dataflow instead of keying off the internal PubSub ID

will now key off on this attribute instead and make sure that any particular ID gets processed only once.

Bigtable Connecting:

you work with Bigtable, you work with it using the hbase API. You basic go to the connection and you get your table, you create a put operation, you add all of your columns, and then you put that into the table and you've basically added a new row to the table.


Kaugnay na mga set ng pag-aaral

Вступ до математичного аналізу

View Set

Women's health/Disorders & Childbearing

View Set

Study set 9 for RN NCLEX (Kaplan)

View Set

S20 Story Problems (with calculator)

View Set

Chapter 39 practice questions- PrepU

View Set

Civics 5 political games and examples

View Set

Practice and Learn: Posttraumatic Stress Disorder

View Set