Combined Courses
Steps to stream into Bigtable using Dataflow
1. Get/create table a. Get authenticated session b. Create table 2. Convert object to write into Mutation(s) inside a ParDo 3. Write mutations to bigtable
Create a bigtable cluster using gcloud (or web UI)
"gcould beta bigtable instance create INSTANCE"
The need for fast decisions leads to streaming
1. Massive data, from varied sources, that keeps growing over time 2. Need to derive insights immediately in the form of dashboards 3. Need to make timely decisions
Cloud Storage
persistent storage, durable, replicated. It can be made globally available if you need to. Usually used as a first step of the data life cycle
Apache Spark
popular, flexible powerful way to process large datasets; it's able to mix different kinds of applications and to adjust how it uses the available resources.
Precision
positive predictive value = TP / (TP + FP)
Bigtable is
probably a better place for high throughput sensor data.
Training ML
process of optimizing the weights; includes gradient descent + evaluation
Dataflow
provides a fully-managed, autoscaling execution environment for Beam pipelines
In order to do that we looked at an architecture that consisted of ingesting the
pub/sub, processing the data in stream using Cloud Dataflow and streaming it into BigQuery for durable storage and interactive analysis.
Compute engine
you take your workload and you just run it as is on the cloud
to associate the messages with the timestamps
you have to create that association in the p collection. In this case, rather than use process context.output, you can use process context.output with timestamp.
Datalab applications that ran previously
1. Do not show up in Dataproc > Jobs page in Console. 2. Only applications submitted from Console are tracked in Console.
Ingesting data into a pipeline
1. Read data from file system, GCS, BigQuery, Pub/Sub a. Text formats return String b. Big query returns a TableRow
Custom ML models
1. Tensorflow 2. Machine Learning Engine
Creating a cluster on Dataproc
1. a deployment manager template, which is an infrastructure automation service in Google Cloud 2. CLI commands 3. Google Cloud console
neural network model
A hidden layer, a layer of neurons, is a combination of neurons, all of which share the same set of inputs.
Evaluation
Is the model good enough? Has to be done on full dataset
perKey
Key-value pairs such as in tuples
learning rate
The size of the changes we make to the weights
Cloud Dataproc
cloud based implementation of Hadoop.
Bigquery details:
easy, inexpensive 1. Latency in order of seconds 2. 100k rows/second streaming
if you need globally consistent data or more than one cloud sql instance
use cloud spanner
Train_and_evaluate Manages:
1. Distribute the graph 2. Share variables 3. Evaluate occasionally 4. Handle machine failures 5. Create checkpoint files 6. Recover from failures 7. Saves summaries for Tensor Board
This skew can be caused by
1. A discrepancy between how you handle data in the training and serving pipelines. 2. A change in the data between when you train and when you serve. 3.A feedback loop between your model and your algorithm.
Designing for Bigtable
1. A table have only one index(the row key) 2. Grouped related columns into column families 3. Two types of designs (wide or narrow tables) 4. Rows are sorted lexicographically by row key, from lowest to highest bytes 5. Queries that use the row key, a row prefix, or a row range are the most efficient 6. Store related entities in adjacent rows 7. Distribute your writes and reads across rows 8. Design row keys to avoid hot spotting
Machine Learning
1. A way to use standard algorithms to derive predictive insights from data and make repeated decisions. 2. Adjust the weights in such a way that the output of the ML model of this mathematical function (ML)
A dataset contains tables and views
1. Access Control Lists for Reader/Writer/Owner 2. Applied to all tables/views in dataset
Compute nodes on GCP are
1. Allocated on demand, and you pay for the time that they are up. 2. Fungible resource 3. Software packages need to be downloaded 4. Provide unlimited options because of customization!
So, you can write JavaScript UDFs, you can write SQL UDFs
1. Amount of data UDF outputs per input row should be <5 MB 2. Each user can run 6 concurrent JavaScript UDF queries per project 3. Native code JavaScript functions aren't supported 4. JavaScript handles only the most significant 32 bits 5. A query job can have a maximum of 50 JavaScript UDF resources a. Each inline code blob is limited to maximum size of 32 KB b. Each external code resource limited to maximum size of 1 MB
Interactive, Iterative Development with Cloud Datalab:
1. Analyze data in BigQuery Compute Engine or Cloud Storage 2. Use existing python packages
Monitor BigQuery with stackdriver
1. Available for all Big Query customers 2. Fully interactive GUI. Customers can create custom dashboards displaying up to 13 BigQuery metrics, including: a. Slots utilization b. Queries in flight c. Uploaded bytes d. Stored bytes
Bigquery uploading data:
1. Batch - Web console and Command line using the BQ command. batch data on Cloud storage or streaming data via Cloud Dataflow 2. Stream - stream data in with Cloud Dataflow If, for example, you're receiving sensor data in real time, log data in real time, you can process them with Cloud Dataflow and string them into BigQuery. Even as the data are streaming in, you can run queries on that data. 3. Federated data source - raw form as CSV files or JSON files or AVRO files (including Google Sheets - query it with BigQuery)
Streaming data into Bigquery
1. BigQuery provides streaming ingestion at a rate of 100,000 rows/table/second a. Provided by the rest APIs tabledata().insertAll() method b. Works for partitioned and standard tables 2. Streaming data can be queried as arrives a. Data available within seconds 3. For data consistency, enter insertID for each inserted row a. De-duplication is based on a best-effort basis, and can be affected by network errors b. Can be done manually
Pub/Sub simplifies event distribution:
1. By replacing synchronous point-to-point connections with a single availability asynchronous bus 2. Asynchronous -> Publisher never waits a. A subscriber can get the message now or any time (within 7 days) 3. Can avoid overprovisioning for spikes with Pub/Sub
A job is a potentially long-running action
1. Can be cancelled
3 aspects to Big data
1. Can use the same tools for batch as for streaming 2. Another aspect of big data is variety; audio, video, images, etc, unstructured text, blog posts. 3. But the third aspect to big data, is near real-time data processing, data that's coming in so fast that you need to process it to keep up with the data
Tips for improving performance
1. Change schema to minimize data skew 2. Takes a while after scaling up nodes for performance improvement to be seen 3. Test with > 300 GB and for minutes-to-hours 4. Disk Speed: SSD faster than HDD 5. Performance increases linearly with number of nodes 6. Make sure clients and Bigtable are in same zone
Before you begin, though, be sure to gather and prepare, that's clean, split, engineer features, preprocess features
1. Do all this on your training data. 2. put your data in an online source that Cloud ML can access, example, cloud storage.
This module covered ways to expand the value of Dataproc by leveraging other services available on Google Cloud Platform
1. Cloud storage is useful in many ways, not just holding data, but also holding application code and initialization scripts. It can also be used as an intermediary between BigQuery and Dataproc. 2. For more direct communications, a BigQuery connector is available in Dataproc, and you can develop an application that reads directly from BigQuery. 3. You also learned about how to output shards to handle parallel processing and to accumulate those shards back into a table when importing data into BigQuery leveraging JSON formatting and tools. 4. You learned about cluster customization with installation scripts, how to use metadata to determine whether the script is running on a master node or a worker node 5. you learned how properties for core software can be set for some configuration files from the command line or via the REST API.
In reality, ML is
1. Collect data 2. Organize data 3. Create model 4. Use machines to flesh out the model from data 5. Deploy fleshed out model
Bigquery storage is
1. Columnar 2. Each column in separate, compressed encrypted file that is replicated 3+ times 3. No indexes, keys, or partitions required 4. Meant for immutable, massive datasets
A table is a collection of columns
1. Columnar storage 2. Views are virtual tables defined by SQL query 3. Tables can be external (e.g., on Cloud Storage)
Load and expert into Bigquery
1. Command line interface called BQ (comes with gcloud SDK) 2. Web user interface 3. Use an API, a python API or a data flow API
Cloud Storage How do you get your data onto cloud storage?
1. Command line tool called gsutil (simplest; comes with gcloud SDK) 2. So whichever machine you're going to be uploading the data from, install G Cloud, get gsutil, and then say gsutil copy, that's a cp, gsutil cp sales*.csv. gs://acme-sales/data/
Tr.transform
1. Computes min,max,vocab,etc, and store in metadata.json 2. In serving function, use the metadata to scale the raw inputs before providing to model
To pass in a PCollection
1. Convert the PCollection to a View (asList, asMap) cz = .. czmap = cz.apply("ToView",View.asMap()) 2. Call the ParDo with side input(s) "ParDo.withsideInputs().of()" 2. Within ParDo, get the side input from the context
Tensorflow Steps
1. Create task.py to parse command-line parameters and send along to train_and_evaluate 2. The model.py contains the ML model in TensorFlow(Estimator API) 3. Package up TensorFlow model as python package (needs to contain an __init__.py in every folder) 4. Verify that the model works as a Python package 5. Then use the gcloud command to submit the training job, either locally or to cloud
Bucketizing in Cloud ML
1. Creating bucketized features using Tensorflow 2. Number of buckets is a hyper parameter 3. tf.feature_column.bucketized_column(lat,latbuckets) 4. pipeline for bucketized and crossed features
Apply Transform to PCollection
1. Data in a pipeline are represented by PCollection a. Supports parallel processing b. Not an in-memory collection; can be unbounded c. Apply transform to pcollection; returns Pcollection
Beyond linear regression with estimators
1. Deep neural network (DNNRegressor) 2. Classification (LinearClassifier or DNNClassifier)
Did we use triggers?
1. Default trigger setting used, which is trigger first when the watermark passes the end of the window, and then trigger again every time there is late arriving data
Steps to define an Estimator API model
1. Define input feature columns 2. Create a model, passing in the feature columns 3. Write input_fn (returns features, labels); features is a dict 4. Train the model using model.train 5. Use trained model to predict
Manages:
1. Distribute the graph 2. Share variables 3. Evaluate occasionally 4. Handle machine failures 5. Create checkpoint files 6. Recover from failures 7. Saves summaries for Tensor Board
Pub/sub is a low latency, guaranteed delivery service
1. Does not guarantee order of messages 2. At-least-once delivery means that repeated delivery is possible
Relational databases
1. Does not support very very very high-throughput 2. Scale pretty well to a few 100 gigabytes for queries 3. Aggregations on structured data 4. Transactional updates on relatively small datasets
Bigtables These are row keys to avoid:
1. Domains 2. Sequential (numeric) IDS 3. Static, repeatedly updated identifiers
Cloud ML Engine Scalable Training:
1. During training, Engine will help you distribute your pre-processing, your model training, even your hyperparameter tuning and finally deploy your trained model to the cloud.
Data studio lets you build dashboards and reports
1. Easy to read, share, and fully customizable 2. Handles authentication, access rights, and structuring of data
Stream processing:
1. Element-wise stream processing is easy 2. Aggregating is hard 3. Composite on unbounded data is super difficult
Pull:
1. Endpoint can be a server or a device capable of making API call 2. Delays between publication and delivery 3. Ideal for large number of dynamically created subscribers
Push:
1. Endpoint can only be HTTPS server accepting Webhook 2. Immediate delivery; no latency 3. Ideal for subscribers needing closer to real time performance
Hyperparameters tuning
1. Ensure that your model writes out evaluation metrics periodically. 2. Ensure that the outputs of different trials don't clobber each other. 3. Create a YAML configuration file. 4. Submit training job, configuration file included.
What do you pass in train_and_evaluate?
1. Estimator 2. Train spec 3. Eval spec
Handle late data: watermarks, triggers, accumulation
1. Event time a. Bound to the source of the event e.g. event of ID X is given a timestamp relative to the source scope b. Always < processing time 2. Processing time a. Relative to the engine processing the event e.g. event of ID X and even tie of Y is being processed now b. Always > event time
ParDo is useful for a variety of common data processing operations including
1. Filtering a data set. 2. Formatting or type-converting each element in a data set. 3. Extracting parts of each element in a data set. 4. Performing computations on each element in a data set.
Programming Tensorflows requires 2 steps:
1. First step, create the graph. 2. Second step, run the graph. 3. Does lazy evaluation: you need to run the graph to get results
Need to process variable amounts of data that will grow overtime:
1. Fixed or slowly scaled clusters are a waste 2. Windowing lets us answer the question of "where in event time" we are computing the aggregates a. Windowing divides data into event-time based finite chunks b. Required when doing aggregations over unbounded date c. Fixed, sliding, sessions 3. Beam unified model is very powerful and handles different processing paradigms
What is BigQuery?
1. Fully managed data warehouse 2. Fast, petabyte-scale with the convenience of SQL 3. Encrypted, durable, and highly available 4. Virtually unlimited resources only pay for what you use 5. Provides streaming ingest to unbounded data sets
BigQuery Characteristics:
1. Fully managed data warehouse that leds you do ad-hoc SQL queries on massive volumes of data 2. Ingesting data into Bigquery 3. Files on disk or cloud storage 4. Stream Data 5. Federated data source
Cloud SQL characteristics:
1. Fully managed database service 2. Flexible pricing 3. Familiar 4. Managed backups, 5. Automatic replication 6. Fast connection from GComputeE and GAppE 7. Connection from anywhere 8. Google Security
Cloud SQL and Cloud Dataproc offer familiar tools. What is the value-add provided by Google Cloud Platform?
1. Fully-managed versions of the software offer no-ops 2. Running it on Google infrastructure offers reliability and cost savings 3. Fast Random Access 4. CRUD operations are easily implemented on Datastore
Built-in functions are faster than JavaScript UDFs
1. Functions - what work are we doing on the data? 2. Guideline - some operators are faster than others
Create subscriptions, pull messages
1. Gcloud pubsub subscriptions create -topic sandiego mysub1 2. Gcloud pubusub subscriptions pull -auto-ack mysub1
Tensorflow Overfitting:
1. Get more training data - a model trained on more data will naturally generalize better 2. When that is no longer possible, the next best solution is to use techniques like regularization 3. Reduce the size of the model i.e. the number of learnable parameters in the model (which is determined by the number of layers and the number of units per layer) 4. Dropout, applied to a layer, consists of randomly "dropping out" (i.e. set to zero) a number of output features of the layer during training; The "dropout rate" is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5
Data proc
1. Google-managed: Hadoop, Pig, Hive, Spark 2. Image versioning 3. Familiar 4. Resize in seconds 5. Automated cluster management 6. Integrates with google cloud 7. Flexible VMs 8. Google Security
Low cardinality Group Bys are faster
1. Grouping - how much data are we grouping per-key for aggregation? 2. Guideline - low cardinality/groups -> fast, high-cardinality -> slower 3. However, high key cardinality (more groups) leads to more shuffling
Asynchronous processing advantages:
1. High availability 2. Balance load across multiple workers balance high-throughput 3. Reduce coupling 4. reduce latency
TensorFlow toolkit hierarchy
1. High-level "out-of-box" API does distributed training 2. Components useful when building custom NN models 3. Python API gives you full control 4. C++ API is quite low-level 5. TF runs on different hardware (CPU/GPU/TPU/Android)
What is work for a query?
1. I/O - how many bytes did you read? 2. Shuffle - how many bytes did you pass to the next stage? a. Grouping - how many bytes do you pass to each group 3. Materialization - how many bytes did you write? 4. CPU work - User-defined functions (UDFs), functions
Streaming(data processing for unbounded datasets)
1. Infinite data set 2. Is never complete, especially when considering time 3. Stored in multiple temporary, yet durable stores
Challenge #1: Variable volumes require ability of ingest to scale and be fault-tolerant
1. Ingesting variable volumes: massive amounts of streaming events, handle spiky/bursty data, high availability and durability 2. the way to get a durable, highly available messaging system is to use Pub/Sub.
The original data is organized visually, but if you had to write an algorithm to process the data, how might you approach?
1. It could be by rows, by columns, by rows then fields, and the different approaches would perform differently based on the query. 2. Your method might not be parallelizable. The original data can be interpreted and stored in many ways in a database.
Pub/Sub: how it works: Topics and subscriptions
1. Its an asynchronous communicate pattern 2. Multiple topics in Pub/Sub 3. One or more publishers sends a message to a topic 4. At least once a delivery guarantee a. A subscriber ACKs each message for every subscription b. A message is resent if subscriber takes more than "ackDeadline" to respond c. A subscriber can extend the deadlinge per message d. 5. Exactly once, ordered processing a. Pub/sub delivers at least once b. Dataflow: deduplicate, order and window c. Separation of concerns -> scale
Do the biggest joins first
1. Joins - in what order are you merging data? 2. Guideline - Biggest, smallest, decreasing size thereafter 3. Avoid self-join if you can, since it squares the number of rows processed
Use projects to
1. Limit access to datasets and jobs 2. Manage billing
Three categories of BigQuery pricing free
1. Loading data into BigQuery, 2. exporting data from BigQuery, 3. any queries on metadata. 4. How many rows are there in this table, 5. how many columns are there in this table, 6. what are the names of the columns, those queries like that, they are always free. 7. Any cached query is free. Right. So, if you run a query and you run the exact same query in that project, it's free. 8. Right. Anything that has already been cached and you are getting it back is free, the caches are per user for privacy reasons. So if you have two users in a project, they don't share the cache. 9. Okay. So, it's a per user cache but any query whose results are returned from the cache is free. 10. Any query that has an error is also free.
Cloud MLE supports hyperparameter tuning
1. Make the parameter a command line argument 2. Make sure outputs don't clobber each other 3. Supply hyperparameters to training job
Pub/Sub - publish-subscribe
1. Message bus that's a great way to deal with the challenge of ingesting variable volumes of data 2. Has the ability to ingest data at high speeds 3. Works well with streaming data 4. Durable and fault tolerant 5. No ops design 6. Serverless
So it's important to look at what BigQuery, how to get, Data into BigQuery because that's something you'll be doing quite a bit
1. Now you can load data into BigQuery using a command line interface called BQ, it comes with the G Cloud SDK. 2. You can use the web user interface 3. You can use an API, a Python API, a data flow API. 4. Pretty much all of the tools in Google Cloud will be able to talk to BigQuery and will be able to write their data into BigQuery.
Tensorflow Underfitting:
1. Occurs when there is still room for improvement on the test day 2. Results if the model is not powerful enough, is over-regularized, or has simply not been trained long enough 3. Network has not learned the relevant patterns in the training data
Data studio connects to various GCP data sources
1. Offers a BigQuery connector 2. Read from table or run a custom query 3. Build charts graph or lay it on a map
Don't project unnecessary columns
1. On how many columns are you operating? 2. Excess columns incur wasted I/O and materialization
Filter early and often using WHERE clauses
1. On how many rows (or partitions) are you operating? 2. Excess rows incur "waste" similar to excess columns
General rules for feature engineering
1. Overly specific attributes should be discarded 2. Categorical values could be one-hot encoded 3. Preprocess data to create a vocabulary of keys 4. Don't mix magic numbers with data
ParDo allows for parallel processing:
1. ParDo acts on one item at a time (like a map in mapreduce) a. Multiple instances of class on many machines b. Should not contain any state 2. Useful for: a. Filtering b. Converting one Java type to another c. Extracting parts of an input (e.g. fields of tableRow) d. Calculating values from different parts of input
Batch and Streaming set up
1. Pub/Sub is a global message bus. 2. Dataflow is capable of doing batch and streaming, the core doesn't change. It gives you the better deal with late data and unordered data 3. big query gives you the power of doing analytics both on historical data and on streaming data.
DAG steps
1. Read in data, transform it, write out 2. Can branch, merge, use if-then statements, etc.
what makes a good feature? Represent raw data in a form conduce for ML
1. Related to what is being predicted? 2. Value should be known for prediction 3. Numeric with meaningful magnitude 4. Enough examples 5. Good features bring human insight to problem
Datalab frees from being constrained by hardware limitations:
1. Run it on any Compute Engine instance that you want 2. Change the machine specs after it's been provisioned a. You can go into the web console, find a running VM, stop it and restart it with a new machine configuration 3. As a developer, to work in Datalab, you simply connect the VM that's running the notebook server 4. The notebook themselves can be persisted in Git or in a cloud repository so you can delete the VM if you don't need it anymore. 5. To start up Cloud Datalab, you go into Cloud Shell and you type in datalab create. If you have gcloud installed on your local computer, you can run this datalab create command from your own computer instead of from Cloud Shell.
Cloud Dataflow on the cloud
1. Scalable, fault-tolerant multi-step processing of data 2. offers you ways to spend less on ops and administration. Incorporate real-time data into apps and architectures, apply machine learning broadly easily and as an end goal, create citizen data scientists so that everybody in your organization can work with data.
Streaming processing makes it possible to derive real-time insights from growing data
1. Scale to variables volumes 2. Act on real-time data using continuous queries 3. Derive insights from data in flight
Asynchronous processing Potential use cases:
1. Send an SMS 2. Train ML model 3. Process data from multiple sources 4. Weekly reports
To use custom timestamps, perhaps based on message producer's clock:
1. Set an attribute in pubsub with the timestamp when publishing batch.publish(event_data,mytime="2017-04-12T23:20:50.52Z") 2. Tell dataflow which PubSub attribute is the timestampLabel p.apply(PubsubIO.readStrings().fromTopic(t).withTimestampAttribute("mytime") ).apply(...)
Understand BigQuery plans
1. Significant difference between avg and max time a. Probably data skew - use approx._top_count to check b. Filter early to workaround 2. Most time spent reading from intermediate stages a. Consider filtering earlier in the query 3. Most time spent on CPU tasks a. Consider approximate functions, inspect UDF usage, filter earlier
Training is very sensitive to batch-size and learning-rate
1. Size of model 2. Number of hash buckets 3. Embedding size
Order on the outermost query
1. Sorting - how many values do you need to sort? a. Filter first reduces the number of values you need to sort b. Ordering first forces you to sort the world
Can enforce only-once handling in dataflow even if your publisher might retry publishes
1. Specify a unique label when publishing to Pub/Sub 2. When reading, tell Dataflow which PubSub attribute is the idLabel
Learning rate
1. Start with a model with random weights 2. Calculate error on labeled dataset (every batch) 3. Change the weights so that the error goes down (every batch) 4. Repeat step 2-3 until the model is good enough.
BigQuery Not-Free
1. Storage a. Amount of data in table b. Ingest rate of streaming data c. Automatic discount for old data 3. Processing a. On demand OR flat-rate plans b. On demand based on amount of data processed c. 1 TB/month free d. Have to opt in to run high compute queries
Pre-trained ML models
1. Vision API 2. Speech API 3. Jobs API 4. Translation API 5. Natural Language API
Dataflow Templates enable a new development and execution workflow
1. The templates help separate the development activities and the developers from the execution activities and the users. The user environment no longer has dependencies back to the development environment. 2. The new approach facilitates the scheduling of batch jobs and opens up more ways for users to submit jobs and more opportunities for automation. 3. Runtime parameters work through the value provider interface, so that your users can set these values and the template is submitted. 4. Value Provider can be used in IO transformations and do FN your functions and there are static and nested versions of Value Provider for more complex cases. 5. You specify the location of the template in cloud storage and output location in cloud storage, the name, value parameters that map to the Value Provider Interface. 6. Example templates for basic tasks are provided including word count, Cloud Pub Sub to BigQuery, Cloud Storage text to Cloud Pub Sub, Cloud Pub Sub to Cloud Storage texts and so forth.
Can I not read directly from BigQuery?
1. There is a BigQuery reader in TensorFlow. 2. When we do our training, we're gonna be reading multiple times. We're gonna be reading from different parameter servers. We're gonna be reading in chunks. We can read from BigQuery, but it's gonna be cheaper to read from CSV.
Table partitioning
1. Time-partioned tables are a cost effective way to manage data 2. Easier to write queries spanning time periods 3. When you create tables with time-based partitions, BigQuery automatically loads data in correct partition a. Declare the table as partitioned at creatino time using this flag: --time_partitioning-type b. To create partitioned table with expiration time for data, using this flag: --time_partitioning_expiration
Create topic and publish message
1. To create at topic, you use Gcloud 2. Messages are opaque in pub/sub (no parsing) 3. Actually sending a web call, a rest API call.
Unbounded datasets are quite common
1. Traffic sensors along highways 2. Usage information of Cloud component by every user with a GCP project 3. Credit card transactions 4. User moves in multi-user online gaming
Reading from Bigtable
1. Typically programmatic using HBase API 2. HBase command line client 3. Bigquery
Cluster performance
1. Under typical workloads cloud Bigtable delivers highly predictable performance. When everything is running smoothly, you can expect the following performance for each node in your Cloud Bigtable cluster, depending on which type of storage your cluster uses
There are three steps to training your model at Cloud ML Engine.
1. Use TensorFlow to write your code. 2. Package up your trainer as a Python module. 3. Configure and start your ML Engine job.
For realistic, real-world ML models, we need to:
1. Use a fault-tolerant distributed training framework 2. Choose a model based on validation datasets 3. Monitor training, especially if it will take days 4. Resume training if necessary
Best practices Dataflow + BigQuery to enable fast data-driven decisions
1. Use dataflow to do the processing/transforms 2. Create multiple tables for easy analysis 3. Take advantage of BigQuery for streaming analysis for dashboards and long term storage to reduce storage cost 4. Create views for common query support
Python: Map vs flatmap
1. Use map for 1:1 relationship between input & output 2. Flatmap for non 1:1 relationships, usually with generator 3. Java: use apply(parDo) for both cases
Wildcard tables - Standard SQL
1. Use wildcards to query multiple tables using concise SQL statements 2. Wildcar tables are union of tables matching the wildcard expression 3. Useful if your dataset contains: a. Multiple, similarly named tables with compatible schemas b. Sharded tables 4. When you query, each row contains a special column with the wildcard match Example: FROM `bigquery-public-data.noaa_gsod.gsod*` Matches all tables in noaa_gsod that begin with string 'gsod' The backtick(``) is required Richer prefixes perform better than shorted prefixes For example: .gsod200* versus .*
Techniques to deal with the three Vs
1. Volume a. Terabytes, petabytes b. Mapreduce autoscaling analysis 2. Velocity a. Realtime or near-realtime b. streaming 3. Variety a. Social networks, blog posts, logs, sensors b. Unstructured data and machine learning
Triggers control when results are emitted
1. What are you computing? What = transformations 2. Where in event time? Where = windowing 3. When in processing time? When = watermarks + triggers 4. How do refinements relate? How = accumulation
How dataflow handles streaming data while balancing tradeoffs
1. Windowing model a. Which supports unaligned eventtime windows, and a simple API for their creation and use 2. Triggering model a. That binds the output times of results to runtime characteristics of the pipeline with a powerful and flexible declarative API for describing desired triggering semantics 3. Incremental processing model a. That integrates retractions and updates into the windowing and triggering models
Stream processing in Dataflow accounts for this
1. Works with out-of-order messages when computing aggregates 2. Automatically removes duplicated based on internal Pub/Sub id
Cloud Datalab:
1. Write code in Python 2. Run cell (shift enter) 3. Examine output 4. Write commentary in markdown 5. Share and collaborate
Can write data out to same formats
1. Write data to file system, GCS, BigQuery, Pub/Sub 2. Can prevent sharding of output (do only if it is small) 3. May have to transform PCollection <Integer>, etc. to PCollectionString> before writing out
Modifying bigtable clusters
1. You can add or remove nodes or change a cluster's name without any downtime 2. You cannot modify the Cluster ID, zone, or storage type after a cluster is created
Understand query performance
1. You can optimize our queries and your data, but still need to monitor performance 2. Tow primary approaches: a. Per-query explains plans i. What did my query do? b. Project-level monitoring through Google stack driver i. What is going on with all my resources in this project?
Dataproc customization options
1. You can start up a single node cluster where all the Hadoop services are installed on a single VM. If you're developing code, you might want to use this for cost control or to give each developer their own environment. 2. Standard mode has a single master node. In Hadoop, the master node is the ingress point for job submission. Normally, having a single master node is sufficient. 3. If you have a very long-running job then you might want to use the high availability option. That provides three master nodes, so the loss of a single VM will not result in losing the job.
Three possible places to do feature engineering:
1. You could do it on the fly as you read in the data in the input function itself, or by creating feature columns. 2. Alternately, you could do it as a separate step before you do the training. And then your input function reads the preprocessed data. 3. The third option, is to do the pre-processing in Dataflow and create a preprocessed data features.
Three things that you need to do to build an effective machine learning model:
1. You need to scale it out to large data; we just looked at that with Cloud ML 2. You need to do what's called feature engineering 3. Hyper parameter tuning
Dataprep provides
1. a high-leverage method to quickly create Dataflow pipelines without coding. 2. This is especially useful for data quality tasks and for Master Data task, combining data from multiple sources where programming may not be required. 3. The pipeline can be output as a Dataflow Template for continued use in Dataflow. 4. For example, you could set up a data quality job to clean up source data provided by a native system that's destined for data analysis. Then this template can be used by the administrative staff periodically to submit clean data for the analysis tasks. 5. processing logs in Cloud Dataflow and some Apache Beam resources.
Dataproc
1. a managed Hadoop service on Google Cloud Platform 2. It's fast, convenient, and offers several unique flexible features. 3. Bdutil eliminates the complexity of deploying a cluster. And if you use it in the cloud, connectivity to cloud based services ceases to be an issue. So it eliminates a lot of the IT overhead. 4. Operational and performance tuning overhead remains - Responsible for your own custom code used in your jobs.
pig
1. a scripting language, and you would write your MapReduce programs in that scripting language @ higher level. 2. It is almost like an ETL language extraction transformation loading of data language.
BigQuery in respect to Dataproc
1. a serverless, highly scalable, low-cost enterprise data warehouse with a fast interactive interface designed for use by data analysts. 2. utility overlaps significantly with Hadoop, and in some cases can be used instead of Hadoop. 3. an add-on to Dataproc to extend the abilities of Dataproc.
Apache beam
1. a unified model for batch and stream processing 2. supporting multiple runtimes
Dataproc connecting
1. also hosts the HDFS name node at port 9870, which gives insight in the HDFS. 2. You can SSH directly to the cluster nodes. you can use a SOCKS proxy to connect your browser through a SSH tunnel. One reason is to use SSH is to directly access software installed on the cluster such as Hive, Pig, and Pyspark. This can be a great way of interacting directly with the cluster and learning about the open source software that's installed by default on Dataproc.
Dataprep overview
1. an interactive graphical system for preparing structured or unstructured data for use in analytics such as BigQuery, visualization like, Data Studio and to train machine learning models 2. input integration: provided for Cloud Storage BigQuery and files. 3. offers a graphical user interface for interactively designing a pipeline. 4. offers a rich set of tools for working with data 5. the format of a string field can have transformations applied to change to uppercase, to proper case, that's initial uppercase letters, to trim leading and trailing whitespace, and to remove whitespace altogether 6. These are the kinds of transformations commonly needed for improving the quality of data produced by a native system in preparation for big data processing
Apache beam :
1. an open source API that let's you define a data pipeline. So, this API shown here is Apache beam and you basically creating a pipeline reading some text and the text that we're reading is from cloud storage. 2. Python, Java 3. Executable on Cloud Dataflow, Flink, Spark etc
Hive
1. an open source software project that implements a data warehouse and offers an SQL-like query language 2. HiveQL is not identical to standard SQL. 3. used for structured data, similar to SQL
Apache Bigtop
1. an open-source project that automates packaging deployment and integration of other projects in the Hadoop ecosystem. 2. gathers the core Hadoop components and makes sure that the configuration works. It uses Jenkins for continuous integration testing. 3. makes sure that the default Dataproc clusters perform well. It's common when installing Hadoop software manually to accidentally include software and services that actually aren't used in the configuration. 4. makes sure you're not wasting resources by eliminating elements that are not really needed.
Can associate a timestamp with inputs
1. automatic timestamp when reading from PubSub a. Timestamp is the time that message was published to topic 2. For batch inputs, explicitly assign timestamp when emitting at some step in your pipeline a. outputWithTimestamp()
As mentioned before, HDFS is available and you can use it if you want to reduce the changes for adoption from your existing system. Using Cloud Storage instead of HDFS
1. avoid some of the complexity of configuring your cluster. 2. dynamically scale to meet requirements. You wouldn't have to try to predict storage consumption in advance. Because the cluster would only need disk space for working storage and software, sizing the cluster nodes becomes much easier. 3. makes the cluster stateless, so you can shut it down when you don't need it.
Why would you want to run Hadoop on a cloud platform to begin with
1. cheap storage 2. Running with gigabytes or terabytes of data running a Hadoop on your own cluster is efficient.
Earlier in the course, the concept of machine learning was introduced and three categories of work were identified
1. consists of problems that require human insight to solve. Those are not good candidates for machine learning solutions, at least not yet. 2. category are problems that essentially reduce to counting. These are easy problems to solve with big data processing. 3. problems that at first appear to have no easy solution. Mainly because they involve unstructured data. However, with some ingenuity and using machine learning services, these problems can be transformed into counting problems and they become very powerful applications for big data processing.
Hadoop alternatives come with operational and maintenance overhead. You can overcome these limitations with Cloud Dataproc, which was designed to deal with them
1. create a cluster specifically for one job 2. use cloud storage instead of HDFS 3. shutdown the cluster when it's not actually processing data 4. use custom machines to closely match the CPU and memory requirements of the job 5. on non-critical jobs requiring huge clusters, use preemptible VMs to hasten results and cut costs at the same time.
Big Query
1. data warehouse 2. petabyte scale data warehouse on Google Cloud 3. denormalized
Randomly shuffle the filenames in the filename_queue
1. different file sizes 2. different complexity 3. don't want the same machine to be stuck every time
Bigtable
1. don't need transactional support 2. capcity of Petabytes 3. high throughput scenarios 4.4. With Bigtable you basically deal with flattened data, it's not for hierarchical data, it's flattened and you search only based on the key. And because you can search only based on the key, the key itself and the way you design it becomes extremely important. '5. NoOps 6. automatically balanced 7. it's automatically replicated, 8. its compacted, 9. it's essential NoOps. 10. You don't have to manage any of that infrastructure, you can deal with extremely high throughput data.
Date and time functions
1. enable date and time manipulation for timestamps, date strings, and timestamp data types 2. bigquery uses epoch time
record defaults get used two ways.
1. figure out what a default value is 2., to determine what the type of the column is
Pub/Sub :
1. has to be primed so the second message will only show. 2. simplifies systems by removing the need for every component to speak to every component 3. connects applications and services through a messaging infrastructure
Providing other inputs to a ParDo
1. in memory objects can be provided as usual
So if you're thinking about a good stream processing solution, there are three key challenges that it needs to address.
1. it needs to be able to scale. 2. you want to essentially use Continuous Queries 3. we want to be able to do SQL-like queries that operate over time windows, over time windows over that data
Spanner
1. it uses familiar relational semantics, so traditional database analysts will adapt to it easily. 2. Data is sharded within the zone, providing high throughput. 3. And it provides high availability by design, so there's no manual intervention required to deal with a zone failure.
Reason to Use Tensorflow:
1. machine-learning researcher interested in extending the open source SDK 2. creating new machine learning models for research, et cetera.
Store related entities in adjacent rows
1. make query parameter the row key 2. Add reverse timestap to the rowkey 3. Distribute the writing load between tablets while allowing common queries to return consecutive rows
Dataflow:
1. manages the provisioning of these machines. 2. The auto-scaling if necessary of your pipeline such that transform one just happens at scale completely distributed and then everything comes streaming back into transform two. 3. completely NoOps data pipeline. 4. it can have the intermediate processing be identical even when you move from a batch to a streaming scenario. 5. swap out the input 6. Run to PubSub, BigQuery, Cloud Storage 7. So, dataflow is where we see a lot of data pipelines migrating because you really want to be able to process historical data and real-time data in an identical way. That's the only way you'll be able to build a machine learning pipeline for example, that is strained on historical data that operates on real-time arriving data. 8. ingest, transform, and load, filtering, grouping, windowing etc; consider using it instead of Spark
Cloud Pub/Sub:
1. message oriented architectures 2. offers reliable real-time messaging that's accessible through HTTP 3. reliable delivery, decoupled workers 4. asynchronous processing
Bigquery details
1. near real-time analysis of very large datasets 2. it's no-ops so you're only paying for what you use 3. it gives you durability 4. it gives you replication which gives you a very cheap storage about the same cost as cloud storage and it gives you immutable audit logs so you know who's access the data when, and the very important thing is that because there's only one BigQuery that's global, it allows you to mash up different datasets
Reason to use machine learning APIs:
1. pre-built models and incorporating them into your applications 2. not training a machine learning model when you'd use the ML APIs 3. The machine learning APIs are built off Google's data, so if you've ever used the Android app where you can point the application at a foreign language sign and get it translated, well that app users translation, it uses optical character recognition. OCR, Optical Character Recognition is part of the Vision API and translation is part of the Translate API.
Tensorflow: Directed graph
1. preparing a graph for execution on multiple hardware devices 2. process the graph to add quantization, debug nodes, create summaries. 3. can be compiled for example, to fuse ops to improve the performance. For example, you may have two consecutive add notes and you might want to fuse them into a single one.
BigQuery Overview:
1. serverless data warehouse that operates at massive scale. 2. To use BigQuery, you don't have to store your data in a cluster. 3. To query the data, API call or invoke BigQuery from just a web browser. 4. You can analyze terabytes to petabytes of data 5. Requires uncheck Cache results
Monitor training Tensorflow:
1. set your verbosity to be INFO. 2. By default, TensorFlow's error logging level is at warn, so it doesn't show you a bunch of stuff. 3. So if you want TensorFlow to show you the loss as it trains, change the error level to INFO or change it to debug. 4. The levels are debug, info, warn, error and fatal. 5. Graphical way to monitor training through TensorBoard 6. You can use TensorBoard and you point it at the model output directory, whether it's a local directory or on cloud storage, TensorBoard can read from both.
Stream processing poses several challenges:
1. size a. Traffic data will only grow with more sensors and higher frequency 2. scalability and fault-tolerance a. Handle growing traffic data volumes, distributed sensors, and still be fault tolerant 3. programming model a. Compare traffic over past hour against that of last Friday at same time: is this stream or batch? 4. Unboundedness a. What happens if data from a sensor arrives late?
Cloud dataflow key ideas:
1. the execution framework for Apache beam pipelines. 2. Allows for decoupling producers and consumers of data in large organizations and complex systems
Denormalizing(nested and repeated fields)
1. the strategy of accepting repeated fields in the data to gain processing performance. 2. Data must be normalized before it can be denormalized. 3. Denormalization is another increase in the orderliness of the data. 4. takes more storage - repeated fields 5. because it no longer is relational, queries can be processed more efficiently and in parallel using columnar processing. 6. Nested can be understood as a form of repeated field
Normalizing the data
1. turning it into a relational system. 2. This stores the data efficiently and makes query processing a clear and direct task. Normalizing increases the orderliness of the data.
Pig
1. used for semi-structured data, similar to SQL + scripting 2. provides SQL primitives similar to Hive, but in a more flexible scripting language format. 3. deal with semi-structured data, such as data having partial schemas, or for which the schema is not yet known. 4. sometimes used for Extract Transform Load (ETL) 5. generates Java MapReduce jobs. 6. not designed to deal with unstructured data.
Smart way to explore the space in Cloud ML
1. uses a Bayesian optimization approach that can be applied to autotune parameters (like learning rate, number of hidden nodes, etc.) of your machine learning model.
Repeat the data and send it along in chunks
1. we now have our filename_queue by taking these file names, and randomly shuffling them, and adding them num_epoch 2. We need to set up our readers that are going to do the decoding. The reader in our case is a TextlineReader because these are CSV file 3. Batch of records from the filename_queue 4. take that record, which at this point is just a line, it's a scalae 5. we make it a tensor with viewing expand dims, a tensor with the same shape. And that basically becomes our value. It's now just a string. 6. And we take that and we ask TensorFlow to do a decode of the CSV. So decode this as a comma separated value string.
build effective machine learning models.
1. you need to collect all the data that you can so collecting the data so that you can do analytics on the data is extremely important. 2. So, but once you have the data then you want to basically bring human insight into the data using good features, and then we also looked at how you can take advantage of modern improvements in neural network architectures to get the best possible accuracy once you've decided that this is how you're going to build your ML. 3. We said that when you're doing machine learning the accuracy improvement that you're going to get is you're going to get it through hard work, you're going to get it through feature engineering, through hyperparameter tuning, and through lots of data 4.. What Cloud ML Engine gives you is an environment in which you can do all of these things.
Reason to use Cloud ML Engine:
1. you're in an industry as a data scientist and you want to build a machine learning model on your data set and the machine learning model that you're building is something that's pretty well understood 2. a no-op, so that you're not in the business of managing infrastructure to be able to do machine learning at scale over real-world datasets.
Difference between BigQuery and Bigtable Latency:
a. BigQuery is in the order of seconds b. BigTable is in the order of milliseconds.
Cloud Storage reduce latency
Choose the closest zone/region; Distribute your apps and data across zones to reduce service disruptions and regions for global availability
BIgQuery
2. It supports nested and repeated fields. 3. separates out storage and compute 4. structured data or tabular data 5. near-real time analysis 6. completely no-ops 7. durable - replicated in multiple places and pretty inexpensive in terms of storing 8. immutable audit logs 9. Mashing up different datasets to derive insights
A message in Pub/Sub persist for
7 days
Configure Alerts
: Define thresholds on job or resource group-level metrics and alert when these metrics reach specified values. Stackdriver alerting can notify on a variety of conditions such as long streaming system lag or failed jobs.
The elements are divided into datasets, recipes and output
A dataset roughly translates into a Dataflow pipeline read, a recipe usually translates into multiple pipeline transformations, and an output translates into a pipeline action
Cloud pub/sub overview:
A global multitenant, managed, real-time messaging services. 1. Discoverability 2. Availability 3. Durability 4. Scalability 5. Low latency
Feature Cross
A synthetic feature formed by crossing (multiplying or taking a Cartesian product of) individual features; help represent nonlinear relationships.
A data warehouse
can be a source of structured data examples for your ML model.
Why does Tensorflow need a DEFAULT value?
But TensorFlow is an anti-neural networks are adding and subtracting machines
Feature creation in tensorflow also possible
can be quite powerful since it is so flexible and will need to add call to all input functions (train,eval,serving)
Advantages of putting the preprocessing directly in TensorFlow
But tell your prediction graph that you want the same transformations carried out in TensorFlow during serving. To do that, you will use a library called TensorFlow transfer. Discretising and feature crossing are examples of preprocessing that is done in TensorFlow itself. These operations are part of your model graph and so they are carried out in an identical fashion in both training and in serving.
Vision API match question
Automatically reject inappropriate image content
No tail skew
Average and max are identical
Files accepted in the Bigquery web ui
Avro, Parquet, Json (newline delimited), and CSV
Machine scale tiers and pricing
Basic - single worker instance standard_1 = 1 master, 4 workers, 3 parameters servers Premium_1 - 1 master, 19 workers, 11 parameter servers Basic_GPU - 1 worker with GPU Custom priced by hour
Difference between BigQuery and Bigtable Structure:
BigQuery (SQL) Bigtable (NoSQL)
Data flow export to
BigQuery, Cloud Storage text file, cloud storage Avro file
But if you need very high throughput, very low latency, then you need
Bigtable
you have an existing Hadoop application that reads and writes data to an HBase database
Bigtable is the path for separating storage and compute. You migrate that data to Bigtable and update the references. It uses the same API as HBase, so the code change will be minimal. Bigtable is also a strong candidate for real-time solutions.
all are faster than Javascript UDFs 3. Example - Exact COUNT(DISTINCT) is very costly, but APPROX_COUNT_DISTINCT is very fast Note
Check to see if there are reasonable approximate functions for your query
Translation API match question
Build application to monitor Spanish twitter feed
Training and evaluation input functions
CSV_COLUMNS = def read_dataset(filename, mode, batch_size=512): ...
And we allow Cloud ML to be able to write into our bucket because
Cloud ML runs as a service account or a robot account.
Data flow import from
Cloud Pub/Sub subscription, cloud storage text file
When you load data from Cloud Storage into BigQuery, your data can be in any of the following formats:
Comma-separated values (CSV) JSON (newline-delimited) Avro Parquet ORC (Beta) Cloud Datastore exports Cloud Firestore exports
Chart Dataflow metrics in Stackdriver Dashboards:
Create Dashboards and chart time series of Dataflow metrics.
Metrics for dataflow in lab
Data watermark age: The age (time since event timestamp) of the most recent item of data that has been fully processed by the pipeline. System lag: The current maximum duration that an item of data has been awaiting processing, in seconds
Feature Engineering
Def add_more_features(feats): # will be covered in next chapter; for now, just a no-op Return feats
Serving input function
Def serving_input_fn(): ... Return tf.estimator.export.ServingInputReceiver(features, feature_pholders)
Train and evaluate loop
Def train_and_evaluate(args): ... tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
Tensorflow match question
Create, test new machine learning methods
Dataflow current vCPU count:
Current # of virtual CPUs used by job and updated on value change.
Two types of features in feature comparison model
Dense and Sparse
Datalab match question
Develop Big Data algorithms interactively in python
Overfit
Does not generalize
Cloud SQL
Does not handle high throughput needs. If you have sensors that are distributed all across the world, and you're basically getting back millions of messages a minute, that's not something that this database could handle
Prefer combine
GroupByKey Collection.apply(Count.perKey()) Is faster than: Collection.apply(groupbykey.create()).apply(parDo.of(new DoFn(){ Void processElement(ProcessContext c){ c.output(KV.of(c.element().getKey(),c.element().getvalue().size())));
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
Estimator comes with a method that handles distributed training and evaluation
Cloud ML Engine Scalable Predictable:
For predictions, the ML model is accessible via a rest API and it includes all the pre-processing and feature creation that you did so your client code can simply supply the raw input variables and get back a prediction
Datalab price:
Free - just pay for Google Cloud resources
Google Compute Storage
Good option for storing data that may be required to be read at some time later and imported into a cluster for analysis
the BigQuery Data Transfer Service supports loading data from the following data sources
Google AdWords DoubleClick Campaign Manager DoubleClick for Publishers Google Play (beta) YouTube - Channel Reports YouTube - Content Owner Reports
Bigtable match question
High-throughput writes of wide-column data
Feature columns
INPUT_COLUMNS = [ tf.feature_column.numeric_column('pickuplon')
Using GCS as staging
If you need to basically for example have your data into Big Query, a good option is to first get it into GCS and then use GCS as a staging to import it into Big Query, or into Dataproc, or into any other cluster.
Monitor User-Defined Metrics:
In addition to Dataflow metrics, Dataflow exposes user-defined metrics (SDK Aggregators) as Stackdriver custom counters in the Monitoring UI, available for charting and alerting. Any Aggregator defined in a Dataflow pipeline will be reported to Stackdriver as a custom metric. Dataflow will define a new custom metric on behalf of the user and report incremental updates to Stackdriver approximately every 30 seconds.
Tensorflow's capacity:
In deep learning, the number of learnable parameters in a model
Dataflow Job status:
Job status (Failed, Successful), reported as an enum every 30 secs and on update.
Dataflow Elapsed time:
Job elapsed time (measured in seconds), reported every 30 secs.
Which of these is a way of discretizing a continuous variable?
Layers.bucketized_column()
Which of these is a way of encoding categorical data?
Layers.sparse_columns_with_keys()
hyperparameter
Learning rate is an example of what is known
Problem with vertical scaling
Marginally diminishing returns. For each increased unit of computing power, the cost goes up and the value provided goes down.
Dataflow System lag:
Max lag across the entire pipeline, reported in seconds.
Cloud Pub/Sub push subscriptions
Or could be a push in which basically the client system says call this endpoint whenever there's a new message for me, and that new endpoint would get called by this architecture whenever there's a new message
Cloud ML Engine match question
No-ops, custom machine learning applications at scale
Dataflow Estimated byte count :
Number of bytes processed per PCollection.
Problem it is in memory, so usually only used with a subset of data. first method. One method that was discussed at the beginning was to use Cloud Storage as an intermediary because both BigQuery and Dataproc can communicate with Cloud Storage. Cloud Storage is fast, but there are two operations involved
One method that was discussed at the beginning was to use Cloud Storage as an intermediary because both BigQuery and Dataproc can communicate with Cloud Storage. Cloud Storage is fast, but there are two operations involved 1. writing out from BigQuery to Cloud Storage. 2. reading from Cloud Storage into Dataproc. This method is appropriate for periodic or infrequent transfers that in other circumstances a more direct communication method would be useful.
Cloud Storage Transfer Services
Online data - transfer once or multiple times
TextIO
Output methods do not support output with the timestamp.
Deep models components:
Output units, hidden layers, dense embeddings, sparse features
Transform
ParDo GroupByKey CoGroupByKey Combine Flatten Partition
Machine learning
Pattern recognition from examples
Gradient Descent
Process of reducing error
Cloud Storage IAM
Project level, Bucket level, Object level
Datastore match question
Searching for objects by attribute value
Working with Estimator API
Set up machine learning model 1. regression or classification? 2. What is the label? 3. What are the features Carry out ML steps 1. Train the model 2. Evaluate the model 3. Predict with the model
important for distributed training
Shuffling
One thing to note with Google Compute Engine is that disk performance scales with the size of the VM
So the second key to using dataproc to overcome Hadoop's limitations is to use Cloud storage instead of HDFS. It reduces the complexity of disk provisioning and enables you to shut down your cluster when it's not processing a job. There are some more handy options : 1. You can load multiple initialization scripts to customize the software on dataproc workers and on the master. 2. Dataproc comes preconfigured with a Cloud Storage connector. So the cluster already knows how to communicate with buckets located in the project. You can stage initialization scripts there. 3. The network section allows you to do things like associate a tag name with the nodes. So that later on, you can create a very narrow firewall rule to allow access to the cluster services. For example, to access some of the Hadoop Web interfaces. 4. Dataproc uses Google Compute Engine virtual machines. So it inherits the ability to select virtual machines with different qualities. You can match your qualities to your processing requirements to gain greater control over the speed and cost of your data processing solution. 5. Why isn't Cloud Shell used in this lab? : Unlike Compute Engine, Cloud Shell has no SLA. Therefore, the availability of the Cloud Shell VM cannot be guaranteed during the lab.
Challenge #2: Latency is to be expected
So, the Beam/Dataflow model provides for exactly once processing of events. 1. What results are calculated, 2. Where in event time should we calculate those results 3. When in processing time should you save out those results 4. How do you change, already computed results? How do you refine it?
Horizontal Scaling (scaling out):
That's the distributed parallel processing solution. You acquire or borrow many smaller computers and use then together.
Can add new features in dataflow:
The advantage of doing this, of putting the preprocessing directly in TensorFlow, is that these operations are part of your model graph and so they are carried out in an identical fashion in both training and in serving. Feature creation in tensorflow also possible -> can be quite powerful since it is so flexible and will need to add call to all input functions (train,eval,serving) Can add new features in dataflow: 1. Ideal for features that involve time-windowed aggregations(streaming) 2. You will have to compute these features in real-time pipeline for predictinos (i.e. will have to use dataflow for predictions also)
Cloud ML Engine - best practice
The fewer you do in parallel, the better it is in terms of accuracy but the longer it's going to take, so that's the tradeoff that you're doing. And then at the end of it, CloudML is going to come back to you and say this is the best.
Cloud Spanner
The first horizontally scalable, globally consistent database. It's proprietary, not open source. Consider what it means to have a relational database that's consistent but also distributed and global. Think about what might be involved in coordinating transactions on components of relational database located around the world. It seems like a very difficult problem to solve.
Tensorflow
The more capacity the network has, the quicker it will be able to model the training data (resulting in a low training loss), but the more susceptible it is to overfitting (resulting in a large difference between the training and validation loss).
PubSub timestamp
The timestamp will be the time at which the element was published to pubsub. If you want to use a custom timestamp, it must be published as a pubsub attribute, and you tell data flow about it using the timestamp label setter.
weights
parameters we optimize
Speech-to-Text (speech api) match question
Transcribe customer support calls
machine learning APIs as a way to make sense of unstructured data
We said that whether you have images or audio or video or free-form text you can use the machine learning APIs to essentially extract entities, labels, people, events, etc, so you can make sense of it and then take that unstructured data, get entities, get sentiment, get labels and then do if necessary, machine-learning on that data.
Why isn't Cloud Shell used in this lab?
Unlike Compute Engine, Cloud Shell has no SLA. Therefore, the availability of the Cloud Shell VM cannot be guaranteed during the lab.
why do we say the Cloud ML Engine is repeatable, scalable, et cetera?
Well, repeatable, using sing a bear TensorFlow yourself, you would have to keep track of all kinds of things such as the order of pre-processing operations, what the scaling parameters are, excetra.
Cloud AutoML
a new technology that helps to automate the creation of machine learning modules.
The batch size
a number of points over which we try out the changes in weights.
Dataproc's third reason to overcome Hadoop's limitations by using Managed Instant groups
Use custom machine types to closely manage the resources the job requires. So, the primary group has a two node minimum, and you can define a secondary group with as few as zero preemptible instances to start. You can manually scale the cluster later, or you can setup autoscaling.
Yaml
a parameter named train_batch_size, a parameter named nbuckets, a parameter named hidden_units, and specifying the exploration space
App Engine
a serverless way to run web applications and autoscales, reliable
Spark knows the truth
data partitioning, data replication, data recovery, pipelining of processing, all are automated by Spark so you don't have to worry about them.
Bigquery match question
Warehousing structured data
Recall
a true positive rate
Dataproc is
a way by which you can run a lot of the Hadoop ecosystem tools, Pig, Hive, Spark, etc.
A neuron
a way to combine all of the inputs, and make a single decision on those inputs
Cloud dataflow
a way to execute apache beam data pipelines on Google Cloud platform
The simplest neuron does _______________.
a weighted sum of its inputs.
Dataflow connector for cloud bigtable
Years of engineering to 1. Teach bigtable to configure itself 2. Isolate performance form noisy neighbors 3. React automatically to new patterns, splitting, and balancing 4. Looks at access patterns and improves itself
accessing dataproc
You request a cluster from the service using either the console web interface, the Gcloud command, or programmatically via the API.
Tensorboard
a collection of visualization tools, that are especially designed to help you visualize TensorFlow models and the training of TensorFlow models.
Example is
a combination of label and input. An input and its corresponding label together form an example.
An example in machine learning terms is
a combination of the input, the input for which we want an output, and a label, which is a true output, the thing that we know, this is what it needs to be.
pipeline
a directed graph of steps
TensorFlow, at its heart, is
a high performance library for numerical computation. So it's a library that lets you work with numbers in an efficient way. It's open-source and follows a graph processing idea, in a way that's similar to Apache Beam and data flow.
Pub/Sub features
a. Fast: order of 100s of milliseconds b. Fan in, fan out parallel consumption c. Push and pull delivery flows i. a subscriber can keep checking whether a new message is available, which is pull ii. it can register for notifications when there is a new message which is called push. d. Client libraries i. Idiomatic, hand-built in Java, Python, C#, Ruby, PHP, Node.js ii. Auto-generated in 10 gRPC languages
Pub/sub is global service
a. Messages stored in region closest to publisher (in multiple availability zones) b. A subscription collates a topic from different regions c. Subscribers can be anywhere in world; no change of code
1. A table can have only one index (the row key)
a. Rows are stored in ascending order of the row key b. None of the other columns can be indexed
3. Two types of designs
a. Wide tables when every column value exists for every row b. Narrow tables for sparse data
Pub/Sub
an auto-scaling message queue
Dialogue Flow
an end-to-end development suite for building conversational interfaces for dialogue systems. It uses machine learning to recognize the intent and context of what a user saying
tensor
an n-dimensional array
The Natural Language API
analyzes free form text and identifies parts of speech and quality such as sentiment analysis, entity analysis, entity sentiment analysis, syntactic analysis, and content classification
Content classification
analyzes text content and returns a content category for the content. Content classification is performed by using the classifyText method
accuracy is the one that you will use if the dataset is
balanced
BigQuery and Sheets:
being able to join a table in sheets with a table in BigQuery
Bigtable details:
big, fast, noSQL, autoscaling 1. Low latency/high-throughput 2. 100,000 QPS @ 6 ms latency for a 10 node cluster 3. Paying for the number nodes of Bigtable that you are running
Container Engine
containerize it and put it into a Docker Container and we will basically orchestrate those containers and manage them for you
Eval spec
controls the evaluation
Label
correct output for some input. This is what you train the model with. The label is a correct output for an input.
To read sharded CSV files
create a tf.data.TextLineDataset(filenames).map(decode_csv) giving it a function to decode the CSV into features, labels
Bigquery export
data studio, GCS
Repeat the data and send it along in chunks
dataset = dataset.repeat(num_epochs).batch(batch_size)
DNNs good for
dense, highly correlated features
Bigquery
destination table write preference: Write if empty, append to table, overwrite table
Training-serving skew
difference between performance during training and performance during serving.
CLoud ML Engine the tuning
do hyper-parameter tuning and will remember these hyperparameters.
a ParDo transform considers
each element in the input PCollection, performs some processing function (your user code) on that element, and emits zero, one, or multiple elements to an output PCollection.
Syntactic analysis
extracts linguistic information, breaking up the given text into a series of sentences and tokens (generally, word boundaries), providing further analysis on those tokens. Syntactic Analysis is performed with the analyzeSyntax method
Cloud Spanner Natural use cases include:
financial applications and inventory applications traditionally served by relational database technology. Here's some example, mission critical use cases. Powering customer authentication and provisioning for multinational businesses. Building consistent systems for transactions and inventory management and the financial services in retail industries. Supporting high volume systems that require low latency and high throughput in the advertising and media industries.
Cross-entropy
for classification problems, the most commonly used error measure - because it is differentiable
Weights
free parameters in machine learning model ; the weights are the things that you get to change so that your model captures your data
Both BigQuery and Bigtable are what kind of services?
fully managed, no-ops services.
Command for submitting on cloud
gcloud ml-engine jobs submit training $JOBNAME
Command for running local on cloud
gcloud ml-engine local train
key skew can lead to increased tail latency Note
get a count of your groups when trying to understand performance
c. Distribute the writing load between tablets while allowing common queries to return consecutive rows
i. Can you have both distributed writes and block reads? ii. E.g. highway-milemaker - reverse times (I35-347-123456789)
b. Add reverse timestamp to the rowkey
i. Will you often need to retrieve the latest few records ii. E.g. highway - milemaker - reverse timestamp (I35-347-123456789)
Make query parameter the row key
i. what is the most common query you need to support ii. e.g. highway - milemarker (I35-347) iii. Entities are considered related if users are likely to pull both records in a single query. This makes reads more efficient iv. Results would come from the same tablet
Sentiment analysis
inspects the given text and identifies the prevailing emotional opinion within the text, especially to determine a writer's attitude as positive, negative, or neutral. Sentiment analysis is performed through the analyzeSentiment method
Entity analysis
inspects the given text for known entities (Proper nouns such as public figures, landmarks, and so on. Common nouns such as restaurant, stadium, and so on.) and returns information about those entities
Entity sentiment analysis
inspects the given text for known entities (proper nouns and common nouns), returns information about those entities, and identifies the prevailing emotional opinion of the entity within the text, especially to determine a writer's attitude toward the entity as positive, negative, or neutral. Entity analysis is performed with the analyzeEntitySentiment method
Dataflow
is a runner. Each step is elastically scaled
Refactoring
is a software engineering term that essentially means that you're taking your program and you're changing the design of your program without adding any extra features. The reason you are changing the design of your program is so that you can do extra things with it.
The timestamp of the message in Pub Sub
is going to be the timestamp at which you call the published method. The time at which you publish into pubsub, is the timestamp of the message
A graph definition
is separate from the training group because this is a lazy evaluation model. It minimizes the Python C++ context switches and enables the computation to be very efficient. Note that c, after you call tf.add, is not the actual values. You have to evaluate c in the context of a TensorFlow session to get a numpy array of values, which we are calling "numpy_c".
So one of the cool things is that the TextlineReader in TensorFlow not only reads from local files,
it also reads from Google Cloud Storage
Dataflow is a core part of this architecture
it does ingest, it does transformation and it does load, it can do filtering, it can do grouping, it can do windowing and windowing of course is very important if you're doing unbounded data, if you're doing stream data.
Bigtable and clusters
it uses clusters but those clusters only contain pointers to the data, they don't contain the data itself. So, the clusters consist of nodes, these nodes are contain the metadata, the data itself remains on Colossus, it remains on Google Cloud Storage.
Cloud Storage Usage
persistent storage and as staging ground for import to other google cloud products.
we looked at how to do resilient stream processing on GCP
it was important to be able to ingest variable volumes because you could have spikes in your data, it's important to be able to deal with latency because latency is a fact of life, and we want to be able to derive real-time insights from the data even as the data are streaming in.
Example
label + input
Cloud Dataflow connector for Cloud Bigtable.
makes it possible to use Cloud Bigtable in a Cloud Dataflow pipeline.
model
mathematical function that takes an input and creates an output that approximates the label for that input.
Cloud Storage default storage class
multi-regional, regional, nearline, coldline
Pig
often used for cleaning up data and for turning semi-structured data into structured data and originally developed to submit MapReduce jobs.
Epoch
one pass through entire dataset; It consists of going through multiple batches. In our example, if we said we had 100,000 samples was our training dataset, and each batch was 100, then an epoch consists of 1,000 batches or 1,000 steps. So that's another word that's often used for one step is one tweak of the weights.
Neural network
only as good as the input that it is provided with.
Side inputs
other smaller data that you need to get and join with in Big Query
if for example you're simulating historical data, so you're publishing at sometime in 2017 but the data actually comes from 2008, we might want to set metadata which is the timestamp of the message, because
pubsub is not going to parse this message and figure out what the actual timestamp is inside the message. The other up thing that you want to be aware of is that, you don't have to call publish one at a time for every message.
The Vision API
recognizes objects and other qualities in images
The Speech API
recognizes spoken words in audio files
PTransform:
represents a data processing operation, or a step, in your pipeline
A PCollection
represents a distributed data set that your Beam pipeline operates on
Direct runner
running locally
Cloud ML Engine Scalable:
scale your service with as many machines as needed to reach the required number of queries per second, and this is important, you need high-quality execution at both training and prediction time, so while computation of a TensorFlow model ones is relatively cheap, the point of an ML model is to do prediction for lots and lots of repeated requests.
Pub/Sub is
serverless as we'll see later, you don't need to start any clusters, just publish messages to a topic
Cloud ML Engine simplifies
simplifies the bookkeeping, ensures that the trained model is what you run at prediction time. This will help you handle training serving skew. It can be quite easy otherwise for the trading pipeline to do something that the prediction pipeline doesn't do
MapReduce approach splits Big Data
so that each compute node processes data local to it
Linear for
sparse, independent features
A chatbot
special purpose program that's designed to conduct a convincingly intelligent conversation
sharding
split the data as it's being copied in from mass storage,and distribute it to the nodes, a process called sharding.
Instead of using HDFS for storage
store your data from dataproc to gcloudstorage
globally in cases
sum over all of floats
The difference between a view and a table
that a new table is materialized, it is no longer live. A view is not materialized and therefore it is live
When a message is read from PubSub by a subscriber
that data includes that timestamp. The timestamp is the basis of all windowing primitives including watermarks, triggers, and lag monitoring of delayed messages
Cloud Spanner is suited for applications
that require relational database support, strong consistency, transactions, and horizontal scalability.
we need this extra input function
that will map between the JSON received from the REST API and the features as expected from the model. So this extra input function is called the serving input function.
Batch size
the amount of data we compute error on
precision and recall, are the measures of scale that you will use to describe your dataset, the performance of your ML model if
the dataset is unbalanced
Dataflow runner
the graph gets launched on the cloud, and all of the compute is now happening on the cloud.
The scale-tier essentially controls
the kind of resources that you want this program to take
Vertical scaling (scaling up):
the main frame solution, your builder require a larger computer.
When a publisher publishes a message,
the message is associated with a timestamp
legacy SQL
the original Google SQL that currently is a default in BigQuery
Input
the thing that you will know and that you can provide at the time of even prediction. These are the things, for example, if they're images; the image itself is an input.
Training
this process of adjusting the weights of a model in such a way that it can make predictions, given an input.
Prediction
this process of taking an input in and applying the mathematical model to it, so as to get an output that hopefully is the correct output for that input
Beam supports
time-based shuffle (Windowing) (Published datatime, not received datatime; event time vs processing time)
Google Data Studio needs access
to Bigquery
The watermark:
tracks how far behind the system is 1. Where in event time to compute? 2. When in processing time to emit?
The Translation API
translates among 80 languages
weights
tunable parameters
If you are running Java code in Dataflow
use Maven, because maven will also take care of downloading dependencies for you and managing them - By default this runs locally, where the default runner is a local runner
Bigquery and Bigtable
user generated queries, ad-hoc queries, queries that you have that you do once in a long while.
A project contains
users and datasets
An important solution to the diminishing returns of vertical scaling
was horizontal scaling, also called distributed processing. Instead of provisioning a bigger machine, you use a cluster of smaller machines called nodes. One early software for coordinating the nodes was called MapReduce. To make distributed processing work in a traditional storage environment, you have to split the data as it's being copied in from mass storage,and distribute it to the nodes, a process called sharding.
Handling late data
watermarks, triggers, accumulation
Cloud Pub/Sub pull subscriptions
whenever the system the client is ready to process a new message, it goes ahead and asks are there any new messages?
The ML model microservice
will auto scale for you all the way down to zero if there's no traffic, to how many other machines that you need if you have lots of traffic.
Dataflow instead of keying off the internal PubSub ID
will now key off on this attribute instead and make sure that any particular ID gets processed only once.
Bigtable Connecting:
you work with Bigtable, you work with it using the hbase API. You basic go to the connection and you get your table, you create a put operation, you add all of your columns, and then you put that into the table and you've basically added a new row to the table.