Course 5
What is BigQuery?
1. Fully managed data warehouse 2. Fast, petabyte-scale with the convenience of SQL 3. Encrypted, durable, and highly available 4. Virtually unlimited resources only pay for what you use 5. Provides streaming ingest to unbounded data sets
Create subscriptions, pull messages
1. Gcloud pubsub subscriptions create -topic sandiego mysub1 2. Gcloud pubusub subscriptions pull -auto-ack mysub1
Steps to stream into Bigtable using Dataflow
1. Get/create table a. Get authenticated session b. Create table 2. Convert object to write into Mutation(s) inside a ParDo 3. Write mutations to bigtable
Streaming(data processing for unbounded datasets)
1. Infinite data set 2. Is never complete, especially when considering time 3. Stored in multiple temporary, yet durable stores
Challenge #1: Variable volumes require ability of ingest to scale and be fault-tolerant
1. Ingesting variable volumes: massive amounts of streaming events, handle spiky/bursty data, high availability and durability 2. the way to get a durable, highly available messaging system is to use Pub/Sub.
Handling late data
watermarks, triggers, accumulation
PTransform:
represents a data processing operation, or a step, in your pipeline
A PCollection
represents a distributed data set that your Beam pipeline operates on
Pub/Sub is
serverless as we'll see later, you don't need to start any clusters, just publish messages to a topic
When a message is read from PubSub by a subscriber
that data includes that timestamp. The timestamp is the basis of all windowing primitives including watermarks, triggers, and lag monitoring of delayed messages
Cloud Spanner is suited for applications
that require relational database support, strong consistency, transactions, and horizontal scalability.
When a publisher publishes a message,
the message is associated with a timestamp
Beam supports
time-based shuffle (Windowing) (Published datatime, not received datatime; event time vs processing time)
Google Data Studio needs access
to Bigquery
The watermark:
tracks how far behind the system is 1. Where in event time to compute? 2. When in processing time to emit?
if you need globally consistent data or more than one cloud sql instance
use cloud spanner
Create a bigtable cluster using gcloud (or web UI)
"gcould beta bigtable instance create INSTANCE"
Designing for Bigtable
1. A table have only one index(the row key) 2. Grouped related columns into column families 3. Two types of designs (wide or narrow tables) 4. Rows are sorted lexicographically by row key, from lowest to highest bytes 5. Queries that use the row key, a row prefix, or a row range are the most efficient 6. Store related entities in adjacent rows 7. Distribute your writes and reads across rows 8. Design row keys to avoid hot spotting
Streaming data into Bigquery
1. BigQuery provides streaming ingestion at a rate of 100,000 rows/table/second a. Provided by the rest APIs tabledata().insertAll() method b. Works for partitioned and standard tables 2. Streaming data can be queried as arrives a. Data available within seconds 3. For data consistency, enter insertID for each inserted row a. De-duplication is based on a best-effort basis, and can be affected by network errors b. Can be done manually
Pub/Sub simplifies event distribution:
1. By replacing synchronous point-to-point connections with a single availability asynchronous bus 2. Asynchronous -> Publisher never waits a. A subscriber can get the message now or any time (within 7 days) 3. Can avoid overprovisioning for spikes with Pub/Sub
3 aspects to Big data
1. Can use the same tools for batch as for streaming 2. Another aspect of big data is variety; audio, video, images, etc, unstructured text, blog posts. 3. But the third aspect to big data, is near real-time data processing, data that's coming in so fast that you need to process it to keep up with the data
Did we use triggers?
1. Default trigger setting used, which is trigger first when the watermark passes the end of the window, and then trigger again every time there is late arriving data
Pub/sub is a low latency, guaranteed delivery service
1. Does not guarantee order of messages 2. At-least-once delivery means that repeated delivery is possible
Bigtables These are row keys to avoid:
1. Domains 2. Sequential (numeric) IDS 3. Static, repeatedly updated identifiers
Data studio lets you build dashboards and reports
1. Easy to read, share, and fully customizable 2. Handles authentication, access rights, and structuring of data
Stream processing:
1. Element-wise stream processing is easy 2. Aggregating is hard 3. Composite on unbounded data is super difficult
Pull:
1. Endpoint can be a server or a device capable of making API call 2. Delays between publication and delivery 3. Ideal for large number of dynamically created subscribers
Push:
1. Endpoint can only be HTTPS server accepting Webhook 2. Immediate delivery; no latency 3. Ideal for subscribers needing closer to real time performance
Handle late data: watermarks, triggers, accumulation
1. Event time a. Bound to the source of the event e.g. event of ID X is given a timestamp relative to the source scope b. Always < processing time 2. Processing time a. Relative to the engine processing the event e.g. event of ID X and even tie of Y is being processed now b. Always > event time
ParDo is useful for a variety of common data processing operations including
1. Filtering a data set. 2. Formatting or type-converting each element in a data set. 3. Extracting parts of each element in a data set. 4. Performing computations on each element in a data set.
Need to process variable amounts of data that will grow overtime:
1. Fixed or slowly scaled clusters are a waste 2. Windowing lets us answer the question of "where in event time" we are computing the aggregates a. Windowing divides data into event-time based finite chunks b. Required when doing aggregations over unbounded date c. Fixed, sliding, sessions 3. Beam unified model is very powerful and handles different processing paradigms
Pub/Sub: how it works: Topics and subscriptions
1. Its an asynchronous communicate pattern 2. Multiple topics in Pub/Sub 3. One or more publishers sends a message to a topic 4. At least once a delivery guarantee a. A subscriber ACKs each message for every subscription b. A message is resent if subscriber takes more than "ackDeadline" to respond c. A subscriber can extend the deadlinge per message d. 5. Exactly once, ordered processing a. Pub/sub delivers at least once b. Dataflow: deduplicate, order and window c. Separation of concerns -> scale
The need for fast decisions leads to streaming
1. Massive data, from varied sources, that keeps growing over time 2. Need to derive insights immediately in the form of dashboards 3. Need to make timely decisions
Pub/Sub - publish-subscribe
1. Message bus that's a great way to deal with the challenge of ingesting variable volumes of data 2. Has the ability to ingest data at high speeds 3. Works well with streaming data 4. Durable and fault tolerant 5. No ops design 6. Serverless
Data studio connects to various GCP data sources
1. Offers a BigQuery connector 2. Read from table or run a custom query 3. Build charts graph or lay it on a map
Batch and Streaming set up
1. Pub/Sub is a global message bus. 2. Dataflow is capable of doing batch and streaming, the core doesn't change. It gives you the better deal with late data and unordered data 3. big query gives you the power of doing analytics both on historical data and on streaming data.
Streaming processing makes it possible to derive real-time insights from growing data
1. Scale to variables volumes 2. Act on real-time data using continuous queries 3. Derive insights from data in flight
To use custom timestamps, perhaps based on message producer's clock:
1. Set an attribute in pubsub with the timestamp when publishing batch.publish(event_data,mytime="2017-04-12T23:20:50.52Z") 2. Tell dataflow which PubSub attribute is the timestampLabel p.apply(PubsubIO.readStrings().fromTopic(t).withTimestampAttribute("mytime") ).apply(...)
Can enforce only-once handling in dataflow even if your publisher might retry publishes
1. Specify a unique label when publishing to Pub/Sub 2. When reading, tell Dataflow which PubSub attribute is the idLabel
Create topic and publish message
1. To create at topic, you use Gcloud 2. Messages are opaque in pub/sub (no parsing) 3. Actually sending a web call, a rest API call.
Unbounded datasets are quite common
1. Traffic sensors along highways 2. Usage information of Cloud component by every user with a GCP project 3. Credit card transactions 4. User moves in multi-user online gaming
Reading from Bigtable
1. Typically programmatic using HBase API 2. HBase command line client 3. Bigquery
Cluster performance
1. Under typical workloads cloud Bigtable delivers highly predictable performance. When everything is running smoothly, you can expect the following performance for each node in your Cloud Bigtable cluster, depending on which type of storage your cluster uses
Best practices Dataflow + BigQuery to enable fast data-driven decisions
1. Use dataflow to do the processing/transforms 2. Create multiple tables for easy analysis 3. Take advantage of BigQuery for streaming analysis for dashboards and long term storage to reduce storage cost 4. Create views for common query support
Techniques to deal with the three Vs
1. Volume a. Terabytes, petabytes b. Mapreduce autoscaling analysis 2. Velocity a. Realtime or near-realtime b. streaming 3. Variety a. Social networks, blog posts, logs, sensors b. Unstructured data and machine learning
Triggers control when results are emitted
1. What are you computing? What = transformations 2. Where in event time? Where = windowing 3. When in processing time? When = watermarks + triggers 4. How do refinements relate? How = accumulation
How dataflow handles streaming data while balancing tradeoffs
1. Windowing model a. Which supports unaligned eventtime windows, and a simple API for their creation and use 2. Triggering model a. That binds the output times of results to runtime characteristics of the pipeline with a powerful and flexible declarative API for describing desired triggering semantics 3. Incremental processing model a. That integrates retractions and updates into the windowing and triggering models
Stream processing in Dataflow accounts for this
1. Works with out-of-order messages when computing aggregates 2. Automatically removes duplicated based on internal Pub/Sub id
Modifying bigtable clusters
1. You can add or remove nodes or change a cluster's name without any downtime 2. You cannot modify the Cluster ID, zone, or storage type after a cluster is created
Apache beam
1. a unified model for batch and stream processing 2. supporting multiple runtimes
Can associate a timestamp with inputs
1. automatic timestamp when reading from PubSub a. Timestamp is the time that message was published to topic 2. For batch inputs, explicitly assign timestamp when emitting at some step in your pipeline a. outputWithTimestamp()
Pub/Sub :
1. has to be primed so the second message will only show. 2. simplifies systems by removing the need for every component to speak to every component 3. connects applications and services through a messaging infrastructure
So if you're thinking about a good stream processing solution, there are three key challenges that it needs to address.
1. it needs to be able to scale. 2. you want to essentially use Continuous Queries 3. we want to be able to do SQL-like queries that operate over time windows, over time windows over that data
Spanner
1. it uses familiar relational semantics, so traditional database analysts will adapt to it easily. 2. Data is sharded within the zone, providing high throughput. 3. And it provides high availability by design, so there's no manual intervention required to deal with a zone failure.
Store related entities in adjacent rows
1. make query parameter the row key 2. Add reverse timestap to the rowkey 3. Distribute the writing load between tablets while allowing common queries to return consecutive rows
Stream processing poses several challenges:
1. size a. Traffic data will only grow with more sensors and higher frequency 2. scalability and fault-tolerance a. Handle growing traffic data volumes, distributed sensors, and still be fault tolerant 3. programming model a. Compare traffic over past hour against that of last Friday at same time: is this stream or batch? 4. Unboundedness a. What happens if data from a sensor arrives late?
A message in Pub/Sub persist for
7 days
Configure Alerts
: Define thresholds on job or resource group-level metrics and alert when these metrics reach specified values. Stackdriver alerting can notify on a variety of conditions such as long streaming system lag or failed jobs.
Cloud pub/sub overview:
A global multitenant, managed, real-time messaging services. 1. Discoverability 2. Availability 3. Durability 4. Scalability 5. Low latency
Bigquery and Bigtable
user generated queries, ad-hoc queries, queries that you have that you do once in a long while.
Difference between BigQuery and Bigtable Structure:
BigQuery (SQL) Bigtable (NoSQL)
Data flow export to
BigQuery, Cloud Storage text file, cloud storage Avro file
But if you need very high throughput, very low latency, then you need
Bigtable
Data flow import from
Cloud Pub/Sub subscription, cloud storage text file
When you load data from Cloud Storage into BigQuery, your data can be in any of the following formats:
Comma-separated values (CSV) JSON (newline-delimited) Avro Parquet ORC (Beta) Cloud Datastore exports Cloud Firestore exports
Bigquery
destination table write preference: Write if empty, append to table, overwrite table
a ParDo transform considers
each element in the input PCollection, performs some processing function (your user code) on that element, and emits zero, one, or multiple elements to an output PCollection.
Chart Dataflow metrics in Stackdriver Dashboards:
Create Dashboards and chart time series of Dataflow metrics.
Dataflow current vCPU count:
Current # of virtual CPUs used by job and updated on value change.
Bigquery details:
easy, inexpensive 1. Latency in order of seconds 2. 100k rows/second streaming
Metrics for dataflow in lab
Data watermark age: The age (time since event timestamp) of the most recent item of data that has been fully processed by the pipeline. System lag: The current maximum duration that an item of data has been awaiting processing, in seconds
the BigQuery Data Transfer Service supports loading data from the following data sources
Google AdWords DoubleClick Campaign Manager DoubleClick for Publishers Google Play (beta) YouTube - Channel Reports YouTube - Content Owner Reports
Monitor User-Defined Metrics:
In addition to Dataflow metrics, Dataflow exposes user-defined metrics (SDK Aggregators) as Stackdriver custom counters in the Monitoring UI, available for charting and alerting. Any Aggregator defined in a Dataflow pipeline will be reported to Stackdriver as a custom metric. Dataflow will define a new custom metric on behalf of the user and report incremental updates to Stackdriver approximately every 30 seconds.
Dataflow Elapsed time:
Job elapsed time (measured in seconds), reported every 30 secs.
Dataflow Job status:
Job status (Failed, Successful), reported as an enum every 30 secs and on update.
Dataflow System lag:
Max lag across the entire pipeline, reported in seconds.
Dataflow Estimated byte count :
Number of bytes processed per PCollection.
TextIO
Output methods do not support output with the timestamp.
Transform
ParDo GroupByKey CoGroupByKey Combine Flatten Partition
Challenge #2: Latency is to be expected
So, the Beam/Dataflow model provides for exactly once processing of events. 1. What results are calculated, 2. Where in event time should we calculate those results 3. When in processing time should you save out those results 4. How do you change, already computed results? How do you refine it?
Cloud Spanner
The first horizontally scalable, globally consistent database. It's proprietary, not open source. Consider what it means to have a relational database that's consistent but also distributed and global. Think about what might be involved in coordinating transactions on components of relational database located around the world. It seems like a very difficult problem to solve.
Dataflow connector for cloud bigtable
Years of engineering to 1. Teach bigtable to configure itself 2. Isolate performance form noisy neighbors 3. React automatically to new patterns, splitting, and balancing 4. Looks at access patterns and improves itself
Difference between BigQuery and Bigtable Latency:
a. BigQuery is in the order of seconds b. BigTable is in the order of milliseconds.
Pub/Sub features
a. Fast: order of 100s of milliseconds b. Fan in, fan out parallel consumption c. Push and pull delivery flows i. a subscriber can keep checking whether a new message is available, which is pull ii. it can register for notifications when there is a new message which is called push. d. Client libraries i. Idiomatic, hand-built in Java, Python, C#, Ruby, PHP, Node.js ii. Auto-generated in 10 gRPC languages
Pub/sub is global service
a. Messages stored in region closest to publisher (in multiple availability zones) b. A subscription collates a topic from different regions c. Subscribers can be anywhere in world; no change of code
1. A table can have only one index (the row key)
a. Rows are stored in ascending order of the row key b. None of the other columns can be indexed
3. Two types of designs
a. Wide tables when every column value exists for every row b. Narrow tables for sparse data
Bigtable details:
big, fast, noSQL, autoscaling 1. Low latency/high-throughput 2. 100,000 QPS @ 6 ms latency for a 10 node cluster 3. Paying for the number nodes of Bigtable that you are running
Bigquery export
data studio, GCS
Cloud Spanner Natural use cases include:
financial applications and inventory applications traditionally served by relational database technology. Here's some example, mission critical use cases. Powering customer authentication and provisioning for multinational businesses. Building consistent systems for transactions and inventory management and the financial services in retail industries. Supporting high volume systems that require low latency and high throughput in the advertising and media industries.
Both BigQuery and Bigtable are what kind of services?
fully managed, no-ops services.
c. Distribute the writing load between tablets while allowing common queries to return consecutive rows
i. Can you have both distributed writes and block reads? ii. E.g. highway-milemaker - reverse times (I35-347-123456789)
b. Add reverse timestamp to the rowkey
i. Will you often need to retrieve the latest few records ii. E.g. highway - milemaker - reverse timestamp (I35-347-123456789)
Make query parameter the row key
i. what is the most common query you need to support ii. e.g. highway - milemarker (I35-347) iii. Entities are considered related if users are likely to pull both records in a single query. This makes reads more efficient iv. Results would come from the same tablet
The timestamp of the message in Pub Sub
is going to be the timestamp at which you call the published method. The time at which you publish into pubsub, is the timestamp of the message
Bigtable and clusters
it uses clusters but those clusters only contain pointers to the data, they don't contain the data itself. So, the clusters consist of nodes, these nodes are contain the metadata, the data itself remains on Colossus, it remains on Google Cloud Storage.
Dataflow
provides a fully-managed, autoscaling execution environment for Beam pipelines
if for example you're simulating historical data, so you're publishing at sometime in 2017 but the data actually comes from 2008, we might want to set metadata which is the timestamp of the message, because
pubsub is not going to parse this message and figure out what the actual timestamp is inside the message. The other up thing that you want to be aware of is that, you don't have to call publish one at a time for every message.
Dataflow instead of keying off the internal PubSub ID
will now key off on this attribute instead and make sure that any particular ID gets processed only once.
to associate the messages with the timestamps
you have to create that association in the p collection. In this case, rather than use process context.output, you can use process context.output with timestamp.