Course 5

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What is BigQuery?

1. Fully managed data warehouse 2. Fast, petabyte-scale with the convenience of SQL 3. Encrypted, durable, and highly available 4. Virtually unlimited resources only pay for what you use 5. Provides streaming ingest to unbounded data sets

Create subscriptions, pull messages

1. Gcloud pubsub subscriptions create -topic sandiego mysub1 2. Gcloud pubusub subscriptions pull -auto-ack mysub1

Steps to stream into Bigtable using Dataflow

1. Get/create table a. Get authenticated session b. Create table 2. Convert object to write into Mutation(s) inside a ParDo 3. Write mutations to bigtable

Streaming(data processing for unbounded datasets)

1. Infinite data set 2. Is never complete, especially when considering time 3. Stored in multiple temporary, yet durable stores

Challenge #1: Variable volumes require ability of ingest to scale and be fault-tolerant

1. Ingesting variable volumes: massive amounts of streaming events, handle spiky/bursty data, high availability and durability 2. the way to get a durable, highly available messaging system is to use Pub/Sub.

Handling late data

watermarks, triggers, accumulation

PTransform:

represents a data processing operation, or a step, in your pipeline

A PCollection

represents a distributed data set that your Beam pipeline operates on

Pub/Sub is

serverless as we'll see later, you don't need to start any clusters, just publish messages to a topic

When a message is read from PubSub by a subscriber

that data includes that timestamp. The timestamp is the basis of all windowing primitives including watermarks, triggers, and lag monitoring of delayed messages

Cloud Spanner is suited for applications

that require relational database support, strong consistency, transactions, and horizontal scalability.

When a publisher publishes a message,

the message is associated with a timestamp

Beam supports

time-based shuffle (Windowing) (Published datatime, not received datatime; event time vs processing time)

Google Data Studio needs access

to Bigquery

The watermark:

tracks how far behind the system is 1. Where in event time to compute? 2. When in processing time to emit?

if you need globally consistent data or more than one cloud sql instance

use cloud spanner

Create a bigtable cluster using gcloud (or web UI)

"gcould beta bigtable instance create INSTANCE"

Designing for Bigtable

1. A table have only one index(the row key) 2. Grouped related columns into column families 3. Two types of designs (wide or narrow tables) 4. Rows are sorted lexicographically by row key, from lowest to highest bytes 5. Queries that use the row key, a row prefix, or a row range are the most efficient 6. Store related entities in adjacent rows 7. Distribute your writes and reads across rows 8. Design row keys to avoid hot spotting

Streaming data into Bigquery

1. BigQuery provides streaming ingestion at a rate of 100,000 rows/table/second a. Provided by the rest APIs tabledata().insertAll() method b. Works for partitioned and standard tables 2. Streaming data can be queried as arrives a. Data available within seconds 3. For data consistency, enter insertID for each inserted row a. De-duplication is based on a best-effort basis, and can be affected by network errors b. Can be done manually

Pub/Sub simplifies event distribution:

1. By replacing synchronous point-to-point connections with a single availability asynchronous bus 2. Asynchronous -> Publisher never waits a. A subscriber can get the message now or any time (within 7 days) 3. Can avoid overprovisioning for spikes with Pub/Sub

3 aspects to Big data

1. Can use the same tools for batch as for streaming 2. Another aspect of big data is variety; audio, video, images, etc, unstructured text, blog posts. 3. But the third aspect to big data, is near real-time data processing, data that's coming in so fast that you need to process it to keep up with the data

Did we use triggers?

1. Default trigger setting used, which is trigger first when the watermark passes the end of the window, and then trigger again every time there is late arriving data

Pub/sub is a low latency, guaranteed delivery service

1. Does not guarantee order of messages 2. At-least-once delivery means that repeated delivery is possible

Bigtables These are row keys to avoid:

1. Domains 2. Sequential (numeric) IDS 3. Static, repeatedly updated identifiers

Data studio lets you build dashboards and reports

1. Easy to read, share, and fully customizable 2. Handles authentication, access rights, and structuring of data

Stream processing:

1. Element-wise stream processing is easy 2. Aggregating is hard 3. Composite on unbounded data is super difficult

Pull:

1. Endpoint can be a server or a device capable of making API call 2. Delays between publication and delivery 3. Ideal for large number of dynamically created subscribers

Push:

1. Endpoint can only be HTTPS server accepting Webhook 2. Immediate delivery; no latency 3. Ideal for subscribers needing closer to real time performance

Handle late data: watermarks, triggers, accumulation

1. Event time a. Bound to the source of the event e.g. event of ID X is given a timestamp relative to the source scope b. Always < processing time 2. Processing time a. Relative to the engine processing the event e.g. event of ID X and even tie of Y is being processed now b. Always > event time

ParDo is useful for a variety of common data processing operations including

1. Filtering a data set. 2. Formatting or type-converting each element in a data set. 3. Extracting parts of each element in a data set. 4. Performing computations on each element in a data set.

Need to process variable amounts of data that will grow overtime:

1. Fixed or slowly scaled clusters are a waste 2. Windowing lets us answer the question of "where in event time" we are computing the aggregates a. Windowing divides data into event-time based finite chunks b. Required when doing aggregations over unbounded date c. Fixed, sliding, sessions 3. Beam unified model is very powerful and handles different processing paradigms

Pub/Sub: how it works: Topics and subscriptions

1. Its an asynchronous communicate pattern 2. Multiple topics in Pub/Sub 3. One or more publishers sends a message to a topic 4. At least once a delivery guarantee a. A subscriber ACKs each message for every subscription b. A message is resent if subscriber takes more than "ackDeadline" to respond c. A subscriber can extend the deadlinge per message d. 5. Exactly once, ordered processing a. Pub/sub delivers at least once b. Dataflow: deduplicate, order and window c. Separation of concerns -> scale

The need for fast decisions leads to streaming

1. Massive data, from varied sources, that keeps growing over time 2. Need to derive insights immediately in the form of dashboards 3. Need to make timely decisions

Pub/Sub - publish-subscribe

1. Message bus that's a great way to deal with the challenge of ingesting variable volumes of data 2. Has the ability to ingest data at high speeds 3. Works well with streaming data 4. Durable and fault tolerant 5. No ops design 6. Serverless

Data studio connects to various GCP data sources

1. Offers a BigQuery connector 2. Read from table or run a custom query 3. Build charts graph or lay it on a map

Batch and Streaming set up

1. Pub/Sub is a global message bus. 2. Dataflow is capable of doing batch and streaming, the core doesn't change. It gives you the better deal with late data and unordered data 3. big query gives you the power of doing analytics both on historical data and on streaming data.

Streaming processing makes it possible to derive real-time insights from growing data

1. Scale to variables volumes 2. Act on real-time data using continuous queries 3. Derive insights from data in flight

To use custom timestamps, perhaps based on message producer's clock:

1. Set an attribute in pubsub with the timestamp when publishing batch.publish(event_data,mytime="2017-04-12T23:20:50.52Z") 2. Tell dataflow which PubSub attribute is the timestampLabel p.apply(PubsubIO.readStrings().fromTopic(t).withTimestampAttribute("mytime") ).apply(...)

Can enforce only-once handling in dataflow even if your publisher might retry publishes

1. Specify a unique label when publishing to Pub/Sub 2. When reading, tell Dataflow which PubSub attribute is the idLabel

Create topic and publish message

1. To create at topic, you use Gcloud 2. Messages are opaque in pub/sub (no parsing) 3. Actually sending a web call, a rest API call.

Unbounded datasets are quite common

1. Traffic sensors along highways 2. Usage information of Cloud component by every user with a GCP project 3. Credit card transactions 4. User moves in multi-user online gaming

Reading from Bigtable

1. Typically programmatic using HBase API 2. HBase command line client 3. Bigquery

Cluster performance

1. Under typical workloads cloud Bigtable delivers highly predictable performance. When everything is running smoothly, you can expect the following performance for each node in your Cloud Bigtable cluster, depending on which type of storage your cluster uses

Best practices Dataflow + BigQuery to enable fast data-driven decisions

1. Use dataflow to do the processing/transforms 2. Create multiple tables for easy analysis 3. Take advantage of BigQuery for streaming analysis for dashboards and long term storage to reduce storage cost 4. Create views for common query support

Techniques to deal with the three Vs

1. Volume a. Terabytes, petabytes b. Mapreduce autoscaling analysis 2. Velocity a. Realtime or near-realtime b. streaming 3. Variety a. Social networks, blog posts, logs, sensors b. Unstructured data and machine learning

Triggers control when results are emitted

1. What are you computing? What = transformations 2. Where in event time? Where = windowing 3. When in processing time? When = watermarks + triggers 4. How do refinements relate? How = accumulation

How dataflow handles streaming data while balancing tradeoffs

1. Windowing model a. Which supports unaligned eventtime windows, and a simple API for their creation and use 2. Triggering model a. That binds the output times of results to runtime characteristics of the pipeline with a powerful and flexible declarative API for describing desired triggering semantics 3. Incremental processing model a. That integrates retractions and updates into the windowing and triggering models

Stream processing in Dataflow accounts for this

1. Works with out-of-order messages when computing aggregates 2. Automatically removes duplicated based on internal Pub/Sub id

Modifying bigtable clusters

1. You can add or remove nodes or change a cluster's name without any downtime 2. You cannot modify the Cluster ID, zone, or storage type after a cluster is created

Apache beam

1. a unified model for batch and stream processing 2. supporting multiple runtimes

Can associate a timestamp with inputs

1. automatic timestamp when reading from PubSub a. Timestamp is the time that message was published to topic 2. For batch inputs, explicitly assign timestamp when emitting at some step in your pipeline a. outputWithTimestamp()

Pub/Sub :

1. has to be primed so the second message will only show. 2. simplifies systems by removing the need for every component to speak to every component 3. connects applications and services through a messaging infrastructure

So if you're thinking about a good stream processing solution, there are three key challenges that it needs to address.

1. it needs to be able to scale. 2. you want to essentially use Continuous Queries 3. we want to be able to do SQL-like queries that operate over time windows, over time windows over that data

Spanner

1. it uses familiar relational semantics, so traditional database analysts will adapt to it easily. 2. Data is sharded within the zone, providing high throughput. 3. And it provides high availability by design, so there's no manual intervention required to deal with a zone failure.

Store related entities in adjacent rows

1. make query parameter the row key 2. Add reverse timestap to the rowkey 3. Distribute the writing load between tablets while allowing common queries to return consecutive rows

Stream processing poses several challenges:

1. size a. Traffic data will only grow with more sensors and higher frequency 2. scalability and fault-tolerance a. Handle growing traffic data volumes, distributed sensors, and still be fault tolerant 3. programming model a. Compare traffic over past hour against that of last Friday at same time: is this stream or batch? 4. Unboundedness a. What happens if data from a sensor arrives late?

A message in Pub/Sub persist for

7 days

Configure Alerts

: Define thresholds on job or resource group-level metrics and alert when these metrics reach specified values. Stackdriver alerting can notify on a variety of conditions such as long streaming system lag or failed jobs.

Cloud pub/sub overview:

A global multitenant, managed, real-time messaging services. 1. Discoverability 2. Availability 3. Durability 4. Scalability 5. Low latency

Bigquery and Bigtable

user generated queries, ad-hoc queries, queries that you have that you do once in a long while.

Difference between BigQuery and Bigtable Structure:

BigQuery (SQL) Bigtable (NoSQL)

Data flow export to

BigQuery, Cloud Storage text file, cloud storage Avro file

But if you need very high throughput, very low latency, then you need

Bigtable

Data flow import from

Cloud Pub/Sub subscription, cloud storage text file

When you load data from Cloud Storage into BigQuery, your data can be in any of the following formats:

Comma-separated values (CSV) JSON (newline-delimited) Avro Parquet ORC (Beta) Cloud Datastore exports Cloud Firestore exports

Bigquery

destination table write preference: Write if empty, append to table, overwrite table

a ParDo transform considers

each element in the input PCollection, performs some processing function (your user code) on that element, and emits zero, one, or multiple elements to an output PCollection.

Chart Dataflow metrics in Stackdriver Dashboards:

Create Dashboards and chart time series of Dataflow metrics.

Dataflow current vCPU count:

Current # of virtual CPUs used by job and updated on value change.

Bigquery details:

easy, inexpensive 1. Latency in order of seconds 2. 100k rows/second streaming

Metrics for dataflow in lab

Data watermark age: The age (time since event timestamp) of the most recent item of data that has been fully processed by the pipeline. System lag: The current maximum duration that an item of data has been awaiting processing, in seconds

the BigQuery Data Transfer Service supports loading data from the following data sources

Google AdWords DoubleClick Campaign Manager DoubleClick for Publishers Google Play (beta) YouTube - Channel Reports YouTube - Content Owner Reports

Monitor User-Defined Metrics:

In addition to Dataflow metrics, Dataflow exposes user-defined metrics (SDK Aggregators) as Stackdriver custom counters in the Monitoring UI, available for charting and alerting. Any Aggregator defined in a Dataflow pipeline will be reported to Stackdriver as a custom metric. Dataflow will define a new custom metric on behalf of the user and report incremental updates to Stackdriver approximately every 30 seconds.

Dataflow Elapsed time:

Job elapsed time (measured in seconds), reported every 30 secs.

Dataflow Job status:

Job status (Failed, Successful), reported as an enum every 30 secs and on update.

Dataflow System lag:

Max lag across the entire pipeline, reported in seconds.

Dataflow Estimated byte count :

Number of bytes processed per PCollection.

TextIO

Output methods do not support output with the timestamp.

Transform

ParDo GroupByKey CoGroupByKey Combine Flatten Partition

Challenge #2: Latency is to be expected

So, the Beam/Dataflow model provides for exactly once processing of events. 1. What results are calculated, 2. Where in event time should we calculate those results 3. When in processing time should you save out those results 4. How do you change, already computed results? How do you refine it?

Cloud Spanner

The first horizontally scalable, globally consistent database. It's proprietary, not open source. Consider what it means to have a relational database that's consistent but also distributed and global. Think about what might be involved in coordinating transactions on components of relational database located around the world. It seems like a very difficult problem to solve.

Dataflow connector for cloud bigtable

Years of engineering to 1. Teach bigtable to configure itself 2. Isolate performance form noisy neighbors 3. React automatically to new patterns, splitting, and balancing 4. Looks at access patterns and improves itself

Difference between BigQuery and Bigtable Latency:

a. BigQuery is in the order of seconds b. BigTable is in the order of milliseconds.

Pub/Sub features

a. Fast: order of 100s of milliseconds b. Fan in, fan out parallel consumption c. Push and pull delivery flows i. a subscriber can keep checking whether a new message is available, which is pull ii. it can register for notifications when there is a new message which is called push. d. Client libraries i. Idiomatic, hand-built in Java, Python, C#, Ruby, PHP, Node.js ii. Auto-generated in 10 gRPC languages

Pub/sub is global service

a. Messages stored in region closest to publisher (in multiple availability zones) b. A subscription collates a topic from different regions c. Subscribers can be anywhere in world; no change of code

1. A table can have only one index (the row key)

a. Rows are stored in ascending order of the row key b. None of the other columns can be indexed

3. Two types of designs

a. Wide tables when every column value exists for every row b. Narrow tables for sparse data

Bigtable details:

big, fast, noSQL, autoscaling 1. Low latency/high-throughput 2. 100,000 QPS @ 6 ms latency for a 10 node cluster 3. Paying for the number nodes of Bigtable that you are running

Bigquery export

data studio, GCS

Cloud Spanner Natural use cases include:

financial applications and inventory applications traditionally served by relational database technology. Here's some example, mission critical use cases. Powering customer authentication and provisioning for multinational businesses. Building consistent systems for transactions and inventory management and the financial services in retail industries. Supporting high volume systems that require low latency and high throughput in the advertising and media industries.

Both BigQuery and Bigtable are what kind of services?

fully managed, no-ops services.

c. Distribute the writing load between tablets while allowing common queries to return consecutive rows

i. Can you have both distributed writes and block reads? ii. E.g. highway-milemaker - reverse times (I35-347-123456789)

b. Add reverse timestamp to the rowkey

i. Will you often need to retrieve the latest few records ii. E.g. highway - milemaker - reverse timestamp (I35-347-123456789)

Make query parameter the row key

i. what is the most common query you need to support ii. e.g. highway - milemarker (I35-347) iii. Entities are considered related if users are likely to pull both records in a single query. This makes reads more efficient iv. Results would come from the same tablet

The timestamp of the message in Pub Sub

is going to be the timestamp at which you call the published method. The time at which you publish into pubsub, is the timestamp of the message

Bigtable and clusters

it uses clusters but those clusters only contain pointers to the data, they don't contain the data itself. So, the clusters consist of nodes, these nodes are contain the metadata, the data itself remains on Colossus, it remains on Google Cloud Storage.

Dataflow

provides a fully-managed, autoscaling execution environment for Beam pipelines

if for example you're simulating historical data, so you're publishing at sometime in 2017 but the data actually comes from 2008, we might want to set metadata which is the timestamp of the message, because

pubsub is not going to parse this message and figure out what the actual timestamp is inside the message. The other up thing that you want to be aware of is that, you don't have to call publish one at a time for every message.

Dataflow instead of keying off the internal PubSub ID

will now key off on this attribute instead and make sure that any particular ID gets processed only once.

to associate the messages with the timestamps

you have to create that association in the p collection. In this case, rather than use process context.output, you can use process context.output with timestamp.


Ensembles d'études connexes

Grade 10 - Chemistry - Atomic Structure Basic

View Set

Environmental Science H // Chapter 3

View Set

APA Manual Ch.7 Reference Examples

View Set