BigQuery

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

BigQuery: Streaming

$0.01/200 MB

BigQuery: Queries

$5/TB (first TB/month free)

Storage

0.02/GB/month (first 10GB/mo free) (long term storage (not edited for 90 days) = $0.01/GB/month)

BigQuery: General best practices

1. Avoid using SELECT * 2. Denormalize Data when possible 3. Filter early and big with WHere clause 4. Do biggest joins first, and filter pre-JOIN 5. LIMIT does not affect cost 6 Partition data by date

What is BigQuery

1. Fully Managed Data warehouse (Near-real time analysis of petabyte scale databases 2. Serverless (no-ops) 3. Auto-scaling to petabyte range 4. Both storage and analysis 5. Accepts batch and streaming loads 6. Locations = multi-regional, Regional 7. Replicated, durable 8. Interact primarily with standard SQL (Legacy SQL)

Denormalize data when possible

1. Grouping data into single table 2. Often with nested/repeated data 3. Good for read performance, not for write (transactional) performance

What is 'work'?

1. I/O - how many bytes read? 2. Shuffle - how much passed to next stage 3. How many bytes written? 4. CPU work in functions

BigQuery Transfer Service

1. Move BigQuery data to other Google advertising SaaS applications 2. Google AdWords 3. DoubleClick 4. Youtube reports

How BigQuery works

1. Part of the "3rd wave" of cloud computing 2. Focus on serverless compute, real time insights, machine learning 3. Jobs (queries) can scale up to thousands of CPU's across many nodes, but the process is completely invisible to end user 4. Storage and compute are separated, connected by petabit network 5. Columnar data store

Partition data by date

1. Partition by ingest time 2. Partition by specified data columns

Searching multiple tables with wildcards

1. Query across multiple, similarly named tables 2. Filter further in WHERE clause

Pricing

1. Storage, queries, streaming insert 2. pay as you go, with high end flat-rate query pricing (starts at $40k per month with 2000 slots)

Monitoring query performance

1. Understand color codes 2. Understand "skew" in difference between average and max time

views

1. Virtual table defined by query 2. 'Querying a query" 3. Contains data only from query that contains view 4. Useful for limiting table data to others

BigQuery: Interaction methods

1. Web UI 2. Command line (bq commands) 3. Programmatic (REST Api, client libraries) 4. interact via queries

BigQuery: Loading data with command line

1. bq load --source_format = [format][dataset].[table] [source-path][schema] 2. can load multiple files with command line (not WebUI)

BigQuery: Exporting tables

1. can only export to Cloud Storage 2. can copy table to another BigQuery dataset 3. can export multiple tables with command line 4. Can only export up to 1GB per file, but can split into multiple files with wildcards

BigQuery: IAM

1. control by project, dataset, view 2. cannot control at table level (but can control by views via datasets as alternative (virtual table defined by SQL query) 3. predefined roles - BigQuery (Admin, Data Owner, Data Editor, Data Viewer, Job User, User) 4. sharing datasets (make public with all authenticated users)

BigQuery: Why use external sources?

1. load and clean data in one pass from external, then write to BigQuery 2. Small amount of frequently changing data to join to other tables

Cached queries

1. queries cost money 2. previous queries are cached to avoid charges if ran again 3. command line to disable cached results 4. caching is per user only

columnar data store

1. separates records into column values, stores each value on different storage volume 2. Traditional RDBMS stores whole record on one volume 3. Extremely fast read performance, poor write (update) performance - BigQuery does not update existing records 4. Not transactional

User Defined Functions(UDF)

1.combine SQL code with JavaScript/SQL functions 2. Combine SQL queries with programming logic 3. Allow much more complex operations (loops, complex conditionals) 4. WebUI only usable with Legacy SQL

Advanced SQL queries are allowed including

JOINS, sub queries, CONCAT

Data editor

edit dataset tables

Admin

full access

Data Owner

full dataset access

Job user

run jobs

user

run queries and create data sets(but not tables)

dataviewer

view datasets and tables

BigQuery: Data formats: read

Avro CSV JSON (Newline delimited) Parquet

BigQuery: Data formats: Load

Avro (best for compressed files) CSV Datastore backups JSON (Newline delimited) Parquet

BigQuery: Exporting tables: formats

CSV, JSON, Avro

Connecting to/from other Google Cloud Services

Dataproc - Use BigQuery connector (installed by default), job uses Cloud Storage for staging

BigQuery structure

Dataset - contains tables/views Table - collection of columns Job - long running action/query

Querying tables in SQL

FROM 'project.dataset.table' (Standard SQL) FROM [project:dataset.table] (Legacy SQL)


Kaugnay na mga set ng pag-aaral

Chapter 9- Motivating the Workforce

View Set

Chapter 14 Gene regulationHow is the lac operon regulated?

View Set

CH 41 Sexually Transmitted Infections NCLEX

View Set

Main Scenes and Main Themes (Poems, Hamlet, Paradise Lost, and Moby Dick)

View Set

Business Law: Chapter 6 (Tort Law)

View Set