BigQuery
BigQuery: Streaming
$0.01/200 MB
BigQuery: Queries
$5/TB (first TB/month free)
Storage
0.02/GB/month (first 10GB/mo free) (long term storage (not edited for 90 days) = $0.01/GB/month)
BigQuery: General best practices
1. Avoid using SELECT * 2. Denormalize Data when possible 3. Filter early and big with WHere clause 4. Do biggest joins first, and filter pre-JOIN 5. LIMIT does not affect cost 6 Partition data by date
What is BigQuery
1. Fully Managed Data warehouse (Near-real time analysis of petabyte scale databases 2. Serverless (no-ops) 3. Auto-scaling to petabyte range 4. Both storage and analysis 5. Accepts batch and streaming loads 6. Locations = multi-regional, Regional 7. Replicated, durable 8. Interact primarily with standard SQL (Legacy SQL)
Denormalize data when possible
1. Grouping data into single table 2. Often with nested/repeated data 3. Good for read performance, not for write (transactional) performance
What is 'work'?
1. I/O - how many bytes read? 2. Shuffle - how much passed to next stage 3. How many bytes written? 4. CPU work in functions
BigQuery Transfer Service
1. Move BigQuery data to other Google advertising SaaS applications 2. Google AdWords 3. DoubleClick 4. Youtube reports
How BigQuery works
1. Part of the "3rd wave" of cloud computing 2. Focus on serverless compute, real time insights, machine learning 3. Jobs (queries) can scale up to thousands of CPU's across many nodes, but the process is completely invisible to end user 4. Storage and compute are separated, connected by petabit network 5. Columnar data store
Partition data by date
1. Partition by ingest time 2. Partition by specified data columns
Searching multiple tables with wildcards
1. Query across multiple, similarly named tables 2. Filter further in WHERE clause
Pricing
1. Storage, queries, streaming insert 2. pay as you go, with high end flat-rate query pricing (starts at $40k per month with 2000 slots)
Monitoring query performance
1. Understand color codes 2. Understand "skew" in difference between average and max time
views
1. Virtual table defined by query 2. 'Querying a query" 3. Contains data only from query that contains view 4. Useful for limiting table data to others
BigQuery: Interaction methods
1. Web UI 2. Command line (bq commands) 3. Programmatic (REST Api, client libraries) 4. interact via queries
BigQuery: Loading data with command line
1. bq load --source_format = [format][dataset].[table] [source-path][schema] 2. can load multiple files with command line (not WebUI)
BigQuery: Exporting tables
1. can only export to Cloud Storage 2. can copy table to another BigQuery dataset 3. can export multiple tables with command line 4. Can only export up to 1GB per file, but can split into multiple files with wildcards
BigQuery: IAM
1. control by project, dataset, view 2. cannot control at table level (but can control by views via datasets as alternative (virtual table defined by SQL query) 3. predefined roles - BigQuery (Admin, Data Owner, Data Editor, Data Viewer, Job User, User) 4. sharing datasets (make public with all authenticated users)
BigQuery: Why use external sources?
1. load and clean data in one pass from external, then write to BigQuery 2. Small amount of frequently changing data to join to other tables
Cached queries
1. queries cost money 2. previous queries are cached to avoid charges if ran again 3. command line to disable cached results 4. caching is per user only
columnar data store
1. separates records into column values, stores each value on different storage volume 2. Traditional RDBMS stores whole record on one volume 3. Extremely fast read performance, poor write (update) performance - BigQuery does not update existing records 4. Not transactional
User Defined Functions(UDF)
1.combine SQL code with JavaScript/SQL functions 2. Combine SQL queries with programming logic 3. Allow much more complex operations (loops, complex conditionals) 4. WebUI only usable with Legacy SQL
Advanced SQL queries are allowed including
JOINS, sub queries, CONCAT
Data editor
edit dataset tables
Admin
full access
Data Owner
full dataset access
Job user
run jobs
user
run queries and create data sets(but not tables)
dataviewer
view datasets and tables
BigQuery: Data formats: read
Avro CSV JSON (Newline delimited) Parquet
BigQuery: Data formats: Load
Avro (best for compressed files) CSV Datastore backups JSON (Newline delimited) Parquet
BigQuery: Exporting tables: formats
CSV, JSON, Avro
Connecting to/from other Google Cloud Services
Dataproc - Use BigQuery connector (installed by default), job uses Cloud Storage for staging
BigQuery structure
Dataset - contains tables/views Table - collection of columns Job - long running action/query
Querying tables in SQL
FROM 'project.dataset.table' (Standard SQL) FROM [project:dataset.table] (Legacy SQL)