Bigquery: Best practices
Bigquery: Introduction to Optimizing Query Performance
*Input data and data sources (I/O):* How many bytes does your query read? *Communication between nodes (shuffling):* How many bytes does your query pass to the next stage? How many bytes does your query pass to each slot? Computation: How much CPU work does your query require? *Outputs (materialization):* How many bytes does your query write? *Query anti-patterns:* Are your queries following SQL best practices?
Bigquery: Managing Query Outputs
1. Avoid repeatedly joining the same tables and using the same subqueries. 2. Carefully consider materializing large result sets to a destination table. Writing large result sets has performance and cost impacts. 3. If you are sorting a very large number of values, use a LIMIT clause. 4.
Bigquery: Avoiding SQL Anti-Patterns
1. Avoid self-joins. Use a window function instead. 2. If your query processes keys that are heavily skewed to a few values, filter your data as early as possible. 3. Avoid joins that generate more outputs than inputs. When a CROSS JOIN is required, pre-aggregate your data. 4. Avoid point-specific DML statements (updating or inserting 1 row at a time). Batch your updates and inserts.
BigQuery Best Practices: Optimizing Storage
1. Configure the default table expiration for your datasets, configure the expiration time for your tables, and configure the partition expiration for time-partitioned tables. 2. Keep your data in BigQuery. 3. Estimate your storage costs using the Google Cloud Platform Pricing Calculator.
Bigquery: Input data and data sources (I/O)
1. Control projection - Query only the columns that you need. 2. When querying a time-partitioned table, use the _PARTITIONTIME pseudo column to filter the partitions. 3. BigQuery performs best when your data is denormalized. Rather than preserving a relational schema such as a star or snowflake schema, denormalize your data and take advantage of nested and repeated fields. Nested and repeated fields can maintain relationships without the performance impact of preserving a relational (normalized) schema. 4. If query performance is a top priority, do not use an external data source. 5. When querying wildcard tables, use the most granular prefix possible.
Bigquery: Optimizing Query Computation
1. If you are using SQL to perform ETL operations, avoid situations where you are repeatedly transforming the same data. 2. Avoid using JavaScript user-defined functions. Use native UDFs instead. 3. If your use case supports it, use an approximate aggregation function. 4. Use ORDER BY only in the outermost query or within window clauses (analytic functions). Push complex operations to the end of the query. 5. For queries that join data from multiple tables, optimize your join patterns. Start with the largest table. 6. When querying a time-partitioned table, use the _PARTITIONTIME pseudo column to filter the partitions.
Bigquery: doesn't require a completely flat denormalization. You can use nested and repeated fields to maintain relationships.
1. Nesting data (STRUCT) 2. Repeated data (ARRAY) 3. Nested and repeated data (ARRAY of STRUCTs)
BigQuery Best Practices: Controlling Costs
1. Query only the columns that you need. 2. Don't run queries to explore or preview table data. 3. Before running queries, preview them to estimate costs. 4. Use the maximum bytes billed setting to limit query costs. 5. Do not use a LIMIT clause as a method of cost control. 6. Create a dashboard to view your billing data so you can make adjustments to your BigQuery usage. Also consider streaming your audit logs to BigQuery so you can analyze usage patterns. 7. Partition your tables by date. 8. If possible, materialize your query results in stages. 9. If you are writing large query results to a destination table, use the default table expiration time to remove the data when it's no longer needed. 10. Use streaming inserts only if your data must be immediately available.
Bigquery: Optimizing Communication Between Slots
1. Reduce the amount of data that is processed before a JOIN clause. 2. Use WITH clauses primarily for readability. 3. Do not use tables sharded by date (also called date-named tables) in place of time-partitioned tables. 4. Avoid creating too many table shards. If you are sharding tables by date, use time-partitioned tables instead.
Bigquery has
JDBC and ODBC connector
Bigquery: Nested and repeated data (ARRAY of STRUCTs)
Nesting and repetition complement each other. For example, in a table of transaction records, you could include an array of line item STRUCTs.
Bigquery: Nesting Data
Nesting data allows you to represent foreign entities inline. Querying nested data uses "dot" syntax to reference leaf fields, which is similar to the syntax using a join. Nested data is represented as a STRUCT type in standard SQL.
Bigquery Connector
Spark/Cloud Dataproc Excel Data studio
Bigquery: Keep your data in BigQuery.
You can load data into BigQuery at no cost. Rather than exporting your older data to another storage option (such as Google Cloud Storage), take advantage of BigQuery's long-term storage pricing. If you have a table that is not edited for 90 consecutive days, the price of storage for that table automatically drops by 50 percent to $0.01 per GB, per month. This is the same cost as Cloud Storage Nearline. Each partition of a partitioned table is considered separately for long-term storage pricing. If a partition hasn't been modified in the last 90 days, the data in that partition is considered long term storage and is charged at the discounted price.
Bigquery: Repeated Data
Creating a field of type RECORD with the mode set to REPEATED allows you to preserve a 1:many relationship inline (so long as the relationship isn't high cardinality). With repeated data, shuffling is not necessary. Repeated data is represented as an ARRAY. You can use an ARRAY function in standard SQL when you query the repeated data.
Bigquery: authorized view
An authorized view allows you to share query results with particular users and groups without giving them access to the underlying tables. You can also use the view's SQL query to restrict the columns (fields) the users are able to query. In this tutorial, you create an authorized view.
Bigquery: Loading data
Avro Parquet ORC CSV JSON