DataExamTopics

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Your business users need a way to clean and prepare data before using the data for analysis. Your business users are less technically savvy and prefer to work with graphical user interfaces to define their transformations. After the data has been transformed, the business users want to perform their analysis directly in a spreadsheet. You need to recommend a solution that they can use. What should you do?

A. Use Dataprep to clean the data, and write the results to BigQuery. Analyze the data by using Connected Sheets.

Your company's customer_order table in BigQuery stores the order history for 10 million customers, with a table size of 10 PB. You need to create a dashboard for the support team to view the order history. The dashboard has two filters, country_name and username. Both are string data types in the BigQuery table. When a filter is applied, the dashboard fetches the order history from the table and displays the query results. However, the dashboard is slow to show the results when applying the filters to the following query:

A. Cluster the table by country and username fields. ==================== the fields are both strings, which are not supported for partitioning. Moreover, the fields are regularly used in filters, which is where clustering really improves performance - Clustering organizes the data based on the specified columns (in this case, country_name and username). - When a query filters on these columns, BigQuery can efficiently scan only the relevant parts of the table

Your company's data platform ingests CSV file dumps of booking and user profile data from upstream sources into Cloud Storage. The data analyst team wants to join these datasets on the email field available in both the datasets to perform analysis. However, personally identifiable information (PII) should not be accessible to the analysts. You need to de-identify the email field in both the datasets before loading them into BigQuery for analysts. What should you do? ABSOLUMENT A VERIFIER !!!!

?????

You are designing a data mesh on Google Cloud by using Dataplex to manage data in BigQuery and Cloud Storage. You want to simplify data asset permissions. You are creating a customer virtual lake with two user groups: • Data engineers, which require full data lake access • Analytic users, which require access to curated dataYou need to assign access rights to these two groups. What should you do?

A. 1. Grant the dataplex.dataOwner role to the data engineer group on the customer data lake. 2. Grant the dataplex.dataReader role to the analytic user group on the customer curated zone. - dataplex.dataOwner: Grants full control over data assets, including reading, writing, managing, and granting access to others. - dataplex.dataReader: Allows users to read data but not modify it.

You are creating the CI/CD cycle for the code of the directed acyclic graphs (DAGs) running in Cloud Composer. Your team has two Cloud Composer instances: one instance for development and another instance for production. Your team is using a Git repository to maintain and develop the code of the DAGs. You want to deploy the DAGs automatically to Cloud Composer when a certain tag is pushed to the Git repository. What should you do? A VERIFIER !!!!

A. 1. Use Cloud Build to copy the code of the DAG to the Cloud Storage bucket of the development instance for DAG testing. 2. If the tests pass, use Cloud Build to copy the code to the bucket of the production instance.

You have designed an Apache Beam processing pipeline that reads from a Pub/Sub topic. The topic has a message retention duration of one day, and writes to a Cloud Storage bucket. You need to select a bucket location and processing strategy to prevent data loss in case of a regional outage with an RPO of 15 minutes. What should you do? VERIFIER PS CERTAIN DU TOUT

A. 1. Use a dual-region Cloud Storage bucket. 2. Monitor Dataflow metrics with Cloud Monitoring to determine when an outage occurs. 3. Seek the subscription back in time by 15 minutes to recover the acknowledged messages. 4. Start the Dataflow job in a secondary region.

You are planning to use Cloud Storage as part of your data lake solution. The Cloud Storage bucket will contain objects ingested from external systems. Each object will be ingested once, and the access patterns of individual objects will be random. You want to minimize the cost of storing and retrieving these objects. You want to ensure that any cost optimization efforts are transparent to the users and applications. What should you do?

A. Create a Cloud Storage bucket with Autoclass enabled. - Autoclass automatically analyzes access patterns of objects and automatically transitions them to the most cost-effective storage class within Standard, Nearline, Coldline, or Archive. - This eliminates the need for manual intervention or setting specific age thresholds. - No user or application interaction is required, ensuring transparency.

You are creating a data model in BigQuery that will hold retail transaction data. Your two largest tables, sales_transaction_header and sales_transaction_line, have a tightly coupled immutable relationship. These tables are rarely modified after load and are frequently joined when queried. You need to model the sales_transaction_header and sales_transaction_line tables to improve the performance of data analytics queries. What should you do?

A. Create a sales_transaction table that holds the sales_transaction_header information as rows and the sales_transaction_line rows as nested and repeated fields. info : - In BigQuery, nested and repeated fields can significantly improve performance for certain types of queries, especially joins, because the data is co-located and can be read efficiently. - - This approach is often used in data warehousing scenarios where query performance is a priority, and the data relationships are immutable and rarely modified.

You have an inventory of VM data stored in the BigQuery table. You want to prepare the data for regular reporting in the most cost-effective way. You need to exclude VM rows with fewer than 8 vCPU in your report. What should you do?

A. Create a view with a filter to drop rows with fewer than 8 vCPU, and use the UNNEST operator. - The table structure shows that the vCPU data is stored in a nested field within the components column. - Using the UNNEST operator to flatten the nested field and apply the filter.

Your organization stores customer data in an on-premises Apache Hadoop cluster in Apache Parquet format. Data is processed on a daily basis by Apache Spark jobs that run on the cluster. You are migrating the Spark jobs and Parquet data to Google Cloud. BigQuery will be used on future transformation pipelines so you need to ensure that your data is available in BigQuery. You want to use managed services, while minimizing ETL data processing changes and overhead costs. What should you do?

A. Migrate your data to Cloud Storage and migrate the metadata to Dataproc Metastore (DPMS). Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless. ======================================= - This option involves moving Parquet files to Cloud Storage, which is a common and cost-effective storage solution for big data and is compatible with Spark jobs. - Using Dataproc Metastore to manage metadata allows us to keep Hadoop ecosystem's structural information. - Running Spark jobs on Dataproc Serverless takes advantage of managed Spark services without managing clusters. - Once the data is in Cloud Storage, you can also easily load it into BigQuery for further analysis.

You are managing a Dataplex environment with raw and curated zones. A data engineering team is uploading JSON and CSV files to a bucket asset in the curated zone but the files are not being automatically discovered by Dataplex. What should you do to ensure that the files are discovered by Dataplex? PAS CERTAIN DE LA REPONSE A REVOIR

A. Move the JSON and CSV files to the raw zone.

You are using BigQuery with a multi-region dataset that includes a table with the daily sales volumes. This table is updated multiple times per day. You need to protect your sales table in case of regional failures with a recovery point objective (RPO) of less than 24 hours, while keeping costs to a minimum. What should you do?

A. Schedule a daily export of the table to a Cloud Storage dual or multi-region bucket. --------------------------

You have important legal hold documents in a Cloud Storage bucket. You need to ensure that these documents are not deleted or modified. What should you do?

A. Set a retention policy. Lock the retention policy. - Setting a retention policy on a Cloud Storage bucket prevents objects from being deleted for the duration of the retention period. - Locking the policy makes it immutable, meaning that the retention period cannot be reduced or removed, thus ensuring that the documents cannot be deleted or overwritten until the retention period expires.

You are preparing an organization-wide dataset. You need to preprocess customer data stored in a restricted bucket in Cloud Storage. The data will be used to create consumer analyses. You need to comply with data privacy requirements.

A. Use Dataflow and the Cloud Data Loss Prevention API to mask sensitive data. Write the processed data in BigQuery.

You migrated your on-premises Apache Hadoop Distributed File System (HDFS) data lake to Cloud Storage. The data scientist team needs to process the data by using Apache Spark and SQL. Security policies need to be enforced at the column level. You need a cost-effective solution that can scale into a data mesh. What should you do? PAS CERTAIN DU TOUT IL FAUT REVOIR CETTE REPONSE

B. 1. Define a BigLake table. 2. Create a taxonomy of policy tags in Data Catalog. 3. Add policy tags to columns. 4. Process with the Spark-BigQuery connector or BigQuery SQL.

You are running a Dataflow streaming pipeline, with Streaming Engine and Horizontal Autoscaling enabled. You have set the maximum number of workers to 1000. The input of your pipeline is Pub/Sub messages with notifications from Cloud Storage. One of the pipeline transforms reads CSV files and emits an element for every CSV line. The job performance is low, the pipeline is using only 10 workers, and you notice that the autoscaler is not spinning up additional workers. What should you do to improve performance?

B. Change the pipeline code, and introduce a Reshuffle step to prevent fusion. - Fusion optimization in Dataflow can lead to steps being "fused" together, which can sometimes hinder parallelization. - Introducing a Reshuffle step can prevent fusion and force the distribution of work across more workers. - This can be an effective way to improve parallelism and potentially trigger the autoscaler to increase the number of workers.

You have one BigQuery dataset which includes customers' street addresses. You want to retrieve all occurrences of street addresses from the dataset. What should you do?

B. Create a deep inspection job on each table in your dataset with Cloud Data Loss Prevention and create an inspection template that includes the STREET_ADDRESS infoType. - Cloud Data Loss Prevention (Cloud DLP) provides powerful inspection capabilities for sensitive data, including predefined detectors for infoTypes such as STREET_ADDRESS. - By creating a deep inspection job for each table with the STREET_ADDRESS infoType, you can accurately identify and retrieve rows that contain street addresses.

You currently have transactional data stored on-premises in a PostgreSQL database. To modernize your data environment, you want to run transactional workloads and support analytics needs with a single database. You need to move to Google Cloud without changing database management systems, and minimize cost and complexity. What should you do?

B. Migrate your workloads to AlloyDB for PostgreSQL.

You have a network of 1000 sensors. The sensors generate time series data: one metric per sensor per second, along with a timestamp. You already have 1 TB of data, and expect the data to grow by 1 GB every day. You need to access this data in two ways. The first access pattern requires retrieving the metric from one specific sensor stored at a specific timestamp, with a median single-digit millisecond latency. The second access pattern requires running complex analytic queries on the data, including joins, once a day. How should you store this data?

B. Store your data in Bigtable. Concatenate the sensor ID and timestamp and use it as the row key. Perform an export to BigQuery every day. info : Based on your requirements, Option B seems most suitable. Bigtable's design caters to the low-latency access of time-series data (your first requirement), and the daily export to BigQuery enables complex analytics (your second requirement). The use of sensor ID and timestamp as the row key in Bigtable would facilitate efficient access to specific sensor data at specific times.

You need to modernize your existing on-premises data strategy. Your organization currently uses:• Apache Hadoop clusters for processing multiple large data sets, including on-premises Hadoop Distributed File System (HDFS) for data replication.• Apache Airflow to orchestrate hundreds of ETL pipelines with thousands of job steps. You need to set up a new architecture in Google Cloud that can handle your Hadoop workloads and requires minimal changes to your existing orchestration processes. What should you do?

B. Use Dataproc to migrate Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases. Orchestrate your pipelines with Cloud Composer. ===================== Cloud Composer -> Airflow

You work for a large ecommerce company. You store your customer's order data in Bigtable. You have a garbage collection policy set to delete the data after 30 days and the number of versions is set to 1. When the data analysts run a query to report total customer spending, the analysts sometimes see customer data that is older than 30 days. You need to ensure that the analysts do not see customer data older than 30 days while minimizing cost and overhead. What should you do?

B. Use a timestamp range filter in the query to fetch the customer's data for a specific range.

You are monitoring your organization's data lake hosted on BigQuery. The ingestion pipelines read data from Pub/Sub and write the data into tables on BigQuery. After a new version of the ingestion pipelines is deployed, the daily stored data increased by 50%. The volumes of data in Pub/Sub remained the same and only some tables had their daily partition data size doubled. You need to investigate and fix the cause of the data increase. What should you do? A VERIFIER !!!

C. 1. Check for duplicate rows in the BigQuery tables that have the daily partition data size doubled. 2. Check the BigQuery Audit logs to find job IDs. 3. Use Cloud Monitoring to determine when the identified Dataflow jobs started and the pipeline code version. 4. When more than one pipeline ingests data into a table, stop all versions except the latest one. info : - Detailed Investigation of Logs and Jobs Checking for duplicate rows targets the potential immediate cause of the issue. - Checking the BigQuery Audit logs helps identify which jobs might be contributing to the increased data volume. - Using Cloud Monitoring to correlate job starts with pipeline versions helps identify if a specific version of the pipeline is responsible. - Managing multiple versions of pipelines ensures that only the intended version is active, addressing any versioning errors that might have occurred during deployment.

You work for a farming company. You have one BigQuery table named sensors, which is about 500 MB and contains the list of your 5000 sensors, with columns for id, name, and location. This table is updated every hour. Each sensor generates one metric every 30 seconds along with a timestamp, which you want to store in BigQuery. You want to run an analytical query on the data once a week for monitoring purposes. You also want to minimize costs. What data model should you use?

C. 1. Create a metrics table partitioned by timestamp. 2. Create a sensorId column in the metrics table, that points to the id column in the sensors table. 3. Use an INSERT statement every 30 seconds to append new metrics to the metrics table. 4. Join the two tables, if needed, when running the analytical query.

You are administering a BigQuery dataset that uses a customer-managed encryption key (CMEK). You need to share the dataset with a partner organization that does not have access to your CMEK. What should you do?

C. Copy the tables you need to share to a dataset without CMEKs. Create an Analytics Hub listing for this dataset. ================================== - Create a copy of the necessary tables into a new dataset that doesn't use CMEK, ensuring the data is accessible without requiring the partner to have access to the encryption key. -Analytics Hub can then be used to share this data securely and efficiently with the partner organization, maintaining control and governance over the shared data. Preserves Key Confidentiality: Avoids sharing your CMEK with the partner, upholding key security and control.

You are administering shared BigQuery datasets that contain views used by multiple teams in your organization. The marketing team is concerned about the variability of their monthly BigQuery analytics spend using the on-demand billing model. You need to help the marketing team establish a consistent BigQuery analytics spend each month. What should you do?

C. Create a BigQuery reservation with a baseline of 500 slots with no autoscaling for the marketing team, and bill them back accordingly.

You are developing a model to identify the factors that lead to sales conversions for your customers. You have completed processing your data. You want to continue through the model development lifecycle. What should you do next?

C. Delineate what data will be used for testing and what will be used for training the model. info : - Before you can train a model, you need to decide how to split your dataset.

You created a new version of a Dataflow streaming data ingestion pipeline that reads from Pub/Sub and writes to BigQuery. The previous version of the pipeline that runs in production uses a 5-minute window for processing. You need to deploy the new version of the pipeline without losing any data, creating inconsistencies, or increasing the processing latency by more than 10 minutes. What should you do?

C. Drain the old pipeline, then start the new pipeline.

You want to store your team's shared tables in a single dataset to make data easily accessible to various analysts. You want to make this data readable but unmodifiable by analysts. At the same time, you want to provide the analysts with individual workspaces in the same project, where they can create and store tables for their own use, without the tables being accessible by other analysts. What should you do?

C. Give analysts the BigQuery Data Viewer role on the shared dataset. Create a dataset for each analyst, and give each analyst the BigQuery Data Editor role at the dataset level for their assigned dataset. info : - Data Viewer on Shared Dataset: Grants read-only access to the shared dataset. - Data Editor on Individual Datasets: Giving each analyst Data Editor role on their respective dataset creates private workspaces where they can create and store personal tables without exposing them to other analysts.

You need to connect multiple applications with dynamic public IP addresses to a Cloud SQL instance. You configured users with strong passwords and enforced the SSL connection to your Cloud SQL instance. You want to use Cloud SQL public IP and ensure that you have secured connections. What should you do?

C. Leave the Authorized Network empty. Use Cloud SQL Auth proxy on all applications.

You are designing a Dataflow pipeline for a batch processing job. You want to mitigate multiple zonal failures at job submission time. What should you do?

C. Specify a worker region by using the --region flag. info : - Specifying a worker region (instead of a specific zone) allows Google Cloud's Dataflow service to manage the distribution of resources across multiple zones within that region

You have created an external table for Apache Hive partitioned data that resides in a Cloud Storage bucket, which contains a large number of files. You notice that queries against this table are slow. You want to improve the performance of these queries. What should you do?v

C. Upgrade the external table to a BigLake table. Enable metadata caching for the table. info : - BigLake Table: BigLake allows for more efficient querying of data lakes stored in Cloud Storage. It can handle large datasets more effectively than standard external tables. - Metadata Caching: Enabling metadata caching can significantly improve query performance by reducing the time taken to read and process metadata from a large number of files.

One of your encryption keys stored in Cloud Key Management Service (Cloud KMS) was exposed. You need to re- encrypt all of your CMEK-protected Cloud Storage data that used that key, and then delete the compromised key. You also want to reduce the risk of objects getting written without customer-managed encryption key (CMEK) protection in the future. What should you do?

D. Create a new Cloud KMS key. Create a new Cloud Storage bucket configured to use the new key as the default CMEK key. Copy all objects from the old bucket to the new bucket without specifying a key.

You are deploying a batch pipeline in Dataflow. This pipeline reads data from Cloud Storage, transforms the data, and then writes the data into BigQuery. The security team has enabled an organizational constraint in Google Cloud, requiring all Compute Engine instances to use only internal IP addresses and no external IP addresses. What should you do?

D. Ensure that Private Google Access is enabled in the subnetwork. Use Dataflow with only internal IP addresses. VM instances that only have internal IP addresses (no external IP addresses) can use Private Google Access. They can reach the external IP addresses of Google APIs and services. - Private Google Access for services allows VM instances with only internal IP addresses in a VPC network or on-premises networks (via Cloud VPN or Cloud Interconnect) to reach Google APIs and services. - When you launch a Dataflow job, you can specify that it should use worker instances without external IP addresses if Private Google Access is enabled on the subnetwork where these instances are launched. - This way, your Dataflow workers will be able to access Cloud Storage and BigQuery without violating the organizational constraint of no external IPs.

You are designing a messaging system by using Pub/Sub to process clickstream data with an event-driven consumer app that relies on a push subscription. You need to configure the messaging system that is reliable enough to handle temporary downtime of the consumer app. You also need the messaging system to store the input messages that cannot be consumed by the subscriber. The system needs to retry failed messages gradually, avoiding overloading the consumer app, and store the failed messages after a maximum of 10 retries in a topic. How should you configure the Pub/Sub subscription?

D. Use exponential backoff as the subscription retry policy, and configure dead lettering to a different topic with maximum delivery attempts set to 10.

You store and analyze your relational data in BigQuery on Google Cloud with all data that resides in US regions. You also have a variety of object stores across Microsoft Azure and Amazon Web Services (AWS), also in US regions. You want to query all your data in BigQuery daily with as little movement of data as possible. What should you do?

D. Use the BigQuery Omni functionality and BigLake tables to query files in Azure and AWS. Reveal Solution ====================== - BigQuery Omni allows us to analyze data stored across Google Cloud, AWS, and Azure directly from BigQuery without having to move or copy the data. - It extends BigQuery's data analysis capabilities to other clouds, enabling cross-cloud analytics.

You designed a data warehouse in BigQuery to analyze sales data. You want a self-serving, low-maintenance, and cost- effective solution to share the sales dataset to other business units in your organization. What should you do?

A. Create an Analytics Hub private exchange, and publish the sales dataset.

You have data located in BigQuery that is used to generate reports for your company. You have noticed some weekly executive report fields do not correspond to format according to company standards. For example, report errors include different telephone formats and different country code identifiers. This is a frequent issue, so you need to create a recurring job to normalize the data. You want a quick solution that requires no coding. What should you do?

A. Use Cloud Data Fusion and Wrangler to normalize the data, and set up a recurring job.

You are designing a fault-tolerant architecture to store data in a regional BigQuery dataset. You need to ensure that your application is able to recover from a corruption event in your tables that occurred within the past seven days. You want to adopt managed services with the lowest RPO and most cost-effective solution. What should you do?

A. Access historical data by using time travel in BigQuery. info : - Lowest RPO: Time travel offers point-in-time recovery for the past seven days by default, providing the shortest possible recovery point objective (RPO) among the given options. You can recover data to any state within that window. - No Additional Costs: Time travel is a built-in feature of BigQuery, incurring no extra storage or operational costs. - Managed Service: BigQuery handles time travel automatically, eliminating manual backup and restore processes.

You are designing the architecture to process your data from Cloud Storage to BigQuery by using Dataflow. The network team provided you with the Shared VPC network and subnetwork to be used by your pipelines. You need to enable the deployment of the pipeline on the Shared VPC network. What should you do?

A. Assign the compute.networkUser role to the Dataflow service agent.

You currently use a SQL-based tool to visualize your data stored in BigQuery. The data visualizations require the use of outer joins and analytic functions. Visualizations must be based on data that is no less than 4 hours old. Business users are complaining that the visualizations are too slow to generate. You want to improve the performance of the visualization queries while minimizing the maintenance overhead of the data preparation pipeline. What should you do?

A. Create materialized views with the allow_non_incremental_definition option set to true for the visualization queries. Specify the max_staleness parameter to 4 hours and the enable_refresh parameter to true. Reference the materialized views in the data visualization tool. ================================ - Materialized views in BigQuery precompute and store the result of a base query, which can speed up data retrieval for complex queries used in visualizations. - The max_staleness parameter allows us to specify how old the data can be, ensuring that the visualizations are based on data no less than 4 hours old. - The enable_refresh parameter ensures that the materialized view is periodically refreshed. - The allow_non_incremental_definition is used for enabling the creation of non-incrementally refreshable materialized views GCP --> Use non-incremental materialized views You can create non-incremental materialized views by using the allow_non_incremental_definition option. This option must be accompanied by the max_staleness option. To ensure a periodic refresh of the materialized view, you should also configure a refresh policy. Without a refresh policy, you must manually refresh the materialized view.Use non-incremental materialized views You can create non-incremental materialized views by using the allow_non_incremental_definition option. This option must be accompanied by the max_staleness option. To ensure a periodic refresh of the materialized view, you should also configure a refresh policy. Without a refresh policy, you must manually refresh the materialized view.

Your team is building a data lake platform on Google Cloud. As a part of the data foundation design, you are planning to store all the raw data in Cloud Storage. You are expecting to ingest approximately 25 GB of data a day and your billing department is worried about the increasing cost of storing old data. The current business requirements are:• The old data can be deleted anytime.• There is no predefined access pattern of the old data.• The old data should be available instantly when accessed.• There should not be any charges for data retrieval.What should you do to optimize for cost?

A. Create the bucket with the Autoclass storage class feature. - Autoclass automatically moves objects between storage classes without impacting performance or availability, nor incurring retrieval costs. - It continuously optimizes storage costs based on access patterns without the need to set specific lifecycle management policies.

You are architecting a data transformation solution for BigQuery. Your developers are proficient with SQL and want to use the ELT development technique. In addition, your developers need an intuitive coding environment and the ability to manage SQL as code. You need to identify a solution for your developers to build these pipelines. What should you do?

A. Use Dataform to build, manage, and schedule SQL pipelines.

You want to encrypt the customer data stored in BigQuery. You need to implement per-user crypto-deletion on data stored in your tables. You want to adopt native features in Google Cloud to avoid custom solutions. What should you do?

A. Implement Authenticated Encryption with Associated Data (AEAD) BigQuery functions while storing your data in BigQuery. - AEAD cryptographic functions in BigQuery allow for encryption and decryption of data at the column level. - You can encrypt specific data fields using a unique key per user and manage these keys outside of BigQuery (for example, in your application or using a key management system). - By "deleting" or revoking access to the key for a specific user, you effectively make their data unreadable, achieving crypto-deletion. - This method provides fine-grained encryption control but requires careful key management and integration with your applications.

You maintain ETL pipelines. You notice that a streaming pipeline running on Dataflow is taking a long time to process incoming data, which causes output delays. You also noticed that the pipeline graph was automatically optimized by Dataflow and merged into one step. You want to identify where the potential bottleneck is occurring. What should you do?

A. Insert a Reshuffle operation after each processing step, and monitor the execution details in the Dataflow console. - The Reshuffle operation is used in Dataflow pipelines to break fusion and redistribute elements, which can sometimes help improve parallelization and identify bottlenecks. - By inserting Reshuffle after each processing step and observing the pipeline's performance in the Dataflow console, you can potentially identify stages that are disproportionately slow or stalled. - This can help in pinpointing the step where the bottleneck might be occurring.

You are on the data governance team and are implementing security requirements to deploy resources. You need to ensure that resources are limited to only the europe-west3 region. You want to follow Google-recommended practices.What should you do?

A. Set the constraints/gcp.resourceLocations organization policy constraint to in:europe-west3-locations. ------------- - The constraints/gcp.resourceLocations organization policy constraint is used to define where resources in the organization can be created. - Setting it to in:europe-west3-locations would specify that resources can only be created in the europe-west3 region.

Your organization uses a multi-cloud data storage strategy, storing data in Cloud Storage, and data in Amazon Web Services' (AWS) S3 storage buckets. All data resides in US regions. You want to query up-to-date data by using BigQuery, regardless of which cloud the data is stored in. You need to allow users to query the tables from BigQuery without giving direct access to the data in the storage buckets. What should you do?

A. Setup a BigQuery Omni connection to the AWS S3 bucket data. Create BigLake tables over the Cloud Storage and S3 data and query the data using BigQuery directly.

Your infrastructure team has set up an interconnect link between Google Cloud and the on-premises network. You are designing a high-throughput streaming pipeline to ingest data in streaming from an Apache Kafka cluster hosted on- premises. You want to store the data in BigQuery, with as minimal latency as possible. What should you do?

A. Setup a Kafka Connect bridge between Kafka and Pub/Sub. Use a Google-provided Dataflow template to read the data from Pub/Sub, and write the data to BigQuery.

You want to migrate your existing Teradata data warehouse to BigQuery. You want to move the historical data to BigQuery by using the most efficient method that requires the least amount of programming, but local storage space on your existing data warehouse is limited. What should you do?

A. Use BigQuery Data Transfer Service by using the Java Database Connectivity (JDBC) driver with FastExport connection. - Reduced Local Storage: By using FastExport, data is directly streamed from Teradata to BigQuery without the need for local storage, addressing your storage limitations. - Minimal Programming: BigQuery Data Transfer Service offers a user-friendly interface, eliminating the need for extensive scripting or coding.

You are using a Dataflow streaming job to read messages from a message bus that does not support exactly-once delivery. Your job then applies some transformations, and loads the result into BigQuery. You want to ensure that your data is being streamed into BigQuery with exactly-once delivery semantics. You expect your ingestion throughput into BigQuery to be about 1.5 GB per second. What should you do?

A. Use the BigQuery Storage Write API and ensure that your target BigQuery table is regional. info : - BigQuery Storage Write API: This API is designed for high-throughput, low-latency writing of data into BigQuery. It also provides tools to prevent data duplication, which is essential for exactly-once delivery semantics. - Regional Table: Choosing a regional location for the BigQuery table could potentially provide better performance and lower latency, as it would be closer to the Dataflow job if they are in the same region.

You are running a streaming pipeline with Dataflow and are using hopping windows to group the data as the data arrives. You noticed that some data is arriving late but is not being marked as late data, which is resulting in inaccurate aggregations downstream. You need to find a solution that allows you to capture the late data in the appropriate window. What should you do?

A. Use watermarks to define the expected data arrival window. Allow late data as it arrives. info : - Watermarks: Watermarks in a streaming pipeline are used to specify the point in time when Dataflow expects all data up to that point to have arrived. - Allow Late Data: configure the pipeline to accept and correctly process data that arrives after the watermark, ensuring it's captured in the appropriate window.

Your organization has two Google Cloud projects, project A and project B. In project A, you have a Pub/Sub topic that receives data from confidential sources. Only the resources in project A should be able to access the data in that topic. You want to ensure that project B and any future project cannot access data in the project A topic. What should you do?

B. Configure VPC Service Controls in the organization with a perimeter around project A. ====================================== -It allows us to create a secure boundary around all resources in Project A, including the Pub/Sub topic. - It prevents data exfiltration to other projects and ensures that only resources within the perimeter (Project A) can access the sensitive data. - VPC Service Controls are specifically designed for scenarios where you need to secure sensitive data within a specific context or boundary in Google Cloud.

You have a Standard Tier Memorystore for Redis instance deployed in a production environment. You need to simulate a Redis instance failover in the most accurate disaster recovery situation, and ensure that the failover has no impact on production data. What should you do?

B. Create a Standard Tier Memorystore for Redis instance in a development environment. Initiate a manual failover by using the force-data-loss data protection mode. -------------------------- The key points are: • The failover should be tested in a separate development environment, not production, to avoid impacting real data. • The force-data-loss mode will simulate a full failover and restart, which is the most accurate test of disaster recovery. • Limited-data-loss mode only fails over reads which does not fully test write capabilities. • Increasing replicas in production and failing over (C) risks losing real production data. • Failing over production (D) also risks impacting real data and traffic. So option B isolates the test from production and uses the most rigorous failover mode to fully validate disaster recovery capabilities.

You have a table that contains millions of rows of sales data, partitioned by date. Various applications and users query this data many times a minute. The query requires aggregating values by using AVG, MAX, and SUM, and does not require joining to other tables. The required aggregations are only computed over the past year of data, though you need to retain full historical data in the base tables. You want to ensure that the query results always include the latest data from the tables, while also reducing computation cost, maintenance overhead, and duration. What should you do?

B. Create a materialized view to aggregate the base table data. Configure a partition expiration on the base table to retain only the last one year of partitions.

You have a BigQuery table that ingests data directly from a Pub/Sub subscription. The ingested data is encrypted with a Google-managed encryption key. You need to meet a new organization policy that requires you to use keys from a centralized Cloud Key Management Service (Cloud KMS) project to encrypt data at rest. What should you do? A VERIFIER !!!

B. Create a new BigQuery table by using customer-managed encryption keys (CMEK), and migrate the data from the old BigQuery table.

You stream order data by using a Dataflow pipeline, and write the aggregated result to Memorystore. You provisioned a Memorystore for Redis instance with Basic Tier, 4 GB capacity, which is used by 40 clients for read-only access. You are expecting the number of read-only clients to increase significantly to a few hundred and you need to be able to support the demand. You want to ensure that read and write access availability is not impacted, and any changes you make can be deployed quickly. What should you do?

B. Create a new Memorystore for Redis instance with Standard Tier. Set capacity to 5 GB and create multiple read replicas. Delete the old instance. ============================================= - Upgrading to the Standard Tier and adding read replicas is an effective way to scale and manage increased read load. - The additional capacity (5 GB) provides more space for data, and read replicas help distribute the read load across multiple instances.

You are on the data governance team and are implementing security requirements. You need to encrypt all your data in BigQuery by using an encryption key managed by your team. You must implement a mechanism to generate and store encryption material only on your on-premises hardware security module (HSM). You want to rely on Google managed solutions. What should you do? A VERIFIER !!!

B. Create the encryption key in the on-premises HSM and link it to a Cloud External Key Manager (Cloud EKM) key. Associate the created Cloud KMS key while creating the BigQuery resources.

You have two projects where you run BigQuery jobs:• One project runs production jobs that have strict completion time SLAs. These are high priority jobs that must have the required compute resources available when needed. These jobs generally never go below a 300 slot utilization, but occasionally spike up an additional 500 slots.• The other project is for users to run ad-hoc analytical queries. This project generally never uses more than 200 slots at a time. You want these ad-hoc queries to be billed based on how much data users scan rather than by slot capacity. You need to ensure that both projects have the appropriate compute resources available. What should you do?

B. Create two reservations, one for each of the projects. For the SLA project, use an Enterprise Edition with a baseline of 300 slots and enable autoscaling up to 500 slots. For the ad-hoc project, configure on-demand billing. - The SLA project gets a dedicated reservation with autoscaling to handle spikes, ensuring it meets its strict completion time SLAs. - The ad-hoc project uses on-demand billing, which means it will be billed based on the amount of data scanned rather than slot capacity, fitting the billing preference for ad-hoc queries.

You have 100 GB of data stored in a BigQuery table. This data is outdated and will only be accessed one or two times a year for analytics with SQL. For backup purposes, you want to store this data to be immutable for 3 years. You want to minimize storage costs. What should you do?

D. 1. Perform a BigQuery export to a Cloud Storage bucket with archive storage class. 2. Set a locked retention policy on the bucket. 3. Create a BigQuery external table on the exported files. info : For data keeping till last 3 years, use bucket lock with rentention policy

You are troubleshooting your Dataflow pipeline that processes data from Cloud Storage to BigQuery. You have discovered that the Dataflow worker nodes cannot communicate with one another. Your networking team relies on Google Cloud network tags to define firewall rules. You need to identify the issue while following Google-recommended networking security practices. What should you do?

B. Determine whether there is a firewall rule set to allow traffic on TCP ports 12345 and 12346 for the Dataflow network tag. ================================== When you create firewall rules for Dataflow, specify the Dataflow network tags. Otherwise, the firewall rules apply to all VMs in the VPC network. This option focuses directly on ensuring that the firewall rules are set up correctly for the network tags used by Dataflow worker nodes. It specifically addresses the potential issue of worker nodes not being able to communicate due to restrictive firewall rules blocking the necessary ports.

You have a BigQuery table that contains customer data, including sensitive information such as names and addresses. You need to share the customer data with your data analytics and consumer support teams securely. The data analytics team needs to access the data of all the customers, but must not be able to access the sensitive data. The consumer support team needs access to all data columns, but must not be able to access customers that no longer have active contracts. You enforced these requirements by using an authorized dataset and policy tags. After implementing these steps, the data analytics team reports that they still have access to the sensitive columns. You need to ensure that the data analytics team does not have access to restricted data. What should you do? (Choose two.) A VERIFIER !!!

B. Ensure that the data analytics team members do not have the Data Catalog Fine-Grained Reader role for the policy tags. E. Enforce access control in the policy tag taxonomy. Reveal Solution --------------------------------- B - It will ensure they don't have access to secure columns E- It will allow to enforce column level security

You are designing a real-time system for a ride hailing app that identifies areas with high demand for rides to effectively reroute available drivers to meet the demand. The system ingests data from multiple sources to Pub/Sub, processes the data, and stores the results for visualization and analysis in real-time dashboards. The data sources include driver location updates every 5 seconds and app-based booking events from riders. The data processing involves real-time aggregation of supply and demand data for the last 30 seconds, every 2 seconds, and storing the results in a low-latency system for visualization. What should you do?

B. Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to Memorystore. info : - Hopping Window: Hopping windows are fixed-sized, overlapping intervals. - Aggregate data over the last 30 seconds, every 2 seconds, as hopping windows allow for overlapping data analysis. - Memorystore: Ideal for low-latency access required for real-time visualization and analysis.

A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real time. This is then loaded into BigQuery. Analysts in your company want to query the tracking data in BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created with ingest-date partitioning. Over time, the query processing time has increased. You need to copy all the data to a new clustered table. What should you do?

B. Implement clustering in BigQuery on the package-tracking ID column. ================================= implementing clustering in BigQuery on the package-tracking ID column, seems the most appropriate. It directly addresses the query slowdown issue by reorganizing the data in a way that aligns with the analysts' query patterns, leading to more efficient and faster query execution.

You work for a large ecommerce company. You are using Pub/Sub to ingest the clickstream data to Google Cloud for analytics. You observe that when a new subscriber connects to an existing topic to analyze data, they are unable to subscribe to older data. For an upcoming yearly sale event in two months, you need a solution that, once implemented, will enable any new subscriber to read the last 30 days of data. What should you do?

B. Set the topic retention policy to 30 days.

The data analyst team at your company uses BigQuery for ad-hoc queries and scheduled SQL pipelines in a Google Cloud project with a slot reservation of 2000 slots. However, with the recent introduction of hundreds of new non time-sensitive SQL pipelines, the team is encountering frequent quota errors. You examine the logs and notice that approximately 1500 queries are being triggered concurrently during peak time. You need to resolve the concurrency issue. What should you do?

B. Update SQL pipelines to run as a batch query, and run ad-hoc queries as interactive query jobs. - BigQuery allows you to specify job priority as either BATCH or INTERACTIVE. - Batch queries are queued and then started when idle resources are available, making them suitable for non-time-sensitive workloads. - Running ad-hoc queries as interactive ensures they have prompt access to resources.

You have several different file type data sources, such as Apache Parquet and CSV. You want to store the data in Cloud Storage. You need to set up an object sink for your data that allows you to use your own encryption keys. You want to use a GUI-based solution. What should you do?

B. Use Cloud Data Fusion to move files into Cloud Storage. - Cloud Data Fusion is a fully managed, code-free, GUI-based data integration service that allows you to visually connect, transform, and move data between various sources and sinks. - It supports various file formats and can write to Cloud Storage. - You can configure it to use Customer-Managed Encryption Keys (CMEK) for the buckets where it writes data.

Your organization's data assets are stored in BigQuery, Pub/Sub, and a PostgreSQL instance running on Compute Engine. Because there are multiple domains and diverse teams using the data, teams in your organization are unable to discover existing data assets. You need to design a solution to improve data discoverability while keeping development and configuration efforts to a minimum. What should you do?

B. Use Data Catalog to automatically catalog BigQuery datasets and Pub/Sub topics. Use Data Catalog APIs to manually catalog PostgreSQL tables. info : - It utilizes Data Catalog's native support for both BigQuery datasets and Pub/Sub topics. - For PostgreSQL tables running on a Compute Engine instance, you'd use Data Catalog APIs to create custom entries, as Data Catalog does not automatically discover external databases like PostgreSQL.

You are migrating your on-premises data warehouse to BigQuery. One of the upstream data sources resides on a MySQL. database that runs in your on-premises data center with no public IP addresses. You want to ensure that the data ingestion into BigQuery is done securely and does not go through the public internet. What should you do?

B. Use Datastream to replicate data from your on-premises MySQL database to BigQuery. Set up Cloud Interconnect between your on-premises data center and Google Cloud. Use Private connectivity as the connectivity method and allocate an IP address range within your VPC network to the Datastream connectivity configuration. Use Server-only as the encryption type when setting up the connection profile in Datastream. =================================== - Datastream is a serverless change data capture and replication service, which can be used to replicate data changes from MySQL to BigQuery. - Using Cloud Interconnect provides a private, secure connection between your on-premises environment and Google Cloud ==> This method ensures that data doesn't go through the public internet and is a recommended approach for secure, large-scale data migrations. - Setting up private connectivity with Datastream allows for secure and direct data transfer.

You have a streaming pipeline that ingests data from Pub/Sub in production. You need to update this streaming pipeline with improved business logic. You need to ensure that the updated pipeline reprocesses the previous two days of delivered Pub/Sub messages. What should you do? (Choose two.)

B. Use Pub/Sub Snapshot capture two days before the deployment. E. Use Pub/Sub Seek with a timestamp. Reveal Solution ================================== - Pub/Sub Snapshots allow you to capture the state of a subscription's unacknowledged messages at a particular point in time. - By creating a snapshot two days before deploying the updated pipeline, you can later use this snapshot to replay the messages from that point. : - Pub/Sub Seek allows us to alter the acknowledgment state of messages in bulk. - So we can rewind a subscription to a point in time or a snapshot. - Using Seek with a timestamp corresponding to two days ago would allow the updated pipeline to reprocess messages from that time.

Your car factory is pushing machine measurements as messages into a Pub/Sub topic in your Google Cloud project. A Dataflow streaming job, that you wrote with the Apache Beam SDK, reads these messages, sends acknowledgment to Pub/Sub, applies some custom business logic in a DoFn instance, and writes the result to BigQuery. You want to ensure that if your business logic fails on a message, the message will be sent to a Pub/Sub topic that you want to monitor for alerting purposes. What should you do?

B. Use an exception handling block in your Dataflow's DoFn code to push the messages that failed to be transformed through a side output and to a new Pub/Sub topic. Use Cloud Monitoring to monitor the topic/num_unacked_messages_by_region metric on this new topic. info : - Exception Handling in DoFn: Implementing an exception handling block within DoFn in Dataflow to catch failures during processing is a direct way to manage errors. - Side Output to New Topic: Using a side output to redirect failed messages to a new Pub/Sub topic is an effective way to isolate and manage these messages. - Monitoring: Monitoring the num_unacked_messages_by_region on the new topic can alert you to the presence of failed messages.

You need to create a SQL pipeline. The pipeline runs an aggregate SQL transformation on a BigQuery table every two hours and appends the result to another existing BigQuery table. You need to configure the pipeline to retry if errors occur. You want the pipeline to send an email notification after three consecutive failures. What should you do? A VERIFIER !

B. Use the BigQueryInsertJobOperator in Cloud Composer, set the retry parameter to three, and set the email_on_failure parameter to true. infos : - It provides a direct and controlled way to manage the SQL pipeline using Cloud Composer (Apache Airflow). - The BigQueryInsertJobOperator is well-suited for running SQL jobs in BigQuery, including aggregate transformations and handling of results. - The retry and email_on_failure parameters align with the requirements for error handling and notifications. - Cloud Composer requires more setup than using BigQuery's scheduled queries directly, but it offers robust workflow management, retry logic, and notification capabilities, making it suitable for more complex and controlled data pipeline requirements.

You orchestrate ETL pipelines by using Cloud Composer. One of the tasks in the Apache Airflow directed acyclic graph (DAG) relies on a third-party service. You want to be notified when the task does not succeed. What should you do?

C. Assign a function with notification logic to the on_failure_callback parameter tor the operator responsible for the task at risk. ============================== - The on_failure_callback is a function that gets called when a task fails. - Assigning a function with notification logic to this parameter is a direct way to handle task failures. - When the task fails, this function can trigger a notification, making it an appropriate solution for the need to be alerted on task failures.

You are building an ELT solution in BigQuery by using Dataform. You need to perform uniqueness and null value checks on your final tables. What should you do to efficiently integrate these checks into your pipeline?

C. Build Dataform assertions into your code. ==================== - Dataform provides a feature called "assertions," which are essentially SQL-based tests that you can define to verify the quality of your data. - Assertions in Dataform are a built-in way to perform data quality checks, including checking for uniqueness and null values in your tables.

Your company operates in three domains: airlines, hotels, and ride-hailing services. Each domain has two teams: analytics and data science, which create data assets in BigQuery with the help of a central data platform team. However, as each domain is evolving rapidly, the central data platform team is becoming a bottleneck. This is causing delays in deriving insights from data, and resulting in stale data when pipelines are not kept up to date. You need to design a data mesh architecture by using Dataplex to eliminate the bottleneck. What should you do?

C. 1. Create one lake for each domain. Inside each lake, create one zone for each team. 2. Attach each of the BigQuery datasets created by the individual teams as assets to the respective zone. 3. Direct each domain to manage their own lake's data assets. info : - each domain should manage their own lake's data assets

You are deploying an Apache Airflow directed acyclic graph (DAG) in a Cloud Composer 2 instance. You have incoming files in a Cloud Storage bucket that the DAG processes, one file at a time. The Cloud Composer instance is deployed in a subnetwork with no Internet access. Instead of running the DAG based on a schedule, you want to run the DAG in a reactive way every time a new file is received. What should you do?

C. 1. Enable the Airflow REST API, and set up Cloud Storage notifications to trigger a Cloud Function instance. 2. Create a Private Service Connect (PSC) endpoint. 3. Write a Cloud Function that connects to the Cloud Composer cluster through the PSC endpoint. - Enable Airflow REST API: In Cloud Composer, enable the "Airflow web server" option. - Set Up Cloud Storage Notifications: Create a notification for new files, routing to a Cloud Function. - Create PSC Endpoint: Establish a PSC endpoint for Cloud Composer. - Write Cloud Function: Code the function to use the Airflow REST API (via PSC endpoint) to trigger the DAG.

You are designing the architecture of your application to store data in Cloud Storage. Your application consists of pipelines that read data from a Cloud Storage bucket that contains raw data, and write the data to a second bucket after processing. You want to design an architecture with Cloud Storage resources that are capable of being resilient if a Google Cloud regional failure occurs. You want to minimize the recovery point objective (RPO) if a failure occurs, with no impact on applications that use the stored data. What should you do?

C. Adopt a dual-region Cloud Storage bucket, and enable turbo replication in your architecture. info : Turbo replication provides faster redundancy across regions for data in your dual-region buckets, which reduces the risk of data loss exposure and helps support uninterrupted service following a regional outage. - Dual-region buckets are a specific type of storage that automatically replicates data between two geographically distinct regions. - Turbo replication is an enhanced feature that provides faster replication between the two regions, thus minimizing RPO. - This option ensures that your data is resilient to regional failures and is replicated quickly, meeting the needs for low RPO and no impact on application performance.

Different teams in your organization store customer and performance data in BigQuery. Each team needs to keep full control of their collected data, be able to query data within their projects, and be able to exchange their data with other teams. You need to implement an organization-wide solution, while minimizing operational tasks and costs. What should you do?

C. Ask each team to publish their data in Analytics Hub. Direct the other teams to subscribe to them. - Analytics Hub allows organizations to create and manage exchanges where producers can publish their data and consumers can discover and subscribe to data products. - Asking each team to publish their data in Analytics Hub and having other teams subscribe to them is a scalable and controlled way of sharing data. - It minimizes operational tasks because data doesn't need to be duplicated or manually managed after setup, and teams can maintain full control over their datasets. Analytics hub to reduce operational overhead of creating/maintaining views permissions etc

You are deploying a MySQL database workload onto Cloud SQL. The database must be able to scale up to support several readers from various geographic regions. The database must be highly available and meet low RTO and RPO requirements, even in the event of a regional outage. You need to ensure that interruptions to the readers are minimal during a database failover. What should you do?

C. Create a highly available Cloud SQL instance in region A. Create a highly available read replica in region B. Scale up read workloads by creating cascading read replicas in multiple regions. Promote the read replica in region B when region A is down. --------------------------------------------------------------------- - Combines high availability with geographic distribution of read workloads. - Promoting a highly available read replica can provide a quick failover solution, potentially meeting low RTO and RPO requirements.

You are migrating a large number of files from a public HTTPS endpoint to Cloud Storage. The files are protected from unauthorized access using signed URLs. You created a TSV file that contains the list of object URLs and started a transfer job by using Storage Transfer Service. You notice that the job has run for a long time and eventually failed. Checking the logs of the transfer job reveals that the job was running fine until one point, and then it failed due to HTTP 403 errors on the remaining files. You verified that there were no changes to the source system. You need to fix the problem to resume the migration process. What should you do?

C. Create a new TSV file for the remaining files by generating signed URLs with a longer validity period. Split the TSV file into multiple smaller files and submit them as separate Storage Transfer Service jobs in parallel.

You have terabytes of customer behavioral data streaming from Google Analytics into BigQuery daily. Your customers' information, such as their preferences, is hosted on a Cloud SQL for MySQL database. Your CRM database is hosted on a Cloud SQL for PostgreSQL instance. The marketing team wants to use your customers' information from the two databases and the customer behavioral data to create marketing campaigns for yearly active customers. You need to ensure that the marketing team can run the campaigns over 100 times a day on typical days and up to 300 during sales. At the same time, you want to keep the load on the Cloud SQL databases to a minimum. What should you do?

C. Create streams in Datastream to replicate the required tables from both Cloud SQL databases to BigQuery for these queries.

You have a BigQuery dataset named "customers". All tables will be tagged by using a Data Catalog tag template named "gdpr". The template contains one mandatory field, "has_sensitive_data", with a boolean value. All employees must be able to do a simple search and find tables in the dataset that have either true or false in the "has_sensitive_data' field. However, only the Human Resources (HR) group should be able to see the data inside the tables for which "has_sensitive data" is true. You give the all employees group the bigquery.metadataViewer and bigquery.connectionUser roles on the dataset. You want to minimize configuration overhead. What should you do next? A VERIFIER !!!

C. Create the "gdpr" tag template with public visibility. Assign the bigquery.dataViewer role to the HR group on the tables that contain sensitive data.

You have a Cloud SQL for PostgreSQL instance in Region' with one read replica in Region2 and another read replica in Region3. An unexpected event in Region' requires that you perform disaster recovery by promoting a read replica in Region2. You need to ensure that your application has the same database capacity available before you switch over the connections. What should you do?

C. Create two new read replicas from the new primary instance, one in Region3 and one in a new region. ================================ After promoting the read replica in Region2 to be the new primary instance, creating additional read replicas from it can help distribute the read load and maintain or increase the database's total capacity.

You are planning to load some of your existing on-premises data into BigQuery on Google Cloud. You want to either stream or batch-load data, depending on your use case. Additionally, you want to mask some sensitive data before loading into BigQuery. You need to do this in a programmatic way while keeping costs to a minimum. What should you do? A verifier quand meme !

C. Create your pipeline with Dataflow through the Apache Beam SDK for Python, customizing separate options within your code for streaming, batch processing, and Cloud DLP. Select BigQuery as your data sink.

You migrated a data backend for an application that serves 10 PB of historical product data for analytics. Only the last known state for a product, which is about 10 GB of data, needs to be served through an API to the other applications. You need to choose a cost-effective persistent storage solution that can accommodate the analytics requirements and the API performance of up to 1000 queries per second (QPS) with less than 1 second latency. What should you do? Verifier

D. 1. Store the historical data in BigQuery for analytics. 2. In a Cloud SQL table, store the last state of the product after every product change. 3. Serve the last state data directly from Cloud SQL to the API.

You recently deployed several data processing jobs into your Cloud Composer 2 environment. You notice that some tasks are failing in Apache Airflow. On the monitoring dashboard, you see an increase in the total workers memory usage, and there were worker pod evictions. You need to resolve these errors. What should you do? (Choose two.) A VERIFIER ABSOLUMENT !!!

C. Increase the maximum number of workers and reduce worker concurrency. D. Increase the memory available to the Airflow workers. ------------------- - Scaling up the environment size can provide more resources, including memory, to the Airflow workers. If worker pod evictions are occurring due to insufficient memory, increasing the environment size to allocate more resources could alleviate the problem and improve the stability of your data processing jobs. D: - Increase the memory available to the Airflow workers. - Directly increasing the memory allocation for Airflow workers can address the issue of high memory usage and worker pod evictions. More memory per worker means that each worker can handle more demanding tasks or a higher volume of tasks without running out of memory.

You are a BigQuery admin supporting a team of data consumers who run ad hoc queries and downstream reporting in tools such as Looker. All data and users are combined under a single organizational project. You recently noticed some slowness in query results and want to troubleshoot where the slowdowns are occurring. You think that there might be some job queuing or slot contention occurring as users run jobs, which slows down access to results. You need to investigate the query job information and determine where performance is being affected. What should you do?

C. Use available administrative resource charts to determine how slots are being used and how jobs are performing over time. Run a query on the INFORMATION_SCHEMA to review query performance. -------------- - BigQuery provides administrative resource charts that show slot utilization and job performance, which can help identify patterns of heavy usage or contention. - Additionally, querying the INFORMATION_SCHEMA with the JOBS or JOBS_BY_PROJECT view can provide detailed information about specific queries, including execution time, slot usage, and whether they were queued.

You are building a streaming Dataflow pipeline that ingests noise level data from hundreds of sensors placed near construction sites across a city. The sensors measure noise level every ten seconds, and send that data to the pipeline when levels reach above 70 dBA. You need to detect the average noise level from a sensor when data is received for a duration of more than 30 minutes, but the window ends when no data has been received for 15 minutes. What should you do?

C. Use hopping windows with a 15-minute window, and a thirty-minute period.

Your organization is modernizing their IT services and migrating to Google Cloud. You need to organize the data that will be stored in Cloud Storage and BigQuery. You need to enable a data mesh approach to share the data between sales, product design, and marketing departments. What should you do? A VERIFIER !!!!

D. 1. Create multiple projects for storage of the data for each of your departments' applications. 2. Enable each department to create Cloud Storage buckets and BigQuery datasets. 3. In Dataplex, map each department to a data lake and the Cloud Storage buckets, and map the BigQuery datasets to zones. 4. Enable each department to own and share the data of their data lakes.

You are designing a data mesh on Google Cloud with multiple distinct data engineering teams building data products. The typical data curation design pattern consists of landing files in Cloud Storage, transforming raw data in Cloud Storage and BigQuery datasets, and storing the final curated data product in BigQuery datasets. You need to configure Dataplex to ensure that each team can access only the assets needed to build their data products. You also need to ensure that teams can easily share the curated data product. What should you do?

D. 1. Create a Dataplex virtual lake for each data product, and create multiple zones for landing, raw, and curated data. 2. Provide the data engineering teams with full access to the virtual lake assigned to their data product.

You have an upstream process that writes data to Cloud Storage. This data is then read by an Apache Spark job that runs on Dataproc. These jobs are run in the us-central1 region, but the data could be stored anywhere in the United States. You need to have a recovery process in place in case of a catastrophic single region failure. You need an approach with a maximum of 15 minutes of data loss (RPO=15 mins). You want to ensure that there is minimal latency when reading the data. What should you do?

D. 1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions. 2. Enable turbo replication. 3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the same region. 4. In case of a regional failure, redeploy the Dataproc clusters to the us-south1 region and read from the same bucket. info : Dataproc clusters can read from the bucket in the same region

You want to schedule a number of sequential load and transformation jobs. Data files will be added to a Cloud Storage bucket by an upstream process. There is no fixed schedule for when the new data arrives. Next, a Dataproc job is triggered to perform some transformations and write the data to BigQuery. You then need to run additional transformation jobs in BigQuery. The transformation jobs are different for every table. These jobs might take hours to complete. You need to determine the most efficient and maintainable workflow to process hundreds of tables and provide the freshest data to your end users. What should you do?

D. 1. Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Dataproc and BigQuery operators. 2. Create a separate DAG for each table that needs to go through the pipeline. 3. Use a Cloud Storage object trigger to launch a Cloud Function that triggers the DAG. --------------------- Tailored handling and scheduling for each table; triggered by data arrival for more timely and efficient processing.

You created an analytics environment on Google Cloud so that your data scientist team can explore data without impacting the on-premises Apache Hadoop solution. The data in the on-premises Hadoop Distributed File System (HDFS) cluster is in Optimized Row Columnar (ORC) formatted files with multiple columns of Hive partitioning. The data scientist team needs to be able to explore the data in a similar way as they used the on-premises HDFS cluster with SQL on the Hive query engine. You need to choose the most cost-effective storage and processing solution. What should you do?

D. Copy the ORC files on Cloud Storage, then create external BigQuery tables for the data scientist team. Reveal Solution info : - It leverages the strengths of BigQuery for SQL-based exploration while avoiding additional costs and complexity associated with data transformation or migration. - The data remains in ORC format in Cloud Storage, and BigQuery's external tables feature allows direct querying of this data. This approach leverages BigQuery's powerful analytics capabilities without the overhead of data transformation or maintaining a separate cluster, while also allowing your team to use SQL for data exploration, similar to their experience with the on-premises Hadoop/Hive environment.

You are running your BigQuery project in the on-demand billing model and are executing a change data capture (CDC) process that ingests data. The CDC process loads 1 GB of data every 10 minutes into a temporary table, and then performs a merge into a 10 TB target table. This process is very scan intensive and you want to explore options to enable a predictable cost model. You need to create a BigQuery reservation based on utilization information gathered from BigQuery Monitoring and apply the reservation to the CDC process. What should you do?

D. Create a BigQuery reservation for the project. On-demand pricing model: The per-project slot quota with transient burst capability is sufficient for most users. Depending on your workloads, access to more slots can improve query performance. To check how many slots your account uses, see BigQuery monitoring.

You have an Oracle database deployed in a VM as part of a Virtual Private Cloud (VPC) network. You want to replicate and continuously synchronize 50 tables to BigQuery. You want to minimize the need to manage infrastructure. What should you do?

D. Create a Datastream service from Oracle to BigQuery, use a private connectivity configuration to the same VPC network, and a connection profile to BigQuery. - Datastream is a serverless and easy-to-use change data capture (CDC) and replication service. - You would create a Datastream service that sources from your Oracle database and targets BigQuery, with private connectivity configuration to the same VPC. - This option is designed to minimize the need to manage infrastructure and is a fully managed service.

You are designing a data warehouse in BigQuery to analyze sales data for a telecommunication service provider. You need to create a data model for customers, products, and subscriptions. All customers, products, and subscriptions can be updated monthly, but you must maintain a historical record of all data. You plan to use the visualization layer for current and historical reporting. You need to ensure that the data model is simple, easy-to-use, and cost-effective. What should you do?

D. Create a denormalized, append-only model with nested and repeated fields. Use the ingestion timestamp to track historical data. - A denormalized, append-only model simplifies query complexity by eliminating the need for joins. - Adding data with an ingestion timestamp allows for easy retrieval of both current and historical states. - Instead of updating records, new records are appended, which maintains historical information without the need to create separate snapshots.

You have thousands of Apache Spark jobs running in your on-premises Apache Hadoop cluster. You want to migrate the jobs to Google Cloud. You want to use managed services to run your jobs instead of maintaining a long-lived Hadoop cluster yourself. You have a tight timeline and want to keep code changes to a minimum. What should you do?

D. Move your data to Cloud Storage. Run your jobs on Dataproc. Reveal Solution

You have a variety of files in Cloud Storage that your data science team wants to use in their models. Currently, users do not have a method to explore, cleanse, and validate the data in Cloud Storage. You are looking for a low code solution that can be used by your data science team to quickly cleanse and explore data within Cloud Storage. What should you do?

D. Provide the data science team access to Dataprep to prepare, validate, and explore the data within Cloud Storage. Reveal Solution ========================= - Dataprep is a serverless, no-code data preparation tool that allows users to visually explore, cleanse, and prepare data for analysis. - It's designed for business analysts, data scientists, and others who want to work with data without writing code. - Dataprep can directly access and transform data in Cloud Storage, making it a suitable choice for a team that prefers a low-code, user-friendly solution.

You are developing an Apache Beam pipeline to extract data from a Cloud SQL instance by using JdbcIO. You have two projects running in Google Cloud. The pipeline will be deployed and executed on Dataflow in Project A. The Cloud SQL. instance is running in Project B and does not have a public IP address. After deploying the pipeline, you noticed that the pipeline failed to extract data from the Cloud SQL instance due to connection failure. You verified that VPC Service Controls and shared VPC are not in use in these projects. You want to resolve this error while ensuring that the data does not go through the public internet. What should you do?

D. Set up VPC Network Peering between Project A and Project B. Create a Compute Engine instance without external IP address in Project B on the peered subnet to serve as a proxy server to the Cloud SQL database. Reveal Solution

You are part of a healthcare organization where data is organized and managed by respective data owners in various storage services. As a result of this decentralized ecosystem, discovering and managing data has become difficult. You need to quickly identify and implement a cost-optimized solution to assist your organization with the following:• Data management and discovery• Data lineage tracking• Data quality validation How should you build the solution?

D. Use Dataplex to manage data, track data lineage, and perform data quality validation. Reveal Solution

A web server sends click events to a Pub/Sub topic as messages. The web server includes an eventTimestamp attribute in the messages, which is the time when the click occurred. You have a Dataflow streaming job that reads from this Pub/Sub topic through a subscription, applies some transformations, and writes the result to another Pub/Sub topic for use by the advertising department. The advertising department needs to receive each message within 30 seconds of the corresponding click occurrence, but they report receiving the messages late. Your Dataflow job's system lag is about 5 seconds, and the data freshness is about 40 seconds. Inspecting a few messages show no more than 1 second lag between their eventTimestamp and publishTime. What is the problem and what should you do?

G. Messages in your Dataflow job are processed in less than 30 seconds, but your job cannot keep up with the backlog in the Pub/Sub subscription. Optimize your job or increase the number of workers to fix this. ==================================== - It suggest a backlog problem. - It indicates that while individual messages might be processed quickly once they're handled, the job overall cannot keep up with the rate of incoming messages, causing a delay in processing the backlog.


Ensembles d'études connexes

anatomy & physiology 1: chapter 12 & 13

View Set

Practice Test for Midterm BUSN 2003

View Set

English 1 -The Vital Role of Wetlands

View Set

A&P1 Chapter 8 Multiple Choice Review

View Set

Information systems-midterm prep

View Set