GCP Pro Data Engineer Exam

Ace your homework & exams now with Quizwiz!

Auditors have determined that your company's processes for storing, processing, and transmitting sensitive data are insufficient. They believe that additional measures must be taken to prevent sensitive information, such as personally identifiable government-issued numbers, are not disclosed. They suggest masking or removing sensitive data before it is transmitted outside the company. What GCP service would you recommend? A. Data loss prevention API B. In-transit encryption C. Storing sensitive information in Cloud Key Management D. Cloud Dataflow

A. A data loss prevention API can be used to remove many forms of sensitive data, such as government identifiers. Option B is incorrect; encryption can help keep data from being read, but it does not remove or mask sensitive data. Option C is incorrect; Cloud Key Management is a service for storing and managing encryption keys. Option D is incorrect; Cloud Dataflow is a batch and stream processing service.

Your company would like to start experimenting with machine learning, but no one in the company is experienced with ML. Analysts in the marketing department have identified some data in their relational database that they think may be useful for training a model. What would you recommend that they try first to build proof-of-concept models? A. AutoML Tables B. Kubeflow C. Cloud Firestore D. Spark MLlib

A. AutoML Tables is a service for generating machine learning models from structured data. Option B is incorrect; Kubeflow is an orchestration platform for running machine learning workloads in Kubernetes, which is more than is needed for this use case. Option C is incorrect; Cloud Firestore is a document database, not a machine learning service. Option D is incorrect because Spark MLlib requires more knowledge of machine learning than AutoML Tables, and therefore it is not as good an option for this use case.

You have developed a machine learning algorithm for identifying objects in images. Your company has a mobile app that allows users to upload images and get back a list of identified objects. You need to implement the mechanism to detect when a new image is uploaded to Cloud Storage and invoke the model to perform the analysis. Which GCP service would you use for that? A. Cloud Functions B. Cloud Storage Nearline C. Cloud Dataflow D. Cloud Dataproc

A. Cloud Functions is a managed serverless product that is able to respond to events in the cloud, such as creating a file in Cloud Storage. Option B is incorrect; Cloud Storage Nearline is a class of object storage. Option C is incorrect; Cloud Dataflow is a stream and batch processing service that does not respond to events. Option D is incorrect; Cloud Dataproc is a managed Hadoop and Spark service.

You have a BigQuery table with data about customer purchases, including the date of purchase, the type of product purchases, the product name, and several other descriptive attributes. There is approximately three years of data. You tend to query data by month and then by customer. You would like to minimize the amount of data scanned. How would you organize the table? A. Partition by purchase date and cluster by customer B. Partition by purchase date and cluster by product C. Partition by customer and cluster by product D. Partition by customer and cluster by purchase date

A. Partitioning by purchase date will keep all data for a day in a single partition. Clustering by customer will order the data in a partition by customer. This strategy will minimize the amount of data that needs to be scanned in order to answer a query by purchase date and customer. Option B is incorrect; clustering by product does not help reduce the amount of data scanned for date and customer-based queries. Option C is incorrect because partitioning by customer is not helpful in reducing the amount of data scanned. Option D is incorrect because partitioning by customer would spread data from one date over many partitions, and that would lead to scanning more data than partitioning by purchase date.

Your finance department is migrating a third-party application from an on-premises physical server. The system was written in C, but only the executable binary is available. After the migration, data will be extracted from the application database, transformed, and stored in a BigQuery data warehouse. The application is no longer actively supported by the original developer, and it must run on an Ubuntu 14.04 operating system that has been configured with several required packages. Which compute platform would you use? A. Compute Engine B. Kubernetes Engine C. App Engine Standard D. Cloud Functions

A. The answer is A. This scenario calls for full control over the choice of the operating system, and the application is moving from a physical server so that it is not containerized. Compute Engine can run the application in a VM configured with Ubuntu 14.04 and the additional packages. Option B is incorrect because the application is not containerized (although it may be modified to be containerized). Option C is incorrect because the application cannot run in one of the language-specific runtimes of App Engine Standard. Option D is incorrect because the Cloud Functions product runs code in response to events and does not support long-running applications.

You are working with a group of genetics researchers analyzing data generated by gene sequencers. The data is stored in Cloud Storage. The analysis requires running a series of six programs, each of which will output data that is used by the next process in the pipe- line. The final result set is loaded into BigQuery. What tool would you recommend for orchestrating this workflow? A. Cloud Composer B. Cloud Dataflow C. Apache Flink D. Cloud Dataproc

A. The correct answer is A, Cloud Composer, which is designed to support workflow orchestration. Options B and C are incorrect because they are both implementations of the Apache Beam model that is used for executing stream and batch processing program. Option D is incorrect; Cloud Dataproc is a managed Hadoop and Spark service.

A team of analysts has collected several CSV datasets with a total size of 50 GB. They plan to store the datasets in GCP and use Compute Engine instances to run RStudio, an interac- tive statistical application. Data will be loaded into RStudio using an RStudio data loading tool. Which of the following is the most appropriate GCP storage service for the datasets? A. Cloud Storage B. Cloud Datastore C. MongoDB D. Bigtable

A. The correct answer is A, Cloud Storage, because the data in the files is treated as an atomic unit of data that is loaded into RStudio. Options B and C are incorrect because those are document databases and there is no requirement for storing the data in semi- structured format with support for fully indexed querying. Also, MongoDB is not a GCP service. Option D is incorrect because, although you could load CSV data into a Bigtable table, the volume of data is not sufficient to warrant using Bigtable.

The CTO of your company is concerned about the rising costs of maintaining your com- pany's enterprise data warehouse. Some members of your team are advocating to migrate to a cloud-based data warehouse such as BigQuery. What is the first step for migrating from the on-premises data warehouse to a cloud-based data warehouse? A. Assessing the current state of the data warehouse B. Designing the future state of the data warehouse C. Migrating data, jobs, and access controls to the cloud D. Validating the cloud data warehouse

A. The correct answer is A. An assessment should be done first. Options B, C, and D are all parts of a data warehouse migration plan but come after the assessment phase.

The CTO of your company wants to reduce the cost of running an HBase and Hadoop cluster on premises. Only one HBase application is run on the cluster. The cluster currently supports 10 TB of data, but it is expected to double in the next six months. Which of the following managed services would you recommend to replace the on-premises cluster in order to minimize migration and ongoing operational costs? A. Cloud Bigtable using the HBase API B. Cloud Dataflow using the HBase API C. Cloud Spanner D. Cloud Datastore

A. The correct answer is A. Cloud Bigtable using the HBase API would minimize migration efforts, and since Bigtable is a managed service, it would help reduce operational costs. Option B is incorrect. Cloud Dataflow is a stream and batch processing service, not a database. Options C and D are incorrect. Relational databases are not likely to be appropriate choices for an HBase database, which is a wide-column NoSQL database, and trying to migrate from a wide-column to a relational database would incur unnecessary costs.

When gathering requirements for a data warehouse migration, which of the following would you include in a listing of technical requirements? A. Data sources, data model, and ETL scripts B. Data sources, data model, and business sponsor roles C. Data sources only D. Data model, data catalog, ETL scripts, and business sponsor roles

A. The correct answer is A. Data sources, the data model, and ETL scripts would all be included. Options B and D are incorrect; technical requirements do not include information about business sponsors and their roles. Option C is incorrect because more than data sources should be included.

A team of machine learning engineers are creating a repository of data for training and testing machine learning models. All of the engineers work in the same city, and they all contribute datasets to the repository. The data files will be accessed frequently, usually at least once a week. The data scientists want to minimize their storage costs. They plan to use Cloud Storage; what storage class would you recommend? A. Regional B. Multi-regional C. Nearline D. Coldline

A. The correct answer is A. Regional storage is sufficient for serving users in the same geographic location and costs less than multi-regional storage. Option B is incorrect because it does not minimize cost, and there is no need for multi-regional storage. Options C and D are incorrect because Nearline and Coldline are less expensive only for infrequently accessed data.

You are running a Redis cache using Cloud Memorystore. One day, you receive an alert notification that the memory usage is exceeding 80 percent. You do not want to scale up the instance, but you need to reduce the amount of memory used. What could you try? A. Setting shorter TTLs and trying a different eviction policy. B. Switching from Basic Tier to Standard Tier. C. Exporting the cache. D. There is no other option—you must scale the instance.

A. The correct answer is A. Setting shorter TTLs will make keys eligible for eviction sooner, and a different eviction policy may lead to more evictions. For example, switching from an eviction policy that evicts only keys with a TTL to a policy that can evict any key may reduce memory use.

A Cloud Spanner database is being deployed in us-west1 and will have to store up to 20 TB of data. What is the minimum number of nodes required? A. 10 B. 20 C. 5 D. 40

A. The correct answer is A. Since each node can store 2 TB, it will require at least 10 nodes. Options B and D are incorrect because they are more nodes than needed. Answer C is incorrect; five is not sufficient for storing 20 TB of data.

Your startup is developing a mobile app that takes an image as input and produces a list of names of objects in the image. The image file is uploaded from the mobile device to a Cloud Storage bucket. A service account is associated with the server-side application that will retrieve the image. The application will not perform any other operation on the file or the bucket. Following the principle of least privilege, what role would you assign to the service account? A. roles/storage.objectViewer B. roles/storage.objectAdmin C. roles/storage.objectCreator D. roles/storage.objectViewer and roles/storage.objectCreator

A. The correct answer is A. Since the application needs to read the contents of only the object, the roles/storage.objectViewer role is sufficient. Options B grants more permissions than needed. Option C would not allow the application to read the object. Option D has more permissions than needed.

You are migrating several terabytes of historical sensor data to Google Cloud Storage. The data is organized into files with one file per sensor per day. The files are named with the date followed by the sensor ID. After loading 10 percent of the data, you realize that the data loads are not proceeding as fast as expected. What might be the cause? A. The filenaming convention uses dates as the first part of the file name. If the files are loaded in this order, they may be creating hotspots when writing the data to Cloud Storage. B. The data is in text instead of Avro or Parquet format. C. You are using a gcloud command-line utility instead of the REST API. D. The data is being written to regional instead of multi-regional storage.

A. The correct answer is A. Since the files have sequential names, they may be loading in lexicographic order, and this can create hotspots. Option B is incorrect; the volume of data, not the format, will determine upload speed. Option C is incorrect; there should be no noticeable difference between the command-line SDK and the REST API. Option D is incorrect; writing to multi-regional storage would not make the uploads faster.

A startup is creating a business service for the hotel industry. The service will allow hotels to sell unoccupied rooms on short notice using the startup's platform. The startup wants to make it as easy as possible for hotels to share data with the platform, so it uses a message queue to collect data about rooms that are available for rent. Hotels send a message for each room that is available and the days that it is available. Room identifier and dates are the keys that uniquely identify a listing. If a listing exists and a message is received with the same room identifier and dates, the message is discarded. What are the minimal guarantees that you would want from the message queue? A. Route randomly to any instance that is building a machine learning model B. Route based on the sensor identifier so identifiers in close proximity are used in the same model C. Route based on machine type so only data from one machine type is used for each model D. Route based on timestamp so metrics close in time to each other are used in the same model

A. The correct answer is A. The purpose of this queue is to list rooms on the platform so that as long each message is processed at least once, the room will appear in the listing. Options B and D are incorrect because processing does not have to be exactly once because listing a room is an idempotent operation. For example, adding a listing of the same room twice does not change the listing since duplicate listing messages are dropped by the application. Option C is incorrect because no ordering is implied in the requirements.

A software developer asks your advice about storing data. The developer has hundreds of thousands of 1 KB JSON objects that need to be accessed in sub-millisecond times if pos- sible. All objects are referenced by a key. There is no need to look up values by the contents of the JSON structure. What kind of NoSQL database would you recommend? A. Key-value database B. Analytical database C. Wide-column database D. Graph database

A. The correct answer is A. This is a good use case for key-value databases because the value is looked up by key only and the value is a JSON structure. Option B is incorrect. Analytical databases are not a type of NoSQL database. Option C is not a good option because wide-column databases work well with larger databases, typically in the terabyte range. Option D is incorrect because the data is not modeled as nodes and links, such as a network model.

You are querying a BigQuery table that has been partitioned by time. You create a query and use the --dry_run flag with the bq query command. The amount of data scanned is far more than you expected. What is a possible cause of this? A. You did not include _PARTITIONTIME in the WHERE clause to limit the amount of data that needs to be scanned. B. You used CSV instead of AVRO file format when loading the data. C. Both active and long-term data are included in the query results. D. You used JSON instead of the Parquet file format when loading the data.

A. The correct answer is A. You probably did not include the pseudo-column _PARTITIONTIME in the WHERE clause to limit the amount of data scanned. Options B and D are incorrect; the format of the file from which data is loaded does not affect the amount of data scanned. Option C is incorrect; the distinction between active and long-term data impacts only the cost of storage, not the execution of a query.

Data can be encrypted at multiple levels, such as at the platform, infrastructure, and device levels. At the device level, how is data encrypted in the Google Cloud Platform? A. AES256 or AES128 encryption B. Elliptic curve cryptography C. Data Encryption Standard (DES) D. Blowfish

A. The correct answer is A: AES256 or AES128 encryption. Option B is incorrect, but it is a strong encryption algorithm and could be used to encrypt data. Option C is incorrect; DES is a weak encryption algorithm that is easily broken by today's methods. Option D is incorrect; Blowfish is strong encryption algorithm designed as a replacement for DES and other weak encryption algorithms.

Your team is migrating applications from running on bare-metal servers and virtual machines to running in containers. You would like to use Kubernetes Engine to run those containers. One member of the team is unfamiliar with Kubernetes and does not under- stand why they cannot find a command to deploy a container. How would you describe the reason why there is no deploy container command? A. Kubernetes uses pods as the smallest deployable unit, and pods have usually one but possibly multiple containers that are deployed as a unit. B. Kubernetes uses deployments as the smallest deployable unit, and pods have usually one but possibly multiple containers that are deployed as a unit. C. Kubernetes uses replicas as the smallest deployable unit, and pods have usually one but possibly multiple containers that are deployed as a unit. D. Kubernetes calls containers "pods," and the command to deploy is kubectl deploy pod.

A. The correct answer is A; Kubernetes uses pods as the smallest deployable unit. Options B and C are incorrect because deployments and replicas are Kubernetes abstractions, but they are not used as the mechanism for logically encapsulating containers. Option D is incorrect, since pods and containers are not synonymous.

A group of IoT sensors is sending streaming data to a Cloud Pub/Sub topic. A Cloud Data- flow service pulls messages from the topic and reorders the messages sorted by event time. A message is expected from each sensor every minute. If a message is not received from a sensor, the stream processing application should use the average of the values in the last four messages. What kind of window would you use to implement the missing data logic? A. Sliding window B. Tumbling window C. Extrapolation window D. Crossover window

A. The correct answer is A; a sliding window would have the data for the past four minutes. Option B is incorrect because tumbling windows do not overlap, and the requirement calls for using the last four messages so the window must slide. Options C and D are not actually names of window types.

It is considered a good practice to make your processing logic idempotent when consuming messages from a Cloud Pub/Sub topic. Why is that? A. Messages may be delivered multiple times. B. Messages may be received out of order. C. Messages may be delivered out of order. D. A consumer service may need to wait extended periods of time between the delivery of messages.

A. The correct answer is A; messages may be delivered multiple times and therefore processed multiple times. If the logic were not idempotent, it could leave the application in an incorrect state, such as that which could occur if you counted the same message multiple times. Options B and C are incorrect; the order of delivery does not require idempotent operations. Option D is incorrect; the time between messages is not a factor in requiring logic to be idempotent.

The auditors for your company have determined that several employees have more per- missions than needed to carry out their job responsibilities. All the employees have users accounts on GCP that have been assigned predefined roles. You have concluded that the optimal way to meet the auditors' recommendations is by using custom roles. What permis- sion is needed to create a custom role? A. iam.roles.create B. iam.custom.roles C. roles/iam.custom.create D. roles/iam.create.custom

A. The correct answer is A; the iam.roles.create permission is needed to create custom roles. Option B is incorrect; it is not an actual permission. Options C and D are incorrect; they are examples of fictitious roles, not permissions.

A Kubernetes administrator wants to improve the performance of an application running in Kubernetes. They have determined that the four replicas currently running are not enough to meet demand and want to increase the total number of replicas to six. The name of the deployment is my-app-123. What command should they use? A. kubectl scale deployment my-app-123 --replicas 6 B. kubectl scale deployment my-app-123 --replicas 2 C. gcloud containers scale deployment my-app-123 --replicas 6 D. gcloud containers scale deployment my-app-123 --replicas 2

A. The correct answer is A; the kubectl scale deployment command specifying the desired number of replicas is the correct command. Option B is incorrect, since this would set the number of replicas to 2. Options C and D are incorrect; there is no gcloud containers scale deployment command.

Your department is migrating some stream processing to GCP and keeping some on prem- ises. You are tasked with designing a way to share data from on-premises pipelines that use Kafka with GPC data pipelines that use Cloud Pub/Sub. How would you do that? A. Use CloudPubSubConnector and Kafka Connect B. Stream data to a Cloud Storage bucket and read from there C. Write a service to read from Kafka and write to Cloud Pub/Sub D. Use Cloud Pub/Sub Import Service

A. The correct answer is A; you should use CloudPubSubConnector and Kafka Connect. The connector is developed and maintained by the Cloud Pub/Sub team for this purpose. Option B is incorrect since this is a less direct and efficient method. Option C requires maintaining a service. Option D is incorrect because there is no such service.

An IoT system streams data to a Cloud Pub/Sub topic for ingestion, and the data is processed in a Cloud Dataflow pipeline before being written to Cloud Bigtable. Latency is increasing as more data is added, even though nodes are not at maximum utilization. What would you look for first as a possible cause of this problem? A. Too many nodes in the cluster B. A poorly designed row key C. Too many column families D. Too many indexes being updated during write operations

B. A poorly designed row key could be causing hot spotting. Option A is incorrect; more nodes in a cluster will not increase latency. Option C is incorrect; the number of column families on its own would not lead to higher latency. Option D is incorrect; Bigtable does not have indexes.

The CTO of your company is concerned about the rising costs of maintaining your company's enterprise data warehouse. The current data warehouse runs in a PostgreSQL instance. You would like to migrate to GCP and use a managed service that reduces operational overhead and one that will scale to meet future needs of up to 3 PB. What service would you recommend? A. Cloud SQL using PostgreSQL B. BigQuery C. Cloud Bigtable D. Cloud Spanner

B. BigQuery is a managed service that is well suited to data warehousing, and it can scale to petabytes of storage. Option A is incorrect; Cloud SQL will not scale to meet future needs. Option C is incorrect; Bigtable is a NoSQL, wide-column database, which is not suitable for use with a data warehouse design that uses a relational data model. Option D is incorrect; Cloud Spanner is a transactional, scalable relational database.

You are currently using Java to implement an ELT pipeline in Hadoop. You'd like to replace your Java programs with a managed service in GCP. Which would you use? A. Data Studio B. Cloud Dataflow C. Cloud Bigtable D. BigQuery

B. Cloud Dataflow is a stream and batch processing managed service that is a good replacement for Java ELT programs. Option A is incorrect; Data Studio is a reporting tool. Option C is incorrect; Cloud Bigtable is a NoSQL, wide-column database. Option D is incorrect; BigQuery is an analytical database.

You are analyzing several datasets and will likely use them to build regression models. You will receive additional datasets, so you'd like to have a workflow to transform the raw data into a form suitable for analysis. You'd also like to work with the data in an interactive manner using Python. What services would you use in GCP? A. Cloud Dataflow and Data Studio B. Cloud Dataflow and Cloud Datalab C. Cloud Dataprep and Data Studio D. Cloud Datalab and Data Studio

B. Cloud Dataflow is well suited to transforming batch data, and Cloud Datalab is a Jupyter Notebook managed service, which is useful for ad hoc analysis using Python. Options A, B, and C are incorrect. Data Studio is a reporting tool, and that is not needed in this use case.

You are migrating your machine learning operations to GCP and want to take advantage of managed services. You have been managing a Spark cluster because you use the MLlib library extensively. Which GCP managed service would you use? A. Cloud Dataprep B. Cloud Dataproc C. Cloud Dataflow D. Cloud Pub/Sub

B. Cloud Dataproc is a Hadoop and Spark managed service. Option A is incorrect; Cloud Dataprep is service for preparing data for analysis. Option C is incorrect; Cloud Dataflow is an implementation of Apache Beam, a stream and batch processing service. Option D is incorrect; Cloud Pub/Sub is a messaging service that can buffer data in a topic until a service is ready to process the data.

You are using Cloud Firestore to store data about online game players' state while in a game. The state information includes health score, a set of possessions, and a list of team members collaborating with the player. You have noticed that the size of the raw data in the database is approximately 2 TB, but the amount of space used by Cloud Firestore is almost 5 TB. What could be causing the need for so much more space? A. The data model has been denormalized. B. There are multiple indexes. C. Nodes in the database cluster are misconfigured. D. There are too many column families in use.

B. Cloud Firestore stores data redundantly when multiple indexes are used, so having more indexes will lead to greater storage sizes. Option A is incorrect; Cloud Firestore is a NoSQL document database that supports a denormalized data model without using excessive storage. Option C is incorrect; you do not configure nodes in Cloud Firestore. Option D is incorrect; column families are not used with document databases such as Cloud Firestore.

You have several analysis programs running in production. Sometimes they are failing, but there is no apparent pattern to the failures. You'd like to use a GCP service to record custom information from the programs so that you can better understand what is happening. Which service would you use? A. Stackdriver Debugger B. Stackdriver Logging C. Stackdriver Monitoring D. Stackdriver Trace

B. Stackdriver Logging is used to collect semi-structured data about events. Option A is incorrect; Stackdriver Debugger is used to inspect the state of running code. Option C is incorrect because Stackdriver Monitoring collects performance metrics, not custom data. Option D is incorrect; Stackdriver Trace is used to collect information about the time required to execute functions in a call stack.

You have several large deep learning networks that you have built using TensorFlow. The models use only standard TensorFlow components. You have been running the models on an n1-highcpu-64 VM, but the models are taking longer to train than you would like. What would you try first to accelerate the model training? A. GPUs B. TPUs C. Shielded VMs D. Preemptible VMs

B. TPUs are the correct accelerator because they are designed specifically to accelerate TensorFlow models. Option A is incorrect because, although GPUs would accelerate the model training, GPUs are not optimized for the low-precision matrix math that is performed when training deep learning networks. Option C is incorrect; shielded VMs have additional security controls, but they do not accelerate model training. Option D is incorrect; preemptible machines cost less than non-preemptible machines, but they do not provide acceleration.

You are using Cloud Functions to start the processing of images as they are uploaded into Cloud Storage. In the past, there have been spikes in the number of images uploaded, and many instances of the Cloud Function were created at those times. What can you do to prevent too many instances from starting? A. Use the --max-limit parameter when deploying the function. B. Use the --max-instances parameter when deploying the function. C. Configure the --max-instance parameter in the resource hierarchy. D. Nothing. There is no option to limit the number of instances.

B. The --max-instances parameter limits the number of concurrently executing function instances. Option A is incorrect; --max-limit is not a parameter used with function deployments. Option C is incorrect; there is no --max-instance parameter to set in the resource hierarchy. Option D is incorrect; there is a way to specify a limit using the --max-instances parameter.

Your department is experimenting with using Cloud Spanner for a globally accessible data- base. You are starting with a pilot project using a regional instance. You would like to fol- low Google's recommendations for the maximum sustained CPU utilization of a regional instance. What is the maximum CPU utilization that you would target? A. 50% B. 65% C. 75% D. 45%

B. The correct answer is B, 65%. Options A and C are not recommended levels for any Cloud Spanner configuration. Option D, 45%, is the recommend CPU utilization for a multi-regional Cloud Spanner instance.

A database administrator (DBA) who is new to Google Cloud has asked for your help con- figuring network access to a Cloud SQL PostgreSQL database. The DBA wants to ensure that traffic is encrypted while minimizing administrative tasks, such as managing SQL cer- tificates. What would you recommend? A. Use the TLS protocol B. Use Cloud SQL Proxy C. Use a private IP address D. Configure the database instance to use auto-encryption

B. The correct answer is B, Cloud SQL Proxy. Cloud SQL Proxy provides secure access to Second Generation instances without you having to create allow lists or configure SSL. The proxy manages authentication and automatically encrypts data. Option A is incorrect because TLS is the successor to SSL. It can be used to encrypt traffic, but it would require the DBA to manage certificates, so it is not as good an answer as Option B. Option C is incorrect; using an IP address does not ensure encryption of data. Option D is incorrect; there is no such thing as an auto-encryption feature in Cloud SQL.

A team of developers has been tasked with rewriting the ETL process that populates an enterprise data warehouse. They plan to use a microservices architecture. Each microservice will run in its own Docker container. The amount of data processed during a run can vary, but the ETL process must always finish within one hour of starting. You want to minimize the amount of DevOps tasks the team needs to perform, but you do not want to sacrifice efficient utilization of compute resources. What GCP compute service would you recommend? A. Compute Engine B. Kubernetes Engine C. App Engine Standard D. Cloud Functions

B. The correct answer is B, Kubernetes Engine, because the application will be designed using containerized microservices that should be run in a way that minimizes DevOps overhead. Option A is incorrect because Compute Engine would require more DevOps work to manage your own Kubernetes Cluster or configure managed instance groups to run different containers needed for each microservice. Options C and D are incorrect because App Engine Standard and Cloud Functions do not run containers.

As a database administrator tasked with migrating a MongoDB instance to Google Cloud, you are concerned about your ability to configure the database optimally. You want to col- lect metrics at both the instance level and the database server level. What would you do in addition to creating an instance and installing and configuring MongoDB to ensure that you can monitor key instances and database metrics? A. Install Stackdriver Logging agent. B. Install Stackdriver Monitoring agent. C. Install Stackdriver Debug agent. D. Nothing. By default, the database instance will send metrics to Stackdriver.

B. The correct answer is B, installing the Stackdriver Monitoring agent. This will collect application-level metrics and send them to Stackdriver for alerting and charting. Option A is incorrect because Stackdriver Logging does not collect metrics, but you would install the Stackdriver Logging agent if you also wanted to collect database logs. Option C is incorrect; Stackdriver Debug is for analyzing a running program. Option D is incorrect; by default, you will get only instance metrics and audit logs.

A group of climate scientists is collecting weather data every minute from 10,000 sensors across the globe. Data often arrives near the beginning of a minute, and almost all data arrives within the first 30 seconds of a minute. The data ingestion process is losing some data because servers cannot ingest the data as fast as it is arriving. The scientists have scaled up the number of servers in their managed instance group, but that has not com- pletely eliminated the problem. They do not wish to increase the maximum size of the man- aged instance group. What else can the scientists do to prevent data loss? A. Write data to a Cloud Dataflow stream B. Write data to a Cloud Pub/Sub topic C. Write data to Cloud SQL table D. Write data to Cloud Dataprep

B. The correct answer is B, write data to a Cloud Pub/Subtopic, which can scale automatically to existing workloads. The ingestion process can read data from the topic and data and then process it. Some data will likely accumulate early in every minute, but the ingestion process can catch up later in the minute after new data stops arriving. Option A is incorrect; Cloud Dataflow is a batch and stream processing service—it is not a message queue for buffering data. Option C is incorrect; Cloud SQL is not designed to scale for ingestion as needed in this example. Option D is incorrect; Cloud Dataprep is a tool for cleaning and preparing datasets for analysis.

You are running a high-performance computing application in a managed instance group. You notice that the throughput of one instance is significantly lower than that for other instances. The poorly performing instance is terminated, and another instance is created to replace it. What feature of managed instance groups is at work here? A. Autoscaling B. Autohealing C. Redundancy D. Eventual consistency

B. The correct answer is B. Autohealing uses a health check function to determine whether an application is functioning correctly, and if not, the instance is replaced. Option A is incorrect; autoscaling adds or removes instances based on instance metrics. Option C is incorrect; redundancy is a feature of instance groups, but it is not the mechanism that replaces poorly performing nodes. Option D is incorrect; eventual consistency describes a model for storing writes in a way that they will eventually be visible to all queries.

You are developing a new application and will be storing semi-structured data that will only be accessed by a single key. The total volume of data will be at least 40 TB. What GCP database service would you use? A. BigQuery B. Bigtable C. Cloud Spanner D. Cloud SQL

B. The correct answer is B. Bigtable is a wide-column NoSQL database that supports semi- structured data and works well with datasets over 1 TB. Options A, D, and C are incorrect because they all are used for structured data. Option D is also incorrect because Cloud SQL does not currently scale to 40 TB in a single database.

You are responsible for developing an ingestion mechanism for a large number of IoT sen- sors. The ingestion service should accept data up to 10 minutes late. The service should also perform some transformations before writing the data to a database. Which of the managed services would be the best option for managing late arriving data and performing transfor- mations? A. Cloud Dataproc B. Cloud Dataflow C. Cloud Dataprep D. Cloud SQL

B. The correct answer is B. Cloud Dataflow is a stream and batch processing service that is used for transforming data and processing streaming data. Option A, Cloud Dataproc, is a managed Hadoop and Spark service and not as well suited as Cloud Dataflow for the kind of stream processing specified. Option C, Cloud Dataprep, is an interactive tool for exploring and preparing data sets for analysis. Option D, Cloud SQL, is a relational database service, so it may be used to store data, but it is not a service specifically for ingesting and transforming data before writing to a database.

A team of data warehouse developers is migrating a set of legacy Python scripts that have been used to transform data as part of an ETL process. They would like to use a service that allows them to use Python and requires minimal administration and operations sup- port. Which GCP service would you recommend? A. Cloud Dataproc B. Cloud Dataflow C. Cloud Spanner D. Cloud Dataprep

B. The correct answer is B. Cloud Dataflow supports Python and is a serverless platform. Option A is incorrect because, although it supports Python, you have to create and configure clusters. Option C is incorrect; Cloud Spanner is a horizontally scalable global relational database. Option D is incorrect; Cloud Dataprep is an interactive tool for preparing data for analysis.

A startup is designing a data processing pipeline for its IoT platform. Data from sensors will stream into a pipeline running in GCP. As soon as data arrives, a validation process, written in Python, is run to verify data integrity. If the data passes the validation, it is ingested; otherwise, it is discarded. What services would you use to implement the validation check and ingestion? A. Cloud Storage and Cloud Pub/Sub B. Cloud Functions and Cloud Pub/Sub C. Cloud Functions and BigQuery D. Cloud Storage and BigQuery

B. The correct answer is B. IoT sensors can write data to a Cloud Pub/Sub topic. When a message is written, it can trigger a Cloud Function that runs the associated code. Cloud Functions can execute the Python validation check, and if the validation check fails, the message is removed from the queue. Option A is incorrect; Cloud Storage is not a for streaming ingestion. Option C is incorrect because BigQuery is an analytical database that could be used in later stages but not during ingest. Answer D is incorrect because Cloud Storage is not a suitable choice for high-volume streaming ingestion, and BigQuery is not suitable for storing data during ingestion.

You created a Cloud SQL database that uses replication to improve read performance. Occasionally, the read replica will be unavailable. You haven't noticed a pattern, but the disruptions occur once or twice a month. No DBA operations are occurring when the inci- dents occur. What might be the cause of this issue? A. The read replica is being promoted to a standalone Cloud SQL instance. B. Maintenance is occurring on the read replica. C. A backup is being performed on the read replica. D. The primary Cloud SQL instance is failing over to the read replica.

B. The correct answer is B. Maintenance could be occurring. Maintenance on read replicas is not restricted to the maintenance window of the primary instance or to other windows, so it can occur anytime. That would make the read replica unavailable. Option A is incorrect because a database administrator would have to promote a read replica, and the problem stated that there is no pattern detected and DBAs were not performing database operations. Option C is incorrect; backups are not performed on read replicas. Option D is incorrect; Cloud SQL instances do not fail over to a read replica.

Your company has implemented an organizational hierarchy consisting of two layers of folders and tens of projects. The top layer of folders corresponds to a department, and the second layer of folders are working groups within a department. Each working group has one or more projects in the resource hierarchy. You have to ensure that all projects comply with regulations, so you have created several policies. Policy A applies to all departments. Policies B, C, D, and E are department specific. At what level of the resource hierarchy would you assign each policy? A. Assign policies A, B, C, D, and E to each folder B. Assign policy A to the organizational hierarchy and policies B, C, D, and E to each department's corresponding folder C. Assign policy A to the organizational hierarchy and policies B, C, D, and E to each department's corresponding projects D. Assign policy A to each department's folder and policies B, C, D, and E to each project

B. The correct answer is B. Policy A applies to all departments, so it should be assigned at the organizational level. Policies B, C, D, and E are department specific and apply to all projects, so they can be inherited by projects when they are assigned to the departments folder. Option A is incorrect; policy A belongs at the organizational level, and each of the other policies should apply only to one department's folder. Option C is incorrect; the policies should not be assigned to individual projects. Option D is incorrect because policy A belongs at the organization level, and policies B, C, D and E belong at the folder level.

An on-premises data warehouse is currently deployed using HBase on Hadoop. You want to migrate the database to GCP. You could continue to run HBase within a Cloud Dataproc cluster, but what other option would help ensure consistent performance and support the HBase API? A. Store the data in Cloud Storage B. Store the data in Cloud Bigtable C. Store the data in Cloud Datastore D. Store the data in Cloud Dataflow

B. The correct answer is B. The data could be stored in Cloud Bigtable, which provides consistent, scalable performance. Option A is incorrect because Cloud Storage is an object storage system, not a database. Option C is incorrect, since Cloud Datastore is a document- style NoSQL database and is not suitable for a data warehouse. Option D is incorrect; Cloud Dataflow is not a database.

A software-as-a-service (SaaS) company specializing in automobile IoT sensors collects streaming time-series data from tens of thousands of vehicles. The vehicles are owned and operated by 40 different companies, who are the primary customers of the SaaS company. The data will be stored in Bigtable using a multitenant database; that is, all customer data will be stored in the same database. The data sent from the IoT device includes a sensor ID, which is globally unique; a timestamp; and several metrics about engine efficiency. Each cus- tomer will query their own data only. Which of the following would you use as a row-key? A. Customer ID, timestamp, sensor ID B. Customer ID, sensor ID, timestamp C. Sensor ID, timestamp, customer ID D. Sensor ID, customer ID, timestamp

B. The correct answer is B. The database is multitenant, so each tenant, or customer, will query only its own data, so all that data should be in close proximity. Using customer ID first accomplishes this. Next, the sensor ID is globally unique, so data would be distributed evenly across database storage segments when sorting based on sensor ID. Since this is time-series data, virtually all data arriving at the same time will have timestamps around the same time. Using a timestamp early in the key could create hotspots. Using sensor ID first would avoid hotspots but would require more scans to retrieve customer data because multiple customers' data would be stored in each data block.

As part of a cloud migration effort, you are tasked with compiling an inventory of existing applications that will move to the cloud. One of the attributes that you need to track for each application is a description of its architecture. An application used by the finance department is written in Java, deployed on virtual machines, has several distinct services, and uses the SOAP protocol for exchanging messages. How would you categorize this architecture? A. Monolithic B. Service-oriented architecture (SOA) C. Microservice D. Serverless functions

B. The correct answer is B. The description of independent services, using SOAP, and deployed on virtual machines fits the definition of an SOA architecture. Answer A is incorrect; since there are multiple components, it is not a monolithic architecture. Option C could be a possibility, but it is not the best fit since the application uses SOAP and is deployed on VMs. Option D is incorrect because the application does not use a serverless deployment.

A data analyst asks for your help on a problem that users are having that involves BigQuery. The data analyst has been granted permissions to read the tables in a particular dataset. However, when the analyst runs a query, an error message is returned. What role would you think is missing from the users' assigned roles? A. roles/BigQuery.admin B. roles/BigQuery.jobUser C. roles/BigQuery.metadataViewer D. roles/BigQuery.queryRunner

B. The correct answer is B. The roles/BigQuery.jobUser role allows users to run jobs, including queries. Option A is incorrect because that would grant more permissions than needed. Option C is incorrect; it would allow access to table and dataset metadata. Option D is incorrect; there is no such role.

The enterprise data warehouse has been migrated to BigQuery. The CTO wants to shut down the on-premises data warehouse but first wants to verify that the new cloud-based data warehouse is functioning correctly. What should you include in the verification process? A. Verify that schemas are correct and that data is loaded B. Verify schemas, data loads, transformations, and queries C. Verify that schemas are correct, data is loaded, and the backlog of feature requests is prioritized D. Verify schemas, data loads, transformations, queries, and that the backlog of feature requests is prioritized

B. The correct answer is B. The set of tasks to verify a correct data warehouse migration include verifying schemas, data loads, transformations, and queries, among other things. Option A is incorrect because more is required than just verifying schemas and data loads. Options C and D are incorrect; the backlog of feature requests is important but not relevant to verifying the migration.

You are designing a data pipeline to populate a sales data mart. The sponsor of the project has had quality control problems in the past and has defined a set of rules for filtering out bad data before it gets into the data mart. At what stage of the data pipeline would you implement those rules? A. Ingestion B. Transformation C. Storage D. Analysis

B. The correct answer is B. The transformation stage is where business logic and filters are applied. Option A is incorrect; ingestion is when data is brought into the GCP environment. Option C is incorrect—that data should be processed, and problematic data removed before storing the data. Answer D is incorrect; by the analysis stage, data should be fully transformed and available for analysis.

Your company is about to start a huge project to analyze a large number of documents to redact sensitive information. You would like to follow Google-recommended best practices. What would you do first? A. Identify InfoTypes to use B. Prioritize the order of scanning, starting with the most at-risk data C. Run a risk analysis job first D. Extract a sample of data and apply all InfoTypes to it

B. The correct answer is B. You should prioritize the order of scanning, starting with the most at-risk data. Option A is incorrect; identifying InfoTypes to use comes later. Option C is incorrect; a risk analysis is done after inspection. Option D is incorrect; that is not the recommended first step.

A data modeler is designing a database to support ad hoc querying, including drilling down and slicing and dicing queries. What kind of data model is the data modeler likely to use? A. OLTP B. OLAP C. Normalized D. Graph

B. The correct answer is B; OLAP data models are designed to support drilling down and slicing and dicing. Option A is incorrect; OLTP models are designed to facilitate storing, searching, and retrieving individual records in a database. Option C is incorrect; OLAP databases often employ denormalization. Option D is incorrect; graph data models are used to model nodes and their relationships, such as those in social networks.

In GCP, each data chunk written to a storage system is encrypted with a data encryption key. How does GCP protect the data encryption key so that an attacker who gained access to the storage system storing the key could not use it to decrypt the data chunk? A. GCP writes the data encryption key to a hidden location on disk. B. GCP encrypts the data encryption key with a key encryption key. C. GCP stores the data encryption key in a secure Cloud SQL database. D. GCP applies an elliptic curve encryption algorithm for each data encryption key.

B. The correct answer is B; the data encryption key is encrypted using a key encryption key. Option A is incorrect; there are no hidden locations on disk that are inaccessible from a hardware perspective. Option C is incorrect; keys are not stored in a relational database. Option D is incorrect; an elliptic curve encryption algorithm is not used.

You have a large number of files that you would like to store for several years. The files will be accessed frequently by users around the world. You decide to store the data in multi-regional Cloud Storage. You want users to be able to view files and their metadata in a Cloud Storage bucket. What role would you assign to those users? (Assume you are practicing the principle of least privilege.) A. roles/storage.objectCreator B. roles/storage.objectViewer C. roles/storage.admin D. roles/storage.bucketList

B. The roles/storage.objectViewer role allows users to view objects and list metadata. Option A is incorrect; roles/storage.objectCreator allows a user to create an object only. Option C is incorrect; the roles/storage.admin role gives a user full control over buckets and objects, which is more privilege than needed. Option D is incorrect; there is no such role as roles/storage.bucketList.

Your company has been collecting vehicle performance data for the past year and now has 500 TB of data. Analysts at the company want to analyze the data to understand performance differences better across classes of vehicles. The analysts are advanced SQL users, but not all have programming experience. They want to minimize administrative overhead by using a managed service, if possible. What service might you recommend for conducting preliminary analysis of the data? A. Compute Engine B. Kubernetes Engine C. BigQuery D. Cloud Functions

C. BigQuery is an analytical database that supports SQL. Options A and B are incorrect because, although they could be used for ad hoc analysis, doing so would require more administrative overhead. Option D is incorrect; the Cloud Functions feature is intended for running short programs in response to events in GCP.

A group of data scientists is using Hadoop to store and analyze IoT data. They have decided to use GCP because they are spending too much time managing the Hadoop cluster. They are particularly interested in using services that would allow them to port their models and machine learning workflows to other clouds. What service would you use as a replacement for their existing platform? A. BigQuery B. Cloud Storage C. Cloud Dataproc D. Cloud Spanner

C. Cloud Dataproc is a managed Hadoop and Spark service; Spark has a machine learning library called MLlib, and Spark is an open source platform that can run in other clouds. Option A is incorrect; BigQuery is a managed data warehouse and analytical database that is not available in other clouds. Option B is incorrect; Cloud Storage is used for unstructured data and not a substitute for a Hadoop/Spark platform. Option D is incorrect; Cloud Spanner is used for global transaction-processing systems, not large-scale analytics and machine learning.

Your team is designing a database to store product catalog information. They have determined that you need to use a database that supports flexible schemas and transactions. What service would you expect to use? A. Cloud SQL B. Cloud BigQuery C. Cloud Firestore D. Cloud Storage

C. Cloud Firestore is a managed document database that supports flexible schemas and transactions. Option A is incorrect; Cloud SQL does not support flexible schemas. Option B is incorrect; BigQuery is an analytical database, not a NoSQL database with a flexible schema. Option D is incorrect; Cloud Storage is an object storage system, not a NoSQL database.

Your company is migrating from an on-premises pipeline that uses Apache Kafka for ingesting data and MongoDB for storage. What two managed services would you recommend as replacements for these? A. Cloud Dataflow and Cloud Bigtable B. Cloud Dataprep and Cloud Pub/Sub C. Cloud Pub/Sub and Cloud Firestore D. Cloud Pub/Sub and BigQuery

C. Cloud Pub/Sub is a good replacement for Kafka, and Cloud Firestore is a good replacement for MongoDB, which is another document database. Option A is incorrect; Cloud Dataflow is for stream and batch processing, not ingestion. Option B is incorrect; there is no database in the option. Option D is incorrect; BigQuery is analytical database and not a good replacement for a document database such as MongoDB.

An airline is moving its luggage-tracking applications to Google Cloud. There are many requirements, including support for SQL and strong consistency. The database will be accessed by users in the United States, Europe, and Asia. The database will store approximately 50 TB in the first year and grow at approximately 10 percent a year after that. What managed database service would you recommend? A. Cloud SQL B. BigQuery C. Cloud Spanner D. Cloud Dataflow

C. Cloud Spanner is a globally scalable, strongly consistent relational database that can be queried using SQL. Option A is incorrect because it will not scale to the global scale as Cloud Spanner will, and it does not support storing 50 TB of data. Option B is incorrect; the requirements call for a transaction processing system, and BigQuery is designed for analytics and data warehousing. Option D is incorrect; Cloud Dataflow is a stream and batch processing service.

The finance department at your company has been archiving data on premises. They no longer want to maintain a costly dedicated storage system. They would like to store up to 300 TB of data for 10 years. The data will likely not be accessed at all. They also want to minimize cost. What storage service would you recommend? A. Cloud Storage multi-regional storage B. Cloud Storage Nearline storage C. Cloud Storage Coldline storage D. Cloud Bigtable

C. Cloud Storage Coldline is the lowest-cost option, and it is designed for data that is accessed less than once a year. Options A and B are incorrect because they cost more than Coldline storage. Option D is incorrect because Cloud Bigtable is a low-latency, wide- column database.

Your company wants to build a data lake to store data in its raw form for extended periods of time. The data lake should provide access controls, virtually unlimited storage, and the lowest cost possible. Which GCP service would you suggest? A. Cloud Bigtable B. BigQuery C. Cloud Storage D. Cloud Spanner

C. Cloud Storage is an object storage system that meets all of the requirements. Option A is incorrect; Cloud Bigtable is a wide-column database. Option B is incorrect; BigQuery is an analytical database. Option D is incorrect; Cloud Spanner is a horizontally scalable relational database.

Your company has been losing market share because competitors are attracting your customers with a more personalized experience on their e-commerce platforms, including providing recommendations for products that might be of interest to them. The CEO has stated that your company will provide equivalent services within 90 days. What GCP service would you use to help meet this objective? A. Cloud Bigtable B. Cloud Storage C. AI Platform D. Cloud Datastore

C. The AI Platform is a managed service for machine learning, which is needed to provide recommendations. Options A and B are incorrect because, although they are useful for storing data, they do not provide managed machine learning services. Option D is incorrect; Cloud Datastore is a NoSQL database.

A health and wellness startup in Canada has been more successful than expected. Investors are pushing the founders to expand into new regions outside of North America. The CEO and CTO are discussing the possibility of expanding into Europe. The app offered by the startup collects personal information, storing some locally on the user's device and some in the cloud. What regulation will the startup need to plan for before expanding into the European market? A. HIPAA B. PCI-DSS C. GDPR D. SOX

C. The General Data Protection Regulation (GDPR) is a European Union regulation protecting personal information of persons in and citizens of the European Union. Option A is incorrect; HIPAA is a U.S. healthcare regulation. Option B is incorrect; PCI-DSS is a self-imposed global security standard by major brands in the credit card industry, not a government regulation. Although not necessarily law, the standard may be applicable to the start-up in Europe if it accepts payment cards for brands that require PCI-DSS compliance. Option D is a U.S. regulation that applies to all publicly traded companies in the United States, and wholly-owned subsidiaries and foreign companies that are publicly traded and do business in the United States - the company may be subject to that regulation already, and expanding to Europe will not change its status.

A genomics research institute is developing a platform for analyzing data related to genetic diseases. The genomics data is in a specialized format known as FASTQ, which stores nucleotide sequences and quality scores in a text format. Files may be up to 400 GB and are uploaded in batches. Once the files finish uploading, an analysis pipeline runs, reads the data in the FASTQ file, and outputs data to a database. What storage system is a good option for storing the uploaded FASTQ data? A. Cloud Bigtable B. Cloud Datastore C. Cloud Storage D. Cloud Spanner

C. The correct answer is C because the FASTQ files are unstructured since their internal format is not used to organize storage structures. Also, 400 GB is large enough that it is not efficient to store them as objects in a database. Options A and B are incorrect because a NoSQL database is not needed for the given requirements. Similarly, there is no need to store the data in a structured database like Cloud Spanner, so Option D is incorrect.

As a member of a team of game developers, you have been tasked with devising a way to track players' possessions. Possessions may be purchased from a catalog, traded with other players, or awarded for game activities. Possessions are categorized as clothing, tools, books, and coins. Players may have any number of possessions of any type. Players can search for other players who have particular possession types to facilitate trading. The game designer has informed you that there will likely be new types of possessions and ways to acquire them in the future. What kind of a data store would you recommend using? A. Transactional database B. Wide-column database C. Document database D. Analytic database

C. The correct answer is C because the requirements call for a semi-structured schema. You will need to search players' possessions and not just look them up using a single key because of the requirement for facilitating trading. Option A is not correct. Transactional databases have fixed schemas, and this use case calls for a semi-structured schema. Option B is incorrect because it does not support indexed lookup, which is needed for searching. Option D is incorrect. Analytical databases are structured data stores.

A team of analysts has collected several terabytes of telemetry data in CSV datasets. They plan to store the datasets in GCP and query and analyze the data using SQL. Which of the following is the most appropriate GCP storage service for the datasets? A. Cloud SQL B. Cloud Spanner C. BigQuery D. Bigtable

C. The correct answer is C, BigQuery, which is a managed analytical database service that supports SQL and scales to petabyte volumes of data. Options A and B are incorrect because both are used for transaction processing applications, not analytics. Option D is incorrect because Bigtable does not support SQL.

A large enterprise using GCP has recently acquired a startup that has an IoT platform. The acquiring company wants to migrate the IoT platform from an on-premises data center to GCP and wants to use Google Cloud managed services whenever possible. What GCP ser- vice would you recommend for ingesting IoT data? A. Cloud Storage B. Cloud SQL C. Cloud Pub/Sub D. BigQuery streaming inserts

C. The correct answer is C, Cloud Pub/Sub, which is a scalable, managed messaging queue that is typically used for ingesting high-volume streaming data. Option A is incorrect; Cloud Storage does not support streaming inserts, but Cloud Pub/Sub is designed to scale for high-volume writes and has other features useful for stream processing, such as acknowledging and processing a message. Option B is incorrect; Cloud SQL is not designed to support high volumes of low-latency writes like the kind needed in IoT applications. Option D is incorrect; although BigQuery has streaming inserts, the database is designed for analytic operations.

A new engineer in your group asks for your help with creating a managed instance group. The engineer knows the configuration and the minimum and maximum number of instances in the MIG. What is the next thing the engineer should do to create the desired MIG? A. Create each of the initial members of the instance group using gcloud compute instance create commands B. Create each of the initial members of the instance group using the cloud console C. Create an instance template using the gcloud compute instance-templates create command D. Create an instance template using the cbt create instance-template command

C. The correct answer is C, defining an instance template using the gcloud compute instance-templates create command. Options A and B are incorrect, since there is no need to create each instance individually. Option D is incorrect. cbt is the command-line utility for working with Cloud Bigtable.

Auditors have informed your company CFO that to comply with a new regulation, your company will need to ensure that financial reporting data is kept for at least three years. The CFO asks for your advice on how to comply with the regulation with the least adminis- trative overhead. What would you recommend? A. Store the data on Coldline storage B. Store the data on multi-regional storage C. Define a data retention policy D. Define a lifecycle policy

C. The correct answer is C. A data retention policy will ensure that files are not deleted from a storage bucket until they reach a specified age. Options A and B are incorrect because files can be deleted from Coldline or multi-regional data unless a data retention policy is in place. Option D is incorrect because a lifecycle policy will change the storage type on an object but not prevent it from being deleted.

To ensure high availability of a mission-critical application, your team has determined that it needs to run the application in multiple regions. If the application becomes unavailable in one region, traffic from that region should be routed to another region. Since you are designing a solution for this set of requirements, what would you expect to include? A. Cloud Storage bucket B. Cloud Pub/Sub topic C. Global load balancer D. HA VPN

C. The correct answer is C. A global load balancer is needed to distribute workload across multiple regions. Options A and B are incorrect because there is no indication in the requirements that object storage or a message queue is required. Option D is incorrect because there is no indication that a hybrid cloud is needed that would necessitate the use of a VPN or direct connect option.

A group of data scientists wants to preprocess a large dataset that will be delivered in batches. The data will be written to Cloud Storage and processed by custom applications running on Compute Engine instances. They want to process the data as quickly as pos- sible when it arrives and are willing to pay the cost of running up to 10 instances at a time. When a batch is finished, they'd like to reduce the number of instances to 1 until the next batch arrives. The batches do not arrive on a known schedule. How would you recommend that they provision Compute Engine instances? A. Use a Cloud Function to monitor Stackdriver metrics, add instances when CPU utiliza- tion peaks, and remove them when demand drops. B. Use a script running on one dedicated instance to monitor Stackdriver metrics, add instances when CPU utilization peaks, and remove them when demand drops. C. Use managed instance groups with a minimum of 1 instance and a maximum of 10. D. Use Cloud Dataproc with an autoscaling policy set to have a minimum of 1 instance and a maximum of 10.

C. The correct answer is C. A managed instance group will provision instances as required to meet the load and stay within the bounds set for the number of instances. Option A is incorrect; Cloud Functions are for event-driven processing, not continually monitoring metrics. Option B is incorrect because it is not the most efficient way to scale instances. Option D is incorrect, since the requirements call for Compute Engine instances, not a Hadoop/Spark cluster.

You have created a managed instance group in Compute Engine to run a high-performance computing application. The application will read source data from a Cloud Storage bucket and write results to another bucket. The application will run whenever new data is uploaded to Cloud Storage via a Cloud Function that invokes the script to start the job. You will need to assign the role roles/storage.objectCreator to an identity so that the application can write the output data to Cloud Storage. To what kind of identity would you assign the roles? A. User. B. Group. C. Service account. D. You wouldn't. The role would be assigned to the bucket.

C. The correct answer is C. A service account associated with the application should have the roles/storage.objectCreator assigned to it. Options A and B are incorrect; those are identities associated with actual users. Option D is incorrect; access control lists can be assigned to a bucket, but roles are assigned to identities.

You have been hired to consult with a startup that is developing software for self-driving vehicles. The company's product uses machine learning to predict the trajectory of persons and vehicles. Currently, the software is being developed using 20 vehicles, all located in the same city. IoT data is sent from vehicles every 60 seconds to a MySQL database running on a Compute Engine instance using an n2-standard-8 machine type with 8 vCPUs and 16 GB of memory. The startup wants to review their architecture and make any necessary changes to support tens of thousands of self-driving vehicles, all transmitting IoT data every second. The vehicles will be located across North America and Europe. Approximately 4 KB of data is sent in each transmission. What changes to the architecture would you recommend? A. None. The current architecture is well suited to the use case. B. Replace Cloud SQL with Cloud Spanner. C. Replace Cloud SQL with Bigtable. D. Replace Cloud SQL with Cloud Datastore.

C. The correct answer is C. Bigtable is the best storage service for IoT data, especially when a large number of devices will be sending data at short intervals. Option A is incorrect, because Cloud SQL is designed for transaction processing at a regional level. Option B is incorrect because Cloud Spanner is designed for transaction processing, and although it scales to global levels, it is not the best option for IoT data. Option D is incorrect because there is no need for indexed, semi-structured data.

A multinational corporation is building a global inventory database. The database will sup- port OLTP type transactions at a global scale. Which of the following would you consider as possible databases for the system? A. Cloud SQL and Cloud Spanner B. Cloud SQL and Cloud Datastore C. Cloud Spanner only D. Cloud Datastore only

C. The correct answer is C. Cloud Spanner is the only globally scalable relational database for OLTP applications. Options A and B are incorrect because Cloud SQL will not meet the scaling requirements. Options B and D are incorrect because Cloud Datastore does not support OLTP models.

The data modelers who built your company's enterprise data warehouse are asking for your guidance to migrate the data warehouse to BigQuery. They understand that BigQuery is an analytical database that uses SQL as a query language. They also know that BigQuery supports joins, but reports currently run on the data warehouse are consuming significant amounts of CPU because of the number and scale of joins. What feature of BigQuery would you suggest they consider in order to reduce the number of joins required? A. Colossus filesystem B. Columnar data storage C. Nested and repeated fields D. Federated storage

C. The correct answer is C. Denormalization reduces the number of joins required and nested, and repeated fields can be used to store related data in a single row. Option A is incorrect; BigQuery does use Colossus, but that does not change the number of joins. Option B is incorrect; BigQuery does use columnar storage, but that does not affect the number of joins. Option D is incorrect; federated storage allows BigQuery to access data stored outside of BigQuery, but it does not change the need for joins.

Sensors on manufacturing machines send performance metrics to a cloud-based service that uses the data to build models that predict when a machine will break down. Metrics are sent in messages. Messages include a sensor identifier, a timestamp, a machine type, and a set of measurements. Different machine types have different characteristics related to failures, and machine learning engineers have determined that for highest accuracy, each machine type should have its own model. Once messages are written to a message broker, how should they be routed to instances of a machine learning service? A. Route randomly to any instance that is building a machine learning model B. Route based on the sensor identifier so that identifiers in close proximity are used in the same model C. Route based on machine type so that only data from one machine type is used for each model D. Route based on

C. The correct answer is C. Machines of different types have different failure characteristics and therefore will have their own models. Option A is incorrect; randomly distributing messages will mix metrics from different types of machines. Option B is incorrect because identifiers in close proximity are not necessarily from machines of the same type. Option D is incorrect; routing based on timestamp will mix metrics from different machine types.

Your consulting company is contracted to help an enterprise customer negotiate a contract with a SaaS provider. Your client wants to ensure that they will have access to the SaaS service and it will be functioning correctly with only minimal downtime. What metric would you use when negotiating with the SaaS provider to ensure that your client's reliability requirements are met? A. Average CPU utilization B. A combination of CPU and memory utilization C. Mean time between failure D. Mean time to recovery

C. The correct answer is C. Mean time between failure is used for measuring reliability. Options A and B are incorrect because they are related to utilization and efficiency but unrelated to reliability. Option D is incorrect, since mean time to recovery is used as a metric for restoring service after an outage. Mean time to recovery is important and would likely be included in negotiations, but it is not used as a measure of reliability.

As part of a migration to the cloud, your department wants to restructure a distributed application that currently runs several services on a cluster of virtual machines. Each service implements several functions, and it is difficult to update one function without disrupting operations of the others. Some of the services require third-party libraries to be installed. Your company has standardized on Docker containers for deploying new services. What kind of architecture would you recommend? A. Monolithic B. Hub-and-spoke C. Microservices D. Pipeline architecture

C. The correct answer is C. Microservices would allow each function to be deployed independently in its own container. Option A is incorrect; a monolithic architecture would make the update problems worse. Option B is incorrect, because hub-and-spoke is a message broker pattern. Option D is incorrect; pipelines are abstractions for thinking about workflows—they are not a type of architecture.

In addition to concerns about the rising costs of maintaining an on-premises data warehouse, the CTO of your company has complained that new features and reporting are not being rolled out fast enough. The lack of adequate business intelligence has been blamed for a drop in sales in the last quarter. Your organization is incurring what kind of cost because of the backlog? A. Capital B. Operating C. Opportunity D. Fiscal

C. The correct answer is C. The company is incurring an opportunity cost because if they had migrated to a modern cloud-based data warehouse, the team would have had opportunities to develop new reports. Options A and B are incorrect; although they are kinds of expenses, they require expenditure of funds to be either a capital or an operating cost. Option D is not a type of cost.

A group of data scientists have uploaded multiple time-series datasets to BigQuery over the last year. They have noticed that their queries—which select up to six columns, apply four SQL functions, and group by the day of a timestamp—are taking longer to run and are incurring higher BigQuery costs as they add data. They do not understand why this is the case since they typically work only with the most recent set of data loaded. What would you recommend they consider in order to reduce query latency and query costs? A. Sort the data by time order before loading B. Stop using Legacy SQL and use Standard SQL dialect C. Partition the table and use clustering D. Add more columns to the SELECT statement to use data fetched by BigQuery more effi- ciently

C. The correct answer is C. The queries are likely scanning more data than needed. Partitioning the table will enable BigQuery to scan only data within a partition, and clustering will improve the way column data is stored. Option A is incorrect because BigQuery organizes data according to table configuration parameters, and there is no indication that queries need to order results. Option B is incorrect; Standard SQL dialect has more SQL features but none of those are used. Also, it is unlikely that the query execution plan would be more efficient with Standard SQL. Option D is incorrect; it would actually require more data to be scanned and fetched because BigQuery uses a columnar data storage model.

The CTO has asked you to participate in a prototype project to provide better privacy con- trols. The CTO asks you to run a risk analysis job on a text file that has been inspected by the Data Loss Prevention API. What is the CTO interested in knowing? A. The number of times sensitive information is redacted B. The percentage of text that is redacted C. The likelihood that the data can be re-identified D. What InfoType patterns were detected

C. The correct answer is C. The risk analysis job assesses the likelihood that redacted data can be re-identified. Option A and Option B are incorrect. The results are not measures of counts or percent of times that data is redacted. Option D is incorrect. The result is not a list of InfoType patterns detected.

You have been tasked with creating a pilot project in GCP to demonstrate the feasibility of migrating workloads from an on-premises Hadoop cluster to Cloud Dataproc. Three other engineers will work with you. None of the data that you will use contains sensitive informa- tion. You want to minimize the amount of time that you spend on administering the devel- opment environment. What would you use to control access to resources in the development environment? A. Predefined roles B. Custom roles C. Primitive roles D. Access control lists

C. The correct answer is C. This is an appropriate use case for primitive roles because there are few users working in a development environment, not production, and working with data that does not contain sensitive information. In this case, there is no need for fine- grained access controls. Options A and B are incorrect because they would require more administration, and fine-grained access controls are not needed. Option D is incorrect; access control lists are used with Cloud Storage resources and should be used only when roles are insufficient.

A team of developers wants to create standardized patterns for processing IoT data. Several teams will use these patterns. The developers would like to support collaboration and facilitate the use of patterns for building streaming data pipelines. What component should they use? A. Cloud Dataflow Python Scripts B. Cloud Dataproc PySpark jobs C. Cloud Dataflow templates D. Cloud Dataproc templates

C. The correct answer is C. Use Cloud Dataflow templates to specify the pattern and provide parameters for users to customize the template. Option A is incorrect since this would require users to customize the code in the script. Options B and D are incorrect because Cloud Dataproc should not be used for this requirement. Also, Option D is incorrect because there are no Cloud Dataproc templates.

Your startup is creating an app to help students with math homework. The app will track assignments, how long the student takes to answer a question, the number of incorrect answers, and so on. The app will be used by students ages 9 to 14. You expect to market the app in the United States. With which of the following regulations must you comply? A. HIPAA B. GDPR C. COPPA D. FedRAMP

C. The correct answer is C; COPPA is a regulation that governs the collection of data from children under the age of 13. Option A is incorrect; HIPAA is a healthcare regulation. Option B is incorrect; GDPR is a European Union privacy regulation. Option D is incorrect; FedRAMP applies to cloud providers supplying services to U.S. federal agencies.

You are using Cloud Pub/Sub to buffer records from an application that generates a stream of data based on user interactions with a website. The messages are read by another service that transforms the data and sends it to a machine learning model that will use it for train- ing. A developer has just released some new code, and you notice that messages are sent repeatedly at 10-minute intervals. What might be the cause of this problem? A. The new code release changed the subscription ID. B. The new code release changed the topic ID. C. The new code disabled acknowledgments from the consumer. D. The new code changed the subscription from pull to push.

C. The correct answer is C; the new code disabled message acknowledgments. That caused Cloud Pub/Sub to consider the message outstanding for up to the duration of the acknowledgment wait time and then resend the message. Options A and B are incorrect; changing the subscription or topic IDs would cause problems but not the kind described. Option D is incorrect because the type of subscription does not influence whether messages are delivered multiple times.

You have built a deep learning neural network to perform multiclass classification. You find that the model is overfitting. Which of the following would not be used to reduce overfitting? A. Dropout B. L2 Regularization C. L1 Regularization D. Logistic regression

D. Logistic regression is a binary classifier algorithm. Options A, B, and C are all regularization techniques.

Your company is subject to financial industry regulations that require all customer data to be encrypted when persistently stored. Your CTO has tasked you with assessing options for encrypting the data. What must you do to ensure that applications processing protected data encrypt it when it is stored on disk or SSD? A. Configure a database to use database encryption. B. Configure persistent disks to use disk encryption. C. Configure the application to use application encryption. D. Nothing. Data is encrypted at rest by default.

D. Option D is correct. You do not need to configure any settings to have data encrypted at rest in GCP. Options B, C, and D are all incorrect because no configuration is required.

You will be developing machine learning models using sensitive data. Your company has several policies regarding protecting sensitive data, including requiring enhanced security on virtual machines (VMs) processing sensitive data. Which GCP service would you look to for meeting those requirements? A. Identity and access management (IAM) B. Cloud Key Management Service C. Cloud Identity D. Shielded VMs

D. Shielded VMs are instances with additional security controls. Option A is incorrect; IAM is used for managing identities and authorizations. Option B is incorrect; the Cloud Key Management Service is a service for managing encryption keys. Option C is incorrect; Cloud Identity is used for authentication.

A genomics research institute is developing a platform for analyzing data related to genetic diseases. The genomics data is in a specialized format known as FASTQ, which stores nucleotide sequences and quality scores in a text format. Once the files finish uploading, an analysis pipeline runs, reads the data in the FASTQ file, and outputs data to a database. The output is in tabular structure, the data is queried using SQL, and typically queries retrieve only a small number of columns but many rows. What database would you recom- mend for storing the output of the workflow? A. Cloud Bigtable B. Cloud Datastore C. Cloud Storage D. BigQuery

D. The correct answer is D because the output is structured, will be queried with SQL, and will retrieve a large number of rows but few columns, making this a good use case for columnar storage, which BigQuery uses. Options A and B are not good options because neither database supports SQL. Option C is incorrect because Cloud Storage is used for unstructured data and does not support querying the contents of objects.

A software developer asks your advice about storing data. The developer has hundreds of thousands of 10 KB JSON objects that need to be searchable by most attributes in the JSON structure. What kind of NoSQL database would you recommend? A. Key-value database B. Analytical database C. Wide-column database D. Document database

D. The correct answer is D. A document database could store the volume of data, and it provides for indexing on columns other than a single key. Options A and C do not support indexing on non-key attributes. Option B is incorrect because analytical is not a type of NoSQL database.

A team of game developers is using Cloud Firestore to store player data, including character description, character state, and possessions. Descriptions are up to a 60-character alpha- numeric string that is set when the character is created and not updated. Character state includes health score, active time, and passive time. When they are updated, they are all updated at the same time. Possessions are updated whenever the character acquires or loses a possession. Possessions may be complex objects, such as bags of items, where each item may be a simple object or another complex object. Simple objects are described with a char- acter string. Complex objects have multiple properties. How would you model player data in Cloud Firestore? A. Store description and character state as strings and possessions as entities B. Store description, character state, and possessions as strings C. Store description, character state, and possessions as entities D. Store description as a string; character state as an entity with properties for health score, active time, and passive time; and possessions as an entity that may have embedded entities

D. The correct answer is D. Description can be represented in a string. Health consists of three properties that are accessed together, so they can be grouped into an entity. Possessions need a recursive representation since a possession can include sets of other possessions. Options A and B are incorrect; character state requires multiple properties, and so it should not be represented in a single string. Option B is also incorrect, because possessions are complex objects and should not be represented in strings. Option C is incorrect; description is an atomic property and does not need to be modeled as an entity.

While the CTO is interested in having your enterprise data warehouse migrated to the cloud as quickly as possible, the CTO is particularly risk averse because of errors in reporting in the past. Which prioritization strategy would you recommend? A. Exploiting current opportunities B. Migrating analytical workloads first C. Focusing on the user experience first D. Prioritizing low-risk use cases first

D. The correct answer is D. Prioritizing low-risk use cases will allow the team to make progress on migrating while minimizing the impact if something goes wrong. Options A, B, and C are incorrect because they do not give priority to minimizing risk; other factors are prioritized in each case.

A developer is planning a mobile application for your company's customers to use to track information about their accounts. The developer is asking for your advice on storage tech- nologies. In one case, the developer explains that they want to write messages each time a significant event occurs, such as the client opening, viewing, or deleting an account. This data is collected for compliance reasons, and the developer wants to minimize administra- tive overhead. What system would you recommend for storing this data? A. Cloud SQL using MySQL B. Cloud SQL using PostgreSQL C. Cloud Datastore D. Stackdriver Logging

D. The correct answer is D. Stackdriver Logging is the best option because it is a managed service designed for storing logging data. Neither Option A nor B is as good a fit because the developer would have to design and maintain a relational data model and user interface to view and manage log data. Option C, Cloud Datastore, would not require a fixed data model, but it would still require the developer to create and maintain a user interface to manage log events.

Your department is planning to expand the use of BigQuery. The CFO has asked you to investigate whether the company should invest in flat-rate billing for BigQuery. What tools and data would you use to help answer that question? A. Stackdriver Logging and audit log data B. Stackdriver Logging and CPU utilization metrics C. Stackdriver Monitoring and CPU utilization metrics D. Stackdriver Monitoring and slot utilization metrics

D. The correct answer is D. Stackdriver Monitoring collects metrics, and the slot metrics are the ones that show resource utilization related to queries. Options A and B are incorrect; logging does not collect the metrics that are needed. Option C is incorrect because CPU utilization is not a metric associated with a serverless service like BigQuery.

The business owners of a data warehouse have determined that the current design of the data warehouse is not meeting their needs. In addition to having data about the state of systems at certain points in time, they need to know about all the times that data changed between those points in time. What kind of data warehousing pipeline should be used to meet this new requirement? A. ETL B. ELT C. Extraction and load D. Change data capture

D. The correct answer is D. With change data capture, each change is a source system captured and recorded in a data store. Options A, B, and C all capture the state of source systems at a point in time and do not capture changes between those times.

You need to run several map reduce jobs on Hadoop along with one Pig job and four PySpark jobs. When you ran the jobs on premises, you used the department's Hadoop cluster. Now you are running the jobs in GCP. What configuration for running these jobs would you recommend? A. Create a single cluster and deploy Pig and Spark in the cluster. B. Create one persistent cluster for the Hadoop jobs, one for the Pig job and one for the PySpark jobs. C. Create one cluster for each job, and keep the cluster running continuously so that you do not need to start a new cluster for each job. D. Create one cluster for each job and shut down the cluster when the job completes.

D. The correct answer is D. You should create an ephemeral cluster for each job and delete the cluster after the job completes. Option A is incorrect because that is a more complicated configuration. Option B is incorrect because it keeps the cluster running instead of shutting down after jobs complete. Option C is incorrect because it keeps the clusters running after the jobs complete.

You are querying a Cloud Firestore collection of order entities searching for all orders that were created today and have a total sales amount of greater than $100. You have not excluded any indexes, and you have not created any additional indexes using index.yaml. What do you expect the results to be? A. A set of all orders created today with a total sales amount greater than $100 B. A set of orders created today and any total sales amount C. A set of with total sales amount greater than $100 and any sales date D. No entities returned

D. The correct answer is D—no entities are returned. The query requires a composite index, but the question stated that no additional indexes were created. All other answers are wrong because querying by property other than a key will only return entities found in an index.

A group of attorneys has hired you to help them categorize over a million documents in an intellectual property case. The attorneys need to isolate documents that are relevant to a patent that the plaintiffs argue has been infringed. The attorneys have 50,000 labeled examples of documents, and when the model is evaluated on training data, it performs quite well. However, when evaluated on test data, it performs quite poorly. What would you try to improve the performance? A. Perform feature engineering B. Perform validation testing C. Add more data D. Regularization

D. This is a case of the model overfitting the training data. Regularization is a set of methods used to reduce the risk of overfitting. Option A is incorrect; feature engineering could be used to create new features if the existing set of features was not sufficient, but that is not a problem in this case. Option B is incorrect; validation testing will not improve the quality of the model, but it will measure the quality. Option C is incorrect; the existing dataset has a sufficient number of training instances.

GCP Pro Data Engineer Exam

Related study sets

Crazy English 900

SAT writing section

chapter 14 learn smart

Stages of Labor

Med Surg II: Chapter 38 - Assessment of the Nervous System

Unit 5 Ap Bio Classroom

2.02 Monitoring Activity

SCIE1111 Ch. 16

ISQS exam 3 Ben Mitchell TTU

2A Midterm

APUSH 15

Shock, Burns, SCI, and Emergency Nursing NCLEX Study Questions

DOHC Chapter 14 Quiz

Marketing Principles Chapter 3 Test

psych exam 3 quizzes

Ch 8: Ris and Rates of Return

Chapter 13 Study Guide AP Human Geography (Urban Patterns)

Battle of Lexington and Concord

Accounting Exam 3

Environmental Science - Chapter 16 Study Guide