Google Cloud Platform DevOps Engineer
Which IAM roles allow an end user to create an export sink of logs from Cloud Logging?
1) Logs Configuration Writer 2) Logging Admin 3) Project Owner
You are running a production application on Compute Engine. You want to monitor the key metrics of CPU, Memory, and Disk I/O time. You want to ensure that the metrics are visible by the team and will be explorable if an issue occurs. What should you do? (Choose 2)
1) Set up alerts in Cloud Monitoring for key metrics breaching defined thresholds. 2) Create a Dashboard with key metrics and indicators that can be viewed by the team.
What happens when an error budget is exceeded?
1) releases are temporarily halted 2) system testing and development is expanded 3) performance is improved
What is the formula for calculating an error budget?
100% - SLO
You deploy a new release of an internal application during a weekend maintenance window when there is minimal user traffic. After the window ends, you learn that one of the new features isn't working as expected in the production environment. After an extended outage, you roll back the new release and deploy a fix. You want to modify your release process to reduce the mean time to recovery so you can avoid extended outages in the future. What should you do? (Choose two.)
Adopt the blue/green deployment strategy when releasing new code via a CD server. Configure a CI server. Add a suite of unit tests to your code and have your CI server run them on commit and verify any changes.
What is an error budget?
An error budget is a quantitative measurement shared between the product and SRE teams to balance innovation and stability.
You are writing a post-mortem for an incident that severely affected users. You want to prevent similar incidents in the future. Which two of the following sections should you include in the post-mortem? (Choose two.)
An explanation of the root cause of the incident. A list of action items to prevent a recurrence of the incident
You created a Stackdriver chart for CPU utilization in a dashboard within your workspace project. You want to share the chart with your Site Reliability Engineering (SRE) team only. You want to ensure you follow the principle of least privilege. What should you do?
Click "Share chart by URL" and provide the URL to the SRE team. Assign the SRE team the Monitoring Viewer IAM role in the workspace project.
You support a high-traffic web application and want to ensure that the home page loads in a timely manner. As a first step, you decide to implement a Service Level Indicator (SLI) to represent home page request latency with an acceptable page load time set to 100 ms. What is the Google-recommended way of calculating this SLI?
Count the number of home page requests that load in under 100 ms, and then divide by the total number of home page requests. In the SRE principles book, it's recommended treating the SLI as the ratio of two numbers: the number of good events divided by the total number of events. For example: Number of successful HTTP requests / total HTTP requests (success rate)
You support a Python application running in production on Compute Engine. You want to debug some of the application code by inspecting the value of a specific variable. What should you do?
Create a Cloud Debugger snapshot at a specific line location in your application's source code, and view the value of the variable in the Google Cloud Console. This is the Google recommended approach.
You use a multiple step Cloud Build pipeline to build and deploy your application to Google Kubernetes Engine (GKE). You want to integrate with a third-party monitoring platform by performing a HTTP POST of the build information to a webhook. You want to minimize the development effort. What should you do?
Create a Cloud Pub/Sub push subscription to the Cloud Build cloud-builds PubSub topic to HTTP POST the build information to a webhook. Cloud Build publishes messages on a Google Pub/Sub topic when your build's state changes, such as when your build is created, when your build transitions to a working state, and when your build completes. The Pub/Sub topic to which Cloud Build publishes these build update messages is called cloud-builds. Each message contains a base64 JSON string representation of your Build resource in the message.data attribute. The build's unique ID and the build's status can be found in the message.attributes field. You can use a push or pull model for your Pub/Sub subscriptions. Push subscriptions deliver messages to an HTTP endpoint that you define. Messages are delivered as soon as they are published to the topic.
Your company is developing applications that are deployed on Google Kubernetes Engine (GKE). Each team manages a different application. You need to create the development and production environments for each team, while minimizing costs. Different teams should not be able to access other teams' environments. What should you do?
Create a Development and a Production GKE cluster in separate projects. In each cluster, create a Kubernetes namespace per team, and then configure Kubernetes Role-based access control (RBAC) so that each team can only access its own namespace.
You support a production service that runs on a single Compute Engine instance. You regularly need to spend time on recreating the service by deleting the crashing instance and creating a new instance based on the relevant image. You want to reduce the time spent performing manual operations while following Site Reliability Engineering principles. What should you do?
Create a Managed instance Group with a single instance and use health checks to determine the system status.
How can you view a subset of logs from multiple projects in a folder at the same time?
Create a logging export sink which includes the folder (as well as children objects) and the logging filter to be defined. Export these logs to another service such as BigQuery.
How can you monitor metrics from multiple projects at the same time?
Create a single workspace in a project, and add each monitored project to that workspace. You can monitor multiple projects from a single workspace. You can only have a single workspace per project.
Which security logs type is NOT enabled by default?
Data Access logs are NOT enabled by default, and must be manually enabled and can be disabled.
You work with a video rendering application that publishes small tasks as messages to a Cloud Pub/Sub topic. You need to deploy the application that will execute these tasks on multiple virtual machines (VMs). Each task takes less than 1 hour to complete. The rendering is expected to be completed within a month. You need to minimize rendering costs. What should you do?
Deploy the application as a managed instance group with Preemptible VMs. Preemptible VMs are the cheapest way of running a VM. Running them in a MIG will start new instances when some are terminated.
You support a website with a global audience. The website has a frontend web service and a backend database service that runs on different clusters. All clusters are scaled to handle at least ⅓ of the total user traffic. You use 4 different regions in Google Cloud and Cloud Load Balancing to direct traffic to a region closer to the user. You are applying a critical security patch to the backend database. You successfully patch the database in the first 2 regions, but you make a configuration error while patching Region 3. The unsuccessful patching causes 50% of user requests to Region 3 to time out. You want to mitigate the impact of unsuccessful patching on users. What should you do?
Drain the requests to Region 3 and redirect new requests to other regions. The remaining 3 regions can handle the total traffic load, which gives you time to fix the configuration error in Region 3 and then apply the patch.
When is an SRE least engaged in the life-cycle of a service?
During the Active Development phase.
You encounter a large number of outages in the production systems you support. You receive alerts for all the outages that wake you up at night. The alerts are due to unhealthy systems that are automatically restarted within a minute. You want to set up a process that would prevent staff burnout while following Site Reliability Engineering practices. What should you do?
Eliminate unactionable alerts. Eliminate bad monitoring. Good monitoring alerts on actionable problems.
Your company follows Site Reliability Engineering principles. You are writing a post-mortem for an incident, triggered by a software change, that severely affected users. You want to prevent severe incidents from happening in the future. What should you do?
Ensure that test cases that catch errors of this type are run successfully before new software releases.
Your Site Reliability Engineering team does toil work to archive unused data in tables within your application's relational database. This toil is required to ensure that your application has a low Latency Service Level Indicator (SLI) to meet your Service Level Objective (SLO). Toil is preventing your team from focusing on a high-priority engineering project that will improve the Availability SLI of your application. You want to: (1) reduce repetitive tasks to avoid burnout, (2) improve organizational efficiency, and (3) follow the Site Reliability Engineering recommended practices. What should you do?
Identify repetitive tasks that contribute to toil and automate them. Organizational culture should allow for openly expressing concerns in the benefit of service reliability. Toil does not diminish on its own. It needs to be eliminated with action. Changing the SLO will not eliminate toil. Assigning the Availability SLI engineering project to the Software Engineering team means the SRE team will still be overwhelmed with toil that will also block future projects.
Several teams in your company want to use Cloud Build to deploy to their own Google Kubernetes Engine (GKE) clusters. The clusters are in projects that are dedicated to each team. The teams only have access to their own projects. One team should not have access to the cluster of another team. You are in charge of designing the Cloud Build setup, and want to follow Google-recommended practices. What should you do?
In each team's project, list the service accounts and identify the one used by Cloud Build for each project. In each project, grant the Kubernetes Engine Developer IAM role to the service account used by Cloud Build. Ask each team to execute Cloud Build builds in their own project.
You are managing an application that exposes an HTTP endpoint without using a load balancer. The latency of the HTTP responses is important for the user experience. You want to understand what HTTP latencies all of your users are experiencing. You use Stackdriver Monitoring. What should you do?
In your application, create a metric with a metricKind set to GAUGE and a valueType set to DISTRIBUTION. In Stackdriver's Metrics Explorer, use a Heatmap graph to visualize the metric. Latency is commonly measured as a distribution. A gauge metric measures a specific instant in time.
You support a service with a well-defined Service Level Objective (SLO). Over the previous 6 months, your service has consistently met its SLO and customer satisfaction has been consistently high. Most of your service's operations tasks are automated and few repetitive tasks occur frequently. You want to optimize the balance between reliability and deployment velocity while following site reliability engineering best practices. What should you do? (Choose two.)
Increase the service's deployment velocity and/or risk. Shift engineering time to other services that need more reliability.
You have a Compute Engine instance that uses the default Debian image. The application hosted on this instance recently suffered a series of crashes that you weren't able to debug in real time: the application process died suddenly every time. The application usually consumes 50% of the instance's memory, and normally never more than 70%, but you suspect that a memory leak was responsible for the crashes. You want to validate this hypothesis. What should you do?
Install the Cloud Monitoring agent on the instance. Create an alert policy on the "agent.googleapis.com/memory/percent_used" metric for that instance to be alerted when the memory used is higher than 75%. When you receive an alert, use your usual debugging tools to investigate the behavior of the application in real time.
Your application is hosted on a Linux Compute Engine instance. Your instance periodically crashes, leading you to suspect a fault in the OS. You want to query syslog entries in Cloud Logging. What do you need to do to achieve this?
Install the Logging Agent and 'catch-all' configuration on the affected instance. New syslog entries will automatically appear in Cloud Logging for analysis. The Logging Agent is required for OS troubleshooting logs. Installing the Logging Agent and catch-all configuration will automatically capture logs from syslog from the Linux OS.
Which IAM role grants the SMALLEST scope of rights for a service account to write logs to Cloud Logging?
Log Writer. The Logs Writer role allows a service account to write logs to Cloud Logging without giving it read permissions.
Does viewing Windows Event Viewer logs on a Compute Engine instance require installing and configuring logging agents?
OS logs (syslog and Event Viewer) require the logging agent to be installed/configured.
Your company follows Site Reliability Engineering practices. You are the person in charge of Communications for a large, ongoing incident affecting your customer-facing applications. There is still no estimated time for a resolution of the outage. You are receiving emails from internal stakeholders who want updates on the outage, as well as emails from customers who want to know what is happening. You want to efficiently provide updates to everyone affected by the outage. What should you do?
Provide periodic updates to all stakeholders in a timely manner. Commit to a "next update" time in all communications.
What is recall?
Recall is the proportion of significant events detected. Recall is 100% if every significant event results in an alert.
You support a popular mobile game application deployed on Google Kubernetes Engine (GKE) across several Google Cloud regions. Each region has multiple Kubernetes clusters. You receive a report that none of the users in a specific region can connect to the application. You want to resolve the incident while following Site Reliability Engineering practices. What should you do first?
Reroute the user traffic from the affected region to other regions that don't report issues. SRE best practices of resolving incidents quickly and then conducting a post-mortem analysis to prevent similar incidents from happening in the future.
You have a service running on Compute Engine virtual machine instances behind a global load balancer. You need to ensure that when an instance fails, it is recovered. What should you do?
Set up health checks in the managed instance group configuration. The managed instance group health check will recreate the instance when it fails, and this is the platform-native way to satisfy this use case. A health check on the load balancer will not recover the instance. The load balancer health check will exclude the instance from receiving traffic.
When using Spinnaker as a continuous delivery tool, how does Spinnaker control traffic to different versions of your Kubernetes application?
Spinnaker updates existing staging and production Replica Sets with each new version of a deployed application.
Your application artifacts are being built and deployed via a CI/CD pipeline. You want the CI/CD pipeline to securely access application secrets. You also want to more easily rotate secrets in case of a security breach. What should you do?
Store secrets in Cloud Storage encrypted with a key from Cloud KMS. Provide the CI/CD pipeline with access to Cloud KMS via IAM.
Your team of Infrastructure DevOps Engineers is growing, and you are starting to use Terraform to manage infrastructure. You need a way to implement code versioning and to share code with other team members. What should you do?
Store the Terraform code in a version-control system. Establish procedures for pushing new versions and merging with the master.
You are deploying an application to a Kubernetes cluster that requires a username and password to connect to another service. When you deploy the application, you want to ensure that the credentials are used securely in multiple environments with minimal code changes. What should you do?
Store the credentials as a Kubernetes Secret and let the application access it via environment variables at runtime. This approach enables secrets usage without needing to modify the code per environment, update build pipelines, or store secrets insecurely.
How do you modify log entries before they are sent to Cloud Logging?
The 'filter_record_transformer' fluentd plugin performs the role of modifying log entries before they are sent to Cloud Logging.
Your application runs in Google Kubernetes Engine (GKE). You want to use Spinnaker with the Kubernetes Provider to perform blue/green deployments and control which version of the application receives traffic. What should you do?
Use a Kubernetes Replica Set and use Spinnaker to update the Replica Set for each new version of the application to be deployed. Spinnaker can update the replica set in place without conflicting with Kubernetes. Spinnaker needs to update and not create a new service for deployment. Updating the Kubernetes Deployment would conflict with Kubernetes operations on the deployment.
What tool should you use to analyze a sample network traffic on a specific subnet on your VPC?
VPC Flow Logs VPC Flow Logs are enabled per subnet, and log a sample of all network traffic on that subnet.
You need to deploy a new service to production. The service needs to automatically scale using a Managed Instance Group (MIG) and should be deployed over multiple regions. The service needs a large number of resources for each instance and you need to plan for capacity. What should you do?
Validate that the resource requirements are within the available quota limits of each region. It is important to ensure that the resource requirements are within the available quota limits in each region before deploying the service, to avoid exceeding the limits and causing problems. This is essential to ensure that the service is deployed correctly and has the necessary capacity to handle the load.
According to Google Cloud recommended best practices, when can an application or service move from limited availability to general availability?
When the Production Readiness Review has been passed. The Production Readiness Review (PRR) is a process that identifies the reliability needs of a service based on its specific details. Through a PRR, SREs seek to apply what they've learned and experienced to ensure the reliability of a service operating in production.
What steps are required set up error reporting notification on your team's project?
1) Enable error reporting notification for the project. 2) Assign team members a custom role with cloudnotifications.activities.list permission. Error reporting notifications are available to users with Project Owner, Project Editor, Project Viewer role or a custom role with the cloudnotifications.activities.list permission.
You support a multi-region web service running on Google Kubernetes Engine (GKE) behind a Global HTTP/S Cloud Load Balancer (CLB). For legacy reasons, user requests first go through a third-party Content Delivery Network (CDN), which then routes traffic to the CLB. You have already implemented an availability Service Level Indicator (SLI) at the CLB level. However, you want to increase coverage in case of a potential load balancer misconfiguration, CDN failure, or other global networking catastrophe. Where should you measure this new SLI? (Choose two.)
1) Instrumentation coded directly in the client. 2) A synthetic client that periodically sends simulated user requests.
You are helping with the design of an e-commerce application. The web application receives web requests and stores sales transactions in a database. A batch job runs every hour to trigger analysis of sales numbers, available inventory, and forecasted sales numbers. You want to identify minimal Service Level Indicators (SLIs) for the application to ensure that forecasted numbers are based on the latest sales numbers. Which SLIs should you set for the application?
1) Web Application - Availability 2) Database - Availability 3) Batch Job - Freshness These are the minimal set of SLIs to measure in order to meet the objective of using the latest data in a batch job. Web Application Quality and Batch Job Coverage SLIs don't help in meeting the objective. The latency of the web application, although important to measure, doesn't help with meeting the objective of the latest data available in the database. Web Application Quality and Batch Job Coverage SLIs don't help in meeting the objective.
You currently store the virtual machine (VM) utilization logs in Stackdriver. You need to provide an easy-to-share interactive VM utilization dashboard that is updated in real time and contains information aggregated on a quarterly basis. You want to use Google Cloud Platform solutions. What should you do?
1. Export VM utilization logs from Stackdriver to BigQuery. 2. Create a dashboard in Data Studio. 3. Share the dashboard with your stakeholders. Google cloud studio is built to provide real time metric data and it can be directly integrated with BigQuery.
You need to define Service Level Objectives (SLOs) for a high-traffic multi-region web application. Customers expect the application to always be available and have fast response times. Customers are currently happy with the application performance and availability. Based on current measurement, you observe that the 90th percentile of latency is 120ms and the 95th percentile of latency is 275ms over a 28-day window. What latency SLO would you recommend to the team to publish?
90 percentile 150ms 95 percentile 300ms Picking an SLO based upon current performance can commit you to unnecessarily strict SLOs. Select slightly lower SLO.
You support a high-traffic web application with a microservice architecture. The home page of the application displays multiple widgets containing content such as the current weather, stock prices, and news headlines. The main serving thread makes a call to a dedicated microservice for each widget and then lays out the homepage for the user. The microservices occasionally fail; when that happens, the serving thread serves the homepage with some missing content. Users of the application are unhappy if this degraded mode occurs too frequently, but they would rather have some content served instead of no content at all. You want to set a Service Level Objective (SLO) to ensure that the user experience does not degrade too much. What Service Level Indicator (SLI) should you use to measure this?
A quality SLI: the ratio of non-degraded responses to total responses. Quality is a helpful SLI for complex services that are designed to fail gracefully by degrading when dependencies are slow or unavailable. The SLI for quality is defined as follows: The proportion of valid requests served without degradation of service.
You are responsible for the reliability of a high-volume enterprise application. A large number of users report that an important subset of the application's functionality "a data intensive reporting feature" is consistently failing with an HTTP 500 error. When you investigate your application's dashboards, you notice a strong correlation between the failures and a metric that represents the size of an internal queue used for generating reports. You trace the failures to a reporting backend that is experiencing high I/O wait times. You quickly fix the issue by resizing the backend's persistent disk (PD). Now you need to create an availability Service Level Indicator (SLI) for the report generation feature. How would you define it?
As the proportion of report generation requests that result in a successful response. The proportion of report generation requests that result in a successful response is a valid availability Service Level Indicator (SLI) for the report generation feature. This indicator measures the percentage of requests for report generation that are successfully completed, and it is an indicator of the service availability. This SLI provides a clear and measurable way to track the availability of the report generation feature. It is a simple and easy to understand metric, that it can be easily monitored and reported. It can be used to track the performance of the report generation feature over time and detect any potential issues that may cause it to become unavailable. It also allows you to detect and diagnose issues in the system quickly and take appropriate action to mitigate them. Additionally, it aligns well with the customer's expectation of the report generation feature as they want to see a high percentage of successful report generation requests, which indicates that the feature is working correctly.
You manage several production systems that run on Compute Engine in the same Google Cloud Platform (GCP) project. Each system has its own set of dedicated Compute Engine instances. You want to know how much it costs to run each of the systems. What should you do?
Assign all instances a label specific to the system they run. Configure BigQuery billing export and query costs per label. Using labels to tag instances with the specific system they run allows you to easily filter and query costs by system in BigQuery. This allows you to see the costs associated with each system and make informed decisions about cost optimization.
You are creating and assigning action items in a post-mortem for an outage. The outage is over, but you need to address the root causes. You want to ensure that your team handles the action items quickly and efficiently. How should you assign owners and collaborators to action items?
Assign one owner for each action item and any necessary collaborators. This will ensure clear accountability and ownership for each action item, and the necessary collaborators can provide support and expertise to the owner in completing the action item. This approach allows for clear communication and delegation of responsibilities, which can help ensure that the action items are handled quickly and efficiently.
Your team uses Cloud Build for all CI/CD pipelines. You want to use the kubectl builder for Cloud Build to deploy new images to Google Kubernetes Engine (GKE). You need to authenticate to GKE while minimizing development effort. What should you do?
Assign the Container Developer role to the Cloud Build service account. Google Cloud Build uses a default service account to run the build, this service account is automatically created by Cloud Build and it has the necessary permissions to access the resources used by the build. By assigning the Container Developer role to this service account, it will have the necessary permissions to deploy new images to GKE. This way you don't need to create a new service account or specify the role in the cloudbuild.yaml file. This is an easy and secure way to authenticate to GKE without adding extra steps to the CI/CD pipeline.
You use Spinnaker to deploy your application and have created a canary deployment stage in the pipeline. Your application has an in-memory cache that loads objects at start time. You want to automate the comparison of the canary version against the production version. How should you configure the canary analysis?
Compare the canary with a new deployment of the current production version. You might be tempted to compare the canary deployment against your current production deployment. Instead always compare the canary against an equivalent baseline, deployed at the same time. The baseline uses the same version and configuration that is currently running in production, but is otherwise identical to the canary: Same time of deployment Same size of deployment Same type and amount of traffic In this way, you control for version and configuration only, and you reduce factors that could affect the analysis, like the cache warmup time, the heap size, and so on.
You support the backend of a mobile phone game that runs on a Google Kubernetes Engine (GKE) cluster. The application is serving HTTP requests from users. You need to implement a solution that will reduce the network cost. What should you do?
Configure a Google Cloud HTTP Load Balancer as Ingress. A Google Cloud HTTP Load Balancer can help reduce network costs by efficiently routing traffic to the backend services running on the GKE cluster. By configuring the load balancer as ingress, it will receive all incoming traffic and then route it to the appropriate backend service. This eliminates the need for each service to have its own external IP address, which can be costly in terms of network usage.
You are using Stackdriver to monitor applications hosted on Google Cloud Platform (GCP). You recently deployed a new application, but its logs are not appearing on the Stackdriver dashboard. You need to troubleshoot the issue. What should you do?
Confirm that the Stackdriver agent has been installed in the hosting virtual machine. To troubleshoot the issue of logs not appearing on the Stackdriver dashboard, you should first confirm that the Stackdriver agent has been installed in the hosting virtual machine. This is because the agent is responsible for sending the logs from the machine to the Stackdriver service
Your company experiences bugs, outages, and slowness in its production systems. Developers use the production environment for new feature development and bug fixes. Configuration and experiments are done in the production environment, causing outages for users. Testers use the production environment for load testing, which often slows the production systems. You need to redesign the environment to reduce the number of bugs and outages in production and to enable testers to toad test new features. What should you do?
Create a development environment for writing code and a test environment for configurations, experiments, and load testing.
You support an application that stores product information in cached memory. For every cache miss, an entry is logged in Stackdriver Logging. You want to visualize how often a cache miss happens over time. What should you do?
Create a logs-based metric in Stackdriver Logging and a dashboard for that metric in Stackdriver Monitoring. Stackdriver Logging provides the ability to extract metrics from logs, these metrics are called logs-based metrics. You can create a logs-based metric that counts the number of cache miss logs and configure it to be collected at a regular interval, this way you can see how often a cache miss happens over time. Additionally, Stackdriver Monitoring provides the ability to create dashboards that display the metrics collected by logs-based metrics, you can use this dashboard to visualize the cache misses over time and easily identify trends or spikes in the data.
You are developing a strategy for monitoring your Google Cloud Platform (GCP) projects in production using Stackdriver Workspaces. One of the requirements is to be able to quickly identify and react to production environment issues without false alerts from development and staging projects. You want to ensure that you adhere to the principle of least privilege when providing relevant team members with access to Stackdriver Workspaces. What should you do?
Create a new GCP monitoring project and create a Stackdriver Workspace inside it. Attach the production projects to this workspace. Grant relevant team members read access to the Stackdriver Workspace. When you want to manage metrics for multiple projects, Google recommends that you create a project to be the scoping project for that metrics scope.
You are working with a government agency that requires you to archive application logs for seven years. You need to configure Stackdriver to export and store the logs while minimizing costs of storage. What should you do?
Create a sink in Stackdriver, name it, create a bucket on Cloud Storage for storing archived logs, and then select the bucket as the log export destination.
You support a high-traffic web application that runs on Google Cloud Platform (GCP). You need to measure application reliability from a user perspective without making any engineering changes to it. What should you do? (Choose two.)
Create new synthetic clients to simulate a user journey using the application. Use current and historic Request Logs to trace customer interaction with the application.
You need to reduce the cost of virtual machines (VM) for your organization. After reviewing different options, you decide to leverage preemptible VM instances. Which application is suitable for preemptible VMs? A. A scalable in-memory caching system. B. The organization's public-facing website. C. A distributed, eventually consistent NoSQL database cluster with sufficient quorum. D. A GPU-accelerated video rendering platform that retrieves and stores videos in a storage bucket.
D. A GPU-accelerated video rendering platform that retrieves and stores videos in a storage bucket.
You have a set of applications running on a Google Kubernetes Engine (GKE) cluster, and you are using Stackdriver Kubernetes Engine Monitoring. You are bringing a new containerized application required by your company into production. This application is written by a third party and cannot be modified or reconfigured. The application writes its log information to /var/log/app_messages.log, and you want to send these log entries to Stackdriver Logging. What should you do?
Deploy a Fluentd daemonset to GKE. Then create a customized input and output configuration to tail the log file in the application's pods and write to Stackdriver Logging. Some applications can't be easily configured to write logs to stdout and stderr. Because such applications write to different log files on disk, the best way to handle them in Kubernetes is to use the sidecar pattern for logging. In this solution, you add a logging agent in a sidecar container to your application (in the same pod) and share an emptyDir volume between the two containers, as shown in this YAML example on GitHub. You then configure the application to write its logs to the shared volume and configure the logging agent to read and forward them where needed. Stackdriver Kubernetes Engine Monitoring provides log collection and analysis for Kubernetes clusters running on GKE out of the box, but the default configuration doesn't include the ability to tail a specific log file. To collect log entries written to /var/log/app_messages.log, you can deploy a Fluentd daemonset to your GKE cluster. Fluentd is a log collector and forwarder that can be configured to tail a specific log file, in this case, /var/log/app_messages.log and send the log entries to Stackdriver Logging. By deploying a Fluentd daemonset, you can create a customized input and output configuration, you can use this configuration to tail the log file in the application's pods and write to Stackdriver Logging, this allows you to collect the logs from that specific application, ensuring that the logs are going to stackdriver and can be analyzed later on.
You have a pool of application servers running on Compute Engine. You need to provide a secure solution that requires the least amount of configuration and allows developers to easily access application logs for troubleshooting. How would you implement the solution on GCP?
Deploy the Stackdriver logging agent to the application servers. Give the developers the IAM Logs Viewer role to access Stackdriver and view logs.
You're the engineer on duty and you receive a text message that a data center has gone offline shortly after a new release of your application went live. Following Google Cloud SRE best practices, what should you do?
Designate a development team member as the incident commander. Designating an incident commander - or taking that role yourself - is the first step in a best practice for incident response.
Your organization wants to implement Site Reliability Engineering (SRE) culture and principles. Recently, a service that you support had a limited outage. A manager on another team asks you to provide a formal explanation of what happened so they can action remediations. What should you do?
Develop a post-mortem that includes the root causes, resolution, lessons learned, and a prioritized list of action items. Share it on the engineering organization's document portal.
You encountered a major service outage that affected all users of the service for multiple hours. After several hours of incident management, the service returned to normal, and user access was restored. You need to provide an incident summary to relevant stakeholders following the Site Reliability Engineering recommended practices. What should you do first?
Develop a post-mortem to be distributed to stakeholders. Post-mortems are a common practice in Site Reliability Engineering (SRE) where an incident summary is written to document the incident, including root causes, resolution, lessons learned, and a prioritized list of action items. This information can be used to improve processes, identify areas for improvement, and prevent similar incidents from occurring in the future.
You support a large service with a well-defined Service Level Objective (SLO). The development team deploys new releases of the service multiple times a week. If a major incident causes the service to miss its SLO, you want the development team to shift its focus from working on features to improving service reliability. What should you do before a major incident occurs?
Develop an appropriate error budget policy in cooperation with all service stakeholders. The goals of this policy are to: 1) Protect customers from repeated SLO misses 2) Provide an incentive to balance reliability with other features
You are managing the production deployment to a set of Google Kubernetes Engine (GKE) clusters. You want to make sure only images which are successfully built by your trusted CI/CD pipeline are deployed to production. What should you do?
Enable Binary Authorization on the Container Registry. Binary Authorization is a feature of Google Kubernetes Engine that allows you to ensure that only containers that are verified to be from a trusted source are deployed to your clusters. It works by using a policy that checks the signatures of container images before they are deployed. You can configure Binary Authorization to require that all images are signed by a trusted certificate authority (CA) or that they are signed by a trusted key that you manage. This ensures that only images that have been successfully built by your trusted CI/CD pipeline are deployed to your production clusters.
You are running a real-time gaming application on Compute Engine that has a production and testing environment. Each environment has their own Virtual Private Cloud (VPC) network. The application frontend and backend servers are located on different subnets in the environment's VPC. You suspect there is a malicious process communicating intermittently in your production frontend servers. You want to ensure that network traffic is captured for analysis. What should you do?
Enable VPC Flow Logs on the production VPC network frontend and backend subnets only with a sample volume scale of 1.0.
Your organization recently adopted a container-based workflow for application development. Your team develops numerous applications that are deployed continuously through an automated build pipeline to a Kubernetes cluster in the production environment. The security auditor is concerned that developers or operators could circumvent automated testing and push code changes to production without approval. What should you do to enforce approvals?
Enable binary authorization inside the Kubernetes cluster and configure the build pipeline as an attestor. Binary Authorization is a deploy-time security control that ensures only trusted container images are deployed on Google Kubernetes Engine (GKE) or Cloud Run. With Binary Authorization, you can require images to be signed by trusted authorities during the development process and then enforce signature validation when deploying. By enforcing validation, you can gain tighter control over your container environment by ensuring only verified images are integrated into the build-and-release process.
What is Google's best practices for securely accessing sensitive credentials for authentication purposes on each new deployment of a Kubernetes application?
Enable secrets management on your Kubernetes cluster, and store the credentials in Google Cloud Secrets Manager as a secret. Design the application to access the secret using variables during runtime. You want to store your secrets in a secure manner, whether using KMS, an encrypted Cloud Storage bucket, or other controlled source.
You support an application running on GCP and want to configure SMS notifications to your team for the most critical alerts in Stackdriver Monitoring. You have already identified the alerting policies you want to configure this for. What should you do?
Ensure that your team members set their SMS/phone numbers in their Stackdriver Profile. Select the SMS notification option for each alerting policy and then select the appropriate SMS/phone numbers from the list. Google Cloud Monitoring (previously known as Stackdriver) supports SMS notifications
You are on-call for an infrastructure service that has a large number of dependent systems. You receive an alert indicating that the service is failing to serve most of its requests and all of its dependent systems with hundreds of thousands of users are affected. As part of your Site Reliability Engineering (SRE) incident management protocol, you declare yourself Incident Commander (IC) and pull in two experienced people from your team as Operations Lead (OL) and Communications Lead (CL). What should you do next?
Establish a communication channel where incident responders and leads can communicate with each other.
You support a service that recently had an outage. The outage was caused by a new release that exhausted the service memory resources. You rolled back the release successfully to mitigate the impact on users. You are now in charge of the post-mortem for the outage. You want to follow Site Reliability Engineering practices when developing the post-mortem. What should you do?
Focus on identifying the contributing causes of the incident rather than the individual responsible for the cause. According to Site Reliability Engineering (SRE) practices, the goal of a post-mortem is to identify the underlying causes of the incident in order to take steps to prevent it from happening again in the future. This involves looking for patterns and issues in the system rather than looking for a specific person to blame. It's important to have a focus on learning and continuous improvement, rather than assigning blame.
You manage an application that is writing logs to Stackdriver Logging. You need to give some team members the ability to export logs. What should you do?
Grant the team members the IAM role of logging.configWriter on Cloud IAM. Logs Configuration Writer (roles/logging.configWriter) permissions allow you to create, delete, or modify sinks.
You are part of an organization that follows SRE practices and principles. You are taking over the management of a new service from the Development Team, and you conduct a Production Readiness Review (PRR). After the PRR analysis phase, you determine that the service cannot currently meet its Service Level Objectives (SLOs). You want to ensure that the service can meet its SLOs in production. What should you do next?
Identify recommended reliability improvements to the service to be completed before handover. Explanation: A Production Readiness Review (PRR) is an assessment of a service's readiness to be deployed in production. A service that cannot meet its Service Level Objectives (SLOs) is not ready to be deployed in production. The next step is to identify the recommended reliability improvements that should be made to the service before it can be handed over to the SRE team.
Your application runs on Google Cloud Platform (GCP). You need to implement Jenkins for deploying application releases to GCP. You want to streamline the release process, lower operational toil, and keep user data secure. What should you do?
Implement Jenkins on Compute Engine virtual machines. This will allow you to leverage GCP's security and compliance features, and integrate with other GCP services such as Cloud Storage or Cloud SQL for storing build artifacts and user data. Additionally, using Compute Engine virtual machines for Jenkins will provide flexibility in terms of scaling and managing resources.
You support a user-facing web application. When analyzing the application's error budget over the previous six months, you notice that the application has never consumed more than 5% of its error budget in any given time window. You hold a Service Level Objective (SLO) review with business stakeholders and confirm that the SLO is set appropriately. You want your application's SLO to more closely reflect its observed reliability. What steps can you take to further that goal while balancing velocity, reliability, and business needs? (Choose two.)
Implement and measure additional Service Level Indicators (SLIs) for the application. Announce planned downtime to consume more error budget, and ensure that users are not depending on a tighter SLO. You want the application's SLO to more closely reflect it's observed reliability. The key here is error budget never goes over 5%. This means they can have additional downtime and still stay within their budget. E is correct as per Google SRE handbook (https://sre.google/sre-book/service-level-objectives/) 'You can avoid over-dependence by deliberately taking the system offline occasionally (Google's Chubby service introduced planned outages in response to being overly available)' D is a good answer because with more SLI's, this may more accurately reflect the system's reliability.
Your team is designing a new application for deployment both inside and outside Google Cloud Platform (GCP). You need to collect detailed metrics such as system resource utilization. You want to use centralized GCP services while minimizing the amount of work required to set up this collection system. What should you do?
Import the Stackdriver Profiler package, and configure it to relay function timing data to Stackdriver for further analysis. Stackdriver Profiler is a statistical, low-overhead profiler that continuously gathers CPU usage and memory-allocation information from your production applications. Profiler works both inside and outside of GCP.
Your team has recently deployed an NGINX-based application into Google Kubernetes Engine (GKE) and has exposed it to the public via an HTTP Google Cloud Load Balancer (GCLB) ingress. You want to scale the deployment of the application's frontend using an appropriate Service Level Indicator (SLI). What should you do?
Install the Stackdriver custom metrics adapter and configure a horizontal pod autoscaler to use the number of requests provided by the GCLB. To scale the deployment of the application's frontend using an appropriate Service Level Indicator (SLI), we need to monitor the traffic coming to the application. One way to do this is to install the Stackdriver custom metrics adapter, which provides visibility into GCLB metrics such as request counts, bytes sent and received, and active connections. We can then configure a horizontal pod autoscaler (HPA) to scale the number of pods based on the request count coming through the GCLB, which will help to ensure that our application is always available to handle the incoming traffic.
You support a Node.js application running on Google Kubernetes Engine (GKE) in production. The application makes several HTTP requests to dependent applications. You want to anticipate which dependent applications might cause performance issues. What should you do?
Instrument all applications with Stackdriver Trace and review inter-service HTTP requests.
Your development team has created a new version of their service's API. You need to deploy the new versions of the API with the least disruption to third-party developers and end users of third-party installed applications. What should you do?
Introduce the new version of the API. Announce deprecation of the old version of the API. Deprecate the old version of the API. Contact remaining users of the old API. Provide best effort support to users of the old API. Turn down the old version of the API. You cannot deprecate or announce depreciation before introducing the newer version. The pattern of API deprecation: Deprecate, but have not stopped yet, trying to provide support till it is totally closed. No support after that.
You have migrated an e-commerce application to Google Cloud Platform (GCP). You want to prepare the application for the upcoming busy season. What should you do first to prepare for the busy season?
Load test the application to profile its performance for scaling. The objective of the preparation stage is to test the system's ability to scale for peak user traffic and to document the results. Completing the preparation stage results in architecture refinement to handle peak traffic more efficiently and increase system reliability. This stage also yields procedures for operations and support that help streamline processes for handling the peak event and any issues that might occur. Consider this stage as practice for the peak event from a system and operations perspective.
Your product is currently deployed in three Google Cloud Platform (GCP) zones with your users divided between the zones. You can fail over from one zone to another, but it causes a 10-minute service disruption for the affected users. You typically experience a database failure once per quarter and can detect it within five minutes. You are cataloging the reliability risks of a new real-time chat feature for your product. You catalog the following information for each risk:* Mean Time to Detect (MTTD) in minutes* Mean Time to Repair (MTTR) in minutes* Mean Time Between Failure (MTBF) in days* User Impact Percentage The chat feature requires a new database system that takes twice as long to successfully fail over between zones. You want to account for the risk of the new database failing in one zone. What would be the values for the risk of database failover with the new system?
MTTD: 5 MTTR: 20 MTBF: 90 Impact: 33%
You support a web application that runs on App Engine and uses CloudSQL and Cloud Storage for data storage. After a short spike in website traffic, you notice a big increase in latency for all user requests, increase in CPU use, and the number of processes running the application. Initial troubleshooting reveals: ✑ After the initial spike in traffic, load levels returned to normal but users still experience high latency. ✑ Requests for content from the CloudSQL database and images from Cloud Storage show the same high latency. ✑ No changes were made to the website around the time the latency increased. ✑ There is no increase in the number of errors to the users. You expect another spike in website traffic in the coming days and want to make sure users don't experience latency. What should you do?
Modify the App Engine configuration to have additional idle instances.
You are planning on deploying an Apache web server using Compute Engine. You need to track Apache requests in Cloud Monitoring. What do you need to do to make this happen?
On the GCE instance, install the Cloud Monitoring Agent. Then install the necessary configuration files for Apache. For the monitoring agent, we need to install both the agent and specific configuration files on an application-by-application basis.
Your company follows Site Reliability Engineering practices. You are the Incident Commander for a new, customer-impacting incident. You need to immediately assign two incident management roles to assist you in an effective incident response. What roles should you assign? (Choose two.)
Operations Lead Communications Lead The main roles in incident response are the Incident Commander (IC), Communications Lead (CL), and Operations or Ops Lead (OL).
What is precision?
Precision is the proportion of events detected that were significant. Precision is 100% if every alert corresponds to a significant event. Note that alerting can become particularly sensitive to non-significant events during low-traffic periods
You've been promoted to the team's primary log manager. As such, you know which options are available to you for log usage. Which of the following is NOT one of those options? Export Discard Ingest Profile
Profile. Profiling your application CPU and memory usage is handled by Cloud Profiler, not by Cloud Logging.
Your team is designing a new application for deployment into Google Kubernetes Engine (GKE). You need to set up monitoring to collect and aggregate various application-level metrics in a centralized location. You want to use Google Cloud Platform services while minimizing the amount of work required to set up monitoring. What should you do?
Publish various metrics from the application directly to the Stackdriver Monitoring API, and then observe these custom metrics in Stackdriver.
You need to run a business-critical workload on a fixed set of Compute Engine instances for several months. The workload is stable with the exact amount of resources allocated to it. You want to lower the costs for this workload without any performance implications. What should you do?
Purchase Committed Use Discounts.
Some of your production services are running in Google Kubernetes Engine (GKE) in the eu-west-1 region. Your build system runs in the us-west-1 region. You want to push the container images from your build system to a scalable registry to maximize the bandwidth for transferring the images to the cluster. What should you do?
Push the images to Google Container Registry (GCR) using the eu.gcr.io hostname. Pushing the images to Google Container Registry (GCR) using the eu.gcr.io hostname will allow the images to be transferred to the GKE cluster in the eu-west-1 region with the best possible network performance. This will minimize the latency when the cluster pulls the images from the registry, maximizing the bandwidth for transferring the images to the cluster.
You are running an experiment to see whether your users like a new feature of a web application. Shortly after deploying the feature as a canary release, you receive a spike in the number of 500 errors sent to users, and your monitoring reports show increased latency. You want to quickly minimize the negative impact on users. What should you do first?
Roll back the experimental canary release.
You support a web application that is hosted on Compute Engine. The application provides a booking service for thousands of users. Shortly after the release of a new feature, your monitoring dashboard shows that all users are experiencing latency at login. You want to mitigate the impact of the incident on the users of your service. What should you do first?
Roll back the recent release.
You have a CI/CD pipeline that uses Cloud Build to build new Docker images and push them to Docker Hub. You use Git for code versioning. After making a change in the Cloud Build YAML configuration, you notice that no new artifacts are being built by the pipeline. You need to resolve the issue following SiteReliability Engineering practices. What should you do?
Run a Git compare between the previous and current Cloud Build Configuration files to find and fix the bug.
You are running an application in a virtual machine (VM) using a custom Debian image. The image has the Stackdriver Logging agent installed. The VM has the cloud-platform scope. The application is logging information via syslog. You want to use Stackdriver Logging in the Google Cloud Platform Console to visualize the logs. You notice that syslog is not showing up in the "All logs" dropdown list of the Logs Viewer. What is the first thing you should do?
SSH to the VM and execute the following commands on your VM: ps ax | grep fluentd. When an instance is created, we can specify which service account the instance uses when calling Google Cloud APIs. The instance is automatically configured with access scope and one such access scope is monitoring.write The first recommended troubleshooting step is to check if the agent is running or not.
Your organization recently adopted a container-based workflow for application development. Your team develops numerous applications that are deployed continuously through an automated build pipeline to the production environment. A recent security audit alerted your team that the code pushed to production could contain vulnerabilities and that the existing tooling around virtual machine (VM) vulnerabilities no longer applies to the containerized environment. You need to ensure the security and patch level of all code running through the pipeline. What should you do?
Set up Container Analysis to scan and report Common Vulnerabilities and Exposures. Container Analysis is a service on GCP that allows you to scan and analyze container images for vulnerabilities, malware and other issues. This will help you identify vulnerabilities in your container images and take appropriate action to address them.
You support a stateless web-based API that is deployed on a single Compute Engine instance in the europe-west2-a zone. The Service Level Indicator (SLI) for service availability is below the specified Service Level Objective (SLO). A post-mortem has revealed that requests to the API regularly time out. The time outs are due to the API having a high number of requests and running out memory. You want to improve service availability. What should you do?
Set up additional service instances in other zones and load balance the traffic between all instances. This will provide redundancy and increase the availability of the service by distributing the traffic across multiple instances. Additionally, if one instance goes down, the load balancer will redirect the traffic to the other healthy instances, minimizing the impact on the service availability.
You are deploying an application that needs to access sensitive information. You need to ensure that this information is encrypted and the risk of exposure is minimal if a breach occurs. What should you do?
Store the encryption keys in Cloud Key Management Service (KMS) and rotate the keys frequently. This ensures that the sensitive information is encrypted at rest and in transit, and that the encryption keys are regularly rotated to minimize the risk of exposure in the event of a breach.
You are responsible for creating and modifying the Terraform templates that define your Infrastructure. Because two new engineers will also be working on the same code, you need to define a process and adopt a tool that will prevent you from overwriting each other's code. You also want to ensure that you capture all updates in the latest version. What should you do?
Store your code in a Git-based version control system. Establish a process that includes code reviews by peers and unit testing to ensure integrity and functionality before integration of code. Establish a process where the fully integrated code in the repository becomes the latest master version.
What does the Logs Writer IAM Role allow for?
The Logs Writer IAM role allows a service account to write logs to Cloud Logging without giving it read permissions.
What is a Production Readiness Review (PRR)?
The Production Readiness Review (PRR) is a process that identifies the reliability needs of a service based on its specific details. Through a PRR, SREs seek to apply what they've learned and experienced to ensure the reliability of a service operating in production.
Your application images are built using Cloud Build and pushed to Google Container Registry (GCR). You want to be able to specify a particular version of your application for deployment based on the release version tagged in source control. What should you do when you push the image?
Use Cloud Build to include the release version tag in the application image. Using Cloud Build to include the release version tag in the application image allows you to automate the process of adding the release version tag to the application image during the build process. This can make it easier to track the different versions of your application and their association with the codebase. Additionally, Cloud Build allows you to automate different steps of the build and deployment process, such as building, testing and deploying the images, making it easier to manage and keep track of your deployments.
You use Cloud Build to build and deploy your application. You want to securely incorporate database credentials and other application secrets into the build pipeline. You also want to minimize the development effort. What should you do?
Use Cloud Key Management Service (Cloud KMS) to encrypt the secrets and include them in your Cloud Build deployment configuration. Grant Cloud Build access to the KeyRing. This allows you to use Google-managed encryption and access controls, and it also minimizes the development effort required to securely incorporate the secrets into the build pipeline.
Your application images are built and pushed to Google Container Registry (GCR). You want to build an automated pipeline that deploys the application when the image is updated while minimizing the development effort. What should you do?
Use Cloud Pub/Sub to trigger a Spinnaker pipeline. What is Spinnaker? Multi-cloud continuous delivery platform for releasing software changes with high velocity and confidence. Created at Netflix, it has been battle-tested in production by hundreds of teams over millions of deployments. It combines a powerful and flexible pipeline management system with integrations to the major cloud providers. Cloud build directly can't trigger a Spinnaker pipeline. You have to use cloud pub/sub to trigger a Spinnaker pipeline.
You use Cloud Build to build your application. You want to reduce the build time while minimizing cost and development effort. What should you do?
Use Cloud Storage to cache intermediate artifacts. To increase the speed of a build, reuse the results from a previous build. You can copy the results of a previous build to a Google Cloud Storage bucket, use the results for faster calculation, and then copy the new results back to the bucket.
You support an e-commerce application that runs on a large Google Kubernetes Engine (GKE) cluster deployed on-premises and on Google Cloud Platform. The application consists of microservices that run in containers. You want to identify containers that are using the most CPU and memory. What should you do?
Use Stackdriver Kubernetes Engine Monitoring. Cloud Operations for GKE is designed to monitor GKE clusters. It manages Monitoring and Logging services together and features a Cloud Operations for GKE dashboard that provides a customized interface for GKE clusters: You can view a cluster's key metrics, such as CPU utilization, memory utilization, and the number of open incidents. You can view clusters by their infrastructure, workloads, or services.
You support an application deployed on Compute Engine. The application connects to a Cloud SQL instance to store and retrieve data. After an update to the application, users report errors showing database timeout messages. The number of concurrent active users remained stable. You need to find the most probable cause of the database timeout. What should you do?
Use Stackdriver Profiler to visualize the resources utilization throughout the application. The most probable cause of the database timeout when the number of concurrent active users remained stable is a performance issue. Stackdriver Profiler can be used to identify and diagnose performance issues in the application. Profiler can help you to visualize the resources utilization throughout the application, including CPU and memory usage, and identify any parts of the application that might be causing high load. This can help you understand how the application is utilizing the resources and identify any bottlenecks in the code that might be causing the timeouts.
Your application services run in Google Kubernetes Engine (GKE). You want to make sure that only images from your centrally-managed Google Container Registry (GCR) image registry in the altostrat-images project can be deployed to the cluster while minimizing development time. What should you do?
Use a Binary Authorization policy that includes the whitelist name pattern gcr.io/altostrat-images/.
You are running an application on Compute Engine and collecting logs through Stackdriver. You discover that some personally identifiable information (PII) is leaking into certain log entry fields. All PII entries begin with the text userinfo. You want to capture these log entries in a secure location for later review and prevent them from leaking to Stackdriver Logging. What should you do?
Use a Fluentd filter plugin with the Stackdriver Agent to remove log entries containing userinfo, and then copy the entries to a Cloud Storage bucket. Fluentd can filter logs quite nicely before passing information to Stackdriver. It can remove sensitive information such as credit card details, social security numbers, etc. Once the filtering is done, then the log can be passed to Cloud Storage, but the unfiltered information should not even reach stackdriver.
You have an application running in Google Kubernetes Engine. The application invokes multiple services per request but responds too slowly. You need to identify which downstream service or services are causing the delay. What should you do?
Use a distributed tracing framework such as OpenTelemetry or Stackdriver Trace. Distributed tracing allows you to trace the path of a request as it travels through multiple services and identify where delays may be occurring. This can provide detailed information about the request and response timings for each service, making it easier to pinpoint which services are causing delays in your application. OpenTelemetry and Stackdriver Trace are both available on GCP, and provide easy integration with Kubernetes and other GCP services.
You are ready to deploy a new feature of a web-based application to production. You want to use Google Kubernetes Engine (GKE) to perform a phased rollout to half of the web server pods. What should you do?
Use a partitioned rolling update. A partitioned rolling update allows you to control the percentage of pods that are updated at a time, which allows you to perform a phased rollout. This way you can incrementally test and monitor the new feature, before it is deployed to all the pods. This approach is useful when you want to minimize the risk of introducing new bugs or breaking changes in your production environment, it allows you to have more control over the process, and it's less likely to cause service disruption, or to have all the pods down at the same time.
You support a trading application written in Python and hosted on App Engine flexible environment. You want to customize the error information being sent to Stackdriver Error Reporting. What should you do?
Use the Stackdriver Error Reporting API to write errors from your application to ReportedErrorEvent, and then generate log entries with properly formatted error messages in Stackdriver Logging. App Engine grants the Error Reporting Writer role by default. The Error Reporting library for Python can be used without needing to explicitly provide credentials. Error Reporting is automatically enabled for App Engine flexible environment applications. No additional setup is required.
You are running an application on Compute Engine and collecting logs through Stackdriver. You discover that some personally identifiable information (PII) is leaking into certain log entry fields. You want to prevent these fields from being written in new log entries as quickly as possible. What should you do?
Use the filter_record_transformer Fluentd filter plugin to remove the fields from the log entries in flight. Fluentd is a log collector and processor that is commonly used with Google Cloud Platform. The filter_record_transformer plugin for Fluentd can be used to modify log entries as they are being collected, allowing you to remove sensitive fields from the log entries in real-time before they are written to Stackdriver. This can be done quickly, as it doesn't require changes on the application code. The filter_record_transformer filter plugin mutates/transforms incoming event streams in a versatile manner. If there is a need to add/delete/modify events, this plugin is the first filter to try. It is included in the Fluentd's core.
You are performing a semi-annual capacity planning exercise for your flagship service. You expect a service user growth rate of 10% month-over-month over the next six months. Your service is fully containerized and runs on Google Cloud Platform (GCP), using a Google Kubernetes Engine (GKE) Standard regional cluster on three zones with cluster autoscaler enabled. You currently consume about 30% of your total deployed CPU capacity, and you require resilience against the failure of a zone. You want to ensure that your users experience minimal negative impact as a result of this growth or as a result of zone failure, while avoiding unnecessary costs. How should you prepare to handle the predicted growth?
Verify the maximum node pool size, enable a horizontal pod autoscaler, and then perform a load test to verify your expected resource needs. Cluster autoscaler will add additional nodes if some pods can't be scheduled, but you need something to add new pods when load increases, and that is Horizontal Pod Autoscaler. The Horizontal Pod Autoscaler changes the shape of your Kubernetes workload by automatically increasing or decreasing the number of Pods in response to the workload's CPU or memory consumption
You support an application running on App Engine. The application is used globally and accessed from various device types. You want to know the number of connections. You are using Stackdriver Monitoring for App Engine. What metric should you use?
flex/connections/current