ALL 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

You work as a data engineer in charge of running transformation jobs on subscription sales and turnover data for a large movie and tv streaming platform. Currently, the sales and turnover data resides in Apache Parquet files on S3. You notice that there are 40,000 files in S3 and all are small in size, ranging from 256-512 KB. You need to find a way for the AWS Glue job to group the files when they are read from S3. Which solution can be used to accomplish this with the least development effort? Use the join, filter, and relationalize transformation together within the AWS Glue job. Set the groupFiles property to inPartition for the AWS Glue job. Use an EMR cluster to read in the files from S3 using the S3DistCp command and an Apache Spark job to merge the files into fewer, but larger, files. Use the mergeDynamicFrame transformation within the AWS Glue job.

AWS Glue job. Set the groupFiles property to inPartition for the AWS Glue job. To enable grouping files for a table, you set key-value pairs in the parameters field of your table structure. Use JSON notation to set a value for the parameter field of your table. For more information about editing the properties of a table, see Viewing and Editing Table Details. You can use this method to enable grouping for tables in the Data Catalog with Amazon S3 data stores. groupFiles Set groupFiles to inPartition to enable the grouping of files within an Amazon S3 data partition. Reading Input Files in Larger Groups (https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html)

Bob's Turnip Farm is storing their financial records in an S3 bucket. Due to corporate espionage in the turnip market, they want to ensure that their financial records are secured with an easily rotated strong encryption key that they create and manage. How can they most easily accomplish this? Add a Customer Master Key to Key Management Service and utilize the Server Side Encryption - KMS (SSE-KMS) option for your S3 objects. Enable Server Side Encryption - S3 (SSE-S3) on a new bucket and copy all objects from the existing bucket to the new bucket. Download all files from the S3 bucket, delete them from S3 once they've been downloaded, and encrypt all of the files locally and upload them again. Create a custom workflow to automatically pick up each file in the bucket, pass it through an on-premises HSM utility, and upload to the S3 bucket.

Add a Customer Master Key to Key Management Service and utilize the Server Side Encryption - KMS (SSE-KMS) option for your S3 objects. Correct! This will satisfy all of the requested features.

You are creating an AWS Glue crawler to populate your AWS Glue Data Catalog. The data store that you want to crawl is located in an Amazon S3 bucket. Which of the following are the correct approaches to enabling access for AWS Glue? Create an IAM user with permission to access the S3 bucket. Use Cognito to authenticate the AWS Crawler. Attach the role to the AWS Glue crawler to give temporary security credentials the crawler will need to access Amazon S3. Provide the IAM credentials as a parameter to the AWS Glue crawler job. Attach the user to the AWS Glue crawler to give temporary security credentials the crawler will need to access Amazon S3. Create an IAM role with permission to access the S3 bucket and write to the AWS Glue Data Catalog.

Attach the role to the AWS Glue crawler to give temporary security credentials the crawler will need to access Amazon S3. The IAM role that you specify for a crawler must have permission to access the data store that is crawled, and permission to create and update tables and partitions in the AWS Glue Data Catalog. Crawler Prerequisites (https://docs.aws.amazon.com/glue/latest/dg/crawler-prereqs.html) Create an IAM role with permission to access the S3 bucket and write to the AWS Glue Data Catalog. The IAM role that you specify for a crawler must have permission to access the data store that is crawled, and permission to create and update tables and partitions in the AWS Glue Data Catalog. Crawler Prerequisites (https://docs.aws.amazon.com/glue/latest/dg/crawler-prereqs.html)

A geologic study group has installed thousands of IoT sensors across the globe to measure various soil attributes. Each sensor delivers its data to a prefix in a single S3 bucket in Parquet-formatted files. The study group would like to be able to query this data using SQL and avoid any data processing in order to minimize their costs. What of the following is the best solution? Use Glue to catalog the data, load the catalog into Elastic Map Reduce, and query the data from EMR. Configure Athena tables to make the data queryable and provide the appropriate access to team members via IAM policy. Utilize Glue to ETL the data to a Redshift cluster, run SQL queries from Redshift, and visualize data with QuickSight. Write a Lambda function that leverages the S3 Select API call to collect data from each sensor and join them together to answer queries.

Configure Athena tables to make the data queryable and provide the appropriate access to team members via IAM policy. This is the best option. Because the data is already stored in an Athena-friendly format in S3, this is would be a cost-effective solution.

A geologic study group has installed thousands of IoT sensors across the globe to measure various soil attributes. Each sensor delivers its data to a prefix in a single S3 bucket in Parquet-formatted files. The study group would like to be able to query this data using SQL and avoid any data processing in order to minimize their costs. What of the following is the best solution? Write a Lambda function that leverages the S3 Select API call to collect data from each sensor and join them together to answer queries. Use Glue to catalog the data, load the catalog into Elastic Map Reduce, and query the data from EMR. Configure Athena tables to make the data queryable and provide the appropriate access to team members via IAM policy. Utilize Glue to ETL the data to a Redshift cluster, run SQL queries from Redshift, and visualize data with QuickSight.

Configure Athena tables to make the data queryable and provide the appropriate access to team members via IAM policy. This is the best option. Because the data is already stored in an Athena-friendly format in S3, this is would be a cost-effective solution.

You are working for an investment bank, which has just employed a new team of analysts who will focus on using historical data stored in Amazon EMR to predict future stock market performance using an in-house Business Intelligence (BI) application. The new Trading Analytics Team are working from the New York office and, in order to complete their analysis, they need to connect the BI application running on their local desktop to the EMR cluster. The BI application is extremely sensitive to network inconsistencies, and during initial testing it frequently hangs and becomes unresponsive at the busiest time of day. How would you configure the network between the BI application and the EMR cluster to ensure that the network is consistent and the application does not hang? Configure a Direct Connect connection between the New York office and AWS. Configure a bastion host to manage the network connectivity between the New York office and the EMR cluster. Configure an Internet Gateway to manage the network connectivity between the New York office and the EMR cluster. Configure an AWS managed IPsec VPN connection over the internet to the VPC where the EMR cluster is running.

Configure a Direct Connect connection between the New York office and AWS. AWS Direct Connect is a service you can use to establish a private dedicated network connection to AWS from your data center, office, or colocation environment. If you have large amounts of input data, using AWS Direct Connect may reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections. Connect to Data with AWS DirectConnect (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-input-directconnect.html)

You are working for an investment bank, which has just employed a new team of analysts who will focus on using historical data stored in Amazon EMR to predict future stock market performance using an in-house Business Intelligence (BI) application. The new Trading Analytics Team are working from the New York office and, in order to complete their analysis, they need to connect the BI application running on their local desktop to the EMR cluster. The BI application is extremely sensitive to network inconsistencies, and during initial testing it frequently hangs and becomes unresponsive at the busiest time of day. How would you configure the network between the BI application and the EMR cluster to ensure that the network is consistent and the application does not hang? Configure an Internet Gateway to manage the network connectivity between the New York office and the EMR cluster. Configure a Direct Connect connection between the New York office and AWS. Configure a bastion host to manage the network connectivity between the New York office and the EMR cluster. Configure an AWS managed IPsec VPN connection over the internet to the VPC where the EMR cluster is running.

Configure a Direct Connect connection between the New York office and AWS. AWS Direct Connect is a service you can use to establish a private dedicated network connection to AWS from your data center, office, or colocation environment. If you have large amounts of input data, using AWS Direct Connect may reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections. Connect to Data with AWS DirectConnect (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-input-directconnect.html)

You've been contracted to help secure an analytics architecture for a government agency. Their primary point of concern is to ensure that their primary analytics system is only accessible from their VPN/Bastion subnet, which uses the CIDR range 10.125.36.0/24. The subnets that are being utilized by the analytics systems should ONLY be accessible from this CIDR range. What is the most efficient way to accomplish this? Configure a Network Access Control List rule to only allow traffic in from the 10.125.36.0/24 CIDR range for the NACL associated with subnets the analytics system is utilizing. Configure a Network Access Control List rule to only allow traffic in from the 10.125.0.0/16 CIDR range for the NACL associated with subnets that the analytics system is utilizing. Create a IAM policy that restricts access to the analytics subnets and apply it to the appropriate VPC subnets. Create a security group that only allows inbound traffic from the 10.125.36.0/24 CIDR range and apply it to the Redshift cluster.

Configure a Network Access Control List rule to only allow traffic in from the 10.125.36.0/24 CIDR range for the NACL associated with subnets the analytics system is utilizing. Correct! This will prevent any inbound traffic from unwanted IP ranges from entering the target subnets, which will cover all resources within those subnets.

Your EMR cluster needs to access data located in an S3 bucket. Your CTO has security concerns about sending and receiving data across the public internet, and has requested that any traffic between EMR and S3 must not use the internet. How can you configure your architecture so that the EMR cluster can access resources in S3 using a private IP address, and without exposing traffic to the internet? Configure a VPC endpoint for S3 and configure the EMR cluster to access S3 using the VPC endpoint, so that all requests to S3 remain within the Amazon network. Configure a Direct Connect connection and configure the EMR cluster to access S3 using Direct Connect, so that all requests to S3 remain within a dedicated private network. Configure a site-to-site VPN so that all requests to S3 remain within an encrypted private network. Configure a NAT gateway and route all traffic through the AWS network using a private IP address.

Configure a VPC endpoint for S3 and configure the EMR cluster to access S3 using the VPC endpoint, so that all requests to S3 remain within the Amazon network. Many customers have legitimate privacy and security concerns about sending and receiving data across the public internet. A VPC endpoint for Amazon S3 allows you to use private IP addresses to access Amazon S3 with no exposure to the public internet. No public IP address is required, and you don't need an internet gateway, a NAT device, or a virtual private gateway in your VPC. Traffic between your VPC and the AWS service does not leave the Amazon network. S3 VPC Endpoints (https://docs.aws.amazon.com/glue/latest/dg/vpc-endpoints-s3.html)

Clicky Looky is an analytics company that specializes in processing mobile device clickstream data. They've created a delivery API for their clients to send this data. They require that the data be delivered to the API in JSON format. They need a near real-time workflow to make the data available to their clients in the application and the data should be validated early in the analytics pipeline. How can they best accomplish this? Utilize Glue to ingest and validate API data. Create a Glue job to process the data into an S3 bucket. Configure Athena to provide a SQL interface for the data. Configure the API to deliver the records to an S3 bucket. Use Database Migration Service to ingest and process the records from S3. Utilize DMS Data Validation to validate the records before writing them to a Redshift cluster. Create an EC2 auto scaling group to send API data to through an Application Load Balancer. Validate each record on the EC2 instance, and write the record to an Aurora Postgres instance. Configure the end user application to use Aurora Postgres as its datastore. Configure the API to deliver records to a Kinesis Data Stream. Create a Lambda consumer for the data stream that validates the data and sends valid records to a Kinesis Firehose delivery stream with a Redshift cluster configured as the destination. Configure the application to use Redshift as its datastore.

Configure the API to deliver records to a Kinesis Data Stream. Create a Lambda consumer for the data stream that validates the data and sends valid records to a Kinesis Firehose delivery stream with a Redshift cluster configured as the destination. Configure the application to use Redshift as its datastore. This will provide a fast, scalable pipeline for the clickstream data which may not have a regular delivery pattern.

You work for a new startup that is developing gaming applications for mobile devices. Your team is about to launch a new feature to the game and projections have determined that the user base will grow exponentially. Your job as a data engineer is to design a data warehousing solution to store and manage all of the metadata about users, game play, in-app purchases, and error logs. You've decided to use a Redshift cluster for the data warehousing solution. At first, a beta version of the application will launch and the data that needs to be loaded into the data warehousing solution is small. Following orders from the leadership team, it's your job to ensure the data warehousing solution will be able to grow to keep up with demand. What are some considerations you should take when creating the initial Redshift tables and deciding on a distribution style, in order to ensure a cost-effective and high-performing solution? Consider using KEY distribution. Consider using EVEN distribution. Consider using ALL distribution. Consider using AUTO distribution.

Consider using AUTO distribution. With AUTO distribution, Amazon Redshift assigns an optimal distribution style based on the size of the table data. For example, Amazon Redshift initially assigns ALL distribution to a small table, then changes to EVEN distribution when the table grows larger. When a table is changed from ALL to EVEN distribution, storage utilization might change slightly. The change in distribution occurs in the background, in a few seconds. Distribution Styles (https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html)

You've been contacted by a company that is looking to clear stale data from an existing multi-terabyte DynamoDB table. They want to store all of the records in a data lake, and want to unload the data as quickly as possible before performing pruning on the table. They have requested a solution that does not require writing code, if possible. What is the optimal way to accomplish this? Create a Data Pipeline with the Export DynamoDB table to S3 template. Provide the source DynamoDB table and destination S3 bucket, and start the pipeline. Configure a Glue crawler to catalog the DynamoDB table. Configure a Glue job to migrate all records from the table to an S3 bucket. Write a custom script to run from an EC2 instance that reads batches of records from the DynamoDB table and writes them to the S3 destination. Create a Lambda function that reads batches of records from the DynamoDB table and writes them to the S3 destination.

Create a Data Pipeline with the Export DynamoDB table to S3 template. Provide the source DynamoDB table and destination S3 bucket, and start the pipeline. Data Pipeline is the best tool available to extract all of the records from a DynamoDB table as quickly as possible without writing any code.

You work as a data analyst for a major airline company who operate flights scheduled all around the globe. The current ticketing system is going through a technical audit and has the requirement, by air traffic control law, that all parts of the ticketing system be digitized. The volume of ticketing data created on a daily basis is incredibly high. Your team has been tasked with collecting the ticketing data and storing it in S3, which is copied on a nightly basis to a company data lake for retrieval. There is also a requirement that the ticketing data be transformed and grouped into batches according to the flight departure location. The data must be optimized for high-performance retrieval rates, as well as collected and stored with high durability. Which solution would you use to ensure the data is collected and stored in a cost-effective, durable, and high-performing manner? Create a Kinesis Data Stream and set the partition key to the flight's departure location. Use multiple shards to batch the data, before sending it to a Kinesis Data Firehose delivery stream that delivers the data to S3. Create a Kinesis Data Firehose delivery stream to receive the data that is then sent to Lambda, where records will be batched by the buffer interval/size. Once the data is transformed, convert the records to CSV format and store the results onto S3. Create an Elastic MapReduce (EMR) cluster with Spark stream to receive the data, and use a spark-submit job to batch and transform the data into ORC before it is delivered into S3. Create a Kinesis Data Firehose delivery stream to receive the data, with transformations enabled to allow the data to be batched and transformed into ORC before it is delivered into S3. Stream Real-Time Data in Apache Parquet or ORC Format Using Amazon Kinesis Data Firehose (https://aws.amazon.com/about-aws/whats-new/2018/05/stream_real_time_data_in_apache_parquet_or_orc_format_using_firehose/)

Create a Kinesis Data Firehose delivery stream to receive the data, with transformations enabled to allow the data to be batched and transformed into ORC before it is delivered into S3. Stream Real-Time Data in Apache Parquet or ORC Format Using Amazon Kinesis Data Firehose (https://aws.amazon.com/about-aws/whats-new/2018/05/stream_real_time_data_in_apache_parquet_or_orc_format_using_firehose/) This is the best answer because it uses ORC files, which are partitioned in batches by Kinesis Data Firehose transformations and allow for highly optimized SQL queries in the company data lake.

You're in charge of the data backend for a very popular social media website. All live OLTP data is stored in several DynamoDB tables. The company you work for has started a new analytics initiative, and needs a system that enables text search and produces near real-time analytics output for Business Intelligence tooling. Timely results are more important than cost, as revenue projections for the deliverables of the project are enormous. What is the best way to accomplish this? Use the DynamoDB S3 export feature to export all existing data to an S3 bucket. Configure Athena to provide an SQL interface for the S3 stored data. Use Athena for Business Intelligence and search functionality. Enable streams on all production DynamoDB tables with Lambda functions to add any new records to the S3 data store. Create a Redshift cluster and Elasticsearch Service cluster. Configure two Kinesis Firehose streams, one configured to deliver data to each new cluster respectively. Use Glue to crawl and catalog the application DynamoDB tables. Create Glue jobs to process the existing DynamoDB stored data into the Kinesis Firehose delivery streams. Enable the DynamoDB table streams on all application tables. Create a Lambda function for each production DynamoDB table that is triggered from the respective table's stream to appropriately process stream data into the two Kinesis Firehose delivery streams. Provide the Elasticsearch cluster endpoint for text search and Redshift endpoint for analytics querying and Business Intelligence tooling. Create a Redshift cluster and Elasticsearch Service cluster. Configure a Kinesis Firehose stream to deliver data to each new cluster. Use Glue to crawl and catalog the application DynamoDB tables. Create Glue jobs to process the existing DynamoDB stored data into the Kinesis Firehose delivery stream. Enable the DynamoDB table streams on all application tables. Create a Lambda function for each production DynamoDB table that is triggered from the respective table's stream to appropriately process stream data into the Kinesis Firehose delivery stream. Provide the Elasticsearch cluster endpoint for text search and Redshift endpoint for analytics querying and Business Intelligence tooling. Utilize Glue to crawl and catalog all production DynamoDB tables. Launch an Elastic MapReduce (EMR) cluster and utilize the Glue data catalog for the production DynamoDB tables to create Hive tables. Provide the EMR cluster endpoint for querying the DynamoDB stored data and Business Intelligence tooling.

Create a Redshift cluster and Elasticsearch Service cluster. Configure two Kinesis Firehose streams, one configured to deliver data to each new cluster respectively. Use Glue to crawl and catalog the application DynamoDB tables. Create Glue jobs to process the existing DynamoDB stored data into the Kinesis Firehose delivery streams. Enable the DynamoDB table streams on all application tables. Create a Lambda function for each production DynamoDB table that is triggered from the respective table's stream to appropriately process stream data into the two Kinesis Firehose delivery streams. Provide the Elasticsearch cluster endpoint for text search and Redshift endpoint for analytics querying and Business Intelligence tooling. While this seems like a fairly complex configuration, it will fulfill the requirements and can be tuned and expanded to meet various analytics and search needs. While data will be stored in triplicate in this configuration, data storage is relatively inexpensive and each access pattern (OLTP, OLAP, and search) is handled by the best-suited service.

Unboxy is company that specializes in high-resolution videos of product unboxing. They've amassed more 4k video footage of product unboxing than they can reasonably store on-premises. The problem is that they need to keep unboxing new products and producing videos. They would like to move their video library to S3 as quickly as possible. They have a 1Gb/s internet connection to utilize, and each video averages 6GB. How can they get their videos into S3 as quickly as possible? Open as many browser tabs as there are videos, and start an upload for one video per tab in the S3 web console. Create a multi-threaded script that splits each video into 100MB chunks and utilize S3 Multipart Upload to upload the videos to S3, utilizing all available bandwidth. Utilize the S3 mv command from the AWS CLI to move the video files to S3. Order a Snowmobile, include the appropriate S3 information in the order, and load the video files onto the Snowmobile once it arrives.

Create a multi-threaded script that splits each video into 100MB chunks and utilize S3 Multipart Upload to upload the videos to S3, utilizing all available bandwidth. Given the ample connection, you can leverage Multipart Upload to quickly transfer the large video files to S3.

Your company uses Athena to run standard SQL queries on data stored in S3. You have a new team member who requires access to run Athena queries on a number of different S3 buckets. Which of the following should you do to configure access for this new team member? Create a new IAM user account and attach an IAM policy which allows access to Athena. Configure S3 bucket policies to allow the new user to access objects in the required buckets. Create a new IAM account and attach the AmazonAthenaFullAccess managed policy to allow the new user access to run Athena queries on objects in the required buckets. Create a new IAM account and configure S3 bucket policies to allow the new user to access objects in the required buckets. Create a new IAM account and attach a new AWS Managed Policy, allowing the new employee to access the required S3 buckets using Athena.

Create a new IAM user account and attach an IAM policy which allows access to Athena. Configure S3 bucket policies to allow the new user to access objects in the required buckets. Amazon Athena allows you to control access to your data by using IAM policies, Access Control Lists (ACLs), and S3 bucket policies. With IAM policies, you can grant IAM users fine-grained control to your S3 buckets. By controlling access to data in S3, you can restrict users from querying it using Athena. Athena reads data from S3 buckets using the IAM credentials of the user who submitted the query. Query results are stored in a separate S3 bucket. Usually, an Access Denied error means that you don't have permission to read the data in the bucket, or permission to write to the results bucket. Athena FAQs (https://aws.amazon.com/athena/faqs/#Security_.26_availability) Athena "Access Denied" error (https://aws.amazon.com/premiumsupport/knowledge-center/access-denied-athena/)

You have recently started a new role as a Data Analyst for a car rental company. The company uses Amazon EMR for the majority of analytics workloads. While working on a new report, you notice the root volumes of the EMR cluster are not encrypted. You suggest to your boss that the volumes should be encrypted as soon as possible and she agrees, asking you to recommend the best approach. Which course of action do you recommend? Create a new security configuration specifying local disk encryption. Re-create the cluster using the security configuration. Select root volume encryption in the EMR console. Detach the EBS volumes from the master node. Encrypt the EBS volumes and attach them back to the master node. Specify encryption in transit in a security configuration. Re-create the cluster using the security configuration.

Create a new security configuration specifying local disk encryption. Re-create the cluster using the security configuration. Local disk encryption can be enabled as part of a security configuration to encrypt root and storage volumes. EMR Security Configuration (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-create-security-configuration.html)

You have recently started a new role as a Data Analyst for a car rental company. The company uses Amazon EMR for the majority of analytics workloads. While working on a new report, you notice the root volumes of the EMR cluster are not encrypted. You suggest to your boss that the volumes should be encrypted as soon as possible and she agrees, asking you to recommend the best approach. Which course of action do you recommend? Detach the EBS volumes from the master node. Encrypt the EBS volumes and attach them back to the master node. Create a new security configuration specifying local disk encryption. Re-create the cluster using the security configuration. Specify encryption in transit in a security configuration. Re-create the cluster using the security configuration. Select root volume encryption in the EMR console.

Create a new security configuration specifying local disk encryption. Re-create the cluster using the security configuration. Local disk encryption can be enabled as part of a security configuration to encrypt root and storage volumes. EMR Security Configuration (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-create-security-configuration.html)

You work for a company that is currently using a private Redshift as their data warehousing solution running inside a VPC. You have been tasked with producing a dashboard for sales and KPI data that is stored in the Redshift cluster. You have decided to use QuickSight as the visualization and a BI tool to create these dashboard. Which of the following must be done to enable access and create the dashboards? Create a security group that contains an inbound rule authorizing access from the appropriate IP address range for the region where the QuickSight servers are located. Create a Redshift Spectrum external table that allows QuickSight access through the security configurations. Create an IAM role and policy-based rules allowing QuickSight access to the Redshift cluster. Assign the IAM role the Redshift cluster. Setup an AWS Glue Crawler to crawl the Redshift cluster in order to create a Glue Data Catalog with the Redshift metadata. Use QuickSight to connect to the Glue Data Catalog.

Create a security group that contains an inbound rule authorizing access from the appropriate IP address range for the region where the QuickSight servers are located. To give QuickSight access to a Redshift cluster, it needs to allow the appropriate IP ranges for the QuickSight servers in the AWS region where the servers are located. Authorizing Connections from Amazon QuickSight to Amazon Redshift Clusters (https://docs.aws.amazon.com/quicksight/latest/user/enabling-access-redshift.html)

You work for a company that is currently using a private Redshift as their data warehousing solution running inside a VPC. You have been tasked with producing a dashboard for sales and KPI data that is stored in the Redshift cluster. You have decided to use QuickSight as the visualization and a BI tool to create these dashboard. Which of the following must be done to enable access and create the dashboards? Create an IAM role and policy-based rules allowing QuickSight access to the Redshift cluster. Assign the IAM role the Redshift cluster. Setup an AWS Glue Crawler to crawl the Redshift cluster in order to create a Glue Data Catalog with the Redshift metadata. Use QuickSight to connect to the Glue Data Catalog. Create a security group that contains an inbound rule authorizing access from the appropriate IP address range for the region where the QuickSight servers are located. Create a Redshift Spectrum external table that allows QuickSight access through the security configurations.

Create a security group that contains an inbound rule authorizing access from the appropriate IP address range for the region where the QuickSight servers are located. To give QuickSight access to a Redshift cluster, it needs to allow the appropriate IP ranges for the QuickSight servers in the AWS region where the servers are located. Authorizing Connections from Amazon QuickSight to Amazon Redshift Clusters (https://docs.aws.amazon.com/quicksight/latest/user/enabling-access-redshift.html)

You work for a home improvement chain as a data engineer, monitoring and overseeing the data that the home improvement stores log. Every hour, online orders are batched up and stored into a CSV formatted file onto S3. Your team runs queries using Athena to determine which products are most popular during a particular date for a particular region. On average, the CSV files stored in S3 are 5 GB in size, but are growing in size to tens and hundreds of GBs. Queries are taking longer to run as the files grow larger. Which of the following solutions can help improve query performance in Athena? Break the CSV files up into smaller sizes of 128 MB each. Use queries that utilize the WHERE clause with a smaller date range. Create an AWS Glue job to transform the CSV files into Apache Parquet files. Use queries that use the GROUP BY clause. Consider partitioning the data by date and region.

Create an AWS Glue job to transform the CSV files into Apache Parquet files. This can help speed up queries by transforming the row-based storage format of CSV to columnar-based storage or Apache Parquet. Consider partitioning the data by date and region. Partitioning divides your table into parts and keeps the related data together based on column values such as date, country, region, etc. Partitions act as virtual columns. You define them at table creation, and they can help reduce the amount of data scanned per query, thereby improving performance. Top 10 Performance Tuning Tips for Amazon Athena (https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/)

You work for a home improvement chain as a data engineer, monitoring and overseeing the data that the home improvement stores log. Every hour, online orders are batched up and stored into a CSV formatted file onto S3. Your team runs queries using Athena to determine which products are most popular during a particular date for a particular region. On average, the CSV files stored in S3 are 5 GB in size, but are growing in size to tens and hundreds of GBs. Queries are taking longer to run as the files grow larger. Which of the following solutions can help improve query performance in Athena? Use queries that use the GROUP BY clause. Create an AWS Glue job to transform the CSV files into Apache Parquet files. Use queries that utilize the WHERE clause with a smaller date range. Consider partitioning the data by date and region. Break the CSV files up into smaller sizes of 128 MB each.

Create an AWS Glue job to transform the CSV files into Apache Parquet files. This can help speed up queries by transforming the row-based storage format of CSV to columnar-based storage or Apache Parquet. Consider partitioning the data by date and region. Partitioning divides your table into parts and keeps the related data together based on column values such as date, country, region, etc. Partitions act as virtual columns. You define them at table creation, and they can help reduce the amount of data scanned per query, thereby improving performance. Top 10 Performance Tuning Tips for Amazon Athena (https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/)

You work for an organization that is responsible for ingesting GPS tracking data from devices your team has created into a data lake in EMR. Once the data is in EMR, the company then runs sophisticated Apache Spark processing on the data. To save on cost, there is a requirement that the data must be compressed before it is streamed to the Apache Spark process on EMR. There is also a requirement to have the GPS data have an unlimited retention period in the streaming service. Which steps would you take to most efficiently gather the data from the GPS tracking devices? Create an Amazon Managed Streaming for Kafka (MSK) cluster with connector applications to export data from MSK to S3. Then, use the S3DistCp command to move the data from S3 to EMR. Use the Kinesis Producer Library to write the GPS data to a Kinesis Data Stream, and use Kinesis Data Firehose to load the data into S3. Then, use the DistCp command to move the data from S3 to EMR. Use the Kinesis SDKs to run REST API calls directly to Kinesis Data Firehose, and use Kinesis Data Firehose to load the data into S3. Then, use the S3DistCp command to move the data from S3 to EMR. Use the Kinesis Producer Library to write the GPS data to a Kinesis Data Stream, and use Kinesis Data Firehose to load the data into S3. Then, use the S3DistCp command to move the data from S3 to EMR.

Create an Amazon Managed Streaming for Kafka (MSK) cluster with connector applications to export data from MSK to S3. Then, use the S3DistCp command to move the data from S3 to EMR. Since there is a requirement for the data to have an unlimited retention period, then MSK is the only option that will work. With MSK, you can provide custom configurations that allow for an unlimited retention period by setting the following configuration log.retention.ms to -1. S3DistCP is always recommended over Apache DistCp since it created and optimized specifically for loading data to and from S3 and EMR. You can also specify the compression codec to use for the copied files (gzip, gz, lzo, snappy).

You are designing a Kinesis Client Library (KCL) application that reads data from a Kinesis Data Stream and immediately writes the data to S3. The data is batched into 15-second intervals and sent off to S3. The batching interval is a regulatory requirement set by the team who owns the data and cannot be changed. You are using Athena to query the results in S3, but notice over time that the query results are taking longer and longer to process. Which of the following is the BEST solution to improve query speeds? Use Redshift to run CREATE EXTERNAL SCHEMA SPECTRUM AND CREATE EXTERNAL TABLE to make the data accessible from Redshift spectrum. Run the queries in Redshift. Create an EMR cluster to run the S3DistCp command to combine smaller files into larger objects. Increase the batch interval to help create larger files that are put onto S3. Create a Lambda function to iterate over the smaller files and compress them using gzip.

Create an EMR cluster to run the S3DistCp command to combine smaller files into larger objects. Use the S3DistCP utility on Amazon EMR. You can use it to combine smaller files into larger objects. You can also use S3DistCP to move large amounts of data in an optimized fashion from HDFS to Amazon S3, Amazon S3 to Amazon S3, and Amazon S3 to HDFS. S3DistCp

You've been contacted by a Data Engineer at your company, Swallow Speed Coconut Delivery, who is attempting to utilize the UNLOAD command to move data from the company's Redshift cluster to an S3 bucket which will be used to create Redshift Spectrum tables. They're receiving a permission denied error when they run their command. What are the likely corrective steps to resolve this error? Create an IAM Redshift service role and attach a policy with the appropriate S3 permissions to the newly created role. Assign the role to the Redshift cluster and retry the UNLOAD command. Ensure the user has the appropriate permissions in the Redshift user schema. Create an IAM policy granting the user permissions to upload files to S3 and attach the role to the IAM user. Create a bucket policy granting full read/write permissions to the 0.0.0.0/0 CIDR range and attach the policy to the appropriate S3 bucket. Retry the UNLOAD command.

Create an IAM Redshift service role and attach a policy with the appropriate S3 permissions to the newly created role. Assign the role to the Redshift cluster and retry the UNLOAD command. Correct! The most likely cause of a permission failure for an UNLOAD or COPY command involving S3 is a lack of permissions for S3.

You've been contacted by a Data Engineer at your company, Swallow Speed Coconut Delivery, who is attempting to utilize the UNLOAD command to move data from the company's Redshift cluster to an S3 bucket which will be used to create Redshift Spectrum tables. They're receiving a permission denied error when they run their command. What are the likely corrective steps to resolve this error? Ensure the user has the appropriate permissions in the Redshift user schema. Create a bucket policy granting full read/write permissions to the 0.0.0.0/0 CIDR range and attach the policy to the appropriate S3 bucket. Retry the UNLOAD command. Create an IAM policy granting the user permissions to upload files to S3 and attach the role to the IAM user. Create an IAM Redshift service role and attach a policy with the appropriate S3 permissions to the newly created role. Assign the role to the Redshift cluster and retry the UNLOAD command.

Create an IAM Redshift service role and attach a policy with the appropriate S3 permissions to the newly created role. Assign the role to the Redshift cluster and retry the UNLOAD command. Correct! The most likely cause of a permission failure for an UNLOAD or COPY command involving S3 is a lack of permissions for S3.

Your company uses Athena heavily for analytics queries. You've been tasked with optimizing costs for the company's data storage. Data fits into 3 categories, based on the age of the data. Some data is accessed very frequently to populate application views. Some data is accessed once a week to generate weekly status reports. Some data is only utilized yearly to generate annual report information. Application (Category 1) data is used for a month, Weekly Report (Category 2) data is used for up to one year, and Yearly Report (Category 3) data should be retained for 5 years. It is acceptable to have an extended data retrieval time for Category 3 data as it is only accessed annually. Which solution best meets these requirements while optimizing cost? Move all data to a new Elastic File System volume, and configure Athena to utilize Elastic File System for its data storage backend. Create a Lambda function that runs on a schedule and that checks the age of all objects in the S3 bucket backing Athena, then have it modify the storage classes to Standard-Infrequent Access and Glacier Deep Archive as appropriate. Create an S3 bucket Lifecycle policy that moves objects older than one month to One Zone-Infrequent Access, then Glacier after one year, and finally deletes objects 5 years after creation. Create an S3 bucket Lifecycle policy that moves objects older than one month to Standard-Infrequent Access, then Glacier Deep Archive after one year, and finally deletes objects 5 years after creation.

Create an S3 bucket Lifecycle policy that moves objects older than one month to Standard-Infrequent Access, then Glacier Deep Archive after one year, and finally deletes objects 5 years after creation. Standard-Infrequent Access will save a significant amount due to lower storage cost and the infrequent nature of the access pattern. Because extended retrieval time is acceptable for data access once a year, Glacier Deep Archive will realize the greatest cost savings for the more infrequently accessed data.

You work for Gerald's Gorgeous Gramophones as a data architect. Gramophones have been flying off the shelf and the sales and site usage data is growing exponentially. You have been using a Redshift cluster to store this data and make it available to Business Intelligence tooling, but Gerald is concerned about the cost of increasing the size of the Redshift cluster to accommodate the growing data. Fresh data is only accessed regularly for two weeks, and then infrequently for BI reports that require JOINing between fresh data and older data. Which solution will maintain functionality and avoid the increased cost of a larger Redshift cluster? Create an S3 bucket to store the infrequently accessed data. Add a Redshift Spectrum table to the existing Redshift cluster to enable access of infrequently accessed data. Create a scheduled Lambda function that runs once a week that utilizes the UNLOAD command to maintain the requested divide in Active/Infrequently Accessed data. Once the data has been confirmed in the Redshift Spectrum table via automated testing, it can be dropped from the Redshift cluster's table. Utilize the UNLOAD command in your Redshift cluster to export all data to an S3 bucket. Configure Athena to be able to query the data, and modify all application and business intelligence processes to query Athena. Create an S3 bucket to store the infrequently accessed data. Utilize the UNLOAD command from a Lambda function to move the data that is not accessed frequently to the S3 bucket on a two-week schedule. Configure Athena to be able to query the infrequently accessed data. Create a RDS Postgres instance and a S3 Bucket. Utilize the UNLOAD command to copy the data that will be infrequently accessed to the S3 Bucket. Utilize Database Migration Service to load the data from S3 to the RDS Postgres instance. Utilize Elastic MapReduce to query the data from Redshift and RDS for Business Intelligence.

Create an S3 bucket to store the infrequently accessed data. Add a Redshift Spectrum table to the existing Redshift cluster to enable access of infrequently accessed data. Create a scheduled Lambda function that runs once a week that utilizes the UNLOAD command to maintain the requested divide in Active/Infrequently Accessed data. Once the data has been confirmed in the Redshift Spectrum table via automated testing, it can be dropped from the Redshift cluster's table. Redshift Spectrum allows us to create external tables with data stored in S3, which will have a much lower cost than maintaining the data in our Redshift cluster. The trade-off is increased access time and limited SQL operation availability.

We See Your Bird is a company that is building an application that collects Twitter traffic and emits it as JSON documents. They need you to design a system that will store these JSON documents, enable full-text search functionality, and provide visualization of the stored data. Which of the following will accomplish this with the least development effort? Write a Lambda function to process the JSON documents and insert them into a RDS Oracle instance. Utilize full-text compound indexes to enable text searching. Configure QuickSight to visualize the data stored in the RDS database. Utilize Glue to create a catalog of the existing data. Add the Glue catalog to a newly created Elastic MapReduce (EMR) cluster to create Hive tables. Query the data from EMR, and configure QuickSight to utilize EMR as a datasource to generate visualizations of the data. Create and configure an Elasticsearch cluster. Create a Kinesis Firehose that is configured to deliver data to the newly created Elasticsearch cluster. Create an index in Elasticsearch to index the received JSON documents. Use a simple loader script to send the existing JSON documents to the Kinesis Firehose, and configure the scraping application to deliver new documents to Kinesis Firehose. Supply the relevant teams with the Elasticsearch cluster endpoint and Kinesis instance endpoint. Create a new Redshift cluster. Utilize Glue to catalogue and ETL the existing records into the Redshift cluster. Modify the Twitter scraping code to insert data into the Redshift cluster instead of the existing data pooling solution. Utilize QuickSight with the Redshift cluster to generate data visualizations.

Create and configure an Elasticsearch cluster. Create a Kinesis Firehose that is configured to deliver data to the newly created Elasticsearch cluster. Create an index in Elasticsearch to index the received JSON documents. Use a simple loader script to send the existing JSON documents to the Kinesis Firehose, and configure the scraping application to deliver new documents to Kinesis Firehose. Supply the relevant teams with the Elasticsearch cluster endpoint and Kinesis instance endpoint. This option provides a solution that meets all the requirements, with as little development effort as possible. Elasticsearch will provide the requested functionality, and Kinesis Firehose is a no-hassle way to ingest data to an Elasticsearch index.

A large amount of homogeneous data has been ingested from IoT sensors around the world in a S3 bucket. You've been asked to process this data into a Redshift cluster for analytics purposes. Which of the following is the most efficient way to do this? Utilize Glue to crawl and catalogue the IoT sensor data, and create a Glue job to process the S3 stored data into Redshift. Utilize Database Migration Service to load the data from S3 to Redshift. Create a Lambda function to read the IoT sensor data and perform Inserts to the appropriate tables in the Redshift cluster. Create appropriately defined tables in Redshift, and utilize the COPY command to load the data from S3 to the appropriate Redshift tables.

Create appropriately defined tables in Redshift, and utilize the COPY command to load the data from S3 to the appropriate Redshift tables. It is easy to load data from S3 to Redshift. S3 is the intermediary step most ETL/Migration tools use to pool data before loading it into Redshift.

You are working as a data analyst for a marketing agency. Through a mobile app, the company gathers data from thousands of mobile devices every hour, which is stored in an S3 bucket. The COPY command is used to move data to a Redshift cluster for further analysis. The data reconciliation team notified you that some of the original data present in S3 files is missing from the Redshift cluster. Which of the following actions would you take to mitigate this issue with the least amount of development effort? Feed the data directly into an Elastic MapReduce (EMR) cluster and use the COPY command to move it to the Redshift cluster for better consistency. Use multiple S3 buckets to store incoming data, then use multiple COPY commands against them to move data to Redshift. Use Step Functions to aggregate data and check the integrity of data across S3 and Redshift, then kick off an extra step function which inserts any missing data back into Redshift. Create new object keys in S3 for new incoming data and use manifest files for stronger consistency when moving data to Redshift.

Create new object keys in S3 for new incoming data and use manifest files for stronger consistency when moving data to Redshift. If you overwrite existing files with new data, and then issue a COPY command immediately following the upload, it is possible for the COPY operation to begin loading from the old files before all of the new data is available. Therefore, creating new object keys will ensure better consistency of data. When you use a manifest file, COPY enforces strong consistency by searching secondary data sources if it does not find a listed file on the primary server and can predetermine which files to move data from. Managing Data Consistency (https://docs.aws.amazon.com/redshift/latest/dg/managing-data-consistency.html)

The Sales and Marketing Teams at your company are using business intelligence applications to run a number of Presto queries on an Amazon EMR cluster with an EMR File System (EMRFS). There is a new Marketing Analyst starting today as well as a new Sales Data Analyst. The Marketing Analyst will need to access the marketing table only. The Sales Data Analyst will need to access to the sales table only. How should you configure access for these two new employees? Create separate IAM roles for the Marketing and Sales users. Assign the roles using an S3 bucket policy to enable the users to access the corresponding tables in the EMR cluster. Configure Presto to use the AWS Glue Data Catalog as the Apache Hive metastore. Create separate IAM roles for the Marketing and Sales users. Assign the roles with AWS Glue resource-based policies to enable the users to access the corresponding tables in the AWS Glue Data Catalog. Configure Presto to use the AWS Glue Data Catalog as the Apache Hive metastore. Create separate IAM roles for the Marketing and Sales users. Assign the roles using IAM policies to enable the users to access the corresponding tables in the AWS Glue Data Catalog. Configure Presto to use the AWS Glue Data Catalog as the Apache Hive metastore. Create separate IAM roles for the Marketing and Sales users. Configure access to the relevant tables using an S3 Access Control List to enable the users to access the corresponding tables in the EMR cluster. Configure Presto to use the AWS Glue Data Catalog as the Apache Hive metastore.

Create separate IAM roles for the Marketing and Sales users. Assign the roles with AWS Glue resource-based policies to enable the users to access the corresponding tables in the AWS Glue Data Catalog. Configure Presto to use the AWS Glue Data Catalog as the Apache Hive metastore. AWS Glue resource policies can be used to control access to Data Catalog resources. AWS Glue Resource Policies for Access Control (https://docs.aws.amazon.com/glue/latest/dg/glue-resource-policies.html)

You are developing a security information and event management system to handle threat detection and incident management using Amazon ElasticSearch Service and Kibana. The data held in ElasticSearch is highly sensitive and your Head of Security is keen to limit access to only your team of 5 analysts. The number of analysts may grow in time. How should you configure access to the ElasticSearch service and Kibana with the least operational overhead? Create programmatic keys for each of your team members so that they can SSH directly to the cluster nodes. Do not share the keys with anyone outside of your team. Create separate IAM user account users for each team member. Configure an IAM role with permissions to access the ElasticSearch cluster. Create an SSH key pair for each of the cluster nodes. Distribute the SSH key to each of your team members so that they can SSH directly to the cluster nodes. Do not share the key with anyone outside of your team. Configure a Cognito User Pool for your team and enable federated access using a web identity provider.

Create separate IAM user account users for each team member. Configure an IAM role with permissions to access the ElasticSearch cluster. One of the key benefits of using Amazon ElasticSearch Service is that you can leverage IAM to grant or deny access to your search domains. The simplest way to provide access in a secure way is to create IAM user accounts for the team and attach an IAM role with permissions to the ElasticSearch cluster. How To Control Access To Elasticsearch (https://aws.amazon.com/blogs/security/how-to-control-access-to-your-amazon-elasticsearch-service-domain/) Configure a Cognito User Pool for your team and enable federated access using a web identity provider. Cognito supports sign-in with social identity providers - such as Facebook, Google, and Amazon - and enterprise identity providers via SAML 2.0, and is generally used as an authentication mechanism for web and mobile applications built on AWS. In 2018 AWS announced support for authenticating to Kibana using Amazon Cognito. Cognito provides the least operational overhead as it allows use of existing identities as opposed to creating IAM users for each member of staff and managing usernames/passwords and access keys. Cognito FAQs (https://aws.amazon.com/about-aws/whats-new/2018/04/amazon-elasticsearch-service-simplifies-user-authentication-and-access-for-kibana-with-amazon-cognito/)

You work for a movie theater organization that is integrating a new concession system. The movie theaters will be spread across the globe, showing movies in hundreds of different languages. The new concession system needs to be able to handle users all throughout the day and during any given time. The amount of concession purchases spike during certain times of the day and night, so the collected data volume fluctuates. The data that is stored for concession purchases, items, and prices needs to be delivered at low latency and high throughput no matter the size of data; however, the data is typically small in size. What storage option is the best solution for the new concession system? Elasticache with multi-AZ enabled S3 with CloudFront to create global CDN Neptune with a multi-AZ enabled RDS with a multi-AZ enabled DynamoDB with global tables and multi-region replication

DynamoDB with global tables and multi-region replication DynamoDB scales horizontally and allows applications to deliver data at single-digit millisecond latency at large scale. DynamoDB also offers global tables for multi-region replication that can be used for your global application. Global Tables: Multi-Region Replication with DynamoDB Amazon DynamoDB FAQs

You have been asked to build an EMR cluster. The EBS storage volumes attached to the cluster nodes must be encrypted. Your Head of Security would like you to confirm which type of encryption is supported. Which of the following are the correct responses? EBS encryption SSE-S3 CSE-KMS Open-source HDFS encryption LUKS encryption

EBS encryption Beginning with Amazon EMR version 5.24.0, you can choose to enable EBS encryption. The EBS encryption option encrypts the EBS root device volume and attached storage volumes. EMR Encryption Options (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html#emr-encryption-localdisk) Open-source HDFS encryption Open-source HDFS encryption is supported for encrypting data on EBS volumes attached to EC2 instances in an EMR cluster. EMR Encryption Options (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html#emr-encryption-localdisk) LUKS encryption Linux Unified Key Setup, or LUKS encryption is supported for encrypting data on EBS volumes attached to EC2 instances in an EMR cluster. EMR Encryption Options (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html#emr-encryption-localdisk)

You are working on a project to consolidate a large amount of confidential information onto Redshift. In order to meet compliance requirements, you need to demonstrate that you can produce a record of authentication attempts, user activity on the database, connections, and disconnections. Which of the following will create the required logs? Enable QuickSight logs. Enable CloudTrail logs. Enable CloudWatch logs. Enable Redshift audit logs.

Enable Redshift audit logs. Redshift can log information about connections and user activities in your database. Audit logging is not enabled by default in Amazon Redshift. The connection log, user log, and user activity log can be enabled using the AWS Management Console, the Amazon Redshift API, or the AWS Command Line Interface. RedShift Audit Logs (https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html#db-auditing-logs)

You are working on a project to consolidate a large amount of confidential information onto Redshift. In order to meet compliance requirements, you need to demonstrate that you can produce a record of authentication attempts, user activity on the database, connections, and disconnections. Which of the following will create the required logs? Enable Redshift audit logs. Enable QuickSight logs. Enable CloudTrail logs. Enable CloudWatch logs.

Enable Redshift audit logs. Redshift can log information about connections and user activities in your database. Audit logging is not enabled by default in Amazon Redshift. The connection log, user log, and user activity log can be enabled using the AWS Management Console, the Amazon Redshift API, or the AWS Command Line Interface. RedShift Audit Logs (https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html#db-auditing-logs)

You work for an organization that uses legacy Microsoft applications to run the day-to-day services, as well as the authentication mechanisms. Currently, all employees are authenticated into applications using AWS Managed Microsoft AD in us-west-2. You have recently set up a QuickSight account in us-east-1 that you need teammates to authenticate into, so they can run data analytics tasks. Your teammates are not able to authenticate into the QuickSight account. Which of the following is the cause for the issue and what are the possible solutions? Use the Standard edition for the QuickSight account. Invite the users to the QuickSight account using their email addresses. Ensure Active Directory is the identity provider for QuickSight and associate your AD groups with Amazon QuickSight. Use the Enterprise edition for the QuickSight account. Set up an AWS Managed Microsoft AD directory in the same region as the QuickSight account and migrate users using the new directory.

Ensure Active Directory is the identity provider for QuickSight and associate your AD groups with Amazon QuickSight. When you subscribe to Amazon QuickSight Enterprise edition and choose Active Directory as your identity provider, you can associate your AD groups with Amazon QuickSight. You can also add or change your AD groups later on. Using Active Directory with Amazon QuickSight Enterprise Edition (https://docs.aws.amazon.com/quicksight/latest/user/aws-directory-service.html)

As Herbert's Hyper Hot Chillies has expanded their hot pepper and spice sales to the global market, they've accumulated a significant number of S3- backed data lakes in multiple AWS accounts across multiple regions. They would like to produce some Business Intelligence visualizations that combine data from all of these sources. How can they do this with minimal cost and development effort? Utilize Glue to ETL the data into JSON format, and load it into an Elasticsearch index. Utilize Kibana to create visualizations of the data. Utilize Glue to create a catalog of all involved data, and use the catalog to inform Hive tables in Elastic Map Reduce(EMR). Then, utilize EMR as a datasource for QuickSight. Ensure that all data sources are configured with the appropriate permissions to provide QuickSight access. Configure QuickSight to access the S3 data in the various regions and accounts. Write a custom visualization frontend with the D3 framework, and back this frontend with a custom API that accesses each data lake individually to aggregate the data before visualization.

Ensure that all data sources are configured with the appropriate permissions to provide QuickSight access. Configure QuickSight to access the S3 data in the various regions and accounts. Given the correct permissions, QuickSight can be utilized to aggregate data for creating visualizations.

You work for a large university developing a web application that allows students to upload various applications regarding their attendance. Many of these applications contain large files up to 3 GB in size. After each upload, the attached files need to be processed through an in-house developed OCR application that is hosted on a SageMaker endpoint. The application submissions happen unpredictably: some applications are sent every few hours and sometimes hundreds of applications are sent per minute. Which architecture best suits the workload and is cost efficient? First, use the AWS SDK to store the file on an EBS volume. Use a fleet of EC2 instances to read the attachments from the EBS volume, sending the attachment as input to invoke the SageMaker endpoint. First, use an SQS queue to process the file. Use a fleet of EC2 instances to poll the SQS queue, sending the attachment as input to invoke the SageMaker endpoint. First, use a multipart upload to deliver the attachments to S3. Use S3 event notifications to trigger a Lambda function, sending the attachment as input to invoke the SageMaker endpoint. First, use a Kinesis Data Firehose to deliver the attachment to S3. Use S3 event notifications to trigger a Lambda function, sending the attachment as input to invoke the SageMaker endpoint.

First, use a multipart upload to deliver the attachments to S3. Use S3 event notifications to trigger a Lambda function, sending the attachment as input to invoke the SageMaker endpoint. This architecture is best suited for the workload requirements and would be the most cost-effecient solution. Configuring Amazon S3 Event Notifications (https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html) Multipart Upload Overview (https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html) InvokeEndpoint (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html)

You work with a team of data scientists who use ERM clusters to analyze large datasets using Presto jobs and YARN jobs. Many of your team members forget to terminate EMR clusters when they are finished with their workload. You have been notified by the finance team that the cost of EMR is often exceeding the monthly budget, and have been tasked with automating a solution to terminate idle running EMR clusters. Which of the following solutions meets these requirements, for both clusters running Presto jobs and clusters running YARN jobs? Create tags for the EMR clusters that are running Presto jobs and YARN jobs separately. Use CloudWatch alarms to monitor the billing amount for each tag that is set to the monthly billing amount. When the alarm exceeds the monthly billing amount, send a message to an SNS topic. Create a Lambda function that is subscribed to the topic to terminate the EMR cluster. For the Presto jobs, create a bash script that is installed directly onto the master node of the EMR cluster that runs every 5 minutes with a cron job. Implement the script to terminate the EMR cluster after 8 hours each day. For the Presto jobs, create a bash script that is installed directly onto the master node of the EMR cluster that runs every 5 minutes with a cron job. The script monitors the clusters and sends a CUSTOM metric EMR-INUSE (0=inactive; 1=active) to CloudWatch every 5 minutes. If CloudWatch receives 0 (inactive), send a message to an SNS topic. Create a Lambda function that is subscribed to the topic to terminate the EMR cluster. Create tags for the EMR clusters that are running Presto jobs and YARN jobs separately. Use AWS Systems Manager to continuously monitor the ERM clusters by tags and check for idle clusters. If the clusters are idle, issue a aws emr terminate-clusters command on all of the clusters. For the YARN jobs, create a CloudWatch alarm for the IsIdle metric from the EMR cluster that sends a message to an SNS topic. Create a Lambda function that is subscribed to the topic to terminate the EMR cluster.

For the Presto jobs, create a bash script that is installed directly onto the master node of the EMR cluster that runs every 5 minutes with a cron job. The script monitors the clusters and sends a CUSTOM metric EMR-INUSE (0=inactive; 1=active) to CloudWatch every 5 minutes. If CloudWatch receives 0 (inactive), send a message to an SNS topic. Create a Lambda function that is subscribed to the topic to terminate the EMR cluster. The Amazon EMR native IsIdle Amazon CloudWatch metric determines the idleness of the cluster by checking whether there's a YARN job running. However, you should consider additional metrics, such as SSH users connected or Presto jobs running, to determine whether the cluster is idle. Also, when you execute any Spark jobs in Apache Zeppelin, the IsIdle metric remains active (1) for long hours, even after the job is finished executing. In such cases, the IsIdle metric is not ideal in deciding the inactivity of a cluster. Optimize Amazon EMR Costs with Idle Checks and Automatic Resource Termination Using Advanced Amazon CloudWatch Metrics and AWS Lambda (https://aws.amazon.com/blogs/big-data/optimize-amazon-emr-costs-with-idle-checks-and-automatic-resource-termination-using-advanced-amazon-cloudwatch-metrics-and-aws-lambda/) For the YARN jobs, create a CloudWatch alarm for the IsIdle metric from the EMR cluster that sends a message to an SNS topic. Create a Lambda function that is subscribed to the topic to terminate the EMR cluster. The EMR native IsIdle Amazon CloudWatch metric determines the idleness of the cluster by checking whether there's a YARN job running. Optimize Amazon EMR Costs with Idle Checks and Automatic Resource Termination Using Advanced Amazon CloudWatch Metrics and AWS Lambda (https://aws.amazon.com/blogs/big-data/optimize-amazon-emr-costs-with-idle-checks-and-automatic-resource-termination-using-advanced-amazon-cloudwatch-metrics-and-aws-lambda/)

You work with a team of data scientists who use ERM clusters to analyze large datasets using Presto jobs and YARN jobs. Many of your team members forget to terminate EMR clusters when they are finished with their workload. You have been notified by the finance team that the cost of EMR is often exceeding the monthly budget, and have been tasked with automating a solution to terminate idle running EMR clusters. Which of the following solutions meets these requirements, for both clusters running Presto jobs and clusters running YARN jobs? For the Presto jobs, create a bash script that is installed directly onto the master node of the EMR cluster that runs every 5 minutes with a cron job. Implement the script to terminate the EMR cluster after 8 hours each day. For the YARN jobs, create a CloudWatch alarm for the IsIdle metric from the EMR cluster that sends a message to an SNS topic. Create a Lambda function that is subscribed to the topic to terminate the EMR cluster. For the Presto jobs, create a bash script that is installed directly onto the master node of the EMR cluster that runs every 5 minutes with a cron job. The script monitors the clusters and sends a CUSTOM metric EMR-INUSE (0=inactive; 1=active) to CloudWatch every 5 minutes. If CloudWatch receives 0 (inactive), send a message to an SNS topic. Create a Lambda function that is subscribed to the topic to terminate the EMR cluster. Create tags for the EMR clusters that are running Presto jobs and YARN jobs separately. Use AWS Systems Manager to continuously monitor the ERM clusters by tags and check for idle clusters. If the clusters are idle, issue a aws emr terminate-clusters command on all of the clusters. Create tags for the EMR clusters that are running Presto jobs and YARN jobs separately. Use CloudWatch alarms to monitor the billing amount for each tag that is set to the monthly billing amount. When the alarm exceeds the monthly billing amount, send a message to an SNS topic. Create a Lambda function that is subscribed to the topic to terminate the EMR cluster.

For the YARN jobs, create a CloudWatch alarm for the IsIdle metric from the EMR cluster that sends a message to an SNS topic. Create a Lambda function that is subscribed to the topic to terminate the EMR cluster. The EMR native IsIdle Amazon CloudWatch metric determines the idleness of the cluster by checking whether there's a YARN job running. Optimize Amazon EMR Costs with Idle Checks and Automatic Resource Termination Using Advanced Amazon CloudWatch Metrics and AWS Lambda (https://aws.amazon.com/blogs/big-data/optimize-amazon-emr-costs-with-idle-checks-and-automatic-resource-termination-using-advanced-amazon-cloudwatch-metrics-and-aws-lambda/) For the Presto jobs, create a bash script that is installed directly onto the master node of the EMR cluster that runs every 5 minutes with a cron job. The script monitors the clusters and sends a CUSTOM metric EMR-INUSE (0=inactive; 1=active) to CloudWatch every 5 minutes. If CloudWatch receives 0 (inactive), send a message to an SNS topic. Create a Lambda function that is subscribed to the topic to terminate the EMR cluster. The Amazon EMR native IsIdle Amazon CloudWatch metric determines the idleness of the cluster by checking whether there's a YARN job running. However, you should consider additional metrics, such as SSH users connected or Presto jobs running, to determine whether the cluster is idle. Also, when you execute any Spark jobs in Apache Zeppelin, the IsIdle metric remains active (1) for long hours, even after the job is finished executing. In such cases, the IsIdle metric is not ideal in deciding the inactivity of a cluster. Optimize Amazon EMR Costs with Idle Checks and Automatic Resource Termination Using Advanced Amazon CloudWatch Metrics and AWS Lambda (https://aws.amazon.com/blogs/big-data/optimize-amazon-emr-costs-with-idle-checks-and-automatic-resource-termination-using-advanced-amazon-cloudwatch-metrics-and-aws-lambda/)

You were recently hired by a company that has been using a Redshift data warehouse for many years. They had been noticing some slowness when running queries against certain database tables with high traffic, likely due to small regions of unsorted rows. You have been tasked with analyzing the data to determine which tables require better sorting and clean-up, and communicating those changes to the Engineering team lead. Which solution would you propose that requires the least development effort and lowest cost for this task? Inspect the SVV_TABLE_INFO table's unsorted_rows and vacuum_sort_benefit to determine the number of unsorted rows and performance benefit from sorting them. Perform a deep copy of the tables in question to recreate and re-sort the tables automatically. No additional action is required — the automatically scheduled vacuuming is ideal for recreating and sorting tables for efficiency in all cases. Inspect the OPTIMIZE_TABLE table's sorted_row and vacuum_select_sort fields to determine if you need to run manual VACUUM DELETE for cleanup.

Inspect the SVV_TABLE_INFO table's unsorted_rows and vacuum_sort_benefit to determine the number of unsorted rows and performance benefit from sorting them. The SVV_TABLE_INFO table can be really helpful if you want to take a more detailed look into optimizing your Redshift database for better sorting. Although VACUUM DELETE is run every now and then, you might still want to look at specific tables and see if further VACUUM action can give a table better performance at a more frequent interval.

You are creating an EMR cluster which will handle highly sensitive data. The Chief of Security has mandated that the EMR cluster must not be accessible from the public internet, and subnets must be configured with maximum network security. Which of the following options will best meet this requirement? Launch the Amazon EMR cluster in a private subnet, use VPC endpoints to access services within AWS, and use a NAT gateway to access resources that can only be accessed using the internet. Launch the Amazon EMR cluster in a private subnet, and use a NAT gateway to access services within AWS and a VPC endpoint to access resources that can only be accessed using the internet. Launch the Amazon EMR cluster in the default subnet, update the routing table to remove the route to the internet gateway, and use VPC endpoints to access services within AWS and resources that can only be accessed using the internet. Launch the Amazon EMR cluster in a public subnet with no public IP space or internet gateway. Use a NAT gateway to access services within AWS and resources that can only be accessed using the internet.

Launch the Amazon EMR cluster in a private subnet, use VPC endpoints to access services within AWS, and use a NAT gateway to access resources that can only be accessed using the internet. This is the correct answer because the EMR cluster will not be exposed to the internet, any traffic to the VPC endpoints will remain within Amazon's network, and the use of a NAT gateway is the most secure way to access internet-based resources because it does not allow ingress connections or incoming connections from external networks. Securely Access Web Interfaces on Amazon EMR Launched in a Private Subnet (https://aws.amazon.com/blogs/big-data/securely-access-web-interfaces-on-amazon-emr-launched-in-a-private-subnet/) VPC Endpoints (https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html)

You are creating an EMR cluster which will handle highly sensitive data. The Chief of Security has mandated that the EMR cluster must not be accessible from the public internet, and subnets must be configured with maximum network security. Which of the following options will best meet this requirement? Launch the Amazon EMR cluster in a private subnet, use VPC endpoints to access services within AWS, and use a NAT gateway to access resources that can only be accessed using the internet. Launch the Amazon EMR cluster in a public subnet with no public IP space or internet gateway. Use a NAT gateway to access services within AWS and resources that can only be accessed using the internet. Launch the Amazon EMR cluster in the default subnet, update the routing table to remove the route to the internet gateway, and use VPC endpoints to access services within AWS and resources that can only be accessed using the internet. Launch the Amazon EMR cluster in a private subnet, and use a NAT gateway to access services within AWS and a VPC endpoint to access resources that can only be accessed using the internet.

Launch the Amazon EMR cluster in a private subnet, use VPC endpoints to access services within AWS, and use a NAT gateway to access resources that can only be accessed using the internet. This is the correct answer because the EMR cluster will not be exposed to the internet, any traffic to the VPC endpoints will remain within Amazon's network, and the use of a NAT gateway is the most secure way to access internet-based resources because it does not allow ingress connections or incoming connections from external networks. Securely Access Web Interfaces on Amazon EMR Launched in a Private Subnet (https://aws.amazon.com/blogs/big-data/securely-access-web-interfaces-on-amazon-emr-launched-in-a-private-subnet/) VPC Endpoints (https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html)

You work for a coffee company which has thousands of branches all over the country. The sales system generates logs regarding transactions. The logs are aggregated and uploaded to an S3 bucket 'transaction-logs' which has a subfolder for logs for each item like those shown below: transaction-logs/dt=11-22-2019-0700/Hot-Drinks/ transaction-logs/dt=11-22-2019-0800/Cold-Drinks/ transaction-logs/dt=11-22-2019-0900/Edibles-Sweet/ transaction-logs/dt=11-22-2019-1000/Edibles-Salty/ Some store locations are open from 8 AM to 5 PM, but there are many 24 hour locations as well, which means there's millions of transactions being reported per hour. Consequently, to parse and analyze the data, an Elastic MapReduce (EMR) cluster is used to process and upload data to a Redshift data warehouse. What changes should you make to the S3 bucket for better read performance without altering current architecture? Set up an EC2 AutoScaling group to issue multiple parallel connections to S3 for better concurrent reads. Modify the S3 prefix to better spread out the read requests from EMR and utilize the read request performance for each unique prefix. Use the S3 Select feature to read the required objects and stream them directly into Redshift. Use the COPY command within Redshift to directly pull S3 object data.

Modify the S3 prefix to better spread out the read requests from EMR and utilize the read request performance for each unique prefix. S3 is a massively distributed and scalable service and allows read throughput per S3 prefix, which means that a new unique S3 key prefix will offer better read performance. The S3 key could be named this way to create a new prefix for each flavor category and offer separate read performance for that prefix. In this scenario, for example, we could aggregate logs by hour and set the date, time and hour as a unique prefix. transaction-logs/dt=2020-11-22-0800/item-drinks/hot/mocha transaction-logs/dt=2020-11-22-0900/item-drinks/cold/iced_coffee transaction-logs/dt=2020-11-22-1000/item-edibles/sweet/donut transaction-logs/dt=2020-11-22-1100/item-edibles/salty/egg_roll

You work for a coffee company which has thousands of branches all over the country. The sales system generates logs regarding transactions. The logs are aggregated and uploaded to an S3 bucket 'transaction-logs' which has a subfolder for logs for each item like those shown below: transaction-logs/dt=11-22-2019-0700/Hot-Drinks/ transaction-logs/dt=11-22-2019-0800/Cold-Drinks/ transaction-logs/dt=11-22-2019-0900/Edibles-Sweet/ transaction-logs/dt=11-22-2019-1000/Edibles-Salty/ Some store locations are open from 8 AM to 5 PM, but there are many 24 hour locations as well, which means there's millions of transactions being reported per hour. Consequently, to parse and analyze the data, an Elastic MapReduce (EMR) cluster is used to process and upload data to a Redshift data warehouse. What changes should you make to the S3 bucket for better read performance without altering current architecture? Set up an EC2 AutoScaling group to issue multiple parallel connections to S3 for better concurrent reads. Use the COPY command within Redshift to directly pull S3 object data. Modify the S3 prefix to better spread out the read requests from EMR and utilize the read request performance for each unique prefix. Use the S3 Select feature to read the required objects and stream them directly into Redshift.

Modify the S3 prefix to better spread out the read requests from EMR and utilize the read request performance for each unique prefix. S3 is a massively distributed and scalable service and allows read throughput per S3 prefix, which means that a new unique S3 key prefix will offer better read performance. The S3 key could be named this way to create a new prefix for each flavor category and offer separate read performance for that prefix. In this scenario, for example, we could aggregate logs by hour and set the date, time and hour as a unique prefix. transaction-logs/dt=2020-11-22-0800/item-drinks/hot/mocha transaction-logs/dt=2020-11-22-0900/item-drinks/cold/iced_coffee transaction-logs/dt=2020-11-22-1000/item-edibles/sweet/donut transaction-logs/dt=2020-11-22-1100/item-edibles/salty/egg_roll

You work for a large computer hardware organization that has many different IT stores across the world. The computer parts, order details, shipping details, customer, and sales person information is stored in a data lake in S3. You have been tasked with developing a visualization to show the amount of hardware that was shipped out by various stores and the sales person who sold the hardware. You have a requirement that the visualization must be able to apply statistical functions, as well as cluster columns and rows to show values for subcategories grouped by related dimension. Which type of visualization would meet these requirements? Combo chart Heat map Tree map Pivot table

Pivot table A pivot table would be the best choice for visualizing this data. With a pivot table you can: Specify multiple measures to populate the cell values of the table, so that you can see a range of data Cluster pivot table columns and rows to show values for subcategories grouped by related dimension Change row sort order Apply statistical functions Add totals and subtotals to rows and columns Use infinite scroll Transpose fields used by rows and columns Pivot Table Features (https://docs.aws.amazon.com/quicksight/latest/user/pivot-table.html#pivot-table-features)

You work for a large computer hardware organization that has many different IT stores across the world. The computer parts, order details, shipping details, customer, and sales person information is stored in a data lake in S3. You have been tasked with developing a visualization to show the amount of hardware that was shipped out by various stores and the sales person who sold the hardware. You have a requirement that the visualization must be able to apply statistical functions, as well as cluster columns and rows to show values for subcategories grouped by related dimension. Which type of visualization would meet these requirements? Heat map Combo chart Pivot table Tree map

Pivot table A pivot table would be the best choice for visualizing this data. With a pivot table you can: Specify multiple measures to populate the cell values of the table, so that you can see a range of data Cluster pivot table columns and rows to show values for subcategories grouped by related dimension Change row sort order Apply statistical functions Add totals and subtotals to rows and columns Use infinite scroll Transpose fields used by rows and columns Pivot Table Features (https://docs.aws.amazon.com/quicksight/latest/user/pivot-table.html#pivot-table-features)

What's That Thing is a medical reference company. They've been provided with a huge store of medical data, but to be able to utilize the images and associated data, it needs to be anonymized and trimmed to remove any non-pertinent information from the records. They would like to accomplish this with minimal development effort. What is the best workflow to accomplish this? Load each record into an SQS queue. Create an SQS client Lambda function to process each record, write code to clean and filter the data to remove personally identifiable information, and add an anonymized identifier to associate data with image files. Process the stream into a Kinesis Firehose delivery stream. Utilize a Kinesis Analytics Application to clean the data of extraneous data and any personally identifiable information, and add unique identifiers to connect data with images. Manually edit each data record and image to remove any personally identifiable information and add anonymized identifiers to enable connecting data and images. Write a Lambda function to ingest the data, perform data filtering to remove any personally identifiable information, and add anonymized identifiers to connect data and images.

Process the stream into a Kinesis Firehose delivery stream. Utilize a Kinesis Analytics Application to clean the data of extraneous data and any personally identifiable information, and add unique identifiers to connect data with images. This is the best option, as you can easily filter, augment, and enhance data on the fly with Kinesis Analytics Applications.

What's That Thing is a medical reference company. They've been provided with a huge store of medical data, but to be able to utilize the images and associated data, it needs to be anonymized and trimmed to remove any non-pertinent information from the records. They would like to accomplish this with minimal development effort. What is the best workflow to accomplish this? Process the stream into a Kinesis Firehose delivery stream. Utilize a Kinesis Analytics Application to clean the data of extraneous data and any personally identifiable information, and add unique identifiers to connect data with images. Write a Lambda function to ingest the data, perform data filtering to remove any personally identifiable information, and add anonymized identifiers to connect data and images. Manually edit each data record and image to remove any personally identifiable information and add anonymized identifiers to enable connecting data and images. Load each record into an SQS queue. Create an SQS client Lambda function to process each record, write code to clean and filter the data to remove personally identifiable information, and add an anonymized identifier to associate data with image files.

Process the stream into a Kinesis Firehose delivery stream. Utilize a Kinesis Analytics Application to clean the data of extraneous data and any personally identifiable information, and add unique identifiers to connect data with images. This is the best option, as you can easily filter, augment, and enhance data on the fly with Kinesis Analytics Applications.

You work as a data scientist for a small team in charge of collecting and processing data from handheld devices that air conditioning technicians use in the field. You've successfully set up consumer applications to consume, partition, compress, and store the data in Amazon S3. For environmental compliance, your teams must process the device data on a daily basis and send a report to management, who then forwards the results to the local government agencies. You have setup an AWS Glue Job that processes the data on a daily basis. The job ran fast and was low cost the first few weeks, but after several months of collecting data, you notice that the AWS Glue Job is taking longer to run and is costing more money. You also notice that each daily run is processing all the data, instead of only new data. Which of the following actions can you take to improve the speed and cost, and also disable the reprocessing of already processed data? Review the active executors, complete stages, and maximum needed executors in CloudWatch during previous job executions. Create a new AWS Glue Job that runs before the daily job to partition and compress the data. Determine the optimal DPU capacity. Enable auto-scaling when creating the job. Disable the Job Bookmark feature. Enable the Job Bookmark feature.

Review the active executors, complete stages, and maximum needed executors in CloudWatch during previous job executions. The number of maximum needed executors is computed by adding the total number of running tasks and pending tasks, and dividing by the tasks per executor. This result is a measure of the total number of executors required to satisfy the current load. Number of actively running executors (https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html#glue.driver.ExecutorAllocationManager.executors.numberAllExecutors) In contrast, the number of actively running executors measures how many executors are running active Apache Spark tasks. As the job progresses, the maximum needed executors can change and typically goes down towards the end of the job as the pending task queue diminishes. Number of completed stages Number of maximum needed executors (https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html#glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors) Determine the optimal DPU capacity. You can use job metrics in AWS Glue to estimate the number of data processing units (DPUs) that can be used to scale out an AWS Glue job. Monitoring for DPU Capacity Planning (https://docs.aws.amazon.com/glue/latest/dg/monitor-debug-capacity.html) Enable the Job Bookmark feature. Enabling Job Bookmarks keeps track of previously processed data and processes new data since the last checkpoint. Tracking Processed Data Using Job Bookmarks (https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html)

You work as a data engineer who performs data processing solutions for your customers. You have been tasked with designing an EMR solution that will process a large amount of data with little to no time constraint. It's important that the data process solution be more cost effective. Due to the size of the data, you know that the EMR map reduce job is going to require 20 mappers to process the input data. Which of the following configurations for your EMR cluster would help you achieve this? Use 10 core nodes, where each node can process 2 mappers in parallel. Run 10 mappers first, while the remaining 10 mappers stay in queue. Once Hadoop has processed the first 10 mappers, the remain 10 mappers run. Use 10 core nodes, where each node can process 3 mappers in parallel. Use 5 core nodes, where each node can process 2 mappers in parallel. Run all the mappers in parallel.

Run 10 mappers first, while the remaining 10 mappers stay in queue. Once Hadoop has processed the first 10 mappers, the remain 10 mappers run. This option will help you save on cost, because you will only have to use 5 nodes (as compared to 10 nodes), while still processing all 20 mappers. Best Practices for Amazon EMR (https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf) Use 5 core nodes, where each node can process 2 mappers in parallel. Using this option — in conjunction with having 10 mappers run first, then have the remaining mappers run after — will provide you with the most cost savings. Best Practices for Amazon EMR (https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf)

You work as a data engineer who performs data processing solutions for your customers. You have been tasked with designing an EMR solution that will process a large amount of data with little to no time constraint. It's important that the data process solution be more cost effective. Due to the size of the data, you know that the EMR map reduce job is going to require 20 mappers to process the input data. Which of the following configurations for your EMR cluster would help you achieve this? Use 10 core nodes, where each node can process 3 mappers in parallel. Use 10 core nodes, where each node can process 2 mappers in parallel. Run all the mappers in parallel. Run 10 mappers first, while the remaining 10 mappers stay in queue. Once Hadoop has processed the first 10 mappers, the remain 10 mappers run. Use 5 core nodes, where each node can process 2 mappers in parallel.

Run 10 mappers first, while the remaining 10 mappers stay in queue. Once Hadoop has processed the first 10 mappers, the remain 10 mappers run. This option will help you save on cost, because you will only have to use 5 nodes (as compared to 10 nodes), while still processing all 20 mappers. Best Practices for Amazon EMR (https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf) Use 5 core nodes, where each node can process 2 mappers in parallel. Using this option — in conjunction with having 10 mappers run first, then have the remaining mappers run after — will provide you with the most cost savings. Best Practices for Amazon EMR (https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf)

You are building an EMR cluster and have been asked to enable encryption at rest for EMRFS data. Which of the following encryption methods can you use? Open-source HDFS Encryption SSE-C SSE-KMS SSE-S3

SSE-KMS EMRFS is an implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3. SSE-KMS is a supported encryption method for EMRFS. With SSE-KMS, you use an AWS KMS customer master key (CMK) set up with policies suitable for Amazon EMR. EMR Encryption Options (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html) SSE-S3 EMRFS is an implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3. SSE-S3 is a supported encryption method for EMRFS. With SSE-S3, Amazon S3 manages the encryption keys for you. EMR Encryption Options (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html)

Safety Pattern is a company that specializing in recognizing patterns in cloud-based architectures that indicate unusual behavior. They would like to launch a data access application that detects specific patterns in data storage services. They're experiencing difficulties because of the volume of data that needs to be processed to make real-time alerting functional. Which of the following pipelines would be the best option to accomplish this goal? Create an S3 bucket, and send event objects to the bucket. Create a Lambda function that evaluates for alert-worthy states on a state match. Send a message to an SNS topic configured to alert the appropriate subscribers. Send access events to a Kinesis Data Stream. Create a Kinesis Data Application that utilizes a Flink application with the alert-worthy event patterns. Send matches to a Lambda function that sends a message to an SNS topic configured to alert the appropriate subscribers. Create a custom EMR application that accepts events and publishes alerts to an SNS topic configured to alert the appropriate subscribers. Create a custom application and Docker image. Run the Docker image in ECS with an Application Load Balancer configured to send events to the ECS containers. Configure the application to send messages to an SNS function when there is a state match.

Send access events to a Kinesis Data Stream. Create a Kinesis Data Application that utilizes a Flink application with the alert-worthy event patterns. Send matches to a Lambda function that sends a message to an SNS topic configured to alert the appropriate subscribers. This is the best available solution. Kinesis can be scaled to handle a huge amount of data input, and Flink enables efficiently managing and flexibly filtering that data based on various states.

You work as a Data Specialist for a new startup that processes and transforms data for organizations that lack the support and staff to complete the data processing capabilities. Your typical workflow is to use EMR to process and transform the data that is backed by EMRFS, since most of your customers have data stored in S3. Your newest customer has multiple buckets, each with their own encryption keys for each bucket. Which solution would you use to allow EMR access to the encrypted data stored in the S3 buckets? Use S3 Select to access the data in S3 in each bucket by specifying the encryption keys as options. Upload the encryption keys to the EMR master node using SFTP. Use the s3 cp command with encryption keys as options. Modify the S3 bucket policy to allow access by the IAM role associated with the EMR cluster. Set up the per bucket encryption overrides option in EMR that specifies the encryption keys for each bucket.

Set up the per bucket encryption overrides option in EMR that specifies the encryption keys for each bucket. If you have highly sensitive content in specific S3 buckets, you may want to manage the encryption of these buckets separately by using different CMKs or encryption modes for individual buckets. You can accomplish this using the per bucket encryption overrides option in Amazon EMR. Secure your data on Amazon EMR using native EBS and per bucket S3 encryption options (https://aws.amazon.com/blogs/big-data/secure-your-data-on-amazon-emr-using-native-ebs-and-per-bucket-s3-encryption-options/)

You work as a data engineer for a large health agency that runs data analytics on world health data. Currently, there are large datasets of world health data in S3 that is not accessible over the internet. You have been tasked with setting up a QuickSight account that will enable you to build dashboards from the data in S3 without moving the data over the public internet. Which of these methods meets these requirements? Create a new data source and select the S3 bucket with the world health data. Setup a QuickSight VPC connection and a VPC endpoint for S3 to allow QuickSight private access to S3 world health data. Create a VPC endpoint for S3 to allow QuickSight private access to the S3 world health data. Download the S3 data using FTP and upload the S3 data into the QuickSight SPICE.

Setup a QuickSight VPC connection and a VPC endpoint for S3 to allow QuickSight private access to S3 world health data. A VPC endpoint for S3 provides the private access to the world health data, and a QuickSight VPC connection is also needed in the same VPC where the VPC endpoint resides. Configuring the VPC Connection in the QuickSight Console (https://docs.aws.amazon.com/quicksight/latest/user/vpc-creating-a-connection-in-quicksight.html)

You work as a data engineer for a large health agency that runs data analytics on world health data. Currently, there are large datasets of world health data in S3 that is not accessible over the internet. You have been tasked with setting up a QuickSight account that will enable you to build dashboards from the data in S3 without moving the data over the public internet. Which of these methods meets these requirements? Setup a QuickSight VPC connection and a VPC endpoint for S3 to allow QuickSight private access to S3 world health data. Create a new data source and select the S3 bucket with the world health data. Create a VPC endpoint for S3 to allow QuickSight private access to the S3 world health data. Download the S3 data using FTP and upload the S3 data into the QuickSight SPICE.

Setup a QuickSight VPC connection and a VPC endpoint for S3 to allow QuickSight private access to S3 world health data. A VPC endpoint for S3 provides the private access to the world health data, and a QuickSight VPC connection is also needed in the same VPC where the VPC endpoint resides. Configuring the VPC Connection in the QuickSight Console (https://docs.aws.amazon.com/quicksight/latest/user/vpc-creating-a-connection-in-quicksight.html)

Your company is looking to reduce the cost of their Business Intelligence applications. Currently, all data is stored in a Redshift cluster, which has grown exponentially with the increase in sales. Additionally, the bespoke visualizations for quarterly reports are incredibly cumbersome to generate by hand. What steps can be taken to reduce the cost of business intelligence workflow, while keeping all data available for generating reports from time to time? Store data no longer being actively utilized in an S3 bucket using the Standard Infrequent Access storage class. Create a Redshift Spectrum table to access this data and join it with warm data in the Redshift cluster for reporting. Utilize QuickSight to create the appropriate charts and graphs to accompany the BI reports. Move the cold data to S3 and use the S3 Select API call to query the cold data, then join the data with Redshift query results in a custom application layer. Store data no longer being actively utilized in an S3 bucket using the Standard storage class. Create a Redshift Spectrum table to access this data and join it with warm data in the Redshift cluster for reporting. Utilize QuickSight to create the appropriate charts and graphs to accompany the BI reports. Export all data to S3 with the Redshift UNLOAD command, configure an Athena table, and update/rebuild the application layer to query Athena instead of Redshift.

Store data no longer being actively utilized in an S3 bucket using the Standard Infrequent Access storage class. Create a Redshift Spectrum table to access this data and join it with warm data in the Redshift cluster for reporting. Utilize QuickSight to create the appropriate charts and graphs to accompany the BI reports. By leveraging the Standard Infrequent Access storage class, data that is not accessed frequently will cost less to store. Redshift Spectrum will keep the cold data available for analytics purposes.

You are a Data Analyst at a retail bank. You are working on a project to encrypt all Personally Identifiable Information (PII) that is generated by customer credit card applications. This data is generated in the form of a JSON document each time a customer applies for a credit card. For each successful application, the data must be encrypted, and you also need to be alerted of any attempted access by unauthorized individuals. Which of the following is the best solution for storing and protecting this data? Use an encrypted DynamoDB table to store the customer data and use Amazon Macie to scan the data against compliance rules. Use Amazon CloudWatch Events to trigger alerts. Use S3 with encryption enabled to store JSON files, use AWS Lambda to scan the data to detect PII, and use SNS to alert for unauthorized access. Store the customer data in an S3 bucket with encryption enabled. Use Macie to scan the Amazon S3 bucket to identify PII. Configure CloudWatch to alert for unauthorized access events in CloudTrail. Store the customer data in an encrypted DynamoDB table, use Lambda to scan the data to detect PII and use CloudWatch Events to alert for unauthorized access.

Store the customer data in an S3 bucket with encryption enabled. Use Macie to scan the Amazon S3 bucket to identify PII. Configure CloudWatch to alert for unauthorized access events in CloudTrail. Amazon Macie uses machine learning and pattern matching to discover sensitive data at scale, including Personally Identifiable Information such as names, addresses, and credit card numbers. It also gives you constant visibility of the data security and data privacy of your data stored in Amazon S3. A CloudWatch alarm can be configured to alert when an unauthorized API call is made, based on CloudTrail logs. Amazon Macie FAQs (https://aws.amazon.com/macie/faq/) Creating CloudWatch Alarms for CloudTrail Events (https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudwatch-alarms-for-cloudtrail.html#cloudwatch-alarms-for-cloudtrail-authorization-failures)

You are working as a data engineer within a financial institution. You're required to move a load of data gathered against various datasets in S3 to a Redshift cluster. You've attached the appropriate IAM role to your cluster and have issued a COPY command to move data from the S3 bucket into your Redshift database. After a while you check and notice the data is not populated in Redshift. Which of the following errors could be causing the issue with your data population? You are not connecting to your Redshift cluster as the default "ec2-user" database user when running the COPY command. The Redshift cluster is in maintenance mode and therefore buffering all queries for whenever it gets back to "Available" state. The default Security Group attached to your Redshift cluster does not allow outbound traffic through the Redshift cluster's VPC. The Redshift cluster does not have permissions to access the S3 files. The COPY command is not committing data into the Redshift cluster.

The Redshift cluster does not have permissions to access the S3 files. When using the COPY command to move data from an S3 bucket, two things are required: an IAM role for accessing S3 resources and ensuring that data is either auto-committed or there's an explicit COMMIT at the end of your COPY command to save the changes uploaded from S3. The COPY command is not committing data into the Redshift cluster. When using the COPY command to move data from an S3 bucket, two things are required: an IAM role for accessing S3 resources and ensuring that data is either auto-committed or there's an explicit COMMIT at the end of your COPY command to save the changes uploaded from S3.

You work as a data engineer for a HVAC and refrigerant recycling company who uses near real-time IoT devices to stream data about air conditioning equipment to a centralized data repository for data analytical purposes and monitoring metrics. To capture this data, you have created a Kinesis Data Firehose delivery stream to collect the data and store the data in DynamoDB, which will be accessible through HTTP endpoints via API Gateway. The data is loaded to the DynamoDB table through a synchronous Lambda function before the raw data is loaded into S3. After launching the beta version of the application, the Lambda function attempts to ingest the buffered Kinesis Data Firehose records three times before skipping the batch of records. What could be the cause of the skipped batch records and how can the issue be resolved? The buffer size for the Kinesis Data Firehose is set to 8 MB, which is too high. This is causing the Lambda function to fail due to a invocation limit error. Lower the buffer size on the Kinesis Data Firehose delivery stream. The buffer interval for the Kinesis Data Firehose is set to 60 seconds, which is too high. This is causing the Lambda function to fail due to a invocation limit error. Lower the buffer interval on the Kinesis Data Firehose delivery stream. The buffer size for the Kinesis Data Firehose is set to 1 MB, which is too high. This is causing the Lambda function to fail due to a invocation limit error. Lower the buffer size on the Kinesis Data Firehose delivery stream. The buffer interval for the Kinesis Data Firehose is set to 900 seconds, which is too high. This is causing the Lambda function to fail due to a function timeout error. Lower the buffer interval on the Kinesis Data Firehose delivery stream.

The buffer size for the Kinesis Data Firehose is set to 8 MB, which is too high. This is causing the Lambda function to fail due to a invocation limit error. Lower the buffer size on the Kinesis Data Firehose delivery stream. The buffer size for Kinesis Data Firehose can be set between 1 MB - 128 MB when delivering data to S3. However, when running a Lambda function synchronously, the request limit is 6 MB. The 8 MB buffer size is exceeding the request limit for Lambda. Amazon Kinesis Data Firehose FAQs - Data Delivery Lambda Quotas (https://aws.amazon.com/kinesis/data-firehose/faqs/#:~:text=Amazon%20Kinesis%20Data%20Firehose%20buffers%20incoming%20data%20before%20delivering%20it,data%20delivery%20to%20Amazon%20S3) Amazon Kinesis Data Firehose Data Transformation (https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html)

You are part of a team of engineers building an attendance tracking system used to keep track of students in a university classroom. The students will be sent a unique QR code to their email address each day before a particular class starts. The QR code will then be scanned as the student enters the university classroom and they will be marked present for class. It is expected that the creation of the QR codes and QR scanning of the QR codes will happen at various times throughout the day, and high traffic spikes will happen regularly. It's also important that the data is highly durable and operates with low latency. What bundle of AWS services do you suggest using to meet all of the requirements to build the attendance tracking system? Use API Gateway as a REST API that receives QR code requests and responses. Trigger a Lambda function for each service the tracking system needs to implement. Use Neptune as the data storage system for student information and DynamoDB for QR code image URL and attendance tracking validations. Use API Gateway as a REST API that receives QR code requests and responses. Trigger a Lambda function for each service the tracking system needs to implement. Use DynamoDB as the data storage system for student information, QR code image URLs, and attendance tracking validations. Use API Gateway as a REST API that receives QR code requests and responses. Trigger a Lambda function for each service the tracking system needs to implement. Use Neptune as the data storage system for student information, QR code image URLs, and attendance tracking validations. Use a EC2 instance with a Spring API that receives QR code requests and responses. Use RDS as the data storage system for student information, QR code image URL, and attendance tracking validations.

Use API Gateway as a REST API that receives QR code requests and responses. Trigger a Lambda function for each service the tracking system needs to implement. Use DynamoDB as the data storage system for student information, QR code image URLs, and attendance tracking validations. API Gateway is used as a REST API and uses Lambda to implement the functionality. Using DynamoDB for storage needs is a great solution, providing high durability and low latency for your application. Query Your AWS Database From Your Serverless Application (https://aws.amazon.com/blogs/database/query-your-aws-database-from-your-serverless-application/)

You work for a stock trading company that runs daily ad-hoc queries on data using Athena. There are multiple silos within the company using Athena to run trading queries specific to their team. The finance department has a requirement to enforce the amount of money being spent by each team for the queries that they run in Athena. The security department has a requirement to enforce all query results be encrypted. Which solution could be implemented that would meet both of these requirements? Use Athena Workgroups to assign a unique workgroup to each silo, tagging them appropriately. Configure the workgroup to encrypt the query results. Generate cost reports from the tags as well as resource-based policies that assigns each workgroup to a silo. Use CloudWatch logs to determine the run time for scanned data for each query that is run by each silo, and trigger an alarm at a specified threshold. Create a Lambda function trigger to enforce cost control. Use CloudTrail logs to audit the silos and run times for scanned data for each query that is run. Use an S3 bucket with an SSE-S3 key and point the Athena queries results to the S3 bucket. Create tags for each silo in the AWS Glue Data Catalog associated with the data that each silo is querying, generating a cost report for each tag. Use the AWS Glue security settings to ensure the metadata is encrypted.

Use Athena Workgroups to assign a unique workgroup to each silo, tagging them appropriately. Configure the workgroup to encrypt the query results. Generate cost reports from the tags as well as resource-based policies that assigns each workgroup to a silo. By default, all Athena queries execute in the primary workgroup. As an administrator, you can create new workgroups to separate different types of workloads. Administrators commonly turn to workgroups to separate analysts running ad-hoc queries from automated reports. Separate Queries and Managing Costs Using Amazon Athena Workgroups (https://aws.amazon.com/blogs/big-data/separating-queries-and-managing-costs-using-amazon-athena-workgroups/)

You work for a stock trading company that runs daily ad-hoc queries on data using Athena. There are multiple silos within the company using Athena to run trading queries specific to their team. The finance department has a requirement to enforce the amount of money being spent by each team for the queries that they run in Athena. The security department has a requirement to enforce all query results be encrypted. Which solution could be implemented that would meet both of these requirements? Use CloudWatch logs to determine the run time for scanned data for each query that is run by each silo, and trigger an alarm at a specified threshold. Create a Lambda function trigger to enforce cost control. Use CloudTrail logs to audit the silos and run times for scanned data for each query that is run. Use an S3 bucket with an SSE-S3 key and point the Athena queries results to the S3 bucket. Create tags for each silo in the AWS Glue Data Catalog associated with the data that each silo is querying, generating a cost report for each tag. Use the AWS Glue security settings to ensure the metadata is encrypted. Use Athena Workgroups to assign a unique workgroup to each silo, tagging them appropriately. Configure the workgroup to encrypt the query results. Generate cost reports from the tags as well as resource-based policies that assigns each workgroup to a silo.

Use Athena Workgroups to assign a unique workgroup to each silo, tagging them appropriately. Configure the workgroup to encrypt the query results. Generate cost reports from the tags as well as resource-based policies that assigns each workgroup to a silo. By default, all Athena queries execute in the primary workgroup. As an administrator, you can create new workgroups to separate different types of workloads. Administrators commonly turn to workgroups to separate analysts running ad-hoc queries from automated reports. Separate Queries and Managing Costs Using Amazon Athena Workgroups (https://aws.amazon.com/blogs/big-data/separating-queries-and-managing-costs-using-amazon-athena-workgroups/)

You work for an organization that contracts out to healthcare providers by providing data lakes in AWS for personal health records. The data stored in the data lake has both Personal Identifiable Information (PII) as well as Personal Health Information (PHI), so a Health Insurance Portability and Accountability Act (HIPAA) compliant data lake is a requirement. You are using an EMR cluster with EMRFS to read and write data to and from S3. The data lake requirement by HIPAA requires all data be encrypted to and from S3. What needs to be done to ensure all data is encrypted moving to and from S3? Use SSE-KMS to encrypt data server-side. Use SSE-S3 to encrypt data server-side. Manually create PEM certificates, referenced in S3 to encrypt data in transit. Use CSE-KMS/CSE-C to encrypt data client-side.

Use CSE-KMS/CSE-C to encrypt data client-side. Amazon S3 encryption and decryption takes place client-side on your Amazon EMR cluster. You can use keys provided by AWS KMS (CSE-KMS) or use a custom Java class that provides the master key (CSE-C). Best Practices for Securing Amazon EMR (https://aws.amazon.com/blogs/big-data/best-practices-for-securing-amazon-emr/) How Amazon EMR uses AWS KMS (https://docs.aws.amazon.com/kms/latest/developerguide/services-emr.html)

You have a legacy Business Intelligence (BI) application running on servers located in your own Data Center. The BI application needs to access data stored in a Redshift cluster. Your CEO has requested that you make sure the network connection between your Data Center is private, dedicated, and with consistent network performance to prevent the BI application from timing out. Which of the following approaches do you recommend? Configure a dedicated NAT gateway to consistently route all network traffic between the Redshift cluster and your Data Center. Use a site-to-site VPN to provide a dedicated, consistent network connection to AWS. Use Direct Connect to provide a dedicated, consistent network connection to AWS. Configure VPC peering between your VPC and the Data Center to provide a dedicated, consistent connection.

Use Direct Connect to provide a dedicated, consistent network connection to AWS. AWS Direct Connect is a dedicated network connection from your premises to AWS. In many cases, Direct Connect can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections. AWS Direct Connect (https://aws.amazon.com/directconnect/)

Gustof's Training Emporium is looking to combine the data from all of their testing centers spread around the world. Each testing center is storing data in their own RDS database and they're planning to utilize Glue to perform ETL and combine the disparate data. They need to be able to run SQL-based queries against this data and create visualizations. What is the lowest cost solution to accomplish this? Create a custom Glue Job script that processes the data and adds it to an Elasticsearch Index. Use the SQL interface for Elasticsearch, and Kibana to visualize the data. Use Glue Crawlers to crawl the data, and utilize Glue Jobs to perform ETL into an S3 bucket in the Parquet format. Configure Athena as an SQL endpoint for the data, and configure QuickSight to use Athena as its data source to create visualizations. Utilize the Glue Data Catalog to add the data to Elastic Map Reduce (EMR), query data from EMR, and configure QuickSight to utilize the EMR cluster as a data source for generating visualizations. Pool the data in a single Postgres RDS instance and utilize Database Migration Service to deliver data to Redshift. Configure QuickSight to utilize Redshift as its data source to create visualizations.

Use Glue Crawlers to crawl the data, and utilize Glue Jobs to perform ETL into an S3 bucket in the Parquet format. Configure Athena as an SQL endpoint for the data, and configure QuickSight to use Athena as its data source to create visualizations. Glue is an ETL tool suite capable of cataloguing data which can be used in other services (Athena, EMR), as well as performing ETL jobs to move data between various sources and destinations.

You work as a data scientist for a food and beverage manufacturer that creates and distributes food all over the world. The data associated with the distribution of food is processed and analyzed using an EMR cluster, which also functions as the data lake. Due to a new law that has been passed by the food administration, there are new requirements that must be met around processing hot data for food distribution and processing cold data for nutritional values for the food. The hot data accessed must be presented in real-time to the food administration, while the cold data is typically reviewed in weekly or monthly reports. The hot data that is reviewed in real-time must be fast preformat and temporary. The cold data does not require reviewing in real-time; however, the data must be persistent. Which of the following data processing configurations meets the business needs as well as usage pattern requirements? Use S3 block file system for hot data, and use S3 EMRFS for the cold data. Use HDFS for the hot data, and use for S3 EMRFS the cold data. Use HDFS for both the hot data and cold data. Use S3 EMRFS for the hot data, and use HDFS for the cold data.

Use HDFS for the hot data, and use for S3 EMRFS the cold data. HDFS is the distributed file system that is temporary for data storage. S3 EMRFS is HDFS integrated with S3 allowing the data to be stored on a distributed file system and is persistent. Work with Storage and File Systems (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html)

You've been contacted by a consulting client to assist with optimizing Athena query performance. They have a large amount of data stored in CSV format, and are not happy with either the expense of using Athena or its performance. Each file is 5-10GB in size, and all files are in a single S3 bucket in the root prefix. The data in question is being used for analytics purposes with heavy single-column reads. How can this data most easily be optimized in order to reduce access cost and improve query performance? Use Database Migration Service to reformat the data into Parquet format in a new S3 bucket. Recreate Athena tables to utilize this newly-formatted data. Use a CREATE TABLE AS SELECT (CTAS) query in Athena to process the existing data into Parquet format, partitioning the data as appropriate and compressing any non-index columns in the data with SNAPPY compression. Once all the data has been processed, DROP the original tables and ensure the data has been deleted from the underlying S3 bucket. Utilize Glue to catalogue the data. Create a Glue job to ETL the data into appropriately partitioned and compressed ORC format files. Utilize an EC2 instance to read all CSV files out of the S3 bucket backing Athena. Write a custom ETL script to reformat, partition, and apply compression to appropriate columns. Write the Parquet-formatted files to a new S3 bucket with the appropriate prefix schema to maximize performance. Recreate Athena tables with these new files.

Use a CREATE TABLE AS SELECT (CTAS) query in Athena to process the existing data into Parquet format, partitioning the data as appropriate and compressing any non-index columns in the data with SNAPPY compression. Once all the data has been processed, DROP the original tables and ensure the data has been deleted from the underlying S3 bucket. You can process the data "in place" with just Athena. By default, CTAS queries will store the output in Parquet format, and from there it's relatively simple to create partitions and configure column compression. All of these things will improve query performance and reduce the cost of querying the data.

You work as a data scientist for a new startup in the rapid growing field of health and telemedicine. The organization's health data needs to be stored into a data warehousing solution with an initial data load of around 1,000 GB. You also expect rapid data growth due to the growing demand for telemedicine services. You've been tasked with coming up with a data warehousing solution to host the health data that also allows for daily and weekly visualizations regarding global and regional health statistics. These visualizations will help determine health funding from the government and private lenders. It's important that the data warehousing solution be able to scale and that computer and managed storage are billed independently. Which of the data warehousing solutions would you suggest that allows for a simple and cost-effective approach? Use a Redshift cluster with RA3 nodes Use a Redshift cluster with DC2 nodes Use a Redshift cluster with DS2 nodes Use a cluster of EBS volumes with SSD on EC2 instances Use S3 Glacier

Use a Redshift cluster with RA3 nodes Amazon Redshift managed storage uses large, high-performance SSDs in each RA3 node for fast local storage and Amazon S3 for longer-term durable storage. If the data in a node grows beyond the size of the large local SSDs, Amazon Redshift managed storage automatically offloads that data to Amazon S3. You pay the same low rate for Amazon Redshift managed storage regardless of whether the data sits in high-performance SSDs or Amazon S3. For workloads that require ever-growing storage, managed storage lets you automatically scale your data warehouse storage capacity without adding and paying for additional nodes. Amazon Redshift Clusters RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently. With RA3, you choose the number of nodes based on your performance requirements and only pay for the managed storage that you use. Size your RA3 cluster based on the amount of data you process daily. You launch clusters that use the RA3 node types in a virtual private cloud (VPC). You can't launch RA3 clusters in EC2-Classic.

You work as a data engineer for a large sports team that collects stats on plays, ticket and concession sales, clickstream data on the sports teams website, social media feeds, and more. Your team is planning to use EMR to process and transform the constantly growing data. The data analytics team run reports by querying the data using tools like Apache Hive and Presto, so the ability to run queries is a must. There is a requirement that the EMR cluster not run persistently. To do this, your team has decided to implement a solution that initiates EMR to process data when it lands onto S3, run Apache Spark ETL jobs, save the transformed data onto S3, and finally terminate the cluster. Which of the following is the best solution based on the requirements? Create a Lambda function that populates an RDS instances with the Apache Hive metadata for the EMR cluster. Use the EMR cluster and create an external table to run queries on the transformed data. Use an AWS Glue crawler to crawl the data that is transformed in S3 to populate a Glue Data Catalog with the metadata. Use Athena to run queries on the transformed data. Use HDFS on the EMR cluster to store the data. When the data analytics teams wants to run the queries on the transformed data, use the S3DistCp command to copy the data to S3. Once the data is in S3, use Athena to query the transformed data. Store the Apache Hive metadata externally in DynamoDB for the EMR cluster. Use S3 Select to run queriers on the transformed data.

Use an AWS Glue crawler to crawl the data that is transformed in S3 to populate a Glue Data Catalog with the metadata. Use Athena to run queries on the transformed data. This is the best solution given the requirements. You can use AWS Glue to crawl the data in S3 and then use Athena to query the data. Using AWS Glue to Connect to Data Sources in Amazon S3 (https://docs.aws.amazon.com/athena/latest/ug/data-sources-glue.html) Best Practices When Using Athena with AWS Glue (https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html)

You work as a data engineer for a mid-sized paper company that distributes paper all across the nation. After some recent economic cut backs, you have been tasked with reviewing the current data warehousing processing pipeline in order to try to lower operational costs. Currently, Redshift is being used as the data warehousing solution. The data is first ingested into EMR, before being loaded into Redshift. This data processing usually takes less than 1 hour and is done 2 times daily (at 7 AM and 7 PM). What operational adjustment could you make to help lower costs? Use three master nodes for the EMR cluster Use spot instances for the task nodes of the EMR cluster Use spot instances for the core nodes of the EMR cluster Use an EMR transient cluster Use an EMR long-running cluster

Use spot instances for the task nodes of the EMR cluster You can use spot-instances for task nodes to process data and these nodes to not run the DataNode which means there is no HDFS storage on tasks nodes so task nodes are easily added and removed to help assist in processes power. Understanding Master, Core, and Task Nodes Task Nodes on Spot Instances (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html#emr-dev-task-instance-group-spot) Use an EMR transient cluster Transient cluster are used when a particular processes task or job needs to occur on some defined schedule. After all of the steps are complete, the cluster terminates and does not incur any more costs (unlike long-running clusters). Plan and Configure Clusters Overview of Amazon EMR (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan.html)

You work for large phone marketing firm that specializes in making campaign calls about local and national government candidates and laws. Every call that is made is logged into a streaming application and mapped to news and social media feeds to try and recommend talking points for the campaign calls. Your team has decided to use Kinesis Data Streams to capture the streaming data, and Kinesis Data Analytics to run SQL queries on the streaming data. Since news and social media feeds happen all over the world at inconsistent intervals, windowed queries are also used to process the data that arrives at these intervals. Which of the windowed querying options should you use to ensure that the data is aggregated using time-based windows that open as data arrives? Use tumbling queries Use stagger windowed queries Use continuous queries Use sliding windowed queries

Use stagger windowed queries Stagger windowed queries are queries that aggregate data using keyed time-based windows that open as data arrives. The keys allow for multiple overlapping windows. This is the recommended way to aggregate data using time-based windows, because stagger windows reduce late or out-of-order data compared to tumbling windows. Windowed Queries Stagger Windows (https://docs.aws.amazon.com/kinesisanalytics/latest/dev/stagger-window-concepts.html)

You work for an organization that heavily utilizes QuickSight as their Business Intelligence tool. The latest project you have been asked to join is looking for ways to set up automation in the process of building dashboards and updating them based on fresh data and data transformations. The goal of your project is to build an automated BI dashboard that customers can use to gain insights about their data with minimal development involved. The dashboards should be as up-to-date as possible, using the most current and recent data. Which of the following solutions will satisfy the requirements with the least amount of development effort? Create a scheduled refresh in the QuickSight configurations to occur only when new data is added to the QuickSight data source, which will rthen efresh the dashboard with the most up-to-date data. Use the CreateIngestion API call to create and start a new SPICE ingestion on the dataset. Automate the transformation process and use this API call to refresh the dashboard after the transformations have completed. Create a socket connection into the QuickSight instance by using it's publicly rotatable IP address. Since sockets are long-lived, you can refresh the dashboard only when new data is added to the QuickSight data source. Create a Lambda function that runs periodically to check to see if new data has been added to the QuickSight data source. Transform the data and refresh the dashboard.

Use the CreateIngestion API call to create and start a new SPICE ingestion on the dataset. Automate the transformation process and use this API call to refresh the dashboard after the transformations have completed. You can use the CreateIngestion command to create and start a new SPICE ingestion on the dataset which will refresh the dashboard with the most up-to-date data. CreateIngestion (https://docs.aws.amazon.com/quicksight/latest/APIReference/API_CreateIngestion.html)

A global wildlife research group has been collecting a huge amount of data in regionally located Redshift clusters. While planning for the next increase in storage capacity for their cluster, there was significant pushback regarding increased cost. At least 3/4 of the data being stored in Redshift is only accessed 4 times a year to generate reports that are delayed one quarter and do not include the most recent quarter's data. The leadership of the research group has requested a solution that will continue generating reports from SQL queries, and charts and graphs generated from the data with QuickSight. Which of the following is the lowest cost solution? Make the necessary changes to the Redshift cluster to enable Redshift Spectrum, create Redshift Spectrum tables in the cluster, and move the infrequently used data to the Redshift Spectrum tables. Utilize Glue to migrate all data to an Elastic MapReduce (EMR) cluster of appropriate size, and reconfigure application and analytics workflows to utilize EMR instead of Redshift. Use the Redshift UNLOAD command to an S3 bucket located in the region closest to the group that generates quarterly reports with the FORMAT PARQUET option to create a single data lake. Configure Athena to support SQL queries, and configure QuickSight to utilize Athena for its data source. Once operation of the new system is confirmed, delete the cold data from the Redshift cluster. Once cold data is removed from the cluster, scale the cluster down to the appropriate size to accommodate hot data needs. Use Glue to ETL all cold data to a DynamoDB table in the region closest to the Analytics group, then use a third party framework like D3 to generate visualizations of the data for reporting purposes.

Use the Redshift UNLOAD command to an S3 bucket located in the region closest to the group that generates quarterly reports with the FORMAT PARQUET option to create a single data lake. Configure Athena to support SQL queries, and configure QuickSight to utilize Athena for its data source. Once operation of the new system is confirmed, delete the cold data from the Redshift cluster. Once cold data is removed from the cluster, scale the cluster down to the appropriate size to accommodate hot data needs. Because the fresh data is not accessed for this purpose and the cold data is not utilized for live application operations, it is viable to move cold data outside of the Redshift ecosystem.

A global wildlife research group has been collecting a huge amount of data in regionally located Redshift clusters. While planning for the next increase in storage capacity for their cluster, there was significant pushback regarding increased cost. At least 3/4 of the data being stored in Redshift is only accessed 4 times a year to generate reports that are delayed one quarter and do not include the most recent quarter's data. The leadership of the research group has requested a solution that will continue generating reports from SQL queries, and charts and graphs generated from the data with QuickSight. Which of the following is the lowest cost solution? Use Glue to ETL all cold data to a DynamoDB table in the region closest to the Analytics group, then use a third party framework like D3 to generate visualizations of the data for reporting purposes. Utilize Glue to migrate all data to an Elastic MapReduce (EMR) cluster of appropriate size, and reconfigure application and analytics workflows to utilize EMR instead of Redshift. Make the necessary changes to the Redshift cluster to enable Redshift Spectrum, create Redshift Spectrum tables in the cluster, and move the infrequently used data to the Redshift Spectrum tables. Use the Redshift UNLOAD command to an S3 bucket located in the region closest to the group that generates quarterly reports with the FORMAT PARQUET option to create a single data lake. Configure Athena to support SQL queries, and configure QuickSight to utilize Athena for its data source. Once operation of the new system is confirmed, delete the cold data from the Redshift cluster. Once cold data is removed from the cluster, scale the cluster down to the appropriate size to accommodate hot data needs.

Use the Redshift UNLOAD command to an S3 bucket located in the region closest to the group that generates quarterly reports with the FORMAT PARQUET option to create a single data lake. Configure Athena to support SQL queries, and configure QuickSight to utilize Athena for its data source. Once operation of the new system is confirmed, delete the cold data from the Redshift cluster. Once cold data is removed from the cluster, scale the cluster down to the appropriate size to accommodate hot data needs. Because the fresh data is not accessed for this purpose and the cold data is not utilized for live application operations, it is viable to move cold data outside of the Redshift ecosystem.

You work as a data scientist for an organization that builds videos for university students who use them in place of classroom settings. Each video has a rating system that is positive or negative, which is determined by the students who view the content. Some of the ratings appear to come from bots that are flooding the platform with massive amounts of negative feedback. You've been tasked with creating real-time visualizations for these outliers to bring to the department heads. You have a large dataset of historical data, as well as the streaming data from current student viewing metrics. Which of the following provides the most cost-effective way to visualize these outliers? Use Kinesis Data Analytics' RANDOM_CUT_FOREST anomaly detection function to detect outliers. Use SageMaker to train a model using the built-in Random Cut Forest algorithm to detect outliers storing the results into S3. Use QuickSight to visualize the data. Use SageMaker to train a model using the built-in Random Cut Forest algorithm to detect outliers storing results in memory in the Jupyter notebook used to create the model. Visualize the results using the matplotlib library. Use the anomaly detection feature in QuickSight to detect outliers.

Use the anomaly detection feature in QuickSight to detect outliers. Amazon QuickSight uses proven Amazon technology to continuously run ML-powered anomaly detection across millions of metrics to discover hidden trends and outliers in your data. This anomaly detection enables you to get deep insights that are often buried in the aggregates and not scalable with manual analysis. With ML-powered anomaly detection, you can find outliers in your data without the need for manual analysis, custom development, or ML domain expertise. Detecting Outliers with ML-Powered Anomaly Detection (https://docs.aws.amazon.com/quicksight/latest/user/anomaly-detection.html)

You work as a data scientist for an organization that builds videos for university students who use them in place of classroom settings. Each video has a rating system that is positive or negative, which is determined by the students who view the content. Some of the ratings appear to come from bots that are flooding the platform with massive amounts of negative feedback. You've been tasked with creating real-time visualizations for these outliers to bring to the department heads. You have a large dataset of historical data, as well as the streaming data from current student viewing metrics. Which of the following provides the most cost-effective way to visualize these outliers? Use Kinesis Data Analytics' RANDOM_CUT_FOREST anomaly detection function to detect outliers. Use SageMaker to train a model using the built-in Random Cut Forest algorithm to detect outliers storing the results into S3. Use QuickSight to visualize the data. Use the anomaly detection feature in QuickSight to detect outliers. Use SageMaker to train a model using the built-in Random Cut Forest algorithm to detect outliers storing results in memory in the Jupyter notebook used to create the model. Visualize the results using the matplotlib library.

Use the anomaly detection feature in QuickSight to detect outliers. Amazon QuickSight uses proven Amazon technology to continuously run ML-powered anomaly detection across millions of metrics to discover hidden trends and outliers in your data. This anomaly detection enables you to get deep insights that are often buried in the aggregates and not scalable with manual analysis. With ML-powered anomaly detection, you can find outliers in your data without the need for manual analysis, custom development, or ML domain expertise. Detecting Outliers with ML-Powered Anomaly Detection (https://docs.aws.amazon.com/quicksight/latest/user/anomaly-detection.html)

You have been tasked with going through your company's AWS Glue jobs to audit which jobs are currently being used and which ones are outdated. You notice that one job that runs everyday at 5 PM is failing with the error "Command failed with exit code 1" and CloudWatch logs shows the "java.lang.OutOfMemoryError: Java heap space" error. Which of the following methods should you use to resolve this issue? Use actions like collect and count. Use the grouping feature in AWS Glue to coalesce multiple files together into a group. Set useS3ListImplementation to False so AWS Glue doesn't cache the list of files in memory all at once. Configure the AWS Glue job from G1.X to G2.x workers. Configure the AWS Glue job from G1.X to P2.x workers.

Use the grouping feature in AWS Glue to coalesce multiple files together into a group. You can fix the processing of the multiple files by using the grouping feature in AWS Glue. Grouping is automatically enabled when you use dynamic frames and when the input dataset has a large number of files (more than 50,000). Grouping allows you to coalesce multiple files together into a group, and it allows a task to process the entire group instead of a single file. As a result, the Spark driver stores significantly less state in memory to track fewer tasks. Fix the Processing of Multiple Files Using Grouping (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html#monitor-debug-oom-fix)

You work for a major university where thousands of students study and do research. The student's information, course schedules, activities, and financials are constantly being captured and analyzed by the university to help improve the students learning experience. As the lead data engineer for the university, you have built out a sophisticated system of data structures to captures all of this data and store it in DynamoDB. A new policy is being launched by the university, and it's your job to create a new DynamoDB table and select a partition key and sort key to use as a composite key to help improve performance and DynamoDB capacity when the table is utilized. The new policy will only need to be sent out to and monitored by students that fall into certain categories such as age, nationality, major, residential status, etc. Which of the following selections will give you the best performance for the table when the policy is launched? Use the student's ID number as the partition key Use the student's major as the partition key and student's name as the sort key. Use the student's ID number plus the student's age as the composite partition key and the student's age as the sort key. Use the student's major plus the student's name as the composite partition key and the student's age as the sort key.

Use the student's ID number as the partition key Best practice for designing and using partition keys effectively is using a partition key that uniquely identifies each item in a DynamoDB table. Since the student ID number uniquely identifies the student, then a composite key is not required. Best Practices for Designing and Using Partition Keys Effectively (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html)

You work for a major university where thousands of students study and do research. The student's information, course schedules, activities, and financials are constantly being captured and analyzed by the university to help improve the students learning experience. As the lead data engineer for the university, you have built out a sophisticated system of data structures to captures all of this data and store it in DynamoDB. A new policy is being launched by the university, and it's your job to create a new DynamoDB table and select a partition key and sort key to use as a composite key to help improve performance and DynamoDB capacity when the table is utilized. The new policy will only need to be sent out to and monitored by students that fall into certain categories such as age, nationality, major, residential status, etc. Which of the following selections will give you the best performance for the table when the policy is launched? Use the student's major plus the student's name as the composite partition key and the student's age as the sort key. Use the student's ID number as the partition key Use the student's ID number plus the student's age as the composite partition key and the student's age as the sort key. Use the student's major as the partition key and student's name as the sort key.

Use the student's ID number as the partition key Best practice for designing and using partition keys effectively is using a partition key that uniquely identifies each item in a DynamoDB table. Since the student ID number uniquely identifies the student, then a composite key is not required. Best Practices for Designing and Using Partition Keys Effectively (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html)

You work as a data analyst and have been tasked with creating processing jobs for data that lives in an S3 data lake. Currently, you are manually starting and chaining together AWS Glue jobs that transform the data and store it back to S3, which is then processed by the next AWS Glue job. These jobs can sometimes run for 20 minutes or longer. To save on cost, you have automated a script to purge intermediate data stored in S3 the AWS Glue job stores. What other methods of automation can you introduce? Use workflows in AWS Glue to chain together AWS Glue jobs with event triggers. Create multiple Lambda functions for each AWS Glue job. Trigger the appropriate Lambda function bases on the job's CloudWatch Event rule. The event rule is triggered on a custom event pattern when the AWS Glue job state changes to Succeeded. Create a single Lambda functions for the initial AWS Glue Job. Inside the Lambda function, check for the state to change to Succeeded. Once the state changes, use AWS Step Functions to trigger the remaining AWS Glue jobs, one after another. Create multiple Lambda functions for each AWS Glue job. Trigger the appropriate Lambda function bases on the job's CloudWatch Event rule. The event rule is triggered on a custom event pattern when the AWS Glue job run ID matches the JobRunId. Move the script to purge intermediate data to a Lambda function. Trigger the Lambda function to execute 20 minutes after the last AWS Glue Job has been run.

Use workflows in AWS Glue to chain together AWS Glue jobs with event triggers. In AWS Glue, you can use workflows to create and visualize complex extract, transform, and load (ETL) activities involving multiple crawlers, jobs, and triggers. Each workflow manages the execution and monitoring of all its components. As a workflow runs each component, it records execution progress and status, providing you with an overview of the larger task and the details of each step. The AWS Glue console provides a visual representation of a workflow as a graph. Event triggers within workflows can be fired by both jobs or crawlers, and can start both jobs and crawlers. Thus, you can create large chains of interdependent jobs and crawlers. Overview of Workflows in AWS Glue (https://docs.aws.amazon.com/glue/latest/dg/workflows_overview.html) Create multiple Lambda functions for each AWS Glue job. Trigger the appropriate Lambda function bases on the job's CloudWatch Event rule. The event rule is triggered on a custom event pattern when the AWS Glue job state changes to Succeeded. To start a job when a crawler run completes, create an AWS Lambda function and an Amazon CloudWatch Events rule. You can modify this method to automate other AWS Glue functions. How Can I use a Lambda Function to Automatically Start an AWS Glue Job When a Crawler Run Completes? (https://aws.amazon.com/premiumsupport/knowledge-center/start-glue-job-crawler-completes-lambda/) How Can I automatically Start an AWS Glue Job When a Crawler Run Completes? (https://aws.amazon.com/premiumsupport/knowledge-center/start-glue-job-run-end/)

Pickle Scanz is a company utilizing LIDAR technology to make high-resolution scans of pickles. They've utilized a machine learning model to identify areas they suspect represent a specific bump shape on the skin of the pickles. They've loaded candidates into a Redshift table and need to filter the candidates for a specific base64 binary pattern, but want to fuzz their search to include any records that are a close, but not exact, match to the pattern. How can they most easily identify this pattern? Craft a regex matching pattern and utilize the REGEXP_SUBSTR Redshift query function to identify possible close matches. Perform queries against the Redshift table with all near-value combination of base64 values, and load all returns into a newly created near_matches table. Utilize Glue to ETL the Redshift candidate table to a DynamoDB table, and use the CONTAINS query function to find matches to the base64 pattern in the possible candidates. Utilize Glue to ETL the Redshift candidate table to an Elasticsearch cluster, and utilize the Elasticsearch string search functionality to fuzz the search for the identified base64 pattern.

Utilize Glue to ETL the Redshift candidate table to an Elasticsearch cluster, and utilize the Elasticsearch string search functionality to fuzz the search for the identified base64 pattern. Elasticsearch has very powerful string search functionality, which will give a match-ranked response to a search that can be tuned to increase accuracy.

Pickle Scanz is a company utilizing LIDAR technology to make high-resolution scans of pickles. They've utilized a machine learning model to identify areas they suspect represent a specific bump shape on the skin of the pickles. They've loaded candidates into a Redshift table and need to filter the candidates for a specific base64 binary pattern, but want to fuzz their search to include any records that are a close, but not exact, match to the pattern. How can they most easily identify this pattern? Utilize Glue to ETL the Redshift candidate table to an Elasticsearch cluster, and utilize the Elasticsearch string search functionality to fuzz the search for the identified base64 pattern. Craft a regex matching pattern and utilize the REGEXP_SUBSTR Redshift query function to identify possible close matches. Perform queries against the Redshift table with all near-value combination of base64 values, and load all returns into a newly created near_matches table. Utilize Glue to ETL the Redshift candidate table to a DynamoDB table, and use the CONTAINS query function to find matches to the base64 pattern in the possible candidates.

Utilize Glue to ETL the Redshift candidate table to an Elasticsearch cluster, and utilize the Elasticsearch string search functionality to fuzz the search for the identified base64 pattern. Elasticsearch has very powerful string search functionality, which will give a match-ranked response to a search that can be tuned to increase accuracy.

You've been provided with a list of highly structured, normalized data stored in disparate relational databases. This data needs to be combined to enable business intelligence tooling and analytics queries. The solution will be used frequently. Speed and cost are equally important. Which of the following will provide the best data repository given the access requirements? Utilize multiple Database Migration Service tasks to migrate the data into a single Aurora Postgres cluster. Provide the appropriate teams access to the Aurora Postgres cluster endpoint. Utilize Glue to catalog and ETL the data into a DynamoDB table and provide the appropriate teams access to the DynamoDB table. Utilize multiple Database Migration Service tasks to migrate the data into a Redshift cluster. Provide the appropriate teams access to the Redshift cluster endpoint. Utilize Glue to catalog and ETL the data into a Redshift data warehouse and provide the appropriate teams access to the Redshift cluster endpoint.

Utilize Glue to catalog and ETL the data into a Redshift data warehouse and provide the appropriate teams access to the Redshift cluster endpoint. Glue will provide ETL tools to aggregate, clean, and extend the data in the process of loading it into the Redshift cluster, which is well-suited to the workload described in the scenario.

Mistry's Mysteries collects mysterious stories from all over the world. Each reporting office utilizes an Aurora MySQL Serverless cluster to manage local stories. The editorial team would like to collect and collate this data into a central repository. They're already running on a tight budget and need this solution to be as cheap as possible. Speed is not important, and the data will not be accessed more than a few times a week. What is the best solution to meet their needs? Utilize Glue's cataloging and ETL capabilities to extract data on a regular schedule from the disparate Aurora clusters. Write the data in Parquet format to a single S3 bucket, and configure Athena to provide a SQL interface for the data. Leverage Database Migration Service to extract the data with Full Load and Change Data Capture. Use an S3 bucket for the target, and utilize Parquet format. Configure Athena to provide a SQL interface for the data. Leverage Database Migration Service to extract the data with Full Load and Change Data Capture. Utilize a Redshift cluster as the target, and provide the cluster endpoint to the appropriate team. Utilize Glue's cataloging and ETL capabilities to extract data on a regular schedule from the disparate Aurora clusters. Write the data to a Redshift cluster and provide the endpoint to the appropriate team.

Utilize Glue's cataloging and ETL capabilities to extract data on a regular schedule from the disparate Aurora clusters. Write the data in Parquet format to a single S3 bucket, and configure Athena to provide a SQL interface for the data. This option would be the most appropriate and meet the requirements, as it will be the least costly option for both storage and infrequently-accessed data.

Your company has been hired to create a search and analytics system for Percival's Peculiar Pickles, which is a site where people post pictures of and discuss pictures of peculiar pickles. The solution should provide a REST API interface, enable deep text search capabilities, and be able to generate visualizations of the data stored in the system. Which solution will meet these requirements with minimal development effort? Store the files in Elastic File System. Access the files through a custom API that provides search services hosted on EC2 instances in an Autoscaling group behind a Application Load Balancer. Create a custom API with API Gateway and Lambda. Use S3 and Athena as the datastore. Perform text filtering in the application layer. Utilize Kinesis Firehose to deliver data from the various elements of the application to an Elasticsearch Service cluster. Provide Elasticsearch API and Kibana endpoints to the customer with appropriate security credential information. Use DynamoDB as the data store. Utilize Elastic MapReduce to create a full-text search system. Write a custom API with API Gateway and Lambda.

Utilize Kinesis Firehose to deliver data from the various elements of the application to an Elasticsearch Service cluster. Provide Elasticsearch API and Kibana endpoints to the customer with appropriate security credential information. Kinesis Firehose is able to deliver records to Elasticsearch Service with no additional development needed. Elasticsearch provides a REST API which satisfies the rest of the requirements.

Stupendous Fantasy Football League would like to create near real-time scoreboards for all games being played on any given day. They have an existing Kinesis Data Firehose which ingests all relevant statistics about each game as it is being played, but they would like to be able to extract just score data on the fly. Data is currently being delivered to a Redshift cluster, but they would like score data to be stored and updated in a DynamoDB table. This table will then function as the datastore for the live scoreboard on their web application. Which of the following is the best way to accomplish this? Utilize a Kinesis Data Analytics Application to filter out data for each individual active game from the data stream. Send the data to a Lambda function, which then inserts/updates the data in the DynamoDB table. Utilize a Kinesis Data Analytics Application to filter out just the score data from the data stream. Send the data to a Lambda function, which then inserts/updates the data in the DynamoDB table. Create a scheduled Lambda function that periodically polls the Redshift cluster for updated score data and inserts/updates the data in the DynamoDB table. Insert an SQS queue and Lambda function in front of the Kinesis Firehose. Use the Lambda function to filter score data into the DynamoDB table, and leave the rest of the pipeline as is.

Utilize a Kinesis Data Analytics Application to filter out just the score data from the data stream. Send the data to a Lambda function, which then inserts/updates the data in the DynamoDB table. This is the best option Kinesis Data Analytics Applications allow us to split data from within a Kinesis Data Stream or Firehose.

Stupendous Fantasy Football League would like to create near real-time scoreboards for all games being played on any given day. They have an existing Kinesis Data Firehose which ingests all relevant statistics about each game as it is being played, but they would like to be able to extract just score data on the fly. Data is currently being delivered to a Redshift cluster, but they would like score data to be stored and updated in a DynamoDB table. This table will then function as the datastore for the live scoreboard on their web application. Which of the following is the best way to accomplish this? Utilize a Kinesis Data Analytics Application to filter out just the score data from the data stream. Send the data to a Lambda function, which then inserts/updates the data in the DynamoDB table. Utilize a Kinesis Data Analytics Application to filter out data for each individual active game from the data stream. Send the data to a Lambda function, which then inserts/updates the data in the DynamoDB table. Create a scheduled Lambda function that periodically polls the Redshift cluster for updated score data and inserts/updates the data in the DynamoDB table. Insert an SQS queue and Lambda function in front of the Kinesis Firehose. Use the Lambda function to filter score data into the DynamoDB table, and leave the rest of the pipeline as is.

Utilize a Kinesis Data Analytics Application to filter out just the score data from the data stream. Send the data to a Lambda function, which then inserts/updates the data in the DynamoDB table. This is the best option Kinesis Data Analytics Applications allow us to split data from within a Kinesis Data Stream or Firehose.

You work for a large data warehousing company that is constantly running large scale processing jobs for customers. Every team has the freedom to use whichever EMR cluster configuration they need to accomplish a particular task, but the solution must be cost optimized. The latest contract requires a very large EMR cluster to be used throughout the year to process ML data and statistical functions. During a few months out of the year, the processing will be massive and, during other months, it will be minimal. To contend with this, your team uses a combination of on-demand and spot instances for the EMR cluster nodes, which is estimated to be around 40 core and tasks nodes. The team also varies the instance types to handle different workload types; for example, GPU-intensive ML processes will use g3 instance types and storage optimized processes will use i2 instance types. Which type of EMR cluster solution would need to be set up to meet the requirements for the new contract? Utilize instance fleets configurations when creating the EMR cluster. Utilize instance fleets and instance groups configurations when creating the EMR cluster. Utilize spot-instances for core nodes and instance groups for master and task nodes. Utilize instance groups configurations when creating the EMR cluster.

Utilize instance fleets configurations when creating the EMR cluster. The instance fleets configuration for a cluster offers the widest variety of provisioning options for EC2 instances. With instance fleets, you specify target capacities for On-Demand Instances and Spot Instances within each fleet. When the cluster launches, Amazon EMR provisions instances until the targets are fulfilled. Create a Cluster with Instance Fleets or Uniform Instance Groups (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-group-configuration.html)

Magic Tuber Face is the hot new IoT toy. Each kit comes with facial feature- and body-shaped plastic sensors that are powered by plugging them into a potato or other tuber. Each kit writes data to a regional RDS Postgres instance after being processed by an ingestion API. Spud Toys, the company that makes Magic Tuber Face, wants to perform analytics with the data collected from these toys, but are experiencing difficulties combining the data from multiple regional databases. Their goal is to utilize Athena to make the query accessible from a single data lake. What is the best way to accomplish this? Use Glue with a regularly scheduled crawler and a Glue job for each regional database to replicate data to S3. Configure Athena to provide a SQL interface for the S3-stored data. Utilize multiple regional Database Migration Service instances with a task for each database. Configure the task to perform Full Load for each regional database, and use an S3 bucket in the location nearest to the analytics team as the target for all tasks. Configure Athena to provide a SQL interface for the S3-stored data. Write a custom script to run on EC2 that connects to each regional database in order and copies any new records to an S3 data lake. Configure Athena to provide a SQL interface for the S3-stored data. Utilize multiple regional Database Migration Service instances with a task for each database. Configure the task to perform Full Load and Change Data Capture for each regional database, and use an S3 bucket in the location nearest to the analytics team as the target for all tasks. Configure Athena to provide a SQL interface for the S3-stored data.

Utilize multiple regional Database Migration Service instances with a task for each database. Configure the task to perform Full Load and Change Data Capture for each regional database, and use an S3 bucket in the location nearest to the analytics team as the target for all tasks. Configure Athena to provide a SQL interface for the S3-stored data. Because you only need to move the data from regional endpoints to a single data lake, this is a good use for Database Migration Service. IoT devices typically emit data relatively constantly, so the cost of running DMS instances constantly to handle replicating the data to S3 would be both justified and well-utilized.

You work for a large organization that uses Redshift as their data warehousing solution. The members of the HR department run simple ad-hoc queries that take very little time and resources to execute. The members of the engineering team run complex queries that use multiple joins and usually take a long time to run. The HR department is complaining that their queries are getting stuck in queues behind long-running queries by the engineering team. Which of the following solutions could resolve this issue in the most cost-effective manner? Utilize AWS Step Functions to manage query queues. Utilize workload management (WLM) in Redshift to manage query queues. Create a snapshot of the Redshift cluster to create a new analytic Redshift cluster. Assign group-based IAM policies to the HR department and engineering department assigned to the Redshift cluster. Configure two query queues, one for each department. Set the number of queries that can run in each of the queues to 10. Configure two query queues, one for each department. Set the number of queries that can run in each of the queues to 50.

Utilize workload management (WLM) in Redshift to manage query queues. Amazon Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won't get stuck in queues behind long-running queries. Amazon Redshift WLM creates query queues at runtime according to service classes, which define the configuration parameters for various types of queues, including internal system queues and user-accessible queues. From a user perspective, a user-accessible service class and a queue are functionally equivalent. For consistency, this documentation uses the term queue to mean a user-accessible service class as well as a runtime queue. Amazon Redshift - Workload Management Amazon Redshift - Implementing Workload Management (https://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html) Configure two query queues, one for each department. Set the number of queries that can run in each of the queues to 10. At runtime, you can route queries to these queues according to user groups or query groups. You can enable this manual configuration using the Amazon Redshift console by switching to Manual WLM. With this choice, you specify the queues used to manage queries, and the Memory and Concurrency on main field values. With a manual configuration, you can configure up to eight query queues and set the number of queries that can run in each of those queues concurrently. You can set up rules to route queries to particular queues based on the user running the query or labels that you specify. You can also configure the amount of memory allocated to each queue, so that large queries run in queues with more memory than other queues. You can also configure a query monitoring rule (QMR) to limit long-running queries. Amazon Redshift - Implementing Workload Management (https://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html)

Congratulations, your website for people to comment on your collection of Magic Nose Goblins has gone viral! Unfortunately, the Elasticsearch domain you've set up to make user comments and your written descriptions searchable is running out of storage. Fortunately, search volume is well served by the number of nodes in the domain, you just need more storage. The domain is configured to utilize Elastic Block Store (EBS) storage. How can you add additional storage? Create a script on an EC2 instance to extract all of the data from your Elasticsearch Service Domain to S3. Create a new Elasticsearch Service Domain with increased storage. Load the data onto the newly created domain and update application code to utilize the newly created domain. Utilizing the Elasticsearch Service web console, modify the Storage configuration of your Elasticsearch Domain to increase the per-node storage. In the Elasticsearch Service web console, modify the Elasticsearch domain and add additional nodes to the domain to increase available storage. Utilize Glue to catalog all data on the Elasticsearch domain, and create a Glue Job that moves all data to an S3 bucket once it is more than a week old. Utilize the S3 Select API call to query the data from S3.

Utilizing the Elasticsearch Service web console, modify the Storage configuration of your Elasticsearch Domain to increase the per-node storage. This would be the best way to add additional storage. Since Elasticsearch Service is a fully-managed service, it is capable of managing your domain storage configuration with minimal work on the end user's part.

A new machine learning startup is training a model to detect diseases with medical imagery. They have a large number of images stored in an S3 bucket. The images still contain personally identifying information (PII) in a 300x100 pixel area in the bottom left corner. Each image needs to be processed and written to a second S3 bucket, where it will be ingested by a machine learning pipeline. What's the best solution for processing these images? Write a Lambda function that loads each image and replaces the PII area with black pixels. Put the updated image in the second S3 bucket, and delete the original file from the source S3 bucket. Utilize a Kinesis Data Analytics application to remove the PII from each image. Configure the application to send processed images to a Kinesis Firehose delivery stream with the second S3 bucket set as the delivery target. Write an image processing script to replace the PII area with black pixels. Copy all images to an EC2 instance, and process the images into a "Processed" directory. Use the S3 sync command to copy the data into the target bucket. Utilize the S3 mv command to move all image files to an EC2 instance local storage. Use a custom image processing script to replace the PII area with black pixels. Once all of the images have been processed, use the S3 mv command to move the files to the target S3 bucket.

Write a Lambda function that loads each image and replaces the PII area with black pixels. Put the updated image in the second S3 bucket, and delete the original file from the source S3 bucket. Utilizing Lambda avoids any risk of accidentally leaving any image files containing PII on secondary storage.

You've been provided with an S3 bucket with several terabytes of log data that needs to be prepared for analysis. Unfortunately, the logs are not in a common data format and use irregular delimters, but are grouped in prefixes in such a way that each prefix contains logs with identical data formatting. The logs need to be processed and loaded into an Elasticsearch domain. This process needs to be completed as quickly as possible. What is the best workflow to accomplish this? Utilize Database Migration Service to ingest the data, format it, and deliver it to the Elasticsearch domain. Ingest each prefix's worth of logs to an EC2 instance and run a processing script to format each line as a JSON document, then send the JSON document to the Elasticsearch domain's REST API. Write a Lambda function with a format template for each S3 prefix data format. Process each line in the log into a JSON document, and deliver the JSON documents to a Kinesis Firehose delivery stream with the Elasticsearch domain configured as the target. Utilize Glue to catalog the data, and create Glue jobs to process the log files and deliver them to the Elasticsearch domain.

Write a Lambda function with a format template for each S3 prefix data format. Process each line in the log into a JSON document, and deliver the JSON documents to a Kinesis Firehose delivery stream with the Elasticsearch domain configured as the target. Because you have irregularly formatted data, you need to perform manual data transforms. So, the bulk of the work here is to map a line from a file in each bucket prefix and then apply that transformation where appropriate. After that, you can easily deliver the results to a Kinesis Firehose stream, which can have an Elasticsearch domain configured as the target.

You've been provided with an S3 bucket with several terabytes of log data that needs to be prepared for analysis. Unfortunately, the logs are not in a common data format and use irregular delimters, but are grouped in prefixes in such a way that each prefix contains logs with identical data formatting. The logs need to be processed and loaded into an Elasticsearch domain. This process needs to be completed as quickly as possible. What is the best workflow to accomplish this? Write a Lambda function with a format template for each S3 prefix data format. Process each line in the log into a JSON document, and deliver the JSON documents to a Kinesis Firehose delivery stream with the Elasticsearch domain configured as the target. Ingest each prefix's worth of logs to an EC2 instance and run a processing script to format each line as a JSON document, then send the JSON document to the Elasticsearch domain's REST API. Utilize Glue to catalog the data, and create Glue jobs to process the log files and deliver them to the Elasticsearch domain. Utilize Database Migration Service to ingest the data, format it, and deliver it to the Elasticsearch domain.

Write a Lambda function with a format template for each S3 prefix data format. Process each line in the log into a JSON document, and deliver the JSON documents to a Kinesis Firehose delivery stream with the Elasticsearch domain configured as the target. Because you have irregularly formatted data, you need to perform manual data transforms. So, the bulk of the work here is to map a line from a file in each bucket prefix and then apply that transformation where appropriate. After that, you can easily deliver the results to a Kinesis Firehose stream, which can have an Elasticsearch domain configured as the target.

You're creating an application to process art school portfolios. Most of the data being ingested for this application will be high-resolution images that average 50MB each. It is imperative that no data is lost in the process of ingesting the data. Each image has roughly 20KB of metadata that will be the primary focus of the application, but the images themselves need to be accessible as well. The starting point of the ingestion flow for this application will be in an admissions office, where digital media is processed. The front end application will mostly be performing OLTP workloads. Which of the following will ensure all data is available and is able to be ingested in a timely manner? As part of the ingestion process, resize each image to be approximately 900KB. Create an ingestion S3 bucket and configure a Kinesis Firehose to deliver data to the S3 bucket. Load each image into the Firehose after resizing. Create a Lambda function triggered by S3 PUTs that processes each image, extracting the relevant metadata and writing it to an Aurora MySQL cluster. Write the processed images to a final storage S3 bucket and delete the image file from the ingestion S3 bucket. Create a Kinesis Data Firehose configured to deliver records to an S3 bucket. Write an ingestion application that places each file in the Kineses Firehose. Create a Lambda function triggered by PUTs to the S3 bucket, which processes each file to extract the metadata and add it to a separate CSV file in the same bucket prefix as the image file. Configure Athena to provide a SQL interface for the S3 bucket. Create an S3 bucket. Before uploading each image file, modify the Exif data to include any additional metadata. Write the image files to the S3 bucket with a deterministic prefix schema (/Year/Month/Day/Applicant ID/...). Write the front end application to read the Exif data from each image as it is being loaded. Write the application to extract the metadata from each image file, and combine it with any other metadata that is not part of the file (Applicant Name, ID, etc.). Upload the image file to a S3 bucket with a deterministic prefix schema (/Year/Month/Day/Applicant ID/...). Configure a Kinesis Data Stream to handle the metadata records. Write a Lambda- based Kinesis consumer to process the metadata records into a DynamoDB table. Have the consumer Lambda function also write the metadata for each file to the appropriate S3 location to accompany the image it relates to.

Write the application to extract the metadata from each image file, and combine it with any other metadata that is not part of the file (Applicant Name, ID, etc.). Upload the image file to a S3 bucket with a deterministic prefix schema (/Year/Month/Day/Applicant ID/...). Configure a Kinesis Data Stream to handle the metadata records. Write a Lambda- based Kinesis consumer to process the metadata records into a DynamoDB table. Have the consumer Lambda function also write the metadata for each file to the appropriate S3 location to accompany the image it relates to. This will ensure that both metadata and image files are stored in a fault tolerant, durable manner. By also storing the metadata files with the images, you add an easy-to-locate backup of each metadata record, should there be an issue with DynamoDB or if you need a convenient place to collect records for secondary use cases without increasing load on the DynamoDB table.

You're creating an application to process art school portfolios. Most of the data being ingested for this application will be high-resolution images that average 50MB each. It is imperative that no data is lost in the process of ingesting the data. Each image has roughly 20KB of metadata that will be the primary focus of the application, but the images themselves need to be accessible as well. The starting point of the ingestion flow for this application will be in an admissions office, where digital media is processed. The front end application will mostly be performing OLTP workloads. Which of the following will ensure all data is available and is able to be ingested in a timely manner? Create an S3 bucket. Before uploading each image file, modify the Exif data to include any additional metadata. Write the image files to the S3 bucket with a deterministic prefix schema (/Year/Month/Day/Applicant ID/...). Write the front end application to read the Exif data from each image as it is being loaded. Write the application to extract the metadata from each image file, and combine it with any other metadata that is not part of the file (Applicant Name, ID, etc.). Upload the image file to a S3 bucket with a deterministic prefix schema (/Year/Month/Day/Applicant ID/...). Configure a Kinesis Data Stream to handle the metadata records. Write a Lambda- based Kinesis consumer to process the metadata records into a DynamoDB table. Have the consumer Lambda function also write the metadata for each file to the appropriate S3 location to accompany the image it relates to. As part of the ingestion process, resize each image to be approximately 900KB. Create an ingestion S3 bucket and configure a Kinesis Firehose to deliver data to the S3 bucket. Load each image into the Firehose after resizing. Create a Lambda function triggered by S3 PUTs that processes each image, extracting the relevant metadata and writing it to an Aurora MySQL cluster. Write the processed images to a final storage S3 bucket and delete the image file from the ingestion S3 bucket. Create a Kinesis Data Firehose configured to deliver records to an S3 bucket. Write an ingestion application that places each file in the Kineses Firehose. Create a Lambda function triggered by PUTs to the S3 bucket, which processes each file to extract the metadata and add it to a separate CSV file in the same bucket prefix as the image file. Configure Athena to provide a SQL interface for the S3 bucket.

Write the application to extract the metadata from each image file, and combine it with any other metadata that is not part of the file (Applicant Name, ID, etc.). Upload the image file to a S3 bucket with a deterministic prefix schema (/Year/Month/Day/Applicant ID/...). Configure a Kinesis Data Stream to handle the metadata records. Write a Lambda- based Kinesis consumer to process the metadata records into a DynamoDB table. Have the consumer Lambda function also write the metadata for each file to the appropriate S3 location to accompany the image it relates to. This will ensure that both metadata and image files are stored in a fault tolerant, durable manner. By also storing the metadata files with the images, you add an easy-to-locate backup of each metadata record, should there be an issue with DynamoDB or if you need a convenient place to collect records for secondary use cases without increasing load on the DynamoDB table.


Ensembles d'études connexes

NURS 3740 Quiz 4 Intro to EBP and Research; Home Health

View Set