ALL
Bob's Turnip Farm is storing their financial records in an S3 bucket. Due to corporate espionage in the turnip market, they want to ensure that their financial records are secured with an easily rotated strong encryption key that they create and manage. How can they most easily accomplish this? Add a Customer Master Key to Key Management Service and utilize the Server Side Encryption - KMS (SSE-KMS) option for your S3 objects. Enable Server Side Encryption - S3 (SSE-S3) on a new bucket and copy all objects from the existing bucket to the new bucket. Download all files from the S3 bucket, delete them from S3 once they've been downloaded, and encrypt all of the files locally and upload them again. Create a custom workflow to automatically pick up each file in the bucket, pass it through an on-premises HSM utility, and upload to the S3 bucket.
Add a Customer Master Key to Key Management Service and utilize the Server Side Encryption - KMS (SSE-KMS) option for your S3 objects. Correct! This will satisfy all of the requested features.
You work for a startup that is implementing a new data store to handle data types with fixed schemas, as well as key-value access patterns. The queries that will be run on this data will be complex SQL queries that must be transactional. Additionally, the data store must also be able to provide strongly consistent reads for global secondary indexes. Which data store should the startup use to fulfill all of these requirements? Amazon S3 Amazon RDS Amazon DynamoDB Amazon Elasticsearch
Amazon RDS RDS handles all of the requirements. Although RDS is not typically used for key-value based access, a schema with a good primary key select can achieve this architecture. Amazon Relational Database Service (https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Welcome.html)
A geologic study group has installed thousands of IoT sensors across the globe to measure various soil attributes. Each sensor delivers its data to a prefix in a single S3 bucket in Parquet-formatted files. The study group would like to be able to query this data using SQL and avoid any data processing in order to minimize their costs. What of the following is the best solution? Write a Lambda function that leverages the S3 Select API call to collect data from each sensor and join them together to answer queries. Use Glue to catalog the data, load the catalog into Elastic Map Reduce, and query the data from EMR. Configure Athena tables to make the data queryable and provide the appropriate access to team members via IAM policy. Utilize Glue to ETL the data to a Redshift cluster, run SQL queries from Redshift, and visualize data with QuickSight.
Configure Athena tables to make the data queryable and provide the appropriate access to team members via IAM policy. This is the best option. Because the data is already stored in an Athena-friendly format in S3, this is would be a cost-effective solution.
You are working for an investment bank, which has just employed a new team of analysts who will focus on using historical data stored in Amazon EMR to predict future stock market performance using an in-house Business Intelligence (BI) application. The new Trading Analytics Team are working from the New York office and, in order to complete their analysis, they need to connect the BI application running on their local desktop to the EMR cluster. The BI application is extremely sensitive to network inconsistencies, and during initial testing it frequently hangs and becomes unresponsive at the busiest time of day. How would you configure the network between the BI application and the EMR cluster to ensure that the network is consistent and the application does not hang? Configure an Internet Gateway to manage the network connectivity between the New York office and the EMR cluster. Configure a Direct Connect connection between the New York office and AWS. Configure a bastion host to manage the network connectivity between the New York office and the EMR cluster. Configure an AWS managed IPsec VPN connection over the internet to the VPC where the EMR cluster is running.
Configure a Direct Connect connection between the New York office and AWS. AWS Direct Connect is a service you can use to establish a private dedicated network connection to AWS from your data center, office, or colocation environment. If you have large amounts of input data, using AWS Direct Connect may reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections. Connect to Data with AWS DirectConnect (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-input-directconnect.html)
Clicky Looky is an analytics company that specializes in processing mobile device clickstream data. They've created a delivery API for their clients to send this data. They require that the data be delivered to the API in JSON format. They need a near real-time workflow to make the data available to their clients in the application and the data should be validated early in the analytics pipeline. How can they best accomplish this? Utilize Glue to ingest and validate API data. Create a Glue job to process the data into an S3 bucket. Configure Athena to provide a SQL interface for the data. Configure the API to deliver the records to an S3 bucket. Use Database Migration Service to ingest and process the records from S3. Utilize DMS Data Validation to validate the records before writing them to a Redshift cluster. Create an EC2 auto scaling group to send API data to through an Application Load Balancer. Validate each record on the EC2 instance, and write the record to an Aurora Postgres instance. Configure the end user application to use Aurora Postgres as its datastore. Configure the API to deliver records to a Kinesis Data Stream. Create a Lambda consumer for the data stream that validates the data and sends valid records to a Kinesis Firehose delivery stream with a Redshift cluster configured as the destination. Configure the application to use Redshift as its datastore.
Configure the API to deliver records to a Kinesis Data Stream. Create a Lambda consumer for the data stream that validates the data and sends valid records to a Kinesis Firehose delivery stream with a Redshift cluster configured as the destination. Configure the application to use Redshift as its datastore. This will provide a fast, scalable pipeline for the clickstream data which may not have a regular delivery pattern.
You work for an event coordination firm that uses anywhere from hundreds to thousands of low-power IoT devices to produce information and records about the people who attend the events you are contracted to coordinate. The current architecture consists of EC2 instances that serve as a proxy for the IoT events. Your development team has a requirement that the IoT devices must write data to a Kinesis Data Stream, which is proxied through the EC2 instances. The lead architect also recommends using batching and multithreading. Additionally, retry logic must be implemented, as well as the ability for record deaggregation from consumer applications. Which architecture could you use to accomplish this? Install the Kinesis Agent on the proxy EC2 instances. Create a SQS queue and Lambda function for retry logic and batching respectively. Create Kinesis Producer Library (KPL) applications on the proxy EC2 instances. Create Kinesis Client Library (KCL) applications on the proxy EC2 instances.
Create Kinesis Producer Library (KPL) applications on the proxy EC2 instances. The KPL simplifies producer application development, allowing developers to achieve high write throughput to one or more Kinesis streams. The KPL is an easy-to-use, highly configurable library that you install on your hosts that generate the data that you wish to stream to Kinesis Streams. It acts as an intermediary between your producer application code and the Kinesis Streams API actions. The KPL performs the following primary tasks: 1) Writes to one or more Kinesis streams with an automatic and configurable retry mechanism; 2) Collects records and uses PutRecords to write multiple records to multiple shards per request; 3) Aggregates user records to increase payload size and improve throughput; 4) Integrates seamlessly with the Amazon Kinesis Client Library (KCL) to de-aggregate batched records on the consumer; 5) Submits Amazon CloudWatch metrics on your behalf to provide visibility into producer performance. Developing Producers Using the Amazon Kinesis Producer Library (https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-kpl.html) KPL Key Concepts (https://docs.aws.amazon.com/streams/latest/dev/kinesis-kpl-concepts.html)
You've been contacted by a company that is looking to clear stale data from an existing multi-terabyte DynamoDB table. They want to store all of the records in a data lake, and want to unload the data as quickly as possible before performing pruning on the table. They have requested a solution that does not require writing code, if possible. What is the optimal way to accomplish this? Create a Data Pipeline with the Export DynamoDB table to S3 template. Provide the source DynamoDB table and destination S3 bucket, and start the pipeline. Configure a Glue crawler to catalog the DynamoDB table. Configure a Glue job to migrate all records from the table to an S3 bucket. Write a custom script to run from an EC2 instance that reads batches of records from the DynamoDB table and writes them to the S3 destination. Create a Lambda function that reads batches of records from the DynamoDB table and writes them to the S3 destination.
Create a Data Pipeline with the Export DynamoDB table to S3 template. Provide the source DynamoDB table and destination S3 bucket, and start the pipeline. Data Pipeline is the best tool available to extract all of the records from a DynamoDB table as quickly as possible without writing any code.
You work as a data analyst for a major airline company who operate flights scheduled all around the globe. The current ticketing system is going through a technical audit and has the requirement, by air traffic control law, that all parts of the ticketing system be digitized. The volume of ticketing data created on a daily basis is incredibly high. Your team has been tasked with collecting the ticketing data and storing it in S3, which is copied on a nightly basis to a company data lake for retrieval. There is also a requirement that the ticketing data be transformed and grouped into batches according to the flight departure location. The data must be optimized for high-performance retrieval rates, as well as collected and stored with high durability. Which solution would you use to ensure the data is collected and stored in a cost-effective, durable, and high-performing manner? Create a Kinesis Data Stream and set the partition key to the flight's departure location. Use multiple shards to batch the data, before sending it to a Kinesis Data Firehose delivery stream that delivers the data to S3. Create a Kinesis Data Firehose delivery stream to receive the data that is then sent to Lambda, where records will be batched by the buffer interval/size. Once the data is transformed, convert the records to CSV format and store the results onto S3. Create an Elastic MapReduce (EMR) cluster with Spark stream to receive the data, and use a spark-submit job to batch and transform the data into ORC before it is delivered into S3. Create a Kinesis Data Firehose delivery stream to receive the data, with transformations enabled to allow the data to be batched and transformed into ORC before it is delivered into S3. Stream Real-Time Data in Apache Parquet or ORC Format Using Amazon Kinesis Data Firehose (https://aws.amazon.com/about-aws/whats-new/2018/05/stream_real_time_data_in_apache_parquet_or_orc_format_using_firehose/)
Create a Kinesis Data Firehose delivery stream to receive the data, with transformations enabled to allow the data to be batched and transformed into ORC before it is delivered into S3. Stream Real-Time Data in Apache Parquet or ORC Format Using Amazon Kinesis Data Firehose (https://aws.amazon.com/about-aws/whats-new/2018/05/stream_real_time_data_in_apache_parquet_or_orc_format_using_firehose/) This is the best answer because it uses ORC files, which are partitioned in batches by Kinesis Data Firehose transformations and allow for highly optimized SQL queries in the company data lake.
You're in charge of the data backend for a very popular social media website. All live OLTP data is stored in several DynamoDB tables. The company you work for has started a new analytics initiative, and needs a system that enables text search and produces near real-time analytics output for Business Intelligence tooling. Timely results are more important than cost, as revenue projections for the deliverables of the project are enormous. What is the best way to accomplish this? Use the DynamoDB S3 export feature to export all existing data to an S3 bucket. Configure Athena to provide an SQL interface for the S3 stored data. Use Athena for Business Intelligence and search functionality. Enable streams on all production DynamoDB tables with Lambda functions to add any new records to the S3 data store. Create a Redshift cluster and Elasticsearch Service cluster. Configure two Kinesis Firehose streams, one configured to deliver data to each new cluster respectively. Use Glue to crawl and catalog the application DynamoDB tables. Create Glue jobs to process the existing DynamoDB stored data into the Kinesis Firehose delivery streams. Enable the DynamoDB table streams on all application tables. Create a Lambda function for each production DynamoDB table that is triggered from the respective table's stream to appropriately process stream data into the two Kinesis Firehose delivery streams. Provide the Elasticsearch cluster endpoint for text search and Redshift endpoint for analytics querying and Business Intelligence tooling. Create a Redshift cluster and Elasticsearch Service cluster. Configure a Kinesis Firehose stream to deliver data to each new cluster. Use Glue to crawl and catalog the application DynamoDB tables. Create Glue jobs to process the existing DynamoDB stored data into the Kinesis Firehose delivery stream. Enable the DynamoDB table streams on all application tables. Create a Lambda function for each production DynamoDB table that is triggered from the respective table's stream to appropriately process stream data into the Kinesis Firehose delivery stream. Provide the Elasticsearch cluster endpoint for text search and Redshift endpoint for analytics querying and Business Intelligence tooling. Utilize Glue to crawl and catalog all production DynamoDB tables. Launch an Elastic MapReduce (EMR) cluster and utilize the Glue data catalog for the production DynamoDB tables to create Hive tables. Provide the EMR cluster endpoint for querying the DynamoDB stored data and Business Intelligence tooling.
Create a Redshift cluster and Elasticsearch Service cluster. Configure two Kinesis Firehose streams, one configured to deliver data to each new cluster respectively. Use Glue to crawl and catalog the application DynamoDB tables. Create Glue jobs to process the existing DynamoDB stored data into the Kinesis Firehose delivery streams. Enable the DynamoDB table streams on all application tables. Create a Lambda function for each production DynamoDB table that is triggered from the respective table's stream to appropriately process stream data into the two Kinesis Firehose delivery streams. Provide the Elasticsearch cluster endpoint for text search and Redshift endpoint for analytics querying and Business Intelligence tooling. While this seems like a fairly complex configuration, it will fulfill the requirements and can be tuned and expanded to meet various analytics and search needs. While data will be stored in triplicate in this configuration, data storage is relatively inexpensive and each access pattern (OLTP, OLAP, and search) is handled by the best-suited service.
Your company uses Athena to run standard SQL queries on data stored in S3. You have a new team member who requires access to run Athena queries on a number of different S3 buckets. Which of the following should you do to configure access for this new team member? Create a new IAM user account and attach an IAM policy which allows access to Athena. Configure S3 bucket policies to allow the new user to access objects in the required buckets. Create a new IAM account and attach the AmazonAthenaFullAccess managed policy to allow the new user access to run Athena queries on objects in the required buckets. Create a new IAM account and configure S3 bucket policies to allow the new user to access objects in the required buckets. Create a new IAM account and attach a new AWS Managed Policy, allowing the new employee to access the required S3 buckets using Athena.
Create a new IAM user account and attach an IAM policy which allows access to Athena. Configure S3 bucket policies to allow the new user to access objects in the required buckets. Amazon Athena allows you to control access to your data by using IAM policies, Access Control Lists (ACLs), and S3 bucket policies. With IAM policies, you can grant IAM users fine-grained control to your S3 buckets. By controlling access to data in S3, you can restrict users from querying it using Athena. Athena reads data from S3 buckets using the IAM credentials of the user who submitted the query. Query results are stored in a separate S3 bucket. Usually, an Access Denied error means that you don't have permission to read the data in the bucket, or permission to write to the results bucket. Athena FAQs (https://aws.amazon.com/athena/faqs/#Security_.26_availability) Athena "Access Denied" error (https://aws.amazon.com/premiumsupport/knowledge-center/access-denied-athena/)
You have recently started a new role as a Data Analyst for a car rental company. The company uses Amazon EMR for the majority of analytics workloads. While working on a new report, you notice the root volumes of the EMR cluster are not encrypted. You suggest to your boss that the volumes should be encrypted as soon as possible and she agrees, asking you to recommend the best approach. Which course of action do you recommend? Create a new security configuration specifying local disk encryption. Re-create the cluster using the security configuration. Select root volume encryption in the EMR console. Detach the EBS volumes from the master node. Encrypt the EBS volumes and attach them back to the master node. Specify encryption in transit in a security configuration. Re-create the cluster using the security configuration.
Create a new security configuration specifying local disk encryption. Re-create the cluster using the security configuration. Local disk encryption can be enabled as part of a security configuration to encrypt root and storage volumes. EMR Security Configuration (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-create-security-configuration.html)
You work for a company that is currently using a private Redshift as their data warehousing solution running inside a VPC. You have been tasked with producing a dashboard for sales and KPI data that is stored in the Redshift cluster. You have decided to use QuickSight as the visualization and a BI tool to create these dashboard. Which of the following must be done to enable access and create the dashboards? Create a security group that contains an inbound rule authorizing access from the appropriate IP address range for the region where the QuickSight servers are located. Create a Redshift Spectrum external table that allows QuickSight access through the security configurations. Create an IAM role and policy-based rules allowing QuickSight access to the Redshift cluster. Assign the IAM role the Redshift cluster. Setup an AWS Glue Crawler to crawl the Redshift cluster in order to create a Glue Data Catalog with the Redshift metadata. Use QuickSight to connect to the Glue Data Catalog.
Create a security group that contains an inbound rule authorizing access from the appropriate IP address range for the region where the QuickSight servers are located. To give QuickSight access to a Redshift cluster, it needs to allow the appropriate IP ranges for the QuickSight servers in the AWS region where the servers are located. Authorizing Connections from Amazon QuickSight to Amazon Redshift Clusters (https://docs.aws.amazon.com/quicksight/latest/user/enabling-access-redshift.html)
You work for a home improvement chain as a data engineer, monitoring and overseeing the data that the home improvement stores log. Every hour, online orders are batched up and stored into a CSV formatted file onto S3. Your team runs queries using Athena to determine which products are most popular during a particular date for a particular region. On average, the CSV files stored in S3 are 5 GB in size, but are growing in size to tens and hundreds of GBs. Queries are taking longer to run as the files grow larger. Which of the following solutions can help improve query performance in Athena? Use queries that use the GROUP BY clause. Create an AWS Glue job to transform the CSV files into Apache Parquet files. Use queries that utilize the WHERE clause with a smaller date range. Consider partitioning the data by date and region. Break the CSV files up into smaller sizes of 128 MB each.
Create an AWS Glue job to transform the CSV files into Apache Parquet files. This can help speed up queries by transforming the row-based storage format of CSV to columnar-based storage or Apache Parquet. Consider partitioning the data by date and region. Partitioning divides your table into parts and keeps the related data together based on column values such as date, country, region, etc. Partitions act as virtual columns. You define them at table creation, and they can help reduce the amount of data scanned per query, thereby improving performance. Top 10 Performance Tuning Tips for Amazon Athena (https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/)
You've been contacted by a Data Engineer at your company, Swallow Speed Coconut Delivery, who is attempting to utilize the UNLOAD command to move data from the company's Redshift cluster to an S3 bucket which will be used to create Redshift Spectrum tables. They're receiving a permission denied error when they run their command. What are the likely corrective steps to resolve this error? Create an IAM Redshift service role and attach a policy with the appropriate S3 permissions to the newly created role. Assign the role to the Redshift cluster and retry the UNLOAD command. Ensure the user has the appropriate permissions in the Redshift user schema. Create an IAM policy granting the user permissions to upload files to S3 and attach the role to the IAM user. Create a bucket policy granting full read/write permissions to the 0.0.0.0/0 CIDR range and attach the policy to the appropriate S3 bucket. Retry the UNLOAD command.
Create an IAM Redshift service role and attach a policy with the appropriate S3 permissions to the newly created role. Assign the role to the Redshift cluster and retry the UNLOAD command. Correct! The most likely cause of a permission failure for an UNLOAD or COPY command involving S3 is a lack of permissions for S3.
You work for Gerald's Gorgeous Gramophones as a data architect. Gramophones have been flying off the shelf and the sales and site usage data is growing exponentially. You have been using a Redshift cluster to store this data and make it available to Business Intelligence tooling, but Gerald is concerned about the cost of increasing the size of the Redshift cluster to accommodate the growing data. Fresh data is only accessed regularly for two weeks, and then infrequently for BI reports that require JOINing between fresh data and older data. Which solution will maintain functionality and avoid the increased cost of a larger Redshift cluster? Create an S3 bucket to store the infrequently accessed data. Add a Redshift Spectrum table to the existing Redshift cluster to enable access of infrequently accessed data. Create a scheduled Lambda function that runs once a week that utilizes the UNLOAD command to maintain the requested divide in Active/Infrequently Accessed data. Once the data has been confirmed in the Redshift Spectrum table via automated testing, it can be dropped from the Redshift cluster's table. Utilize the UNLOAD command in your Redshift cluster to export all data to an S3 bucket. Configure Athena to be able to query the data, and modify all application and business intelligence processes to query Athena. Create an S3 bucket to store the infrequently accessed data. Utilize the UNLOAD command from a Lambda function to move the data that is not accessed frequently to the S3 bucket on a two-week schedule. Configure Athena to be able to query the infrequently accessed data. Create a RDS Postgres instance and a S3 Bucket. Utilize the UNLOAD command to copy the data that will be infrequently accessed to the S3 Bucket. Utilize Database Migration Service to load the data from S3 to the RDS Postgres instance. Utilize Elastic MapReduce to query the data from Redshift and RDS for Business Intelligence.
Create an S3 bucket to store the infrequently accessed data. Add a Redshift Spectrum table to the existing Redshift cluster to enable access of infrequently accessed data. Create a scheduled Lambda function that runs once a week that utilizes the UNLOAD command to maintain the requested divide in Active/Infrequently Accessed data. Once the data has been confirmed in the Redshift Spectrum table via automated testing, it can be dropped from the Redshift cluster's table. Redshift Spectrum allows us to create external tables with data stored in S3, which will have a much lower cost than maintaining the data in our Redshift cluster. The trade-off is increased access time and limited SQL operation availability.
We See Your Bird is a company that is building an application that collects Twitter traffic and emits it as JSON documents. They need you to design a system that will store these JSON documents, enable full-text search functionality, and provide visualization of the stored data. Which of the following will accomplish this with the least development effort? Write a Lambda function to process the JSON documents and insert them into a RDS Oracle instance. Utilize full-text compound indexes to enable text searching. Configure QuickSight to visualize the data stored in the RDS database. Utilize Glue to create a catalog of the existing data. Add the Glue catalog to a newly created Elastic MapReduce (EMR) cluster to create Hive tables. Query the data from EMR, and configure QuickSight to utilize EMR as a datasource to generate visualizations of the data. Create and configure an Elasticsearch cluster. Create a Kinesis Firehose that is configured to deliver data to the newly created Elasticsearch cluster. Create an index in Elasticsearch to index the received JSON documents. Use a simple loader script to send the existing JSON documents to the Kinesis Firehose, and configure the scraping application to deliver new documents to Kinesis Firehose. Supply the relevant teams with the Elasticsearch cluster endpoint and Kinesis instance endpoint. Create a new Redshift cluster. Utilize Glue to catalogue and ETL the existing records into the Redshift cluster. Modify the Twitter scraping code to insert data into the Redshift cluster instead of the existing data pooling solution. Utilize QuickSight with the Redshift cluster to generate data visualizations.
Create and configure an Elasticsearch cluster. Create a Kinesis Firehose that is configured to deliver data to the newly created Elasticsearch cluster. Create an index in Elasticsearch to index the received JSON documents. Use a simple loader script to send the existing JSON documents to the Kinesis Firehose, and configure the scraping application to deliver new documents to Kinesis Firehose. Supply the relevant teams with the Elasticsearch cluster endpoint and Kinesis instance endpoint. This option provides a solution that meets all the requirements, with as little development effort as possible. Elasticsearch will provide the requested functionality, and Kinesis Firehose is a no-hassle way to ingest data to an Elasticsearch index.
A large amount of homogeneous data has been ingested from IoT sensors around the world in a S3 bucket. You've been asked to process this data into a Redshift cluster for analytics purposes. Which of the following is the most efficient way to do this? Utilize Glue to crawl and catalogue the IoT sensor data, and create a Glue job to process the S3 stored data into Redshift. Utilize Database Migration Service to load the data from S3 to Redshift. Create a Lambda function to read the IoT sensor data and perform Inserts to the appropriate tables in the Redshift cluster. Create appropriately defined tables in Redshift, and utilize the COPY command to load the data from S3 to the appropriate Redshift tables.
Create appropriately defined tables in Redshift, and utilize the COPY command to load the data from S3 to the appropriate Redshift tables. It is easy to load data from S3 to Redshift. S3 is the intermediary step most ETL/Migration tools use to pool data before loading it into Redshift.
You are working as a data analyst for a marketing agency. Through a mobile app, the company gathers data from thousands of mobile devices every hour, which is stored in an S3 bucket. The COPY command is used to move data to a Redshift cluster for further analysis. The data reconciliation team notified you that some of the original data present in S3 files is missing from the Redshift cluster. Which of the following actions would you take to mitigate this issue with the least amount of development effort? Feed the data directly into an Elastic MapReduce (EMR) cluster and use the COPY command to move it to the Redshift cluster for better consistency. Use multiple S3 buckets to store incoming data, then use multiple COPY commands against them to move data to Redshift. Use Step Functions to aggregate data and check the integrity of data across S3 and Redshift, then kick off an extra step function which inserts any missing data back into Redshift. Create new object keys in S3 for new incoming data and use manifest files for stronger consistency when moving data to Redshift.
Create new object keys in S3 for new incoming data and use manifest files for stronger consistency when moving data to Redshift. If you overwrite existing files with new data, and then issue a COPY command immediately following the upload, it is possible for the COPY operation to begin loading from the old files before all of the new data is available. Therefore, creating new object keys will ensure better consistency of data. When you use a manifest file, COPY enforces strong consistency by searching secondary data sources if it does not find a listed file on the primary server and can predetermine which files to move data from. Managing Data Consistency (https://docs.aws.amazon.com/redshift/latest/dg/managing-data-consistency.html)
The Sales and Marketing Teams at your company are using business intelligence applications to run a number of Presto queries on an Amazon EMR cluster with an EMR File System (EMRFS). There is a new Marketing Analyst starting today as well as a new Sales Data Analyst. The Marketing Analyst will need to access the marketing table only. The Sales Data Analyst will need to access to the sales table only. How should you configure access for these two new employees? Create separate IAM roles for the Marketing and Sales users. Assign the roles using an S3 bucket policy to enable the users to access the corresponding tables in the EMR cluster. Configure Presto to use the AWS Glue Data Catalog as the Apache Hive metastore. Create separate IAM roles for the Marketing and Sales users. Assign the roles with AWS Glue resource-based policies to enable the users to access the corresponding tables in the AWS Glue Data Catalog. Configure Presto to use the AWS Glue Data Catalog as the Apache Hive metastore. Create separate IAM roles for the Marketing and Sales users. Assign the roles using IAM policies to enable the users to access the corresponding tables in the AWS Glue Data Catalog. Configure Presto to use the AWS Glue Data Catalog as the Apache Hive metastore. Create separate IAM roles for the Marketing and Sales users. Configure access to the relevant tables using an S3 Access Control List to enable the users to access the corresponding tables in the EMR cluster. Configure Presto to use the AWS Glue Data Catalog as the Apache Hive metastore.
Create separate IAM roles for the Marketing and Sales users. Assign the roles with AWS Glue resource-based policies to enable the users to access the corresponding tables in the AWS Glue Data Catalog. Configure Presto to use the AWS Glue Data Catalog as the Apache Hive metastore. AWS Glue resource policies can be used to control access to Data Catalog resources. AWS Glue Resource Policies for Access Control (https://docs.aws.amazon.com/glue/latest/dg/glue-resource-policies.html)
You work for a movie theater organization that is integrating a new concession system. The movie theaters will be spread across the globe, showing movies in hundreds of different languages. The new concession system needs to be able to handle users all throughout the day and during any given time. The amount of concession purchases spike during certain times of the day and night, so the collected data volume fluctuates. The data that is stored for concession purchases, items, and prices needs to be delivered at low latency and high throughput no matter the size of data; however, the data is typically small in size. What storage option is the best solution for the new concession system? Elasticache with multi-AZ enabled S3 with CloudFront to create global CDN Neptune with a multi-AZ enabled RDS with a multi-AZ enabled DynamoDB with global tables and multi-region replication
DynamoDB with global tables and multi-region replication DynamoDB scales horizontally and allows applications to deliver data at single-digit millisecond latency at large scale. DynamoDB also offers global tables for multi-region replication that can be used for your global application. Global Tables: Multi-Region Replication with DynamoDB Amazon DynamoDB FAQs
You are working on a project to consolidate a large amount of confidential information onto Redshift. In order to meet compliance requirements, you need to demonstrate that you can produce a record of authentication attempts, user activity on the database, connections, and disconnections. Which of the following will create the required logs? Enable QuickSight logs. Enable CloudTrail logs. Enable CloudWatch logs. Enable Redshift audit logs.
Enable Redshift audit logs. Redshift can log information about connections and user activities in your database. Audit logging is not enabled by default in Amazon Redshift. The connection log, user log, and user activity log can be enabled using the AWS Management Console, the Amazon Redshift API, or the AWS Command Line Interface. RedShift Audit Logs (https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html#db-auditing-logs)
You work as a data scientist for a company that delivers a new messaging application that is becoming very popular. Currently, the architecture consists of a single Kinesis Data Stream with multiple shards. Due to the growing popularity of the application, it has become apparent that the number of consumer applications using the Kinesis Client Library (KCL) also needs to grow. After adding more consumer applications, the consumer applications are constantly polling the data stream shards, causing the consumers to contend with other consumers. You've been tasked with finding a solution wherein the consumer applications can subscribe to the shards instead of polling them, and consume messages at a rate of 2 MB/s per shard. What solution should you propose to handle the demand and requirements? Create a new Kinesis Data Stream with double the amount of shards. Make this new stream the output for the original Kinesis Data Stream. Enable enhanced fan-out on the consumer applications. The KCL automatically subscribes the application to all the shards of a stream. Increase the number of shards to the maximum amount of 500. Create a new Kinesis Data Stream with the same amount of shards. Make this new stream the output for the original Kinesis Data Stream.
Enable enhanced fan-out on the consumer applications. The KCL automatically subscribes the application to all the shards of a stream. This feature enables consumers to receive records from a stream with throughput of up to 2 MB of data per second per shard. This throughput is dedicated, which means that consumers that use enhanced fan-out don't have to contend with other consumers that are receiving data from the stream. Kinesis Data Streams pushes data records from the stream to consumers that use enhanced fan-out. Therefore, these consumers don't need to poll for data. Developing Custom Consumers with Dedicated Throughput - Enhanced Fan-Out (https://docs.aws.amazon.com/streams/latest/dev/enhanced-consumers.html) Amazon Kinesis Data Streams Adds Enhanced Fan-Out and HTTP/2 for Faster Streaming (https://aws.amazon.com/blogs/aws/kds-enhanced-fanout/) High Performance Data Streaming with Amazon Kinesis: Best Practices and Common Pitfalls (https://www.youtube.com/watch?v=MELPeni0p04&t=909s)
You work for an organization that uses legacy Microsoft applications to run the day-to-day services, as well as the authentication mechanisms. Currently, all employees are authenticated into applications using AWS Managed Microsoft AD in us-west-2. You have recently set up a QuickSight account in us-east-1 that you need teammates to authenticate into, so they can run data analytics tasks. Your teammates are not able to authenticate into the QuickSight account. Which of the following is the cause for the issue and what are the possible solutions? Use the Standard edition for the QuickSight account. Invite the users to the QuickSight account using their email addresses. Ensure Active Directory is the identity provider for QuickSight and associate your AD groups with Amazon QuickSight. Use the Enterprise edition for the QuickSight account. Set up an AWS Managed Microsoft AD directory in the same region as the QuickSight account and migrate users using the new directory.
Ensure Active Directory is the identity provider for QuickSight and associate your AD groups with Amazon QuickSight. When you subscribe to Amazon QuickSight Enterprise edition and choose Active Directory as your identity provider, you can associate your AD groups with Amazon QuickSight. You can also add or change your AD groups later on. Using Active Directory with Amazon QuickSight Enterprise Edition (https://docs.aws.amazon.com/quicksight/latest/user/aws-directory-service.html)
As Herbert's Hyper Hot Chillies has expanded their hot pepper and spice sales to the global market, they've accumulated a significant number of S3- backed data lakes in multiple AWS accounts across multiple regions. They would like to produce some Business Intelligence visualizations that combine data from all of these sources. How can they do this with minimal cost and development effort? Utilize Glue to ETL the data into JSON format, and load it into an Elasticsearch index. Utilize Kibana to create visualizations of the data. Utilize Glue to create a catalog of all involved data, and use the catalog to inform Hive tables in Elastic Map Reduce(EMR). Then, utilize EMR as a datasource for QuickSight. Ensure that all data sources are configured with the appropriate permissions to provide QuickSight access. Configure QuickSight to access the S3 data in the various regions and accounts. Write a custom visualization frontend with the D3 framework, and back this frontend with a custom API that accesses each data lake individually to aggregate the data before visualization.
Ensure that all data sources are configured with the appropriate permissions to provide QuickSight access. Configure QuickSight to access the S3 data in the various regions and accounts. Given the correct permissions, QuickSight can be utilized to aggregate data for creating visualizations.
You work for a large university developing a web application that allows students to upload various applications regarding their attendance. Many of these applications contain large files up to 3 GB in size. After each upload, the attached files need to be processed through an in-house developed OCR application that is hosted on a SageMaker endpoint. The application submissions happen unpredictably: some applications are sent every few hours and sometimes hundreds of applications are sent per minute. Which architecture best suits the workload and is cost efficient? First, use the AWS SDK to store the file on an EBS volume. Use a fleet of EC2 instances to read the attachments from the EBS volume, sending the attachment as input to invoke the SageMaker endpoint. First, use an SQS queue to process the file. Use a fleet of EC2 instances to poll the SQS queue, sending the attachment as input to invoke the SageMaker endpoint. First, use a multipart upload to deliver the attachments to S3. Use S3 event notifications to trigger a Lambda function, sending the attachment as input to invoke the SageMaker endpoint. First, use a Kinesis Data Firehose to deliver the attachment to S3. Use S3 event notifications to trigger a Lambda function, sending the attachment as input to invoke the SageMaker endpoint.
First, use a multipart upload to deliver the attachments to S3. Use S3 event notifications to trigger a Lambda function, sending the attachment as input to invoke the SageMaker endpoint. This architecture is best suited for the workload requirements and would be the most cost-effecient solution. Configuring Amazon S3 Event Notifications (https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html) Multipart Upload Overview (https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html) InvokeEndpoint (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html)
You work with a team of data scientists who use ERM clusters to analyze large datasets using Presto jobs and YARN jobs. Many of your team members forget to terminate EMR clusters when they are finished with their workload. You have been notified by the finance team that the cost of EMR is often exceeding the monthly budget, and have been tasked with automating a solution to terminate idle running EMR clusters. Which of the following solutions meets these requirements, for both clusters running Presto jobs and clusters running YARN jobs? For the Presto jobs, create a bash script that is installed directly onto the master node of the EMR cluster that runs every 5 minutes with a cron job. Implement the script to terminate the EMR cluster after 8 hours each day. For the YARN jobs, create a CloudWatch alarm for the IsIdle metric from the EMR cluster that sends a message to an SNS topic. Create a Lambda function that is subscribed to the topic to terminate the EMR cluster. For the Presto jobs, create a bash script that is installed directly onto the master node of the EMR cluster that runs every 5 minutes with a cron job. The script monitors the clusters and sends a CUSTOM metric EMR-INUSE (0=inactive; 1=active) to CloudWatch every 5 minutes. If CloudWatch receives 0 (inactive), send a message to an SNS topic. Create a Lambda function that is subscribed to the topic to terminate the EMR cluster. Create tags for the EMR clusters that are running Presto jobs and YARN jobs separately. Use AWS Systems Manager to continuously monitor the ERM clusters by tags and check for idle clusters. If the clusters are idle, issue a aws emr terminate-clusters command on all of the clusters. Create tags for the EMR clusters that are running Presto jobs and YARN jobs separately. Use CloudWatch alarms to monitor the billing amount for each tag that is set to the monthly billing amount. When the alarm exceeds the monthly billing amount, send a message to an SNS topic. Create a Lambda function that is subscribed to the topic to terminate the EMR cluster.
For the YARN jobs, create a CloudWatch alarm for the IsIdle metric from the EMR cluster that sends a message to an SNS topic. Create a Lambda function that is subscribed to the topic to terminate the EMR cluster. The EMR native IsIdle Amazon CloudWatch metric determines the idleness of the cluster by checking whether there's a YARN job running. Optimize Amazon EMR Costs with Idle Checks and Automatic Resource Termination Using Advanced Amazon CloudWatch Metrics and AWS Lambda (https://aws.amazon.com/blogs/big-data/optimize-amazon-emr-costs-with-idle-checks-and-automatic-resource-termination-using-advanced-amazon-cloudwatch-metrics-and-aws-lambda/) For the Presto jobs, create a bash script that is installed directly onto the master node of the EMR cluster that runs every 5 minutes with a cron job. The script monitors the clusters and sends a CUSTOM metric EMR-INUSE (0=inactive; 1=active) to CloudWatch every 5 minutes. If CloudWatch receives 0 (inactive), send a message to an SNS topic. Create a Lambda function that is subscribed to the topic to terminate the EMR cluster. The Amazon EMR native IsIdle Amazon CloudWatch metric determines the idleness of the cluster by checking whether there's a YARN job running. However, you should consider additional metrics, such as SSH users connected or Presto jobs running, to determine whether the cluster is idle. Also, when you execute any Spark jobs in Apache Zeppelin, the IsIdle metric remains active (1) for long hours, even after the job is finished executing. In such cases, the IsIdle metric is not ideal in deciding the inactivity of a cluster. Optimize Amazon EMR Costs with Idle Checks and Automatic Resource Termination Using Advanced Amazon CloudWatch Metrics and AWS Lambda (https://aws.amazon.com/blogs/big-data/optimize-amazon-emr-costs-with-idle-checks-and-automatic-resource-termination-using-advanced-amazon-cloudwatch-metrics-and-aws-lambda/)
You were recently hired by a company that has been using a Redshift data warehouse for many years. They had been noticing some slowness when running queries against certain database tables with high traffic, likely due to small regions of unsorted rows. You have been tasked with analyzing the data to determine which tables require better sorting and clean-up, and communicating those changes to the Engineering team lead. Which solution would you propose that requires the least development effort and lowest cost for this task? Inspect the SVV_TABLE_INFO table's unsorted_rows and vacuum_sort_benefit to determine the number of unsorted rows and performance benefit from sorting them. Perform a deep copy of the tables in question to recreate and re-sort the tables automatically. No additional action is required — the automatically scheduled vacuuming is ideal for recreating and sorting tables for efficiency in all cases. Inspect the OPTIMIZE_TABLE table's sorted_row and vacuum_select_sort fields to determine if you need to run manual VACUUM DELETE for cleanup.
Inspect the SVV_TABLE_INFO table's unsorted_rows and vacuum_sort_benefit to determine the number of unsorted rows and performance benefit from sorting them. The SVV_TABLE_INFO table can be really helpful if you want to take a more detailed look into optimizing your Redshift database for better sorting. Although VACUUM DELETE is run every now and then, you might still want to look at specific tables and see if further VACUUM action can give a table better performance at a more frequent interval.
You work for a social media platform that automates the streaming of view length and time on post information. The information streamed is typically small in size (256 bytes) and is sent at a rapid rate (2,500 records per second) to your application, using the Kinesis Producer Library (KPL) to stream data into a Kinesis Data Streams stream. Currently, after the data is streamed from the KPL to the Kinesis Data Streams stream, it is then ingested by Kinesis Data Firehose, transformed and enriched by a Lambda function, and output into an S3 bucket. The data collection process is being constrained, resulting in low througput. What is causing the the throughput to be low and what can be done to alleviate it? S3 has a limit of 2,000 PUT requests per second. Enable S3 Transfer Acceleration on the S3 bucket where the transformed output data is being stored. This will allow more PUT request from the Kinesis Data Firehose delivery stream. Kinesis Data Streams has a limit of 1,000 records per second or 1 MB throughput. The amount of data being sent to the Kinesis Data Streams shard is greater than 1 MB per second. To solve this, compress the data using an optimized compress algorithm before the data is sent via the KPL to the Kinesis Data Streams shard. The Lambda function being used for transformation is limited by a 15 minute execution runtime. Increase the buffer size and buffer interval in Kinesis Data Firehose so more records are batched together before being sent to Lambda for transformation. Kinesis Data Streams has a limit of 1,000 records per second or 1 MB throughput. By taking advantage of KPL aggregation, you can aggregate the 2,500 records into 15 Kinesis Data Stream records, which in turn brings the records per second to 15, where each record contains 42 KB fo data.
Kinesis Data Streams has a limit of 1,000 records per second or 1 MB throughput. By taking advantage of KPL aggregation, you can aggregate the 2,500 records into 15 Kinesis Data Stream records, which in turn brings the records per second to 15, where each record contains 42 KB fo data. Using the KPL aggregation is the best solution. Aggregation refers to the storage of multiple records in a Kinesis Data Streams record. Aggregation allows customers to increase the number of records sent per API call, which effectively increases producer throughput. Kinesis Data Streams shards support up to 1,000 Kinesis Data Streams records per second, or 1 MB throughput. The Kinesis Data Streams records per second limit binds customers with records smaller than 1 KB. Record aggregation allows customers to combine multiple records into a single Kinesis Data Streams record. This allows customers to improve their per shard throughput. KPL Key Concepts - Aggregation (https://docs.aws.amazon.com/streams/latest/dev/kinesis-kpl-concepts.html#kinesis-kpl-concepts-aggretation)
You are creating an EMR cluster which will handle highly sensitive data. The Chief of Security has mandated that the EMR cluster must not be accessible from the public internet, and subnets must be configured with maximum network security. Which of the following options will best meet this requirement? Launch the Amazon EMR cluster in a private subnet, use VPC endpoints to access services within AWS, and use a NAT gateway to access resources that can only be accessed using the internet. Launch the Amazon EMR cluster in a public subnet with no public IP space or internet gateway. Use a NAT gateway to access services within AWS and resources that can only be accessed using the internet. Launch the Amazon EMR cluster in the default subnet, update the routing table to remove the route to the internet gateway, and use VPC endpoints to access services within AWS and resources that can only be accessed using the internet. Launch the Amazon EMR cluster in a private subnet, and use a NAT gateway to access services within AWS and a VPC endpoint to access resources that can only be accessed using the internet.
Launch the Amazon EMR cluster in a private subnet, use VPC endpoints to access services within AWS, and use a NAT gateway to access resources that can only be accessed using the internet. This is the correct answer because the EMR cluster will not be exposed to the internet, any traffic to the VPC endpoints will remain within Amazon's network, and the use of a NAT gateway is the most secure way to access internet-based resources because it does not allow ingress connections or incoming connections from external networks. Securely Access Web Interfaces on Amazon EMR Launched in a Private Subnet (https://aws.amazon.com/blogs/big-data/securely-access-web-interfaces-on-amazon-emr-launched-in-a-private-subnet/) VPC Endpoints (https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html)
You work for a coffee company which has thousands of branches all over the country. The sales system generates logs regarding transactions. The logs are aggregated and uploaded to an S3 bucket 'transaction-logs' which has a subfolder for logs for each item like those shown below: transaction-logs/dt=11-22-2019-0700/Hot-Drinks/ transaction-logs/dt=11-22-2019-0800/Cold-Drinks/ transaction-logs/dt=11-22-2019-0900/Edibles-Sweet/ transaction-logs/dt=11-22-2019-1000/Edibles-Salty/ Some store locations are open from 8 AM to 5 PM, but there are many 24 hour locations as well, which means there's millions of transactions being reported per hour. Consequently, to parse and analyze the data, an Elastic MapReduce (EMR) cluster is used to process and upload data to a Redshift data warehouse. What changes should you make to the S3 bucket for better read performance without altering current architecture? Set up an EC2 AutoScaling group to issue multiple parallel connections to S3 for better concurrent reads. Use the COPY command within Redshift to directly pull S3 object data. Modify the S3 prefix to better spread out the read requests from EMR and utilize the read request performance for each unique prefix. Use the S3 Select feature to read the required objects and stream them directly into Redshift.
Modify the S3 prefix to better spread out the read requests from EMR and utilize the read request performance for each unique prefix. S3 is a massively distributed and scalable service and allows read throughput per S3 prefix, which means that a new unique S3 key prefix will offer better read performance. The S3 key could be named this way to create a new prefix for each flavor category and offer separate read performance for that prefix. In this scenario, for example, we could aggregate logs by hour and set the date, time and hour as a unique prefix. transaction-logs/dt=2020-11-22-0800/item-drinks/hot/mocha transaction-logs/dt=2020-11-22-0900/item-drinks/cold/iced_coffee transaction-logs/dt=2020-11-22-1000/item-edibles/sweet/donut transaction-logs/dt=2020-11-22-1100/item-edibles/salty/egg_roll
You work for a large computer hardware organization that has many different IT stores across the world. The computer parts, order details, shipping details, customer, and sales person information is stored in a data lake in S3. You have been tasked with developing a visualization to show the amount of hardware that was shipped out by various stores and the sales person who sold the hardware. You have a requirement that the visualization must be able to apply statistical functions, as well as cluster columns and rows to show values for subcategories grouped by related dimension. Which type of visualization would meet these requirements? Combo chart Heat map Tree map Pivot table
Pivot table A pivot table would be the best choice for visualizing this data. With a pivot table you can: Specify multiple measures to populate the cell values of the table, so that you can see a range of data Cluster pivot table columns and rows to show values for subcategories grouped by related dimension Change row sort order Apply statistical functions Add totals and subtotals to rows and columns Use infinite scroll Transpose fields used by rows and columns Pivot Table Features (https://docs.aws.amazon.com/quicksight/latest/user/pivot-table.html#pivot-table-features)
What's That Thing is a medical reference company. They've been provided with a huge store of medical data, but to be able to utilize the images and associated data, it needs to be anonymized and trimmed to remove any non-pertinent information from the records. They would like to accomplish this with minimal development effort. What is the best workflow to accomplish this? Load each record into an SQS queue. Create an SQS client Lambda function to process each record, write code to clean and filter the data to remove personally identifiable information, and add an anonymized identifier to associate data with image files. Process the stream into a Kinesis Firehose delivery stream. Utilize a Kinesis Analytics Application to clean the data of extraneous data and any personally identifiable information, and add unique identifiers to connect data with images. Manually edit each data record and image to remove any personally identifiable information and add anonymized identifiers to enable connecting data and images. Write a Lambda function to ingest the data, perform data filtering to remove any personally identifiable information, and add anonymized identifiers to connect data and images.
Process the stream into a Kinesis Firehose delivery stream. Utilize a Kinesis Analytics Application to clean the data of extraneous data and any personally identifiable information, and add unique identifiers to connect data with images. This is the best option, as you can easily filter, augment, and enhance data on the fly with Kinesis Analytics Applications.
You work as a data engineer who performs data processing solutions for your customers. You have been tasked with designing an EMR solution that will process a large amount of data with little to no time constraint. It's important that the data process solution be more cost effective. Due to the size of the data, you know that the EMR map reduce job is going to require 20 mappers to process the input data. Which of the following configurations for your EMR cluster would help you achieve this? Use 10 core nodes, where each node can process 3 mappers in parallel. Use 10 core nodes, where each node can process 2 mappers in parallel. Run all the mappers in parallel. Run 10 mappers first, while the remaining 10 mappers stay in queue. Once Hadoop has processed the first 10 mappers, the remain 10 mappers run. Use 5 core nodes, where each node can process 2 mappers in parallel.
Run 10 mappers first, while the remaining 10 mappers stay in queue. Once Hadoop has processed the first 10 mappers, the remain 10 mappers run. This option will help you save on cost, because you will only have to use 5 nodes (as compared to 10 nodes), while still processing all 20 mappers. Best Practices for Amazon EMR (https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf) Use 5 core nodes, where each node can process 2 mappers in parallel. Using this option — in conjunction with having 10 mappers run first, then have the remaining mappers run after — will provide you with the most cost savings. Best Practices for Amazon EMR (https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf)
You are building an EMR cluster and have been asked to enable encryption at rest for EMRFS data. Which of the following encryption methods can you use? Open-source HDFS Encryption SSE-C SSE-KMS SSE-S3
SSE-KMS EMRFS is an implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3. SSE-KMS is a supported encryption method for EMRFS. With SSE-KMS, you use an AWS KMS customer master key (CMK) set up with policies suitable for Amazon EMR. EMR Encryption Options (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html) SSE-S3 EMRFS is an implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3. SSE-S3 is a supported encryption method for EMRFS. With SSE-S3, Amazon S3 manages the encryption keys for you. EMR Encryption Options (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html)
Safety Pattern is a company that specializing in recognizing patterns in cloud-based architectures that indicate unusual behavior. They would like to launch a data access application that detects specific patterns in data storage services. They're experiencing difficulties because of the volume of data that needs to be processed to make real-time alerting functional. Which of the following pipelines would be the best option to accomplish this goal? Create an S3 bucket, and send event objects to the bucket. Create a Lambda function that evaluates for alert-worthy states on a state match. Send a message to an SNS topic configured to alert the appropriate subscribers. Send access events to a Kinesis Data Stream. Create a Kinesis Data Application that utilizes a Flink application with the alert-worthy event patterns. Send matches to a Lambda function that sends a message to an SNS topic configured to alert the appropriate subscribers. Create a custom EMR application that accepts events and publishes alerts to an SNS topic configured to alert the appropriate subscribers. Create a custom application and Docker image. Run the Docker image in ECS with an Application Load Balancer configured to send events to the ECS containers. Configure the application to send messages to an SNS function when there is a state match.
Send access events to a Kinesis Data Stream. Create a Kinesis Data Application that utilizes a Flink application with the alert-worthy event patterns. Send matches to a Lambda function that sends a message to an SNS topic configured to alert the appropriate subscribers. This is the best available solution. Kinesis can be scaled to handle a huge amount of data input, and Flink enables efficiently managing and flexibly filtering that data based on various states.
You work as a data engineer for an organization responsible for tracking the spending on company issued tablets and the apps that are purchased. The tablet owners' spending on app usage is logged to a Kinesis Data Firehose, where the data is then delivered to S3 and copied onto Redshift every 15 minutes. Your job is to set up a billing alert system to notify tablet owners when they have spent too much on apps within 10 minutes. Currently, a DynamoDB table contains the cumulative app spending total, as well as the threshold amount. If the cumulative total surpasses the threshold amount, a notification must be sent out to the tablet owner. What is a solution that will allow for timely notifications to be sent to tablet owners when the spending threshold is surpassed? Set up a Kinesis Data Analytics application to total the app spending for each tablet owner over a 5-minute window. At the end of the window, send the totals to a Kinesis Data Streams stream which is the event source for a Lambda function. The Lambda function queries the DynamoDB table to retrieve the threshold and current total spent by each tablet owner. If the threshold has been exceeded, send a Simple Notification Service (SNS) to notify the tablet owner. Set up a Lambda function to trigger every 10 minutes and use S3 Select to query each of the objects for the data that is already in S3 to calculate a cumulative total. The Lambda function queries the DynamoDB table to retrieve the threshold and current total spent by each tablet owner. If the threshold has been exceeded, send a Simple Notification Service (SNS) to notify the tablet owner. Set up a Lambda function to trigger every 10 minutes and use Redshift Spectrum to query the tables and data that are in Redshift to calculate a cumulative total. The Lambda function queries the DynamoDB table to retrieve the threshold and current total spent by each tablet owner. If the threshold has been exceeded, send a Simple Notification Service (SNS) to notify the tablet owner. Set up a second Kinesis Data Firehose delivery stream as a connector to the original delivery stream and set the batch interval to 600 seconds. Use the newly created delivery stream as the event source for a Lambda function. The Lambda function queries the DynamoDB table to retrieve the threshold and current total spent by each tablet owner. If the threshold has been exceeded, send a Simple Notification Service (SNS) to notify the tablet owner.
Set up a Kinesis Data Analytics application to total the app spending for each tablet owner over a 5-minute window. At the end of the window, send the totals to a Kinesis Data Streams stream which is the event source for a Lambda function. The Lambda function queries the DynamoDB table to retrieve the threshold and current total spent by each tablet owner. If the threshold has been exceeded, send a Simple Notification Service (SNS) to notify the tablet owner. Since the data is already being streamed through Kinesis Data Firehose, then using Kinesis Data Analytics enables you to use SQL aggregation to stream the data and calculate the spending in a timely manner. Streaming Data Solutions on AWS with Amazon Kinesis (https://d0.awsstatic.com/whitepapers/whitepaper-streaming-data-solutions-on-aws-with-amazon-kinesis.pdf)
You work as a data engineer for a large health agency that runs data analytics on world health data. Currently, there are large datasets of world health data in S3 that is not accessible over the internet. You have been tasked with setting up a QuickSight account that will enable you to build dashboards from the data in S3 without moving the data over the public internet. Which of these methods meets these requirements? Create a new data source and select the S3 bucket with the world health data. Setup a QuickSight VPC connection and a VPC endpoint for S3 to allow QuickSight private access to S3 world health data. Create a VPC endpoint for S3 to allow QuickSight private access to the S3 world health data. Download the S3 data using FTP and upload the S3 data into the QuickSight SPICE.
Setup a QuickSight VPC connection and a VPC endpoint for S3 to allow QuickSight private access to S3 world health data. A VPC endpoint for S3 provides the private access to the world health data, and a QuickSight VPC connection is also needed in the same VPC where the VPC endpoint resides. Configuring the VPC Connection in the QuickSight Console (https://docs.aws.amazon.com/quicksight/latest/user/vpc-creating-a-connection-in-quicksight.html)
You are a data scientist working for an outdoor expedition company that specializes in building GPS tracking devices for outdoor gear retail stores. You have been tasked with building out a data collection architecture that will gather all of the waypoints, moving speed, temperature, etc., and be loaded into a data lake. You have already set up a Kinesis Data Stream stream with a Kinesis Producer Library (KPL) application to deliver the data to the shard within the stream. Your next task is to build on a Kinesis Client Library (KCL) application to delivery the data to the S3 data lake. Before you begin, another team member points out that there is a requirement in which the KCL must have an implementation for any failed logic when the KCL is retrieving a message from the stream. What should your architecture consist of to ensure this requirement is met? Since KCL takes care of tracking by passing a checkpointer when processing records, if the worker fails, the shutdown method is called. The shutdown method includes the checkpointer which is used in subsequent processRecords() method calls to retrieve failed record reads. Since KCL takes care of tracking by passing a checkpointer when processing records, if the worker fails, an exception with the checkpoint information will be thrown - allowing your KCL application to retry the record from the last known processed record. Since KCL takes care of tracking by passing a checkpointer when processing records, if the worker fails, the KCL will use this information to restart the processing of the shard at the last known processed record. Whenever KCL encounters a failed record read from a shard, the processRecords() method will throw an exception with the sequence number and partition key for the failed record in the shard, allowing your KCL application to retry the record from the last known processed record.
Since KCL takes care of tracking by passing a checkpointer when processing records, if the worker fails, the KCL will use this information to restart the processing of the shard at the last known processed record. Kinesis Data Streams requires the record processor to keep track of the records that have already been processed in a shard. The KCL takes care of this tracking for you by passing a checkpointer (IRecordProcessorCheckpointer) to processRecords(). The record processor calls the checkpoint method on this interface to inform the KCL of how far it has progressed in processing the records in the shard. If the worker fails, the KCL uses this information to restart the processing of the shard at the last known processed record. Developing a Kinesis Client Library Consumer in Java (https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-implementation-app-java.html)
Your company is looking to reduce the cost of their Business Intelligence applications. Currently, all data is stored in a Redshift cluster, which has grown exponentially with the increase in sales. Additionally, the bespoke visualizations for quarterly reports are incredibly cumbersome to generate by hand. What steps can be taken to reduce the cost of business intelligence workflow, while keeping all data available for generating reports from time to time? Store data no longer being actively utilized in an S3 bucket using the Standard Infrequent Access storage class. Create a Redshift Spectrum table to access this data and join it with warm data in the Redshift cluster for reporting. Utilize QuickSight to create the appropriate charts and graphs to accompany the BI reports. Move the cold data to S3 and use the S3 Select API call to query the cold data, then join the data with Redshift query results in a custom application layer. Store data no longer being actively utilized in an S3 bucket using the Standard storage class. Create a Redshift Spectrum table to access this data and join it with warm data in the Redshift cluster for reporting. Utilize QuickSight to create the appropriate charts and graphs to accompany the BI reports. Export all data to S3 with the Redshift UNLOAD command, configure an Athena table, and update/rebuild the application layer to query Athena instead of Redshift.
Store data no longer being actively utilized in an S3 bucket using the Standard Infrequent Access storage class. Create a Redshift Spectrum table to access this data and join it with warm data in the Redshift cluster for reporting. Utilize QuickSight to create the appropriate charts and graphs to accompany the BI reports. By leveraging the Standard Infrequent Access storage class, data that is not accessed frequently will cost less to store. Redshift Spectrum will keep the cold data available for analytics purposes.
You are a Data Analyst at a retail bank. You are working on a project to encrypt all Personally Identifiable Information (PII) that is generated by customer credit card applications. This data is generated in the form of a JSON document each time a customer applies for a credit card. For each successful application, the data must be encrypted, and you also need to be alerted of any attempted access by unauthorized individuals. Which of the following is the best solution for storing and protecting this data? Use an encrypted DynamoDB table to store the customer data and use Amazon Macie to scan the data against compliance rules. Use Amazon CloudWatch Events to trigger alerts. Use S3 with encryption enabled to store JSON files, use AWS Lambda to scan the data to detect PII, and use SNS to alert for unauthorized access. Store the customer data in an S3 bucket with encryption enabled. Use Macie to scan the Amazon S3 bucket to identify PII. Configure CloudWatch to alert for unauthorized access events in CloudTrail. Store the customer data in an encrypted DynamoDB table, use Lambda to scan the data to detect PII and use CloudWatch Events to alert for unauthorized access.
Store the customer data in an S3 bucket with encryption enabled. Use Macie to scan the Amazon S3 bucket to identify PII. Configure CloudWatch to alert for unauthorized access events in CloudTrail. Amazon Macie uses machine learning and pattern matching to discover sensitive data at scale, including Personally Identifiable Information such as names, addresses, and credit card numbers. It also gives you constant visibility of the data security and data privacy of your data stored in Amazon S3. A CloudWatch alarm can be configured to alert when an unauthorized API call is made, based on CloudTrail logs. Amazon Macie FAQs (https://aws.amazon.com/macie/faq/) Creating CloudWatch Alarms for CloudTrail Events (https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudwatch-alarms-for-cloudtrail.html#cloudwatch-alarms-for-cloudtrail-authorization-failures)
You are working as a data engineer within a financial institution. You're required to move a load of data gathered against various datasets in S3 to a Redshift cluster. You've attached the appropriate IAM role to your cluster and have issued a COPY command to move data from the S3 bucket into your Redshift database. After a while you check and notice the data is not populated in Redshift. Which of the following errors could be causing the issue with your data population? You are not connecting to your Redshift cluster as the default "ec2-user" database user when running the COPY command. The Redshift cluster is in maintenance mode and therefore buffering all queries for whenever it gets back to "Available" state. The default Security Group attached to your Redshift cluster does not allow outbound traffic through the Redshift cluster's VPC. The Redshift cluster does not have permissions to access the S3 files. The COPY command is not committing data into the Redshift cluster.
The Redshift cluster does not have permissions to access the S3 files. When using the COPY command to move data from an S3 bucket, two things are required: an IAM role for accessing S3 resources and ensuring that data is either auto-committed or there's an explicit COMMIT at the end of your COPY command to save the changes uploaded from S3. The COPY command is not committing data into the Redshift cluster. When using the COPY command to move data from an S3 bucket, two things are required: an IAM role for accessing S3 resources and ensuring that data is either auto-committed or there's an explicit COMMIT at the end of your COPY command to save the changes uploaded from S3.
You work as a data engineer for a HVAC and refrigerant recycling company who uses near real-time IoT devices to stream data about air conditioning equipment to a centralized data repository for data analytical purposes and monitoring metrics. To capture this data, you have created a Kinesis Data Firehose delivery stream to collect the data and store the data in DynamoDB, which will be accessible through HTTP endpoints via API Gateway. The data is loaded to the DynamoDB table through a synchronous Lambda function before the raw data is loaded into S3. After launching the beta version of the application, the Lambda function attempts to ingest the buffered Kinesis Data Firehose records three times before skipping the batch of records. What could be the cause of the skipped batch records and how can the issue be resolved? The buffer size for the Kinesis Data Firehose is set to 8 MB, which is too high. This is causing the Lambda function to fail due to a invocation limit error. Lower the buffer size on the Kinesis Data Firehose delivery stream. The buffer interval for the Kinesis Data Firehose is set to 60 seconds, which is too high. This is causing the Lambda function to fail due to a invocation limit error. Lower the buffer interval on the Kinesis Data Firehose delivery stream. The buffer size for the Kinesis Data Firehose is set to 1 MB, which is too high. This is causing the Lambda function to fail due to a invocation limit error. Lower the buffer size on the Kinesis Data Firehose delivery stream. The buffer interval for the Kinesis Data Firehose is set to 900 seconds, which is too high. This is causing the Lambda function to fail due to a function timeout error. Lower the buffer interval on the Kinesis Data Firehose delivery stream.
The buffer size for the Kinesis Data Firehose is set to 8 MB, which is too high. This is causing the Lambda function to fail due to a invocation limit error. Lower the buffer size on the Kinesis Data Firehose delivery stream. The buffer size for Kinesis Data Firehose can be set between 1 MB - 128 MB when delivering data to S3. However, when running a Lambda function synchronously, the request limit is 6 MB. The 8 MB buffer size is exceeding the request limit for Lambda. Amazon Kinesis Data Firehose FAQs - Data Delivery Lambda Quotas (https://aws.amazon.com/kinesis/data-firehose/faqs/#:~:text=Amazon%20Kinesis%20Data%20Firehose%20buffers%20incoming%20data%20before%20delivering%20it,data%20delivery%20to%20Amazon%20S3) Amazon Kinesis Data Firehose Data Transformation (https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html)
You are part of a team of engineers building an attendance tracking system used to keep track of students in a university classroom. The students will be sent a unique QR code to their email address each day before a particular class starts. The QR code will then be scanned as the student enters the university classroom and they will be marked present for class. It is expected that the creation of the QR codes and QR scanning of the QR codes will happen at various times throughout the day, and high traffic spikes will happen regularly. It's also important that the data is highly durable and operates with low latency. What bundle of AWS services do you suggest using to meet all of the requirements to build the attendance tracking system? Use API Gateway as a REST API that receives QR code requests and responses. Trigger a Lambda function for each service the tracking system needs to implement. Use Neptune as the data storage system for student information and DynamoDB for QR code image URL and attendance tracking validations. Use API Gateway as a REST API that receives QR code requests and responses. Trigger a Lambda function for each service the tracking system needs to implement. Use DynamoDB as the data storage system for student information, QR code image URLs, and attendance tracking validations. Use API Gateway as a REST API that receives QR code requests and responses. Trigger a Lambda function for each service the tracking system needs to implement. Use Neptune as the data storage system for student information, QR code image URLs, and attendance tracking validations. Use a EC2 instance with a Spring API that receives QR code requests and responses. Use RDS as the data storage system for student information, QR code image URL, and attendance tracking validations.
Use API Gateway as a REST API that receives QR code requests and responses. Trigger a Lambda function for each service the tracking system needs to implement. Use DynamoDB as the data storage system for student information, QR code image URLs, and attendance tracking validations. API Gateway is used as a REST API and uses Lambda to implement the functionality. Using DynamoDB for storage needs is a great solution, providing high durability and low latency for your application. Query Your AWS Database From Your Serverless Application (https://aws.amazon.com/blogs/database/query-your-aws-database-from-your-serverless-application/)
You work for a stock trading company that runs daily ad-hoc queries on data using Athena. There are multiple silos within the company using Athena to run trading queries specific to their team. The finance department has a requirement to enforce the amount of money being spent by each team for the queries that they run in Athena. The security department has a requirement to enforce all query results be encrypted. Which solution could be implemented that would meet both of these requirements? Use Athena Workgroups to assign a unique workgroup to each silo, tagging them appropriately. Configure the workgroup to encrypt the query results. Generate cost reports from the tags as well as resource-based policies that assigns each workgroup to a silo. Use CloudWatch logs to determine the run time for scanned data for each query that is run by each silo, and trigger an alarm at a specified threshold. Create a Lambda function trigger to enforce cost control. Use CloudTrail logs to audit the silos and run times for scanned data for each query that is run. Use an S3 bucket with an SSE-S3 key and point the Athena queries results to the S3 bucket. Create tags for each silo in the AWS Glue Data Catalog associated with the data that each silo is querying, generating a cost report for each tag. Use the AWS Glue security settings to ensure the metadata is encrypted.
Use Athena Workgroups to assign a unique workgroup to each silo, tagging them appropriately. Configure the workgroup to encrypt the query results. Generate cost reports from the tags as well as resource-based policies that assigns each workgroup to a silo. By default, all Athena queries execute in the primary workgroup. As an administrator, you can create new workgroups to separate different types of workloads. Administrators commonly turn to workgroups to separate analysts running ad-hoc queries from automated reports. Separate Queries and Managing Costs Using Amazon Athena Workgroups (https://aws.amazon.com/blogs/big-data/separating-queries-and-managing-costs-using-amazon-athena-workgroups/)
You work for an organization that contracts out to healthcare providers by providing data lakes in AWS for personal health records. The data stored in the data lake has both Personal Identifiable Information (PII) as well as Personal Health Information (PHI), so a Health Insurance Portability and Accountability Act (HIPAA) compliant data lake is a requirement. You are using an EMR cluster with EMRFS to read and write data to and from S3. The data lake requirement by HIPAA requires all data be encrypted to and from S3. What needs to be done to ensure all data is encrypted moving to and from S3? Use SSE-KMS to encrypt data server-side. Use SSE-S3 to encrypt data server-side. Manually create PEM certificates, referenced in S3 to encrypt data in transit. Use CSE-KMS/CSE-C to encrypt data client-side.
Use CSE-KMS/CSE-C to encrypt data client-side. Amazon S3 encryption and decryption takes place client-side on your Amazon EMR cluster. You can use keys provided by AWS KMS (CSE-KMS) or use a custom Java class that provides the master key (CSE-C). Best Practices for Securing Amazon EMR (https://aws.amazon.com/blogs/big-data/best-practices-for-securing-amazon-emr/) How Amazon EMR uses AWS KMS (https://docs.aws.amazon.com/kms/latest/developerguide/services-emr.html)
You have a legacy Business Intelligence (BI) application running on servers located in your own Data Center. The BI application needs to access data stored in a Redshift cluster. Your CEO has requested that you make sure the network connection between your Data Center is private, dedicated, and with consistent network performance to prevent the BI application from timing out. Which of the following approaches do you recommend? Configure a dedicated NAT gateway to consistently route all network traffic between the Redshift cluster and your Data Center. Use a site-to-site VPN to provide a dedicated, consistent network connection to AWS. Use Direct Connect to provide a dedicated, consistent network connection to AWS. Configure VPC peering between your VPC and the Data Center to provide a dedicated, consistent connection.
Use Direct Connect to provide a dedicated, consistent network connection to AWS. AWS Direct Connect is a dedicated network connection from your premises to AWS. In many cases, Direct Connect can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections. AWS Direct Connect (https://aws.amazon.com/directconnect/)
You work as a data scientist for a food and beverage manufacturer that creates and distributes food all over the world. The data associated with the distribution of food is processed and analyzed using an EMR cluster, which also functions as the data lake. Due to a new law that has been passed by the food administration, there are new requirements that must be met around processing hot data for food distribution and processing cold data for nutritional values for the food. The hot data accessed must be presented in real-time to the food administration, while the cold data is typically reviewed in weekly or monthly reports. The hot data that is reviewed in real-time must be fast preformat and temporary. The cold data does not require reviewing in real-time; however, the data must be persistent. Which of the following data processing configurations meets the business needs as well as usage pattern requirements? Use S3 block file system for hot data, and use S3 EMRFS for the cold data. Use HDFS for the hot data, and use for S3 EMRFS the cold data. Use HDFS for both the hot data and cold data. Use S3 EMRFS for the hot data, and use HDFS for the cold data.
Use HDFS for the hot data, and use for S3 EMRFS the cold data. HDFS is the distributed file system that is temporary for data storage. S3 EMRFS is HDFS integrated with S3 allowing the data to be stored on a distributed file system and is persistent. Work with Storage and File Systems (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html)
You work as a data scientist for a team that is developing a REST API for internal applications. Currently, every API call that is made to your application is logged in JSON format to an HTTP endpoint. You have been tasked with streaming the JSON data into S3 in Parquet format. How can this be done with minimum development effort? Use Kinesis Data Firehose as delivery stream. Enable record transformation that references a table stored in Apache Hive metastore in EMR. Set up an EMR cluster that uses Apache Streaming to stream data onto cluster. Create an Apache Spark job to convert the JSON to Parquet format using an Apache Hive metastore to determine the schema of the JSON data. Use Kinesis Data Stream to ingest the data and Kinesis Data Firehose as a delivery stream. Once data lands onto S3, use AWS Glue to transform the data through a Glue job. Use Kinesis Data Firehose as the delivery stream. Enable record transformation that references a table stored in AWS Glue defining the schema for your source records.
Use Kinesis Data Firehose as the delivery stream. Enable record transformation that references a table stored in AWS Glue defining the schema for your source records. Amazon Kinesis Data Firehose is the easiest way to load streaming data into AWS storage services such as S3. You can convert your streaming data from JSON to Apache Parquet and Apache ORC, and have a Kinesis Data Firehose delivery stream automatically convert data into Parquet or ORC format before delivering to your S3 bucket. Kinesis Data Firehose references table definitions stored in AWS Glue. Choose an AWS Glue table to specify a schema for your source records. Converting Your Input Record Format in Kinesis Data Firehose (https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html)
You've been contacted by a consulting client to assist with optimizing Athena query performance. They have a large amount of data stored in CSV format, and are not happy with either the expense of using Athena or its performance. Each file is 5-10GB in size, and all files are in a single S3 bucket in the root prefix. The data in question is being used for analytics purposes with heavy single-column reads. How can this data most easily be optimized in order to reduce access cost and improve query performance? Use Database Migration Service to reformat the data into Parquet format in a new S3 bucket. Recreate Athena tables to utilize this newly-formatted data. Use a CREATE TABLE AS SELECT (CTAS) query in Athena to process the existing data into Parquet format, partitioning the data as appropriate and compressing any non-index columns in the data with SNAPPY compression. Once all the data has been processed, DROP the original tables and ensure the data has been deleted from the underlying S3 bucket. Utilize Glue to catalogue the data. Create a Glue job to ETL the data into appropriately partitioned and compressed ORC format files. Utilize an EC2 instance to read all CSV files out of the S3 bucket backing Athena. Write a custom ETL script to reformat, partition, and apply compression to appropriate columns. Write the Parquet-formatted files to a new S3 bucket with the appropriate prefix schema to maximize performance. Recreate Athena tables with these new files.
Use a CREATE TABLE AS SELECT (CTAS) query in Athena to process the existing data into Parquet format, partitioning the data as appropriate and compressing any non-index columns in the data with SNAPPY compression. Once all the data has been processed, DROP the original tables and ensure the data has been deleted from the underlying S3 bucket. You can process the data "in place" with just Athena. By default, CTAS queries will store the output in Parquet format, and from there it's relatively simple to create partitions and configure column compression. All of these things will improve query performance and reduce the cost of querying the data.
You're working as a data scientist for a company who is migrating their global application to the AWS cloud. The application has frequently-accessed, business-critical data that is stored in a transactional MySQL database with another cloud provider. Your team has been tasked with migrating the current data from the MySQL database to an Amazon Aurora MySQL compatible database to save on cost. However, there is never a time when the application can be drained of users or experience downtime. This means the solution must handle the full data migration, as well as any ongoing changes to the destination database. What data migration solution would you suggest in order to accomplish this task with minimal development effort? Set up a Data Pipeline CopyActivity to perform a full load and a ShellCommandActivity setup on a cron job to copy the ongoing changes. Use a Database Migration Service (DMS) task to perform a full load and set up a Data Pipeline CopyActivity to copy the ongoing changes. Use a Database Migration Service (DMS) task to perform a full load and change data capture (CDC). Set up a AWS Data Sync agent onto the MySQL database to perform a full load and configure the schedule option to automate the transfer of ongoing changes.
Use a Database Migration Service (DMS) task to perform a full load and change data capture (CDC). You can create an AWS DMS task that captures ongoing changes to the source data store. You can do this capture while you are migrating your data. You can also create a task that captures ongoing changes after you complete your initial (full-load) migration to a supported target data store. This process is called ongoing replication or change data capture (CDC). AWS DMS uses this process when replicating ongoing changes from a source data store. This process works by collecting changes to the database logs using the database engine's native API. Creating Tasks for Ongoing Replication Using AWS DMS (https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Task.CDC.html)
You work as a data engineer for a large sports team that collects stats on plays, ticket and concession sales, clickstream data on the sports teams website, social media feeds, and more. Your team is planning to use EMR to process and transform the constantly growing data. The data analytics team run reports by querying the data using tools like Apache Hive and Presto, so the ability to run queries is a must. There is a requirement that the EMR cluster not run persistently. To do this, your team has decided to implement a solution that initiates EMR to process data when it lands onto S3, run Apache Spark ETL jobs, save the transformed data onto S3, and finally terminate the cluster. Which of the following is the best solution based on the requirements? Create a Lambda function that populates an RDS instances with the Apache Hive metadata for the EMR cluster. Use the EMR cluster and create an external table to run queries on the transformed data. Use an AWS Glue crawler to crawl the data that is transformed in S3 to populate a Glue Data Catalog with the metadata. Use Athena to run queries on the transformed data. Use HDFS on the EMR cluster to store the data. When the data analytics teams wants to run the queries on the transformed data, use the S3DistCp command to copy the data to S3. Once the data is in S3, use Athena to query the transformed data. Store the Apache Hive metadata externally in DynamoDB for the EMR cluster. Use S3 Select to run queriers on the transformed data.
Use an AWS Glue crawler to crawl the data that is transformed in S3 to populate a Glue Data Catalog with the metadata. Use Athena to run queries on the transformed data. This is the best solution given the requirements. You can use AWS Glue to crawl the data in S3 and then use Athena to query the data. Using AWS Glue to Connect to Data Sources in Amazon S3 (https://docs.aws.amazon.com/athena/latest/ug/data-sources-glue.html) Best Practices When Using Athena with AWS Glue (https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html)
You work as a data engineer for a mid-sized paper company that distributes paper all across the nation. After some recent economic cut backs, you have been tasked with reviewing the current data warehousing processing pipeline in order to try to lower operational costs. Currently, Redshift is being used as the data warehousing solution. The data is first ingested into EMR, before being loaded into Redshift. This data processing usually takes less than 1 hour and is done 2 times daily (at 7 AM and 7 PM). What operational adjustment could you make to help lower costs? Use three master nodes for the EMR cluster Use spot instances for the task nodes of the EMR cluster Use spot instances for the core nodes of the EMR cluster Use an EMR transient cluster Use an EMR long-running cluster
Use spot instances for the task nodes of the EMR cluster You can use spot-instances for task nodes to process data and these nodes to not run the DataNode which means there is no HDFS storage on tasks nodes so task nodes are easily added and removed to help assist in processes power. Understanding Master, Core, and Task Nodes Task Nodes on Spot Instances (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html#emr-dev-task-instance-group-spot) Use an EMR transient cluster Transient cluster are used when a particular processes task or job needs to occur on some defined schedule. After all of the steps are complete, the cluster terminates and does not incur any more costs (unlike long-running clusters). Plan and Configure Clusters Overview of Amazon EMR (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan.html)
You work as a data engineer consultant for customers who are moving data from S3 to Redshift. Your current customer has a 620 GB file that needs to be loaded from S3 to Redshift using the COPY command. The file has corrupted portions that have been found, modified, and fixed. However, there are other portions of the file that might be corrupted and could cause the the data load to fail. You've been tasked with finding a way to efficiently detect load errors without performing any excess development to clean up failed load processes. Which solution should you consider using? Compress the data using the optimized algorithm gzip before using the COPY command. Create a Lambda function that uses try/catch statements when using the COPY command. Use the catch statement to clean up and handle failed load processes as they occur. Use the COPY and NOLOAD commands to check the integrity of all of the data without loading it into Redshift. Use splitting techniques to split the file into equal 20 GB chunks. Load each chunk separately, and then clean up and handle failed load processes as they occur.
Use the COPY and NOLOAD commands to check the integrity of all of the data without loading it into Redshift. The NOLOAD checks the integrity of all of the data without loading it into the database. The NOLOAD option displays any errors that would occur if you had attempted to load the data. All other options will require subsequent processing on the cluster which will consume resources. Validating Input Data (https://docs.aws.amazon.com/redshift/latest/dg/t_Validating_input_files.html)
A global wildlife research group has been collecting a huge amount of data in regionally located Redshift clusters. While planning for the next increase in storage capacity for their cluster, there was significant pushback regarding increased cost. At least 3/4 of the data being stored in Redshift is only accessed 4 times a year to generate reports that are delayed one quarter and do not include the most recent quarter's data. The leadership of the research group has requested a solution that will continue generating reports from SQL queries, and charts and graphs generated from the data with QuickSight. Which of the following is the lowest cost solution? Make the necessary changes to the Redshift cluster to enable Redshift Spectrum, create Redshift Spectrum tables in the cluster, and move the infrequently used data to the Redshift Spectrum tables. Utilize Glue to migrate all data to an Elastic MapReduce (EMR) cluster of appropriate size, and reconfigure application and analytics workflows to utilize EMR instead of Redshift. Use the Redshift UNLOAD command to an S3 bucket located in the region closest to the group that generates quarterly reports with the FORMAT PARQUET option to create a single data lake. Configure Athena to support SQL queries, and configure QuickSight to utilize Athena for its data source. Once operation of the new system is confirmed, delete the cold data from the Redshift cluster. Once cold data is removed from the cluster, scale the cluster down to the appropriate size to accommodate hot data needs. Use Glue to ETL all cold data to a DynamoDB table in the region closest to the Analytics group, then use a third party framework like D3 to generate visualizations of the data for reporting purposes.
Use the Redshift UNLOAD command to an S3 bucket located in the region closest to the group that generates quarterly reports with the FORMAT PARQUET option to create a single data lake. Configure Athena to support SQL queries, and configure QuickSight to utilize Athena for its data source. Once operation of the new system is confirmed, delete the cold data from the Redshift cluster. Once cold data is removed from the cluster, scale the cluster down to the appropriate size to accommodate hot data needs. Because the fresh data is not accessed for this purpose and the cold data is not utilized for live application operations, it is viable to move cold data outside of the Redshift ecosystem.
You work as a data scientist for an organization that builds videos for university students who use them in place of classroom settings. Each video has a rating system that is positive or negative, which is determined by the students who view the content. Some of the ratings appear to come from bots that are flooding the platform with massive amounts of negative feedback. You've been tasked with creating real-time visualizations for these outliers to bring to the department heads. You have a large dataset of historical data, as well as the streaming data from current student viewing metrics. Which of the following provides the most cost-effective way to visualize these outliers? Use Kinesis Data Analytics' RANDOM_CUT_FOREST anomaly detection function to detect outliers. Use SageMaker to train a model using the built-in Random Cut Forest algorithm to detect outliers storing the results into S3. Use QuickSight to visualize the data. Use the anomaly detection feature in QuickSight to detect outliers. Use SageMaker to train a model using the built-in Random Cut Forest algorithm to detect outliers storing results in memory in the Jupyter notebook used to create the model. Visualize the results using the matplotlib library.
Use the anomaly detection feature in QuickSight to detect outliers. Amazon QuickSight uses proven Amazon technology to continuously run ML-powered anomaly detection across millions of metrics to discover hidden trends and outliers in your data. This anomaly detection enables you to get deep insights that are often buried in the aggregates and not scalable with manual analysis. With ML-powered anomaly detection, you can find outliers in your data without the need for manual analysis, custom development, or ML domain expertise. Detecting Outliers with ML-Powered Anomaly Detection (https://docs.aws.amazon.com/quicksight/latest/user/anomaly-detection.html)
You have been tasked with going through your company's AWS Glue jobs to audit which jobs are currently being used and which ones are outdated. You notice that one job that runs everyday at 5 PM is failing with the error "Command failed with exit code 1" and CloudWatch logs shows the "java.lang.OutOfMemoryError: Java heap space" error. Which of the following methods should you use to resolve this issue? Use actions like collect and count. Use the grouping feature in AWS Glue to coalesce multiple files together into a group. Set useS3ListImplementation to False so AWS Glue doesn't cache the list of files in memory all at once. Configure the AWS Glue job from G1.X to G2.x workers. Configure the AWS Glue job from G1.X to P2.x workers.
Use the grouping feature in AWS Glue to coalesce multiple files together into a group. You can fix the processing of the multiple files by using the grouping feature in AWS Glue. Grouping is automatically enabled when you use dynamic frames and when the input dataset has a large number of files (more than 50,000). Grouping allows you to coalesce multiple files together into a group, and it allows a task to process the entire group instead of a single file. As a result, the Spark driver stores significantly less state in memory to track fewer tasks. Fix the Processing of Multiple Files Using Grouping (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html#monitor-debug-oom-fix)
You work for a major university where thousands of students study and do research. The student's information, course schedules, activities, and financials are constantly being captured and analyzed by the university to help improve the students learning experience. As the lead data engineer for the university, you have built out a sophisticated system of data structures to captures all of this data and store it in DynamoDB. A new policy is being launched by the university, and it's your job to create a new DynamoDB table and select a partition key and sort key to use as a composite key to help improve performance and DynamoDB capacity when the table is utilized. The new policy will only need to be sent out to and monitored by students that fall into certain categories such as age, nationality, major, residential status, etc. Which of the following selections will give you the best performance for the table when the policy is launched? Use the student's major plus the student's name as the composite partition key and the student's age as the sort key. Use the student's ID number as the partition key Use the student's ID number plus the student's age as the composite partition key and the student's age as the sort key. Use the student's major as the partition key and student's name as the sort key.
Use the student's ID number as the partition key Best practice for designing and using partition keys effectively is using a partition key that uniquely identifies each item in a DynamoDB table. Since the student ID number uniquely identifies the student, then a composite key is not required. Best Practices for Designing and Using Partition Keys Effectively (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html)
Pickle Scanz is a company utilizing LIDAR technology to make high-resolution scans of pickles. They've utilized a machine learning model to identify areas they suspect represent a specific bump shape on the skin of the pickles. They've loaded candidates into a Redshift table and need to filter the candidates for a specific base64 binary pattern, but want to fuzz their search to include any records that are a close, but not exact, match to the pattern. How can they most easily identify this pattern? Craft a regex matching pattern and utilize the REGEXP_SUBSTR Redshift query function to identify possible close matches. Perform queries against the Redshift table with all near-value combination of base64 values, and load all returns into a newly created near_matches table. Utilize Glue to ETL the Redshift candidate table to a DynamoDB table, and use the CONTAINS query function to find matches to the base64 pattern in the possible candidates. Utilize Glue to ETL the Redshift candidate table to an Elasticsearch cluster, and utilize the Elasticsearch string search functionality to fuzz the search for the identified base64 pattern.
Utilize Glue to ETL the Redshift candidate table to an Elasticsearch cluster, and utilize the Elasticsearch string search functionality to fuzz the search for the identified base64 pattern. Elasticsearch has very powerful string search functionality, which will give a match-ranked response to a search that can be tuned to increase accuracy.
Your company has been hired to create a search and analytics system for Percival's Peculiar Pickles, which is a site where people post pictures of and discuss pictures of peculiar pickles. The solution should provide a REST API interface, enable deep text search capabilities, and be able to generate visualizations of the data stored in the system. Which solution will meet these requirements with minimal development effort? Store the files in Elastic File System. Access the files through a custom API that provides search services hosted on EC2 instances in an Autoscaling group behind a Application Load Balancer. Create a custom API with API Gateway and Lambda. Use S3 and Athena as the datastore. Perform text filtering in the application layer. Utilize Kinesis Firehose to deliver data from the various elements of the application to an Elasticsearch Service cluster. Provide Elasticsearch API and Kibana endpoints to the customer with appropriate security credential information. Use DynamoDB as the data store. Utilize Elastic MapReduce to create a full-text search system. Write a custom API with API Gateway and Lambda.
Utilize Kinesis Firehose to deliver data from the various elements of the application to an Elasticsearch Service cluster. Provide Elasticsearch API and Kibana endpoints to the customer with appropriate security credential information. Kinesis Firehose is able to deliver records to Elasticsearch Service with no additional development needed. Elasticsearch provides a REST API which satisfies the rest of the requirements.
Stupendous Fantasy Football League would like to create near real-time scoreboards for all games being played on any given day. They have an existing Kinesis Data Firehose which ingests all relevant statistics about each game as it is being played, but they would like to be able to extract just score data on the fly. Data is currently being delivered to a Redshift cluster, but they would like score data to be stored and updated in a DynamoDB table. This table will then function as the datastore for the live scoreboard on their web application. Which of the following is the best way to accomplish this? Utilize a Kinesis Data Analytics Application to filter out data for each individual active game from the data stream. Send the data to a Lambda function, which then inserts/updates the data in the DynamoDB table. Utilize a Kinesis Data Analytics Application to filter out just the score data from the data stream. Send the data to a Lambda function, which then inserts/updates the data in the DynamoDB table. Create a scheduled Lambda function that periodically polls the Redshift cluster for updated score data and inserts/updates the data in the DynamoDB table. Insert an SQS queue and Lambda function in front of the Kinesis Firehose. Use the Lambda function to filter score data into the DynamoDB table, and leave the rest of the pipeline as is.
Utilize a Kinesis Data Analytics Application to filter out just the score data from the data stream. Send the data to a Lambda function, which then inserts/updates the data in the DynamoDB table. This is the best option Kinesis Data Analytics Applications allow us to split data from within a Kinesis Data Stream or Firehose.
You work for a large data warehousing company that is constantly running large scale processing jobs for customers. Every team has the freedom to use whichever EMR cluster configuration they need to accomplish a particular task, but the solution must be cost optimized. The latest contract requires a very large EMR cluster to be used throughout the year to process ML data and statistical functions. During a few months out of the year, the processing will be massive and, during other months, it will be minimal. To contend with this, your team uses a combination of on-demand and spot instances for the EMR cluster nodes, which is estimated to be around 40 core and tasks nodes. The team also varies the instance types to handle different workload types; for example, GPU-intensive ML processes will use g3 instance types and storage optimized processes will use i2 instance types. Which type of EMR cluster solution would need to be set up to meet the requirements for the new contract? Utilize instance fleets configurations when creating the EMR cluster. Utilize instance fleets and instance groups configurations when creating the EMR cluster. Utilize spot-instances for core nodes and instance groups for master and task nodes. Utilize instance groups configurations when creating the EMR cluster.
Utilize instance fleets configurations when creating the EMR cluster. The instance fleets configuration for a cluster offers the widest variety of provisioning options for EC2 instances. With instance fleets, you specify target capacities for On-Demand Instances and Spot Instances within each fleet. When the cluster launches, Amazon EMR provisions instances until the targets are fulfilled. Create a Cluster with Instance Fleets or Uniform Instance Groups (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-group-configuration.html)
You work for a nationwide grocery store that allows users to order groceries online through a custom-built mobile application. The current architecture consists of loading historical data of user's past purchases, search history, and liked items into a data lake in S3 using Kinesis Data Firehose. A data blob is typically 500 KB and ingested at a rate of 1 MB/second. The amount of data that is ingested by Kinesis Data Firehose typically reaches the buffer size before reaching the buffer interval. These are set to 128 MB and 300 seconds respectively. You have been tasked with setting up another Kinesis Data Firehose with a destination of Elasticsearch to give your team the ability to make on-the-fly recommendations and also provide reporting and other analysis. After successfully setting up this architecture, you notice the amount of data delivered into Elasticsearch is not as large as the amount of data being delivered into S3, which is leading to lower overall throughput. What is the reason for the lower throughput of the data being delivered into Elasticsearch? When using Elasticsearch as a delivery destination, the maximum value for the buffer interval is 60 seconds. This is causing lower overall throughput for data being delivered into Elasticsearch. When using Elasticsearch as a delivery destination, the maximum value for the buffer size is 512 KB. This is causing lower overall throughput for data being delivered into Elasticsearch. When using Elasticsearch as a delivery destination, the maximum value for the buffer size is 100 MB. This is causing lower overall throughput for data being delivered into Elasticsearch. Transfer Acceleration has been enabled onto the S3 bucket where the historical data is being delivered. This is causing the perception of higher overall throughput for data being delivered into S3.
When using Elasticsearch as a delivery destination, the maximum value for the buffer size is 100 MB. This is causing lower overall throughput for data being delivered into Elasticsearch. Elasticsearch has a minimum value of 1 and maximum value of 100 (in MB) for the buffer size for incoming data before delivering it to the destination. ElasticsearchBufferingHints (https://docs.aws.amazon.com/firehose/latest/APIReference/API_ElasticsearchBufferingHints.html)
You work for a government agency that is building an application in AWS that collects voting ballot data about government candidates and federal laws. Your task is to collect real-time voting data and store it in S3. You have built a streaming data capturing system using Kinesis Data Streams and taking advantage of the Kinesis Producer Library (KPL), since it provides a layer of abstraction for writing data. You have a requirement that exceptions and failed records must be handled with great importance, since losing one voting ballot record could sway results. What should be done to ensure failures are handled appropriately in your KPL application? When using the KPL, retries are automatically rolled back, allowing your application to take advantage of custom retry logic. When using the KPL, retries are automatically added to the buffer for subsequent Kinesis Data Streams API calls. When using the KPL, retries are handled asynchronously after-the-fact, because of Kinesis Data Stream's data replication to three availability zones in a region. When an exception or failure happens, trigger a Lambda function to add the record to the buffer for subsequent Kinesis Data Streams API calls.
When using the KPL, retries are automatically added to the buffer for subsequent Kinesis Data Streams API calls. Retries are automatic when using the KPL and records will be included in subsequent Kinesis Data Streams API calls, in order to improve throughput and reduce complexity while enforcing the Kinesis Data Streams record's time-to-live value. There is no backoff algorithm, making this a relatively aggressive retry strategy. Spamming due to excessive retries is prevented by rate limiting. KPL Retries and Rate Limiting (https://docs.aws.amazon.com/streams/latest/dev/kinesis-producer-adv-retries-rate-limiting.html)
You've been provided with an S3 bucket with several terabytes of log data that needs to be prepared for analysis. Unfortunately, the logs are not in a common data format and use irregular delimters, but are grouped in prefixes in such a way that each prefix contains logs with identical data formatting. The logs need to be processed and loaded into an Elasticsearch domain. This process needs to be completed as quickly as possible. What is the best workflow to accomplish this? Write a Lambda function with a format template for each S3 prefix data format. Process each line in the log into a JSON document, and deliver the JSON documents to a Kinesis Firehose delivery stream with the Elasticsearch domain configured as the target. Ingest each prefix's worth of logs to an EC2 instance and run a processing script to format each line as a JSON document, then send the JSON document to the Elasticsearch domain's REST API. Utilize Glue to catalog the data, and create Glue jobs to process the log files and deliver them to the Elasticsearch domain. Utilize Database Migration Service to ingest the data, format it, and deliver it to the Elasticsearch domain.
Write a Lambda function with a format template for each S3 prefix data format. Process each line in the log into a JSON document, and deliver the JSON documents to a Kinesis Firehose delivery stream with the Elasticsearch domain configured as the target. Because you have irregularly formatted data, you need to perform manual data transforms. So, the bulk of the work here is to map a line from a file in each bucket prefix and then apply that transformation where appropriate. After that, you can easily deliver the results to a Kinesis Firehose stream, which can have an Elasticsearch domain configured as the target.
You're creating an application to process art school portfolios. Most of the data being ingested for this application will be high-resolution images that average 50MB each. It is imperative that no data is lost in the process of ingesting the data. Each image has roughly 20KB of metadata that will be the primary focus of the application, but the images themselves need to be accessible as well. The starting point of the ingestion flow for this application will be in an admissions office, where digital media is processed. The front end application will mostly be performing OLTP workloads. Which of the following will ensure all data is available and is able to be ingested in a timely manner? As part of the ingestion process, resize each image to be approximately 900KB. Create an ingestion S3 bucket and configure a Kinesis Firehose to deliver data to the S3 bucket. Load each image into the Firehose after resizing. Create a Lambda function triggered by S3 PUTs that processes each image, extracting the relevant metadata and writing it to an Aurora MySQL cluster. Write the processed images to a final storage S3 bucket and delete the image file from the ingestion S3 bucket. Create a Kinesis Data Firehose configured to deliver records to an S3 bucket. Write an ingestion application that places each file in the Kineses Firehose. Create a Lambda function triggered by PUTs to the S3 bucket, which processes each file to extract the metadata and add it to a separate CSV file in the same bucket prefix as the image file. Configure Athena to provide a SQL interface for the S3 bucket. Create an S3 bucket. Before uploading each image file, modify the Exif data to include any additional metadata. Write the image files to the S3 bucket with a deterministic prefix schema (/Year/Month/Day/Applicant ID/...). Write the front end application to read the Exif data from each image as it is being loaded. Write the application to extract the metadata from each image file, and combine it with any other metadata that is not part of the file (Applicant Name, ID, etc.). Upload the image file to a S3 bucket with a deterministic prefix schema (/Year/Month/Day/Applicant ID/...). Configure a Kinesis Data Stream to handle the metadata records. Write a Lambda- based Kinesis consumer to process the metadata records into a DynamoDB table. Have the consumer Lambda function also write the metadata for each file to the appropriate S3 location to accompany the image it relates to.
Write the application to extract the metadata from each image file, and combine it with any other metadata that is not part of the file (Applicant Name, ID, etc.). Upload the image file to a S3 bucket with a deterministic prefix schema (/Year/Month/Day/Applicant ID/...). Configure a Kinesis Data Stream to handle the metadata records. Write a Lambda- based Kinesis consumer to process the metadata records into a DynamoDB table. Have the consumer Lambda function also write the metadata for each file to the appropriate S3 location to accompany the image it relates to. This will ensure that both metadata and image files are stored in a fault tolerant, durable manner. By also storing the metadata files with the images, you add an easy-to-locate backup of each metadata record, should there be an issue with DynamoDB or if you need a convenient place to collect records for secondary use cases without increasing load on the DynamoDB table.