BigData Final

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Which of the below is NOT a phase of MapReduce? a. Combine phase b. Map phase c. Shuffle phase d. Reduce phase

Combine phase

In how many ways RDD can be created? (enter the number)

3

What is the default replication factor of a block on HDFS?

3

An administrator wants to be notified when the Amazon EC2 instances CPU utilization threshold is exceeded. Which AWS service could they implement to meet this requirement? a. AWS Config b. Amazon CloudWatch c. Amazon Inspector d. AWS CloudTrail

Amazon CloudWatch

Match the AWS services with the corresponding data stores from modern data architecture. AWS services: Amazon Redshift, Amazon S3, Amazon Aurora, Amazon EMR, Amazon DynamoDB Data stores: Data Warehouse, Data lake, Relational, BigData, Non-Relational

Amazon Redshift - Data Warehouse Amazon S3 - Data lake Amazon Aurora - Relational Amazon EMR - BigData Amazon DynamoDB - Non-Relational

What is Apache Hive used for? a. Machine learning b. Batch data processing c. Network security d. Real-time data processing

Batch data processing

In Hadoop, HDFS splits huge file into small chunks that are called ... a. Segments b. Blocks c. Frames d. Pages

Blocks

What type of processing does Spark support? a. Batch processing b. Stream processing c. Both A and B d. None of the above

Both A and B

Which of the following is NOT true for Hadoop and Spark? a. Both are cluster computing environments b. Both have their own file system c. Both use open source APIs to link between different tools d. Both are big data processing platforms

Both have their own file system

What is the data engineer's role in processing data through a pipeline? a. Look for additional insights in the data. b. Evaluate the results. c. Build the infrastructure that the data passes through. d. Train data analyst on analysis and visulization tools.

Build the infrastructure that the data passes through.

Match the Linux OS commands with their functions. Commands: Display the contents of the directory Print the content of the file Show the path to the current folder Go from the current directory to the specified one Functions: cat, ls cd, pwd

Display the contents of the directory - ls Print the content of the file - cat Show the path to the current folder - pwd Go from the current directory to the specified one - cd

Which of these features correspond to RDD actions? a. Manipulate data values b. Return values c. Execute immediately d. Return pointers to new RDDs

Execute immediately Return values

With Hadoop MapReduce, you can run applications written only in Java. Select one: True False

False

The data scientist is working with a dataset that has high dimensionality. Which approach might they take to improve the dataset for their ML model? a. Identify features that do not appear to impact outcomes and remove them. b. Create new features to augment existing ones. c. Use binning to create a fixed number of values for a feature. d. Increase the dimensionality by using feature extraction.

Identify features that do not appear to impact outcomes and remove them.

Which programming language does NOT support Apache Spark? a. Julia b. Scala c. R d. Java

Julia

What is the most important node in the Hadoop cluster? a. Datanode b. Mainnode c. Namenode d. Masternode

Namenode

What is the process of dividing data into smaller pieces for more efficient processing called in Apache Hive? a. Schema creation b. Partitioning c. Shuffling d. Loading

Partitioning

What is Spark's main abstraction for parallel processing? a. SparkSQL b. RDD c. DataFrame d. DataSet

RDD

Which data type best describes JSON and XML files? a. Structured b. Semistructured c. Relational d. Unstructured

Semistructured

What is the main component of Apache Spark architecture? a. Spark Streaming b. Spark Core c. RDD d. Spark SQL

Spark Core

What is the Spark equivalent of a SQL table? a. Spark Streaming b. RDD c. Spark SQL d. Spark DataFrame

Spark DataFrame

Which stages are part of every modern data pipeline? (Select all that apply) a. Storage and processing b. Ingestion c. Analysis and visualization d. Parquet and GZIP compression e. Schema adjustment f. Labeling and resource tagging

Storage and processing Ingestion Analysis and visualization

What types of methods you can run on Spark RDD? a. Transformations b. Computations c. Actions d. Evaluations

Transformations Actions

Pig and Hive are high-level technologies for processing big data, working on top of the Hadoop MapReduce. Select one: a. True b. False

True

Which of these are actions? (Select all that apply) a. collect() b. count() c. orderBy() d. select() e. take() f. filter()

collect() count() take()

A data engineer wants to implement the most cost-effective service to coordinate multiple serverless functions into workflows. Which AWS service would meet their needs? a. Amazon Simple Workflow Service (Amazon SWF) b. AWS Batch c. AWS Step Functions d. AWS Lambda

AWS Step Functions

Which type or types of data movement should the modern data architecture support? a. Only outside-in data movement, which is when data from purpose-built data stores is moved into the data lake. b. The architecture should not support data movement between stores. Data should be ingestred into the data store where it will be used. c. Only inside-out data movement, which is when data in the lake is moved to a purpose-built data store. d. Both inside-out and outside-in data movement. The architecture should also support movement directly between purpose-built data stores.

Both inside-out and outside-in data movement. The architecture should also support movement directly between purpose-built data stores.

What is HDFS? a. Distributed programming model b. Big data processing platform c. Operating system for a cluster d. Distributed file system

Distributed file system

What happens to the local file system when installing HDFS? a. They are not compatible b. The local file system will be replaced with HDFS c. The local file system will run on top of HDFS d. HDFS will run on top of the local file system

HDFS will run on top of the local file system

What are the benefits of implementing infrastructure as code? (Select all that apply) a. Programmability b. Serviceability c. Reusability d. Vulnerability e. Repeatability

Reusability Repeatability

Which data characteristic describes how accurate, precise, and trusted the data is? (Enter the answer as a single word)

Veracity

Which Spark RDD methods are transformation? a. count b. filter c. take d. map

filter map

Which Spark RDD methods are transformation? a. take b. filter c. map d. count

filter map

Which statement best describes the data scientist's role in processing data through a pipeline? a. A data scientist focuses on the infrastructure that data passes through. b. A data scientist works with data in the pipeline. c. A data scientist works on moving data into the pipeline. d. A data scientist focuses on granting the correct level of access to different type of users.

A data scientist works with data in the pipeline.

A data engineer is building their infrastructure. They would like to create and deploy infrastructure as code to simplify and automate this process. Which service could the data engineer use it accomplish this task? a. AWS Auto Scaling b. AWS CloudTrail c. AWS Key Management Service (KMS) d. AWS CloudFormation

AWS CloudFormation

Which of the following is true for Amazon Athena? Select three. a. Amazon Athena builds a catalog that contains metadata about the various data sources. b. Amazon Athena is an interactive query service. c. Athena is serverless service. d. Amazon Athena is a fast, fully managed, petabyte-scale data warehouse service. e. You can build a crawler with Amazon Athena to discover the schema. f. The Athena query engine is based on an open source tool called Presto.

Amazon Athena is an interactive query service. Athena is serverless service. The Athena query engine is based on an open source tool called Presto.

A startup company is building an order inventory system with a web frontend and is looking for a real-time transactional database. Which service would meet their need? a. Amazon DocumentDB (with MongoDB compatibility ) b. Amazon Neptune c. Amazon DynamoDB d. Amazon Redshift

Amazon DynamoDB

Which AWS services are serverless? (Select all that apply) a. Amazon DynamoDB b. Amazon EMR c. Amazon Glue d. Amazon S3 e. Amazon SageMaker f. Amazon Athena

Amazon DynamoDB Amazon S3 Amazon Athena

In real time, a data engineer needs to analyze and visualize a lot of streaming data from private user logs. Which single service would provide this ability? a. Amazon OpenSearch Service b. Amazon Athena c. Amazon Redshift d. Amazon QuickSight

Amazon OpenSearch Service

Which service or feature provides the ability for users to ask questions in natural language about data and receive accurate answers with relevant visualizations to help them gain insights? a. Amazon Athena b. Amazon OpenSearch Service c. Amazon Redshift d. Amazon QuickSight Q

Amazon QuickSight Q

Due to a company merger, a data engineer needs to increase their object storage capacity. They are not sure how much storage they will need. They want a highly scalable service that can store unstructured, semistructured, and structured data. Which service would be the most cost-effective to accomplish this task?

Amazon S3? idk true answer

A developer without ML experience wants to automatically localize documents that are uploaded to their web application. Which AWS service might be a good choice to integrate with their application? a. Amazon Translate b. Amazon Polly c. Amazon Comprehend d. Amazon Rekognition

Amazon Translate

In which layer of a data pipeline are insights produced from data? a. Analysis and visualization b. Processing c. Storage d. Ingestion

Analysis and visualization

Which tool is an open-source, in-memory structured query language (SQL) query engine that emphasizes faster queries?

Apache Drill ? Presto idk true answer

A company is developing an infrastructure to support a new focus on data analytics. Due to cost concerns, the company wants to use the inexpensive hardware that they already have. Which programming frameworks are designed to best operate on that hardware? (Select TWO) a. Amazon RDS b. Amazon Redshift c. Presto d. Apache Hive e. Apache Hadoop

Apache Hive Apache Hadoop

Which of the following is a big data platform for analyzing unstructured data? a. Apache Mahout b. Apache Pig c. Apache Hudi d. Apache Hive

Apache Pig

What is the main difference between Apache Pig and Apache Hive? a. Apache Hive is faster than Apache Pig b. Apache Pig supports more data types than Apache Hive c. Apache Hive supports more data types than Apache Pig d. Apache Pig is faster than Apache Hive

Apache Pig is faster than Apache Hive

A data engineer is planning for a machine learning (ML) project that will involve the use of iterative, multi-stage ML algorithms. Which programming framework would best support these requirements? a. Apache Spark b. Presto c. Apache Hive d. Apache Hadoop

Apache Spark

A data engineer is planning for a machine learning (ML) project that will involve the use of an iterative, multi-stage ML algorithm. Which programming framework would best support these requirements? a. Apache Mahout b. Apache Hadoop YARN c. Apache Spark MLlib d. Apache Hive

Apache Spark MLlib

Which of these file formats are big data formats optimized for storage on Hadoop? (select all that apply) a. XML b. Avro c. Parquet d. Optimized Row Columnar (ORC) e. JSON f. CSV

Avro Parquet Optimized Row Columnar (ORC)

What is a common strategy for the preprocessing phase of the ML lifecycle? a. Balance and unbias the data b. Extract features c. Monitor model outputs d. Train the model

Balance and unbias the data

Which of these statements best describes batch and stream ingestion? (Select TWO) a. Batch ingestion is a approach of continuous integration and real-time processing of data across multiple sources towards a target destination. b. Batch ingestion is a process of collecting and transferring data in chunks according to scheduled intervals or on demand. c. Stream ingestion is a process of collecting and transferring data in chunks according to scheduled intervals or on demand. d. Stream ingestion is a approach of continuous integration and real-time processing of data across multiplesources towards a target destination.

Batch ingestion is a process of collecting and transferring data in chunks according to scheduled intervals or on demand. Stream ingestion is a approach of continuous integration and real-time processing of data across multiplesources towards a target destination.

What is the process of reusing intermediate data across multiple Spark operations called in Spark? a. Partitioning b. Shuffling c. Caching d. Filtering

Caching

Which of these features does NOT apply to RDD? a. Can be modified after creation b. Can be created by performing transformations on existing RDD c. Distributed datasets can be processed in parallel with them d. Reliably store very large files across cluster machines

Can be modified after creation Reliably store very large files across cluster machines

Which statement describes how data architectures evolved from 1970 to the present? a. Data warehouses evolved out of the need to process data for artificial intelligence and machine learning (AI/ML) applications. b. Hierarchical databases dominated the market until explosion of data during the rise of the internet. c. Data stores evolved from relational to nonrelational structures to support demand for higher levels of connected users. d. Data stores evolved to adapt to increasing demands of data volume, variety, and velocity.

Data stores evolved to adapt to increasing demands of data volume, variety, and velocity.

Which statement best describes data wrangling? a. Data wrangling provides a set of transformation steps, which are each performed one time in sequence as data is ingested b. Data wrangling provides rigid guidance for data transformations to ensure they adhere to standards that are needed for ML models c. Data wrangling is a data transformation approach that requires the use of sophisticated tools to review and transform data from a given data source d. Data wrangling is a set of steps that are performed to transform large amounts of data from multiple sources into a meaningful dataset

Data wrangling is a set of steps that are performed to transform large amounts of data from multiple sources into a meaningful dataset

What is the term for the combination of cultural philosophies, practices, and tools that increases an organization's ability to deliver applications and services at a high velocity? a. Continuous deployment b. Continuous delivery c. Continuous integration d. DevOps

DevOps

What is MapReduce? a. Programming language b. Operating system for a cluster c. Distributed file system d. Distributed programming model

Distributed programming model

What is the role of AWS Glue in the AWS modern data architecture? a. Help you to monitor and classify data. b. Provide the ability to query data directly from the data lake by using SQL. Flag question c. Secure access to sensitive data. d. Facilitate data movement and transformation between data stores.

Facilitate data movement and transformation between data stores.

With Hadoop MapReduce, you can run applications written only in Java. Select one: a. True b. False

False

Arrange the name of the technologies in accordance with the companies where they were created: Pig / Hive / MapReduce / YARN Google ___ Facebook ___ Yahoo ___

Google MapReduce Facebook Hive Yahoo Pig

You can run Pig in interactive mode using the ____ shell.

Grunt

What is the utility that allows us to create and run MapReduce jobs with any language or executable as the mapper or the reducer in Hadoop? a. Hadoop Cluster b. Hadoop Core c. Hadoop MapReduce d. Hadoop Streaming

Hadoop Streaming

A data consumer wants to visualize the digital products that customers are purchasing across multiple countries. Which type of data visualization would help the consumer to quickly identify trends and outliers? a. Donut chart b. Word cloud c. Heat map d. Gauge chart

Heat map

What technology is NOT a part of Hadoop Core? a. HDFS b. YARN c. Hive d. MapReduce

Hive

What is an example of unstructured data? a. Relational database table b. CSV, JSON, or XML file c. Log files or clickstream data d. Image and text files

Image and text files

Which type of data visualization would help a data consumer to visualize trends over intervals of time? a. Geospatial map b. Line chart c. Donut chart d. Pie chart

Line chart

Which of these higher-level tools are NOT in Apache Spark ecosystem? a. Pig b. GraphX c. MLlib d. Hive

Pig Hive

Which of the following is true for Amazon Simple Storage Service (Amazon S3)? Select three. a. S3 bucket names must be unique across all buckets in Amazon S3. b. Amazon S3 is a fast, fully managed, petabyte-scale data warehouse service. c. Objects in Amazon S3 can be up to 5 TB. d. Amazon S3 supports the .gzip and .zip compression formats. e. By default, all data stored in Amazon S3 is viewable by the public. f. Buckets and objects are the basic building blocks for Amazon S3.

S3 bucket names must be unique across all buckets in Amazon S3. Objects in Amazon S3 can be up to 5 TB. Buckets and objects are the basic building blocks for Amazon S3.

Which statement about data types is correct? a. Unstructured data is the easiest to query but the most flexible. Structured data is the hardest to query and the least flexible. b. Unstructured data is the hardest to query and the least flexible. Structured data is the easiest to query and the most flexible. c. Unstructured and structured data are equally difficult to query. d. Unstructured data is the hardest to query but the most flexible. Structured data is the easiest to query and the least flexible.

Unstructured data is the hardest to query but the most flexible. Structured data is the easiest to query and the least flexible.

Which statement are TRUE regarding horizontal and vertical scaling? (Select TWO) a. Upgrading to a higher Amazon EC2 instance type is an example of horizontal scaling. b. Upgrading to a higher Amazon EC2 instance type and adding more EC2 instances to your resource pool are both examples of horizontal scaling. c. Upgrading to a higher Amazon EC2 instance type is an example of vertical scaling. d. Adding more Amazon EC2 instances to your resource pool is an example of horizontal scaling. e. Adding more Amazon EC2 instances to your resource pool is an example of vertical scaling.

Upgrading to a higher Amazon EC2 instance type is an example of vertical scaling. Adding more Amazon EC2 instances to your resource pool is an example of horizontal scaling.

A business analyst wants to use a CSV file that they built to test a hypothesis about predicting an outcome. What approach could simplify their work? a. Load the data into Amazon EMR. b. Load the data into Amazon Redshift. c. Use Amazon SageMaker Canvas. d. Use an AWS Deep Learning AMI (DLAMI).

Use Amazon SageMaker Canvas.

A data engineer is building a pipeline to ingest unstructured and semistructured data. A data scientist will explore the data for potential use in an ML solution. Which ingestion approach is best for this business need? a. Use an extract, load, and transform (ELT) approach and load the minimally transformed data into an Amazon Redshift data warehouse. b. Use an extract, transform, and load (ETL) approach and load the highly transformed data into an Amazon Redshift data warehouse. c. Use an extract, load, and transform (ELT) approach and load the nearly raw data into an S3 data lake. d. Use an extract, transform, and load (ETL) approach and load the highly transformed data into an S3 data lake.

Use an extract, load, and transform (ELT) approach and load the nearly raw data into an S3 data lake.

Which statements describe volume and velocity in a data pipeline? (Select THREE) a. Velocity is about how much data you need to process. b. Volume and velocity together drive the expected throughput and scaling requirements of your pipeline. c. Velocity is about how quickly data enters and moves through a pipeline d. Volume is about how much data you need to process. e. Only volume drives the expected throughput and scaling requirements of a pipeline.

Volume and velocity together drive the expected throughput and scaling requirements of your pipeline. Velocity is about how quickly data enters and moves through a pipeline Volume is about how much data you need to process.

To determine the appropriate AWS tools and services to analyze and visualize data, which question should a data engineer consider first? a. What is the business need that tools or services need to fulfill? b. What are the characteristics of the data that needs to be analyzed and visualized? c. How do different personas in the organization use tools and services within the stages of the data pipeline? d. What are the available tools and services?

What is the business need that tools or services need to fulfill?

Which of the following is true for AWS Glue? Select three. a. You can build a crawler with AWS Glue to discover the schema of data. b. You can store big data in AWS Glue using the Apache Parquet format. c. AWS Glue is a scalable, serverless data integration service. d. AWS Glue is a fast, fully managed, petabyte-scale data warehouse service. e. AWS Glue builds a catalog that contains metadata about the various data sources. f. AWS Glue is an interactive query service that makes it easy to analyze data directly in Amazon S3.

You can build a crawler with AWS Glue to discover the schema of data. AWS Glue is a scalable, serverless data integration service. AWS Glue builds a catalog that contains metadata about the various data sources.

A data engineer is considering whether to use the Apache Hadoop framework for their data analytics workload. What are the benefits of the Hadoop framework? (Select all that apply) a. You can store and process multiple petabytes of data. b. The open-source framework is free and can run on inexpensive hardware. c. The Hadoop infrastructure is not scalable. d. Hadoop can only process structured and unstructured data. e. Hadoop has a low degree of fault tolerance.

You can store and process multiple petabytes of data. The open-source framework is free and can run on inexpensive hardware.

In Bash shell (Linux) cat is the command used to: a. exit the shell b. change directory c. compress the contents of a file d. move directory e. display the contents of a file

display the contents of a file

Select the Bash shell command that counts the number of lines in a file. a. cat -n <filename> b. wc-c <filename> c. rm -f <filename> d. wc-l <filename>

wc-l <filename>


Ensembles d'études connexes

Ch 52 - NGN PrepU - Maternity, Newborn, and Women's Health Nursing

View Set

Review Questions 1, CNT4403 Midterm

View Set

Personal Money Management (Final Exam Review)

View Set

Social Studies SS8H7: The New South

View Set

Introduction aux Sciences du langage et de la communication

View Set