Week 5 Spark and AWS

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

EMR

is a platform for rapidly processing, analyzing and applying machine learning to big data using open-source frameworks. Its use cases include: building scalable data pipelines, accelerate data science with ML, processing real time data and to query any datsets.

What are objects

objects are files within S3. They have a key and the key full path is composed as prefix + object_name.

Executor memory exception troubleshooting: executor runs out of memory, fetchfailed exception

set a higher value for the executor memory using the commands --conf spark.executor.memory=<XX>g or --executor-memory <XX>G

What are dataframes?

DataFrames are an abstraction of RDDs used with Spark SQL that are similar to their namesake in R and Python. They allow data scientists who may not be familiar with RDD concepts to still perform SQL queries via Spark.

Resource Manager

Resource Manager is the decision-maker unit about the allocation of resources between all applications in the cluster, and it is a part of Cluster Manager.

S3 Standard-IA

Data that is accessed less frequently, but requires rapid access when needed and is stored redundantly in multiple AZs. Ideally suited for long term file storage, older sync and share storage and other aging data.

Can you assign a new key to an EC2 instance

No

When you create a new EMR once it is ready to run the clusters status is

Waitinf

different storage levels for caching

We can use different storage levels for caching the data. Refer: StorageLevel.scala DISK_ONLY: Persist data on disk only in serialized format. MEMORY_ONLY: Persist data in memory only in deserialized format. MEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on disk. OFF_HEAP: Data is persisted in off-heap memory. Refer spark.memory.offHeap.enabled in Spark Doc.

Columnar Database vs Relational Database

While a relational database is optimized for storing rows of data, typically for transactional applications, a columnar database is optimized for fast retrieval of columns of data, typically in analytical applications. Column-oriented storage for database tables is an important factor in analytic query performance because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk. Like other NoSQL databases, column-oriented databases are designed to scale "out" using distributed clusters of low-cost hardware to increase throughput, making them ideal for data warehousing and Big Data processing. Apache HBase is an open-source, column-oriented, distributed NoSQL database. HBase runs on the Apache Hadoop framework. HBase provides you a fault-tolerant, efficient way of storing large quantities of sparse data using column-based compression and storage.

What ways can you persist memory in Spark?

With persist, and specifying the persist storage level and with cache which is default memory_and_disk for data set and memory_only for a RDD

Different ways to repartition?

With repartition and coalesce. Repartition partitions to the given number in the parenthesis, coalesce partitions only at a reduced number than previously defined.

How to specify the cluster manager in spark submit?

by setting the master flag in spark submit

Amazon Glacier is designed for: (Choose 2 answers) A. active database storage. B. infrequently accessed data. C. data archives. D. frequently accessed data. E. cached session data.

infrequently accessed data

How to configure in spark submit? How can you do it programmically?

--conf where you configure the properties such as spark.sql.shuffle.partitions. val config = new SparkConf() config.set("spark.sql.shuffle.partitions","300") val spark=SparkSession.builder().config(config)

How to get storage level in Spark?

.getStorageLevel

How can we see the lineage?

.toDebugString

Horizontal scaling

Adding more resources of the same kind

What is EC2?

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers. Amazon EC2 provides complete control of your computing resources, integration with other AWS services and security. It offers balance between compute memory and networking recourses on workloads like repositories and web servers through instances. Each instance type delivers a mix of CPU, memory, storage, and networking capacity, across one or more size options, and should be carefully matched to your workload's unique demands.

Amazon Glacier

Amazon Glacier is a secure, durable, and extremely low-cost storage service for data archiving and long-term backup. You can reliably store large or small amounts of data for as little as $0.004 per gigabyte per month, a significant savings compared to on-premises solutions. To keep costs low yet suitable for varying retrieval needs, Amazon Glacier provides three options for access to archives, from a few minutes to several hours.

What are buckets?

Amazon S3 allows people to store objects (files) in buckets (directories). They must be globally unique name and are defined at the region level. Naming convention: They should have no uppercase, underscore need to bee 3-63 characters long and not and IP address. It can start with a number.

What is Amazon S3?

Amazon S3 is the main building blocks of AWS and advertised as infinitely scaling storage. It provides a simple web interface that offers a highly-scalable, reliable, and low-latency data storage infrastructure at very low costs. Objects are stored in buckets. The buckets must have a globally unique name and are defined at the region level. Object paths have a key and the key is the

What is AWS?

Amazon Web Services (AWS) is a cloud-based service provided by Amazon that allows for the execution of application on a virtual server that runs using servers maintained or rented by Amazon. AWS has all the typical benefits of a cloud service, including elasticity, scalability, and the ability to use large amounts of computing resources on a per-use fee basis, which is ultimately much cheaper than maintaining a server farm directly.

What does it mean to run an EMR Step Execution?

An EMR cluster can be run in two ways. When the cluster is set up to run like any other Hadoop system, it will remain idle when no job is running. The other mode is for "step execution." This is where the cluster is created, runs one or more steps of a submitted job, and then terminates. Obviously the second mode of operation saves costs, but it also means data within the cluster is not persisted when the job finishes.

What is an executor? What are executors when we run Spark on YARN?

An executor is a process that is launched for a Spark application on a worker node. Executor acts as a JVM process launched on a worker node. Executors are launched at the start of a Spark Application in coordination with the Cluster Manager. The responsibility of executor is to run an individual task and return the result to the driver. It can cache persist the data in the worker note. When running Spark on Yarn each spark executor runs as a YARN container. Where MapReduce schedules a container and fires up a JVM for each task, Spark hosts multiple tasks within the same container. This approach enables several orders of magnitude faster task startup time.

Define the different persistent storage levels in scala

DISK_ONLY: Persist data on disk only in serialized format. MEMORY_ONLY: Persist data in memory only in deserialized format. MEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on disk. OFF_HEAP: Data is persisted in off-heap memory.

What is Apache Spark?

Apache Spark is one of the most popular open-source distributed computing platforms for in-memory batch and stream processing. It promises to process millions of records in a fast manner. Spark uses a master/slave architecture with a central coordinator called Driver and a set of executable workflows called Executors that are located at various nodes in the cluster.

Spark caching

Apache Spark provides an important feature to cache intermediate data and provide significant performance improvement while running multiple queries on the same data. The rule of thumb for caching is to identify the Dataframe that you will be reusing in your Spark Application and cache it. Even if you don't have enough memory to cache all of your data you should go-ahead and cache it. Spark will cache whatever it can in memory and spill the rest to disk.

What are the different types of cloud computing?

As cloud computing has grown in popularity, several different models and deployment strategies have emerged to help meet specific needs of different users. Each type of cloud service and deployment method provides you with different levels of control, flexibility, and management.

Benefits of cache

Benefits of caching DataFrame Reading data from source(hdfs:// or s3://) is time consuming. So after you read data from the source and apply all the common operations, cache it if you are going to reuse the data. By caching you create a checkpoint in your spark application and if further down the execution of application any of the tasks fail your application will be able to recompute the lost RDD partition from the cache. If you don't have enough memory data will be cached at the local disk of executor which will also be faster than reading from the source. If you can only cache a fraction of data it will also improve the performance, the rest of the data can be recomputed by spark and that's what resilient in RDD means.

Reducebykey vs groupbykey

Both reduceByKey and groupByKey result in wide transformations which means both triggers a shuffle operation. The key difference between reduceByKey and groupByKey is that reduceByKey does a map side combine and groupByKey does not do a map side combine. With Groupbykey since the data is not combined or reduced on the map side, we transferred all elements over the network during shuffle. Since all elements are sent to the task performing the aggregate operation, the number of elements to be handled by the task will be more and could possibly result in an Out of Memory exception. With ReducebyKey it is optimized due to the map side combine. There by sending fewer elements over the network. Also fewer elements are reduced on the task performing the reduce operation after the shuffle.

On-Heap Memory

By default, Spark uses on-heap memory only. The size of the on-heap memory is configured by the --executor-memory parameter when the Spark Application starts. The concurrent tasks running inside Executor share JVM's on-heap memory. Two main configurations that control Executor memory allocation: spark.memory.fraction spark.memory.storageFraction Apache Spark supports three memory regions within the executor: - Reserved Memory -User Memory -Spark Memory

when do we cache an rdd?

Caching is very useful for applications that re-use an RDD multiple times. Iterative machine learning applications include such RDDs that are re-used in each iteration. Caching all of the generated RDDs is not a good strategy as useful cached blocks may be evicted from the cache well before being re-used. For such cases, additional computation time is required to re-evaluate the RDD blocks evicted from the cache.

Cluster Manager

Cluster Manager is a process that controls, governs, and reserves computing resources in the form of containers on the cluster. There are lots of cluster manager options for Spark applications, one of them is Hadoop YARN.

Select the correct launch modes of EMR

Cluster and step execution

When during a job do we need to pay attention to the number of partitions and adjust if necessary?

Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. The general recommendation is to have 4x of partitions to the number of cores in cluster available for application. It is recommended that you partition depending on the configuration and requirements of the cluster. Too few partitions does not utilize all the cores in the cluster. Too many partitions can cause excessive overhead in managing many small tasks as well as data movement.

Why is Spark faster than MapReduce?

DAG execution engine and in-memory computation

What are datasets?

DataSets are similar to DataFrames, in that they are an abstraction of RDDs, however they differ in that they require strict data typing via the use of case classes, and thus guarantee compile time type safety.

Spark job repeatedly fails

Description: When the cluster is fully scaled and the cluster is not able to manage the job size, the Spark job may fail repeatedly. Resolution: Run the Sparklens tool to analyze the job execution and optimize the configuration accordingly.

Executor container killed by yarn for exceeding memory limits

Description: When the container hosting the executor needs more memory for overhead tasks or executor tasks, the following error occurs. Set a higher value for spark.yarn.executor.memoryOverhead based on the requrements of the job. The executor memory overhead value increases with the executor size. If increasing the executor memory overhead value or executor memory value does not resolve the issue, you can either use a larger instance, or reduce the number of cores.

Out of memory exceptions

Driver memory exceptions: -exception due to Spark driver running out of memory. -Job failure because the Application Master that launches the driver exceeds memory limits executor memory exceptions: -exception because executor runs out of memory -fetch failed exception due to executor running out of memory -executor container killed by yarn for exceeding memory limits Troubleshooting: You should understand how much memory and cores the application requires, and these are the essential parameters for optimizing the Spark application. Based on the resource requirements, you can modify the Spark application parameters to resolve the out-of-memory exceptions.

AWS Core Services

Iaas Saas PaaS

If the storage level for a persist is MEMORY_ONLY and there isn't enough memory, what happens?

If you use MEMORY_ONLY as the Storage Level and if there is not enough memory in your cluster to hold the entire RDD, then some partitions of the RDD cannot be stored in memory and will have to be recomputed every time it is needed. If you don't want this to happen, you can use the StorageLevel - MEMORY_AND_DISK in which if an RDD does not fit in memory, the partitions that do not fit are saved to disk.

Broadcast variables

In Spark RDD and DataFrame, Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks. Instead of sending this data along with every task, spark distributes broadcast variables to the machine using efficient broadcast algorithms to reduce communication costs. When you run a Spark RDD, DataFrame jobs that have the Broadcast variables defined and used, Spark does the following. Spark broadcasts the common data (reusable) needed by tasks within each stage. The broadcasted data is cache in serialized format and deserialized before executing each task. You should be creating and using broadcast variables for data that shared across multiple stages and tasks. Note that broadcast variables are not sent to executors with sc.broadcast(variable) call instead, they will be sent to executors when they are first used.

What is the difference between cluster mode and client mode on YARN?

In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. Cluster mode is used to run production jobs. All the slave or worker-nodes act as an Executor and one of them will act as Spark Driver too this is where the SparkContext will live for the lifetime of the app, one specific node will submit the JAR(or .py file )and we can track the execution using web UI. However note this Particular node will also act as an executor at the same time. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. Client mode is majorly used for interactive and debugging purposes. In client mode, the node where the spark-submit is invoked , will act as the Spark driver this is where the SparkContext will live for the lifetime of the app. But this node WILL NOT execute the DAG as this it is designated JUST as a driver for the spark cluster. However all the other nodes will act as executors for running the job. We can track the execution of the jobs through the Web UI .

Iaas

Infrastructure as a Service (IaaS) contains the basic building blocks for cloud IT and typically provides access to networking features, computers (virtual or on dedicated hardware), and data storage space. IaaS provides you with the highest level of flexibility and management control over your IT resources

SaaS

Is responsible for application operation system, storage and servers. Nothing is managed by the user

Apache Mesos

It is a distributed cluster manager. Like yarn, it is also highly available for master and slaves. It can also manage resources per application. We can run spark jobs, Hadoop MapReduce or any other service applications easily. Apache has API's for Java, Python as well as c++. We can run Mesos on Linux or Mac OSX also. Apache Mesos Cluster Manager determines the availability of resources at first. Then it makes an offer back to its framework. Offers can also be rejected or accepted by its framework. This is a two level scheduler model in which scheduling are pluggable. It allows an infinite number of scheduled algorithms because it can also decline the offers. So it can accommodate thousand number of schedules on the same cluster. As we discussed, it supports two-level scheduling. it decides which algorithm it wants to use for scheduling the jobs that it requires to run.

Standalone cluster managers

It is a part of spark distribution and available as a simple cluster manager to us. Standalone cluster manager is resilient in nature, it can handle work failures. It has capabilities to manage resources according to the requirement of applications. We can easily run it on Linux, Windows, or Mac. It can also access HDFS (Hadoop Distributed File System) data. This is the easiest way to run Apache spark on this cluster. It also has high availability for a master.

s3 one zone - infrequent access

It is ideal for customers who want a lower-cost option for infrequently accessed data but do not require the availability and resilience of S3 Standard or S3 Standard-IA. Unlike the others which store data in a minimum of three Availability Zones (AZs), this stores data in a single AZ and costs 20% less than S3 Standard-IA. It's a good choice for storing secondary backup copies of on-premises data or easily re-creatable data.

What does Spark Engine do?

It provides scheduling through its DAG Scheduler, distribution across a cluster through the use of RDDS, and monitoring data in a cluster with cluster managers like YARN and the default standalone cluster.

what are the important components of spark?

Language support for Java, python, scala and R. The core components allow us to do SQL queries, streaming machine learning and graph computations without having to import a new library on top. There is cluster managers which can be used to further optimize your work.

What is the logical plan?

Logical plan is an abstract of all transformation steps that need to be performed and it does not refer anything about the Driver or executor. It is generated by and stored in the Spark context. Logical plan is divided into three parts, unresolved logical plan resolved logical plan and optimized logical plan.

Amazon Glacier Deep Archive

Lowest-cost storage class and supports long-term retention and digital preservation for data that will be retained for 7-10 years and may be accessed once or twice in a year. Use case: It is designed for customers — particularly those in highly-regulated industries, such as the Financial Services, Healthcare, and Public Sectors — that retain data sets for 7-10 years or longer to meet regulatory compliance requirements. It can also be used for backup and disaster recovery use cases.

What is the storage level for .cache()?

MEMORY_ONLY

How does map work on RDD in spark?

Map() operation applies to each element of RDD and returns the result as a new rdd. Map is a transformation in Spark and it takes one element as input and process it according to custom code and results one element at a time. Map transform as an RDD of length N into another RDD of length N.

Off-Heap Memory (External memory)

Off Heap memory means allocating memory objects (serialized to byte array) to memory outside the heap of the Java virtual machine(JVM), which is directly managed by the operating system (not the virtual machine), but stored outside the process heap in native memory (therefore, they are not processed by the garbage collector). The result of this is to keep a smaller heap to reduce the impact of garbage collection on the application. Accessing this data is slightly slower than accessing the on-heap storage, but still faster than reading/writing from a disk. The downside is that the user has to manually deal with managing the allocated memory.

How many Spark context can be active in JVM?

Only one

What is the physical plan

Physical plan is an interanl optimization for spark. Once our optimized logical plan is created then the physical plan is generated. a physical plan specifies how out logical plan is going to be executed on the cluster. It generates different kinds of execution strategies, such as estimating the execution time and resources taken by each strategy. the most optimal strategy is then selected.

PaaS

Platform as a Service (PaaS) removes the need for your organization to manage the underlying infrastructure (usually hardware and operating systems) and allows you to focus on the deployment and management of your applications. This helps you be more efficient as you don't need to worry about resource procurement, capacity planning, software maintenance, patching, or any of the other undifferentiated heavy lifting involved in running your application.

Benefits of caching DataFrame

Reading data from source(hdfs:// or s3://) is time consuming. So after you read data from the source and apply all the common operations, cache it if you are going to reuse the data. By caching you create a checkpoint in your spark application and if further down the execution of application any of the tasks fail your application will be able to recompute the lost RDD partition from the cache. If you don't have enough memory data will be cached at the local disk of executor which will also be faster than reading from the source. If you can only cache a fraction of data it will also improve the performance, the rest of the data can be recomputed by spark and that's what resilient in RDD means.

What is an RDD?

Resilient Distributed Datasets (RDDs) are the core concepts in Spark. The Spark RDD is a fault tolerant, distributed collection of data that can be operated in parallel. Each RDD is split into multiple partitions, and spark runs one task for each partition. The Spark RDDs can contain any type of Python, Java or Scala objects, including user-defined classes. They are not actual data, but they are Objects, which contains information about data residing on the cluster. The RDDs try to solve these problems by enabling fault tolerant, distributed In-memory computations.

What is the correct command utilized to transfer a local file into a EMR cluster

Scp-I private key local path hadoop@emr:dns/remotepath

Resolution for driver memory exceptions: spark driver running out of memory or application master that launches the driver exceeds memory limits

Set a higher value for the driver memory using one of the following commands spark submit command line options --conf spark.driver.memory =<XX>g -- driver-memory <XX>G

Saas

Software as a Service (SaaS) provides you with a completed product that is run and managed by the service provider. In most cases, people referring to Software as a Service are referring to end-user applications. With a SaaS offering you do not have to think about how the service is maintained or how the underlying infrastructure is managed; you only need to think about how you will use that particular piece of software.

Some levels have _SER, what does this mean?

Some levels of memory allow for the data to be stored in a serialized format, which will reduce the amount of memory space used to store the data, but will result in additional computing overhead in both the serialization and deserialization processes.

What is the SparkSession?

Spark 2.0 introduced a new entry point called SparkSession that essentially replaced both SQLContext and HiveContext. Additionally, it gives to developers immediate access to SparkContext. In order to create a SparkSession with Hive support, all you have to do is declare a spark session with .enableHiveSupport

How does spark partition data?

Spark automatically partition and distributes the partitions across different nodes.

What are Persistence Storage Levels in Spark?

Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels namely: MEMORY_ONLY MEMORY_ONLY_SER MEMORY_AND_DISK MEMORY_AND_DISK_SER, DISK_ONLY OFF_HEAP

Spark use cases

Spark is most effective for scenarios that involve the following: Dealing with chains of parallel operations by using iterative algorithms Achieving quick results with in-memory computations Analyzing stream data analysis in real time Graph-parallel processing to model data All ML applications

What does it mean to "spill to disk" when executing spark tasks?

Spark spills data when a given partition is too large to fit into the RAM of the Executor. A spill is sometimes Spark's last resort to avoid an OutOfMemory Exception. However, this comes at the cost of potentially expensive disk reads and writes.Data move from Host RAM to Host Disk.

The different types of cluster managers

Standalone cluster manager Hadoop Yarn Apache Mesos

S3 Storage Classes

Standard, Intelligent-Tiering, Standard-IA, One Zone-IA, Glacier, Glacier Deep Archive.

Spark Memory Management is divided into two types

Static Memory Manager (Static Memory Management), and Unified Memory Manager (Unified memory management) Unified Memory Manager has been set as the default memory manager for Spark. Static Memory Manager has been deprecated because of the lack of flexibility. In both memory managers, a portion of Java Heap is located for processing Spark applications, while the rest of memory is reserved for Java class references and metadata usage. There will only be one MemoryManager per JVM.

FileAlreadyExistsException in Spark job

The FileAlreadyExistsException error occurs in the following scenarios: Failure of the previous task might leave some files that trigger the FileAlreadyExistsException errors When the executor runs out of memory, the individual tasks of that executor are scheduled on another executor. As a result, the FileAlreadyExistsException error occurs. When any Spark executor fails, Spark retries to start the task, which might result into FileAlreadyExistsException error after the maximum number of retries. Resolution: Identify the original executor failure reason that causes the FileAlreadyExistsException error. Verify size of the nodes in the clusters. Upgrade them to the next tier to increase the Spark executor's memory overhead.

What is the Spark History Server?

The Spark history server is a web UI where you can view the status of running and completed Spark jobs on a provisioned instance of Analytics Engine powered by Apache Spark. If you want to analyse how different stages of your Spark job performed, you can view the details in the Spark history server UI. The history server shows only the running and the completed or completed but failed Spark jobs. It doesn't show the jobs for which the Spark application couldn't be started because, for example, the wrong arguments were passed or the number of passed arguments is less than the expected number.

SparkContext

The SparkContext is used by the Driver Process of the Spark Application in order to establish a communication with the cluster and the resource managers in order to coordinate and execute jobs. SparkContext also enables the access to the other two contexts, namely SQLContext and HiveContext. In order to create a SparkContext, you will first need to create a Spark Configuration (SparkConf) as shown below: val sparkConf = new SparkConf().setAppName("app").setMaster("yarn") val sc = new SparkContext(sparkConf) In spark-shell, SparkContext is already available through the variable called sc. With the spark context you can declare it as Hive or SQL context with the spark context in the parenthesis. val sqlContext = new SQLContext(sc)

IaaS (Infrastructure as a Service) responsibilities

The application and operating system is what the user is responsible for

standard storage class

The default, most expensive option designed for general, all-purpose storage of frequently accessed data. Use cases: big data analysis, mobile and gaming applications, content distibution

What is the lineage of an RDD?

The lineage of an RDD is the information an RDD stores about its construction that allows RDDs to maintain a level of fault tolerance. The lineage is made up of two parts, information about the data the RDD needs to read from during construction, as well as the transformations it must perform upon that data. This lineage forms the logical plan for the execution of an RDD.

What is spark.driver.memory? What about spark.executor.memory?

The memory components of a spark cluster worker node are memory for HDFS, YARN and other daemons, and executors for spark applications. It is used to configure the total driver memory in a Spark application. Each executor memory is the sum of yarn overhead memory and the JVM heap memory. . To deduce for it, total available executor memory is firstly calculated and then memory overhead is taken into consideration and subtracted from total available memory.

Difference between repartition and coalesce

The repartition algorithm does a full shuffle of the data and creates equal sized partitions of data. coalesce combines existing partitions to avoid a full shuffle.

spark-submit

The spark-submit command is a utility to run or submit a Spark or PySpark application program (or job) to the cluster by specifying options and configurations, the application you are submitting can be written in Scala, Java, or Python (PySpark). spark submit allows you to specify if you are submitting Spark application on different cluster managers like Yarn, Kubernetes, Mesos, and Stand-alone. Submitting Spark application on client or cluster deployment modes.

Application Job Stages Tasks

The user program built on spark consisting of a driver program and executors on the cluster is the application. When you invoke an action on an RDD, a "job" is created. Jobs are work submitted to Spark. Jobs are divided into "stages" based on the shuffle boundary. Each stage is further divided into tasks based on the number of partitions in the RDD. So tasks are the smallest units of work for Spark. They will be sent to one executor.

What is an action in Spark RDD?

The ways to send results from executor to driver.

What are some benefits to using the cloud?

There are a number of benefits to using the cloud, including elasticity, high availability, and cost effectiveness. Since cloud resources are virtualized, they are highly elastic, meaning that the resources needed for a given job can be expanded as demand increases, and contracted when they are no longer necessary. This elasticity leads to both scalability, as the system expands to meet higher levels of demand, as well as high availability, as the server will is guaranteed to be available at almost all times. Also, the ability to transfer the burden of physical resource storage and maintenance to the cloud provider means it is cheaper and more feasible to run large cluster computing operations.

Action vs transformation

There are two types of Apache Spark RDD operations are- Transformations and Actions. A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed. When the action is triggered after the result, new RDD is not formed like transformation. Transformation is a function that produces new RDD from existing RDDS. The transformations are lazy in nature and execute when we call an action. Actions are Spark RDD operations that give non-RDD values. The values of action are stored to divers or external storage system. An action is a way to send data from Executor to the driver.

Hadoop Yarn:

This cluster manager works as a distributed computing framework. It also maintains job scheduling as well as resource management. In this cluster, masters and slaves are highly available for us. We can run it on Linux and even on windows. Hadoop yarn is also known as MapReduce 2.0. It also bifurcates the functionality of resource manager as well as job scheduling.We can optimize Hadoop jobs with the help of Yarn.It is neither eligible for long-running services nor for short-lived queries.The yarn is suitable for the jobs that can be re-start easily if they fail. Yarn do not handle distributed file systems or databases.

Is it possible to launch a single cluster

Yes

Unified Memory Manage

is adopted to replace Static Memory Manager to provide Spark with dynamic memory allocation. It allocates a region of memory as a Unified memory container that is shared by storage and execution. When execution memory is not used, the storage memory can acquire all the available memory and vice versa. If any of the storage or execution memory needs more space, a function called acquireMemory() will expand one of the memory pools and shrink another one. Advantages: -The boundary between Storage memory and Execution memory is not static and in case of memory pressure, the boundary would be moved i.e. one region would grow by borrowing space from another one. -When the application has no cache, execution uses all the memory to avoid unnecessary disk overflow. -When the application has a cache, it will reserve the minimum storage memory, so that the data block is not affected. -This approach provides reasonable performance for a variety of workloads without requiring expertise of how memory is divided internally. JVM has two types of memory: In addition to the JVM Memory types, On-Heap Memory, Off-Heap Memory, there is one more segment of memory that is accessed by Spark i.e External Process Memory. This kind of memory mainly used for PySpark and SparkR applications. This is the memory used by the Python/R process which resides outside of the JVM.

Static Memory Manager (SMM)

is the traditional model and simple scheme for memory management. It divides memory into two fixed partitions statically. The size of Storage Memory and Execution Memory and other memory is fixed during application processing, but users can configure it before the application starts. Static Memory allocation method has been eliminated in Spark 3.0 Advantage: Static Memory Manager mechanism is simple to implement Disadvantage: Even though space is available with storage memory, we can't use it, and there is a disk spill since executor memory is full. (vice versa).

Some levels have _2, what does this mean?

replications. These storage levels cache your RDD partitions on multiple machines across the cluster. This can make jobs faster (due to data locality + redundant copies) but uses more resources.

A narrow transformation

resides within a single partition

Configure number of executors

spark.executor.instances: Number of executors for the spark application.. To calculate this property, we initially determine the executor number per node. spark.executor.instances = (executor_per_node * number_of_nodes)-1

Can you launch a cluster of just worker nodes in EMR

yes

S3 security

• Buckets are private by default •Can enable Access Control Lists • Integrates with IAM • Endpoints encrypted by SSL

S3 intelligent tiering

• Same low latency and high throughput performance of S3 Standard • Small monthly monitoring and auto-tiering fee • Automatically moves objects between two access tiers based on changing access patterns • Designed for durability of 99.999999999% of objects across multiple Availability Zones • Resilient against events that impact an entire Availability Zone • Designed for 99.9% availability over a given year Use cases: data with unknown or changing access patterns, optimize storage costs automatically and unpredictable workloads


Kaugnay na mga set ng pag-aaral

Chapter 14-16 Testable Questions

View Set

حديث ثالث متوسط ف1

View Set

most abundant elements in earth's crust

View Set

Privateers and Early Spanish Colonies

View Set