Apache Spark

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Java virtual machine

A software "execution engine" that safely and compatibly executes the byte codes in Java class files on a microprocessor (whether in a computer or in another electronic device). When using Spark from Python or R, you don't write explicit JVM instructions; instead, you write Python and R code that Spark translates into code that it then can run on the executor JVMs.

Fault-tolerant computer systems

A system that won't fail to run if one of it's components stops working properly.

MapReduce

A two-phase technique for harnessing the power of thousands of computers working in parallel. During the first phase, the Map phase, computers work on a task in parallel; during the second phase, the Reduce phase, the work of separate computers is combined, eventually obtaining a single result. More specifically the map procedure performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reduce method performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).

Parallel Computing

A type of computation where many calculations or the execution of processes are carried out simultaneously

Apache Spark

A unified computing engine and a set of libraries for parallel data processing on computer clusters

batch processing

Accumulating transaction records into groups or batches for processing at a regular interval such as daily or weekly. The records are usually sorted into some sequence (such as numerically or alphabetically) before processing.

Spark Streaming

An API that allows for real time streaming of data.

Spark Application process

Consists of two parts: Driver process: Driver nodes maintain information about program, responds to input, distributes/schedule work across executors. Executor process: Each executor executed code assigned to it, then reports it back to the driver node. The executors, for the most part, will always be running Spark code. However, the driver can be "driven" from a number of different languages through Spark's language APIs.

parallel vs distributed computing

Distributed computing is often used in tandem with parallel computing. Parallel computing on a single computer uses multiple processors to process tasks in parallel, whereas distributed parallel computing uses multiple computing devices to process those tasks.

SparkSession object

Entrance point to running spark code. When we actually go about writing our Spark Application, we are going to need a way to send user commands and data to it. We do that by first creating a SparkSession. To do this, we will start Spark's local mode, just like we did in Chapter 1. This means running ./bin/spark-shell to access the Scala console to start an interactive session. You can also start the Python console by using ./bin/pyspark. This starts an interactive Spark Application. In Python you'll see something like this: <pyspark.sql.session.SparkSession at 0x7efda4c1ccd0>

cluster manager

Grants resources to an application submitted by a data-processing engine such as Spark so that it can complete a task. An example of this is YARN or Mesos.

Lazy Evaluation

Lazy evaulation means that Spark will wait until the very last moment to execute the graph of computation instructions. In Spark, instead of modifying the data immediately when you express some operation, you build up a plan of transformations that you would like to apply to your source data. By waiting until the last minute to execute the code, Spark compiles this plan from your raw DataFrame transformations to a streamlined physical plan that will run as efficiently as possible across the cluster. This provides immense benefits because Spark can optimize the entire data flow from end to end. An example of this is something called predicate pushdown on DataFrames. If we build a large Spark job but specify a filter at the end that only requires us to fetch one row from our source data, the most efficient way to execute this is to access the single record that we need. Spark will actually optimize this for us by pushing the filter down automatically.

SparkContext

Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. Note: Only one SparkContext should be active per JVM. You must stop() the active SparkContext before creating a new one.

Types of Transformations

Narrow and Wide, also referred to as narrow and wide dependencies. A narrow dependency is a 1-1 relationship while a wide dependency is a one to many.

resilient distributed dataset (RDD)

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed fault tolerant collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. There are two types of operations that can be performed on an RDD: Transformation: Operations that return a new RDD such as a JOIN, UNION, Filter, etc. Transformations create a "tag" that is a series of RDD's mapping each RDD to a new RDD based on a transformation. Action: Operations that return a value, such as COUNT, first, etc. Actions will invoke an execution from the first to last RDD in its tagging. Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations.

Spark's Language APIs

Spark code can be run using Scala, Java, and Python. Spark code is primarily written in Scala as it's default language. However Python supports nearly all constructs that Scala does. Spark also supports a subset of ANSI SQL 2003 standard and two R libraries, one as part of Spark core, SparkR; and the other as a community driven R package, sparklyr.

Spark APIs

Spark has two fundemental sets of APIs: Low level "unstructured" APIs, and high level "structured" APIs.

spark vs hadoop

Spark is faster at processing data

GraphX

Spark's graph computation engine and data store.

local mode

Spark, in addition to its cluster mode, also has a local mode. The driver and executors are simply processes, which means that they can live on the same machine or different machines. In local mode, the driver and executors run (as threads) on your individual computer instead of a cluster. We wrote this book with local mode in mind, so you should be able to run everything on a single machine.

partition

To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A partition is a collection of rows that sit on one physical machine in your cluster. A DataFrame's partitions represent how the data is physically distributed across the cluster of machines during execution. If you have one partition, Spark will have a parallelism of only one, even if you have thousands of executors. If you have many partitions but only one executor, Spark will still have a parallelism of only one because there is only one computation resource. An important thing to note is that with DataFrames you do not (for the most part) manipulate partitions manually or individually. You simply specify high-level transformations of data in the physical partitions, and Spark determines how this work will actually execute on the cluster.

Spark UI

You can monitor the progress of a job through the Spark web UI. The Spark UI is available on port 4040 of the driver node. If you are running in local mode, this will be http://localhost:4040. The Spark UI displays information on the state of your Spark jobs, its environment, and cluster state. It's very useful, especially for tuning and debugging. Figure 2-6 shows an example UI for a Spark job where two stages containing nine tasks were executed.

Application Programming Interface (API)

a set of routines, protocols, and tools for building software applications.

distributed computing

processes and manages algorithms across many machines in a computing environment

Dataframe

A DataFrame is the most common Structured API and simply represents a table of data with rows and columns. The list that defines the columns and the types within those columns is called the schema. You can think of a DataFrame as a spreadsheet with named columns. Figure 2-3 illustrates the fundamental difference: a spreadsheet sits on one computer in one specific location, whereas a Spark DataFrame can span thousands of computers. The reason for putting the data on more than one computer should be intuitive: either the data is too large to fit on one machine or it would simply take too long to perform that computation on one machine.

.explain

A DataFrame method that returns its linniage, that is how spark will execute the query based on the transformations we've performed on the DataFrame.

spark-submit

A built in command-line tool that sends your application to a cluster to be executed.

Hadoop

A collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.

databricks

A company started by the founder of spark that hosts a cloud service as a learning environment. If you would like a simple interactive notebook experience in the cloud for running spark, databricks offers a free community edition for doing so.

cluster

A group of computers working together to complete a single task that a single computers is unable to such as large scale data-processing.

Hadoop Distributed File System (HDFS)

A highly distributed, fault-tolerant file storage system designed to manage large amounts of data at high speeds.

MLlib

A low level machine learning library that is scalable, and compatible with various programming languages.


Ensembles d'études connexes

EMT all flashcards which term need to be studied

View Set

Chapter 8: Dog Breed Identification & Management Review

View Set

All Saunders Nclex-Q that pertain to exam 2

View Set

Chapter 3 - Money Management Strategy: Financial Statements and Budgeting

View Set

combo mistake"The Roaring Twenties: Sex, Alcohol, and Jazz" and 24 others

View Set

Choosing Depreciation Methods (Chapter 10: Choosing Accounting Methods) 693-706

View Set