Data Science - Spark Preliminaries
RDD can be shared between SparkContexts.
false
Spark SQL does not provide support for both reading and writing Parquet files.
false
Spark is 100x faster than MapReduce due to development in Scala.
false (it's UP TO 100x faster)
Do you need to install Spark on all nodes of the YARN cluster while running Spark on YARN?
no, bc spark runs on top of YARN
In which year was Apache Spark made an open-source technology?
2010
Which action returns all the elements of the dataset as an array?
collect()
Parquet stores nested data structures in a flat ________ format.
columnar
Column Names and Count (Rows and Columns) (pyspark)
for finding the column names, count the number of rows and columns # for column names: df.columns ['ID', 'Name', 'age', 'nationality', 'Overall', 'Potential', 'Club', 'Value', 'Wage', 'Special'] # row count: df.count() 17884 #column count : len(df.columns) 10
RDD Op example
lines = sc.textFile("data.txt") lineLengths = lines.map(lambda s: len(s)) totalLength = lineLengths.reduce(lambda a, b: a+b) - the first line defines a base RDD from an external file - the second line defines lineLengths as the result of a map transformation - finally, the third line runs reduce, which is an action
deployment modes
choose the mode with the '--deploy-mode' flag 1. client - driver runs on a dedicated server (e.g. edge node) inside a dedicated process. Submitter starts the driver outside of the cluster 2. cluster - driver runs on one of the cluster's worker nodes. The master selects the worker. The driver operated as a dedicated, standalone process inside the worker
components of spark
programming languages : scala, r, java, python libraries : spark sql, MLlib, graphX, streaming engine : spark core cluster management : hadoop yarn, apache mesos, spark scheduler storage : hdfs, standalone node, cloud, rdbms/noSql
important classes of Spark SQL and DataFrames
pyspark.sql.SparkSession : Main entry point for Dataframe SparkSQL functionality pyspark.sql.DataFrame : A distributed collection of data grouped into named columns pyspark.sql.Column : A column expression in a DataFrame. pyspark.sql.Row : A row of data in a DataFrame. pyspark.sql.GroupedData : Aggregation methods, returned by DataFrame.groupBy(). pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values). pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality. pyspark.sql.functions : List of built-in functions available for DataFrame. pyspark.sql.types : List of data types available. pyspark.sql.Window : For working with window functions.
spark streaming
the spark component that allows live-streaming data processing. Eg: includes log files created by production web servers, or queues of messages including status updates raised by web service users.
Which of the following Scala statements would be most appropriate to load the data (sfpd.txt) into an RDD? Assume that SparkContext is available as the variable sc and SQLContext as the variable sqlContext.
val sfpd = sc.textFile("/path to file/sfpd.txt")
named accumulators
you can create unnamed or named accumulators as a user. as the image, a named accumulator (here, named 'counter') will be displayed in the web UI for the stage that modifies that accumulator. Spark shows the value for each accumulator modified by a task in the 'tasks' table
reading a parquet file
# here, loading json file into DataFrame: df = spark.read.json("path of file") # saving dataframe into parquet format: df.write.parquet("parquet file name") # verifying results by loading in parquet format: pf = spark.read.parquet("parquet file name") # view Dataframe use show() method
MEMORY_AND_DISK
(persistence level/storage level that can be assigned to an RDD) store RDD as deserialized java objects in the JVM. If the RDD doesn't fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed
The following languages are supported for Spark Development, except ________
C++
spark vs. MapReduce
DIFFICULTY : spark is simpler to program and doesn't require any abstractions while MR is hard to program with abstractions INTERACTIVITY: spark provides an interactive mode while MR has no inbuilt interactive mode except for Pig and Hive STREAMING : MR offers batch processing on historical data while spark provides streaming of data nd processing in real-time LATENCY : spark caches partial results over its memory of distributed workers thereby ensuring lower latency computations. MR is disk-oriented, unlike spark. SPEED: spark places the data in memory, by storing the data in Resilient distributed db's (RDD), spark is 100x quicker than hadoop MR for big data processing
Which among the following is an example of Action?
foreach(func)
MEMORY_ONLY_2, DISK_ONLY_2, etc.
(persistence level/storage level that can be assigned to an RDD) same as the levels above, but replicate each partition on two cluster nodes
generic load/save functions
* most cases, the default data source will be used for all ops. File path can be from local machine or HDFS df = spark.read.load("file path") #spark loads the data source from the defined file path df.select("column name", "column name").write.save("file name") #the dataframe is saved in the defined format #by default, it is saved in the spark warehouse
data sources
- Spark SQL supports operating on a variety of data sources through the DataFrame interface - a DataFrame can be operated on using relational transformations and can also be used to create a temporary view - registering a DataFrame as a temporary view allows you to run SQL queries over its data - allows loading and saving data manually specifying options: - can manually specify the data source that will be used with extra options to pass to data source - data sources fully qualified name is used to specify them, but for built-in sources, you can also use their short names : json parquet jdbc orc libsvm csv text
features of Spark SQL
- provides DataFrame abstraction in Scala, Java, and Python - Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats - data can be queried by using Spark SQL *capabilities of using structured and semi-structured data
converting DataFrames
DataFrames which are loaded from any type of data can be converted to other types by using the syntax below. a json file can be loaded like: df = spark.read.load("path of json file", format="json")
An instance of the Spark SQL execution engine that integrates with the data stored in Hive is _________
HiveContext
By default, Spark uses which algorithm to remove old and unused RDD to release more memory?
Least Recently Used (LRU)
interactive analysis
MAPREDUCE : supports batch processing SPARK: processes data quicker and thereby processes exploratory queries without sampling
Solving interactive problems : Spark vs MapReduce
MAPREDUCE : -same data is repeatedly read from disk for different queries SPARK : disk -> (one-time processing) to RAM -> splits into multiple queries to their individual results - input is read just once into memory where different queries act on the data to give their results
Solving iterative problems: Spark vs. MapReduce
MAPREDUCE: disk -> MR -> disk -> MR -> disk -each iteration is stored to disk then read back for the next processing SPARK: disk -> spark -> ram -> spark -> ram -> disk -results can be kept in ram and fetched easily for each iteration. There's no disk i/o related latency
Sorting Data (with OrderBy) (pyspark)
OrderBy method df.filter((df.Club=='FC Barcelona') & (df.Nationality=='Spain')).orderBy('ID',).show(5) by default, pyspark will sort in ascending order but we can swap to descending: df.filter((df.Club=='FC Barcelona') & (df.Nationality=='Spain')).orderBy('ID', ascending=False).show(5)
lineage graph
RDDs maintain a graph of one RDD transforming into another, which assists Spark to recompute any common RDD in the event of failures. This is how Spark achieves fault tolerance
How can you create an RDD for a text file?
SparkContext.textFile
Apache Parquet
a columnar storage format available to all projects in the Hadoop ecosystem, irrespective of the choice of framework used for data processing, the model of data or programming language used - stores nested data structures in a flat columnar format - row-oriented storage makes it more efficient - serves both efficiency and performance in both storage and processing - spark SQL provides support for both reading and writing Parquet files - automatic conversion to nullable occurs when one tries to write Parquet files, This is done due to compatibility reasons
CSV file
a file format which allows the user to store the data in tabular format. csv = 'comma-separated values' its data fields are most often separated, or delimited, by a comma
Which of the statement(s) is/are true about Spark?
all - supports real time processing - caches data in-memory and ensures low latency - save data on memory with use of RDDs
Spark Properties can be configured using ___________
all SparkConf --conf - the command line option used by spark-shell and spark-submit conf/spark-defaults.conf - the default
Which among the following is a/are feature(s) of DataFrames?
all distributed immutable lazy evals
What is an Accumulator?
all - can be efficiently supported in parallel - variables that are only added through an associative and commutative operation - used to implement counters
Which of the following is true of caching the RDD?
all - rdd.persist(MEMORY_ONLY) is the same as rdd.cache() - use rdd.cache() to cache the RDD - cache behavior depends on available memory. If there is not enough memory, action will reload from the file instead of the cache - when there is branching in lineage, it is advisable to cache the RDD
Which type of processing can Apache Spark handle?
all graph processing interactive processing batch processing stream processing
Spark has APIs in __________
all (scala, java, python)
What is/are the benefit(s) of using appropriate file formats in Spark?
all : - faster accessing during read and write - schema oriented - more compression support
In Spark-Shell, which context(s) is/are available by default?
both - SparkContext - SQLContext
Choose the correct statement(s) about Spark Context.
both - interacts with cluster manager - specifies spark how to access cluster
Which is the method to create RDD in Spark?
by parallelizing a collection in your Driver program and by loading an external dataset from external storage like HDFS, HBase, shared file system
describing a different column (pyspark)
df.describe('Age').show() statistical params (results): +------------------+--------------------------+ | summary | Age | +------------------+--------------------------+ | count | 17981 | | mean |25.144541460430453 | | stddev | 4.614272345005111 | | min | 16 | | max | 47 | +-------------------+-------------------------+
Filter data with AND /OR conditions (pyspark)
df.filter((df.Club=='FC Barcelona') & (df.Nationality=='Spain')).show(3)
checking the Schema (pyspark)
df.printSchema() results look something like: root |-- ID: integer (nullable = true) |-- Name: string (nullable = true) |-- Age: integer (nullable = true) |-- Nationality: string (nullable = true |-- Overall: integer (nullable = true) |-- Potential: integer (nullable = true) |-- Club: string (nullable = true) |-- Value: string (nullable = true) |-- Wage: string (nullable = true) |-- Special: integer (nullable = true)
df.show()
displays results to the pyspark shell like +------------+-------------+-------+-------------+ |firstName | lastName | age | telephone | +------------+------------+-------+--------------+ | David | Julian | 22 | 100000 | | Mark | Webb | 23 | 658545 | +------------+------------+--------+--------------+
dataframe features
distributed : makes it fault tolerant and a highly available data structure lazy evaluation : an eval strategy which will hold the eval of an expression until its value is needed immutable : makes it an object whos state cannot be modified after it's created
What is meant by RDD Lazy Evaluation?
even the base RDD is not created until an action
creating a DataFrame in PySpark
from pyspark.sql import * Student = Row("firstName", "lastName", "age", "telephone") s1 = Student('David', 'Julian', 22, 1000000) s2 = Student('Mark', 'Webb', 23, 24353) StudentData=[s1,s2] df=spark.createDataFrame(StudentData) df.show()
Which among the following is an example of Transformation?
groupByKey([numPartitions])
The number of partitions of a RDD can be controlled using _________
none of the options - coalesce - repartition
SQL on DataFrames
the sql function on a SQLContext allows apps to run SQL queries programmatically and returns the result as a DataFrame val sc: SparkContext // An existing SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df = sqlContext.read.json("home/spark/input.json") input.registerTempTable("students") val teenagers = sqlContext.sql("SELECT name, age FROM students WHERE age >= 13 AND age <= 19")
Schema of DataFrame (pyspark)
the structure of the DataFrame. uses printSchema method to check the schema it results in different columns in our DataFrame, along with the datatype and the nullable conditions
Which is true of a broadcast variable?
to cache a value in memory on all nodes and it's a shared variable
describing a particular column (pyspark)
to get the summary of any particular column use the describe method. this method gives us the statistical summary of the given column, if not specified, it provides the statistical summary of the DataFrame df.describe('Name').show() +-------------+-------------+ |summary | Name | +-------------+-------------+ | count | 17981 | | mean | null | | stddev | null | | min | A. Abbas | | max |Óscar Whalley| +--------------+-------------+
csv loading (pyspark)
to load csv data set, user has to make use of spark.read.csv method to load it into a DataFrame. here, we load a soccer player dataset using the spark csvreader: df = spark.read.csv("path-of-file/fifa_players.csv", inferSchema=True, header=True) inferSchema (default false) : from the data, it infers the input schema authomatically header (default false) : using this inherits the first line as column names. To verify, we run : df.show(2) where the argument 2 will display the first 2 rows of the resulting DataFrame
2 operations supported by RDDs
transformation : which create a new dataset from an existing one. ex: map is a transformation that passes each dataset element through a function and returns a new RDD representing the results actions : which return a value to the driver program after running a computation on the dataset ex: reduce is an action which aggregates all RDD elements by using some functions and then returns the final result to the driver program.
Parallelized collections are created by calling SparkContext's parallelize method on an existing iterable or collection in driver program.
true
RDDs can also be unpersisted to remove RDD from permanent storage like memory and/or disk.
true
Spark caches the data automatically in the memory as and when needed.
true
Transformations are computed lazily.
true
filtering data (pyspark)
use filter command df.filter(df.Club=='Fc Barcelona').show(3) results are all columns where club = FC Barcelona
selecting multiple columns (pyspark)
use select method for selecting particular columns from the DataFrame. df.select('Column name1', 'column name 2', ......, 'column name n').show() *show() is optional** load the result into another DataFrame by simply equating: dfnew=df.select('col 1', 'col 2', ....., 'col n') ex: dfnew=df.select('ID', 'Name') # verifying result dfnew.show(5) #results +---------------+-----------------+ | ID | Name | +---------------+-----------------+ | 20801 | Cristiano Ronaldo | | 158023 | L. Messi | | 190871 | Neymar | | 176580 | L. Suárez | | 167495 | M. Neuer | +---------------+-----------------------+
creating DataFrames
can be created from a variety of sources like : - existing RDDs - external databases - tables in Hive - structured data files with an SQLContext spark is compatible with: - Hive data - RDDS - cassadra data - parquet data - xml data - rdbms data - json data - csv data the subsequent ex creates a DF based on the content of a JSON file: val sc: SparkContext //an existing SpCon val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df = sqlContext.read.json("home/spark/input.json") //shows the content of the DF to stdout df.show()
commonly used options
class : entry point for app (e.g. org.apache.spark.examples.SparkPi) master : master URL for cluster (e.g. spark://23.195.26.187.7077) deploy-mode : whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) conf: arbitrary Spark configuration property in key=value format application-jar : path to a bundled jar with the app and dependencies application-arguments : arguments passed to the main method of your main class, if any
data load from hive to spark
consider the following example of employee in a text file named employee.txt. We will first create a hive table, load the employee record data into it using HiveQL language, and apply some queries on it. use the subsequent command for initializing the HiveContext into the Spark Shell: scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) now create a table named employee with the fields id, name, and age using HQL: scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'") load employee data into the employee table in hive: scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO TABLE employee") fetch all records using HiveQL select query: scala> val result = sqlContext.sql("FROM employee SELECT id, name, age") to show the record data, call the show() method on the result DataFrame: scala> result.show()
built-in spark sources
hive java h2 hdfs parquet mySql json amazon webservices s3 (aws)
MEMORY_AND_DISK_SER
(persistence level/storage level that can be assigned to an RDD) similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed
MEMORY_ONLY_SER
(persistence level/storage level that can be assigned to an RDD) store RD as serialized java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read
MEMORY_ONLY
(persistence level/storage level that can be assigned to an RDD) store RDD as deserialized java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the defualt level.
DISK_ONLY
(persistence level/storage level that can be assigned to an RDD) store the RDD partitions only on disk
dataset
- a DataFrame is known as a Dataset organized into named columns a new interface included in Spark 1.6, which provides the advantages of RDDs with the advantages of Spark SQLs optimized execution engine - it's an immutable, strongly-typed set of objects that are mapped to a relational schema - will act as the new abstraction layer for Spark from Spark 2.0
how to assign a storage level
- persist an RDD to a storage level result = input.map(<Computation>) result.persist(LEVEL) by default, spark uses the algorithm of Least Recently Used (LRU) to remove old and unused RDD to release more memory. - we can also manually remove remaining RDD from memory by using unpersist()
broadcast variables
- enables the programmer to keep a read-only variable cached on each machine instead of shipping a copy of it with tasks - generated from a variable v by calling SparkContext.broadcast(v) - its value can be accessed by calling 'the value method'. The subsequent code shows this: scala> val broadcastVar = sc.broadcast(Array(1, 2, 3, 4, 5)) broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0) scala> broadcastVar.value res0: Array[Int] = Array(1, 2, 3, 4, 5)
Spark history
- founded by Matei Zaharia at AMPLab in UC Berkeley in 2009. Later open-sourced under BSD license in 2010 - donated to Apache in 2013 then licensed to Apache 2.0 - recognized as top-level Apache project in Feb. 2014 - Matei's company Databricks created a new world record using it in large scale sorting - Spark 2 was launched in June 2016
transformations
- functions that use an RDD as the input and return one or more RDDs as the output - 'randomSplit', 'cogroup', 'join', 'reduceByKey', 'filter', and 'map' are examples of few transformations - they don't change the input RDD, but always create one or more new RDDs by utilizing the computations they represent - by using these, you incrementally create an RDD lineage with all the parent RDDs of the last RDD - they're lazy, aka, they aren't run immediately. They're done on demand - they're executed only after calling an action
spark core
- includes the primary functionality of spark, namely components for task scheduling, fault recovery, memory management, interacting with storage systems, etc. - home to the API that reps RDD, which is the primary programming abstraction of spark
benefits of Dataset APIs
1. static-typing and runtime type-safety High-level 2. abstraction and custom view into structured and semi-structured data 3. Higher performance and Optimization
The following are characteristics shared by Hadoop and Spark, except ___________
Both have their own file system
The following are storage levels in Spark, except ________
HEAP_AND_DISK
The programming paradigm used in Spark is __________. - none of the options - generalized - map only - mapreduce
MapReduce
Which tells Spark how and where to access a cluster?
Spark Context
Which is responsible for task scheduling and memory management?
Spark Core
GraphX
a library for performing graph-parallel computations and manipulating graphs
Spark can integrate with which of the following data storage systems?
all - google cloud - cassandra - hive
RDD is ____________
all immutable recomputable fault-tolerant
Which of the following is true of caching the RDD?
all - rdd.persist(MEMORY_ONLY) is the same as rdd.cache() - use rdd.cache() to cache the RDD - cache behavior depends on available memory. If there is not enough memory, action will reload from the file instead of the cache
Identify the correct transformation.
all : map join filter
Which of the following file formats is/are supported by Spark?
all : parquet json sequence file csv
Spark can store its data in ____________
all: hdfs cassandra mongoDB
Apache Spark has which of the following capabilities? - all - distributing - monitoring - scheduling
all?
Spark
an open source cluster computing framework that provides an interface for entire programming clusters with implicit data parallelism and fault-tolerance. serves as 'general-purpose' and 'fast cluster computing platform' - runs computations in memory & provides a quicker system for complex applications operating on disk - covers various workloads needing a dedicated distributed systems namely streaming, interactive queries, iterative algorithms, and batch applications.
external spark sources
apache hbase cassandra avro dbase aws redshift csv elasticsearch
dataframes
data organized into named columns - distributed - immutable - lazy evals can be defined as a data structure, which is tabular in nature. It represents rows, each consisting of a number of observations. similar to RDD, this is immutable distributed set of data. widely used for processing a large collection of unstructured or semi-structured data. ability to handle petabytes of data supports wide range of data formats for reading and writing unilke RDD, data is arranged into named columns, similar to a table in a relational database. created to make processing simpler, DF permits developers to impose a structure onto a distributed collection of data, enabling higher-level abstraction rows: can have variety of data formats (heterogeneous) columns : can have data of the same data type (homogeneous) they mainly contain metadata in addition to data like column and row names - introduced in Spark 1.3 - used with statistical and math functions which is an important aspect in modern data science
Choose the correct statement. 1. execution starts with the call of transformation 2. all transformations and actions are lazily evaluated 3. none of the options 4. execution starts with the call of action
execution starts with the call of action
examples of Transformations
filter(func) : returns a new dataset (RDD) that are created by choosing the elements of the source on which the function returns true map(func) : passes each element of the RDD via the supplied function union() : new RDD contains elements from source argument and RDD intersection() : new RDD includes only common elements from source argument and RDD cartesian() : new RDD cross product of all elements from source argument and RDD
open-source/commercial third-party data storage systems compatible with Spark
google cloud elastic search jdbc apache cassandra apache hadoop (hdfs) apache hbase apache hive
creating spark session
it can be built utilizing a builder pattern. The builder will automatically reuse an existing SparkContext if one exists; and create a SparkContext if it does not exist. in command prompt for Spark: import org.apache.spark.sql.SparkSession val dataLocation = "file:${system:user.dir}/spark-data" // Create a SparkSession val spark = SparkSession .builder() .appName("SparkSessionExample") .config("spark.sql.data.dir", dataLocation) .enableHiveSupport() .getOrCreate() or as a class in PySpark: from pyspark.sql import SparkSession spark = SparkSession .builder .appName("data frame example") .config("spark.some.config.option", "some-value") .getOrCreate() or in Spark SQL: val sparkSession = SparkSession.builder.master("local").appName("Spark session in Fresco").getOrCreate()
configuring properties
once the sparksession is instantiated, you can configure the runtime config properties of spark; Eg: in this code snippet, we can alter the existing runtime config options //set new runtime options spark.conf.set("spark.executor.memory", "1g") spark.conf.set("spark.sql.shuffle.partitions", 4) //get all settings val configMap:Map[String, String] = spark.conf.getAll()
MapReduce's example 'Word Count'
showing how Spark offers support for 'operator chaining' which is handy when doing pre- or post- processing on data, like filtering data before running a complex MapReduce job val file = sc.textFile("hdfs://.../wordcounts-*.gz") val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveAsTextFile("hdfs://.../wordcountOutput")
MLlib
spark appears with a library including common machine learning (ML) feature, named MLlib. Here, MLlib offers many types of machine learning algorithms, namely collaborative filtering, clustering, regression, and classification.
Spark ecosystem
spark core streaming SQL GraphX MLlib
spark app structure
spark driver : user code runs in the driver process; sends tasks to executors for processing data -> driver points to multiple executors as driver launches executors in cluster; YARN/Mesos/Spark standalone cluster - single driver process and a collection of executor processes scattered over nodes on the cluster - both executors and the driver usually run as long as the app runs
machine learning with spark
spark is provided with a scalable ML library, MLlib, which executes advanced analytics on iterative problems. few of the critical analytics jobs such as sentiment analysis, customer segmentation, and predictive analysis make spark an intelligent tech
access to underlying SparkContext
sparkSession.sparkContext returns the subsequent SparkContext, employed for building RDDs and managing cluster resources spark.sparkContext res17: org.apache.spark.SparkContext = org.apache.spark.SparkContext@2debe9ac
running SQL queries
sparksession is the entry point for reading data, akin to the old SQLContext.read. It can be utilized to execute SQL queries across data, getting results back as a DataFrame val jsonData = spark.read.json("/home/user/employee.json") display(spark.sql("select * from employee"))
What kind of data can be handled by Spark?
structured (incorrect)
Spark supports loading data from Hbase.
true
shared variables
usually, when a function passed to a spark operation is run on a remote cluster node, it runs on individual copies of all the variables used in the function. these vars are copied to every machine, and no updates to the variables on the remote machine are delivered back to the driver program spark offers two limited types of shared vars for two common usage patterns: - accumulators - broadcast variables
spark in companies
Uber : deploys HDFS, spark streaming, and Kafka for developing a continuous ETL pipeline Conviva : uses spark for handling live traffic and optimizing the videos Pinterest : deploys spark steaming to know about customer engagement information
Event detection
streaming functionality of spark permits organizations to monitor unusual behaviors for protecting systems. Health/security organizations and financial institutions utilize triggers to detect potential risks.
file formats supported by spark
text json csv sequence file parquet Hadoop inputOutput format
The number of stages in a job is usually equal to the number of RDDs in the DAG. However, the scheduler can truncate the lineage when _______
the RDD is cached or persisted ?
hive integration
- hive comes packaged with the Spark library as HiveContext that inherits from SQLContext. Utilizing HiveContext, you can create and find tables in the HiveMetaStore and write queries on it using HiveQL. - when hive-site.xml is not configured, the context automatically produces a metastore names as metastore_db and a folder known as warehouse in the current directory
RDD caching and persisting
- in spark, you can utilize few RDDs multiple times. If we repeat the same process of RDD evaluation every time, it is needed or taken into action, this task can be time-consuming and memory-consuming, especially for iterative algorithms that look at data multiple times - to resolve the problem of repeated computation, the method of caching or persistence came into the picture - RDDs can be cached with the help of cache operation. They can also be persisted using persist operation - cache persists with default storage level MEMORY_ONLY - RDDs can also be unpersisted to eliminate RDD from a permanent storage like memory and disk
applications of spark
- interactive analysis - event detection - machine learning with MLlib
SparkSession - a New Entry Point
- introduced in Apache Spark 2.0 - *offers a single point of entry* to communicate with underlying Spark feature and enables programming spark with dataset APIs and Dataframe - in previous spark versions, spark context was the entry point for spark. For streaming, you required StreamingContext, for hive HiveContext, and for SQL SQLContext - as dataframe and dataset api's are the new standards, Spark 2.0 features SparkSession as the new entry point - SS is a combination of HiveContext, StreamingContext, and SQLContext. = all the api's available on these contexts are available on SparkSession also. It internally has a spark context for actual computation
RDD (Resilient Distributed Datasets)
- known as the main abstraction in Spark - a partitioned collection of objects spread across a cluster, and can be persisted in memory or on disk - once created, RDDs are immutable
accumulators
- known as the variables that are only 'added' via na associative and commutative operation and can, hence, be efficiently supported in parallel - they can be utilized to implement sums or counters - programmers can include support for new types - spark natively offers support for accumulators of numeric types
SparkContext
- main entry point for spark functionality - can be utilized to create broadcast variables, RDDs, and accumulators, and denotes the connection to a Spark cluster - to create one, you first have to develop a SparkConf object that includes details about your app - spark driver program uses SparkContext to connect to the cluster manager for resource allocation, submit Spark jobs and knows what resource manager (YARN, mesos, or standalone) to communicate - via SparkContext, the driver can access other contexts like StreamingContext, HiveContext, and SQLContext to program spark **there may be only ONE SparkContext active per JVM** before creating a new one, you have to stop() the active sparkcontext in the sparkshell, there's already a special interpreter-aware SparkContext (default?) created in the variable names as sc: val sc = new SparkContext(conf)
features of RDDs
- resilient - dataset - distributed RESILIENT : i.e. tolerant to faults using RDD lineage graph and therefore ready to recompute damaged or missing partitions due to node failures DATASET : a set of partitioned data with primitive values or values of values, for example, records or tuples DISTRIBUTED : with data remaining on multiple nodes in a cluster
actions
- return concluding results of RDD computations - they trigger execution utilizing 'lineage graph' to load the data into original RDD, and then execute all intermediate transformations and write final results out to file system or return it to Driver program - 'count', 'collect', 'reduce', 'take', and 'first' are few actions in spark
running a spark job
- spark submit program initiated with spark driver; creates logical DAG - spark driver program checks with the cluster manager-YARN (Mesos or standalone) for resource availability for executors and launches it. - executors created in nodes, register to spark driver - spark driver converts the actions and transformations defined in the main method and allocate to executors - executors performs the transformations and; actions return values to the driver - while reading from an HDFS, each executor directly applies the subsequent operations, to the partition in the same task
spark driver
- the program that produces the SparkContext, connecting to a given Spark Master - Declares the actions and transformations on RDDs of data
Which of the following is true of running a Spark application on Hadoop YARN? - in Hadoop YARN mode, the RDDs and variables are always in the same memory space - there are two deploy modes that can be used to launch spark application on YARN- client mode and cluster mode - irrespective of the mode, the driver is launched in the client process that submitted the job - running in hadoop YARN has the advantage of having multiple users running the spark interactive shell
- there are two deploy modes that can be used to launch spark application on YARN- client mode and cluster mode
To launch a Spark application in any one of the four modes (local, standalone, Mesos or YARN), use ________
./bin/spark-submit
launching apps with Spark-submit
./bin/spark-submit --class &It;main-class> --master &It;master-url> --deploy-mode &It;deploy-mode> --conf &It;key>=&It;value> ...# other options &It;application-jar> [application-arguments]
key features of spark
PERFORMANCE : - faster than Hadoop MapReduce up to 10x (on disk) - 100x (in-memory) - caches datasets in memory for interactive data analysis - in spark, tasks are threaded, while in Hadoop, a task generates a separate JVM RICH APIs and LIBRARIES : - offers a deep set of high-level APIs for languages like R, Python, Scala, and Java - very less code than Hadoop MRed. bc it uses functional programming constructs SCALABILITY AND FAULT TOLERANT : - scalable above 8000 nodes in production - utilizes resilient distributed datasets (RDDs) a logical collection of data partitioned across machines, which produces an intelligent fault tolerant mechanism SUPPORTS HDFS : - integrated with hadoop and its ecosystem - can read existing data REALTIME STREAMING : - supports streams from a variety of data sources like Twitter, Kinesis, Flume, and Kafka - we defined a high-level library for stream processing, utilizing Spark Streaming INTERACTIVE SHELL : - provides an interactive command line interface (in Python or Scala) for horizontally scalable, low, latency, data exploration - supports structured and relational query processing (SQL), via Spark SQL MACHINE LEARNING : - higher level libraries for graph processing and machine learning - various machine learning algorithms such as pattern-mining, clustering, recommendation, and classification
creating a SparkConf object
val conf = new SparkConf().setMaster("local[4]").setAppName("FirstSparkApp") val sc = new SparkContext(conf) specifying the master URL and app name and passed it to a sparkContext
Lazy evaluation
When we call a transformation on RDD, the operation is not immediately executed. Alternatively, Spark internally records meta-data to show this operation has been requested - loading data into RDD is lazily evaluated as similar to how transformations are - in Hadoop, developers often spend a lot of time considering how to group together ops to minimize the number of MapReduce passes. It's not required in case of Spark - spark uses this to reduce the number of passes it has to take over our data by grouping ops together. Hence, users are free to arrange their program into smaller, more controllable ops - Evaluation that takes place only when the value is needed. (delayed until value that is returned is needed) For example, elements in a stream are computed only when accessed, so the entire (infinite) stream need not be computed. Allows for scale free programming (create some infinite thing represented as a function and then we sample that at some resolution).
creating a dataset
val lines = sqlContext.read.text("/log_data").as[String] val words = lines .flatMap(_.split(" ")) .filter(_ != "") here, we created a dataset 'lines' on which RDD operations like 'filter' and 'split' are applied
spark SQL
programming module for structured data processing brings native support for SQL to Spark blurs the lines between RDD's and relational tables - provides a programming abstraction called DataFrame and can act as distributed SQL query engine - package for working with structured data - enables querying data through sql and as the apache hive variant of sql - termed as the hive query language (HQL) - supports various data, including JSON, parquet, and hive tables can pass sql queries directly to any DataFrame by creating a table using registerTempTable method then sqlContext.sql() to pass the queries
examples of Actions
count() : get the number of data elements in the RDD collect() : gets all data elements in RDD as an array reduce(func) : aggregate the data elements in an RDD using this function which takes two args and returns one take(n) : fetch first n data elements in an RDD computed by driver program foreach(func) : execute function for each data element in RDD. Usually used to update an accumulator or interacting with external systems first() : retrieves the first data element in RDD. It's similar to take(1) saveAsTextFile(path) : writes the content of RDD to a text file or a set of text files to local file system/HDFS
supported languages
currently supports multiple languages, like: - Java - Scala, - R - Python final language chosen based on efficiency of the functional solutions to tasks, but most developers prefer Scala. - spark is built on Scala, thus being proficient in Scala helps you to dig into the source code when something does not work as you expect - multi-paradigm programming language and supports functional as well as, object oriented paradigms. It's a JVM based statically typed language that is safe and expressive - python is in general slower than Scala while Java is too verbose and doesn't support REPL (read-evaluate-print-loop)
lineage graph example
rawData -> map to -> carsData -> filter to -> americanCars -> combineByKey to -> makeWeightSum -> map to -> makeWeightAvg here, the 'map', 'filter', 'combineByKey' are the actions taken upon each RDD, such as 'rawData', 'carsData',
code snippet : transformations and actions on an RDD
records = spark.textFile("hdfs://...") errors = records.filter(_.startsWith("ERROR")) messages = errors.map(_.split('\t')(2)) cachedMessages = messages.cache() cachedMessages.filter(_.contains("400")).count here, records is the Base RDD and errors is a transformed RDD created by applying the filter transformation. count is the action called upon which the transformations start to execute
spark executors
runs the tasks, returns results to driver - offers in memory storage for RDDs that are cached by user programs - multiple executors per nodes possible
cluster managers
standalone - a simple cluster manager added with Spark that makes it simple to establish a cluster Apache Mesos - a cluster manager that can run service apps and Hadoop MapReduce Hadoop YARN - the resource manager in Hadoop 2
SparkConf
stores configuration parameters for a spark application. - these config params can be properties of the spark driver app ro utilized by spark to allot resources on the cluster, like memory size and cores. - sparkConf object can be created with : new SparkConf() and permits you to configure standard properties and arbitrary key-value pairs via the set() method
Data sources API
offers single interface for storing and loading data using Spark SQL. along with prepackaged sources with spark, this offers an integration point for external developers to add support for custom data sources
creating RDDs
parallelizing a collection in a driver program, holding the numbers 1 -5: val data = Array(1,2,3,4,5) val newRDD = sc.parallelize(data) or data = [1,2,3,4,5] distData = sc.parallelize(data) here, newRDD/distData is the new RDD created by calling SparkContext's parallelize method. referencing one dataset in an external storage system, like a shared filesystem, HBase, HDFS, or any data source providing a Hadoop InputFormat. for example, text file RDDs can be created using SparkContext's textFile method. This method takes an URI for the file (either a local path on th emachine, or a hdfs://, s3n://, etc. URI) and reads it as a collection of lines to produce RDD newRDD val newRDD = sc.textFile("data.txt") or distFile = sc.textFile("data.txt")