Data Science - Spark Streaming & Structured Streaming

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

spark streaming key ideas

1. The primary one is handling streaming computation just like the form batch computation is employed, managing the incoming data stream as an input table, and as a new set of data comes, managing that as a new set of rows being added to the input table. 2. The secondary key idea is the transactional integration with the storage systems to implement an end-to-end, exactly once guarantee. The purpose here is to assure that the serving applications that read data from the storage systems detect a constant snapshot of the data that has been prepared by the streaming applications. Traditionally, a developer must assure there is no duplicate data or data loss while transferring data from a streaming application to an outside storage system. This is one of the pain points that was raised by streaming application developers. Internally, the Structure Streaming engine already gives an exactly-once guarantee, and now that equivalent guarantee is elongated to external storage systems, given that those systems back transactions.

creating DStreams

you can create a DStream that represents streaming data by using the StreamingContext - ssc val lines = ssc.socketTextStream(""samplehost"", 007) lines DStream represents a stream of data that you will receive from the data server. each record in 'lines' corresponds to a line of text

word count program

//splitting the line at <space> val words = lines.flatMap(_.split(" ")) //creating (<word>,1) as a pair val pairs = words.map(word => (word, 1)) //summing up all values(1's) for a particular word val wordCounts = paris.reduceByKey(_+_) //print results wordCounts.print() ****all above are performed only at the start of StreamingContext(ssc) //start computation ssc.start() //wait for computation to terminate ssc.awaitTermination()

goals of spark streaming

- Dynamic load balancing - Fast failure and straggler recovery - Unification of batch, streaming and interactive analytics - Advanced analytics like machine learning and - interactive SQL - Performance

structured streaming

- Spark's second-generation streaming engine - quicker, more scalable, and more fault tolerant, to spout the deficiencies in the 1st gen streaming engine It was meant for developers to create end-to-end streaming applications that can respond to data in real-time utilizing a simple programming model that it is made on top of the optimized and solid foundation of the Spark SQL engine. One different aspect of Structured Streaming is that it gives a sole and comfortable way for engineers to create streaming applications. Creating a production-grade streaming application needs overcoming various difficulties, and with that in thought, the Structured Streaming engine was created to improve with these hurdles: - Managing end-to-end authenticity and ensuring correctness. - Implementing complex transmutations on a variety of incoming data. - Processing data with respect to the event time and dealing with out-of-order data quickly. - Integrating with a mixture of data sources and data sinks. =========================================== data source -> processing logic -> output mode -> trigger -> data sink The principal elements of a streaming application consist of defining one or more streaming data sources, implementing the logic for managing the incoming data streams in the order of DataFrame transformations, setting the output mode and the trigger, and finally specifying a data sink to write the result to. Since both the output form and the trigger have default states, they are arbitrary if their default values meet your use case. The optional ones are marked with an asterisk. Each of these concepts will be described in detail in the following chapters. ================== does not coalesce the entire table. It examines the newest accessible data from the streaming data source, processes it incrementally to renew the result, and then drops the source data. It only holds around the least common state data as needed to renew the result. This design is significantly distinct from several other stream processing engines. Many streaming systems need the user to keep running aggregations themselves, thus ought to argue about data consistency and fault-tolerance. In this design, Spark is liable for renewing the Result Table while there is new data, thus freeing the users from thinking about it.

why is streaming so hard? disadvantages

- Streaming computations don't run in separation. - Data coming out of time scale is a dilemma for batch-processed processing. - Writing stream processing services from scratch is not simple. problems with DStreams: - Processing with event-time; dealing with late data. - Interoperate streaming with batch and interactive. - Reasoning about end-to-end guarantees.

flume vs. kafka

- flume is more tightly integrated with hadoop ecosysten -flume may work "out of the box", while kafka may require you to write your own producers and consumers (but that gives you more flexibility) - you can do both; flume supports kafka sinks and kafka data sources - kafka is more reliable and scalable

why is streaming necessary

- increased need for real-time data processing - prevalence of online transactions and social media, as well as sensors and devices, companies are generating huge amounts of data - bank on this data, pull out meaningful real-time insights at scale to enable business growth

accumulators

- known as the variables that are only 'added' via na associative and commutative operation and can, hence, be efficiently supported in parallel - they can be utilized to implement sums or counters - programmers can include support for new types, other than numerics - spark natively offers support for accumulators of numeric types - for an Executor/worker node, accumulators are write-only variables if there is a named accumulator, it can be seen in the Spark UI, which in-turn helps us understand the progress of our app workflow

why push-based flume is bad

- not transactional ; data can be lost - you must code your receiver's hostname and Avro port into Flume's configuration - if it changes, fume stops working

DStreams classification

2 categories of sources: 1. basic sources : sources that are directly available in the StreamingContext API. e.g. File systems, Socket connections, and Akka actors 2. advances sources : sources like Kafka, Flume, Twitter, etc. These are available through extra utility classes

transformations on DStreams

map(func) : returns a new DStream by passing each element of the source DStream through a function func flatMap(func) : each input item can be mapped to 0 or more output elements filter(func) : return a new DStream by selecting only those records of the source DStream which returns a true value for the func the below for more: Spark Streaming Programming Guide - Spark 1.1.0 Documentation (https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#transformations-on-dstreams)

Fault Tolerance Semantics

Achieving end-to-end exactly-once semantics was one of the key intentions behind the idea of Structured Streaming. To accomplish that, they have produced the Structured Streaming sources, the sinks and the execution engine to assuredly follow the correct flow of the processing so that it can manage any set of the crash by restarting and/or reprocessing. Every streaming spring is supposed to have offsets to trace the read point in the stream. The engine practices checkpointing and write-ahead logs to register the offset scale of the data being prepared in every trigger. The streaming sinks are created to be idempotent with facility for handling reprocessing. Collectively, utilizing replayable sources and idempotent sinks. Structured Streaming can guarantee end-to-end exactly-once semantics under any crash.

topics and partitions

messages in kafka are categorized into topics topics are broken down into a number of partitions. (partitions can be accessed out of order like indexes) partitions provides redundancy and scalability each partition has its own id

Spark streaming converts the input data streams into ______.

micro-batches

Who is responsible for keeping track of the Block IDs?

Block Management Master on the driver

Creating Broadcast Variables

Broadcast variables are created with SparkContext.broadcast function as: scala>val broadVar = sc.broadcast(Array(1, 2)) Note : Explicitly create broadcast variables only if tasks across multiple stages are in need of same data. value function is used to get the value assigned to a broadcast variable. scala> broadcastVar.value res2: Array[Int] = Array(1, 2)

output operations on DStreams

print() : prints the first ten elements of every batch of data in a DStream saveAsTextFiles(prefix, [suffix]) : save this DStream's contents as SequenceFiles of serialized Java objects to learn more: https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#output-operations-on-dstreams

kafka architecture

producer -> topics -> consumer

The basic programming abstraction of Spark Streaming is _______.

DStream

What is the programming abstraction in Spark Streaming?

DStream

Caching / Persistence

DStreams can be persisted in as stream's of data. You can make use of the persist() method on a DStream which persist every RDD of that particular DStream in memory. This is useful if the data in the DStream is computed multiple times. Default persistence level for input streams is set to replicate the data to two nodes for fault-tolerance.

significance of shared variables

while designing distributed tasks, we may come across situations where we want each task to access some shared variables or values instead of independently returning intermediate results back to the driver program. in such situations, you can use a shared variable which can be accessed by all of the tasks

Handling Event-time and Late Data

Event-time is the time fixed in the data itself. For various applications, you may need to work on this event-time. For illustration, if you require to perceive the number of events produced by IoT devices each minute, then you reasonably want to utilize the time when the data was produced, rather than the time Spark catches them. This event-time is quite represented merely in this design - every event from the devices is represented as a row in the table, and event-time is represented as a column value in the row. This provides window-based aggregations to be just a particular kind of grouping and aggregation on the event-time column. Accordingly, such event-time-window-based aggregation queries can be represented consistently on both an immobile dataset and on a data stream, creating the life of the user much simpler.

static data sources

MySQL Cassandra HBase

Programming Model : basic concepts

The principal approach in Structured Streaming is to handle a live data stream as a table that is being continuously appended. This drives to a new stream processing model that is very related to a batch processing model. You will represent your streaming computation as a regular batch-like query as on a static table, and Spark runs it as an incremental query on the unbounded input table. -------------------------------- Consider the input data stream as the Input Table. Every data item that is arriving on the stream is like a new row being appended to the Input Table. Let us consider the input data stream as Input Table. Whenever a data item arrives on the stream, it is appended as a row in the Input Table. ----------------------------- Consider the input data stream as the Input Table. Every data item that is arriving on the stream is like a new row being appended to the Input Table. Let us consider the input data stream as Input Table. Whenever a data item arrives on the stream, it is appended as a row in the Input Table.

Output modes

Output: described as what gets printed out to the outside storage. The output can be represented in a different mode: ===================================== COMPLETE MODE : The complete updated Result Table will be recorded to the outer storage. It is up to the storage connector to determine how to manage the writing of the entire table. APPEND MODE : Only the brand-new rows appended in the Result Table after the last trigger will be recorded to the outer storage. This is relevant only on the queries where current rows in the Result Table are not assumed to vary. UPDATE MODE : Only the rows that were renewed in the Result Table since the latest trigger will be recorded to the outer storage. Note that this is distinct from the Complete Mode in that this mode only outputs the rows that have modified after the latest trigger. If the query doesn't include aggregations, it will be similar to Append mode.

DStreams are a collection of _______ internally.

RDD

Internally, DStream is represented as a sequence of _______ arriving at discrete time intervals.

RDD

ingesting Flume input streams

SStreaming can also ingest data from Flume. the data taken in as streams can be transformed as per the business requirements, and fed back to a data sink system

Continuous Processing Model

Springing with Apache Spark 2.3, the Structured Streaming engine's processing model has been extended to support a brand-new model called continuous processing. The early processing model was the micro-batching model, which is the default one. Given the quality of the micro-batching processing model, it is fit for use instances that can allow end-to-end latency in the span of 100 milliseconds. For other use cases that require end-to-end latency as low as 1 millisecond, they should practice the continuous processing model.

streaming data from a TCP source example

TCP source : host:port like: samplehost:007

Creating StreamingContext

The first step in creating a streaming application is to create a StreamingContext. This is the primary entry point for all streaming functionality. //creating configuration object val conf = new sparkConf().setMaster(""local"").setAppName(""MyStreamingApp"") //create streaming context and batch interval is set to 1 second val ssc = new StreamingContext(conf, Seconds(1)) required import: import org.apache.spark._ import org.apache.spark.streaming._

Why Windowing?

We may often come across situations where our interest is only on what happened, say for the last One hour of time and would want these statistics to refresh every other minute. Here comes Window Operations for help. With windowing such computations become all the more easier. A typical use case can be Monitoring a web server. Note : Here One hour is the window length, while one minute is the slide interval.

demo: integrate Kafka with spark

fetch data from kafka topic to spark app go to folder where spark app is present: >cd KafkaSparkStreaming >ls >cd src >ls >cd main >ls >cd scala >ls open spark app: >vi kafkar.scala start zookeeper: command found in last demo start kafka: commands found in last demo duplicate session to check if they've started: > jps create topic with same name as in program: >command in last demo with replication factor parts but must name topic the same as mentioned in Map("name" -> )) in the val kafkaStream = kafkaUtils.createStream.... line enter in message duplicate session to run spark session in streaming dir: >cd KafkaSparkStreaming >ls >sbt compile >sbt run **this successfully starts session in producer terminal, you can send messages and the times will show in the streaming terminal

DStreams are ________ internally.

a collection of RDD

Window Duration/Size

a duration over which we perform certain fold operations. Window duration should be a multiple of batch interval

broadcast variables

a kind of shared variable, that allows developers to keep a read-only variable cached on separate nodes rather than shipping a copy of it to each node. spark distributes them using various broadcast algorithms which will largely reduce the network I/O

Applying transformations on top of a DStream will __________.

a new DStream ?? that's all it gave

Accumulators for Executors & Driver

for an Executor/worker node, accumulators are write-only variables. Task running on the executor nodes can manipulate the value set to an accum. however, they cannot read it's value e.g. scala> sc.parallelize(Array(1,2)).foreach(x => acc += x) the driver can read the value of the accum, using the value method as: scala> acc.value returns: res4: Int = 3

2 types of shared variables

accumulators broadcast variables

push-based receiver

acts as a sink configure flume to treat your receivere as an Avro sink running on a new port install spark-streaming-flume-assembly package relatively easy to set up: a1.sinks = avrosink a1.sinks.avroSink.type = avro a1.sinks.avroSink.channel = memoryChannel a1.sinks.avroSink.hostname = <your receiver hostname> a1.sinks.avroSink.port = <your avro sink port (not Spark port)>

Choose the correct statement(s). - blocks in one executor are copied to other executor nodes to prevent data loss - datastreams from the source are taken up by the receiver - the driver will execute the task on the worker nodes on encountering ssc.start() - all

all

Which among the following is/are true about Window Operations?

all - multiple DStreams can be present in a window - window duration should be a multiple of batch interval - parts of DStreams cannot be part of a window

Select the goals of Apache Spark Streaming.

all; - Dynamic load balancing - Fast failure and straggler recovery - Unification of batch, streaming and interactive analytics

The benefits of Discretized Stream Processing is/are ___________.

all; - fast failure and straggler recovery - interactive analytics - dynamic load balancing

________ are the various output modes.

all; complete mode append mode update mode

Which of the following transformations can be applied to a DStream?

all; filter flatMap map

Data sources for Spark Streaming that comes under the 'Advanced sources' category include ________.

all; kafka flume twitter

apache zookeeper

an open source apache project that provides centralized infrastructure and services that enable synchronization across an apache hadoop cluster. stores info in local log files

Kafka

an open source, distributed, publish-subscribe messaging system which manages and maintains the real time stream of data from different apps, websites, etc. - reduces the complexity of data pipelines - simpler and manageable communication - easy to establish remote communication - ensures reliable communication linkedIn

Which among the following needs to be a multiple of batch interval?

both; sliding interval window duration

Which among the following is/are true about Spark Streaming?

both; - DStreams are the programming abstractions of Spark Streaming - Spark Streaming enables real-time processing

kafka components

broker : the servers that manage and mediate the conversation between two different systems. Responsible for the delivery of messages to the right party Message : simply byte arrays and any object can be stored in any format by the developers. The format can be in String, JSON, Avro, etc. Topic : all messages are maintained in topics. messages are stored, published and organized in these cluster : every broker holds few partitions and each partition can be either a leader or a replica for a topic

consumer

can subscribe to one or more topics and reads messages in order they were produced the track of the offset of messages messages with same key arrive at the same consumer one consumer can work on two partitions at a time or one at a time

metadata checkpointing

checkp'ing info to systems like hdfs. usually used to recover from failure of the node running the driver of the streaming app. metadata, meaning: configuration: which was used to create the streaming app DStream ops : the set of DStream ops defined inthe app incomplete batches : batches whos jobs are queued but have not completed yet *note : primarily used for recovery from driver failures

broker

cluster typically consists of multiple brokers to maintain load balance on receiving messages from producer, assigns offsets to them services consumers by responding to fetch requests for partitions

Block Management Master keeps track of _________.

block ID

creating accumulators

created from SparkContext(sc) as : val acc = sc.accumulator(0, "test") you can create built-in accumulators for longs, doubles, or collections. You are free to create accumulators with or without a name, but only named accumulators are displayed in Spark UI

Batch interval is configured at _________.

creating a spark streaming context

What is the strategy taken to prevent the loss of the incoming stream?

data is replicated in different nodes

Batch interval is ______.

defined during the creation of steaming context

Apache Flume

distributed, reliable service for collecting, aggregating, and moving large amounts of log data available as separate jar: spark-streaming-flume_2_10 two different approaches to receiving data: - flume-style push-based - pull-based using a custom sink

Micro batching is fit for use cases that require end-to-end latency as low as one millisecond.

false

window operation

feature provided in spark streaming allows you to apply transformations over a sliding window of data

internal working

input data stream -> streaming -> batches of input data -> spark engine -> batches of processed data the live input data streams received by spark streaming are divided into several micro batches. Spark Engine take up these batches and process it to generate the final streams of results in batches

What is a batch interval?

interval at which a DStream is created

What is a Sliding Interval?

interval at which the sliding of the window area occurs

role of spark streaming

it's a microbatch-based streaming library technique for transferring data so it can be processed as a steady data stream spark and spark streaming, with its sophisticated design, unique and unified programming model and processing capabilities have enabled us to build complex pipelines that encompass streaming, batch, or even machine learning capabilities with ease free from dealing with multiple frameworks, each meant for its own specific purpose, such as Storm for real-time actions, Hadoop MapReduce for batch processing, etc takes from streaming data sources, like flume, kafka, and kinesis, uses MLlib and/or spark sql, then puts it in data storage systems like apache hbase, cassandra, or kafka data from various sources can be ingested into Spark Streaming - like from Kafka, Flume, Kinesis, or TCP sockets. this data can be processed and pushed out to filesystems, databases, and live dashboards - scalable - speed (achieves low latency) - fault tolerance (efficiently recovers from failures) - integration (integrates with batch and real-time processing) ============================================ streaming is very difficult streaming apps are becoming more complicated

checkpointing

it's required to checkpoint or store enough information to any of the fault-tolerant storage systems so that it can recover from failures of the node running the driver of the streaming app

demo for configuring kafka and zookeeper

java, kafka, zookeeper in system already in terminal: start zookeeper: >zookeeper-server-start.sh kafka/config/zookeeper.properties duplicate session to start kafka: >kafka-server-start.sh kafka/config/server.properties duplicate session: check for kafka and zookeeper >jps *QuorumPeerMain means zookeeper has started?? create kafka topic: >kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic mytopic1 start producer: > kafka-console-producer.sh --broker-list localhost:9092 --topic mytopic1 type in random message duplicate session & start server: >kafka-console-consumer.sh --boostrap-server localhost:9092 --topic mytopic1 --from-beginning hit enter and you see the message you typed from the other session from producer add to the message open consumer and see the added go back to producer and can see messages from consumer

streaming data sources

kafka flume kinesis TCP sockets

where to use Spark Streaming

perfect fit for any real-time data statistics in : apps: intrusion detection; fraud detection; log processing sensors: malfunction detection; dynamic process optimization; supply chain planning web: site analytics; recommendations; sentiment analysis mobile phones: network metrics analysis; locations based ads; etc. ========================================== once the processing is over, the data is pushed out to databases, filesystems, and live dashboards

DStream (discretized stream)

represents a continuous stream of data/ represents a sequence of data which arrives over time the high-level abstraction that Spark Streaming provides. internally, it's rep'd as a sequence of RDD's arriving at discrete time intervals. operations that can be applied on DStreams are similar to those of RDD operations: - transformations : yield a new DStream - Output ops : write data to an external system can be created in 2 different ways: 1. from input data streams from sources such as kafka, Flume, etc. 2. by applying high-level operations on other DStreams

What does saveAsTextFiles(prefix, [suffix]) do?

saves the DStream's contents as text files

data checkpointing

saving the generated RDDs to a storage system like HDFS. has great relevance in cases where the dependency chain keep on growing with time and the recovery of RDD in any intermediate state becomes difficult. In-order to avoid unbounded increase in the recovery process intermediate RDDs of stateful transformations are periodically checkpointed to a reliable storage systems.

producer

sends records (also referred to as messages) to topics selects the partition to send the message per topic implement priority systems (based on sending records to certain partitions depending on priority of record)

pull-based

spark worker nodes 'pull' data from a special spark flume sink more work; you need to install the spark sink as a plug-in into flume first -> spark-streaming-flume-sink package but much more reliable configure custome flume sink: a1.sinks = spark a1.sinks.spark.type = org.apache.spark.steaming.flume.sink.SparkSink a1.sinks.spark.hostname = <Spark receiver> a1.sinks.spark.port = <new port, not spark port> a1.sinks.spark.channel = memoryChannel create a DStream with FlumeUtils.createPollingStream this DStream contains SparkFlumeEvents; call getBody() on them to get the contents as a byte array

streaming process example

stage 1: reads streaming data stage 2: processes the streaming data stage 3: writes the processed data to an HBase table stage 4: provides a visualization of the data sensor time stamped data -> streaming -> spark processing -> reads : data for real-time monitoring

The receiver receives data from the Streaming sources at the start of _________.

streaming context ?

We specify _________ when we create streaming context.

streaming source

Streaming App Data Flow

workflow starts with the Spark Streaming Context, represented by ssc.start() stage 1 : when the spark streaming context starts, the driver will execute task on the executors/worker nodes stage 2 : Data Streams generated at the streaming sources will be received by the Receivers that sit on top of executor nodes. The receiver is responsible for dividing the stream into blocks and for keeping them in memory. Stage 3 : In order to avoid data loss, these blocks are also replicated to another executor. Stage 4 : Block Management Master on the driver keeps track of the block ID information. Stage 5 : For every batch interval configured in Spark Streaming Context (commonly in seconds), the driver will launch tasks to process the blocks. After the processing, the resultant data blocks are persisted to any number of target data stores including cloud storage , relational data stores , and NoSQL stores."

streaming abstraction

the high-level abstraction that sStreaming provides is called 'Discretized stream' or 'DStream'

batch interval

the interval at which a DStream is created. We specify this interval while we create the Streaming Context(ssc).

Sliding Interval

the interval over which, sliding of the window occurs. This interval has to be a multiple of batch interval.

DStream represents a continuous stream of data.

true

DStreams can be created from an existing DStream.

true

For every batch interval, the Driver launches tasks to process a block.

true

HDFS can be a sink for Spark Streaming.

true

In Spark Structured Streaming, whenever a data item arrives on the stream, it is appended as a row in the Input Table.

true

Sliding Interval is the interval at which sliding of the window area occurs.

true

Spark Streaming can be used for real-time processing of data.

true

Spark Streaming has two categories of sources, namely, basic sources and advanced sources.

true

Structured Streaming is Spark's second-generation streaming engine.

true

The principal approach in Structured Streaming is to handle a live data stream as a table that is being continuously appended.

true

The receiver divides the streams into blocks and stores them in the memory.

true

There can be multiple DStreams in a single window.

true

We can configure Twitter as a data source system for Spark Streaming

true

With Spark Streaming, the incoming data is split into micro-batches.

true

ssc.start() is the entry point for a Streaming application.

true


Ensembles d'études connexes

Illinois Insurance Pre-Licensing Exam 2017 (Life)

View Set

NURS 2300 Final Exam Review - Jenkins

View Set

AP US Government 4/5 Civil Liberties & Rights

View Set

Chapter 9: Long-Run Economic Growth

View Set

Guide to Operating Systems - Module 9 Quiz

View Set