参考,http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html
SparkStreaming支持多种流输入,like Kafka, Flume, Twitter, ZeroMQ or plain old TCP
sockets,并且可以在上面进行transform操作,最终数据存入HDFS,数据库或dashboard
另外可以把Spark’s in-built
machine
learning algorithms, and graph
processing algorithms用于spark
streaming,这个比较有意思
SparkStreaming的原理,下面那幅图很清晰,将stream数据离散化,提出的概念DStream,其实就是sequence
of RDDs
Spark Streaming is an extension of the core Spark API that allows enables
high-throughput, fault-tolerant stream processing of live data streams.
Data
can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ or plain
old TCP sockets and be processed using complex algorithms expressed with
high-level functions like map
, reduce
,
join
and window
. Finally, processed data can be pushed
out to filesystems, databases, and live dashboards. In fact, you can apply
Spark’s in-built machine
learning algorithms, and graph
processing algorithms on data streams.
Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.
Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data stream from sources such as Kafka and Flume, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
Before we go into the details of how to write your own Spark Streaming program, let’s take a quick look at what a simple Spark Streaming program looks like.
StreamingContext是SparkStreaming的入口,就像SparkContext对于spark一样
其实StreamingContext就是SparkContext的封装,通过streamingContext.sparkContext
可以取得
参数除了batchDuration,其他的都和SparkContext没有区别
To initialize a Spark Streaming program in Scala, a StreamingContext
object has to be created, which is the main entry point of all Spark Streaming
functionality.
A StreamingContext
object can be created by
using
The master
parameter is a standard Spark
cluster URL and can be “local” for local testing.
The
appName
is a name of your program, which will be shown on your
cluster’s web UI.
The batchInterval
is the size of the batches,
as explained earlier.
Finally, the last two parameters are needed to deploy
your code to a cluster if running in distributed mode, as described in the Spark
programming guide.
Additionally, the underlying SparkContext can be
accessed as streamingContext.sparkContext
.
The batch interval must be set based on the latency requirements of your application and available cluster resources. See the Performance Tuning section for more details.
通过图示很清晰的说明什么是DStream,和基于DStream的transform是怎样的?
Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, it is represented by a continuous sequence of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval, as shown in the following figure.
Any operation applied on a DStream translates to operations on the underlying
RDDs. For example, in the earlier
example of converting a stream of lines to words, the flatmap
operation is applied on each RDD in the lines
DStream to generate
the RDDs of the words
DStream. This is shown the following
figure.
These underlying RDD transformations are computed by the Spark engine. The DStream operations hide most of these details and provides the developer with higher-level API for convenience. These operations are discussed in detail in later sections.
DStream支持transformations and output,和Spark的action不太一样
There are two kinds of DStream operations - transformations and
output operations.
Similar to RDD transformations, DStream
transformations operate on one or more DStreams to create new DStreams with
transformed data.
After applying a sequence of transformations to the input
streams, output operations need to called, which write data out to an external
data sink, such as a filesystem or a database.
Transformations大部分都和RDD中的一样,重点说下底下的几个,
UpdateStateByKey Operation
用流数据持续更新state,比如下面的例子,wordcount
只需要定义updateFunction,如何根据newValue更新现有的state
The updateStateByKey
operation allows you to maintain arbitrary
stateful computation, where you want to maintain some state data and
continuously update it with new information.
To use this, you will have to
do two steps.
Let’s illustrate this with an example. Say you want to maintain a running count of each word seen in a text data stream. Here, the running count is the state and it is an integer. We define the update function as
Transform Operation
The transform
operation (along with its variations like
transformWith
) allows arbitrary RDD-to-RDD functions to be applied
on a DStream. It can be used to apply any RDD operation that is not exposed in
the DStream API. For example, the functionality of joining every batch in a data
stream with another dataset is not directly exposed in the DStream API. However,
you can easily use transform
to do this. This enables very powerful
possibilities. For example, if you want to do real-time data cleaning by joining
the input data stream with precomputed spam information (maybe generated with
Spark as well) and then filtering based on it.
In fact, you can also use machine
learning and graph
computation algorithms in the transform
method.
未完。。。
Spark Streaming Programming Guide
原文:http://www.cnblogs.com/fxjwind/p/3560011.html