Spark Streaming Programming Guide

时间：2014-02-22 05:18:36 阅读：894 评论：0 收藏：0 [点我收藏+]

参考，http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html

Overview

SparkStreaming支持多种流输入，like Kafka, Flume, Twitter, ZeroMQ or plain old TCP sockets，并且可以在上面进行transform操作，最终数据存入HDFS，数据库或dashboard
另外可以把Spark’s in-built machine learning algorithms, and graph processing algorithms用于spark streaming，这个比较有意思
SparkStreaming的原理，下面那幅图很清晰，将stream数据离散化，提出的概念DStream，其实就是sequence of RDDs

Spark Streaming is an extension of the core Spark API that allows enables high-throughput, fault-tolerant stream processing of live data streams.
Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ or plain old TCP sockets and be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s in-built machine learning algorithms, and graph processing algorithms on data streams.

Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data stream from sources such as Kafka and Flume, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

A Quick Example

Before we go into the details of how to write your own Spark Streaming program, let’s take a quick look at what a simple Spark Streaming program looks like.

// Create a StreamingContext with a SparkConf configuration
val ssc = new StreamingContext(sparkConf, Seconds(1)) //封装了SparkContext，最大的不同是需要batchDuration

// Create a DStream that will connect to serverIP:serverPort
val lines = ssc.socketTextStream(serverIP, serverPort) //InputStream，会将socket流转化为DStream

// Split each line into words
val words = lines.flatMap(_.split(" ")) //类似RDD的操作，不同的是这里是DStream
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

// Print a few of the counts to the console
wordCount.print()

ssc.start()             // Start the computation，启动流处理
ssc.awaitTermination()  // Wait for the computation to terminate

Initializing

StreamingContext是SparkStreaming的入口，就像SparkContext对于spark一样
其实StreamingContext就是SparkContext的封装，通过streamingContext.sparkContext可以取得
参数除了batchDuration，其他的都和SparkContext没有区别

To initialize a Spark Streaming program in Scala, a StreamingContext object has to be created, which is the main entry point of all Spark Streaming functionality.
A StreamingContext object can be created by using

new StreamingContext(master, appName, batchDuration, [sparkHome], [jars])

The master parameter is a standard Spark cluster URL and can be “local” for local testing.
The appName is a name of your program, which will be shown on your cluster’s web UI.
The batchInterval is the size of the batches, as explained earlier.
Finally, the last two parameters are needed to deploy your code to a cluster if running in distributed mode, as described in the Spark programming guide.
Additionally, the underlying SparkContext can be accessed as streamingContext.sparkContext.

The batch interval must be set based on the latency requirements of your application and available cluster resources. See the Performance Tuning section for more details.

DStreams

通过图示很清晰的说明什么是DStream，和基于DStream的transform是怎样的？

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, it is represented by a continuous sequence of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval, as shown in the following figure.

Any operation applied on a DStream translates to operations on the underlying RDDs. For example, in the earlier example of converting a stream of lines to words, the flatmap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. This is shown the following figure.

These underlying RDD transformations are computed by the Spark engine. The DStream operations hide most of these details and provides the developer with higher-level API for convenience. These operations are discussed in detail in later sections.

Operations

DStream支持transformations and output，和Spark的action不太一样

There are two kinds of DStream operations - transformations and output operations.
Similar to RDD transformations, DStream transformations operate on one or more DStreams to create new DStreams with transformed data.
After applying a sequence of transformations to the input streams, output operations need to called, which write data out to an external data sink, such as a filesystem or a database.

Transformations大部分都和RDD中的一样，重点说下底下的几个，

UpdateStateByKey Operation

用流数据持续更新state，比如下面的例子，wordcount
只需要定义updateFunction，如何根据newValue更新现有的state

The updateStateByKey operation allows you to maintain arbitrary stateful computation, where you want to maintain some state data and continuously update it with new information.
To use this, you will have to do two steps.

Define the state - The state can be of arbitrary data type.
Define the state update function - Specify with a function how to update the state using the previous state and the new values from input stream.

Let’s illustrate this with an example. Say you want to maintain a running count of each word seen in a text data stream. Here, the running count is the state and it is an integer. We define the update function as

def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    val newCount = ...  // add the new values with the previous running count to get the new count
    Some(newCount)
}

//pairs DStream containing (word, 1) pairs
val runningCounts = pairs.updateStateByKey[Int](updateFunction _)

Transform Operation

The transform operation (along with its variations like transformWith) allows arbitrary RDD-to-RDD functions to be applied on a DStream. It can be used to apply any RDD operation that is not exposed in the DStream API. For example, the functionality of joining every batch in a data stream with another dataset is not directly exposed in the DStream API. However, you can easily use transform to do this. This enables very powerful possibilities. For example, if you want to do real-time data cleaning by joining the input data stream with precomputed spam information (maybe generated with Spark as well) and then filtering based on it.

val spamInfoRDD = sparkContext.hadoopFile(...) // RDD containing spam information

val cleanedDStream = inputDStream.transform(rdd => {
  rdd.join(spamInfoRDD).filter(...) // join data stream with spam information to do data cleaning
  ...
})

In fact, you can also use machine learning and graph computation algorithms in the transform method.

未完。。。

Spark Streaming Programming Guide

原文：http://www.cnblogs.com/fxjwind/p/3560011.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)