首页 > 其他 > 详细

Spark Streaming Programming Guide

时间:2014-02-22 05:18:36      阅读:894      评论:0      收藏:0      [点我收藏+]

参考,http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html

 

Overview

SparkStreaming支持多种流输入,like Kafka, Flume, Twitter, ZeroMQ or plain old TCP sockets,并且可以在上面进行transform操作,最终数据存入HDFS,数据库或dashboard
另外可以把Spark’s in-built
machine learning algorithms, and graph processing algorithms用于spark streaming,这个比较有意思
SparkStreaming的原理,下面那幅图很清晰,将stream数据离散化,提出的概念DStream,其实就是sequence of RDDs

Spark Streaming is an extension of the core Spark API that allows enables high-throughput, fault-tolerant stream processing of live data streams.
Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ or plain old TCP sockets and be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s in-built machine learning algorithms, and graph processing algorithms on data streams.

bubuko.com,布布扣

Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

bubuko.com,布布扣

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data stream from sources such as Kafka and Flume, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

A Quick Example

Before we go into the details of how to write your own Spark Streaming program, let’s take a quick look at what a simple Spark Streaming program looks like.

bubuko.com,布布扣

Initializing

StreamingContext是SparkStreaming的入口,就像SparkContext对于spark一样
其实StreamingContext就是SparkContext的封装,通过streamingContext.sparkContext可以取得
参数除了batchDuration,其他的都和SparkContext没有区别

To initialize a Spark Streaming program in Scala, a StreamingContext object has to be created, which is the main entry point of all Spark Streaming functionality.
A StreamingContext object can be created by using

bubuko.com,布布扣

The master parameter is a standard Spark cluster URL and can be “local” for local testing.
The appName is a name of your program, which will be shown on your cluster’s web UI.
The batchInterval is the size of the batches, as explained earlier.
Finally, the last two parameters are needed to deploy your code to a cluster if running in distributed mode, as described in the Spark programming guide.
Additionally, the underlying SparkContext can be accessed as streamingContext.sparkContext.

The batch interval must be set based on the latency requirements of your application and available cluster resources. See the Performance Tuning section for more details.

DStreams

通过图示很清晰的说明什么是DStream,和基于DStream的transform是怎样的?

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, it is represented by a continuous sequence of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval, as shown in the following figure.

bubuko.com,布布扣

Any operation applied on a DStream translates to operations on the underlying RDDs. For example, in the earlier example of converting a stream of lines to words, the flatmap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. This is shown the following figure.

bubuko.com,布布扣

These underlying RDD transformations are computed by the Spark engine. The DStream operations hide most of these details and provides the developer with higher-level API for convenience. These operations are discussed in detail in later sections.

Operations

DStream支持transformations and output,和Spark的action不太一样

There are two kinds of DStream operations - transformations and output operations.
Similar to RDD transformations, DStream transformations operate on one or more DStreams to create new DStreams with transformed data.
After applying a sequence of transformations to the input streams, output operations need to called, which write data out to an external data sink, such as a filesystem or a database.

Transformations大部分都和RDD中的一样,重点说下底下的几个,

UpdateStateByKey Operation

用流数据持续更新state,比如下面的例子,wordcount
只需要定义updateFunction,如何根据newValue更新现有的state

The updateStateByKey operation allows you to maintain arbitrary stateful computation, where you want to maintain some state data and continuously update it with new information.
To use this, you will have to do two steps.

  1. Define the state - The state can be of arbitrary data type.
  2. Define the state update function - Specify with a function how to update the state using the previous state and the new values from input stream.

Let’s illustrate this with an example. Say you want to maintain a running count of each word seen in a text data stream. Here, the running count is the state and it is an integer. We define the update function as

bubuko.com,布布扣

Transform Operation

The transform operation (along with its variations like transformWith) allows arbitrary RDD-to-RDD functions to be applied on a DStream. It can be used to apply any RDD operation that is not exposed in the DStream API. For example, the functionality of joining every batch in a data stream with another dataset is not directly exposed in the DStream API. However, you can easily use transform to do this. This enables very powerful possibilities. For example, if you want to do real-time data cleaning by joining the input data stream with precomputed spam information (maybe generated with Spark as well) and then filtering based on it.

bubuko.com,布布扣

In fact, you can also use machine learning and graph computation algorithms in the transform method.

未完。。。

Spark Streaming Programming Guide

原文:http://www.cnblogs.com/fxjwind/p/3560011.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!