<译>流计算容错

时间：2015-12-08 00:19:18 阅读：224 评论：0 收藏：0 [点我收藏+]

这篇文档描述了Flink的流式计算的容错机制

简介

Flink提供容错机制来对应用数据流提供持续的恢复。这个机制保证了即使在出现错误的情况下，记录也只会被处理一次。注意，这里有一个开关来降级担保至少处理一次（接下来会详细介绍）。

容错机制会持续不断地对分布式数据流画快照。对于一些小状态的流式应用来说，这些快照是很轻量级的，不会非性能造成太大影响。这些流式应用的状态都被保存在一个可配置的地方（如master node或HDFS）

一旦程序出错（可能是机器故障，网络故障，或者一些软件缺陷），Flink会停止这些分布式的数据流。系统会重启它们并重置到最近的一次正常的检查点。输入流会重置到快照点。任何重启后数据流的已经被处理的记录会被确保不会回到之前的检查点状态，也就是说不会再重新去走一遍已经走过的路。

注意：为充分发挥这个机制的保障，数据源（如消息队列或broker）需要支持能够将数据流倒回到一个已定义最近的点，Apache Kafka支持这样，Flink和Kafka连接会充分发挥其能力。

注意：因为Flink的检查点是通过分布式快照来实现的，我们使用字符快照和可交换的检查点

检查点

Flink容错机制最核心的部分就是对分布式流和operater的状态持续不断地画快照。这些快照作为持续不断的检查点来保证一旦程序出现问题后可以回滚。画快照的机制在“Lightweight Asynchronous Snapshots for Distributed Dataflows”里有介绍。它的灵感来自于分布式快照的Chandy-Lamport 算法，并且是转为Flink设计的执行模型

Barriers

A core element in Flink’s distributed snapshotting are the stream barriers. These barriers are injected into the data stream and flow with the records as part of the data stream. Barriers never overtake records, the flow strictly in line. A barrier separates the records in the data stream into the set of records that goes into the current snapshot, and the records that go into the next snapshot. Each barrier carries the ID of the snapshot whose records it pushed in front of it. Barriers do not interrupt the flow of the stream and are hence very lightweight. Multiple barriers from different snapshots can be in the stream at the same time, which means that various snapshots may happen concurrently.

Flink分布式快照里的核心元素是brarriers。这些brarris被注入到数据流中成为数据流的一部分。brarris不会超越前面的数据，它们严格线性流动。brarrier 将一个数据流中的记录分离到即将进入目前的一个快照、下一个快照里面去的集合。也就是说brarrier将数据流划分为多个段，每一段去不同的快照。

Stream barriers are injected into the parallel data flow at the stream sources. The point where the barriers for snapshot n are injected (let’s call it Sn) is the position in the source stream up to which the snapshot covers the data. For example, in Apache Kafka, this position would be the last record’s offset in the partition. This position Sn is reported to the checkpoint coordinator (Flink’s JobManager).

The barriers then flow downstream. When an intermediate operator has received a barrier for snapshot n from all of its input streams, it emits itself a barrier for snapshot n into all of its outgoing streams. Once a sink operator (the end of a streaming DAG) has received the barrier n from all of its input streams, it acknowledges that snapshot n to the checkpoint coordinator. After all sinks acknowledged a snapshot, it is considered completed.

When snapshot n is completed, it is certain that no records from before Sn will be needed any more from the source, because these records (and their descendant records) have passed through the entire data flow topology.

Operators that receive more than one input stream need to align the input streams on the snapshot barriers. The figure above illustrates this:

As soon as the operator received snapshot barrier n from an incoming stream, it cannot process any further records from that stream until it has received the barrier n from the other inputs as well. Otherwise, it would have mixed records that belong to snapshot n and with records that belong to snapshot n+1.
Streams that report barrier n are temporarily set aside. Records that are received from these streams are not processed, but put into an input buffer.
Once the last stream has received barrier n, the operator emits all pending outgoing records, and then emits snapshot n barriers itself.
After that, it resumes processing records from all input streams, processing records from the input buffers before processing the records from the streams.

State

When operators contain any form of state, this state must be part of the snapshots as well. Operator state comes in different forms:

User-defined state: This is state that is created and modified directly by the transformation functions (like map() or filter()). User-defined state can either be a simple variable in the function’s java object, or the associated key/value state of a function (see State in Streaming Applications for details).
System state: This state refers to data buffers that are part of the operator’s computation. A typical example for this state are the window buffers, inside which the system collects (and aggregates) records for windows until the window is evaluated and evicted.

Operators snapshot their state at the point in time when they received all snapshot barriers from their input streams, before emitting the barriers to their output streams. At that point, all updates to the state from records before the barriers will have been made, and no updates that depend on records from after the barriers have been applied. Because the state of a snapshot may be potentially large, it is stored in a configurable state backend. By default, this is the JobManager’s memory, but for serious setups, a distributed reliable storage should be configured (such as HDFS). After the state has been stored, the operator acknowledges the checkpoint, emits the snapshot barrier into the output streams, and proceeds.

The resulting snapshot now contains:

For each parallel stream data source, the offset/position in the stream when the snapshot was started
For each operator, a pointer to the state that was stored as part of the snapshot

Exactly Once vs. At Least Once

The alignment step may add latency to the streaming program. Usually, this extra latency is in the order of a few milliseconds, but we have seen cases where the latency of some outliers increased noticeably. For applications that require consistently super low latencies (few milliseconds) for all records, Flink has a switch to skip the stream alignment during a checkpoint. Checkpoint snapshots are still drawn as soon as an operator has seen the checkpoint barrier from each input.

When the alignment is skipped, an operator keeps processing all inputs, even after some checkpoint barriers for checkpoint n arrived. That way, the operator also processes elements that belong to checkpoint n+1 before the state snapshot for checkpoint n was taken. On a restore, these records will occur as duplicates, because they are both included in the state snapshot of checkpoint n, and will be replayed as part of the data after checkpoint n.

NOTE: Alignment happens only for operators with multiple predecessors (joins) as well as operators with multiple senders (after a stream repartitioning/shuffle). Because of that, dataflows with only embarrassingly parallel streaming operations (map(), flatMap(), filter(), …) actually give exactly once guarantees even in at least once mode.

Recovery

Recovery under this mechanism is straightforward: Upon a failure, Flink selects the latest completed checkpoint k. The system then re-deploys the entire distributed dataflow, and gives each operator the state that was snapshotted as part of checkpoint k. The sources are set to start reading the stream from position Sk. For example in Apache Kafka, that means telling the consumer to start fetching from offset Sk.

If state was snapshotted incrementally, the operators start with the state of the latest full snapshot and then apply a series of incremental snapshot updates to that state.

<译>流计算容错

原文：http://www.cnblogs.com/Chuck-wu/p/5027528.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)