Pacemaker将错误分成3类:soft,hard和fatal,后两种属于环境或配置问题,如果没有人工干预是不可能自动修复的。一般的故障都采用OCF_ERR_GENERIC作为返回值,比如,服务进程crash,网络不通等,OCF_ERR_GENERIC属于soft类型。
Table B.3. Types of recovery performed by the cluster
Type | Description | Action Taken by the Cluster |
---|---|---|
soft
|
A transient error occurred
|
|
hard
|
A non-transient error that may be specific to the current node occurred
|
|
fatal
|
A non-transient error that will be common to all cluster nodes (eg. a bad configuration was specified)
|
Table B.4. OCF Return Codes and their Recovery Types
RC | OCF Alias | Description | RT |
---|---|---|---|
0
|
OCF_SUCCESS
|
soft
|
|
1
|
OCF_ERR_GENERIC
|
soft
|
|
2
|
OCF_ERR_ARGS
|
hard
|
|
3
|
OCF_ERR_UNIMPLEMENTED
|
hard
|
|
4
|
OCF_ERR_PERM
|
hard
|
|
5
|
OCF_ERR_INSTALLED
|
hard
|
|
6
|
OCF_ERR_CONFIGURED
|
fatal
|
|
7
|
OCF_NOT_RUNNING
|
N/A
|
|
8
|
OCF_RUNNING_MASTER
|
soft
|
|
9
|
OCF_FAILED_MASTER
|
soft
|
|
other
|
NA
|
soft
|
每个资源的操作(operation)有一个on-fail属性,用于控制如何进行出错处理。
Table 5.3. Properties of an Operation
Field | Description |
---|---|
id
|
|
name
|
|
interval
|
|
timeout
|
|
on-fail
|
The action to take if this action ever fails. Allowed values:
* ignore - Pretend the resource did not fail
* block - Don’t perform any further operations on the resource
* stop - Stop the resource and do not start it elsewhere
* restart - Stop the resource and start it again (possibly on a different node)
* fence - STONITH the node on which the resource failed
* standby - Move all resources away from the node on which the resource failed
|
enabled
|
但是,实际测试验证后,发现不管如何设置on-fail,效果都不会变,也就是说永远是缺省行为。
以下是让Resource Agent的各个操作返回OCF_ERR_GENERIC时资源管理器的处理:
操作 | 错误处理 | 对应的on-fail值 |
---|---|---|
start |
设置fail-count=1000000 在本节点上调用stop 在其它节点上start该资源 |
restart |
stop |
设置fail-count=1000000 阻止该资源的进一步操作,该资源成为unmanaged FAILED状态,如下 dummy (ocf::heartbeat:Dummy2): Started srdsdevapp69 (unmanaged) FAILED |
block |
monitor |
设置fail-count+=1 在本节点上依次调用stop,start,monitor。如果monitor依然出错,重复stop,start,monitor,直到fail-count达到migration-threshold后,保持资源为stop状态。
|
restart |
promote |
设置fail-count+=1 在本节点上依次调用demote,stop,start 。 在其它节点上调用promote以提升其它节点上的资源为master |
restart |
demote |
设置fail-count+=1 在本节点上依次调用stop,start,demote。如果demote依然出错,重复stop,start,demote,直到fail-count达到migration-threshold后,保持资源为stop状态。 |
restart |
notify | 无视 | ignore |
注1:超时的处理与OCF_ERR_GENERIC相同
注2:Pacemaker不会对已经stop了的资源调用post stop notify。
注3:测试环境 Pacemaker 1.1.7-6 ,CentOS 6.3
上面关于错误处理的测试结果,可以给Resource Agent编写者提供几点启示:
原文:http://blog.chinaunix.net/uid-20726500-id-5604144.html