为了方便相关问题测试,我在本机搭建了一套RAC环境,但昨天打开后却发现RAC无法启动了,不错,就当一次实战演练了。
测试环境:Redhat6.3_x64+ oracle11gr2 RAC
二、 处理过程:
在启动虚拟机一段时间后,通过命令查看,信息如下:
[grid@rac01 ~]$ crs_stat -t
CRS-0184: Cannot communicate with the CRS
daemon. CRS-4535: Cannot communicate with Cluster Ready Services CRS-4000: Command Status failed, or completed with errors. |
查看CRS服务状态
[root@rac01 rac-cluster]# crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4535: Cannot communicate with Cluster Ready Services CRS-4530: Communications failure contacting Cluster Synchronization Services daemon CRS-4534: Cannot communicate with Event Manager |
启动Cluster资源
[root@rac01 bin]#crsctl start cluster CRS-2800: Cannot start resource ‘ora.asm‘ as it is already in the INTERMEDIATE state on server ‘rac01‘ CRS-4000: Command Start failed, or completed with errors. |
相关相关日志,获取到如下信息,并未在其他日志中找到更有效的参考信息,如果有好的建议,请联系在下:
查看CSS信息
[grid@rac01 ~]$ crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ----------------- --------- --------- 1. ONLINE aaaf9f57bc9c4fc7bfb57ac937d2d149 (/dev/asm-diskb) [CRS] |
下面我通过ASM实例查看相关ASM磁盘信息:
SQL> select NAME , STATE FROM V$ASM_DISKGROUP;
NAME STATE ------------------------------ ----------- DATA DISMOUNTED CRS DISMOUNTED |
OK,尝试MOUNT磁盘组(后续,整理是发现奇怪问题,既然前边我们查看css信息时 磁盘是online,那么这我们却无法mount,并未尝试强制mount,有待进一步研究)
SQL> alter diskgroup crs mount; alter diskgroup crs mount * ERROR at line 1: ORA-15032: not all alterations performed ORA-15040: diskgroup is incomplete |
尝试MOUNT DATA磁盘组
SQL> alter diskgroup data mount;
Diskgroup altered.
SQL> select NAME , STATE FROM V$ASM_DISKGROUP;
NAME STATE ------------------------------ ----------- DATA MOUNTED CRS DISMOUNTED |
注:现在写下当时处理问题的过程,并未过多深入研究问题,在整理文档时有了更多思考,暂且不讨论。
既然磁盘组DATA可以用,那么我们先将CRS等信息存储到DATA磁盘组中,之前并未手动备份过CRS等信息,只能通过自动备份信息恢复。
停止CRS服务,两个节点都执行
[root@rac01 rac-cluster]# crsctl stop has -f |
再次启动,以NOCRS方式启动CRS,节点1执行
[root@rac01 rac-cluster]# crsctl start crs -excl -nocrs CRS-4123: Oracle High Availability Services has been started. CRS-2672: Attempting to start ‘ora.mdnsd‘ on ‘rac01‘ CRS-2676: Start of ‘ora.mdnsd‘ on ‘rac01‘ succeeded CRS-2672: Attempting to start ‘ora.gpnpd‘ on ‘rac01‘ CRS-2676: Start of ‘ora.gpnpd‘ on ‘rac01‘ succeeded CRS-2672: Attempting to start ‘ora.cssdmonitor‘ on ‘rac01‘ CRS-2672: Attempting to start ‘ora.gipcd‘ on ‘rac01‘ CRS-2676: Start of ‘ora.cssdmonitor‘ on ‘rac01‘ succeeded CRS-2676: Start of ‘ora.gipcd‘ on ‘rac01‘ succeeded CRS-2672: Attempting to start ‘ora.cssd‘ on ‘rac01‘ CRS-2672: Attempting to start ‘ora.diskmon‘ on ‘rac01‘ CRS-2676: Start of ‘ora.diskmon‘ on ‘rac01‘ succeeded CRS-2676: Start of ‘ora.cssd‘ on ‘rac01‘ succeeded CRS-2672: Attempting to start ‘ora.drivers.acfs‘ on ‘rac01‘ CRS-2679: Attempting to clean ‘ora.cluster_interconnect.haip‘ on ‘rac01‘ CRS-2672: Attempting to start ‘ora.ctssd‘ on ‘rac01‘ CRS-2681: Clean of ‘ora.cluster_interconnect.haip‘ on ‘rac01‘ succeeded CRS-2672: Attempting to start ‘ora.cluster_interconnect.haip‘ on ‘rac01‘ CRS-2676: Start of ‘ora.drivers.acfs‘ on ‘rac01‘ succeeded CRS-2676: Start of ‘ora.ctssd‘ on ‘rac01‘ succeeded CRS-2676: Start of ‘ora.cluster_interconnect.haip‘ on ‘rac01‘ succeeded CRS-2672: Attempting to start ‘ora.asm‘ on ‘rac01‘ CRS-2676: Start of ‘ora.asm‘ on ‘rac01‘ succeeded |
修改/etc/oracle/ocr.loc文件,将OCR修改为DATA,两个节点都需要修改。
查看备份情况,选择一个最近时间节点恢复
查看命令:ocrconfig –showbackup [root@rac01 rac-cluster]# ocrcheck Status of Oracle Cluster Registry is as follows : Version : 3 Total space (kbytes) : 262120 Used space (kbytes) : 3088 Available space (kbytes) : 259032 ID : 471595559 Device/File Name : +DATA Device/File integrity check succeeded
Device/File not configured
Device/File not configured
Device/File not configured
Device/File not configured
Cluster registry integrity check succeeded
Logical corruption check succeeded |
创建VOTEDISK
在创建时出现以下问题,解决办法如下:
[root@rac01 rac-cluster]# crsctl replace votedisk +DATA CRS-4602: Failed 27 to add voting file 7255773670ae4fa9bf64a150a9fd5915. Failure 27 with Cluster Synchronization Services while deleting voting disk. Failed to replace voting disk group with +DATA. CRS-4000: Command Replace failed, or completed with errors. |
设置ASM磁盘搜索路径
SQL> show parameter asm_diskstring
NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ asm_diskstring string SQL> alter system set asm_diskstring = ‘/dev/asm*‘;
System altered.
SQL> create spfile=‘+DATA‘ from memory;
File created.
SQL> startup force mount; |
再次创建VOTEDISK
[root@rac01 rac-cluster]# crsctl replace votedisk +DATA Successful addition of voting disk 383b8c3e4db34f72bf9dedd15e47471b. Successful deletion of voting disk aaaf9f57bc9c4fc7bfb57ac937d2d149. Successfully replaced voting disk group with +DATA. CRS-4266: Voting file(s) successfully replaced |
停止集群服务,再次启动
[root@rac01 rac-cluster]# crsctl stop has –f CRS-4123: Oracle High Availability Services has been started. |
通过下面集群状态检查,我们可以看到CRS状态为OFFLINE,需要我们通过asm管理工具重新整理磁盘。
[root@rac01 bin]# crs_stat –t Name Type Target State Host ------------------------------------------------------------ ora.CRS.dg ora....up.type ONLINE OFFNLINE ora.DATA.dg ora....up.type ONLINE ONLINE rac01 ora....ER.lsnr ora....er.type ONLINE ONLINE rac01 ora....N1.lsnr ora....er.type ONLINE ONLINE rac01 ora.asm ora.asm.type ONLINE ONLINE rac01 ora.cvu ora.cvu.type ONLINE ONLINE rac01 ora.gsd ora.gsd.type OFFLINE OFFLINE ora....network ora....rk.type ONLINE ONLINE rac01 ora.oc4j ora.oc4j.type ONLINE ONLINE rac01 ora.ons ora.ons.type ONLINE ONLINE rac01 ora....SM1.asm application ONLINE ONLINE rac01 ora....01.lsnr application ONLINE ONLINE rac01 ora.rac01.gsd application OFFLINE OFFLINE ora.rac01.ons application ONLINE ONLINE rac01 ora.rac01.vip ora....t1.type ONLINE ONLINE rac01 ora....SM2.asm application ONLINE ONLINE rac02 ora....02.lsnr application ONLINE ONLINE rac02 ora.rac02.gsd application OFFLINE OFFLINE ora.rac02.ons application ONLINE ONLINE rac02 ora.rac02.vip ora....t1.type ONLINE ONLINE rac02 ora.racdb.db ora....se.type OFFLINE OFFLINE ora....ry.acfs ora....fs.type ONLINE ONLINE rac01 ora.scan1.vip ora....ip.type ONLINE ONLINE rac01 |
三、 总结:
此次测试系统情况,主要通过之前集群自动备份恢复至新的磁盘组解决出现的问题, 只针对问题做出了解决,并未查找出根本原因,这个需要进一步去查证,当然虚拟环境容易出现问题,我们可以通过这种方式锻炼自己解决问题的能力。此次出现问题的磁盘组是CRS,通过备份已恢复,加入DATA磁盘组呢,首先对于数据,我们需要定制备份计划,其次在处理该问题时应该更慎重、有更好的计划。
原文:http://blog.itpub.net/29487349/viewspace-1699535/