首页 > 其他 > 详细

RAC环境中HACMP的vg为non-conrrent的解决经历

时间:2014-04-10 09:53:55      阅读:968      评论:0      收藏:0      [点我收藏+]

   在rac环境中,HACMPvgunconcurrent状态,是多么糟糕的一件事,而这个不幸就在某生产系统上发生了。  

环境介绍

AIX 6.1的系统,使用的是EMC CLARiiON存储,oracle10.2.0.5

问题状况:

   先看下各个卷组的状态

data03vg

lsvg data03vg
   VOLUME GROUP:       data03vg                 VG IDENTIFIER:
   00f79d1100004c00000001386f00edfb
   VG STATE:           active                   PP SIZE:        128
 megabyte
   (s)
   VG PERMISSION:      read/write               TOTAL PPs:      5315
 (680320
   megabytes)
   MAX LVs:            512                      FREE PPs:       711 (91008
   megabytes)
   LVs:                78                       USED PPs:       4604
 (589312
   megabytes)
   OPEN LVs:           0                        QUORUM:         3 (Enabled)
   TOTAL PVs:          5                        VG DESCRIPTORS: 5
   STALE PVs:          0                        STALE PPs:      0
   ACTIVE PVs:         5                        AUTO ON:        no
   Concurrent:         Enhanced-Capable         Auto-Concurrent: Disabled
   VG Mode:            Non-Concurrent
   MAX PPs per VG:     130048
   MAX PPs per PV:     2032                     MAX PVs:        64
   LTG size (Dynamic): 1024 kilobyte(s)         AUTO SYNC:      no
   HOT SPARE:          no                       BB POLICY:      relocatable
   PV RESTRICTION:     none                     INFINITE RETRY: no

data01vg

lsvg data01vg
   VOLUME GROUP:       data01vg                 VG IDENTIFIER:
   00f79d1100004c00000001386effcc48
   VG STATE:           active                   PP SIZE:        128
 megabyte
   (s)
   VG PERMISSION:      read/write               TOTAL PPs:      6378
 (816384
   megabytes)
   MAX LVs:            512                      FREE PPs:       1146
 (146688
   megabytes)
   LVs:                88                       USED PPs:       5232
 (669696
   megabytes)
   OPEN LVs:           0                        QUORUM:         4 (Enabled)
   TOTAL PVs:          6                        VG DESCRIPTORS: 6
   STALE PVs:          0                        STALE PPs:      0
   ACTIVE PVs:         6                        AUTO ON:        no
   Concurrent:         Enhanced-Capable         Auto-Concurrent: Disabled
   VG Mode:            Non-Concurrent
   MAX PPs per VG:     130048
   MAX PPs per PV:     2032                     MAX PVs:        64
   LTG size (Dynamic): 1024 kilobyte(s)         AUTO SYNC:      no
   HOT SPARE:          no                       BB POLICY:      relocatable
   PV RESTRICTION:     none                     INFINITE RETRY: no

data02vg

lsvg data02vg
   VOLUME GROUP:       data02vg                 VG IDENTIFIER:
   00f79d1100004c00000001386f007c90
   VG STATE:           active                   PP SIZE:        128
 megabyte
   (s)
   VG PERMISSION:      read/write               TOTAL PPs:      2126
 (272128
   megabytes)
   MAX LVs:            512                      FREE PPs:       18 (2304
   megabytes)
   LVs:                39                       USED PPs:       2108
 (269824
   megabytes)
   OPEN LVs:           0                        QUORUM:         2 (Enabled)
   TOTAL PVs:          2                        VG DESCRIPTORS: 3
   STALE PVs:          0                        STALE PPs:      0
   ACTIVE PVs:         2                        AUTO ON:        no
   Concurrent:         Enhanced-Capable         Auto-Concurrent: Disabled
   VG Mode:            Non-Concurrent
   MAX PPs per VG:     130048
   MAX PPs per PV:     2032                     MAX PVs:        64
   LTG size (Dynamic): 1024 kilobyte(s)         AUTO SYNC:      no
   HOT SPARE:          no                       BB POLICY:      relocatable
   PV RESTRICTION:     none                     INFINITE RETRY: no

看到vg modenon-concurrent状态,那在看下vgpv的状态:

data03vg

lsvg -p data03vg
   data03vg:
   PV_NAME           PV STATE          TOTAL PPs   FREE PPs    FREE
   DISTRIBUTION
   hdiskpower11      active            1063        43
   01..00..00..00..42
   hdiskpower17      removed           1063        167
   21..00..00..00..146
   hdiskpower18      removed           1063        167
   21..00..00..00..146
   hdiskpower19      removed           1063        167
   21..00..00..00..146
   hdiskpower20      removed           1063        167
   21..00..00..00..146

data01vg

lsvg -p data01vg
   data01vg:
   PV_NAME           PV STATE          TOTAL PPs   FREE PPs    FREE
   DISTRIBUTION
   hdiskpower7       active            1063        0
   00..00..00..00..00
   hdiskpower8       active            1063        20
   00..00..00..00..20
   hdiskpower9       active            1063        24
   02..00..00..00..22
   hdiskpower10      active            1063        0
   00..00..00..00..00
   hdiskpower16      missing           1063        551
   21..00..105..212..213
   hdiskpower21      missing           1063        551

data02vg

lsvg -p data02vg
  data02vg:
  PV_NAME           PV STATE          TOTAL PPs   FREE PPs    FREE
  DISTRIBUTION
  hdiskpower0       active            1063        0
  00..00..00..00..00
  hdiskpower24      active            1063        18
  00..00..00..00..18

有好多盘不是missing就是removed的,数据库日志报错为:

Thu Mar 21 17:53:58 BEIST 2013

Errors in

file /oracle/app/oracle/admin/ctsdb/bdump/ctsdb2_m000_19595456.trc:

ORA-27072: File I/O error

IBM AIX RISC System/6000 Error: 5: I/O error

odmget HACMPdisktype
  HACMPdisktype:
          PdDvLn = "disk/pseudo/power"
          ghostdisks = "SCSI3"
          checkres = "SCSI_TUR"
          breakres = "/usr/lpp/EMC/Symmetrix/bin/emcpowerreset"
          parallel = "false"
          makedev = "MKDEV"
          reserved1 = ""
          reserved2 = ""
          reserved3 = ""

可以看到parallel=false,很悲剧啊。

lssrc –a | grep cl
  clcomd           caa              7929856      active
 clcomdES         clcomdES         9633858      active
 clstrmgrES       cluster          9240596      active
 gsclvmd                               inoperative
 clinfoES         cluster          17104944     active
 clconfd          caa                           inoperative
 nimsh            nimclient                     inoperative

两节点的gsclvmd 都是inoperative,看来只能重启hacmp来把gsclvmd给拉起来。

解决过程

1.先进行数据库的备份,然后停库

节点1:
   su – oracle
   srvctl stop listener –n ctscrm1
   ps –ef | grep “LOCAL=NO”| grep –v grep | awk ‘{print $2}’|xargs kill -9
   oracle> alter system switch logfile;
   oracle> alter system checkpoint;
   srvctl stop instance –d ctsdb –I ctsdb1

节点2:
    su – oracle
   srvctl stop listener –n ctscrm2
   ps –ef | grep “LOCAL=NO”| grep –v grep | awk ‘{print $2}’|xargs kill -9
   oracle> alter system switch logfile;
   oracle> alter system checkpoint;
   srvctl stop instance –d ctsdb –I ctsdb2

关闭crs:节点1和节点2

crsctlstop crs

2.重启hacmp

smit clstop

bubuko.com,布布扣

结果悲剧再次发生,两个节点虽然hacmp都停,可现在本身就有异常,想把hacmp停了,以为vg也可以跟着卸载下来,可事实是vg都没卸载下来,手动卸载,节点1data01vg一直卡着不动,无解了,只能shutdown –Fr了,重启后vg都已经卸载了。开始修改EMC clarion系统存储,需要使用并行环境:

smit hacmp》extended configure》extended resource configure》hacmp extended resource configure 》 configure custom disk methods 》 change/show custom disk methods
选择disk/pseudo/power 修改parallel 为true
两个节点都这么做。


odmget HACMPdisktype
  HACMPdisktype:
          PdDvLn = "disk/pseudo/power"
          ghostdisks = "SCSI3"
          checkres = "SCSI_TUR"
          breakres = "/usr/lpp/EMC/Symmetrix/bin/emcpowerreset"
          parallel = "true"
          makedev = "MKDEV"
          reserved1 = ""
          reserved2 = ""
          reserved3 = ""

OK,现在paralleltrue了。

启动hacmpsmit clstart 这回又出问题了,启动报错,具体的错误我没记,反正报的就是卷组的时间戳不同步了。继续搞:


节点1,2
exportvg data01vg;import –y data01vg hdiskpower7
exportvg data02vg;import –y data02vg hdiskpower24
exportvg data03vg;import –y data03vg hdiskpower17

这回在启动hacmpsmit clstart 这时OK了,hacmp可以正常启动了,检查各个vg的状态也都是Concurrent的了。

bubuko.com,布布扣

3.启动rac

本以为现在问题已经解决了,启动crsctl start crscrs_stat –t

crs没启动,根本就没有反应。/etc/init.d/init.crs start 还是启动不了,没任何反应,日志也没任何变化,这时想到了,由于exportvgimportvg了,权限没有进行修改。修改裸设备的权限:

cd /dev;chown oracle:dba rdb*;chown oracle:dba rbss*;chown oracle:dba rctsbss*;

这些是数据文件的lv

chown oracle:dba rvote*;chmod 766 rvote*
chmod 766 *ocr*;
/etc/init.d/init.crs start;

这时启动了,过一会crs_stat –t可以看到都是online了,到此问题都解决了。

总结:

貌似这次问题原因的产生就是上回重启系统时,hacmp的并行没有修改,导致的,而且上回是强制挂载的卷组,有1个节点是通过hacmp挂载不上的(以前别的同事在维护,他操作的,也是他说的)。

lslpp –l | grep –i emc
EMC.CELERRA.aix.rte        5.3.0.6  COMMITTED  EMC CELERRA AIX Support
  EMC.CLARiiON.aix.rte       5.3.0.6  COMMITTED  EMC CLARiiON AIX Support
  EMC.CLARiiON.fcp.rte       5.3.0.6  COMMITTED  EMC CLARiiON FCP Support
  EMC.CLARiiON.ha.rte        5.3.0.6  COMMITTED  EMC CLARiiON HA Concurrent
  EMCpower.base              5.3.1.1  COMMITTED  PowerPath Base Driver and
  EMCpower.encryption        5.3.1.1  COMMITTED  PowerPath Encryption with RSA
  EMCpower.migration_enabler
  EMCpower.mpx               5.3.1.1  COMMITTED  PowerPath Multi_Pathing
  EMC.CELERRA.aix.rte        5.3.0.6  COMMITTED  EMC CELERRA AIX Support
  EMC.CLARiiON.aix.rte       5.3.0.6  COMMITTED  EMC CLARiiON AIX Support
  EMC.CLARiiON.fcp.rte       5.3.0.6  COMMITTED  EMC CLARiiON FCP Support
  devices.common.IBM.modemcfg.data

这个就是emc存储提供ha concurrent的软件包,如果没有的话是不支持不了concurrent的,得需要安装。然后进行parallel的修改就像上面提到的:

smit hacmp》extended configure》extended resource configure》hacmp extended resource configure 》 configure custom disk methods 》 change/show custom disk methods

或者

/usr/sbin/cluster/utilities/clcustomdisk –c -tdisk/pseudo/power -Ndisk/pseudo/power -gSCSI3 -hSCSI_TUR  -b/usr/lpp/EMC/CLARiiON/bin/emcpowerreset -ptrue –mMKDEV

还好这次操作,夜间进行申请了好久的停机时间。。。。





本文出自 “小鱼儿” 博客,请务必保留此出处http://xiaoyuer3.blog.51cto.com/8622790/1393025

RAC环境中HACMP的vg为non-conrrent的解决经历,布布扣,bubuko.com

RAC环境中HACMP的vg为non-conrrent的解决经历

原文:http://xiaoyuer3.blog.51cto.com/8622790/1393025

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!