RAC

ASM memory fragment problem in 11gR2 RAC

June 7, 2013 oracle, RAC No comments

前天处理了一启RAC ASM 内存泄露问题,系统为公司最后一套RAC one node 版本为11.2.0.3

这套核心RAC one node系统出现大量 log file switch buffer busy waits等待,并伴随大量cursor: pin S wait on X read by other session。

SID USERNAME   MACHINE                EVENT                          PARAM                   W   WT SQL                          ST     LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
 10917 PROD_DATA2 JDBC Thin Client       buffer busy waits              14/907008/8             0  294 /bgx0f86ucnw8v               A     355       7628
 13549 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  172 /520mkxqpf15q8               A     294      20514
  4456 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 6rsq14x8a7n5r/dsxmfbstazq7u  A     525      94053
  4521 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 6rsq14x8a7n5r/dsxmfbstazq7u  A     508      94052
  6831 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  168 1k8vrykzkvccg/d5dfm29f41s0d  A     739       1400
  3170 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:12:  1397817350/21363/0      0  391 4uj3xtfzq3a91/gky4m8njtfrcd  A     392     536146
  4611 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  181 fqc7yvkzxjzk7/cqb6f9fkfkcsq  A     725        932
 13448 PROD_DATA2 JDBC Thin Client       buffer busy waits              56/645165/1             0  730 fqc7yvkzxjzk7/cqb6f9fkfkcsq  A     730        736
   940 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  181 fqc7yvkzxjzk7/cqb6f9fkfkcsq  A     710        953
  8026 PROD_DATA2 JDBC Thin Client       buffer busy waits              20/279682/8             0  441 3f0thwr0rg23c/c4h5dg81ffxyr  A     441      94043
  9290 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  178 3f0thwr0rg23c/c4h5dg81ffxyr  A     220        473
  2750 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  199 b3dmt8325pmvf/bpk6g015uzqzt  A     417      81687
  3510 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  223 c5n6d774as72b/5ak0xdkuvc65n  A     273     536146
  2323 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  172 c5n6d774as72b/5ak0xdkuvc65n  A     407     536140
  3767 PROD_DATA2 JDBC Thin Client       buffer busy waits              21/35200/8              0  725 5q60jbr6700u7/g2th2agwfkk1t  A     725       1133
 10205 PROD_DATA2 JDBC Thin Client       buffer busy waits              21/35200/8              0  717 5q60jbr6700u7/g2th2agwfkk1t  A     717       1186
 10046 PROD_DATA2 JDBC Thin Client       buffer busy waits              21/35200/8              0  670 5q60jbr6700u7/g2th2agwfkk1t  A     670       1186
 10044 PROD_DATA2 JDBC Thin Client       buffer busy waits              21/35200/8              0  689 5q60jbr6700u7/g2th2agwfkk1t  A     689      24971
  6395 PROD_DATA2 JDBC Thin Client       buffer busy waits              21/35201/8              0  693 5q60jbr6700u7/g2th2agwfkk1t  A     693       1522
  9535 PROD_DATA2 JDBC Thin Client       buffer busy waits              21/35200/8              0  703 5q60jbr6700u7/g2th2agwfkk1t  A     703      42595
  6917 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 5q60jbr6700u7/g2th2agwfkk1t  A     709       1221
  2902 PROD_DATA2 JDBC Thin Client       buffer busy waits              21/35200/8              0  700 5q60jbr6700u7/g2th2agwfkk1t  A     700       2146
  9554 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  168 5q60jbr6700u7/g2th2agwfkk1t  A     726       2147
 11662 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:73:  1397817350/21911/0      0  377 cg3urj768yv8g/apun8g5mb9byt  A     378      18995
  9378 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  168 cg3urj768yv8g/7g63srczfgfd6  A     627      19057
  8698 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:73:  1397817350/21911/0      0  551 cg3urj768yv8g/5smgjxsscf91w  A     551      18996
  7575 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:0:   1397817350/21911/0      0  204 cg3urj768yv8g/793hdqw3bn04q  A     204      19035

...
 13028 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:0:   1397817350/21911/0      0  623 cg3urj768yv8g/8s97vpbnuktp2  A     623      19061

...
  6659 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:0:   1397817350/21911/0      0  615 cg3urj768yv8g/8s97vpbnuktp2  A     616      19035


  5187 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:56:  1397817350/21911/0      0  418 cg3urj768yv8g/793hdqw3bn04q  A     418      19069
  4933 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:0:   1397817350/21911/0      0  576 cg3urj768yv8g/9quv2rcpqjxwt  A     576      18977


  4346 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:0:   1397817350/21911/0      0  219 cg3urj768yv8g/8s97vpbnuktp2  A     546        943

....
  4082 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:14:  1397817350/21911/0      0  460 cg3urj768yv8g/520mkxqpf15q8  A     460       1522
  7267 PROD_DATA2 JDBC Thin Client       enq: SQ - contention:SQ-6:73:  1397817350/21911/0      0  415 cg3urj768yv8g/8s97vpbnuktp2  A     415        919
  9532 PROD_DATA2 JDBC Thin Client       DFS lock handle                1398145029/21916/0      0  353 2xw94b38cybhv/4001f9ha480pb  A     354     358380
  9716 PROD_DATA2 JDBC Thin Client       DFS lock handle                1398145029/21916/0      0  264 2xw94b38cybhv/4001f9ha480pb  A     265       2147
   512 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  168 fhd852bb8agv6/gq6xnzt1spkzp  A     654     533810
  7001 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  178 4q0c1a3bsynyw/guwq1fpq9r1dy  A     261     536140
 11348 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 d3wp3qvbz5bna/1w0c0jhzmp1hh  A     226       7299
 11164 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  168 9fwybsrc3h1dd/70k37x7hwg0w5  A     701       1185
  3096 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  220 9fwybsrc3h1dd/70k37x7hwg0w5  A     510       1522
  8100 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  181 91r0dubcptymp/dsxmfbstazq7u  A     649      95706
  8192 PROD_DATA2 JDBC Thin Client       buffer busy waits              24/274502/1             0  645 91r0dubcptymp/dsxmfbstazq7u  A     645      94041
  8783 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 91r0dubcptymp/dsxmfbstazq7u  A     664      94039
  6667 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 91r0dubcptymp/dsxmfbstazq7u  A     685       1272
  5445 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 91r0dubcptymp/dsxmfbstazq7u  A     688       1185
    38 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  199 91r0dubcptymp/dsxmfbstazq7u  A     695       1263
  9386 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  181 91r0dubcptymp/dsxmfbstazq7u  A     685      95693
  8868 PROD_DATA2 JDBC Thin Client       buffer busy waits              53/118272/8             0  675 ghy358rghc4ty/3p3z1j33tk594  A     675     486608
  6475 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  168 ghy358rghc4ty/3p3z1j33tk594  A     706     486608
  6899 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  178 ftmka9gnhru7t/1w0c0jhzmp1hh  A     738     442302
  4003 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 8a5g4frpyags9/1w0c0jhzmp1hh  A     401     435103

   SID USERNAME   MACHINE                EVENT                          PARAM                   W   WT SQL                          ST     LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
   861 PROD_DATA2 dcb-srv-0343-vm02      buffer busy waits              60/1078781/1            0  504 fjnq76rpzzpdy/a6tzf0tjrwvqh  A     505      20966
 12361 PROD_DATA2 xen-4-211-vm05         buffer busy waits              60/1078781/1            0  402 fjnq76rpzzpdy/a6tzf0tjrwvqh  A     403      20976
  6213 TOAD       JDBC Thin Client       buffer busy waits              41/34704/571            0  542 5kjukxbrv5w3r/dbkpyhyfyvdpu  A     542    1577749
  7513 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  168 cvvzaj3syc76b/1w0c0jhzmp1hh  A     703     502831
  8015 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  175 1tg6ux7vc22zm/32bjxyxc0xwfs  A     463        899
 10817 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  205 1tg6ux7vc22zm/32bjxyxc0xwfs  A     720      18977
 12268 PROD_DATA2 JDBC Thin Client       buffer busy waits              57/1302268/1            0  283 3szwvvzw5aq13/bpaj7frdt63mx  A     284      82802
  5959 PROD_DATA2 JDBC Thin Client       log file switch (archiving nee 0/0/0                   0  178 3szwvvzw5aq13/bpaj7frdt63mx  A     583      75654
 10975 PROD_DATA2 JDBC Thin Client       read by other session          27/223916/1             0    2 2b47rbdt23jp8/a4td25ftgtqt8  A      40         51
 10994 PROD_DATA2 JDBC Thin Client       read by other session          27/223916/1             0    2 2b47rbdt23jp8/459f3z9u4fb3u  A      34         34
 10999 PROD_DATA2 JDBC Thin Client       read by other session          27/223916/1             0    2 2b47rbdt23jp8/a4td25ftgtqt8  A      30         30
 11015 PROD_DATA2 JDBC Thin Client       read by other session          27/223916/1             0    2 2b47rbdt23jp8/a4td25ftgtqt8  A       8          9
  4688 PROD_DATA2 JDBC Thin Client       read by other session          20/1806690/1            0    2 4gzs73jtmhyd0/f2n7vrpx4v30v  A       4         10
  4626 PROD_DATA2 JDBC Thin Client       db file sequential read        27/630098/1             0    0 11n7q3tu19vvp/adghpykwhquu2  A       1         23
 10566 PROD_DATA2 JDBC Thin Client       db file sequential read        57/780833/1             0    0 50731ypu4mu31/adghpykwhquu2  A       1         34
 12527 PROD_DATA2 xen-21205-vm02         db file sequential read        28/168522/1             0    0 7szz51xufmuc4/520mkxqpf15q8  A       5         33
  5489 PROD_DATA2 JDBC Thin Client       read by other session          60/931838/1             0    1 1zbqmb9ukkhpd/459f3z9u4fb3u  A      13         13
  5063 PROD_DATA2 JDBC Thin Client       db file sequential read        17/871948/1             0    2 g8jjuy5wt7h92/64yswjj6cdv5j  A      12         16

   SID USERNAME   MACHINE                EVENT                          PARAM                   W   WT SQL                          ST     LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
 10302 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2044424941/278013233    0   44 42zra6dwxqwrd/cnm20xg9c333h  A      44         44
                                                                        07008/21474836480

 11330 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2044424941/278013233    0   15 42zra6dwxqwrd/523rw17fc75fb  A      14         38
                                                                        07008/21474836480
.....
 11698 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2044424941/278013233    0    2 42zra6dwxqwrd/459f3z9u4fb3u  A       1          2
                                                                        07008/21474836480

  6473 PROD_DATA2 JDBC Thin Client       read by other session          6/4044/1                0    0 42zra6dwxqwrd/gzcbp0razp0vn  A      44         46
 10136 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2044424941/278013233    0   25 42zra6dwxqwrd/523rw17fc75fb  A      25         41                                                                        07008/21474836480                                                                        36320/21474836480
  3669 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2179751475/452904301    0    3 2vncw4q0ysrjm/g6btczrdb4fa1  A       3         47
                                                                       36320/21474836480
  4854 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2179751475/452904301    0    3 2vncw4q0ysrjm/g6btczrdb4fa1  A       3         52
                                                                        36320/21474836480
   265 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2179751475/452904301    0    3 2vncw4q0ysrjm/g6btczrdb4fa1  A       3         50
                                                                        36320/21474836480

  9355 PROD_DATA2 JDBC Thin Client       cursor: pin S wait on X        2179751475/452904301    0    3 2vncw4q0ysrjm/g6btczrdb4fa1  A       3         52
 ......
 
                                                                        36320/21474836480

4764 rows selected.

主要原因archive进程无法归档。ASM无法分配新的文件标识。

ASMCMD [+arch/item/archivelog/2012_10_19] > cp thread_1_seq_3.257.797108745 aaaaa.test;         
ASMCMD-8012: can not determine file type for file
ORA-00569: Failed to acquire global enqueue. (DBD ERROR: OCIStmtExecute)

结合ASM trace log:

Wed Jun 05 07:39:52 2013
Dumping diagnostic data in directory=[cdmp_20130605073952], requested by (instance=2, osid=3061 (LMD0)), summary=[incident=412241].
Wed Jun 05 07:45:58 2013
Dumping diagnostic data in directory=[cdmp_20130605074557], requested by (instance=2, osid=3061 (LMD0)), summary=[incident=412242].
Wed Jun 05 07:52:44 2013
Dumping diagnostic data in directory=[cdmp_20130605075244], requested by (instance=2, osid=3061 (LMD0)), summary=[incident=412243].
Wed Jun 05 07:57:58 2013
Dumping diagnostic data in directory=[cdmp_20130605075758], requested by (instance=2, osid=3061 (LMD0)), summary=[incident=412244].
Wed Jun 05 08:03:59 2013
Dumping diagnostic data in directory=[cdmp_20130605080359], requested by (instance=2, osid=3061 (LMD0)), summary=[incident=412245].
Wed Jun 05 08:36:09 2013
SQL> /* ASMCMD */alter diskgroup /*ASMCMD*/ "data" add directory '+data/arch' 
SUCCESS: /* ASMCMD */alter diskgroup /*ASMCMD*/ "data" add directory '+data/arch'
Wed Jun 05 08:42:40 2013
Dumping diagnostic data in directory=[cdmp_20130605084240], requested by (instance=2, osid=3061 (LMD0)), summary=[incident=463685].
Wed Jun 05 08:47:47 2013
Dumping diagnostic data in directory=[cdmp_20130605084746], requested by (instance=2, osid=3061 (LMD0)), summary=[incident=463686].
Wed Jun 05 08:52:18 2013
NOTE: ASM client item_1:item disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/grid/11.2.0/diag/asm/+asm/+ASM1/trace/+ASM1_ora_25588.trc

早上07:39 ASM1 instance 出现问题 导致arch进程无法完成归档 进而导致后续的等待。

(DESCRIPTION=(ADDRESS=(PROTOCOL=beq)(PROGRAM=/u01/oracle/11.2.0/oracle/product/db_1/bin/oracle)(ARGV0=oracleitem_1)(ENVS='ORACLE_HOME=/u01/oracle/11.2.0/oracle/product/db_1,ORACLE_SID=item_1,LD_LIBRARY_PATH=')(ARGS='(DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))'))(CONNECT_DATA=(SID=item_1)))
2013-06-05 07:39:53.587: [ora.item.db][1607948608] {1:21407:34683} [check] InstConnection::connectInt: server not attached
2013-06-05 07:39:53.717: [ USRTHRD][1607948608] {1:21407:34683} InstConnection:~InstConnection: this 1fc95100
2013-06-05 07:39:53.718: [ora.item.db][1607948608] {1:21407:34683} [check] DbAgent::isDgbHandleFO } status = 16646

Wed Jun 05 07:39:52 2013
ARC3: Error 19504 Creating archive log file to '+ARCH'
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance item_1 - Archival Error
ORA-16038: log 7 sequence# 59248 cannot be archived
ORA-19504: failed to create file ""
ORA-00312: online log 7 thread 1: '+DATA/item/onlinelog/group_7.322.797746361'

For ASM2 instance:

Wed Jun 05 07:39:51 2013
Errors in file /u01/grid/11.2.0/diag/asm/+asm/+ASM2/trace/+ASM2_lmd0_3061.trc (incident=412241):
ORA-04031: unable to allocate 3768 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","ges enqueues")
Incident details in: /u01/grid/11.2.0/diag/asm/+asm/+ASM2/incident/incdir_412241/+ASM2_lmd0_3061_i412241.trc

GES resource已经无法在shared_pool 中初始化,对于这个问题还是比较少见的 针对ASM instance的shared_pool 4031,在11.2.0.3之后
oracle的建议值为memory_target =1536MB 针对这个参数oracle也没有放出相关的最佳时间文档,相信在下一个release的时候oracle会改变
memory_target的default value .目前默认为:

SQL> show parameter target

NAME				     TYPE			       VALUE
------------------------------------ --------------------------------- ------------------------------
memory_max_target		     big integer		       1084M
memory_target			     big integer		       1084M
pga_aggregate_target		     big integer		       0
sga_target			     big integer		       0

由于ASM中的memory 同样要负责GES GCS GRD 等资源的分配,memory fragment是一个不可忽视的问题,只是对于这个case而言大家极少的关注到
ASM 端的4031问题,针对于ASM在11.2.0.3(11.2.0.3 changed some default behaviors)的表现,oracle也应该意识到了这个问题。可以参考如下
文档获取一些有用信息。1536M 应该是oracle确定的一个最小安全值 这个也是得到了研发的认可,没有设置此参数的同学需要注意。

另外针对这个RAC 即使架构为RAC ONE NODE oracle在一些resource control方面仍然是常规RAC的行为,而RAC node NODE 也仅仅是减少了DB端的
一些资源争用。

针对这个case 可以做如下工作:

1 针对11gR2 设置 ASM memory_target 到一个安全的值

2 针对长时间running的database machines 需要做定期重启工作。这次出现问题的机器连续跑了220多天。

3 监控asm log 即使ASM 在11g中的稳定性已经得到了很大的加强。

[转] ASM Handling Block Corruptions

November 25, 2012 oracle, RAC No comments

Some excellent experiment From Honcho

为了保护disks 上的数据,Oracle ASM 对disk groups 有3种冗余方式:
(1)external redundancy 表示Oracle不帮你管理镜像,功能由外部存储系统实现,比如通过RAID技术。
(2)normal redundancy(默认方式)表示Oracle提供2路镜像来保护数据。
(3)high redundancy 表示Oracle提供3路镜像来保护数据。

如果使用ASM的冗余,就是通过 ASM failure group 来实现。ASM使用的镜像算法并不是镜像整个disk,而是作extent级的镜像。所以如果为各个failure group使用不同容量的disk是不明智的,因为这样在Oracle分配下一个extent的时候可能就会出现问题。
在normal redundancy模式下,ASM环境中每分配一个extent都会有一个primary copy和一个secondary copy,ASM的算法保证了secondary copy和primary copy一定是在不同的failure group中,这就是failure group的意义。通过这个算法,ASM保证了即使一个failure group中的所有disk都损坏了,数据也不会丢失。
Oracle在分配extent的时候,所有failure group中的这个将拥有相同数据的extent称为一个extent set,当Oracle将数据写入文件的时候,primary copy可能在任何一个failure group中,而second copy则在另外的failure group中,当Oracle读取数据的时候,除非是primary copy不可用,否则将优先从primary copy中读取数据,通过这种写入无序,读取有序的算法,Oracle保证了数据读取尽量分布在多个disk中。

那么在ASM 内部是如果处理块损坏的呢?Oracle 文档给的描述如下:

A) Block Corruption in Primary Extent
Block reads from disk is affected. Corruption will be found during read from primary extent, we report an ORA-1578 in the ASM alert.log and do a re-read of the same block (primary extent). If the second read fails as well, we read the block from the secondary extent (on a different physical disk) next. This failover to reading from secondary extent happens automatically and transparent for the user/session. If the secondary extent is not corrupt the user will get the results without any error indication other than ORA-1578 about the failed primary extent read in alert.log. Finally ASM will overwrite the corrupt block in the primary extent with the good copy from the secondary. This happens even when the block is unaffected by any transaction. ASM FINDS corrupt blocks in primary extent. That triggers automatically repair at next read.

B) Block Corruption in Secondary Extent
Because block reads always goes to the primary extent first, a corruption in the secondary extent does not have any effect on a read operation. At the next write the block becomes dirty and the new block image will overwrite the old corrupt block on disk. In case of block corruption in the secondary extent there is no diagnostics information available. A corruption in the secondary extent will normally only be seen if the block in the primary extent is also corrupt. ASM fixes corrupt blocks in secondary extent automatically at next write of the block.

下面根据上述描述,做实验验证一下,本实验平台是Oracle 11.2.0.3 for Linux x64

首先创建一组normal redundancy 磁盘组DATA1
SQL> create diskgroup data1 normal redundancy failgroup f1 disk '/dev/asm-diske' failgroup f2 disk '/dev/asm-diskf';
 
Diskgroup created.

创建测试用数据:
SQL> create tablespace xhz datafile '+DATA1' size 100m;
 
Tablespace created.
 
SQL> select name from v$datafile;
 
NAME
--------------------------------------------------------------------------------
+DATA/xhz/datafile/system.260.799281453
+DATA/xhz/datafile/sysaux.261.799281497
+DATA/xhz/datafile/undotbs1.262.799281531
+DATA/xhz/datafile/users.264.799281563
+DATA/xhz/datafile/honcho.266.799334723
+DATA1/xhz/datafile/xhz.256.800304113
 
6 rows selected.
 
SQL> create table honcho (id number,name varchar2(20)) tablespace xhz;
 
Table created.
 
SQL> begin
  2    for i in 1..10000 loop
  3     insert into honcho values (i,'honcho');
  4     end loop;
  5    commit;
  6  end;
  7  /
 
PL/SQL procedure successfully completed.
 
SQL> create index hon_id_idx on honcho(id) nologging;
                                                    
Index created.

查看表HONCHO 在disk上的分布情况:

SQL> select file_id,extent_id,block_id,blocks from dba_extents where segment_name='HONCHO';
 
   FILE_ID  EXTENT_ID   BLOCK_ID     BLOCKS
---------- ---------- ---------- ----------
         6          0        128          8
         6          1        136          8
         6          2        144          8
         6          3        152          8

任意找一条记录:
SQL> select rowid,id from honcho where id=888;
 
ROWID                      ID
------------------ ----------
AAADbBAAGAAAACEAGr        888             
 
SQL> select dbms_rowid.rowid_object(rowid) object_id,     
  2         dbms_rowid.rowid_relative_fno(rowid) file_id, 
  3         dbms_rowid.rowid_block_number(rowid) block_id,
  4         dbms_rowid.rowid_row_number(rowid) num         
  5    from honcho
  6    where id=888;
 
 OBJECT_ID    FILE_ID   BLOCK_ID        NUM
---------- ---------- ---------- ----------
     14017          6        132        427

该记录位于file # 6 BLOCK_ID 在128和136之间,又因为每一个extent大小默认为1AU,本实验1AU=1M,即一个AU最多容纳128个blocks,因此该数据分布在ASM上属于该segment的第二个AU上。
SQL> select group_number,disk_number,name,path from v$asm_disk where group_number=1;
 
GROUP_NUMBER DISK_NUMBER NAME                           PATH
------------ ----------- ------------------------------ ----------------------------------------
           1           1 DATA1_0001                     /dev/asm-diskf
           1           0 DATA1_0000                     /dev/asm-diske
 
SQL> select name from v$datafile where file#=6;
 
NAME
---------------------------------------------------------
+DATA1/xhz/datafile/xhz.256.800304113

从+group/dbname/file_type/tag.file#.incarnation# 可以得出file# 6 在ASM中的file# 为256,通过x$kffxp视图可以查找到数据在ASM磁盘组中的分布情况:

SQL> select  XNUM_KFFXP,  DISK_KFFXP,  AU_KFFXP,  INDX from x$kffxp where group_kffxp=1 and  NUMBER_KFFXP=256;
 
XNUM_KFFXP DISK_KFFXP   AU_KFFXP       INDX
---------- ---------- ---------- ----------
         0          1         52          0 
         0          0         52          1 
         1          0         53          2  ---> primary extent 位于disk 0 (/dev/asm-diske)上
         1          1         53          3  ---> secondary extent 位于disk1 (/dev/asm-diskf)上
         2          1         54          4
         2          0         54          5
         3          0         55          6
         3          1         55          7
         4          1         56          8
         4          0         56          9
         5          0         57         10
       ...         ...       ...        ...
       ...         ...       ...        ...
       100          1        153        200
       100          0        153        201
2147483648          0         82        202
2147483648          1         82        203
2147483648      65534 4294967294        204

从上面的查询可以得知id=888的记录primary extent 位于AU=53 /dev/asm-diske上,secondary extent 位于AU=53 /dev/asm-diskf上。由于1AU=1024K,block_szie=8K,因此53AU=6784 blocks,通过dd命令查看第6788个blocks上的数据记录。

首先查看primary extent上的数据:

[grid@honcho ~]$ dd if=/dev/asm-diske bs=8K skip=6788 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 5.5e-05 seconds, 149 MB/s
0000000 a206 0000 0085 0180 df32 0003 0000 0601
        ack   " nul nul enq nul nul soh   2   _ etx nul nul nul soh ack
0000020 b1f9 0000 0001 0000 365d 0000 df15 0003
          y   1 nul nul soh nul nul nul   ]   6 nul nul nak   _ etx nul
0000040 0000 0000 0002 0032 0080 0180 0003 0000
        nul nul nul nul stx nul   2 nul nul nul nul soh etx nul nul nul
0000060 00d6 0000 069f 00c0 0032 000e 21c6 0000
          V nul nul nul  us ack   @ nul   2 nul  so nul   F   ! nul nul
0000100 df32 0003 0000 0000 0000 0000 0000 0000
          2   _ etx nul nul nul nul nul nul nul nul nul nul nul nul nul
0000120 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
0000140 0000 0000 0100 01c6 ffff 039e 06c9 032b
        nul nul nul nul nul soh   F soh del del  rs etx   I ack   + etx
0000160 032b 0000 01c6 1f8a 1f7c 1f6e 1f60 1f52
          + etx nul nul   F soh  nl  us   |  us   n  us   `  us   R  us
0000200 1f44 1f36 1f28 1f1a 1f0c 1efe 1ef0 1ee2
          D  us   6  us   (  us sub  us  ff  us   ~  rs   p  rs   b  rs
          ...      ...          ...        ...         ...
          ...      ...          ...        ...         ...
0017760 0302 0ec2 0646 6f68 636e 6f68 0601 df32
        stx etx   B  so   F ack   h   o   n   c   h   o soh ack   2   _
0020000

查看secondary extent 上的数据:

[grid@honcho ~]$ dd if=/dev/asm-diskf bs=8K skip=6788 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.009921 seconds, 826 kB/s
0000000 a206 0000 0085 0180 df32 0003 0000 0601
        ack   " nul nul enq nul nul soh   2   _ etx nul nul nul soh ack
0000020 b1f9 0000 0001 0000 365d 0000 df15 0003
          y   1 nul nul soh nul nul nul   ]   6 nul nul nak   _ etx nul
0000040 0000 0000 0002 0032 0080 0180 0003 0000
        nul nul nul nul stx nul   2 nul nul nul nul soh etx nul nul nul
0000060 00d6 0000 069f 00c0 0032 000e 21c6 0000
          V nul nul nul  us ack   @ nul   2 nul  so nul   F   ! nul nul
0000100 df32 0003 0000 0000 0000 0000 0000 0000
          2   _ etx nul nul nul nul nul nul nul nul nul nul nul nul nul
0000120 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
0000140 0000 0000 0100 01c6 ffff 039e 06c9 032b
        nul nul nul nul nul soh   F soh del del  rs etx   I ack   + etx
0000160 032b 0000 01c6 1f8a 1f7c 1f6e 1f60 1f52
          + etx nul nul   F soh  nl  us   |  us   n  us   `  us   R  us
0000200 1f44 1f36 1f28 1f1a 1f0c 1efe 1ef0 1ee2
          D  us   6  us   (  us sub  us  ff  us   ~  rs   p  rs   b  rs
          ...      ...          ...        ...         ...
          ...      ...          ...        ...         ...
0017760 0302 0ec2 0646 6f68 636e 6f68 0601 df32
        stx etx   B  so   F ack   h   o   n   c   h   o soh ack   2   _
0020000

通过对比可以看出两块盘上dd出来的数据是一样的。
现在首先模拟secondary extent上的数据损坏:
[grid@honcho ~]$ dd if=/dev/zero of=/dev/asm-diskf seek=6788 bs=8K count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 5.5e-05 seconds, 149 MB/s

可以看出现在记录为null:

[grid@honcho ~]$ dd if=/dev/asm-diskf bs=8K skip=6788 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 4.3e-05 seconds, 191 MB/s
0000000 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0020000

此时依旧可以查到数据:

SQL> alter system flush buffer_cache;
 
System altered.
 
SQL> select * from honcho where id=888;
 
        ID NAME
---------- --------------------
       888 honcho

重启数据库:

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
 
Total System Global Area  546992128 bytes
Fixed Size                  2230192 bytes
Variable Size             167774288 bytes
Database Buffers          373293056 bytes
Redo Buffers                3694592 bytes
Database mounted.
Database opened.
SQL> select * from honcho where id=888;
 
        ID NAME
---------- --------------------
       888 honcho

没有任何异常,看secondary extent上是否有数据:

[grid@honcho ~]$ dd if=/dev/asm-diskf bs=8K skip=6789 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 4.9e-05 seconds, 167 MB/s
0000000 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0020000

此时发现仍然为null,接下来update honcho表:
SQL> update honcho set name='xhz' where id=888;
 
1 row updated.
 
SQL> commit;
 
Commit complete.
 
SQL> select * from honcho where id=888;
 
        ID NAME
---------- --------------------
       888 xhz
 
Commit complete.

再次查看:
[grid@honcho ~]$ dd if=/dev/asm-diskf bs=8K skip=6789 count=1|od -xa

1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 5.2e-05 seconds, 158 MB/s
0000000 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0020000

依旧为NULL,清理一下buffer cache:
SQL> alter system flush buffer_cache;
 
System altered.
 
SQL> select * from honcho where id=888;
 
        ID NAME
---------- --------------------
       888 xhz

此时发现secondary extent上已经有新数据:

[grid@honcho ~]$ dd if=/dev/asm-diskf bs=8K skip=6789 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.039857 seconds, 206 kB/s
0000000 a206 0000 0085 0180 f0ca 0003 0000 0601
        ack   " nul nul enq nul nul soh   J   p etx nul nul nul soh ack
0000020 8e2e 0000 0001 0000 365d 0000 f0a0 0003
          .  so nul nul soh nul nul nul   ]   6 nul nul  sp   p etx nul
0000040 0000 0000 0002 0032 0080 0180 0003 0000
        nul nul nul nul stx nul   2 nul nul nul nul soh etx nul nul nul
0000060 00d6 0000 069f 00c0 0032 000e 8000 0000
          V nul nul nul  us ack   @ nul   2 nul  so nul nul nul nul nul
0000100 df32 0003 0003 0003 00d8 0000 06f7 00c0
          2   _ etx nul etx nul etx nul   X nul nul nul   w ack   @ nul
0000120 0032 0035 21c6 0552 f0ca 0003 0000 0000
          2 nul   5 nul   F   !   R enq   J   p etx nul nul nul nul nul
0000140 0000 0000 0100 01c6 ffff 039e 0b73 032b
        nul nul nul nul nul soh   F soh del del  rs etx   s  vt   + etx
0000160 087d 0000 01c6 1f8d 1f82 1f77 1f6c 1f61
          }  bs nul nul   F soh  cr  us stx  us   w  us   l  us   a  us
0000200 1f56 1f4b 1f40 1f35 1f2a 1f1f 1f14 1f09
          V  us   K  us   @  us   5  us   *  us  us  us dc4  us  ht  us
          ...      ...          ...        ...         ...
          ...      ...          ...        ...         ...         
0017760 2c7a 0202 c203 460e 7803 7a68 0601 f0ca
          z   , stx stx etx   B  so   F etx   x   h   z soh ack   J   p
0020000

下面接下来模拟primary extent 上数据损坏:

[grid@honcho ~]$ dd if=/dev/zero of=/dev/asm-diske seek=6788 bs=8K count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 6.3e-05 seconds, 130 MB/s
[grid@honcho ~]$ dd if=/dev/asm-diske bs=8K skip=6788 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 4.6e-05 seconds, 178 MB/s
0000000 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0020000

可以看到primary extent 已经没记录了,现在flush 一下 buffer_cache

SQL> alter system flush buffer_cache;
 
System altered.
 
SQL> select * from honcho where id=888;
 
        ID NAME
---------- --------------------
       888 xhz

从alert日志中可以看到如下信息:
Mon Nov 19 19:09:34 2012
ALTER SYSTEM: Flushing buffer cache
Mon Nov 19 19:09:52 2012
Hex dump of (file 6, block 132) in trace file /u01/app/oracle/diag/rdbms/xhz/xhz/trace/xhz_ora_7588.trc
Corrupt block relative dba: 0x01800085 (file 6, block 132)
Completely zero block found during multiblock buffer read
Reading datafile '+DATA1/xhz/datafile/xhz.256.799758019' for corruption at rdba: 0x01800085 (file 6, block 132)
Read datafile mirror 'DATA1_0000' (file 6, block 132) found same corrupt data (no logical check)
Read datafile mirror 'DATA1_0001' (file 6, block 132) found valid data
Repaired corruption at (file 6, block 132)

从日志中我们看到在查询的时候发现DATA1_0000损坏,于是尝试read secondary extent,最后repaired disk。

[grid@honcho ~]$ dd if=/dev/asm-diske bs=8K skip=6788 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.047051 seconds, 174 kB/s
0000000 a206 0000 0085 0180 f0ca 0003 0000 0601
        ack   " nul nul enq nul nul soh   J   p etx nul nul nul soh ack
0000020 8e2e 0000 0001 0000 365d 0000 f0a0 0003
          .  so nul nul soh nul nul nul   ]   6 nul nul  sp   p etx nul
0000040 0000 0000 0002 0032 0080 0180 0003 0000
        nul nul nul nul stx nul   2 nul nul nul nul soh etx nul nul nul
0000060 00d6 0000 069f 00c0 0032 000e 8000 0000
          V nul nul nul  us ack   @ nul   2 nul  so nul nul nul nul nul
0000100 df32 0003 0003 0003 00d8 0000 06f7 00c0
          2   _ etx nul etx nul etx nul   X nul nul nul   w ack   @ nul
0000120 0032 0035 21c6 0552 f0ca 0003 0000 0000
          2 nul   5 nul   F   !   R enq   J   p etx nul nul nul nul nul
0000140 0000 0000 0100 01c6 ffff 039e 0b73 032b
        nul nul nul nul nul soh   F soh del del  rs etx   s  vt   + etx
0000160 087d 0000 01c6 1f8d 1f82 1f77 1f6c 1f61
          }  bs nul nul   F soh  cr  us stx  us   w  us   l  us   a  us
0000200 1f56 1f4b 1f40 1f35 1f2a 1f1f 1f14 1f09
          V  us   K  us   @  us   5  us   *  us  us  us dc4  us  ht  us
          ...      ...          ...        ...         ...
          ...      ...          ...        ...         ...

0017760 2c7a 0202 c203 460e 7803 7a68 0601 f0ca
          z   , stx stx etx   B  so   F etx   x   h   z soh ack   J   p
0020000

当再次查看primary extent 的时候发现已经有记录了。

下面测试将primary extent上和secondary extent上的块都损坏:

[grid@honcho ~]$ dd if=/dev/zero of=/dev/asm-diske seek=6788 bs=8K count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 8.5e-05 seconds, 96.4 MB/s
[grid@honcho ~]$ dd if=/dev/zero of=/dev/asm-diskf seek=6788 bs=8K count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000117 seconds, 70.0 MB/s

查看id=888的记录:
SQL> alter system flush buffer_cache;
 
System altered.
 
SQL> select * from honcho where id=888;
select * from honcho where id=888
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 6, block # 132)
ORA-01110: data file 6: '+DATA1/xhz/datafile/xhz.256.800304113'

查询其他的记录
SQL> select * from honcho where id=100;
 
        ID NAME
---------- --------------------
       100 honcho

可以看到file# 6,block #132 上的数据已被彻底损坏,非file #6,block# 132的数据依然可以查到。

查看日志:

Sun Nov 19 19:24:39 2012
ALTER SYSTEM: Flushing buffer cache
Hex dump of (file 6, block 132) in trace file /u01/app/oracle/diag/rdbms/xhz/xhz/trace/xhz_ora_1915.trc
Corrupt block relative dba: 0x01800084 (file 6, block 132)
Completely zero block found during multiblock buffer read
Reading datafile '+DATA1/xhz/datafile/xhz.256.800304113' for corruption at rdba: 0x01800084 (file 6, block 132)
Read datafile mirror 'DATA1_0000' (file 6, block 132) found same corrupt data (no logical check)
Read datafile mirror 'DATA1_0001' (file 6, block 132) found same corrupt data (no logical check)
Errors in file /u01/app/oracle/diag/rdbms/xhz/xhz/trace/xhz_ora_1915.trc  (incident=18163):
ORA-01578: ORACLE data block corrupted (file # 6, block # 132)
ORA-01110: data file 6: '+DATA1/xhz/datafile/xhz.256.800304113'
Incident details in: /u01/app/oracle/diag/rdbms/xhz/xhz/incident/incdir_18163/xhz_ora_1915_i18163.trc
Sun Nov 19 19:24:49 2012
Starting background process SMCO
Sun Nov 19 19:24:50 2012
SMCO started with pid=24, OS id=3119
Corrupt Block Found
         TSN = 6, TSNAME = XHZ
         RFN = 6, BLK = 132, RDBA = 25165956
         OBJN = 14017, OBJD = 14017, OBJECT = HONCHO, SUBOBJECT =
         SEGMENT OWNER = SYS, SEGMENT TYPE = Table Segment
Sun Nov 19 19:24:59 2012
Sweep [inc][18163]: completed
Hex dump of (file 6, block 132) in trace file /u01/app/oracle/diag/rdbms/xhz/xhz/incident/incdir_18163/xhz_m000_3156_i18163_a.trc
Corrupt block relative dba: 0x01800084 (file 6, block 132)
Completely zero block found during validation
Trying mirror side DATA1_0000.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0001.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0000.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0001.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0000.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0001.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0000.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0001.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0000.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0001.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
SunNov 19 19:04:59 2012
Dumping diagnostic data in directory=[cdmp_20121125190459], requested by (instance=1, osid=1915), summary=[incident=18163].
Checker run found 1 new persistent data failures

通过以上实验得出的结论和描述相符。但是这里还是有一个小小的缺陷,如果secondary extent 损坏,随后无脏数据,在这段时间内如果primary extent损坏,还是依旧有数据丢失,所以Oracle 有这样一句描述:

When a disk block becomes corrupt for whatever reason, the corruption can affect blocks in primary, secondary, or both extents. However, the probability that the same block becomes corrupt in both primary and secondary extent at the same time is very low.

Solaris 10g CRS 自动reboot一例

October 18, 2012 oracle, RAC No comments

最近RAC 的case实在太多,越来越觉得公司逐渐去rac的正确。一套sunOS的系统 在安装完crs 10g之后 服务器不停的自动reboot,具体环境如下:

root@bmsa#showrev 
Hostname: bmsa
Hostid: 84f94303
Release: 5.10
Kernel architecture: sun4v
Application architecture: sparc
Hardware provider: Oracle Corporation
Domain: 
Kernel version: SunOS 5.10 Generic_147440-25

root@bmsb#psrinfo -v 
Status of virtual processor 0 as of: 10/18/2012 00:05:32
  on-line since 10/17/2012 23:46:06.
  The sparcv9 processor operates at 2998 MHz,
        and has a sparcv9 floating point processor.
.....

Status of virtual processor 31 as of: 10/18/2012 00:05:32
  on-line since 10/17/2012 23:46:07.
  The sparcv9 processor operates at 2998 MHz,
        and has a sparcv9 floating point processor. 

root@bmsb# /usr/sbin/prtconf | grep "Memory size"
Memory size: 32768 Megabytes 


-bash-3.2$ crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora.bmsa.gsd   application    ONLINE    ONLINE    bmsa        
ora.bmsa.ons   application    ONLINE    ONLINE    bmsa        
ora.bmsa.vip   application    ONLINE    ONLINE    bmsa        
ora.bmsb.gsd   application    ONLINE    ONLINE    bmsb        
ora.bmsb.ons   application    ONLINE    ONLINE    bmsb        
ora.bmsb.vip   application    ONLINE    ONLINE    bmsb        




-bash-3.2$ crsctl query crs activeversion
CRS active version on the cluster is [10.2.0.1.0]
-bash-3.2$ crsctl query crs softwareversion
CRS software version on node [bmsa] is [10.2.0.1.0]

具体现象为5-10分钟 两个节点会自动reboot,除了cssd均无报错日志,oprocd 无相关日志,syslog无相关日志, ocssd.log如下:

CSSD]2012-10-17 23:40:23.955 >USER:    Oracle Database 10g CSS Release 10.2.0.1.0 Production Copyright 1996, 2004 Oracle.  All rights reserved.
[    CSSD]2012-10-17 23:40:23.955 >USER:    CSS daemon log for node bmsa, number 1, in cluster crs
[  clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=bmsaDBG_CSSD))
[    CSSD]2012-10-17 23:40:23.959 [1] >TRACE:   clssscmain: local-only set to false
[    CSSD]2012-10-17 23:40:23.966 [1] >TRACE:   clssnmReadNodeInfo: added node 1 (bmsa) to cluster
[    CSSD]2012-10-17 23:40:23.970 [1] >TRACE:   clssnmReadNodeInfo: added node 2 (bmsb) to cluster
[    CSSD]2012-10-17 23:40:23.973 [1] >TRACE:   clssgmInitCMInfo: Wait for remote node teermination set to 13 seconds
[    CSSD]2012-10-17 23:40:23.975 [5] >TRACE:   clssnm_skgxnmon: skgxn init failed, rc 1
[    CSSD]2012-10-17 23:40:23.975 [1] >TRACE:   clssnm_skgxnonline: Using vacuous skgxn monitor
[    CSSD]2012-10-17 23:40:23.975 [1] >TRACE:   clssnmInitNMInfo: misscount set to 30
[    CSSD]2012-10-17 23:40:23.981 [1] >TRACE:   clssnmDiskStateChange: state from 1 to 2 disk (0//dev/rdsk/c3t60A98000572D5A70614A6E3370535749d0s6)
[    CSSD]2012-10-17 23:40:23.983 [1] >TRACE:   clssnmDiskStateChange: state from 1 to 2 disk (1//dev/rdsk/c3t60A98000572D5A70614A6E337053574Fd0s6)
[    CSSD]2012-10-17 23:40:23.984 [1] >TRACE:   clssnmDiskStateChange: state from 1 to 2 disk (2//dev/rdsk/c3t60A98000572D5A70614A6E3370535751d0s6)
[    CSSD]2012-10-17 23:40:25.984 [6] >TRACE:   clssnmDiskStateChange: state from 2 to 4 disk (0//dev/rdsk/c3t60A98000572D5A70614A6E3370535749d0s6)
[    CSSD]2012-10-17 23:40:25.985 [7] >TRACE:   clssnmDiskStateChange: state from 2 to 4 disk (1//dev/rdsk/c3t60A98000572D5A70614A6E337053574Fd0s6)
[    CSSD]2012-10-17 23:40:25.987 [8] >TRACE:   clssnmDiskStateChange: state from 2 to 4 disk (2//dev/rdsk/c3t60A98000572D5A70614A6E3370535751d0s6)
[    CSSD]2012-10-17 23:40:25.991 [6] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(13) LATS(0) Disk lastSeqNo(13)
[    CSSD]2012-10-17 23:40:25.992 [7] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(13) LATS(0) Disk lastSeqNo(13)
[    CSSD]2012-10-17 23:40:25.992 [8] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(13) LATS(0) Disk lastSeqNo(13)
[    CSSD]2012-10-17 23:40:26.055 [1] >TRACE:   clssnmFatalInit: fatal mode enabled
[    CSSD]2012-10-17 23:40:26.066 [10] >TRACE:   clssnmconnect: connecting to node 1, flags 0x0001, connector 1
[    CSSD]2012-10-17 23:40:26.072 [10] >TRACE:   clssnmconnect: connecting to node 0, flags 0x0000, connector 1
[    CSSD]2012-10-17 23:40:26.072 [10] >TRACE:   clssnmClusterListener: Probing node(2)
[    CSSD]2012-10-17 23:40:26.076 [11] >TRACE:   clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=Oracle_CSS_LclLstnr_crs_1))
[    CSSD]2012-10-17 23:40:26.076 [11] >TRACE:   clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_bmsa_crs))
[    CSSD]2012-10-17 23:40:26.078 [15] >TRACE:   clssnmPollingThread: Connection complete
[    CSSD]2012-10-17 23:40:26.078 [16] >TRACE:   clssnmSendingThread: Connection complete
[    CSSD]2012-10-17 23:40:26.078 [17] >TRACE:   clssnmRcfgMgrThread: Connection complete
[    CSSD]2012-10-17 23:40:26.078 [17] >TRACE:   clssnmRcfgMgrThread: Local Join
[    CSSD]2012-10-17 23:40:26.078 [17] >TRACE:   clssnmDoSyncUpdate: Initiating sync 1
[    CSSD]2012-10-17 23:40:26.078 [17] >TRACE:   clssnmSetupAckWait: Ack message type (11) 
[    CSSD]2012-10-17 23:40:26.078 [17] >TRACE:   clssnmSetupAckWait: node(1) is ALIVE
[    CSSD]2012-10-17 23:40:26.078 [17] >TRACE:   clssnmSendSync: syncSeqNo(1)
[    CSSD]2012-10-17 23:40:26.078 [17] >TRACE:   clssnmWaitForAcks: Ack message type(11), ackCount(1)
[    CSSD]2012-10-17 23:40:26.078 [10] >TRACE:   clssnmHandleSync: Acknowledging sync: src[1] srcName[bmsa] seq[1] sync[1]
[    CSSD]2012-10-17 23:40:26.178 [1] >USER:    NMEVENT_SUSPEND [00][00][00][00]
[    CSSD]2012-10-17 23:40:27.078 [17] >TRACE:   clssnmWaitForAcks: done, msg type(11)
[    CSSD]2012-10-17 23:40:27.078 [17] >TRACE:   clssnmSetupAckWait: Ack message type (13) 
[    CSSD]2012-10-17 23:40:27.078 [17] >TRACE:   clssnmSetupAckWait: node(1) is ACTIVE
[    CSSD]2012-10-17 23:40:27.078 [17] >TRACE:   clssnmSendVote: syncSeqNo(1)
[    CSSD]2012-10-17 23:40:27.078 [17] >TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(1)
[    CSSD]2012-10-17 23:40:27.079 [10] >TRACE:   clssnmSendVoteInfo: node(1) syncSeqNo(1)
[    CSSD]2012-10-17 23:40:28.079 [17] >TRACE:   clssnmWaitForAcks: done, msg type(13)
[    CSSD]2012-10-17 23:40:28.079 [17] >TRACE:   clssnmCheckDskInfo: Checking disk info...
[    CSSD]2012-10-17 23:40:29.079 [17] >TRACE:   clssnmEvict: Start
[    CSSD]2012-10-17 23:40:29.079 [17] >TRACE:   clssnmWaitOnEvictions: Start
[    CSSD]2012-10-17 23:40:29.080 [17] >TRACE:   clssnmWaitOnEvictions: Node(0) down, LATS(0),timeout(-109570438)
[    CSSD]2012-10-17 23:40:29.080 [17] >TRACE:   clssnmWaitOnEvictions: Node(2) down, LATS(0),timeout(-109570438)
[    CSSD]2012-10-17 23:40:29.080 [17] >TRACE:   clssnmSetupAckWait: Ack message type (15) 
[    CSSD]2012-10-17 23:40:29.080 [17] >TRACE:   clssnmSetupAckWait: node(1) is ACTIVE
[    CSSD]2012-10-17 23:40:29.080 [17] >TRACE:   clssnmSendUpdate: syncSeqNo(1)
[    CSSD]2012-10-17 23:40:29.080 [17] >TRACE:   clssnmWaitForAcks: Ack message type(15), ackCount(1)
[    CSSD]2012-10-17 23:40:29.080 [10] >TRACE:   clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2012-10-17 23:40:29.080 [10] >TRACE:   clssnmDeactivateNode: node 0 () left cluster

[    CSSD]2012-10-17 23:40:29.080 [10] >TRACE:   clssnmUpdateNodeState: node 1, state (2/2) unique (1350488423/1350488423) prevConuni(0) birth (1/1) (old/new)
[    CSSD]2012-10-17 23:40:29.080 [10] >TRACE:   clssnmUpdateNodeState: node 2, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2012-10-17 23:40:29.080 [10] >TRACE:   clssnmDeactivateNode: node 2 (bmsb) left cluster

[    CSSD]2012-10-17 23:40:29.080 [10] >USER:    clssnmHandleUpdate: SYNC(1) from node(1) completed
[    CSSD]2012-10-17 23:40:29.080 [10] >USER:    clssnmHandleUpdate: NODE 1 (bmsa) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2012-10-17 23:40:29.180 [18] >TRACE:   clssgmReconfigThread:  started for reconfig (1)
[    CSSD]2012-10-17 23:40:29.180 [18] >USER:    NMEVENT_RECONFIG [00][00][00][02]
[    CSSD]2012-10-17 23:40:29.180 [18] >TRACE:   clssgmEstablishConnections: 1 nodes in cluster incarn 1
[    CSSD]2012-10-17 23:40:29.180 [14] >TRACE:   clssgmPeerListener: connects done (1/1)
[    CSSD]2012-10-17 23:40:29.180 [18] >TRACE:   clssgmEstablishMasterNode: MASTER for 1 is node(1) birth(1)
[    CSSD]2012-10-17 23:40:29.180 [18] >TRACE:   clssgmChangeMasterNode: requeued 0 RPCs
[    CSSD]2012-10-17 23:40:29.180 [18] >TRACE:   clssgmMasterCMSync: Synchronizing group/lock status
[    CSSD]2012-10-17 23:40:29.180 [18] >TRACE:   clssgmMasterSendDBDone: group/lock status synchronization complete
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 1 with 1 nodes

[    CSSD]CLSS-3001: local node number 1, master node number 1

[    CSSD]2012-10-17 23:40:29.180 [18] >TRACE:   clssgmReconfigThread:  completed for reconfig(1), with status(1)
[    CSSD]2012-10-17 23:40:29.281 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(1007738b0) proc(100b1e050) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:40:29.437 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b1bd90) proc(100b1ea20) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:40:30.080 [17] >TRACE:   clssnmWaitForAcks: done, msg type(15)
[    CSSD]2012-10-17 23:40:30.080 [17] >TRACE:   clssnmDoSyncUpdate: Sync Complete!
[    CSSD]2012-10-17 23:40:30.893 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b23370) proc(100b267b0) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:40:30.916 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b27090) proc(100b29350) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:40:34.623 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b29370) proc(100b27c60) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:40:34.642 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b23b40) proc(100b29600) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:41:34.799 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b22210) proc(100b27a90) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:41:34.815 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b23bf0) proc(100b29690) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:41:35.198 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b22210) proc(100b27a90) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:41:35.216 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b26510) proc(100b24ce0) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:41:35.525 [11] >TRACE:   clssgmClientConnectMsg: Connect from con(100b221c0) proc(100b26720) pid() proto(10:2:1:1)
[    CSSD]2012-10-17 23:41:35.529 [11] >WARNING: clssgmShutDown: Received explicit shutdown request from client.
[    CSSD]2012-10-17 23:41:35.529 [11] >TRACE:   clssgmClientShutdown: Aborting client (100b1f820) proc (100b1e050)
[    CSSD]2012-10-17 23:41:35.529 [11] >TRACE:   clssgmClientShutdown: Aborting client (100b20050) proc (100b1e050)
[    CSSD]2012-10-17 23:41:35.529 [11] >TRACE:   clssgmClientShutdown: Aborting client (100b20ab0) proc (100b1e050)
[    CSSD]2012-10-17 23:41:35.529 [11] >TRACE:   clssgmClientShutdown: Aborting client (100b1ed80) proc (100b1ea20)
[    CSSD]2012-10-17 23:41:35.529 [11] >TRACE:   clssgmClientShutdown: waited 0 seconds on 4 IO capable clients
[    CSSD]2012-10-17 23:41:35.539 [11] >WARNING: clssgmClientShutdown: graceful shutdown completed.

23:41:35.529 收到shutdown request 23:41:35.539 系统reboot,日志中”clssnm_skgxnmon: skgxn init failed, rc 1” 值得关注,对于这个错误,可以参考一些文章

DBA Note
RAC Reboot due to system time change
这两篇文章均提到了chang system time 导致的css reboot行为.并且提到了使用non fatal mode启动的方案:

In some extreme cases, it may be necessary to disable fatal mode for OPROCD to find a root cause. DO NOT do this in a production environment. This is completely unsupported and could cause data corruption due to lack of I/O fencing.

1. Back up the init.cssd file. Example:

Sun and Linux:
cp /etc/init.d/init.cssd /etc/init.d/init.cssd.bak

HP-UX and HP Tru64:
cp /sbin/init.d/init.cssd /sbin/init.d/init.cssd.bak

IBM AIX:
cp /etc/init.cssd /etc/init.cssd.bak

2. Stop the CRS stack or boot the node in single user mode.

To stop the CRS Stack:

Sun and Linux:
/etc/init.d/init.crs stop

HP-UX and HP Tru64:
/sbin/init.d/init.crs stop

IBM AIX:
/etc/init.crs/init.crs stop

3. Confirm that the CRS stack is down.

ps -ef | grep d.bin

4. Edit the init.cssd file from the location in step 1, change the OPROCD
startup line to a non-fatal startup:

Sun Example:

# in fatal mode we always will start OPROCD FATAL
if [ $OPROCD_EXISTS ]; then
$OPROCD start -t $OPROCD_DEFAULT_TIMEOUT -m $OPROCD_DEFAULT_MARGIN
$OPROCD check -t $OPROCD_CHECK_TIMEOUT 2>$NULL
fi

Change this to:

# in fatal mode we always will start OPROCD FATAL
if [ $OPROCD_EXISTS ]; then
$OPROCD startInstall -t $OPROCD_DEFAULT_TIMEOUT -m $OPROCD_DEFAULT_MARGIN
$OPROCD check -t $OPROCD_CHECK_TIMEOUT 2>$NULL
fi

You could also combine this method with the ‘tracing system calls’ method
for more debugging.

5. Reboot the node(s)

对于这种方法没有敢尝试,采用了修改cssd “DISABLE_OPROCD”的方法具体如下:

#!/bin/sh
#
# Copyright (c) 2001, 2010, Oracle and/or its affiliates. All rights reserved. 
#
# init.cssd - Control script for the Oracle CSS daemon.
#
#   In a full RAC install, this should not be in an rcX.d
#   directory. It should only be invoked from init.crs.
#   Never by hand.
#
#     No manual invocation of init.cssd is supported on a cluster.
#
#   In a local, non-cluster, installation without RAC, it should
#   be placed in an rcX.d directory. It may be invoked by hand
#   if necessary, however there are a series of risks and complications
#   that should be kept in mind when doing so.
#
#     Actions usable in local-only configuration are:
#          start, stop, disable and enable.
#   
# ==========================================================
# Porting Definitions and requirements:
#
# FAST_REBOOT - take out the machine now. We are concerned about
#               data integrity since the other node has evicted us.
# SLOW_REBOOT - We can rely on vendor clusterware to delay sending
#               the reconfig to the other node due to its IO fencing
#               guarantees. So trigger its diagnostic reboot.
# VC_UP - Determine whether Vendor clusterware processes are active.
#         If they are, then the remote node will hear that CSS/CLSVMON
#         have died, and we will be forced to use FAST_REBOOT.
#         This is also used at startup time for dependency checking.
#         Returns 0 if they are up, 1 otherwise. This should be
#         an extremely fast check.
# CLINFO - Determine whether we are booted in non-cluster mode.
#          Returns 0 for cluster mode, 1 for non-cluster mode
#          This call is allowed to take a long time to decide.
#
# GETBOOTID - Returns a string that uniquely identifies this boot.
#             This string must never change while the machine is up,
#             and must change the next time the machine boots.

ORA_CRS_HOME=/u01/app/oracle/product/10.2.0/crs_1
ORACLE_USER=oracle

ORACLE_HOME=$ORA_CRS_HOME

export ORACLE_HOME
export ORA_CRS_HOME
export ORACLE_USER

# Set DISABLE_OPROCD to false. Platforms that do not ship an oprocd
# binary should override this below.
DISABLE_OPROCD=false   -------------modify to true
# Default OPROCD timeout values defined here, so that it can be
# over-ridden as needed by a platform.
# default Timout of 1000 ms and a margin of 500ms
OPROCD_DEFAULT_TIMEOUT=1000
OPROCD_DEFAULT_MARGIN=500
# default Timeout for other actions
OPROCD_CHECK_TIMEOUT=2000

.................

手动reboot两个节点之后 root@bmsa#ps -ef |grep op
oracle 1706 1 0 09:23:56 ? 0:00 /u01/app/oracle/product/10.2.0/crs_1/opmn/bin/ons -d
oracle 1707 1706 0 09:23:56 ? 0:00 /u01/app/oracle/product/10.2.0/crs_1/opmn/bin/ons -d
root 14424 14345 0 11:36:50 pts/4 0:00 grep op

oprocd 进程已经消失 此时仍以fatal mode启动

[    CSSD]2012-10-17 23:46:46.883 [1] >TRACE:   clssnmFatalInit: fatal mode enabled
[    CSSD]2012-10-17 23:46:58.025 [19] >TRACE:   clssgmMasterSendDBDone: group/lock status synchronization complete
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 2 with 2 nodes

[    CSSD]CLSS-3001: local node number 1, master node number 1

[    CSSD]2012-10-17 23:46:58.025 [19] >TRACE:   clssgmReconfigThread:  completed for reconfig(2), with status(1)
[    CSSD]2012-10-17 23:46:58.959 [17] >TRACE:   clssnmWaitForAcks: done, msg type(15)
[    CSSD]2012-10-17 23:46:58.959 [17] >TRACE:   clssnmDoSyncUpdate: Sync Complete!

此时已经不再出现reboot行为,迅速升级到10.2.0.5的crs,修改回 cssd :

-bash-3.2$ ps -ef |grep op
  oracle 27803 15868   0 13:48:17 pts/4       0:00 grep op
    root  1172   743   0 09:23:41 ?           0:00 /bin/sh /etc/init.d/init.cssd oprocd
  oracle 24312     1   0 13:41:52 ?           0:00 /u01/app/oracle/product/10.2.0/crs_1/opmn/bin/ons -d
    root  1339  1172   0 09:23:41 ?           0:01 /u01/app/oracle/product/10.2.0/crs_1/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5
  oracle 24313 24312   0 13:41:52 ?           0:00 /u01/app/oracle/product/10.2.0/crs_1/opmn/bin/ons -d
  
-bash-3.2$ crs_stat -t
Name           Type           Target    State     Host        
------------------------------------------------------------
ora....SM1.asm application    ONLINE    ONLINE    bmsa        
ora....SA.lsnr application    ONLINE    ONLINE    bmsa        
ora.bmsa.gsd   application    ONLINE    ONLINE    bmsa        
ora.bmsa.ons   application    ONLINE    ONLINE    bmsa        
ora.bmsa.vip   application    ONLINE    ONLINE    bmsa        
ora....SM2.asm application    ONLINE    ONLINE    bmsb        
ora....SB.lsnr application    ONLINE    ONLINE    bmsb        
ora.bmsb.gsd   application    ONLINE    ONLINE    bmsb        
ora.bmsb.ons   application    ONLINE    ONLINE    bmsb        
ora.bmsb.vip   application    ONLINE    ONLINE    bmsb        
ora....SM3.asm application    ONLINE    ONLINE    bmsc        
ora....SC.lsnr application    ONLINE    ONLINE    bmsc        
ora.bmsc.gsd   application    ONLINE    ONLINE    bmsc        
ora.bmsc.ons   application    ONLINE    ONLINE    bmsc        
ora.bmsc.vip   application    ONLINE    ONLINE    bmsc 

11gR2 rac install ASM terminating the instance due to error 481

October 16, 2012 11g, oracle, RAC No comments

一套新上线的RAC系统(redhat 5.8 RAC 11.2.0.3) 安装过程中跑GI root.sh时抛出error 481错误:

CRS-2672: Attempting to start 'ora.asm' on 'ptdb1'
CRS-5017: The resource action "ora.asm start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0
. For details refer to "(:CLSN00107:)" in "/g01/app/oracle/product/11.2.0/grid/log/agent/ohasd/oraagent_grid/oraagent_grid.log".
CRS-2674: Start of 'ora.asm' on 'racnode1' failed
..
Failed to start ASM at /software/app/11.2.0.3/crs/install/crsconfig_lib.pm line 1272

这一问题的解决可以参考如下文档:


ASM on Non First Node (Second or Other Node) Fails to Come up With: PMON (ospid: nnnn): terminating the instance due to error 481

In this Document
Purpose
Details
Case1: link local IP (169.254.x.x) is being used by other adapter/network
Case2: firewall exists between nodes on private network (iptables etc)
Case3: HAIP is up on some nodes but not on all
Case4: HAIP is up on all nodes but some do not have route info
References
Applies to:

Oracle Server – Enterprise Edition – Version 11.2.0.1 and later
Information in this document applies to any platform.
Purpose

This note lists common causes of ASM start up failure with the following error on non-first node (second or others):

alert_.log from non-first node
lmon registered with NM – instance number 2 (internal mem no 1)
Tue Dec 06 06:16:15 2011
System state dump requested by (instance=2, osid=19095 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /g01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_diag_19138.trc
Tue Dec 06 06:16:15 2011
PMON (ospid: 19095): terminating the instance due to error 481
Dumping diagnostic data in directory=[cdmp_20111206061615], requested by (instance=2, osid=19095 (PMON)), summary=[abnormal instance termination].
Tue Dec 06 06:16:15 2011
ORA-1092 : opitsk aborting process

Note: ASM instance terminates shortly after “lmon registered with NM”
If ASM on non-first node was running previously, likely the following will be in alert.log when it failed originally:
..
IPC Send timeout detected. Sender: ospid 32231 [oracle@ftdcslsedw01b (PING)]
..
ORA-29740: evicted by instance number 1, group incarnation 10
..

diag trace from non-first ASM (+ASMn_diag_.trc)
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE])

alert_.log from first node
LMON (ospid: 15986) detects hung instances during IMR reconfiguration
LMON (ospid: 15986) tries to kill the instance 2 in 37 seconds.
Please check instance 2’s alert log and LMON trace file for more details.
..
Remote instance kill is issued with system inc 64
Remote instance kill map (size 1) : 2
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2
Reconfiguration started (old inc 64, new inc 66)

If the issue happens while running root script (root.sh or rootupgrade.sh) as part of Grid Infrastructure installation/upgrade process, the following symptoms will present:

root script screen output
Start of resource “ora.asm” failed

CRS-2672: Attempting to start ‘ora.asm’ on ‘racnode1’
CRS-5017: The resource action “ora.asm start” encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0
. For details refer to “(:CLSN00107:)” in “/ocw/grid/log/racnode1/agent/ohasd/oraagent_grid/oraagent_grid.log”.
CRS-2674: Start of ‘ora.asm’ on ‘racnode1’ failed
..
Failed to start ASM at /ispiris-qa/app/11.2.0.3/crs/install/crsconfig_lib.pm line 1272
$GRID_HOME/cfgtoollogs/crsconfig/rootcrs_.log
2011-11-29 15:56:48: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl start resource ora.asm -init
..
> CRS-2672: Attempting to start ‘ora.asm’ on ‘racnode1’
> CRS-5017: The resource action “ora.asm start” encountered the following error:
> ORA-03113: end-of-file on communication channel
> Process ID: 0
> Session ID: 0 Serial number: 0
> . For details refer to “(:CLSN00107:)” in “/ispiris-qa/app/11.2.0.3/log/racnode1/agent/ohasd/oraagent_grid/oraagent_grid.log”.
> CRS-2674: Start of ‘ora.asm’ on ‘racnode1’ failed
> CRS-2679: Attempting to clean ‘ora.asm’ on ‘racnode1’
> CRS-2681: Clean of ‘ora.asm’ on ‘racnode1’ succeeded
..
> CRS-4000: Command Start failed, or completed with errors.
>End Command output
2011-11-29 15:59:00: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl check resource ora.asm -init
2011-11-29 15:59:00: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl status resource ora.asm -init
2011-11-29 15:59:01: Checking the status of ora.asm
..
2011-11-29 15:59:53: Start of resource “ora.asm” failed
Details

Case1: link local IP (169.254.x.x) is being used by other adapter/network

Symptoms:

$GRID_HOME/log//alert.log
[/ocw/grid/bin/orarootagent.bin(4813)]CRS-5018:(:CLSN00037:) Removed unused HAIP route: 169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0
OS messages (optional)
Dec 6 06:11:14 racnode1 dhclient: DHCPREQUEST on usb0 to 255.255.255.255 port 67
Dec 6 06:11:14 racnode1 dhclient: DHCPACK from 169.254.95.118
ifconfig -a
..
usb0 Link encap:Ethernet HWaddr E6:1F:13:AD:EE:D3
inet addr:169.254.95.120 Bcast:169.254.95.255 Mask:255.255.255.0
..

Note: it’s usb0 in this case, but it can be any other adapter which uses link local

Solution:

Link local IP must not be used by any other network on cluster nodes. In this case, an USB network device gets IP 169.254.95.118 from DHCP server which disrupted HAIP routing, and solution is to black list the device in udev from being activated automatically.
Case2: firewall exists between nodes on private network (iptables etc)

No firewall is allowed on private network (cluster_interconnect) between nodes including software firewall like iptables, ipmon etc
Case3: HAIP is up on some nodes but not on all

Symptoms:

alert_<+ASMn>.log for some instances
Cluster communication is configured to use the following interface(s) for this instance
10.1.0.1
alert_<+ASMn>.log for other instances
Cluster communication is configured to use the following interface(s) for this instance
169.254.201.65

Note: some instances is using HAIP while others are not, so they can not talk to each other
Solution:

The solution is to bring up HAIP on all nodes.

To find out HAIP status, execute the following on all nodes:

$GRID_HOME/bin/crsctl stat res ora.cluster_interconnect.haip -init

If it’s offline, try to bring it up as root:

$GRID_HOME/bin/crsctl start res ora.cluster_interconnect.haip -init

If HAIP fails to start, refer to note 1210883.1 for known issues.

If the “up node” is not using HAIP, and no outage is allowed, the workaround is to set init.ora/spfile parameter cluster_interconnect to the private IP of each node to allow ASM/DB to come up on “down node”. Once a maintenance window is planned, the parameter must be removed to allow HAIP to work.

Case4: HAIP is up on all nodes but some do not have route info

Symptoms:

alert_<+ASMn>.log for all instances
Cluster communication is configured to use the following interface(s) for this instance
169.254.xxx.xxx
“netstat -rn” for some nodes (surviving nodes) missing HAIP route
netstat -rn
Destination Gateway Genmask Flags MSS Window irtt Iface
161.130.90.0 0.0.0.0 255.255.248.0 U 0 0 0 bond0
160.131.11.0 0.0.0.0 255.255.255.0 U 0 0 0 bond2
0.0.0.0 160.11.80.1 0.0.0.0 UG 0 0 0 bond0

The line for HAIP is missing, i.e:

169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 bond2

Note: As HAIP route info is missing on some nodes, HAIP is not pingable; usually newly restarted node will have HAIP route info
Solution:

The solution is to manually add HAIP route info on the nodes that’s missing:

4.1. Execute “netstat -rn” on any node that has HAIP route info and locate the following:

169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 bond2

Note: the first field is HAIP subnet ID and will start with 169.254.xxx.xxx, the third field is HAIP subnet netmask and the last field is private network adapter name

4.2. Execute the following as root on the node that’s missing HAIP route:

# route add -net netmask dev

i.e.

# route add -net 169.254.0.0 netmask 255.255.0.0 dev bond2

4.3. Start ora.crsd as root on the node that’s partial up:.

# $GRID_HOME/bin/crsctl start res ora.crsd -init

The other workaround is to restart GI on the node that’s missing HAIP route with “crsctl stop crs -f” and “crsctl start crs” command as root.

此时抛出的错误确实是usb0导致,kvm的控制台使用dbcp动态分配ip使得两台app服务器分配了169.254网段的ip.禁止掉dhcp之后手动分配ip继续抛出如下错误:

[cssd(22096)]CRS-1605:CSSD voting file is online: /dev/mapper/mpath3; details in /g01/app/oracle/product/11.2.0/grid/log/ptdb2/cssd/ocssd.log.
[cssd(22096)]CRS-1636:The CSS daemon was started in exclusive mode but found an active CSS daemon on node ptdb1 and is terminating; details at (:CSSNM00006:) in /g01/app/oracle/product/11.2.0/grid/log/ptdb2/cssd/ocssd.log
2012-10-15 10:07:33.421
[ohasd(20942)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'ptdb2'.

2012-10-15 10:11:27.669
[cssd(22425)]CRS-1662:Member kill requested by node ptdb1 for member number 1, group DB+ASM
2012-10-15 10:11:29.931
[/g01/app/oracle/product/11.2.0/grid/bin/oraagent.bin(22334)]CRS-5019:All OCR locations are on ASM disk groups [OCRVOT], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/g01/app/oracle/product/11.2.0/grid/log/ptdb2/agent/ohasd/oraagent_grid/oraagent_grid.log".
2012-10-15 10:11:31.079
[/g01/app/oracle/product/11.2.0/grid/bin/oraagent.bin(22334)]CRS-5019:All OCR locations are on ASM disk groups [OCRVOT], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/g01/app/oracle/product/11.2.0/grid/log/ptdb2/agent/ohasd/oraagent_grid/oraagent_grid.log".
2012-10-15 10:11:31.119
[/g01/app/oracle/product/11.2.0/grid/bin/oraagent.bin(22334)]CRS-5019:All OCR locations are on ASM disk groups [OCRVOT], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/g01/app/oracle/product/11.2.0/grid/log/ptdb2/agent/ohasd/oraagent_grid/oraagent_grid.log".
2012-10-15 10:11:31.197
[/g01/app/oracle/product/11.2.0/grid/bin/oraagent.bin(22334)]CRS-5019:All OCR locations are on ASM disk groups [OCRVOT], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/g01/app/oracle/product/11.2.0/grid/log/ptdb2/agent/ohasd/oraagent_grid/oraagent_grid.log".
2012-10-15 10:11:32.837
[/g01/app/oracle/product/11.2.0/grid/bin/orarootagent.bin(22496)]CRS-5016:Process "/g01/app/oracle/product/11.2.0/grid/bin/acfsload" spawned by agent "/g01/app/oracle/product/11.2.0/grid/bin/orarootagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/g01/app/oracle/product/11.2.0/grid/log/ptdb2/agent/ohasd/orarootagent_root/orarootagent_root.log"
2012-10-15 10:11:33.068
[/g01/app/oracle/product/11.2.0/grid/bin/oraagent.bin(22334)]CRS-5019:All OCR locations are on ASM disk groups [OCRVOT], and none of these disk groups are mounted. Details are at "(:CLSN00100:)" in "/g01/app/oracle/product/11.2.0/grid/log/ptdb2/agent/ohasd/oraagent_grid/oraagent_grid.log".

查看 orarootagent_root.log

2012-10-15 10:08:40.295: [ USRTHRD][1115584832] {0:0:209} HAIP:  Updating member info HAIP1;172.168.0.0#0
2012-10-15 10:08:40.298: [ USRTHRD][1115584832] {0:0:209} InitializeHaIps[ 0]  infList 'inf eth1, ip 172.168.0.102, sub 172.168.0.0'
2012-10-15 10:08:40.299: [ USRTHRD][1115584832] {0:0:209} Error in getting Key SYSTEM.network.haip.group.cluster_interconnect.interface.valid in OCR
2012-10-15 10:08:40.303: [ CLSINET][1115584832] failed to open OLR HAIP subtype SYSTEM.network.haip.group.cluster_interconnect.interface.valid key, rc=4

问题就很好定位了,虽然所有node的res haip的状态为online 但是手动去ping对方 无法ping通 节点2无法通过haip获得必要资源

两个原因导致这个问题:

1. 网卡的driver有问题,导致在private 网卡上虚拟出来的haip 无法通信

2. 交换机的配置有问题有可能vlan的划分 或者交换机的bug导致这一现象

所以对上述文档的case ,我们可以增加一种:

CASE 5

如果HAIP 均online 并且 netstat -rn 均存在相关entry,需要保证node之间的haip能够ping通。在此例中 如果手工”restart res haip”则问题可以得到暂时解决.但是如果reboot机器 问题依旧,最后采取交叉线直连两个节点解决问题,如果节点数大于2 需要从交换机着手此问题。

对于11.2 CRS troubleshooting ,主要的日志有下面这些:
<GRID_HOME>/log/<host>/alert<node>.log :这是CRS 的 alert log,CRS 的启动/停止,核心进程(crsd, ocssd etc.)的启动/停止,对于一些资源的检查异常都会记录下来。

<GRID_HOME>/log/<host>/crsd : 这是CRSD的日志,记录了CRSD 监控的资源的启动、停止以及异常,比如数据库实例、ASM、监听、磁盘组、VIP等。
<GRID_HOME>/log/<host>/cssd: 这是OCSSD进程的日志,记录了节点间的通讯情况,如果有网络心跳或者Voting disk的异常,或者OCSSD进程异常终止,都会记录下来。
<GRID_HOME>/log/<host>/evmd:这是事件监控(event monitor)进程的日志。
<GRID_HOME>/log/<host>/ohasd:这是OHASD进程的日志,有的时候CRS无法启动,可能需要查看这个日志。
<GRID_HOME>/log/<host>/mdnsd:DNS相关的日志
<GRID_HOME>/log/<host>/gipcd:进程间或者节点间通讯的日志
<GRID_HOME>/log/<host>/gpnpd: 关于GPNP(Grid Plug & Play Profile )的日志
<GRID_HOME>/log/<host>/gnsd(可选):Grid Naming Service 的日志
<GRID_HOME>/log/<host>/ctssd: 时间同步服务的日志。

<GRID_HOME>/log/<host>/agent/ohasd:
<GRID_HOME>/log/<host>/agent/ohasd/oraagent_grid: 启动/停止/检查/清除ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd等资源。
<GRID_HOME>/log/<host>/agent/ohasd/orarootagent_root:启动/停止 /检查/清除 ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs, ora.crf (11.2.0.2)等资源。
<GRID_HOME>/log/<host>/agent/ohasd/oracssdagent_root: 启动/停止/检查 ocssd进程。
<GRID_HOME>/log/<host>/agent/ohasd/oracssdmonitor_root:监控cssdagent进程。

<GRID_HOME>/log/<host>/agent/crsd:
<GRID_HOME>/log/<host>/agent/crsd/oraagent_grid: 启动/停止/检查/清除 asm, ora.eons, ora.LISTENER.lsnr, SCAN listeners, ora.ons, diskgroup  等资源
<GRID_HOME>/log/<host>/agent/crsd/oraagent_oracle: 启动/停止/检查/清除 service, database 等资源
<GRID_HOME>/log/<host>/agent/crsd/orarootagent_root : 启动/停止/检查/清除 GNS, VIP, SCAN VIP and network 等资源.
<GRID_HOME>/log/<host>/agent/crsd/scriptagent_grid:  定制的应用服务的日志。

CRS troubleshooting 两例

September 11, 2012 oracle, RAC No comments

一套10.2.0.5 RAC 系统 evmd cssd服务无法启动

CRS version

[oracle@ptdb02 oracle]$ crsctl query crs softwareversion
CRS software version on node [ptdb02] is [10.2.0.5.0]

h1:3:respawn:/sbin/init.d/init.evmd run >/dev/null 2>&1

h2:3:respawn:/sbin/init.d/init.cssd fatal >/dev/null 2>&1

[crsd(21292)]CRS-1012:The OCR service started on node ptdb01.
2012-09-08 18:48:24.980
[evmd(21873)]CRS-1401:EVMD started on node ptdb01.
2012-09-08 18:48:25.271
[crsd(22023)]CRS-1005:The OCR upgrade was completed. Version has changed from 169870592 to 169870592. Details in /u01/app/oracle/product/10.2.0/crs_1/log/ptdb01/crsd/crsd.log.
2012-09-08 18:48:25.271
[crsd(22023)]CRS-1012:The OCR service started on node ptdb01.
2012-09-08 18:48:26.679
[evmd(22605)]CRS-1401:EVMD started on node ptdb01.

CSSD Reconfiguration一直没有成功 active node 为ptdb01 evmd cssd 进程无法启动 –> check evmd log

Oracle Database 10g CRS Release 10.2.0.5.0 Production Copyright 1996, 2007, Oracle.  All rights reserved
2012-09-08 18:59:03.765: [    EVMD][999623216]0Initializing OCR
2012-09-08 18:59:03.773: [    EVMD][999623216]0Active Version from OCR:10.2.0.5.0
2012-09-08 18:59:03.773: [    EVMD][999623216]0Active Version and Software Version are same
2012-09-08 18:59:03.773: [    EVMD][999623216]0Initializing Diagnostics Settings
2012-09-08 18:59:03.773: [    EVMD][999623216]0ENV Logging level for Module: allcomp  0
2012-09-08 18:59:03.773: [    EVMD][999623216]0ENV Logging level for Module: default  0
2012-09-08 18:59:03.773: [    EVMD][999623216]0ENV Logging level for Module: COMMCRS  0
2012-09-08 18:59:03.773: [    EVMD][999623216]0ENV Logging level for Module: COMMNS  0
2012-09-08 18:59:03.774: [    EVMD][999623216]0ENV Logging level for Module: EVMD  0
2012-09-08 18:59:03.774: [    EVMD][999623216]0ENV Logging level for Module: EVMDMAIN  0
2012-09-08 18:59:03.774: [    EVMD][999623216]0ENV Logging level for Module: EVMCOMM  0
2012-09-08 18:59:03.774: [    EVMD][999623216]0ENV Logging level for Module: EVMEVT  0
2012-09-08 18:59:03.774: [    EVMD][999623216]0ENV Logging level for Module: EVMAPP  0
2012-09-08 18:59:03.774: [    EVMD][999623216]0ENV Logging level for Module: EVMAGENT  0
2012-09-08 18:59:03.774: [    EVMD][999623216]0ENV Logging level for Module: CRSOCR  0
2012-09-08 18:59:03.774: [    EVMD][999623216]0ENV Logging level for Module: CLUCLS  0
2012-09-08 18:59:03.774: [    EVMD][999623216]0ENV Logging level for Module: OCRRAW  0
2012-09-08 18:59:03.774: [    EVMD][999623216]0ENV Logging level for Module: OCROSD  0
2012-09-08 18:59:03.774: [    EVMD][999623216]0ENV Logging level for Module: OCRAPI  0
2012-09-08 18:59:03.775: [    EVMD][999623216]0ENV Logging level for Module: OCRUTL  0
2012-09-08 18:59:03.775: [    EVMD][999623216]0ENV Logging level for Module: OCRMSG  0
2012-09-08 18:59:03.775: [    EVMD][999623216]0ENV Logging level for Module: OCRCLI  0
2012-09-08 18:59:03.775: [    EVMD][999623216]0ENV Logging level for Module: CSSCLNT  0
[  clsdmt][1108588864]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=ptdb01DBG_EVMD))
2012-09-08 18:59:03.777: [    EVMD][999623216]0Creating pidfile  /u01/app/oracle/product/10.2.0/crs_1/evm/init/ptdb01.pid
2012-09-08 18:59:03.781: [    EVMD][999623216]0Authorization database built successfully.
2012-09-08 18:59:04.209: [  OCRAPI][999623216]procr_open: Node Failure. Attempting retry #0

..
2012-09-08 18:59:04.210: [  OCRCLI][999623216]oac_reconnect_server: Could not connect to server. clsc ret 9
2012-09-08 18:59:19.177: [  OCRCLI][999623216]oac_reconnect_server: Could not connect to server. clsc ret 9
2012-09-08 18:59:19.227: [  OCRAPI][999623216]procr_open: Node Failure. Attempting retry #298
2012-09-08 18:59:19.228: [  OCRCLI][999623216]oac_reconnect_server: Could not connect to server. clsc ret 9
2012-09-08 18:59:19.278: [  OCRAPI][999623216]procr_open: Node Failure. Attempting retry #299
...
2012-09-08 18:59:19.635: [  OCRCLI][999623216]oac_reconnect_server: Could not connect to server. clsc ret 9
2012-09-08 18:59:19.636: [  EVMAPP][999623216][PANIC]0Unable to open local accept socket - errno 13
2012-09-08 18:59:19.636: [    EVMD][999623216][PANIC]0EVMD exiting
2012-09-08 18:59:19.636: [    EVMD][999623216]0Done.

OCR 无法 Initial retry 299次之后失败 导致evmd进程无法启动 这点与11gr2有了区别 11r2之后 crs通过olr(Oracle Local Repository)获取asm information 从而online ocr 可以参考这篇文章

我们通过/etc/orc.loc 获取ocr信息:

[oracle@ptdb01 oracle]$ cat ocr.loc
ocrconfig_loc=/u01/app/oracle/product/10.2.0/db_1/cdata/localhost/local.ocr
local_only=TRUE
[oracle@ptdb01 oracle]$ strings  /u01/app/oracle/product/10.2.0/db_1/cdata/localhost/local.ocr
root
root
SYSTEM
DATABASE
local_only
ORA_CRS_HOME
versionstring
version
language
AMERICAN_AMERICA.WE8ISO8859P1
activeversion
 node_numbers
10G Release 2
/u01/app/oracle/product/10.2.0/db_1
true
node0
10.2.0.5.0
hostnames
privatenames
node_numbers
node_names
configured_node_map
clustername
localhost
ptdb01
nodenum
node0
nsendpoint
hostname
privatename
nodename
ptdb01
127.0.0.1
nodenum
ptdb01
nodenum
ptdb01
(ADDRESS=(PROTOCOL=tcp)(HOST=127.0.0.1)(PORT=0))
10.2.0.5.0

local_only=TRUE 这显然是一个local的文件 指向 “/u01/app/oracle/product/10.2.0/db_1/cdata/localhost/local.ocr”这个文件 但是local.ocr并不包含任何ocr location的信息 导致无法初始化ocr 从而导致evmd进程无法online 手动设置ocrconfig_loc=/dev/raw/raw2 启动crs

2012-09-08 19:04:20.912: [    EVMD][934681136]0Authorization database built successfully.
2012-09-08 19:04:21.270: [  EVMEVT][934681136][ENTER]0EVM Listening on: 54720654
2012-09-08 19:04:21.274: [  EVMAPP][934681136]0EVMD Started
2012-09-08 19:04:21.277: [  EVMEVT][1199147328]0Listening at (ADDRESS=(PROTOCOL=tcp)(HOST=ptdb01-priv)(PORT=0)) for P2P evmd connections requests
2012-09-08 19:04:21.281: [    EVMD][934681136]0Authorization database built successfully.
2012-09-08 19:04:21.395: [  EVMEVT][1230616896][ENTER]0Establishing P2P connection with node: ptdb02
2012-09-08 19:04:21.397: [  EVMEVT][1241106752]0Private Member Update event for ptdb01 received by clssgsgrpstat

2012-09-08 19:00:07.083
[cssd(13795)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ptdb01 ptdb02 .

ptdb01 ptdb02 均被active cssd启动成功, 事后解到工程师安装ASM的时候曾经错选为单实例然后简单的删除了这个实例,一些后续的清理工作并没有完成。

———————————————————-

DBCA建库报错 :

CRS version:

[grid@db-42 bin]$ crsctl query crs softwareversion
Oracle Clusterware version on node [db-42] is [11.2.0.3.0]

[grid@db-42 bin]$ id oracle
uid=502(oracle) gid=501(oinstall) groups=501(oinstall),502(dba),506(asmdba)
[grid@db-42 bin]$ crs_getperm ora.ARCH.dg
Name: ora.DATA.dg
owner:grid:rwx,pgrp:oinstall:rwx,other::r–
[grid@db-42 bin]$ crs_getperm ora.DATA.dg
Name: ora.DATA.dg
owner:grid:rwx,pgrp:asmadmin:rwx,other::r–

两种方法

1 把oracle加入asmadmin group

2 修改ora.DATA.dg

[grid@db-42 bin]$ crs_setperm ora.DATA.dg -u user:oracle:rwx

[grid@db-42 bin]$ crs_getperm ora.DATA.dg
Name: ora.DATA.dg
owner:grid:rwx,pgrp:asmadmin:rwx,other::r–,user:oracle:rwx

11g Smart Flash Cache

August 30, 2012 11g, oracle, RAC No comments

整理一篇关于11g smart flash cache 的文章,vmcd

11g r2 flashcache_Tips from Louis liu

我们来做一下演示 采用FusionIO SLC 160G

[oracle@db-41 ~]$ ls -l /dev/fioa 
brw-r----- 1 oracle oinstall 252, 0 Jul 17 14:26 /dev/fioa


[root@db-41 dev]# fio-status

Found 1 ioDrive in this system
Fusion-io driver version: 2.3.1 build 123

Adapter: ioDrive
	Fusion-io ioDrive 160GB, Product Number:VRG5T SN:479912
	External Power: NOT connected
	PCIE Power limit threshold: 24.75W
	Sufficient power available: Unknown
	Connected ioDimm module:
	  fct0:	Fusion-io ioDrive 160GB, Product Number:VRG5T SN:479912

fct0	Attached as 'fioa' (block device)
	Fusion-io ioDrive 160GB, Product Number:VRG5T SN:479912
	Located in slot 0 Upper of ioDrive SN:479912
	PCI:08:00.0, Slot Number:1
	Firmware v5.0.6, rev 101583
	160.00 GBytes block device size, 203 GBytes physical device size
	Sufficient power available: Unknown
	Internal temperature: 43.3 degC, max 44.8 degC
	Media status: Healthy; Reserves: 100.00%, warn at 10.00%
	


SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 30 11:13:32 2012

Copyright (c) 1982, 2011, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL> show parameter sga;

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
lock_sga			     boolean	 FALSE
pre_page_sga			     boolean	 FALSE
sga_max_size			     big integer 68096M
sga_target			     big integer 68096M

SQL> alter system set db_flash_cache_size=150G scope=spfile;

System altered.

SQL> alter system set db_flash_cache_file='/dev/fioa' scope=spfile;

System altered.

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area 7.1085E+10 bytes
Fixed Size		    2235608 bytes
Variable Size		 8053064488 bytes
Database Buffers	 6.2814E+10 bytes
Redo Buffers		  215855104 bytes
Database mounted.
ORA-24324: service handle not initialized
ORA-24323: value not allowed
ORA-03113: end-of-file on communication channel


SQL> startup
ORACLE instance started.

Total System Global Area 7.1085E+10 bytes
Fixed Size		    2235608 bytes
Variable Size		 8053064488 bytes
Database Buffers	 6.2814E+10 bytes
Redo Buffers		  215855104 bytes
Database mounted.
Database opened.
SQL> show parameter flash_cache

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
db_flash_cache_file		     string	 /dev/fioa
db_flash_cache_size		     big integer 149G
SQL> alter system set  db_flash_cache_size=150G scope=spfilel
  2  

SQL> alter system set  db_flash_cache_size=150G scope=spfile;

System altered.

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area 7.1085E+10 bytes
Fixed Size		    2235608 bytes
Variable Size		 8053064488 bytes
Database Buffers	 6.2814E+10 bytes
Redo Buffers		  215855104 bytes
Database mounted.
ORA-24324: service handle not initialized
ORA-24323: value not allowed
ORA-03113: end-of-file on communication channel


ORA-27042: not enough space on raw partition to fullfill request
Additional information: 3


modify db_flash_cache_size to 149G


SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 30 11:04:10 2012

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> create spfile from pfile;

File created.

SQL> startup
ORACLE instance started.

Total System Global Area 7.1085E+10 bytes
Fixed Size		    2235608 bytes
Variable Size		 8053064488 bytes
Database Buffers	 6.2814E+10 bytes
Redo Buffers		  215855104 bytes
Database mounted.
Database opened.
SQL> show parameter flash_cache

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
db_flash_cache_file		     string	 /dev/fioa
db_flash_cache_size		     big integer 149G

SQL> select 149/68 from dual;

    149/68
----------
2.19117647

Why ASM?

June 13, 2012 oracle, RAC No comments

Why ASM?

. ASM is an Oracle provided Database/Cluster filesystem and Storage Volume manager, optimized for Oracle for any kind of workload (be it OLTP or DSS or Batch, sequential or random access), built-in with the database, with no additional cost.

. It follows SAME (Stripe and Mirror Everything) strategy, eliminating any database level IO hotspots, as the database files are striped equally across all the available disks in the allocated disk groups and are dynamically rebalanced with the addition and removal of disks online.

. ASM provides for better storage utilization and storage consolidation (ie allows multiple databases on the same Server or across multiple Servers to share storage through Oracle clusterware).

. It allows leveraging existing storage RAID technologies (Mirroring, RAID 5 etc) to provide for additional striping and redundancy.

. It can leverage IO multipathing technologies (like EMC Powerpath, Hitachi HDLM etc) to provide for better IO balancing and fault-tolerance.

. Makes use of Oracle ‘Kernel Service Direct File (ksfd)’ interface to perform KAIO (kernel asynchronous IO) against raw devices by-passing OS page cache, providing the best IO performance with no overhead from Operating system or conventional Logical Volume Managers.

关于‘ksfd’ 我们还可以看下面的解释:

Oracle performs 1MB I/O when possible

DBWR (the main server process) – Coalesces adjacent buffers to max size of 1MB

Kernel sequential file I/O (ksfq) – buffer size is set to 1MB by default

Archiver writes 1MB I/Os

Log Writer – Max I/O size is set to 1MB

Datapump,Dataguard,Rman and Transfer Packages also use ksfq

Kernel Service Direct File (ksfd) I/O is the central module providing disk I/O service to all its clients.the main interfaces provided allow client to create database file and perform I/O operations on them.Every read/write operation is done in multiples of the block size that is specified at file creation time

. Operating System Dependent (OSD) interface is used to access raw and file system files (used with ASM when ASMLIB is not configured) by default

. ASMLIB is disk access interface defined by Oracle and implemented by storage/system vendors. Oracle currently has an ASMLIB implementation for linux only . ASMLIB provides I/O enhancements for persistent binding, I/O optimization and ease of management

. Oracle Disk Manager (ODM) Interface defined by Oracle and implemented by storage/system vendors

ASM is one of the clients of ksfd. This module calls ksfd to issue I/O when ASM feature is enabled and there is NO ASMLIB library present. A database file is identified as an ASM managed file from the name pattern.

这里 我们还可以引申出对ASMLIB使用的考虑:

ASMLIB是一种基于Linux module,专门为Oracle Automatic Storage Management特性设计的内核支持库(kernel support library)。
长久以来我们对ASMLIB的认识并不全面,这里我们来具体了解一下使用ASMLIB的优缺点。
理论上我们可以从ASMLIB API中得到的以下益处:
总是使用direct,async IO
解决了永久性设备名的问题,即便在重启后设备名已经改变的情况下
解决了文件权限、拥有者的问题
减少了I/O期间从用户模式到内核模式的上下文切换,从而可能降低cpu使用率
减少了文件句柄的使用量
ASMLIB API提供了传递如I/O优先级等元信息到存储设备的可能
虽然从理论上我们可以从ASMLIB中得到性能收益,但实践过程中这种优势是几乎可以忽略的,没有任何性能报告显示ASMLIB对比Linux上原生态的udev设备管理服务有任何性能上的优势。在Oracle官方论坛上有一篇讨论ASMLIB性能收益的帖子,你可以从中看到”asmlib wouldn’t necessarily give you much of an io performance benefit, it’s mainly for ease of management as it will find/discover the right devices for you, the io effect of asmlib is large the same as doing async io to raw devices.”的评论,实际上使用ASMLIB和直接使用裸设备(raw device)在性能上没有什么差别。
ASMLIB可能带来的缺点:
对于多路径设备(multipathing)需要在/etc/sysconfig/oracleasm-_dev_oracleasm配置文件中设置ORACLEASM_SCANORDER及ORACLEASM_SCANEXCLUDE,以便ASMLIB能找到正确的设备文件,具体可以参考Metalink Note
因为ASM INSTANCE使用ASMLIB提供的asm disk,所以增加了额外的层面
每次Linux Kernel更新,都需要替换新的ASMLIB包
增加了因人为错误造成宕机downtime的可能
使用ASMLIB意味着要花费更多时间去创建和维护
因为ASMLIB的存在,可能引入更多的bug,这是我们最不想看到的
使用ASMLIB创建的disk,其disk header并不会和普通的asm disk header有什么不同,仅仅是在头部多出了ASMLIB的属性空间。
结论:
我个人的观点是尽可能不要使用ASMLIB,当然这不是DBA个人所能决定的事情。另一方面这取决于个人习惯,在rhel 4的早期发行版本中没有提供udev这样的设备管理服务,这导致在rhel 4中大量的ASM+RAC组合的系统使用ASMLIB , 经网友指出udev 作为kernel 2.6的新特性被引入,在rhel4的初始版本中就已经加入了udev绑定服务,但是在rhel4时代实际udev的使用并不广泛(In Linux 2.6, a new feature was introduced to simplify device management and hot plug capabilities. This feature is called udev and is a standard package in RHEL4 or Oracle
Enterprise Linux 4 (OEL4) as well as Novell’s SLES9 and SLES10.)。如果是在RHEL/OEL 5中那么你已经有充分的理由利用udev而放弃ASMLIB。

Reference:http://www.oracledatabase12g.com/archives/why-asmlib-and-why-not.html

11gR2 rac ora.asm error (ASM)

April 26, 2012 maintain, oracle, RAC No comments ,

ASM – ora.asm error

11.2.0.3 RAC CRS 无法启动

[ohasd(25286)]CRS-2112:The OLR service started on node db-41.
2012-04-25 14:46:13.120
[ohasd(25286)]CRS-1301:Oracle High Availability Service started on node db-41.
2012-04-25 14:46:13.120
[ohasd(25286)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2012-04-25 14:46:14.181
[/data/11.2.0/grid/bin/orarootagent.bin(25332)]CRS-5016:Process “/data/11.2.0/grid/bin/acfsload” spawned by agent “/data/11.2.0/grid/bin/orarootagent.bin” for action “check” failed: details at “(:CLSN00010:)” in “/data/11.2.0/grid/log/db-41/agent/ohasd/orarootagent_root/orarootagent_root.log”
2012-04-25 14:46:17.031
[gpnpd(25510)]CRS-2328:GPNPD started on node db-41.
2012-04-25 14:46:20.794
[cssd(25593)]CRS-1713:CSSD daemon is started in clustered mode
2012-04-25 14:46:22.520
[ohasd(25286)]CRS-2767:Resource state recovery not attempted for ‘ora.diskmon’ as its target state is OFFLINE
2012-04-25 14:46:40.715
[cssd(25593)]CRS-1707:Lease acquisition for node db-41 number 1 completed
2012-04-25 14:46:42.003
[cssd(25593)]CRS-1605:CSSD voting file is online: /dev/mapper/Vote03; details in /data/11.2.0/grid/log/db-41/cssd/ocssd.log.
2012-04-25 14:46:42.008
[cssd(25593)]CRS-1605:CSSD voting file is online: /dev/mapper/Vote02; details in /data/11.2.0/grid/log/db-41/cssd/ocssd.log.
2012-04-25 14:46:42.015
[cssd(25593)]CRS-1605:CSSD voting file is online: /dev/mapper/Vote01; details in /data/11.2.0/grid/log/db-41/cssd/ocssd.log.
2012-04-25 14:46:52.173
[cssd(25593)]CRS-1601:CSSD Reconfiguration complete. Active nodes are db-41 .
2012-04-25 14:46:54.177
[ctssd(25738)]CRS-2407:The new Cluster Time Synchronization Service reference node is host db-41.
2012-04-25 14:46:54.177
[ctssd(25738)]CRS-2401:The Cluster Time Synchronization Service started on host db-41.
[client(25843)]CRS-10001:25-Apr-12 14:46 ACFS-9391: Checking for existing ADVM/ACFS installation.
[client(25848)]CRS-10001:25-Apr-12 14:46 ACFS-9392: Validating ADVM/ACFS installation files for operating system.
[client(25850)]CRS-10001:25-Apr-12 14:46 ACFS-9393: Verifying ASM Administrator setup.
[client(25853)]CRS-10001:25-Apr-12 14:46 ACFS-9308: Loading installed ADVM/ACFS drivers.
[client(25856)]CRS-10001:25-Apr-12 14:46 ACFS-9154: Loading ‘oracleoks.ko’ driver.
[client(25913)]CRS-10001:25-Apr-12 14:46 ACFS-9154: Loading ‘oracleadvm.ko’ driver.
[client(25974)]CRS-10001:25-Apr-12 14:46 ACFS-9154: Loading ‘oracleacfs.ko’ driver.
[client(26099)]CRS-10001:25-Apr-12 14:46 ACFS-9327: Verifying ADVM/ACFS devices.
[client(26103)]CRS-10001:25-Apr-12 14:46 ACFS-9156: Detecting control device ‘/dev/asm/.asm_ctl_spec’.
[client(26107)]CRS-10001:25-Apr-12 14:46 ACFS-9156: Detecting control device ‘/dev/ofsctl’.
[client(26113)]CRS-10001:25-Apr-12 14:46 ACFS-9322: completed
2012-04-25 14:47:06.975
[ohasd(25286)]CRS-2807:Resource ‘ora.asm’ failed to start automatically.

[ctssd(25738)]CRS-2405:The Cluster Time Synchronization Service on host db-41 is shutdown by user
2012-04-25 14:51:25.852
[/data/11.2.0/grid/bin/orarootagent.bin(25523)]CRS-5016:Process “/data/11.2.0/grid/bin/acfsload” spawned by agent “/data/11.2.0/grid/bin/orarootagent.bin” for action “check” failed: details at “(:CLSN00010:)” in “/data/11.2.0/grid/log/db-41/agent/ohasd/orarootagent_root/orarootagent_root.log”
2012-04-25 14:51:35.716
[cssd(25593)]CRS-1603:CSSD on node db-41 shutdown by user.
2012-04-25 14:51:35.725
[ohasd(25286)]CRS-2767:Resource state recovery not attempted for ‘ora.cssdmonitor’ as its target state is OFFLINE
2012-04-25 14:51:36.224
[cssd(25593)]CRS-1660:The CSS daemon shutdown has completed
2012-04-25 14:51:38.144
[gpnpd(25510)]CRS-2329:GPNPD on node db-41 shutdown.

重建spfile :


+ASM.asm_diskgroups=’DATA’,’ARCH’
*.asm_diskstring=’/dev/mapper/*’
*.asm_power_limit=1
*.diagnostic_dest=’/data/app/oracle’
*.instance_type=’asm’
*.large_pool_size=12M
#*.local_listener=’LISTENER_+ASM’
*.remote_login_passwordfile=’EXCLUSIVE’

eg: pfile 拉起 ASM instance mount data group

SQL>create spfile=’+DATA’ from pfile=’xxx’;

[grid@db-41 ~]$ sqlplus ‘/as sysasm’

SQL*Plus: Release 11.2.0.3.0 Production on Thu Apr 26 15:21:33 2012

Copyright (c) 1982, 2011, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 – 64bit Production
With the Real Application Clusters and Automatic Storage Management options

SQL> show parameter spfile;

NAME TYPE
———————————— ———————————
VALUE
——————————
spfile string
+DATA/yhd-cluster/asmparameter
file/registry.253.781624261
SQL>

重新启动CRS OK

[client(2162)]CRS-10001:26-Apr-12 14:47 ACFS-9392: Validating ADVM/ACFS installation files for operating system.
[client(2164)]CRS-10001:26-Apr-12 14:47 ACFS-9393: Verifying ASM Administrator setup.
[client(2167)]CRS-10001:26-Apr-12 14:47 ACFS-9308: Loading installed ADVM/ACFS drivers.
[client(2170)]CRS-10001:26-Apr-12 14:47 ACFS-9154: Loading ‘oracleoks.ko’ driver.
[client(2227)]CRS-10001:26-Apr-12 14:47 ACFS-9154: Loading ‘oracleadvm.ko’ driver.
[client(2286)]CRS-10001:26-Apr-12 14:47 ACFS-9154: Loading ‘oracleacfs.ko’ driver.
[client(2415)]CRS-10001:26-Apr-12 14:47 ACFS-9327: Verifying ADVM/ACFS devices.
[client(2418)]CRS-10001:26-Apr-12 14:47 ACFS-9156: Detecting control device ‘/dev/asm/.asm_ctl_spec’.
[client(2422)]CRS-10001:26-Apr-12 14:47 ACFS-9156: Detecting control device ‘/dev/ofsctl’.
[client(2428)]CRS-10001:26-Apr-12 14:47 ACFS-9322: completed
2012-04-26 14:47:44.001
[crsd(2755)]CRS-1012:The OCR service started on node db-41.
2012-04-26 14:47:44.472
[evmd(2146)]CRS-1401:EVMD started on node db-41.
2012-04-26 14:47:46.297
[crsd(2755)]CRS-1201:CRSD started on node db-41.
2012-04-26 14:47:47.586
[/data/11.2.0/grid/bin/oraagent.bin(2880)]CRS-5016:Process “/data/11.2.0/grid/bin/lsnrctl” spawned by agent “/data/11.2.0/grid/bin/oraagent.bin” for action “check” failed: details at “(:CLSN00010:)” in “/data/11.2.0/grid/log/db-41/agent/crsd/oraagent_grid/oraagent_grid.log”
2012-04-26 14:47:47.592
[/data/11.2.0/grid/bin/oraagent.bin(2880)]CRS-5016:Process “/data/11.2.0/grid/bin/lsnrctl” spawned by agent “/data/11.2.0/grid/bin/oraagent.bin” for action “check” failed: details at “(:CLSN00010:)” in “/data/11.2.0/grid/log/db-41/agent/crsd/oraagent_grid/oraagent_grid.log”
2012-04-26 14:47:47.685
[/data/11.2.0/grid/bin/oraagent.bin(2880)]CRS-5016:Process “/data/11.2.0/grid/opmn/bin/onsctli” spawned by agent “/data/11.2.0/grid/bin/oraagent.bin” for action “check” failed: details at “(:CLSN00010:)” in “/data/11.2.0/grid/log/db-41/agent/crsd/oraagent_grid/oraagent_grid.log”
2012-04-26 14:47:47.832
[crsd(2755)]CRS-2772:Server ‘db-41’ has been assigned to pool ‘Generic’.
2012-04-26 14:47:47.833
[crsd(2755)]CRS-2772:Server ‘db-41’ has been assigned to pool ‘ora.user’.
2012-04-26 14:51:13.534

针对这个问题 可以提出这样的疑问 CRS 如何在 datagroup 还没有mount的时候 读取 spfile 而且CRS如何知道在哪去读 spfile 下面这篇精彩的文章 很详细的介绍了process读取的整个过程:

Oracle 11gR2 have many new futures in Grid Computing area. One of them is Oracle Local Repository (OLR), this repository designed to store information and profiles for local resources, resources that dedicated to particular node. It improves the performance of accessing local resources profile information, redundancy and manageability. In Grid Infrastructure RAC configuration there is usual one global shared OCR and OLR’s on each node. In Oracle Restart environment there is only OLR repository. In 11g R2 there is also new feature, Grid Plug and Play (GPNP), as the name implies, it helps to automate and simplify some of the aspects of grid administration. GPNP maintains profile, it is XML file that stores the configuration information of some components maintained by GPNP, for example vips and interconnect information is stored here.

But in this post, I wont to discuss another new feature of 11g R2, it is the possibility to store the server parameter file (spfile) on the ASM disks, also for ASM instance itself! And there is some reasonable questions arise:

1.How process can read the spfile from ASM disks, if they are not mounted yet
2.How process knows where to get spfile if there is no init.ora file in $ORACLE_HOME/dbs with spfile=[path to spfile] parameter
3.What about asm_diskstring parameter?

Third point is important for ASM instance because, before reading spfile from ASM disks, we need to identify them, to identify them we need spfile to get asm_diskstring parameter!

To make clear all this points I will use system calls tracing tools, to trace oracle processes. We can use tools like Dtrace, strace, truss and so on, it depends on platform, I will use strace because I am on Linux.

My environment is Oracle Grid Infrastructure 11g R2 (11.2.0.1.0) in Oracle Restart mode. I simulated ASM disks using loop devices, /dev/loop*, I have 5 disks

bash> /etc/init.d/oracleasm listdisks
DISK1
DISK2
DISK3
DISK4
DISK5

SQL> conn / as sysasm
Connected.
SQL> col path format a15

SQL> select a.name,a.state,b.name,b.path from v$asm_diskgroup a, v$asm_disk b where a.group_number=b.group_number order by b.name;

NAME STATE NAME PATH
—————————— ———– —————————— —————
DGROUP1 MOUNTED DISK1 ORCL:DISK1
DGROUP1 MOUNTED DISK2 ORCL:DISK2
DGROUP2 MOUNTED DISK3 ORCL:DISK3
DGROUP2 MOUNTED DISK4 ORCL:DISK4
DGROUP2 MOUNTED DISK5 ORCL:DISK5

As seen from the listing, I have two disk groups DGROUP1 and DGROUP2. My spfile is located on DGROUP1

SQL> show parameter spfile;

NAME TYPE VALUE
———————————— ———– —————————————————-
spfile string +DGROUP1/asm/asmparameterfile/registry.253.740659373

Lets make some tests

SQL> conn / as sysasm
Connected.

SQL> startup
ASM instance started

Total System Global Area 284565504 bytes
Fixed Size 1336036 bytes
Variable Size 258063644 bytes
ASM Cache 25165824 bytes
ASM diskgroups mounted
SQL>
SQL> show parameter power

NAME TYPE VALUE
———————————— ———– ——————————
asm_power_limit integer 1
SQL> alter system set asm_power_limit=3;

System altered.

SQL> show parameter power

NAME TYPE VALUE
———————————— ———– ——————————
asm_power_limit integer 3

SQL>–We set asm_power_limit to 3 in spfile

SQL> shutdown immediate;
ASM diskgroups dismounted
ASM instance shutdown

SQL> !crs_stat -t
Name Type Target State Host
————————————————————
ora.DGROUP1.dg ora….up.type OFFLINE OFFLINE
ora.DGROUP2.dg ora….up.type OFFLINE OFFLINE
ora….ER.lsnr ora….er.type ONLINE ONLINE testdb03
ora.asm ora.asm.type OFFLINE OFFLINE
ora.cssd ora.cssd.type ONLINE ONLINE testdb03
ora.diskmon ora….on.type ONLINE ONLINE testdb03

SQL>– Lets start instance in nomount mode, it will not mount the diskgroups

SQL> startup nomount;
ASM instance started

Total System Global Area 284565504 bytes
Fixed Size 1336036 bytes
Variable Size 258063644 bytes
ASM Cache 25165824 bytes

SQL> show parameter spfile;

NAME TYPE VALUE
———————————— ———– —————————————————-
spfile string +DGROUP1/asm/asmparameterfile/registry.253.740659373

SQL> show parameter power

NAME TYPE VALUE
———————————— ———– ——————————
asm_power_limit integer 3

SQL> select * from v$spparameter;
select * from v$spparameter
*
ERROR at line 1:
ORA-15001: diskgroup “DGROUP1” does not exist or is not mounted

SQL> alter system set asm_power_limit=10;
alter system set asm_power_limit=10
*
ERROR at line 1:
ORA-32000: write to SPFILE requested but SPFILE is not modifiable

SQL> !asmcmd
ASMCMD> ls
ASMCMD> exit

SQL> !crs_stat -t
Name Type Target State Host
————————————————————
ora.DGROUP1.dg ora….up.type OFFLINE OFFLINE
ora.DGROUP2.dg ora….up.type OFFLINE OFFLINE
ora….ER.lsnr ora….er.type ONLINE ONLINE testdb03
ora.asm ora.asm.type ONLINE ONLINE testdb03
ora.cssd ora.cssd.type ONLINE ONLINE testdb03
ora.diskmon ora….on.type ONLINE ONLINE testdb03

So, what we see, we were able to start ASM instance without mounting the disks, our process picked up correct spfile. The command SHOW PARAMETER reads the parameter values from the memory, but if we try to read directly from spfile (select from v$spparameter) or write to it (alter system set) we will get errors, because diskgroups not mounted yet. It means that our process was read spfile directly from ASM disks. The design about from which disk to read it made based on information from disk header. I use ASMlib, and my asm_diskstring parameter is equal to default value null, ASM instance by default scans the /dev/oracleasm/disks/* on Linux, that is why process found my disks properly. But what if we use HP-UX and our disks is multipathing disks in /dev/rdisk/*, ASM will not scan them by default, we need to use asm_diskstring parameter, how our process will read asm_diskstring parameter before accessing spfile?

Lets start sqlplus, after we will connect as sysasm it will initialize server process, this process will do all job, that is why I will trace this server process

bash> sqlplus /nolog
SQL> conn / as sysasm
Connected.
SQL> !ps -aefH
.
.
oracle 15505 4255 0 13:26 pts/1 00:00:00 sqlplus
oracle 15507 15505 0 13:26 ? 00:00:00 oracle+ASM (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
.
.
SQL>

Our sserver process number is 15507, now I will use strace to trace it


bash> strace -o userproc.out -p 15507

start the instance with nomount option, OCSSD must be running, or instace will not start

SQL> startup nomount;

Total System Global Area 284565504 bytes
Fixed Size 1336036 bytes
Variable Size 258063644 bytes
ASM Cache 25165824 bytes

SQL>

Now we can analyze the contents of the userproc.out, I will not list all contents, just useful parts of it

First row is

read(9, “001\6\3\212\6\376\377\377\377\1\376\377\377\377\376\377\377\377″…, 8208) = 49

it is our “startup nomount” command transmitted to server process from sqlplus by socket

connect(6, {sa_family=AF_FILE, path=”/var/tmp/.oracle/sOCSSD_LL_testhost_”…}, 110) = 0
open(“/u01/oracle/product/grid11g/auth/css/testdb03/A6857753/84b0b692”, O_WRONLY|O_CREAT|O_EXCL, 0644) = 13
write(13, “\nS\336[“, 4) = 4
close(13)

from this lines we can see that our process connects to socket file of OCSSD /var/tmp/.oracle/sOCSSD_LL_testhost_ to communicate with it and authenticates itself. That is why it impossible to start instance without OCSSD, if process could not connect to this socket, it would fail. Then it establishes communication with OHASD (Oracle High Availability Services Deamon) through the socket file /var/tmp/.oracle/sCRSD_UI_SOCKET, after exchange messages with OHASD it will get information about the location of spfile and asm_diskstring parameter

access(“/var/tmp/.oracle/sCRSD_UI_SOCKET”, F_OK) = 0
connect(14, {sa_family=AF_FILE, path=”/var/tmp/.oracle/sCRSD_UI_SOCKET”…}, 110) = 0
open(“/u01/oracle/product/grid11g/auth/ohasd/testdb03/A4972914/5140f6df”, O_WRONLY|O_CREAT|O_EXCL, 0644) = 15
write(15, “\20\7\3519”, 4) = 4
close(15)
write(14, “4\2\2\1\1\1\1\3\1T\235\376\t”…, 52) = 52
write(14, “8\3\2\1\1\1\1\4\1T\235\376\t”…, 56) = 56
read(14, “8\3\2\1\1\1\1\3\1T\235\376\t”…, 32768) = 56
write(14, “\212\1PC\v\2\2\5\1T\235\376\t”…, 394) = 394
read(14, “\366\nPC\v\2\2\4\1T\235\376\t”…, 32768) = 2806
write(14, “0\2\20\1\1T\235\376\t”…, 48) = 48

write(3, “kggpnpSIHAGetItem 1 = +dgroup1/ASM/asmparameterfile/registry.253.740659373″…, 75) = 7
write(3, “kfspFileNameGet name=+dgroup1/ASSM/asmparameterfile/registry.253.740659373″…, 78) = 78

write(3, “kggpnpSIHAGetItem 2 = “, 23) = 23
write(3, “kgfspb_shallow_discover dscstr=\”\””…, 33) = 33

As we can see our process got spfile location (+dgroup1/ASM/asmparameterfile/registry.253.740659373) and asm_diskstring=”” which is null by default from OHASD. Now, lets see where OHASD itself gets this information. To identify that I traced the OHASD process, and I found this lines

open(“/etc/oracle/olr.loc”, O_RDONLY) = 4
read(4, “olrconfig_loc=/u01/oracle/produc”…, 4096) = 108
read(4, “”, 4096) = 0
close(4) = 0
open(“/u01/oracle/product/grid11g/cdata/localhost/testhost.olr”, O_RDONLY|O_SYNC|O_LARGEFILE) = 4
pread64(4, “\1\202VD\31\4\3OCR\226\361nA”…, 4096, 102400) = 4096
pread64(4, “\1\202\300I#\4\3OCR\226\361nA”…, 4096, 143360) = 4096
pread64(4, “\1\202VD\31\4\3OCR\226\361nA”…, 4096, 102400) = 4096
pread64(4, “\1\202hq\33\4\3OCR\226\361nA”…, 4096, 110592) = 4096
pread64(4, “\1\202\271\311#\4\20\3OCR\226\361nA”…, 4096, 4337664) = 4096
pread64(4, “\1\202\276\262$\4\20\3OCR\226\361nA”…, 4096, 4341760) = 4096
pread64(4, “\1\202VD\31\4\3OCR\226\361nA”…, 4096, 102400) = 4096
pread64(4, “\1\202hq\33\4\3OCR\226\361nA”…, 4096, 110592) = 4096
pread64(4, “\1\202\271\311#\4\20\3OCR\226\361nA”…, 4096, 4337664) = 4096
pread64(4, “\1\202\276\262$\4\20\3OCR\226\361nA”…, 4096, 4341760) = 4096
pread64(4, “\1\202\236\363%\4\2\3OCR\226\361nA”…, 4096, 4345856) = 4096
pread64(4, “\1\202\334\n&\4\2\3OCR\226\361nA”…, 4096, 4349952) = 4096
pread64(4, “\1\202\325\357-\4\2\3OCR\226\361nA”…, 4096, 4378624) = 4096
pread64(4, “\1\202VD\31\4\3OCR\226\361nA”…, 4096, 102400) = 4096
pread64(4, “\1\202hq\33\4\3OCR\226\361nA”…, 4096, 110592) = 4096
pread64(4, “\1\202\271\311#\4\20\3OCR\226\361nA”…, 4096, 4337664) = 4096
pread64(4, “\1\202\276\262$\4\20\3OCR\226\361nA”…, 4096, 4341760) = 4096
pread64(4, “\1\202\325\357-\4\2\3OCR\226\361nA”…, 4096, 4378624) = 4096

As one would expect OHASD reads this information from OLR, the path to it, it gets from /etc/oracle/olr.loc. I want to note that it is true for Oracle Restart mode, in Oracle RAC environment information is stored in GPNP profile and there is GPNPD process that maintains and manages this profile information. In Oracle Restart, as we can see OHASD process executes this role, and because there is no PGNP profile it stores information in OLR file. So, what next? Our process starts to scan all disks to identify ASM headers, after identifying them, it identifies which disks belongs to which diskgroups by information from header. There is many other metadata in the ASM diskheader that it reads, including pointers to spfile and votig disk file, it is kfdhdb.spfile, kfdhdb.spfflg (first block and number of blocks) and kfdhdb.vfstart, kfdhdb.vfend (begin block and end block). It is possible to read disk header using kfed utility from $ORACLE_HOME/bin directory. For example, lets read disk header of the /dev/loop1 which corresponds to ORCL:DISK1, spfile is on this disk.

shell> kfed read /dev/loop1 | more
kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD –Indicates that this is ASM disk header
.
kfdhdb.grptyp: 2 ; 0x026: KFDGTP_NORMAL –Indicates mirroring level, in my case it is NORMAL
kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER –Indicates that disk is the member disk of diskgroup
.
kfdhdb.dskname: DISK1 ; 0x028: length=5 –Disk name
kfdhdb.grpname: DGROUP1 ; 0x048: length=7 –Disk group name, to which this disk belongs
kfdhdb.fgname: DISK1 ; 0x068: length=5 –To which failure group this disk belongs
.
kfdhdb.secsize: 512 ; 0x0b8: 0x0200 –Disk sector size
kfdhdb.blksize: 4096 ; 0x0ba: 0x1000 –Disk block size
kfdhdb.ausize: 1048576 ; 0x0bc: 0x00100000 –Allocation Unit size, by default 1M
.
kfdhdb.vfstart: 0 ; 0x0ec: 0x00000000 –Begin block address of voting disk file
kfdhdb.vfend: 0 ; 0x0f0: 0x00000000 –End block address of voting disk file
kfdhdb.spfile: 59 ; 0x0f4: 0x0000003b –Begin block address of spfile
kfdhdb.spfflg: 1 ; 0x0f8: 0x00000001 –Number of blocks containing spfile

Now, lets see what will be next steps of our process

open(“/opt/oracle/extapi/32/asm/orcl/1/libasm.so”, O_RDONLY) = 17
read(17, “\177ELF\1\1\1\3\3\1000\v004″…, 512) = 512
close(17) = 0
open(“/dev/oracleasm/.query_version”, O_RDWR|O_LARGEFILE) = 17
write(17, “MSA\2\1\20”, 16) = 16
read(17, “MSA\2\1\20”, 16) = 16
close(17) = 0
open(“/dev/oracleasm/.get_iid”, O_RDWR|O_LARGEFILE) = 17
write(17, “MSA\2\2\30”, 24) = 24
read(17, “MSA\2\2\30\3”, 24) = 24
close(17) = 0
open(“/dev/oracleasm/.check_iid”, O_RDWR|O_LARGEFILE) = 17
write(17, “MSA\2\3\30\3”, 24) = 24
read(17, “MSA\2\3\30\3”, 24) = 24
close(17) = 0
open(“/dev/oracleasm/iid/0000000000000003”, O_RDWR|O_CREAT|O_LARGEFILE, 0770) = 17

open(“/dev/oracleasm/disks/DISK1”, O_RDONLY|O_LARGEFILE) = 18
read(17, “MSA\2\5 \22T/v\17\200-%\310”, 32) = 32
close(18) = 0
read(17, “MSA\2\7P@\364\311\20\320\301\225\277″…, 80) = 80
open(“/dev/oracleasm/disks/DISK2”, O_RDONLY|O_LARGEFILE) = 18
read(17, “MSA\2\5 \22T/v\17\200+%\310”, 32) = 32
close(18) = 0
read(17, “MSA\2\7P@\364\311\20\320\301\225\277″…, 80) = 80
open(“/dev/oracleasm/disks/DISK3”, O_RDONLY|O_LARGEFILE) = 18
read(17, “MSA\2\5 \22T/v\17\200!%\310”, 32) = 32
close(18) = 0
read(17, “MSA\2\7P@\364\311\20\320\301\225\277″…, 80) = 80
open(“/dev/oracleasm/disks/DISK4”, O_RDONLY|O_LARGEFILE) = 18
read(17, “MSA\2\5 \22T/v\17\200#%\310”, 32) = 32
close(18) = 0
read(17, “MSA\2\7P@\364\311\20\320\301\225\277″…, 80) = 80
open(“/dev/oracleasm/disks/DISK5”, O_RDONLY|O_LARGEFILE) = 18
read(17, “MSA\2\5 \22T/v\17\200)%\310”, 32) = 32
close(18) = 0

read(17, “MSA\2\7P@\364\311\20\320\301\225\277″…, 80) = 80
read(17, “MSA\2\7P@\364\311\20″…, 80) = 80
read(17, “MSA\2\7P@\364\311\20″…, 80) = 80
read(17, “MSA\2\7P@\364\311\20″…, 80) = 80
read(17, “MSA\2\7P@\364\311\20″…, 80) = 80
read(17, “MSA\2\7P@\364\311\20″…, 80) = 80
.
.

open(“/u01/oracle/diag/asm/+asm/+ASM/trace/alert_+ASM.log”, O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0660) = 16
write(16, “Tue Jan 18 17:45:13 2011\n”, 25) = 25
write(16, “Starting ORACLE instance (normal”…, 33) = 33
write(16, “\n”, 1) = 1
close(16)
.
.

It opens /opt/oracle/extapi/32/asm/orcl/1/libasm.so library, reads it. Then it opens special files /dev/oracleasm/.query_version, /dev/oracleasm/.get_iid and /dev/oracleasm/.check_iid, this files is interfaces to ASM device manager. First one is used to get managers version, second one is used to get identifier of the ASM disk device manager instance (not ASM instance) and third for verifying of this instance identifier. In our case ASM device managers instance identifier is 0000000000000003. Then, our process opens /dev/oracleasm/iid/0000000000000003, in other words it establishes connection to the ASM device manager instance. Process will use this interface to read and write to ASM disks. Then it checks all disks from DISK1 to DISK5. After it gets all information about disks, i.e. groups, mirroring level, AU size, failure group and so on it starts reading ASM disk through established interface. I think in this step it reads the spfile. After it gets initialization parameters it starts to allocate memory structures and starting necessary background processes.

Now lets see what is stored in OLR regarding ora.asm resource

shell> cd $ORACLE_HOME/cdata/localhost
shell> strings testhost.olr | grep dgroup1/ASM/asmparameterfile/registry | sed ‘s/~/\n/g’
DEFAULT_TEMPLATE=PROPERTY(RESOURCE_CLASS=asm) ELEMENT(INSTANCE_NAME= %GEN_USR_ORA_INST_NAME%)
DEGREE=1
DESCRIPTION=Oracle ASM resource
ENABLED=1
GEN_USR_ORA_INST_NAME=+ASM
LOAD=1
LOGGING_LEVEL=1
NAME=ora.asm
NLS_LANG=
NOT_RESTARTING_TEMPLATE=
OFFLINE_CHECK_INTERVAL=0
PROFILE_CHANGE_TEMPLATE=
RESTART_ATTEMPTS=5
SCRIPT_TIMEOUT=60
SPFILE=+dgroup1/ASM/asmparameterfile/registry.253.740659373
START_DEPENDENCIES=hard(ora.cssd) weak(ora.LISTENER.lsnr)
START_TIMEOUT=900
STATE_CHANGE_TEMPLATE=
STOP_DEPENDENCIES=hard(ora.cssd)
STOP_TIMEOUT=600
TYPE=ora.asm.type
TYPE_ACL=owner:oracle:rwx,pgrp:dba:rwx,other::r–
UPTIME_THRESHOLD=1d
USR_ORA_ENV=
USR_ORA_INST_NAME=+ASM
USR_ORA_OPEN_MODE=mount
USR_ORA_OPI=false
USR_ORA_STOP_MODE=immediate
VERSION=11.2.0.1.0

We can found there information about spfile “SPFILE=+dgroup1/ASM/asmparameterfile/registry.253.740659373”, we also can see that ora.cssd is hard dependency resource, because OCSSD is responsible for synchronization between ASM and database instance. But what about asm_diskstring parameter, where is it?

shell> strings testhost.olr | grep ASM_DISKSTRING | sed ‘s/~/\n/g’

nothing, it is because my asm_diskstring parameter equals to null, default value, lets change it

SQL> alter system set asm_diskstring=’ORCL:DISK1,ORCL:DISK2′ scope=spfile;

System altered.

check OLR file again

strings testdb03.olr | grep ASM_DISKSTRING | sed ‘s/~/\n/g’
bASM_DISKSTRING
ACL=owner:oracle:rwx,pgrp:dba:rwx,other::r–
ACTION_FAILURE_TEMPLATE=
ACTION_SCRIPT=
AGENT_FILENAME=%CRS_HOME%/bin/oraagent%CRS_EXE_SUFFIX%
ALIAS_NAME=
ASM_DISKSTRING=ORCL:DISK1,ORCL:DISK2
AUTO_START=restore
BASE_TYPE=ora.local_resource.type

Now we can see that information about asm_diskstring is also stored in OLR profile. And next time, at ASM instance startup it will scan only specified disk strings. If we specify disks on which spfile not present, our instance will not start


SQL> alter system set asm_diskstring=’ORCL:DISK3,ORCL:DISK4,ORCL:DISK5′ scope=spfile;

System altered.

SQL> startup nomount force;
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file ‘+DGROUP1/asm/asmparameterfile/registry.253.740659373’
ORA-17503: ksfdopn:2 Failed to open file +DGROUP1/asm/asmparameterfile/registry.253.740659373
SQL>

You can also use the ocrdump utility to extract this entries on any platform in XML format:

shell> ocrdump -local -keyname SYSTEM.OHASD.RESOURCES.ora\!asm.CONFIG -xml -noheader
shell> more OCRDUMPFILE


SYSTEM.OHASD.RESOURCES.ora!asm.CONFIG
ORATEXT

PROCR_ALL_ACCESS
PROCR_NONE
PROCR_NONE
oracle
dba

同样 我们在11.2.0.3的OLR 文件中可以找到SPFILE 的位置:


SPFile=”+DATA/yhd-cluster/asmparameterfile/registry.253.781624261″/>

这里需要说明一点 11.2.0.3 修改diskstring OLR 文件并没有出现对应的修改记录

另外针对 ASM_DISKGROUPS 需要一点说明:

SQL> create pfile=’/tmp/tmp.ctl’ from spfile=’+DATA/yhd-cluster/asmparameterfile/registry.253.781624261′;

File created.

[grid@db-41 ~]$ cat /tmp/tmp.ctl
+ASM1.__oracle_base=’/data/11.2.0’#ORACLE_BASE set from in memory value
+ASM.asm_diskgroups=’DATA’,’ARCH’
+ASM1.asm_diskgroups=’ARCH’#Manual Mount
*.asm_diskgroups=’ARCH’,’DATA’
*.asm_diskstring=’/dev/mapper/*’
*.asm_power_limit=1
*.diagnostic_dest=’/data/app/oracle’
*.instance_type=’asm’
*.large_pool_size=12M
*.remote_login_passwordfile=’EXCLUSIVE’

SQL> alter system set asm_diskgroups=’ARCH’,’DATA’ scope=both sid=’+ASM1′;

System altered.

SQL> create pfile=’/tmp/tmp.ctl’ from spfile=’+DATA/yhd-cluster/asmparameterfile/registry.253.781624261′;

File created.

[grid@db-41 ~]$ cat /tmp/tmp.ctl
+ASM1.__oracle_base=’/data/11.2.0’#ORACLE_BASE set from in memory value
+ASM.asm_diskgroups=’DATA’,’ARCH’
*.asm_diskgroups=’ARCH’,’DATA’
+ASM1.asm_diskgroups=’ARCH’,’DATA’
*.asm_diskstring=’/dev/mapper/*’
*.asm_power_limit=1
*.diagnostic_dest=’/data/app/oracle’
*.instance_type=’asm’
*.large_pool_size=12M
*.remote_login_passwordfile=’EXCLUSIVE’

针对这个问题 这篇文章也做过说明:http://comtmr.blogspot.com/2011/01/11gr2-mysteriously-changing.html

Reference:https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&id=1059856.1

ASM Preferred mirror Read

April 25, 2012 oracle, RAC 2 comments

ASM Preferred mirror Read


Introduction

The concept of the Preferred Read is not a new idea, but is now implemented in Oracle’s ASM volume
management in Oracle 11g.The concept is to read from the storage that can present the needed data at a lower
latency. Initially, this was designed for WAN or site-specific storage in order to avoid higher-latency site
connections. By restricting data reads to the local storage, the application would be able to service requests at
nominal read speeds while writes were the only communication needed to traverse the long haul site link. This is
a feature that is available to most Operating Systems with their included volume manager and as a feature to
Symantec/Veritas through the title Preferred Plex. This paper will discuss the merits of using PRG technology with Oracle ASM

eg:

SELECT name, failgroup FROM v$asm_disk;

NAME FAILGROUP
—————————— ——————————
DATA_0000 DATA_0000
DATA_0001 DATA_0001
DATA_0002 DATA_0002

3 rows selected.

SQL> ALTER SYSTEM SET ASM_PREFERRED_READ_FAILURE_GROUPS = ‘data.data_0000’, ‘data.data_0001’, ‘data.data_0002’;


Allocating ASM Extent Sets
With ASM redundancy, the first file extent allocated is chosen as primary extent, and
the mirrored extent is called the secondary extent. In the case of high redundancy,
there will two secondary extents. This logical grouping of primary and secondary
extents is called an extent set. When a block is read from disk, it is always read from
the primary extent, unless the primary extent cannot be read. In Oracle Database
11g, the preferred read feature allows the database to read the secondary extent first
instead of reading the primary extent. This is especially important for RAC Extended
Cluster implementations.


As stated earlier, ASM in Oracle Database 10g always reads the primary copy of a mirrored extent set. Thus, a read for a specific block may require a read of the primary extent at the remote site across the interconnect. Accessing a remote disk through a metropolitan area or wide area storage network is substantially slower than accessing a local disk. This can tax the interconnect as well as result in high I/ O and network latency. To assuage this, Oracle Database 11g provides a feature called preferred reads. This feature enables ASM administrators to specify a failure group for local reads–– that is, provide preferred reads. In a normal or high-redundancy diskgroup, when a secondary extent is on a preferred disk and the primary extent is remote, the secondary extent is read rather than the primary one on that node. This feature is especially beneficial for extended cluster configurations.

根据上面所说的 “This is especially important for RAC Extended” hellodba 也提出过类似的想法 参见 PRG

另外IOUG的一篇文档也给出了一个很好的利用这个特性配合SSD提高IO效率的方案

11gR2 RAC Rebootless Node Fencing

March 6, 2012 Internals, oracle, RAC No comments

Rebootless Node Fencing

In versions before 11.2.0.2 Oracle Clusterware tried to prevent a split-brain with a fast reboot (better: reset) of the server(s) without waiting for ongoing I/O operations or synchronization of the file systems. This mechanism has been changed in version 11.2.0.2 (first 11g Release 2 patch set). After deciding which node to evict, the clusterware:


. attempts to shut down all Oracle resources/processes on the server (especially processes generating I/Os)

. will stop itself on the node

. afterwards Oracle High Availability Service Daemon (OHASD) will try to start the Cluster Ready Services (CRS) stack again. Once the cluster interconnect is back online,all relevant cluster resources on that node will automatically start

. kill the node if stop of resources or processes generating I/O is not possible (hanging in kernel mode, I/O path, etc.)

This behavior change is particularly useful for non-cluster aware applications.

[cssd(3713)]CRS-1610:Network communication with node rac1 (1) missing for 90% of timeout interval.
Removal of this node from cluster in 2.190 seconds

[cssd(3713)]CRS-1652:Starting clean up of CRSD resources.

[cssd(3713)]CRS-1654:Clean up of CRSD resources finished successfully.
[cssd(3713)]CRS-1655:CSSD on node rac2 detected a problem and started to shutdown.

[cssd(5912)]CRS-1713:CSSD daemon is started in clustered mode

与11gR2之前的机制不一样,Oracle不再直接kill掉这个node 而是采取了kill 相关的process 如果尝试kill失败 则会去kill node,这种机制对于big cluster 是一种很好的保护,避免了node reboot 之后 resources re-mastered 导致的资源冻结,下面这段话比较详细的说明了这个观点:


Prior to 11g R2, during voting disk failures the node will be rebooted to protect the integrity of the cluster. But rebooting cannot be necessarily just the communication issue. The node can be hanging or the IO operation can be hanging so potentially the reboot decision can be the incorrect one. So Oracle Clusterware will fence the node without rebooting. This is a big (and big) achievement and changes in the way the cluster is designed.

The reason why we will have to avoid the reboot is that during reboots resources need to re-mastered and the nodes remaining on the cluster should be re-formed. In a big cluster with many numbers of nodes, this can be potentially a very expensive operation so Oracle fences the node by killing the offending process so the cluster will shutdown but the node will not be shutdown. Once the IO path is available or the network heartbeat is available, the cluster will be started again. Be assured the data will be protected but it will be done without any pain rebooting the nodes. But in the cases where the reboot is needed to protect the integrity, the cluster will decide to reboot the node.

reference from :RAC_System_Test_Plan_Outline_11gr2_v2_0