asm

[转] ASM Handling Block Corruptions

November 25, 2012 oracle, RAC No comments

Some excellent experiment From Honcho

为了保护disks 上的数据,Oracle ASM 对disk groups 有3种冗余方式:
(1)external redundancy 表示Oracle不帮你管理镜像,功能由外部存储系统实现,比如通过RAID技术。
(2)normal redundancy(默认方式)表示Oracle提供2路镜像来保护数据。
(3)high redundancy 表示Oracle提供3路镜像来保护数据。

如果使用ASM的冗余,就是通过 ASM failure group 来实现。ASM使用的镜像算法并不是镜像整个disk,而是作extent级的镜像。所以如果为各个failure group使用不同容量的disk是不明智的,因为这样在Oracle分配下一个extent的时候可能就会出现问题。
在normal redundancy模式下,ASM环境中每分配一个extent都会有一个primary copy和一个secondary copy,ASM的算法保证了secondary copy和primary copy一定是在不同的failure group中,这就是failure group的意义。通过这个算法,ASM保证了即使一个failure group中的所有disk都损坏了,数据也不会丢失。
Oracle在分配extent的时候,所有failure group中的这个将拥有相同数据的extent称为一个extent set,当Oracle将数据写入文件的时候,primary copy可能在任何一个failure group中,而second copy则在另外的failure group中,当Oracle读取数据的时候,除非是primary copy不可用,否则将优先从primary copy中读取数据,通过这种写入无序,读取有序的算法,Oracle保证了数据读取尽量分布在多个disk中。

那么在ASM 内部是如果处理块损坏的呢?Oracle 文档给的描述如下:

A) Block Corruption in Primary Extent
Block reads from disk is affected. Corruption will be found during read from primary extent, we report an ORA-1578 in the ASM alert.log and do a re-read of the same block (primary extent). If the second read fails as well, we read the block from the secondary extent (on a different physical disk) next. This failover to reading from secondary extent happens automatically and transparent for the user/session. If the secondary extent is not corrupt the user will get the results without any error indication other than ORA-1578 about the failed primary extent read in alert.log. Finally ASM will overwrite the corrupt block in the primary extent with the good copy from the secondary. This happens even when the block is unaffected by any transaction. ASM FINDS corrupt blocks in primary extent. That triggers automatically repair at next read.

B) Block Corruption in Secondary Extent
Because block reads always goes to the primary extent first, a corruption in the secondary extent does not have any effect on a read operation. At the next write the block becomes dirty and the new block image will overwrite the old corrupt block on disk. In case of block corruption in the secondary extent there is no diagnostics information available. A corruption in the secondary extent will normally only be seen if the block in the primary extent is also corrupt. ASM fixes corrupt blocks in secondary extent automatically at next write of the block.

下面根据上述描述,做实验验证一下,本实验平台是Oracle 11.2.0.3 for Linux x64

首先创建一组normal redundancy 磁盘组DATA1
SQL> create diskgroup data1 normal redundancy failgroup f1 disk '/dev/asm-diske' failgroup f2 disk '/dev/asm-diskf';
 
Diskgroup created.

创建测试用数据:
SQL> create tablespace xhz datafile '+DATA1' size 100m;
 
Tablespace created.
 
SQL> select name from v$datafile;
 
NAME
--------------------------------------------------------------------------------
+DATA/xhz/datafile/system.260.799281453
+DATA/xhz/datafile/sysaux.261.799281497
+DATA/xhz/datafile/undotbs1.262.799281531
+DATA/xhz/datafile/users.264.799281563
+DATA/xhz/datafile/honcho.266.799334723
+DATA1/xhz/datafile/xhz.256.800304113
 
6 rows selected.
 
SQL> create table honcho (id number,name varchar2(20)) tablespace xhz;
 
Table created.
 
SQL> begin
  2    for i in 1..10000 loop
  3     insert into honcho values (i,'honcho');
  4     end loop;
  5    commit;
  6  end;
  7  /
 
PL/SQL procedure successfully completed.
 
SQL> create index hon_id_idx on honcho(id) nologging;
                                                    
Index created.

查看表HONCHO 在disk上的分布情况:

SQL> select file_id,extent_id,block_id,blocks from dba_extents where segment_name='HONCHO';
 
   FILE_ID  EXTENT_ID   BLOCK_ID     BLOCKS
---------- ---------- ---------- ----------
         6          0        128          8
         6          1        136          8
         6          2        144          8
         6          3        152          8

任意找一条记录:
SQL> select rowid,id from honcho where id=888;
 
ROWID                      ID
------------------ ----------
AAADbBAAGAAAACEAGr        888             
 
SQL> select dbms_rowid.rowid_object(rowid) object_id,     
  2         dbms_rowid.rowid_relative_fno(rowid) file_id, 
  3         dbms_rowid.rowid_block_number(rowid) block_id,
  4         dbms_rowid.rowid_row_number(rowid) num         
  5    from honcho
  6    where id=888;
 
 OBJECT_ID    FILE_ID   BLOCK_ID        NUM
---------- ---------- ---------- ----------
     14017          6        132        427

该记录位于file # 6 BLOCK_ID 在128和136之间,又因为每一个extent大小默认为1AU,本实验1AU=1M,即一个AU最多容纳128个blocks,因此该数据分布在ASM上属于该segment的第二个AU上。
SQL> select group_number,disk_number,name,path from v$asm_disk where group_number=1;
 
GROUP_NUMBER DISK_NUMBER NAME                           PATH
------------ ----------- ------------------------------ ----------------------------------------
           1           1 DATA1_0001                     /dev/asm-diskf
           1           0 DATA1_0000                     /dev/asm-diske
 
SQL> select name from v$datafile where file#=6;
 
NAME
---------------------------------------------------------
+DATA1/xhz/datafile/xhz.256.800304113

从+group/dbname/file_type/tag.file#.incarnation# 可以得出file# 6 在ASM中的file# 为256,通过x$kffxp视图可以查找到数据在ASM磁盘组中的分布情况:

SQL> select  XNUM_KFFXP,  DISK_KFFXP,  AU_KFFXP,  INDX from x$kffxp where group_kffxp=1 and  NUMBER_KFFXP=256;
 
XNUM_KFFXP DISK_KFFXP   AU_KFFXP       INDX
---------- ---------- ---------- ----------
         0          1         52          0 
         0          0         52          1 
         1          0         53          2  ---> primary extent 位于disk 0 (/dev/asm-diske)上
         1          1         53          3  ---> secondary extent 位于disk1 (/dev/asm-diskf)上
         2          1         54          4
         2          0         54          5
         3          0         55          6
         3          1         55          7
         4          1         56          8
         4          0         56          9
         5          0         57         10
       ...         ...       ...        ...
       ...         ...       ...        ...
       100          1        153        200
       100          0        153        201
2147483648          0         82        202
2147483648          1         82        203
2147483648      65534 4294967294        204

从上面的查询可以得知id=888的记录primary extent 位于AU=53 /dev/asm-diske上,secondary extent 位于AU=53 /dev/asm-diskf上。由于1AU=1024K,block_szie=8K,因此53AU=6784 blocks,通过dd命令查看第6788个blocks上的数据记录。

首先查看primary extent上的数据:

[grid@honcho ~]$ dd if=/dev/asm-diske bs=8K skip=6788 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 5.5e-05 seconds, 149 MB/s
0000000 a206 0000 0085 0180 df32 0003 0000 0601
        ack   " nul nul enq nul nul soh   2   _ etx nul nul nul soh ack
0000020 b1f9 0000 0001 0000 365d 0000 df15 0003
          y   1 nul nul soh nul nul nul   ]   6 nul nul nak   _ etx nul
0000040 0000 0000 0002 0032 0080 0180 0003 0000
        nul nul nul nul stx nul   2 nul nul nul nul soh etx nul nul nul
0000060 00d6 0000 069f 00c0 0032 000e 21c6 0000
          V nul nul nul  us ack   @ nul   2 nul  so nul   F   ! nul nul
0000100 df32 0003 0000 0000 0000 0000 0000 0000
          2   _ etx nul nul nul nul nul nul nul nul nul nul nul nul nul
0000120 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
0000140 0000 0000 0100 01c6 ffff 039e 06c9 032b
        nul nul nul nul nul soh   F soh del del  rs etx   I ack   + etx
0000160 032b 0000 01c6 1f8a 1f7c 1f6e 1f60 1f52
          + etx nul nul   F soh  nl  us   |  us   n  us   `  us   R  us
0000200 1f44 1f36 1f28 1f1a 1f0c 1efe 1ef0 1ee2
          D  us   6  us   (  us sub  us  ff  us   ~  rs   p  rs   b  rs
          ...      ...          ...        ...         ...
          ...      ...          ...        ...         ...
0017760 0302 0ec2 0646 6f68 636e 6f68 0601 df32
        stx etx   B  so   F ack   h   o   n   c   h   o soh ack   2   _
0020000

查看secondary extent 上的数据:

[grid@honcho ~]$ dd if=/dev/asm-diskf bs=8K skip=6788 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.009921 seconds, 826 kB/s
0000000 a206 0000 0085 0180 df32 0003 0000 0601
        ack   " nul nul enq nul nul soh   2   _ etx nul nul nul soh ack
0000020 b1f9 0000 0001 0000 365d 0000 df15 0003
          y   1 nul nul soh nul nul nul   ]   6 nul nul nak   _ etx nul
0000040 0000 0000 0002 0032 0080 0180 0003 0000
        nul nul nul nul stx nul   2 nul nul nul nul soh etx nul nul nul
0000060 00d6 0000 069f 00c0 0032 000e 21c6 0000
          V nul nul nul  us ack   @ nul   2 nul  so nul   F   ! nul nul
0000100 df32 0003 0000 0000 0000 0000 0000 0000
          2   _ etx nul nul nul nul nul nul nul nul nul nul nul nul nul
0000120 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
0000140 0000 0000 0100 01c6 ffff 039e 06c9 032b
        nul nul nul nul nul soh   F soh del del  rs etx   I ack   + etx
0000160 032b 0000 01c6 1f8a 1f7c 1f6e 1f60 1f52
          + etx nul nul   F soh  nl  us   |  us   n  us   `  us   R  us
0000200 1f44 1f36 1f28 1f1a 1f0c 1efe 1ef0 1ee2
          D  us   6  us   (  us sub  us  ff  us   ~  rs   p  rs   b  rs
          ...      ...          ...        ...         ...
          ...      ...          ...        ...         ...
0017760 0302 0ec2 0646 6f68 636e 6f68 0601 df32
        stx etx   B  so   F ack   h   o   n   c   h   o soh ack   2   _
0020000

通过对比可以看出两块盘上dd出来的数据是一样的。
现在首先模拟secondary extent上的数据损坏:
[grid@honcho ~]$ dd if=/dev/zero of=/dev/asm-diskf seek=6788 bs=8K count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 5.5e-05 seconds, 149 MB/s

可以看出现在记录为null:

[grid@honcho ~]$ dd if=/dev/asm-diskf bs=8K skip=6788 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 4.3e-05 seconds, 191 MB/s
0000000 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0020000

此时依旧可以查到数据:

SQL> alter system flush buffer_cache;
 
System altered.
 
SQL> select * from honcho where id=888;
 
        ID NAME
---------- --------------------
       888 honcho

重启数据库:

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
 
Total System Global Area  546992128 bytes
Fixed Size                  2230192 bytes
Variable Size             167774288 bytes
Database Buffers          373293056 bytes
Redo Buffers                3694592 bytes
Database mounted.
Database opened.
SQL> select * from honcho where id=888;
 
        ID NAME
---------- --------------------
       888 honcho

没有任何异常,看secondary extent上是否有数据:

[grid@honcho ~]$ dd if=/dev/asm-diskf bs=8K skip=6789 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 4.9e-05 seconds, 167 MB/s
0000000 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0020000

此时发现仍然为null,接下来update honcho表:
SQL> update honcho set name='xhz' where id=888;
 
1 row updated.
 
SQL> commit;
 
Commit complete.
 
SQL> select * from honcho where id=888;
 
        ID NAME
---------- --------------------
       888 xhz
 
Commit complete.

再次查看:
[grid@honcho ~]$ dd if=/dev/asm-diskf bs=8K skip=6789 count=1|od -xa

1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 5.2e-05 seconds, 158 MB/s
0000000 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0020000

依旧为NULL,清理一下buffer cache:
SQL> alter system flush buffer_cache;
 
System altered.
 
SQL> select * from honcho where id=888;
 
        ID NAME
---------- --------------------
       888 xhz

此时发现secondary extent上已经有新数据:

[grid@honcho ~]$ dd if=/dev/asm-diskf bs=8K skip=6789 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.039857 seconds, 206 kB/s
0000000 a206 0000 0085 0180 f0ca 0003 0000 0601
        ack   " nul nul enq nul nul soh   J   p etx nul nul nul soh ack
0000020 8e2e 0000 0001 0000 365d 0000 f0a0 0003
          .  so nul nul soh nul nul nul   ]   6 nul nul  sp   p etx nul
0000040 0000 0000 0002 0032 0080 0180 0003 0000
        nul nul nul nul stx nul   2 nul nul nul nul soh etx nul nul nul
0000060 00d6 0000 069f 00c0 0032 000e 8000 0000
          V nul nul nul  us ack   @ nul   2 nul  so nul nul nul nul nul
0000100 df32 0003 0003 0003 00d8 0000 06f7 00c0
          2   _ etx nul etx nul etx nul   X nul nul nul   w ack   @ nul
0000120 0032 0035 21c6 0552 f0ca 0003 0000 0000
          2 nul   5 nul   F   !   R enq   J   p etx nul nul nul nul nul
0000140 0000 0000 0100 01c6 ffff 039e 0b73 032b
        nul nul nul nul nul soh   F soh del del  rs etx   s  vt   + etx
0000160 087d 0000 01c6 1f8d 1f82 1f77 1f6c 1f61
          }  bs nul nul   F soh  cr  us stx  us   w  us   l  us   a  us
0000200 1f56 1f4b 1f40 1f35 1f2a 1f1f 1f14 1f09
          V  us   K  us   @  us   5  us   *  us  us  us dc4  us  ht  us
          ...      ...          ...        ...         ...
          ...      ...          ...        ...         ...         
0017760 2c7a 0202 c203 460e 7803 7a68 0601 f0ca
          z   , stx stx etx   B  so   F etx   x   h   z soh ack   J   p
0020000

下面接下来模拟primary extent 上数据损坏:

[grid@honcho ~]$ dd if=/dev/zero of=/dev/asm-diske seek=6788 bs=8K count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 6.3e-05 seconds, 130 MB/s
[grid@honcho ~]$ dd if=/dev/asm-diske bs=8K skip=6788 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 4.6e-05 seconds, 178 MB/s
0000000 0000 0000 0000 0000 0000 0000 0000 0000
        nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0020000

可以看到primary extent 已经没记录了,现在flush 一下 buffer_cache

SQL> alter system flush buffer_cache;
 
System altered.
 
SQL> select * from honcho where id=888;
 
        ID NAME
---------- --------------------
       888 xhz

从alert日志中可以看到如下信息:
Mon Nov 19 19:09:34 2012
ALTER SYSTEM: Flushing buffer cache
Mon Nov 19 19:09:52 2012
Hex dump of (file 6, block 132) in trace file /u01/app/oracle/diag/rdbms/xhz/xhz/trace/xhz_ora_7588.trc
Corrupt block relative dba: 0x01800085 (file 6, block 132)
Completely zero block found during multiblock buffer read
Reading datafile '+DATA1/xhz/datafile/xhz.256.799758019' for corruption at rdba: 0x01800085 (file 6, block 132)
Read datafile mirror 'DATA1_0000' (file 6, block 132) found same corrupt data (no logical check)
Read datafile mirror 'DATA1_0001' (file 6, block 132) found valid data
Repaired corruption at (file 6, block 132)

从日志中我们看到在查询的时候发现DATA1_0000损坏,于是尝试read secondary extent,最后repaired disk。

[grid@honcho ~]$ dd if=/dev/asm-diske bs=8K skip=6788 count=1|od -xa
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.047051 seconds, 174 kB/s
0000000 a206 0000 0085 0180 f0ca 0003 0000 0601
        ack   " nul nul enq nul nul soh   J   p etx nul nul nul soh ack
0000020 8e2e 0000 0001 0000 365d 0000 f0a0 0003
          .  so nul nul soh nul nul nul   ]   6 nul nul  sp   p etx nul
0000040 0000 0000 0002 0032 0080 0180 0003 0000
        nul nul nul nul stx nul   2 nul nul nul nul soh etx nul nul nul
0000060 00d6 0000 069f 00c0 0032 000e 8000 0000
          V nul nul nul  us ack   @ nul   2 nul  so nul nul nul nul nul
0000100 df32 0003 0003 0003 00d8 0000 06f7 00c0
          2   _ etx nul etx nul etx nul   X nul nul nul   w ack   @ nul
0000120 0032 0035 21c6 0552 f0ca 0003 0000 0000
          2 nul   5 nul   F   !   R enq   J   p etx nul nul nul nul nul
0000140 0000 0000 0100 01c6 ffff 039e 0b73 032b
        nul nul nul nul nul soh   F soh del del  rs etx   s  vt   + etx
0000160 087d 0000 01c6 1f8d 1f82 1f77 1f6c 1f61
          }  bs nul nul   F soh  cr  us stx  us   w  us   l  us   a  us
0000200 1f56 1f4b 1f40 1f35 1f2a 1f1f 1f14 1f09
          V  us   K  us   @  us   5  us   *  us  us  us dc4  us  ht  us
          ...      ...          ...        ...         ...
          ...      ...          ...        ...         ...

0017760 2c7a 0202 c203 460e 7803 7a68 0601 f0ca
          z   , stx stx etx   B  so   F etx   x   h   z soh ack   J   p
0020000

当再次查看primary extent 的时候发现已经有记录了。

下面测试将primary extent上和secondary extent上的块都损坏:

[grid@honcho ~]$ dd if=/dev/zero of=/dev/asm-diske seek=6788 bs=8K count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 8.5e-05 seconds, 96.4 MB/s
[grid@honcho ~]$ dd if=/dev/zero of=/dev/asm-diskf seek=6788 bs=8K count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000117 seconds, 70.0 MB/s

查看id=888的记录:
SQL> alter system flush buffer_cache;
 
System altered.
 
SQL> select * from honcho where id=888;
select * from honcho where id=888
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 6, block # 132)
ORA-01110: data file 6: '+DATA1/xhz/datafile/xhz.256.800304113'

查询其他的记录
SQL> select * from honcho where id=100;
 
        ID NAME
---------- --------------------
       100 honcho

可以看到file# 6,block #132 上的数据已被彻底损坏,非file #6,block# 132的数据依然可以查到。

查看日志:

Sun Nov 19 19:24:39 2012
ALTER SYSTEM: Flushing buffer cache
Hex dump of (file 6, block 132) in trace file /u01/app/oracle/diag/rdbms/xhz/xhz/trace/xhz_ora_1915.trc
Corrupt block relative dba: 0x01800084 (file 6, block 132)
Completely zero block found during multiblock buffer read
Reading datafile '+DATA1/xhz/datafile/xhz.256.800304113' for corruption at rdba: 0x01800084 (file 6, block 132)
Read datafile mirror 'DATA1_0000' (file 6, block 132) found same corrupt data (no logical check)
Read datafile mirror 'DATA1_0001' (file 6, block 132) found same corrupt data (no logical check)
Errors in file /u01/app/oracle/diag/rdbms/xhz/xhz/trace/xhz_ora_1915.trc  (incident=18163):
ORA-01578: ORACLE data block corrupted (file # 6, block # 132)
ORA-01110: data file 6: '+DATA1/xhz/datafile/xhz.256.800304113'
Incident details in: /u01/app/oracle/diag/rdbms/xhz/xhz/incident/incdir_18163/xhz_ora_1915_i18163.trc
Sun Nov 19 19:24:49 2012
Starting background process SMCO
Sun Nov 19 19:24:50 2012
SMCO started with pid=24, OS id=3119
Corrupt Block Found
         TSN = 6, TSNAME = XHZ
         RFN = 6, BLK = 132, RDBA = 25165956
         OBJN = 14017, OBJD = 14017, OBJECT = HONCHO, SUBOBJECT =
         SEGMENT OWNER = SYS, SEGMENT TYPE = Table Segment
Sun Nov 19 19:24:59 2012
Sweep [inc][18163]: completed
Hex dump of (file 6, block 132) in trace file /u01/app/oracle/diag/rdbms/xhz/xhz/incident/incdir_18163/xhz_m000_3156_i18163_a.trc
Corrupt block relative dba: 0x01800084 (file 6, block 132)
Completely zero block found during validation
Trying mirror side DATA1_0000.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0001.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0000.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0001.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0000.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0001.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0000.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0001.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0000.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
Trying mirror side DATA1_0001.
Reread of blocknum=132, file=+DATA1/xhz/datafile/xhz.256.800304113. found same corrupt data
SunNov 19 19:04:59 2012
Dumping diagnostic data in directory=[cdmp_20121125190459], requested by (instance=1, osid=1915), summary=[incident=18163].
Checker run found 1 new persistent data failures

通过以上实验得出的结论和描述相符。但是这里还是有一个小小的缺陷,如果secondary extent 损坏,随后无脏数据,在这段时间内如果primary extent损坏,还是依旧有数据丢失,所以Oracle 有这样一句描述:

When a disk block becomes corrupt for whatever reason, the corruption can affect blocks in primary, secondary, or both extents. However, the probability that the same block becomes corrupt in both primary and secondary extent at the same time is very low.

ASM error : DBD ERROR: OCIStmtExecute

September 26, 2012 Exadata, oracle 2 comments

问题描述:在Exadata迁移中使用的DG被直接用asmcmd从asm中rm掉了所有datafile.但是最后的一级目录无法删除,并且影响到了其他实例.在rm hang之后 diskgroup无法添加任何文件.原来的实例的所有add datafile drop directory操作均被hang住

[grid@dm02db01 ~]$ crsctl query crs softwareversion
Oracle Clusterware version on node [dm02db01] is [11.2.0.3.0]

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE	11.2.0.3.0	Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production


[root@dm02cel01 ~]# imageinfo 

Kernel version: 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64
Cell version: OSS_11.2.3.1.0_LINUX.X64_120304
Cell rpm version: cell-11.2.3.1.0_LINUX.X64_120304-1

Active image version: 11.2.3.1.0.120304
Active image activated: 2012-05-07 02:04:12 -0700
Active image status: success
Active system partition on device: /dev/md5
Active software partition on device: /dev/md7

In partition rollback: Impossible

Cell boot usb partition: /dev/sdm1
Cell boot usb version: 11.2.3.1.0.120304

Inactive image version: undefined
Rollback to the inactive partitions: Impossible

ASMCMD> ls -l
Type Redund Striped Time Sys Name
Y YHDLOG/
Y EDWT/

ASMCMD> ls -l +data_dm02/edwt
Type Redund Striped Time Sys Name
Y DATAFILE/
Y ONLINELOG/

do this command:

ASMCMD> rm -rf edwt/
ORA-15032: not all alterations performed
ORA-15177: cannot operate on system aliases (DBD ERROR: OCIStmtExecute)

ERROR: alter diskgroup DATA_DM02 drop directory  '+DATA_DM02/edwt' FORCE
Thu Sep 20 17:12:06 2012
SQL> /* ASMCMD */alter diskgroup /*ASMCMD*/ "DATA_DM02" drop directory '+DATA_DM02/edwt/DATAFILE'
ORA-15032: not all alterations performed
ORA-15177: cannot operate on system aliases
ERROR: /* ASMCMD */alter diskgroup /*ASMCMD*/ "DATA_DM02" drop directory '+DATA_DM02/edwt/DATAFILE'
Thu Sep 20 17:12:22 2012
SQL> /* ASMCMD */alter diskgroup /*ASMCMD*/ "DATA_DM02" drop directory '+DATA_DM02/edwt/DATAFILE'
ORA-15032: not all alterations performed
ORA-15177: cannot operate on system aliases
ERROR: /* ASMCMD */alter diskgroup /*ASMCMD*/ "DATA_DM02" drop directory '+DATA_DM02/edwt/DATAFILE'
Thu Sep 20 17:59:07 2012
Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_67709.trc  (incident=73026):
ORA-04030: out of process memory when trying to allocate 184 bytes (callheap,kffd)
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM2/incident/incdir_73026/+ASM2_ora_67709_i73026.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Sep 20 17:59:40 2012
Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_67709.trc  (incident=73027):
ORA-04030: out of process memory when trying to allocate 48 bytes (kfui heap,kfuilst)
ORA-04030: out of process memory when trying to allocate 184 bytes (callheap,kffd)
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM2/incident/incdir_73027/+ASM2_ora_67709_i73027.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.   

Solution:

receate database using DBCA and put datafiles in these directories and then using dbca to drop the database you created.

ASMCMD> ls -l
Type Redund Striped Time Sys Name
Y YHDLOG/
已经被删除掉了.

Why ASM?

June 13, 2012 oracle, RAC No comments

Why ASM?

. ASM is an Oracle provided Database/Cluster filesystem and Storage Volume manager, optimized for Oracle for any kind of workload (be it OLTP or DSS or Batch, sequential or random access), built-in with the database, with no additional cost.

. It follows SAME (Stripe and Mirror Everything) strategy, eliminating any database level IO hotspots, as the database files are striped equally across all the available disks in the allocated disk groups and are dynamically rebalanced with the addition and removal of disks online.

. ASM provides for better storage utilization and storage consolidation (ie allows multiple databases on the same Server or across multiple Servers to share storage through Oracle clusterware).

. It allows leveraging existing storage RAID technologies (Mirroring, RAID 5 etc) to provide for additional striping and redundancy.

. It can leverage IO multipathing technologies (like EMC Powerpath, Hitachi HDLM etc) to provide for better IO balancing and fault-tolerance.

. Makes use of Oracle ‘Kernel Service Direct File (ksfd)’ interface to perform KAIO (kernel asynchronous IO) against raw devices by-passing OS page cache, providing the best IO performance with no overhead from Operating system or conventional Logical Volume Managers.

关于‘ksfd’ 我们还可以看下面的解释:

Oracle performs 1MB I/O when possible

DBWR (the main server process) – Coalesces adjacent buffers to max size of 1MB

Kernel sequential file I/O (ksfq) – buffer size is set to 1MB by default

Archiver writes 1MB I/Os

Log Writer – Max I/O size is set to 1MB

Datapump,Dataguard,Rman and Transfer Packages also use ksfq

Kernel Service Direct File (ksfd) I/O is the central module providing disk I/O service to all its clients.the main interfaces provided allow client to create database file and perform I/O operations on them.Every read/write operation is done in multiples of the block size that is specified at file creation time

. Operating System Dependent (OSD) interface is used to access raw and file system files (used with ASM when ASMLIB is not configured) by default

. ASMLIB is disk access interface defined by Oracle and implemented by storage/system vendors. Oracle currently has an ASMLIB implementation for linux only . ASMLIB provides I/O enhancements for persistent binding, I/O optimization and ease of management

. Oracle Disk Manager (ODM) Interface defined by Oracle and implemented by storage/system vendors

ASM is one of the clients of ksfd. This module calls ksfd to issue I/O when ASM feature is enabled and there is NO ASMLIB library present. A database file is identified as an ASM managed file from the name pattern.

这里 我们还可以引申出对ASMLIB使用的考虑:

ASMLIB是一种基于Linux module,专门为Oracle Automatic Storage Management特性设计的内核支持库(kernel support library)。
长久以来我们对ASMLIB的认识并不全面,这里我们来具体了解一下使用ASMLIB的优缺点。
理论上我们可以从ASMLIB API中得到的以下益处:
总是使用direct,async IO
解决了永久性设备名的问题,即便在重启后设备名已经改变的情况下
解决了文件权限、拥有者的问题
减少了I/O期间从用户模式到内核模式的上下文切换,从而可能降低cpu使用率
减少了文件句柄的使用量
ASMLIB API提供了传递如I/O优先级等元信息到存储设备的可能
虽然从理论上我们可以从ASMLIB中得到性能收益,但实践过程中这种优势是几乎可以忽略的,没有任何性能报告显示ASMLIB对比Linux上原生态的udev设备管理服务有任何性能上的优势。在Oracle官方论坛上有一篇讨论ASMLIB性能收益的帖子,你可以从中看到”asmlib wouldn’t necessarily give you much of an io performance benefit, it’s mainly for ease of management as it will find/discover the right devices for you, the io effect of asmlib is large the same as doing async io to raw devices.”的评论,实际上使用ASMLIB和直接使用裸设备(raw device)在性能上没有什么差别。
ASMLIB可能带来的缺点:
对于多路径设备(multipathing)需要在/etc/sysconfig/oracleasm-_dev_oracleasm配置文件中设置ORACLEASM_SCANORDER及ORACLEASM_SCANEXCLUDE,以便ASMLIB能找到正确的设备文件,具体可以参考Metalink Note
因为ASM INSTANCE使用ASMLIB提供的asm disk,所以增加了额外的层面
每次Linux Kernel更新,都需要替换新的ASMLIB包
增加了因人为错误造成宕机downtime的可能
使用ASMLIB意味着要花费更多时间去创建和维护
因为ASMLIB的存在,可能引入更多的bug,这是我们最不想看到的
使用ASMLIB创建的disk,其disk header并不会和普通的asm disk header有什么不同,仅仅是在头部多出了ASMLIB的属性空间。
结论:
我个人的观点是尽可能不要使用ASMLIB,当然这不是DBA个人所能决定的事情。另一方面这取决于个人习惯,在rhel 4的早期发行版本中没有提供udev这样的设备管理服务,这导致在rhel 4中大量的ASM+RAC组合的系统使用ASMLIB , 经网友指出udev 作为kernel 2.6的新特性被引入,在rhel4的初始版本中就已经加入了udev绑定服务,但是在rhel4时代实际udev的使用并不广泛(In Linux 2.6, a new feature was introduced to simplify device management and hot plug capabilities. This feature is called udev and is a standard package in RHEL4 or Oracle
Enterprise Linux 4 (OEL4) as well as Novell’s SLES9 and SLES10.)。如果是在RHEL/OEL 5中那么你已经有充分的理由利用udev而放弃ASMLIB。

Reference:http://www.oracledatabase12g.com/archives/why-asmlib-and-why-not.html

11gR2 rac ora.asm error (ASM)

April 26, 2012 maintain, oracle, RAC No comments ,

ASM – ora.asm error

11.2.0.3 RAC CRS 无法启动

[ohasd(25286)]CRS-2112:The OLR service started on node db-41.
2012-04-25 14:46:13.120
[ohasd(25286)]CRS-1301:Oracle High Availability Service started on node db-41.
2012-04-25 14:46:13.120
[ohasd(25286)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2012-04-25 14:46:14.181
[/data/11.2.0/grid/bin/orarootagent.bin(25332)]CRS-5016:Process “/data/11.2.0/grid/bin/acfsload” spawned by agent “/data/11.2.0/grid/bin/orarootagent.bin” for action “check” failed: details at “(:CLSN00010:)” in “/data/11.2.0/grid/log/db-41/agent/ohasd/orarootagent_root/orarootagent_root.log”
2012-04-25 14:46:17.031
[gpnpd(25510)]CRS-2328:GPNPD started on node db-41.
2012-04-25 14:46:20.794
[cssd(25593)]CRS-1713:CSSD daemon is started in clustered mode
2012-04-25 14:46:22.520
[ohasd(25286)]CRS-2767:Resource state recovery not attempted for ‘ora.diskmon’ as its target state is OFFLINE
2012-04-25 14:46:40.715
[cssd(25593)]CRS-1707:Lease acquisition for node db-41 number 1 completed
2012-04-25 14:46:42.003
[cssd(25593)]CRS-1605:CSSD voting file is online: /dev/mapper/Vote03; details in /data/11.2.0/grid/log/db-41/cssd/ocssd.log.
2012-04-25 14:46:42.008
[cssd(25593)]CRS-1605:CSSD voting file is online: /dev/mapper/Vote02; details in /data/11.2.0/grid/log/db-41/cssd/ocssd.log.
2012-04-25 14:46:42.015
[cssd(25593)]CRS-1605:CSSD voting file is online: /dev/mapper/Vote01; details in /data/11.2.0/grid/log/db-41/cssd/ocssd.log.
2012-04-25 14:46:52.173
[cssd(25593)]CRS-1601:CSSD Reconfiguration complete. Active nodes are db-41 .
2012-04-25 14:46:54.177
[ctssd(25738)]CRS-2407:The new Cluster Time Synchronization Service reference node is host db-41.
2012-04-25 14:46:54.177
[ctssd(25738)]CRS-2401:The Cluster Time Synchronization Service started on host db-41.
[client(25843)]CRS-10001:25-Apr-12 14:46 ACFS-9391: Checking for existing ADVM/ACFS installation.
[client(25848)]CRS-10001:25-Apr-12 14:46 ACFS-9392: Validating ADVM/ACFS installation files for operating system.
[client(25850)]CRS-10001:25-Apr-12 14:46 ACFS-9393: Verifying ASM Administrator setup.
[client(25853)]CRS-10001:25-Apr-12 14:46 ACFS-9308: Loading installed ADVM/ACFS drivers.
[client(25856)]CRS-10001:25-Apr-12 14:46 ACFS-9154: Loading ‘oracleoks.ko’ driver.
[client(25913)]CRS-10001:25-Apr-12 14:46 ACFS-9154: Loading ‘oracleadvm.ko’ driver.
[client(25974)]CRS-10001:25-Apr-12 14:46 ACFS-9154: Loading ‘oracleacfs.ko’ driver.
[client(26099)]CRS-10001:25-Apr-12 14:46 ACFS-9327: Verifying ADVM/ACFS devices.
[client(26103)]CRS-10001:25-Apr-12 14:46 ACFS-9156: Detecting control device ‘/dev/asm/.asm_ctl_spec’.
[client(26107)]CRS-10001:25-Apr-12 14:46 ACFS-9156: Detecting control device ‘/dev/ofsctl’.
[client(26113)]CRS-10001:25-Apr-12 14:46 ACFS-9322: completed
2012-04-25 14:47:06.975
[ohasd(25286)]CRS-2807:Resource ‘ora.asm’ failed to start automatically.

[ctssd(25738)]CRS-2405:The Cluster Time Synchronization Service on host db-41 is shutdown by user
2012-04-25 14:51:25.852
[/data/11.2.0/grid/bin/orarootagent.bin(25523)]CRS-5016:Process “/data/11.2.0/grid/bin/acfsload” spawned by agent “/data/11.2.0/grid/bin/orarootagent.bin” for action “check” failed: details at “(:CLSN00010:)” in “/data/11.2.0/grid/log/db-41/agent/ohasd/orarootagent_root/orarootagent_root.log”
2012-04-25 14:51:35.716
[cssd(25593)]CRS-1603:CSSD on node db-41 shutdown by user.
2012-04-25 14:51:35.725
[ohasd(25286)]CRS-2767:Resource state recovery not attempted for ‘ora.cssdmonitor’ as its target state is OFFLINE
2012-04-25 14:51:36.224
[cssd(25593)]CRS-1660:The CSS daemon shutdown has completed
2012-04-25 14:51:38.144
[gpnpd(25510)]CRS-2329:GPNPD on node db-41 shutdown.

重建spfile :


+ASM.asm_diskgroups=’DATA’,’ARCH’
*.asm_diskstring=’/dev/mapper/*’
*.asm_power_limit=1
*.diagnostic_dest=’/data/app/oracle’
*.instance_type=’asm’
*.large_pool_size=12M
#*.local_listener=’LISTENER_+ASM’
*.remote_login_passwordfile=’EXCLUSIVE’

eg: pfile 拉起 ASM instance mount data group

SQL>create spfile=’+DATA’ from pfile=’xxx’;

[grid@db-41 ~]$ sqlplus ‘/as sysasm’

SQL*Plus: Release 11.2.0.3.0 Production on Thu Apr 26 15:21:33 2012

Copyright (c) 1982, 2011, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 – 64bit Production
With the Real Application Clusters and Automatic Storage Management options

SQL> show parameter spfile;

NAME TYPE
———————————— ———————————
VALUE
——————————
spfile string
+DATA/yhd-cluster/asmparameter
file/registry.253.781624261
SQL>

重新启动CRS OK

[client(2162)]CRS-10001:26-Apr-12 14:47 ACFS-9392: Validating ADVM/ACFS installation files for operating system.
[client(2164)]CRS-10001:26-Apr-12 14:47 ACFS-9393: Verifying ASM Administrator setup.
[client(2167)]CRS-10001:26-Apr-12 14:47 ACFS-9308: Loading installed ADVM/ACFS drivers.
[client(2170)]CRS-10001:26-Apr-12 14:47 ACFS-9154: Loading ‘oracleoks.ko’ driver.
[client(2227)]CRS-10001:26-Apr-12 14:47 ACFS-9154: Loading ‘oracleadvm.ko’ driver.
[client(2286)]CRS-10001:26-Apr-12 14:47 ACFS-9154: Loading ‘oracleacfs.ko’ driver.
[client(2415)]CRS-10001:26-Apr-12 14:47 ACFS-9327: Verifying ADVM/ACFS devices.
[client(2418)]CRS-10001:26-Apr-12 14:47 ACFS-9156: Detecting control device ‘/dev/asm/.asm_ctl_spec’.
[client(2422)]CRS-10001:26-Apr-12 14:47 ACFS-9156: Detecting control device ‘/dev/ofsctl’.
[client(2428)]CRS-10001:26-Apr-12 14:47 ACFS-9322: completed
2012-04-26 14:47:44.001
[crsd(2755)]CRS-1012:The OCR service started on node db-41.
2012-04-26 14:47:44.472
[evmd(2146)]CRS-1401:EVMD started on node db-41.
2012-04-26 14:47:46.297
[crsd(2755)]CRS-1201:CRSD started on node db-41.
2012-04-26 14:47:47.586
[/data/11.2.0/grid/bin/oraagent.bin(2880)]CRS-5016:Process “/data/11.2.0/grid/bin/lsnrctl” spawned by agent “/data/11.2.0/grid/bin/oraagent.bin” for action “check” failed: details at “(:CLSN00010:)” in “/data/11.2.0/grid/log/db-41/agent/crsd/oraagent_grid/oraagent_grid.log”
2012-04-26 14:47:47.592
[/data/11.2.0/grid/bin/oraagent.bin(2880)]CRS-5016:Process “/data/11.2.0/grid/bin/lsnrctl” spawned by agent “/data/11.2.0/grid/bin/oraagent.bin” for action “check” failed: details at “(:CLSN00010:)” in “/data/11.2.0/grid/log/db-41/agent/crsd/oraagent_grid/oraagent_grid.log”
2012-04-26 14:47:47.685
[/data/11.2.0/grid/bin/oraagent.bin(2880)]CRS-5016:Process “/data/11.2.0/grid/opmn/bin/onsctli” spawned by agent “/data/11.2.0/grid/bin/oraagent.bin” for action “check” failed: details at “(:CLSN00010:)” in “/data/11.2.0/grid/log/db-41/agent/crsd/oraagent_grid/oraagent_grid.log”
2012-04-26 14:47:47.832
[crsd(2755)]CRS-2772:Server ‘db-41’ has been assigned to pool ‘Generic’.
2012-04-26 14:47:47.833
[crsd(2755)]CRS-2772:Server ‘db-41’ has been assigned to pool ‘ora.user’.
2012-04-26 14:51:13.534

针对这个问题 可以提出这样的疑问 CRS 如何在 datagroup 还没有mount的时候 读取 spfile 而且CRS如何知道在哪去读 spfile 下面这篇精彩的文章 很详细的介绍了process读取的整个过程:

Oracle 11gR2 have many new futures in Grid Computing area. One of them is Oracle Local Repository (OLR), this repository designed to store information and profiles for local resources, resources that dedicated to particular node. It improves the performance of accessing local resources profile information, redundancy and manageability. In Grid Infrastructure RAC configuration there is usual one global shared OCR and OLR’s on each node. In Oracle Restart environment there is only OLR repository. In 11g R2 there is also new feature, Grid Plug and Play (GPNP), as the name implies, it helps to automate and simplify some of the aspects of grid administration. GPNP maintains profile, it is XML file that stores the configuration information of some components maintained by GPNP, for example vips and interconnect information is stored here.

But in this post, I wont to discuss another new feature of 11g R2, it is the possibility to store the server parameter file (spfile) on the ASM disks, also for ASM instance itself! And there is some reasonable questions arise:

1.How process can read the spfile from ASM disks, if they are not mounted yet
2.How process knows where to get spfile if there is no init.ora file in $ORACLE_HOME/dbs with spfile=[path to spfile] parameter
3.What about asm_diskstring parameter?

Third point is important for ASM instance because, before reading spfile from ASM disks, we need to identify them, to identify them we need spfile to get asm_diskstring parameter!

To make clear all this points I will use system calls tracing tools, to trace oracle processes. We can use tools like Dtrace, strace, truss and so on, it depends on platform, I will use strace because I am on Linux.

My environment is Oracle Grid Infrastructure 11g R2 (11.2.0.1.0) in Oracle Restart mode. I simulated ASM disks using loop devices, /dev/loop*, I have 5 disks

bash> /etc/init.d/oracleasm listdisks
DISK1
DISK2
DISK3
DISK4
DISK5

SQL> conn / as sysasm
Connected.
SQL> col path format a15

SQL> select a.name,a.state,b.name,b.path from v$asm_diskgroup a, v$asm_disk b where a.group_number=b.group_number order by b.name;

NAME STATE NAME PATH
—————————— ———– —————————— —————
DGROUP1 MOUNTED DISK1 ORCL:DISK1
DGROUP1 MOUNTED DISK2 ORCL:DISK2
DGROUP2 MOUNTED DISK3 ORCL:DISK3
DGROUP2 MOUNTED DISK4 ORCL:DISK4
DGROUP2 MOUNTED DISK5 ORCL:DISK5

As seen from the listing, I have two disk groups DGROUP1 and DGROUP2. My spfile is located on DGROUP1

SQL> show parameter spfile;

NAME TYPE VALUE
———————————— ———– —————————————————-
spfile string +DGROUP1/asm/asmparameterfile/registry.253.740659373

Lets make some tests

SQL> conn / as sysasm
Connected.

SQL> startup
ASM instance started

Total System Global Area 284565504 bytes
Fixed Size 1336036 bytes
Variable Size 258063644 bytes
ASM Cache 25165824 bytes
ASM diskgroups mounted
SQL>
SQL> show parameter power

NAME TYPE VALUE
———————————— ———– ——————————
asm_power_limit integer 1
SQL> alter system set asm_power_limit=3;

System altered.

SQL> show parameter power

NAME TYPE VALUE
———————————— ———– ——————————
asm_power_limit integer 3

SQL>–We set asm_power_limit to 3 in spfile

SQL> shutdown immediate;
ASM diskgroups dismounted
ASM instance shutdown

SQL> !crs_stat -t
Name Type Target State Host
————————————————————
ora.DGROUP1.dg ora….up.type OFFLINE OFFLINE
ora.DGROUP2.dg ora….up.type OFFLINE OFFLINE
ora….ER.lsnr ora….er.type ONLINE ONLINE testdb03
ora.asm ora.asm.type OFFLINE OFFLINE
ora.cssd ora.cssd.type ONLINE ONLINE testdb03
ora.diskmon ora….on.type ONLINE ONLINE testdb03

SQL>– Lets start instance in nomount mode, it will not mount the diskgroups

SQL> startup nomount;
ASM instance started

Total System Global Area 284565504 bytes
Fixed Size 1336036 bytes
Variable Size 258063644 bytes
ASM Cache 25165824 bytes

SQL> show parameter spfile;

NAME TYPE VALUE
———————————— ———– —————————————————-
spfile string +DGROUP1/asm/asmparameterfile/registry.253.740659373

SQL> show parameter power

NAME TYPE VALUE
———————————— ———– ——————————
asm_power_limit integer 3

SQL> select * from v$spparameter;
select * from v$spparameter
*
ERROR at line 1:
ORA-15001: diskgroup “DGROUP1” does not exist or is not mounted

SQL> alter system set asm_power_limit=10;
alter system set asm_power_limit=10
*
ERROR at line 1:
ORA-32000: write to SPFILE requested but SPFILE is not modifiable

SQL> !asmcmd
ASMCMD> ls
ASMCMD> exit

SQL> !crs_stat -t
Name Type Target State Host
————————————————————
ora.DGROUP1.dg ora….up.type OFFLINE OFFLINE
ora.DGROUP2.dg ora….up.type OFFLINE OFFLINE
ora….ER.lsnr ora….er.type ONLINE ONLINE testdb03
ora.asm ora.asm.type ONLINE ONLINE testdb03
ora.cssd ora.cssd.type ONLINE ONLINE testdb03
ora.diskmon ora….on.type ONLINE ONLINE testdb03

So, what we see, we were able to start ASM instance without mounting the disks, our process picked up correct spfile. The command SHOW PARAMETER reads the parameter values from the memory, but if we try to read directly from spfile (select from v$spparameter) or write to it (alter system set) we will get errors, because diskgroups not mounted yet. It means that our process was read spfile directly from ASM disks. The design about from which disk to read it made based on information from disk header. I use ASMlib, and my asm_diskstring parameter is equal to default value null, ASM instance by default scans the /dev/oracleasm/disks/* on Linux, that is why process found my disks properly. But what if we use HP-UX and our disks is multipathing disks in /dev/rdisk/*, ASM will not scan them by default, we need to use asm_diskstring parameter, how our process will read asm_diskstring parameter before accessing spfile?

Lets start sqlplus, after we will connect as sysasm it will initialize server process, this process will do all job, that is why I will trace this server process

bash> sqlplus /nolog
SQL> conn / as sysasm
Connected.
SQL> !ps -aefH
.
.
oracle 15505 4255 0 13:26 pts/1 00:00:00 sqlplus
oracle 15507 15505 0 13:26 ? 00:00:00 oracle+ASM (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
.
.
SQL>

Our sserver process number is 15507, now I will use strace to trace it


bash> strace -o userproc.out -p 15507

start the instance with nomount option, OCSSD must be running, or instace will not start

SQL> startup nomount;

Total System Global Area 284565504 bytes
Fixed Size 1336036 bytes
Variable Size 258063644 bytes
ASM Cache 25165824 bytes

SQL>

Now we can analyze the contents of the userproc.out, I will not list all contents, just useful parts of it

First row is

read(9, “001\6\3\212\6\376\377\377\377\1\376\377\377\377\376\377\377\377″…, 8208) = 49

it is our “startup nomount” command transmitted to server process from sqlplus by socket

connect(6, {sa_family=AF_FILE, path=”/var/tmp/.oracle/sOCSSD_LL_testhost_”…}, 110) = 0
open(“/u01/oracle/product/grid11g/auth/css/testdb03/A6857753/84b0b692”, O_WRONLY|O_CREAT|O_EXCL, 0644) = 13
write(13, “\nS\336[“, 4) = 4
close(13)

from this lines we can see that our process connects to socket file of OCSSD /var/tmp/.oracle/sOCSSD_LL_testhost_ to communicate with it and authenticates itself. That is why it impossible to start instance without OCSSD, if process could not connect to this socket, it would fail. Then it establishes communication with OHASD (Oracle High Availability Services Deamon) through the socket file /var/tmp/.oracle/sCRSD_UI_SOCKET, after exchange messages with OHASD it will get information about the location of spfile and asm_diskstring parameter

access(“/var/tmp/.oracle/sCRSD_UI_SOCKET”, F_OK) = 0
connect(14, {sa_family=AF_FILE, path=”/var/tmp/.oracle/sCRSD_UI_SOCKET”…}, 110) = 0
open(“/u01/oracle/product/grid11g/auth/ohasd/testdb03/A4972914/5140f6df”, O_WRONLY|O_CREAT|O_EXCL, 0644) = 15
write(15, “\20\7\3519”, 4) = 4
close(15)
write(14, “4\2\2\1\1\1\1\3\1T\235\376\t”…, 52) = 52
write(14, “8\3\2\1\1\1\1\4\1T\235\376\t”…, 56) = 56
read(14, “8\3\2\1\1\1\1\3\1T\235\376\t”…, 32768) = 56
write(14, “\212\1PC\v\2\2\5\1T\235\376\t”…, 394) = 394
read(14, “\366\nPC\v\2\2\4\1T\235\376\t”…, 32768) = 2806
write(14, “0\2\20\1\1T\235\376\t”…, 48) = 48

write(3, “kggpnpSIHAGetItem 1 = +dgroup1/ASM/asmparameterfile/registry.253.740659373″…, 75) = 7
write(3, “kfspFileNameGet name=+dgroup1/ASSM/asmparameterfile/registry.253.740659373″…, 78) = 78

write(3, “kggpnpSIHAGetItem 2 = “, 23) = 23
write(3, “kgfspb_shallow_discover dscstr=\”\””…, 33) = 33

As we can see our process got spfile location (+dgroup1/ASM/asmparameterfile/registry.253.740659373) and asm_diskstring=”” which is null by default from OHASD. Now, lets see where OHASD itself gets this information. To identify that I traced the OHASD process, and I found this lines

open(“/etc/oracle/olr.loc”, O_RDONLY) = 4
read(4, “olrconfig_loc=/u01/oracle/produc”…, 4096) = 108
read(4, “”, 4096) = 0
close(4) = 0
open(“/u01/oracle/product/grid11g/cdata/localhost/testhost.olr”, O_RDONLY|O_SYNC|O_LARGEFILE) = 4
pread64(4, “\1\202VD\31\4\3OCR\226\361nA”…, 4096, 102400) = 4096
pread64(4, “\1\202\300I#\4\3OCR\226\361nA”…, 4096, 143360) = 4096
pread64(4, “\1\202VD\31\4\3OCR\226\361nA”…, 4096, 102400) = 4096
pread64(4, “\1\202hq\33\4\3OCR\226\361nA”…, 4096, 110592) = 4096
pread64(4, “\1\202\271\311#\4\20\3OCR\226\361nA”…, 4096, 4337664) = 4096
pread64(4, “\1\202\276\262$\4\20\3OCR\226\361nA”…, 4096, 4341760) = 4096
pread64(4, “\1\202VD\31\4\3OCR\226\361nA”…, 4096, 102400) = 4096
pread64(4, “\1\202hq\33\4\3OCR\226\361nA”…, 4096, 110592) = 4096
pread64(4, “\1\202\271\311#\4\20\3OCR\226\361nA”…, 4096, 4337664) = 4096
pread64(4, “\1\202\276\262$\4\20\3OCR\226\361nA”…, 4096, 4341760) = 4096
pread64(4, “\1\202\236\363%\4\2\3OCR\226\361nA”…, 4096, 4345856) = 4096
pread64(4, “\1\202\334\n&\4\2\3OCR\226\361nA”…, 4096, 4349952) = 4096
pread64(4, “\1\202\325\357-\4\2\3OCR\226\361nA”…, 4096, 4378624) = 4096
pread64(4, “\1\202VD\31\4\3OCR\226\361nA”…, 4096, 102400) = 4096
pread64(4, “\1\202hq\33\4\3OCR\226\361nA”…, 4096, 110592) = 4096
pread64(4, “\1\202\271\311#\4\20\3OCR\226\361nA”…, 4096, 4337664) = 4096
pread64(4, “\1\202\276\262$\4\20\3OCR\226\361nA”…, 4096, 4341760) = 4096
pread64(4, “\1\202\325\357-\4\2\3OCR\226\361nA”…, 4096, 4378624) = 4096

As one would expect OHASD reads this information from OLR, the path to it, it gets from /etc/oracle/olr.loc. I want to note that it is true for Oracle Restart mode, in Oracle RAC environment information is stored in GPNP profile and there is GPNPD process that maintains and manages this profile information. In Oracle Restart, as we can see OHASD process executes this role, and because there is no PGNP profile it stores information in OLR file. So, what next? Our process starts to scan all disks to identify ASM headers, after identifying them, it identifies which disks belongs to which diskgroups by information from header. There is many other metadata in the ASM diskheader that it reads, including pointers to spfile and votig disk file, it is kfdhdb.spfile, kfdhdb.spfflg (first block and number of blocks) and kfdhdb.vfstart, kfdhdb.vfend (begin block and end block). It is possible to read disk header using kfed utility from $ORACLE_HOME/bin directory. For example, lets read disk header of the /dev/loop1 which corresponds to ORCL:DISK1, spfile is on this disk.

shell> kfed read /dev/loop1 | more
kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD –Indicates that this is ASM disk header
.
kfdhdb.grptyp: 2 ; 0x026: KFDGTP_NORMAL –Indicates mirroring level, in my case it is NORMAL
kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER –Indicates that disk is the member disk of diskgroup
.
kfdhdb.dskname: DISK1 ; 0x028: length=5 –Disk name
kfdhdb.grpname: DGROUP1 ; 0x048: length=7 –Disk group name, to which this disk belongs
kfdhdb.fgname: DISK1 ; 0x068: length=5 –To which failure group this disk belongs
.
kfdhdb.secsize: 512 ; 0x0b8: 0x0200 –Disk sector size
kfdhdb.blksize: 4096 ; 0x0ba: 0x1000 –Disk block size
kfdhdb.ausize: 1048576 ; 0x0bc: 0x00100000 –Allocation Unit size, by default 1M
.
kfdhdb.vfstart: 0 ; 0x0ec: 0x00000000 –Begin block address of voting disk file
kfdhdb.vfend: 0 ; 0x0f0: 0x00000000 –End block address of voting disk file
kfdhdb.spfile: 59 ; 0x0f4: 0x0000003b –Begin block address of spfile
kfdhdb.spfflg: 1 ; 0x0f8: 0x00000001 –Number of blocks containing spfile

Now, lets see what will be next steps of our process

open(“/opt/oracle/extapi/32/asm/orcl/1/libasm.so”, O_RDONLY) = 17
read(17, “\177ELF\1\1\1\3\3\1000\v004″…, 512) = 512
close(17) = 0
open(“/dev/oracleasm/.query_version”, O_RDWR|O_LARGEFILE) = 17
write(17, “MSA\2\1\20”, 16) = 16
read(17, “MSA\2\1\20”, 16) = 16
close(17) = 0
open(“/dev/oracleasm/.get_iid”, O_RDWR|O_LARGEFILE) = 17
write(17, “MSA\2\2\30”, 24) = 24
read(17, “MSA\2\2\30\3”, 24) = 24
close(17) = 0
open(“/dev/oracleasm/.check_iid”, O_RDWR|O_LARGEFILE) = 17
write(17, “MSA\2\3\30\3”, 24) = 24
read(17, “MSA\2\3\30\3”, 24) = 24
close(17) = 0
open(“/dev/oracleasm/iid/0000000000000003”, O_RDWR|O_CREAT|O_LARGEFILE, 0770) = 17

open(“/dev/oracleasm/disks/DISK1”, O_RDONLY|O_LARGEFILE) = 18
read(17, “MSA\2\5 \22T/v\17\200-%\310”, 32) = 32
close(18) = 0
read(17, “MSA\2\7P@\364\311\20\320\301\225\277″…, 80) = 80
open(“/dev/oracleasm/disks/DISK2”, O_RDONLY|O_LARGEFILE) = 18
read(17, “MSA\2\5 \22T/v\17\200+%\310”, 32) = 32
close(18) = 0
read(17, “MSA\2\7P@\364\311\20\320\301\225\277″…, 80) = 80
open(“/dev/oracleasm/disks/DISK3”, O_RDONLY|O_LARGEFILE) = 18
read(17, “MSA\2\5 \22T/v\17\200!%\310”, 32) = 32
close(18) = 0
read(17, “MSA\2\7P@\364\311\20\320\301\225\277″…, 80) = 80
open(“/dev/oracleasm/disks/DISK4”, O_RDONLY|O_LARGEFILE) = 18
read(17, “MSA\2\5 \22T/v\17\200#%\310”, 32) = 32
close(18) = 0
read(17, “MSA\2\7P@\364\311\20\320\301\225\277″…, 80) = 80
open(“/dev/oracleasm/disks/DISK5”, O_RDONLY|O_LARGEFILE) = 18
read(17, “MSA\2\5 \22T/v\17\200)%\310”, 32) = 32
close(18) = 0

read(17, “MSA\2\7P@\364\311\20\320\301\225\277″…, 80) = 80
read(17, “MSA\2\7P@\364\311\20″…, 80) = 80
read(17, “MSA\2\7P@\364\311\20″…, 80) = 80
read(17, “MSA\2\7P@\364\311\20″…, 80) = 80
read(17, “MSA\2\7P@\364\311\20″…, 80) = 80
read(17, “MSA\2\7P@\364\311\20″…, 80) = 80
.
.

open(“/u01/oracle/diag/asm/+asm/+ASM/trace/alert_+ASM.log”, O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0660) = 16
write(16, “Tue Jan 18 17:45:13 2011\n”, 25) = 25
write(16, “Starting ORACLE instance (normal”…, 33) = 33
write(16, “\n”, 1) = 1
close(16)
.
.

It opens /opt/oracle/extapi/32/asm/orcl/1/libasm.so library, reads it. Then it opens special files /dev/oracleasm/.query_version, /dev/oracleasm/.get_iid and /dev/oracleasm/.check_iid, this files is interfaces to ASM device manager. First one is used to get managers version, second one is used to get identifier of the ASM disk device manager instance (not ASM instance) and third for verifying of this instance identifier. In our case ASM device managers instance identifier is 0000000000000003. Then, our process opens /dev/oracleasm/iid/0000000000000003, in other words it establishes connection to the ASM device manager instance. Process will use this interface to read and write to ASM disks. Then it checks all disks from DISK1 to DISK5. After it gets all information about disks, i.e. groups, mirroring level, AU size, failure group and so on it starts reading ASM disk through established interface. I think in this step it reads the spfile. After it gets initialization parameters it starts to allocate memory structures and starting necessary background processes.

Now lets see what is stored in OLR regarding ora.asm resource

shell> cd $ORACLE_HOME/cdata/localhost
shell> strings testhost.olr | grep dgroup1/ASM/asmparameterfile/registry | sed ‘s/~/\n/g’
DEFAULT_TEMPLATE=PROPERTY(RESOURCE_CLASS=asm) ELEMENT(INSTANCE_NAME= %GEN_USR_ORA_INST_NAME%)
DEGREE=1
DESCRIPTION=Oracle ASM resource
ENABLED=1
GEN_USR_ORA_INST_NAME=+ASM
LOAD=1
LOGGING_LEVEL=1
NAME=ora.asm
NLS_LANG=
NOT_RESTARTING_TEMPLATE=
OFFLINE_CHECK_INTERVAL=0
PROFILE_CHANGE_TEMPLATE=
RESTART_ATTEMPTS=5
SCRIPT_TIMEOUT=60
SPFILE=+dgroup1/ASM/asmparameterfile/registry.253.740659373
START_DEPENDENCIES=hard(ora.cssd) weak(ora.LISTENER.lsnr)
START_TIMEOUT=900
STATE_CHANGE_TEMPLATE=
STOP_DEPENDENCIES=hard(ora.cssd)
STOP_TIMEOUT=600
TYPE=ora.asm.type
TYPE_ACL=owner:oracle:rwx,pgrp:dba:rwx,other::r–
UPTIME_THRESHOLD=1d
USR_ORA_ENV=
USR_ORA_INST_NAME=+ASM
USR_ORA_OPEN_MODE=mount
USR_ORA_OPI=false
USR_ORA_STOP_MODE=immediate
VERSION=11.2.0.1.0

We can found there information about spfile “SPFILE=+dgroup1/ASM/asmparameterfile/registry.253.740659373”, we also can see that ora.cssd is hard dependency resource, because OCSSD is responsible for synchronization between ASM and database instance. But what about asm_diskstring parameter, where is it?

shell> strings testhost.olr | grep ASM_DISKSTRING | sed ‘s/~/\n/g’

nothing, it is because my asm_diskstring parameter equals to null, default value, lets change it

SQL> alter system set asm_diskstring=’ORCL:DISK1,ORCL:DISK2′ scope=spfile;

System altered.

check OLR file again

strings testdb03.olr | grep ASM_DISKSTRING | sed ‘s/~/\n/g’
bASM_DISKSTRING
ACL=owner:oracle:rwx,pgrp:dba:rwx,other::r–
ACTION_FAILURE_TEMPLATE=
ACTION_SCRIPT=
AGENT_FILENAME=%CRS_HOME%/bin/oraagent%CRS_EXE_SUFFIX%
ALIAS_NAME=
ASM_DISKSTRING=ORCL:DISK1,ORCL:DISK2
AUTO_START=restore
BASE_TYPE=ora.local_resource.type

Now we can see that information about asm_diskstring is also stored in OLR profile. And next time, at ASM instance startup it will scan only specified disk strings. If we specify disks on which spfile not present, our instance will not start


SQL> alter system set asm_diskstring=’ORCL:DISK3,ORCL:DISK4,ORCL:DISK5′ scope=spfile;

System altered.

SQL> startup nomount force;
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file ‘+DGROUP1/asm/asmparameterfile/registry.253.740659373’
ORA-17503: ksfdopn:2 Failed to open file +DGROUP1/asm/asmparameterfile/registry.253.740659373
SQL>

You can also use the ocrdump utility to extract this entries on any platform in XML format:

shell> ocrdump -local -keyname SYSTEM.OHASD.RESOURCES.ora\!asm.CONFIG -xml -noheader
shell> more OCRDUMPFILE


SYSTEM.OHASD.RESOURCES.ora!asm.CONFIG
ORATEXT

PROCR_ALL_ACCESS
PROCR_NONE
PROCR_NONE
oracle
dba

同样 我们在11.2.0.3的OLR 文件中可以找到SPFILE 的位置:


SPFile=”+DATA/yhd-cluster/asmparameterfile/registry.253.781624261″/>

这里需要说明一点 11.2.0.3 修改diskstring OLR 文件并没有出现对应的修改记录

另外针对 ASM_DISKGROUPS 需要一点说明:

SQL> create pfile=’/tmp/tmp.ctl’ from spfile=’+DATA/yhd-cluster/asmparameterfile/registry.253.781624261′;

File created.

[grid@db-41 ~]$ cat /tmp/tmp.ctl
+ASM1.__oracle_base=’/data/11.2.0’#ORACLE_BASE set from in memory value
+ASM.asm_diskgroups=’DATA’,’ARCH’
+ASM1.asm_diskgroups=’ARCH’#Manual Mount
*.asm_diskgroups=’ARCH’,’DATA’
*.asm_diskstring=’/dev/mapper/*’
*.asm_power_limit=1
*.diagnostic_dest=’/data/app/oracle’
*.instance_type=’asm’
*.large_pool_size=12M
*.remote_login_passwordfile=’EXCLUSIVE’

SQL> alter system set asm_diskgroups=’ARCH’,’DATA’ scope=both sid=’+ASM1′;

System altered.

SQL> create pfile=’/tmp/tmp.ctl’ from spfile=’+DATA/yhd-cluster/asmparameterfile/registry.253.781624261′;

File created.

[grid@db-41 ~]$ cat /tmp/tmp.ctl
+ASM1.__oracle_base=’/data/11.2.0’#ORACLE_BASE set from in memory value
+ASM.asm_diskgroups=’DATA’,’ARCH’
*.asm_diskgroups=’ARCH’,’DATA’
+ASM1.asm_diskgroups=’ARCH’,’DATA’
*.asm_diskstring=’/dev/mapper/*’
*.asm_power_limit=1
*.diagnostic_dest=’/data/app/oracle’
*.instance_type=’asm’
*.large_pool_size=12M
*.remote_login_passwordfile=’EXCLUSIVE’

针对这个问题 这篇文章也做过说明:http://comtmr.blogspot.com/2011/01/11gr2-mysteriously-changing.html

Reference:https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&id=1059856.1

ASM Preferred mirror Read

April 25, 2012 oracle, RAC 2 comments

ASM Preferred mirror Read


Introduction

The concept of the Preferred Read is not a new idea, but is now implemented in Oracle’s ASM volume
management in Oracle 11g.The concept is to read from the storage that can present the needed data at a lower
latency. Initially, this was designed for WAN or site-specific storage in order to avoid higher-latency site
connections. By restricting data reads to the local storage, the application would be able to service requests at
nominal read speeds while writes were the only communication needed to traverse the long haul site link. This is
a feature that is available to most Operating Systems with their included volume manager and as a feature to
Symantec/Veritas through the title Preferred Plex. This paper will discuss the merits of using PRG technology with Oracle ASM

eg:

SELECT name, failgroup FROM v$asm_disk;

NAME FAILGROUP
—————————— ——————————
DATA_0000 DATA_0000
DATA_0001 DATA_0001
DATA_0002 DATA_0002

3 rows selected.

SQL> ALTER SYSTEM SET ASM_PREFERRED_READ_FAILURE_GROUPS = ‘data.data_0000’, ‘data.data_0001’, ‘data.data_0002’;


Allocating ASM Extent Sets
With ASM redundancy, the first file extent allocated is chosen as primary extent, and
the mirrored extent is called the secondary extent. In the case of high redundancy,
there will two secondary extents. This logical grouping of primary and secondary
extents is called an extent set. When a block is read from disk, it is always read from
the primary extent, unless the primary extent cannot be read. In Oracle Database
11g, the preferred read feature allows the database to read the secondary extent first
instead of reading the primary extent. This is especially important for RAC Extended
Cluster implementations.


As stated earlier, ASM in Oracle Database 10g always reads the primary copy of a mirrored extent set. Thus, a read for a specific block may require a read of the primary extent at the remote site across the interconnect. Accessing a remote disk through a metropolitan area or wide area storage network is substantially slower than accessing a local disk. This can tax the interconnect as well as result in high I/ O and network latency. To assuage this, Oracle Database 11g provides a feature called preferred reads. This feature enables ASM administrators to specify a failure group for local reads–– that is, provide preferred reads. In a normal or high-redundancy diskgroup, when a secondary extent is on a preferred disk and the primary extent is remote, the secondary extent is read rather than the primary one on that node. This feature is especially beneficial for extended cluster configurations.

根据上面所说的 “This is especially important for RAC Extended” hellodba 也提出过类似的想法 参见 PRG

另外IOUG的一篇文档也给出了一个很好的利用这个特性配合SSD提高IO效率的方案

Why ASMLIB and why not?

October 19, 2011 oracle, RAC 2 comments

ASMLIB是一种基于Linux module,专门为Oracle Automatic Storage Management特性设计的内核支持库
(kernel support library)。
长久以来我们对ASMLIB的认识并不全面,这里我们来具体了解一下使用ASMLIB的优缺点。

理论上我们可以从ASMLIB API中得到的以下益处:

总是使用direct,async IO
1.解决了永久性设备名的问题,即便在重启后设备名已经改变的情况下
2.解决了文件权限、拥有者的问题
3.减少了I/O期间从用户模式到内核模式的上下文切换,从而可能降低cpu使用率
4.减少了文件句柄的使用量
5.ASMLIB API提供了传递如I/O优先级等元信息到存储设备的可能
虽然从理论上我们可以从ASMLIB中得到性能收益,但实践过程中这种优势是几乎可以忽略的,没有任何性能报告显示
ASMLIB对比Linux上原生态的udev设备管理服务有任何性能上的优势。
在Oracle官方论坛上有一篇讨论ASMLIB性能收益的帖子,你可以从中看
到”asmlib wouldn’t necessarily give you much of an io performance benefit,
it’s mainly for ease of management as it will find/discover the right devices for you,
the io effect of asmlib is large the same as doing async io to raw devices.”的评论,
实际上使用ASMLIB和直接使用裸设备(raw device)在性能上没有什么差别。

ASMLIB可能带来的缺点:

1.对于多路径设备(multipathing)需要在/etc/sysconfig/oracleasm-_dev_oracleasm配置文件中设置
ORACLEASM_SCANORDER及ORACLEASM_SCANEXCLUDE,以便ASMLIB能找到正确的设备文件,
具体可以参考Metalink Note
2.因为ASM INSTANCE使用ASMLIB提供的asm disk,所以增加了额外的层面
3.每次Linux Kernel更新,都需要替换新的ASMLIB包增加了因人为错误造成宕机downtime的可能
4.使用ASMLIB意味着要花费更多时间去创建和维护
5.因为ASMLIB的存在,可能引入更多的bug,这是我们最不想看到的
6.使用ASMLIB创建的disk,其disk header并不会和普通的asm disk header有什么不同,仅仅是在头部多出了ASMLIB的属性空间。
结论:
我个人的观点是尽可能不要使用ASMLIB,当然这不是DBA个人所能决定的事情。另一方面这取决于个人习惯,在rhel 4的早期发行版本中没有提供
udev这样的设备管理服务,这导致在rhel 4中大量的ASM+RAC组合的系统使用ASMLIB。
如果是在RHEL/OEL 5中那么你有充分的理由利用udev而放弃ASMLIB。
reference:http://www.oracledatabase12g.com/archives/why-asmlib-and-why-not.html