replication

Oracle DGHA architecture

November 1, 2013 advanced tech, Architect, network, oracle, replication No comments



Download this PDF

DGHA,Dataguard自动切换方案

August 15, 2013 Architect, oracle, replication, software No comments

发布对autoswitch over dg脚本DGHA测试,此脚本针对oracle dataguard设计使用共享存储存放redo以及controlfile从而达到了切换数据0丢失。
同时切换以后,original primary database可以无缝加入整个dg集群,从而形成了dataguard高可用方案。
采用perl脚本定制开发,可任意部署于任何一台可连接至集群的机器(最好不要用集群中的机器做监控机),多重判断机制,从而增加了切换时的可靠性以及准确性。感谢作者 noodba. noodba作为部门数据架构师,定制开发了myawr(mysql性能分析工具)

测试环境:

10.0.0.63 (primary database)
10.0.0.24 (standby database)
10.0.0.13 (standby database)
切换site指定为63<->24 13作为恒定standby存在

Step 1. 使用swingbench作为压测工具 (导入10G 数据)

QQ图片20130815220940

[oracle@userdb-pre admin]$ ora active
 17:23:15 up 1 day,  1:15,  5 users,  load average: 2.99, 0.97, 0.34

   SID USERNAME   MACHINE		 EVENT				PARAM			W   WT SQL			    ST	   LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
  1089 TEST	  JDBC Thin Client	 SQL*Net more data from client	1413697536/4/0	       -1    0 /5gkc10wnagprk		    A	    0	     105
   961 TEST	  JDBC Thin Client	 SQL*Net more data from client	1413697536/1/0	       -1    0 5gkc10wnagprk/5gkc10wnagprk  A	    0	     105
  1249 TEST	  JDBC Thin Client	 SQL*Net more data from client	1413697536/5/0	       -1    0 dy4rgmf46x1q4/dy4rgmf46x1q4  A	    0	     105

[oracle@userdb-pre admin]$ ora active
 17:23:15 up 1 day,  1:15,  5 users,  load average: 3.07, 1.02, 0.36

   SID USERNAME   MACHINE		 EVENT				PARAM			W   WT SQL			    ST	   LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
   961 TEST	  JDBC Thin Client	 SQL*Net message from client	1413697536/1/0		3    0 5gkc10wnagprk/5gkc10wnagprk  A	    0	     105
     1 TEST	  JDBC Thin Client	 SQL*Net message from client	1413697536/1/0		2    0 dy4rgmf46x1q4/dy4rgmf46x1q4  A	    0	     104

[oracle@userdb-pre admin]$ ora active
 17:23:49 up 1 day,  1:15,  4 users,  load average: 3.68, 1.35, 0.49

   SID USERNAME   MACHINE		 EVENT				PARAM			W   WT SQL			    ST	   LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
  1025 TEST	  JDBC Thin Client	 SQL*Net message from client	1413697536/1/0		1    0 d1yrz7run4p1s/d1yrz7run4p1s  A	    0	     139

Step 2. Halt database machine (将63 database直接halt关闭,模拟机器crash)

BP9OJX}OU[ZIT2MQ[98[5LA

观察10.0.0.24 (standby 将被dgha自动切换至primary,redo,controlfile将被copy至规定目录)

alert_log from 10.0.0.24

Thu Aug 15 17:37:50 2013
ALTER DATABASE   MOUNT
Changing di2dbun from dgha to dgha1
Successful mount of redo thread 1, with mount id 3944140846
Database mounted in Exclusive Mode
Lost write protection disabled
Completed: ALTER DATABASE   MOUNT
Thu Aug 15 17:37:55 2013
ALTER DATABASE RECOVER  database
Media Recovery Start
 started logmerger process
Parallel Media Recovery started with 48 slaves
Thu Aug 15 17:37:56 2013
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Media Recovery Complete (dgha)
Completed: ALTER DATABASE RECOVER  database
alter database open
Beginning crash recovery of 1 threads
 parallel recovery started with 32 processes
Started redo scan
Completed redo scan
 read 779943 KB redo, 0 data blocks need recovery
Started redo application at
 Thread 1: logseq 58, block 35644
Recovery of Online Redo Log: Thread 1 Group 1 Seq 58 Reading mem 0
  Mem# 0: /data1/dgha/redo01.log
Recovery of Online Redo Log: Thread 1 Group 2 Seq 59 Reading mem 0
  Mem# 0: /data1/dgha/redo02.log
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Completed redo application of 0.00MB
Completed crash recovery at
 Thread 1: logseq 60, block 31, scn 2500652
 0 data blocks read, 0 data blocks written, 779943 redo k-bytes read
LGWR: STARTING ARCH PROCESSES
...

Completed: ALTER DATABASE   MOUNT
Thu Aug 15 17:37:55 2013
ALTER DATABASE RECOVER  database
Media Recovery Start
 started logmerger process
Parallel Media Recovery started with 48 slaves
Thu Aug 15 17:37:56 2013
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Media Recovery Complete (dgha)
Completed: ALTER DATABASE RECOVER  database
alter database open
Beginning crash recovery of 1 threads
 parallel recovery started with 32 processes
Started redo scan
Completed redo scan
 read 779943 KB redo, 0 data blocks need recovery
Started redo application at
 Thread 1: logseq 58, block 35644
Recovery of Online Redo Log: Thread 1 Group 1 Seq 58 Reading mem 0
  Mem# 0: /data1/dgha/redo01.log
Recovery of Online Redo Log: Thread 1 Group 2 Seq 59 Reading mem 0
  Mem# 0: /data1/dgha/redo02.log
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Completed redo application of 0.00MB
Completed crash recovery at
 Thread 1: logseq 60, block 31, scn 2500652
 0 data blocks read, 0 data blocks written, 779943 redo k-bytes read
LGWR: STARTING ARCH PROCESSES
...

Archived Log entry 191 added for thread 1 sequence 61 ID 0xeb14a3a4 dest 1:
ARC3: Standby redo logfile selected for thread 1 sequence 60 for destination LOG_ARCHIVE_DEST_3
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
******************************************************************
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_3
******************************************************************
LNS: Standby redo logfile selected for thread 1 sequence 62 for destination LOG_ARCHIVE_DEST_3
..


Completed: alter database open

数据库被recover到了最新的时间点(datafile scn领先于63 database,63此时可以看成24过去某个点的状态)


Step 3. third standby database (10.0.0.13)

alert_log from 10.0.0.13

Error 12543 received logging on to the standby
FAL[client, USER]: Error 12543 connecting to dgha for fetching gap sequence
Thu Aug 15 17:38:14 2013
Media Recovery Log /u01/oracle/11.2.0/oracle/product/11203/db1/dbs/arch1_61_823427815.dbf
Media Recovery Waiting for thread 1 sequence 62 (in transit)
Recovery of Online Redo Log: Thread 1 Group 4 Seq 62 Reading mem 0
  Mem# 0: /u01/oracle/oradata/dgha/standby04.log
Thu Aug 15 17:39:56 2013
Archived Log entry 62 added for thread 1 sequence 62 ID 0xeb14a3a4 dest 1:
Thu Aug 15 17:39:56 2013
RFS[14]: Selected log 4 for thread 1 sequence 63 dbid -350937948 branch 823427815
Thu Aug 15 17:39:56 2013
Media Recovery Waiting for thread 1 sequence 63 (in transit)
Recovery of Online Redo Log: Thread 1 Group 4 Seq 63 Reading mem 0
  Mem# 0: /u01/oracle/oradata/dgha/standby04.log
Thu Aug 15 17:40:08 2013
Archived Log entry 63 added for thread 1 sequence 63 ID 0xeb14a3a4 dest 1:
Thu Aug 15 17:40:08 2013
RFS[14]: Selected log 4 for thread 1 sequence 64 dbid -350937948 branch 823427815

可以看到 primary 自动切换为了10.0.0.24,13继续recover database.

Step 4. restart 63 database and join the dg cluster (24 is new primary database and 63 is standby database)

(1).copy standby controlfile from 24 database
(2).replace 63’s controlfile using new standby controlfile
(3).recreate standby logfiles
(4).alter database recover managed standby database;


alert_log:

..

Media Recovery Log /u01/oracle/arch/1_66_823427815.dbf
Media Recovery Waiting for thread 1 sequence 67 (in transit)

..
Physical standby database opened for read only access.
Completed: alter database open read only

Step 5. check data between 24 and 63 databases

from 24 database

SQL> select count(*) from test.CUSTOMERS              ;
select count(*) from test.WAREHOUSES             ;
select count(*) from test.ORDER_ITEMS            ;
select count(*) from test.ORDERS                 ;
select count(*) from test.INVENTORIES            ;
select count(*) from test.PRODUCT_INFORMATION    ;
select count(*) from test.LOGON                  ;
select count(*) from test.PRODUCT_DESCRIPTIONS   ;
select count(*) from test.ORDERENTRY_METADATA    ;

  COUNT(*)
----------
  39999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
 134986840

SQL> 
  COUNT(*)
----------
  45000000

SQL> 
  COUNT(*)
----------
    901565

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
   9999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
	 0

from 63 database

SQL> select count(*) from test.CUSTOMERS              ;
select count(*) from test.WAREHOUSES             ;
select count(*) from test.ORDER_ITEMS            ;
select count(*) from test.ORDERS                 ;
select count(*) from test.INVENTORIES            ;
select count(*) from test.PRODUCT_INFORMATION    ;
select count(*) from test.LOGON                  ;
select count(*) from test.PRODUCT_DESCRIPTIONS   ;
select count(*) from test.ORDERENTRY_METADATA    ;

  COUNT(*)
----------
  39999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
 134986840

SQL> 
  COUNT(*)
----------
  45000000

SQL> 
  COUNT(*)
----------
    901565

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
   9999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
	 0

perfect ! no rows lost.

—————————————————————————

下面我们来剖析一下dgha的行为

Thu Aug 15 17:25:57 2013 - [info] oraha start,welcome to use this tool...
Thu Aug 15 17:25:57 2013 - [info] try to check (10.0.0.63:1521:dgha) and (10.0.0.24:1521:dgha) are really cluster or not.
Thu Aug 15 17:25:57 2013 - [info] try ssh to standby(10.0.0.24,1521,dgha)...
Thu Aug 15 17:25:57 2013 - [info] ssh to 10.0.0.24 success.
Thu Aug 15 17:25:58 2013 - [debug] standby query result: (DGHA,PHYSICAL STANDBY,dgha1)
Thu Aug 15 17:25:58 2013 - [info] try connect to primary(10.0.0.63,1521,dgha)...
Thu Aug 15 17:25:58 2013 - [debug] primary check query sql: (select count(*) cnt from V$ARCHIVE_DEST A,v$database B where A.status='VALID' AND A.DB_UNIQUE_NAME='dgha1' AND A.TARGET='STANDBY' AND B.NAME='DGHA' AND B.DATABASE_ROLE='PRIMARY')
Thu Aug 15 17:25:58 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:25:59 2013 - [info] connect to (10.0.0.63:1521:dgha) success.
Thu Aug 15 17:25:59 2013 - [debug] primary query result rn_cnt: 1
Thu Aug 15 17:25:59 2013 - [info] (10.0.0.63:1521:dgha) and (10.0.0.24:1521:dgha) are really cluster.
Thu Aug 15 17:25:59 2013 - [info] the primary database is (10.0.0.63:1521:dgha)
Thu Aug 15 17:25:59 2013 - [info] the standby database is (10.0.0.24:1521:dgha)
Thu Aug 15 17:25:59 2013 - [info] ssh to 10.0.0.63 success.
Thu Aug 15 17:25:59 2013 - [info] ssh to 10.0.0.24 success.
Thu Aug 15 17:26:03 2013 - [info] ssh to 192.168.254.63 success.
Thu Aug 15 17:26:03 2013 - [info] ssh check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.63:1521:dgha) /home/oracle/oraha/db_stop.sh check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step01.sh check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step02.pl check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step03.sh check passed.
Thu Aug 15 17:26:04 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_check.sh check passed.
Thu Aug 15 17:26:04 2013 - [info] scripts under /home/oracle/oraha check passed.
Thu Aug 15 17:26:04 2013 - [info] I will wait until (10.0.0.63:1521:dgha) unreachable.health check interval:8
Thu Aug 15 17:26:04 2013 - [info] ======================================================
Thu Aug 15 17:26:04 2013 - [info] ----------             welcome              ----------
Thu Aug 15 17:26:04 2013 - [info] ======================================================
Thu Aug 15 17:26:04 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:0
Thu Aug 15 17:31:50 2013 - [debug] UPDATE HACHECK success.
Thu Aug 15 17:31:58 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:0

DGHA 会定期去update一张check表判断instance的存活

Thu Aug 15 17:32:03 2013 - [warning] connect to (10.0.0.63:1521:dgha) was abort! connect again.
Thu Aug 15 17:32:03 2013 - [debug] update check == 1.undef_count is 0
Thu Aug 15 17:32:03 2013 - [info] I will destory dbh.
Issuing rollback() due to DESTROY without explicit disconnect() of DBD::Oracle::db handle (DESCRIPTION=(ADDRESS=(HOST=10.0.0.63)(PROTOCOL=tcp)(PORT=1521))(CONNECT_DATA=(SID=dgha))) at oraha.pl line 509.
DBI db handle 0x1efbf710 cleared whilst still active at oraha.pl line 509.
DBI db handle 0x1efbf710 has uncleared implementors data at oraha.pl line 509.
    dbih_clearcom (dbh 0x1efbf710, com 0x1ee4d4f0, imp DBD::Oracle::db):
       FLAGS 0x182097: COMSET IMPSET Active Warn RaiseError PrintWarn ShowErrorStatement 
       PARENT DBI::dr=HASH(0x1ef1df90)
       KIDS 0 (0 Active)
Thu Aug 15 17:32:08 2013 - [warning] destory dbh  was abort!
Thu Aug 15 17:32:08 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:32:13 2013 - [error][oraha.pl, ln426] connect to  (10.0.0.63:1521:dgha) timeout
Thu Aug 15 17:32:13 2013 - [warning] connect to (10.0.0.63:1521:dgha) error_count:1
Thu Aug 15 17:32:21 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:1
Thu Aug 15 17:32:21 2013 - [debug] the dbh is null.
Thu Aug 15 17:32:21 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:32:26 2013 - [error][oraha.pl, ln426] connect to  (10.0.0.63:1521:dgha) timeout
Thu Aug 15 17:32:26 2013 - [warning] connect to (10.0.0.63:1521:dgha) was abort! connect again.
Thu Aug 15 17:32:26 2013 - [debug] update check == 1.undef_count is 1
Thu Aug 15 17:32:26 2013 - [info] I will destory dbh.
Thu Aug 15 17:32:26 2013 - [warning] connect to (10.0.0.63:1521:dgha) error_count:2
Thu Aug 15 17:32:34 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:2
Thu Aug 15 17:32:34 2013 - [debug] the dbh is null.
Thu Aug 15 17:32:34 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:32:39 2013 - [error][oraha.pl, ln426] connect to  (10.0.0.63:1521:dgha) timeout
Thu Aug 15 17:32:39 2013 - [warning] connect to (10.0.0.63:1521:dgha) was abort! connect again.
Thu Aug 15 17:32:39 2013 - [debug] update check == 1.undef_count is 1
Thu Aug 15 17:32:39 2013 - [info] I will destory dbh.
Thu Aug 15 17:32:39 2013 - [warning] connect to (10.0.0.63:1521:dgha) error_count:3
Thu Aug 15 17:32:39 2013 - [info] try connect to the edm...
Thu Aug 15 17:32:39 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'13817963180','the primary(10.0.0.63:1521:dgha) is unreachable.','',1,'11',sysdate)
Thu Aug 15 17:32:39 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'18666668061','the primary(10.0.0.63:1521:dgha) is unreachable.','',1,'11',sysdate)
Thu Aug 15 17:32:47 2013 - [warning] ssh to 10.0.0.63 err. error_count:1
ssh: connect to host 10.0.0.63 port 22: No route to host
Thu Aug 15 17:32:50 2013 - [warning] ssh to 10.0.0.63 err. error_count:2
ssh: connect to host 10.0.0.63 port 22: No route to host
Thu Aug 15 17:32:53 2013 - [warning] ssh to 10.0.0.63 err. error_count:3
Thu Aug 15 17:32:54 2013 - [info] ssh to 10.0.0.24 success.
Thu Aug 15 17:32:57 2013 - [info] ssh to 192.168.254.63 success.

DGHA 在update多次之后发现oracle死亡,开始尝试ssh到63同样不通(将会连接至63的远程管理卡poweroff掉该机器)同理将会查看待选standby状态,如果可以切换将会启动切换程序,同时发送短信告知63已经死亡。

QQ图片20130815221021

Thu Aug 15 17:32:57 2013 - [info] try to check (10.0.0.24:1521:dgha) is ready to failover to primary or not.
Thu Aug 15 17:32:57 2013 - [info] the standby open mode is:MOUNTED. database role is:PHYSICAL STANDBY.
Thu Aug 15 17:32:57 2013 - [info] standby lag minutes is: 0
Thu Aug 15 17:32:57 2013 - [info] standby (10.0.0.24:1521:dgha) healthcheck is passed. Go on.
Thu Aug 15 17:32:57 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step01.sh check passed.
Thu Aug 15 17:32:58 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step02.pl check passed.
Thu Aug 15 17:32:58 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step03.sh check passed.
Thu Aug 15 17:32:58 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_check.sh check passed.
Thu Aug 15 17:32:58 2013 - [info] standby (10.0.0.24:1521:dgha) scripts under /home/oracle/oraha check passed.
Thu Aug 15 17:32:58 2013 - [info] Power of 10.0.0.63  was successfully turned off.
Thu Aug 15 17:32:58 2013 - [info] I will execute /home/oracle/oraha/db_start_step01.sh on 10.0.0.24
Thu Aug 15 17:35:58 2013 - [debug] db_start output: 
SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 15 17:32:58 2013

Copyright (c) 1982, 2011, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> 
Thu Aug 15 17:35:58 2013 - [error][oraha.pl, ln180] Oracle on 10.0.0.24 shutdown failed.Kill the pmon next.
Thu Aug 15 17:35:58 2013 - [debug] Oracle pmon pid is: 10179
Thu Aug 15 17:35:58 2013 - [debug] kill -9  output: 
Thu Aug 15 17:36:08 2013 - [info] I will execute /home/oracle/oraha/db_start_step02.pl on 10.0.0.24
Thu Aug 15 17:37:47 2013 - [debug] db_start output: 
Thu Aug 15 17:37:47 2013 - [info] I will execute /home/oracle/oraha/db_start_step03.sh on 10.0.0.24
Thu Aug 15 17:38:06 2013 - [debug] db_start output: 
SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 15 17:37:47 2013

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> ORACLE instance started.

Total System Global Area 4.0486E+10 bytes
Fixed Size                  2237088 bytes
Variable Size            4026535264 bytes
Database Buffers         3.6373E+10 bytes
Redo Buffers               84631552 bytes
Database mounted.
SQL> Media recovery complete.
SQL> 
Database altered.

SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Thu Aug 15 17:38:08 2013 - [info] try to check the new primary(10.0.0.24,1521,dgha) after failover .
Thu Aug 15 17:38:08 2013 - [info] try connect to the new primary(10.0.0.24,1521,dgha)...
Thu Aug 15 17:38:08 2013 - [info] Attemp to connect to (10.0.0.24:1521:dgha)...
Thu Aug 15 17:38:08 2013 - [info] connect to (10.0.0.24:1521:dgha) success.
Thu Aug 15 17:38:08 2013 - [info] the new primary open mode is:READ WRITE. database role is:PRIMARY.
Thu Aug 15 17:38:08 2013 - [info] (10.0.0.63:1521:dgha) failover to (10.0.0.24:1521:dgha) successful.
Thu Aug 15 17:38:08 2013 - [info] try connect to the edm...
Thu Aug 15 17:38:08 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'13817963180','(10.0.0.63:1521:dgha) failover to (10.0.0.24:1521:dgha) successful.','',1,'11',sysdate)
Thu Aug 15 17:38:08 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'18666668061','(10.0.0.63:1521:dgha) failover to (10.0.0.24:1521:dgha) successful.','',1,'11',sysdate)
Killed by signal 1.
Killed by signal 1.

使用存储redo recover完毕后同样会发出短信告知并退出。

QQ图片20130815221008

以下附各种场景的切换测试:

1. 模拟instance hang (oradebug frozen instance) –>当前连接全部hang死,新建process去update check表,两种情况:如果update成功,不切换,如果hang导致update不成功,切换。但是之前的process会处于”僵尸”态。

2. 模拟listener hang (kill -STOP pid of listener and kill active session) 新建process无法update timeout超时–>ssh主机执行abort—>连接至远程卡powerff掉主机—>切换至备选standby.

3. 模拟机器down (直接halt机器,update,ssh均失败,判断死亡,直接切换)

4. 网卡down(处理方式类1)
1,4的处理方式会出现之前dgha建立的连接无法回收,采用在本地部署脚本针对dgha的机器名,idle时间做定期清理。

5. 新的primary切换完成后,就的primary如果需要重新加入集群(copy standby controlfile from new primary,standby logfile需要重建)

Active dataguard tips

January 23, 2013 oracle, replication No comments

11g ADG 给了大家更多的灾备选择,包括之前美国飓风,给了ADG大放光彩的机会,但是ADG并没有想象中的那么完美,以下是最近的一个CASE,具体过程不详写了,目前这个case还在跟进中,记录之:

[oracle@dcb-srv-0174 trace]$ cat itemstd02_ora_22156.trc |grep "LibraryObjectLock" |awk '{print $3 "-- obj handle get"}' | uniq -c |awk  '{ if ($1>100) print $0}'
    107 Handle=0x2efee6a060-- obj handle get
    107 Handle=0x2eff322578-- obj handle get
    113 Handle=0x2eff54acf0-- obj handle get
    118 Handle=0x2f1f0b89c8-- obj handle get
    118 Handle=0x2f1fc813c0-- obj handle get
    125 Handle=0x2f3e772c58-- obj handle get
    127 Handle=0x2f1fccd750-- obj handle get
    128 Handle=0x2f7efaa2a0-- obj handle get
    129 Handle=0x2f7f831ac0-- obj handle get
    130 Handle=0x2f3e171640-- obj handle get
    130 Handle=0x2f7ee7a100-- obj handle get
    131 Handle=0x2f7ee7a268-- obj handle get
    132 Handle=0x2f7f855570-- obj handle get
    147 Handle=0x2f1a8f0bf8-- obj handle get
    147 Handle=0x2f3edd63d0-- obj handle get
    150 Handle=0x2f3edd6620-- obj handle get
    150 Handle=0x2f5ef05050-- obj handle get
    150 Handle=0x2f5f213530-- obj handle get
    154 Handle=0x2f3e81b988-- obj handle get
    159 Handle=0x2f3e926498-- obj handle get
    182 Handle=0x2eff03c3c0-- obj handle get
    190 Handle=0x2f1f5a2858-- obj handle get
    192 Handle=0x2f1f859808-- obj handle get
    195 Handle=0x2fbf7f46f8-- obj handle get
    195 Handle=0x2fbfe9eb80-- obj handle get
    197 Handle=0x2f5ec64840-- obj handle get
    200 Handle=0x2eff4e7698-- obj handle get
    200 Handle=0x2f5f54e510-- obj handle get
    202 Handle=0x2f1fee7520-- obj handle get
    203 Handle=0x2eff975f98-- obj handle get
    204 Handle=0x2eff0480e0-- obj handle get
    204 Handle=0x2efff29160-- obj handle get
    207 Handle=0x2f7eedaae0-- obj handle get
    208 Handle=0x2f3fd32ef0-- obj handle get
    208 Handle=0x2f3fd33510-- obj handle get
    209 Handle=0x2f3fd33ef0-- obj handle get
    209 Handle=0x2f3fd34140-- obj handle get
    211 Handle=0x2f1f9ed050-- obj handle get
    211 Handle=0x2f1fee7770-- obj handle get
    212 Handle=0x2f1fc7ee38-- obj handle get
    212 Handle=0x2f9fa5a770-- obj handle get
    213 Handle=0x2efff297e0-- obj handle get
    213 Handle=0x2f7fa54a28-- obj handle get
    218 Handle=0x2f9fce2748-- obj handle get
    231 Handle=0x2eff04a950-- obj handle get
    247 Handle=0x2ef65d8a88-- obj handle get
    264 Handle=0x2effd1c938-- obj handle get
    268 Handle=0x2f3ee0e1a8-- obj handle get
    268 Handle=0x2f3f6469a0-- obj handle get
    270 Handle=0x2f1f8a6120-- obj handle get
    270 Handle=0x2f1fcda868-- obj handle get
    290 Handle=0x2f7f85ade0-- obj handle get
    378 Handle=0x2f7f85b030-- obj handle get
    468 Handle=0x2f3f28ec08-- obj handle get
    697 Handle=0x2f3fd326e0-- obj handle get
    800 Handle=0x2efca0bdd0-- obj handle get
    847 Handle=0x2f9ff0e448-- obj handle get
[oracle@dcb-srv-0174 trace]$ cat temp.log |grep "library cache load lock" |wc -l && cat temp.log |grep  "cursor: pin S" |wc -l
619
158

..........


ADDR	00002B4642274E78
INDX	51844
INST_ID	1
KGLHDADR	0000002F3F28EC08
KGLHDPAR	0000002F3F28EC08
KGLHDCLT	65657
KGLNAOWN	SYS
KGLNAOBJ	TRG_SERVER_ERRORS
KGLFNOBJ	<CLOB>
KGLNADLK	
KGLNAHSH	102957177
KGLNAHSV	e230e86a06763591475191a006230079
KGLNATIM	27-10月-12
KGLNAPTM	
KGLHDNSP	3
KGLHDNSD	TRIGGER
KGLHDLMD	0
KGLHDPMD	0
KGLHDFLG	10241
KGLHDOBJ	0000002C9F3EA470
KGLHDLDC	1
KGLHDIVC	0
KGLHDEXC	0
KGLHDLKC	0
KGLHDKMK	0
KGLHDDMK	67
KGLHDAMK	0
KGLOBFLG	4
KGLOBSTA	1
KGLOBTYP	12
KGLOBTYD	TRIGGER
KGLOBHMK	0
KGLOBHS0	4712
KGLOBHS1	2480
KGLOBHS2	0
KGLOBHS3	0
KGLOBHS4	0
KGLOBHS5	0
KGLOBHS6	1032
KGLOBHS7	0
KGLOBHD0	0000002F3EE199D8
KGLOBHD1	0000002C9F3EABA0
KGLOBHD2	00
KGLOBHD3	00
KGLOBHD4	00
KGLOBHD5	00
KGLOBHD6	0000002C9F3EAC28
KGLOBHD7	00
KGLOBPC0	0
KGLOBPC6	0
KGLOBTP0	00
KGLOBT00	0
KGLOBT01	0
KGLOBT02	0
KGLOBT03	
KGLOBT04	0
KGLOBT05	0
KGLOBT35	0
KGLOBT06	0
KGLOBT07	0
KGLOBT08	0
KGLOBT09	0
KGLOBT10	0
KGLOBT11	0
KGLOBT12	0
KGLOBT13	0
KGLOBT14	0
KGLOBT15	0
KGLOBT16	0
KGLOBT17	0
KGLOBT18	0
KGLOBT19	0
KGLOBT20	0
KGLOBT21	0
KGLOBT22	0
KGLOBT23	2382
KGLOBT24	1795
KGLOBT25	0
KGLOBT26	0
KGLOBT28	0
KGLOBT29	0
KGLOBT30	0
KGLOBT31	0
KGLOBT27	0
KGLOBT32	0
KGLOBT33	0
KGLOBWAP	0
KGLOBWCC	0
KGLOBWCL	0
KGLOBWUI	0
KGLOBWDW	0
KGLOBT42	0
KGLOBT43	0
KGLOBT44	0
KGLOBT45	0
KGLOBT46	0
KGLOBT47	0
KGLOBT49	0
KGLOBT50	0
KGLOBT52	0
KGLOBT53	0
KGLOBTL0	0
KGLOBTL1	0
KGLOBTS0	
KGLOBTS1	
KGLOBTN0	
KGLOBTN1	
KGLOBTN2	
KGLOBTN3	
KGLOBTN4	
KGLOBTN5	
KGLOBTS2	
KGLOBTS3	
KGLOBTS5	
KGLOBTT0	
KGLOBCCE	
KGLOBCCEH	0
KGLOBCLA	
KGLOBCLC	0
KGLOBCCC	0
KGLOBTS4	
KGLOBCBCA	
KGLOBT48	0
KGLOBDSO	0
KGLOBDEX	0
KGLOBDPX	0
KGLOBDLD	0
KGLOBDIV	0
KGLOBDPS	0
KGLOBDDR	0
KGLOBDDW	0
KGLOBDBF	0
KGLOBDRO	0
KGLOBDCP	0
KGLOBDEL	0
KGLOBDFT	0
KGLOBDEF	0
KGLOBDUI	0
KGLOBDCL	0
KGLOBDAP	0
KGLOBDCC	0
KGLOBDPL	0
KGLOBDJV	0
KGLOBACS	0
KGLOBTS6	
KGLOBTS7	
KGLOBT54	0
KGLOBT55	0
KGLOBT56	0
KGLOBT57	0
KGLOBDCO	0
KGLOBDCI	0
KGLOBDRR	0
KGLOBDRB	0
KGLOBDWR	0
KGLOBDWB	0
KGLOBT58	0
KGLOBDOR	0
KGLHDMTX	0000002F3F28ED38
KGLHDMVL	0
KGLHDMSP	558
KGLHDMGT	667484
KGLHDDMTX	0000002F3F28ECB8
KGLHDDVL	0
KGLHDDSP	0
KGLHDDGT	0
KGLHDBID	65657
KGLHDBMTX	0000002E3546D210
KGLHDBVL	0
KGLHDBSP	3
KGLHDBGT	561277
KGLOBT59	0
KGLOBDCU	0
KGLOBPROP

问题SQL就不贴出来了,涉及到3张表的FTS,并发上来之后导致ADG 短暂的hang 死,当时的状态包括了很多类library cache lock(library cache load lock 与 cursor pin S) 的交互等待 。

对于这里涉及到的机制:

The connection between log file switches and the media recovery checkpoints is just a little messier than I’ve described, and 11g has introduced two new parameters, _defer_log_boundary_ckpt (defer media recovery checkpoint at log boundary) and _defer_log_count (number of log boundaries media recovery checkpoint lags behind), defining this behavior. Effectively these parameters allow Oracle to be lazy about copying buffers to disk on media recovery checkpoints until a few extra log files switches have taken place. This type of mechanism has been in place since 9i—but with no visible controls—presumably in the hope that a little extra delay at peak processing periods would allow incremental checkpointing to spread the write load that an immediate media recovery checkpoint would cause.

Mandatory checkpoint at every log boundary

Recovery slaves wait on log boundary checkpoint

No redo apply during log boundary checkpoint

Incremental recovery checkpoints keep log boundary checkpoint duration smaller Checkpoint performed prior to stopping MRP

Parameters “_defer_log_boundary_ckpt” and “_defer_log_count” define the behavior (not fully implemented in 11g)

Recovery until consistent required after crash of ADG

关于另外几个参数的说明:

_row_cache_cursors:参考 Steve Adams的Tuning the _row_cache_cursors

session_cached_cursor : 设置此参数用以缓解 library cache load lock.

_row_cr : SET FALSE 用以缓解 row CR 对于缓解dc_rollback_segments效果不明

_defer_log_count 针对 ADG target 端的checkpoint

_log_committime_block_cleanout 目前没有具体影响范围. (When it is set to TRUE then we will try to perform block cleanout when a transaction commits rather than applying delayed logging cleanout to the affected blocks.The commit causes us to walk the associated list of blocks state objects owned by the transaction and perform the relevant form of cleanout upon each block depending upon the parameter’s setting)

delayed logging behavior by JL:

When you commit, the commit cleanout updates some of the changed data blocks by stamping them with a couple of SCNs. But the commit still takes only 60 bytes of redo – the changes to the data blocks are not logged i.e. they are delayed.When the block is subsequently updated by another transaction, the ‘full’ cleanout of the ITL and its row locks takes place. However, if you examine the redo generated at this point, you will find that it has been engineered to look as if the new update started from a fully cleaned out block. In this respect, the logging of the commit cleanout has not been delayed, it has been totally bypassed.

Also by Steve : Read Consistency

_db_lost_write_checking and db_lost_write_protect can also change “_log_committime_block_cleanout”

Set “_log_committime_block_cleanout” to TRUE on primary if possible, redo overhead is less than 5%. (by paypal)

对于这个Case,目前采用TNS HA连接方式轮询多个可用的ADG,配合APP的自动failover.目前ADG 对于大并发垃圾SQL的支持是有一些问题,包括一些row cache latch 的竞争十分的激烈.可以考虑采用logical standby 或者ogg 来满足一些异步大SQL的查询。

Physical standby Failover

June 28, 2012 oracle, replication No comments

db1 –liu

SQL> select * from v$version;

BANNER
—————————————————————-
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – Prod
PL/SQL Release 10.2.0.1.0 – Production
CORE 10.2.0.1.0 Production
TNS for Linux: Version 10.2.0.1.0 – Production
NLSRTL Version 10.2.0.1.0 – Production

db2 –yang

SQL> select * from v$archive_gap;

no rows selected

SQL>

第一种情况 正常的failover

phsical standby :

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE;

Database altered.

SQL>

Wed Jun 27 13:47:15 2012
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
Wed Jun 27 13:47:15 2012
Terminal Recovery: Stopping real time apply
Wed Jun 27 13:47:15 2012
MRP0: Background Media Recovery cancelled with status 16037
Wed Jun 27 13:47:15 2012
Errors in file /u01/app/oracle/admin/liu/bdump/liu_mrp0_10260.trc:
ORA-16037: user requested cancel of managed recovery operation
Managed Standby Recovery not using Real Time Apply
Recovery interrupted!
Recovered data files to a consistent state at change 456564
Wed Jun 27 13:47:16 2012
Errors in file /u01/app/oracle/admin/liu/bdump/liu_mrp0_10260.trc:
ORA-16037: user requested cancel of managed recovery operation
Wed Jun 27 13:47:16 2012
MRP0: Background Media Recovery process shutdown (liu)
Wed Jun 27 13:47:16 2012
Terminal Recovery: Stopped real time apply
Wed Jun 27 13:47:16 2012
Attempt to do a Terminal Recovery (liu)
Wed Jun 27 13:47:16 2012
Media Recovery Start: Managed Standby Recovery (liu)
Managed Standby Recovery not using Real Time Apply
parallel recovery started with 2 processes
Media Recovery Waiting for thread 1 sequence 2 (in transit)
Terminal Recovery timestamp is ’06/27/2012 13:47:18′
Terminal Recovery: applying standby redo logs.
Terminal Recovery: thread 1 seq# 2 redo required
Terminal Recovery: /u01/app/oracle/oradata/liu/redo05.log
Identified End-Of-Redo for thread 1 sequence 2
Wed Jun 27 13:47:18 2012
Incomplete recovery applied all redo ever generated.
Recovery completed through change 456566
Wed Jun 27 13:47:18 2012
Media Recovery Complete (liu)
Terminal Recovery: successful completion
Begin: Standby Redo Logfile archival
End: Standby Redo Logfile archival
Resetting standby activation ID 2612058581 (0x9bb0d9d5)
Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE

SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;

Database altered.

SQL> ALTER DATABASE OPEN;

Database altered.

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

db2 作为主库

[oracle@yang ~]$ sqlplus ‘/as sysdba’

SQL*Plus: Release 10.2.0.1.0 – Production on Wed Jun 27 14:01:34 2012

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – Production
With the Partitioning, OLAP and Data Mining options

SQL> SELECT TO_CHAR(STANDBY_BECAME_PRIMARY_SCN) FROM V$DATABASE;

TO_CHAR(STANDBY_BECAME_PRIMARY_SCN)
—————————————-
479312

SQL>

DB1 作为备库 flashback

SQL>
FLASHBACK DATABASE TO SCN 479312SQL>
2 ;

Flashback complete.

SQL> ALTER DATABASE CONVERT TO PHYSICAL STANDBY;

Database altered.

继续开启apply

ttempt to start background Managed Standby Recovery process (liu)
MRP0 started with pid=20, OS id=22199
Wed Jun 27 14:52:31 2012
MRP0: Background Managed Standby Recovery process started (liu)
Managed Standby Recovery not using Real Time Apply
Media Recovery Log /u01/arch/1_10_787062936.dbf
Media Recovery Log /u01/arch/1_11_787062936.dbf
Media Recovery Log /u01/arch/1_12_787062936.dbf
Wed Jun 27 14:52:37 2012
Completed: alter database recover managed standby database disconnect
Wed Jun 27 14:52:38 2012
Media Recovery Log /u01/arch/1_13_787062936.dbf
Media Recovery Log /u01/arch/1_14_787062936.dbf
Media Recovery Waiting for thread 1 sequence 15 (in transit)

应用正常


第二种情况 active database 的 failover

此处模拟网络断开

SQL> create table failover(id int);

Table created.

SQL> alter system set log_archive_dest_state_2=defer;

System altered.

SQL> alter system switch logfile;

System altered.

这个归档不会传输过去

physical standby 操作:

SQL> alter database recover managed standby database disconnect;

Database altered.

SQL>
SQL>
SQL> alter database recover managed standby database cancel;

Database altered.

SQL> recover standby database until cancel;
ORA-00279: change 500198 generated at 06/27/2012 14:57:52 needed for thread 1
ORA-00289: suggestion : /u01/arch/1_16_787062936.dbf
ORA-00280: change 500198 for thread 1 is in sequence #16

Specify log: {=suggested | filename | AUTO | CANCEL}
cancel;
ORA-00308: cannot open archived log ‘cancel;’
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3

Specify log: {=suggested | filename | AUTO | CANCEL}
cancel
Media recovery cancelled.
SQL> alter database activate standby database;

Database altered.

SQL> shutdown immediate;
ORA-01109: database not open

Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area 524288000 bytes
Fixed Size 1220336 bytes
Variable Size 150995216 bytes
Database Buffers 364904448 bytes
Redo Buffers 7168000 bytes
Database mounted.
Database opened.
SQL> SELECT TO_CHAR(STANDBY_BECAME_PRIMARY_SCN) FROM V$DATABASE;

TO_CHAR(STANDBY_BECAME_PRIMARY_SCN)
—————————————-
500196

SQL> select database_role from v$database;

DATABASE_ROLE
—————-
PRIMARY

SQL>

已经切换为primary

Resetting resetlogs activation ID 2612053838 (0x9bb0c74e)
Online log /u01/app/oracle/oradata/liu/redo01.log: Thread 1 Group 1 was previously cleared
Online log /u01/app/oracle/oradata/liu/redo02.log: Thread 1 Group 2 was previously cleared
Online log /u01/app/oracle/oradata/liu/redo03.log: Thread 1 Group 3 was previously cleared
Standby became primary SCN: 500196

此时发生了resetlogs 操作

SQL> archive log list;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination /u01/arch
Oldest online log sequence 1
Next log sequence to archive 2
Current log sequence 2

此时我们看一下原来的primary 端:

SQL> shutdown abort;
ORACLE instance shut down.

SQL> startup mount;
ORACLE instance started.

Total System Global Area 524288000 bytes
Fixed Size 1220336 bytes
Variable Size 150995216 bytes
Database Buffers 364904448 bytes
Redo Buffers 7168000 bytes
Database mounted.
SQL> flashback database to scn 500196;

Flashback complete.

SQL> ALTER DATABASE CONVERT TO PHYSICAL STANDBY;

Database altered.

SQL> SHUTDOWN IMMEDIATE;
ORA-01507: database not mounted

ORACLE instance shut down.
SQL> STARTUP MOUNT;
ORACLE instance started.

Total System Global Area 524288000 bytes
Fixed Size 1220336 bytes
Variable Size 150995216 bytes
Database Buffers 364904448 bytes
Redo Buffers 7168000 bytes
Database mounted.

SQL> startup mount;
ORACLE instance started.

Total System Global Area 524288000 bytes
Fixed Size 1220336 bytes
Variable Size 150995216 bytes
Database Buffers 364904448 bytes
Redo Buffers 7168000 bytes
Database mounted.
SQL> alter database recover managed standby database disconnect;

Database altered.

SQL> select database_role from v$database;

DATABASE_ROLE
—————-
PHYSICAL STANDBY

看一下standby 端的log :

Using STANDBY_ARCHIVE_DEST parameter default value as /u01/arch
Redo Shipping Client Connected as PUBLIC
— Connected User is Valid
RFS LogMiner: Client disabled from further notification
RFS[1]: New Archival REDO Branch(resetlogs_id): 787074684 Prior: 787074091
RFS[1]: Archival Activation ID: 0x9bb07de2 Current: 0x0
RFS[1]: Effect of primary database OPEN RESETLOGS


RFS[1]: Assigned to RFS process 23110
RFS[1]: Identified database type as ‘physical standby’
Wed Jun 27 15:05:38 2012
RFS LogMiner: Client disabled from further notification
RFS[1]: Archived Log: ‘/u01/arch/1_1_787071479.dbf’
New incarnation branch detected in ArchiveLog, filename /u01/arch/1_1_787071479.dbf
Inspection of file changed rdi from 2 to 3
Setting recovery target incarnation to 3

incarnation 自动进行了切换

Wed Jun 27 15:05:40 2012
RFS[1]: Incarnation entry added for Branch(resetlogs_id): 787071479 (liu)
Wed Jun 27 15:05:40 2012
Setting recovery target incarnation to 3
Wed Jun 27 15:05:40 2012
MRP0: Incarnation has changed! Retry recovery…

重置之后开始recover

Wed Jun 27 15:05:40 2012
Errors in file /u01/app/oracle/admin/liu/bdump/liu_mrp0_23056.trc:
ORA-19906: recovery target incarnation changed during recovery
Recovery interrupted!
Wed Jun 27 15:05:40 2012
Errors in file /u01/app/oracle/admin/liu/bdump/liu_mrp0_23056.trc:
ORA-19906: recovery target incarnation changed during recovery
Wed Jun 27 15:06:00 2012
Managed Standby Recovery not using Real Time Apply
Media Recovery apply resetlogs offline range for datafile 1, incarnation : 0
Media Recovery apply resetlogs offline range for datafile 2, incarnation : 0
Media Recovery apply resetlogs offline range for datafile 3, incarnation : 0
Media Recovery apply resetlogs offline range for datafile 4, incarnation : 0

开始resetlog

Media Recovery Log /u01/arch/1_1_787071479.dbf
Media Recovery Waiting for thread 1 sequence 2
Fetching gap sequence in thread 1, gap sequence 2-14

开始fetching gap ;

我们在两个服务器上查看 table failover

SQL> desc failover;
ERROR:
ORA-04043: object test_liu does not exist

丢失了这张表,也就是说发生故障的scn 点直至failover 完成之后的数据 将全部丢失

总结一下:

1. 正常情况的failover 切换之后主备角色发生变化 DG集群中的其他standby端 可以不受到影响
2. 非常规情况下的failover (active database) 需要发生media recover,open需要resetlogs DG集群中的所有备库需要flashback到 STANDBY_BECAME_PRIMARY_SCN 点的SCN 将会丢失数据
3. 两种情况下 都需要打开flashback log

Recover physical standby database after loss of archive log(2)

June 20, 2012 oracle, replication No comments

Recover physical standby database after loss of archive log(2)

上次写过一篇DG丢失归档后的处理过程,总体来说就是使用增量备份覆盖gap数据从而跳过gap的archivelog 这里再阐述另一种情况


[oracle@db61 orcl]$ ora si
SQL*Plus: Release 10.2.0.5.0 – Production on Wed Jun 20 14:35:01 2012
Copyright (c) 1982, 2010, Oracle. All Rights Reserved.

Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 – 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> select database_role from v$database;
DATABASE_ROLE
—————-
PHYSICAL STANDBY
SQL> recover standby database;
ORA-00279: change 40103914365 generated at 05/23/2012 09:26:36 needed for thread 3
ORA-00289: suggestion : /data/oracle/oradata/orcl/arch/3_8658_657561562.dbf
ORA-00280: change 40103914365 for thread 3 is in sequence #8658

Specify log: {=suggested | filename | AUTO | CANCEL} auto
ORA-00308: cannot open archived log ‘/data/oracle/oradata/orcl/arch/3_8658_657561562.dbf’
ORA-27037: unable to obtain file status Linux-x86_64 Error: 2: No such file or directory Additional information: 3

这个库的大概情况为丢失了6月4号至今的所有归档,很容易想到使用standby 的current_scn去作为 起始scn增量备份,对于这里的增量备份出现了一个有趣的现象。看上面的操作,可以知道还是需要scn为40103914365的归档文件,but why ?既然已经使用增量备份recover database,这里就不兜圈子了,DG开启redo apply之后 oracle 会寻找file header最低的scn开始apply 我们可以查询下当前的file header scn:


SQL> select file#,to_char(checkpoint_change#) from v$datafile_header;

FILE# TO_CHAR(CHECKPOINT_CHANGE#)
———- —————————————-
1 42501726792
2 42501726792
3 42501726801
4 42501726801
5 42501726801
6 42501726801
7 42501726792
8 40103914365
9 42501726792
10 42501726801
11 42501726801

FILE# TO_CHAR(CHECKPOINT_CHANGE#)
———- —————————————-
12 42501726801
13 42501726792
14 42501726801
15 42501726792
16 42501726792
17 42501726801
18 42501726801


看到file 8的scn正是oracle需要的scn 对应上面的操作:change 40103914365 generated at 05/23/2012 09:26:36 needed for thread 3 这里的05/23/2012 09:26:36足以说明问题。查看主库的file 8文件发现 change time 为 05/23/2012,从这里可以说明file 8自从2012-05-23之后从来没有change过,对于这种file – BLOCK change为0 ,也就是说change scn为上一次的05/23/2012 09:26:36之前的change scn,即所有的块都不满足以上条件,所以对于从6月4号开始的增量备份,oracle将忽略这个文件的所有blocks从而导致recover之后file header checkpoint scn没有发生变化,当开启redo apply之后oracle仍然从最小的scn开始尝试恢复,从而导致这个诡异的现象,当然这种极端情况是很少出现的,这里我们可以采用rman copy这个file到standby端从而解决这个问题。

eg:

RMAN> copy datafile ‘+DATA/pri/datafile/udata01_16.dbf’ to ‘/data/xxx.dbf’;

Starting backup at 20-JUN-12
using channel ORA_DISK_1
channel ORA_DISK_1: starting datafile copy
input datafile fno=00008 name=+DATA/pri/datafile/udata01_16.dbf
output filename=/data/temp.dbf tag=TAG20120620T145721 recid=172 stamp=786467766
channel ORA_DISK_1: datafile copy complete, elapsed time: 00:18:46
Finished backup at 20-JUN-12

Starting Control File and SPFILE Autobackup at 20-JUN-12
piece handle=/data/oracle/product/10.2/db1/dbs/c-1185342296-20120620-00 comment=NONE
Finished Control File and SPFILE Autobackup at 20-JUN-12

scp ….

[oracle@db61 orcl]$ ora dgstats

NAME VALUE UNIT TIME_COMPUTED
—————————— ——————– —————————— ——————————
apply finish time +00 00:09:07.7 day(2) to second(1) interval 20-JUN-2012 18:54:58
apply lag +00 00:57:49 day(2) to second(0) interval 20-JUN-2012 18:54:58
estimated startup time 116 second 20-JUN-2012 18:54:58
standby has been open N 20-JUN-2012 18:54:58
transport lag +00 00:42:10 day(2) to second(0) interval 20-JUN-2012 18:54:58

4个小时之后DG 追上了16天的gap 恢复速度还是不错的 总体来说通过增量备份恢复丢失归档的DG是一个很常规的手法 16天的归档>=4T 如果从带库恢复归档 将是一个漫长的过程

Recover physical standby database after loss of archive log

March 20, 2012 oracle, replication 2 comments

今天报表数据库的备库出现了问题,由于监控脚本出现了问题,主机空间耗尽 而没有及时发出邮件,导致归档无法进行,DB停滞在一个时间点,同时在主库 由于只保留了2天的归档 导致这部分归档没有传输至备库,等我们发现问题时,主库已经删除了归档,备库接近8天gap无法恢复。由于主库超过3个T ,重新用备份恢复一个standby 时间上是不允许的,并且主库上也没有空间存放这么大的备份集,于是采取了增量备份的方法,跳过了这个GAP。下面记录这次恢复过程

查看备库日志发现:

Tue Mar 13 02:45:00 2012
ARC1: Closing local archive destination LOG_ARCHIVE_DEST_1: ‘/data/oracle/oradata/track1/arch/1_30021_754336076.dbf’ (error 19502) (track1)
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance track1 – Archival Error
ORA-16038: log 7 sequence# 30021 cannot be archived
ORA-19502: write error on file “”, block number (block size=)
ORA-00312: online log 7 thread 1: ‘/data/oracle/oradata/track1/standbyredo7.log’
Tue Mar 13 02:45:00 2012
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance track1 – Archival Error

Tue Mar 13 04:06:56 2012
Non critical error ORA-00001 caught while writing to trace file “/data/oracle/diag/rdbms/trackdg/track1/trace/track1_rfs_9177.trc”
Error message: Linux-x86_64 Error: 28: No space left on device
Additional information: 1
Writing to the above trace file is disabled for now on…
Creating archive destination file : /data/oracle/oradata/track1/arch/1_30038_754336076.dbf (808922 blocks)
Non critical error ORA-00001 caught while writing to trace file “/data/oracle/diag/rdbms/trackdg/track1/trace/track1_rfs_9175.trc”
Error message: Linux-x86_64 Error: 28: No space left on device
Additional information: 1
Writing to the above trace file is disabled for now on…

Fetching gap sequence in thread 1, gap sequence 30028-30127
Mon Mar 19 20:35:42 2012
Fetching gap sequence in thread 1, gap sequence 30028-30127
Mon Mar 19 20:35:52 2012
Fetching gap sequence in thread 1, gap sequence 30028-30127
Mon Mar 19 20:36:02 2012
Fetching gap sequence in thread 1, gap sequence 30028-30127
Mon Mar 19 20:36:12 2012
Fetching gap sequence in thread 1, gap sequence 30028-30127

发现从13号的2点 主机space 已经被耗尽. 13-17号的归档全部丢失(gap sequence 30028-30127)近100个归档

下面采取增量备份的方法恢复这个数据库:

在备库查找当前的SCN号

SQL> select current_scn from v$database;

CURRENT_SCN
———————–
16895677901

在主库用这个SCN 通过增量备份 备份出新的backupset :

RMAN> run {
2> allocate channel c1 device type disk;
3> allocate channel c2 device type disk;
4> allocate channel c3 device type disk;
5> allocate channel c4 device type disk;
6> BACKUP INCREMENTAL FROM SCN 16895677901 DATABASE FORMAT ‘/data/track_standby_%U’ tag ‘track_standby’;
7> release channel c1;
8> release channel c2;
9> release channel c3;
10> release channel c4;
11> }

将这个备份集 SCP 到备库的/data2 目录下 由于主备采用了convert 转换了文件的路径 需要通过set newname 来恢复备库 大致脚本如下:


[oracle@db59 track1]$ rman target /

Recovery Manager: Release 11.2.0.2.0 – Production on Mon Mar 19 23:44:58 2012

Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.

connected to target database: TRACK1 (DBID=3812187020, not open)

RMAN> CATALOG START WITH ‘/data2/track_standby’;

searching for all files that match the pattern /data2/track_standby

List of Files Unknown to the Database
=====================================
File Name: /data2/track_standby_76n694la_1_1

.. …

….

File Name: /data2/track_standby_9bn69dqo_1_1 (一共71个)

Do you really want to catalog the above files (enter YES or NO)? YES
cataloging files…
cataloging done

List of Cataloged Files
=======================
File Name: /data2/track_standby_76n694la_1_1

File Name: /data2/track_standby_9bn69dqo_1_1

RMAN> run {
allocate channel c1 device type disk;
allocate channel c2 device type disk;
allocate channel c3 device type disk;
allocate channel c4 device type disk;
recover database noredo;
release channel c1;
release channel c2;
release channel c3;
release channel c4;
}

[oracle@db59 ~]$ tailalert

Incremental restore complete of datafile 280 /data/oracle/oradata/track1/lg_track3_100.dbf
checkpoint is 36031098956
Incremental restore complete of datafile 294 /data/oracle/oradata/track1/lg_track4_004.dbf
checkpoint is 36031098956
Mon Mar 19 17:54:45 2012
Incremental restore complete of datafile 28 /data/oracle/oradata/track1/lg_track18.dbf
checkpoint is 36025918455
last deallocation scn is 19965755207
Incremental restore complete of datafile 286 /data2/oracle/oradata/track1/lg_track119.dbf
checkpoint is 36025918455
Incremental restore complete of datafile 245 /data/oracle/oradata/track1/lg_track3_68.dbf
checkpoint is 36025918455
Incremental restore complete of datafile 241 /data/oracle/oradata/track1/lg_track3_64.dbf

..

完成后,由于主库在期间添加过数据文件,使用rman copy 将 file 296,297 copy 到本地,scp至备库的相应位置,主库生成standby controlfile 在备库通过rename file 重新定义file路径:


alter database rename file ‘/data/oracle/oradata/track1/system.dbf’ to ‘/data/oracle/oradata/track1/system.dbf’;
..
alter database rename file ‘/storage/disk01/lg_track3_92.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_92.dbf’;
alter database rename file ‘/storage/disk02/lg_track3_93.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_93.dbf’;
alter database rename file ‘/storage/disk03/lg_track3_94.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_94.dbf’;
alter database rename file ‘/storage/disk04/lg_track3_95.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_95.dbf’;
alter database rename file ‘/storage/disk05/lg_track3_96.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_96.dbf’;
alter database rename file ‘/storage/disk06/lg_track3_97.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_97.dbf’;
alter database rename file ‘/storage/disk07/lg_track3_98.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_98.dbf’;
alter database rename file ‘/storage/disk00/lg_track3_99.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_99.dbf’;
alter database rename file ‘/storage/disk01/lg_track3_100.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_100.dbf’;
..
….

alter database rename file ‘/storage/disk06/lg_track4_006.dbf’ to ‘/data2/oracle/oradata/track1/lg_track4_006.dbf’;
alter database rename file ‘/storage/disk02/lg_track4_007.dbf’ to ‘/data2/oracle/oradata/track1/lg_track4_007.dbf’;

添加standby logfile


SQL> alter system set standby_file_management=MANUAL;

System altered.

ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 group 6(‘/data2/oracle/oradata/track1/standbyredo6.log’) SIZE 500M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 group 7(‘/data2/oracle/oradata/track1/standbyredo7.log’) SIZE 500M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 group 8(‘/data2/oracle/oradata/track1/standbyredo8.log’) SIZE 500M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 group 9(‘/data2/oracle/oradata/track1/standbyredo9.log’) SIZE 500M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 group 10(‘/data2/oracle/oradata/track1/standbyredo10.log’) SIZE 500M;

开启日志apply


SQL> alter system set standby_file_management=auto;

System altered.

SQL> alter database recover managed standby database disconnect from session using current logfile;

观察日志 经过将近4个小时 终于追上了主库


Recovery of Online Redo Log: Thread 1 Group 12 Seq 31152 Reading mem 0
Mem# 0: /data2/oracle/oradata/track1/standbyredo12.log
Tue Mar 20 00:28:09 2012
RFS[3]: Selected log 11 for thread 1 sequence 31153 dbid -482780276 branch 754336076
Tue Mar 20 00:28:13 2012
Archived Log entry 101 added for thread 1 sequence 31152 ID 0xe339e88c dest 1:
Tue Mar 20 00:28:16 2012
Media Recovery Waiting for thread 1 sequence 31153 (in transit)
Recovery of Online Redo Log: Thread 1 Group 11 Seq 31153 Reading mem 0
Mem# 0: /data2/oracle/oradata/track1/standbyredo11.log
Tue Mar 20 00:32:40 2012
RFS[3]: Selected log 12 for thread 1 sequence 31154 dbid -482780276 branch 754336076
Tue Mar 20 00:32:42 2012
Media Recovery Waiting for thread 1 sequence 31154 (in transit)
Recovery of Online Redo Log: Thread 1 Group 12 Seq 31154 Reading mem 0

NAME VALUE UNIT TIME_COMPUTED DATUM_TIME
—————————— ——————– —————————— —————————— ——————————
transport lag +00 00:00:00 day(2) to second(0) interval 03/20/2012 00:37:51 03/20/2012 00:37:51
apply lag +00 00:00:18 day(2) to second(0) interval 03/20/2012 00:37:51 03/20/2012 00:37:51
apply finish time +00 00:00:00.078 day(2) to second(3) interval 03/20/2012 00:37:51
estimated startup time 17 second 03/20/2012 00:37:51

SQL> select current_scn from v$database;

CURRENT_SCN
—————–
36050243871

SQL> select * from v$archive_gap;

no rows selected

在这种比较极端的情况下选择这种恢复的方法,如果重做备库 将花费大量的时间与空间,使用增量备份,极大的缩短了恢复的时间

logic standby ORA-01403 故障处理一例 (2)

March 18, 2012 oracle, replication 3 comments

前几天刚处理了 logic standby故障,还没缓过神来 , logical standby 又出问题了。 还是那张表,那个错误 不得不让我怀疑 是不是那张表出现了问题。记录一下解决的过程:

2012-03-16 11:45:00 update “PROD_DATA2″.”PM_SUPPLIER” ORA-01403: no data found
set
“PRODUCT_CODE” = ‘0021244213’,

2012-03-16 11:44:29 update “PROD_DATA2″.”PM_SUPPLIER” ORA-01403: no data found
set
“PRODUCT_CODE” = ‘0021244213’,

依然是这个老错误,这张表的同步出现了不一致,下面我们来看看这张表的结构:

SQL> desc “PROD_DATA2″.”PM_SUPPLIER”;
Name Null? Type
—————————————– ——– —————————-
ID NOT NULL NUMBER(18)

PRODUCT_DESCRIPTION NCLOB

…..

IS_DELETED NUMBER(1)
UNIT VARCHAR2(10)
CATEGORY_SEARCH_NAME VARCHAR2(200)

这张表包含了一个NCLOB 字段,针对之前同步这张表的报错,让我们怀疑是不是NCLOB 出现了问题,让我们看看当时同步这张表的报错:

SQL> exec DBMS_LOGSTDBY.INSTANTIATE_TABLE(‘PROD_DATA2′,’PM_SUPPLIER’,’dblinkprimary’);

ERROR at line 1:
ORA-20001: ORA-31693: Table data object “PROD_DATA2″.”PM_SUPPLIER” failed to
load/unload and is being skipped due to error:
ORA-12161: TNS:internal error: partial data received
ORA-02063: preceding from
ORA-06512: at “SYS.KUPW$WORKER”, line 113
ORA-02063: preceding from
ORA-01555: snapshot too old: rollback segment number with name “” too small
ORA-22924: snapshot too old
ORA-02063: preceding 2 lines from DBLINKPRIMARY
ORA-01555: snapshot too old: rollback segment number with name “” too small
ORA-22924: snaps
ORA-06512: at “SYS.DBMS_LOGSTDBY”, line 664
ORA-06512: at line 1

snapshot too old: rollback segment number with name “” too small 这种报错十分值得怀疑,一般情况下LOB字段默认采用pctversion=10,在不是特别频繁修改LOB字段的情况下,是不会报错的.何况当时是凌晨3点左右同步的这张表,这种报错更加值得怀疑,使用如下脚本检查lob字段的可用性:

create table corrupted_data (corrupted_rowid rowid);

set concat off
declare
error_1555 exception;
pragma exception_init(error_1555,-1555);
v_lob NCLOB;
n number;
begin
for cursor_lob in (select rowid r from prod_data2.PM_SUPPLIER) loop
begin
select PRODUCT_DESCRIPTION into v_lob from prod_data2.PM_SUPPLIER where rowid=cursor_lob.r;
n:=dbms_lob.instr(v_lob,hextoraw(‘889911′)) ;
exception
when error_1555 then
insert into corrupted_data values (cursor_lob.r);
commit;
end;
end loop;
end;
/
undefine lob_column

之后查看 corrupted_data 发现17条记录存在问题:

sys@item> select * from corrupted_data;

CORRUPTED_ROWID
——————
AAAEKcAAMAABhYrAAA
AAAEKcAAMAABhYrAAB
AAAEKcAAMAABhYrAAC
AAAEKcAAMAABhYrAAD
AAAEKcAAMAABhYrAAE
AAAEKcAAMAABhYrAAF
AAAEKcAAMAABhYrAAG
AAAEKcAAMAABhYrAAH
AAAEKcAAMAABhYrAAI
AAAEKcAAMAABhYrAAJ
AAAEKcAAMAABhYrAAK
AAAEKcAAMAABhYrAAL
AAAEKcAAMAABhYrAAM
AAAEKcAAMAABhYrAAN
AAAEKcAAMAABhYrAAO
AAAEKcAAMAABhZrAAA
AAAEKcAAMAABhXSAAG

17 rows selected.

查看任意一条记录 报错:

select PRODUCT_DESCRIPTION from prod_data2.PM_SUPPLIER where rowid=’AAAEKcAAMAABhYrAAA’;

ORA-01555: snapshot too old: rollback segment number with name “” too small
ORA-22924: snapshot too old

这17条记录的NCLOB 都出现了问题,下面修复这些错误的column 信息

update prod_data2.PM_SUPPLIER set PRODUCT_DESCRIPTION=null where rowid in (select CORRUPTED_ROWID from corrupted_data);

Reference:
Export Fails With ORA-2354 ORA-1555 ORA-22924 and How To Confirm LOB Segment Corruption Using Export Utility? [ID 833635.1]

Export Receives The Errors ORA-1555 ORA-22924 ORA-1578 ORA-22922 [ID 787004.1]


For this reason I propose a simple PLSQL block in this new note, which runs faster and does not consume much of the system resources.

LOBDATA was used as the table which should be checked for any potential corrupted LOBs:

1. Create a new temporary table for storing all rowids of the corrupted LOBs. Let’s call it “corrupt_lobs”

SQL> create table corrupt_lobs (corrupt_rowid rowid, err_num number);

2. Make a desc on the table containing the LOB column:

DESC LOBDATA

Name Null? Type
—————————————– ——–
ID NOT NULL NUMBER
DOCUMENT BLOB

Run the following PLSQL block:

declare
error_1578 exception;
error_1555 exception;
error_22922 exception;
pragma exception_init(error_1578,-1578);
pragma exception_init(error_1555,-1555);
pragma exception_init(error_22922,-22922);
n number;
begin
for cursor_lob in (select rowid r, document from LOBDATA) loop
begin
n:=dbms_lob.instr(cursor_lob.document,hextoraw(‘889911’));
exception
when error_1578 then
insert into corrupt_lobs values (cursor_lob.r, 1578);
commit;
when error_1555 then
insert into corrupt_lobs values (cursor_lob.r, 1555);
commit;
when error_22922 then
insert into corrupt_lobs values (cursor_lob.r, 22922);
commit;
end;
end loop;
end;
/

In the end all rowids of the corrupted LOBs will be inserted into the corrupt_lobs newly created table.

A possible solution would then be to empty the affected LOBs using a statement like:

SQL> update LOBDATA set document = empty_blob()
where rowid in (select corrupt_rowid from corrupt_lobs);

这里我们采用set null的方法 关于LOB 字段 null or empty 参考以下定义:


You can set an internal LOB — that is, a LOB column in a table, or a LOB attribute in an object type defined by you– to be NULL or empty:

Setting an Internal LOB to NULL:

A LOB set to NULL has no locator. A NULL value is stored in the row in the table, not a locator. This is the same process as for all other datatypes.

Setting an Internal LOB to Empty:

By contrast, an empty LOB stored in a table is a LOB of zero length that has a locator. So, if you SELECT from an empty LOB column or attribute, you get back a locator which you can use to populate the LOB with data via one of the six programmatic environments, such as OCI or PL/SQL(DBMS_LOB).

这里修改完毕后恢复pctversion为10(之前我们设置为90) 重新同步这张表到3个逻辑备库,很顺利的完成,后续效果待观察。

logic standby ORA-01403 故障处理一例

March 14, 2012 oracle, replication 2 comments

昨天的logic standby 事故处理:

夜里12点左右 2台logic standby 全部停止日志apply 查看logic standby 相关进程发现:

1.select spid, type, status from v$logstdby_process no rows returns

2.2012-03-12 23:36:16 update “PROD_DATA2″.”PM_SUPPLIER” set xxx…..

ORA-01403: no data found

报出如下错误,这种错误一般是由于standby 端的表数据与primary端出现了不一致导致的,最大的麻烦莫过于这张表是一个LOB字段的表

通过数据的对比,发现源表与目标表 已经出现了很大的不一致,初步判断是由于skip操作导致的 transaction 丢失导致的数据异常

准备使用DBMS_LOGSTDBY.INSTANTIATE_TABLE 进行表的重新同步:

INSTANTIATE_TABLE Procedure

This procedure creates and populates a table in the standby database from a corresponding table in the primary database. The table requires the name of the database link (dblink) as an input parameter. If the table already exists in the logical standby database, it will be dropped and re-created based on the table definition at the primary database. This procedure only brings over the data associated with the table, and not the associated indexes and constraints.

Use the INSTANTIATE_TABLE procedure to:

Add a table to a standby database.

Re-create a table in a standby database.

Syntax

DBMS_LOGSTDBY.INSTANTIATE_TABLE (
schema_name IN VARCHAR2,
table_name IN VARCHAR2,
dblink IN VARCHAR2);
Parameters

Table 56-5 INSTANTIATE_TABLE Procedure Parameters

Parameter Description
schema_name
Name of the schema
table_name
Name of the table to be created or re-created in the standby database
dblink
Name of the database link account that has privileges to read and lock the table in the primary database

SQL*Plus: Release 10.2.0.5.0 – Production on Wed Mar 14 00:42:21 2012

Copyright (c) 1982, 2010, Oracle. All Rights Reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 – 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL>
SQL> alter session set db_file_multiblock_read_count = 128;

Session altered.

SQL> alter session set sort_area_size =804857600;

Session altered.

SQL> alter session set workarea_size_policy = manual;

Session altered.

SQL> exec DBMS_LOGSTDBY.INSTANTIATE_TABLE(‘PROD_DATA2′,’PM_SUPPLIER’,’dblinkprimary’);

ERROR at line 1:
ORA-20001: ORA-31693: Table data object “PROD_DATA2″.”PM_SUPPLIER” failed to
load/unload and is being skipped due to error:
ORA-12161: TNS:internal error: partial data received
ORA-02063: preceding from
ORA-06512: at “SYS.KUPW$WORKER”, line 113
ORA-02063: preceding from
ORA-01555: snapshot too old: rollback segment number with name “” too small
ORA-22924: snapshot too old
ORA-02063: preceding 2 lines from DBLINKPRIMARY
ORA-01555: snapshot too old: rollback segment number with name “” too small
ORA-22924: snaps
ORA-06512: at “SYS.DBMS_LOGSTDBY”, line 664
ORA-06512: at line 1

报出如下错误 这种错误一般是由于LOB 回滚段不足所导致的:

关于LOB的undo做如下说明:

RETENTION – time-based: this specifies how long older versions are to be retained.

PCTVERSION – space-based: this specifies what percentage of the LOB segment is to be used to hold older versions.

Points to Note:

RETENTION is a keyword in the LOB column definition. No value can be specified for RETENTION. The RETENTION value is implicit. If a LOB is created with database compatibility set to 9.2.0.0 or higher, undo_management=TRUE and PCTVERSION is not explicitly specified, time-based retention is used. The LOB RETENTION value is always equal to the value of the UNDO_RETENTION database instance parameter.

You cannot specify both PCTVERSION and RETENTION.

PCTVERSION is applicable only to LOB chunks/pages allocated in LOBSEGMENTS. Other LOB related data in the table column and the LOBINDEX segment use regular undo mechanism.

PCTVERSION=0: the space allocated for older versions of LOB data in LOBSEGMENTS can be reused by other transactions and can cause“snapshot too old” errors.

PCTVERSION=100: the space allocated by older versions of LOB data can never be reused by other transactions. LOB data storage space is never reclaimed and it always increases.

RETENTION and PCTVERSION can be altered Recommendation Time-based retention using the RETENTION keyword is preferred. A high value for RETENTION or PCTVERSION may be needed to avoid “snapshot too old” errors in environments with high concurrent read/write LOB access.

查看”PROD_DATA2″.”PM_SUPPLIER”这张表的LOB属性:

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 – 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL>
SQL> select bitand(flags,32) from sys.lob$ where OBJ#= (select OBJECT_ID from dba_objects where
OWNER=’PROD_DATA2′ and OBJECT_NAME=’PM_SUPPLIER’); 2

BITAND(FLAGS,32)
—————-
0

SQL> select table_name, segment_name, pctversion, retention
from dba_lobs where table_name = ‘PM_SUPPLIER’;

TABLE_NAME SEGMENT_NAME PCTVERSION RETENTION
—————————— —————————— ———- ———-
PM_SUPPLIER SYS_LOB0000017052C00041$$ 10

修改为90

alter table prod_data2.PM_SUPPLIER modify lob(PRODUCT_DESCRIPTION) (pctversion 90);

sys@item> select table_name, segment_name, pctversion, retention
from dba_lobs where table_name = ‘PM_SUPPLIER’; 2

TABLE_NAME SEGMENT_NAME PCTVERSION RETENTION
—————————— —————————— ———- ———-
PM_SUPPLIER SYS_LOB0000017052C00041$$ 90

重新开启同步,继续报错:

ERROR at line 1:
ORA-20001: ORA-39082: Object type
TRIGGER:”PROD_DATA2″.”TRIG_PRODUCT_SUPPLIER_INSERT” created with compilation
warnings
ORA-06512: at “SYS.DBMS_LOGSTDBY”, line 664
ORA-06512: at line 1

使用 DBMS_LOGSTDBY.INSTANTIATE_TABLE同步时 所依赖的对象会一并同步过来,这个trigger 在源端我们已经禁用了,直接drop掉这个trigger继续同步:

SQL> alter session set db_file_multiblock_read_count = 128;

Session altered.

SQL> alter session set sort_area_size =804857600;

Session altered.

SQL> alter session set workarea_size_policy = manual;

Session altered.

SQL> exec DBMS_LOGSTDBY.INSTANTIATE_TABLE(‘PROD_DATA2′,’PM_SUPPLIER’,’dblinkprimary’);

PL/SQL procedure successfully completed.

经历了漫长的等待终于同步成功,由于此前失败了好几次,总共大概花了7个小时,可见LOB表对于logic standby的影响,一旦出现问题,同步花费的时间将大大的超过普通表,一张记录相同的普通表 使用DBMS_LOGSTDBY.INSTANTIATE_TABLE同步大概只需要10分钟左右,相比较而言差了几百倍之多!

同步完成后 进度很快追了上来:

SPID TYPE STATUS
———— ————— —————————————————————————————————-
29989 COORDINATOR ORA-16116: no work available
29940 READER ORA-16116: no work available
29942 BUILDER ORA-16116: no work available
29944 PREPARER ORA-16116: no work available
29946 PREPARER ORA-16116: no work available
29948 PREPARER ORA-16116: no work available
29952 ANALYZER ORA-16116: no work available
29954 APPLIER ORA-16116: no work available
29956 APPLIER ORA-16116: no work available
29958 APPLIER ORA-16116: no work available
29960 APPLIER ORA-16116: no work available
29962 APPLIER ORA-16116: no work available
29964 APPLIER ORA-16116: no work available
29966 APPLIER ORA-16116: no work available
29968 APPLIER ORA-16116: no work available
29970 APPLIER ORA-16116: no work available
29972 APPLIER ORA-16116: no work available
29974 APPLIER ORA-16116: no work available
29976 APPLIER ORA-16116: no work available

19 rows selected.

APPLIED_SCN APPLIED_TIME RESTART_SCN RESTART_TIME LATEST_SCN LATEST_TIME MINING_SCN MINING_TIME
———– ——————- ———– ——————- ———- ——————- ———- ——————-
3.5694E+10 2012-03-13 06:31:33 3.5686E+10 2012-03-12 19:07:22 3.5694E+10 2012-03-13 06:31:56 3.5694E+10 2012-03-13 06:31:34

终于处理完了这次故障,仰头一看 东方已经微微既白。对于大型的OLTP系统LOB表的存在无疑增加的很多维护的难度,对于要求性极高的系统,LOB字段建议不要存放在数据库系统,无论对于DG还是后期的例如OGG的同步 都是一个极大的隐患。

Fractured blocks when Rman backup is running

February 20, 2012 backup, oracle, replication No comments

系统alert在主库出现下面错误 OS为linux 5.5


Stopping background process CJQ0
Sat Feb 11 03:41:07 2012
Hex dump of (file 9, block 561424) in trace file /data/oracle/diag/rdbms/yhdstd/yhddb1/trace/yhddb1_ora_11327.trc
Corrupt block relative dba: 0x02489110 (file 9, block 561424)
Fractured block found during backing up datafile
Data in bad block:
type: 6 format: 2 rdba: 0x02489110
last change scn: 0x0007.a9d2831f seq: 0x1 flg: 0x06
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0xe6b90601
check value in block header: 0x8c60
computed block checksum: 0x82af
Reread of blocknum=561424, file=/data/oracle/oradata/yhddb1/md_data01.dbf. found valid data

这个情况在9i比较常见,rman备份时datafile 正在处在剧烈的io操作,如大批量的写入等等,oracle判断此块为Fractured,但是这并不是真正意义上的Corrupt

oracle会再次check这个块 ‘Reread of blocknum=561424, file=/data/oracle/oradata/yhddb1/md_data01.dbf. found valid data’ 发现这个块是valid的,数据库版本是11.2.0.3

[oracle@rac03 ~]$ crontab -l
0 03 * * 2 sh /home/oracle/monitor/script/rman_level0.sh >> /home/oracle/monitor/script/rman_level0.log
0 03 * * 0,1,3,4,5,6 sh /home/oracle/monitor/script/rman_level1.sh >> /home/oracle/monitor/script/rman_level1.log

rman备份确实是放在3点左右,和alert log 中的时间相吻合

由此发现在11g中我们还是建议rman备份尽量不要放在数据库繁忙的阶段

reference:

fact: Oracle Server – Enterprise Edition 8
fact: Oracle Server – Enterprise Edition 9
fact: Recovery Manager (RMAN)
symptom: Fractured block found during backup up datafile
symptom: Reread of blocknum found some corrupt data
symptom: Analyze table validate structure cascade returns no errors
change: NOTE ROLE:

The messages are of the form

Reread of blocknum=36256, file=/pdscdata/pdsclive/data1/dispatch_data_large2.
dbf. found same corrupt data
***
Corrupt block relative dba: 0xfc008dc0 (file 63, block 36288)
Fractured block found during backing up datafile
Data in bad block –
type: 0 format: 0 rdba: 0x00000000
last change scn: 0x0000.00000000 seq: 0x0 flg: 0x00
consistency value in tail: 0x53494e53
check value in block header: 0x0, block checksum disabled
spare1: 0x0, spare2: 0x0, spare3: 0x0
cause: RMAN backups of datafile are being performed while the datafile is
involved in heavy I/O.

RMAN reads Oracle blocks from disk. If it finds that the block is fractured,
which means it is being actively used, it performs a reread of the block. If
that fails again then the block is assumed to be corrupt.

By identifying the object that these blocks belong to by following
Handling Oracle Block Corruptions in Oracle7/8/8i and
performing an analyze .. validate structure cascade on the object involved you
can confirm that
the object is not corrupt.

fix:

Run the backups when the tablespace has less I/O activity.

记录一次database upgrade 导致physical standby故障

February 9, 2012 migration, oracle, replication No comments

记录一次database upgrade 导致physical standby故障

upgrade from 10.2.0.5->11.2.0.3

单节点的database升级很容易,严格按照手册,修改compatible=10.2.0

参考文档 Complete Checklist for Manual Upgrades to 11gR2 [ID 837570.1]

如果带有standby的系统升级要注意以下几点:

1.首先stop 10g listener,使用 11g listener

2.使用11g software打开standby database

3,开启日志apply

由于没有仔细看文档,升级primary的时候 没有打开standby 备库处于shutdown 状态,想primary升级完成后开启sql apply,报如下错误

Errors in file /data/oracle/diag/rdbms/paystd/paystd/trace/paystd_pr00_17447.trc (incident=40247):
ORA-00353: log corruption near block 6144 change 2561976 time 02/08/2012 00:00:15
ORA-00334: archived log: ‘/data/oracle/oradata/paystd/arch/1_78_770564180.dbf’
Incident details in: /data/oracle/diag/rdbms/paystd/paystd/incident/incdir_40247/paystd_pr00_17447_i40247.trc
Completed: alter database recover managed standby database disconnect from session using current logfile
MRP0: Background Media Recovery terminated with error 354
Errors in file /data/oracle/diag/rdbms/paystd/paystd/trace/paystd_pr00_17447.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 6144 change 2561976 time 02/08/2012 00:00:15
ORA-00334: archived log: ‘/data/oracle/oradata/paystd/arch/1_78_770564180.dbf’
Managed Standby Recovery not using Real Time Apply
Wed Feb 08 09:53:02 2012
Sweep [inc][40247]: completed
Wed Feb 08 09:53:02 2012
Sweep [inc2][40247]: completed

[oracle@db57 ~]$ vi /data/oracle/diag/rdbms/paystd/paystd/incident/incdir_40247/paystd_pr00_17447_i40247.trc

*** 2012-02-08 09:53:01.725
*** SESSION ID:(765.15) 2012-02-08 09:53:01.725
*** CLIENT ID:() 2012-02-08 09:53:01.725
*** SERVICE NAME:() 2012-02-08 09:53:01.725
*** MODULE NAME:() 2012-02-08 09:53:01.725
*** ACTION NAME:() 2012-02-08 09:53:01.725

Dump continued from file: /data/oracle/diag/rdbms/paystd/paystd/trace/paystd_pr00_17447.trc
ORA-00353: log corruption near block 6144 change 2561976 time 02/08/2012 00:00:15
ORA-00334: archived log: ‘/data/oracle/oradata/paystd/arch/1_78_770564180.dbf’

========= Dump for incident 40247 (ORA 353 [6144] [2561976]) ========

*** 2012-02-08 09:53:01.725
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
—– SQL Statement (None) —–
Current SQL information unavailable – no cursor.

—– Call Stack Trace —–

可以看到standby 无法读取 block 6144 的内容 而此时的情况是 如果直接使用real-time-apply 则报错

临时解决方案:

重建primary端的redo log:

使用循环 drop inactive group 之后 standby 恢复正常:


alter database recover managed standby database disconnect using current logfile
Attempt to start background Managed Standby Recovery process (paystd)
Wed Feb 08 10:25:46 2012
MRP0 started with pid=26, OS id=18107
MRP0: Background Managed Standby Recovery process started (paystd)
started logmerger process
Wed Feb 08 10:25:51 2012
Managed Standby Recovery starting Real Time Apply
Parallel Media Recovery started with 12 slaves
Waiting for all non-current ORLs to be archived…
All non-current ORLs have been archived.
Media Recovery Log /data/oracle/oradata/paystd/arch/1_95_770564180.dbf
Media Recovery Waiting for thread 1 sequence 96 (in transit)
Recovery of Online Redo Log: Thread 1 Group 6 Seq 96 Reading mem 0
Mem# 0: /data/oracle/oradata/paystd/std2.log
Completed: alter database recover managed standby database disconnect using current logfile
Wed Feb 08 19:45:46 2012
RFS[1]: Selected log 5 for thread 1 sequence 97 dbid 1905896596 branch 770564180
Wed Feb 08 19:45:46 2012
Media Recovery Waiting for thread 1 sequence 97 (in transit)
Recovery of Online Redo Log: Thread 1 Group 5 Seq 97 Reading mem 0
Mem# 0: /data/oracle/oradata/paystd/std1.log
Wed Feb 08 19:45:46 2012
Archived Log entry 24 added for thread 1 sequence 96 ID 0x7198f794 dest 1:

没有按照文档做导致的匪夷所思的错误,至于standby 为何无法直接读取primary 端的redo内容,有待后续查证