dataguard

DGHA,Dataguard自动切换方案

August 15, 2013 Architect, oracle, replication, software No comments

发布对autoswitch over dg脚本DGHA测试,此脚本针对oracle dataguard设计使用共享存储存放redo以及controlfile从而达到了切换数据0丢失。
同时切换以后,original primary database可以无缝加入整个dg集群,从而形成了dataguard高可用方案。
采用perl脚本定制开发,可任意部署于任何一台可连接至集群的机器(最好不要用集群中的机器做监控机),多重判断机制,从而增加了切换时的可靠性以及准确性。感谢作者 noodba. noodba作为部门数据架构师,定制开发了myawr(mysql性能分析工具)

测试环境:

10.0.0.63 (primary database)
10.0.0.24 (standby database)
10.0.0.13 (standby database)
切换site指定为63<->24 13作为恒定standby存在

Step 1. 使用swingbench作为压测工具 (导入10G 数据)

QQ图片20130815220940

[oracle@userdb-pre admin]$ ora active
 17:23:15 up 1 day,  1:15,  5 users,  load average: 2.99, 0.97, 0.34

   SID USERNAME   MACHINE		 EVENT				PARAM			W   WT SQL			    ST	   LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
  1089 TEST	  JDBC Thin Client	 SQL*Net more data from client	1413697536/4/0	       -1    0 /5gkc10wnagprk		    A	    0	     105
   961 TEST	  JDBC Thin Client	 SQL*Net more data from client	1413697536/1/0	       -1    0 5gkc10wnagprk/5gkc10wnagprk  A	    0	     105
  1249 TEST	  JDBC Thin Client	 SQL*Net more data from client	1413697536/5/0	       -1    0 dy4rgmf46x1q4/dy4rgmf46x1q4  A	    0	     105

[oracle@userdb-pre admin]$ ora active
 17:23:15 up 1 day,  1:15,  5 users,  load average: 3.07, 1.02, 0.36

   SID USERNAME   MACHINE		 EVENT				PARAM			W   WT SQL			    ST	   LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
   961 TEST	  JDBC Thin Client	 SQL*Net message from client	1413697536/1/0		3    0 5gkc10wnagprk/5gkc10wnagprk  A	    0	     105
     1 TEST	  JDBC Thin Client	 SQL*Net message from client	1413697536/1/0		2    0 dy4rgmf46x1q4/dy4rgmf46x1q4  A	    0	     104

[oracle@userdb-pre admin]$ ora active
 17:23:49 up 1 day,  1:15,  4 users,  load average: 3.68, 1.35, 0.49

   SID USERNAME   MACHINE		 EVENT				PARAM			W   WT SQL			    ST	   LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
  1025 TEST	  JDBC Thin Client	 SQL*Net message from client	1413697536/1/0		1    0 d1yrz7run4p1s/d1yrz7run4p1s  A	    0	     139

Step 2. Halt database machine (将63 database直接halt关闭,模拟机器crash)

BP9OJX}OU[ZIT2MQ[98[5LA

观察10.0.0.24 (standby 将被dgha自动切换至primary,redo,controlfile将被copy至规定目录)

alert_log from 10.0.0.24

Thu Aug 15 17:37:50 2013
ALTER DATABASE   MOUNT
Changing di2dbun from dgha to dgha1
Successful mount of redo thread 1, with mount id 3944140846
Database mounted in Exclusive Mode
Lost write protection disabled
Completed: ALTER DATABASE   MOUNT
Thu Aug 15 17:37:55 2013
ALTER DATABASE RECOVER  database
Media Recovery Start
 started logmerger process
Parallel Media Recovery started with 48 slaves
Thu Aug 15 17:37:56 2013
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Media Recovery Complete (dgha)
Completed: ALTER DATABASE RECOVER  database
alter database open
Beginning crash recovery of 1 threads
 parallel recovery started with 32 processes
Started redo scan
Completed redo scan
 read 779943 KB redo, 0 data blocks need recovery
Started redo application at
 Thread 1: logseq 58, block 35644
Recovery of Online Redo Log: Thread 1 Group 1 Seq 58 Reading mem 0
  Mem# 0: /data1/dgha/redo01.log
Recovery of Online Redo Log: Thread 1 Group 2 Seq 59 Reading mem 0
  Mem# 0: /data1/dgha/redo02.log
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Completed redo application of 0.00MB
Completed crash recovery at
 Thread 1: logseq 60, block 31, scn 2500652
 0 data blocks read, 0 data blocks written, 779943 redo k-bytes read
LGWR: STARTING ARCH PROCESSES
...

Completed: ALTER DATABASE   MOUNT
Thu Aug 15 17:37:55 2013
ALTER DATABASE RECOVER  database
Media Recovery Start
 started logmerger process
Parallel Media Recovery started with 48 slaves
Thu Aug 15 17:37:56 2013
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Media Recovery Complete (dgha)
Completed: ALTER DATABASE RECOVER  database
alter database open
Beginning crash recovery of 1 threads
 parallel recovery started with 32 processes
Started redo scan
Completed redo scan
 read 779943 KB redo, 0 data blocks need recovery
Started redo application at
 Thread 1: logseq 58, block 35644
Recovery of Online Redo Log: Thread 1 Group 1 Seq 58 Reading mem 0
  Mem# 0: /data1/dgha/redo01.log
Recovery of Online Redo Log: Thread 1 Group 2 Seq 59 Reading mem 0
  Mem# 0: /data1/dgha/redo02.log
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Completed redo application of 0.00MB
Completed crash recovery at
 Thread 1: logseq 60, block 31, scn 2500652
 0 data blocks read, 0 data blocks written, 779943 redo k-bytes read
LGWR: STARTING ARCH PROCESSES
...

Archived Log entry 191 added for thread 1 sequence 61 ID 0xeb14a3a4 dest 1:
ARC3: Standby redo logfile selected for thread 1 sequence 60 for destination LOG_ARCHIVE_DEST_3
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
******************************************************************
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_3
******************************************************************
LNS: Standby redo logfile selected for thread 1 sequence 62 for destination LOG_ARCHIVE_DEST_3
..


Completed: alter database open

数据库被recover到了最新的时间点(datafile scn领先于63 database,63此时可以看成24过去某个点的状态)


Step 3. third standby database (10.0.0.13)

alert_log from 10.0.0.13

Error 12543 received logging on to the standby
FAL[client, USER]: Error 12543 connecting to dgha for fetching gap sequence
Thu Aug 15 17:38:14 2013
Media Recovery Log /u01/oracle/11.2.0/oracle/product/11203/db1/dbs/arch1_61_823427815.dbf
Media Recovery Waiting for thread 1 sequence 62 (in transit)
Recovery of Online Redo Log: Thread 1 Group 4 Seq 62 Reading mem 0
  Mem# 0: /u01/oracle/oradata/dgha/standby04.log
Thu Aug 15 17:39:56 2013
Archived Log entry 62 added for thread 1 sequence 62 ID 0xeb14a3a4 dest 1:
Thu Aug 15 17:39:56 2013
RFS[14]: Selected log 4 for thread 1 sequence 63 dbid -350937948 branch 823427815
Thu Aug 15 17:39:56 2013
Media Recovery Waiting for thread 1 sequence 63 (in transit)
Recovery of Online Redo Log: Thread 1 Group 4 Seq 63 Reading mem 0
  Mem# 0: /u01/oracle/oradata/dgha/standby04.log
Thu Aug 15 17:40:08 2013
Archived Log entry 63 added for thread 1 sequence 63 ID 0xeb14a3a4 dest 1:
Thu Aug 15 17:40:08 2013
RFS[14]: Selected log 4 for thread 1 sequence 64 dbid -350937948 branch 823427815

可以看到 primary 自动切换为了10.0.0.24,13继续recover database.

Step 4. restart 63 database and join the dg cluster (24 is new primary database and 63 is standby database)

(1).copy standby controlfile from 24 database
(2).replace 63’s controlfile using new standby controlfile
(3).recreate standby logfiles
(4).alter database recover managed standby database;


alert_log:

..

Media Recovery Log /u01/oracle/arch/1_66_823427815.dbf
Media Recovery Waiting for thread 1 sequence 67 (in transit)

..
Physical standby database opened for read only access.
Completed: alter database open read only

Step 5. check data between 24 and 63 databases

from 24 database

SQL> select count(*) from test.CUSTOMERS              ;
select count(*) from test.WAREHOUSES             ;
select count(*) from test.ORDER_ITEMS            ;
select count(*) from test.ORDERS                 ;
select count(*) from test.INVENTORIES            ;
select count(*) from test.PRODUCT_INFORMATION    ;
select count(*) from test.LOGON                  ;
select count(*) from test.PRODUCT_DESCRIPTIONS   ;
select count(*) from test.ORDERENTRY_METADATA    ;

  COUNT(*)
----------
  39999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
 134986840

SQL> 
  COUNT(*)
----------
  45000000

SQL> 
  COUNT(*)
----------
    901565

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
   9999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
	 0

from 63 database

SQL> select count(*) from test.CUSTOMERS              ;
select count(*) from test.WAREHOUSES             ;
select count(*) from test.ORDER_ITEMS            ;
select count(*) from test.ORDERS                 ;
select count(*) from test.INVENTORIES            ;
select count(*) from test.PRODUCT_INFORMATION    ;
select count(*) from test.LOGON                  ;
select count(*) from test.PRODUCT_DESCRIPTIONS   ;
select count(*) from test.ORDERENTRY_METADATA    ;

  COUNT(*)
----------
  39999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
 134986840

SQL> 
  COUNT(*)
----------
  45000000

SQL> 
  COUNT(*)
----------
    901565

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
   9999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
	 0

perfect ! no rows lost.

—————————————————————————

下面我们来剖析一下dgha的行为

Thu Aug 15 17:25:57 2013 - [info] oraha start,welcome to use this tool...
Thu Aug 15 17:25:57 2013 - [info] try to check (10.0.0.63:1521:dgha) and (10.0.0.24:1521:dgha) are really cluster or not.
Thu Aug 15 17:25:57 2013 - [info] try ssh to standby(10.0.0.24,1521,dgha)...
Thu Aug 15 17:25:57 2013 - [info] ssh to 10.0.0.24 success.
Thu Aug 15 17:25:58 2013 - [debug] standby query result: (DGHA,PHYSICAL STANDBY,dgha1)
Thu Aug 15 17:25:58 2013 - [info] try connect to primary(10.0.0.63,1521,dgha)...
Thu Aug 15 17:25:58 2013 - [debug] primary check query sql: (select count(*) cnt from V$ARCHIVE_DEST A,v$database B where A.status='VALID' AND A.DB_UNIQUE_NAME='dgha1' AND A.TARGET='STANDBY' AND B.NAME='DGHA' AND B.DATABASE_ROLE='PRIMARY')
Thu Aug 15 17:25:58 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:25:59 2013 - [info] connect to (10.0.0.63:1521:dgha) success.
Thu Aug 15 17:25:59 2013 - [debug] primary query result rn_cnt: 1
Thu Aug 15 17:25:59 2013 - [info] (10.0.0.63:1521:dgha) and (10.0.0.24:1521:dgha) are really cluster.
Thu Aug 15 17:25:59 2013 - [info] the primary database is (10.0.0.63:1521:dgha)
Thu Aug 15 17:25:59 2013 - [info] the standby database is (10.0.0.24:1521:dgha)
Thu Aug 15 17:25:59 2013 - [info] ssh to 10.0.0.63 success.
Thu Aug 15 17:25:59 2013 - [info] ssh to 10.0.0.24 success.
Thu Aug 15 17:26:03 2013 - [info] ssh to 192.168.254.63 success.
Thu Aug 15 17:26:03 2013 - [info] ssh check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.63:1521:dgha) /home/oracle/oraha/db_stop.sh check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step01.sh check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step02.pl check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step03.sh check passed.
Thu Aug 15 17:26:04 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_check.sh check passed.
Thu Aug 15 17:26:04 2013 - [info] scripts under /home/oracle/oraha check passed.
Thu Aug 15 17:26:04 2013 - [info] I will wait until (10.0.0.63:1521:dgha) unreachable.health check interval:8
Thu Aug 15 17:26:04 2013 - [info] ======================================================
Thu Aug 15 17:26:04 2013 - [info] ----------             welcome              ----------
Thu Aug 15 17:26:04 2013 - [info] ======================================================
Thu Aug 15 17:26:04 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:0
Thu Aug 15 17:31:50 2013 - [debug] UPDATE HACHECK success.
Thu Aug 15 17:31:58 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:0

DGHA 会定期去update一张check表判断instance的存活

Thu Aug 15 17:32:03 2013 - [warning] connect to (10.0.0.63:1521:dgha) was abort! connect again.
Thu Aug 15 17:32:03 2013 - [debug] update check == 1.undef_count is 0
Thu Aug 15 17:32:03 2013 - [info] I will destory dbh.
Issuing rollback() due to DESTROY without explicit disconnect() of DBD::Oracle::db handle (DESCRIPTION=(ADDRESS=(HOST=10.0.0.63)(PROTOCOL=tcp)(PORT=1521))(CONNECT_DATA=(SID=dgha))) at oraha.pl line 509.
DBI db handle 0x1efbf710 cleared whilst still active at oraha.pl line 509.
DBI db handle 0x1efbf710 has uncleared implementors data at oraha.pl line 509.
    dbih_clearcom (dbh 0x1efbf710, com 0x1ee4d4f0, imp DBD::Oracle::db):
       FLAGS 0x182097: COMSET IMPSET Active Warn RaiseError PrintWarn ShowErrorStatement 
       PARENT DBI::dr=HASH(0x1ef1df90)
       KIDS 0 (0 Active)
Thu Aug 15 17:32:08 2013 - [warning] destory dbh  was abort!
Thu Aug 15 17:32:08 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:32:13 2013 - [error][oraha.pl, ln426] connect to  (10.0.0.63:1521:dgha) timeout
Thu Aug 15 17:32:13 2013 - [warning] connect to (10.0.0.63:1521:dgha) error_count:1
Thu Aug 15 17:32:21 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:1
Thu Aug 15 17:32:21 2013 - [debug] the dbh is null.
Thu Aug 15 17:32:21 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:32:26 2013 - [error][oraha.pl, ln426] connect to  (10.0.0.63:1521:dgha) timeout
Thu Aug 15 17:32:26 2013 - [warning] connect to (10.0.0.63:1521:dgha) was abort! connect again.
Thu Aug 15 17:32:26 2013 - [debug] update check == 1.undef_count is 1
Thu Aug 15 17:32:26 2013 - [info] I will destory dbh.
Thu Aug 15 17:32:26 2013 - [warning] connect to (10.0.0.63:1521:dgha) error_count:2
Thu Aug 15 17:32:34 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:2
Thu Aug 15 17:32:34 2013 - [debug] the dbh is null.
Thu Aug 15 17:32:34 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:32:39 2013 - [error][oraha.pl, ln426] connect to  (10.0.0.63:1521:dgha) timeout
Thu Aug 15 17:32:39 2013 - [warning] connect to (10.0.0.63:1521:dgha) was abort! connect again.
Thu Aug 15 17:32:39 2013 - [debug] update check == 1.undef_count is 1
Thu Aug 15 17:32:39 2013 - [info] I will destory dbh.
Thu Aug 15 17:32:39 2013 - [warning] connect to (10.0.0.63:1521:dgha) error_count:3
Thu Aug 15 17:32:39 2013 - [info] try connect to the edm...
Thu Aug 15 17:32:39 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'13817963180','the primary(10.0.0.63:1521:dgha) is unreachable.','',1,'11',sysdate)
Thu Aug 15 17:32:39 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'18666668061','the primary(10.0.0.63:1521:dgha) is unreachable.','',1,'11',sysdate)
Thu Aug 15 17:32:47 2013 - [warning] ssh to 10.0.0.63 err. error_count:1
ssh: connect to host 10.0.0.63 port 22: No route to host
Thu Aug 15 17:32:50 2013 - [warning] ssh to 10.0.0.63 err. error_count:2
ssh: connect to host 10.0.0.63 port 22: No route to host
Thu Aug 15 17:32:53 2013 - [warning] ssh to 10.0.0.63 err. error_count:3
Thu Aug 15 17:32:54 2013 - [info] ssh to 10.0.0.24 success.
Thu Aug 15 17:32:57 2013 - [info] ssh to 192.168.254.63 success.

DGHA 在update多次之后发现oracle死亡,开始尝试ssh到63同样不通(将会连接至63的远程管理卡poweroff掉该机器)同理将会查看待选standby状态,如果可以切换将会启动切换程序,同时发送短信告知63已经死亡。

QQ图片20130815221021

Thu Aug 15 17:32:57 2013 - [info] try to check (10.0.0.24:1521:dgha) is ready to failover to primary or not.
Thu Aug 15 17:32:57 2013 - [info] the standby open mode is:MOUNTED. database role is:PHYSICAL STANDBY.
Thu Aug 15 17:32:57 2013 - [info] standby lag minutes is: 0
Thu Aug 15 17:32:57 2013 - [info] standby (10.0.0.24:1521:dgha) healthcheck is passed. Go on.
Thu Aug 15 17:32:57 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step01.sh check passed.
Thu Aug 15 17:32:58 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step02.pl check passed.
Thu Aug 15 17:32:58 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step03.sh check passed.
Thu Aug 15 17:32:58 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_check.sh check passed.
Thu Aug 15 17:32:58 2013 - [info] standby (10.0.0.24:1521:dgha) scripts under /home/oracle/oraha check passed.
Thu Aug 15 17:32:58 2013 - [info] Power of 10.0.0.63  was successfully turned off.
Thu Aug 15 17:32:58 2013 - [info] I will execute /home/oracle/oraha/db_start_step01.sh on 10.0.0.24
Thu Aug 15 17:35:58 2013 - [debug] db_start output: 
SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 15 17:32:58 2013

Copyright (c) 1982, 2011, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> 
Thu Aug 15 17:35:58 2013 - [error][oraha.pl, ln180] Oracle on 10.0.0.24 shutdown failed.Kill the pmon next.
Thu Aug 15 17:35:58 2013 - [debug] Oracle pmon pid is: 10179
Thu Aug 15 17:35:58 2013 - [debug] kill -9  output: 
Thu Aug 15 17:36:08 2013 - [info] I will execute /home/oracle/oraha/db_start_step02.pl on 10.0.0.24
Thu Aug 15 17:37:47 2013 - [debug] db_start output: 
Thu Aug 15 17:37:47 2013 - [info] I will execute /home/oracle/oraha/db_start_step03.sh on 10.0.0.24
Thu Aug 15 17:38:06 2013 - [debug] db_start output: 
SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 15 17:37:47 2013

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> ORACLE instance started.

Total System Global Area 4.0486E+10 bytes
Fixed Size                  2237088 bytes
Variable Size            4026535264 bytes
Database Buffers         3.6373E+10 bytes
Redo Buffers               84631552 bytes
Database mounted.
SQL> Media recovery complete.
SQL> 
Database altered.

SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Thu Aug 15 17:38:08 2013 - [info] try to check the new primary(10.0.0.24,1521,dgha) after failover .
Thu Aug 15 17:38:08 2013 - [info] try connect to the new primary(10.0.0.24,1521,dgha)...
Thu Aug 15 17:38:08 2013 - [info] Attemp to connect to (10.0.0.24:1521:dgha)...
Thu Aug 15 17:38:08 2013 - [info] connect to (10.0.0.24:1521:dgha) success.
Thu Aug 15 17:38:08 2013 - [info] the new primary open mode is:READ WRITE. database role is:PRIMARY.
Thu Aug 15 17:38:08 2013 - [info] (10.0.0.63:1521:dgha) failover to (10.0.0.24:1521:dgha) successful.
Thu Aug 15 17:38:08 2013 - [info] try connect to the edm...
Thu Aug 15 17:38:08 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'13817963180','(10.0.0.63:1521:dgha) failover to (10.0.0.24:1521:dgha) successful.','',1,'11',sysdate)
Thu Aug 15 17:38:08 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'18666668061','(10.0.0.63:1521:dgha) failover to (10.0.0.24:1521:dgha) successful.','',1,'11',sysdate)
Killed by signal 1.
Killed by signal 1.

使用存储redo recover完毕后同样会发出短信告知并退出。

QQ图片20130815221008

以下附各种场景的切换测试:

1. 模拟instance hang (oradebug frozen instance) –>当前连接全部hang死,新建process去update check表,两种情况:如果update成功,不切换,如果hang导致update不成功,切换。但是之前的process会处于”僵尸”态。

2. 模拟listener hang (kill -STOP pid of listener and kill active session) 新建process无法update timeout超时–>ssh主机执行abort—>连接至远程卡powerff掉主机—>切换至备选standby.

3. 模拟机器down (直接halt机器,update,ssh均失败,判断死亡,直接切换)

4. 网卡down(处理方式类1)
1,4的处理方式会出现之前dgha建立的连接无法回收,采用在本地部署脚本针对dgha的机器名,idle时间做定期清理。

5. 新的primary切换完成后,就的primary如果需要重新加入集群(copy standby controlfile from new primary,standby logfile需要重建)

Physical standby Failover

June 28, 2012 oracle, replication No comments

db1 –liu

SQL> select * from v$version;

BANNER
—————————————————————-
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – Prod
PL/SQL Release 10.2.0.1.0 – Production
CORE 10.2.0.1.0 Production
TNS for Linux: Version 10.2.0.1.0 – Production
NLSRTL Version 10.2.0.1.0 – Production

db2 –yang

SQL> select * from v$archive_gap;

no rows selected

SQL>

第一种情况 正常的failover

phsical standby :

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE;

Database altered.

SQL>

Wed Jun 27 13:47:15 2012
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
Wed Jun 27 13:47:15 2012
Terminal Recovery: Stopping real time apply
Wed Jun 27 13:47:15 2012
MRP0: Background Media Recovery cancelled with status 16037
Wed Jun 27 13:47:15 2012
Errors in file /u01/app/oracle/admin/liu/bdump/liu_mrp0_10260.trc:
ORA-16037: user requested cancel of managed recovery operation
Managed Standby Recovery not using Real Time Apply
Recovery interrupted!
Recovered data files to a consistent state at change 456564
Wed Jun 27 13:47:16 2012
Errors in file /u01/app/oracle/admin/liu/bdump/liu_mrp0_10260.trc:
ORA-16037: user requested cancel of managed recovery operation
Wed Jun 27 13:47:16 2012
MRP0: Background Media Recovery process shutdown (liu)
Wed Jun 27 13:47:16 2012
Terminal Recovery: Stopped real time apply
Wed Jun 27 13:47:16 2012
Attempt to do a Terminal Recovery (liu)
Wed Jun 27 13:47:16 2012
Media Recovery Start: Managed Standby Recovery (liu)
Managed Standby Recovery not using Real Time Apply
parallel recovery started with 2 processes
Media Recovery Waiting for thread 1 sequence 2 (in transit)
Terminal Recovery timestamp is ’06/27/2012 13:47:18′
Terminal Recovery: applying standby redo logs.
Terminal Recovery: thread 1 seq# 2 redo required
Terminal Recovery: /u01/app/oracle/oradata/liu/redo05.log
Identified End-Of-Redo for thread 1 sequence 2
Wed Jun 27 13:47:18 2012
Incomplete recovery applied all redo ever generated.
Recovery completed through change 456566
Wed Jun 27 13:47:18 2012
Media Recovery Complete (liu)
Terminal Recovery: successful completion
Begin: Standby Redo Logfile archival
End: Standby Redo Logfile archival
Resetting standby activation ID 2612058581 (0x9bb0d9d5)
Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE

SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;

Database altered.

SQL> ALTER DATABASE OPEN;

Database altered.

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

db2 作为主库

[oracle@yang ~]$ sqlplus ‘/as sysdba’

SQL*Plus: Release 10.2.0.1.0 – Production on Wed Jun 27 14:01:34 2012

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – Production
With the Partitioning, OLAP and Data Mining options

SQL> SELECT TO_CHAR(STANDBY_BECAME_PRIMARY_SCN) FROM V$DATABASE;

TO_CHAR(STANDBY_BECAME_PRIMARY_SCN)
—————————————-
479312

SQL>

DB1 作为备库 flashback

SQL>
FLASHBACK DATABASE TO SCN 479312SQL>
2 ;

Flashback complete.

SQL> ALTER DATABASE CONVERT TO PHYSICAL STANDBY;

Database altered.

继续开启apply

ttempt to start background Managed Standby Recovery process (liu)
MRP0 started with pid=20, OS id=22199
Wed Jun 27 14:52:31 2012
MRP0: Background Managed Standby Recovery process started (liu)
Managed Standby Recovery not using Real Time Apply
Media Recovery Log /u01/arch/1_10_787062936.dbf
Media Recovery Log /u01/arch/1_11_787062936.dbf
Media Recovery Log /u01/arch/1_12_787062936.dbf
Wed Jun 27 14:52:37 2012
Completed: alter database recover managed standby database disconnect
Wed Jun 27 14:52:38 2012
Media Recovery Log /u01/arch/1_13_787062936.dbf
Media Recovery Log /u01/arch/1_14_787062936.dbf
Media Recovery Waiting for thread 1 sequence 15 (in transit)

应用正常


第二种情况 active database 的 failover

此处模拟网络断开

SQL> create table failover(id int);

Table created.

SQL> alter system set log_archive_dest_state_2=defer;

System altered.

SQL> alter system switch logfile;

System altered.

这个归档不会传输过去

physical standby 操作:

SQL> alter database recover managed standby database disconnect;

Database altered.

SQL>
SQL>
SQL> alter database recover managed standby database cancel;

Database altered.

SQL> recover standby database until cancel;
ORA-00279: change 500198 generated at 06/27/2012 14:57:52 needed for thread 1
ORA-00289: suggestion : /u01/arch/1_16_787062936.dbf
ORA-00280: change 500198 for thread 1 is in sequence #16

Specify log: {=suggested | filename | AUTO | CANCEL}
cancel;
ORA-00308: cannot open archived log ‘cancel;’
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3

Specify log: {=suggested | filename | AUTO | CANCEL}
cancel
Media recovery cancelled.
SQL> alter database activate standby database;

Database altered.

SQL> shutdown immediate;
ORA-01109: database not open

Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area 524288000 bytes
Fixed Size 1220336 bytes
Variable Size 150995216 bytes
Database Buffers 364904448 bytes
Redo Buffers 7168000 bytes
Database mounted.
Database opened.
SQL> SELECT TO_CHAR(STANDBY_BECAME_PRIMARY_SCN) FROM V$DATABASE;

TO_CHAR(STANDBY_BECAME_PRIMARY_SCN)
—————————————-
500196

SQL> select database_role from v$database;

DATABASE_ROLE
—————-
PRIMARY

SQL>

已经切换为primary

Resetting resetlogs activation ID 2612053838 (0x9bb0c74e)
Online log /u01/app/oracle/oradata/liu/redo01.log: Thread 1 Group 1 was previously cleared
Online log /u01/app/oracle/oradata/liu/redo02.log: Thread 1 Group 2 was previously cleared
Online log /u01/app/oracle/oradata/liu/redo03.log: Thread 1 Group 3 was previously cleared
Standby became primary SCN: 500196

此时发生了resetlogs 操作

SQL> archive log list;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination /u01/arch
Oldest online log sequence 1
Next log sequence to archive 2
Current log sequence 2

此时我们看一下原来的primary 端:

SQL> shutdown abort;
ORACLE instance shut down.

SQL> startup mount;
ORACLE instance started.

Total System Global Area 524288000 bytes
Fixed Size 1220336 bytes
Variable Size 150995216 bytes
Database Buffers 364904448 bytes
Redo Buffers 7168000 bytes
Database mounted.
SQL> flashback database to scn 500196;

Flashback complete.

SQL> ALTER DATABASE CONVERT TO PHYSICAL STANDBY;

Database altered.

SQL> SHUTDOWN IMMEDIATE;
ORA-01507: database not mounted

ORACLE instance shut down.
SQL> STARTUP MOUNT;
ORACLE instance started.

Total System Global Area 524288000 bytes
Fixed Size 1220336 bytes
Variable Size 150995216 bytes
Database Buffers 364904448 bytes
Redo Buffers 7168000 bytes
Database mounted.

SQL> startup mount;
ORACLE instance started.

Total System Global Area 524288000 bytes
Fixed Size 1220336 bytes
Variable Size 150995216 bytes
Database Buffers 364904448 bytes
Redo Buffers 7168000 bytes
Database mounted.
SQL> alter database recover managed standby database disconnect;

Database altered.

SQL> select database_role from v$database;

DATABASE_ROLE
—————-
PHYSICAL STANDBY

看一下standby 端的log :

Using STANDBY_ARCHIVE_DEST parameter default value as /u01/arch
Redo Shipping Client Connected as PUBLIC
— Connected User is Valid
RFS LogMiner: Client disabled from further notification
RFS[1]: New Archival REDO Branch(resetlogs_id): 787074684 Prior: 787074091
RFS[1]: Archival Activation ID: 0x9bb07de2 Current: 0x0
RFS[1]: Effect of primary database OPEN RESETLOGS


RFS[1]: Assigned to RFS process 23110
RFS[1]: Identified database type as ‘physical standby’
Wed Jun 27 15:05:38 2012
RFS LogMiner: Client disabled from further notification
RFS[1]: Archived Log: ‘/u01/arch/1_1_787071479.dbf’
New incarnation branch detected in ArchiveLog, filename /u01/arch/1_1_787071479.dbf
Inspection of file changed rdi from 2 to 3
Setting recovery target incarnation to 3

incarnation 自动进行了切换

Wed Jun 27 15:05:40 2012
RFS[1]: Incarnation entry added for Branch(resetlogs_id): 787071479 (liu)
Wed Jun 27 15:05:40 2012
Setting recovery target incarnation to 3
Wed Jun 27 15:05:40 2012
MRP0: Incarnation has changed! Retry recovery…

重置之后开始recover

Wed Jun 27 15:05:40 2012
Errors in file /u01/app/oracle/admin/liu/bdump/liu_mrp0_23056.trc:
ORA-19906: recovery target incarnation changed during recovery
Recovery interrupted!
Wed Jun 27 15:05:40 2012
Errors in file /u01/app/oracle/admin/liu/bdump/liu_mrp0_23056.trc:
ORA-19906: recovery target incarnation changed during recovery
Wed Jun 27 15:06:00 2012
Managed Standby Recovery not using Real Time Apply
Media Recovery apply resetlogs offline range for datafile 1, incarnation : 0
Media Recovery apply resetlogs offline range for datafile 2, incarnation : 0
Media Recovery apply resetlogs offline range for datafile 3, incarnation : 0
Media Recovery apply resetlogs offline range for datafile 4, incarnation : 0

开始resetlog

Media Recovery Log /u01/arch/1_1_787071479.dbf
Media Recovery Waiting for thread 1 sequence 2
Fetching gap sequence in thread 1, gap sequence 2-14

开始fetching gap ;

我们在两个服务器上查看 table failover

SQL> desc failover;
ERROR:
ORA-04043: object test_liu does not exist

丢失了这张表,也就是说发生故障的scn 点直至failover 完成之后的数据 将全部丢失

总结一下:

1. 正常情况的failover 切换之后主备角色发生变化 DG集群中的其他standby端 可以不受到影响
2. 非常规情况下的failover (active database) 需要发生media recover,open需要resetlogs DG集群中的所有备库需要flashback到 STANDBY_BECAME_PRIMARY_SCN 点的SCN 将会丢失数据
3. 两种情况下 都需要打开flashback log

Recover physical standby database after loss of archive log(2)

June 20, 2012 oracle, replication No comments

Recover physical standby database after loss of archive log(2)

上次写过一篇DG丢失归档后的处理过程,总体来说就是使用增量备份覆盖gap数据从而跳过gap的archivelog 这里再阐述另一种情况


[oracle@db61 orcl]$ ora si
SQL*Plus: Release 10.2.0.5.0 – Production on Wed Jun 20 14:35:01 2012
Copyright (c) 1982, 2010, Oracle. All Rights Reserved.

Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 – 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> select database_role from v$database;
DATABASE_ROLE
—————-
PHYSICAL STANDBY
SQL> recover standby database;
ORA-00279: change 40103914365 generated at 05/23/2012 09:26:36 needed for thread 3
ORA-00289: suggestion : /data/oracle/oradata/orcl/arch/3_8658_657561562.dbf
ORA-00280: change 40103914365 for thread 3 is in sequence #8658

Specify log: {=suggested | filename | AUTO | CANCEL} auto
ORA-00308: cannot open archived log ‘/data/oracle/oradata/orcl/arch/3_8658_657561562.dbf’
ORA-27037: unable to obtain file status Linux-x86_64 Error: 2: No such file or directory Additional information: 3

这个库的大概情况为丢失了6月4号至今的所有归档,很容易想到使用standby 的current_scn去作为 起始scn增量备份,对于这里的增量备份出现了一个有趣的现象。看上面的操作,可以知道还是需要scn为40103914365的归档文件,but why ?既然已经使用增量备份recover database,这里就不兜圈子了,DG开启redo apply之后 oracle 会寻找file header最低的scn开始apply 我们可以查询下当前的file header scn:


SQL> select file#,to_char(checkpoint_change#) from v$datafile_header;

FILE# TO_CHAR(CHECKPOINT_CHANGE#)
———- —————————————-
1 42501726792
2 42501726792
3 42501726801
4 42501726801
5 42501726801
6 42501726801
7 42501726792
8 40103914365
9 42501726792
10 42501726801
11 42501726801

FILE# TO_CHAR(CHECKPOINT_CHANGE#)
———- —————————————-
12 42501726801
13 42501726792
14 42501726801
15 42501726792
16 42501726792
17 42501726801
18 42501726801


看到file 8的scn正是oracle需要的scn 对应上面的操作:change 40103914365 generated at 05/23/2012 09:26:36 needed for thread 3 这里的05/23/2012 09:26:36足以说明问题。查看主库的file 8文件发现 change time 为 05/23/2012,从这里可以说明file 8自从2012-05-23之后从来没有change过,对于这种file – BLOCK change为0 ,也就是说change scn为上一次的05/23/2012 09:26:36之前的change scn,即所有的块都不满足以上条件,所以对于从6月4号开始的增量备份,oracle将忽略这个文件的所有blocks从而导致recover之后file header checkpoint scn没有发生变化,当开启redo apply之后oracle仍然从最小的scn开始尝试恢复,从而导致这个诡异的现象,当然这种极端情况是很少出现的,这里我们可以采用rman copy这个file到standby端从而解决这个问题。

eg:

RMAN> copy datafile ‘+DATA/pri/datafile/udata01_16.dbf’ to ‘/data/xxx.dbf’;

Starting backup at 20-JUN-12
using channel ORA_DISK_1
channel ORA_DISK_1: starting datafile copy
input datafile fno=00008 name=+DATA/pri/datafile/udata01_16.dbf
output filename=/data/temp.dbf tag=TAG20120620T145721 recid=172 stamp=786467766
channel ORA_DISK_1: datafile copy complete, elapsed time: 00:18:46
Finished backup at 20-JUN-12

Starting Control File and SPFILE Autobackup at 20-JUN-12
piece handle=/data/oracle/product/10.2/db1/dbs/c-1185342296-20120620-00 comment=NONE
Finished Control File and SPFILE Autobackup at 20-JUN-12

scp ….

[oracle@db61 orcl]$ ora dgstats

NAME VALUE UNIT TIME_COMPUTED
—————————— ——————– —————————— ——————————
apply finish time +00 00:09:07.7 day(2) to second(1) interval 20-JUN-2012 18:54:58
apply lag +00 00:57:49 day(2) to second(0) interval 20-JUN-2012 18:54:58
estimated startup time 116 second 20-JUN-2012 18:54:58
standby has been open N 20-JUN-2012 18:54:58
transport lag +00 00:42:10 day(2) to second(0) interval 20-JUN-2012 18:54:58

4个小时之后DG 追上了16天的gap 恢复速度还是不错的 总体来说通过增量备份恢复丢失归档的DG是一个很常规的手法 16天的归档>=4T 如果从带库恢复归档 将是一个漫长的过程

Recover physical standby database after loss of archive log

March 20, 2012 oracle, replication 2 comments

今天报表数据库的备库出现了问题,由于监控脚本出现了问题,主机空间耗尽 而没有及时发出邮件,导致归档无法进行,DB停滞在一个时间点,同时在主库 由于只保留了2天的归档 导致这部分归档没有传输至备库,等我们发现问题时,主库已经删除了归档,备库接近8天gap无法恢复。由于主库超过3个T ,重新用备份恢复一个standby 时间上是不允许的,并且主库上也没有空间存放这么大的备份集,于是采取了增量备份的方法,跳过了这个GAP。下面记录这次恢复过程

查看备库日志发现:

Tue Mar 13 02:45:00 2012
ARC1: Closing local archive destination LOG_ARCHIVE_DEST_1: ‘/data/oracle/oradata/track1/arch/1_30021_754336076.dbf’ (error 19502) (track1)
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance track1 – Archival Error
ORA-16038: log 7 sequence# 30021 cannot be archived
ORA-19502: write error on file “”, block number (block size=)
ORA-00312: online log 7 thread 1: ‘/data/oracle/oradata/track1/standbyredo7.log’
Tue Mar 13 02:45:00 2012
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance track1 – Archival Error

Tue Mar 13 04:06:56 2012
Non critical error ORA-00001 caught while writing to trace file “/data/oracle/diag/rdbms/trackdg/track1/trace/track1_rfs_9177.trc”
Error message: Linux-x86_64 Error: 28: No space left on device
Additional information: 1
Writing to the above trace file is disabled for now on…
Creating archive destination file : /data/oracle/oradata/track1/arch/1_30038_754336076.dbf (808922 blocks)
Non critical error ORA-00001 caught while writing to trace file “/data/oracle/diag/rdbms/trackdg/track1/trace/track1_rfs_9175.trc”
Error message: Linux-x86_64 Error: 28: No space left on device
Additional information: 1
Writing to the above trace file is disabled for now on…

Fetching gap sequence in thread 1, gap sequence 30028-30127
Mon Mar 19 20:35:42 2012
Fetching gap sequence in thread 1, gap sequence 30028-30127
Mon Mar 19 20:35:52 2012
Fetching gap sequence in thread 1, gap sequence 30028-30127
Mon Mar 19 20:36:02 2012
Fetching gap sequence in thread 1, gap sequence 30028-30127
Mon Mar 19 20:36:12 2012
Fetching gap sequence in thread 1, gap sequence 30028-30127

发现从13号的2点 主机space 已经被耗尽. 13-17号的归档全部丢失(gap sequence 30028-30127)近100个归档

下面采取增量备份的方法恢复这个数据库:

在备库查找当前的SCN号

SQL> select current_scn from v$database;

CURRENT_SCN
———————–
16895677901

在主库用这个SCN 通过增量备份 备份出新的backupset :

RMAN> run {
2> allocate channel c1 device type disk;
3> allocate channel c2 device type disk;
4> allocate channel c3 device type disk;
5> allocate channel c4 device type disk;
6> BACKUP INCREMENTAL FROM SCN 16895677901 DATABASE FORMAT ‘/data/track_standby_%U’ tag ‘track_standby’;
7> release channel c1;
8> release channel c2;
9> release channel c3;
10> release channel c4;
11> }

将这个备份集 SCP 到备库的/data2 目录下 由于主备采用了convert 转换了文件的路径 需要通过set newname 来恢复备库 大致脚本如下:


[oracle@db59 track1]$ rman target /

Recovery Manager: Release 11.2.0.2.0 – Production on Mon Mar 19 23:44:58 2012

Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.

connected to target database: TRACK1 (DBID=3812187020, not open)

RMAN> CATALOG START WITH ‘/data2/track_standby’;

searching for all files that match the pattern /data2/track_standby

List of Files Unknown to the Database
=====================================
File Name: /data2/track_standby_76n694la_1_1

.. …

….

File Name: /data2/track_standby_9bn69dqo_1_1 (一共71个)

Do you really want to catalog the above files (enter YES or NO)? YES
cataloging files…
cataloging done

List of Cataloged Files
=======================
File Name: /data2/track_standby_76n694la_1_1

File Name: /data2/track_standby_9bn69dqo_1_1

RMAN> run {
allocate channel c1 device type disk;
allocate channel c2 device type disk;
allocate channel c3 device type disk;
allocate channel c4 device type disk;
recover database noredo;
release channel c1;
release channel c2;
release channel c3;
release channel c4;
}

[oracle@db59 ~]$ tailalert

Incremental restore complete of datafile 280 /data/oracle/oradata/track1/lg_track3_100.dbf
checkpoint is 36031098956
Incremental restore complete of datafile 294 /data/oracle/oradata/track1/lg_track4_004.dbf
checkpoint is 36031098956
Mon Mar 19 17:54:45 2012
Incremental restore complete of datafile 28 /data/oracle/oradata/track1/lg_track18.dbf
checkpoint is 36025918455
last deallocation scn is 19965755207
Incremental restore complete of datafile 286 /data2/oracle/oradata/track1/lg_track119.dbf
checkpoint is 36025918455
Incremental restore complete of datafile 245 /data/oracle/oradata/track1/lg_track3_68.dbf
checkpoint is 36025918455
Incremental restore complete of datafile 241 /data/oracle/oradata/track1/lg_track3_64.dbf

..

完成后,由于主库在期间添加过数据文件,使用rman copy 将 file 296,297 copy 到本地,scp至备库的相应位置,主库生成standby controlfile 在备库通过rename file 重新定义file路径:


alter database rename file ‘/data/oracle/oradata/track1/system.dbf’ to ‘/data/oracle/oradata/track1/system.dbf’;
..
alter database rename file ‘/storage/disk01/lg_track3_92.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_92.dbf’;
alter database rename file ‘/storage/disk02/lg_track3_93.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_93.dbf’;
alter database rename file ‘/storage/disk03/lg_track3_94.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_94.dbf’;
alter database rename file ‘/storage/disk04/lg_track3_95.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_95.dbf’;
alter database rename file ‘/storage/disk05/lg_track3_96.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_96.dbf’;
alter database rename file ‘/storage/disk06/lg_track3_97.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_97.dbf’;
alter database rename file ‘/storage/disk07/lg_track3_98.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_98.dbf’;
alter database rename file ‘/storage/disk00/lg_track3_99.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_99.dbf’;
alter database rename file ‘/storage/disk01/lg_track3_100.dbf’ to ‘/data/oracle/oradata/track1/lg_track3_100.dbf’;
..
….

alter database rename file ‘/storage/disk06/lg_track4_006.dbf’ to ‘/data2/oracle/oradata/track1/lg_track4_006.dbf’;
alter database rename file ‘/storage/disk02/lg_track4_007.dbf’ to ‘/data2/oracle/oradata/track1/lg_track4_007.dbf’;

添加standby logfile


SQL> alter system set standby_file_management=MANUAL;

System altered.

ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 group 6(‘/data2/oracle/oradata/track1/standbyredo6.log’) SIZE 500M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 group 7(‘/data2/oracle/oradata/track1/standbyredo7.log’) SIZE 500M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 group 8(‘/data2/oracle/oradata/track1/standbyredo8.log’) SIZE 500M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 group 9(‘/data2/oracle/oradata/track1/standbyredo9.log’) SIZE 500M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 group 10(‘/data2/oracle/oradata/track1/standbyredo10.log’) SIZE 500M;

开启日志apply


SQL> alter system set standby_file_management=auto;

System altered.

SQL> alter database recover managed standby database disconnect from session using current logfile;

观察日志 经过将近4个小时 终于追上了主库


Recovery of Online Redo Log: Thread 1 Group 12 Seq 31152 Reading mem 0
Mem# 0: /data2/oracle/oradata/track1/standbyredo12.log
Tue Mar 20 00:28:09 2012
RFS[3]: Selected log 11 for thread 1 sequence 31153 dbid -482780276 branch 754336076
Tue Mar 20 00:28:13 2012
Archived Log entry 101 added for thread 1 sequence 31152 ID 0xe339e88c dest 1:
Tue Mar 20 00:28:16 2012
Media Recovery Waiting for thread 1 sequence 31153 (in transit)
Recovery of Online Redo Log: Thread 1 Group 11 Seq 31153 Reading mem 0
Mem# 0: /data2/oracle/oradata/track1/standbyredo11.log
Tue Mar 20 00:32:40 2012
RFS[3]: Selected log 12 for thread 1 sequence 31154 dbid -482780276 branch 754336076
Tue Mar 20 00:32:42 2012
Media Recovery Waiting for thread 1 sequence 31154 (in transit)
Recovery of Online Redo Log: Thread 1 Group 12 Seq 31154 Reading mem 0

NAME VALUE UNIT TIME_COMPUTED DATUM_TIME
—————————— ——————– —————————— —————————— ——————————
transport lag +00 00:00:00 day(2) to second(0) interval 03/20/2012 00:37:51 03/20/2012 00:37:51
apply lag +00 00:00:18 day(2) to second(0) interval 03/20/2012 00:37:51 03/20/2012 00:37:51
apply finish time +00 00:00:00.078 day(2) to second(3) interval 03/20/2012 00:37:51
estimated startup time 17 second 03/20/2012 00:37:51

SQL> select current_scn from v$database;

CURRENT_SCN
—————–
36050243871

SQL> select * from v$archive_gap;

no rows selected

在这种比较极端的情况下选择这种恢复的方法,如果重做备库 将花费大量的时间与空间,使用增量备份,极大的缩短了恢复的时间

logic standby ORA-01403 故障处理一例 (2)

March 18, 2012 oracle, replication 3 comments

前几天刚处理了 logic standby故障,还没缓过神来 , logical standby 又出问题了。 还是那张表,那个错误 不得不让我怀疑 是不是那张表出现了问题。记录一下解决的过程:

2012-03-16 11:45:00 update “PROD_DATA2″.”PM_SUPPLIER” ORA-01403: no data found
set
“PRODUCT_CODE” = ‘0021244213’,

2012-03-16 11:44:29 update “PROD_DATA2″.”PM_SUPPLIER” ORA-01403: no data found
set
“PRODUCT_CODE” = ‘0021244213’,

依然是这个老错误,这张表的同步出现了不一致,下面我们来看看这张表的结构:

SQL> desc “PROD_DATA2″.”PM_SUPPLIER”;
Name Null? Type
—————————————– ——– —————————-
ID NOT NULL NUMBER(18)

PRODUCT_DESCRIPTION NCLOB

…..

IS_DELETED NUMBER(1)
UNIT VARCHAR2(10)
CATEGORY_SEARCH_NAME VARCHAR2(200)

这张表包含了一个NCLOB 字段,针对之前同步这张表的报错,让我们怀疑是不是NCLOB 出现了问题,让我们看看当时同步这张表的报错:

SQL> exec DBMS_LOGSTDBY.INSTANTIATE_TABLE(‘PROD_DATA2′,’PM_SUPPLIER’,’dblinkprimary’);

ERROR at line 1:
ORA-20001: ORA-31693: Table data object “PROD_DATA2″.”PM_SUPPLIER” failed to
load/unload and is being skipped due to error:
ORA-12161: TNS:internal error: partial data received
ORA-02063: preceding from
ORA-06512: at “SYS.KUPW$WORKER”, line 113
ORA-02063: preceding from
ORA-01555: snapshot too old: rollback segment number with name “” too small
ORA-22924: snapshot too old
ORA-02063: preceding 2 lines from DBLINKPRIMARY
ORA-01555: snapshot too old: rollback segment number with name “” too small
ORA-22924: snaps
ORA-06512: at “SYS.DBMS_LOGSTDBY”, line 664
ORA-06512: at line 1

snapshot too old: rollback segment number with name “” too small 这种报错十分值得怀疑,一般情况下LOB字段默认采用pctversion=10,在不是特别频繁修改LOB字段的情况下,是不会报错的.何况当时是凌晨3点左右同步的这张表,这种报错更加值得怀疑,使用如下脚本检查lob字段的可用性:

create table corrupted_data (corrupted_rowid rowid);

set concat off
declare
error_1555 exception;
pragma exception_init(error_1555,-1555);
v_lob NCLOB;
n number;
begin
for cursor_lob in (select rowid r from prod_data2.PM_SUPPLIER) loop
begin
select PRODUCT_DESCRIPTION into v_lob from prod_data2.PM_SUPPLIER where rowid=cursor_lob.r;
n:=dbms_lob.instr(v_lob,hextoraw(‘889911′)) ;
exception
when error_1555 then
insert into corrupted_data values (cursor_lob.r);
commit;
end;
end loop;
end;
/
undefine lob_column

之后查看 corrupted_data 发现17条记录存在问题:

sys@item> select * from corrupted_data;

CORRUPTED_ROWID
——————
AAAEKcAAMAABhYrAAA
AAAEKcAAMAABhYrAAB
AAAEKcAAMAABhYrAAC
AAAEKcAAMAABhYrAAD
AAAEKcAAMAABhYrAAE
AAAEKcAAMAABhYrAAF
AAAEKcAAMAABhYrAAG
AAAEKcAAMAABhYrAAH
AAAEKcAAMAABhYrAAI
AAAEKcAAMAABhYrAAJ
AAAEKcAAMAABhYrAAK
AAAEKcAAMAABhYrAAL
AAAEKcAAMAABhYrAAM
AAAEKcAAMAABhYrAAN
AAAEKcAAMAABhYrAAO
AAAEKcAAMAABhZrAAA
AAAEKcAAMAABhXSAAG

17 rows selected.

查看任意一条记录 报错:

select PRODUCT_DESCRIPTION from prod_data2.PM_SUPPLIER where rowid=’AAAEKcAAMAABhYrAAA’;

ORA-01555: snapshot too old: rollback segment number with name “” too small
ORA-22924: snapshot too old

这17条记录的NCLOB 都出现了问题,下面修复这些错误的column 信息

update prod_data2.PM_SUPPLIER set PRODUCT_DESCRIPTION=null where rowid in (select CORRUPTED_ROWID from corrupted_data);

Reference:
Export Fails With ORA-2354 ORA-1555 ORA-22924 and How To Confirm LOB Segment Corruption Using Export Utility? [ID 833635.1]

Export Receives The Errors ORA-1555 ORA-22924 ORA-1578 ORA-22922 [ID 787004.1]


For this reason I propose a simple PLSQL block in this new note, which runs faster and does not consume much of the system resources.

LOBDATA was used as the table which should be checked for any potential corrupted LOBs:

1. Create a new temporary table for storing all rowids of the corrupted LOBs. Let’s call it “corrupt_lobs”

SQL> create table corrupt_lobs (corrupt_rowid rowid, err_num number);

2. Make a desc on the table containing the LOB column:

DESC LOBDATA

Name Null? Type
—————————————– ——–
ID NOT NULL NUMBER
DOCUMENT BLOB

Run the following PLSQL block:

declare
error_1578 exception;
error_1555 exception;
error_22922 exception;
pragma exception_init(error_1578,-1578);
pragma exception_init(error_1555,-1555);
pragma exception_init(error_22922,-22922);
n number;
begin
for cursor_lob in (select rowid r, document from LOBDATA) loop
begin
n:=dbms_lob.instr(cursor_lob.document,hextoraw(‘889911’));
exception
when error_1578 then
insert into corrupt_lobs values (cursor_lob.r, 1578);
commit;
when error_1555 then
insert into corrupt_lobs values (cursor_lob.r, 1555);
commit;
when error_22922 then
insert into corrupt_lobs values (cursor_lob.r, 22922);
commit;
end;
end loop;
end;
/

In the end all rowids of the corrupted LOBs will be inserted into the corrupt_lobs newly created table.

A possible solution would then be to empty the affected LOBs using a statement like:

SQL> update LOBDATA set document = empty_blob()
where rowid in (select corrupt_rowid from corrupt_lobs);

这里我们采用set null的方法 关于LOB 字段 null or empty 参考以下定义:


You can set an internal LOB — that is, a LOB column in a table, or a LOB attribute in an object type defined by you– to be NULL or empty:

Setting an Internal LOB to NULL:

A LOB set to NULL has no locator. A NULL value is stored in the row in the table, not a locator. This is the same process as for all other datatypes.

Setting an Internal LOB to Empty:

By contrast, an empty LOB stored in a table is a LOB of zero length that has a locator. So, if you SELECT from an empty LOB column or attribute, you get back a locator which you can use to populate the LOB with data via one of the six programmatic environments, such as OCI or PL/SQL(DBMS_LOB).

这里修改完毕后恢复pctversion为10(之前我们设置为90) 重新同步这张表到3个逻辑备库,很顺利的完成,后续效果待观察。

logic standby ORA-01403 故障处理一例

March 14, 2012 oracle, replication 2 comments

昨天的logic standby 事故处理:

夜里12点左右 2台logic standby 全部停止日志apply 查看logic standby 相关进程发现:

1.select spid, type, status from v$logstdby_process no rows returns

2.2012-03-12 23:36:16 update “PROD_DATA2″.”PM_SUPPLIER” set xxx…..

ORA-01403: no data found

报出如下错误,这种错误一般是由于standby 端的表数据与primary端出现了不一致导致的,最大的麻烦莫过于这张表是一个LOB字段的表

通过数据的对比,发现源表与目标表 已经出现了很大的不一致,初步判断是由于skip操作导致的 transaction 丢失导致的数据异常

准备使用DBMS_LOGSTDBY.INSTANTIATE_TABLE 进行表的重新同步:

INSTANTIATE_TABLE Procedure

This procedure creates and populates a table in the standby database from a corresponding table in the primary database. The table requires the name of the database link (dblink) as an input parameter. If the table already exists in the logical standby database, it will be dropped and re-created based on the table definition at the primary database. This procedure only brings over the data associated with the table, and not the associated indexes and constraints.

Use the INSTANTIATE_TABLE procedure to:

Add a table to a standby database.

Re-create a table in a standby database.

Syntax

DBMS_LOGSTDBY.INSTANTIATE_TABLE (
schema_name IN VARCHAR2,
table_name IN VARCHAR2,
dblink IN VARCHAR2);
Parameters

Table 56-5 INSTANTIATE_TABLE Procedure Parameters

Parameter Description
schema_name
Name of the schema
table_name
Name of the table to be created or re-created in the standby database
dblink
Name of the database link account that has privileges to read and lock the table in the primary database

SQL*Plus: Release 10.2.0.5.0 – Production on Wed Mar 14 00:42:21 2012

Copyright (c) 1982, 2010, Oracle. All Rights Reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 – 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL>
SQL> alter session set db_file_multiblock_read_count = 128;

Session altered.

SQL> alter session set sort_area_size =804857600;

Session altered.

SQL> alter session set workarea_size_policy = manual;

Session altered.

SQL> exec DBMS_LOGSTDBY.INSTANTIATE_TABLE(‘PROD_DATA2′,’PM_SUPPLIER’,’dblinkprimary’);

ERROR at line 1:
ORA-20001: ORA-31693: Table data object “PROD_DATA2″.”PM_SUPPLIER” failed to
load/unload and is being skipped due to error:
ORA-12161: TNS:internal error: partial data received
ORA-02063: preceding from
ORA-06512: at “SYS.KUPW$WORKER”, line 113
ORA-02063: preceding from
ORA-01555: snapshot too old: rollback segment number with name “” too small
ORA-22924: snapshot too old
ORA-02063: preceding 2 lines from DBLINKPRIMARY
ORA-01555: snapshot too old: rollback segment number with name “” too small
ORA-22924: snaps
ORA-06512: at “SYS.DBMS_LOGSTDBY”, line 664
ORA-06512: at line 1

报出如下错误 这种错误一般是由于LOB 回滚段不足所导致的:

关于LOB的undo做如下说明:

RETENTION – time-based: this specifies how long older versions are to be retained.

PCTVERSION – space-based: this specifies what percentage of the LOB segment is to be used to hold older versions.

Points to Note:

RETENTION is a keyword in the LOB column definition. No value can be specified for RETENTION. The RETENTION value is implicit. If a LOB is created with database compatibility set to 9.2.0.0 or higher, undo_management=TRUE and PCTVERSION is not explicitly specified, time-based retention is used. The LOB RETENTION value is always equal to the value of the UNDO_RETENTION database instance parameter.

You cannot specify both PCTVERSION and RETENTION.

PCTVERSION is applicable only to LOB chunks/pages allocated in LOBSEGMENTS. Other LOB related data in the table column and the LOBINDEX segment use regular undo mechanism.

PCTVERSION=0: the space allocated for older versions of LOB data in LOBSEGMENTS can be reused by other transactions and can cause“snapshot too old” errors.

PCTVERSION=100: the space allocated by older versions of LOB data can never be reused by other transactions. LOB data storage space is never reclaimed and it always increases.

RETENTION and PCTVERSION can be altered Recommendation Time-based retention using the RETENTION keyword is preferred. A high value for RETENTION or PCTVERSION may be needed to avoid “snapshot too old” errors in environments with high concurrent read/write LOB access.

查看”PROD_DATA2″.”PM_SUPPLIER”这张表的LOB属性:

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 – 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL>
SQL> select bitand(flags,32) from sys.lob$ where OBJ#= (select OBJECT_ID from dba_objects where
OWNER=’PROD_DATA2′ and OBJECT_NAME=’PM_SUPPLIER’); 2

BITAND(FLAGS,32)
—————-
0

SQL> select table_name, segment_name, pctversion, retention
from dba_lobs where table_name = ‘PM_SUPPLIER’;

TABLE_NAME SEGMENT_NAME PCTVERSION RETENTION
—————————— —————————— ———- ———-
PM_SUPPLIER SYS_LOB0000017052C00041$$ 10

修改为90

alter table prod_data2.PM_SUPPLIER modify lob(PRODUCT_DESCRIPTION) (pctversion 90);

sys@item> select table_name, segment_name, pctversion, retention
from dba_lobs where table_name = ‘PM_SUPPLIER’; 2

TABLE_NAME SEGMENT_NAME PCTVERSION RETENTION
—————————— —————————— ———- ———-
PM_SUPPLIER SYS_LOB0000017052C00041$$ 90

重新开启同步,继续报错:

ERROR at line 1:
ORA-20001: ORA-39082: Object type
TRIGGER:”PROD_DATA2″.”TRIG_PRODUCT_SUPPLIER_INSERT” created with compilation
warnings
ORA-06512: at “SYS.DBMS_LOGSTDBY”, line 664
ORA-06512: at line 1

使用 DBMS_LOGSTDBY.INSTANTIATE_TABLE同步时 所依赖的对象会一并同步过来,这个trigger 在源端我们已经禁用了,直接drop掉这个trigger继续同步:

SQL> alter session set db_file_multiblock_read_count = 128;

Session altered.

SQL> alter session set sort_area_size =804857600;

Session altered.

SQL> alter session set workarea_size_policy = manual;

Session altered.

SQL> exec DBMS_LOGSTDBY.INSTANTIATE_TABLE(‘PROD_DATA2′,’PM_SUPPLIER’,’dblinkprimary’);

PL/SQL procedure successfully completed.

经历了漫长的等待终于同步成功,由于此前失败了好几次,总共大概花了7个小时,可见LOB表对于logic standby的影响,一旦出现问题,同步花费的时间将大大的超过普通表,一张记录相同的普通表 使用DBMS_LOGSTDBY.INSTANTIATE_TABLE同步大概只需要10分钟左右,相比较而言差了几百倍之多!

同步完成后 进度很快追了上来:

SPID TYPE STATUS
———— ————— —————————————————————————————————-
29989 COORDINATOR ORA-16116: no work available
29940 READER ORA-16116: no work available
29942 BUILDER ORA-16116: no work available
29944 PREPARER ORA-16116: no work available
29946 PREPARER ORA-16116: no work available
29948 PREPARER ORA-16116: no work available
29952 ANALYZER ORA-16116: no work available
29954 APPLIER ORA-16116: no work available
29956 APPLIER ORA-16116: no work available
29958 APPLIER ORA-16116: no work available
29960 APPLIER ORA-16116: no work available
29962 APPLIER ORA-16116: no work available
29964 APPLIER ORA-16116: no work available
29966 APPLIER ORA-16116: no work available
29968 APPLIER ORA-16116: no work available
29970 APPLIER ORA-16116: no work available
29972 APPLIER ORA-16116: no work available
29974 APPLIER ORA-16116: no work available
29976 APPLIER ORA-16116: no work available

19 rows selected.

APPLIED_SCN APPLIED_TIME RESTART_SCN RESTART_TIME LATEST_SCN LATEST_TIME MINING_SCN MINING_TIME
———– ——————- ———– ——————- ———- ——————- ———- ——————-
3.5694E+10 2012-03-13 06:31:33 3.5686E+10 2012-03-12 19:07:22 3.5694E+10 2012-03-13 06:31:56 3.5694E+10 2012-03-13 06:31:34

终于处理完了这次故障,仰头一看 东方已经微微既白。对于大型的OLTP系统LOB表的存在无疑增加的很多维护的难度,对于要求性极高的系统,LOB字段建议不要存放在数据库系统,无论对于DG还是后期的例如OGG的同步 都是一个极大的隐患。

记录一次database upgrade 导致physical standby故障

February 9, 2012 migration, oracle, replication No comments

记录一次database upgrade 导致physical standby故障

upgrade from 10.2.0.5->11.2.0.3

单节点的database升级很容易,严格按照手册,修改compatible=10.2.0

参考文档 Complete Checklist for Manual Upgrades to 11gR2 [ID 837570.1]

如果带有standby的系统升级要注意以下几点:

1.首先stop 10g listener,使用 11g listener

2.使用11g software打开standby database

3,开启日志apply

由于没有仔细看文档,升级primary的时候 没有打开standby 备库处于shutdown 状态,想primary升级完成后开启sql apply,报如下错误

Errors in file /data/oracle/diag/rdbms/paystd/paystd/trace/paystd_pr00_17447.trc (incident=40247):
ORA-00353: log corruption near block 6144 change 2561976 time 02/08/2012 00:00:15
ORA-00334: archived log: ‘/data/oracle/oradata/paystd/arch/1_78_770564180.dbf’
Incident details in: /data/oracle/diag/rdbms/paystd/paystd/incident/incdir_40247/paystd_pr00_17447_i40247.trc
Completed: alter database recover managed standby database disconnect from session using current logfile
MRP0: Background Media Recovery terminated with error 354
Errors in file /data/oracle/diag/rdbms/paystd/paystd/trace/paystd_pr00_17447.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 6144 change 2561976 time 02/08/2012 00:00:15
ORA-00334: archived log: ‘/data/oracle/oradata/paystd/arch/1_78_770564180.dbf’
Managed Standby Recovery not using Real Time Apply
Wed Feb 08 09:53:02 2012
Sweep [inc][40247]: completed
Wed Feb 08 09:53:02 2012
Sweep [inc2][40247]: completed

[oracle@db57 ~]$ vi /data/oracle/diag/rdbms/paystd/paystd/incident/incdir_40247/paystd_pr00_17447_i40247.trc

*** 2012-02-08 09:53:01.725
*** SESSION ID:(765.15) 2012-02-08 09:53:01.725
*** CLIENT ID:() 2012-02-08 09:53:01.725
*** SERVICE NAME:() 2012-02-08 09:53:01.725
*** MODULE NAME:() 2012-02-08 09:53:01.725
*** ACTION NAME:() 2012-02-08 09:53:01.725

Dump continued from file: /data/oracle/diag/rdbms/paystd/paystd/trace/paystd_pr00_17447.trc
ORA-00353: log corruption near block 6144 change 2561976 time 02/08/2012 00:00:15
ORA-00334: archived log: ‘/data/oracle/oradata/paystd/arch/1_78_770564180.dbf’

========= Dump for incident 40247 (ORA 353 [6144] [2561976]) ========

*** 2012-02-08 09:53:01.725
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
—– SQL Statement (None) —–
Current SQL information unavailable – no cursor.

—– Call Stack Trace —–

可以看到standby 无法读取 block 6144 的内容 而此时的情况是 如果直接使用real-time-apply 则报错

临时解决方案:

重建primary端的redo log:

使用循环 drop inactive group 之后 standby 恢复正常:


alter database recover managed standby database disconnect using current logfile
Attempt to start background Managed Standby Recovery process (paystd)
Wed Feb 08 10:25:46 2012
MRP0 started with pid=26, OS id=18107
MRP0: Background Managed Standby Recovery process started (paystd)
started logmerger process
Wed Feb 08 10:25:51 2012
Managed Standby Recovery starting Real Time Apply
Parallel Media Recovery started with 12 slaves
Waiting for all non-current ORLs to be archived…
All non-current ORLs have been archived.
Media Recovery Log /data/oracle/oradata/paystd/arch/1_95_770564180.dbf
Media Recovery Waiting for thread 1 sequence 96 (in transit)
Recovery of Online Redo Log: Thread 1 Group 6 Seq 96 Reading mem 0
Mem# 0: /data/oracle/oradata/paystd/std2.log
Completed: alter database recover managed standby database disconnect using current logfile
Wed Feb 08 19:45:46 2012
RFS[1]: Selected log 5 for thread 1 sequence 97 dbid 1905896596 branch 770564180
Wed Feb 08 19:45:46 2012
Media Recovery Waiting for thread 1 sequence 97 (in transit)
Recovery of Online Redo Log: Thread 1 Group 5 Seq 97 Reading mem 0
Mem# 0: /data/oracle/oradata/paystd/std1.log
Wed Feb 08 19:45:46 2012
Archived Log entry 24 added for thread 1 sequence 96 ID 0x7198f794 dest 1:

没有按照文档做导致的匪夷所思的错误,至于standby 为何无法直接读取primary 端的redo内容,有待后续查证

Logic standby delay when query with for update clause

January 11, 2012 oracle, replication, system No comments

今天遇到一个奇怪的问题,3个logic standby 中有一个delay 查看任何日志都没有发现错误,有一个transaction在apply一个SQL,这条SQL没有任何问题,SQL所涉及到的表也没有任何问题
当时的apply情况如下:

SID USERNAME MACHINE EVENT PARAM W WT SQL ST LT LOGON_TIME
—— ———- ———————- —————————— ——————– —- —- —————————- — —— ———-
3222 oracle@db-15 (P000) LogMiner: wakeup event for rea 1/0/0 0 718 / A 1587 1588
3279 oracle@db-15 (P003) LogMiner: wakeup event for pre 1/0/0 0 #### / A 1587 1588
3238 oracle@db-15 (P008) rdbms ipc message 500/0/0 0 #### /404qaurwrsnva A 1587 1588
3237 oracle@db-15 (P011) rdbms ipc message 500/0/0 0 #### /404qaurwrsnva A 1587 1588
3276 oracle@db-15 (P002) LogMiner: wakeup event for pre 1/0/0 0 #### / A 1587 1588
3250 oracle@db-15 (P005) rdbms ipc message 500/0/0 0 #### /05dcutzfrnzmw A 1587 1588
3254 oracle@db-15 (P007) rdbms ipc message 500/0/0 0 #### /404qaurwrsnva A 1587 1588
3256 oracle@db-15 (P010) rdbms ipc message 500/0/0 0 #### /404qaurwrsnva A 1587 1588
3257 oracle@db-15 (P012) rdbms ipc message 500/0/0 0 #### /404qaurwrsnva A 1587 1588
3258 oracle@db-15 (P015) rdbms ipc message 500/0/0 0 #### /404qaurwrsnva A 1587 1588
3262 oracle@db-15 (P014) rdbms ipc message 500/0/0 0 #### /404qaurwrsnva A 1587 1588
3263 oracle@db-15 (P013) rdbms ipc message 500/0/0 0 #### /404qaurwrsnva A 1587 1588
3267 oracle@db-15 (P017) rdbms ipc message 500/0/0 0 #### /404qaurwrsnva A 1587 1588
3269 oracle@db-15 (P016) rdbms ipc message 500/0/0 0 #### /404qaurwrsnva A 1587 1588
3274 oracle@db-15 (P009) rdbms ipc message 500/0/0 0 #### /404qaurwrsnva A 1587 1588
3275 oracle@db-15 (P004) LogMiner: wakeup event for pre 1/0/0 0 #### / A 1587 1588
3246 oracle@db-15 (P006) latch: cache buffers chains 63287313528/124/0 -1 #### bmgkxmt7cybyp/bbjphxukw15ap K 1587 1588
3277 oracle@db-15 (P001) LogMiner: wakeup event for bui 1/0/0 0 814 21wwpjmcv56ds/ A 1587 1588

具体SQL为:
select “PRODUCT_DESCRIPTION”,”SALE_SKILL” from “PROD_DATA2″.”PRO
DUCT” p where(:1=”ID”)for update 我们看到这条SQL带有 for update操作
尝试kill掉这个进程 hang 死:
SQL> alter system kill session ‘3272,4’;
alter system kill session ‘3272,4’
*
ERROR at line 1:
ORA-01013: user requested cancel of current operation

尝试kill掉spid
[oracle@db-15 ~]$ kill -9 8823
-bash: kill: (8823) – No such process

尝试手动停止sql apply:

SQL>alter database stop logical standby apply;

依然hang死

所有的进程都在等待这个回话结束but,it’s for update

尝试强制abort这个transaction:

SQL> ALTER DATABASE ABORT LOGICAL STANDBY APPLY;
SQL>alter database start logical standby apply immediate;

done.3个小时的delay很快被追上

Stopping SQL Apply on a Logical Standby Database

To stop SQL Apply, issue the following statement on the logical standby database:

SQL> ALTER DATABASE STOP LOGICAL STANDBY APPLY;When you issue this statement, SQL Apply waits until it has committed all complete transactions that were in the process of being applied. Thus, this command may not stop the SQL Apply processes immediately.

If you want to stop SQL Apply immediately, issue the following statement:

SQL> ALTER DATABASE ABORT LOGICAL STANDBY APPLY;

下面做个测试

maybe 仅是个人猜测:备库由PL/SQLDEV 发起的for update查询,但是DBA 没有关闭这个进程:

开启一个新的session:

SQL> update product set id=100000 where id=56;

出现了等待 。也许由于某个DBA 对这个逻辑备库开启了表的for update,但是忘掉了commit 这个操作 it just a guess !

主表缺少主键导致logic standby delay 一例

December 28, 2011 oracle, replication No comments

table缺少primary key 导致logic standby delay 一例

早上在主库做如下操作:


[oracle@racdg1 ~]$ sqlplus ‘/as sysdba’

SQL*Plus: Release 10.2.0.5.0 – Production on Wed Dec 28 11:23:14 2011

Copyright (c) 1982, 2010, Oracle. All Rights Reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 – 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL> DECLARE
2 row_num NUMBER := 0;
BEGIN
FOR c_test IN (SELECT ROWID rid FROM MC_SITE_PRODUCT_PIC t ) LOOP
UPDATE MC_SITE_PRODUCT_PIC t SET t.UPDATE_FLAG = 0 WHERE ROWID = c_test.rid;
row_num := row_num + 1;
IF MOD(row_num,2000) =0 THEN
COMMIT;
dbms_lock.sleep(2);
END IF;
END LOOP;
COMMIT;
END; 3 4 5 6 7 8 9 10 11 12 13
14 /

PL/SQL procedure successfully completed.

SQL>

该表大约有150万行,选择update 2000条 提交一次,导致逻辑备库出现delay


SPID TYPE STATUS
———— ————— —————————————————————————————————-
5798 COORDINATOR ORA-16116: no work available
5665 READER ORA-16127: stalled waiting for additional transactions to be applied
5667 BUILDER ORA-16127: stalled waiting for additional transactions to be applied
5669 PREPARER ORA-16127: stalled waiting for additional transactions to be applied
5671 PREPARER ORA-16127: stalled waiting for additional transactions to be applied
5673 PREPARER ORA-16127: stalled waiting for additional transactions to be applied
5675 ANALYZER ORA-16120: dependencies being computed for transaction at SCN 0x0007.1a92b5a4
5677 APPLIER ORA-16124: transaction 21 33 10269852 is waiting on another transaction
5679 APPLIER ORA-16113: applying change to table or sequence “PROD_DATA2″.”MC_SITE_PRODUCT_PIC”
5681 APPLIER ORA-16124: transaction 21 27 10269698 is waiting on another transaction
5683 APPLIER ORA-16124: transaction 21 6 10269799 is waiting on another transaction
5685 APPLIER ORA-16124: transaction 21 29 10269696 is waiting on another transaction
5687 APPLIER ORA-16124: transaction 10 37 2597623 is waiting on another transaction
5689 APPLIER ORA-16124: transaction 21 40 10269557 is waiting on another transaction
5691 APPLIER ORA-16124: transaction 33 35 358073 is waiting on another transaction
5693 APPLIER ORA-16124: transaction 33 16 357996 is waiting on another transaction
5695 APPLIER ORA-16124: transaction 21 46 10269647 is waiting on another transaction
5697 APPLIER ORA-16124: transaction 10 23 2598005 is waiting on another transaction
5699 APPLIER ORA-16124: transaction 21 13 10269728 is waiting on another transaction

—>5679 APPLIER ORA-16113: applying change to table or sequence “PROD_DATA2”.”MC_SITE_PRODUCT_PIC”更新出现延迟

APPLIED_SCN APPLIED_TIME RESTART_SCN RESTART_TIME LATEST_SCN LATEST_TIME MINING_SCN MINING_TIME
———– ——————- ———– ——————- ———- ——————- ———- ——————-
3.0511E+10 2011-12-28 09:12:06 3.0510E+10 2011-12-28 08:48:07 3.0511E+10 2011-12-28 09:27:28 3.0511E+10 2011-12-28 09:24:01

延迟超过了10分钟,并且有不断加大的趋势

查看5679 apply过程 :

SQL_ID CHILD outline/plan_hash_value Ex DISK_READS bg bg/exec rows LOAD_TIME
————- ———- —————————————————————- ———- ———- ————— ———— ———- ———–
8zy2mm8z8fvpw 0 745000887 109851 24 59862576 544.94 109851 12-28/00:44
TOTAL 109851 24 59862576 544.94 109851

Optimizer Plan:
——————————————————————————————————————————
| Operation | Name | Rows | Bytes| Cost | Pstart| Pstop |
————————————————————————————————————
| UPDATE STATEMENT | | | | 4746 | | |
| UPDATE |MC_SITE_PRODUCT_PIC | | | | | |
| COUNT STOPKEY | | | | | | |
| TABLE ACCESS FULL |MC_SITE_PRODUCT_PIC | 1 | 68 | 4746 | | |
————————————————————————————————————

发现主库已经更新完成150万 ,但是逻辑备库仅仅更新了10万条,执行计划走了 TABLE ACCESS FULL

查看主库的执行计划:

Optimizer Plan:
——————————————————————————————————————————
| Operation | Name | Rows | Bytes| Cost | Pstart| Pstop |
————————————————————————————————————
| UPDATE STATEMENT | | | | 1 | | |
| UPDATE |MC_SITE_PRODUCT_PIC | | | | | |
| TABLE ACCESS BY USER ROWID |MC_SITE_PRODUCT_PIC | 1 | 69 | 1 | | |
————————————————————————————————————

完全不同,查看该表的结构发现此表没有主键,导致logic standby的约束检查,导致logic standby 严重delay

解决方法:

SQL> create public database link xxx connect to xxx IDENTIFIED by xxx using ‘xxx’;

SQL> alter session db_file_multiblock_read_count = 128;

SQL> alter session set sort_area_size = 104857600;

SQL> alter session set workarea_size_policy=manual

SQL> alter database stop logical standby apply;

SQL> EXECUTE DBMS_LOGSTDBY.INSTANTIATE_TABLE (‘xxx’, ‘xxx’, ‘xxx’);

SQL> EXECUTE DBMS_LOGSTDBY.UNSKIP(‘DML’,’xx’,’xxx’);

SQL> alter database start logical standby apply immediate

使用 DBMS_LOGSTDBY.INSTANTIATE_TABLE 将表从pri端 重新同步过来。注意上面的步骤需要在空闲时候做,不然可能导致大量的latch等待:
这次操作产生了大量以下事件,幸运的是table本身不是很大,同步大概花了几十秒而已,对于大表的同步切记小心:


latch: undo global data
latch: undo global data
latch: undo global data
latch: undo global data
latch: undo global data
latcg cache buffers chains
latch: cache buffers chains
latch: cache buffers chains
latch: cache buffers chains
latch: cache buffers chains


临时的解决方法可以先skip掉该表。最后强调一点,表结构最好添加主键,没有主键的表对类似ogg等同步工具的使用都会产生影响。

Physical-standby standbylogfile checksum error

December 23, 2011 oracle, replication No comments

总结下今天DG碰到的问题,本来一个很简单的问题,被我们复杂化衍生出很多问题。

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 – 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> show parameter check

NAME TYPE VALUE
———————————— ———– ——————————
db_block_checking string FALSE
db_block_checksum string TRUE
log_checkpoint_interval integer 0
log_checkpoint_timeout integer 1800
log_checkpoints_to_alert boolean FALSE
SQL> show parameter filesystem

通过filesystemio_options设置DIO和ASYNC模式

NAME TYPE VALUE
———————————— ———– ——————————
filesystemio_options string SETALL
SQL>

SQL>alter database recover managed standby database disconnect from session using current logfile;

以下是alert.log信息:

Thu Dec 22 18:23:57 CST 2011
Waiting for all non-current ORLs to be archived…
Media Recovery Log /data/oracle/oradata/paystd/arch/1_3_770564180.dbf
MRP0: Background Media Recovery terminated with error 368
Thu Dec 22 18:23:57 CST 2011
Errors in file /data/oracle/oradata/paystd/bdump/paystd_mrp0_2983.trc:
ORA-00368: checksum error in redo log block
ORA-00353: log corruption near block 6144 change 175261 time 12/22/2011 17:00:09
ORA-00334: archived log: ‘/data/oracle/oradata/paystd/arch/1_3_770564180.dbf’

checksum检查发现错误,同事以前是做SUN的,认为是DIO导致的,修改filesystemio_options=async,之后我也没管,今天早上来问说正常了,我检查了下依然发现问题:

Errors in file /data/oracle/oradata/paystd/udump/paystd_rfs_5428.trc:
ORA-00367: checksum error in log file header
ORA-00315: log 5 of thread 0, wrong thread # 1 in header
ORA-00312: online log 5 thread 0: ‘/data/oracle/oradata/paystd/std1.log’

改成async依然不行?

检查system 设置:

[oracle@db57 bdump]$ cat /proc/slabinfo |grep kio
kioctx 43 72 320 12 1 : tunables 54 27 8 : slabdata 6 6 0
kiocb 11 30 256 15 1 : tunables 120 60 8 : slabdata 2 2 0
[oracle@db57 bdump]$

系统没有问题 async使用正常,至此 我认为filesystemio_options并不是问题的根源,查看相关metalink文档:

Oracle Linux, Filesystem & I/O Type Supportability [ID 279069.1]

此图明确的指出了Oracle在linux5下所支持的IOtype(async sync dio) 并不会因为DG的原因所改变,在测试环境做了如下测试。


session1:

SQL> alter system set filesystemio_options = directio scope=spfile;

System altered.

SQL> shutdown abort;
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area 218103808 bytes
Fixed Size 1218580 bytes
Variable Size 88082412 bytes
Database Buffers 121634816 bytes
Redo Buffers 7168000 bytes
Database mounted.
Database opened.

SQL> begin
2 for i in 1..10000 loop
3 insert into t1 values (i);
4 commit;
5 end loop;
6 end;
7 /

PL/SQL procedure successfully completed.

SQL> alter system switch logfile;

System altered.

SQL> alter system checkpoint;

System altered.

SQL>
SQL>
SQL> begin
2 for i in 1..1000 loop
3 insert into t1 values (i);
4 commit;
5 end loop;
6 end;
7 /

PL/SQL procedure successfully completed.

SQL> commit;

Commit complete.

SQL>
SQL>
SQL> alter system checkpoint;

System altered.

SQL>

session2:

[root@liu u01]# strace -o /u01/dbwn.log -p 18671
Process 18671 attached – interrupt to quit

pwrite64(20, “\6\242\0\0\323\21@\0\321\350\24\0\0\0\1\6\345v\0\0\2\0\’\0@\2\0\0\320\350\24\0″…, 8192, 37380096) = 8192
times(NULL) = 431759580
times(NULL) = 431759580
pwrite64(20, “\6\242\0\0\326\21@\0\321\350\24\0\0\0\1\6\31g\0\0\2\0\0\0@\2\0\0\320\350\24\0″…, 8192, 37404672) = 8192
times(NULL) = 431759581
times(NULL) = 431759581
pwrite64(20, “\6\242\0\0\333\21@\0\321\350\24\0\0\0\1\6\3624\0\0\2\0\0\0A\2\0\0\320\350\24\0″…, 8192, 37445632) = 8192
times(NULL) = 431759581
times(NULL) = 431759581
pwrite64(20, “\6\242\0\0\336\21@\0\321\350\24\0\0\0\1\6\302\221\0\0\2\0\0\0A\2\0\0\320\350\24\0″…, 8192, 37470208) = 8192
times(NULL) = 431759581
times(NULL) = 431759581
pwrite64(20, “\6\242\0\0\”+@\0\300\343\24\0\0\0\1\6\2358\0\0\1\0\0\0\351\23\0\0\277\343\24\0″…, 8192, 90456064) = 8192
times(NULL) = 431759581
times(NULL) = 431759581
pwrite64(20, “\6\242\0\0002+@\0\300\343\24\0\0\0\1\6|Z\0\0\2\0\5\0\353\23\0\0\277\343\24\0″…, 8192, 90587136) = 8192

证明了此观点。OK 那么这种现象只能定义为block corrupt–>对应 checksum error in redo log block,这么简单的现象,因为一个filesystemio设置,引申出如此多的疑虑。

解决方法:clear or drop logfile;

我们采取 clear logfile的方法:

SQL> alter database clear logfile ‘/data/oracle/oradata/paystd/std5.log’
Clearing online log 9 of thread 0 sequence number 0
Completed: alter database clear logfile ‘/data/oracle/oradata/paystd/std5.log’

SQL> alter system set filesystemio_options=setall scope=spfile;

System altered.

SQL> shutdown abort;
ORACLE instance shut down.
SQL> startup mount;
ORACLE instance started.

Total System Global Area 1.2885E+10 bytes
Fixed Size 2118256 bytes
Variable Size 1744833936 bytes
Database Buffers 1.1123E+10 bytes
Redo Buffers 14655488 bytes
Database mounted.
SQL> alter database recover managed standby database disconnect from session using current logfile;

Database altered.

SQL>

查看alert:

Recovery of Online Redo Log: Thread 1 Group 5 Seq 44 Reading mem 0
Mem# 0: /data/oracle/oradata/paystd/std1.log
Fri Dec 23 12:52:09 CST 2011
Completed: ALTER DATABASE RECOVER managed standby database using current logfile disconnect
Redo Shipping Client Connected as PUBLIC
— Connected User is Valid
RFS[2]: Assigned to RFS process 5942
RFS[2]: Identified database type as ‘physical standby’
Primary database is in MAXIMUM PERFORMANCE mode
Primary database is in MAXIMUM PERFORMANCE mode
RFS[2]: Successfully opened standby log 6: ‘/data/oracle/oradata/paystd/std2.log’
Fri Dec 23 13:31:15 CST 2011
RFS[1]: Successfully opened standby log 5: ‘/data/oracle/oradata/paystd/std1.log’
Fri Dec 23 13:31:16 CST 2011
Media Recovery Waiting for thread 1 sequence 45 (in transit)
Fri Dec 23 13:31:16 CST 2011
Recovery of Online Redo Log: Thread 1 Group 6 Seq 45 Reading mem 0
Mem# 0: /data/oracle/oradata/paystd/std2.log

APPLY 正常 问题解决

总结: 一个比较简单的问题,因为同事的一句话:设置filesystemio为ASYNC解决了,导致后面一连串的认证工作,以至于怀疑DG是否允许DIO特性,看来自己对待问题的态度还不够严谨,但是由此延伸到filesystemio问题也不失为一件好事,以此记之。