RAC

[转] Oracle 10g RAC中的DRM问题及关闭

December 1, 2011 Internals, oracle, RAC 2 comments

在RAC环境中,Oracle使用GRD(Global Resource Service)来记录各个RAC节点的资源信息,具体通过GCS(Global Cache Service)和GES(Global Enqueue Service)这两个服务进行管理。
由于在RAC中每个节点都有自己的SGA和buffer cache,为了保证Cache资源的一致性和提高性能,GCS和GES会指定RAC中的一个instance来管理Cache,这个节点这时就是Resource Master。
在10g以前,Cache资源是不能在各个节点间移动的,除非重启或者某节点因为其他异常被RAC驱逐等情况。而10g的DRM就解决了这个问题,它可以保证cache能够被remaster到频繁访问这部分数据的节点上,从而提高RAC的性能。DRM的全称是Dynamic Resource Mastering,metalink上的Doc ID: 390483.1文档详细介绍了DRM的信息。

从理论上讲,利用此项技术,非master节点对所需资源有频繁访问需求时,可以提升为master节点,从而减少大量后续的跨节点资源访问需求。
但是,首先从根本上说,一个好的RAC应用设计,本就应该极尽所能的取避免同一资源的多节点访问,如果不存在同一资源的多节点访问,则DRM所要解决的问题,就根本不存在。其次,DRM本身是需要消耗资源的,并且存在诸多bug,对于一个设计较差的系统而言,频繁的DRM,也会引发Libary cache lock而导致实例挂住。
更严重的,在10.2.0.3系统上,曾经遇到一个case,电信行业的巨型数据库,rac的2号节点由于批量处理作业在非业务时间段,首先cache了一张40G的表,而到了业务时间段后,rac的1号节点的OLTP业务需要频繁访问该表,此时,故障发生了,由于DRM的介入,2号节点开始将内存内的40Gcache数据向1号节点传输,心跳网段千兆带宽被耗尽,RAC陷入僵死阶段,足足维持了40分钟。
事后检查网络流量图,该时段内,私有网络流量持续保持在90M/s的峰值水平。
根据metalink确认,该问题确实由DRM机制引起,最终解决方案,使用隐含参数,将DRM特性屏蔽:
_gc_affinity_time=0
_gc_undo_affinity=FALSE

修改这两个参数需要重启数据库,可以用下面两个参数实行关闭:

_gc_affinity_limit=250  

_gc_affinity_minimum=10485760 

因此,从根本上来说,drm的出现,只是在理论上的一种缓解,而并不能在实际的大型应用中发挥其作用。就类似于Oracle自己针对RAC推出的自动负载平衡一样,只是一种看起来很美的东西,如果真的有人用了,呵呵,那就只能等死吧。或许压力极小的数据库无所谓,但我没遇到过,话又说回来,压力极小,又何必上RAC呢。

reference: http://www.dbaleading.com/Article.asp?id=42

Oracle 9i 单节点文件系统 升级到 10g RAC

November 10, 2011 migration, oracle, RAC No comments

最近做了一次Oracle 9i 升级到10g RAC 的测试,现做如下简要记录:
1.查看目前数据库版本


SQL> select * from v$version;

BANNER
—————————————————————-
Oracle9i Enterprise Edition Release 9.2.0.4.0 – Production
PL/SQL Release 9.2.0.4.0 – Production
CORE 9.2.0.3.0 Production
TNS for Linux: Version 9.2.0.4.0 – Production
NLSRTL Version 9.2.0.4.0 – Production

FILE_ID FILE_NAME
———- ————————————————————
1 /u01/app/oracle/oradata/honcho/system01.dbf
2 /u01/app/oracle/oradata/honcho/undotbs01.dbf
3 /u01/app/oracle/oradata/honcho/hongzx01.dbf
4 /u01/app/oracle/oradata/honcho/drsys01.dbf
5 /u01/app/oracle/oradata/honcho/example01.dbf
9 /u01/app/oracle/oradata/honcho/users01.dbf
10 /u01/app/oracle/oradata/honcho/xdb01.dbf

FILE# NAME
———- ————————————————————
1 /u01/app/oracle/oradata/honcho/temp01.dbf

2.对当前数据库做全备

RMAN> run
2> {
3> allocate channel c1 type disk;
4> backup full database format=’/u01/backup/honcho_full_%U_%T’ tag=’honcho_full’;
5> sql ‘alter system archive log current’;
6> backup archivelog all tag=’arc_bak’ format=’/u01/backup/arch_%U_%T’ delete input;
7> backup current controlfile tag=’bak_ctlfile’ format=’/u01/backup/ctl_file_%U_%T’;
8> backup spfile tag=’spfile’ format=’/u01/backup/honcho_spfile_%U_%T’;
9> release channel c1;
10> }

RMAN> list backup summary;

List of Backups
===============
Key TY LV S Device Type Completion Time #Pieces #Copies Tag
——- — — – ———– ————— ——- ——- —
1 B F A DISK 08-NOV-11 1 1 honcho_FULL
2 B F A DISK 08-NOV-11 1 1
3 B A A DISK 08-NOV-11 1 1 ARC_BAK
4 B F A DISK 08-NOV-11 1 1 BAK_CTLFILE
5 B F A DISK 08-NOV-11 1 1 SPFILE
6 B F A DISK 08-NOV-11 1 1

3.搭建10g 的clusterware 和ASM 实例
[oracle@rac1 ~]$ crs_stat -t
Name Type Target State Host
————————————————————
ora….SM1.asm application ONLINE ONLINE rac1
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora….SM2.asm application ONLINE ONLINE rac2
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac2

3.直接拷贝原9.2.0.4的pfile到192.168.1.131相应目录,然后在两节点创建相应目录
[oracle@rac1 dbs]$ mkdir -p /u01/app/oracle/admin/honcho/bdump
[oracle@rac1 dbs]$ mkdir -p /u01/app/oracle/admin/honcho/cdump
[oracle@rac1 dbs]$ mkdir -p /u01/app/oracle/admin/honcho/udump

4.修改之后的参数如下:
[oracle@rac1 dbs]$ cat inithoncho1.ora
*.aq_tm_processes=0
*.background_dump_dest=’/u01/app/oracle/admin/honcho/bdump’
*.compatible=’10.2.0.1.0′
*.core_dump_dest=’/u01/app/oracle/admin/honcho/cdump’
*.db_block_size=8192
*.db_cache_size=33554432
*.db_domain=”
*.db_file_multiblock_read_count=16
*.db_name=’honcho’
*.dispatchers='(PROTOCOL=TCP) (SERVICE=honchoXDB)’
*.fast_start_mttr_target=300
*.instance_name=’honcho’
*.java_pool_size=83886080
*.job_queue_processes=0
*.large_pool_size=16777216
*.open_cursors=300
*.pga_aggregate_target=25165824
*.processes=150
*.query_rewrite_enabled=’FALSE’
*.remote_login_passwordfile=’EXCLUSIVE’
*.shared_pool_size=83886080
*.sort_area_size=524288
*.star_transformation_enabled=’FALSE’
*.timed_statistics=TRUE
*.undo_management=’AUTO’
*.undo_retention=10800
*.user_dump_dest=’/u01/app/oracle/admin/honcho/udump’
*.cluster_database_instances=2
*.cluster_database=true
*.control_files=’+DATA/honcho/controlfile/control01.ctl’,’+RECOVERY/honcho/controlfile/control02.ctl’
*.db_create_file_dest=’+DATA’
*.db_recovery_file_dest=’+RECOVERY’
*.db_recovery_file_dest_size=2147483648
honcho1.instance_name=’honcho1′
honcho2.instance_name=’honcho2′
honcho1.instance_number=1
honcho2.instance_number=2
*.log_archive_dest_1=’LOCATION=+RECOVERY/honcho/archivelog’
honcho1.thread=1
honcho2.thread=2
honcho1.undo_tablespace=’UNDOTBS1′
honcho2.undo_tablespace=’UNDOTBS2′

5.使用pfile生产spfile

[oracle@rac1 dbs]$ sqlplus /nolog

SQL*Plus: Release 10.2.0.1.0 – Production on Wed Nov 9 10:52:17 2011

Copyright (c) 1982, 2005, Oracle. All rights reserved.

SQL> conn / as sysdba
Connected.
SQL> create spfile=’+DATA/honcho/PARAMETERFILE/spfilehoncho.ora’ from pfile=’/u01/app/oracle/product/10.0.2.1/db_1/dbs/inithoncho1.ora’;

File created.

6.重新编译pfile,让其指向spfile,在两节点都需要编译
[oracle@rac1 dbs]$ echo “SPFILE=’+DATA/honcho/PARAMETERFILE/spfilehoncho.ora'” > /u01/app/oracle/product/10.0.2.1/db_1/dbs/inithoncho1.ora
[oracle@rac2 ~]$ cd $ORACLE_HOME/dbs
[oracle@rac2 dbs]$ touch inithoncho2.ora
[oracle@rac2 dbs]$ echo “SPFILE=’+DATA/honcho/PARAMETERFILE/spfilehoncho.ora'” > /u01/app/oracle/product/10.0.2.1/db_1/dbs/inithoncho2.ora

7.创建密码文件:
[oracle@rac1 dbs]$ orapwd file=orapwhoncho1 password=oracle entries=20
[oracle@rac2 dbs]$ orapwd file=orapwhoncho2 password=oracle entries=20

8.开启数据库到nomount状态:
SQL> conn / as sysdba
Connected to an idle instance.
SQL> startup nomount
ORACLE instance started.

Total System Global Area 222298112 bytes
Fixed Size 1218628 bytes
Variable Size 184551356 bytes
Database Buffers 33554432 bytes
Redo Buffers 2973696 bytes

9.恢复controlfile

RMAN> restore controlfile from ‘/u01/backup/ctl_file_04mr66b3_1_1_20111108’;

Starting restore at 09-NOV-11
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=150 instance=honcho1 devtype=DISK

channel ORA_DISK_1: restoring control file
channel ORA_DISK_1: restore complete, elapsed time: 00:00:08
output filename=+DATA/honcho/controlfile/control01.ctl
output filename=+RECOVERY/honcho/controlfile/control02.ctl
Finished restore at 09-NOV-11

RMAN> alter database mount;

database mounted
released channel: ORA_DISK_1

10.restore database

RMAN> run {
2> set newname for datafile 1 to ‘+DATA/honcho/DATAFILE/system01.dbf’;
3> set newname for datafile 2 to ‘+DATA/honcho/DATAFILE/undotbs01.dbf’;
4> set newname for datafile 3 to ‘+DATA/honcho/DATAFILE/hongzx01.dbf’;
5> set newname for datafile 4 to ‘+DATA/honcho/DATAFILE/drsys01.dbf’;
6> set newname for datafile 5 to ‘+DATA/honcho/DATAFILE/example01.dbf’;
7> set newname for datafile 9 to ‘+DATA/honcho/DATAFILE/users01.dbf’;
8> set newname for datafile 10 to ‘+DATA/honcho/DATAFILE/xdb01.dbf’;
9> restore database;
10> switch datafile all;
11> switch tempfile all;
12> }

这里并没有restore tempfile

11.recover database

RMAN> recover database;

Starting recover at 09-NOV-11
using channel ORA_DISK_1

starting media recovery

channel ORA_DISK_1: starting archive log restore to default destination
channel ORA_DISK_1: restoring archive log
archive log thread=1 sequence=5
channel ORA_DISK_1: restoring archive log
archive log thread=1 sequence=6
channel ORA_DISK_1: reading from backup piece /u01/backup/arch_03mr66ar_1_1_20111108
channel ORA_DISK_1: restored backup piece 1
piece handle=/u01/backup/arch_03mr66ar_1_1_20111108 tag=ARC_BAK
channel ORA_DISK_1: restore complete, elapsed time: 00:00:03
archive log filename=+RECOVERY/honcho/archivelog/1_5_766710700.dbf thread=1 sequence=5
archive log filename=+RECOVERY/honcho/archivelog/1_6_766710700.dbf thread=1 sequence=6
unable to find archive log
archive log thread=1 sequence=7
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 11/09/2011 13:39:31
RMAN-06054: media recovery requesting unknown log: thread 1 seq 7 lowscn 196183

12.修改online logfile 到ASM

SQL> alter database rename file ‘/u01/app/oracle/oradata/honcho/redo03.log’ to ‘+DATA/honcho/ONLINELOG/redo03.log’;

Database altered.

SQL> alter database rename file ‘/u01/app/oracle/oradata/honcho/redo02.log’ to ‘+DATA/honcho/ONLINELOG/redo02.log’;

Database altered.

SQL> alter database rename file ‘/u01/app/oracle/oradata/honcho/redo01.log’ to ‘+DATA/honcho/ONLINELOG/redo01.log’;

Database altered.

13.把数据库以resetlogs的方式打开
alter database open resetlogs;

14.升级数据库到10g
SQL> startup upgrade;
ORACLE instance started.

Total System Global Area 222298112 bytes
Fixed Size 1218628 bytes
Variable Size 184551356 bytes
Database Buffers 33554432 bytes
Redo Buffers 2973696 bytes
Database mounted.
ORA-01092: ORACLE instance terminated. Disconnection forced

查看日志
Errors in file /u01/app/oracle/admin/honcho/udump/honcho1_ora_16126.trc:
ORA-39701: database must be mounted EXCLUSIVE for UPGRADE or DOWNGRADE
Error 39701 happened during db open, shutting down database
USER: terminating instance due to error 39701

解决方法:
[oracle@rac1]$ cd $ORACLE_HOME/nls/data/old
[oracle@rac1 old]$ perl cr9idata.pl
Creating directory /u01/app/oracle/product/10.0.2.1/db_1/nls/data/9idata …
Copying files to /u01/app/oracle/product/10.0.2.1/db_1/nls/data/9idata…
Copy finished.
Please reset environment variable ORA_NLS10 to /u01/app/oracle/product/10.0.2.1/db_1/nls/data/9idata!

编辑环境变量,添加:
export ORA_NLS10=/u01/app/oracle/product/10.0.2.1/db_1/nls/data/9idata

同时要将参数文件里面的与cluster 相关的参数去掉

SQL> startup upgrade;
ORACLE instance started.

Total System Global Area 222298112 bytes
Fixed Size 1218628 bytes
Variable Size 184551356 bytes
Database Buffers 33554432 bytes
Redo Buffers 2973696 bytes
Database mounted.
Database opened.
SQL> select * from v$version;

BANNER
—————————————————————-
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – Prod
PL/SQL Release 10.2.0.1.0 – Production
CORE 10.2.0.1.0 Production
TNS for Linux: Version 10.2.0.1.0 – Production
NLSRTL Version 10.2.0.1.0 – Production

15.给临时表空间添加文件
SQL> alter tablespace temp add tempfile ‘+DATA’ size 100m;

16.由于在9i里面没有sysaux表空间,这里也要添加sysaux表空间:
SQL> create tablespace sysaux datafile ‘+DATA’ size 500m reuse
2 extent management local
3 segment space management auto;

Tablespace created.

17.运行升级脚本

@?/rdbms/admin/catupgrd.sql
@?/rdbms/admin/catclust.sql
@?/rdbms/admin/utlrp.sql

18.添加与RAC相关的一些参数:


SQL> alter system set cluster_database=true scope=spfile;

System altered.

SQL> alter system set cluster_database_instances=2 scope=spfile;

System altered.

SQL> alter system set instance_number=1 scope=spfile sid=’honcho1′;

System altered.

SQL> alter system set instance_number=2 scope=spfile sid=’honcho2′;

System altered.

SQL> alter system set thread=1 scope=spfile sid=’honcho1′;

System altered.

SQL> alter system set thread=2 scope=spfile sid=’honcho2′;

System altered.

19.添加节点2的undo表空间和redolog file

SQL> create undo tablespace UNDOTBS2 datafile ‘+DATA/honcho/datafile/undotbs02.dbf’ size 50m;

Tablespace created.

SQL> alter system set undo_tablespace=’UNDOTBS2′ scope=spfile sid=’honcho2′;

System altered.

SQL> alter database add logfile thread 2 group 4 ‘+DATA’ size 100m;

Database altered.

SQL> alter database add logfile thread 2 group 5 ‘+DATA’ size 100m;

Database altered.

SQL> alter database add logfile thread 2 group 6 ‘+DATA’ size 100m;

Database altered.
SQL> alter database enable public thread 2;

Database altered.

20.重新打开1节点:
SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area 222298112 bytes
Fixed Size 1218628 bytes
Variable Size 184551356 bytes
Database Buffers 33554432 bytes
Redo Buffers 2973696 bytes
Database mounted.
Database opened.
SQL> show parameter cluster

NAME TYPE VALUE
———————————— ———– ——————————
cluster_database boolean TRUE
cluster_database_instances integer 2
cluster_interconnects string

在节点2上同样打开实例:
[oracle@rac2 ~]$ sqlplus /nolog

SQL*Plus: Release 10.2.0.1.0 – Production on Wed Nov 9 18:24:02 2011

Copyright (c) 1982, 2005, Oracle. All rights reserved.

SQL> conn / as sysdba
Connected to an idle instance.
SQL> startup
ORACLE instance started.

Total System Global Area 222298112 bytes
Fixed Size 1218628 bytes
Variable Size 184551356 bytes
Database Buffers 33554432 bytes
Redo Buffers 2973696 bytes
Database mounted.
Database opened.

SQL> select instance_number,instance_name,host_name from gv$instance;

INSTANCE_NUMBER INSTANCE_NAME HOST_NAME
————— —————- ——————————
1 honcho1 rac1
2 honcho2 rac2

2 rows selected.

21.添加监听
[oracle@rac1 ~]$ crs_stat -t
Name Type Target State Host
————————————————————
ora….SM1.asm application ONLINE ONLINE rac1
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora….SM2.asm application ONLINE ONLINE rac2
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac2
使用netca配置
[oracle@rac1 ~]$ crs_stat -t
Name Type Target State Host
————————————————————
ora….SM1.asm application ONLINE ONLINE rac1
ora….C1.lsnr application ONLINE ONLINE rac1
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora….SM2.asm application ONLINE ONLINE rac2
ora….C2.lsnr application ONLINE ONLINE rac2
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac2

22.添加服务

[oracle@rac1 ~]$ srvctl add database -d honcho -o $ORACLE_HOME -p +DATA/honcho/PARAMETERFILE/spfilehoncho.ora
[oracle@rac1 ~]$ srvctl add instance -d honcho -i honcho1 -n rac1
[oracle@rac1 ~]$ srvctl add instance -d honcho -i honcho2 -n rac2
[oracle@rac1 ~]$ srvctl modify instance -d honcho -i honcho1 -s +ASM1
[oracle@rac1 ~]$ srvctl modify instance -d honcho -i honcho2 -s +ASM2
[oracle@rac1 ~]$ srvctl start database -d honcho

[oracle@rac1 ~]$ crs_stat -t
Name Type Target State Host
————————————————————
ora.honcho.db application ONLINE ONLINE rac1
ora….b1.inst application ONLINE ONLINE rac1
ora….b2.inst application ONLINE ONLINE rac2
ora….SM1.asm application ONLINE ONLINE rac1
ora….C1.lsnr application ONLINE ONLINE rac1
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora….SM2.asm application ONLINE ONLINE rac2
ora….C2.lsnr application ONLINE ONLINE rac2
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac2

SQL> select * from v$version;

BANNER
—————————————————————-
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – Prod
PL/SQL Release 10.2.0.1.0 – Production
CORE 10.2.0.1.0 Production
TNS for Linux: Version 10.2.0.1.0 – Production
NLSRTL Version 10.2.0.1.0 – Production

还有一些相关参数需要调整。

OK. Game over!