software

NoSQL压测工具YCSB

July 16, 2014 Architect, NoSQL, software No comments

Download PDF from slideshare

基于mongodb的压力评测工具 YCSB的一些概括

YCSB 是一款基于NOSQL Benchmark的工具,提供了多种方式模拟测试nosql的负载,基于现在对nosql的一些压力测试还没有得到重视.
YCSB 的全面的测试功能可以给即将上线的nosql提供一种另类保障。

There are many new serving databases available, including:

PNUTS
BigTable
HBase
Hypertable
Azure
Cassandra
CouchDB
Voldemort
MongoDb
OrientDB
Infinispan
Dynomite
Redis
GemFire
GigaSpaces XAP
DynamoDB
Couchhase
Aerospike 

下面的数据仅供参考,测试于虚拟机服务器。

给一个简单的load data的例子:

[root@mysqlstd ycsb-0.1.4]# ./bin/ycsb load  mongodb -P workloads/workloada   -p mongodb.url=mongodb://127.0.0.1:27017 -p mongodb.database=newdb -p mongodb.writeConcern=normal  -s  >data
Loading workload...
Starting test.
 0 sec: 0 operations; 
 10 sec: 18448 operations; 1837.08 current ops/sec; [INSERT AverageLatency(us)=423.35] 
 20 sec: 42134 operations; 2366.71 current ops/sec; [INSERT AverageLatency(us)=373.44] 
 30 sec: 61185 operations; 1904.34 current ops/sec; [INSERT AverageLatency(us)=661.58] 
 40 sec: 85308 operations; 2411.09 current ops/sec; [INSERT AverageLatency(us)=324.83] 
 50 sec: 97785 operations; 1247.2 current ops/sec; [INSERT AverageLatency(us)=985.33] 
 50 sec: 100000 operations; 2662.26 current ops/sec; [INSERT AverageLatency(us)=371.24] 

load data之后 可以开始模拟压测了.YCSB 主要分为以下几种模式:

Workload A: Update heavy workload

This workload has a mix of 50/50 reads and writes. An application example is a session store recording recent actions.

Workload B: Read mostly workload

This workload has a 95/5 reads/write mix. Application example: photo tagging; add a tag is an update, but most operations are to read tags.

Workload C: Read only

This workload is 100% read. Application example: user profile cache, where profiles are constructed elsewhere (e.g., Hadoop).

Workload D: Read latest workload

In this workload, new records are inserted, and the most recently inserted records are the most popular. Application example: user status updates; people want to read the latest.

Workload E: Short ranges

In this workload, short ranges of records are queried, instead of individual records. Application example: threaded conversations, where each scan is for the posts in a given thread (assumed to be clustered by thread id).

Workload F: Read-modify-write
In this workload, the client will read a record, modify it, and write back the changes. Application example: user database, where user records are read and modified by the user or to record user activity.

其中E模式的short range存在问题,当然我们也可以自定义模式:

# Yahoo! Cloud System Benchmark
# Workload A: Update heavy workload
#   Application example: Session store recording recent actions
#                        
#   Read/update ratio: 50/50
#   Default data size: 1 KB records (10 fields, 100 bytes each, plus key)
#   Request distribution: zipfian

recordcount=100000
operationcount=100000
workload=com.yahoo.ycsb.workloads.CoreWorkload

readallfields=true

readproportion=1
updateproportion=0
scanproportion=0
insertproportion=0

修改read update insert的百分比来达到模拟真实环境的目的, 下面给个压测例子

比如压测纯读操作:

[root@mysqlstd ycsb-0.1.4]# ./bin/ycsb run  mongodb -P workloads/workloadc  -P large.dat -p mongodb.url=mongodb://127.0.0.1:27017 -p mongodb.database=newdb  -p mongodb.writeConcern=normal  -s  >data
Loading workload...
Starting test.
 0 sec: 0 operations; 
 10 sec: 49375 operations; 4922.24 current ops/sec; [READ AverageLatency(us)=192.56] 
 18 sec: 100000 operations; 6141.57 current ops/sec; [READ AverageLatency(us)=159.72] 

95%读+%5写:

[root@mysqlstd ycsb-0.1.4]# ./bin/ycsb run  mongodb -P workloads/workloadd  -P large.dat -p mongodb.url=mongodb://127.0.0.1:27017 -p mongodb.database=newdb  -p mongodb.writeConcern=normal  -s  >data
Loading workload...
Starting test.
 0 sec: 0 operations; 
 10 sec: 43497 operations; 4333.23 current ops/sec; [INSERT AverageLatency(us)=633.66] [READ AverageLatency(us)=196.33] 
 20 sec: 92795 operations; 4925.37 current ops/sec; [INSERT AverageLatency(us)=792.15] [READ AverageLatency(us)=167.74] 
 21 sec: 100000 operations; 5637.72 current ops/sec; [INSERT AverageLatency(us)=379.57] [READ 

AverageLatency(us)=163.45]

另外 thumbtack对YCSB 进行了修改,增强了一些功能 开源地址:

https://github.com/thumbtack-technology/ycsb

主要增加支持了 Aerospike and Couchbase 目前 Aerospike已经开源 针对SSD 进行了专门优化 :

http://www.aerospike.com/blog/entrepreneurs-break-all-the-rules-aerospike-goes-open-source/

同时MongoDB 的驱动程序从 2.8.0 版(在 Mongo 2.2 发布之前的版本)升级到了 2.10.1 版,并实现了对 readPreference 配置的设置

mongodb.readPreference = primary|primaryPreferred|secondary|secondaryPreferred

QQ图片20140715114647

下面结合 thumbtack修改的版本做个测试 可以得到具体数据:

[root@mysqlstd ycsb]# fab ycsb_load:db=mongodb

[10.0.32.38] Executing task 'ycsb_load'
2014-07-15 01:09:00-07:00
[10.0.32.38] run: echo "/root/ycsb/bin/ycsb load mongodb -s -p mongodb.url=mongodb://127.0.0.1:27017 -p workload=com.yahoo.ycsb.workloads.CoreWorkload -p updateretrycount=1000 -p mongodb.writeConcern=normal -p mongodb.database=ycsb -p recordcount=5000000 -p exportmeasurementsinterval=30000 -p fieldcount=10 -p timeseries.granularity=100 -p threadcount=32 -p insertretrycount=10 -p readretrycount=1000 -p ignoreinserterrors=true -p reconnectionthroughput=10 -p operationcount=2400000000 -p fieldnameprefix=f -p maxexecutiontime=2400 -p mongodb.readPreference=primaryPreferred -p measurementtype=timeseries -p reconnectiontime=1000 -p fieldlength=10 -p insertstart=0 -p insertcount=5000000 > /root/ycsb/2014-07-15_01-09_mongodb_load.out 2> /root/ycsb/2014-07-15_01-09_mongodb_load.err" | at 01:09 today
[10.0.32.38] out: job 13 at 2014-07-15 01:09

Done.
Disconnecting from 10.0.32.38... done.

[mongo@mysqlstd ~]$ /data/mongodb/mongodb/bin/mongo

MongoDB shell version: 2.6.1
connecting to: test
> show dbs
admin   0.031GB
local   0.031GB
newdb   0.500GB
newdb1  0.500GB
newdb2  0.500GB
ycsb    1.500GB
> use ycsb
switched to db ycsb
> db.usertable.c
db.usertable.clean(                  db.usertable.convertToCapped(        db.usertable.copyTo(                 db.usertable.createIndex(
db.usertable.constructor             db.usertable.convertToSingleObject(  db.usertable.count(
> db.usertable.count()
2675710

模拟压力环境:workload mode=A

[root@mysqlstd ycsb]# fab ycsb_run:db=mongodb,workload=A
[10.0.32.38] Executing task 'ycsb_run'
2014-07-15 02:13:00-07:00
[10.0.32.38] run: echo "/root/ycsb/bin/ycsb run mongodb -s -P /root/ycsb/workloads/workloada -p mongodb.url=mongodb://127.0.0.1:27017 -p workload=com.yahoo.ycsb.workloads.CoreWorkload -p updateretrycount=1000 -p mongodb.writeConcern=normal -p mongodb.database=ycsb -p recordcount=5000000 -p exportmeasurementsinterval=30000 -p fieldcount=10 -p timeseries.granularity=100 -p threadcount=32 -p insertretrycount=10 -p readretrycount=1000 -p ignoreinserterrors=true -p reconnectionthroughput=10 -p operationcount=1800000 -p fieldnameprefix=f -p maxexecutiontime=180 -p mongodb.readPreference=primaryPreferred -p measurementtype=timeseries -p reconnectiontime=1000 -p fieldlength=10 > /root/ycsb/2014-07-15_02-13_mongodb_workloada.out 2> /root/ycsb/2014-07-15_02-13_mongodb_workloada.err" | at 02:13 today
[10.0.32.38] out: job 23 at 2014-07-15 02:13
[10.0.32.38] out: 

Done.
Disconnecting from 10.0.32.38... done.

使用merge.py得到具体数值

[root@mysqlstd ycsb]#  ./bin/merge.py																													
	OVERALL	OVERALL	READ	READ	READ	READ	READ	READ	READ	READ	READ	UPDATE	UPDATE	UPDATE	UPDATE	UPDATE	UPDATE	UPDATE	UPDATE	UPDATE	CLEANUP	CLEANUP	CLEANUCLEANUP	CLEANUP	CLEANUP	CLEANUP	CLEANUP	CLEANUP	
	RunTime	Throughput	Operations	Retries	Return=0	Return=[^0].*	AverageLatency	MinLatency	MaxLatency	95thPercentileLatency	99thPercentileLatency	Operations	Retries	Return=0	Return=[^0].*	AverageLatency	MinLatency	MaxLatency	95thPercentileLatency	99thPercentileLatency	Operations	Retries	Return=0	Return=[^0].*	AverageLatency	MinLatency	MaxLatency	95thPercentileLatency	99thPercentileLatency
1	61156	28.58264111	1665	747000	918	747	927.7820691	0.132	6630.776			83	33000	50	33	2075.808675	0.504	9767.828			32	0			18.08734375	0.465	207.159		
Total	61156	28.58264111	1665	747000	918	747	927.7820691	0.132	6630.776			83	33000	50	33	2075.808675	0.504	9767.828			32	0	0	0	18.08734375	0.465	207.159		

[mongo@mysqlstd ~]$ /data/mongodb/mongodb/bin/mongostat
connected to: 127.0.0.1

insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
    *0   5568   5369     *0       0  5381|0       0  3.06g  6.47g   282m      9 admin:0.9%          0       0|0     0|0     1m   894k    95   02:14:05 
    *0   4298   6267     *0       0  6279|0       0  3.06g  6.47g   282m      6 admin:0.6%          0       0|0     1|0     1m   962k    96   02:14:06 
    *0   4675   6119     *0       0  6066|0       0  3.06g  6.47g   282m      2 admin:0.0%          0      95|0     1|0     1m   948k    92   02:14:07 
    *0   4137   4866     *0       0  4948|0       0  3.06g  6.47g   282m     18 admin:2.1%          0       0|0     0|0     1m   790k    91   02:14:08 
    *0   4568   5904     *0       0  5922|0       0  3.06g  6.47g   282m      4 admin:0.1%          0       0|0     0|0     1m   927k    92   02:14:09 
    *0   4727   6034     *0       0  6046|0       0  3.06g  6.47g   282m      5 admin:0.0%          0       0|0     0|0     1m   949k    90   02:14:10 
    *0   4991   5673     *0       0  5690|0       0  3.06g  6.47g   282m      3 admin:0.9%          0       0|0     0|0     1m   914k    94   02:14:11 
    *0   4740   5173     *0       0  5183|0       1  3.06g  6.47g   282m      7 admin:0.1%          0       0|0     0|0     1m   839k    94   02:14:12 
    *0   4332   5493     *0       0  5510|0       0  3.06g  6.47g   282m      8 admin:0.9%          0       0|0     0|0     1m   866k    94   02:14:13 
    *0   4980   5583     *0       0  5592|0       0  3.06g  6.47g   282m      8 admin:0.0%          0       0|0     0|0     1m   901k    97   02:14:14 
insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
    *0   5750   5030     *0       0  4997|0       0  3.06g  6.47g   282m     20 admin:1.8%          0      94|0     1|1     1m   853k    97   02:14:15 
    *0   4884   5509     *0       0  5578|0       0  3.06g  6.47g   282m     10 admin:0.1%          0       0|0     0|0     1m   894k    97   02:14:16 
    *0   5733   5773     *0       0  5784|0       0  3.06g  6.47g   282m      5 admin:0.0%          0       0|0     0|0     1m   952k    92   02:14:17 
    *0   5178   5202     *0       0  5219|0       0  3.06g  6.47g   282m     14 admin:0.0%          0       0|0     0|0     1m   861k    95   02:14:18 
    *0   4179   5680     *0       0  5688|0       0  3.06g  6.47g   282m      8 admin:0.0%          0       0|0     0|1     1m   884k    93   02:14:19 
    *0   4879   5695     *0       0  5707|0       0  3.06g  6.47g   282m     11 admin:0.1%          0       0|0     0|0     1m   911k    93   02:14:20 
    *0   5271   5402     *0       0  5413|0       0  3.06g  6.47g   282m     12 admin:0.0%          0       0|0     0|0     1m   887k    95   02:14:21 
    *0   4583   4852     *0       0  4867|0       1  3.06g  6.47g   282m     11 admin:0.0%          0       0|0     0|0     1m   795k    93   02:14:22 
    *0   6654   4956     *0       0  4967|0       0  3.06g  6.47g   282m     10 admin:1.5%          0       0|0     0|0     1m   881k    95   02:14:23 
    

REF:
http://www.aerospike.com/blog/entrepreneurs-break-all-the-rules-aerospike-goes-open-source/

https://github.com/thumbtack-technology/ycsb

http://www.aerospike.com/wp-content/uploads/2013/02/Ultra-High-Performance-NoSQL-Benchmarking_zh-CN.pdf

https://github.com/brianfrankcooper/YCSB/wiki

http://labs.yahoo.com/news/yahoo-cloud-serving-benchmark/

1号店架构鸟瞰

November 18, 2013 Architect, hardware, IDC, software No comments

UPDATE 12.2: you can download complete PDF from database design of YHD

DGHA,Dataguard自动切换方案

August 15, 2013 Architect, oracle, replication, software No comments

发布对autoswitch over dg脚本DGHA测试,此脚本针对oracle dataguard设计使用共享存储存放redo以及controlfile从而达到了切换数据0丢失。
同时切换以后,original primary database可以无缝加入整个dg集群,从而形成了dataguard高可用方案。
采用perl脚本定制开发,可任意部署于任何一台可连接至集群的机器(最好不要用集群中的机器做监控机),多重判断机制,从而增加了切换时的可靠性以及准确性。感谢作者 noodba. noodba作为部门数据架构师,定制开发了myawr(mysql性能分析工具)

测试环境:

10.0.0.63 (primary database)
10.0.0.24 (standby database)
10.0.0.13 (standby database)
切换site指定为63<->24 13作为恒定standby存在

Step 1. 使用swingbench作为压测工具 (导入10G 数据)

QQ图片20130815220940

[oracle@userdb-pre admin]$ ora active
 17:23:15 up 1 day,  1:15,  5 users,  load average: 2.99, 0.97, 0.34

   SID USERNAME   MACHINE		 EVENT				PARAM			W   WT SQL			    ST	   LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
  1089 TEST	  JDBC Thin Client	 SQL*Net more data from client	1413697536/4/0	       -1    0 /5gkc10wnagprk		    A	    0	     105
   961 TEST	  JDBC Thin Client	 SQL*Net more data from client	1413697536/1/0	       -1    0 5gkc10wnagprk/5gkc10wnagprk  A	    0	     105
  1249 TEST	  JDBC Thin Client	 SQL*Net more data from client	1413697536/5/0	       -1    0 dy4rgmf46x1q4/dy4rgmf46x1q4  A	    0	     105

[oracle@userdb-pre admin]$ ora active
 17:23:15 up 1 day,  1:15,  5 users,  load average: 3.07, 1.02, 0.36

   SID USERNAME   MACHINE		 EVENT				PARAM			W   WT SQL			    ST	   LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
   961 TEST	  JDBC Thin Client	 SQL*Net message from client	1413697536/1/0		3    0 5gkc10wnagprk/5gkc10wnagprk  A	    0	     105
     1 TEST	  JDBC Thin Client	 SQL*Net message from client	1413697536/1/0		2    0 dy4rgmf46x1q4/dy4rgmf46x1q4  A	    0	     104

[oracle@userdb-pre admin]$ ora active
 17:23:49 up 1 day,  1:15,  4 users,  load average: 3.68, 1.35, 0.49

   SID USERNAME   MACHINE		 EVENT				PARAM			W   WT SQL			    ST	   LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
  1025 TEST	  JDBC Thin Client	 SQL*Net message from client	1413697536/1/0		1    0 d1yrz7run4p1s/d1yrz7run4p1s  A	    0	     139

Step 2. Halt database machine (将63 database直接halt关闭,模拟机器crash)

BP9OJX}OU[ZIT2MQ[98[5LA

观察10.0.0.24 (standby 将被dgha自动切换至primary,redo,controlfile将被copy至规定目录)

alert_log from 10.0.0.24

Thu Aug 15 17:37:50 2013
ALTER DATABASE   MOUNT
Changing di2dbun from dgha to dgha1
Successful mount of redo thread 1, with mount id 3944140846
Database mounted in Exclusive Mode
Lost write protection disabled
Completed: ALTER DATABASE   MOUNT
Thu Aug 15 17:37:55 2013
ALTER DATABASE RECOVER  database
Media Recovery Start
 started logmerger process
Parallel Media Recovery started with 48 slaves
Thu Aug 15 17:37:56 2013
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Media Recovery Complete (dgha)
Completed: ALTER DATABASE RECOVER  database
alter database open
Beginning crash recovery of 1 threads
 parallel recovery started with 32 processes
Started redo scan
Completed redo scan
 read 779943 KB redo, 0 data blocks need recovery
Started redo application at
 Thread 1: logseq 58, block 35644
Recovery of Online Redo Log: Thread 1 Group 1 Seq 58 Reading mem 0
  Mem# 0: /data1/dgha/redo01.log
Recovery of Online Redo Log: Thread 1 Group 2 Seq 59 Reading mem 0
  Mem# 0: /data1/dgha/redo02.log
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Completed redo application of 0.00MB
Completed crash recovery at
 Thread 1: logseq 60, block 31, scn 2500652
 0 data blocks read, 0 data blocks written, 779943 redo k-bytes read
LGWR: STARTING ARCH PROCESSES
...

Completed: ALTER DATABASE   MOUNT
Thu Aug 15 17:37:55 2013
ALTER DATABASE RECOVER  database
Media Recovery Start
 started logmerger process
Parallel Media Recovery started with 48 slaves
Thu Aug 15 17:37:56 2013
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Media Recovery Complete (dgha)
Completed: ALTER DATABASE RECOVER  database
alter database open
Beginning crash recovery of 1 threads
 parallel recovery started with 32 processes
Started redo scan
Completed redo scan
 read 779943 KB redo, 0 data blocks need recovery
Started redo application at
 Thread 1: logseq 58, block 35644
Recovery of Online Redo Log: Thread 1 Group 1 Seq 58 Reading mem 0
  Mem# 0: /data1/dgha/redo01.log
Recovery of Online Redo Log: Thread 1 Group 2 Seq 59 Reading mem 0
  Mem# 0: /data1/dgha/redo02.log
Recovery of Online Redo Log: Thread 1 Group 3 Seq 60 Reading mem 0
  Mem# 0: /data1/dgha/redo03.log
Completed redo application of 0.00MB
Completed crash recovery at
 Thread 1: logseq 60, block 31, scn 2500652
 0 data blocks read, 0 data blocks written, 779943 redo k-bytes read
LGWR: STARTING ARCH PROCESSES
...

Archived Log entry 191 added for thread 1 sequence 61 ID 0xeb14a3a4 dest 1:
ARC3: Standby redo logfile selected for thread 1 sequence 60 for destination LOG_ARCHIVE_DEST_3
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
******************************************************************
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_3
******************************************************************
LNS: Standby redo logfile selected for thread 1 sequence 62 for destination LOG_ARCHIVE_DEST_3
..


Completed: alter database open

数据库被recover到了最新的时间点(datafile scn领先于63 database,63此时可以看成24过去某个点的状态)


Step 3. third standby database (10.0.0.13)

alert_log from 10.0.0.13

Error 12543 received logging on to the standby
FAL[client, USER]: Error 12543 connecting to dgha for fetching gap sequence
Thu Aug 15 17:38:14 2013
Media Recovery Log /u01/oracle/11.2.0/oracle/product/11203/db1/dbs/arch1_61_823427815.dbf
Media Recovery Waiting for thread 1 sequence 62 (in transit)
Recovery of Online Redo Log: Thread 1 Group 4 Seq 62 Reading mem 0
  Mem# 0: /u01/oracle/oradata/dgha/standby04.log
Thu Aug 15 17:39:56 2013
Archived Log entry 62 added for thread 1 sequence 62 ID 0xeb14a3a4 dest 1:
Thu Aug 15 17:39:56 2013
RFS[14]: Selected log 4 for thread 1 sequence 63 dbid -350937948 branch 823427815
Thu Aug 15 17:39:56 2013
Media Recovery Waiting for thread 1 sequence 63 (in transit)
Recovery of Online Redo Log: Thread 1 Group 4 Seq 63 Reading mem 0
  Mem# 0: /u01/oracle/oradata/dgha/standby04.log
Thu Aug 15 17:40:08 2013
Archived Log entry 63 added for thread 1 sequence 63 ID 0xeb14a3a4 dest 1:
Thu Aug 15 17:40:08 2013
RFS[14]: Selected log 4 for thread 1 sequence 64 dbid -350937948 branch 823427815

可以看到 primary 自动切换为了10.0.0.24,13继续recover database.

Step 4. restart 63 database and join the dg cluster (24 is new primary database and 63 is standby database)

(1).copy standby controlfile from 24 database
(2).replace 63’s controlfile using new standby controlfile
(3).recreate standby logfiles
(4).alter database recover managed standby database;


alert_log:

..

Media Recovery Log /u01/oracle/arch/1_66_823427815.dbf
Media Recovery Waiting for thread 1 sequence 67 (in transit)

..
Physical standby database opened for read only access.
Completed: alter database open read only

Step 5. check data between 24 and 63 databases

from 24 database

SQL> select count(*) from test.CUSTOMERS              ;
select count(*) from test.WAREHOUSES             ;
select count(*) from test.ORDER_ITEMS            ;
select count(*) from test.ORDERS                 ;
select count(*) from test.INVENTORIES            ;
select count(*) from test.PRODUCT_INFORMATION    ;
select count(*) from test.LOGON                  ;
select count(*) from test.PRODUCT_DESCRIPTIONS   ;
select count(*) from test.ORDERENTRY_METADATA    ;

  COUNT(*)
----------
  39999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
 134986840

SQL> 
  COUNT(*)
----------
  45000000

SQL> 
  COUNT(*)
----------
    901565

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
   9999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
	 0

from 63 database

SQL> select count(*) from test.CUSTOMERS              ;
select count(*) from test.WAREHOUSES             ;
select count(*) from test.ORDER_ITEMS            ;
select count(*) from test.ORDERS                 ;
select count(*) from test.INVENTORIES            ;
select count(*) from test.PRODUCT_INFORMATION    ;
select count(*) from test.LOGON                  ;
select count(*) from test.PRODUCT_DESCRIPTIONS   ;
select count(*) from test.ORDERENTRY_METADATA    ;

  COUNT(*)
----------
  39999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
 134986840

SQL> 
  COUNT(*)
----------
  45000000

SQL> 
  COUNT(*)
----------
    901565

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
   9999996

SQL> 
  COUNT(*)
----------
      1000

SQL> 
  COUNT(*)
----------
	 0

perfect ! no rows lost.

—————————————————————————

下面我们来剖析一下dgha的行为

Thu Aug 15 17:25:57 2013 - [info] oraha start,welcome to use this tool...
Thu Aug 15 17:25:57 2013 - [info] try to check (10.0.0.63:1521:dgha) and (10.0.0.24:1521:dgha) are really cluster or not.
Thu Aug 15 17:25:57 2013 - [info] try ssh to standby(10.0.0.24,1521,dgha)...
Thu Aug 15 17:25:57 2013 - [info] ssh to 10.0.0.24 success.
Thu Aug 15 17:25:58 2013 - [debug] standby query result: (DGHA,PHYSICAL STANDBY,dgha1)
Thu Aug 15 17:25:58 2013 - [info] try connect to primary(10.0.0.63,1521,dgha)...
Thu Aug 15 17:25:58 2013 - [debug] primary check query sql: (select count(*) cnt from V$ARCHIVE_DEST A,v$database B where A.status='VALID' AND A.DB_UNIQUE_NAME='dgha1' AND A.TARGET='STANDBY' AND B.NAME='DGHA' AND B.DATABASE_ROLE='PRIMARY')
Thu Aug 15 17:25:58 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:25:59 2013 - [info] connect to (10.0.0.63:1521:dgha) success.
Thu Aug 15 17:25:59 2013 - [debug] primary query result rn_cnt: 1
Thu Aug 15 17:25:59 2013 - [info] (10.0.0.63:1521:dgha) and (10.0.0.24:1521:dgha) are really cluster.
Thu Aug 15 17:25:59 2013 - [info] the primary database is (10.0.0.63:1521:dgha)
Thu Aug 15 17:25:59 2013 - [info] the standby database is (10.0.0.24:1521:dgha)
Thu Aug 15 17:25:59 2013 - [info] ssh to 10.0.0.63 success.
Thu Aug 15 17:25:59 2013 - [info] ssh to 10.0.0.24 success.
Thu Aug 15 17:26:03 2013 - [info] ssh to 192.168.254.63 success.
Thu Aug 15 17:26:03 2013 - [info] ssh check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.63:1521:dgha) /home/oracle/oraha/db_stop.sh check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step01.sh check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step02.pl check passed.
Thu Aug 15 17:26:03 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step03.sh check passed.
Thu Aug 15 17:26:04 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_check.sh check passed.
Thu Aug 15 17:26:04 2013 - [info] scripts under /home/oracle/oraha check passed.
Thu Aug 15 17:26:04 2013 - [info] I will wait until (10.0.0.63:1521:dgha) unreachable.health check interval:8
Thu Aug 15 17:26:04 2013 - [info] ======================================================
Thu Aug 15 17:26:04 2013 - [info] ----------             welcome              ----------
Thu Aug 15 17:26:04 2013 - [info] ======================================================
Thu Aug 15 17:26:04 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:0
Thu Aug 15 17:31:50 2013 - [debug] UPDATE HACHECK success.
Thu Aug 15 17:31:58 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:0

DGHA 会定期去update一张check表判断instance的存活

Thu Aug 15 17:32:03 2013 - [warning] connect to (10.0.0.63:1521:dgha) was abort! connect again.
Thu Aug 15 17:32:03 2013 - [debug] update check == 1.undef_count is 0
Thu Aug 15 17:32:03 2013 - [info] I will destory dbh.
Issuing rollback() due to DESTROY without explicit disconnect() of DBD::Oracle::db handle (DESCRIPTION=(ADDRESS=(HOST=10.0.0.63)(PROTOCOL=tcp)(PORT=1521))(CONNECT_DATA=(SID=dgha))) at oraha.pl line 509.
DBI db handle 0x1efbf710 cleared whilst still active at oraha.pl line 509.
DBI db handle 0x1efbf710 has uncleared implementors data at oraha.pl line 509.
    dbih_clearcom (dbh 0x1efbf710, com 0x1ee4d4f0, imp DBD::Oracle::db):
       FLAGS 0x182097: COMSET IMPSET Active Warn RaiseError PrintWarn ShowErrorStatement 
       PARENT DBI::dr=HASH(0x1ef1df90)
       KIDS 0 (0 Active)
Thu Aug 15 17:32:08 2013 - [warning] destory dbh  was abort!
Thu Aug 15 17:32:08 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:32:13 2013 - [error][oraha.pl, ln426] connect to  (10.0.0.63:1521:dgha) timeout
Thu Aug 15 17:32:13 2013 - [warning] connect to (10.0.0.63:1521:dgha) error_count:1
Thu Aug 15 17:32:21 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:1
Thu Aug 15 17:32:21 2013 - [debug] the dbh is null.
Thu Aug 15 17:32:21 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:32:26 2013 - [error][oraha.pl, ln426] connect to  (10.0.0.63:1521:dgha) timeout
Thu Aug 15 17:32:26 2013 - [warning] connect to (10.0.0.63:1521:dgha) was abort! connect again.
Thu Aug 15 17:32:26 2013 - [debug] update check == 1.undef_count is 1
Thu Aug 15 17:32:26 2013 - [info] I will destory dbh.
Thu Aug 15 17:32:26 2013 - [warning] connect to (10.0.0.63:1521:dgha) error_count:2
Thu Aug 15 17:32:34 2013 - [debug] connect to (10.0.0.63:1521:dgha) error_count:2
Thu Aug 15 17:32:34 2013 - [debug] the dbh is null.
Thu Aug 15 17:32:34 2013 - [info] Attemp to connect to (10.0.0.63:1521:dgha)...
Thu Aug 15 17:32:39 2013 - [error][oraha.pl, ln426] connect to  (10.0.0.63:1521:dgha) timeout
Thu Aug 15 17:32:39 2013 - [warning] connect to (10.0.0.63:1521:dgha) was abort! connect again.
Thu Aug 15 17:32:39 2013 - [debug] update check == 1.undef_count is 1
Thu Aug 15 17:32:39 2013 - [info] I will destory dbh.
Thu Aug 15 17:32:39 2013 - [warning] connect to (10.0.0.63:1521:dgha) error_count:3
Thu Aug 15 17:32:39 2013 - [info] try connect to the edm...
Thu Aug 15 17:32:39 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'13817963180','the primary(10.0.0.63:1521:dgha) is unreachable.','',1,'11',sysdate)
Thu Aug 15 17:32:39 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'18666668061','the primary(10.0.0.63:1521:dgha) is unreachable.','',1,'11',sysdate)
Thu Aug 15 17:32:47 2013 - [warning] ssh to 10.0.0.63 err. error_count:1
ssh: connect to host 10.0.0.63 port 22: No route to host
Thu Aug 15 17:32:50 2013 - [warning] ssh to 10.0.0.63 err. error_count:2
ssh: connect to host 10.0.0.63 port 22: No route to host
Thu Aug 15 17:32:53 2013 - [warning] ssh to 10.0.0.63 err. error_count:3
Thu Aug 15 17:32:54 2013 - [info] ssh to 10.0.0.24 success.
Thu Aug 15 17:32:57 2013 - [info] ssh to 192.168.254.63 success.

DGHA 在update多次之后发现oracle死亡,开始尝试ssh到63同样不通(将会连接至63的远程管理卡poweroff掉该机器)同理将会查看待选standby状态,如果可以切换将会启动切换程序,同时发送短信告知63已经死亡。

QQ图片20130815221021

Thu Aug 15 17:32:57 2013 - [info] try to check (10.0.0.24:1521:dgha) is ready to failover to primary or not.
Thu Aug 15 17:32:57 2013 - [info] the standby open mode is:MOUNTED. database role is:PHYSICAL STANDBY.
Thu Aug 15 17:32:57 2013 - [info] standby lag minutes is: 0
Thu Aug 15 17:32:57 2013 - [info] standby (10.0.0.24:1521:dgha) healthcheck is passed. Go on.
Thu Aug 15 17:32:57 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step01.sh check passed.
Thu Aug 15 17:32:58 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step02.pl check passed.
Thu Aug 15 17:32:58 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_start_step03.sh check passed.
Thu Aug 15 17:32:58 2013 - [info] (10.0.0.24:1521:dgha) /home/oracle/oraha/db_check.sh check passed.
Thu Aug 15 17:32:58 2013 - [info] standby (10.0.0.24:1521:dgha) scripts under /home/oracle/oraha check passed.
Thu Aug 15 17:32:58 2013 - [info] Power of 10.0.0.63  was successfully turned off.
Thu Aug 15 17:32:58 2013 - [info] I will execute /home/oracle/oraha/db_start_step01.sh on 10.0.0.24
Thu Aug 15 17:35:58 2013 - [debug] db_start output: 
SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 15 17:32:58 2013

Copyright (c) 1982, 2011, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> 
Thu Aug 15 17:35:58 2013 - [error][oraha.pl, ln180] Oracle on 10.0.0.24 shutdown failed.Kill the pmon next.
Thu Aug 15 17:35:58 2013 - [debug] Oracle pmon pid is: 10179
Thu Aug 15 17:35:58 2013 - [debug] kill -9  output: 
Thu Aug 15 17:36:08 2013 - [info] I will execute /home/oracle/oraha/db_start_step02.pl on 10.0.0.24
Thu Aug 15 17:37:47 2013 - [debug] db_start output: 
Thu Aug 15 17:37:47 2013 - [info] I will execute /home/oracle/oraha/db_start_step03.sh on 10.0.0.24
Thu Aug 15 17:38:06 2013 - [debug] db_start output: 
SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 15 17:37:47 2013

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> ORACLE instance started.

Total System Global Area 4.0486E+10 bytes
Fixed Size                  2237088 bytes
Variable Size            4026535264 bytes
Database Buffers         3.6373E+10 bytes
Redo Buffers               84631552 bytes
Database mounted.
SQL> Media recovery complete.
SQL> 
Database altered.

SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Thu Aug 15 17:38:08 2013 - [info] try to check the new primary(10.0.0.24,1521,dgha) after failover .
Thu Aug 15 17:38:08 2013 - [info] try connect to the new primary(10.0.0.24,1521,dgha)...
Thu Aug 15 17:38:08 2013 - [info] Attemp to connect to (10.0.0.24:1521:dgha)...
Thu Aug 15 17:38:08 2013 - [info] connect to (10.0.0.24:1521:dgha) success.
Thu Aug 15 17:38:08 2013 - [info] the new primary open mode is:READ WRITE. database role is:PRIMARY.
Thu Aug 15 17:38:08 2013 - [info] (10.0.0.63:1521:dgha) failover to (10.0.0.24:1521:dgha) successful.
Thu Aug 15 17:38:08 2013 - [info] try connect to the edm...
Thu Aug 15 17:38:08 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'13817963180','(10.0.0.63:1521:dgha) failover to (10.0.0.24:1521:dgha) successful.','',1,'11',sysdate)
Thu Aug 15 17:38:08 2013 - [info] send message:insert into edm_user.tb_queue(ID,PHONE,MSG,STATUS,SENDLEVEL,svrtype,INSERTTIME) 
                                values(edm_user.SEQ_QUE.NEXTVAL,'18666668061','(10.0.0.63:1521:dgha) failover to (10.0.0.24:1521:dgha) successful.','',1,'11',sysdate)
Killed by signal 1.
Killed by signal 1.

使用存储redo recover完毕后同样会发出短信告知并退出。

QQ图片20130815221008

以下附各种场景的切换测试:

1. 模拟instance hang (oradebug frozen instance) –>当前连接全部hang死,新建process去update check表,两种情况:如果update成功,不切换,如果hang导致update不成功,切换。但是之前的process会处于”僵尸”态。

2. 模拟listener hang (kill -STOP pid of listener and kill active session) 新建process无法update timeout超时–>ssh主机执行abort—>连接至远程卡powerff掉主机—>切换至备选standby.

3. 模拟机器down (直接halt机器,update,ssh均失败,判断死亡,直接切换)

4. 网卡down(处理方式类1)
1,4的处理方式会出现之前dgha建立的连接无法回收,采用在本地部署脚本针对dgha的机器名,idle时间做定期清理。

5. 新的primary切换完成后,就的primary如果需要重新加入集群(copy standby controlfile from new primary,standby logfile需要重建)

Database-exporter release v1.0

June 21, 2013 Architect, software No comments

Database-exporter发布 感谢wucharles,一个充满Geek精神的小伙。

1.supports export database from each other. (you need write driver by yourself,now support oracle and MySQL)

2.supported database: Oracle, MySQL.(from oracle to mysql  or mysql to oracle)

3.supports multi source tables merged into one target table.(both mysql and oracle)

4.supports one source table split into multi target tables.(both mysql and oracle)

5.supports source table column name different from target table column name.(both mysql and oracle)

6.supports multi tables exported concurrently.

7.supports one table's data exported concurrently by page.

8.supports LOB,Long,raw datatype.

8.supports database plugin.(eg you can write driver to load sqlserver data to oracle) 

Github地址:Database-exporter

Oracle Direct NFS Performance

November 23, 2012 11g, Architect, oracle, software 1 comment

在Oracle 11g中引入了Direct Network File System(Oracle Direct NFS)的新特性,通过一个打包在Oracle内核中的NFS client来提高使用NFS时的性能,DNFS同样适用于RAC等高可用场景。

DIRECT NFS CLIENT OVERVIEW
Standard NFS client software, provided by the operating system, is not optimized for Oracle Database file I/O access patterns. With Oracle Database 11g, you can configure Oracle Database to access NFS V3 NAS devices directly using Oracle Direct NFS Client, rather than using the operating system kernel NFS client. Oracle Database will access files stored on the NFS server directly through the integrated Direct NFS Client eliminating the overhead imposed by the operating system kernel NFS. These files are also accessible via the operating system kernel NFS client thereby allowing seamless administration.

Benefits of Direct NFS Client
Direct NFS Client overcomes many of the challenges associated with using NFS with the Oracle Database. Direct NFS Client outperforms traditional NFS clients, is simple Oracle Database 11g – Direct NFS Client

最近公司准备使用dnfs 作为测试数据库的存储模式,正好借机测试了一把。

具体环境为:

[root@DCA-SRV-0133 u01]# dmidecode -s system-product-name
PowerEdge R905
[root@DCA-SRV-0133 u01]# cat /proc/meminfo |grep  MemTotal
MemTotal:       132352732 kB
[root@DCA-SRV-0133 u01]# cat /proc/cpuinfo |grep processor |wc -l
24
[root@DCA-SRV-0133 u01]# cat /etc/issue
Oracle  Server release 5.8
Kernel \r on an \m

存储使用NETAPP 标准NAS结构

测试对比dnfs,nfs 性能指标,使用数据库版本11.2.0.3 

setup by DNFS

1.mount nfs 文件

[root@DCA-SRV-0133 /]# showmount  -e nas
Export list for nas:
/vol/vol0        192.168.254.0/24
/vol/vol_user_db 192.168.254.0/24,10.0.254.0/24
[root@DCA-SRV-0133 /]# cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1             localhost.localdomain localhost
::1          localhost6.localdomain6 localhost6
10.0.254.1   nas
10.63.0.221   DCA-SRV-0133
[root@DCA-SRV-0133 /]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1            101572540   3671028  92658704   4% /
/dev/sda3            101572572    192352  96137408   1% /home
tmpfs                 66176364         0  66176364   0% /dev/shm
/dev/sda4            878850916  14454200 819753688   2% /data
10.0.254.1:/vol/vol_user_db
                     1518128384   1821888 1516306496   1% /u01
[root@DCA-SRV-0133 /]# umount /u01/
[root@DCA-SRV-0133 /]# mount -t nfs 10.0.254.1:/vol/vol_user_db  /u01
[root@DCA-SRV-0133 /]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1            101572540   3671056  92658676   4% /
/dev/sda3            101572572    192352  96137408   1% /home
tmpfs                 66176364         0  66176364   0% /dev/shm
/dev/sda4            878850916  14454200 819753688   2% /data
10.0.254.1:/vol/vol_user_db
                     1518128384   1821888 1516306496   1% /u01

2.配置数据库 dnfs

[oracle@DCA-SRV-0133 lib]$ cd $ORACLE_HOME/lib
[oracle@DCA-SRV-0133 lib]$ mv libodm11.so libodm11.so.old
[oracle@DCA-SRV-0133 lib]$ ln -s libnfsodm11.so libodm11.so
[oracle@DCA-SRV-0133 lib]$ ls -l libodm11.so
lrwxrwxrwx 1 oracle oinstall 14 Nov 22 16:03 libodm11.so -> libnfsodm11.so
[oracle@DCA-SRV-0133 lib]$ ls -l  libodm11.so
lrwxrwxrwx 1 oracle oinstall 14 Nov 22 16:03 libodm11.so -> libnfsodm11.so
[oracle@DCA-SRV-0133 lib]$ ls -l  libodm11.so.old
-rwxr-xr-x 1 oracle oinstall 60431 Nov 22 15:08 libodm11.so.old


[oracle@DCA-SRV-0133 lib]$ cat $ORACLE_HOME/dbs/oranfstab
server: nas
local:10.0.254.21 ----->这里注意由于使用了网卡bind 使用了两个IP区分nfs 网络与IDC网络,需要手工指定local ip.
path: 10.0.254.1
path: 192.168.254.1  ------------------- > MP
export: /vol/vol_user_db     mount: /u01
[oracle@DCA-SRV-0133 lib]$

3.startup database

SQL> startup
ORACLE instance started.

Total System Global Area 4.0486E+10 bytes
Fixed Size                2237088 bytes
Variable Size              3892317536 bytes
Database Buffers       3.6507E+10 bytes
Redo Buffers                 84631552 bytes

.....




Starting ORACLE instance (normal)
****************** Large Pages Information *****************

Total Shared Global Region in Large Pages = 0 KB (0%)

Large Pages used by this instance: 0 (0 KB)
Large Pages unused system wide = 0 (0 KB) (alloc incr 128 MB)
Large Pages configured system wide = 0 (0 KB)
Large Page size = 2048 KB
..


Oracle instance running with ODM: Oracle Direct NFS ODM Library Version 3.0
Fri Nov 23 13:34:34 2012
PMON started with pid=2, OS id=18612
Fri Nov 23 13:34:34 2012
PSP0 started with pid=3, OS id=18616
Fri Nov 23 13:34:35 2012
VKTM started with pid=4, OS id=18620 at elevated priority
VKTM running at (1)millisec precision with DBRM quantum (100)ms

..
ORACLE_BASE from environment = /data/oracle
Fri Nov 23 13:34:36 2012
ALTER DATABASE   MOUNT
Direct NFS: channel id [0] path [10.0.254.1] to filer [nas] via local [10.0.254.21] is UP
Direct NFS: channel id [1] path [10.0.254.1] to filer [nas] via local [10.0.254.21] is UP
Successful mount of redo thread 1, with mount id 2930804780
Database mounted in Exclusive Mode
Lost write protection disabled
Completed: ALTER DATABASE   MOUNT

4. 使用swingbench 导入10g数据

4.1设置database sga 为5G .


DNFS performance 100 users:

AVG TPS 2222 AVG response time delay 40ms

NFS performance 100 users:

AVG TPS 1401 AVG response time delay 80ms

4.2 设置database sga 为 800M

DNFS performance 1000 users:

AVG TPS 578 AVG response time delay 1.5S

NFS performance 1000 users:

AVG TPS 400 AVG response time delay 2.0S

Other DNFS Best practice Performance:

LVS introduction

March 20, 2012 Architect, linux, software No comments

What is virtual server?
Virtual server is a highly scalable and highly available server built on a cluster of real servers. The architecture of server cluster is fully transparent to end users, and the users interact with the cluster system as if it were only a single high-performance virtual server. Please consider the following figure.

The real servers and the load balancers may be interconnected by either high-speed LAN or by geographically dispersed WAN. The load balancers can dispatch requests to the different servers and make parallel services of the cluster to appear as a virtual service on a single IP address, and request dispatching can use IP load balancing technolgies or application-level load balancing technologies. Scalability of the system is achieved by transparently adding or removing nodes in the cluster. High availability is provided by detecting node or daemon failures and reconfiguring the system appropriately.

Recommendation for the Real Application Cluster Interconnect and Jumbo Frames

February 9, 2012 Architect, oracle, RAC, software 2 comments

Recommendation for the Real Application Cluster Interconnect and Jumbo Frames

Applies to:


Oracle Server – Enterprise Edition – Version: 9.2.0.1 to 11.2.0.0 – Release: 9.2 to 11.2
Information in this document applies to any platform.
Oracle Server Enterprise Edition – Version: 9.2.0.1 to 11.2
Purpose

This note covers the current recommendation for the Real Application Cluster Interconnect and Jumbo Frames
Scope and Application

This article points out the issues surrounding Ethernet Jumbo Frame usage for the Oracle Real Application Cluster (RAC) Interconnect. In Oracle Real Application Clusters, the Cluster Interconnect is designed to run on a dedicated, or stand-alone network. The Interconnect is designed to carry the communication between the nodes in the Cluster needed to check for the Clusters condition and to synchronize the various memory caches used by the database.

Ethernet is a widely used networking technology for Cluster Interconnects. Ethernet’s variable frame size of 46-1500 bytes is the transfer unit between the all Ethernet participants, such as the hosts and switches. The upper bound, in this case 1500, is called MTU (Maximum Transmission Unit). When an application sends a message greater than 1500 bytes (MTU), it is fragmented into 1500 byte, or smaller, frames from one end-point to another. In Oracle RAC, the setting of DB_BLOCK_SIZE multiplied by the MULTI_BLOCK_READ_COUNT determines the maximum size of a message for the Global Cache and the PARALLEL_EXECUTION_MESSAGE_SIZE determines the maximum size of a message used in Parallel Query. These message sizes can range from 2K to 64K or more, and hence will get fragmented more so with a lower/default MTU.

Jumbo Frames introduces the ability for an Ethernet frame to exceed its IEEE 802 specified Maximum Transfer Unit of 1500 bytes up to a maximum of 9000 bytes. Even though Jumbo Frames is widely available in most NICs and data-center class managed switches it is not an IEEE approved standard. While the benefits are clear, Jumbo Frames interoperability is not guaranteed with some existing networking devices. Though Jumbo Frames can be implemented for private Cluster Interconnects, it requires very careful configuration and testing to realize its benefits. In many cases, failures or inconsistencies can occur due to incorrect setup, bugs in the driver or switch software, which can result in sub-optimal performance and network errors.
Recommendation for the Real Application Cluster Interconnect and Jumbo Frames

Configuration

In order to make Jumbo Frames work properly for a Cluster Interconnect network, careful configuration in the host, its Network Interface Card and switch level is required:
The host’s network adapter must be configured with a persistent MTU size of 9000 (which will survive reboots).
For example, ifconfig -mtu 9000 followed by ifconfig -a to show the setting completed.
Certain NIC’s require additional hardware configuration.
For example, some Intel NIC’s require special descriptors and buffers to be configured for Jumbo Frames to work properly.
The LAN switches must also be properly configured to increase the MTU for Jumbo Frame support. Ensure the changes made are permanent (survives a power cycle) and that both “Jumbo” refer to same size, recommended 9000 (some switches do not support this size).

Because of the lack of standards with Jumbo Frames the interoperability between switches can be problematic and requires advanced networking skills to troubleshoot.
Remember that the smallest MTU used by any device in a given network path determines the maximum MTU (the MTU ceiling) for all traffic travelling along that path.
Failing to properly set these parameters in all nodes of the Cluster and Switches can result in unpredictable errors as well as a degradation in performance.
Testing

Request your network and system administrator along with vendors to fully test the configuration using standard tools such as SPRAY or NETCAT and show that there is an improvement not degradation when using Jumbo Frames. Other basic ways to check it’s configured correctly on Linux/Unix are using:

Traceroute: Notice the 9000 packet goes through with no error, while the 9001 fails, this is a correct configuration that supports a message of up to 9000 bytes with no fragmentation:

[node01] $ traceroute -F node02-priv 9000
traceroute to node02-priv (10.10.10.2), 30 hops max, 9000 byte packets
1 node02-priv (10.10.10.2) 0.232 ms 0.176 ms 0.160 ms

[node01] $ traceroute -F node02-priv 9001
traceroute to node02-priv (10.10.10.2), 30 hops max, 9001 byte packets
traceroute: sendto: Message too long
1 traceroute: wrote node02-priv 9001 chars, ret=-1
* Note: Due to Oracle Bugzilla 7182 (must have logon privileges) — also known as RedHat Bugzilla 464044 — older than EL4.7 traceroute may not work correctly for this purpose.
* Note: Some versions of tracroute, e.g. traceroute 2.0.1 shipped with EL5, add the header size on top of what is specified when using the -F flag (same as ping behavior below). Newer versions of traceroute, like 2.0.14 (shipped with OL6) have the old behavior of traceroute version 1 (size of packet is exactly as what is specified with the -F flag).

Ping: With ping we have to take into account an overhead of about 28 bytes per packet, so 8972 bytes go through with no errors, while 8973 fail, this is a correct configuration that supports a message of up to 9000 bytes with no fragmentation:
[node01]$ ping -c 2 -M do -s 8972 node02-priv
PING node02-priv (10.10.10.2) 1472(1500) bytes of data.
1480 bytes from node02-priv (10.10.10.2): icmp_seq=0 ttl=64 time=0.220 ms
1480 bytes from node02-priv (10.10.10.2): icmp_seq=1 ttl=64 time=0.197 ms

[node01]$ ping -c 2 -M do -s 8973 node02-priv
From node02-priv (10.10.10.1) icmp_seq=0 Frag needed and DF set (mtu = 9000)
From node02-priv (10.10.10.1) icmp_seq=0 Frag needed and DF set (mtu = 9000)
— node02-priv ping statistics —
0 packets transmitted, 0 received, +2 errors
* Note: Ping reports fragmentation errors, due to exceeding the MTU size.
Performance

For RAC Interconnect traffic, devices correctly configured for Jumbo Frame improves performance by reducing the TCP, UDP, and Ethernet overhead that occurs when large messages have to be broken up into the smaller frames of standard Ethernet. Because one larger packet can be sent, inter-packet latency between various smaller packets is eliminated. The increase in performance is most noticeable in scenarios requiring high throughput and bandwidth and when systems are CPU bound.

When using Jumbo Frames, fewer buffer transfers are required which is part of the reduction for fragmentation and reassembly in the IP stack, and thus has an impact in reducing the latency of a an Oracle block transfer.

As illustrated in the configuration section, any incorrect setup may prevent instances from starting up or can have a very negative effect on the performance.
Known Bugs

In some versions of Linux there are specific bugs in Intel’s Ethernet drivers and the UDP code path in conjunction with Jumbo Frames that could affect the performance. Check for and use the latest version of these drivers to be sure you are not running into these older bugs.

The following bugzilla bugs 162197, 125122 are limited to RHEL3.
Recommendation

There is some complexity involved in configuring Jumbo Frames, which is highly hardware and OS specific. The lack of a specific standard may present OS and hardware bugs. Even with these considerations, Oracle recommends using Jumbo Frames for private Cluster Interconnects.

Since there is no official standard for Jumbo Frames, this configuration should be properly load tested by Customers. Any indication of packet loss, socket buffer or DMA overflows, TX and RX error in adapters should be noted and checked with the hardware and operating system vendors.

The recommendation in this Note is strictly for Oracle private interconnect only, it does not applies to other NAS or iSCSI vendor tested and validated Jumbo Frames configured networks.

Oracle VM does not support Jumbo Frame. Refer Oracle VM: Jumbo Frame on Oracle VM (Doc ID 1166925.1) for further information.