hardware

IBM will support Flash In DIMM using MCS

January 17, 2014 Architect, hardware, system, unix No comments

The next generation of IBM’s X-series servers will be able to accommodate solid-state Flash drives clipped into their DIMM memory slots, potentially improving the response times of fast-paced enterprise applications.

On Thursday, IBM unveiled the Series 6 generation of its System X x86-based servers. In addition to the novel reuse of DIMM slots, the X6 architecture will also let customers upgrade them to a new generation of processors or memory without swapping in a new motherboard.

DIMM PicULLtraDIMM

Diablo Technologies, a memory technology company, developed Memory Channel Storage (MCS) that 
enables flash on a DIMM module to be accessed by the CPU, instead of using the SATA bus as other 
DIMM form-factor SSD products have. Using a host-level driver and an ASIC on the DIMM, it creates 
a special memory storage layer in flash through which the CPU actually moves data from the RAM 
memory space. It also requires a minor modification in the server BIOS to be supported by the CPU, something 
that three OEMs have currently completed. 

Each DIMM flash module has 16 separate data channels, which are independently addressable. 
This enables parallel data writes by the driver, improving performance over the DMA process 
used by PCIe-based solutions. Specs for ULLtraDIMM are showing 5 microsecond latencies for these 
devices, an order of magnitude better than typical PCIe flash products. This architecture also 
enables up to 63 ULLtraDIMM modules to be aggregated creating 25TB of flash capacity and >9M IOPS in a single server.

MCS

Ref: IBM X series servers now pack Flash into speedy DIMM slots
IBM Beefs Up Enterprise X-Architecture With Flash, Modular Design
Heating Up Storage Performance
How to Make Flash Accessible on the Memory Bus
Memory Channel Storage™
ULLtraDIMM: combining SSD and DRAM for the enterprise

ssd gc review

December 12, 2013 Architect, hardware No comments

update 12.14 (add two docs) Download this PDF

overview of gc_doc1

overview of gc_doc2

1号店架构鸟瞰

November 18, 2013 Architect, hardware, IDC, software No comments

UPDATE 12.2: you can download complete PDF from database design of YHD

ION — SERVER-BASED Storage 方案

September 5, 2013 Architect, hardware, storage No comments

FusionIO推出的基于共享级别的ioMemory加速方案,此文简要概述方案的一些粗略架构。ION可配选IB或者40GB以太网连接,同时支持FCoIB,FCoE,EoIB,RDMA协议。
基于FIO与HP的特殊关系,下面的图主要来自ION Accelerator on HP DL380

QQ图片20130905230902
注意这里ION对于based server的机器是有一定要求的,对于1U的机器由于PCI插槽的限制,导致IO performace的下降是必然的。
这里的ION推出,类似模拟storage的controller机头概念,使用自己的software配合IB从而达到模拟一台普通的Server成为”存储”,概念类似于QGUARD,有兴趣的朋友可以去研究下。针对这种shared概念,一般database的应用必然首选ORACLE RAC,并且对于开源的解决方案,我相信也不会有人去花大价钱来买一堆付费的软件+硬件来模拟存储吧:) 是否可以挑战下exadata(虽然exadata针对的场景不一样)还是值得期待的。

QQ图片20130905231833

底层通过server模拟storage,仍然使用FC协议。
目前ION并不支持cluster架构(多台server模拟存储机柜)只是简单的一对一的HA架构(类似存储复制)

QQ图片20130905232041

在RAC架构中,与传统方案结构类似,极大的增强了IO能力(可谓超级能力的一个application cluster?),同样的解决方案有XtremSF,Flash Accel 。

QQ图片20130905232955

具体参考 FIO ION

How to use Flash Cache on Redhat (not OEL)

August 26, 2013 Architect, hardware, linux, system No comments

QQ图片20130826223025

By ask Surachart for help

Test: Flash Cache on 11gR2 + RHEL

A Flash Cache (11gR2) is supported by OEL or Solaris. If Want To use RHEL(Example: RHEL 5.3)

Patched 8974084 before

SQL> startup
ORA-00439: feature not enabled: Server Flash Cache
ORA-01078: failure in processing system parameters

TEST: ***use "strace" commnad to trace system & signals***
$ strace -o /tmp/file01.txt -f sqlplus '/ as sysdba' <<EOF
startup
EOF

Find 2 points:
1. about /etc/*-release files.

3884  open("/etc/enterprise-release", O_RDONLY) = 8
3884  read(8, "Enterprise Linux Enterprise Linu"..., 255) = 64


2. about "rpm" cammand
32278 execve("/bin/rpm", ["/bin/rpm", "-qi", "--info", "enterprise-release"], [/* 25 vars */] <unfinished ...>
Next, it greps for “66ced3de1e5e0159” from the following output…
 
try to check on Enterprise Linux.

$ rpm -qi --info "enterprise-release"

Name        : enterprise-release           Relocations: (not relocatable)
Version     : 5                                 Vendor: Oracle USA
Release     : 0.0.17                        Build Date: Wed 21 Jan 2009 06:00:33 PM PST
Install Date: Mon 11 May 2009 11:19:45 AM PDT      Build Host: ca-build10.us.oracle.com
Group       : System Environment/Base       Source RPM: enterprise-release-5-0.0.17.src.rpm
Size        : 59030                            License: GPL
Signature   : DSA/SHA1, Wed 21 Jan 2009 06:56:48 PM PST, Key ID 66ced3de1e5e0159
Summary     : Enterprise Linux release file
Description :
System release and information files
Name        : enterprise-release           Relocations: (not relocatable)
Version     : 5                                 Vendor: Oracle USA
Release     : 0.0.17                        Build Date: Wed 21 Jan 2009 06:00:33 PM PST
Install Date: Mon 11 May 2009 11:19:45 AM PDT      Build Host: ca-build10.us.oracle.com
Group       : System Environment/Base       Source RPM: enterprise-release-5-0.0.17.src.rpm
Size        : 59030                            License: GPL
Signature   : DSA/SHA1, Wed 21 Jan 2009 06:56:48 PM PST, Key ID 66ced3de1e5e0159
Summary     : Enterprise Linux release file
Description :
System release and information files


Fixed:
1. FAKE *-release file (don't forgot backup before)
- Modify /etc/redhat-release + /etc/enterprise-release files.
$ cat /etc/redhat-release
Enterprise Linux Enterprise Linux Server release 5.3 (Carthage)

$ cat /etc/enterprise-release
Enterprise Linux Enterprise Linux Server release 5.3 (Carthage)

2. FAKE rpm to check "enterprise-release" package.
- Modify /bin/rpm
#  mv /bin/rpm /bin/rpm.bin

# vi /bin/rpm
#!/bin/sh
if [ "$3" = "enterprise-release" ]
then
     echo 66ced3de1e5e0159
else
      exec /bin/rpm.bin "$*"
fi

# chmod 755 /bin/rpm

Try... Again -> startup database.

SQL> startup

Advanced use of dbms_stats for extended stats of CG

June 4, 2013 Architect, hardware No comments

PCIE performance test — LSI vs FusionIO vs VIRI

April 2, 2013 Architect, hardware 2 comments

Testing three PCIE cards’s performance for all of scenes using fio

Reference:fio parameter setting

NetAPP DISK SCRUB

September 17, 2012 Architect, hardware No comments

在最近一次的核心系统迁移中.NetAPP存储发生了意想不到的情况,在前端负载不是很高的情况下 存储CPU使用了超过了55%,并且读竟然达到了1GB/s
在无法获取1GB数据产生源的情况下,项目被迫回滚,导致50多人白忙活了一夜。最后在netapp的check 中发现竟然是一次存储的自检行为导致”NetAPP DISK SCRUB” 默认在周日凌晨1点启动持续6个小时,竟然跟我们项目冲突了,下面做一个总结:

当时的情况 A B 两个机头负载同时飙升到60% read均达到了1GB+/s 并且A机头的负载>B 机头 这是由于这套系统使用了B机头作为主机头,NETAPP在自检的过程中采取了dynamic的方式自动降低了有数据交换的B机头的扫描负载


It’s a well-known fact in the storage world that firmware bugs (and sometimes hardware and data path problems) can cause silent data corruption; the data that ends up on disk is not the data that was sent down the pipe. To protect against this, when Data ONTAP writes data to disk, it creates a checksum for each 4kB block that is stored as part of the block’s metadata. When data is later read from disk, the checksum is recalculated and compared to the stored checksum. If they are different, the requested data is recreated from parity. In addition, the data from parity is rewritten to the original 4kB block, then read back to verify its accuracy.

To ensure the accuracy of archive data that may remain on disk for long periods without being read, NetApp offers the configurable RAID scrub feature. A scrub can be configured to run when the system is idle and reads every 4kB block on disk, triggering the checksum mechanism to identify and correct hidden corruption or media errors that may occur over time. This proactive diagnostic software promotes self-healing and general drive maintenance.

To NetApp, rule number 1 is to protect our customer data at all costs. Protection against firmware-induced silent data corruption is an example of NetApp’s continuing focus on developing innovative storage resiliency features to ensure the highest level of data integrity.



How you schedule automatic RAID-level scrubs
By default, Data ONTAP performs a weekly RAID-level scrub starting on Sunday at 1:00 a.m. for a duration of six hours. You can change the start time and duration of the weekly scrub, add more automatic scrubs, or disable the automatic scrub.
To schedule an automatic RAID-level scrub, you use the raid.scrub.schedule option.
To change the duration of automatic RAID-level scrubbing without changing the start time, you use the raid.scrub.duration option,specifying the number of minutes you want automatic RAID-level scrubs to run. If you set this option to -1, all automatic RAID-level scrubs run to completion.
Note: If you specify a duration using the raid.scrub.schedule option, that value overrides the value you specify with this option.
To enable or disable automatic RAID-level scrubbing, you use the raid.scrub.enable option.
Scheduling example
The following command schedules two weekly RAID scrubs. The first scrub is for 240 minutes (four hours) every Tuesday starting at 2 a.m. The second scrub is for eight hours every Saturday starting at 10 p.m.
options raid.scrub.schedule 240m@tue@2,8h@sat@22
Verification example
The following command displays your current RAID-level automatic scrub schedule. If you are using the default schedule, nothing is displayed.
options raid.scrub.schedule
Reverting to the default schedule example
The following command reverts your automatic RAID-level scrub schedule to the default (Sunday at 1:00 am, for six hours):
options raid.scrub.schedule ” ”

FUJITSU-RX600 测试报告

August 21, 2012 Architect, hardware No comments

11G新特性 IO Calibration 评测

December 13, 2011 11g, hardware, oracle No comments

11G新特性 IO Calibration可以帮我们估算出存储的读写性能,在使用这个特性之前 我们需要满足一些条件:

.在linux系统中默认是不开启异步IO的
SQL> show parameter filesystemio_options

NAME TYPE VALUE
———————————— ———– ——————————
filesystemio_options string none

可以通过以下语句查找asynchronous I/O是否被开启:

SQL> col name format a50
select name,asynch_io from v$datafile f,v$iostat_file i
where f.file#=i.file_no
and (filetype_name=’Data File’ or filetype_name=’Temp File’);

NAME ASYNCH_IO
————————————————– ———
/data/oracle/oradata/yhddb1/system.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/system.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/sysaux.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/sysaux.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/undotbs01.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/undotbs01.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/qipei_data01.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/lg_index01.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/qipei_index01.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/undotbs02.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/qipei_data02.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/md_data01.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/qipei_data03.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/qipei_index02.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/tms_data01.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/undotbs03.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/lg_index02.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/tms_idx01.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/tms_data02.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/pos_data_01.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/md_data02.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/lg_index03.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/pos_data_02.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/pos_index_01.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/lg_index04.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/lg_index05.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/ims_data01.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/lg_index06.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/ims_index01.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/md_data05.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/md_data03.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/tms_idx02.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/lg_index07.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/md_data04.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/ttuser01.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/tms_data03.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/ttuser02.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/lg_index08.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/lg_index09.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/lg_index10.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/lg_index11.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/lg_index12.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/lg_index13.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/ttuser03.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/tms_data04.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/tms_data05.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/ttuser04.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/ttuser05.dbf ASYNC_OFF
/data/oracle/oradata/yhddb1/ttuser06.dbf ASYNC_OFF

我们需要打开异步io
SQL> show parameter FILESYSTEMIO_OPTIONS;

NAME TYPE VALUE
———————————— ———– ——————————
filesystemio_options string SETALL

设置这个参数需要重启数据库。参数“filesystemio_options” 支持4种值:
ASYNCH: 使Oracle支持文件的异步(Asynchronous)IO;
DIRECTIO:使Oracle支持文件的Direct IO;
SETALL:使Oracle同时支持文件的Asynchronous IO和Direct IO;
NONE:使Oracle关闭对Asynchronous IO和Direct IO的支持。


Syntax

DBMS_RESOURCE_MANAGER.CALIBRATE_IO (
num_physical_disks IN PLS_INTEGER DEFAULT 1,
max_latency IN PLS_INTEGER DEFAULT 20,
max_iops OUT PLS_INTEGER,
max_mbps OUT PLS_INTEGER,
actual_latency OUT PLS_INTEGER);

num_physical_disks —— Approximate number of physical disks in the database storage
max_latency —— Maximum tolerable latency in milliseconds for database-block-sized IO requests
max_iops —— Maximum number of I/O requests per second that can be sustained. The I/O requests are randomly-distributed, database-block-sized reads.
max_mbps —— Maximum throughput of I/O that can be sustained, expressed in megabytes per second. The I/O requests are randomly-distributed, 1 megabyte reads.
actual_latency —— Average latency of database-block-sized I/O requests at max_iops rate, expressed in milliseconds

我们可以通过 DBMS_RESOURCE_MANAGER.CALIBRATE_IO 测算出存储的性能 disk_count表示实际的物理磁盘个数,max_latency为最大容忍的延迟,这里我们设置为10

————————–

SQL> set serveroutput on;
SQL> DECLARE
2 lat INTEGER;
3 iops INTEGER;
4 mbps INTEGER;
5 BEGIN
6 — DBMS_RESOURCE_MANAGER.CALIBRATE_IO (disk_count,max_latency , iops, mbps, lat);
7 DBMS_RESOURCE_MANAGER.CALIBRATE_IO (2, 10, iops, mbps, lat);
8
9 DBMS_OUTPUT.PUT_LINE (‘max_iops = ‘ || iops);
10 DBMS_OUTPUT.PUT_LINE (‘latency = ‘ || lat);
11 dbms_output.put_line(‘max_mbps = ‘ || mbps);
12 end;
13 /

max_iops = 901
latency = 15
max_mbps = 800

通过以下视图可以查看 I/O calibration results

SQL> desc V$IO_CALIBRATION_STATUS
Name Null? Type
—————————————– ——– —————————-
STATUS VARCHAR2(13)
CALIBRATION_TIME TIMESTAMP(3)

SQL> desc gv$io_calibration_status
Name Null? Type
—————————————– ——– —————————-
INST_ID NUMBER
STATUS VARCHAR2(13)
CALIBRATION_TIME TIMESTAMP(3)

Column explanation:
——————-
STATUS:
IN PROGRESS : Calibration in Progress (Results from previous calibration
run displayed, if available)
READY : Results ready and available from earlier run
NOT AVAILABLE : Calibration results not available.

CALIBRATION_TIME: End time of the last calibration run
DBA table that stores I/O Calibration results

SQL> desc DBA_RSRC_IO_CALIBRATE
Name Null? Type
—————————————– ——– —————————-
START_TIME TIMESTAMP(6)
END_TIME TIMESTAMP(6)
MAX_IOPS NUMBER
MAX_MBPS NUMBER
MAX_PMBPS NUMBER
LATENCY NUMBER
NUM_PHYSICAL_DISKS NUMBER

comment on table DBA_RSRC_IO_CALIBRATE is
‘Results of the most recent I/O calibration’

comment on column DBA_RSRC_IO_CALIBRATE.START_TIME is
‘start time of the most recent I/O calibration’

comment on column DBA_RSRC_IO_CALIBRATE.END_TIME is
‘end time of the most recent I/O calibration’

comment on column DBA_RSRC_IO_CALIBRATE.MAX_IOPS is
‘maximum number of data-block read requests that can be sustained per second’

comment on column DBA_RSRC_IO_CALIBRATE.MAX_MBPS is
‘maximum megabytes per second of maximum-sized read requests that can be
sustained’

comment on column DBA_RSRC_IO_CALIBRATE.MAX_PMBPS is
‘maximum megabytes per second of large I/O requests that
can be sustained by a single process’

comment on column DBA_RSRC_IO_CALIBRATE.LATENCY is
‘latency for data-block read requests’

comment on column DBA_RSRC_IO_CALIBRATE.NUM_PHYSICAL_DISKS is
‘number of physical disks in the storage subsystem (as specified by user)’