linux

linux tools part 4– Monitor process IO state

March 5, 2013 linux, system No comments

iotop是一个类似top的检测process IO 的Python 项目。可以在这里下载到源码,可以在这里找到具体的说明。下面做一个演示:

首先下载源码,要求python version>=2.6

wget http://guichaz.free.fr/iotop/files/iotop-0.5.tar.gz
wget http://www.python.org/ftp/python/2.6.6/Python-2.6.6.tgz

编译python

[root@oel58 tmp]# cd /tmp/Python-2.6.6
[root@oel58 Python-2.6.6]#./configure && make && make install

....


checking for %zd printf() format support... yes
checking for socklen_t... yes
checking for build directories... done
configure: creating ./config.status
config.status: creating Makefile.pre
config.status: creating Modules/Setup.config
config.status: creating pyconfig.h
creating Modules/Setup
creating Modules/Setup.local
creating Makefile


...

Writing /usr/local/lib/python2.6/lib-dynload/Python-2.6.6-py2.6.egg-info
if test -f /usr/local/bin/python -o -h /usr/local/bin/python; \
        then rm -f /usr/local/bin/python; \
        else true; \
        fi
(cd /usr/local/bin; ln python2.6 python)
rm -f /usr/local/bin/python-config
(cd /usr/local/bin; ln -s python2.6-config python-config)
/usr/bin/install -c -m 644 ./Misc/python.man \
                /usr/local/share/man/man1/python.1

编译完成之后,编译iotop 源码

[root@oel58 tmp]# cd iotop-0.5
[root@oel58 iotop-0.5]# ./setup.py  install
running install
running build
running build_py
creating build
creating build/lib
creating build/lib/iotop
copying iotop/version.py -> build/lib/iotop
copying iotop/netlink.py -> build/lib/iotop
copying iotop/__init__.py -> build/lib/iotop
copying iotop/ui.py -> build/lib/iotop
copying iotop/data.py -> build/lib/iotop
copying iotop/genetlink.py -> build/lib/iotop
copying iotop/vmstat.py -> build/lib/iotop
copying iotop/ioprio.py -> build/lib/iotop
running build_scripts
creating build/scripts-2.6
copying and adjusting sbin/iotop -> build/scripts-2.6
changing mode of build/scripts-2.6/iotop from 644 to 755
running install_lib
creating /usr/local/lib/python2.6/site-packages/iotop
copying build/lib/iotop/version.py -> /usr/local/lib/python2.6/site-packages/iotop
copying build/lib/iotop/netlink.py -> /usr/local/lib/python2.6/site-packages/iotop
copying build/lib/iotop/__init__.py -> /usr/local/lib/python2.6/site-packages/iotop
copying build/lib/iotop/ui.py -> /usr/local/lib/python2.6/site-packages/iotop
copying build/lib/iotop/data.py -> /usr/local/lib/python2.6/site-packages/iotop
copying build/lib/iotop/genetlink.py -> /usr/local/lib/python2.6/site-packages/iotop
copying build/lib/iotop/vmstat.py -> /usr/local/lib/python2.6/site-packages/iotop
copying build/lib/iotop/ioprio.py -> /usr/local/lib/python2.6/site-packages/iotop
byte-compiling /usr/local/lib/python2.6/site-packages/iotop/version.py to version.pyc
byte-compiling /usr/local/lib/python2.6/site-packages/iotop/netlink.py to netlink.pyc
byte-compiling /usr/local/lib/python2.6/site-packages/iotop/__init__.py to __init__.pyc
byte-compiling /usr/local/lib/python2.6/site-packages/iotop/ui.py to ui.pyc
byte-compiling /usr/local/lib/python2.6/site-packages/iotop/data.py to data.pyc
byte-compiling /usr/local/lib/python2.6/site-packages/iotop/genetlink.py to genetlink.pyc
byte-compiling /usr/local/lib/python2.6/site-packages/iotop/vmstat.py to vmstat.pyc
byte-compiling /usr/local/lib/python2.6/site-packages/iotop/ioprio.py to ioprio.pyc
running install_scripts
copying build/scripts-2.6/iotop -> /usr/local/bin
changing mode of /usr/local/bin/iotop to 755
running install_data
copying iotop.8 -> /usr/local/share/man/man8
running install_egg_info
Writing /usr/local/lib/python2.6/site-packages/iotop-0.5-py2.6.egg-info

————————————————————–

[root@oel58 bin]# ./iotop --only
Total DISK READ :       0.00 B/s | Total DISK WRITE :     255.17 M/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:     256.64 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                 
16389 be/4 root        0.00 B/s   48.40 M/s  0.00 % 11.61 % dd if /dev/zero of /tmp/temp.txt bs 1M count 10000
16418 be/4 root        0.00 B/s   58.64 M/s  0.00 % 11.55 % dd if /dev/zero of /tmp/temp.txt bs 1M count 10000
16359 be/4 root        0.00 B/s   73.66 M/s  0.00 %  3.48 % dd if /dev/zero of /tmp/temp.txt bs 1M count 10000
 6756 be/4 root        0.00 B/s    0.00 B/s  0.00 %  5.97 % [flush-8:0]
16418 be/4 root        0.00 B/s   21.15 M/s  0.00 %  4.52 % dd if /dev/zero of /tmp/temp.txt bs 1M count 10000
16447 be/4 root        0.00 B/s   23.91 M/s  0.00 %  3.02 % dd if /dev/zero of /tmp/temp.txt bs 1M count 10000
  109 be/4 root        3.54 K/s 1030.75 K/s  0.00 %  2.36 % [kjournald]
16359 be/4 root        0.00 B/s   22.91 M/s  0.00 %  1.69 % dd if /dev/zero of /tmp/temp.txt bs 1M count 10000

Python ≥ 2.7 and a Linux kernel ≥ 2.6.20 with the TASK_DELAY_ACCT CONFIG_TASKSTATS, TASK_IO_ACCOUNTING and CONFIG_VM_EVENT_COUNTERS options on.

经测试py version >2.6即可,此外需要说明如果kernel version < 2.6.20 会出现如下问题:

Total DISK READ :	0.00 B/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:	0.00 B/s | Actual DISK WRITE:      16.26 K/s
TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND


CONFIG_TASK_DELAY_ACCT not enabled in kernel, cannot determine SWAPIN and IO %   

strace iotop主要调用的python模块

open("/usr/local/lib/python2.6/lib-dynload/_socket.so", O_RDONLY) = 9
open("/usr/local/lib/python2.6/lib-dynload/_ssl.so", O_RDONLY|O_LARGEFILE) = 8
open("/usr/local/lib/python2.6/site-packages/iotop/vmstat.py", O_RDONLY|O_LARGEFILE) = 5
open("/usr/local/lib/python2.6/site-packages/iotop/vmstat.pyc", O_RDONLY|O_LARGEFILE) = 6
open("/usr/local/lib/python2.6/lib-dynload/cStringIO.so", O_RDONLY) = 7
open("/usr/local/lib/python2.6/lib-dynload/cStringIO.so", O_RDONLY|O_LARGEFILE) = 6

.....

iotop 对python的依赖性过高,以及对于version的要求过于严格,导致了通用性和移植性的下降。

linux tools part 3– Monitor process status

February 28, 2013 linux, system No comments

pidstat是一款很不错的针对linux pid状态监控的程序

The pidstat command is used for monitoring individual tasks currently being managed by the Linux kernel.
It writes to standard output activities for every task selected with option -p or for every task managed by the Linux kernel if option -p ALL has been used. Not selecting any tasks is equivalent to specifying -p ALL but only active tasks (tasks with non-zero statistics values) will appear in the report.

pidstat的具体用法参考这里,源码版本在这里下载

[root@db56 tmp]# tar -zxvf sysstat-10.0.5.tar.gz 
[root@db56 tmp]# cd sysstat-10.0.5
 [root@db56 sysstat-10.0.5]# ./configure 
.
Check programs:
.
checking for gcc... gcc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
.....

config.status: creating contrib/isag/isag
config.status: creating Makefile

   Sysstat version:		10.0.5
   Installation prefix:		/usr/local
   rc directory:		/etc/rc.d
   Init directory:		/etc/rc.d/init.d
   Configuration directory:	/etc/sysconfig
   Man pages directory:		${datarootdir}/man
   Compiler:			gcc
   Compiler flags:		-g -O2

[root@db56 sysstat-10.0.5]# 
[root@db56 sysstat-10.0.5]# 
[root@db56 sysstat-10.0.5]# make -f  Makefile

————————————————-

eg:

[root@db56 sysstat-10.0.5]# ./pidstat -p 12990  2 5 
Linux 2.6.18-194.el5 (db56) 	02/28/2013 	_x86_64_	(12 CPU)

03:07:46 PM       PID    %usr %system  %guest    %CPU   CPU  Command
03:07:48 PM     12990    0.00    0.00    0.00    0.00     7  oracle
03:07:50 PM     12990    0.00    0.00    0.00    0.00     7  oracle
03:07:52 PM     12990    0.00    0.00    0.00    0.00     7  oracle
03:07:54 PM     12990    0.00    0.00    0.00    0.00     7  oracle
03:07:56 PM     12990    0.00    0.00    0.00    0.00     7  oracle
Average:        12990    0.00    0.00    0.00    0.00     -  oracle


[root@db56 sysstat-10.0.5]# pidstat -r -t -p 12990 1 2
Linux 2.6.18-194.el5 (db56) 	02/28/2013 	_x86_64_	(12 CPU)

04:06:28 PM      TGID       TID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
04:06:29 PM     12990         -      0.00      0.00 12733740  18184   0.06  oracle
04:06:29 PM         -     12990      0.00      0.00 12733740  18184   0.06  |__oracle

04:06:29 PM      TGID       TID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
04:06:30 PM     12990         -      0.00      0.00 12733740  18184   0.06  oracle
04:06:30 PM         -     12990      0.00      0.00 12733740  18184   0.06  |__oracle

Average:         TGID       TID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
Average:        12990         -      0.00      0.00 12733740  18184   0.06  oracle
Average:            -     12990      0.00      0.00 12733740  18184   0.06  |__oracle


strace pidstat -p 12990 :

open("/proc/uptime", O_RDONLY)          = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b3fb25f9000
read(3, "45884933.69 45647910.77\n", 4096) = 24
close(3)                                = 0
munmap(0x2b3fb25f9000, 4096)            = 0
open("/proc/stat", O_RDONLY)            = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b3fb25f9000
read(3, "cpu  340880517 704021 59278877 5"..., 4096) = 1513
read(3, "", 4096)                       = 0
close(3)                                = 0
munmap(0x2b3fb25f9000, 4096)            = 0
open("/proc/12990/stat", O_RDONLY)      = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b3fb25f9000
read(3, "12990 (oracle) S 1 12990 12990 0"..., 4096) = 231
read(3, "", 4096)                       = 0
close(3)                                = 0
munmap(0x2b3fb25f9000, 4096)            = 0
open("/proc/12990/status", O_RDONLY)    = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b3fb25f9000
read(3, "Name:\toracle\nState:\tS (sleeping)"..., 4096) = 779
read(3, "", 4096)                       = 0
close(3) 

主要从/proc/$pid/stat 获取信息:

 /proc/[pid]/stat
              Status information about the process.  This is used by ps(1).  It is
              defined in /usr/src/linux/fs/proc/array.c.

              The fields, in order, with their proper scanf(3) format specifiers,
              are:

              pid %d      The process ID.

              comm %s     The filename of the executable, in parentheses.  This is
                          visible whether or not the executable is swapped out.

              state %c    One character from the string "RSDZTW" where R is running,
                          S is sleeping in an interruptible wait, D is waiting in
                          uninterruptible disk sleep, Z is zombie, T is traced or
                          stopped (on a signal), and W is paging.

              ppid %d     The PID of the parent.

              pgrp %d     The process group ID of the process.

              session %d  The session ID of the process.

              tty_nr %d   The controlling terminal of the process.  (The minor device
                          number is contained in the combination of bits 31 to 20 and
                          7 to 0; the major device number is in bits 15 to 8.)

              tpgid %d    The ID of the foreground process group of the controlling
                          terminal of the process.

              flags %u (%lu before Linux 2.6.22)
                          The kernel flags word of the process.  For bit meanings,
                          see the PF_* defines in <linux/sched.h>.  Details depend on
                          the kernel version.

              minflt %lu  The number of minor faults the process has made which have
                          not required loading a memory page from disk.

              cminflt %lu The number of minor faults that the process's waited-for
                          children have made.

              majflt %lu  The number of major faults the process has made which have
                          required loading a memory page from disk.

              cmajflt %lu The number of major faults that the process's waited-for
                          children have made.

              utime %lu   Amount of time that this process has been scheduled in user
                          mode, measured in clock ticks (divide by
                          sysconf(_SC_CLK_TCK).  This includes guest time, guest_time
                          (time spent running a virtual CPU, see below), so that
                          applications that are not aware of the guest time field do
                          not lose that time from their calculations.

              stime %lu   Amount of time that this process has been scheduled in
                          kernel mode, measured in clock ticks (divide by
                          sysconf(_SC_CLK_TCK).

              cutime %ld  Amount of time that this process's waited-for children have
                          been scheduled in user mode, measured in clock ticks
                          (divide by sysconf(_SC_CLK_TCK).  (See also times(2).)
                          This includes guest time, cguest_time (time spent running a
                          virtual CPU, see below).

              cstime %ld  Amount of time that this process's waited-for children have
                          been scheduled in kernel mode, measured in clock ticks
                          (divide by sysconf(_SC_CLK_TCK).

              priority %ld
                          (Explanation for Linux 2.6) For processes running a real-
                          time scheduling policy (policy below; see
                          sched_setscheduler(2)), this is the negated scheduling
                          priority, minus one; that is, a number in the range -2 to
                          -100, corresponding to real-time priorities 1 to 99.  For
                          processes running under a non-real-time scheduling policy,
                          this is the raw nice value (setpriority(2)) as represented
                          in the kernel.  The kernel stores nice values as numbers in
                          the range 0 (high) to 39 (low), corresponding to the user-
                          visible nice range of -20 to 19.

                          Before Linux 2.6, this was a scaled value based on the
                          scheduler weighting given to this process.

              nice %ld    The nice value (see setpriority(2)), a value in the range
                          19 (low priority) to -20 (high priority).

              num_threads %ld
                          Number of threads in this process (since Linux 2.6).
                          Before kernel 2.6, this field was hard coded to 0 as a
                          placeholder for an earlier removed field.

              itrealvalue %ld
                          The time in jiffies before the next SIGALRM is sent to the
                          process due to an interval timer.  Since kernel 2.6.17,
                          this field is no longer maintained, and is hard coded as 0.

              starttime %llu (was %lu before Linux 2.6)
                          The time in jiffies the process started after system boot.

              vsize %lu   Virtual memory size in bytes.

              rss %ld     Resident Set Size: number of pages the process has in real
                          memory.  This is just the pages which count toward text,
                          data, or stack space.  This does not include pages which
                          have not been demand-loaded in, or which are swapped out.

              rsslim %lu  Current soft limit in bytes on the rss of the process; see
                          the description of RLIMIT_RSS in getpriority(2).

              startcode %lu
                          The address above which program text can run.

              endcode %lu The address below which program text can run.

              startstack %lu
                          The address of the start (i.e., bottom) of the stack.

              kstkesp %lu The current value of ESP (stack pointer), as found in the
                          kernel stack page for the process.

              kstkeip %lu The current EIP (instruction pointer).

              signal %lu  The bitmap of pending signals, displayed as a decimal
                          number.  Obsolete, because it does not provide information
                          on real-time signals; use /proc/[pid]/status instead.

              blocked %lu The bitmap of blocked signals, displayed as a decimal
                          number.  Obsolete, because it does not provide information
                          on real-time signals; use /proc/[pid]/status instead.

              sigignore %lu
                          The bitmap of ignored signals, displayed as a decimal
                          number.  Obsolete, because it does not provide information
                          on real-time signals; use /proc/[pid]/status instead.

              sigcatch %lu
                          The bitmap of caught signals, displayed as a decimal
                          number.  Obsolete, because it does not provide information
                          on real-time signals; use /proc/[pid]/status instead.

              wchan %lu   This is the "channel" in which the process is waiting.  It
                          is the address of a system call, and can be looked up in a
                          namelist if you need a textual name.  (If you have an up-
                          to-date /etc/psdatabase, then try ps -l to see the WCHAN
                          field in action.)

              nswap %lu   Number of pages swapped (not maintained).

              cnswap %lu  Cumulative nswap for child processes (not maintained).

              exit_signal %d (since Linux 2.1.22)
                          Signal to be sent to parent when we die.

              processor %d (since Linux 2.2.8)
                          CPU number last executed on.

              rt_priority %u (since Linux 2.5.19; was %lu before Linux 2.6.22)
                          Real-time scheduling priority, a number in the range 1 to
                          99 for processes scheduled under a real-time policy, or 0,
                          for non-real-time processes (see sched_setscheduler(2)).

              policy %u (since Linux 2.5.19; was %lu before Linux 2.6.22)
                          Scheduling policy (see sched_setscheduler(2)).  Decode
                          using the SCHED_* constants in linux/sched.h.

              delayacct_blkio_ticks %llu (since Linux 2.6.18)
                          Aggregated block I/O delays, measured in clock ticks
                          (centiseconds).

              guest_time %lu (since Linux 2.6.24)
                          Guest time of the process (time spent running a virtual CPU
                          for a guest operating system), measured in clock ticks
                          (divide by sysconf(_SC_CLK_TCK).

              cguest_time %ld (since Linux 2.6.24)
                          Guest time of the process's children, measured in clock
                          ticks (divide by sysconf(_SC_CLK_TCK).

linux tools part 2– network Monitoring

February 27, 2013 linux, system No comments

本来想写一篇关于nicstat的文章,霸爷已经写的非常好了参考这里nicstat 网络流量统计利器

这里说明一下使用nicstat 可以完美替代iptraf ,nload 等工具

同样可以替代watch –命令行模式:
eg:

Every 1.0s: /sbin/ifconfig eth0 | grep bytes                                                                                                                           Wed Feb 27 22:20:58 2013

          RX bytes:1851917520138 (1.6 TiB)  TX bytes:41460941958 (38.6 GiB)

--

Every 1.0s: /sbin/ifconfig eth0 | grep bytes                                                                                                                           Wed Feb 27 22:21:27 2013

          RX bytes:1851921542828 (1.6 TiB)  TX bytes:41461009539 (38.6 GiB)
[root@db-83 nicstat-1.92]# nicstat -help
USAGE: nicstat [-hvnsxpztual] [-i int[,int...]]
   [-S int:mbps[,int:mbps...]] [interval [count]]

         -h                 # help
         -v                 # show version (1.92)
         -i interface       # track interface only
         -n                 # show non-local interfaces only (exclude lo0)
         -s                 # summary output
         -x                 # extended output
         -p                 # parseable output
         -z                 # skip zero value lines
         -t                 # show TCP statistics
         -u                 # show UDP statistics
         -a                 # equivalent to "-x -u -t"
         -l                 # list interface(s)
         -M                 # output in Mbits/sec
         -S int:mbps[fd|hd] # tell nicstat the interface
                            # speed (Mbits/sec) and duplex
    eg,
       nicstat              # print summary since boot only
       nicstat 1            # print every 1 second
       nicstat 1 5          # print 5 times only
       nicstat -z 1         # print every 1 second, skip zero lines
       nicstat -i hme0 1    # print hme0 only every 1 second
[root@db-83 nicstat-1.92]# nicstat -a
22:14:24    InKB   OutKB   InSeg  OutSeg Reset  AttF %ReTX InConn OutCon Drops
TCP         0.00    0.00   125.7   57.71  0.01  0.01 0.000   0.15   0.01  0.00
22:14:24                    InDG   OutDG     InErr  OutErr
UDP                         0.00    0.00      0.00    0.00
22:14:24      RdKB    WrKB   RdPkt   WrPkt   IErr  OErr  Coll  NoCP Defer  %Util
lo            0.00    0.00    0.01    0.01   0.00  0.00  0.00  0.00  0.00   0.00
eth0         185.3    4.15   127.4   57.70   0.00  0.00  0.00  0.00  0.00   0.16
[root@db-83 nicstat-1.92]# nicstat -i eth0  5 4 
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
22:14:31     eth0   185.3    4.15   127.4   57.70  1488.9   73.62  0.16   0.00
22:14:36     eth0    0.32    0.29    3.00    2.40   109.5   125.3  0.00   0.00
22:14:41     eth0    0.23    0.05    3.60    0.40   66.67   136.0  0.00   0.00
22:14:46     eth0   494.6    5.77   337.8   78.99  1499.6   74.83  0.41   0.00
[root@db-83 nicstat-1.92]# nicstat  -v
nicstat version 1.92
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
22:15:50       lo    0.00    0.00    0.01    0.01   50.00   50.00  0.00   0.00
22:15:50     eth0   185.3    4.15   127.4   57.70  1488.9   73.62  0.16   0.00

NOTES:

– Some unusual network cards may not provide all the details to Kstat,(or provide different symbols). Check for newer versions of this program, and the @Network array in the code below.
– Utilisation is based on bytes transferred divided by speed of the interface (if the speed is known). It should be impossible to reach 100% as there are overheads due to bus negotiation and timing.
– Loopback interfaces may only provide packet counts (if anything), and so bytes and %util will always be zero. Newer versions of Solaris (newer than Solaris 10 6/06) may provide loopback byte stats.
– Saturation is determined by counting read and write errors caused by the interface running at saturation. This approach is not ideal, and the value reported is often lower than it should be (eg, 0.0). Reading the rKB/s and wKB/s fields may be more useful.

linux tools part 1– Linux Kernel Performance

February 27, 2013 linux, system No comments

Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences in Linux performance measurementsand presents a simple commandline interface. Perf is based on the perf_events interface exported by recent versions of the Linux kernel. This article demonstrates the perf tool through example runs.

最初的时候,它叫做 Performance counter,在 2.6.31 中第一次亮相。此后他成为内核开发最为活跃的一个领域。在 2.6.32 中它正式
改名为 Performance Event,因为 perf 已不再仅仅作为 PMU 的抽象,而是能够处理所有的性能相关的事件。

使用 perf,您可以分析程序运行期间发生的硬件事件,比如 instructions retired ,processor clock cycles 等;您也可以分析软件
事件比如 Page Fault 和进程切换。

这使得 Perf 拥有了众多的性能分析能力,举例来说,使用 Perf 可以计算每个时钟周期内的指令数,称为 IPC,IPC 偏低表明代码没有
很好地利用 CPU。Perf 还可以对程序进行函数级别的采样,从而了解程序的性能瓶颈究竟在哪里等等。Perf 还可以替代 strace,可以添
加动态内核 probe 点,还可以做 benchmark 衡量调度器的好坏.

Subcommands:

perf stat: obtain event counts

perf record: record events for later reporting

perf report: break down events by process, function, etc.

perf annotate: annotate assembly or source code with event counts

perf top: see live event count

perf sched: tracing/measuring of scheduler actions and latencies

perf list: list available events

relative link:
linux-performance-analysis-and-tools

Perf — Linux下的系统性能调优工具介绍

http://en.wikipedia.org/wiki/Perf_(Linux)

perf relation with oracle:

linux-perf-utility-with-el-6 by Hoogland p1

linux-perf-utility-with-el-6 by Hoogland p2

闰月的威胁

June 26, 2012 linux No comments

来自公司SA组的一封邮件


HI All

简单来说Linux kernel低于2.6.18-164且开启NTP服务的server有自动重启的风险
目前我们大多数的server kernel如下
$ uname -a
Linux xen21-vm04 2.6.18-128.1.10.el5.xs5.5.0.51xen #1 SMP Wed Nov 11 08:31:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

10.1.0.99上
#chkconfig –list|grep ntp
ntpd 0:off 1:off 2:on 3:on 4:on 5:on 6:off

简单的解决办法
1, 6月28日23点前停止NTP服务
2, 7月2日8点以后启动NTP服务
3, 在crontab中执行ntpdate命令的方式来进行ntp同步

相关资料原文:
https://bugzilla.redhat.com/show_bug.cgi?id=479765
http://blog.toracat.org/2012/06/leap-seconds-who-cares/
http://zh.wikipedia.org/wiki/%E9%97%B0%E7%A7%92

LVS introduction

March 20, 2012 Architect, linux, software No comments

What is virtual server?
Virtual server is a highly scalable and highly available server built on a cluster of real servers. The architecture of server cluster is fully transparent to end users, and the users interact with the cluster system as if it were only a single high-performance virtual server. Please consider the following figure.

The real servers and the load balancers may be interconnected by either high-speed LAN or by geographically dispersed WAN. The load balancers can dispatch requests to the different servers and make parallel services of the cluster to appear as a virtual service on a single IP address, and request dispatching can use IP load balancing technolgies or application-level load balancing technologies. Scalability of the system is achieved by transparently adding or removing nodes in the cluster. High availability is provided by detecting node or daemon failures and reconfiguring the system appropriately.

HugePages on Linux

January 16, 2012 linux, system 5 comments

Regular Pages and HugePages

This section aims to give a general picture about memory access in virtual memory systems and how pages are referenced.
When a single process works with a piece of memory, the pages that the process uses are reference in a local page table for the specific process. The entries in this table also contain references to the System-Wide Page Table which actually has references to actual physical memory addresses. So theoretically a user mode process (i.e. Oracle processes), follows its local page table to access to the system page table and then can reference the actual physical table virtually. As you can see below, it is also possible (and very common to Oracle RDBMS due to SGA use) that two different O/S processes can point to the same entry in the system-wide page table.

When HugePages are in the play, the usual page tables are employed. The very basic difference is that the entries in both process page table and the system page table has attributes about huge pages. So any page in a page table can be a huge page or a regular page. The following diagram illustrates 4096K hugepages but the diagram would be the same for any huge page size.

Some HugePages Facts/Features

HugePages can be allocated on-the-fly but they must be reserved during system startup. Otherwise the allocation might fail as the memory is already paged in 4K mostly.
HugePage sizes vary from 2MB to 256MB based on kernel version and HW architecture (See related section below.)
HugePages are not subject to reservation / release after the system startup unless there is system administrator intervention, basically changing the hugepages configuration (i.e. number of pages available or pool size)

HugePages and Oracle 11g Automatic Memory Management (AMM)

The AMM and HugePages are not compatible. One needs to disable AMM on 11g to be able to use HugePages. See hugepage in 11g for further information.

设置大页内存

[oracle@db-36 ~]$ cat /etc/sysctl.conf |grep nr_hugepages
vm.nr_hugepages=33792

vm.nr_hugepages>=SGA/2M 如SGA=64G vm.nr_hugepages>=32768

设置limits.conf

cat /etc/security/limits.conf

cat oracle soft nofile 131072
oracle hard nofile 131072
oracle soft nproc 131072
oracle hard nproc 131072
oracle soft core unlimited
oracle hard core unlimited
oracle soft memlock 69206016 –> 大于SGA
oracle hard memlock 69206016 –> 大于SGA


[oracle@db-36 ~]$ more /proc/meminfo |grep -i HugePage
HugePages_Total: 33792
HugePages_Free: 998
HugePages_Rsvd: 38
Hugepagesize: 2048 kB

表示已经使用了大页内存

scripts :用于计算系统所需要的大页

#!/bin/bash
#
# hugepages_settings.sh
#
# Linux bash script to compute values for the
# recommended HugePages/HugeTLB configuration
#
# Note: This script does calculation for all shared memory
# segments available when the script is run, no matter it
# is an Oracle RDBMS shared memory segment or not.
#
# This script is provided by Doc ID 401749.1 from My Oracle Support
# http://support.oracle.com

# Welcome text
echo ”
This script is provided by Doc ID 401749.1 from My Oracle Support
(http://support.oracle.com) where it is intended to compute values for
the recommended HugePages/HugeTLB configuration for the current shared
memory segments. Before proceeding with the execution please make sure
that:
* Oracle Database instance(s) are up and running
* Oracle Database 11g Automatic Memory Management (AMM) is not setup
(See Doc ID 749851.1)
* The shared memory segments can be listed by command:
# ipcs -m

Press Enter to proceed…”

read

# Check for the kernel version
KERN=`uname -r | awk -F. ‘{ printf(“%d.%d\n”,$1,$2); }’`

# Find out the HugePage size
HPG_SZ=`grep Hugepagesize /proc/meminfo | awk ‘{print $2}’`

# Initialize the counter
NUM_PG=0

# Cumulative number of pages required to handle the running shared memory segments
for SEG_BYTES in `ipcs -m | awk ‘{print $5}’ | grep “[0-9][0-9]*”`
do
MIN_PG=`echo “$SEG_BYTES/($HPG_SZ*1024)” | bc -q`
if [ $MIN_PG -gt 0 ]; then
NUM_PG=`echo “$NUM_PG+$MIN_PG+1” | bc -q`
fi
done

RES_BYTES=`echo “$NUM_PG * $HPG_SZ * 1024” | bc -q`

# An SGA less than 100MB does not make sense
# Bail out if that is the case
if [ $RES_BYTES -lt 100000000 ]; then
echo “***********”
echo “** ERROR **”
echo “***********”
echo “Sorry! There are not enough total of shared memory segments allocated for
HugePages configuration. HugePages can only be used for shared memory segments
that you can list by command:

# ipcs -m

of a size that can match an Oracle Database SGA. Please make sure that:
* Oracle Database instance is up and running
* Oracle Database 11g Automatic Memory Management (AMM) is not configured”
exit 1
fi

# Finish with results
case $KERN in
‘2.4’) HUGETLB_POOL=`echo “$NUM_PG*$HPG_SZ/1024” | bc -q`;
echo “Recommended setting: vm.hugetlb_pool = $HUGETLB_POOL” ;;
‘2.6’) echo “Recommended setting: vm.nr_hugepages = $NUM_PG” ;;
*) echo “Unrecognized kernel version $KERN. Exiting.” ;;
esac

# End


example:

[oracle@db-36 ~]$ sh page.sh

This script is provided by Doc ID 401749.1 from My Oracle Support
(http://support.oracle.com) where it is intended to compute values for
the recommended HugePages/HugeTLB configuration for the current shared
memory segments. Before proceeding with the execution please make sure
that:
* Oracle Database instance(s) are up and running
* Oracle Database 11g Automatic Memory Management (AMM) is not setup
(See Doc ID 749851.1)
* The shared memory segments can be listed by command:
# ipcs -m

Press Enter to proceed…

Recommended setting: vm.nr_hugepages = 32835
[oracle@db-36 ~]$ cat /etc/sysctl.conf |grep vm.nr_hugepages
vm.nr_hugepages=33792

可以看出 我们设置的大页是很合理的