NoSQL

MongoDB not preallocate journal log

June 19, 2014 mongodb, NoSQL No comments

在安装mongodb 2.6.2的时候发现一个奇怪的问题 RS 节点的journal log并没有提前分配.我们知道在安装mongodb的时候
mongo总是会预先分配journal log 启用smallfile的时候默认为128MB 否则会分配1GB的journal log

下面是仲裁节点的日志:

2014-06-17T11:50:09.842+0800 [initandlisten] MongoDB starting : pid=4749 port=27017 dbpath=/data/mongodb/data 64-bit host=vm-3-57
2014-06-17T11:50:09.844+0800 [initandlisten] db version v2.6.2
2014-06-17T11:50:09.844+0800 [initandlisten] git version: 4d06e27876697d67348a397955b46dabb8443827
2014-06-17T11:50:09.844+0800 [initandlisten] build info: Linux build10.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
2014-06-17T11:50:09.844+0800 [initandlisten] allocator: tcmalloc
2014-06-17T11:50:09.844+0800 [initandlisten] options: { config: "/data/mongodb/mongod.cnf", net: { http: { enabled: false }, maxIncomingConnections: 5000, port: 27017, unixDomainSocket: { pathPrefix: "/data/mongodb/data" } }, operationProfiling: { mode: "slowOp", slowOpThresholdMs: 500 }, processManagement: { fork: true, pidFilePath: "/data/mongodb/data/mongod.pid" }, replication: { replSet: "rs1" }, security: { authorization: "enabled", keyFile: "/data/mongodb/data/rs1.keyfile" }, storage: { dbPath: "/data/mongodb/data", directoryPerDB: true, journal: { enabled: true }, repairPath: "/data/mongodb/data", syncPeriodSecs: 10.0 }, systemLog: { destination: "file", path: "/data/mongodb/log/mongod_data.log", quiet: true } }
2014-06-17T11:50:09.863+0800 [initandlisten] journal dir=/data/mongodb/data/journal
2014-06-17T11:50:09.864+0800 [initandlisten] recover : no journal files present, no recovery needed
2014-06-17T11:50:10.147+0800 [initandlisten] preallocateIsFaster=true 3.52
2014-06-17T11:50:10.378+0800 [initandlisten] preallocateIsFaster=true 3.4
2014-06-17T11:50:11.662+0800 [initandlisten] preallocateIsFaster=true 2.9
2014-06-17T11:50:11.662+0800 [initandlisten] preallocating a journal file /data/mongodb/data/journal/prealloc.0
2014-06-17T11:50:14.009+0800 [initandlisten]        File Preallocator Progress: 629145600/1073741824    58%
2014-06-17T11:50:26.266+0800 [initandlisten] preallocating a journal file /data/mongodb/data/journal/prealloc.1
2014-06-17T11:50:29.009+0800 [initandlisten]        File Preallocator Progress: 723517440/1073741824    67%
2014-06-17T11:50:40.751+0800 [initandlisten] preallocating a journal file /data/mongodb/data/journal/prealloc.2
2014-06-17T11:50:43.020+0800 [initandlisten]        File Preallocator Progress: 597688320/1073741824    55%
2014-06-17T11:50:55.830+0800 [FileAllocator] allocating new datafile /data/mongodb/data/local/local.ns, filling with zeroes...

mongo默认创建了3个1GB 的journal log

再来看下RS 节点的日志:

2014-06-17T14:31:31.095+0800 [initandlisten] MongoDB starting : pid=8630 port=27017 dbpath=/storage/sas/mongodb/data 64-bit host=db-mysql-common01a
2014-06-17T14:31:31.096+0800 [initandlisten] db version v2.6.2
2014-06-17T14:31:31.096+0800 [initandlisten] git version: 4d06e27876697d67348a397955b46dabb8443827
2014-06-17T14:31:31.096+0800 [initandlisten] build info: Linux build10.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
2014-06-17T14:31:31.096+0800 [initandlisten] allocator: tcmalloc
2014-06-17T14:31:31.096+0800 [initandlisten] options: { config: "/storage/sas/mongodb/mongod.cnf", net: { http: { enabled: false }, maxIncomingConnections: 5000, port: 27017, unixDomainSocket: { pathPrefix: "/storage/sas/mongodb/data" } }, operationProfiling: { mode: "slowOp", slowOpThresholdMs: 500 }, processManagement: { fork: true, pidFilePath: "/storage/sas/mongodb/data/mongod.pid" }, replication: { replSet: "rs1" }, security: { authorization: "enabled", keyFile: "/storage/sas/mongodb/data/rs1.keyfile" }, storage: { dbPath: "/storage/sas/mongodb/data", directoryPerDB: true, journal: { enabled: true }, repairPath: "/storage/sas/mongodb/data", syncPeriodSecs: 10.0 }, systemLog: { destination: "file", path: "/storage/sas/mongodb/log/mongod_data.log", quiet: true } }
2014-06-17T14:31:31.101+0800 [initandlisten] journal dir=/storage/sas/mongodb/data/journal
2014-06-17T14:31:31.102+0800 [initandlisten] recover : no journal files present, no recovery needed
2014-06-17T14:31:31.130+0800 [FileAllocator] allocating new datafile /storage/sas/mongodb/data/local/local.ns, filling with zeroes...
2014-06-17T14:31:31.130+0800 [FileAllocator] creating directory /storage/sas/mongodb/data/local/_tmp
2014-06-17T14:31:31.132+0800 [FileAllocator] done allocating datafile /storage/sas/mongodb/data/local/local.ns, size: 16MB,  took 0 secs
2014-06-17T14:31:31.137+0800 [FileAllocator] allocating new datafile /storage/sas/mongodb/data/local/local.0, filling with zeroes...
2014-06-17T14:31:31.138+0800 [FileAllocator] done allocating datafile /storage/sas/mongodb/data/local/local.0, size: 64MB,  took 0 secs
2014-06-17T14:31:31.141+0800 [initandlisten] build index on: local.startup_log properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "local.startup_log" }

没有创建journal log 直接创建了datafile 这个问题很奇怪 之前认为是ext4的问题,咨询朋友之后发现mongodb在创建journal log 之前会有一个判断,部分源码如下:

// @file dur_journal.cpp writing to the writeahead logging journal

  bool _preallocateIsFaster() {
            bool faster = false;
            boost::filesystem::path p = getJournalDir() / "tempLatencyTest";
            if (boost::filesystem::exists(p)) {
                try {
                    remove(p);
                }
                catch(const std::exception& e) {
                    log() << "Unable to remove temporary file due to: " << e.what() << endl;
                }
            }
            try {
                AlignedBuilder b(8192);
                int millis[2];
                const int N = 50;
                for( int pass = 0; pass < 2; pass++ ) {
                    LogFile f(p.string());
                    Timer t;
                    for( int i = 0 ; i < N; i++ ) { 
                        f.synchronousAppend(b.buf(), 8192);
                    }
                    millis[pass] = t.millis();
                    // second time through, file exists and is prealloc case
                }
                int diff = millis[0] - millis[1];
                if( diff > 2 * N ) {
                    // at least 2ms faster for prealloc case?
                    faster = true;
                    log() << "preallocateIsFaster=true " << diff / (1.0*N) << endl;
                }
            }
            catch (const std::exception& e) {
                log() << "info preallocateIsFaster couldn't run due to: " << e.what()
                      << "; returning false" << endl;
            }
            if (boost::filesystem::exists(p)) {
                try {
                    remove(p);
                }
                catch(const std::exception& e) {
                    log() << "Unable to remove temporary file due to: " << e.what() << endl;
                }
            }
            return faster;
        }
        bool preallocateIsFaster() {
            Timer t;
            bool res = false;
            if( _preallocateIsFaster() && _preallocateIsFaster() ) { 
                // maybe system is just super busy at the moment? sleep a second to let it calm down.  
                // deciding to to prealloc is a medium big decision:
                sleepsecs(1);
                res = _preallocateIsFaster();
            }
            if( t.millis() > 3000 ) 
                log() << "preallocateIsFaster check took " << t.millis()/1000.0 << " secs" << endl;
            return res;
        }
        
 int diff = millis[0] - millis[1];
                if( diff > 2 * N ) {
                    // at least 2ms faster for prealloc case?
                    faster = true;
                    log() << "preallocateIsFaster=true " << diff / (1.0*N) << endl;
                }   

如果diff> 2*N 那么mongo 将会认为 preallocate 是更好的选择,将会预先分配log ,这在仲裁节点日志也体现出来了,都大于2ms

2014-06-17T11:50:10.147+0800 [initandlisten] preallocateIsFaster=true 3.52
2014-06-17T11:50:10.378+0800 [initandlisten] preallocateIsFaster=true 3.4
2014-06-17T11:50:11.662+0800 [initandlisten] preallocateIsFaster=true 2.9

不过个人认为这个设计很无聊,我相信没人会介意初始化日志的那么点时间吧,何况如果出现峰值的之后再去分配log,对IO又是一个冲击。

How to change oplog size — mongo

September 10, 2013 mongodb, NoSQL No comments

主要两种方法 :
1.轮流改变oplog size (from primary to secondary)

2.重新初始化 secondary 定制oplog size 切换原来的primary

具体操作一下方法一的步骤,集体参考mongo oplog

1). 切换当前primary ->secondary

rs1:PRIMARY> rs.stepDown();

2). 关闭MongoDB

rs1:SECONDARY> db.shutdownServer();

3). 注释掉replSet选项,以单机模式启动 —切换port
4). 查询出最后的同步点

> use local
> db.oplog.rs.find( { }, { ts: 1, h: 1 } ).sort( {$natural : -1} ).limit(1).next();
{ "ts" : Timestamp(1378716098, 2), "h" : NumberLong("-654971153597320397") }

5). 删除旧的oplog

> db.oplog.rs.drop();

6). 创建新的oplog,这里为30GB

> db.runCommand({create:"oplog.rs", capped:true, size:(30*1024*1024*1024)});

7). 写入最后的同步点

> db.oplog.rs.save({ "ts" : Timestamp(1378716098, 2), "h" : NumberLong("-654971153597320397") });

8). 关闭MongoDB

> db.shutdownServer();

9). 使用replSet选项,以Replset模式启动
10). 检查同步情况

MySQL key partition and MongoDB TEST

March 8, 2013 mongodb, MYSQL, NoSQL, performance No comments

对于业务的激活码需求做了一次关于mysql,mongodb的比对.mysql分为normal,key partition 数量分别是1亿和10亿数据,mysql采用直接访问PK键,partition key为PK,mysql table size 为90G,mongodb table size为157G。

[liuyang@yhdem ~]$ cat /proc/cpuinfo  |grep processor |wc -l
24

[liuyang@yhdem ~]$ cat /etc/issue
Oracle Linux Server release 5.8
Kernel \r on an \m

mysql evn:

mysql> select version();
+-----------+
| version() |
+-----------+
| 5.5.25a   | 
+-----------+
1 row in set (0.00 sec)
      
      log_bin[OFF] innodb_flush_log_at_trx_commit [2]  query_cache_type[OFF]
      max_connect_errors[10] max_connections[214] max_user_connections[0] 
      sync_binlog[0] table_definition_cache[400] 
      table_open_cache[400] thread_cache_size[8]  open_files_limit[30000]
      innodb_adaptive_flushing[ON] innodb_adaptive_hash_index[ON] innodb_buffer_pool_size[30.234375G] 
      innodb_file_per_table[ON] innodb_flush_log_at_trx_commit[2] innodb_flush_method[] 
      innodb_io_capacity[200] innodb_lock_wait_timeout[100] innodb_log_buffer_size[128M] 
      innodb_log_file_size[200M] innodb_log_files_in_group[2] innodb_max_dirty_pages_pct[75] 
      innodb_open_files[1600] innodb_read_io_threads[4] innodb_thread_concurrency[0] 
      innodb_write_io_threads[4]

以下图片均为QPS统计,TPS测试暂时没有做

no partition table with one billion rows –> small random select by pk

mysql_test_1

xDiskName Busy  Read WriteKB|0          |25         |50          |75	   100|                                                                        
xsda        1%    2.0   35.9|>                                                |                                                                      
xsda1       0%    0.0    0.0|>                                                |                                                                      
xsda2       0%    0.0    0.0|>                                                |                                                                      
xsda3       0%    0.0    0.0|>                                                |                                                                      
xsda4       0%    0.0    0.0|>disk busy not available                         |                                                                      
xsda5       0%    0.0    0.0|>                                                |                                                                      
xsda6       1%    2.0   35.9|>                                                |                                                                      
xsdb        0%    0.0   55.9|>                                                |                                                                      
xsdb1       0%    0.0   55.9|>                                                |                                                                      
xTotals Read-MB/s=0.0      Writes-MB/s=0.2      Transfers/sec=18.0 

partition table with one billion rows –> small random select by pk

mysql_test_2

xDiskName Busy  Read WriteKB|0          |25         |50          |75	   100|                                                                       
xsda        0%    0.0    8.0|>                                                |                                                                     
xsda1       0%    0.0    0.0|>                                                |                                                                     
xsda2       0%    0.0    8.0|>                                                |                                                                     
xsda3       0%    0.0    0.0|>                                                |                                                                     
xsda4       0%    0.0    0.0|>disk busy not available                         |                                                                     
xsda5       0%    0.0    0.0|>                                                |                                                                     
xsda6       0%    0.0    0.0|>                                                |                                                                     
xsdb        0%    0.0  201.5|                         >                       |                                                                     
xsdb1       0%    0.0  201.5|W                        >                       |                                                                     
xTotals Read-MB/s=0.0      Writes-MB/s=0.4      Transfers/sec=46.9             

no partition table with one billion rows –> full random select by pk

mysql_test_3

xDiskName Busy  Read WriteMB|0          |25         |50          |75	   100|                                                                        
xsda        0%    0.0    0.0| >                                               |                                                                      
xsda1       0%    0.0    0.0|>                                                |                                                                      
xsda2       0%    0.0    0.0|>                                                |                                                                      
xsda3       0%    0.0    0.0|>                                                |                                                                      
xsda4       0%    0.0    0.0|>disk busy not available                         |                                                                      
xsda5       0%    0.0    0.0|>                                                |                                                                      
xsda6       0%    0.0    0.0| >                                               |                                                                      
xsdb      100%   86.8    0.2|RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR>                                                                      
xsdb1     100%   86.8    0.2|RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR>                                                                      
xTotals Read-MB/s=173.6    Writes-MB/s=0.4      Transfers/sec=6448.1    

partition table with one billion rows –> full random select by pk

mysql_test_4

xDiskName Busy  Read WriteMB|0          |25         |50          |75	   100|                                                                        
xsda        0%    0.0    0.0| >                                               |                                                                      
xsda1       0%    0.0    0.0|>                                                |                                                                      
xsda2       0%    0.0    0.0| >                                               |                                                                      
xsda3       0%    0.0    0.0|>                                                |                                                                      
xsda4       0%    0.0    0.0|>disk busy not available                         |                                                                      
xsda5       0%    0.0    0.0|>                                                |                                                                      
xsda6       0%    0.0    0.0| >                                               |                                                                      
xsdb      100%   89.6    0.2|RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR>                                                                      
xsdb1     100%   89.6    0.2|                                                 >                                                                      
xTotals Read-MB/s=179.2    Writes-MB/s=0.3      Transfers/sec=6539.3        

no partition table with 100 million rows –> full random select by pk

mysql_test_5

下面基于mongodb的TEST.同样为10亿的表,157G.

[root@db-13 tmp]# mongo
MongoDB shell version: 2.0.8
connecting to: test
> db.foo.totalSize();
157875838416
> db.foo.find().count();
1000000000

——

第一次 使用128G 满额内存 16thread,10亿random query:

[root@db-13 tmp]# mongo test ./mongodb_benchmark_query.js 
MongoDB shell version: 2.0.8
connecting to: test
threads: 16      queries/sec: 126151.69666666667

第二次 使用128G 内存 24 thread,10亿中的前1亿数据random query:

[root@db-13 tmp]# mongo test ./mongodb_benchmark_query.js 
MongoDB shell version: 2.0.8
connecting to: test
threads: 24      queries/sec: 166527.42333333334

第三次 使用mysql用户启动mongo 限制mysql用户的mem为24G 24 thread , 10亿中的前1亿数据random query :

[mysql@db-13 ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1052672
max locked memory       (kbytes, -l) 26055452
max memory size         (kbytes, -m) 26055452
open files                      (-n) 131072
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

[mysql@db-13 tmp]$ mongo test ./mongodb_benchmark_query.js 
MongoDB shell version: 2.0.8
connecting to: test
threads: 24	 queries/sec: 161358.03333333333

第四次 使用mysql用户启动mongo 限制mysql用户的mem为24G 24 thread , 10亿random query :

[mysql@db-13 tmp]$ mongo test ./mongodb_benchmark_query.js 
MongoDB shell version: 2.0.8
connecting to: test
threads: 24	 queries/sec: 2549.2 ----------------------> 这里出现了物理IO读写

—提供查询脚本

ops = [{op: "findOne", ns: "test.foo", query: {_id : { "#RAND_INT" : [ 1 , 100000000 ] } }}]       
x=24
 {
    res = benchRun( {
       parallel : x ,
        seconds : 60 ,
        ops : ops
    } );
    print( "threads: " + x + "\t queries/sec: " + res.query );
}

10亿 normal table 对于1亿 normal table 在内存基于PK的访问没有衰减,10亿的partition table 对于 10亿的 normal table 在内存中衰减了2/3,10亿的partition table对于10亿的 normal table 在full table out of memory 的情况下 性能有所提升 (另外注意激活码基本只会被访问1次)

对于mongodb来说,这种业务需求完全可以搞定,在内存充足的情况下QPS达到了16W+/s,但是在内存不足的情况下,暴跌至2549.

SQL to mongoDB 图解

April 8, 2012 mongodb, NoSQL No comments

分享一篇 SQL to mongoDB 图解