很久之前看过这个帖子,当时没有细究,这几天看了下mysql相关的internal 突然想起这个问题,跟三四个朋友讨论了下 想把这个问题展开讲一讲。
这个帖子的内容为一个老外问oracle如何避免partial write的,也就是大家所说的半个IO问题,IO 在写到一半的时候发生断裂,（瞬间断线,瞬间crash瞬间存储或者机器故障等） 虽然这个情况极少发生,但是在非常规恢复的时候也是经常遇到的,例如某个block内部出现不一致,也就是我们所谓的逻辑坏块。
首先我们看看什么是mysql 的double write:
-----引自 mysqlperformace blog
目的是为了保证出现部分写失效（partial page write）--即数据页写到一半时就出现故障--时的数据安全性。Innodb并不在日志中记录整个数据页，而是使用一种称之为“生理”日志的技术，即日志项中只包含页号、对数据进行的操作（如更新一行记录）和日志序列号等信息。这一日志结构的优点是能够减少写入到日志的数据量，但这也要求要保持页内部的一致性。页的版本号是没关系的，数据页可以是当前版本（这时Innodb（故障恢复时）会跳过对页的更新操作）若是早期版本（这时Innodb将执行更新操作）。但如果页内部数据不一致，恢复将不能进行。
这里暂且不讨论为何mysql不开启double write会容易出现断裂的写操作. 在mysql中数据写的单元是以page为单位 1page=16KB 而在oracle中是以block为单位 block可以指定大小。但是写入OS的时候都是以OS block为单位,也就是说如果写入OS block时发生partial write 同样会出现逻辑问题。
It's an interesting question. I think the clue is in the link you provided: "Such logging structure is geat as it require less data to be written to the log, however it requires pages to be internally consistent."
What that's saying (I think!) is that the contents of the innodb transaction log can only be replayed to datafile pages which are 'clean' -and that's true for Oracle, too. You can't apply Oracle redo to an Oracle database block that is internally corrupted because some of its consituent "os pages" were written at a different time from others. When such partial writes happen, you get what's called a "fractured block", warnings in the alert log ...and the data file is regarded as corrupt from that point on.
Oracle's fix to this potential problem, however, is also hinted at in the article you linked to: "Innodb does not log full pages to the log files". That's an interesting sentence because , you see, Oracle does write full pages to the logs! I should immediately qualify that: it only does so when you take a "hot backup" using O/S copy commands -because it's only then that you have to worry about the problem. In other words, you only have to worry about the fact that you can only apply redo to an internally consistent database block if you're actually in the business of applying redo... and you're only doing that in the event of a recovery. And complete recoveries in Oracle (as opposed to mere crash recoveries) require you to have restored something from backup. So, it's only during the backup process that you only have to worry about the problem of fractured blocks -and so it's only then that Oracle says, 'if you have put the tablespace into hot backup mode (alter tablespace X begin backup), then the first time a block of data is changed, the entire block should be written in a consistent state into the redo (transaction) logs. Then, if the datafile copy of the block in the backup turns out to be fractured, we've got a known good copy in the redo we can restore in its place. And once you have a clean block as a starting point, you can continue to apply redo from that point on'.
Oracle has an alternative (and more up to date) mechanism for achieving this "I know your data block is clean" starting state, though. It's called RMAN -the Oracle backup and recovery tool. Unlike your OS copy command, it's an Oracle utility... so it understands the concept of Oracle blocks, and it can therefore check that a block that's been copied has been copied consistently, with all its constituent OS 'pages' written coherently to disk in the same state. It knows how to compare input and output in a way no OS command could ever hope to do. So when RMAN copies a data block hot, it reads the copy, compares it with the original -and if it sees the copy is fractured, it just has another go copying the block again. Repeat until the copy is indeed verified as a good copy of the original. No need to write the block into the transaction log at all, because you know that the backup file itself contains the necessary clean block copy.
So, putting that into practice. Let's say your server corrupts data on the disk for whatever reason and, in the process, your Oracle instance dies. You try and restart Oracle, but you get told that recovery is needed (you might get a message that file 16, for example, can't be read). So you restore file 16 from your hot backup taken with OS commands. In that backup, one of the blocks is fractured, because only part of the Oracle block had hit disk at the point the backup was performed. So you restore a fractured block. But that's not a problem, because as redo is replayed, you'll find the clean copy of the block in the redo stream, and restore that over the top of the fractured block. The rest of the redo can then be replayed without a problem. Or, you restore file 16 using RMAN... and what it restores cannot be fractured, because it checks for that before it reports the original backup a success. Therefore, you restore a clean copy of file 16, and can apply redo to it without drama. Either way, you get your database recovered.
So, the article you linked to nails the important matter: "It does not matter which page version it is – it could be “current” version in which case Innodb will skip page upate operation or “former” in which case Innodb will perform update. If page is inconsistent recovery can’t proceed." Absolutely true of Oracle, too. But Oracle has two alternatives for ensuring that a clean version of the block is always available: write a whole block into redo if it's changed whilst the database is being backed up with OS commands, or make sure you only write clean blocks into the backup if you're using RMAN -and you achieve that by multiple reads of the block, as many as are necessary to ensure the output is clean.
Oracle's solutions in these regards are, I think, a lot more efficient than double-writing every block all the time, because the only time you have to worry that what's on disk isn't consistent is, as your linked article again points out, when 'power failure' or 'os crash' happens. That is, during some sort of failure. And the response to failure that involves corruption is always to restore something from backup... so, it's really only that backup that needs to worry about 'clean pages'. Instead of writing everything twice to disk during normal running (which sounds like a potentially enormous overhead to me!), therefore, Oracle only has to employ protective measures during the backup process itself (which should, ordinarily, be a mere fraction of 'normal running' time). The overhead is therefore only encountered sparingly and not something you need worry about as a potentially-constant performance problem.
In closing, I'll second Aman's observation that it is generally and usually the case that any variation away from the default 8K block size is a bad idea. Not always, and there may be justification for it in extremis... but you will certainly be at risk of encountering more and weirder bugs than if you stick to the defaults.
这一段回答可谓一点儿也没讲到重点 主要回答oracle 采用避免partial write的几个方法,注意 这里是在特殊场景下 如何避免partial write.而不是数据库机理. 我们看一下mysql怎么同样避免这个问题 ---> ( oracle hot backup/RMAN backup)
传统Oracle的热备，备份读取与DBWR写文件并行执行，因此可能读取到Fractured Block(不一致的块)，解决办法是对于备份的文件，DBWR写的Dirty Page同时写入Redo Log，用于Fractured Block的恢复。RMAN备份，会检查每个读取页面的一致性，不一致就重新读取。Percona的XtraBackup，采用类似于RMAN的方案。
如何检查备份读取的页面是否是一致的，其实方法很简单：无论是Oracle/InnoDB的页面，都会在页面的页头与页尾存储相同的SCN /CheckSum。当页头/页尾的SCN/CheckSum相同，说明页面处于一致状态，否则就说明页面损坏(Fractured Block)，备份重新读取损坏页面即可。
所以这一段可以理解为当你需要备份的时候 如何避免partial write 因为在这个时候最容易发生断裂的块或者页。而此前别人提问的是oracle是否有类似double write的手法避免常规的partial write.
Thanks HJR for detailed analysis.
But the double write mechanism works without restore from a backup an apply redo: before write the dirty buffer blocks, innodb flush the blocks in the double write buffer to disk which is a sequential disk area (so it's fast),
before write the dirty buffer blocks, logwr flush the blocks in the redo buffer to disk which is a sequential disk area (so it's fast),
so even if partial write happen, the the blocks in double write buffer already persistent in disk,
so even if partial write happen, the the blocks in redo buffer already persistent in disk,
and when mysql restart, innodb can compare the blocks flushed from double write buffer with the datafile blocks,
and when mysql restart, smon can compare control file scn with the datafile blocks scn,
if they matched (means no partial write), then everything is fine, if not matched (partial write happen), just overwrite the datafile blocks with the blocks from double write buffer.
if they matched (means no partial write), then everything is fine, if not matched (partial write happen), just apply the redo from the redo logs.
So no media recover is required.
sounds like media recovery to me
Based on your anaysis, oracle needs media recover.
Based on your analysis, so does mysql. It just applies it in a very slightly different fashion, and calls it all something else.
这里的回答又有误点. 他说道 "so even if partial write happen, the the blocks in double write redo buffer already persistent in disk" 这句话存在明显的误导,首先partial write发生的时候,redo是无法恢复
oracle recovery 分为 instance/crash recovery and media recovery 其本质区别在于instance recovery 需要online redo logfile 即apply 增量检查点之后的redolog. 而media recovery又分为 datafile /block media
recovery 其本质为恢复一个旧的datafile或者block 可能需要用到归档日志。注意这里的前提为需要有备份。restore datafile /block 只能在备份中恢复,而之后的recover是要用到archive log。 这里的media recover可以
对于一个内部不一致的page mysql无法通过redo去恢复 ：redo有效应用的前提是data的一致性，当data flush到磁盘时发生故障，比如16K只写了4K并且redo条目是change vector形式，属于逻辑记录，那么在InnoDB recovery时，redo就不知道从哪开始跑起。
对于oracle来说 内部不一致的block.仅仅有 redo也是不够的.我们可以通过dbf检查到逻辑问题的block,去repair这个block,前提是需要有备份。这里用到的方法是media recovery里的block recovery.