Does your high availability design works as you expect ?

We may pay more attention to our core system architecture design. Use RAC with two nodes at least,normal redundancy diskgroup or higher, multiple IO chains,multiple network cards binding for private network,use local DG with same operation system power and the same architecture as production environment and remote DG and so on.

So it is a high availability system now? The answer is maybe.

It is a very complex system as we know, there are many companies provide different part. The storage from one company and the operation system from another. The question is after they provide the high availability design we actually test it to prove it works well? I have seen it does not work many times at some scenes。

There is an example of ASM redundancy. The customer use normal redundancy for their diskgroups, they need to reboot the disk controller for the disks in the diskgroup. They plan to reboot one failgroup at one time and then the other. The disks have two controllers, they plan to reboot them sequential, so the IO may not be interrupted. They found the databases hang after they reboot the first disk controller, they resolved it as a storage bug which the IO call hang on a function. The puzzle is they just reboot one failgroup(mirror) and the other failgroup(mirror) is well for use.

Why oracle database hang?

Oracle database is a software above operation system, they use OS calls to complete it's tasks, IO operation, network operation, memory operation and so on. If you use normal redundancy for diskgroup, it is to say that, if one mirror get error when IO operation, oracle will make sure the other can do it successfully so the application user can go on with their transactions, if one mirror hangs when IO operation, oracle does not know when operation system IO will be completed and there is no reliable method to notice operation system to cancel the IO operation because oracle can be installed on many operation systems and much of them do not belongs to oracle, it is difficult to reach an agreement about abnormal IO operation with different operation system companies.

So test your system to make sure the high availability design works well, it is one step you should not skip before the system put into production.

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章