Berkeley DB -- 主從複製（HA）下部

轉至：http://blog.sina.com.cn/s/blog_466c6640010002hh.html

Network partitions

bdb replication 的實現可能被網絡隔離的問題影響。

例如，考慮replication組有n個成員。網絡隔離讓master在一邊，多於一半（n/2）的站點在另外一邊。和master在一邊的站點將繼續前進，master繼續接受數據庫的寫請求。不幸的是，隔離在另一邊的站點，意思到他們的master不在了，將舉行一個選舉。這個選舉將取得成功，因爲這兒有總數n/2以上的站點在這邊，然後這個組內將會有兩個master。既然兩個master都可能潛在地接受寫請求，那麼數據庫將可能產生分歧，使得數據不一致。

如果曾經在一個組內發現了多個master，一個master檢測到這個問題的時候將會返回DB_REP_DUPMASTER。如果一個應用程序看到這個返回，它應該重新配置自己作爲一個client（通過調用ENV->rep_start），然後發起一場選舉（通過調用DB_ENV->rep_elect）。贏得這次選舉的可能是先前的兩個master之一，也可能完全就是另外的站點。無論如何，這個勝出的系統將引導其它系統達到一致。

作爲另外一個例子，考慮一個replication組有一個master環境和兩個client，A和B，在那A可能會升級爲master地位而B不可能。然後，假設client A從其他的兩個數據庫環境中被隔離出來了，它的數據變的過期。然後假設這個master倒掉了，而且不再上線。隨後，網絡隔離被修復了，client A和B進行了一次選舉。因爲client B不能贏得選舉，client A將會默認地贏得這次選舉，爲了重新和B同步，可能在B上提交的事務將不能回滾直到這兩個站點能再次地一起前進。

在這兩個例子中，都有一步就是新選舉出的master引導組內的成員和它自己一致，以便它可以開始發送新信息給它們。這可能會丟失信息，因爲以前提交的事務沒有回滾。

在體系結構上網絡隔離是個問題，應用程序可能想實現一個心跳協議以最小化一個糟糕的網絡隔離的影響。只要一個master至少可以和組內一半的站點通信的時候，就不可能出現兩個master。如果一個master不再能和足夠的站點取得聯繫的時候，它應該重新配置自己作爲一個client，和舉行一次選舉。

這兒有另外一個工具應用程序可以用來最小化網絡隔離情況下的損失。通過指定一個 nsites 參數給DB_ENV->rep_elect ，也就是說，比組內的實際成員的數目大，應用程序可以阻止系統宣佈他們自己成爲master，除非它們可以和組內絕大部分站點通話。例如，如果組內有20個數據庫環境，把參數30指定給DB_ENV->rep_elect方法，那麼這個系統至少要和16個站點通話纔可以宣佈自己爲master。

指定一個小於組內世界成員數目的nsites參數給DB_ENV->rep_elect，也有它的用處。例如，考慮一個組有隻有兩個數據庫環境。如果他們被隔離了，其中任何一個都不能取得足夠的選票數成爲master。一個合理的選擇是，指定一個系統的nsites 參數爲2，另一個爲1。那樣，當被隔離的時候，其中一個系統可以贏得選舉權，而另一個不能。這能允許當網絡被隔離的時候其中一個系統能繼續接受寫請求。

這些關卡強調了bdb replicated環境中好的網絡底層構造的重要性。當replicating數據庫環境在嚴重丟包的網絡環境中，最好的解決可能是揀選一個單一的master，只有當人工干涉決定這個被選擇的master不能再恢復上線時，才舉行選舉。

Replication FAQ

Does Berkeley DB provide support for forwarding write queries from clients to masters?
No, it does not. The Berkeley DB RPC server code could be modified to support this functionality, but in general this protocol is left entirely to the application. Note, there is no reason not to use the communications channels the application establishes for replication support to forward database update messages to the master, since Berkeley DB does not require those channels to be used exclusively for replication messages.

Can I use replication to partition my environment across multiple sites?
No, this is not possible. All replicated databases must be equally shared by all environments in the replication group.

I'm running with replication but I don't see my databases on the client.
This problem may be the result of the application using absolute path names for its databases, and the pathnames are not valid on the client system.

How can I distinguish Berkeley DB messages from application messages?
There is no way to distinguish Berkeley DB messages from application-specific messages, nor does Berkeley DB offer any way to wrap application messages inside of Berkeley DB messages. Distributed applications exchanging their own messages should either enclose Berkeley DB messages in their own wrappers, or use separate network connections to send and receive Berkeley DB messages. The one exception to this rule is connection information for new sites; Berkeley DB offers a simple method for sites joining replication groups to send connection information to the other database environments in the group (see Connecting to a new site for more information).

How should I build my send function?
This depends on the specifics of the application. One common way is to write the rec and control arguments' sizes and data to a socket connected to each remote site. On a fast, local area net, the simplest method is likely to be to construct broadcast messages. Each Berkeley DB message would be encapsulated inside an application specific message, with header information specifying the intended recipient(s) for the message. This will likely require a global numbering scheme, however, as the Berkeley DB library has to be able to send specific log records to clients apart from the general broadcast of new log records intended for all members of a replication group.

Does every one of my threads of control on the master have to set up its own connection to every client? And, does every one of my threads of control on the client have to set up its own connection to every master?
This is not always necessary. In the Berkeley DB replication model, any thread of control which modifies a database in the master environment must be prepared to send a message to the client environments, and any thread of control which delivers a message to a client environment must be prepared to send a message to the master. There are many ways in which these requirements can be satisfied.

The simplest case is probably a single, multithreaded process running on the master and clients. The process running on the master would require a single write connection to each client and a single read connection from each client. A process running on each client would require a single read connection from the master and a single write connection to the master. Threads running in these processes on the master and clients would use the same network connections to pass messages back and forth.

A common complication is when there are multiple processes running on the master and clients. A straight-forward solution is to increase the numbers of connections on the master -- each process running on the master has its own write connection to each client. However, this requires only one additional connection for each possible client in the master process. The master environment still requires only a single read connection from each client (this can be done by allocating a separate thread of control which does nothing other than receive client messages and forward them into the database). Similarly, each client still only requires a single thread of control that receives master messages and forwards them into the database, and which also takes database messages and forwards them back to the master. This model requires the networking infrastructure support many-to-one writers-to-readers, of course.

If the number of network connections is a problem in the multiprocess model, and inter-process communication on the system is inexpensive enough, an alternative is have a single process which communicates between the master the each client, and whenever a process' send function is called, the process passes the message to the communications process which is responsible for forwarding the message to the appropriate client. Alternatively, a broadcast mechanism will simplify the entire networking infrastructure, as processes will likely no longer have to maintain their own specific network connections.

（HA 部分結束）

whycold

發佈了43 篇原創文章 · 獲贊 138 · 訪問量 70萬+

私信關注

Berkeley DB -- 主從複製（HA）下部

Kyoto Cabinet 實現原理

rabbitMQ在高可用方面的集羣方案

內存屏障(asm volatile("": : :"memory"))

有關linux下redis overcommit_memory的問題

Berkeley DB -- 主從複製（HA）下部

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結