一萬字詳解 Redis Cluster Gossip 協議

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":5}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家好,我是歷小冰,今天來講一下 Reids Cluster 的 Gossip 協議和集羣操作,文章的思維導圖如下所示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3f/3f12dc1d7bfd582af1e9b0770f72aa28.jpeg","alt":"xmind","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"集羣模式和 Gossip 簡介","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"對於數據存儲領域,當數據量或者請求流量大到一定程度後,就必然會引入分佈式","attrs":{}},{"type":"text","text":"。比如 Redis,雖然其單機性能十分優秀,但是因爲下列原因時,也不得不引入集羣。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單機無法保證高可用,需要引入多實例來提供高可用性","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單機能夠提供高達 8W 左右的QPS,再高的QPS則需要引入多實例","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單機能夠支持的數據量有限,處理更多的數據需要引入多實例;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單機所處理的網絡流量已經超過服務器的網卡的上限值,需要引入多實例來分流。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有集羣,集羣往往需要維護一定的元數據,比如實例的ip地址,緩存分片的 slots 信息等,所以需要一套分佈式機制來維護元數據的一致性。這類機制一般有兩個模式:分散式和集中式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分散式機制將元數據存儲在部分或者所有節點上,不同節點之間進行不斷的通信來維護元數據的變更和一致性。Redis Cluster,Consul 等都是該模式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/81/81bec470767494f7bc38553fbdf25740.png","alt":"Gossip_model","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而集中式是將集羣元數據集中存儲在外部節點或者中間件上,比如 zookeeper。舊版本的 kafka 和 storm 等都是使用該模式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a4/a454062226cbb71a4d1edd637a5f6ab4.png","alt":"center_model","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"兩種模式各有優劣,具體如下表所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0e/0e084c756f1dba71722d945adff35fdc.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分散式的元數據模式有多種可選的算法進行元數據的同步,比如說 Paxos、Raft 和 Gossip。Paxos 和 Raft 等都需要全部節點或者大多數節點(超過一半)正常運行,整個集羣才能穩定運行,而 Gossip 則不需要半數以上的節點運行。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Gossip 協議,顧名思義,就像流言蜚語一樣,利用一種隨機、帶有傳染性的方式,將信息傳播到整個網絡中,並在一定時間內,使得系統內的所有節點數據一致。對你來說,掌握這個協議不僅能很好地理解這種最常用的,實現最終一致性的算法,也能在後續工作中得心應手地實現數據的最終一致性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/35/3570a44bce7b03b658793aaf1122167e.gif","alt":"Gossip_gif","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Gossip 協議又稱 epidemic 協議(epidemic protocol),是基於流行病傳播方式的節點或者進程之間信息交換的協議,在P2P網絡和分佈式系統中應用廣泛,它的方法論也特別簡單:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在一個處於有界網絡的集羣裏,如果每個節點都隨機與其他節點交換特定信息,經過足夠長的時間後,集羣各個節點對該份信息的認知終將收斂到一致。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏的“特定信息”一般就是指集羣狀態、各節點的狀態以及其他元數據等。Gossip協議是完全符合 BASE 原則,可以用在任何要求最終一致性的領域,比如分佈式存儲和註冊中心。另外,它可以很方便地實現彈性集羣,允許節點隨時上下線,提供快捷的失敗檢測和動態負載均衡等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,Gossip 協議的最大的好處是,即使集羣節點的數量增加,每個節點的負載也不會增加很多,幾乎是恆定的。這就允許 Redis Cluster 或者 Consul 集羣管理的節點規模能橫向擴展到數千個。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Redis Cluster 的 Gossip 通信機制","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Redis Cluster 是在 3.0 版本引入集羣功能。爲了讓讓集羣中的每個實例都知道其他所有實例的狀態信息,Redis 集羣規定各個實例之間按照 Gossip 協議來通信傳遞信息。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/37/373150051022efb26cdd1868a366bdc9.jpeg","alt":"redis_cluster","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖展示了主從架構的 Redis Cluster 示意圖,其中實線表示節點間的主從複製關係,而虛線表示各個節點之間的 Gossip 通信。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Redis Cluster 中的每個節點都","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"維護一份自己視角下的當前整個集羣的狀態","attrs":{}},{"type":"text","text":",主要包括:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"當前集羣狀態","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"集羣中各節點所負責的 slots信息,及其migrate狀態","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"集羣中各節點的master-slave狀態","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"集羣中各節點的存活狀態及懷疑Fail狀態","attrs":{}}]}],"attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"也就是說上面的信息,就是集羣中Node相互八卦傳播流言蜚語的內容主題,而且比較全面,既有自己的更有別人的,這麼一來大家都相互傳,最終信息就全面而且一致了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Redis Cluster 的節點之間會相互發送多種消息,較爲重要的如下所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MEET:通過「cluster meet ip port」命令,已有集羣的節點會向新的節點發送邀請,加入現有集羣,然後新節點就會開始與其他節點進行通信;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PING:節點按照配置的時間間隔向集羣中其他節點發送 ping 消息,消息中帶有自己的狀態,還有自己維護的集羣元數據,和部分其他節點的元數據;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PONG: 節點用於迴應 PING 和 MEET 的消息,結構和 PING 消息類似,也包含自己的狀態和其他信息,也可以用於信息廣播和更新;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FAIL: 節點 PING 不通某節點後,會向集羣所有節點廣播該節點掛掉的消息。其他節點收到消息後標記已下線。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Redis 的源碼中 cluster.h 文件定義了全部的消息類型,代碼爲 redis 4.0版本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"// 注意,PING 、 PONG 和 MEET 實際上是同一種消息。\n// PONG 是對 PING 的回覆,它的實際格式也爲 PING 消息,\n// 而 MEET 則是一種特殊的 PING 消息,用於強制消息的接收者將消息的發送者添加到集羣中(如果節點尚未在節點列表中的話)\n#define CLUSTERMSG_TYPE_PING 0 /* Ping 消息 */\n#define CLUSTERMSG_TYPE_PONG 1 /* Pong 用於回覆Ping */\n#define CLUSTERMSG_TYPE_MEET 2 /* Meet 請求將某個節點添加到集羣中 */\n#define CLUSTERMSG_TYPE_FAIL 3 /* Fail 將某個節點標記爲 FAIL */\n#define CLUSTERMSG_TYPE_PUBLISH 4 /* 通過發佈與訂閱功能廣播消息 */\n#define CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST 5 /* 請求進行故障轉移操作,要求消息的接收者通過投票來支持消息的發送者 */\n#define CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK 6 /* 消息的接收者同意向消息的發送者投票 */\n#define CLUSTERMSG_TYPE_UPDATE 7 /* slots 已經發生變化,消息發送者要求消息接收者進行相應的更新 */\n#define CLUSTERMSG_TYPE_MFSTART 8 /* 爲了進行手動故障轉移,暫停各個客戶端 */\n#define CLUSTERMSG_TYPE_COUNT 9 /* 消息總數 */","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上述這些消息,集羣中的每一個實例都能獲得其它所有實例的狀態信息。這樣一來,即使有新節點加入、節點故障、Slot 變更等事件發生,實例間也可以通過 PING、PONG 消息的傳遞,完成集羣狀態在每個實例上的同步。下面,我們依次來看看幾種常見的場景。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"定時 PING/PONG 消息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Redis Cluster 中的節點都會定時地向其他節點發送 PING 消息,來交換各個節點狀態信息,檢查各個節點狀態,包括在線狀態、疑似下線狀態 PFAIL 和已下線狀態 FAIL。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Redis 集羣的定時 PING/PONG 的工作原理可以概括成兩點:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一是,每個實例之間會按照一定的頻率,從集羣中隨機挑選一些實例,把 PING 消息發送給挑選出來的實例,用來檢測這些實例是否在線,並交換彼此的狀態信息。PING 消息中封裝了發送消息的實例自身的狀態信息、部分其它實例的狀態信息,以及 Slot 映射表。","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"二是,一個實例在接收到 PING 消息後,會給發送 PING 消息的實例,發送一個 PONG 消息。PONG 消息包含的內容和 PING 消息一樣。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖顯示了兩個實例間進行 PING、PONG 消息傳遞的情況,其中實例一爲發送節點,實例二是接收節點","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e6/e6b7e6a68753378f2eacfe9e68e4bad6.png","alt":"Gossip_PING","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"新節點上線","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Redis Cluster 加入新節點時,客戶端需要執行 CLUSTER MEET 命令,如下圖所示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5f/5f018745480788cec1ab865f5fb03eda.png","alt":"meet","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"節點一在執行 CLUSTER MEET 命令時會首先爲新節點創建一個 clusterNode 數據,並將其添加到自己維護的 clusterState 的 nodes 字典中。有關 clusterState 和 clusterNode 關係,我們在最後一節會有詳盡的示意圖和源碼來講解。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後節點一會根據據 CLUSTER MEET 命令中的 IP 地址和端口號,向新節點發送一條 MEET 消息。新節點接收到節點一發送的MEET消息後,新節點也會爲節點一創建一個 clusterNode 結構,並將該結構添加到自己維護的 clusterState 的 nodes 字典中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接着,新節點向節點一返回一條PONG消息。節點一接收到節點B返回的PONG消息後,得知新節點已經成功的接收了自己發送的MEET消息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,節點一還會向新節點發送一條 PING 消息。新節點接收到該條 PING 消息後,可以知道節點A已經成功的接收到了自己返回的P ONG消息,從而完成了新節點接入的握手操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MEET 操作成功之後,節點一會通過稍早時講的定時 PING 機制將新節點的信息發送給集羣中的其他節點,讓其他節點也與新節點進行握手,最終,經過一段時間後,新節點會被集羣中的所有節點認識。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"節點疑似下線和真正下線","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Redis Cluster 中的節點會定期檢查已經發送 PING 消息的接收方節點是否在規定時間 ( cluster-node-timeout ) 內返回了 PONG 消息,如果沒有則會將其標記爲疑似下線狀態,也就是 PFAIL 狀態,如下圖所示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e2/e25ac066e2e1f2936127d5f3643c264c.png","alt":"pfail","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後,節點一會通過 PING 消息,將節點二處於疑似下線狀態的信息傳遞給其他節點,例如節點三。節點三接收到節點一的 PING 消息得知節點二進入 PFAIL 狀態後,會在自己維護的 clusterState 的 nodes 字典中找到節點二所對應的 clusterNode 結構,並將主節點一的下線報告添加到 clusterNode 結構的 fail_reports 鏈表中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ba/bae0848b55a0b8e107cbe96f4884c1cb.png","alt":"PING_FAIL","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着時間的推移,如果節點十 (舉個例子) 也因爲 PONG 超時而認爲節點二疑似下線了,並且發現自己維護的節點二的 clusterNode 的 fail_reports 中有","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"半數以上的主節點數量的未過時的將節點二標記爲 PFAIL 狀態報告日誌","attrs":{}},{"type":"text","text":",那麼節點十將會把節點二將被標記爲已下線 FAIL 狀態,並且節點十會","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"立刻","attrs":{}},{"type":"text","text":"向集羣其他節點廣播主節點二已經下線的 FAIL 消息,所有收到 FAIL 消息的節點都會立即將節點二狀態標記爲已下線。如下圖所示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6d/6d0933cfd4ee30dce48df05d5adaf162.png","alt":"fail","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要注意的是,報告疑似下線記錄是由時效性的,如果超過 cluster-node-timeout *2 的時間,這個報告就會被忽略掉,讓節點二又恢復成正常狀態。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Redis Cluster 通信源碼實現","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"綜上,我們瞭解了 Redis Cluster 在定時 PING/PONG、新節點上線、節點疑似下線和真正下線等環節的原理和操作流程,下面我們來真正看一下 Redis 在這些環節的源碼實現和具體操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"涉及的數據結構體","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,我們先來講解一下其中涉及的數據結構,也就是上文提到的 ClusterNode 等結構。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"每個節點都會維護一個 clusterState 結構","attrs":{}},{"type":"text","text":",表示當前集羣的整體狀態,它的定義如下所示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"typedef struct clusterState {\n clusterNode *myself; /* 當前節點的clusterNode信息 */\n ....\n dict *nodes; /* name到clusterNode的字典 */\n ....\n clusterNode *slots[CLUSTER_SLOTS]; /* slot 和節點的對應關係*/\n ....\n} clusterState;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"它有三個比較關鍵的字段,具體示意圖如下所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"myself 字段,是一個 clusterNode 結構,用來記錄自己的狀態;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"nodes 字典,記錄一個 name 到 clusterNode 結構的映射,以此來記錄其他節點的狀態;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"slot 數組,記錄slot 對應的節點 clusterNode結構。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/49/4967d374799792b27d9714632a0e6b26.png","alt":"redis_cluster","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"clusterNode 結構","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"保存了一個節點的當前狀態","attrs":{}},{"type":"text","text":",比如","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"節點的創建時間、節點的名字、節點 當前的配置紀元、節點的IP地址和端口號等等","attrs":{}},{"type":"text","text":"。除此之外,clusterNode結構的 link 屬性是一個clusterLink結構,該結構保存了連接節點所需的有關信息","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":",比如","attrs":{}},{"type":"text","text":"套接字描述符,輸入緩衝區和輸出緩衝區。clusterNode 還有一個 fail_report 的列表,用來記錄疑似下線報告。具體定義如下所示。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"typedef struct clusterNode {\n mstime_t ctime; /* 創建節點的時間 */\n char name[CLUSTER_NAMELEN]; /* 節點的名字 */\n int flags; /* 節點標識,標記節點角色或者狀態,比如主節點從節點或者在線和下線 */\n uint64_t configEpoch; /* 當前節點已知的集羣統一epoch */\n unsigned char slots[CLUSTER_SLOTS/8]; /* slots handled by this node */\n int numslots; /* Number of slots handled by this node */\n int numslaves; /* Number of slave nodes, if this is a master */\n struct clusterNode **slaves; /* pointers to slave nodes */\n struct clusterNode *slaveof; /* pointer to the master node. Note that it\n may be NULL even if the node is a slave\n if we don't have the master node in our\n tables. */\n mstime_t ping_sent; /* 當前節點最後一次向該節點發送 PING 消息的時間 */\n mstime_t pong_received; /* 當前節點最後一次收到該節點 PONG 消息的時間 */\n mstime_t fail_time; /* FAIL 標誌位被設置的時間 */\n mstime_t voted_time; /* Last time we voted for a slave of this master */\n mstime_t repl_offset_time; /* Unix time we received offset for this node */\n mstime_t orphaned_time; /* Starting time of orphaned master condition */\n long long repl_offset; /* 當前節點的repl便宜 */\n char ip[NET_IP_STR_LEN]; /* 節點的IP 地址 */\n int port; /* 端口 */\n int cport; /* 通信端口,一般是端口+1000 */\n clusterLink *link; /* 和該節點的 tcp 連接 */\n list *fail_reports; /* 下線記錄列表 */\n} clusterNode;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"clusterNodeFailReport 是記錄節點下線報告的結構體, node 是報告節點的信息,而 time 則代表着報告時間。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"typedef struct clusterNodeFailReport {\n struct clusterNode *node; /* 報告當前節點已經下線的節點 */\n mstime_t time; /* 報告時間 */\n} clusterNodeFailReport;","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"消息結構體","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"瞭解了 Reids 節點維護的數據結構體後,我們再來看節點進行通信的消息結構體。 通信消息最外側的結構體爲 clusterMsg,它包括了很多消息記錄信息,包括 RCmb 標誌位,消息總長度,消息協議版本,消息類型;它還包括了發送該消息節點的記錄信息,比如節點名稱,節點負責的slot信息,節點ip和端口等;最後它包含了一個 clusterMsgData 來攜帶具體類型的消息。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"typedef struct {\n char sig[4]; /* 標誌位,\"RCmb\" (Redis Cluster message bus). */\n uint32_t totlen; /* 消息總長度 */\n uint16_t ver; /* 消息協議版本 */\n uint16_t port; /* 端口 */\n uint16_t type; /* 消息類型 */\n uint16_t count; /* */\n uint64_t currentEpoch; /* 表示本節點當前記錄的整個集羣的統一的epoch,用來決策選舉投票等,與configEpoch不同的是:configEpoch表示的是master節點的唯一標誌,currentEpoch是集羣的唯一標誌。 */\n uint64_t configEpoch; /* 每個master節點都有一個唯一的configEpoch做標誌,如果和其他master節點衝突,會強制自增使本節點在集羣中唯一 */\n uint64_t offset; /* 主從複製偏移相關信息,主節點和從節點含義不同 */\n char sender[CLUSTER_NAMELEN]; /* 發送節點的名稱 */\n unsigned char myslots[CLUSTER_SLOTS/8]; /* 本節點負責的slots信息,16384/8個char數組,一共爲16384bit */\n char slaveof[CLUSTER_NAMELEN]; /* master信息,假如本節點是slave節點的話,協議帶有master信息 */\n char myip[NET_IP_STR_LEN]; /* IP */\n char notused1[34]; /* 保留字段 */\n uint16_t cport; /* 集羣的通信端口 */\n uint16_t flags; /* 本節點當前的狀態,比如 CLUSTER_NODE_HANDSHAKE、CLUSTER_NODE_MEET */\n unsigned char state; /* Cluster state from the POV of the sender */\n unsigned char mflags[3]; /* 本條消息的類型,目前只有兩類:CLUSTERMSG_FLAG0_PAUSED、CLUSTERMSG_FLAG0_FORCEACK */\n union clusterMsgData data;\n} clusterMsg;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"clusterMsgData 是一個 union 結構體,它可以爲 PING,MEET,PONG 或者 FAIL 等消息體。其中當消息爲 PING、MEET 和 PONG 類型時,ping 字段是被賦值的,而是 FAIL 類型時,fail 字段是被賦值的。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"// 注意這是 union 關鍵字\nunion clusterMsgData {\n /* PING, MEET 或者 PONG 消息時,ping 字段被賦值 */\n struct {\n /* Array of N clusterMsgDataGossip structures */\n clusterMsgDataGossip gossip[1];\n } ping;\n /* FAIL 消息時,fail 被賦值 */\n struct {\n clusterMsgDataFail about;\n } fail;\n // .... 省略 publish 和 update 消息的字段\n};","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"clusterMsgDataGossip 是 PING、PONG 和 MEET 消息的結構體,它會包括髮送消息節點維護的其他節點信息,也就是上文中 clusterState 中 nodes 字段包含的信息,具體代碼如下所示,你也會發現二者的字段是類似的。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"typedef struct {\n\t/* 節點的名字,默認是隨機的,MEET消息發送並得到回覆後,集羣會爲該節點設置正式的名稱*/\n char nodename[CLUSTER_NAMELEN]; \n uint32_t ping_sent; /* 發送節點最後一次給接收節點發送 PING 消息的時間戳,收到對應 PONG 回覆後會被賦值爲0 */\n uint32_t pong_received; /* 發送節點最後一次收到接收節點發送 PONG 消息的時間戳 */\n char ip[NET_IP_STR_LEN]; /* IP address last time it was seen */\n uint16_t port; /* IP*/ \n uint16_t cport; /* 端口*/ \n uint16_t flags; /* 標識*/ \n uint32_t notused1; /* 對齊字符*/\n} clusterMsgDataGossip;\n\ntypedef struct {\n char nodename[CLUSTER_NAMELEN]; /* 下線節點的名字 */\n} clusterMsgDataFail;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看完了節點維護的數據結構體和發送的消息結構體後,我們就來看看 Redis 的具體行爲源碼了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"隨機週期性發送PING消息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Redis 的 clusterCron 函數會被定時調用,每被執行10次,就會準備向隨機的一個節點發送 PING 消息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"它會先隨機的選出 5 個節點,然後從中選擇最久沒有與之通信的節點,調用 clusterSendPing 函數發送類型爲 CLUSTERMSG_TYPE_PING 的消息","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"// cluster.c 文件 \n// clusterCron() 每執行 10 次(至少間隔一秒鐘),就向一個隨機節點發送 gossip 信息\nif (!(iteration % 10)) {\n int j;\n\n /* 隨機 5 個節點,選出其中一個 */\n for (j = 0; j < 5; j++) {\n de = dictGetRandomKey(server.cluster->nodes);\n clusterNode *this = dictGetVal(de);\n\n /* 不要 PING 連接斷開的節點,也不要 PING 最近已經 PING 過的節點 */\n if (this->link == NULL || this->ping_sent != 0) continue;\n if (this->flags & (CLUSTER_NODE_MYSELF|CLUSTER_NODE_HANDSHAKE))\n continue;\n /* 對比 pong_received 字段,選出更長時間未收到其 PONG 消息的節點(表示好久沒有接受到該節點的PONG消息了) */\n if (min_pong_node == NULL || min_pong > this->pong_received) {\n min_pong_node = this;\n min_pong = this->pong_received;\n }\n }\n /* 向最久沒有收到 PONG 回覆的節點發送 PING 命令 */\n if (min_pong_node) {\n serverLog(LL_DEBUG,\"Pinging node %.40s\", min_pong_node->name);\n clusterSendPing(min_pong_node->link, CLUSTERMSG_TYPE_PING);\n }\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"clusterSendPing 函數的具體行爲我們後續再瞭解,因爲該函數在其他環節也會經常用到","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"節點加入集羣","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當節點執行 CLUSTER MEET 命令後,會在自身給新節點維護一個 clusterNode 結構體,該結構體的 link 也就是TCP連接字段是 null,表示是新節點尚未建立連接。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"clusterCron 函數中也會處理這些未建立連接的新節點,調用 createClusterLink 創立連接,然後調用 clusterSendPing 函數來發送 MEET 消息","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"/* cluster.c clusterCron 函數部分,爲未創建連接的節點創建連接 */\nif (node->link == NULL) {\n int fd;\n mstime_t old_ping_sent;\n clusterLink *link;\n /* 和該節點建立連接 */\n fd = anetTcpNonBlockBindConnect(server.neterr, node->ip,\n node->cport, NET_FIRST_BIND_ADDR);\n /* .... fd 爲-1時的異常處理 */\n /* 建立 link */\n link = createClusterLink(node);\n link->fd = fd;\n node->link = link;\n aeCreateFileEvent(server.el,link->fd,AE_READABLE,\n clusterReadHandler,link);\n /* 向新連接的節點發送 PING 命令,防止節點被識進入下線 */\n /* 如果節點被標記爲 MEET ,那麼發送 MEET 命令,否則發送 PING 命令 */\n old_ping_sent = node->ping_sent;\n clusterSendPing(link, node->flags & CLUSTER_NODE_MEET ?\n CLUSTERMSG_TYPE_MEET : CLUSTERMSG_TYPE_PING);\n /* .... */\n /* 如果當前節點(發送者)沒能收到 MEET 信息的回覆,那麼它將不再向目標節點發送命令。*/\n /* 如果接收到回覆的話,那麼節點將不再處於 HANDSHAKE 狀態,並繼續向目標節點發送普通 PING 命令*/\n node->flags &= ~CLUSTER_NODE_MEET;\n}","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"防止節點假超時及狀態過期","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"防止節點假超時和標記疑似下線標記也是在 clusterCron 函數中,具體如下所示。它會檢查當前所有的 nodes 節點列表,如果發現某個節點與自己的最後一個 PONG 通信時間超過了預定的閾值的一半時,爲了防止節點是假超時,會主動釋放掉與之的 link 連接,然後會主動向它發送一個 PING 消息。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"/* cluster.c clusterCron 函數部分,遍歷節點來檢查 fail 的節點*/\nwhile((de = dictNext(di)) != NULL) {\n clusterNode *node = dictGetVal(de);\n now = mstime(); /* Use an updated time at every iteration. */\n mstime_t delay;\n\n /* 如果等到 PONG 到達的時間超過了 node timeout 一半的連接 */\n /* 因爲儘管節點依然正常,但連接可能已經出問題了 */\n if (node->link && /* is connected */\n now - node->link->ctime >\n server.cluster_node_timeout && /* 還未重連 */\n node->ping_sent && /* 已經發過ping消息 */\n node->pong_received < node->ping_sent && /* 還在等待pong消息 */\n /* 等待pong消息超過了 timeout/2 */\n now - node->ping_sent > server.cluster_node_timeout/2)\n {\n /* 釋放連接,下次 clusterCron() 會自動重連 */\n freeClusterLink(node->link);\n }\n\n /* 如果目前沒有在 PING 節點*/\n /* 並且已經有 node timeout 一半的時間沒有從節點那裏收到 PONG 回覆 */\n /* 那麼向節點發送一個 PING ,確保節點的信息不會太舊,有可能一直沒有隨機中 */\n if (node->link &&\n node->ping_sent == 0 &&\n (now - node->pong_received) > server.cluster_node_timeout/2)\n {\n clusterSendPing(node->link, CLUSTERMSG_TYPE_PING);\n continue;\n }\n /* .... 處理failover和標記遺失下線 */\n}","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"處理failover和標記疑似下線","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果防止節點假超時處理後,節點依舊未收到目標節點的 PONG 消息,並且時間已經超過了 cluster_node_timeout,那麼就將該節點標記爲疑似下線狀態。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"/* 如果這是一個主節點,並且有一個從服務器請求進行手動故障轉移,那麼向從服務器發送 PING*/\nif (server.cluster->mf_end &&\n nodeIsMaster(myself) &&\n server.cluster->mf_slave == node &&\n node->link)\n{\n clusterSendPing(node->link, CLUSTERMSG_TYPE_PING);\n continue;\n}\n\n/* 後續代碼只在節點發送了 PING 命令的情況下執行*/\nif (node->ping_sent == 0) continue;\n\n/* 計算等待 PONG 回覆的時長 */ \ndelay = now - node->ping_sent;\n/* 等待 PONG 回覆的時長超過了限制值,將目標節點標記爲 PFAIL (疑似下線)*/\nif (delay > server.cluster_node_timeout) {\n /* 超時了,標記爲疑似下線 */\n if (!(node->flags & (REDIS_NODE_PFAIL|REDIS_NODE_FAIL))) {\n redisLog(REDIS_DEBUG,\"*** NODE %.40s possibly failing\",\n node->name);\n // 打開疑似下線標記\n node->flags |= REDIS_NODE_PFAIL;\n update_state = 1;\n }\n}","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"實際發送Gossip消息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以下是前方多次調用過的clusterSendPing()方法的源碼,代碼中有詳細的註釋,大家可以自行閱讀。主要的操作就是將節點自身維護的 clusterState 轉換爲對應的消息結構體。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"/* 向指定節點發送一條 MEET 、 PING 或者 PONG 消息 */\nvoid clusterSendPing(clusterLink *link, int type) {\n unsigned char *buf;\n clusterMsg *hdr;\n int gossipcount = 0; /* Number of gossip sections added so far. */\n int wanted; /* Number of gossip sections we want to append if possible. */\n int totlen; /* Total packet length. */\n // freshnodes 是用於發送 gossip 信息的計數器\n // 每次發送一條信息時,程序將 freshnodes 的值減一\n // 當 freshnodes 的數值小於等於 0 時,程序停止發送 gossip 信息\n // freshnodes 的數量是節點目前的 nodes 表中的節點數量減去 2 \n // 這裏的 2 指兩個節點,一個是 myself 節點(也即是發送信息的這個節點)\n // 另一個是接受 gossip 信息的節點\n int freshnodes = dictSize(server.cluster->nodes)-2;\n\n \n /* 計算要攜帶多少節點的信息,最少3個,最多 1/10 集羣總節點數量*/\n wanted = floor(dictSize(server.cluster->nodes)/10);\n if (wanted < 3) wanted = 3;\n if (wanted > freshnodes) wanted = freshnodes;\n\n /* .... 省略 totlen 的計算等*/\n\n /* 如果發送的信息是 PING ,那麼更新最後一次發送 PING 命令的時間戳 */\n if (link->node && type == CLUSTERMSG_TYPE_PING)\n link->node->ping_sent = mstime();\n /* 將當前節點的信息(比如名字、地址、端口號、負責處理的槽)記錄到消息裏面 */\n clusterBuildMessageHdr(hdr,type);\n\n /* Populate the gossip fields */\n int maxiterations = wanted*3;\n /* 每個節點有 freshnodes 次發送 gossip 信息的機會\n 每次向目標節點發送 2 個被選中節點的 gossip 信息(gossipcount 計數) */\n while(freshnodes > 0 && gossipcount < wanted && maxiterations--) {\n /* 從 nodes 字典中隨機選出一個節點(被選中節點) */\n dictEntry *de = dictGetRandomKey(server.cluster->nodes);\n clusterNode *this = dictGetVal(de);\n\n /* 以下節點不能作爲被選中節點:\n * Myself:節點本身。\n * PFAIL狀態的節點\n * 處於 HANDSHAKE 狀態的節點。\n * 帶有 NOADDR 標識的節點\n * 因爲不處理任何 Slot 而被斷開連接的節點 \n */\n if (this == myself) continue;\n if (this->flags & CLUSTER_NODE_PFAIL) continue;\n if (this->flags & (CLUSTER_NODE_HANDSHAKE|CLUSTER_NODE_NOADDR) ||\n (this->link == NULL && this->numslots == 0))\n {\n freshnodes--; /* Tecnically not correct, but saves CPU. */\n continue;\n }\n\n // 檢查被選中節點是否已經在 hdr->data.ping.gossip 數組裏面\n // 如果是的話說明這個節點之前已經被選中了\n // 不要再選中它(否則就會出現重複)\n if (clusterNodeIsInGossipSection(hdr,gossipcount,this)) continue;\n\n /* 這個被選中節點有效,計數器減一 */\n clusterSetGossipEntry(hdr,gossipcount,this);\n freshnodes--;\n gossipcount++;\n }\n\n /* .... 如果有 PFAIL 節點,最後添加 */\n\n\n /* 計算信息長度 */\n totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);\n totlen += (sizeof(clusterMsgDataGossip)*gossipcount);\n /* 將被選中節點的數量(gossip 信息中包含了多少個節點的信息)記錄在 count 屬性裏面*/\n hdr->count = htons(gossipcount);\n /* 將信息的長度記錄到信息裏面 */\n hdr->totlen = htonl(totlen);\n /* 發送網絡請求 */\n clusterSendMessage(link,buf,totlen);\n zfree(buf);\n}\n\n\nvoid clusterSetGossipEntry(clusterMsg *hdr, int i, clusterNode *n) {\n clusterMsgDataGossip *gossip;\n /* 指向 gossip 信息結構 */\n gossip = &(hdr->data.ping.gossip[i]);\n /* 將被選中節點的名字記錄到 gossip 信息 */ \n memcpy(gossip->nodename,n->name,CLUSTER_NAMELEN);\n /* 將被選中節點的 PING 命令發送時間戳記錄到 gossip 信息 */\n gossip->ping_sent = htonl(n->ping_sent/1000);\n /* 將被選中節點的 PONG 命令回覆的時間戳記錄到 gossip 信息 */\n gossip->pong_received = htonl(n->pong_received/1000);\n /* 將被選中節點的 IP 記錄到 gossip 信息 */\n memcpy(gossip->ip,n->ip,sizeof(n->ip));\n /* 將被選中節點的端口號記錄到 gossip 信息 */\n gossip->port = htons(n->port);\n gossip->cport = htons(n->cport);\n /* 將被選中節點的標識值記錄到 gossip 信息 */\n gossip->flags = htons(n->flags);\n gossip->notused1 = 0;\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面是 clusterBuildMessageHdr 函數,它主要負責填充消息結構體中的基礎信息和當前節點的狀態信息。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"/* 構建消息的 header */\nvoid clusterBuildMessageHdr(clusterMsg *hdr, int type) {\n int totlen = 0;\n uint64_t offset;\n clusterNode *master;\n\n /* 如果當前節點是salve,則master爲其主節點,如果當前節點是master節點,則master就是當前節點 */\n master = (nodeIsSlave(myself) && myself->slaveof) ?\n myself->slaveof : myself;\n\n memset(hdr,0,sizeof(*hdr));\n /* 初始化協議版本、標識、及類型, */\n hdr->ver = htons(CLUSTER_PROTO_VER);\n hdr->sig[0] = 'R';\n hdr->sig[1] = 'C';\n hdr->sig[2] = 'm';\n hdr->sig[3] = 'b';\n hdr->type = htons(type);\n /* 消息頭設置當前節點id */\n memcpy(hdr->sender,myself->name,CLUSTER_NAMELEN);\n\n /* 消息頭設置當前節點ip */\n memset(hdr->myip,0,NET_IP_STR_LEN);\n if (server.cluster_announce_ip) {\n strncpy(hdr->myip,server.cluster_announce_ip,NET_IP_STR_LEN);\n hdr->myip[NET_IP_STR_LEN-1] = '\\0';\n }\n\n /* 基礎端口及集羣內節點通信端口 */\n int announced_port = server.cluster_announce_port ?\n server.cluster_announce_port : server.port;\n int announced_cport = server.cluster_announce_bus_port ?\n server.cluster_announce_bus_port :\n (server.port + CLUSTER_PORT_INCR);\n /* 設置當前節點的槽信息 */\n memcpy(hdr->myslots,master->slots,sizeof(hdr->myslots));\n memset(hdr->slaveof,0,CLUSTER_NAMELEN);\n if (myself->slaveof != NULL)\n memcpy(hdr->slaveof,myself->slaveof->name, CLUSTER_NAMELEN);\n hdr->port = htons(announced_port);\n hdr->cport = htons(announced_cport);\n hdr->flags = htons(myself->flags);\n hdr->state = server.cluster->state;\n\n /* 設置 currentEpoch and configEpochs. */\n hdr->currentEpoch = htonu64(server.cluster->currentEpoch);\n hdr->configEpoch = htonu64(master->configEpoch);\n\n /* 設置複製偏移量 */\n if (nodeIsSlave(myself))\n offset = replicationGetSlaveOffset();\n else\n offset = server.master_repl_offset;\n hdr->offset = htonu64(offset);\n\n /* Set the message flags. */\n if (nodeIsMaster(myself) && server.cluster->mf_end)\n hdr->mflags[0] |= CLUSTERMSG_FLAG0_PAUSED;\n\n /* 計算並設置消息的總長度 */\n if (type == CLUSTERMSG_TYPE_FAIL) {\n totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);\n totlen += sizeof(clusterMsgDataFail);\n } else if (type == CLUSTERMSG_TYPE_UPDATE) {\n totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);\n totlen += sizeof(clusterMsgDataUpdate);\n }\n hdr->totlen = htonl(totlen);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"後記","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本來只想寫一下 Redis Cluster 的 Gossip 協議,沒想到文章越寫,內容越多,最後源碼分析也是有點虎頭蛇尾,大家就湊合看一下,也希望大家繼續關注我後續的問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章