Facebook緩存技術演進:從單集羣到多區域

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文的主要內容翻譯自 "},{"type":"link","attrs":{"href":"https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf","title":""},"content":[{"type":"text","text":"Scaling Memcache at Facebook (2013) "}]},{"type":"text","text":" 這篇論文,其中圖表來自於論文及 Rajesh Nishtala 在 NSDI '13 | USENIX 上的演講幻燈片。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文介紹 FB 基於 memcached 構建統一緩存層的最佳實踐。全文遞進式地講述 "},{"type":"text","marks":[{"type":"strong"}],"text":"單集羣 (Single Front-end Cluster)"},{"type":"text","text":"、"},{"type":"text","marks":[{"type":"strong"}],"text":"多集羣 (Multiple Front-end Clusters)"},{"type":"text","text":"、"},{"type":"text","marks":[{"type":"strong"}],"text":"多區域 (Multiple Regions)"},{"type":"text","text":" 環境下遇到的問題和相應的解決方案。儘管整個解決方案以 memcached 爲基本單元,但我們可以任意地將 memcached 替換成 redis、boltDB、levelDB 等其它服務作爲緩存單元。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在下文中,需要注意兩個詞語的區別:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"memcached:指 memcached 源碼或運行時,即單機版"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"memcache:指基於 memcached 構建的分佈式緩存系統,即分佈式版"}]}]}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Background"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與大部分互聯網公司的讀寫流量特點類似,FB 的整體業務呈現出明顯讀多寫少的特點,其讀請求量比寫請求量高出若 "},{"type":"text","marks":[{"type":"strong"}],"text":"2"},{"type":"text","text":" 個數量級 (數據來自於 "},{"type":"link","attrs":{"href":"https://www.usenix.org/sites/default/files/conference/protected-files/nishtala_nsdi13_slides.pdf","title":null},"content":[{"type":"text","text":"slides"}]},{"type":"text","text":"),因此增加緩存層可以顯著提高業務穩定性,保護 DB。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Pre-memcache"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在使用緩存層之前,FB 的 Web Server 直接訪問數據庫,通過 "},{"type":"text","marks":[{"type":"strong"}],"text":"數據分片"},{"type":"text","text":" 和 "},{"type":"text","marks":[{"type":"strong"}],"text":"一主多從"},{"type":"text","text":" 的方式來扛住讀寫流量:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ff/ff1d6266f41494640c94e200bfa9e6db.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但隨着用戶數數量飆升,單純靠數據庫來抗壓成本高,效率低。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Design Requirements"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"系統設計的第一步,就是要明白系統的要求。FB 緩存層的設計要求可以概括爲:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"數據讀 QPS 爲 10億"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"支持多區域"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"支持平滑升級"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"緩存層與持久化層數據保持 "},{"type":"text","marks":[{"type":"strong"}],"text":"最大努力最終一致性 (best-effort eventual consistency)"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Cache Policy"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"memcache 的 cache policy 可以用 2 個詞概括:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"demand-filled look-aside (read)"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"write-invalidate (write)"}]}]}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a8/a8c047a3cbccd67ef3a5f34a65cdfea0.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上圖所示:demand-filled look-aside 指"},{"type":"text","marks":[{"type":"strong"}],"text":"讀數據時"},{"type":"text","text":",web server 先嚐試從 memcache 中讀數據,若讀取失敗則從持久化存儲中獲取數據填充到 memcache 中;"},{"type":"text","marks":[{"type":"strong"}],"text":"寫數據時"},{"type":"text","text":",先更新數據庫,然後將 memcache 中相應的數據刪除。採用 write-invalide 的主要原因有兩個:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"刪除操作冪等,當任何異常發生時可以重試;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"write-invalidate 與 demand-filled 在語義上是天作之合;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏需要注意的是,不論使用哪種 cache policy,沒有分佈式事務的支持都無法保證緩存層數據與持久化層數據的一致性。但實踐中往往不需要二者的強一致保證,因此類似 look-aside demand-filled 和 write-invalidate 這種組合策略在實踐中比較流行。更多有關 Cache Policies 的討論歡迎閱讀我的 "},{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/fa1f0f9ac1cfee7845f7b29fe","title":""},"content":[{"type":"text","text":"這篇博客"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"In a Cluster: Latency and Load"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本節探討在一個集羣內部部署上千個 memcached 服務遇到的挑戰和相應的解決方案。在這個規模上,系統優化的主要精力集中在如何減少獲取緩存數據的時延 (latency),抵抗 cache miss 時造成的負載壓力 (load)。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Scale-out"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着用戶數量增加,服務本身可以通過橫向擴容支撐更高的併發請求,相應地緩存層也需要擴容。FB 採用的是一種常見的擴容方案:"},{"type":"text","marks":[{"type":"strong"}],"text":"部署多個 memcached 服務,形成單個緩存集羣,並通過 consistent hashing 將緩存數據散列在不同的 memcached 實例上"},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"High Fanout"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 FB 的服務中,載入一個熱門的網頁平均需要從 memcache 中獲取 521 條不同的數據,如果出現 cache miss 則需要從持久化存儲中獲取數據,這些數據讀取請求的時延都將影響到服務的質量。通常不同數據的讀取之間存在一定的先後依賴關係,可以表示成一個有向無環圖 (DAG),如下圖所示:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c3/c3f3e2bf7404ca51c676e9862d5ca68b.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們稱這種放射狀的數據讀取模式爲 fanout。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Reducing Latency"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"面對 high fanout,memcache 集羣首先要面對的問題就是 "},{"type":"text","marks":[{"type":"strong"}],"text":"all-to-all communication"},{"type":"text","text":"。由於緩存數據被散列到不同的 memcached 實例上,每個 web server 都可能需要與所有 memcached 服務通信:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"由於 fanout 的存在,處理每個請求需要各個 web server 從多個 memcached 實例上獲取數據,如果這些數據在短時間內忽然到來,可能造成網絡擁堵,即 incast congestion;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"由於每個 memcached 實例都持有一部分數據,這使得每個實例在高負載下都有可能成爲服務瓶頸;"}]}]}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"Parallel Requests and Batching"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"面對這些問題,應用層上至少可以做一件事:parallel requests and batching。由於每個請求可能需要在同一個 memcached server 上取多條數據,那麼我們可以在 web server 的邏輯中減少 RTT 次數,將可以一起取的數據通過一次 RTT 一併取出,減少時延。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"Client-server Communication"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在緩存層上,FB 的主要思路就是將控制邏輯集中到 memcache client 上。memcache client 分成兩部分:sdk 與 proxy,後者被稱爲 mcrouter。mcrouter 向外暴露與 memcached 相同的接口,在 web server 與 memcached server 之間增加一層抽象。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於讀多寫少,且讀數據對錯誤的容忍度高,因此 memcache client 使用 UDP 與 memcached server 通信,因爲 UDP 沒有連接的概念,通常處理讀請求時都是由 sdk 與 memcached server 之間直接通信。sdk 使用 UDP 通信時,一旦發現丟包或者順序錯誤,就會報錯,而不嘗試解決錯誤。在論文發表當時,FB 的服務高峯期中,只有約 0.25% 的讀請求被拋棄。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"處理寫請求時,memcache sdk 使用 TCP 與部署在該宿主機上 mcrouter 通信。如此一來,每個 web server 就只需要與單個 mcrouter 建立連接,由後者來保持與不同 memcached server 之間的連接,從而大大減少維持 TCP 連接、處理網絡 I/O 所需的 CPU 與內存資源,這種做法通常被稱爲 connection coalescing。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從統計數據上看,通過使用 UDP 來處理讀請求能夠將讀請求的整體時延降低 20% 左右。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2d/2dbed11b564155878a29d70f650b8413.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"Incast Congestion"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲解決 incast congestion 問題,memcache clients 也實現了擁塞控制邏輯。類似於 TCP 的 congestion control,client 的滑動窗口會根據網絡擁堵狀況自動擴容和縮容。與 TCP 不同的是,來自於同一個 web server 的請求都會被放入同一個滑動窗口中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖展示的是 window size 對請求時延的影響:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/53/5341c7e4e835c531f8b72db7b5f97673.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"window size 太小時,許多請求都在排隊;window size 太大時,可能出現網絡擁堵。因此動態地找到其中的 sweet spot 就是擁塞控制的主要目標。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Reducing Load"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 memcache 可以減少請求直接訪問 DB 的次數,但出現 cache miss 時,DB 依然會承受負載壓力,一條熱點數據可能造成瞬間高壓。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Leases"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FB 在 memcache 中通過引入 leases 來解決兩個問題:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"stale set:過期寫入"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"thundering herds:瞬間高壓"}]}]}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"Stale Set"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"look-aside cache policy 下可能發生數據不一致:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/df/df0d8d3a8f2d2e3b684c05f20f6caab7.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設兩個 web server, x 和 y,需要讀取同一條數據 d,其執行順序如下:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"x 從 memcache 中讀取數據 d,發生 cache miss,從數據庫讀出 d = A;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"另一個 memcache client 將 DB 中的 d 更新爲 B;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"y 從 memcache 中讀取數據 d,發生 cache miss,從數據庫讀出 d = B;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"y 將 d = B 寫入 memcache 中;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"x 將 d = A 寫入 memcache 中;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此時,在 d 過期或者被刪除之前,數據庫與緩存內的數據將保持不一致的狀態。引入 leases 可以解決這個問題:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每次出現 cache miss 時返回一個 lease id,每個 lease id 都只針對單條數據;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當數據被刪除 (write-invalidate) 時,之前發出的 lease id 失效;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"寫入數據時,sdk 會將上次收到的 lease id 帶上,memcached server 如果發現 lease id 失效,則拒絕執行;"}]}]}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"Thundering Herds"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"look-aside cache policy 的另一個問題是可能引發瞬間高壓:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c8/c8836888c8e12313a0804fcd3e847ba3.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當數據出現訪問熱點時,可能導致成千上萬個請求同時發生 cache miss,從而重擊 DB。通過擴展 lease 機制可以解決這個問題。每個 memcached server 都會控制每個 key 的 lease 發放速率。默認配置下,每個 key 在 10 秒內只會發放一個 lease,餘下訪問同一個 key 的請求都會被告知"},{"type":"text","marks":[{"type":"strong"}],"text":"要麼等待一小段時間後重試或者拿過期數據"},{"type":"text","text":"走人。通常在數毫秒內,獲得 lease 的 web server 就會將數據填上,這時其它 client 重試時就會成功,整個過程只有一個請求會穿透到 DB。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"Stale Values"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在一些能夠容忍過期數據的場景下,我們還有可能進一步減少負載。當數據被刪除時,memcached server 可以將它短暫地保存到另一個數據結構中,後者存儲着最新刪除的數據。此時 web server 可以自行決定是等待新的數據還是讀取過期數據。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Memcache Pools"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將 memcache 作爲通用緩存層意味着所有的、不同的 workloads 將共享這一基礎設施。不同的 workload 之間可能互補,也可能互斥。從更新頻率的維度出發,FB 團隊統計了不同更新頻率數據的 working set 大小:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/df/df451f4c429fda771f9ea6d444da002f.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看出,緩存層存儲的所有數據中,更新頻率高的佔大頭。這時候就有可能出現更新頻率高的數據將更新頻率低的數據從緩存中擠出的現象。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決這種問題,FB 將一個集羣內部的 memcached 實例分成不同的 pools:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"default(wildcard) pool:默認 pool 用來存儲大部分數據"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"small pool:存儲訪問頻率高但 cache miss 的成本不高的數據"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"large pool:存儲訪問頻率低但 cache miss 的成本特別高的數據"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"…"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以理解成是 Bulkheads 設計模式的應用。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Replication Within Pools"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在一些 pools 內部,當一個 memcached 實例無法承載讀壓力時,可以通過副本 (replication) 來提高讀效率,降低時延,即"},{"type":"text","marks":[{"type":"strong"}],"text":"將整個 memcached 實例中的數據複製到另一個實例中"},{"type":"text","text":"。之所以選擇複製整個實例的數據而不是在更細粒度上覆制數據,主要目的在於不想增加 web server 獲取數據所需的 RTT 次數:如果只複製一部分數據,原本只需要一次批量讀取請求就能獲取的數據,就可能需要通過請求多個實例來獲取,這反而可能增加時延,降低效率。當然,這種方案也需要數據發生更新時,需要讓它在多個副本中失效纔行,所以本質上是一個"},{"type":"text","marks":[{"type":"strong"}],"text":"寫效率換讀效率"},{"type":"text","text":"的過程。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Handling Failures"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在雲原生環境中,memcached server 同樣可能遭遇網絡失聯或者自身宕機。如果整個數據中心出現大面積問題,FB 會將用戶請求直接轉移到另一個數據中心;如果只是少數幾個 server 因爲網絡原因失聯,則依賴於一種自動恢復機制,通常恢復需要幾分鐘時間,但幾分鐘就有可能將 DB 和後臺服務擊垮。爲此, FB 團隊專門用少量的機器配置一個小的 memcache 集羣,稱爲 Gutter。當集羣內部少量的 server 發生故障時,memcached client 會將請求先轉發到 Gutter 中。可以理解爲 Gutter 是備胎,平時不工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Gutter 與普通的 rehash 不同,後者將失聯機器的負載轉嫁到了剩餘的 server 上,可能造成雪崩效應/鏈式反應。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"In a Region: Replication"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着用戶的訪問量繼續增大,你可能會想要購買更多的機器來部署 web server 和 memcached server,實現橫向擴容。然而簡單地橫向擴容不能解決所有問題。越來越多的用戶會將原本不嚴重的問題暴露出來:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"用戶增多會導致熱點數量增多、單個熱點熱度增大"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"由於 memcached client 需要與所有 memcached server 通信,incast congestion 問題會更嚴重"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此有必要將 memcached servers 分成多個集羣,將熱點問題和網絡問題分而治之。多個集羣將繼續共享同一個 DB 集羣:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/be/be5446fdf54ddfab4e730afe7205e60e.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Regional Invalidations"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"部署多個 memcached server 集羣,"},{"type":"text","marks":[{"type":"strong"}],"text":"同一條數據的不同版本可能會出現在不同集羣上"},{"type":"text","text":"。一種簡單的解決方案是讓 web server 每次發生 cache miss 時,將所有集羣中的對應數據刪除。顯然這會造成大量的跨集羣通信,又重新引發了網絡問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"既然數據在 DB 中只有一份,何不利用 DB 數據的更新日誌來保證數據在不同集羣間的最終一致性?"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8e/8e060be47894af21fc98dd1301cbc914.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FB 在持久化層中使用 MySQL 集羣,於是它們順着思路開發了 mcsqueal 中間件,並將其部署到每個 MySQL 集羣上。mcsqueal 負責讀取 MySQL 的 commit log,解析其中的 SQL 語句,捕獲數據更新信息,並將其廣播給所有 memcached 集羣。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f1/f1eac629b9b125b5e2a966373c9d5874.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從架構圖中,不難看出 fanout 問題再次出現,大量的跨集羣通信數據同樣可能將網絡打垮。解決方案也不難想到,即"},{"type":"text","marks":[{"type":"strong"}],"text":"分而治之"},{"type":"text","text":":"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c0/c09403a2690114afe863c23aeb40ecda.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個區域內部部署多個 memcache 集羣能夠給我們帶來諸多好處,除了緩解熱點問題、網絡擁堵問題,還能讓運維人員方便地下線單個節點、集羣,而不至於使得 cash miss rate 忽然增大。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Regional Pools"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"是否所有數據都需要在一個區域中儲存多份?如果一些數據訪問頻率很低,存一份就足夠了。基於該思路,FB 會在單個區域內單獨劃分一個 pool 用來存儲一些訪問率低的數據。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Cold Cluster Warmup"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上線新的 memcache 集羣時,如果不預熱可能會出現大量 cache miss。因此 FB 團隊構建了一個 Cold Cluster Warmup 系統,可以讓新的集羣在發生 cache miss 時先從已經加載好數據的集羣中獲取數據,而不是從持久化存儲中,如此一來,集羣上線就能夠變得更加平滑。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Across Regions: Consistency"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着 FB 的服務推廣到世界各地,將 web servers 推進到離用戶最近的地方能夠給用戶帶來更好的體驗;將 FB 的數據中心同步到不同區域 (region),也能幫助提高 FB 服務的容災能力;在新的區域可能在各方面產生規模經濟效應。因此 memcache 服務也需要能夠被部署到多個區域。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用 MySQL 的複製機制,FB 將一個區域設置爲 master 區域,而其它區域爲只讀區域,負責從 master 中同步數據。web servers 處理讀請求時只需要訪問本地的 DB 或緩存服務即可:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0b/0b648de1e329ef592daeaadf597d3684.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但這裏將產生一個新的問題:"},{"type":"text","marks":[{"type":"strong"}],"text":"只讀區域的數據庫有同步延遲,可能導致競爭條件出現"},{"type":"text","text":"。想象以下這個場景:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/54/5421f5886c0d00b9ec67a78983635dd0.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"複製集羣中的 web server A 寫入數據到 master DB"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"A 將本地 memcache 中的數據刪除"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"複製集羣中的 web server B 從 memcache 中讀取數據發生 cache miss,從本地 DB 中獲取數據"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"A 寫入的數據從 master DB 中同步到 replica DB,並通過 mcsqueal 將本地 memcache 中的數據刪除"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"web server B 將其讀到的數據寫入 memcache 中"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此時,DB 與 memcache 中的數據將再次出現不一致,且必須等待數據過期之後才能恢復。如何解決這個問題?FB 在 memcache 上引入 remote marker 機制:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/99/997b39d41b04b0927d66d640d5b8f38c.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 replica 區域的 web server 需要寫入某數據 d 時:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"在本地 memcache 上打上 remote marker,標記爲 rd"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"將 d 寫入到 master DB 中"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"將 d 從 memcache 中刪除 (rd 不刪除)"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"等待 master DB 將數據同步到本地 replica DB 中,並且在 SQL 語句中埋入 rd 的信息"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"本地 replica DB 通過 mcsqueal 解析 SQL 語句中,刪除 remote marker rd"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 replica 區域的 web server 想要讀取數據 d 發生 cache miss 時:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果 memcache 中數據 d 帶了 rd,則從 master DB 中讀取數據"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果 memcache 中數據 d 沒有 rd,則直接從本地的 replica DB 中讀取數據"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"remote marker 機制實際上就是標記了 "},{"type":"text","marks":[{"type":"strong"}],"text":"數據寫入 master DB 但尚未同步到 replica DB"},{"type":"text","text":" 的中間狀態。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"References"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Usenix 2013: Scaling Memcache at Facebook, "},{"type":"link","attrs":{"href":"https://www.youtube.com/watch?v=6phA3IAcEJ8","title":null},"content":[{"type":"text","text":"video"}]},{"type":"text","text":", "},{"type":"link","attrs":{"href":"https://www.usenix.org/sites/default/files/conference/protected-files/nishtala_nsdi13_slides.pdf","title":null},"content":[{"type":"text","text":"slides"}]},{"type":"text","text":", "},{"type":"link","attrs":{"href":"https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf","title":null},"content":[{"type":"text","text":"paper"}]}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章