存儲大師班 | RDMA簡介與編程基礎

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​作者簡介:馬強","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"顧問軟件工程師,主要負責 QingStor NEONSAN 產品研發工作。有十多年後端產品研發經驗,對企業級存儲領域相關技術的實踐和應用有着深刻的理解。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本週,存儲大師班爲大家帶來的是“RDMA 簡介與編程基礎”方面的課程,在今天的課程中,我們將詳細講述什麼是 RDMA、RDMA 工作原理、RDMA 網絡協議、RDMA 編程基礎,以及關於 RDMA 操作細節等幾個方面的內容。大家可要認真閱讀哦!","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00a971","name":"user"}}],"text":"什麼是 RDMA","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RDMA(Remote Direct Memory Access)指的是遠程直接內存訪問,這是一種通過網絡在兩個應用程序之間搬運緩衝區裏的數據的方法。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Remote:數據通過網絡與遠程機器間進行數據傳輸。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Direct:沒有內核的參與,有關發送傳輸的所有內容都卸載到網卡上。 ","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Memory:在用戶空間虛擬內存與網卡直接進行數據傳輸不涉及到系統內核,沒有額外的數據移動和複製。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Access:send、receive、read、write、atomic 等操作。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RDMA 與傳統的網絡接口不同,因爲它繞過了操作系統內核。這使得實現了 RDMA 的程序具有如下特點:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"絕對的最低時延","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最高的吞吐量","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最小的 CPU 足跡 (也就是說,需要 CPU 參與的地方被最小化)","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/30/30bf74848da7a28dc05400bdaeaa57b4.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00a971","name":"user"}}],"text":"RDMA 工作原理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7c/7c9816d16b8e426e414f470ba3ba1d23.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統的 TCP/IP 通信過程如下:","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"把收到的數據包緩存到系統上,數據包經過處理後,相應數據被分配到一個 TCP 連接。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"接收系統再把主動提供的 TCP 數據同相應的應用程序聯繫起來,並將數據從系統緩衝區拷貝到目標存儲地址。","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由此可見在發送和接收數據的過程中,數據在源端應用層從上向下逐層拷貝封裝,目的端從下向上拷貝和解封裝,所以比較慢,而且需要 CPU 參與的次數很多。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ab/ab478684efe1857577da2a6a9ea10dfe.jpeg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RDMA 通信過程中,發送和接收,讀/寫操作中,都是網卡直接和參與數據傳輸的已經註冊過的內存區域直接進行數據傳輸,速度快,不需要 CPU 參與,RDMA 網卡接替了 CPU 的工作,節省下來的資源可以進行其它運算和服務。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RDMA 的工作過程如下:","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"當一個應用執行 RDMA 讀或寫請求時,不執行任何數據複製。在不需要任何內核內存參與的條件下,RDMA 請求從運行在用戶空間中的應用中發送到本地網卡。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"網卡讀取緩衝的內容,並通過網絡傳送到遠程網卡。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"在網絡上傳輸的 RDMA 信息包含目標機器虛擬內存地址和數據本身。請求完成可以完全在用戶空間中處理(通過輪詢用戶空間的 RDMA 完成隊列)。RDMA 操作使應用可以從一個遠程應用的內存中讀數據或向這個內存寫數據。 ","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,RDMA 可以簡單理解爲利用相關的硬件和網絡技術,網卡可以直接讀寫遠程服務器的內存,最終達到高帶寬、低延遲和低資源利用率的效果。應用程序不需要參與數據傳輸過程,只需要指定內存讀寫地址,開啓傳輸並等待傳輸完成即可。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00a971","name":"user"}}],"text":"支持 RDMA 的網絡協議","attrs":{}},{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RDMA 作爲一種 host-offload,host-bypass 技術,使低延遲、高帶寬的直接的內存到內存的數據通信成爲了可能。目前支持 RDMA 的網絡協議有: ","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"InfiniBand(IB):從一開始就支持 RDMA 的新一代網絡協議。由於這是一種新的網絡技術,因此需要支持該技術的網卡和交換機。 ","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RDMA 過融合以太網(RoCE):即 RDMA over Converged Ethernet,允許通過以太網執行 RDMA 的網絡協議。這允許在標準以太網基礎架構(交換機)上使用 RDMA,只不過網卡必須是支持 RoCE 的特殊的網卡。  ","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"互聯網廣域 RDMA 協議(iWARP):即 Internet Wide Area RDMA Protocol,其實也就是 RDMA over TCP,允許通過 TCP 執行 RDMA 的網絡協議。這允許在標準以太網基礎架構(交換機)上使用RDMA,只不過網卡要求是支持iWARP(如果使用 CPU offload 的話)的網卡。否則,所有 iWARP 棧都可以在軟件中實現,但是失去了大部分的 RDMA 性能優勢。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00a971","name":"user"}}],"text":"RDMA 編程基礎","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 RDMA,我們需要有一張支持 RDMA 通信(即實現了 RDMA 引擎)的網卡。我們把這種卡稱之爲HCA(Host Channel Adapter 主機通道適配器)。通過PCIe(peripheral component interconnect express)總線, 適配器創建一個從 RDMA 引擎到應用程序內存的通道。一個好的 HCA 將執行的 RDMA 協議所需要的全部邏輯都在硬件上予以實現。這包括分組,重組以及流量控制和可靠性保證。因此,從應用程序的角度看,只負責處理所有緩衝區即可。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cc/cc783e4dd9bade7a7189a48f2c3cce54.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上圖所示,在 RDMA 編程中我們使用命令通道調用內核態驅動建立數據通道,該數據通道允許我們在搬運數據的時候完全繞過內核。一旦建立了這種數據通道,我們就能直接讀寫數據緩衝區。建立數據通道的 API 是一種稱之爲“verbs”的 API。“verbs” API 是由一個叫做 Open Fabrics Enterprise Distribution(OFED)的 Linux 開源項目維護的。“verbs” API 跟 socket 編程 API 是完全不一樣的。但是,一旦你掌握了一些概念後,就會變得非常容易,而且在設計你的程序的時候更簡單。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00a971","name":"user"}}],"text":"1. 關鍵概念","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RDMA 操作開始於操作內存。當你在操作內存的時候,就是告訴內核這段內存“名花有主”了,主人就是你的應用程序。於是,你告訴 HCA,就在這段內存上尋址,趕緊準備開闢一條從 HCA 卡到這段內存的通道。我們將這一動作稱之爲註冊一個內存區域 MR(Memory Region)。註冊時可以設置內存區域的讀寫權限(包括 local write,remote read,remote write,atomic,and bind)。調用 Verbs API ibv_reg_mr 即可實現註冊 MR,該 API 返回 MR 的 remote 和 local key。local key 用於本地 HCA 訪問本地的內存。remote key 是用於提供給遠程 HCA 來訪問本地的內存。一旦 MR 註冊完畢,我們就可以使用這段內存來做任何 RDMA 操作。在下面的圖中,我們可以看到註冊的內存區域(MR)和被通信隊列所使用的位於內存區域之內的緩衝區(buffer)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/21/21fa8f094647ac58f893fc3172f2c993.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RDMA 通信基於三條隊列 SQ(Send Queue),RQ(Receive Queue)和 CQ(Completion Queue)組成的集合。其中, 發送隊列(SQ)和接收隊列(RQ)負責調度工作,他們總是成對被創建,稱之爲隊列對 QP(Queue Pair)。當放置在工作隊列上的指令被完成的時候,完成隊列(CQ)用來發送通知。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當用戶把指令放置到工作隊列的時候,就意味着告訴 HCA 那些緩衝區需要被髮送或者用來接受數據。這些指令是一些小的結構體,稱之爲工作請求 WR(Work Request)或者工作隊列元素 WQE(Work Queue Element)。一個 WQE 主要包含一個指向某個緩衝區的指針。一個放置在發送隊列(SQ)裏的 WQE 中包含一個指向待發送的消息的指針。一個放置在接受隊列裏的 WQE 裏的指針指向一段緩衝區,該緩衝區用來存放待接受的消息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RDMA 是一種異步傳輸機制。因此我們可以一次性在工作隊列裏放置好多個發送或接收 WQE。HCA 將盡可能快地按順序處理這些 WQE。當一個 WQE 被處理了,那麼數據就被搬運了。一旦傳輸完成,HCA 就創建一個狀態爲成功的完成隊列元素 CQE(Completion Queue Element)並放置到完成隊列(CQ)中去。如果由於某種原因傳輸失敗,HCA 也創建一個狀態爲失敗的 CQE 放置到(CQ)中去。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00a971","name":"user"}}],"text":"2. 簡單示例","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們看個簡單的例子。在這個例子中,我們將把一個緩衝區裏的數據從系統 A 的內存中搬到系統 B 的內存中去。這就是我們所說的消息傳遞語義學。接下來我們要講的一種操作爲 SEND,是 RDMA 中最基礎的操作類型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第 1 步:系統 A 和 B 都創建了他們各自的 QP 的完成隊列(CQ),併爲即將進行的 RDMA 傳輸註冊了相應的內存區域(MR)。系統 A 識別了一段緩衝區,該緩衝區的數據將被搬運到系統 B 上。系統 B 分配了一段空的緩衝區,用來存放來自系統 A 發送的數據。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/90/90141778babff7ccacbc4a2fd43076f4.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第 2 步:系統 B 創建一個 WQE 並放置到它的接收隊列(RQ)中。這個 WQE 包含了一個指針,該指針指向的內存緩衝區用來存放接收到的數據。系統 A 也創建一個 WQE 並放置到它的發送隊列(SQ)中去,該 WQE 中的指針執行一段內存緩衝區,該緩衝區的數據將要被傳送。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/81/812ac371642d30a337a2434bd42d1f39.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第 3 步:系統 A 上的 HCA 總是在硬件上幹活,看看發送隊列裏有沒有 WQE。HCA 將消費掉來自系統 A 的 WQE,然後將內存區域裏的數據變成數據流發送給系統 B。當數據流開始到達系統 B 的時候,系統 B 上的 HCA 就消費來自系統 B 的 WQE,然後將數據放到該放的緩衝區上去。在高速通道上傳輸的數據流完全繞過了操作系統內核。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c7/c7957a82d1aee550549e3b0889690542.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注:WQE 上的箭頭表示指向用戶空間內存的指針(地址)。receive/send 模式下,通信雙方需要事先準備自己的 WQE(WorkQueue),HCA 完成後會寫(CQ)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第 4 步:當數據搬運完成的時候,HCA 會創建一個 CQE。這個 CQE 被放置到完成隊列(CQ)中,表明數據傳輸已經完成。HCA 每消費掉一個 WQE,都會生成一個 CQE。因此,在系統 A 的完成隊列中放置一個 CQE,意味着對應的 WQE 的發送操作已經完成。同理,在系統 B 的完成隊列中也會放置一個 CQE,表明對應的 WQE 的接收操作已經完成。如果發生錯誤,HCA 依然會創建一個 CQE。在 CQE 中,包含了一個用來記錄傳輸狀態的字段。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f2/f2c47561d3975beb37eaef935bd45f28.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們剛剛舉例說明的是一個 RDMA Send 操作。在 IB 或 RoCE 中,傳送一個小緩衝區裏的數據耗費的總時間大約在 1.3µs。通過同時創建很多 WQE, 就能在 1 秒內傳輸存放在數百萬個緩衝區裏的數據。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00a971","name":"user"}}],"text":"3. RDMA 操作細節","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 RDMA 傳輸中,SEND/RECEIVE 是雙邊操作,即需要通信雙方的參與,並且 RECEIVE 要先於 SEND 執行,這樣對方纔能發送數據,當然如果對方不需要發送數據,可以不執行 RECEIVE 操作,因此該過程和傳統通信相似,區別在於 RDMA 的零拷貝網絡技術和內核旁路,延遲低,多用於傳輸短的控制消息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"WRITE/READ 是單邊操作,顧名思義,讀/寫操作是一方在執行,在實際的通信過程中,WRITE/READ 操作是由客戶端來執行的,而服務器端不需要執行任何操作。RDMA WRITE 操作中,由客戶端把數據從本地 buffer 中直接 push 到遠程 QP 的虛擬空間的連續內存塊中(物理內存不一定連續),因此需要知道目的地址(remote addr)和訪問權限(remote key)。RDMA READ 操作中,是客戶端直接到遠程的 QP 的虛擬空間的連續內存塊中獲取數據 pull 到本地目的 buffer 中,因此需要遠程 QP 的內存地址和訪問權限。單邊操作多用於批量數據傳輸。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看出,在單邊操作過程中,客戶端需要知道遠程 QP 的 remote addr 和 remote key,而這兩個信息是可以通過 SEND/REVEIVE 操作來交換的。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00a971","name":"user"}}],"text":"3.1 RDMA 單邊操作(RDMA READ)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"READ 和 WRITE 是單邊操作,只需要本端明確信息的源和目的地址,遠端應用不必感知此次通信,數據的讀或寫都通過 RDMA 在網卡與應用 Buffer 之間完成,再由遠端網卡封裝成消息返回到本端。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於單邊操作,以存儲網絡環境下的存儲爲例,數據的流程如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.   首先 A、B 建立連接,QP 已經創建並且初始化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.   數據被存檔在 B 的 buffer 地址 VB,注意 VB 應該提前註冊到 B 的網卡(並且它是一個 memory region),並拿到返回的 remote key,相當於 RDMA 操作這塊 buffer 的權限。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.   B 把數據地址 VB,key 封裝到專用的報文傳送到 A,這相當於 B 把數據 buffer 的操作權交給了 A。同時 B 在它的 WQ 中註冊進一個 WR,以用於接收數據傳輸的 A 返回的狀態。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4.   A 在收到 B 的送過來的數據 VB 和 remote key 後,網卡會把它們連同自身存儲地址 VA 到封裝 RDMA READ 請求,將這個消息請求發送給 B,這個過程 A、B 兩端不需要任何軟件參與,就可以將 B 的數據存儲到 A 的 VA 虛擬地址。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5.   A 在存儲完成後,會向 B 返回整個數據傳輸的狀態信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單邊操作傳輸方式是 RDMA 與傳統網絡傳輸的最大不同,只需提供直接訪問遠程的虛擬地址,無須遠程應用參與其中,這種方式適用於批量數據傳輸。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00a971","name":"user"}}],"text":"3.2 RDMA 單邊操作(RDMA WRITE)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於單邊操作,以存儲網絡環境下的存儲爲例,數據的流程如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.   首先 A、B 建立連接,QP 已經創建並且初始化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.   數據 remote 目標存儲 buffer 地址 VB,注意 VB 應該提前註冊到 B 的網卡(並且它是一個 memory region),並拿到返回的 remote key,相當於 RDMA 操作這塊 buffer 的權限。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.   B 把數據地址 VB,key 封裝到專用的報文傳送到 A,這相當於 B 把數據 buffer 的操作權交給了 A。同時 B 在它的 WQ 中註冊進一個 WR,以用於接收數據傳輸的 A 返回的狀態。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4.   A 在收到 B 的送過來的數據 VB 和 remote key 後,網卡會把它們連同自身發送地址 VA 到封裝 RDMA WRITE 請求,這個過程 A、B 兩端不需要任何軟件參與,就可以將 A 的數據發送到 B 的 VB 虛擬地址。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5.   A 在發送數據完成後,會向 B 返回整個數據傳輸的狀態信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單邊操作傳輸方式是 RDMA 與傳統網絡傳輸的最大不同,只需提供直接訪問遠程的虛擬地址,無須遠程應用的參與其中,這種方式適用於批量數據傳輸。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00a971","name":"user"}}],"text":"3.3 RDMA 雙邊操作(RDMA SEND/RECEIVE)","attrs":{}},{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RDMA 中 SEND/RECEIVE 是雙邊操作,即必須要遠端的應用感知參與才能完成收發。在實際中,SEND/RECEIVE 多用於連接控制類報文,而數據報文多是通過 READ/WRITE 來完成的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於雙邊操作爲例,主機 A 向主機 B(下面簡稱 A、B)發送數據的流程如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.   首先,A 和 B 都要創建並初始化好各自的 QP,CQ。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.   A 和 B 分別向自己的 WQ 中註冊 WQE,對於 A,WQ = SQ,WQE 描述指向一個等到被髮送的數據;對於 B,WQ = RQ,WQE 描述指向一塊用於存儲數據的 Buffer。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.   A 的網卡異步調度輪到 A 的 WQE,解析到這是一個 SEND 消息,從 buffer 中直接向 B 發出數據。數據流到達 B 的網卡後,B 的 WQE 被消耗,並把數據直接存儲到 WQE 指向的存儲位置。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4.  AB 通信完成後,A 的 CQ 中會產生一個完成消息 CQE 表示發送完成。與此同時,B 的 CQ 中也會產生一個完成消息表示接收完成。每個 WQ 中 WQE 的處理完成都會產生一個 CQE。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雙邊操作與傳統網絡的底層 Buffer Pool 類似,收發雙方的參與過程並無差別,區別在零拷貝、kernel bypass,實際上對於 RDMA,這是一種複雜的消息傳輸模式,多用於傳輸短的控制消息。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00a971","name":"user"}}],"text":"總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RDMA 通過網絡在兩個端點的應用軟件之間實現 Buffer 的直接傳遞;相比傳統的網絡傳輸,RDMA 又無需操作系統和協議棧的介入 。RDMA 可以輕易實現端點間的超低延時、超高吞吐量傳輸,而且基本不需要 CPU、OS 等資源介入,也不必再爲網絡數據的處理和搬移耗費過多其他資源。RDMA 提供了完全不同於 socket 的編程接口,因此要想使用 RDMA,需要對學習 RDMA 原生編程 API(verbs/RDMA_CM),對 RDMA 技術有深入理解才能做好開發。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#00a971","name":"user"}}],"text":"更多文章","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg3OTU2MzU2NA==&mid=2247484259&idx=1&sn=9820de34620195bd853846681da2b978&chksm=cf03c520f8744c36aa67e698e737f16e02c6ff6a888a3972b9cf09c94989bb12eac9c001e01d&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"如何打造雲時代的存儲基石","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg3OTU2MzU2NA==&mid=2247483933&idx=1&sn=9dafe567fd4e4cda14633ac90e7be165&chksm=cf03c45ef8744d48687ebe66c3a68c216a21d1e0450b5599bdbe48210ef5eece8b8b8e0e720b&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"傳統應用掛載對象存儲,滿足集羣高併發","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg3OTU2MzU2NA==&mid=2247483912&idx=1&sn=5a5a90c6278ec0462c84c848c48fbf76&chksm=cf03c44bf8744d5da51206430e2ec891883684f23a03c6a78ed85f161ac5fed3ccfbea547e33&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"QingStor 對象存儲架構設計及最佳實踐","attrs":{}}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章