GFS的分佈式哲學:HDFS的一致性成就,歸功於我的失敗……

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GFS(Google File System)是Google公司開發的一種分佈式文件系統。雖然GFS在Google公司內部被廣泛使用,但是在相當長的一段時間裏它並不爲人所知。2003年,Google發表一篇論文[1]詳細描述了GFS,人們纔開始瞭解GFS。開源軟件也開始模仿GFS,第3章講解的HDFS就是GFS的模仿者。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、GFS的外部接口和架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們從GFS的接口設計和架構設計說起吧。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、GFS的外部接口"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GFS採用了人們非常熟悉的接口,但是並沒有實現POSIX的標準文件接口。GFS通常的操作包括:create, delete, open, close, read, write, record append等,這些接口非常類似於POSIX定義的標準文件接口,但是不完全一致。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"create, delete, open, close這幾個接口的語義和POSIX標準接口類似,這裏就不逐一強調說明了。下面詳細介紹write和record append這兩個接口的語義。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"write(隨機寫):可以將任意長度的數據寫入指定文件的位置,這個文件位置也被稱爲偏移(offset)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"record append(尾部追加寫):可以原子地將長度小於16MB的數據寫入指定文件的末尾。GFS之所以設計這個接口,是因爲record append不是簡單地將offset取值設置爲文件末尾的write操作,而是不同於write的一個操作,並且是具有原子性的操作(後面會解釋原子性)。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"write和record append都允許多個客戶端併發操作一個文件,也就是允許一個文件被多個客戶端同時打開和寫入。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、GFS的架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GFS的架構如圖2.1所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/79\/79b58807b1fd92610b209d84f5d95772.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖2.1  GFS的架構(此圖摘自GFS的論文[1])"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GFS的主要架構組件有GFS client、GFS master和GFS chunkserver。一個GFS集羣包括一個master和多個chunkserver,集羣可以被多個GFS客戶端訪問。三個組件的詳細說明如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GFS客戶端(GFS client)是運行在應用(application)進程裏的代碼,通常以SDK形式存在。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GFS中的文件被分割成固定大小的塊(chunk),每個chunk的長度固定爲64MB。GFS chunkserver把這些chunk存儲在本地的Linux文件系統中,也就是本地磁盤中。通常每個chunk會被保存三個副本(replica),也就是會被保存到三個chunkserver裏。一個chunkserver會保存多個不同的chunk,每個chunk都會有一個標識,叫作塊柄(chunk handle)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GFS master維護文件系統的元數據(metadata),包括:"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"名字空間(namespace,也就是常規文件系統中的文件樹)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訪問控制信息。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個文件由哪些chunk構成。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個chunk的副本都存儲在哪些chunkserver上,也就是塊位置(chunk location)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這樣的架構下,幾個組件之間有如下交互過程。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1)客戶端與master的交互"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"客戶端可以根據chunk大小(即固定的64MB)和要操作的offset,計算出操作發生在第幾個chunk上,也就是chunk的塊索引號(chunk index)。在文件操作的過程中,客戶端向master發送要操作的文件名和chunk index,並從master中獲取要操作的chunk的chunk handle和chunk location。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"客戶端獲取到chunk handle和chunk location後,會向chunk location中記錄的chunkserver發送請求,請求操作這個chunkserver上標識爲chunk handle的chunk。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果一次讀取的數據量超過了一個chunk的邊界,那麼客戶端可以從master獲取到多個chunk handle和chunk location,並且把這次文件讀取操作分解成多個chunk讀取操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同樣,如果一次寫入的數據量超過了一個chunk的邊界,那麼這次文件寫入操作也會被分解爲多個chunk寫入操作。當寫滿一個chunk後,客戶端需要向master發送創建新chunk的指令。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2)客戶端向chunkserver寫數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"客戶端向要寫入的chunk所在的三個chunkserver發送數據,每個chunkserver收到數據後,都會將數據寫入本地的文件系統中。客戶端收到三個chunkserver寫入成功的回覆後,會發送請求給master,告知master這個chunk寫入成功,同時告知application寫入成功。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個寫流程是高度簡化和抽象的,實際的寫流程更復雜,要考慮寫入類型(是隨機寫還是尾部追加寫),還要考慮併發寫入(後面的2.2節會詳細描述寫流程,解釋GFS是如何處理不同的寫入類型和併發寫入的)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3)客戶端從chunkserver讀數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"客戶端向要讀取的chunk所在的其中一個chunkserver發送請求,請求中包含chunk handle和要讀取的字節範圍(byte range)。chunkserver根據chunk handle和byte range,從本地的文件系統中讀取數據返回給客戶端。與前面講的寫流程相比,這個讀流程未做太多的簡化和抽象,但對實際的讀流程還會做一些優化(相關優化和本書主題關係不大,就不展開介紹了)。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、GFS的寫流程細節"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本節我們詳細講解在前面的寫數據過程中未提及的幾個細節。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、名字空間管理和鎖保護"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在寫流程中,當要創建新文件和將數據寫入新chunk時,客戶端都需要聯繫master來操作master上的名字空間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"創建新文件:在名字空間創建一個新對象,該對象代表這個文件。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將數據寫入新chunk中:向master的元數據中創建新chunk相關信息。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果有多個客戶端同時進行寫入操作,那麼這些客戶端也會同時向master發送創建文件或創建新chunk的指令。master在同一時間收到多個請求,它會通過加鎖的方式,防止多個客戶端同時修改同一個文件的元數據。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、租約"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"客戶端需要向三個副本寫入數據。在併發的情況下,也會有多個客戶端同時向三個副本寫入數據。GFS需要一條規則來管理這些數據的寫入。簡單來講,這條規則就是每個chunk都只有一個副本來管理多個客戶端的併發寫入。也就是說,對於一個chunk,master會將一個塊租約(chunk lease)授予其中一個副本,由具有租約的副本來管理所有要寫入這個chunk的數據。這個具有租約的副本稱爲首要副本(primary replica)。首要副本之外的其他副本稱爲次要副本(secondary replica)。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、變更及變更次序"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對文件的寫入稱爲變更(mutation)。首要副本管理所有客戶端的併發請求,讓所有的請求按照一定的順序用到chunk上,這個順序稱爲變更次序(mutation order)。變更包括兩種,即前面講過的write操作和record append操作。接下來介紹GFS基本變更流程,write操作就是按照這個基本變更流程進行的,而record append操作則在這個基本變更流程中多出一些特殊的處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1)基本變更流程:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖2.2描述了GFS基本變更流程:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/71\/7150d71dfe76c09d22611f79d4779e96.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖2.2  GFS基本變更流程(此圖摘自GFS的論文[1])"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個寫入過程包括以下7個步驟。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"①當客戶端要進行一次寫入時,它會詢問master哪個chunkserver持有這個chunk的租約,以及其他副本的位置。如果沒有副本持有這個chunk的租約,那麼master會挑選一個副本,通知這個副本它持有租約。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"②master回覆客戶端,告訴客戶端首要副本的位置和所有次要副本的位置。客戶端聯繫首要副本,如果首要副本無響應,或者回復客戶端它不是首要副本,則客戶端會重新聯繫master。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"③客戶端向所有的副本以任意的順序推送數據。每個chunkserver都會將這些數據緩存在緩衝區中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"④當所有的副本都回復已經收到數據後,客戶端會發送一個寫入請求(write request)給首要副本,在這個請求中標識了之前寫入的數據。首要副本收到寫入請求後,會給這次寫入分配一個連續串行的編號,然後它會按照這個編號的順序,將數據寫入本地磁盤中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"⑤首要副本將這個帶有編號的寫入請求轉發給次要副本,次要副本也會按照編號的順序,將數據寫入本地,並且回覆首要副本數據寫入成功。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"⑥當首要副本收到所有次要副本的回覆後,說明這次寫入操作成功。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"⑦首要副本回復客戶端寫入成功。在任意一個副本上遇到的任意錯誤,都會告知客戶端寫入失敗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2)原子記錄追加"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"record append這個接口在論文[1]中被稱爲原子記錄追加(atomic record append),它也遵循基本變更流程,但有一些附加的邏輯。客戶端把要寫入的數據(這裏稱爲記錄,record)推送給所有的副本,如果record推送成功,則客戶端會發送請求給首要副本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首要副本收到寫入請求後,會檢查把這個record追加到尾部會不會超出chunk的邊界,如果超出邊界,那麼它會把chunk剩餘的空間填充滿(這裏填充什麼並不重要,後面的2.4節會解釋這個填充操作),並且讓次要副本做相同的事情,然後再告知客戶端這次寫入應該在下一個chunk上重試。如果這個record適合chunk剩餘的空間,那麼首要副本會把它追加到尾部,並且告知次要副本寫入record在同樣的位置,最後通知客戶端操作成功。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、GFS的原子性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來我們分析GFS的一致性,首先從原子性開始分析。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、write和record append的區別"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面講過,如果一次寫入的數據量超過了chunk的邊界,那麼這次寫入會被分解成多個操作,write和record append在處理數據跨越邊界時的行爲是不同的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面我們舉例來進行說明。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例子1:目前文件有兩個chunk,分別是chunk1和chunk2。客戶端1在54MB的位置寫入20MB數據。同時,客戶端2也在54MB的位置寫入20MB的數據。兩個客戶端都寫入成功。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面講過,chunk的大小是固定的64MB。客戶端1的寫入跨越了chunk的邊界,因此要被分解成兩個操作,其中第一個操作寫入chunk1最後10MB數據;第二個操作寫入chunk2開頭10MB數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"客戶端2的寫入也跨越了chunk的邊界,因此也要被分解爲兩個操作,其中第一個操作(作爲第三個操作)寫入chunk1最後10MB數據;第二個操作(作爲第四個操作)寫入chunk2開頭10MB數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"兩個客戶端併發寫入數據,因此第一個操作和第三個操作在chunk1上是併發執行的,第二個操作和第四個操作在chunk2上也是併發執行的。如果chunk1先執行第一個操作,後執行第三個操作;chunk2先執行第四個操作,後執行第二個操作,那麼最後在chunk1上會保留客戶端1寫入的數據,在chunk2上會保留客戶端2寫入的數據。雖然客戶端1和客戶端2的寫入都成功了,但最後的結果既不是客戶端1想要的結果,也不是客戶端2想要的結果,而是客戶端1和客戶端2寫入的混合結果。對於客戶端1和客戶端2來說,它們的操作都不是原子的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例子2:目前文件有兩個chunk,分別是chunk1和chunk2。一個客戶端在54MB的位置寫入20MB數據,但這次寫入失敗了。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這次寫入跨越了chunk的邊界,因此要被分解成兩個操作,其中第一個操作寫入chunk1最後10MB數據;第二個操作寫入chunk2開頭10MB數據。chunk1執行第一個操作成功了,chunk2執行第二個操作失敗了。也就是說,寫入的這部分數據,一部分是成功的,一部分是失敗的。這也不是原子操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例子3:目前文件有一個chunk,爲chunk1。一個客戶端在54MB的位置追加一個12MB的記錄,最終寫入成功。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於這個record append操作最多能在chunk1中寫入10MB數據,而要寫入的數據量(12MB)超過chunk的剩餘空間,剩餘空間會被填充,GFS會新建一個chunk,爲chunk2,這次寫入操作會在chunk2上重試。這樣就保證了record append操作只會在一個chunk上生效,從而避免了文件操作跨越邊界被分解成多個chunk操作,也就避免了寫入的數據一部分成功、一部分失敗和併發寫入的數據混在一起這兩種非原子性的行爲。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、GFS中原子性的含義"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GFS中的一次寫入,可能會被分解成分佈在多個chunk上的多個操作,並且由於master的鎖機制和chunk lease機制,如果寫入操作發生在一個chunk上,則可以保護它是原子的。但是如果一些文件寫入被分解成多個chunk寫入操作,那麼GFS並不能保證多個chunk寫入要麼同時成功、要麼同時失敗,會出現一部分chunk寫入成功、一部分chunk寫入失敗的情況,所以不具有原子性。之所以稱record append操作是原子的,是因爲GFS保證record append操作不會被分解成多個chunk寫入操作。如果write操作不跨越邊界,那麼write操作也滿足GFS的原子性。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、GFS中多副本之間不具有原子性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GFS中一個chunk的副本之間是不具有原子性的,不具有原子性的副本複製行爲表現爲:一個寫入操作,如果成功,那麼它在所有的副本上都成功;如果失敗,則有可能是一部分副本成功,而另一部分副本失敗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這樣的行爲下,失敗會產生以下結果:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"write在寫入失敗後,雖然客戶端可以重試,直到寫入成功,達到一致的狀態,但是如果在重試成功以前,客戶端出現宕機,那麼就變成永久的不一致了。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"record append在寫入失敗後,也會重試,但是與write的重試不同,它不是在原有的offset處重試,而是在失敗的記錄後面重試,這樣record append留下的不一致是永久的,並且還會出現重複問題。如果一條記錄在一部分副本上寫入是成功的,在另外一部分副本上寫入是失敗的,那麼這次record append就會將失敗的結果告知客戶端,並且讓客戶端重試。如果重試後成功,那麼在某些副本上,這條記錄就會被寫入兩次。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從以上結果可以得出結論:record append保證至少有一次原子操作(at least once atomic)。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、GFS的鬆弛一致性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GFS把自己的一致性稱爲鬆弛的一致性模型(relaxed consistency model)。GFS的一致性分爲元數據的一致性和文件數據的一致性,鬆弛一致性主要是指文件數據。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、元數據的一致性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"元數據的操作都是由單一的master處理的,並且操作通過鎖來保護,所以保證了原子性,也保證了正確性。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、文件數據的一致性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在介紹鬆弛的一致性模型之前,我們先看鬆弛一致性模型中的兩個概念。對於一個文件中的區域:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無論從哪個副本讀取,所有客戶端總是能看到相同的數據,這稱爲一致的(consistent)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在一次數據變更後,這個文件的區域是一致的,並且客戶端可以看到這次數據變更寫入的所有數據,這稱爲界定的(defined)。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在GFS論文[1]中,總結了GFS的鬆弛一致性,如表2.1所示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b1\/b1a1dcd81448f976e75241bc2ba870fa.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"表2.1  GFS的鬆弛一致性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面分別說明表中的幾種情況:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在沒有併發的情況下,寫入不會相互干擾,成功的寫入是界定的,那麼也就是一致的。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在併發的情況下,成功的寫入是一致的,但不是界定的。比如,在前面所舉的“例子1”中,chunk1的各個副本是一致的,chunk2的各個副本也是一致的,但是chunk1和chunk2中包含的數據既不是客戶端1寫入的全部數據,也不是客戶端2寫入的全部數據。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果寫入失敗,那麼不管是write操作失敗還是record append操作失敗,副本之間會出現不一致性。比如,在前面所舉的“例子2”中,當一些寫入失敗後,chunk的副本之間就可能出現不一致性。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"record append能夠保證區域是界定的,但是在界定的區域之間夾雜着一些不一致的區域。record append會填充數據,不管各個副本是否填充相同的數據,這部分區域都會被認爲是不一致的。比如前面所舉的“例子3”。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、適應GFS的鬆弛一致性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GFS的鬆弛一致性模型,實際上是一種不一致的模型,或者更準確地說,在一致的數據中間夾雜着不一致的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些夾雜在其中的不一致的數據,對應用來說是不可接受的。在這種一致性下,應該如何使用GFS呢?在GFS的論文[1]中,給出了幾條使用GFS的建議:依賴追加(append)而不是依賴覆蓋(overwrite)、設立檢查點(checkpoint)、寫入自校驗(write self-validating)、自記錄標識(self-identifying record)。下面我們用兩個場景來說明這些方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"場景1:在只有單個客戶端寫入的情況下,按從頭到尾的方式生成文件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"方法1:先臨時寫入一個文件,在全部數據寫入成功後,將文件改名爲一個永久的名字,文件的讀取方只能通過這個永久的文件名訪問該文件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"方法2:寫入方按一定的週期寫入數據,在寫入成功後,記錄一個寫入進度檢查點,其信息包含應用級的校驗數(checksum)。讀取方只校驗和處理檢查點之前的數據。即便寫入方出現宕機的情況,重啓後的寫入方或者新的寫入方也會從檢查點開始,繼續寫入數據,這樣就修復了不一致的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"場景2:多個客戶端併發向一個文件尾部追加數據,就像一個生產消費隊列,多個生產者向一個文件尾部追加消息,消費者從文件中讀取消息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"方法:使用record append接口,保證數據至少被成功寫入一次。但是應用需要應對不一致的數據和重複數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了校驗不一致的數據,爲每條記錄添加校驗數,讀取方通過校驗數識別出不一致的數據,並且丟棄不一致的數據。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於重複數據,可以採用數據冪等處理。具體來說,可以採用兩種方式處理。第一種,對於同一份數據處理多次,這並無負面影響;第二種,如果執行多次處理帶來不同的結果,那麼應用就需要過濾掉不一致的數據。寫入方寫入記錄時額外寫入一個唯一的標識(identifier),讀取方讀取數據後,通過標識辨別之前是否已經處理過該數據。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、GFS的設計哲學 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面講解了基於GFS的應用,需要通過一些特殊手段來應對GFS的鬆弛一致性模型帶來的各種問題。對於使用者來說,GFS的一致性保證是非常不友好的,很多人第一次看到這樣的一致性保證都是比較喫驚的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GFS在架構上選擇這樣的設計,有它自己的設計哲學。GFS追求的是簡單、夠用的原則。GFS主要解決的問題是如何使用廉價的服務器存儲海量的數據,且達到非常高的吞吐量(GFS非常好地做到了這兩點,但這不是本書的主題,這裏就不展開介紹了),並且文件系統本身要簡單,能夠快速地實現出來(GFS的開發者在開發完GFS之後,很快就去開發BigTable了[2])。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GFS很好地完成了這樣的目標,但是留下了一致性問題,給使用者帶來了負擔。這個問題在GFS推廣應用的初期階段不明顯,因爲GFS的主要使用者(BigTable系統是GFS系統的主要調用方)就是GFS的開發者,他們深知應該如何使用GFS。這種不一致性在BigTable中被屏蔽掉(採用上面所說的方法),BigTable提供了很好的一致性保證。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是隨着GFS推廣應用的不斷深入,GFS簡單、夠用的架構開始帶來很多問題,一致性問題僅僅是其中之一。Sean Quinlan作爲Leader主導GFS的研發很長時間,在一次採訪中,他詳細說明了在GFS渡過推廣應用的初期階段之後,這種簡單的架構帶來的各種問題[2]。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在清晰地看到GFS的一致性模型給使用者帶來的不便後,開源的HDFS(Hadoop分佈式文件系統)堅定地摒棄了GFS的一致性模型,提供了更好的一致性保證(第3章將介紹HDFS的實現方式)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"注:"},{"type":"text","text":"以上是《分佈式系統與一致性》書中的第二個章節,書中詳細介紹了GFS、HDFS、BigTable、MongoDB、RabbitMQ、ZooKeeper、Spanner、CockroachDB系統與一致性有關的實現細節,以及非常重要的Paxos、Raft、Zab分佈式算法;本書還介紹了事務一致性與隔離級別、順序一致性、線性一致性與強一致性相關內容,以及架構設計中的權衡CAP理論等。點擊文末【閱讀原文】可入手本書,希望和大家一起探討分佈式一致性這個難題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參考資料"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] Ghemawat S, Gobioff H, Leung S T. The Google File System. ACM SIGOPS Operating Systems Review, 2003."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2] Marshall, Kirk, McKusick, et al. GFS: Evolution on Fast-forward. Communications of the ACM, 2009."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"陳東明,"},{"type":"text","text":"具有豐富的大規模系統構建和基礎架構的研發經驗,善於複雜業務需求下的大併發、分佈式系統設計和持續優化。近年專注於分佈式系統一致性的研究,常年堅持技術文章創作和社區分享。曾就職於餓了麼、百度,主導開發餓了麼key-value數據庫,負責百度即時通訊產品的架構設計。個人微信公衆號dongming_cdm。本文是本人新書《分佈式系統與一致性》的一個章節,節選出來和大家分享、討論。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:dbaplus社羣(ID:dbaplus)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/ut8Q7vXa5Lm0auNaN2_Emg","title":"xxx","type":null},"content":[{"type":"text","text":"GFS的分佈式哲學:HDFS的一致性成就,歸功於我的失敗……"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章