取代HDFS?Ozone在騰訊的最新研究進展

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲下一代對象存儲還不夠,取代 HDFS 纔是 Ozone 的目標。Ozone 是當前 Apache Hadoop 生態圈的一款新的對象存儲系統,Ozone 與 HDFS 有着很深的關係,在設計上,很多地方也參考了 HDFS,並對 HDFS 存在的不足做了很多改進。很多公司看重的不僅僅是 Ozone 的對象存儲能力,更是 Ozone 標榜自己是 HDFS 的下一代的目標。我們抓住了這一點,並做出了比開源社區 Ozone Filesystem 方案更徹底的 HDFS on Ozone 架構設計和實現,取得了階段性成績。本文將爲大家介紹如何讓 Ozone 成爲 HDFS 的下一代分佈式存儲系統,主要內容包括:"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ozone 介紹"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"NameNode on HDDS"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"騰訊的貢獻"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ozone 的未來"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Ozone 介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,我們一起簡單瞭解下 Ozone;然後,重點介紹一下我們正在做的一個跨時代的事情 NameNode on HDDS。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. HDFS的組織架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在介紹 Apache Ozone 之前,我們先一起回顧一下 HDFS,看看 HDFS 發生了什麼事情,在大數據生態裏,這麼多年不變的分佈式存儲系統爲什麼要被取代?它到底有什麼問題?我們先來了解 HDFS 的組成架構,可以看下邊這張圖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1b\/1b596eb511cb8435cf7a0996996e1b26.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HDFS 採用的 Master\/Slave 架構,並且 master 節點也就是 NameNode 節點,統一管理文件系統命名空間中的元數據,對外提供文件系統元數據服務、數據塊管理服務、節點管理、冗餘存儲管理、心跳管理等衆多服務於一身的一個高度中心化的 Master 服務。此外 NameNode 的內部還有全局的 FSlock 以及 DirLock,這些都是全局鎖。這些設計都讓 HDFS NameNode 架構變得極其簡單,這是它的一些優點,這些設計是個雙刃劍,既成就了 HDFS 也限制 HDFS。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/36\/366aa80fb728cd2b2508dac4ed043957.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而且 NameNode 的 HA ( 高可用性 High Availability ) 也顯得特別複雜,至少要引入三個 zookeeper 節點,三個 JournalNode 節點,兩個 ZKFC 節點,部署維護都顯得比較複雜。文件系統的 inode 信息和 block 信息以及 block 的 location 信息全部在 NameNode 的內存中維護,這使得 NameNode 對內存的要求非常高,需要定製大內存的機器才能承載更大的元數據量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"京東的 NameNode 內存是 512GB,還有某廠的 NameNode 的機器是 1TB 的內存。此外 NameNode 堆分配巨大,京東的 NameNode 需要 360G 的堆,因此對 GC 的要求是特別高,在不斷的調優和改動的情況下,京東定製的 JDK11 以及 JDK G1 GC 發揮着不錯的性能。但是一般小規模的公司是不具備維護 JDK 的能力的,方案不具備普遍性。字節跳動把 NameNode 修改成 C++ 版本,這樣分配內存和釋放內存都由程序控制,也達到了不錯的性能,這個方案仍然不具有普遍性,因爲開發和維護這樣一個 C++ 版本的 NameNode 實現也需要不小規模的團隊。不管怎麼樣,元數據的 scalability ( 擴展性 ) 受限於物理內存大小是一個致命的缺點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家都耳聞 HDFS 對小文件不友好,但是爲什麼呢?其實從根本上不是因爲這個文件小,而是因爲產生了很多小文件和小 block,致使在同樣的數據存儲容量的情況下,NameNode 中產生更多的內存對象,這些內存對象都要被 NameNode 管理起來,這就會導致數據存儲量還不是很大的時候,元數據量就很大了。即便沒有小文件問題,當數據量達到幾百個 PB,NameNode 由於啓動時,需要接受絕大多數 FullBlockReport 纔可以退出 safemode ( 安全模式 ),因此啓動時間會達數個小時。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣常規的 FullBlockReport、Decommision datanode、balancer block,也都會對集羣元數據服務性能造成影響,這些根本原因都是因爲 DataNode 需要把所有 block report 給 NameNode,以及幾乎所有的操作都要獲取全局鎖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般情況下,當5000臺規模 HDFS 集羣,在 200PB 左右,大概會有4億左右元數據,按2億文件,2億塊來說,這些文件和 block 的信息,都會維護在 NameNode 的 heap 上,佔用了昂貴的內存,也給 GC 造成了壓力。此外,這麼大的數據量,在啓動、以及後續的 FullBlockReport 期間,會減少總體集羣的可用時間,限制了 HDFS 集羣的吞吐量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總結一下,HDFS 的問題是,元數據都在內存中,造成的 NameNode scalability 受物理內存限制的問題 和 GC 災難問題。以及 全局鎖和塊彙報風暴,導致的吞吐差和 NameNode 啓動慢的問題。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Ozone"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0e\/0edfd5ffdc53cf0c17cc712b724775be.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近些年 Ozone 在衆多新興的開源分佈式存儲系統裏邊脫穎而出,由於 Ozone 同樣是 Apache 開源基金會出品,也是許多 HDFS 社區的 PMC 和 Committer 共同設計的,因此在架構設計上既避免了 HDFS 諸多設計上的缺陷,同時也借鑑參考了 HDFS 多年曆經時間考驗的優秀設計功能,因此 Ozone 力求成爲 HDFS 的下一代分佈式存儲系統。目前使用 Ozone 和參與 Ozone 開發的公司有很多,我知道的,包括騰訊、京東、思科、Cloudera、谷歌、360,還有一些其他的公司。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中騰訊是重度的參與開發,也有很多公司在觀望,希望等成熟以後再應用到公司內部,這可能是有一些偏運維的團隊對於開發新系統的激情還沒那麼強烈,希望大家不要再觀望了。中國有很多的 Ozone 的 Committer,希望大家能夠踊躍的加入 Ozone 社區,一起把 Ozone 向前推。從社區活躍度來講,Ozone 和 Ratis 的活躍度比 HDFS 的活躍度要高很多,可以從 slack 的 hdfs channel 和 ozone channel 的成員數做對比,HDFS 有111個成員數,Ozone 已經有124個成員了,還有 Ratis 是 Ozone 下面的一個就是寫數據必備的依賴項目也是有100多人,加起來就有200多人了。從整個 Hadoop 項目上一個月的 Pull Request 數量對比,就可以看出來活躍度已經相差很大了,可能有些誤差,因爲有一些 Hadoop 的 patch 還是通過 JIRA 的 patch 的方式提交的。大家可以參與一下 Ozone 社區的代碼貢獻,比較一下,Ozone 的社區的基本設施還是非常完備的,比如集成了 checkstyle、findbug、Test Coverage、UT、Integrity Test,以及在 docker 內的一些測試,基礎設施建設的比較好,開發者的入門門檻也會比較低。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不用編譯安裝 Protocol Buffers,也不用擔心 Protocol 中的一些消息定義會向前不兼容 Ozone 使用的 maven 插件自動幫你去下載你對應版本的 protoc 工具,也會自動檢查 Protocol 消息定義向前兼容。Ozone 社區大佬服務也很周到,百忙之中還給大家錄製了系列的視頻教程放到 YouTube,還寫了一鍵在 intellij idea 裏運行 Ozone 的腳本,一鍵 docker 裏運行 Ozone 腳本,一鍵 kubernete 中啓動 Ozone 的腳本。每週五的中午還有 Ozone 社區的同步會議,每個人都可以參加。我覺得這些都是 Ozone 能成爲下一代分佈式存儲系統的關鍵點之一。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bd\/bdd3aad96433ae1bc7ad6e3d0c3a486c.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然 HDFS 有些問題,雖然 Ozone 社區比較活躍和 open,憑什麼 Ozone 就會成爲下一代分佈式存儲?有很多原因,先從 Ozone 的訪問接口方面來說,Ozone 提供了很多 API,可以讓各種應用使用各種主流的 API 來訪問到 Ozone。比如 Ozone 實現了 hadoop 兼容文件系統 API,這使得 spark、presto、MapReduce,Hive、Alluxio 等這些大數據生態中的系統可以訪問到 Ozone。此外,我們還向 Alluxio 貢獻了 Ozone 底層存儲模塊,這使得 Alluxio 原生支持 Ozone 作爲 Alluxio 的底層存儲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ozone 還提供了 S3 gateway 服務,使得 Ozone 可以適用於 S3 的場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,Ozone 通過 Goofys 和 Ozone CSI 實現了 k8s 掛盤的能力。所以,想把你的應用的存儲轉移到 Ozone 上,非常方便。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/57\/572858597be8fd05c5337c13bc9e6e33.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Ozone 設計上的優勢"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ozone 的管理節點服務是有一些拆分的,我們可以看到 Ozone 同樣採用了 Master\/Slave 架構,但是有一點不同,Ozone 的 master 是有兩個,一個是 OzoneManager 作爲對象存儲元數據服務,另一個是 StorageContainerManager,作爲存儲容器管理服務。相比 HDFS,Ozone 就像是把 NameNode 拆分成這兩個服務。Ozone 把管理節點是拆分成 OM 和 SCM 是有很多好處的,這使得 OM 和 SCM 可以各自運行在獨立的進程當中甚至可以運行在不同的機器當中,各自維護進程的生命週期,可以單獨的重啓、升級、維護。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SCM 和 DataNode 組成了一個通用的存儲層 HDDS,上圖中淺黃色部分。HDDS 是 Hadoop Distributed Data store 的縮寫。如果在 HDDS 之上加上 Ozone manager,就是 Ozone 了,Ozone manager 是對外提供的一個對象存儲服務。在 HDDS 之上我們也可以去搭建一個 HDDS 的 NameNode,這樣的話其實對外界來看,它就像一個 HDFS,就可以提供文件服務。如果說在 HDDS 之上,我們去啓動一個 blockstorage 服務,這樣的話它就可以提供一個高速掛盤的能力。Ozone 這樣一個分層設計還是比較高明的。再來看看 Ozone Manager,Ozone 區別於 HDFS 的一個最大的設計上的不同,就是 Ozone 是對象存儲,沒有維護一個文件系統數,對象語義的操作不存在目錄和文件的關係,因此可以達到很高的吞吐量,Ozone Manager 內部也可以達到 bucket 級別的併發讀寫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外一個很優秀的設計就是 OM 和 SCM 都使用 RocksDB 進行元數據管理,Ozone 的元數據不像 HDFS 的 NameNode 都放在內存裏,而是放到 RocksDB 裏,不管 OM 的元數據還是 SCM 中 Container 信息都維護在 RocksDB 中,不需要使用堆內的存儲,理論上元數據可以無限的擴展。還有一個很高明的設計,就是引入了 Storage Container 的概念,由 DataNode 的管理 Container 中的 block,SCM 就無需管理 block 只需要管理 Container 即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這就好像是國家的領導只需要管理每一個城市,而城市的領導在管理城市裏面的人。每個 Storage Container 都有默認 5G 的容量,其中 block 的狀態是由 container 來管理,所以也極大地減少了 SCM 管理的數據量,從而提升了 SCM 的服務性能和擴展能力。無論是全量的塊彙報,還是增量的塊彙報,以及增刪副本,和 block balance 集羣,這些都不會對 SCM 的性能產生很大影響,因爲塊佔大多數,但是 container 相對來說還是很少。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Ozone 所面臨的挑戰"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/04\/04e374df4bf766790dd0a313984c5893.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ozone 雖然有那麼多的接入方式,也有這麼多的設計上的優勢,但是 HDFS 豐功偉績引領風騷數十年,不是那麼容易被取代,想要取代它需要面臨很多的挑戰。我們逐一來解釋一下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ozone 不支持 Append、truncate 的操作,不過這個缺點是騰訊的兩個 Ozone Committer 已經在內部實現了,並且正在 push 到開源倉庫中,但是用到 hflash 的場景現在還不支持,需要我們後續再去努力。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外 Ozone 寫 key 的方式是不寫完不可見,這是因爲 Ozone 在寫 key 的過程中不會向 OM 去提交 block,而是把所有的數據全寫完之後一次性的向 OM 去提交所有 block 信息,因此在寫作的過程中 key 是不可見的,這個也是我們通過後續的改動也是可以去彌補這個缺點。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ozone 的 RPC 鏈路比 HDFS 的 RPC 鏈路長,這個也是乍一看每個人都會抱怨這個缺點,因爲它把服務拆分成兩個,它的一個 RPC 訪問,就必然會從 client 到 OM,再從 OM 到 SCM。這個也是很正常的,但是你仔細去想一想,對於文件系統的操作,到底是元數據的耗時佔得多,還是數據的耗時佔得多?如果說你是想去打開一個文件去讀取它的內容,或者說創建一個文件去往裏邊寫內容,這些元數據的訪問耗時比重非常小。但是有一些像 getblockLocation 等,這些就是純元數據操作,肯定會慢。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外一個就是 Ozone 目前還沒有像文件夾的這樣的一個 metadata,因此他就沒辦法去獲得文件夾的 owner、modificationTime 等信息。像我之前做過一個實驗,就是把 MapReduce 的 JobHistoryServer 把它去跑在 Ozone 之上,就是有一些 yarn 的 job 可以放在 Ozone 裏面,這樣就可以拿它去看的時候會發現掃不出新生成的 job 的一些日誌,原因我們看了一下代碼 JobHistory 裏面會去根據文件夾的 modificationTime 是否變化去判斷是否新生成子目錄。Ozone 的 modificationTime 現在永遠是零,所以也就不會感覺到變化,也就掃不出新的 job 了。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ozone 底層寫數據是基於 ratis,而目前 ratis 寫也只支持一副本和三副本,騰訊正在向開源社區去貢獻都基於的 Strorage cluster 框架的任意副本的支持。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SCM HA 也是一個很大的問題,現在還沒有,不過騰訊現在正在積極地主導 feature,正在向我們的開源社區去 push 相關的改動,預計在 Ozone 的下一個大版本的發佈會包含這個功能。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ozone 還缺少一個 container balancer 這樣的功能,用它來均衡存儲,使集羣的存儲能分佈均勻。比如我們一開始有1000臺機器,後來又新上了10臺那是空的,這樣就需要把存儲均衡到使用量較低的機器上去,這塊功能騰訊也在開發中。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ozone 目前還缺少 DataNode 磁盤預留功能,Ozone 的寫性能正在通過 ratis stream 的 feature 進行優化,我們目前內部測試有的 feature 的性能會提升很大,它的一個本質上的改變就是把寫數據的過程中,讓它提交塊的元數據的一些信息的操作以及數據塊內容的 stream 操作讓它能夠分開,可以在不同的節點上。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ozone 的寫性能,正在通過 RATIS Streaming feature 進行優化。目前我們內部測試 有了這個 Feature 性能提升巨大。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"NameNode on HDDS"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這麼多的挑戰中,其實大多數我們是可以通過後續的開發來完善這些缺失的功能,但是文件夾元數據的支持,還有提速文件夾操作的性能,這些其實都不太容易實現,而這些恰恰是大數據文件系統語義的場景以及必要的功能和性能的要求。那麼怎麼辦?看起來如果什麼都不做,只靠 Ozone 現有的組件的完善,貌似是沒有辦法再勝任 HDFS 文件語義的這些場景了,所以我們 HDDS 之上啓動了一個基於 HDDS 的 NameNode 服務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們可以在 HDDS 之上創建一個 NameNode 服務讓它來取代 Ozone Manager,這樣對外就提供了一個文件系統服務,就可以從根本上解決 Ozone 文件語義支持不好,以及文件夾元數據缺失的問題。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 文件系統語義與 Ozone 的對象存儲的區別"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4a\/4a13490c4c31bcbcf4c16c60e918a9e6.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們一直在說文件系統語義,還有 Ozone 的對象存儲,它們到底有什麼區別。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以通過上面這張圖簡單可以看出來,對象存儲就像左邊種是 KV 方式去管理對象元數據,它無需管理元數據的之間的關係。文件系統就像右邊,它是額外的需要採用樹狀的結構去作爲索引去管理元數據之間的關係,這就是它的本質上區別。這些區別,使得對象存儲和文件系統有着各自的優缺點和各自的適用場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對象存儲,一個 URL 路徑,就是一個 key,本身就沒有文件夾的概念,只是我們非要把對象存儲硬要當文件系統使用,那隻能靠 key 中的正斜槓,切割出文件夾的概念來,實際沒有文件夾節點,也就不會有文件夾的元數據信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"list 慢也是因爲 list 操作需要在 rocksdb 中,按前綴掃表。rename 文件夾 操作需要修改文件夾中的每一個 key,而對於文件系統,list 操作直接把文件夾的 children 返回即可,rename 文件夾,只需要修改一下文件夾節點的 name 即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"明白了原理,我們也就確定了一點,OzoneManager 是對象存儲服務,不管你怎麼優化,也無法媲美 HDFS。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. Ozone 的存儲分層設計"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0f\/0f774e2c55994a49d86a1a7b84c6e3ce.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以怎麼辦?我們來回顧一下 Ozone 的存儲分層的設計。Ozone 起初設計爲一個對象存儲,但是經過抽象出來的 HDDS 層使得存儲層和元數據層是可以分離的,就像 Ozone 就是把 Ozone Manager 以及下面的 HDDS 是可以做分離,這樣就可以在 HDDS 層之上,增加 HDDS NameNode 或者 Block Storage。或者除此之外,我們也可以在 HDDS 之上去創建一個新的其它的對象存儲,可能你的系統裏面需要這樣的一個定製化的對象存儲,你也可以實現一個其它的類似於 Ozone Manager 這樣的一個節點。我們重點就是要介紹基於 Ozone 分層存儲的設計,我們就可以在 HDDS 層之上去實現一個 HDFS NameNode 去承接大數據場景文件系統語義的 workload。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/97\/97826c5c6e66f41529746c9c71f3cc45.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"事情說清楚了,現在真要幹起來,其實很重要的一點就是要分階段去迭代,不能直接朝着終態目標,埋頭苦幹,中間沒有輸出成果。那樣,對於老闆,看不到你的進展。對於項目成員,也沒有辦法持續興奮起來,所以,我們把事情分爲幾個階段做。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Base 就是我們的基礎了,我們不是從0實現的,我們基於 HDSF client 和 Namenode 以及 Ozone 的 HDDS 開始我們的開發工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Basic 的目標,是實現一個 HddsClient 和 HddsNameNode,加上未經修改的 HDDS 集羣。可以實現 一個基於 HDDS 的 Filesystem,我們稱之爲 OZONE-DFS。這就實現了與 HDFS 幾乎等同的文件系統服務了,可以接入大數據 workload 了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二階段,我們要優於 HDFS,提升吞吐和性能,因此,我們在 OZONE-DFS ( HDDSFS ) 的 NameNode 上大刀闊斧做 HDFS 多年沒有做的事情,鎖優化,也叫細粒度鎖改造。我們借鑑了美團的 NameNode 鎖優化思路,也仔細剖析了 Alluxio 細粒度鎖的實現。這些給力我們很大幫助。這個階段完成,我們的 OZONE-DFS 就已經超越了 HDFS 了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這還不夠,我們要設計實現一個遠遠超越 HDFS 的 OZONE-DFS,所以,我們在第三階段,對 NameNode 的元數據進行 KV 化改造,並採用兩層元數據管理的方式,上層爲內存,下層爲 rocksdb,既實現元數據無限擴展,又最大程度上,不損失太大性能。此外,利用 Apache Ratis 實現 NameNode 的 HA,不再需要 ZK、ZKFC、JournalNode 等 HA 相關服務,3個 HDDS NameNode 組成一個 RAFT Group 足以。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/13\/1339810a3dc981bae11c36b8345bb929.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們是基於 HDFS 源碼進行二次開發:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實現了 Client 到 HDDS Datanode 的通信,這就可以進行正常的塊讀寫了。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們還修改了 Client 到 HDDSNN 的協議,使得 Client 可以通過 HDDS NameNode 向 SCM 申請 block 了。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於與塊無關的操作,比如 createFile,mkdir,rename 等等,都無需任何改動。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這張圖,可以看出我們增加了 HDDSNN,作爲文件系統元數據管理操作。HddsNN 可以與 SCM 通信,進行 block 的申請和刪除操作。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還增加了 HddsNNClient 可以同時訪問 HddsNN,進行文件系統元數據請求,也可以訪問 HddsDataNode,進行數據塊的讀寫。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8d\/8d0f99732d9b4d82ea08e33d06ad0dfa.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面兩張圖分別展示了具體的讀寫流程,左邊是寫流程,比如說你在你想創建一個文件,你先是有一個創建文件的 RPC 請求,由 client 發到 HDDSNN,然後 HDDSNN 創建了一個元數據,接下來 HDDSFS Client 向輸出流去寫數據,寫數據過程當中它會發現沒有 Block,所以就會去向 HDDSNN 去發送 allocateBlock 請求,HDDSNN 就什麼也不幹,把 allocateBlock 請求發給 SCM,SCM 負責把 Block 給 allocate 出來,然後返回給 HDDSFS Client。接下來就可以執行寫 block 邏輯了。Client 向 DataNode 通信,先 write 一些 chunk,最後到了一個 block 的時候就會發送一個 putBlock 請求,完成寫一個 block 的流程。整個過程都完成了,最後就會發一個 complete 的 RPC 請求,HDFSNN 收到 RPC 請求就會把文件標記成完成。這樣一個寫流程就完成了。讀流程大致類似,不再贅述。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/dd\/dda8419c15b8d4c711e740bbe9454032.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實現了功能的同時,我們還有一個更高端的目標:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一套代碼,既可以啓動 HDDSFS,也可以啓動 HDFS,通過一個配置項進行區分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是因爲我們抽象化了 BlockManager、DatanodeManager、heartbeatManager,分別有 HDDS 版本 和 HDFS 版本的實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們也實現了 odfs schema 對應的 Hadoop 兼容文件系統實現類 —— HddsFilesystem。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣同一個 client,就可以同時支持讀寫 hdfs 和 hddsfs 了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"右邊的類依賴關係圖,紅框是我們新加的類。大部分是繼承或擴展自 HDFS 現有的類。寫塊的具體實現細節在 HddsOutputStream 裏。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4d\/4d03b5158d3e3f205da0c3e9b0f8fa10.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們這一階段的目標,非常清晰明確,就是實現一個 Client 和 NameNode,可以藉助 HDDS 的存儲能力,提供文件系統服務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HDDSFS 由於將塊管理、節點管理、心跳管理等功能分離出去,存儲相同規模元數據,節省內存40+%。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲 HDDSFS 避免了塊彙報風暴,對於1千萬個小文件,重啓速度是 HDFS 的4倍。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一方面,由於 HDDSFS 有文件語義的元數據樹形結構,相比 Ozone 來說,在文件夾操作性能上有着明顯的優勢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中在10萬子節點目錄的 Rename、delete 等操作就有100+倍性能提升。( 異步刪除功能,提升更高 )"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於文件夾屬性的相關操作 ( modification Time, set mode, set owner ),OzoneManager 根本上就不支持,而 HDDSFS 完美支持。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/95\/953053a5588c58fc01ce13315bb65a5b.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面這一階段,我們的目標就是通過實現細粒度鎖,優化高併發吞吐,提升集羣性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對 HDFS 的原理是有一些瞭解的同學,都應該知道 HDFS 是需要把 FsLock 還有 FsDirLock 都加鎖之後才能做具體的操作。現在我們把 FsLock 的部分 writelog 變成了 readlock,這是因爲我們把 Block Manager 功能對接到 SCM 裏了。有這樣的一些改動,這樣就使得我們有一些操作是可以併發執行的。同時我們把 FSDirectory 裏邊的 lock 改成細粒度鎖了,這使得集羣的吞吐可以得到很大限度的提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的改動方案就是把 FSDirectory 中的 readwritelock 修改成基於 locklist 的層級鎖。我們看上圖右邊表格來具體說明一下,比如說你要去 list 一個文件夾 \/a\/b,原來可能是要去加個全局讀鎖,然後再去做操作。現在不是加一個全局鎖,而是給這樣的一個根加一個讀鎖,a 加一個讀鎖 b 加一個讀鎖,然後纔會把下面的子節點 list 出來返回回去。這樣就可以看出來是一個 locklist。再舉個寫的例子,比如 createFile 要在 \/a\/b 個文件夾下面去創建一個 c.txt,要給 c 加讀鎖 d 加讀鎖,然後下面再加一個 c.txt 新創建這個文件是加了一個寫鎖,瞭解內幕的同學可能就會發現這樣不對,我們現在一個 directory,下面這個是一個 Children 是一個 array list,相當於是線程不安全的數據結構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果幾個線程同時 create 就會出現線程不安全的問題。沒錯,我們就踩過坑,把 children 改成線條安全的數據結構,這樣的好處就是它父節點就不需要加寫鎖了,通過這樣設計的一個細粒鎖隸屬於兩個文件就是不同的大目錄下邊的文件的操作就可以相互不干擾。相比 HDFS 的刪除或者 DU 某個目錄的時候,整個集羣的性能就會明顯的下降,通過粒粒鎖的優化,這個性能會有明顯的提升,它的吞吐從整體上也會顯著的提高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/81\/814f7dfb0b121e8a9ee3f12d83207abc.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家可以試想一下,如果一個 Inode 都關聯一個 ReentrantReadWriteLock, 內存會增加多大?本來內存就很喫緊,額外增加太大的內存,肯定是無法接受的方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,我們借鑑了 Alluxio 2.0 引入的 LockPool 的概念。什麼是 LockPool 呢?就是,有一個資源池,lock 就是資源池中的一個鎖資源。每一個 Inode 不再關聯一個 Lock 了,而是需要 Lock 加鎖的時候,就去資源池裏申請鎖,同時引用計數會增加,用完了 unlock 掉的時候,引用計數會減少。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"LockPool 有一個勤勞的 LockEvcitor,會在 Lock 達到 high watermark 的時候,進行 evict,直到降低到 low watermark。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ad\/ad98bd90634cd55435c1b434a0cdcffe.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面這一張流程圖,也可以幫我們去理解 lockPool 它異步的 lock Evictor 的工作原理,上面左側是一個線程從 lockPool 裏去獲取一個 key 對應的 lock 的流程,判斷 key 在 lockPool 裏面是否存在,這是第一步,如果不存在就會創建一個 lock,如果存在就會讓它的 refCount 加一,直接就返回了。然後他會判斷一下 lockPool 的 size 是否達到了高水位,如果超過了高水位就會發一個 Evictor 的信號。右側的是一個線程,它會一直檢測 lockPool 的 size 是否超過高水位,如果沒超過高水位,它就會 await 等待信號來給它觸發。當發現了已經超過高水位了,就會進行 Evictor 操作踢出沒有引用的 lock,直到達到了低水位。關於細粒度鎖,後續還可以區分具體的 創建、刪除、修改等寫操作,加不同的鎖,從而最大限度的提升吞吐,能不鎖就不鎖,能用讀鎖不用寫鎖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"NN on HDDS——Lock Guard"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/82\/8227b02eac6187e9f0de7db5d44180ed.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們做了這麼大的一個根基性的改造,就難免會出一些 bug 或者意料之外的事情,所以我們也給自己留了後路,就是開發了一個 Lock Guard 這樣的一個 feature,幫我們去診斷出死鎖、忘放鎖、常佔鎖等情況,採用的辦法就是啓動一個檢測線程,檢測 lock 的持有的時間是否超過了預期,這個時間預期可能有多個標準,達到不同的標準會有不同的處理,然後會有一個 metric 去來統計,超過這個時間標準的 lock 數量,當 metric 增加了,然後我們就可以收到一些報警的一些 alert,同時會向日志裏邊去打印一些警告日誌。我們收到報警以後就可以去登錄到機器上去看具體的到底是哪個 lock 出了問題以及lock的一些上下文信息都可以把它打出來。當然這個功能會增加一些內存的開銷性性能的損耗,所以是可以動態去開關的。當收到報警信息後,就可以使用命令行工具利用拿到的鎖 ID 去分析佔鎖的情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"NN on HDDS——Tired Metadata Management"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bd\/bd039081db558414eefaa881f95a54dd.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三個階段目標就是通過 RocksDB 實現元數據無限擴展。在內存空間中,我們可以很容易把文件和文件夾用一個 Tree 來進行管理,但是 RocksDB 也是一個 KV 存儲。我們需要把元數據切分成兩張表,第一張表是放 inode,這裏 inode 包括 inodeFile 和 inodeDirectory,使用 inodeId 作爲 key。第二張表是放 inode 之間的關係,key 爲 parentId 和 childName 聚合的字符串,它的形式就是,””,value 就是 child 的 inodeID。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們舉個例子,比如說你要去獲得 “\/dir\/file” 的元數據信息怎麼辦?首先是根不用找,根的 ID 就是0,接下來就是找 \/dir,於是去 Edge 表中查找 key 爲 “0,dir” 的記錄,可以找到,value 爲1,也就是說 “dir” 的 ID 是1,接着再找 “1,file”,在 Edge 表中是可以找到,也就是說 “file” 的 ID 也是2,最後我們在 inode table 裏面去找2的 value,就可以找到了 “\/dir” 和 “\/file” 的元數據了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"NN on HDDS——Tired Metadata Management"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/27\/27331f9bb4e87ef431aa2f74b923457f.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這一通操作顯然是在內存裏邊是會比在 RocksDB 裏面要快很多。因此需要進行分級元數據管理,比如說增加一個 cache layer。什麼是分層級元數據管理呢? 有什麼益處呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在一個存儲系統中,不光光只有它所存儲的數據文件重要,它的存儲系統的元數據管理同樣十分的重要。因爲涉及到存儲系統數據訪問操作時,會經過存儲系統的,元數據的查詢或更新操作,如果元數據這邊的操作出現性能瓶頸,同樣會導致用戶訪問數據的行爲出現緩慢的情況。本文我們來聊聊存儲系統一般是如何做高效的元數據管理的,這裏面會涉及到多種不同的元數據管理方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實,存儲系統的元數據管理,是存在演變過程的。初代元數據管理,元數據存儲於外部 db 中,然後 master 服務和 db 進行數據的交互。內存管理式的元數據管理,master 服務在初啓動後加載外部元數據 db 文件到內存中。分區式元數據管理,將元數據按照給定規則進行 partition 的分拆,然後啓動多個 master 服務來管理各自的應該維護的元數據。分層級的元數據管理 ( tiered metastore ),是一個既可以無限擴展,又可以保證活躍數據性能不下降的一種策略。最近訪問的熱點元數據,做內存緩存,叫做 cached layer,很久沒有訪問過的數據 ( 也可稱作冷數據 ),做持久化保存,叫做 persisted layer。在此模式系統下,服務只 cache當前 active 的數據,所以也就不會有內存瓶頸這樣的問題。這張圖是一個此模式的樣例系統的元數據管理模型圖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比較於 HDFS Namenode 將元數據全部 load 到 memory 然後以此提高快速訪問能力的元數據管理方式,但內存卻成爲了元數據擴展上限,以及 Ozone 對所有元數據的讀寫都要通過 rocksdb,性能下降成爲了最大的問題。因此,HDDSFS NN 需要在這點上做優化改進,只 cache 那些 active 的數據。對於那些近期沒有訪問過的冷數據,則保存在本地的 rocksdb 內。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在分層級的元數據管理策略中,在內存中 cache active 數據的存儲層,我們叫做 cache store,底層 rocksdb 層則叫做 backing store。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個概念以及相應的設計,我們也參考了 Alluxio 和 Ozone filesystem 的設計。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"NN on HDDS——RAFT Based HA"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/30\/3046180617af12532788dc1e7e4e59f2.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後一個大的架構升級就是 HA 升級,只需要有3個 HDFS NameNode 組成的這樣一個 RAFT Group 不依賴於外部的組件就可以實現 HA 功能。區別於 Ozone Manager 的 HA 實現需要 client 通過 RAFT 協議與三個 OM 組成的 RAFT Group 通信,也就是說與 leader 的 OM 通信,leader OM 把 client 的請求記成 RAFT 的 log,然後再同步給兩個 follower。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的設計中是由三個 NameNode 組成一個 RAFT Group,但 client 採用的 follower 的方式找到 leader 的 NameNode 與之通信,leader 的 NameNode 會執行 client 的請求,然後再把寫操作記錄到基於操作日誌 journal 裏面。當 RAFT Group 的 leader 切換時,每一個 NameNode 會設置自己的 leader 狀態,只有 leader 纔會接受外部的請求,當 leader 切換的時候,失去 leader 狀態的 NameNode 就會重置狀態,並且載入到 NameNode,這樣就可以達到內存狀態是與 RAFT log 是一致的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 HDDS NameNode 的 HA 設計中,我們參考的也是 Ozone Manager 的 HA,還有 SCM 的 HA,還有 Alluxio 2.0 引入的基於 Ratis 的 HA,目前我們前兩個階段其實已經實現了,但是我們第三個階段還在進行中,所以感興趣的小夥伴可以加入我們一起去幹這些偉大的事情。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"騰訊的貢獻"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/31\/31cf3b5a4af8c2244de8b75283e8a0e4.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"騰訊的貢獻清單有很多,不一一介紹了。6個 committer,其中還有2個 PMC,還有1個 Chair。這位 Chair,就是我們團隊的 Sammi ( 陳怡 )。同時也作爲 Ozone 1.0.0 這個非常有重大意義的版本的 Release Manager。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Ozone 的未來"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. Ozone 的未來"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7a\/7a60f2ade54a745a663c564c54a606c2.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再看看 Ozone 的一些未來,Ozone 未來還有很多事情要做,像 SCM HA、HA 狀態切換等等都需要繼續去開發。在大數據的生態裏邊 Ozone 還是有很多對應的一些功能要去實現。Erasure Coding 功能現在 Hadoop 裏邊已經實現了還不錯了,但是在 Ozone 裏面目前還是設計階段。HDFS 的一些 API,如果說你想替換它,肯定要把這些 API 完美的支持,像 append\/truncate\/hflush 都需要繼續把它們提到社區裏面。Datanode 的一些健康檢查、動態改配置、中心化的配置管理、還有是 container 的一些 balancer 這樣的一個功能,這些現在還是都需要我們補充的,就不一一介紹了,其實還有很多工作要去做。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7a\/7a6c54645b123d1bff7ca59b11126dd4.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Native Object Store"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/www.slidestalk.com\/TencentCloud\/TencentCloudOzone"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"向成熟化邁進 - 騰訊 Ozone 千臺能力突破"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/cloud.tencent.com\/developer\/article\/1667033"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文件系統和對象存儲區別"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/ubuntu.com\/blog\/what-are-the-different-types-of-storage-block-object-and-file"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Ozone 開發者資源"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/docs.qq.com\/doc\/DZUJFSXFuZHFXRGZp"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Goofys 的增量版 Goofuse"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/github.com\/opendataio\/goofuse"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HCFSFuse"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/github.com\/opendataio\/hcfsfuse"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個就是我本次分享所引用的一些文章,其中標綠色的 Ozone 開發者資源,就建議收藏,它裏邊有很多面向開發者的一些資源,比如說本文提到的 Ozone 的一些視頻,以及新開發者怎麼去開發等等,建議大家去加入 Apache slack workspace,並加入 ozone channel,這裏邊有邀請連接,後邊是兩個我們開發的 Fuse 項目,HCFSFuse,以及 GooFuse,作爲 Ozone 的 Fuse 實現都可以利用 Fuse,把 Ozone 掛載成本地盤。大家感興趣也可以關注一下。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後總結下,我們總體還是超前設計,但是分階段去向前迭代,我們也制定了一些優先級,先完成高優先級的需求,在這一階段把它產出一些成果,我們也不像某些公司做一些分包制,這個功能就包給你了,然後你自己去對他結果負責。我們不是這樣,我們是高度的協作,大家一起來去面對問題。然後測試這方面也是重視質量保證的,像 TDD ( 測試驅動開發 )、CI ( 持續集成 )、Nightly build、code review。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們借鑑了 Alluxio、HDFS、Ozone、RATIS、Ceph 等諸多開源軟件的設計和源碼,也參考了美團的拆鎖方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ozone 想成爲 HDFS 的下一代的存儲系統,需要切實替代 HDFS 的一些場景能力,期待下一個十年 Ozone 可以家喻戶曉,成爲各大技術公司中的技術棧中的底層存儲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:DataFunTalk(ID:atafuntalk)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/6rtkmwjfI_Cl-hYMrDOOyA","title":"xxx","type":null},"content":[{"type":"text","text":"取代HDFS?Ozone在騰訊的最新研究進展"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章