利用 JuiceFS 給 Flink 容器啓動加速

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink 因爲其可靠性和易用性,已經成爲當前最流行的流處理框架之一,在流計算領域佔據了主導地位。早在 18 年知乎就引入了 Flink,發展到現在,Flink 已經成爲知乎內部最重要的組件之一,積累了 4000 多個 Flink 實時任務,每天處理 PB 級的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink 的部署方式有多種,根據資源調度器來分類,大致可分爲 standalone、Flink on YARN、Flink on Kubernetes 等。目前知乎內部使用的部署方式是 Flink 官方提供的 native Kubernetes。談到 Kubernetes,就不得不說容器鏡像的問題,因爲 Flink 任務的依賴多種多樣,如何給 Flink 打鏡像也是一個比較頭疼的問題。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Flink 鏡像及依賴處理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink 的任務大致可分爲兩類,第一類是 Flink SQL 任務,Flink SQL 任務的依賴大致有以下幾種:、"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.官方的 connector JAR 包,如 flink-hive-connector、flink-jdbc-connector、flink-kafka-connector 等;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.非官方或者是內部實現的 connector JAR 包;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.用戶的 UDF JAR 包,一些複雜的計算邏輯,用戶可能會自己實現 UDF。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二類 Flink 任務是 Flink 的 jar 包任務,除了以上三種依賴,還需要依賴用戶自己寫的 Flink jar 程序包。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"顯然,對於每一個 Flink 任務,它的依賴不盡相同,我們也不可能爲每一個 Flink 任務單獨打一個鏡像,我們目前的處理如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.將依賴進行分類,分爲穩定依賴和非穩定依賴;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.穩定依賴包括組件(如 Flink、JDK 等)以及官方的 connector 包,這類依賴十分穩定,只會在 Flink 版本升級和 bug 修復這兩種情況下進行改動,因此我們會在構建鏡像時,將這類依賴打入鏡像;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.非穩定依賴包括第三方的 connector 和用戶自己的 JAR 包。第三方的 connector 因爲不是 Flink 官方維護,所以出問題需要修復的概率相對更大;用戶自己的 JAR 包對於每個任務來說都不相同,而且用戶會經常改動重新提交。對於這類不穩定的依賴,我們會動態注入,注入的方式是將依賴存入分佈式文件系統,在容器啓動的時候,利用 pre command 下載進容器裏。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過以上處理,Flink 鏡像具備了一定的動態加載依賴的能力,Flink Job 的啓動流程大致如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/4a\/6f\/4acb9d0ece2f55545a6529cdc8ff936f.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"文件系統選取"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"HDFS 存放依賴的痛點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"存放 Flink 依賴的文件系統在之前我們一直都是選用的 HDFS, 但是在使用過程中我們遇到了以下痛點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.NameNode 在任務高峯期壓力過大,容器在下載依賴時向 NameNode 請求文件元數據會存在卡頓的情況,有些小的批任務,任務本身可能只需要運行十幾秒,但是因爲 NameNode 壓力過大,導致下載依賴可能需要幾分鐘;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.目前 Flink 集羣我們是多數據中心部署,但是 HDFS 只有一個離線機房大集羣,這樣會存在跨數據中心拉文件的情況,消耗專線帶寬;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.有一些特殊的 Flink 任務完全不依賴 HDFS,換句話說它既不使用 checkpoint 也不讀寫 HDFS,但是因爲 Flink 容器的依賴存放在 HDFS 上,導致這類任務依然離不開 HDFS。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"使用對象存儲的痛點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後面我們將 HDFS 換成了對象存儲,解決了 HDFS 的一些痛點,但是很快我們發現了新的問題 — 對象存儲單線程下載的速度慢。對象存儲下載加速可選的方案一般有以下幾種:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.使用多線程下載進行分段下載,但是容器的 pre command 其實只適合執行一些比較簡單的 shell 命令,如果採用分段下載,就必須對這一塊進行比較大的改造,這是一個比較大的痛點;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.給對象存儲加代理層做緩存,加速的事情由代理來做,客戶端依然可以單線程讀取。這種辦法的缺點是需要額外維護一個對象存儲的代理組件,組件的穩定性也需要有保障。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"嘗試 JuiceFS"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比較湊巧的是公司內部正在做 JuiceFS 的 POC, 有現成的對象存儲代理層可用,我們對其進行了一系列測試,發現 JuiceFS 完全滿足我們這個場景的需求,讓我們比較驚喜的地方有以下幾點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.JuiceFS 自帶 S3 gateway 完美兼容 S3 對象存儲協議,能夠讓我們很快上線,無需任何改動,並且 S3 gateway 本身無狀態,擴縮容非常方便;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.JuiceFS 自帶緩存加速功能,經過測試,用 JuiceFS 代理對象存儲後,單線程讀取文件的速度是原來的 4 倍;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.JuiceFS 提供本地文件系統掛載的方式,後面可以嘗試依賴直接掛載進容器目錄;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4.JuiceFS 可選用元數據與存儲分離部署的方式,存儲我們選用原來的對象存儲,雲廠商保證 11 個 9 的可用性;元數據我們選用分佈式 KV 系統—TiKV,選用 TiKV 的原因是我們在線架構組的同事對 TiKV 有着豐富的開發和運維經驗,SLA 能夠得到極大的保障。這樣 JuiceFS 的可用性和擴展性是非常強的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"JuiceFS 上線"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"JuiceFS 的上線過程分爲以下階段:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.數據遷移,我們需要將原先存儲在 HDFS 和對象存儲上的數據同步到 JuiceFS 上,因爲 JuiceFS 提供了數據同步的工具,並且 Flink 的依賴也不是特別大,所以這部分工作我們很快就完成了;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.修改 Flink 鏡像拉取依賴的地址,因爲 JuiceFS 兼容對象存儲協議,我們只需要在平臺側修改原來的對象存儲的 endpoint 爲 JuiceFS S3 gateway 的地址即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"JuiceFS 上線後,我們 Flink 任務啓動的流程圖大致如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/64\/40\/64b78f4133d06c2c2020a8b7dc8a9f40.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比於使用 HDFS 的方式,我們能得到一個可預期的容器啓動時間,容器下載依賴的速度不會受業務高峯期的影響;相比於原生的對象存儲,容器下載依賴的速度提高約 4 倍。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從開始調研 JuiceFS 到 JuiceFS 上線花費時間不到半個月,主要是因爲 JuiceFS 的文檔十分完備,讓我們少走了很多彎路,其次是 JuiceFS 社區的夥伴也有問必答,因此我們的上線過程十分順利。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"初步嘗試 JuiceFS 給我們帶來的收益還是比較明顯的,後續我們會考慮將 JuiceFS 應用在數據湖場景和算法模型加載的場景,讓我們數據的使用更加靈活和高效。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"胡夢宇,知乎大數據架構開發工程師,主要負責知乎內部大數據組件的二次開發和數據平臺建設。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章