【乾貨篇】bilibili:基於 Flink 的機器學習工作流平臺在 b 站的應用

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分享嘉賓:張楊,B 站資深開發工程師","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"導讀:整個機器學習的過程,從數據上報、到特徵計算、到模型訓練、再到線上部署、最終效果評估,整個流程非常冗長。在 b 站,多個團隊都會搭建自己的機器學習鏈路,來完成各自的機器學習需求,工程效率和數據質量都難以保證。於是我們基於 Flink 社區的 aiflow 項目,構建了整套機器學習的標準工作流平臺,加速機器學習流程構建,提升多個場景的數據實效和準確性。本次分享將介紹 b 站的機器學習工作流平臺 ultron 在 b 站多個機器學習場景上的應用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"目錄:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、機器學習實時化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、Flink 在 B 站機器學習的使用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3、機器學習工作流平臺構建","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4、未來規劃","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、機器學習實時化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c1/c13523ef070aaca4276fc15d39a6210a.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先講下機器學習的實時化,主要是分爲三部分:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一是樣本的實時化。傳統的機器學習,樣本全部都是 t+1,也就是說,今天模型用的是昨天的訓練數據,每天早上使用昨天的全天數據訓練一次模型;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二是特徵的實時化。以前的特徵也基本都是 t+1,這樣就會帶來一些推薦不準確的問題。比如,今天我看了很多新的視頻,但給我推薦的卻還是一些昨天或者更久之前看到的內容;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三就是模型訓練的實時化。我們有了樣本的實時化和特徵的實時化之後,模型訓練也是完全可以做到在線訓練實時化的,能帶來更實時的推薦效果。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"傳統離線鏈路","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6d/6d40cd8300be346b7d50228161ddc25a.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖是傳統的離線鏈路圖,首先是 APP 產生日誌或者服務端產生 log,整個數據會通過數據管道落到 HDFS 上,然後每天 t+1 做一些特徵生成和模型訓練,特徵生成會放到特徵存儲裏面,可能是 redis 或者一些其他的 kv 存儲,再給到上面的 inference 在線服務。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"傳統離線鏈路的不足","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a2/a29f5efc959323d431ff383812a47576.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那它有什麼問題呢?","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一是 t+1 數據模型的特徵時效性都很低,很難做到特別高時效性的更新;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二是整個模型訓練或者一些特徵生產的過程中,每天都要用天級的數據,整個訓練或者特徵生產的時間非常長,對集羣的算力要求非常高。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"實時鏈路","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cb/cb0f0d5e25124d096a21a7a2084e0da1.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖我們進行優化之後整個實時鏈路的過程,紅叉的部分是被去掉的。整個數據上報後通過 pipeline 直接落到實時的 kafka,之後會做一個實時特徵的生成,還有實時樣本的生成,特徵結果會寫到 feature store 裏面去,樣本的生成也需要從 feature store 裏面去讀取一些特徵。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"生成完樣本之後我們直接進行實時訓練。整個右邊的那個很長的鏈路已經去掉了,但是離線特徵的部分我們還是保存了。因爲針對一些特殊特徵我們還是要做一些離線計算,比如一些特別複雜不好實時化的或者沒有實時化需求的。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、Flink 在 b 站機器學習的使用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/57/57418e204f7232cbec92212d1855d05a.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面講下我們是怎麼做到實時樣本、實時特徵和實時效果評估的。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個是實時樣本。Flink 目前託管 b 站所有推薦業務樣本數據生產流程;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個是實時特徵。目前相當一部分特徵都使用了 Flink 進行實時計算,時效性非常高。有很多特徵是使用離線 + 實時組合的方式得出結果,歷史數據用離線算,實時數據用 Flink,讀取特徵的時候就用拼接。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是,這兩套計算邏輯有的時候不能複用,所以我們也在嘗試使用 Flink 做批流一體,將特徵的定義全部用 Flink 來做,根據業務需要,實時算或者離線算,底層的計算引擎全部是 Flink;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三是實時效果的一個評估,我們使用了 Flink+olap 來打通整個實時計算 + 實時分析鏈路,進行最終的模型效果評估。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"實時樣本生成","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/71/712a8973c6a0d1edd68a3c31aac4faa0.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖是目前實時樣本的生成,是針對整個推薦業務鏈路的。日誌數據落入 kafka 後,首先我們做一個 Flink 的 label-join,把點擊和展現進行拼接。結果繼續落入 kafka 後,再接一個 Flink 任務進行特徵 join,特徵 join 會拼接多個特徵,有些特徵是公域特徵,有些是業務方的私域特徵。特徵的來源比較多樣,有離線也有實時。特徵全部補全之後,就會生成一個 instance 樣本數據落到 kafka,給後面的訓練模型使用。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"實時特徵生成","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fc/fc6b36c979db5b4f2d643c19901559df.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖是實時特徵的生成,這邊列的是一個比較複雜的特徵的過程,整個計算流程涉及到了 5 個任務。第一個任務是離線任務,後面有 4 個 Flink 任務,一系列複雜計算後生成的一個特徵落到 kafka 裏面,再寫入 feature-store,然後被在線預測或者實時訓練所用到。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"實時效果評估","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d1/d10e753303c50d1ed12adc506834edaf.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖是實時效果的評估,推薦算法關注的一個非常核心的指標就是 ctr 點擊率,做完 label-join 之後,就可以算出 ctr 數據了,除了進行下一步的樣本生成之外,同時會導一份數據到 clickhouse 裏面,報表系統對接後就可以看到非常實時的效果。數據本身會帶上實驗標籤,在 clickhouse 裏面可以根據標籤進行實驗區分,看出對應的實驗效果。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、機器學習工作流平臺構建","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"痛點","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/84/84819212c1dc8225e516d713bbe4a09e.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"機器學習的整個鏈路裏面有樣本生成、特徵生成、訓練、預測、效果評估,每個部分都要配置開發很多任務,一個模型的上線最終需要橫跨多個任務,鏈路非常長。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"新的算法同學很難去理解這個複雜鏈路的全貌,學習成本極高。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個鏈路的改動牽一髮而動全身,非常容易出故障。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算層用到多個引擎,批流混用,語義很難保持一致,同樣的邏輯要開發兩套,保持沒有 gap 也很困難。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個實時化成本門檻也比較高,需要有很強的實時離線能力,很多小的業務團隊在沒有平臺支持下難以完成。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/28/28eae484d1e632c5dc294b710a6307ff.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖是一個模型從數據準備到訓練的大概過程,中間涉及到了七八個節點,那我們能不能在一個平臺上完成所有的流程操作?我們爲什麼要用 Flink?是因爲我們團隊實時計算平臺是基於 Flink 來做的,我們也看到了 Flink 在批流一體上的潛力以及在實時模型訓練和部署上一些未來發展路徑。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"引入 Aiflow","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bf/bfbac30aff4c8316a9b8bb3367a096cd.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Aiflow 是阿里的 Flink 生態團隊開源的一套機器學習工作流平臺,專注於流程和整個機器學習鏈路的標準化。去年八、九月份,我們在和他們接觸後,引入了這樣一套系統,一起共建完善,並開始逐漸在 b 站落地。它把整個機器學習抽象成圖上的 example、transform 、Train、validation、inference 這些過程。在項目架構上非常核心的能力調度就是支持流批混合依賴,元數據層支持模型管理,非常方便的進行模型的迭代更新。我們基於此搭建了我們的機器學習工作流平臺。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"平臺特性","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/70/70afc200b01a45f2c1c2f53f083568e6.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來講一下平臺特性:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一是使用 Python 定義工作流。在 ai 方向,大家用 Python 還是比較多的,我們也參考了一些外部的,像 Netflix 也是使用 Python 來定義這種機器學習的工作流。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二是支持批流任務混合依賴。在一個完整鏈路裏面,涉及到的實時離線過程都可以加入到裏面,並且批流任務之間可以通過信號就行互相依賴。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三是支持一鍵克隆整個實驗過程。從原始 log 到最終整個實驗拉起訓練這塊,我們是希望能夠一鍵整體鏈路克隆,快速拉起一個全新的實驗鏈路。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第四是一些性能方面的優化,支持資源共享。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第五是支持特徵回溯批流一體。很多特徵的冷啓動需要計算曆史很長時間的數據,專門爲冷啓動寫一套離線特徵計算邏輯成本非常高,而且很難和實時特徵計算結果對齊,我們支持直接在實時鏈路上來回溯離線特徵。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"基本架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9d/9d45f585d07badd84c43d310bdad5197.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖是基本架構,最上面是業務,最下面是引擎。目前支持的引擎也比較多:Flink、spark、Hive、kafka、Hbase、Redis。其中有計算引擎,也有存儲引擎。以 aiflow 作爲中間的工作流程管理,Flink 作爲核心的計算引擎,來設計整個工流平臺。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"工作流描述","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/72/7242a69890e55ac65bf83c52bc81318f.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個工作流是用 Python 來描述的,在 python 裏面用戶只需要定義計算節點和資源節點,以及這些節點之間的依賴關係即可,語法有點像調度框架 airflow。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"依賴關係定義","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/54/54d96f481dac04e1e8fbf1f8e5c5482d.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"批流的依賴關係主要有 4 種:流到批,流到流,批到流,批到批。基本可以滿足目前我們業務上的所有需求。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"資源共享","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/49/49218e3c651d99775a7ceb0e2d636e24.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"資源共享主要是用來做性能方面,因爲很多時候一個機器的學習鏈路非常長,比如剛剛那個圖裏面我經常改動的可能只有五六個節點,當我想重新拉起整個實驗流程,把整個圖克隆一遍,中間我只需要改動其中的部分節點或者大部分節點,上游節點是可以做數據共享的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0e/0ec76a896d8af251a09a2e6fad615b53.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個是技術上的實現,克隆之後對共享節點做了一個狀態追蹤。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"實時訓練","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7a/7a54c10d7fa997e17cc2965ebb0196b0.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖是實時訓練的過程。特徵穿越是一個非常常見的問題,多個計算任務的進度不一致時就會發生。在工作流平臺裏面,我們定義好各個節點的依賴關係即可,一旦節點之間發生了依賴,處理進度就會進行同步,通俗來說就是快的等慢的,避免特徵穿越。在 Flink 裏面我們是使用 watermark 來定義處理進度。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"特徵回溯","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4b/4bda6925dbf64ef8ef855c88e934b57e.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖是特徵回溯的過程,我們使用實時鏈路,直接去回溯它歷史數據。離線和實時數據畢竟不同,這中間有很多問題需要解決,因此也用到了 spark,後面這塊我們會改成 Flink。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"特徵回溯的問題","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cb/cb280d022d45e44d158260ac0d99f8c7.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵回溯有幾個比較大的問題:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一是如何保證數據的順序性。實時數據有個隱含的語義就是數據是順序進來的,生產出來立馬處理,天然有一定的順序性。但是離線的 HDFS 不是,HDFS 是有分區的,分區內的數據完全亂序,實際業務裏面大量計算過程是依賴時序的,如何解決離線數據的亂序是一個很大的問題。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二是如何保證特徵和樣本版本的一致性。比如有兩條鏈路,一條是特徵的生產,一條是樣本生產,樣本生產依賴特徵生產,如何保證它們之間版本的一致性,沒有穿越?","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三就是如何保證實時鏈路和回溯鏈路計算邏輯的一致?這個問題其實對我們來說不用擔心,我們是直接在實時鏈路上回溯離線數據。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第四是一些性能方面的問題,怎麼快速得算完大量的歷史數據。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"解決方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e7/e79ad21826f0fa2f3fc09906f0d75777.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以下是第一、第二個問題的解決方案:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個問題。爲了數據的順序性,我們 HDFS 的離線數據進行 kafka 化處理,這裏不是把它灌到 kafka 裏面去,而是模擬 kafka 的數據架構,分區並且分區內有序,我們把 HDFS 數據也處理成類似的架構,模擬成邏輯上的分區,並且邏輯分區內有序,Flink 讀取的 hdfssource 也進行了對應的開發支持這種模擬的數據架構。這塊的模擬計算目前是使用 spark 做的,後面我們會改成 Flink。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個問題分爲兩部分:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實時特徵部分的解決依賴於 Hbase 存儲,Hbase 支持根據版本查詢。特徵計算完後直接按照版本寫入 Hbase,樣本生成的時候去查 Hbase 帶上對應的版本號即可,這裏面的版本通常是數據時間。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"離線特徵部分,因爲不需要重新計算了,離線存儲 hdfs 都有,但是不支持點查,這塊進行 kv 化處理就好,爲了性能我們做了異步預加載。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f1/f11e09e67d091f9d4d75b79e04cab691.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"異步預加載的過程如圖。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、未來規劃","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來介紹下我們後面規劃。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e3/e3a40399325785c043f842f827096665.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個是數據質量保證。現在整個鏈路越來越長,可能有 10 個節點、 20 個節點,那怎麼在整個鏈路出問題的時候快速發現問題點。這裏我們是想針對節點集來做 dpc,對每個節點我們可以自定義一些數據質量校驗規則,數據通過旁路到統一的 dqc-center 進行規則運算告警。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/26/2685ade4f8e55a42e73a5a9ce728cac0.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二是全鏈路的 exactly once,工作流節點之間如何保證精確一致,這塊目前還沒有想清楚。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2d/2dd6bfae5dfb8405f4611d1eac8947c8.png","alt":"img","title":"img","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三是我們會在工作流裏面加入模型訓練和部署的節點。訓練和部署可以是連接到別的平臺,也可能是 Flink 本身支持的訓練模型和部署服務。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"嘉賓介紹:","attrs":{}},{"type":"text","text":"張楊,17 年入職 b 站,從事大數據方面工作。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章