TensorFlow On Flink 原理解析

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"簡介:"},{"type":"text","text":" 本文將分享如何使用一套引擎搞定機器學習全流程的解決方案。先介紹一下典型的機器學習工作流程。如圖所示,整個流程包含特徵工程、模型訓練、離線或者是在線預測等環節。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者:陳戊超(仲卓),阿里巴巴技術專家"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"深度學習技術在當代社會發揮的作用越來越大。目前深度學習被廣泛應用於個性化推薦、商品搜索、人臉識別、機器翻譯、自動駕駛等多個領域,此外還在向社會各個領域迅速滲透。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當前,深度學習的應用越來越多樣化,隨之涌現出諸多優秀的計算框架。其中 TensorFlow,PyTorch,MXNeT 作爲廣泛使用的框架更是備受矚目。在將深度學習應用於實際業務的過程中,往往需要結合數據處理相關的計算框架如:模型訓練之前需要對訓練數據進行加工生成訓練樣本,模型預測過程中需要對處理數據的一些指標進行監控等。在這樣的情況下,數據處理和模型訓練分別需要使用不同的計算引擎,增加了用戶使用的難度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文將分享如何使用一套引擎搞定機器學習全流程的解決方案。先介紹一下典型的機器學習工作流程。如圖所示,整個流程包含特徵工程、模型訓練、離線或者是在線預測等環節。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/06/06f5d8d438e63890eb9c296f840a6fcb.png","alt":"圖片 1.png","title":"圖片 1.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在此過程中,無論是特徵工程、模型訓練還是模型預測,中間都會產生日誌。需要先用數據處理引擎比如 Flink 對這些日誌進行分析,然後進入特徵工程。再使用深度學習的計算引擎 TensorFlow 進行模型訓練和模型預測。當模型訓練好了以後再用 tensor serving 做在線的打分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述流程雖然可以跑通,但也存在一定的問題,比如:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"同一個機器學習項目在做特徵工程、模型訓練、模型預測時需要用到 Flink 和 TensorFlow 兩個計算引擎,部署相對而言更復雜。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"TensorFlow 在分佈式的支持上還不夠友好,運行過程中需要指定機器的 IP 地址和端口號;而實際生產過程經常是運行在一個調度系統上比如 Yarn,需要動態分配 IP 地址和端口號。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"TensorFlow 的分佈式運行缺乏自動的 failover 機制。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/54/54e8e065694084e426c445b2cb9bd855.png","alt":"圖片 2.png","title":"圖片 2.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特徵工程用 Flink 去執行,模型訓練和模型的準實時預測目標使 TensorFlow 計算引擎可以跑在 Flink 集羣上。這樣就可以用 Flink 一套計算引擎去支持模型訓練和模型的預測,部署上更簡單的同時也節約了資源。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Flink 計算簡介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2e/2e6b43bb1fd7ff6a04fb93ccfa583910.png","alt":"圖片 3.png","title":"圖片 3.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink 是一款開源大數據分佈式計算引擎,在 Flink 裏所有的計算都抽象成 operator,如上圖所示,數據讀取的節點叫 source operator,輸出數據的節點叫 sink operator。source 和 sink 中間有多種多樣的 Flink operator 去處理,上圖的計算拓撲包含了三個 source 和兩個 sink。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"機器學習分佈式拓撲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"機器學習分佈式運行拓撲如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5a/5adfc88de95e317dbd205fcd25314460.png","alt":"圖片 4.png","title":"圖片 4.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在一個機器學習的集羣當中,經常會對一組節點(node)進行分組,如上圖所示,一組節點可以是 worker(運行算法),也可以是 ps(更新參數)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何將 Flink 的 operator 結構與 Machine Learning 的 node、Application Manager 角色結合起來?下面將詳細講解 flink-ai-extended 的抽象。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Flink-ai-extended 抽象"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,對機器學習的 cluster 進行一層抽象,命名爲 ML framework,同時機器學習也包含了 ML operator。通過這兩個模塊,可以把 Flink 和 Machine Learning Cluster 結合起來,並且可以支持不同的計算引擎,包括 TensorFlow。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/80/807e50799f7001987794f68eb0e26e44.png","alt":"圖片 5.png","title":"圖片 5.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Flink 運行環境上,抽象了 ML Framework 和 ML Operator 模塊,負責連接 Flink 和其他計算引擎。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"ML Framework"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/10/107f165185c8025117b0821e243092c4.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/12/12a800ec5c246f5308bcf6416e0e6f84.png","alt":"圖片 6.png","title":"圖片 6.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ML Framework 分爲 2 個角色。"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Application Manager(以下簡稱 am) 角色,負責管理所有 node 的節點的生命週期。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"node 角色,負責執行機器學習的算法程序。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上述過程中,還可以對 Application Manager 和 node 進行進一步的抽象,Application Manager 裏面我們單獨把 state machine 的狀態機做成可擴展的,這樣就可以支持不同類型的作業。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"深度學習引擎,可以自己定義其狀態機。從 node 的節點抽象 runner 接口,這樣用戶就可以根據不同的深度學習引擎去自定義運行算法程序。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/24/24b4c997e72abf3c35aacac34e615dc2.png","alt":"圖片 7.png","title":"圖片 7.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"ML Operator"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ML Operator 模塊提供了兩個接口:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"addAMRole,這個接口的作用是在 Flink 的作業裏添加一個 Application Manager 的角色。Application Manager 角色如上圖所示就是機器學習集羣的管理節點。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"addRole,增加的是機器學習的一組節點。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用 ML Operator 提供的接口,可以實現 Flink Operator 中包含一個Application Manager 及 3 組 node 的角色,這三組 node 分別叫 role a、 role b,、role c,三個不同角色組成機器學習的一個 cluster。如上圖代碼所示。Flink 的 operator 與機器學習作業的 node 一一對應。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"機器學習的 node 節點運行在 Flink 的 operator 裏,需要進行數據交換,原理如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/48/48b639d8a892d4edaf4de2128ec59966.png","alt":"圖片 8.png","title":"圖片 8.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink operator 是 java 進程,機器學習的 node 節點一般是 python 進程,java 和 python 進程通過共享內存交換數據。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TensorFlow On Flink"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"TensorFlow 分佈式運行"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/93/93c658a53474a39534beae51b0b1a46e.png","alt":"圖片 9.png","title":"圖片 9.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TensorFlow 分佈式訓練一般分爲 worker 和 ps 角色。worker 負責機器學習計算,ps 負責參數更新。下面將講解 TensorFlow 如何運行在 Flink 集羣中。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"TensorFlow Batch 訓練運行模式"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1d/1d71e04010fac304ef809dfb829dfc74.png","alt":"圖片 10.png","title":"圖片 10.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Batch 模式下,樣本數據可以是放在 HDFS 上的,對於 Flink 作業而言,它會起一個source 的 operator,然後 TensorFlow 的 work 角色就會啓動。如上圖所示,如果 worker 的角色有三個節點,那麼 source 的並行度就會設爲 3。同理下面 ps 角色有 2 個,所以 ps source 節點就會設爲 2。而 Application Manager 和別的角色並沒有數據交換,所以 Application Manager 是單獨的一個節點,因此它的 source 節點並行度始終爲 1。這樣 Flink 作業上啓動了三個 worker 和兩個 ps 節點,worker 和 ps 之間的通訊是通過原始的 TensorFlow 的 GRPC 通訊來實現的,並不是走 Flink 的通信機制。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"TensorFlow stream 訓練運行模式"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0c/0cd1e364e66ede80f8f8fd018aa8e5fb.png","alt":"圖片 11.png","title":"圖片 11.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上圖所示,前面有兩個 source operator,然後接 join operator,把兩份數據合併爲一份數據,再加自定義處理的節點,生成樣本數據。在 stream 模式下,worker 的角色是通過 UDTF 或者 flatmap 來實現的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時,TensorFlow worker node 有3 個,所以 flatmap 和 UDTF 相對應的 operator 的並行度也爲 3, 由於ps 角色並不去讀取數據,所以是通過 flink source operator 來實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面我們再講一下,如果已經訓練好的模型,如何去支持實時的預測。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"使用 Python 進行預測"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1b/1b611cf28c375d297d8a404ac8b966d6.png","alt":"圖片 12.png","title":"圖片 12.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 Python 進行預測流程如圖所示,如果 TensorFlow 的模型是分佈式訓練出來的模型,並且這個模型非常大,比如說單機放不下的情況,一般出現在推薦和搜索的場景下。那麼實時預測和實時訓練原理相同,唯一不同的地方是多了一個加載模型的過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在預測的情況下,通過讀取模型,將所有的參數加載到 ps 裏面去,然後上游的數據還是經過和訓練時候一樣的處理形式,數據流入到 worker 這樣一個角色中去進行處理,將預測的分數再寫回到 flink operator,並且發送到下游 operator。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"使用 Java 進行預測"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/65/65fcd1ad0981907d0d710a747688e1db.png","alt":"圖片 13.png","title":"圖片 13.png","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如圖所示,模型單機進行預測時就沒必要再去起 ps 節點,單個 worker 就可以裝下整個模型進行預測,尤其是使用 TensorFlow 導出 save model。同時,因爲 saved model 格式包含了整個深度學習預測的全部計算邏輯和輸入輸出,所以不需要運行 Python 的代碼就可以進行預測。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,還有一種方式可以進行預測。前面 source、join、UDTF 都是對數據進行加工處理變成預測模型可以識別的數據格式,在這種情況下,可以直接在 Java 進程裏面通過 TensorFlow Java API,將訓練好的模型 load 到內存裏,這時會發現並不需要 ps 角色, worker 角色也都是 Java 進程,並不是 Python 的進程,所以我們可以直接在 Java 進程內進行預測,並且可以將預測結果繼續發給 Flink 的下游。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本文中,我們講解了 flink-ai-extended 原理,以及Flink 結合 TensorFlow 如何進行模型訓練和預測。希望通過本文大分享,大家能夠使用 flink-ai-extended, 通過 Flink 作業去支持模型訓練和模型的預測。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"機器學習/深度學習 自然語言處理 算法 自動駕駛 Java TensorFlow 數據處理 算法框架/工具 流計算Python"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"版權聲明:"},{"type":"text","text":"本文中所有內容均屬於阿里雲開發者社區所有,任何媒體、網站或個人未經阿里雲開發者社區協議授權不得轉載、鏈接、轉貼或以其他方式複製發佈/發表。申請授權請郵件[email protected],已獲得阿里雲開發者社區協議授權的媒體、網站,在轉載使用時必須註明\"稿件來源:阿里雲開發者社區,原文作者姓名\",違者本社區將依法追究責任。 如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至:[email protected] 進行舉報,並提供相關證據,一經查實,本社區將立刻刪除涉嫌侵權內容。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章