機器學習平臺建設指南

原創

ThoughtWorks洞见

2021-05-11 08:03

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"伴隨着數據化、智能化的浪潮，很多大企業爲了沉澱通用技術和業務能力；加快企業智能化、規模化智能開發，開始了自建機器學習平臺。從零搭建一個機器學習平臺的複雜度是不容小覷的，關於平臺的定位、需要解決的問題；及其架構、技術選型等需要提前考量和設計。本文根據幾個從零到一的機器學習平臺構建經歷，再結合目前新興熱門的雲上機器學習平臺，試圖對機器學習平臺做一個概念和技術拆解。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"平臺的業務"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從平臺這個概念本身來說，它提供的是支撐作用，通過整合、管理不同的基礎設施、技術框架，一些通用的流程規範來形成一個通用的、易用的GUI來給用戶使用。通用性是它的考量之一、也是所有平臺的願景之一：希望平臺能適用於各個不同的業務線來產生價值。所以從業務上來說，作爲一個平臺本身是不會、也不應該有太多specific的業務功能的。當然這只是理想情況，有時候爲了平臺使用方的需求，也不得不加上一些業務領域特定的功能或者補丁來適應業務方，特別是平臺建設初期，在沒有太多業務的使用的時候。整體來看，平臺自身的業務可謂是非常簡單，可以用一張圖來表示："}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2a\/2a1dcafa44919e4772e8d268823079ad.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面這個分支是標準的機器學習流程的抽象。從數據準備，數據處理，模型訓練，再到模型上線實現價值完成整個流程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面這個分支，也是機器學習和數據科學領域中不可或缺的，主要指的是類似於Jupyter Notebook這類，提供高度靈活性和可視化的數據探索服務，用戶可以在裏面進行數據探索、嘗試一些實驗來驗證想法。當然當平臺擁有了SDK\/CLI之後，也可以在裏面無縫集成上面這條線裏面的功能，將靈活性與功能性融合在一起。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"基礎設施"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面提到平臺本身是整合了不同的基礎設施、技術框架和流程規範。在正式開始之前，有必要介紹下本文後續所使用的的基礎設施和框架。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在容器化、雲化的潮流中，Kubernetes基本上是必選的基礎設施，它提供靈活易用的基礎設施和應用管理能力，同時擴展性非常好。它可以用於"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"部署平臺本身"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"調度一些批處理任務(這裏主要指離線任務，如數據處理、模型訓練。適用的技術爲：Spark on kubernetes\/Kubernetes Job)，"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"部署常駐服務(一般指的是RESTful\/gRPC等爲基礎的服務，如啓動的Notebook、模型發佈後的模型推理服務。適用的技術爲Kubernetes Deployment\/Kubernetes Statefulset\/Service等)。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"機器學習的主要場景下，數據量都是非常大的，所以Hadoop這一套也是必不可少的，其中包含基礎的Hadoop(HDFS\/HIVE\/HBase\/Yarn)以及上層計算框架Spark等。大數據技術體系主要用於數據存儲和分佈式數據處理、訓練的業務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後就是一些機器學習框架，Spark系的(Spark MLlib\/Angel)，Python系的(Tensorflow\/Pytorch)。主要用途就是模型訓練和模型發佈(Serving)的業務。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"原始數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原始數據，也叫做數據源，也就是機器學習的燃料。平臺本身並不關心原始數據是如何被收集的，只關心數據存儲的方式和位置。存儲的方式決定了平臺是否能支持此種數據的操作。存儲的位置決定了平臺是否有權限、有能力去讀取到此數據。按主流的情況來看，原始數據的存儲一般支持四類形式："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"傳統的數據庫，例如RDS\/NoSQL。"},{"type":"text","text":"這類數據源見得比較少，因爲一般在大數據場景下，通用的解決方案是將此類數據源通過一些工具導入到大數據體系中(如Sqoop)。對於此類數據源的支持也是很簡單的，使用通用的Driver 或者 Connector即可。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"以HDFS爲媒介的通用大數據存儲。"},{"type":"text","text":"此類數據源使用較爲廣泛，最常見的是HDFS文件(parquet\/csv\/txt)和基於HDFS的HIVE數據源。另外，由於是大數據場景下的經典數據源，所以上層的框架支持較爲完善。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"NFS。"},{"type":"text","text":"NFS由於其快速的讀寫能力，以及悠久的歷史。很多企業內部都有此基礎設施，因而已有的數據也極有可能存儲在上面。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"OSS(對象存儲)。"},{"type":"text","text":"最過於流行的要屬S3了。對象存儲是作爲數據湖的經典方案，使用簡單，存儲理論上無限，和HDFS一樣具備數據高可用，不允許按片段更改數據，只能修改整個對象是其缺點。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"值得注意的是，NFS和OSS一般用於存儲非結構化數據，例如圖片和視屏。或者用於持久化輸出目的，如容器存儲，業務日誌存儲。而HDFS和數據庫裏面存放的都是結構化、半結構化的數據，一般都是已經經過ETL處理過的數據。存儲的數據不一樣決定了後續的處理流程的區別："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"NFS\/OSS係數據源，基本上都是通過TensorFlow\/Pytorch來處理，數據一般通過Mount或者API來操作使用。當然也有特例，如果是使用雲服務，例如AWS的大數據體系的話，絕大多數場景下，是使用S3來代替HDFS使用的，這也得益於AWS本身對於S3的專屬EMRFS的定製化。當然Spark本身等大數據處理框架也是支持此類雲存儲的。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"HDFS系和傳統數據庫係數據源，這個大數據框架、Python系框架都是可以的。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"平臺一般會內嵌對以上數據源的支持能力。對於不支持的其他存儲，比如本地文件，一般的解決方案是數據遷移到支持的環境。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據導入"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這應該是整個平臺裏較爲簡單的功能，有點類似於元數據管理的功能，不過功能要更簡單。要做的是在平臺創建一個對應的數據Mapping。爲保證數據源的可訪問，以及用戶的操作感知。平臺要做的事情可以分爲三步："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"訪問對應的數據源，以測試數據是否存在、是否有權限訪問。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"拉取小部分數據作爲採樣存放在平臺中(如MySQL)，以便於快速展示，同時這也是數據本身的直觀展現(Sample)。對於圖片、視頻類的，只需要存儲元信息(如資源的url、大小、名稱)，展示的時候通過Nginx之類的代理訪問展示即可。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"如果是有Schema的數據源，例如Hive表，數據源的Schema也需要獲取到。爲後續數據處理作爲輸入信息。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術上來說，平臺中會存在與各種存儲設施交互的代碼，大量的外部依賴。此時，外部依賴可能會影響到平臺本身的性能和可用性，比如Hive的不可用、訪問緩慢等，這個是需要注意的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業務上來說，更多考量的是提高系統的易用性，舉個例子，創建Hive表數據源，是不是可以支持自動識別分區，選擇分區又或者是動態分區(支持變量)等。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據處理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏的數據處理是一個大的概念，從我的認知上來看。大體可以分成兩個部分，數據標註以及特徵處理。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"數據標註"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據標註針對是的監督學習任務，目前機器學習的應用場景大多都是應用監督學習任務習得的。巧婦難爲無米之炊，沒有足夠的標註數據，算法的發揮也不會好到哪兒去。對於標註這塊。業務功能上來說，基本上可以當成是另一套系統了，我們可以把它叫做標註平臺。但從邏輯關係上來說，標註數據是爲了機器學習服務的。本質上也是對數據的處理，所以劃到機器學習平臺裏面也沒有什麼問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"標註平臺對應了之前提到平臺的另一功能：提供通用的流程規範。這裏，流程指的是整個標註流程，規範指的是標註數據存儲的規範，比如在圖像領域，目前沒有通用而且規範的存儲格式，比較需要平臺來提供一種通用的存儲格式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據標註有兩種，一種是人工標註; 另一種是使用已訓練好的機器學習模型來標註，然後再輔以人工確定和修訂。無論是使用哪種方式，最後都需要人工介入。因爲數據標註的正確性是非常重要的，大家應該都聽過一句話：數據和特徵決定了機器學習的上限，而模型和算法只是逼近這個上限而已。如果給你的數據都是有問題的，那模型如何能學習到正確的能力呢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在標註平臺中，往往都是人工+模型混合使用的，整體流程如下："}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/49\/49b5b5cfc50e66456a8b18e18ec513ac.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"在最開始可能沒有模型來輔助進行標註(這裏的預標註)，這個時候就需要人爲手工介入，以此來訓練出一個模型"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"當我們根據標註出的數據(訓練數據集)訓練出不錯的模型之後，就可以使用此模型來對新輸入的數據做一個預標註的工作，然後再輔以人工確認和修訂"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"如此反覆迭代，隨着輸入數據的增多，當我們訓練出的模型準確率達到一個很高的水準之後，需要人工操作的數據就會越來越少。以此減少人工成本費"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"標註平臺中需要考量的問題有幾點："}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"一般來說，需要標註的數據量都是非常巨大的。不存在說兩三個人隨便標幾下就完事。所以更爲通用的做法是做一個衆包平臺，外包給其他公司或者自由職業者(標註員)來做。參考AWS的Amazon Mechanical Turk。這就涉及到了權限控制、角色控制。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"標註數據的準確性問題。不能避免有些標註員划水摸魚，隨意標註。此時是需要己方人員去做一個審計工作的，這裏就是一個工作流，和上圖所示差不多。是一個相對來說較爲規範的流程，可以隨着業務來增加額外的流程。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"標註工具的開發。如果有合適開源工具，可以集成開源工具，如VGG-VIA\/Vatic， poplar。開源工具一般不會滿足所有的需求，在開源上做二次開發往往成本更高。這裏更傾向於自己開發，這樣更能滿足需要，自定義程度高，靈活性強"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"標註數據的存儲。最好是一個相對通用的格式，以便於往各種其他格式做一個轉換。在使用時做一個數據格式轉換。這裏更偏好Json格式類似的存儲，可讀性和代碼可操作性都非常高"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術上的考量，這裏並不是太多。唯一需要注意的可能是數據的處理，如圖片、視頻標註，資源的拆剪、縮放，各種標註數據的保存，座標計算等。相對而言，後端主要是工作流處理，前端的各種標註工具是技術核心點。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"特徵處理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在正式開始模型構建與訓練之前，對數據做特徵處理基本上是必不可少的環節。特徵處理的輸入是之前創建的數據集、或者是標註完成後產生的數據集。它的輸出是經過一系列數據操作後、可以直接輸入到模型中的數據，這裏稱它爲訓練數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏又可以分爲兩類，一類是不需要太多操作的特徵處理，常見於非結構化數據，如圖片、視屏。另一類是針對結構化\/半結構化數據的特徵處理流程，這類數據往往不是拿來即可用的，需要一定的特徵處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先來看第一類，第一類的典型應用場景就是CNN圖像領域。對於輸入的圖片，一般不需要太多的操作。此類數據源往往也是存儲在NFS\/OSS上的。對於此類數據的處理，一般也不會使用到大數據框架，一般都是自己手擼Python(C++)來處理；或者是放在模型訓練一起處理，不單獨拎開。Spark2.3開始是支持圖片格式的，不過目前沒有看到太多的使用案例。整體流程如下："}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/77\/771f5473587d34b301adc9a1a2cd7949.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本質上是一個批處理過程，Kubernetes的Job天然適用於此類問題。當然其他調度框架來做這個任務調度也是可以的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後是第二類，這種基本上就是傳統的特徵工程了，以Hive\/HDFS輸入爲經典例子。由於對數據的操作是非常多樣的，所以一般傾向於構建一個數據處理的流水線，也就是Pipeline："}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d5\/d512c63dfab631f8c4cd0f09ebb46f51.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"把對數據的操作拆分成一個一個的操作運算(也就是一個Pipeline Node節點)，可以把它叫做算子,然後利用Pipeline把操作進行自由組合，從而達到對數據進行各種變換處理的過程。使用過Sklearn pipeline或者Spark MLlib pipeline的小夥伴應該對此並不陌生，事實上，這就是基於此的可視化建模。使得用戶可以在界面上進行拖拉拽，選擇和輸入一些參數來完成對於數據的操作，提供更靈活的功能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼這條Pipeline裏面都有哪些操作呢，整個算子大致可以分爲這幾類別："}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b5\/b505ada481ff1ba8afee2dfba0eed343.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術上，使用Spark MLlib是個不錯的選擇，可以適應不同數據量的規模。且天然支持這種Pipeline的形式，只需要對此做一定的封裝。同時，考慮到業務功能，還需要手動定製、修改一些算子；對於擴展這點，它的支持也是非常不錯的，自定義算子、算法都比較簡單。當然也可以基於其他框架(如Azkaban)或者自己造輪子。需要注意的是，在大數據場景下，Spark MLlib裏面部分自帶的算子會有性能問題(有些直接OOM)，需要分析源碼定位然後修復。包括我們自己實現的算子，也都要考量大數據場景下這個問題，這對於測試而言是個Case。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個Pipeline的核心是構造一個DAG，解析後，提交給對應的框架來做處理。拋開技術來說，功能上來說，易用性和快速反饋是Pipeline需要注意的。比如："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"所有的算子都應該能自動獲取到可用的輸入、對應類型，自動生成輸出"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"支持參數校驗，Pipeline完整性校驗"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"支持調試\/快速模式，利用部分Sample data快速驗證效果，相當於一個試驗功能"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"日誌查看，每個算子狀態查看，以及操作完成後的Sample data查看"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"支持一些變量、SQL書寫，以擴展更靈活的功能，比如參數動態化"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"支持定時調度，任務狀態郵件、短袖及時告警、同步等"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模型訓練"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"到了模型訓練這塊。從功能上來看需要的不是很多，更多是體驗上的設計。比如參數在界面上的展示和使用，日誌、訓練出的Metrics的可視化、交互方式，模型的一些可視化等。核心功能包含這幾塊兒："}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.預置算法\/模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏考量的點是："}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"支持界面手動調參"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"訓練中日誌、Metrics的可視化。支持同時使用不同的參數多次訓練，直觀對比結果方便調參"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"需要支持常用的框架，如Spark MLlib\/Tensorflow\/Pytorch，或者一些特定的框架(業務方需要的)"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"預置算法\/模型很簡單，在不同的框架上封裝一層並且暴露對應的參數給到用戶即可。有些算法可能框架不支持，這種根據情況決定如何解決(自己實現或者其他解決方案)。有一點就是，框架不宜太多，否則在後續模型發佈的Serving上，也需要多套解決方案。對於訓練本身來說，分爲兩種：單體\/單機訓練和分佈式訓練，分佈式訓練是訓練這塊的大頭(也是技術難點所在)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於單體\/單機訓練的解決方案，和上訴特徵處理中的解決方案一致。優先使用Kubernetes的Job，或者使用Spark，控制下Executor即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於分佈式訓練，目前接觸到的，基本上都是數據並行："}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bc\/bc12d5cb5ae7c696001e89ae4c48be4d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以Spark爲基礎的基本都使用的是PS模式。PS模式一般使用的都是異步更新參數，每個Worker只需要和PS通信，所以容錯管理上會更加簡單。AllReduce模式則是同步的參數模式，而且每個Worker都需要上下游通信，相對而言，容錯處理會更苛刻一些。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark系的分佈式訓練，經典的算法可以由Spark MLlib提供。但是其對於DNN的支持並不好，而且Spark的訓練是一個同步的過程，如果數據傾斜的話，整個訓練的速度取決於最慢的RDD。如果要在Spark上跑深度學習的模型，一般會選用其他框架或者基於Spark的框架，如騰訊的Angel、 Intel的BigDL。兩者的思想都是一致的，同屬於PS模式下的分佈式訓練，整個流程可以概括爲這幾步："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"利用Spark來作爲任務調度、資源分配、數據讀取和容錯處理的容器"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"從PS拉取參數，然後將數據和參數通過JNI喂到底層的框架代碼中去(如Pytorch C++ Frontend)完成訓練，返回梯度"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"使用優化算法和梯度，更新參數，再將新的參數更新到PS。如此反覆"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"借用一張Angel的圖："}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2a\/2aae2257cb47424a34d549df58a85b7f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"順着這個思路，可以以Spark爲基礎，兼容任何深度學習框架。只要底層框架："}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"支持模型\/代碼序列化，以支持分發模型到所有Worker中進行本地訓練"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"通過JNI能調用模型提供的接口，如Forward\/Backward"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這要比自己造輪子更加便利和友好，畢竟Spark已經是一個成熟的分佈式計算框架了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果使用的是Python系的深度學習框架有兩點需要注意："}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"如何讀取分佈式存儲上的數據以支持數據並行。也就是Spark裏面Partition的概念。當然最簡單的方法，就是將數據拆分成塊，不同的節點讀取不同的數據來訓練。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"另外一點在於容錯性，基於Spark的分佈式訓練的優勢就在於數據的分配、容錯處理。現在就得自己去保障這兩點了。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訓練部署上面。Spark系的，使用Spark支持的即可(Yarn\/Kubernetes\/Standalone)。Python系的更傾向於雲原生、或者基於Kubernetes的解決方案(自定義Operator是個比較好的選擇)，這裏有個爲雲原生和容器化而生的框架：Polyaxon，值得一試。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 自定義算法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於自定義算法，個人更傾向於將這個功能放在Notebook中而不是界面化的操作上，因爲這個功能的自由度很高，允許用戶自己寫代碼。平臺其實只需要提供："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"數據讀取能力，這裏包括源數據、平臺內生成的數據"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"NFS\/OSS使用Mount形式或者API提供(如Databricks 的DBFS)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"HDFS系列的API提供支持(如提供pyspark)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"調度能力，能根據需求調度起任務。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上兩者功能，都或多或少依賴於SDK\/API\/CLI。可以參考大廠的實現Amazon SageMaker、Databricks Notebooks。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外一點是GPU的支持，這點，如果是選擇Kubernetes作爲基礎設施，是有基本的功能支持的，利用Extended Resource(ER)配合 Device plugin體系："}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5b\/5b336fce6ab8c7a88370886e323365e5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於Hadoop體系的話，3.1.X之後也是支持的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這兩者對於GPU的支持都非常有限，都是處於一個能用，但是不好用的狀態。使用的時候，淌坑可能在所難免。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. AutoML"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於AutoML這塊。目前還沒有太多的接觸。從我看來，做個簡單的隨機搜索還是問題不大的，集成算法來自動調參、自動構建模型也是不難的。就跟上訴集成算法進來是一樣的。唯一考量的是，集成那些算法進來以及計算資源的考量，尤其是深度學習大行其道的今天，類似於隨機搜索這樣耗費的資源是巨大的。如何設計這個功能，以及具體實現還是需要非常專業的領域知識。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總的來說，模型訓練這塊。是一個很偏技術的功能。完成一個Demo的端到端，實現基本的功能很簡單。但是要做到好用、穩定性這兩點就並不那麼容易。尤其是分佈式機器學習這一塊兒，還是需要不斷淌坑的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模型發佈"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"終於來到最後一步，是整個平臺產生價值的最後一環。模型發佈，通常包含兩種發佈："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"發佈成API(RESTful\/gRPC)，以供業務側實時調用。這類服務通常對於性能要求苛刻"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"發佈成批處理形式，用於批處理數據，上述標註平臺中的預標註屬於此類任務"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於批處理形式的發佈，較爲簡單。一般框架本身都支持模型的序列化和加載做預測："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Spark系的就依舊走原始路線，提交Spark Job，做批處理預測"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"TensorFlow類似的，配合不同的鏡像環境，Mount模型，讀取數據，配合Kubernetes Job的方式來啓動批處理達到一個統一的處理流程"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於實時推理服務，於平臺而言，能使用通用的框架\/技術來做那當然是極好的。然鵝，現實很殘酷。通用的解決方案，要麼性能不夠(如MLeap)，要麼支持的算子\/操作\/輸入有限(如ONNX。想要在真正的業務場景下發揮價值都是需要淌坑的。用的更多的，往往都是性能領域強悍的C\/C++\/RUST來實現，比如自研、或者框架本身自帶的(如TensorFlow的Serving)，會更能滿足性能需求。平臺選用的框架最好是不要太多，這樣做Serving會比較麻煩，不同的框架都要去找解決方案實現一遍。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然做不到通用的模型Serving方案，但是通用的模型發佈、提供服務的平臺卻是可以做到的。這是這部分的重點，即如何設計一個通用的推理平臺。一個推理平臺需要考量的點如下所示："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"支持各種框架，多語言、多環境"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"模型版本管理"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"API安全性"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"可用性、可擴展性，服務狀態監控、動態擴容"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"監控、告警，日誌可視化"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"灰度上線\/AB Testing"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"把Serving看做是普通的Web服務。對於上訴的點，Kubernetes體系是絕佳的選擇："}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"鏡像，配合NFS的模型，啓動時Mount再複製到容器內部。可以支持多語言、多環境"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"版本管理，每個版本在NFS上創建新的版本目錄，比如0.0.1，放入對應的模型和啓動腳本配合一個Service。這樣就允許不同版本的Service同時在線，或者部分在線"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"API安全性，與Kubernetes本身無關。簡單的做法是爲每個API生成對應的調用Token來做鑑權，或者使用API網關來做"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"可用性、可擴展性。利用Kubernetes原生的Liveness\/Readness probe，可以檢測單個Pod中服務的狀態。使用Kubernetes的HPA、或者自己實現監控再配合Autoscale可以滿足需求，而且這些都可以成爲一些配置暴露給用戶"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"監控、告警。可以Prometheus+Grafana+Alertmanager三件套，或者配合企業內部的運維體系"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"交給Istio即可。灰度發佈肯定妥妥的，利用一些Cookie\/Header來做分流到不同版本的服務也可以做到部分AB Testing的事情"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於Serving服務與服務之間是沒有關係的，是一個無狀態的單體服務。對於單體服務的各種處理就會比較簡單。比如不使用Kubernetes，使用AWS雲的EC2來部署，配合Auto Scaling、CloudWatch等也是非常簡單的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Addons"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除開上訴的核心功能。一般來說，機器學習平臺還有一些擴展。最經典的當屬CLI和SDK了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CLI就不用多說了，主要提供快捷入口。可以參考AWS的CLI。需要包含的點主要是："}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"提供基本的操作功能，滿足用戶的使用"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"以及導入導出功能。方便用戶共享配置、快速創建和使用平臺的功能"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"支持多環境、認證"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"完善的操作使用文檔、幫助命令、模板文件等"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於SDK這塊，需要和平臺本身的功能做深度集成；同時，還需要一定的靈活性。比如上訴的自定義算法，用戶可以在Notebook裏面讀取平臺生成的數據集，寫完代碼後，還得支持提交分佈式訓練。既要集成平臺功能，又得支持一定程度的自定義，如何設計會是一個難點。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"探索實驗(Notebook)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Notebook本身功能並不複雜。實質上就是對Jupyter Notebook\/JupyterLab等的包裝。通常的做法是使用Kubernetes的Service，啓動一個Notebook。以下幾點是需要考量的："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"存儲要能持久化，需要引入類似於NFS這樣的持久目錄以供用戶使用"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"內置Demo，如讀取平臺數據、訪問資源、提交任務"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"預置不同的鏡像，支持不同的框架，以及CPU\/GPU訴求用於啓動不同的Notebook"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"安全問題，一個是Notebook本身的安全。另一個是容器本身的安全"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"有可用的安全源，以供用戶安裝和下載一些包和數據"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"能夠訪問到基礎設施裏面的數據，Mount\/API"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":7,"align":null,"origin":null},"content":[{"type":"text","text":"最好能有Checkpoint機制，不用的時候回收資源。需要的時候再啓動，但環境不變"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大致的解決思路如下："}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"持久化需要Mount存儲。Storage class可以幫助"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"Notebook自身的安全，可以在裏面增加Handler或者外面配置一層Proxy來鑑權。容器的安全大概有這幾點需要考量(要做到更嚴格還有很多可以深挖)："}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"限制容器內部用戶的權限"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"開啓SELinux"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"Docker的—cap-add\/—cap-drop"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"heckpoint機制的實現，數據已經持久化在外部存儲了。只需要Save新的鏡像，作爲下次啓動的鏡像即可，可使用Daemonset+Docker in dokcer的形式做一個微服務來提交鏡像。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"平臺的基石"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這一節主要討論的是做平臺一定繞不開的問題。首先逃脫不了的是多租戶問題："}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"可能要和已有的系統進行集成。而不是新開發"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"租戶間權限、數據隔離"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"租戶資源配額、任務優先級劃分"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術上來說，系統自身的租戶功能難度不高。主要是數據、資源隔離這塊。要看對應的基礎設施："}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Hadoop可以使用Keberos"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes可以使用Namespace"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再配合不同的調度隊列。即可完成需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外一點是偏技術層面的調度功能。這可以算是整個平臺運行的基礎，無論是數據處理還是模型訓練、模型部署都依賴於此，調度系統分爲兩塊："}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"調度本身，即如何調度不同的任務到不同的基礎設施上"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"狀態管理，有些任務可能存在依賴關係，會觸發不同的Action。這也是一個狀態機的問題"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"狀態管理這塊，業務場景不復雜的情況下自己實現即可。一般平臺上也沒有太複雜的依賴關係，經典的模型訓練也不過單向鏈路而已。如果過於複雜，可以考慮其他開源任務編排工具："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/85\/85b173d9526e0ed9b5e8eac5794a3f30.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調度本身是一個任務隊列+任務調度組合而成。隊列用啥都可以，數據庫也可以。只要能保證多個Consumer的數據一致性即可。任務調度，如果不考慮用戶的感受，可以直接丟給對應的基礎設施，如Yarn\/Kubernetes都可以。由這些基礎設施來負責調度和管理，這樣最簡單。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果想做的盡善盡美，那還是需要維護一個資源池。同步基礎設施的使用情況，然後根據剩餘資源和任務優先級來做調度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"狀態更新方面。對於Yarn體系，只能通過API去輪詢狀態、獲取日誌。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於Kubernetes體系，推薦一個Informer機制，通過List&Watch機制可以方便、快速獲取整個集羣的情況："}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/28\/281c43aae9de83677ff2ff4bc0675e8d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於如何提交任務到基礎設施："}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Spark\/Yarn - 直接API、包裝下命令行者Livy都可"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes - SDK"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"最終Boss"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"做平臺，最怕的是什麼？當然不是某某技術太過於複雜，攻克不下。而是沒有用戶使用平臺，或者不能通過平臺產生真實業務價值，這纔是最重要的。有某一業務場景在平臺實現端到端產生價值，才能證明這個平臺的價值，才能吸引其他用戶來使用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於用戶來說，一般都已經有自己成熟的技術線。想要他們往平臺上遷移，實屬不易，除非已有系統滿足不了他們的需求、或者已有系統不好用。畢竟，自己用的好好的系統，遷移是需要耗時耗力的，用別人的東西也沒有自己做來的靈活；學習一個新的系統使用也是如此，有可能一開始用着並不習慣。此外，新的系統，沒有人期望第一個來當小白鼠的。有這麼多缺點，當然平臺本身也有其優點。比如將基礎設施這塊甩給了第三方，不用去關注底層的東西，只要關注自己的業務了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"事實上的流程往往都是，隨着平臺MVP的發佈。部分用戶開始上來試用，然後提各種想法和需求；緊接着，會逐漸遷移部分功能上來試用，比如搞些數據上來跑跑Demo，看看訓練的情況等。同時，在這個階段，就會需要各種適配用戶原始的系統了，比如原始數據導入與現有系統不兼容，需要格式的轉換；再比如只想用平臺的部分功能，想集成到自己現有的系統中等。最後纔是，真正的在平臺上使用。想要用戶能端到端的使用平臺，來完成他們的業務需求，還是需要漫長的過程的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相對而言，至上而下的推動往往更爲有效。總有一批人要先來體驗、先來淌坑，給出建議和反饋。這樣這個平臺纔會越來越好，朝好的地方發展。而不是一開始上來就堆功能，什麼炫酷搞什麼、大而全。但是在用戶使用上，易用性和穩定性並不好，或者是並不能解決用戶的需求和難點。那這種平臺是活不下去的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自：ThoughtWorks洞見（ID：TW-Insights）"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接："},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/HEg_6Gly2WMrcPD5Ao2n6g","title":"xxx","type":null},"content":[{"type":"text","text":"機器學習平臺建設指南"}]}]}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

從前端到全棧 -- 最全面向對象總結

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragr

程序员海军

2021-12-21 10:54:01

跨語言的多模態、多任務檢索模型MURAL解讀

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-21 10:54:01

谷歌發佈生態系統RLDS，可在強化學習中生成、共享和使用數據集

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-20 10:53:54

程序員如何建立第二大腦

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-20 10:43:54

改善十年應用的部署體驗

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-21 11:13:52

智慧家庭場景的推薦系統的發展歷程和方向 | InfoQ《公開課》

直播概要：隨着計算機的蓬勃發展，互聯網進入大數據和人工智能時代，爲了解決信息過載和長尾商品，推薦系統成爲唯一選擇，而面對不同的業務場景，爲了解決業務痛點，會根據不同的場景特點尋找不同的方法和手段來解決推薦中實際遇到的問題。在智慧家庭領域，

InfoQ 中文站

2021-12-21 10:54:01

一場數據架構變革正在來臨

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-12-21 10:54:01

Log4j2 維護者：沒工資還捱罵；阿里每週可選一天靈活辦公；亞馬遜 CTO 預測2022年五大技術趨勢；蘋果正式推出“數字遺產”...

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-21 10:53:51

一篇帶你用 VuePress + Github Pages 搭建博客

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前言","attrs

2021-12-21 10:53:51

【HZERO微服務平臺3】源碼分析之oauth服務token生成、校驗、獲取信息、傳遞

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"headin

2021-12-20 11:08:55

BPF 和 Go: Linux 中的現代內省形式

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-12-20 11:08:55

從混合包開發到100%純鴻蒙應用還有多遠？優酷鴻蒙版的開發實踐與思考｜卓越技術團隊訪談錄

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-19 12:03:53

「如何從零到一實現一個玩具瀏覽器🌏」

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-12-18 13:28:55

Facebook 如何做大規模服務的自主測試

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragr

2021-12-21 10:54:01

K8s 安全指南

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-17 17:58:58

24小時熱門文章

.NET開源強大、易於使用的緩存框架 - FusionCache

最新文章

最新評論文章