爲什麼數據科學家不需要了解Kubernetes

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"摘要"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最近,關於數據科學家的工作應該包含哪些,有許多激烈的討論。許多公司都希望數據科學家是全棧的,其中包括瞭解比較底層的基礎設施工具,如Kubernetes(K8s)和資源管理。本文旨在說明,雖然數據科學傢俱備全棧知識有好處,但如果他們有一個良好的基礎設施抽象工具可以使用,那麼即使他們不瞭解K8s,依然可以專注於實際的數據科學工作,而不是編寫有效的YAML文件。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"正文"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最近,關於數據科學家的工作應該包含哪些,有許多激烈的討論("},{"type":"link","attrs":{"href":"https:\/\/www.reddit.com\/r\/datascience\/comments\/i48b5q\/for_those_that_work_for_a_team_that_has_both_data\/","title":"xxx","type":null},"content":[{"type":"text","text":"1"}]},{"type":"text","text":"、"},{"type":"link","attrs":{"href":"https:\/\/twitter.com\/bernhardsson\/status\/1417664482776690692","title":"xxx","type":null},"content":[{"type":"text","text":"2"}]},{"type":"text","text":"、"},{"type":"link","attrs":{"href":"https:\/\/veekaybee.github.io\/2019\/02\/13\/data-science-is-different\/","title":"xxx","type":null},"content":[{"type":"text","text":"3"}]},{"type":"text","text":")。許多公司都希望數據科學家是全棧的,其中包括瞭解比較底層的基礎設施工具,如Kubernetes(K8s)和資源管理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文旨在說明,雖然數據科學傢俱備全棧知識有好處,但如果他們有一個良好的基礎設施抽象工具可以使用,那麼即使他們不瞭解K8s,依然可以專注於實際的數據科學工作,而不是"},{"type":"link","attrs":{"href":"https:\/\/www.arp242.net\/yaml-config.html3\/data-science-is-different\/","title":"xxx","type":null},"content":[{"type":"text","text":"編寫有效的YAML文件"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文是基於這樣一個假設,即對於全棧數據科學家的期望來自這些公司開發和生產環境的巨大差異。接下來,本文討論了消除環境差異的兩個步驟:第一步是容器化;第二步是基礎設施抽象。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於容器化,人們或多或少都有所瞭解,但基礎設施抽象是相對比較新的一類工具,許多人仍然把它們和工作流編排弄混。本文最後一部分是比較各種工作流編排和基礎設施工具,包括Airflow、Argo、Prefect、Kubeflow和Metaflow。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Roles and Responsibilities:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Automate horrible business practices"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Write ad hoc SQL as needed"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"REQUIRED EXPERIENCE:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"15 years exp deep learning in Python"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PhD thesis on Bayesian modeling"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"NLP experience in 7 languages"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"10 years of creating Hadoop clusters from scratch"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"— Nick Heitzman (@NickDoesData) February 12, 2019 Requirementsfor data scientists in real-time Network latency from VermontTworeal-life data scientist job descriptions"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/70\/70a466c4160d6f9a67983a6efe98df50.png","alt":"此處輸入圖片的描述","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"兩份真實的數據科學職位描述"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"目錄"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全棧的期望"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"開發和生產環境分離"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"消除差異第一步:容器化"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"消除差異第二部:基礎設施抽象"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"工作流編排 vs. 基礎設施抽象"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"工作流編排:Airflow vs. Prefect vs. Argo"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基礎設施抽象:Kubeflow vs. Metaflow"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"注意"}]},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"生產是一個範疇。對於有些團隊,生產意味着從筆記本生成的結果生成漂亮的圖表向業務團隊展示。對於其他團隊,生產意味着保證每天服務於數百萬用戶的模型正常運行。在第一種情況下,生產環境和開發環境類似。本文提到的生產環境更接近於第二種情況。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"本文不是要論證K8s是否有用。K8s有用。在本文中,我們只討論數據科學家是否需要了解K8s。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"本文不是要論證全棧沒用。如果你精通這個管道中的每個部分,我認爲會有十幾家公司當場僱用你(如果你允許的話,我也會努力招募你)。但是,如果你想成爲一名數據科學家,不要想着要掌握全棧。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"全棧的期望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大約1年前,我在推特上"},{"type":"link","attrs":{"href":"https:\/\/twitter.com\/chipro\/status\/1315283623910805504","title":"xxx","type":null},"content":[{"type":"text","text":"羅列"}]},{"type":"text","text":"了對於一名ML工程師或數據科學家而言非常重要的技能。該列表幾乎涵蓋了工作流的每一部分:數據查詢、建模、分佈式訓練、配置端點,甚至還包括像Kubernetes和Airflow這樣的工具。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我想自學成爲一名ML工程師,那麼我會優先學習下列內容:"}]},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"版本控制"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"SQL + NoSQL"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"Python"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"Pandas\/Dask"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"數據結構"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"概率 & 統計"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":7,"align":null,"origin":null},"content":[{"type":"text","text":"ML algos"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":8,"align":null,"origin":null},"content":[{"type":"text","text":"並行計算"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":9,"align":null,"origin":null},"content":[{"type":"text","text":"REST API"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":10,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes + Airflow"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":11,"align":null,"origin":null},"content":[{"type":"text","text":"單元\/集成測試"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"——— Chip Huyen (@chipro),2020年11月11日"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這條推特似乎引起了我的粉絲的共鳴。之後,Eugene Yan給我發消息說,他也撰文討論了"},{"type":"link","attrs":{"href":"https:\/\/eugeneyan.com\/writing\/end-to-end-data-science\/","title":"xxx","type":null},"content":[{"type":"text","text":"數據科學家如何在更大程度上做到端到端"}]},{"type":"text","text":"。Stitch Fix首席算法官Eric Colson(之前是Netflix數據科學和工程副總裁)也寫了一篇博文“"},{"type":"link","attrs":{"href":"https:\/\/multithreaded.stitchfix.com\/blog\/2019\/03\/11\/FullStackDS-Generalists\/","title":"xxx","type":null},"content":[{"type":"text","text":"全棧數據科學通才的強大與職能分工的危險性"}]},{"type":"text","text":"”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我發那條推特時,我認爲Kubernetes是DS\/ML工作流必不可少的部分。這個看法源於我在工作中的挫敗感——我是一名ML工程師,如果我能更熟練地使用K8s,那麼我的工作會更簡單。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而,隨着對底層基礎設施瞭解的深入,我認識到,期望數據科學家瞭解這些並不合理。基礎設施需要的技能集與數據科學的需求完全不同。理論上,你可以都學。但實際上,你在一個方面花的時間多,在另一個方面花的時間肯定就少。我很喜歡Erik Bernhardsson打的那個比方,"},{"type":"link","attrs":{"href":"https:\/\/twitter.com\/fulhack\/status\/1417664482776690692","title":"xxx","type":null},"content":[{"type":"text","text":"期望數據科學家瞭解基礎設施就像期望應用開發人員瞭解Linux內核的工作原理"}]},{"type":"text","text":"。我成爲數據科學家,是因爲我想把更多時間花在數據上,而不是花在啓動AWS實例、編寫Dockerfile、調度\/擴展集羣或是調試YAML配置文件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5d\/5dc18fee720180248c4a4cc44a679fd5.png","alt":"此處輸入圖片的描述","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"開發和生產環境分離"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼爲什麼會有這種不合理的預期?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我看來,一個原因是數據科學的開發和生產環境之間存在着很大的差別。開發和生產環境之間有許多不同的地方,但是有兩個關鍵的差異使得數據科學家不得不掌握兩個環境的兩套工具,那就是"},{"type":"text","marks":[{"type":"strong"}],"text":"規模"},{"type":"text","text":"和"},{"type":"text","marks":[{"type":"strong"}],"text":"狀態"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/03\/0320798bc62b5d7d3978be928846a072.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在開發過程中,你可能會啓動一個conda環境,使用notebook,藉助pandas的DataFrame操作靜態數據,藉助sklearn、PyTorch或TensorFlow編寫模型代碼,運行並跟蹤多個實驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一旦對結果滿意了(或是沒時間了),你就會選取最好的模型將其投入生產應用。將模型投入生產應用基本上是說“將其從開發環境移到生產環境”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"幸運的話,開發環境中的Python代碼可以在生產環境中重用,你所要做的是將notebook代碼粘貼複製到合適的腳本中。如果運氣不好,你可能需要將Python代碼用C++或公司在生產環境中使用的其他語言來重寫。依賴項(pandas、dask、PyTorch、TF等)就需要在運行模型的生產實例上重新打包和生成。如果你的模型服務於大量的流量,並且需要大量的計算資源,那麼你可能需要進行任務調度。之前,你需要手動啓動實例,或是在流量比較小的時候關閉實例,但現在,大部分公有云提供商都幫我們做了這項工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在傳統軟件開發中,CI\/CD可以幫助我們彌補這種差距。精心開發的測試集讓我們可以測出在本地進行的修改到生產環境會產生什麼行爲。不過,對於數據科學而言,只有CI\/CD還不夠。除此之外,"},{"type":"link","attrs":{"href":"https:\/\/twitter.com\/chipro\/status\/1313921889061015557","title":"xxx","type":null},"content":[{"type":"text","text":"生產環境中的數據分佈一直在變化"}]},{"type":"text","text":"。不管你的ML模型在開發環境中效果多好,你都無法確定它們在實際的生產環境中表現如何。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於存在這種差別,所以數據科學項目會涉及兩套工具:一套用於開發環境,一套用於生產環境。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"消除差異第一步:容器化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"容器化技術,包括Docker,其設計初衷就是爲了幫助我們在生產機器上重建開發環境。使用Dokcer的時候,你創建一個Dockerfile文件,其中包含一步步的指令(安裝這個包,下載這個預訓練的模型,設置環境變量,導航到一個文件夾,等等),讓你可以重建運行模型的環境。這些指令讓你的代碼可以在任何地方的硬件運行上運行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你的應用程序做了什麼有趣的事情,那麼你可能需要不只一個容器。考慮這樣一種情況:你的項目既包含運行速度快但需要大量內存的特徵提取代碼,也包含運行速度慢但需要較少內存的模型訓練代碼。如果要在相同的GPU實例上運行這兩部分代碼,則需要大內存的GPU實例,這可能非常昂貴。相反,你可以在CPU實例上運行特徵提取代碼,在GPU實例上運行模型訓練代碼。這意味着你需要一個特徵提取實例的容器和一個訓練實例的容器。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當管道的不同步驟存在相互衝突的依賴項時,也可能需要不同的容器,如特徵提取代碼需要NumPy 0.8,但模型需要NumPy 1.0。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當存在多個實例的多個容器時,你需要建立一個網絡來實現它們之間的通信和資源共享。你可能還需要一個容器編排工具來管理它們,保證高可用。Kubernetes就是幹這個的。當你需要更多的計算\/內存資源時,它可以幫助你啓動更多實例的容器,反過來,當你不再需要它們時,它可以把它們關掉。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,爲了協調開發和生產兩個環境,許多團隊選擇了下面兩種方法中的一種:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"由一個單獨的團隊管理生產環境"}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"在這種方法中,數據科學\/ML團隊在開發環境中開發模型。然後由一個單獨的團隊(通常是Ops\/Platform\/MLE團隊)在生產環境中將模型生產化。這種方法存在許多缺點。"}]}]}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"增加溝通和協調開銷:不同的團隊之間可能相互妨礙。按照Frederick P. Brooks的說法是,"},{"type":"text","marks":[{"type":"italic"}],"text":"一名程序員一個月可以完成,兩名程序需要兩個月"},{"type":"text","text":"。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"增加調試難度:當出現問題時,你不知道是自己團隊的代碼導致的,還是其他團隊的代碼導致的。可能根本就和你團隊的代碼無關。你需要和多個團隊一起才能找出問題所在。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相互指責:即使你弄清楚了問題出在哪裏,每個團隊也可能會覺得另一個團隊應該負責修復。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"窄語境:沒有人瞭解整個過程,也就無法對總體流程進行優化\/改進。例如,平臺團隊知道如何改進基礎設施,但他們只會應數據科學家的請求來做工作,但數據科學家並不一定要與基礎設施打交道,所以他們不關心。"}]}]}]},{"type":"numberedlist","attrs":{"start":2,"normalizeStart":2},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據科學家擁有整個過程"}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"在這種方法中,數據科學團隊還需要考慮如何將模型投入生產應用。數據科學家變成了脾氣暴躁的獨角獸,人們期望他們瞭解這個過程中的所有工作,與數據科學相比,他們最終可能要寫出更多的樣板代碼。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"消除差異第二步:基礎設施抽象"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我們有一種抽象方法,讓數據科學家可以擁有端到端的過程,而又不必擔心基礎設施的問題,會怎麼樣?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我可以直接告訴工具:這裏是我存儲數據的地方(S3),這裏是我運行代碼的步驟(特徵提取、建模),這裏是我運行代碼的地方(EC2實例、AWS Batch、Function等無服務器類的東西),這裏是我的代碼在每一步需要運行的東西(依賴項)。然後這個工具會爲我管理所有基礎設施相關的工作,那會怎麼樣?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據Stitch Fix和Netflix的說法,全棧數據科學家的成功依賴於他們擁有的工具。他們需要的工具應該能夠“"},{"type":"link","attrs":{"href":"https:\/\/huyenchip.com\/assets\/pics\/dsinfra\/4_netflix.png","title":"xxx","type":null},"content":[{"type":"text","text":"將數據科學家從容器化、分佈式處理、自動故障轉移及其他複雜的高級計算機科學概念中抽離出來"}]},{"type":"text","text":"”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在"},{"type":"link","attrs":{"href":"https:\/\/huyenchip.com\/assets\/pics\/dsinfra\/4_netflix.png","title":"xxx","type":null},"content":[{"type":"text","text":"Netflix的模型"}]},{"type":"text","text":"中,專家——那些原本就擁有部分項目的人——首先創建了使自己那部分自動化的工具。數據科學家可以利用這些工具來實現自己項目的端到端。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e5\/e5a5f15d767d74687d5fe10de4222a95.png","alt":"此處輸入圖片的描述","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"Netflix的全生命週期開發人員"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"好消息是,你不在Netflix工作也可以使用他們的工具。兩年前,Netflix開源了"},{"type":"link","attrs":{"href":"https:\/\/netflixtechblog.com\/open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9","title":"xxx","type":null},"content":[{"type":"text","text":"Metaflow"}]},{"type":"text","text":",這是一個基礎設施抽象工具,使他們的數據科學家能夠開展全棧工作,而不必擔心底層基礎設施。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於大多數公司來說,數據科學對基礎設施進行抽象的需求是一個相當新的問題。這主要是因爲,以前在大多數公司,數據科學工作的規模並沒有達到讓基礎設施成爲問題的程度。基礎設施抽象主要是在雲設置相當複雜的時候纔有用。從中受益最多的公司是那些擁有數據科學家團隊、大型工作流程和多個生產模型的公司。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"工作流編排 vs. 基礎設施抽象"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲對基礎設施進行抽象的需求是最近纔出現的問題,所以其前景尚不確定(而且極其混亂)。你是否曾經疑惑,Airflow、Kubeflow、MLflow、Metaflow、Prefect、Argo等之間到底有什麼區別,並不是只有你有這種感覺。Paolo Di Tommaso的"},{"type":"link","attrs":{"href":"https:\/\/github.com\/pditommaso\/awesome-pipeline","title":"xxx","type":null},"content":[{"type":"text","text":"awesome-pipeline"}]},{"type":"text","text":"存儲庫中有近200個工作流\/管道工具包。其中大多數是工作流編排工具,而不是基礎設施抽象工具,但是,人們對這兩類工具多有混淆,讓我們看看它們之間的一些關鍵的相似性和差異。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/14\/143e2151e527c383c653a50fd16eaebb.png","alt":"此處輸入圖片的描述","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"強烈建議企業不要在工具名稱中使用“flow”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"造成這種混亂的一個原因是,所有這些工具的基本概念都相同。它們都把工作流程當作一個DAG,即有向無環圖。工作流程中的每一個步驟都對應圖上的一個節點,而步驟之間的邊表示這些步驟的執行順序。它們的不同之處在於如何定義這些步驟,如何打包它們以及在哪裏執行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/14\/14e8671fad064560ec54fc349a813720.png","alt":"此處輸入圖片的描述","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"工作流的DAG表示"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"工作流編排:Airflow vs. Prefect vs. Argo"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Airflow最初是由Airbnb開發的,於2014年發佈,是最早的工作流編排器之一。它是一個令人讚歎的任務調度器,並提供了一個非常大的操作符庫,使得Airflow很容易與不同的雲提供商、數據庫、存儲選項等一起使用。Airflow是“"},{"type":"link","attrs":{"href":"https:\/\/airflow.apache.org\/docs\/apache-airflow\/stable\/","title":"xxx","type":null},"content":[{"type":"text","text":"配置即代碼"}]},{"type":"text","text":"”原則的倡導者。它的創建者認爲,數據工作流很複雜,應該用代碼(Python)而不是YAML或其他聲明性語言來定義。(他們是對的。)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/49\/49ba2b18117223dccfc12af4a5446382.png","alt":"此處輸入圖片的描述","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"Airflow中一個使用了DockerOperator的簡單工作流。本示例來自Airflow 存儲庫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而,由於比其他大多數工具創建得更早,所以Airflow沒有任何工具可以借鑑,並因此有很多缺點,Uber工程公司的"},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/managing-data-workflows-at-scale\/","title":"xxx","type":null},"content":[{"type":"text","text":"這篇博文"}]},{"type":"text","text":"對此做了詳細討論。在這裏,我們只介紹其中三個,讓你大概有個瞭解。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,Airflow是單體的,這意味着它將整個工作流程打包成了一個容器。如果你的工作流程中存在兩個不同步驟有不同的要求,理論上,你可以使用Airflow提供的"},{"type":"link","attrs":{"href":"https:\/\/github.com\/apache\/airflow\/blob\/main\/airflow\/providers\/docker\/example_dags\/example_docker.py","title":"xxx","type":null},"content":[{"type":"text","text":"DockerOperator"}]},{"type":"text","text":"創建不同的容器,但這並不容易。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二,Airflow的DAG沒有參數化,這意味着你無法向工作流中傳入參數。因此,如果你想用不同的學習率運行同一個模型,就必須創建不同的工作流。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三,Airflow的DAG是靜態的,這意味着它不能在運行時根據需要自動創建新步驟。想象一下,當你從數據庫中讀取數據時,你想創建一個步驟來處理數據庫中的每一條記錄(如進行預測),但你事先並不知道數據庫中有多少條記錄,Airflow處理不了這個問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下一代工作流編排器(Argo、Prefect)就是爲了解決Airflow不同方面的缺點而創建的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Prefect首席執行官Jeremiah Lowin是Airflow的核心貢獻者。他們在早期的營銷活動中對Prefect和Airflow做了強烈的"},{"type":"link","attrs":{"href":"https:\/\/medium.com\/the-prefect-blog\/why-not-airflow-4cfa423299c4b\/main\/airflow\/providers\/docker\/example_dags\/example_docker.py","title":"xxx","type":null},"content":[{"type":"text","text":"對比"}]},{"type":"text","text":"。Prefect的工作流實現了參數化,而且是動態的,與Airflow相比有很大的改進。它還遵循 “配置即代碼”的原則,因此工作流是用Python定義的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而,像Airflow一樣,容器化步驟並不是Prefect的首要任務。你可以在容器中運行每個步驟,但仍然需要處理Dockerfile,並在Prefect中註冊工作流docker。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Argo解決了容器的問題。在Argo的工作流程中,每一步都在自己的容器中運行。然而,Argo的工作流是用YAML定義的,這讓你可以在同一個文件中定義每個步驟及其要求。但YAML會讓你的工作流定義變得混亂,難以調試。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f4\/f44c999f40460b4e283071ebe08370cd.png","alt":"此處輸入圖片的描述","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是Argo中一個擲硬幣的工作流。可以想象一下,如果你做的事情遠比這個有趣,那麼這個文件會多麼凌亂。本示例來自Argo存儲庫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了YAML文件比較亂之外,Argo的主要缺點是它只能在Kubernetes集羣上運行,而通常Kubernetes集羣只在生產環境中提供。如果你想在本地測試同樣的工作流,就必須使用minikube或k3d。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"基礎設施抽象:Kubeflow vs. Metaflow"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"像Kubeflow和Metaflow這樣的基礎設施抽象工具,旨在將運行Airflow或Argo通常需要的基礎設施模板代碼抽象出來,幫助你在開發和生產環境中運行工作流。它們承諾讓數據科學家可以從本地筆記本上訪問生產環境的全部計算能力,實際上,這就讓數據科學家可以在開發和生產環境中使用相同的代碼。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管它們有一些工作流編排能力,但它們是要與真正的工作流編排器搭配使用的。事實上,Kubeflow的其中一個組件Kubeflow Pipelines就是基於Argo構建的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了爲你提供一致的開發和生產環境外,Kubeflow和Metaflow還提供了其他一些不錯的特性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"版本控制"},{"type":"text","text":":自動生成工作流模型、數據和工件的快照。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"依賴項管理"},{"type":"text","text":":由於它們允許工作流的每個步驟都在自己的容器中運行,所以你可以控制每個步驟的依賴項。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"可調試性"},{"type":"text","text":":當一個步驟失敗時,你可以從失敗的步驟恢復工作流,而不是從頭開始。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"它們都是完全參數化的,而且是動態的。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,Kubeflow更流行,因爲它與K8s集羣做了集成(同時,它是由谷歌創建的),而Metaflow只能用於AWS服務(Batch、Step Functions等)。然而,它最近從Netflix剝離了出來,成了一家"},{"type":"link","attrs":{"href":"http:\/\/slack.outerbounds.co\/","title":"xxx","type":null},"content":[{"type":"text","text":"創業公司"}]},{"type":"text","text":",所以我預計它很快就會發展到更多的用例。至少,"},{"type":"link","attrs":{"href":"https:\/\/github.com\/Netflix\/metaflow\/issues\/50","title":"xxx","type":null},"content":[{"type":"text","text":"原生的K8s集成正在進行中"}]},{"type":"text","text":"!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從用戶體驗的角度來看,我認爲Metaflow更勝一籌。在Kubeflow中,雖然你可以用Python定義工作流,但你仍然需要寫一個Dockerfile和一個YAML文件來指定每個組件的規格(如處理數據、訓練、部署),然後才能將它們拼接到Python工作流中。因此,Kubeflow幫助你抽離了其他工具的模板,你只需要編寫Kubeflow模板就行了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c9\/c9976abd97cf75b73d0124547da54a82.png","alt":"此處輸入圖片的描述","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubeflow工作流。儘管可以用Python創建Kubeflow工作流,但仍有許多配置文件需要編寫。本示例來自Kubeflow存儲庫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Metaflow中,你可以使用Python裝飾器"},{"type":"codeinline","content":[{"type":"text","text":"@conda"}]},{"type":"text","text":"來指定每個步驟的需求——所需的庫、內存和計算資源需求——Metaflow將自動創建一個滿足所有這些要求的容器來執行該步驟。你不用再編寫Dockerfiles或YAML文件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Metaflow讓你可以在同一個notebook\/腳本中實現開發和生產環境的無縫銜接。你可以在本機上運行小數據集實驗,當你準備在雲上運行大數據集實驗時,只需添加"},{"type":"codeinline","content":[{"type":"text","text":"@batch"}]},{"type":"text","text":"裝飾器就可以在"},{"type":"link","attrs":{"href":"https:\/\/aws.amazon.com\/batch\/","title":"xxx","type":null},"content":[{"type":"text","text":"AWS Batch"}]},{"type":"text","text":"上執行。你甚至可以在不同的環境中運行同一工作流的不同步驟。例如,如果一個步驟需要的內存較小,就可以在本地機器上運行。但如果下一步需要的內存較大,就可以直接添加"},{"type":"codeinline","content":[{"type":"text","text":"@batch"}]},{"type":"text","text":"在雲端執行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"# 示例:一個組合使用了兩種模型的推薦系統的框架\n# A模型在本地機器上運行,B模型在AWS上運行\n\nclass RecSysFlow(FlowSpec):\n @step\n def start(self):\n self.data = load_data()\n self.next(self.fitA, self.fitB)\n\n # fitA requires a different version of NumPy compared to fitB\n @conda(libraries={\"scikit-learn\":\"0.21.1\", \"numpy\":\"1.13.0\"})\n @step\n def fitA(self):\n self.model = fit(self.data, model=\"A\")\n self.next(self.ensemble)\n \n @conda(libraries={\"numpy\":\"0.9.8\"})\n # Requires 2 GPU of 16GB memory\n @batch(gpu=2, memory=16000)\n @step\n def fitB(self):\n self.model = fit(self.data, model=\"B\")\n self.next(self.ensemble)\n \n @step\n def ensemble(self, inputs):\n self.outputs = (\n (inputs.fitA.model.predict(self.data) + \n inputs.fitB.model.predict(self.data)) \/ 2\n for input in inputs\n )\n self.next(self.end)\n\n def end(self):\n print(self.outputs)\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這篇文章的長度和信息量都遠遠超出了我的預期。這有兩個方面的原因,一是所有與工作流有關的工具都很複雜,而且很容易混淆,二是我自己無法找到一種更簡單的方式來解釋它們。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面是本文的一些要點,希望對你有所啓發。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"開發環境和生產環境之間的差異,導致企業希望數據科學家能夠掌握兩套完整的工具:一套用於開發環境,一套用於生產環境。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"數據科學項目端到端可以加速執行,並降低溝通開銷。然而,只有當我們有好的工具來抽象底層基礎設施,幫助數據科學家專注於實際的數據科學工作,而不是配置文件時,這纔有意義。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"基礎設施抽象工具(Kubeflow、Metaflow)與工作流編排器(Airflow、Argo、Prefect)似乎很相似,因爲它們都將工作流視爲DAG。然而,基礎設施抽象的主要價值在於使數據科學家可以在本地和生產環境中使用相同的代碼。基礎設施抽象工具可以和工作流編排器搭配使用。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"在使用它們之前,很多數據科學家都不知道他們需要這樣的基礎設施抽象工具。務必試一下(Kubeflow比較複雜,但Metaflow只需5分鐘就能上手)。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"更新"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Yuan Tang是Argo的頂級貢獻者,他對本文的"},{"type":"link","attrs":{"href":"https:\/\/www.linkedin.com\/posts\/terrytangyuan_why-data-scientists-shouldnt-need-to-know-activity-6843530211145900032-_vG_","title":"xxx","type":null},"content":[{"type":"text","text":"評論"}]},{"type":"text","text":"如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Argo是一個很大的項目,包括Workflows、Events、CD、Rollouts等。因此,在與其他工作流引擎比較時,使用子項目"},{"type":"link","attrs":{"href":"https:\/\/argoproj.github.io\/argo-workflows\/","title":"xxx","type":null},"content":[{"type":"text","text":"Argo Workflows"}]},{"type":"text","text":"更準確。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"還有一些項目爲Argo Workflows提供了更高層次的Python接口,這樣數據科學家就不必使用YAML了。特別地,可以研究下使用Argo Workflows作爲工作流引擎的"},{"type":"link","attrs":{"href":"https:\/\/github.com\/couler-proj\/couler","title":"xxx","type":null},"content":[{"type":"text","text":"Couler"}]},{"type":"text","text":"和Kubeflow Pipelines。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"人們還提到了其他一些很棒的工具,我在這裏就不一一列舉了,比如"},{"type":"link","attrs":{"href":"https:\/\/mlflow.org\/","title":"xxx","type":null},"content":[{"type":"text","text":"MLFlow"}]},{"type":"text","text":"或"},{"type":"link","attrs":{"href":"https:\/\/flytettps\/\/mlflow.org\/","title":"xxx","type":null},"content":[{"type":"text","text":"Flyte"}]},{"type":"text","text":"。我目前還在學習該領域的相關知識。非常感謝您的反饋。謝謝!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"查看英文原文:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/huyenchip.com\/2021\/09\/13\/data-science-infrastructure.html","title":"xxx","type":null},"content":[{"type":"text","text":"Why data scientists shouldn’t need to know Kubernetes"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章