別再把數據當作商品了

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"機器學習在現實世界的應用潛力是無限的,但其實有87%的機器學習應用項目在概念驗證階段就宣告失敗。作者基於在1萬多個項目上5萬多模型訓練的經驗,總結了機器學習應用項目失敗的一些原因,並提出從以模型爲中心的開發模式轉變爲以數據爲中心的開發模式是促使更多項目成功的解決方案。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"既然你正在閱讀這篇文章,我應該就沒必要講什麼機器學習研究中那些動聽的故事了,也沒必要囉嗦人工智能在現實世界場景中的應用潛力有多大。然而,其實機器學習應用項目中有87%在概念驗證階段就已經宣告失敗,而從未進入生產階段[1]。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們已經和"},{"type":"link","attrs":{"href":"http:\/\/hasty.ai\/?utm_source=blog&utm_medium=medium&utm_campaign=tds","title":"","type":null},"content":[{"type":"text","text":"hasty.ai"}]},{"type":"text","text":"的用戶一起開發了超過1萬個項目,並且爲生產環境訓練了超過5萬個視覺模型,我會在這篇文章中深入探討開發和訓練過程中所學到的東西。至於爲何這麼多的機器學習應用項目會失敗,我總結了一些原因並提出了一個解決方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"太長不看版:"},{"type":"text","text":" 我們需要一個範式轉變,從以模型爲中心轉變爲以數據爲中心的機器學習開發。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"免責聲明:在"},{"type":"link","attrs":{"href":"http:\/\/hasty.ai\/?utm_source=blog&utm_medium=medium&utm_campaign=tds","title":"","type":null},"content":[{"type":"text","text":"hasty.ai"}]},{"type":"text","text":",我們正在爲以數據爲中心的視覺AI建立一個端到端平臺。因此,我用到的所有例子都是關於視覺領域的。不過,相信我提出的概念對其他領域的人工智能也同樣適用。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"許多機器學習應用團隊是在模仿研究思維"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要理解爲什麼87%的機器學習項目會失敗,需要先退一步,瞭解下當前大多數團隊是如何構建機器學習應用的。這其實很大程度上取決於機器學習研究的工作方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"研究團隊大多是在嘗試推進特定任務的前沿發展水平(SOTA),而SOTA是通過在給定數據集上達到的最佳性能評分來衡量的。研究團隊需要保持數據集不變,嘗試去改進現有方法的一小部分,以實現0.x%的性能提升,因爲這樣是可以發表論文的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"研究領域中的每一次熱點炒作都是在這種追逐SOTA的心態下產生的,許多從事機器學習應用的團隊都懷揣這種心態,認爲這同樣也能帶領他們走向成功。所以,他們會在模型開發方面投入大量的資源,卻把數據當作給定或可以外包的東西。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"追求SOTA是機器學習應用失敗的祕訣"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無論如何,追求SOTA這種思路是有缺陷的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,機器學習研究和機器學習應用的目標差別很大。作爲一個研究人員,你要確保自己處於領先水平。相反,如果從事機器學習應用,首要目標應該是讓它在生產中發揮作用--就算使用一個五年前的架構,那又如何?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二,你在現實世界中遇到的條件與研究環境大相徑庭,事情會複雜得多。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/db\/74\/db2999ba0ab5cae035343f6b45da4374.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"在研究中,團隊在完美的條件下用乾淨和結構化的數據開展工作。然而現實中,數據可能大不相同,團隊需要面臨一系列全新的挑戰。圖片由作者提供,通過imgflip.com創建。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"講一個我最喜歡的都市傳說,來自自動駕駛領域:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"和大多數機器學習研究一樣,原型是在灣區開發的,在美國中西部首次試駕之前團隊已經花了數年時間調整模型。但最終在中西部試駕時,模型卻崩潰了,因爲所有訓練數據都是在陽光明媚的加州收集的。這些模型無法處理中西部的惡劣天氣,當道路上突然出現雪、相機上出現雨滴時,模型就會混亂。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"題外話:我反覆聽到這個故事,但一直沒找到出處。如果你知道並分享給我,我會非常感激。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這件趣事揭示了一個大問題:在研究環境中,可能會碰到相當舒適的條件,就像加州的好天氣。然而在實踐中,條件會很惡劣,事情變得更具挑戰性,你無法只用理論上的可行性來應付一切。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果想做機器學習應用,通常意味着要面臨一系列全新的挑戰。而不幸的是,其實沒有辦法通過簡單地改進底層架構和重新定義SOTA來克服這些問題。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"機器學習應用 ≠ 研究"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更確切地講,以下是機器學習應用環境與研究環境的5個不同之處:"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、數據質量變得很重要"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在機器學習應用中有這麼一句話:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“進垃圾,出垃圾。”"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這意味着,你的模型只有在訓練它們的數據上纔好用。在10個最常用的跨領域基準數據集上,其訓練和驗證數據上本身就有3.4%的誤分類率[2],而這總被研究人員忽視。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"許多研究人員認爲,不需要太關注這個問題,因爲最終仍可以通過比較模型架構獲得有意義的見解。但是,如果想在模型上建立一些業務邏輯,那麼確定你看到的是一隻獅子還是一隻猴子就相當關鍵了。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、更多時候,你需要定製數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"研究中使用的大多數基準數據集都是巨大的通用數據集,因爲其目標是儘可能多地泛化。而在機器學習應用中,你很可能會遇到專門的用例,這些用例是不會出現在公開的通用數據集中的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,你需要收集和準備自己的數據──能恰當地代表你具體問題的數據。通常,選擇正確的數據對模型性能的影響要大過選擇哪種架構或如何設置超參數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/85\/d5\/85912b9ab7ac065fe56d8197f02992d5.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"在機器學習開發過程中早期決策對最終結果有巨大影響。特別是一開始的數據收集和標籤策略是至關重要的,因爲每一步的問題都會逐級累積。圖片取自[3]。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"谷歌AI lab團隊發表過一篇論文── “每個人都想做模型工作,而不是數據工作:高風險AI中的數據問題累積”[3],展示瞭如果在項目開始時選擇錯誤的數據策略,將如何對模型的後期表現產生不利影響,我非常建議你去讀一讀。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、有時候,你需要用小數據樣本"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在機器學習研究中,對於模型性能低下的一個常見答案是:\"收集更多的數據\"。通常情況下,如果你向會議期刊提交一篇訓練了幾百個樣本的論文,大概率會被拒絕。即使達到了很高的性能,你的模型也會被認爲是過擬合的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不過在機器學習應用中,收集巨大的數據集往往是不可能的,要麼代價太高,所以必須儘可能地利用你能得到的東西。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不管怎樣,過擬合的說法在這裏沒那麼要緊。如果對待解決的問題定義明確,並且運行在一個相對穩定的環境中,那麼只要能產生正確的結果,訓練一個過擬合的模型或許也是可行的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、別忘了數據漂移的問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓機器學習應用變得更復雜的是你很可能會遇到某種數據漂移。當模型部署到生產環境後,現實世界的特徵或目標變量的基本分佈發生變化時,就會出現這種情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個明顯的例子是,在夏天收集了最初的訓練數據並部署了模型,到了冬天世界完全變了樣,模型就崩潰了。另一個更難預測的例子是,當用戶在與模型互動後開始改變他們的行爲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/5a\/67\/5aa57b1077f11f7e82819c4dbd08d867.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Josh Tobin總結了數據漂移的最常見情況[4]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有一小部分研究會關注這個問題(見上圖Josh Tobin的幻燈片),但大多數研究人員在追逐SOTA的工作中忽略了這一點。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5、你會在計算方面遇到限制"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你有沒有試過把一個深度模型部署到德州儀器的ARM設備上?我們做過。最終雖然成功了,但經歷一個抓狂的過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實際上,有時候你不得不在邊緣設備上進行推理,實時並行地爲數百萬用戶提供服務,或者壓根沒有預算分給海量的GPU資源消耗。所有這些都對研究環境中的理論可行性增加了限制,而"},{"type":"link","attrs":{"href":"https:\/\/www.linkedin.com\/posts\/cgnorthcutt_tag-page-activity-6808863701811134464-iWCH\/","title":"","type":null},"content":[{"type":"text","text":"研究人員已經習慣了爲一個項目在GPU上花費超過20000美元。"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"解決方案:從以模型爲中心的機器學習到以數據爲中心的機器學習"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上所述,機器學習應用中截然不同的環境帶來了一系列全新的挑戰。像往常研究那樣,只關注模型的改進並不能應對這些挑戰,因爲模型和數據之間的關係比模型本身更重要!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而,機器學習應用的大多數團隊仍在採用研究思維,熱衷追逐STOA。他們把數據當作可以外包的商品,並把所有的資源投入到模型相關的工作上。根據我們的經驗,這是許多機器學習應用項目失敗的最重要原因。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不過,也有越來越多的從事機器學習應用的人認識到,只專注於模型會得到令人失望的結果。一個倡導機器學習應用新方法的運動正在興起:他們正在推動從以模型爲中心到以數據爲中心的機器學習開發的轉變。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Andrew Ng有一場針對以數據爲中心的機器學習的演講,講了一個更有說服力的案例。總結起來,他的主要的論點是:當你做機器學習應用時,沒必要太擔心SOTA。即使是幾年前的模型,對於大多數的用例來說也足夠強大。調整數據並確保數據符合你的用例,以及保證良好的數據質量,纔是能產生更大影響的工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了更具體一點,我來分享演講中提到的一個趣事。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Andrew和他的團隊在進行一個製造業的缺陷檢測項目時,準確率卡在了76.2%。於是他把團隊拆分,一個小組保持模型不變,增加新的數據或提高數據質量,另一個小組保持數據不變,但嘗試改進模型。從事數據工作的小組能夠將準確率提高到93.1%,而另一個小組卻絲毫沒能提高性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們所見的大多數成功團隊都遵循類似的方法,我們也在試圖把這個理念植入到每個用戶心中。具體來說,以數據爲中心的機器學習開發可以總結出以下幾點經驗:"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、數據飛輪:同步開發模型和數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在演講中,Andrew提到他的團隊通常在項目早期先開發一個模型,一開始不去花時間調整它。在第一批標註數據上運行後,通常能發現數據的一些問題,比如某個類別的代表性不足,數據有噪音(例如,模糊的圖像),數據的標籤很差……"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後,他們花時間修整數據,只有當模型性能無法再通過修整數據改善時,他們纔會回到模型工作中,比較不同的架構,進行微調。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們也在利用一切機會倡導這種被稱爲數據飛輪的方法。思路是需要建一個機器學習管道,它允許你在模型和數據上協同地進行快速迭代。你可以快速建立模型,將其暴露在新的數據中,提取預測不佳的樣本,標註這些樣本,再將其添加到數據集中,重新訓練模型,然後再次測試。這是建立機器學習應用的最快和最可靠的方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/d8\/f4\/d8b6c24e9d705711ab933367efc638f4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"數據飛輪是一個迭代開發數據和模型的機器學習管道,可以在現實世界中不斷提高性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這篇文章不是爲了宣傳,而是向你展示我們在"},{"type":"link","attrs":{"href":"http:\/\/hasty.ai\/?utm_source=blog&utm_medium=medium&utm_campaign=tds","title":"","type":null},"content":[{"type":"text","text":"hasty.ai"}]},{"type":"text","text":"對這個想法的認真態度:實際上,我們正是圍繞這個思路建立起了整個業務,而且據我們所知,至少在視覺領域,我們是唯一這樣做的公司。當你使用我們的標註工具時,就可以在交互界面得到一個數據飛輪。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在你標註圖像時,我們在後臺不斷地爲你(重新)訓練模型,不需要你寫一行代碼。然後,我們用這個模型給你預測下一張圖片上的標籤,你可以糾正這些標籤,從而有效提高模型的性能。你也可以使用我們的API或導出模型,用你自己的(可能是面向客戶的)界面來建立飛輪。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當你通過改善"},{"type":"link","attrs":{"href":"http:\/\/hasty.ai\/?utm_source=blog&utm_medium=medium&utm_campaign=tds","title":"","type":null},"content":[{"type":"text","text":"Hasty"}]},{"type":"text","text":"中的數據達到性能穩定,就可以在"},{"type":"link","attrs":{"href":"https:\/\/hasty.ai\/model-playground\/?utm_source=blog&utm_medium=medium&utm_campaign=tds","title":"","type":null},"content":[{"type":"text","text":"Model Playground"}]},{"type":"text","text":"來微調模型,並達到99.9%的準確率。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、自己標註數據,至少在開始的時候"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"構建以數據爲中心的機器學習意味着你應該自己標註數據。大多數停留在SOTA追逐思維的公司會把數據視爲商品,並把標註工作外包。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而,他們忽略了一些事實,真正在未來成爲競爭優勢的是他們正在建立的數據資產,而非訓練的模型。如何建立模型是公共知識,而且很容易複製(對大多數的用例而言)。現在,每個人都可以免費學習機器學習基礎。但是,建立一個大型的真實數據集是非常耗時和具有挑戰性的,不投入足夠時間的話是無法複製的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用例越複雜,就越難建立數據資產。例子有很多,我簡單挑一個,比如醫學影像就需要一個相關領域的專家來進行高質量的標註。但是,即使把看起來很簡單的標註工作外包出去也可能會導致問題,比如下面的例子,這個也是從Andrew Ng的精彩演講中借來的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/db\/d8\/db09841485339b287dc188f0b4c254d8.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"如何標註對象並不總是十分明確。在標註時識別這樣的邊界情況,可能會節省你在調試模型時尋找它們的時間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是一幅來自"},{"type":"link","attrs":{"href":"https:\/\/www.image-net.org\/index.php","title":"","type":null},"content":[{"type":"text","text":"ImageNet"}]},{"type":"text","text":"的圖像。標註指令是 \"使用邊界框來表示鬣蜥的位置\",這個指令乍聽起來不難理解。然而,一位標註者在繪製標籤時,沒有讓邊界框重疊,並且忽略了尾巴。與此相反,第二位標註者考慮了左邊鬣蜥的尾巴。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這兩種方法本身都沒有問題,但當其中一半標註採用一種方式,另一半用另一種方式時,就有問題了。當標註工作由外包完成時,你可能需要花幾個小時才能發現這樣的錯誤,而當你自己在內部做標註時,就可以更好地實現統一。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,處理這樣的問題可以讓你更好地瞭解邊界情況,而這些邊界情況有可能導致模型在生產環境中出現故障。例如,某隻鬣蜥沒有尾巴的時候看起來有點像青蛙,而你的模型可能會把這兩者混淆在一起。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然,這個例子是杜撰的,但是當你自己標註的時候,經常會遇到不確定如何標註的對象。當模型進入生產環境時,也會在同樣的圖像上掙扎。在早期就意識到這一點,可以讓你提前採取行動,減少潛在問題的出現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我在"},{"type":"link","attrs":{"href":"http:\/\/hasty.ai\/?utm_source=blog&utm_medium=medium&utm_campaign=tds","title":"","type":null},"content":[{"type":"text","text":"hasty.ai"}]},{"type":"text","text":"工作,當然也認爲"},{"type":"link","attrs":{"href":"https:\/\/hasty.ai\/annotation\/?utm_source=blog&utm_medium=medium&utm_campaign=tds","title":"","type":null},"content":[{"type":"text","text":"我們的標註工具"}]},{"type":"text","text":"是最好的。不過,我不是想說服你使用我們的工具。只是想強調,有必要審視一下現有的工具,看看哪一個最適合。使用正確的標註工具,可以讓內部數據標註的成本降到可承受的水平,並給你帶來上面提到的所有好處。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、利用工具儘可能地減少MLOps的麻煩"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"遵循以數據爲中心的方法會帶來更多挑戰,這並不僅限於標註數據方面。想建立如上所述的數據飛輪,在基礎設施層面其實相當棘手。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何把握好這一點是MLOps的藝術。MLOps是機器學習應用領域中的一個新術語,描述了管道管理和如何確保模型在生產環境中按要求運行,有點像傳統軟件工程的DevOps。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果以前瞭解過MLOps,你可能見到過谷歌論文《機器學習系統中隱藏的技術債務》中的圖[5]。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/b5\/ce\/b5bee603d3fd98f27b7849f87831d7ce.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"機器學習應用不僅僅是模型代碼。所有其他的任務構成了MLOps。現在越來越多的工具讓MLOs變得簡單。學聰明點,用這些工具來解決機器學習應用中的麻煩吧。圖片來自[5]。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中展示了MLOps的所有元素。在過去,有能力實現這一目標的公司需要建立龐大的團隊,維護一系列的工具,編寫無數行的膠水代碼讓它運行起來。大多數情況下,只有FAANG(Facebook、Apple、Amazon、Netflix、Google)這樣的公司才負擔得起。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但現在出現了越來越多的創業公司,他們提供可以簡化這一過程的工具,幫助你建立面向生產環境的機器學習應用,不需要再理會MLOps的喧囂。我們"},{"type":"link","attrs":{"href":"http:\/\/hasty.ai\/?utm_source=blog&utm_medium=medium&utm_campaign=tds","title":"","type":null},"content":[{"type":"text","text":"hasty.ai"}]},{"type":"text","text":"就是其中之一,我們提供一個端到端的解決方案來構建複雜的視覺AI應用,併爲你處理所有的飛輪信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是,無論你最終是否使用"},{"type":"link","attrs":{"href":"http:\/\/hasty.ai\/","title":"","type":null},"content":[{"type":"text","text":"Hasty"}]},{"type":"text","text":",都應該明智地審視下有什麼可以用,而不是嘗試自己處理所有的MLOps。這樣可以騰出時間,讓你專注於數據和模型之間的關係,增加應用項目的成功機率。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將機器學習研究中發生的所有令人興奮的事情應用於現實世界是有無限潛力的。然而,有87%的機器學習應用項目仍然在概念驗證階段就宣告失敗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於在1萬多個項目上爲生產環境訓練5萬多個模型的經驗,我們認爲,從以模型爲中心的開發模式轉變爲以數據爲中心的開發模式是促使更多項目成功的解決方案。我們並不是唯一有這樣想法的公司,這已經成爲行業中迅速發展的一種潮流。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"許多機器學習應用團隊還沒有采用這種思維方式,因爲他們在追逐SOTA,模仿學術界做機器學習研究的方式,忽略了在機器學習應用中會遇到的不同情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不過,我決不是想貶低學術界,或者讓人覺得他們的方法毫無價值。儘管對機器學習的學術界有一些合理的批評(這篇文章中我沒有提及)[6],但在過去的幾年中,他們做了這麼多偉大的工作,這很迷人。我期待着學術界會有更多接地氣的東西出來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這也正是問題的關鍵,學術界的目標是做基礎工作,爲機器學習應用鋪路。而機器學習應用的目標和環境完全不同,所以我們需要採用不同的、以數據爲中心的思維方式來讓機器學習應用發揮作用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據我們的經驗,以數據爲中心的機器學習可以歸結爲以下三點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據飛輪:協同開發模型和數據"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"自己標註數據,至少在剛開始時要這樣做"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"使用工具,儘可能減少MLOps的麻煩"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"謝謝你閱讀這篇文章並堅持到這裏。我很想聽到你的反饋,並瞭解你如何對待機器學習應用的挑戰。你可以隨時在Twitter或LinkedIn上與我聯繫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你喜歡這篇文章,請分享它來傳播數據飛輪和以數據爲中心的機器學習思想。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想閱讀更多關於如何做以數據爲中心的VisionAI的實踐文章,請務必在Medium上關注我。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"參考文獻"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}],"text":"[1] “"},{"type":"link","attrs":{"href":"https:\/\/venturebeat.com\/2019\/07\/19\/why-do-87-of-data-science-projects-never-make-it-into-production\/","title":"","type":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"爲什麼87%的數據科學項目從未真正投入生產"}],"marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}]},{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}],"text":"” (2019), VentureBeat magazine article"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}],"text":"[2] C. Northcutt, A. Athalye, J. Mueller,"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/abs\/2103.14749","title":"","type":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"測試數據集中普遍存在的標籤錯誤破壞了機器學習基準的穩定性"}],"marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}]},{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}],"text":"(2021), ICLR 2021 RobustML and Weakly Supervised Learning Workshops, NeurIPS 2020 Workshop on Dataset Curation and Security"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}],"text":"[3] N. Sambasivan, S. Kapania, H. Highfill, D. Akrong, P. Paritosh, L. Aroyo,"},{"type":"link","attrs":{"href":"https:\/\/research.google\/pubs\/pub49953\/","title":"","type":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"每個人都想做模型工作,而不是數據工作:高風險AI中的數據問題累積"}],"marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}]},{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}],"text":"(2021), proceedings of the 2021 CHI Conference on Human Factors in Computing Systems."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}],"text":"[4] S. Karayev, J. Tobin, P. Abbeel,"},{"type":"link","attrs":{"href":"https:\/\/fullstackdeeplearning.com\/","title":"","type":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"全棧式深度學習"}],"marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}]},{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}],"text":"(2021), UC Berkely course"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}],"text":"[5] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J. Crespo, D. Dennison,"},{"type":"link","attrs":{"href":"https:\/\/web.kaust.edu.sa\/Faculty\/MarcoCanini\/classes\/CS290E\/F19\/papers\/tech-debt.pdf","title":"","type":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"機器學習系統中隱藏的技術債"}],"marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}]},{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}],"text":", Advances in Neural Information Processing Systems 28 (NIPS 2015)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}],"text":"[6] J. Buckmann,"},{"type":"link","attrs":{"href":"https:\/\/jacobbuckman.com\/2021-05-29-please-commit-more-blatant-academic-fraud\/","title":"","type":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"就公然地進行更多的學術欺詐吧"}],"marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}]},{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}}],"text":"(2021), personal blog"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Tobias Schaffrath Rosario,Hasty.ai社區戰略負責人,教機器如何看見世界,持續關注視覺AI領域。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/stop-treating-data-as-a-commodity-ee2ac23fa578","title":"","type":null},"content":[{"type":"text","text":"https:\/\/towardsdatascience.com\/stop-treating-data-as-a-commodity-ee2ac23fa578"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章