機器學習模型在生產中表現不佳?問題可能出在這9個地方

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"本文最初發表於 Towards Data Science 博客,經原作者 Satyam Kumar 授權,InfoQ 中文站翻譯並分享。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據集的質量和數量對機器學習模型有着很大的影響"},{"type":"text","text":"。如果一種機器學習模型能夠在無需維護的情況下正常運行,那麼它就是一種最常見的錯誤假設。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Netflix 推薦系統競爭就是模型部署失敗的一個例子"},{"type":"text","text":"。獲獎的模型贏得了 100 萬美元的獎金,但從未投入生產。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本文中,我列出了"},{"type":"text","marks":[{"type":"strong"}],"text":"導致機器學習模型在生產中可能表現不佳的九種可能原因"},{"type":"text","text":",以及數據科學家在訓練模型時應該牢記的一些要點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1. 對離羣值處理不當"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"離羣值是指數據集中存在的極端觀測值,它會對模型的性能產生影響。離羣值處理不當,會影響模型的估計。可以使用不同的技術來處理離羣值:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"離羣值的存在對一些機器學習模型的影響較小,而對某些機器學習模型的影響則較大。因此,模型的選擇應該是有效的。對極易產生異常值的模型,如線性迴歸,應在模型訓練前對其進行處理。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多變量離羣值的存在會影響模型在生產中的性能。多變量離羣值常常被數據科學家所忽略,並根據每個特徵對其進行處理。閱讀這篇文章《"},{"type":"link","attrs":{"href":"https:\/\/www.statisticssolutions.com\/univariate-and-multivariate-outliers\/?__cf_chl_jschl_tk__=6a3494eb3204737116e638cf313100ea2b9ba2ad-1608108461-0-ATos9rtL8txAUMT7eJ-WNRAzMqgddEhRiOBd4O1WoyTzcERm8WnS48pKOK9BDz0Sd8JTUFyY-30t84Z43h5SVJJLKhOn2stNzxWE1_HB3oLcPxKHOE254aOD51XvuS5GzK4qHPRSQCsxGietPFFcRw5LKPWgPRCuHxEQMyUBTrzOq00H8tL-oLuQpQ59TinyQ-M4Jl6RKynvSqiHarpoS41s2DljPxMycipHitOBv0XslwPHMGzYnBRyKq-vAUEUBgpbFSkukGBd3v6BxWtYz0qKp_ma2h55JQGcExtuU6FMguNaWFnHDwIpGCHvmPphDpCJfqbLTj4JMG7ejh602yNH1C6tHx3ZyJEApv77eWXy","title":"","type":null},"content":[{"type":"text","text":"單變量離羣值與多變量離羣值"}]},{"type":"text","text":"》("},{"type":"text","marks":[{"type":"italic"}],"text":"Univariate and Multivariate Outliers"},{"type":"text","text":"),可以瞭解更多關於多變量離羣值的知識。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"閱讀這篇文章《"},{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/ways-to-detect-and-remove-the-outliers-404d16608dba","title":"","type":null},"content":[{"type":"text","text":"檢測和消除離羣值的方法"}]},{"type":"text","text":"》("},{"type":"text","marks":[{"type":"italic"}],"text":"Ways to Detect and Remove the Outliers"},{"type":"text","text":"),瞭解更多關於如何檢測和消除離羣值的信息。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2. 類不平衡問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目標類標籤的類不平衡會影響模型的性能。類不平衡數據集的一些例子有欺詐檢測、癌症檢測等。針對類不平衡數據集訓練機器學習模型的技術有很多種:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"選擇正確的度量"},{"type":"text","text":":對於不平衡的數據集,機器學習模型必須根據 AUC-ROC 得分、F1、精度或召回率等指標來評估。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"過採樣和欠採樣"},{"type":"text","text":":對少數類樣本進行過採樣,以增加少數類對訓練模型的影響,或者對多數類樣本應進行欠採樣,以減少多數類對訓練模型的影響。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"閱讀這篇文章《"},{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/7-over-sampling-techniques-to-handle-imbalanced-data-ec51c8db349f","title":"","type":null},"content":[{"type":"text","text":"處理類不平衡數據的七種過採樣技術"}]},{"type":"text","text":"》("},{"type":"text","marks":[{"type":"italic"}],"text":"7 Over Sampling techniques to handle Imbalanced Data"},{"type":"text","text":"),瞭解更多關於處理類不平衡的技巧。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3. 不正確的性能指標"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了評估模型的性能,以及模型在生產環境中的高效性能,必須選擇正確的評價指標。沒有一個放之四海而皆準的指標。指標的選擇應該符合業務方面的投資回報率指標。對模型進行特定指標的訓練,應同時滿足性能閾值和業務標準。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4. 缺乏監控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"生產中的模型需要定期進行監控。之前表現良好的模型,數據可能會隨着時間的變化而變化,隨着時間的推移,性能會下降。響應變量或獨立變量可能會隨着時間的變化而變化,可能會影響到預測變量。無論是與其變量相關的模型,還是重新估計參數,小規模開發,還是模型的全新開發,都必須定期監控和更新。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"5. 偏差方差權衡"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"偏差方差問題"},{"type":"text","text":"是一種試圖使兩種誤差源同時達到最小化的衝突,這兩個誤差源使得監督機器學習算法不能在訓練集之外進行泛化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"高偏差和低方差的模型對目標函數有更多的形式假設,而高方差和低偏差的模型對訓練數據集進行過度學習。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"低偏差"},{"type":"text","text":"和"},{"type":"text","marks":[{"type":"strong"}],"text":"高方差"},{"type":"text","text":"機器算法的例子:決策樹、k- 最近鄰和支持向量機。"},{"type":"text","marks":[{"type":"strong"}],"text":"高偏差"},{"type":"text","text":"和"},{"type":"text","marks":[{"type":"strong"}],"text":"低方差"},{"type":"text","text":"機器學習算法的例子:線性迴歸、線性判別分析和邏輯迴歸。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/d5\/5c\/d5c345a8663a1ec3b0d2cee74916375c.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"偏差方差權衡"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了得到最佳擬合模型,應該對模型的參數進行調整,使其在生產中表現最佳。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"6. 不具代表性的採樣"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在很多情況下,我們最終會在一個與實際人羣有很大差異的人羣上訓練模型。例如,對於在一個運動目標人羣上進行模型訓練,但之前沒有之前的運動記錄,這樣的話,採樣不具有代表性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"7. 不穩定的模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有些模型往往很不穩定,並且隨着時間的推移,性能會下降。這樣,企業就需要對模型進行頻繁的修改,對模型進行監控。當模型創建的提前期越來越長時,企業可能會開始迴歸基於直覺的策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"8. 依賴於高度動態變量的模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"動態變量是指那些隨着時間變化而變化的變量。如果模型對此動態變量有較強的依賴性,則可以對其進行有效的預測,從而提高模型的性能。在動態變量發生變化的情況下,模型的性能將受到很大影響。舉例來說,如果模型最依賴的特徵是,每月零售商的銷售額,而當月僅有 10~15 天的營業時間,則可能會影響模型的性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"9. 訓練過於複雜的模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型的預測能力是機器學習解決問題的靈魂。但是,預測能力是以模型的複雜性爲代價的。與簡單的模型相比,更復雜的集合模型具有更好的性能,但模型的可解釋性會較差。這樣的模型可能在性能上很驚人,但一旦部署到生產環境中,性能就會開始下降。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“"},{"type":"text","marks":[{"type":"strong"}],"text":"垃圾進,垃圾出。"},{"type":"text","text":"”("},{"type":"text","marks":[{"type":"italic"}],"text":"Garbage In, Garbage Out"},{"type":"text","text":")同樣適用於機器學習。一個機器學習系統在生產過程中如果沒有維護就不能正常工作,它也需要經常進行監控。此外,在將模型部署到生產環境之前,數據科學家應該牢記上述要點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其他常見問題包括:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"過度簡化"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實施問題"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"缺乏業務知識"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據不足或不正確"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Satyam Kumar,軟件工程師、數據科學愛好者、程序員。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/towardsdatascience.com\/9-reasons-why-machine-learning-models-not-perform-well-in-production-4497d3e3e7a5"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章