爲什麼機器學習模型會失敗?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"本文最初發表於 Towards Data Science 博客,經原作者 Delgado Panadero 授權,InfoQ 中文站翻譯並分享。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文通過一個真實的例子,分析了模型選擇不當還是訓練數據噪聲導致了模型性能不佳。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在機器學習中,當你建立和訓練一個模型並檢驗其準確性時,一個最常見的問題就是“準確性是我能從數據中得到的最好的,還是能找到一個更好的模型呢?”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,一旦模型被部署,下一個常見的問題就是“爲什麼模型會失敗?”。有時候,這兩個問題都無法回答,但有時我們可以通過研究模型誤差的統計分佈,找出預處理錯誤、模型偏差,以及數據泄露等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本教程中,我們將解釋並演示如何統計分析模型結果,以找出示例中錯誤的原因。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"業務案例"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這個案例中,我們將使用來自 "},{"type":"link","attrs":{"href":"https:\/\/www.drivendata.org\/competitions\/50\/worldbank-poverty-prediction\/","title":"","type":null},"content":[{"type":"text","text":"Driven Data 競賽"}]},{"type":"text","text":"的數據,通過一系列社會經濟變量來預測一個民族是否處於貧困狀態。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個業務案例的價值不僅在於能夠用機器學習模型來預測貧困狀況,而且還在於通過社會經濟變量對衡量貧困狀態的預測程度,並從特徵上分析原因。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模型訓練"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據由一組九個描述性變量組成,其中四個是類別變量,另外五個是數值變量(但其中一個似乎是一個 id,所以我們將捨棄它)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"import pandas as pd\n\npd.set_option('display.max_columns', None)\ntrain = pd.read_csv('train.csv', index_col='id')\nprint(train)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回結果如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"Unnamed: 0 kjkrfgld bpowgknt raksnhjf vwpsxrgk omtioxzz yfmzwkru\nid \n29252 2225 KfoTG zPfZR DtMvg NaN 12.0 -3.0 \n98286 1598 ljBjd THHLT DtMvg esAQH 21.0 -2.0 \n49040 7896 Lsuai zPfZR zeYAm ZCIYy 12.0 -3.0 \n35261 1458 KfoTG mDadf zeYAm ZCIYy 12.0 -1.0 \n98833 1817 KfoTG THHLT DtMvg ARuYG 21.0 -4.0 \n\n tiwrsloh weioazcf poor \nid \n29252 -1.0 0.5 False \n98286 -5.0 -9.5 True \n49040 -5.0 -9.5 True \n35261 -5.0 -9.5 False \n98833 -5.0 -9.5 True \n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據分佈可以在下面看到:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/b7\/1b\/b76a2ebc965b32f5bc3ef49de8a3461b.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖由作者提供。數據集中所有特徵的配對圖,以目標爲顏色。黃色塊代表 False,紫色塊表示 True。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過某些預處理(NaN 值插補、縮放、分類編碼等等),我們將對一個支持向量機模型進行訓練(通常在獨熱編碼的高維數據中工作良好)。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"支持向量機"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"from sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import RobustScaler\nfrom sklearn.neighbors import KNeighborsClassifier\n\nmodel = Pipeline(steps=preprocess+[\n ('scaler', RobustScaler()),\n ('estimator', KNeighborsClassifier(n_neighbors=5))])\n\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)\nprint(classification_report(y_test,y_pred))`\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回結果如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"precision recall f1-score support\n\n False 0.73 0.77 0.75 891\n True 0.70 0.66 0.68 750\n\n accuracy 0.72 1641\n macro avg 0.72 0.71 0.71 1641\nweighted avg 0.72 0.72 0.72 1641\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"就二元分類問題而言,0.72 的準確率並不高。相比之下,召回率和查準率看起來是平衡的,這使得我們認爲,這個模型不是一個有利於任何類別的先驗偏見。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"測試其他模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"想要改進這個模型,下一步就是嘗試其他機器學習模型和超參數,看看我們是否找到任何可以提高性能的配置(甚至只是檢查性能是否保持穩定)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在不同的函數族集中,我們將使用另外兩個模型。KNN 模型,對於學習局部模型的影響是一個很好的選擇,還有梯度提升樹,它也是機器學習中容量最大的模型之一。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"KNN"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"from sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import RobustScaler\nfrom sklearn.neighbors import KNeighborsClassifier\n\nmodel = Pipeline(steps=preprocess+[\n ('scaler', RobustScaler()),\n ('estimator', KNeighborsClassifier(n_neighbors=5))])\n\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)\nprint(classification_report(y_test,y_pred))\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回結果如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"precision recall f1-score support\n\n False 0.71 0.74 0.72 891\n True 0.67 0.63 0.65 750\n\n accuracy 0.69 1641\n macro avg 0.69 0.69 0.69 1641\nweighted avg 0.69 0.69 0.69 1641\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"梯度提升"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"from sklearn.pipeline import Pipeline\nfrom sklearn.ensemble import GradientBoostingClassifier\n\nmodel = Pipeline(steps=preprocess+[\n ('estimator', \n GradientBoostingClassifier(max_depth=5,\n n_estimators=100))])\n\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)\nprint(classification_report(y_test,y_pred))\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回結果如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"precision recall f1-score support\n\n False 0.76 0.78 0.77 891\n True 0.73 0.70 0.72 750\n\n accuracy 0.74 1641\n macro avg 0.74 0.74 0.74 1641\nweighted avg 0.74 0.74 0.74 1641\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們可以看到,其他兩個模型的表現似乎都非常相似。這就提出了以下問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這就是我們用機器學習模型所能預測的最好結果嗎?"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模型預測分佈"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了檢查性能的一般指標外,分析模型的輸出分佈也很重要。不但要檢查測試數據集的分佈,也要檢查訓練數據集的分佈。這是因爲我們不想看到模型的表現,而是想看看它是否也學會了如何分割訓練數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"import matplotlib.pyplot as plt\n\npd.DataFrame(model.predict_proba(X_train))[1].hist()\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/c5\/43\/c5639b97aaeae96f2e943825c370d343.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖由作者提供。對訓練集進行評估的模型輸出分佈。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"pd.DataFrame(model.predict_proba(X_test))[1].hist()\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/95\/cf\/9519c68313285dd9d4c19417061e79cf.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖由作者提供。對測試集進行評估的模型輸出分佈。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可見,預測爲 0 的數量具有較高的峯值,這表示存在一個數據子集,模型非常確定它的標籤是 0,除此之外,分佈看起來比較均勻。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果模型知道一定要區分這兩個標籤,分佈會有兩個峯值,一個在 0 附近,另一個在 1 附近。因此,我們可以看到,模型並沒有正確地學習模式來區分數據。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"偏差分佈"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們已經看到,該模型還沒有學會明確地區分這兩個類別,但我們還沒有看到它是否在不自信的情況下也能猜到預測結果,還是一直失敗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,重要的是要檢查模型是否更傾向於一類或另一類的失敗。爲檢驗這兩個方面,我們可以繪製預測值與目標值偏差的分佈圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"train_proba = model.predict_proba(X_train)[:,1]\npd.DataFrame(train_proba-y_train.astype(int)).hist(bins=50)\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/6c\/9c\/6c9059c493797934e8e946b96927239c.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖由作者提供。通過訓練集評估的模型置信度輸出與基準真相的偏差。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"test_proba = model.predict_proba(X_test)[:,1]\npd.DataFrame(test_proba-y_test.astype(int)).hist(bins=50)\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/e0\/d0\/e05ccaa2ec643899b0c1b12e98636ed0.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖由作者提供。通過測試集評估的模型置信度輸出與基準真相的偏差。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從這兩張圖中,我們可以看到,偏差分佈似乎是對稱的,並且以零點爲中心。差距只是在零點,因爲模型從來沒有返回 0 和 1 的準確值,所以我們不必擔心這個問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果模型的誤差來自於訓練數據的統計\/測量噪聲誤差,而不是偏置誤差,則我們會期望偏差分佈遵循高斯分佈。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這一分佈與高斯分佈相似,在零點處有一個較高的峯值,但這個峯值可能是因爲模型預測的零點數量較多(也就是說,模型已經學會了一種模式來區分 0 和 1 類別的子集)。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"驗證正態性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於訓練數據中存在的統計噪聲,我們必須確保模型預測的偏差符合高斯分佈,然後才能證明其偏差。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"import scipy\n\nscipy.stats.normaltest(train_proba-y_train.astype(int))\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回結果如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"NormaltestResult(statistic=15.602215177113427, pvalue=0.00040928141243470884)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 P-value=0.0004 時,我們可以假設預測與目標的偏差遵循高斯分佈,這樣從訓練數據中的噪聲導致模型誤差的理論是合理的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模型可解釋性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如前所述,這一業務案例的目的不僅僅是要預測模型發生的原因,還包括與之相關的社會經濟變量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可解釋的模型不僅能預測未見過的數據,還能讓你瞭解特徵如何影響模型(全局可解釋性),以及爲什麼某些預測會如此(局部可解釋性)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管如此,一個模型的可解釋性仍然可以幫助我們理解爲什麼它能做出預測,以及爲什麼它會失敗。從梯度提升模型中,我們可以提取全局可解釋性如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"cols = X_train.columns\nvals= dict(model.steps)['estimator'].feature_importances_\n\nplt.figure()\nplt.bar(cols, vals)\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/fb\/39\/fb57aa520832e5cf859479c5ef020539.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖由作者提供。梯度提升特徵輸入。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來,我們將進行相同的特徵重要性分析,但是隻對數據的一個子集進行訓練。具體地說,我們將只使用明顯爲零的數據(那些模型之前明確預測爲零的數據)來訓練模型的零類別。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"zero_mask = model.predict_proba(X_train)[:,1]<=0.1\none_mask = y_train==1\nmask = np.logical_or(zero_mask,one_mask)\nX_train = X_train.loc[mask,:]\ny_train = y_train.loc[mask]\nmodel.fit(X_train,y_train)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在特徵的重要性是:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/85\/37\/85d68271b890e4fbf871833af1da8237.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖由作者提供。在模型表現最好的訓練集子樣本上訓練的梯度提升特徵導入。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們可以看到,現在,"},{"type":"codeinline","content":[{"type":"text","text":"tiwrsloh"}]},{"type":"text","text":" 和 "},{"type":"codeinline","content":[{"type":"text","text":"yfmzwkru"}]},{"type":"text","text":" 這兩個變量的重要性增加了,而 "},{"type":"codeinline","content":[{"type":"text","text":"vwpsxrgk"}]},{"type":"text","text":" 的數值卻下降了。這意味着,擁有一個子集的人口顯然不是窮人(類別 0),可以通過這兩個變量從窮人的變量和 "},{"type":"codeinline","content":[{"type":"text","text":"vwpsxrgk"}]},{"type":"text","text":" 在許多情況下可能很重要,但不具備決定性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我們繪製這兩個特徵的過濾值,我們可以看到:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/e1\/65\/e15f2e6092b4540e6ab5472324fb4165.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖由作者提供,對模型明確檢測到非貧困的特徵區域進行分割並表徵。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於這兩個特徵,模型已經學會了區分兩個類別,同時,對於這些變量的其他值,在整個數據集中,類別 0 和類別 1 是混合的,所以不能明確區分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們還可以從前面的圖表中找出一個明顯的非貧困人口子集的特徵,即 "},{"type":"codeinline","content":[{"type":"text","text":"tiwrsloh<0"}]},{"type":"text","text":" 和 "},{"type":"codeinline","content":[{"type":"text","text":"yfmzwkru
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章