为什么机器学习模型会失败?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"本文最初发表于 Towards Data Science 博客,经原作者 Delgado Panadero 授权,InfoQ 中文站翻译并分享。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文通过一个真实的例子,分析了模型选择不当还是训练数据噪声导致了模型性能不佳。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在机器学习中,当你建立和训练一个模型并检验其准确性时,一个最常见的问题就是“准确性是我能从数据中得到的最好的,还是能找到一个更好的模型呢?”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,一旦模型被部署,下一个常见的问题就是“为什么模型会失败?”。有时候,这两个问题都无法回答,但有时我们可以通过研究模型误差的统计分布,找出预处理错误、模型偏差,以及数据泄露等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本教程中,我们将解释并演示如何统计分析模型结果,以找出示例中错误的原因。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"业务案例"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在这个案例中,我们将使用来自 "},{"type":"link","attrs":{"href":"https:\/\/www.drivendata.org\/competitions\/50\/worldbank-poverty-prediction\/","title":"","type":null},"content":[{"type":"text","text":"Driven Data 竞赛"}]},{"type":"text","text":"的数据,通过一系列社会经济变量来预测一个民族是否处于贫困状态。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这个业务案例的价值不仅在于能够用机器学习模型来预测贫困状况,而且还在于通过社会经济变量对衡量贫困状态的预测程度,并从特征上分析原因。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模型训练"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据由一组九个描述性变量组成,其中四个是类别变量,另外五个是数值变量(但其中一个似乎是一个 id,所以我们将舍弃它)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"import pandas as pd\n\npd.set_option('display.max_columns', None)\ntrain = pd.read_csv('train.csv', index_col='id')\nprint(train)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回结果如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"Unnamed: 0 kjkrfgld bpowgknt raksnhjf vwpsxrgk omtioxzz yfmzwkru\nid \n29252 2225 KfoTG zPfZR DtMvg NaN 12.0 -3.0 \n98286 1598 ljBjd THHLT DtMvg esAQH 21.0 -2.0 \n49040 7896 Lsuai zPfZR zeYAm ZCIYy 12.0 -3.0 \n35261 1458 KfoTG mDadf zeYAm ZCIYy 12.0 -1.0 \n98833 1817 KfoTG THHLT DtMvg ARuYG 21.0 -4.0 \n\n tiwrsloh weioazcf poor \nid \n29252 -1.0 0.5 False \n98286 -5.0 -9.5 True \n49040 -5.0 -9.5 True \n35261 -5.0 -9.5 False \n98833 -5.0 -9.5 True \n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据分布可以在下面看到:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/b7\/1b\/b76a2ebc965b32f5bc3ef49de8a3461b.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供。数据集中所有特征的配对图,以目标为颜色。黄色块代表 False,紫色块表示 True。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过某些预处理(NaN 值插补、缩放、分类编码等等),我们将对一个支持向量机模型进行训练(通常在独热编码的高维数据中工作良好)。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"支持向量机"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"from sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import RobustScaler\nfrom sklearn.neighbors import KNeighborsClassifier\n\nmodel = Pipeline(steps=preprocess+[\n ('scaler', RobustScaler()),\n ('estimator', KNeighborsClassifier(n_neighbors=5))])\n\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)\nprint(classification_report(y_test,y_pred))`\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回结果如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"precision recall f1-score support\n\n False 0.73 0.77 0.75 891\n True 0.70 0.66 0.68 750\n\n accuracy 0.72 1641\n macro avg 0.72 0.71 0.71 1641\nweighted avg 0.72 0.72 0.72 1641\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"就二元分类问题而言,0.72 的准确率并不高。相比之下,召回率和查准率看起来是平衡的,这使得我们认为,这个模型不是一个有利于任何类别的先验偏见。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"测试其他模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"想要改进这个模型,下一步就是尝试其他机器学习模型和超参数,看看我们是否找到任何可以提高性能的配置(甚至只是检查性能是否保持稳定)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在不同的函数族集中,我们将使用另外两个模型。KNN 模型,对于学习局部模型的影响是一个很好的选择,还有梯度提升树,它也是机器学习中容量最大的模型之一。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"KNN"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"from sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import RobustScaler\nfrom sklearn.neighbors import KNeighborsClassifier\n\nmodel = Pipeline(steps=preprocess+[\n ('scaler', RobustScaler()),\n ('estimator', KNeighborsClassifier(n_neighbors=5))])\n\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)\nprint(classification_report(y_test,y_pred))\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回结果如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"precision recall f1-score support\n\n False 0.71 0.74 0.72 891\n True 0.67 0.63 0.65 750\n\n accuracy 0.69 1641\n macro avg 0.69 0.69 0.69 1641\nweighted avg 0.69 0.69 0.69 1641\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"梯度提升"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"from sklearn.pipeline import Pipeline\nfrom sklearn.ensemble import GradientBoostingClassifier\n\nmodel = Pipeline(steps=preprocess+[\n ('estimator', \n GradientBoostingClassifier(max_depth=5,\n n_estimators=100))])\n\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)\nprint(classification_report(y_test,y_pred))\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回结果如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"precision recall f1-score support\n\n False 0.76 0.78 0.77 891\n True 0.73 0.70 0.72 750\n\n accuracy 0.74 1641\n macro avg 0.74 0.74 0.74 1641\nweighted avg 0.74 0.74 0.74 1641\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们可以看到,其他两个模型的表现似乎都非常相似。这就提出了以下问题:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这就是我们用机器学习模型所能预测的最好结果吗?"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模型预测分布"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了检查性能的一般指标外,分析模型的输出分布也很重要。不但要检查测试数据集的分布,也要检查训练数据集的分布。这是因为我们不想看到模型的表现,而是想看看它是否也学会了如何分割训练数据。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"import matplotlib.pyplot as plt\n\npd.DataFrame(model.predict_proba(X_train))[1].hist()\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/c5\/43\/c5639b97aaeae96f2e943825c370d343.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供。对训练集进行评估的模型输出分布。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"pd.DataFrame(model.predict_proba(X_test))[1].hist()\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/95\/cf\/9519c68313285dd9d4c19417061e79cf.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供。对测试集进行评估的模型输出分布。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可见,预测为 0 的数量具有较高的峰值,这表示存在一个数据子集,模型非常确定它的标签是 0,除此之外,分布看起来比较均匀。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果模型知道一定要区分这两个标签,分布会有两个峰值,一个在 0 附近,另一个在 1 附近。因此,我们可以看到,模型并没有正确地学习模式来区分数据。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"偏差分布"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们已经看到,该模型还没有学会明确地区分这两个类别,但我们还没有看到它是否在不自信的情况下也能猜到预测结果,还是一直失败。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,重要的是要检查模型是否更倾向于一类或另一类的失败。为检验这两个方面,我们可以绘制预测值与目标值偏差的分布图:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"train_proba = model.predict_proba(X_train)[:,1]\npd.DataFrame(train_proba-y_train.astype(int)).hist(bins=50)\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/6c\/9c\/6c9059c493797934e8e946b96927239c.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供。通过训练集评估的模型置信度输出与基准真相的偏差。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"test_proba = model.predict_proba(X_test)[:,1]\npd.DataFrame(test_proba-y_test.astype(int)).hist(bins=50)\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/e0\/d0\/e05ccaa2ec643899b0c1b12e98636ed0.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供。通过测试集评估的模型置信度输出与基准真相的偏差。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从这两张图中,我们可以看到,偏差分布似乎是对称的,并且以零点为中心。差距只是在零点,因为模型从来没有返回 0 和 1 的准确值,所以我们不必担心这个问题。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果模型的误差来自于训练数据的统计\/测量噪声误差,而不是偏置误差,则我们会期望偏差分布遵循高斯分布。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这一分布与高斯分布相似,在零点处有一个较高的峰值,但这个峰值可能是因为模型预测的零点数量较多(也就是说,模型已经学会了一种模式来区分 0 和 1 类别的子集)。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"验证正态性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于训练数据中存在的统计噪声,我们必须确保模型预测的偏差符合高斯分布,然后才能证明其偏差。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"import scipy\n\nscipy.stats.normaltest(train_proba-y_train.astype(int))\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回结果如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"NormaltestResult(statistic=15.602215177113427, pvalue=0.00040928141243470884)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当 P-value=0.0004 时,我们可以假设预测与目标的偏差遵循高斯分布,这样从训练数据中的噪声导致模型误差的理论是合理的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模型可解释性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如前所述,这一业务案例的目的不仅仅是要预测模型发生的原因,还包括与之相关的社会经济变量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可解释的模型不仅能预测未见过的数据,还能让你了解特征如何影响模型(全局可解释性),以及为什么某些预测会如此(局部可解释性)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"尽管如此,一个模型的可解释性仍然可以帮助我们理解为什么它能做出预测,以及为什么它会失败。从梯度提升模型中,我们可以提取全局可解释性如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"cols = X_train.columns\nvals= dict(model.steps)['estimator'].feature_importances_\n\nplt.figure()\nplt.bar(cols, vals)\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/fb\/39\/fb57aa520832e5cf859479c5ef020539.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供。梯度提升特征输入。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下来,我们将进行相同的特征重要性分析,但是只对数据的一个子集进行训练。具体地说,我们将只使用明显为零的数据(那些模型之前明确预测为零的数据)来训练模型的零类别。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"zero_mask = model.predict_proba(X_train)[:,1]<=0.1\none_mask = y_train==1\nmask = np.logical_or(zero_mask,one_mask)\nX_train = X_train.loc[mask,:]\ny_train = y_train.loc[mask]\nmodel.fit(X_train,y_train)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现在特征的重要性是:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/85\/37\/85d68271b890e4fbf871833af1da8237.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供。在模型表现最好的训练集子样本上训练的梯度提升特征导入。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们可以看到,现在,"},{"type":"codeinline","content":[{"type":"text","text":"tiwrsloh"}]},{"type":"text","text":" 和 "},{"type":"codeinline","content":[{"type":"text","text":"yfmzwkru"}]},{"type":"text","text":" 这两个变量的重要性增加了,而 "},{"type":"codeinline","content":[{"type":"text","text":"vwpsxrgk"}]},{"type":"text","text":" 的数值却下降了。这意味着,拥有一个子集的人口显然不是穷人(类别 0),可以通过这两个变量从穷人的变量和 "},{"type":"codeinline","content":[{"type":"text","text":"vwpsxrgk"}]},{"type":"text","text":" 在许多情况下可能很重要,但不具备决定性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我们绘制这两个特征的过滤值,我们可以看到:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/e1\/65\/e15f2e6092b4540e6ab5472324fb4165.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供,对模型明确检测到非贫困的特征区域进行分割并表征。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对于这两个特征,模型已经学会了区分两个类别,同时,对于这些变量的其他值,在整个数据集中,类别 0 和类别 1 是混合的,所以不能明确区分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们还可以从前面的图表中找出一个明显的非贫困人口子集的特征,即 "},{"type":"codeinline","content":[{"type":"text","text":"tiwrsloh<0"}]},{"type":"text","text":" 和 "},{"type":"codeinline","content":[{"type":"text","text":"yfmzwkru
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章