为什么机器学习模型会失败？

原創

elgado Panadero

2021-11-24 10:03

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"本文最初发表于 Towards Data Science 博客，经原作者 Delgado Panadero 授权，InfoQ 中文站翻译并分享。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文通过一个真实的例子，分析了模型选择不当还是训练数据噪声导致了模型性能不佳。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在机器学习中，当你建立和训练一个模型并检验其准确性时，一个最常见的问题就是“准确性是我能从数据中得到的最好的，还是能找到一个更好的模型呢？”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外，一旦模型被部署，下一个常见的问题就是“为什么模型会失败？”。有时候，这两个问题都无法回答，但有时我们可以通过研究模型误差的统计分布，找出预处理错误、模型偏差，以及数据泄露等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本教程中，我们将解释并演示如何统计分析模型结果，以找出示例中错误的原因。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"业务案例"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在这个案例中，我们将使用来自 "},{"type":"link","attrs":{"href":"https:\/\/www.drivendata.org\/competitions\/50\/worldbank-poverty-prediction\/","title":"","type":null},"content":[{"type":"text","text":"Driven Data 竞赛"}]},{"type":"text","text":"的数据，通过一系列社会经济变量来预测一个民族是否处于贫困状态。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这个业务案例的价值不仅在于能够用机器学习模型来预测贫困状况，而且还在于通过社会经济变量对衡量贫困状态的预测程度，并从特征上分析原因。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模型训练"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据由一组九个描述性变量组成，其中四个是类别变量，另外五个是数值变量（但其中一个似乎是一个 id，所以我们将舍弃它）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"import pandas as pd\n\npd.set_option('display.max_columns', None)\ntrain = pd.read_csv('train.csv', index_col='id')\nprint(train)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回结果如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"Unnamed: 0 kjkrfgld bpowgknt raksnhjf vwpsxrgk omtioxzz yfmzwkru\nid \n29252 2225 KfoTG zPfZR DtMvg NaN 12.0 -3.0 \n98286 1598 ljBjd THHLT DtMvg esAQH 21.0 -2.0 \n49040 7896 Lsuai zPfZR zeYAm ZCIYy 12.0 -3.0 \n35261 1458 KfoTG mDadf zeYAm ZCIYy 12.0 -1.0 \n98833 1817 KfoTG THHLT DtMvg ARuYG 21.0 -4.0 \n\n tiwrsloh weioazcf poor \nid \n29252 -1.0 0.5 False \n98286 -5.0 -9.5 True \n49040 -5.0 -9.5 True \n35261 -5.0 -9.5 False \n98833 -5.0 -9.5 True \n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据分布可以在下面看到："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/b7\/1b\/b76a2ebc965b32f5bc3ef49de8a3461b.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供。数据集中所有特征的配对图，以目标为颜色。黄色块代表 False，紫色块表示 True。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过某些预处理（NaN 值插补、缩放、分类编码等等），我们将对一个支持向量机模型进行训练（通常在独热编码的高维数据中工作良好）。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"支持向量机"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"from sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import RobustScaler\nfrom sklearn.neighbors import KNeighborsClassifier\n\nmodel = Pipeline(steps=preprocess+[\n ('scaler', RobustScaler()),\n ('estimator', KNeighborsClassifier(n_neighbors=5))])\n\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)\nprint(classification_report(y_test,y_pred))`\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回结果如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"precision recall f1-score support\n\n False 0.73 0.77 0.75 891\n True 0.70 0.66 0.68 750\n\n accuracy 0.72 1641\n macro avg 0.72 0.71 0.71 1641\nweighted avg 0.72 0.72 0.72 1641\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"就二元分类问题而言，0.72 的准确率并不高。相比之下，召回率和查准率看起来是平衡的，这使得我们认为，这个模型不是一个有利于任何类别的先验偏见。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"测试其他模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"想要改进这个模型，下一步就是尝试其他机器学习模型和超参数，看看我们是否找到任何可以提高性能的配置（甚至只是检查性能是否保持稳定）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在不同的函数族集中，我们将使用另外两个模型。KNN 模型，对于学习局部模型的影响是一个很好的选择，还有梯度提升树，它也是机器学习中容量最大的模型之一。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"KNN"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"from sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import RobustScaler\nfrom sklearn.neighbors import KNeighborsClassifier\n\nmodel = Pipeline(steps=preprocess+[\n ('scaler', RobustScaler()),\n ('estimator', KNeighborsClassifier(n_neighbors=5))])\n\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)\nprint(classification_report(y_test,y_pred))\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回结果如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"precision recall f1-score support\n\n False 0.71 0.74 0.72 891\n True 0.67 0.63 0.65 750\n\n accuracy 0.69 1641\n macro avg 0.69 0.69 0.69 1641\nweighted avg 0.69 0.69 0.69 1641\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"梯度提升"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"from sklearn.pipeline import Pipeline\nfrom sklearn.ensemble import GradientBoostingClassifier\n\nmodel = Pipeline(steps=preprocess+[\n ('estimator', \n GradientBoostingClassifier(max_depth=5,\n n_estimators=100))])\n\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)\nprint(classification_report(y_test,y_pred))\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回结果如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"precision recall f1-score support\n\n False 0.76 0.78 0.77 891\n True 0.73 0.70 0.72 750\n\n accuracy 0.74 1641\n macro avg 0.74 0.74 0.74 1641\nweighted avg 0.74 0.74 0.74 1641\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们可以看到，其他两个模型的表现似乎都非常相似。这就提出了以下问题："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这就是我们用机器学习模型所能预测的最好结果吗？"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模型预测分布"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了检查性能的一般指标外，分析模型的输出分布也很重要。不但要检查测试数据集的分布，也要检查训练数据集的分布。这是因为我们不想看到模型的表现，而是想看看它是否也学会了如何分割训练数据。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"import matplotlib.pyplot as plt\n\npd.DataFrame(model.predict_proba(X_train))[1].hist()\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/c5\/43\/c5639b97aaeae96f2e943825c370d343.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供。对训练集进行评估的模型输出分布。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"pd.DataFrame(model.predict_proba(X_test))[1].hist()\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/95\/cf\/9519c68313285dd9d4c19417061e79cf.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供。对测试集进行评估的模型输出分布。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可见，预测为 0 的数量具有较高的峰值，这表示存在一个数据子集，模型非常确定它的标签是 0，除此之外，分布看起来比较均匀。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果模型知道一定要区分这两个标签，分布会有两个峰值，一个在 0 附近，另一个在 1 附近。因此，我们可以看到，模型并没有正确地学习模式来区分数据。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"偏差分布"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们已经看到，该模型还没有学会明确地区分这两个类别，但我们还没有看到它是否在不自信的情况下也能猜到预测结果，还是一直失败。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外，重要的是要检查模型是否更倾向于一类或另一类的失败。为检验这两个方面，我们可以绘制预测值与目标值偏差的分布图："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"train_proba = model.predict_proba(X_train)[:,1]\npd.DataFrame(train_proba-y_train.astype(int)).hist(bins=50)\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/6c\/9c\/6c9059c493797934e8e946b96927239c.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供。通过训练集评估的模型置信度输出与基准真相的偏差。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"test_proba = model.predict_proba(X_test)[:,1]\npd.DataFrame(test_proba-y_test.astype(int)).hist(bins=50)\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/e0\/d0\/e05ccaa2ec643899b0c1b12e98636ed0.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供。通过测试集评估的模型置信度输出与基准真相的偏差。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从这两张图中，我们可以看到，偏差分布似乎是对称的，并且以零点为中心。差距只是在零点，因为模型从来没有返回 0 和 1 的准确值，所以我们不必担心这个问题。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果模型的误差来自于训练数据的统计\/测量噪声误差，而不是偏置误差，则我们会期望偏差分布遵循高斯分布。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这一分布与高斯分布相似，在零点处有一个较高的峰值，但这个峰值可能是因为模型预测的零点数量较多（也就是说，模型已经学会了一种模式来区分 0 和 1 类别的子集）。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"验证正态性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于训练数据中存在的统计噪声，我们必须确保模型预测的偏差符合高斯分布，然后才能证明其偏差。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"import scipy\n\nscipy.stats.normaltest(train_proba-y_train.astype(int))\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"返回结果如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"NormaltestResult(statistic=15.602215177113427, pvalue=0.00040928141243470884)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当 P-value=0.0004 时，我们可以假设预测与目标的偏差遵循高斯分布，这样从训练数据中的噪声导致模型误差的理论是合理的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"模型可解释性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如前所述，这一业务案例的目的不仅仅是要预测模型发生的原因，还包括与之相关的社会经济变量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可解释的模型不仅能预测未见过的数据，还能让你了解特征如何影响模型（全局可解释性），以及为什么某些预测会如此（局部可解释性）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"尽管如此，一个模型的可解释性仍然可以帮助我们理解为什么它能做出预测，以及为什么它会失败。从梯度提升模型中，我们可以提取全局可解释性如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"cols = X_train.columns\nvals= dict(model.steps)['estimator'].feature_importances_\n\nplt.figure()\nplt.bar(cols, vals)\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/fb\/39\/fb57aa520832e5cf859479c5ef020539.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供。梯度提升特征输入。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下来，我们将进行相同的特征重要性分析，但是只对数据的一个子集进行训练。具体地说，我们将只使用明显为零的数据（那些模型之前明确预测为零的数据）来训练模型的零类别。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"zero_mask = model.predict_proba(X_train)[:,1]<=0.1\none_mask = y_train==1\nmask = np.logical_or(zero_mask,one_mask)\nX_train = X_train.loc[mask,:]\ny_train = y_train.loc[mask]\nmodel.fit(X_train,y_train)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现在特征的重要性是："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/85\/37\/85d68271b890e4fbf871833af1da8237.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供。在模型表现最好的训练集子样本上训练的梯度提升特征导入。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们可以看到，现在，"},{"type":"codeinline","content":[{"type":"text","text":"tiwrsloh"}]},{"type":"text","text":" 和 "},{"type":"codeinline","content":[{"type":"text","text":"yfmzwkru"}]},{"type":"text","text":" 这两个变量的重要性增加了，而 "},{"type":"codeinline","content":[{"type":"text","text":"vwpsxrgk"}]},{"type":"text","text":" 的数值却下降了。这意味着，拥有一个子集的人口显然不是穷人（类别 0），可以通过这两个变量从穷人的变量和 "},{"type":"codeinline","content":[{"type":"text","text":"vwpsxrgk"}]},{"type":"text","text":" 在许多情况下可能很重要，但不具备决定性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我们绘制这两个特征的过滤值，我们可以看到："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/e1\/65\/e15f2e6092b4540e6ab5472324fb4165.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"图由作者提供，对模型明确检测到非贫困的特征区域进行分割并表征。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对于这两个特征，模型已经学会了区分两个类别，同时，对于这些变量的其他值，在整个数据集中，类别 0 和类别 1 是混合的，所以不能明确区分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们还可以从前面的图表中找出一个明显的非贫困人口子集的特征，即 "},{"type":"codeinline","content":[{"type":"text","text":"tiwrsloh<0"}]},{"type":"text","text":" 和 "},{"type":"codeinline","content":[{"type":"text","text":"yfmzwkru

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

数字化转型新篇章：企业通往智能化的新范式

早在十多年前，一些具有前瞻視野的企業以實現“數字化”爲目標啓動轉型實踐。但時至今日，可以說尚無幾家企業能夠在真正意義上實現“數字化”。在實現“數字化”的征途上，人們發現，努力愈進，彷彿終點愈遠。究其原因，還在於轉型一直落後於技術邊界的拓展

2024-04-29 21:22:20

Stable Diffusion中的embedding

Stable Diffusion中的embedding 嵌入，也稱爲文本反轉，是在 Stable Diffusion 中控制圖像樣式的另一種方法。在這篇文章中，我們將學習什麼是嵌入，在哪裏可以找到它們，以及如何使用它們。什麼是嵌入embe

2024-04-25 21:31:13

AI从入门到入门之手写数字识别模型java方式Dense全连接神经网络实现

前言：授人以魚不如授人以漁.先學會用，在學原理，在學創造，可能一輩子用不到這種能力，但是不能不具備這種能力。這篇文章主要是介紹算法入門Helloword之手寫圖片識別模型java中如何實現以及部分解釋。目前大家對於人工智能-機器學習-神經網

2024-04-19 23:17:21

Pinecone: 大模型时代的智能索引与搜索解决方案

隨着人工智能技術的飛速發展，大模型（Large Models）已成爲衆多領域的重要工具。無論是自然語言處理、圖像識別還是其他複雜任務，大模型都展現出了強大的性能。然而，隨着模型規模的不斷擴大，數據量的激增，如何有效地管理、索引和搜索這些模型

2024-04-19 11:29:43

软件测试从自动化到智能化，大模型开始加入

隨着科技的飛速發展，軟件行業也在不斷地演進和創新。作爲軟件行業的關鍵環節之一，軟件測試行業也在經歷着前所未有的變革。從最初的手動測試，到自動化測試，再到如今的智能化測試，軟件測試行業正在經歷一場深刻的技術革命。在這場革命中，Testin雲測

2024-04-19 00:53:25

裁员了！别错过2024年大数据工程师必备的10项技能

在當今快速發展的世界中，數據被視爲新的石油。隨着對數據驅動洞察的日益依賴，大數據工程師的角色比以往任何時候都更爲關鍵。這些專業人員在管理和優化組織內的數據操作中扮演着至關重要的角色。在本文中，我們將探索2024年大數據工程師必須具備的十

2024-04-16 11:00:53

DevOps已死？2024年的DevOps将如何发展

隨着我們進入2024年，DevOps也隨之發生變化。新興的技術、變化的需求和發展的方法正在重新定義有效實施DevOps實踐。 IDC預測顯示，未來五年，支持DevOps實踐的產品市場繼續保持健康且快速增長，2022年-2027年的複合年增長

2024-04-08 12:51:44

从模型到部署，教你如何用Python构建机器学习API服务

本文分享自華爲雲社區《Python構建機器學習API服務從模型到部署的完整指南》，作者：檸檬味擁抱。在當今數據驅動的世界中，機器學習模型在解決各種問題中扮演着重要角色。然而，將這些模型應用到實際問題中並與其他系統集成，往往需要構建API

2024-04-08 10:33:17

测试左移已经开始影响DevOps的发展？

在軟件開發的早期，該過程通常是開發人員編寫代碼，再將其交給質量保證（QA）進行測試。這種瀑布開發方法可能會導致質量問題和延遲，因爲問題是在週期後期發現的。一、瞭解DevOps和測試左移 DevOps是Development和Operati

2024-04-07 12:48:37

黑盒Prompt优化：提升大模型反馈效果的新思路

隨着人工智能技術的快速發展，大模型在各種應用場景中發揮着越來越重要的作用。然而，如何提升大模型的反饋效果，使其更加準確、高效地爲用戶提供服務，一直是研究者和開發者關注的焦點。本文提出了一種新的思路——黑盒Prompt優化，旨在通過改進輸入提

2024-03-29 00:01:17

分布式数据库技术的演进和发展方向

這些年大家都在談分佈式數據庫，各大企業也紛紛開始做數據庫的分佈式改造。那麼，所謂的分佈式數據庫到底是什麼？採用什麼架構？優勢在哪？爲什麼越來越多企業選擇它？分佈式數據庫技術會向什麼方向發展？帶着這些疑問，一探究竟吧！參與文末的話題互動

2024-03-26 11:34:43

利用RAG技术打破大模型幻觉

隨着人工智能技術的不斷進步，大模型在各個領域中發揮着越來越重要的作用。然而，大模型幻覺問題一直是制約其進一步發展的瓶頸。爲了解決這一問題，研究者們不斷探索新的技術和方法。近年來，一種名爲RAG（檢索增強生成）的技術備受關注，它通過結合知識圖

2024-03-21 00:28:34

与 NVIDIA 再次合作、深度参与 GTC，Zilliz 与全球顶尖开发者共迎 AI 变革时刻！

Zilliz 與全球的頂尖開發者齊聚 GTC 2024。近日，備受關注的 NVIDIA GTC 2024 已拉開序幕，來自世界各地的頂尖 AI 開發者齊聚美國加州聖何塞會議中心，共同探索行業未來。作爲去年被 NVIDIA CEO 黃仁

2024-03-19 21:26:53

多模态+大模型会带来哪些“化学反应”？

導語：沒人懷疑，2024 年，AI 依然將是科技界的主角。上個月，OpenAI 推出了可以生成 60 秒高清視頻的視頻生成模型 Sora，掀起了對多模態模型的進一輪討論。多模態大模型技術的最新進展如何？這一波新技術，對於行業和消費者的體驗會

2024-03-15 13:45:01

妇女节：打开 AI 视界，成就“她力量”

根據國內招聘平臺獵聘發佈的《2024 女性人才數據洞察報告》，從 2023 年 3 月到 2024 年 2 月，女性在 AIGC 領域的求職人次同比增長了 190.49%。隨着人工智能時代的降臨，女性正以前所未有的姿態，在技術的助力下，蛻變

2024-03-09 01:06:57

24小時熱門文章

最新文章

爲什麼機器學習模型會失敗？

最新評論文章