如何用幾行代碼運行 40 個迴歸模型

原創

2021-05-12 09:33

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"本文最初發表於 Towards Data Science 博客，經原作者 Ismael Arayjo 授權，InfoQ 中文站翻譯並分享。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這篇文章教你如何使用 Lazy Predict 運行超過 40 個機器學習模型進行迴歸項目。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設你需要執行一項迴歸機器學習項目。你已經分析了你的數據，進行了一些數據清洗，創建了一些虛擬變量，現在，是時候運行機器學習迴歸模型了。你想到的十大模型有哪些？大多數人可能都不知道有“十大回歸模型”。如果你不知道，也不必擔心，因爲在本文的最後，你不僅可以運行 10 個機器學習迴歸模型，而且能運行 40 多個機器學習迴歸模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"幾周前，我在博客上發表了一篇名爲《"},{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/how-to-run-30-machine-learning-models-with-2-lines-of-code-d0f94a537e52?fileGuid=2V67vFzHJsUWw1aV","title":"","type":null},"content":[{"type":"text","text":"如何用幾行代碼運行 30 個機器學習模型"}]},{"type":"text","text":"》（"},{"type":"text","marks":[{"type":"italic"}],"text":"How to Run 30 Machine Learning Models with a Few Lines of Code"},{"type":"text","text":"）的文章，反響非常好。實際上，這是我到目前爲止最流行的博文。在那篇博文中，我創建了一個分類項目來嘗試 Lazy Predict。現在，我要在一個迴歸項目測試 Lazy Predict。因此，我將使用典型的西雅圖房價數據集，在 Kaggle 上就能找到。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Lazy Predict 是什麼？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不需要很多代碼，Lazy Predict 就能幫助構建幾十個模型，並幫助瞭解哪些模型在不經過任何參數調整的情況下工作得更好。說明其工作原理的最好方法就是使用一個小項目，現在就開始吧。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"迴歸項目使用 Lazy Predict"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先，要安裝 Lazy Predict，你可以"},{"type":"codeinline","content":[{"type":"text","text":"pip install lazypredict"}]},{"type":"text","text":"迴歸項目到你的終端。簡單得很。接下來，讓我們導入一些用於本項目的庫。你可以在這裏找到完整的 Notebook。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"# Importing important libraries\nimport pyforest\nfrom lazypredict.Supervised import LazyRegressor\nfrom pandas.plotting import scatter_matrix\n# Scikit-learn packages\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.tree import DecisionTreeRegressor\nfrom sklearn.ensemble import ExtraTreesRegressor\nfrom sklearn import metrics\nfrom sklearn.metrics import mean_squared_error\n# Hide warnings\nimport warnings\nwarnings.filterwarnings(“ignore”)\n# Setting up max columns displayed to 100\npd.options.display.max_columns = 100\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你可以看到我導入了"},{"type":"codeinline","content":[{"type":"text","text":"pyforest"}]},{"type":"text","text":"而非 Pandas 和 Numpy。在 Notebook 中，PyForest 可以非常快速地導入所有重要的庫。我寫了一篇關於它的博文，你可以在"},{"type":"link","attrs":{"href":"https:\/\/towardsdatascience.com\/how-to-import-all-python-libraries-with-one-line-of-code-2b9e66a5879f?fileGuid=2V67vFzHJsUWw1aV","title":"","type":null},"content":[{"type":"text","text":"這裏"}]},{"type":"text","text":"找到。接下來，讓我們來導入數據集。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"# Import dataset\ndf = pd.read_csv('..\/data\/kc_house_data_train.csv', index_col=0)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看看這個數據集是什麼樣子。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/7d\/74\/7d354df35a86b32db5e46cab56c13774.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面我們來檢查一下數據類型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"# Checking datatimes and null values\ndf.info()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/c9\/4c\/c90f2ca4702b58270fdfd1379f229a4c.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面是吸引我注意力的幾件事情。第一件是"},{"type":"codeinline","content":[{"type":"text","text":"id"}]},{"type":"text","text":"列與這個小項目沒有任何關聯。但是，如果你想更深入地研究這個項目，你應該檢查是否存在重複項。另外，"},{"type":"codeinline","content":[{"type":"text","text":"date"}]},{"type":"text","text":"列是一個對象類型。應將其改爲 DateTime 類型。這些列中的"},{"type":"codeinline","content":[{"type":"text","text":"zipcode"}]},{"type":"text","text":"，"},{"type":"codeinline","content":[{"type":"text","text":"lat"}]},{"type":"text","text":"和"},{"type":"codeinline","content":[{"type":"text","text":"long"}]},{"type":"text","text":"可能與價格幾乎或者根本沒有關聯。然而，因爲本項目的目標是演示"},{"type":"codeinline","content":[{"type":"text","text":"lazy predict"}]},{"type":"text","text":"，所以我會保留它們。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來，在運行第一個模型之前，讓我們檢查一些統計數據，以找出需要修改的地方。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/b0\/77\/b074dcfab6b42256c507a9c69c698a77.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"是的，我看到了一些有趣的事情。首先，有一所房子有 33 間臥室，那不可能是真的。所以我在網上查了一下，結果發現我用它的"},{"type":"codeinline","content":[{"type":"text","text":"id"}]},{"type":"text","text":"找到了這套房子，它實際上有 3 間臥室。你可以在"},{"type":"link","attrs":{"href":"https:\/\/www.zillow.com\/homedetails\/8033-Corliss-Ave-N-Seattle-WA-98103\/48795791_zpid\/?fileGuid=2V67vFzHJsUWw1aV","title":"","type":null},"content":[{"type":"text","text":"這裏"}]},{"type":"text","text":"找到這套房子。此外，有些房子看上去沒有衛生間。我會包括至少 1 個衛生間，這樣我們就可以完成數據清理了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"# Fixing house with 33 bedrooms\ndf[df['bedrooms'] == 33] = df[df['bedrooms'] == 3]\n# This will add 1 bathroom to houses without any bathroom\ndf['bathrooms'] = df.bedrooms.apply(lambda x: 1 if x < 1 else x)\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"拆分訓練集和測試集"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們現在可以拆分訓練集和測試集了。但是在此之前，讓我們確保代碼不會出現"},{"type":"codeinline","content":[{"type":"text","text":"nan"}]},{"type":"text","text":"或"},{"type":"codeinline","content":[{"type":"text","text":"infinite"}]},{"type":"text","text":"的值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"# Removing nan and infinite values\ndf.replace([np.inf, -np.inf], np.nan, inplace=True)\ndf.dropna(inplace=True)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在將數據集分爲 X 和 Y 兩個變量。我會給訓練集分配 75% 的數據集，給測試集 25%。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"# Creating train test split\nX = df.drop(columns=['price])\ny = df.price\n# Call train_test_split on the data and capture the results\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3,test_size=0.25)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"是時候找點樂子了！下面的代碼將運行 40 多個模型，並顯示每個模型的 R-Squared 和 RMSE。做好準備，開始！"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"reg = LazyRegressor(ignore_warnings=False, custom_metric=None)\nmodels, predictions = reg.fit(X_train, X_test, y_train, y_test)\nprint(models)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/47\/f9\/47146d1a8a5b72d05467e11816ed48f9.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"哇！對於花費在上面的工作來說，這些結果非常好。對普通模型而言，這些都是非常好的 R-Squared 和 RMSE。就像我們看到的，我們運行了 41 個普通模型，並且得到了我們需要的指標，你可以看到每個模型所花費的時間。一點也不差。那麼，你如何確定這些結果是否正確呢？通過運行一個模型，我們可以查看結果，看它是否和我們得到的結果相近。我們要不要測試一下基於直方圖的梯度提升迴歸樹？如果你從未聽說過這種算法，不要擔心，因爲我也從沒聽說過它。你可以在"},{"type":"link","attrs":{"href":"https:\/\/machinelearningmastery.com\/histogram-based-gradient-boosting-ensembles\/?fileGuid=2V67vFzHJsUWw1aV","title":"","type":null},"content":[{"type":"text","text":"這裏"}]},{"type":"text","text":"找到一篇關於它的文章。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"複覈結果"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先，讓我們用 scikit-learn 導入這個模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"# Explicitly require this experimental feature\nfrom sklearn.experimental import enable_hist_gradient_boosting\n# Now you can import normally from ensemble\nfrom sklearn.ensemble import HistGradientBoostingRegressor\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外，我們還創建了一個函數來檢查模型的度量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"# Evaluation Functions\ndef rmse(model, y_test, y_pred, X_train, y_train):\nr_squared = model.score(X_test, y_test)\nmse = mean_squared_error(y_test, y_pred)\nrmse = np.sqrt(mse)\nprint(‘R-squared: ‘ + str(r_squared))\nprint(‘Mean Squared Error: ‘+ str(rmse))\n# Create model line scatter plot\ndef scatter_plot(y_test, y_pred, model_name):\nplt.figure(figsize=(10,6))\nsns.residplot(y_test, y_pred, lowess=True, color='#4682b4',\nline_kws={'lw': 2, 'color': 'r'})\nplt.title(str('Price vs Residuals for '+ model_name))\nplt.xlabel('Price',fontsize=16)\nplt.xticks(fontsize=13)\nplt.yticks(fontsize=13)\nplt.show()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後，我們來運行這個模型並查看結果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"# Histogram-based Gradient Boosting Regression Tree\nhist = HistGradientBoostingRegressor()\nhist.fit(X_train, y_train)\ny_pred = hist.predict(X_test)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"瞧！我們用 Lazy Predict 得到的結果和這個結果非常接近。看來這確實很管用。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"最後想法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lazy Predict 是一個神奇的庫，易於使用，並且非常快速，只需要很少的代碼就可以運行普通模型。你可以使用 2 到 3 行的代碼來手動設置，而不需要手工設置多個普通模型。切記，不要把結果作爲最終的模型，應該始終對結果進行復核，以確保庫工作正常。就像我在其他博文中提到的那樣，數據科學是一個複雜的領域，Lazy Predict 並不能取代那些優化模型的專業人員的專業知識。請讓我知道它是如何爲你工作的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ismael Araujo，在紐約工作，數據科學家、機器學習工程師。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/towardsdatascience.com\/how-to-run-40-regression-models-with-a-few-lines-of-code-5a24186de7d"}]}]}