使用MLlib進行機器學習(十-上)

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"寫在前面:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家好,我是強哥,一個熱愛分享的技術狂。目前已有 12 年大數據與 AI 相關項目經驗, 10 年推薦系統研究及實踐經驗。平時喜歡讀書、暴走和寫作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業餘時間專注於輸出大數據、AI 等相關文章,目前已經輸出了 40 萬字的推薦系統系列精品文章,今年 6 月底會出版「構建企業級推薦系統:算法、工程實現與案例分析」一書。如果這些文章能夠幫助你快速入門,實現職場升職加薪,我將不勝歡喜。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"想要獲得更多免費學習資料或內推信息,一定要看到文章最後喔。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"內推信息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你正在看相關的招聘信息,請加我微信:liuq4360,我這裏有很多內推資源等着你,歡迎投遞簡歷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"免費學習資料","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想獲得更多免費的學習資料,請關注同名公衆號【數據與智能】,輸入“資料”即可!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"學習交流羣","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想找到組織,和大家一起學習成長,交流經驗,也可以加入我們的學習成長羣。羣裏有老司機帶你飛,另有小哥哥、小姐姐等你來勾搭!加小姐姐微信:epsila,她會帶你入羣。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"到目前爲止,我們一直專注於Apache Spark的數據工程工作負載。數據工程通常是爲機器學習(ML)任務準備數據的前期步驟,而機器學習將是本章的重點。我們生活在一個機器學習和人工智能應用普及的時代。不管我們是否意識到這一點,每天我們都有可能會出於各種目的(例如在線購物推薦和廣告,欺詐檢測,分類,圖像識別,模式匹配等)接觸ML模型。這些ML模型爲許多公司制定了重要的業務決策。根據麥肯錫的這項研究,其中35%的消費者在Amazon購買的商品和75%的Netflix購買的商品受到基於機器學習的產品推薦的推動。建立一個表現良好的模型可以決定公司的成敗。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本章中,我們將幫助你開始使用MLlib(Apache Spark中的核心組件中的機器學習庫)來構建ML模型。我們將從機器學習的簡要介紹開始,然後涵蓋大規模ML和功能設計的最佳實踐(如果你已經熟悉機器學習的基礎知識,則可以直接跳至“設計機器學習管道”)。通過此處提供的簡短代碼段以及該書的GitHub倉庫中提供的筆記(notebook),你將學習如何構建基本的ML模型和使用MLlib。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本章介紹了Scala和Python API。如果你有興趣在Spark 中使用R語言(sparklyr)進行機器學習,我們建議你查看Javier Luraschi,Kevin Kuo和Edgar Ruiz(O'Reilly)的著作《Mastering Spark with R》。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"什麼是機器學習?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如今,機器學習正在大肆宣傳,但是它到底是什麼呢?廣義上講,機器學習是一個使用統計,線性代數和數值優化從數據中抽取模式的過程。機器學習可以應用於諸如預測功耗,確定視頻中是否有貓,或將具有類似特徵的項目聚類的問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"機器學習有幾種類型,包括監督、半監督、無監督和強化學習。本章將主要關注有監督的機器學習,而僅涉及無監督的學習。在深入探討之前,讓我們簡要討論有監督和無監督機器學習之間的區別。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"監督學習","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在有監督的機器學習中,你的數據由一組輸入記錄組成,每個輸入記錄都具有關聯的標籤,並且目標是在給定新的無標籤輸入的情況下預測輸出標籤。這些輸出標籤可以是","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"不連續的","attrs":{}},{"type":"text","text":"或","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"連續的","attrs":{}},{"type":"text","text":",這給我們帶來了兩種監督的機器學習:","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"分類","attrs":{}},{"type":"text","text":"和","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"迴歸","attrs":{}},{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在分類問題中,目標是將輸入分爲一組離散的類或標籤。對於","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"二進制","attrs":{}},{"type":"text","text":"分類,你要預測兩個離散的標籤,例如“ dog”或“ not dog”,如圖10-1所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/87/87bf4d8a26555dfcc63e672f0dd4eaf6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"multiclass","attrs":{}},{"type":"text","text":"(也稱爲","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"多項式","attrs":{}},{"type":"text","text":")分類,可以有三個或更多離散的標籤,例如預測狗的品種(例如,澳大利亞牧羊犬,金毛獵犬或貴賓犬,如圖10-2所示)。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/33/3351bf18f577b76d06a2d2810f2cb62d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在迴歸問題中,要預測的值是連續數字,而不是標籤。這意味着你可能會預測模型在訓練期間未看到的值,如圖10-3所示。例如,你可以構建一個模型來預測在給定溫度下的每日冰淇淋銷量。你的模型可能得到會預測值$ 77.67,即使訓練數據中沒有包含該值的輸入/輸出對。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e6/e6ddfc3631be2c14db8e7471b4b5ee0c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面的表10-1列出了Spark MLlib中可用的一些常用的監督ML算法,並註明了它們是否可用於迴歸,分類或同時用於迴歸和分類。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2b/2bb0d25bdf45adafb774609e9a6f6189.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4f/4faba6e3639b9050353c25c0223fd118.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"無監督學習","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"獲得監督式機器學習所需的標記數據可能需要付出昂貴的代價甚至有時候是不可行的。這就是無監督機器學習發揮作用的地方。無需預測標籤,無監督的機器學習可以幫助你更好地理解數據的結構。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,請觀察圖10-4左側的原始非聚類數據。對於每個數據點(","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"x ","attrs":{}},{"type":"text","text":"1,","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"x ","attrs":{}},{"type":"text","text":"2),都沒有已知的真實標籤,但是通過對我們的數據應用無監督機器學習,我們可以找到自然形成的聚類,如右圖所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/73/737efcdae631d09f1d146a7eb659ee04.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無監督機器學習可用於異常值檢測或用作監督機器學習的預處理步驟,例如,減少數據集的維數(即每個樣本點的維數),這對於減少存儲需求或簡化下游操作很有用。MLlib中的一些無監督機器學習算法包括","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"k-","attrs":{}},{"type":"text","text":"均值、LDA和高斯混合模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"爲什麼使用Spark進行機器學習?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark是一個統一的分析引擎,爲數據攝取,工程設計,模型訓練和部署提供了一個生態系統。如果沒有Spark,開發人員將需要許多不同的工具來完成這組任務,並且可能仍難以應對可伸縮性的問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark有兩個機器學習包:spark.mllib和spark.ml。spark.mllib是基於RDD API(從Spark 2.0開始處於維護模式)的原始機器學習API,而spark.ml是基於DataFrames的較新API。本章的其餘部分將重點介紹spark.ml如何使用該軟件包以及如何在Spark中設計機器學習管道。但是,我們使用“ MLlib”作爲總稱來指代Apache Spark中的兩個機器學習庫包。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用spark.ml,數據科學家可以在同一個生態系統中進行數據準備和模型構建,而無需對數據進行下采樣以使其適合一臺計算機。spark.ml着重於O(n)向外擴展,其中模型隨你擁有的數據點數線性擴展,因此可以擴展至大量的數據。在下一章中,我們將討論在諸如的分佈式框架spark.ml和scikit-learn(sklearn)的單節點框架之間進行選擇時需要進行的一些權衡。如果你以前使用過scikit-learn,很多spark.ml API都會感覺很熟悉,但是也會存在一些細微的差異,下面我們將進行討論。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"設計機器學習管道","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本節中,我們將介紹如何創建和調整機器學習管道。管道的概念在許多機器學習框架中很常見,是一種組織一系列操作以應用於你的數據的方式。在MLlib中,管道API提供了一個基於DataFrames的高級API,用來組織你的機器學習工作流程。Pipeline API由一系列轉換器和預估","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"器","attrs":{}},{"type":"text","text":"組成,我們將在後面詳細討論。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本章中,我們將使用Inside Airbnb提供的舊金山住房數據集。它包含有關Airbnb在舊金山的租金的信息,例如臥室數量,位置,評論評分等,我們的目標是建立一個模型來預測該城市房源的每晚租金價格。這是一個迴歸問題,因爲價格是一個連續變量。我們將指導你完成數據科學家用來解決此問題的工作流程,包括特徵工程,構建模型,超參數調整和評估模型質量。該數據集非常混亂,並且可能很難建模(就像大多數現實世界中的數據集一樣!),因此,如果你自己進行實驗,則如果早期的模型不好,也不需要感到焦慮。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本章的目的不是向你展示MLlib中的每個API,而是讓你掌握使用MLlib來構建端到端管道的技能和知識。在詳細介紹之前,讓我們定義一些MLlib術語:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"轉換器(","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"Transformer","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接受一個DataFrame作爲輸入,並返回一個新的DataFrame並追加一個或多個列。轉換器無法從你的數據中學習任何參數,而只是應用基於規則的轉換來準備用於模型訓練的數據,或使用經過訓練的MLlib模型生成預測。他們有一個 .transform() 方法。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"預估器(Estimator)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過.fit() 方法從你的DataFrame中讀取(或“擬合”)參數,並返回一個模型 ,這個模型是一個轉換器。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"管道(pipeline)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將一系列轉換器和預估器組織到一個模型中。管道本身是預估器,而pipeline.fit()方法返回的輸出是一個 PipelineModel,是一個轉換器。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管這些概念現在看起來似乎還很抽象,但是本章中的代碼段和示例將幫助你理解它們是如何組合在一起的。但是,在構建機器學習模型並使用轉換器,預估器和管道之前,我們需要加載數據並執行一些數據準備。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據提取與研究","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們對示例數據集中的數據進行了稍微的預處理,以刪除異常值(例如,發佈價爲$ 0 / night的Airbnb),將所有整數都轉換爲雙精度,並選擇了一百多個字段中的信息量很大的子集。此外,對於數據列中所有缺失的數值,我們估算了中位數並添加了一個指標列(列名後跟_na,例如bedrooms_na)。這樣,ML模型或人工分析人員就可以將該列中的任何值解釋爲推定值,而不是真實值。你可以在本書的GitHub repo中看到數據準備筆記。請注意,還有許多其他方法可以處理缺失值,對於那些方法本書不做介紹。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們快速瀏覽一下數據集和相應的數據結構(輸出僅顯示列的子集):","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"filePath = \"\"\"/databricks-datasets/learning-spark-v2/sf-airbnb/","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sf-airbnb-clean.parquet/\"\"\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"airbnbDF = spark.read.parquet(filePath)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"airbnbDF.select(\"neighbourhood_cleansed\", \"room_type\", \"bedrooms\", \"bathrooms\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                \"number_of_reviews\", \"price\").show(5)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val filePath =","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  \"/databricks-datasets/learning-spark-v2/sf-airbnb/sf-airbnb-clean.parquet/\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val airbnbDF = spark.read.parquet(filePath)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"airbnbDF.select(\"neighbourhood_cleansed\", \"room_type\", \"bedrooms\", \"bathrooms\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                \"number_of_reviews\", \"price\").show(5)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+----------------------+---------------+--------+---------+--------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|neighbourhood_cleansed|room_type|bedrooms|bathrooms|number_...|price|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+----------------------+---------------+--------+---------+----------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| Western Addition|Entire home/apt| 1.0| 1.0| 180.0|170.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| Bernal Heights  |Entire home/apt| 2.0| 1.0| 111.0|235.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| Haight Ashbury  | Private room  | 1.0| 4.0| 17.0 | 65.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| Haight Ashbury  | Private room  | 1.0| 4.0| 8.0  | 65.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"| Western Addition|Entire home/apt| 2.0| 1.5| 27.0 |785.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+----------------------+---------------+--------+---------+--------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鑑於我們的功能,我們的目標是預測租賃物業每晚的價格。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在數據科學家可以進行模型構建之前,他們需要探索和理解他們的數據。他們通常會使用Spark對數據進行分組,然後使用數據可視化庫(例如matplotlib)來可視化數據。我們將把數據探索作爲練習留給讀者。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"創建訓練和測試數據集","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在開始進行特徵工程和建模之前,我們將數據集分爲兩組:","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"訓練集","attrs":{}},{"type":"text","text":"和","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"測試集","attrs":{}},{"type":"text","text":"。根據數據集的大小,你的訓練/測試比率可能會有所不同,但是許多數據科學家使用80/20作爲標準的訓練/測試劃分。你可能會想,“爲什麼不使用整個數據集來訓練模型?” 問題在於,如果我們在整個數據集上構建模型,則該模型可能會記住或“過度擬合”我們提供的訓練數據,而我們將沒有更多的數據來評估它對以前看不見的數據的概括程度。看不見的數據。假設數據遵循相似的分佈,則模型在測試集上的性能是其對看不見的數據(即,在野外還是在生產中)的性能表現的代理。圖10-5中顯示了訓練數據集和測試數據集的拆分。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0d/0d6216db6af8b8a2fe2cc0f2d46869b1.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的訓練集由一組特徵X和一個標籤y組成。在這裏,我們用大寫字母X表示尺寸爲","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"n","attrs":{}},{"type":"text","text":" x ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"d ","attrs":{}},{"type":"text","text":"的矩陣,其中","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"n","attrs":{}},{"type":"text","text":"是數據點(或示例)的數量,","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"d","attrs":{}},{"type":"text","text":"是特徵的數量(這就是我們在DataFrame中稱爲字段或列的數量)。我們使用小寫字母y表示向量,尺寸爲","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"n","attrs":{}},{"type":"text","text":" x 1;對於每個示例,都有一個標籤。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用不同的度量標準來衡量模型的效果。對於分類問題,標準度量是正確預測的","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"準確性","attrs":{}},{"type":"text","text":"或百分比。一旦該模型在使用該指標的訓練集上具有令人滿意的性能,我們將該模型應用於我們的測試集。如果它根據我們的評估指標在我們的測試集上表現良好,那麼我們可以確信我們已經建立了一個模型,該模型可以推廣到未出現的數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於我們的Airbnb數據集,我們將保留80%的數據作爲訓練集,並保留20%的數據用於測試集。此外,我們將爲數據可重複性設置一個隨機種子,這樣,如果我們重新運行此代碼,我們可能分別在訓練數據集和測試數據集中生成重複的數據。種子本身的價值","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"並不","attrs":{}},{"type":"text","text":"重要,但數據科學家通常喜歡將其設置爲42,因爲這是Ultimate Question of Life","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"的答案:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"trainDF, testDF = airbnbDF.randomSplit([.8, .2], seed=42)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"print(f\"\"\"There are {trainDF.count()} rows in the training set,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"and {testDF.count()} in the test set\"\"\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val Array(trainDF, testDF) = airbnbDF.randomSplit(Array(.8, .2), seed=42)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"println(f\"\"\"There are ${trainDF.count} rows in the training set, and","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"${testDF.count} in the test set\"\"\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這將產生以下輸出:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訓練集中有5780行,測試集中有1366行","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是,如果我們更改Spark集羣中執行程序(Executor)的數量會怎樣?Catalyst優化器根據羣集資源和數據集的大小確定最佳的數據分區方法。假設Spark DataFrame中的數據是按行分區的,並且每個工作節點都獨立於其他工作節點執行拆分,如果分區中的數據發生更改,則拆分結果(by random Split())將不相同。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然你可以修復集羣配置和隨機種子以確保獲得一致的結果,但是我們建議你一次性拆分數據,然後將其寫到其自己的訓練/ 測試文件夾中,這樣就不會出現這些可重複性問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在探索性分析期間,你應該緩存訓練數據集,因爲你將在整個機器學習過程中多次訪問它。請參考上一節“緩存和數據的持久性”的第七章。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"使用轉換器準備特徵","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在,我們已將數據分爲訓練集和測試集,讓我們準備數據以建立一個線性迴歸模型,該模型可以在給定臥室數量的情況下預測價格。在後面的示例中,我們將包括所有相關特徵,但是現在讓我們確保已具備相應的機制。線性迴歸(與Spark中的許多其他算法一樣)要求所有輸入特徵都包含在DataFrame中的單個向量內。因此,我們需要","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"轉換","attrs":{}},{"type":"text","text":"數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark中的轉換器接受一個DataFrame作爲輸入,並返回一個新DataFrame並追加一個或多個列。他們不會從你的數據中學習,而是使用該transform()方法應用基於規則的轉換。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了將我們所有的特徵放到一個向量中,我們將使用VectorAssembler Transformer。VectorAssembler接受一個輸入列的列表,並創建一個帶有追加列的新DataFrame,我們將其稱爲特徵(features)。它將這些輸入列的值組合到一個向量中:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.ml.feature import VectorAssembler","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"vecAssembler = VectorAssembler(inputCols=[\"bedrooms\"], outputCol=\"features\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"vecTrainDF = vecAssembler.transform(trainDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"vecTrainDF.select(\"bedrooms\", \"features\", \"price\").show(10)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.ml.feature.VectorAssembler","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val vecAssembler = new VectorAssembler()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setInputCols(Array(\"bedrooms\"))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setOutputCol(\"features\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val vecTrainDF = vecAssembler.transform(trainDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"vecTrainDF.select(\"bedrooms\", \"features\", \"price\").show(10)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------+--------+-----+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|bedrooms|features|price|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------+--------+-----+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]|200.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]|130.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]| 95.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]|250.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     3.0|   [3.0]|250.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]|115.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]|105.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]| 86.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]|100.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     2.0|   [2.0]|220.0|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------+--------+-----+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你會注意到,在Scala代碼中,我們必須實例化新VectorAssembler對象以及使用setter方法更改輸入和輸出列。在Python中,你可以選擇將參數直接傳遞給的構造函數VectorAssembler或使用setter方法,但是在Scala中,你只能使用setter方法。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來我們將介紹線性迴歸的基礎知識,但是如果你已經熟悉算法,請跳至“使用預估器來構建模型”。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"瞭解線性迴歸","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"線性迴歸建模因變量(或標籤)與一個或多個自變量(或特徵)之間的線性關係。在我們的案例中,我們希望擬合線性迴歸模型來預測在給定臥室數量的情況下Airbnb租金的價格。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在圖10-6中,我們有一個特徵","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"x","attrs":{}},{"type":"text","text":"和一個輸出","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"y","attrs":{}},{"type":"text","text":"(這是我們的因變量)。線性迴歸試圖將方程式擬合","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"x","attrs":{}},{"type":"text","text":"和","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"y之間的線性關係","attrs":{}},{"type":"text","text":",對於標量變量,可以將其表示爲","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"y = mx + b","attrs":{}},{"type":"text","text":",其中","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"m","attrs":{}},{"type":"text","text":"是斜率,","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"b","attrs":{}},{"type":"text","text":"是偏移量或截距。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些點表示來自我們的數據集中真實的(","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"x","attrs":{}},{"type":"text","text":",","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"y","attrs":{}},{"type":"text","text":")對,實線表示最適合該數據集的線。數據點未完全對齊,因此我們通常認爲線性迴歸是將模型擬合爲","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"y≈mx","attrs":{}},{"type":"text","text":" + ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"b","attrs":{}},{"type":"text","text":" +ε,其中 ε 是抽取的服從同一分佈的誤差,不同樣本 ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"x 產生的誤差","attrs":{}},{"type":"text","text":"獨立。這些是我們的模型預測與真實值之間的誤差。通常我們將ε視爲高斯或正態分佈。迴歸線上方的垂直線表示正ε(或殘差),其中真實值高於預測值,迴歸線下方的垂直線表示負殘差。線性迴歸的目標是找到一條使這些殘差的平方最小的線。你會注意到,該線可以推斷未見數據點的預測值。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4f/4f239eb0468a422c5f31dd5c62a461e5.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"線性迴歸還可以擴展爲處理多個自變量。如果我們有三個特徵作爲輸入,x = [x1 , x2 , x3 ],那麼我們就可以建模y作爲y ≈ w0 + w1x1 + w2x2 + w3x3 + ε.。在這種情況下,每個特徵都有一個單獨的係數(或權重)和一個截距(這裏是","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"w","attrs":{}},{"type":"text","text":"0而不是","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"b","attrs":{}},{"type":"text","text":")。估計模型的係數和截距的過程稱爲","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"學習","attrs":{}},{"type":"text","text":"(或","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"擬合","attrs":{}},{"type":"text","text":")模型的參數。現在,我們將重點關注在給定臥室數量的情況下預測價格的單變量回歸示例,稍後將回到多元線性迴歸。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"使用預估器建立模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"設置完vectorAssembler之後,我們準備好數據並將其轉換爲線性迴歸模型期望的格式。在Spark中,LinearRegression是一種預估器——接受DataFrame並返回模型。預估器從你的數據中學習參數,有一個estimator_name.fit()方法,並進行急切的評估計算(即,啓動Spark作業),而對轉換器的評估則比較滯後。其他一些估計器的例子包括輸入器、決策樹分類器和隨機森林迴歸器。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你會注意到,線性迴歸(特徵)的輸入列是我們vectorAssembler的輸出:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.ml.regression import LinearRegression","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"lr = LinearRegression(featuresCol=\"features\", labelCol=\"price\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"lrModel = lr.fit(vecTrainDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.ml.regression.LinearRegression","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val lr = new LinearRegression()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setFeaturesCol(\"features\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setLabelCol(\"price\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val lrModel = lr.fit(vecTrainDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"lr.fit()返回LinearRegressionModel(lrModel),它是一個轉換器。換句話說,預估器fit()方法的輸出是一個轉換器。一旦預估器瞭解了參數,轉換器就可以將這些參數應用於新的數據點以生成預測。讓我們檢查一下它學到的參數:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"m = round(lrModel.coefficients[0], 2)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"b = round(lrModel.intercept, 2)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"print(f\"\"\"The formula for the linear regression line is","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"price = {m}*bedrooms + {b}\"\"\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val m = lrModel.coefficients(0)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val b = lrModel.intercept","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"println(f\"\"\"The formula for the linear regression line is","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"price = $m%1.2f*bedrooms + $b%1.2f\"\"\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"打印:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"線性迴歸線的公式爲:價格= 123.68 *臥室數量+ 47.51","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"創建管道","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我們想將模型應用於測試集,則需要以與訓練集相同的方式來準備數據(即,將其通過向量裝配器傳遞)。通常,數據準備管道會包含多個步驟,並且不僅要記住要應用哪些步驟,而且要記住這些步驟的順序也變得很麻煩。這是Pipeline API的動機:你只需按順序指定希望數據通過的階段,Spark會爲你處理。它們爲用戶提供了更好的代碼可重用性和組織性。在Spark中,Pipelines是預估器,而PipelineModels(擬合的Pipelines)是轉換器。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們現在構建管道:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.ml import Pipeline","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pipeline = Pipeline(stages=[vecAssembler, lr])","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pipelineModel = pipeline.fit(trainDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.ml.Pipeline","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val pipeline = new Pipeline().setStages(Array(vecAssembler, lr))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val pipelineModel = pipeline.fit(trainDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用Pipeline API的另一個好處是,它可以確定哪些階段是你的預估器/轉換器,因此你不必擔心爲每個階段指定name.fit()與相對name.transform()。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於pipelineModel是轉換器,因此也很容易將其應用於我們的測試數據集:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"predDF = pipelineModel.transform(testDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"predDF.select(\"bedrooms\", \"features\", \"price\", \"prediction\").show(10)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val predDF = pipelineModel.transform(testDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"predDF.select(\"bedrooms\", \"features\", \"price\", \"prediction\").show(10)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------+--------+------+------------------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|bedrooms|features| price|        prediction|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------+--------+------+------------------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]|  85.0|171.18598011578285|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]|  45.0|171.18598011578285|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]|  70.0|171.18598011578285|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]| 128.0|171.18598011578285|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]| 159.0|171.18598011578285|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     2.0|   [2.0]| 250.0|294.86172649777757|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]|  99.0|171.18598011578285|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]|  95.0|171.18598011578285|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]| 100.0|171.18598011578285|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|     1.0|   [1.0]|2010.0|171.18598011578285|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------+--------+------+------------------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這段代碼中,我們僅使用一個功能就構建了一個bedrooms模型(你可以在本書的GitHub repo中找到本章的筆記本)。但是,你可能希望使用所有特徵來構建模型,其中某些特徵可能是類別特徵,例如host_is_superhost。類別特徵採用離散值,沒有內在順序——例如,職業或國家/地區名稱。在下一節中,我們將考慮一種解決方案,該方法用於處理這類變量,稱爲“獨熱編碼","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"”","attrs":{}},{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"獨熱編碼","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我們剛剛創建的管道中,我們只有兩個階段,而線性迴歸模型僅使用一個功能。讓我們看一下如何構建一個稍微更復雜的管道,其中包含我們所有的數字和分類功能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MLlib中的大多數機器學習模型都希望數值作爲輸入,以向量表示。要將分類值轉換爲數值,我們可以使用一種稱爲獨熱編碼(簡稱:OHE)的技術。假設我們有一個名爲列Animal,我們有三種類型的動物:Dog,Cat和Fish。我們不能將字符串類型直接傳遞到ML模型中,因此我們需要分配一個數字映射,例如:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Animal = {\"Dog\", \"Cat\", \"Fish\"}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\"Dog\" = 1, \"Cat\" = 2, \"Fish\" = 3 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是,使用這種方法,我們在數據集中引入了一些以前沒有的虛假關係。例如,爲什麼我們分配Cat兩倍的值Dog?我們使用的數值不應在我們的數據集中引入任何關係。相反,我們想爲列中的每個不同值創建一個單獨的Animal列:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\"Dog\" = [ 1, 0, 0]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\"Cat\" = [ 0, 1, 0]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\"Fish\" = [0, 0, 1]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果動物是狗,則在第一列中記錄爲1,在其他列記錄爲0。如果是貓,則在第二列中記錄爲1,在其他列記錄爲0。列的順序無關緊要。如果你以前使用過pandas,你會注意到它的作用與pandas.get_dummies()是相同的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我們有一個擁有300只動物的動物園,那麼OHE是否會大量增加內存/計算資源的消耗?使用Spark不是問題!當大多數條目0爲時,Spark在內部使用SparseVector ,這在OHE很常見,因此它不會浪費存儲0值的空間。讓我們看一個例子,以更好地瞭解如何SparseVector是如何工作的:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DenseVector(0,0,0,7,0,2,0,0,0,0)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SparseVector(10,[3,5],[7,2])","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本實施例中DenseVector包含的10個值,除2個非0值之外,其他值都爲0。要創建一個SparseVector,我們需要跟蹤向量的大小,非零元素的索引以及這些索引處的對應值。在此示例中,向量的大小爲10,在索引3和5處有兩個非零值,在這些索引處的對應值是7和2。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有幾種方法可以使用Spark對數據進行獨熱編碼。常用的方法是使用StringIndexer和OneHotEncoder。使用這種方法,第一步是應用StringIndexer預估器將類別值轉換爲類別索引。這些類別索引按標籤頻率排序,因此最頻繁使用的標籤的索引爲0,這爲我們在相同數據的各種運行中提供了可重複的結果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"創建類別索引後,你可以將其作爲輸入傳遞給OneHotEncoder(如果使用Spark 2.3 / 2.4對對應OneHotEncoderEstimator)。該OneHotEncoder映射將一列類別索引映射到一列二進制向量。查看錶10-2瞭解Spark 2.3 / 2.4與3.0版本在StringIndexer和OneHotEncoder API上的區別。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ad/adfd6dbb3bcc866433aa854b34860477.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以下代碼演示瞭如何對我們的分類功能進行獨熱編碼。在我們的數據集中,任何string類型的列都被視爲分類特徵,但有時你可能希望將數字特徵視爲分類特徵,反之亦然。你需要仔細確定哪些列是數字列,哪些是類別列:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.ml.feature import OneHotEncoder, StringIndexer","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"categoricalCols = [field for (field, dataType) in trainDF.dtypes","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                   if dataType == \"string\"]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"indexOutputCols = [x + \"Index\" for x in categoricalCols]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"oheOutputCols = [x + \"OHE\" for x in categoricalCols]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"stringIndexer = StringIndexer(inputCols=categoricalCols,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                              outputCols=indexOutputCols,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                              handleInvalid=\"skip\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"oheEncoder = OneHotEncoder(inputCols=indexOutputCols,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                           outputCols=oheOutputCols)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"numericCols = [field for (field, dataType) in trainDF.dtypes","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"               if ((dataType == \"double\") & (field != \"price\"))]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"assemblerInputs = oheOutputCols + numericCols","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"vecAssembler = VectorAssembler(inputCols=assemblerInputs,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                               outputCol=\"features\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                               ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val categoricalCols = trainDF.dtypes.filter(_._2 == \"StringType\").map(_._1)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val indexOutputCols = categoricalCols.map(_ + \"Index\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val oheOutputCols = categoricalCols.map(_ + \"OHE\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val stringIndexer = new StringIndexer()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setInputCols(categoricalCols)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setOutputCols(indexOutputCols)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setHandleInvalid(\"skip\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val oheEncoder = new OneHotEncoder()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setInputCols(indexOutputCols)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setOutputCols(oheOutputCols)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val numericCols = trainDF.dtypes.filter{ case (field, dataType) =>","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  dataType == \"DoubleType\" && field != \"price\"}.map(_._1)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val assemblerInputs = oheOutputCols ++ numericCols","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val vecAssembler = new VectorAssembler()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setInputCols(assemblerInputs)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setOutputCol(\"features\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在你可能想知道,“StringIndexer如何處理出現在測試數據集中而不是訓練數據集中的新類別?” 有一個handleInvalid參數指定你要如何處理它們。選項包括skip(過濾掉無效數據的行),error(引發錯誤)或keep(將無效數據放入numLabels索引處的特殊追加桶中)。對於此示例,我們只是跳過了無效的記錄。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種方法的一個難題是你需要明確告訴StringIndexer指出哪些特徵應被視爲類別特徵。你可以使用VectorIndexer來自動檢測所有類別變量,但是由於它必須遍歷每一列並檢測其值是否少於maxCategories唯一值,因此在計算成本是非常高的。maxCategories是用戶指定的參數,確定此值也可能很困難。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一種方法是使用RFormula。其語法受R編程語言的啓發。使用RFormula,你可以提供標籤以及要包括的功能。它支持一個有限的R運算符的子集,包括~,.,:,+和-。例如,你可能指定formula = \"y ~ bedrooms + bathrooms\",這表示給定bedrooms和bathrooms預測y值,或者formula = \"y ~ .\",表示使用所有可用特徵(並自動從特徵中排除y)。RFormula將自動StringIndex和OHE所有字符串列,將數字列轉換爲double類型,並將所有這些組合成一個VectorAssembler的向量。因此,我們可以用一行替換所有前面的代碼,並且我們將得到相同的結果:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.ml.feature import RFormula","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"rFormula = RFormula(formula=\"price ~ .\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                    featuresCol=\"features\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                    labelCol=\"price\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"                    handleInvalid=\"skip\")         ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.ml.feature.RFormula","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val rFormula = new RFormula()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setFormula(\"price ~ .\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setFeaturesCol(\"features\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setLabelCol(\"price\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setHandleInvalid(\"skip\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RFormula自動組合StringIndexer和OneHotEncoder,OneHotEncoder的缺點是,並非所有算法都要求或不建議使用獨熱編碼。例如,如果僅將StringIndexer用作分類功能,則基於樹的算法可以直接處理類別變量。你無需對基於樹的方法獨熱編碼類別特徵,這通常會使基於樹的模型變得更糟糕。不幸的是,沒有一種適合所有人的解決方案,而理想的方法與你計劃應用於數據集的下游算法緊密相關。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果其他人爲你執行特徵工程,請確保他們記錄了他們是如何生成這些特徵的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一旦編寫了用於轉換數據集的代碼,就可以使用所有特徵作爲輸入來添加到線性迴歸模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這裏,我們將所有特徵準備和模型構建放入管道中,並將其應用於我們的數據集:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"lr = LinearRegression(labelCol=\"price\", featuresCol=\"features\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pipeline = Pipeline(stages = [stringIndexer, oheEncoder, vecAssembler, lr])","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Or use RFormula","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pipeline = Pipeline(stages = [rFormula, lr])","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pipelineModel = pipeline.fit(trainDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"predDF = pipelineModel.transform(testDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"predDF.select(\"features\", \"price\", \"prediction\").show(5)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val lr = new LinearRegression()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setLabelCol(\"price\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setFeaturesCol(\"features\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val pipeline = new Pipeline()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setStages(Array(stringIndexer, oheEncoder, vecAssembler, lr))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Or use RFormula","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// val pipeline = new Pipeline().setStages(Array(rFormula, lr))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val pipelineModel = pipeline.fit(trainDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val predDF = pipelineModel.transform(testDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"predDF.select(\"features\", \"price\", \"prediction\").show(5)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------------------+-----+------------------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|            features|price|        prediction|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------------------+-----+------------------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|(98,[0,3,6,7,23,4...| 85.0| 55.80250714362137|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|(98,[0,3,6,7,23,4...| 45.0| 22.74720286761658|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|(98,[0,3,6,7,23,4...| 70.0|27.115811183814913|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|(98,[0,3,6,7,13,4...|128.0|-91.60763412465076|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|(98,[0,3,6,7,13,4...|159.0| 94.70374072351933|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------------------+-----+------------------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如你所見,features列表示爲SparseVector。獨熱編碼後有98個特徵,然後是非零索引,然後是值本身。如果你將truncate=False參數傳入show()方法中,你可以看到所有的輸出。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的模型表現如何?你可以看到,儘管有些預測可能被認爲是“接近”,但其他的預測卻相距遙遠(存在租金爲負數!!)。接下來,我們將評估數值模型在整個測試集中的效果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"評估模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在我們已經建立了一個模型,我們需要評估它的表現。在spark.ml有分類,迴歸,聚類和排序預估(在Spark 3.0引入)。鑑於上面的案例是一個迴歸問題,我們將使用均方根誤差(RMSE)和R²( R平方)來評估模型的性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"RMSE","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RMSE是從零到無窮大的度量。距離零越近越好。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們逐步介紹一下數學公式:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"1.","attrs":{}},{"type":"text","text":"計算真值yi和預測值yi之間的差值(或誤差)(發音爲y-hat,其中hat表示它是hat下變量的預測值):","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a3/a3fc99808dc23dac6709a9403276b2f8.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"ÿ","attrs":{}},{"type":"text","text":"和","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"ÿ之間差在平方,","attrs":{}},{"type":"text","text":"這樣一來我們的正殘差和負殘差就不會被抵消。這被稱爲平方差:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b6/b6a7c73cb53b660df2ba291171ef19cb.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.然後,我們對所有","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"n","attrs":{}},{"type":"text","text":"個數據點的平方差求和,稱爲平方差和(SSE)或殘差平方和:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8c/8c27adb3e8f1d6eb62315e2e78539e7a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4.但是,SSE會隨着數據集中的記錄n的數量的增加而增長,所以我們希望根據記錄的數量來對其進行規範化。它給了我們均方誤差(MSE),一個非常常用的迴歸指標:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bc/bc94139181e075072bc4ebf18d4a4569.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5.如果我們停在MSE,那麼我們的誤差項將處在預測變量單位的平方的規模。我們通常會採用MSE的平方根來使誤差恢復到原始單位的比例,從而得出均方根誤差(RMSE):","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/43/4361d7852c97d341560dd6433b60b119.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們使用RMSE評估我們的模型:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.ml.evaluation import RegressionEvaluator","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"regressionEvaluator = RegressionEvaluator(","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  predictionCol=\"prediction\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  labelCol=\"price\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  metricName=\"rmse\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"rmse = regressionEvaluator.evaluate(predDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"print(f\"RMSE is {rmse:.1f}\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.ml.evaluation.RegressionEvaluator","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val regressionEvaluator = new RegressionEvaluator()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setPredictionCol(\"prediction\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setLabelCol(\"price\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .setMetricName(\"rmse\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val rmse = regressionEvaluator.evaluate(predDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"println(f\"RMSE is $rmse%.1f\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這將產生以下輸出:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RMSE是220.6","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"解釋RMSE的價值","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼,我們如何知道220.6是否對RMSE來說是一個比較好的值呢?有多種方法可以解釋此值,其中一種方法是建立簡單的基準模型並計算其RMSE進行比較。迴歸任務的常見基準模型是計算訓練集上標籤的平均值 ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"ȳ","attrs":{}},{"type":"text","text":"(發音","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"y","attrs":{}},{"type":"text","text":" -bar),然後用該平均值來預測數據集中的每條記錄,並計算結果RMSE(示例代碼在這本書的GitHub repo上)。如果你嘗試此操作,你將看到我們的基準模型的RMSE爲240.7,因此我們的預測好過了基準。如果你沒有好過基準,那麼在模型構建過程中可能出了點問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果這是分類問題,則你可能希望將預測最流行的類別作爲基線模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"請注意,標籤的單位會直接影響你的RMSE。例如,如果你的標籤是高度,那麼如果使用釐米而不是米作爲度量單位,則RMSE會更高。你可以通過使用其他單位來任意降低RMSE,這就是將RMSE與基準進行比較的原因。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然,還有一些指標可以使你直觀地瞭解自己在基準方面的表現,例如","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"R²","attrs":{}},{"type":"text","text":",我們將在下面進行討論。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"R²","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管名稱","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"R² ","attrs":{}},{"type":"text","text":"包含“平方”,但 ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"R² ","attrs":{}},{"type":"text","text":"值的範圍從負無窮大到1。讓我們看一下此度量標準背後的數學公式。","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"R² ","attrs":{}},{"type":"text","text":"的計算如下:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8f/8fc955fbc769d88250f998aaf00641c5.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果總是預測","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"ȳ","attrs":{}},{"type":"text","text":",則","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"SS tot","attrs":{}},{"type":"text","text":"是平方的總和:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c5/c591593c2f16bd0dc1f94eb1960e8a66.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"並且SSres","attrs":{}},{"type":"text","text":"是你的模型預測(也稱爲誤差平方總和,這是我們計算出的RMSE)殘差平方的總和:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7b/7b759ded3a04178286000b2e2020404e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你的模型完美地預測了每個數據點,那麼你的","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"SS res","attrs":{}},{"type":"text","text":" = 0,則使","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"R²","attrs":{}},{"type":"text","text":" =1。如果你的","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"SS res","attrs":{}},{"type":"text","text":" = ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"SS tot","attrs":{}},{"type":"text","text":",則分數爲1/1,因此你的","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"R²","attrs":{}},{"type":"text","text":"爲0。如果你的模型執行與始終預測平均值相同的操作,","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"則會發生","attrs":{}},{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是,如果你的模型的性能比總是預測ȳ還糟糕,並且","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"SStot","attrs":{}},{"type":"text","text":"確實很大,那會出現什麼情況呢?那麼你的","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"R²","attrs":{}},{"type":"text","text":"實際上可以是負數!如果","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"R² ","attrs":{}},{"type":"text","text":"爲負,則應重新評估建模過程。使用","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"R²","attrs":{}},{"type":"text","text":"的好處在於,你不必定義要進行比較的基準模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果要更改回歸評估器以使用","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"R²","attrs":{}},{"type":"text","text":",而不必重新定義迴歸評估器,則可以使用setter屬性設置度量標準名稱:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"r2 = regressionEvaluator.setMetricName(\"r2\").evaluate(predDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"print(f\"R2 is {r2}\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val r2 = regressionEvaluator.setMetricName(\"r2\").evaluate(predDF)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"println(s\"R2 is $r2\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"輸出爲:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"R2爲0.159854","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"R²","attrs":{}},{"type":"text","text":"爲正,但非常接近0。我們的模型表現不佳的原因之一是因爲我們的標籤price似乎是對數正態分佈的。如果分佈是對數正態的,則意味着如果我們對取值求對數,則結果看起來像是正態分佈。價格通常是對數正態分佈的。如果考慮一下舊金山的租金價格,大多數租金約爲每晚200美元,但有些租金每晚可能高達數千美元!你可以在圖10-7中看到我們的訓練數據集的Airbnb價格分佈。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/02/02bbffcd6393e7fa31f0e7284a004fa0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我們查看價格的對數,請看一下結果分佈(圖10-8)。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6f/6f3ea805bc6ebbbef207ccbf1924fde2.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你可以在此處看到我們的對數價格分佈看起來更像是正態分佈。作爲練習,嘗試構建模型以預測對數刻度上的價格,然後對預測取冪並評估模型。該代碼也可以在該書的GitHub repo庫中的本章筆記本中找到。你應該看到此數據集的RMSE降低而 ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"R² ","attrs":{}},{"type":"text","text":"升高。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"保存和加載模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在,我們已經建立並評估了一個模型,讓我們將其保存到持久性存儲中以備後用(或者,如果我們的集羣出現故障,我們就不必重新計算模型)。保存模型與編寫DataFrames 非常相似——也就是API中的model.write().save(path)。你可以選擇提供overwrite()命令來覆蓋該路徑中包含的任何數據:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pipelinePath = \"/tmp/lr-pipeline-model\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pipelineModel.write().overwrite().save(pipelinePath)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val pipelinePath = \"/tmp/lr-pipeline-model\"","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pipelineModel.write.overwrite().save(pipelinePath)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"加載保存的模型時,需要指定要重新加載的模型的類型(例如,是LinearRegressionModel還是LogisticRegressionModel)。因此,我們建議你始終將轉換器/預估器放在Pipeline中,這樣對於所有模型,你都可以加載PipelineModel,而只需更改模型的文件路徑即可:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"from pyspark.ml import PipelineModel","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"savedPipelineModel = PipelineModel.load(pipelinePath)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.ml.PipelineModel","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val savedPipelineModel = PipelineModel.load(pipelinePath)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"加載後,可以將其應用於新的數據點。但是,你不能使用該模型中的權重作爲訓練新模型的初始化參數(與從隨機權重開始相反),因爲Spark沒有“熱啓動”的概念。如果數據集稍有變化,則必須從頭開始重新訓練整個線性迴歸模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過構建和評估線性迴歸模型,讓我們探究其他一些模型如何在我們的數據集上執行。在下一節中,我們將探索基於樹的模型,並查看一些常見的超參數以進行調整以提高模型效果。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章