pyspark入門---機器學習實戰預測嬰兒出生率（二）使用ML庫

機器學習實戰預測嬰兒出生率

在上一文中，主要對Spark MLlib機器學習庫使用流程進行了介紹。

從搭建環境開始，然後加載數據，探索數據，直到進行模型的訓練與評估，最終進行未知數據的預測，即預測嬰兒生存機會

本文則來介紹如何使用ML機器學習庫來實戰ML！同樣使用上一節的數據集來演示ML的構建過程。再次嘗試預測嬰兒的生存機率。

**Pipeline——**管道工作流

管道鏈接多個轉換器和預測器生成一個機器學習工作流。

管道被指定爲一系列階段,每個階段是一個轉換器或一個預測器。

第一行表示一個Pipleline，包含三個階段。其中Tokenizer和HashingTF是轉換，第三個LogiticRegression邏輯迴歸是預測。下面一行代表流水線中的數據流，圓柱體表示DataFrame，

（1）對於Raw text文本數據和標籤生成DataFrame，然後調用Pipeline的fit接口；

（2）調用Tokenizer的transform接口將文本進行分詞，並將分詞添加到DataFrame；

（3）pipleline調用LogiticRegression.fit產出一個LogiticRegressionModel。

Parameter：所有的Transformer和Estimator共享一個通用的指定參數的API。

Pipline的使用示例，會在ML構建機器學習的案例中體現出來。

1.加載數據

數據地址：https://pan.baidu.com/s/1RJIAR4em2L2XiQpZhBWOgg
提取碼：yw39

from pyspark.ml import Pipeline
import pyspark.ml.classification as cl
import pyspark.ml.evaluation as ev
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
import pyspark.sql.types as typ

labels = [('INFANT_ALIVE_AT_REPORT', typ.IntegerType()),
          ('BIRTH_PLACE', typ.StringType()),
          ('MOTHER_AGE_YEARS', typ.IntegerType()),
          ('FATHER_COMBINE_AGE', typ.IntegerType()),
          ('CIG_BEFORE', typ.IntegerType()),
          ('CIG_1_TRI', typ.IntegerType()),
          ('CIG_2_TRI', typ.IntegerType()),
          ('CIG_3_TRI', typ.IntegerType()),
          ('MOTHER_HEIGHT_IN', typ.IntegerType()),
          ('MOTHER_PRE_WEIGHT', typ.IntegerType()),
          ('MOTHER_DELIVERY_WEIGHT', typ.IntegerType()),
          ('MOTHER_WEIGHT_GAIN', typ.IntegerType()),
          ('DIABETES_PRE', typ.IntegerType()),
          ('DIABETES_GEST', typ.IntegerType()),
          ('HYP_TENS_PRE', typ.IntegerType()),
          ('HYP_TENS_GEST', typ.IntegerType()),
          ('PREV_BIRTH_PRETERM', typ.IntegerType())
          ]

schema = typ.StructType([
    typ.StructField(e[0], e[1], False) for e in labels
])

births = spark.read.csv(
    'data/births_transformed.csv', header=True, schema=schema)
births.show(3)

2.創建轉換器

1、先將births中的BIRTH_PLACE字段類型修改爲數值類型

2、然後創建一個轉換器 OneHotEncoder可以對數值類型的數據進行編碼，從而轉化爲數值類型

3、創建一個單一的列，將所有的特徵聚集到一起該方法是一個列表（沒有包含標籤列），包含所有要組成outputCol的列，outputCol表示輸出的列的名爲’features’。

# 創建轉換器
import  pyspark.ml.feature as ft

births = births.withColumn('BIRTH_PLACE_INT', births['BIRTH_PLACE']\
    .cast(typ.IntegerType()))

# birth place使用one-hot編碼
encoder = ft.OneHotEncoder(inputCol='BIRTH_PLACE_INT',
                           outputCol='BIRTH_PLACE_VEC')

# 創建單一的列將所有特徵整合在一起
featuresCreator = ft.VectorAssembler(
    inputCols=[col[0] for col in labels[2:]] + [encoder.getOutputCol()],
    outputCol='features'
)

3.創建預測器

這裏使用和上節一致的邏輯迴歸模型LogisticRegression，只不過來自pyspark.ml.classification模塊。

在此我們還是先導入依賴，再創建模型。

#需要注意的是，如果數據的標籤列的名稱爲label 則無需指定labelCol，如果featuresCre的輸出不爲’features’，需要使用featuresCre調用getOutputColl()來指明featuresCol

# 預測模型性能
import pyspark.ml.evaluation as ev

evaluator = ev.BinaryClassificationEvaluator(
    rawPredictionCol='probability',
    labelCol='INFANT_ALIVE_AT_REPORT'
)

print(evaluator.evaluate(test_model, {evaluator.metricName:'areaUnderROC'}))
print(evaluator.evaluate(test_model, {evaluator.metricName:'areaUnderPR'}))

4.創建管道

前面創建了兩個轉換器和一個預測器

現在需要做的事情是將兩個轉換器和一個預測器連接起來，放入一個管道中。

因此我們需要創建一個管道：依然是導入相應的模塊，然後實例化一個管道實例。

代碼創建一個管道，並其依次將轉換器enco、featuresCre和預測器lr結合了起來

# 創建一個管道
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[encoder, featuresCreator, logistic])

5.訓練模型

有了轉換器、預測器和管道，我們就可以利用數據集進行訓練模型了，

但是爲了確保模型的說服力，在訓練模型之前，需要把數據集拆分爲訓練集和測試集

其中訓練集births_train用來訓練模型、測試集births_test用來評估模型

訓練模型時，訓練集傳遞給enco轉換器，enco轉換器輸出的DataFrame傳遞給featuresCre轉換器，featuresCre轉換器的輸出爲features列，features列再傳遞給預測器lr邏輯迴歸模型。

調用管道模型model對象的transform方法會獲得預測值

# 擬合模型
birth_train, birth_test = births.randomSplit([0.7,0.3],seed=123)

model = pipeline.fit(birth_train)
test_model = model.transform(birth_test)

6.使用BinaryClassificationEvaluator對模型評估

此步依然使用BinaryClassificationEvaluator對模型評估

這一步驟與MLlib中基本相同。

先導入模型評估的模塊、然後實例化BinaryClassificationEvaluator評估器

最後打印相關評估結果。

# 評估模型性能
import pyspark.ml.evaluation as ev

evaluator = ev.BinaryClassificationEvaluator(
    rawPredictionCol='probability',
    labelCol='INFANT_ALIVE_AT_REPORT'
)

print(evaluator.evaluate(test_model, {evaluator.metricName:'areaUnderROC'}))
print(evaluator.evaluate(test_model, {evaluator.metricName:'areaUnderPR'}))

7.模型保存與調用

包括管道的保存與加載、模型的保存與加載

這一步驟與MLlib中基本相同。

# 保存模型pipeline
pipelinePath = './infant_oneHotEncoder_Logistic_Pipeline'
pipeline.write().overwrite().save(pipelinePath)

# 重載模型pipeline
loadedPipeline = Pipeline.load(pipelinePath)
loadedPipeline.fit(birth_train).transform(birth_test).take(1)


# 保存模型
from pyspark.ml import PipelineModel

modelPath = './infant_oneHotEncoder_LogisticPipelineModel'
model.write().overwrite().save(modelPath)

# 載入模型
loadedPipelineModel = PipelineModel.load(modelPath)
test_reloadedModel = loadedPipelineModel.transform(birth_test)
test_reloadedModel.take(1)

通過本文的學習，大致瞭解了Spark.ML機器學習庫的使用步驟：

包括加載數據、創建轉換器和預測器、創建管道、訓練模型、模型評估以及模型的保存與調用

ML與上一節MLlib的使用步驟主要區別在於預測器和管道的創建、這是ML特殊的地方。

而且從spark2.x版本開始人們更傾向於使用ML

pyspark入門---機器學習實戰預測嬰兒出生率（二）使用ML庫

機器學習實戰預測嬰兒出生率

1.加載數據

2.創建轉換器

3.創建預測器

4.創建管道

5.訓練模型

6.使用BinaryClassificationEvaluator對模型評估

7.模型保存與調用

SQL優化-20231016

python數據清洗實戰入門筆記（六）數據預處理

python數據清洗實戰入門筆記（五）數據統計

python數據清洗實戰入門筆記（七）總結

害！兩小時帶你看透python數據可視化

python數據清洗實戰入門筆記（三）表處理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結