Pipeline詳解及Spark MLlib使用示例(Scala/Java/Python)

本文中，我們介紹機器學習管道的概念。機器學習管道提供一系列基於數據框的高級的接口來幫助用戶建立和調試實際的機器學習管道。

管道里的主要概念

MLlib提供標準的接口來使聯合多個算法到單個的管道或者工作流，管道的概念源於scikit-learn項目。

1.數據框：機器學習接口使用來自Spark SQL的數據框形式數據作爲數據集，它可以處理多種數據類型。比如，一個數據框可以有不同的列存儲文本、特徵向量、標籤值和預測值。

2.轉換器：轉換器是將一個數據框變爲另一個數據框的算法。比如，一個機器學習模型就是一個轉換器，它將帶有特徵數據框轉爲預測值數據框。

3.估計器：估計器是擬合一個數據框來產生轉換器的算法。比如，一個機器學習算法就是一個估計器，它訓練一個數據框產生一個模型。

4.管道：一個管道串起多個轉換器和估計器，明確一個機器學習工作流。

5.參數：管道中的所有轉換器和估計器使用共同的接口來指定參數。

數據框

機器學習算法可以應用於多種類型的數據，如向量、文本、圖像和結構化數據。管道接口中採用來自Spark SQL的數據框來支持多種類型的數據。可以查看Spark SQLdatatype reference來了解數據框支持的基礎和結構化數據類型。除了Spark SQL指南中提到的數據類型外，數據框還可以使用機器學習向量類型。可以顯式地建立數據框或者隱式地從規則的RDD建立數據框，下面的代碼將會給出示例。數據框中的列需要命名。代碼中的示例使用如“text”，“features“和”label“的名字。

管道組件

轉換器

轉換器包含特徵變化和學習模型。技術上來說，轉化器通過方法transform()，在原始數據上增加一列或者多列來將一個數據框轉爲另一個數據框。如：

1.一個特徵轉換器輸入一個數據框，讀取一個文本列，將其映射爲新的特徵向量列。輸出一個新的帶有特徵向量列的數據框。

2.一個學習模型轉換器輸入一個數據框，讀取包括特徵向量的列，預測每一個特徵向量的標籤。輸出一個新的帶有預測標籤列的數據框。

估計器

估計器指用來擬合或者訓練數據的學習算法或者任何算法。技術上說，估計器通過fit()方法，接受一個數據框產生一個模型。比如，邏輯迴歸就是一個估計器，通過fit()來產生一個邏輯迴歸模型。

管道組件的特性

轉換器的transform()方法和估計器的fit()方法都是無狀態性的。將來，有狀態性的算法可能通過其他概念得到支持。

每個轉換器或估計器實例有唯一的編號，這個特徵在制定參數的時候非常有用。

管道

在機器學習中，運行一系列算法來處理和學習數據是非常常見的。如一個文檔數據的處理工作流可能包括下列步驟：

1.將文檔氛圍單個詞語。

2.將每個文檔中的詞語轉爲數字化的特徵向量。

3.使用特徵向量和標籤學習一個預測模型。

MLlib將上述的工作流描述爲管道，它包含一系列需要被執行的有順序的管道階段（轉換器和估計器）。本節中我們將會使用上述文檔處理工作流作爲例子。

工作原理

管道由一系列有順序的階段指定，每個狀態時轉換器或估計器。每個狀態的運行是有順序的，輸入的數據框通過每個階段進行改變。在轉換器階段，transform()方法被調用於數據框上。對於估計器階段，fit()方法被調用來產生一個轉換器，然後該轉換器的transform()方法被調用在數據框上。

下面的圖說明簡單的文檔處理工作流的運行。

上面的圖示中，第一行代表管道處理的三個階段。第一二個藍色的階段是轉換器，第三個紅色框中的邏輯迴歸是估計器。底下一行代表管道中的數據流，圓筒指數據框。管道的fit()方法被調用於原始的數據框中，裏面包含原始的文檔和標籤。分詞器的transform()方法將原始文檔分爲詞語，添加新的詞語列到數據框中。哈希處理的transform()方法將詞語列轉換爲特徵向量，添加新的向量列到數據框中。然後，因爲邏輯迴歸是估計器，管道先調用邏輯迴歸的fit()方法來產生邏輯迴歸模型。如果管道還有其它更多階段，在將數據框傳入下一個階段之前，管道會先調用邏輯迴歸模型的transform()方法。

整個管道是一個估計器。所以當管道的fit()方法運行後，會產生一個管道模型，管道模型是轉換器。管道模型會在測試時被調用，下面的圖示說明用法。

上面的圖示中，管道模型和原始管道有同樣數目的階段，然而原始管道中的估計器此時變爲了轉換器。當管道模型的transform()方法被調用於測試數據集時，數據依次經過管道的各個階段。每個階段的transform()方法更新數據集，並將之傳到下個階段。

管道和管道模型有助於確認訓練數據和測試數據經過同樣的特徵處理流程。

詳細信息

DAG管道：管道的狀態是有序的隊列。這兒給的例子都是線性的管道，也就是說管道的每個階段使用上一個階段產生的數據。我們也可以產生非線性的管道，數據流向爲無向非環圖(DAG)。這種圖通常需要明確地指定每個階段的輸入和輸出列名（通常以指定參數的形式）。如果管道是DAG形式，則每個階段必須以拓撲序的形式指定。

運行時間檢查：因爲管道可以運行在多種數據類型上，所以不能使用編譯時間檢查。管道和管道模型在實際運行管道之前就會進行運行時間檢查。這種檢查通過數據框摘要，它描述了數據框中各列的類型。

管道的唯一階段：管道的的每個階段需要是唯一的實體。如同樣的實體“哈希變換”不可以進入管道兩次，因爲管道的每個階段必須有唯一的ID。當然“哈希變換1”和“哈希變換2”（都是哈希變換類型）可以進入同個管道兩次，因爲他們有不同的ID。

參數

MLlib估計器和轉換器使用統一的接口來指定參數。Param是有完備文檔的已命名參數。ParamMap是一些列“參數－值”對。

有兩種主要的方法來向算法傳遞參數：

1.給實體設置參數。比如，lr是一個邏輯迴歸實體，通過lr.setMaxIter(10)來使得lr在擬合的時候最多迭代10次。這個接口與spark.mllib包相似。

2.傳遞ParamMap到fit()或者transform()。所有在ParamMap裏的參數都將通過設置被重寫。

參數屬於指定估計器和轉換器實體過程。因此，如果我們有兩個邏輯迴歸實體lr1和lr2，我們可以建立一個ParamMap來指定兩個實體的最大迭代次數參數：ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)。這在一個管道里有兩個算法都有最大迭代次數參數時非常有用。

存儲和讀取管道

我們經常需要將管道存儲到磁盤以供下次使用。在Spark1.6中，模型導入導出功能新添了管道接口，支持大多數轉換器。請到算法接口文檔查看是否支持存儲和讀入。

代碼示例

下面給出上述討論功能的代碼示例：

估計器、轉換器和Param示例：

Scala:

[plain]view plain copy
import org.apache.spark.ml.classification.LogisticRegression  
import org.apache.spark.ml.linalg.{Vector, Vectors}  
import org.apache.spark.ml.param.ParamMap  
import org.apache.spark.sql.Row  
  
// Prepare training data from a list of (label, features) tuples.  
val training = spark.createDataFrame(Seq(  
  (1.0, Vectors.dense(0.0, 1.1, 0.1)),  
  (0.0, Vectors.dense(2.0, 1.0, -1.0)),  
  (0.0, Vectors.dense(2.0, 1.3, 1.0)),  
  (1.0, Vectors.dense(0.0, 1.2, -0.5))  
)).toDF("label", "features")  
  
// Create a LogisticRegression instance. This instance is an Estimator.  
val lr = new LogisticRegression()  
// Print out the parameters, documentation, and any default values.  
println("LogisticRegression parameters:\n" + lr.explainParams() + "\n")  
  
// We may set parameters using setter methods.  
lr.setMaxIter(10)  
  .setRegParam(0.01)  
  
// Learn a LogisticRegression model. This uses the parameters stored in lr.  
val model1 = lr.fit(training)  
// Since model1 is a Model (i.e., a Transformer produced by an Estimator),  
// we can view the parameters it used during fit().  
// This prints the parameter (name: value) pairs, where names are unique IDs for this  
// LogisticRegression instance.  
println("Model 1 was fit using parameters: " + model1.parent.extractParamMap)  
  
// We may alternatively specify parameters using a ParamMap,  
// which supports several methods for specifying parameters.  
val paramMap = ParamMap(lr.maxIter -> 20)  
  .put(lr.maxIter, 30)  // Specify 1 Param. This overwrites the original maxIter.  
  .put(lr.regParam -> 0.1, lr.threshold -> 0.55)  // Specify multiple Params.  
  
// One can also combine ParamMaps.  
val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability")  // Change output column name.  
val paramMapCombined = paramMap ++ paramMap2  
  
// Now learn a new model using the paramMapCombined parameters.  
// paramMapCombined overrides all parameters set earlier via lr.set* methods.  
val model2 = lr.fit(training, paramMapCombined)  
println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)  
  
// Prepare test data.  
val test = spark.createDataFrame(Seq(  
  (1.0, Vectors.dense(-1.0, 1.5, 1.3)),  
  (0.0, Vectors.dense(3.0, 2.0, -0.1)),  
  (1.0, Vectors.dense(0.0, 2.2, -1.5))  
)).toDF("label", "features")  
  
// Make predictions on test data using the Transformer.transform() method.  
// LogisticRegression.transform will only use the 'features' column.  
// Note that model2.transform() outputs a 'myProbability' column instead of the usual  
// 'probability' column since we renamed the lr.probabilityCol parameter previously.  
model2.transform(test)  
  .select("features", "label", "myProbability", "prediction")  
  .collect()  
  .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>  
    println(s"($features, $label) -> prob=$prob, prediction=$prediction")  
  }  

Java:

[java]view plain copy
import java.util.Arrays;  
import java.util.List;  
  
import org.apache.spark.ml.classification.LogisticRegression;  
import org.apache.spark.ml.classification.LogisticRegressionModel;  
import org.apache.spark.ml.linalg.VectorUDT;  
import org.apache.spark.ml.linalg.Vectors;  
import org.apache.spark.ml.param.ParamMap;  
import org.apache.spark.sql.Dataset;  
import org.apache.spark.sql.Row;  
import org.apache.spark.sql.RowFactory;  
import org.apache.spark.sql.types.DataTypes;  
import org.apache.spark.sql.types.Metadata;  
import org.apache.spark.sql.types.StructField;  
import org.apache.spark.sql.types.StructType;  
  
// Prepare training data.  
List<Row> dataTraining = Arrays.asList(  
    RowFactory.create(1.0, Vectors.dense(0.0, 1.1, 0.1)),  
    RowFactory.create(0.0, Vectors.dense(2.0, 1.0, -1.0)),  
    RowFactory.create(0.0, Vectors.dense(2.0, 1.3, 1.0)),  
    RowFactory.create(1.0, Vectors.dense(0.0, 1.2, -0.5))  
);  
StructType schema = new StructType(new StructField[]{  
    new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),  
    new StructField("features", new VectorUDT(), false, Metadata.empty())  
});  
Dataset<Row> training = spark.createDataFrame(dataTraining, schema);  
  
// Create a LogisticRegression instance. This instance is an Estimator.  
LogisticRegression lr = new LogisticRegression();  
// Print out the parameters, documentation, and any default values.  
System.out.println("LogisticRegression parameters:\n" + lr.explainParams() + "\n");  
  
// We may set parameters using setter methods.  
lr.setMaxIter(10).setRegParam(0.01);  
  
// Learn a LogisticRegression model. This uses the parameters stored in lr.  
LogisticRegressionModel model1 = lr.fit(training);  
// Since model1 is a Model (i.e., a Transformer produced by an Estimator),  
// we can view the parameters it used during fit().  
// This prints the parameter (name: value) pairs, where names are unique IDs for this  
// LogisticRegression instance.  
System.out.println("Model 1 was fit using parameters: " + model1.parent().extractParamMap());  
  
// We may alternatively specify parameters using a ParamMap.  
ParamMap paramMap = new ParamMap()  
  .put(lr.maxIter().w(20))  // Specify 1 Param.  
  .put(lr.maxIter(), 30)  // This overwrites the original maxIter.  
  .put(lr.regParam().w(0.1), lr.threshold().w(0.55));  // Specify multiple Params.  
  
// One can also combine ParamMaps.  
ParamMap paramMap2 = new ParamMap()  
  .put(lr.probabilityCol().w("myProbability"));  // Change output column name  
ParamMap paramMapCombined = paramMap.$plus$plus(paramMap2);  
  
// Now learn a new model using the paramMapCombined parameters.  
// paramMapCombined overrides all parameters set earlier via lr.set* methods.  
LogisticRegressionModel model2 = lr.fit(training, paramMapCombined);  
System.out.println("Model 2 was fit using parameters: " + model2.parent().extractParamMap());  
  
// Prepare test documents.  
List<Row> dataTest = Arrays.asList(  
    RowFactory.create(1.0, Vectors.dense(-1.0, 1.5, 1.3)),  
    RowFactory.create(0.0, Vectors.dense(3.0, 2.0, -0.1)),  
    RowFactory.create(1.0, Vectors.dense(0.0, 2.2, -1.5))  
);  
Dataset<Row> test = spark.createDataFrame(dataTest, schema);  
  
// Make predictions on test documents using the Transformer.transform() method.  
// LogisticRegression.transform will only use the 'features' column.  
// Note that model2.transform() outputs a 'myProbability' column instead of the usual  
// 'probability' column since we renamed the lr.probabilityCol parameter previously.  
Dataset<Row> results = model2.transform(test);  
Dataset<Row> rows = results.select("features", "label", "myProbability", "prediction");  
for (Row r: rows.collectAsList()) {  
  System.out.println("(" + r.get(0) + ", " + r.get(1) + ") -> prob=" + r.get(2)  
    + ", prediction=" + r.get(3));  
}  

Python:

[python]view plain copy
from pyspark.ml.linalg import Vectors  
from pyspark.ml.classification import LogisticRegression  
  
# Prepare training data from a list of (label, features) tuples.  
training = spark.createDataFrame([  
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),  
    (0.0, Vectors.dense([2.0, 1.0, -1.0])),  
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),  
    (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])  
  
# Create a LogisticRegression instance. This instance is an Estimator.  
lr = LogisticRegression(maxIter=10, regParam=0.01)  
# Print out the parameters, documentation, and any default values.  
print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"  
  
# Learn a LogisticRegression model. This uses the parameters stored in lr.  
model1 = lr.fit(training)  
  
# Since model1 is a Model (i.e., a transformer produced by an Estimator),  
# we can view the parameters it used during fit().  
# This prints the parameter (name: value) pairs, where names are unique IDs for this  
# LogisticRegression instance.  
print "Model 1 was fit using parameters: "  
print model1.extractParamMap()  
  
# We may alternatively specify parameters using a Python dictionary as a paramMap  
paramMap = {lr.maxIter: 20}  
paramMap[lr.maxIter] = 30  # Specify 1 Param, overwriting the original maxIter.  
paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55})  # Specify multiple Params.  
  
# You can combine paramMaps, which are python dictionaries.  
paramMap2 = {lr.probabilityCol: "myProbability"}  # Change output column name  
paramMapCombined = paramMap.copy()  
paramMapCombined.update(paramMap2)  
  
# Now learn a new model using the paramMapCombined parameters.  
# paramMapCombined overrides all parameters set earlier via lr.set* methods.  
model2 = lr.fit(training, paramMapCombined)  
print "Model 2 was fit using parameters: "  
print model2.extractParamMap()  
  
# Prepare test data  
test = spark.createDataFrame([  
    (1.0, Vectors.dense([-1.0, 1.5, 1.3])),  
    (0.0, Vectors.dense([3.0, 2.0, -0.1])),  
    (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])  
  
# Make predictions on test data using the Transformer.transform() method.  
# LogisticRegression.transform will only use the 'features' column.  
# Note that model2.transform() outputs a "myProbability" column instead of the usual  
# 'probability' column since we renamed the lr.probabilityCol parameter previously.  
prediction = model2.transform(test)  
selected = prediction.select("features", "label", "myProbability", "prediction")  
for row in selected.collect():  
    print row  

管道示例：

Scala:

[plain]view plain copy
import org.apache.spark.ml.{Pipeline, PipelineModel}  
import org.apache.spark.ml.classification.LogisticRegression  
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}  
import org.apache.spark.ml.linalg.Vector  
import org.apache.spark.sql.Row  
  
// Prepare training documents from a list of (id, text, label) tuples.  
val training = spark.createDataFrame(Seq(  
  (0L, "a b c d e spark", 1.0),  
  (1L, "b d", 0.0),  
  (2L, "spark f g h", 1.0),  
  (3L, "hadoop mapreduce", 0.0)  
)).toDF("id", "text", "label")  
  
// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.  
val tokenizer = new Tokenizer()  
  .setInputCol("text")  
  .setOutputCol("words")  
val hashingTF = new HashingTF()  
  .setNumFeatures(1000)  
  .setInputCol(tokenizer.getOutputCol)  
  .setOutputCol("features")  
val lr = new LogisticRegression()  
  .setMaxIter(10)  
  .setRegParam(0.01)  
val pipeline = new Pipeline()  
  .setStages(Array(tokenizer, hashingTF, lr))  
  
// Fit the pipeline to training documents.  
val model = pipeline.fit(training)  
  
// Now we can optionally save the fitted pipeline to disk  
model.write.overwrite().save("/tmp/spark-logistic-regression-model")  
  
// We can also save this unfit pipeline to disk  
pipeline.write.overwrite().save("/tmp/unfit-lr-model")  
  
// And load it back in during production  
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")  
  
// Prepare test documents, which are unlabeled (id, text) tuples.  
val test = spark.createDataFrame(Seq(  
  (4L, "spark i j k"),  
  (5L, "l m n"),  
  (6L, "mapreduce spark"),  
  (7L, "apache hadoop")  
)).toDF("id", "text")  
  
// Make predictions on test documents.  
model.transform(test)  
  .select("id", "text", "probability", "prediction")  
  .collect()  
  .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>  
    println(s"($id, $text) --> prob=$prob, prediction=$prediction")  
  }  

Java:

[java]view plain copy
import java.util.Arrays;  
  
import org.apache.spark.ml.Pipeline;  
import org.apache.spark.ml.PipelineModel;  
import org.apache.spark.ml.PipelineStage;  
import org.apache.spark.ml.classification.LogisticRegression;  
import org.apache.spark.ml.feature.HashingTF;  
import org.apache.spark.ml.feature.Tokenizer;  
import org.apache.spark.sql.Dataset;  
import org.apache.spark.sql.Row;  
  
// Prepare training documents, which are labeled.  
Dataset<Row> training = spark.createDataFrame(Arrays.asList(  
  new JavaLabeledDocument(0L, "a b c d e spark", 1.0),  
  new JavaLabeledDocument(1L, "b d", 0.0),  
  new JavaLabeledDocument(2L, "spark f g h", 1.0),  
  new JavaLabeledDocument(3L, "hadoop mapreduce", 0.0)  
), JavaLabeledDocument.class);  
  
// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.  
Tokenizer tokenizer = new Tokenizer()  
  .setInputCol("text")  
  .setOutputCol("words");  
HashingTF hashingTF = new HashingTF()  
  .setNumFeatures(1000)  
  .setInputCol(tokenizer.getOutputCol())  
  .setOutputCol("features");  
LogisticRegression lr = new LogisticRegression()  
  .setMaxIter(10)  
  .setRegParam(0.01);  
Pipeline pipeline = new Pipeline()  
  .setStages(new PipelineStage[] {tokenizer, hashingTF, lr});  
  
// Fit the pipeline to training documents.  
PipelineModel model = pipeline.fit(training);  
  
// Prepare test documents, which are unlabeled.  
Dataset<Row> test = spark.createDataFrame(Arrays.asList(  
  new JavaDocument(4L, "spark i j k"),  
  new JavaDocument(5L, "l m n"),  
  new JavaDocument(6L, "mapreduce spark"),  
  new JavaDocument(7L, "apache hadoop")  
), JavaDocument.class);  
  
// Make predictions on test documents.  
Dataset<Row> predictions = model.transform(test);  
for (Row r : predictions.select("id", "text", "probability", "prediction").collectAsList()) {  
  System.out.println("(" + r.get(0) + ", " + r.get(1) + ") --> prob=" + r.get(2)  
    + ", prediction=" + r.get(3));  
}  

Python:

[python]view plain copy
from pyspark.ml import Pipeline  
from pyspark.ml.classification import LogisticRegression  
from pyspark.ml.feature import HashingTF, Tokenizer  
  
# Prepare training documents from a list of (id, text, label) tuples.  
training = spark.createDataFrame([  
    (0, "a b c d e spark", 1.0),  
    (1, "b d", 0.0),  
    (2, "spark f g h", 1.0),  
    (3, "hadoop mapreduce", 0.0)], ["id", "text", "label"])  
  
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.  
tokenizer = Tokenizer(inputCol="text", outputCol="words")  
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")  
lr = LogisticRegression(maxIter=10, regParam=0.01)  
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])  
  
# Fit the pipeline to training documents.  
model = pipeline.fit(training)  
  
# Prepare test documents, which are unlabeled (id, text) tuples.  
test = spark.createDataFrame([  
    (4, "spark i j k"),  
    (5, "l m n"),  
    (6, "mapreduce spark"),  
    (7, "apache hadoop")], ["id", "text"])  
  
# Make predictions on test documents and print columns of interest.  
prediction = model.transform(test)  
selected = prediction.select("id", "text", "prediction")  
for row in selected.collect():  
    print(row)  

文章出處：https://blog.csdn.net/liulingyuan6/article/details/53576550

meng_shangjy

發佈了6 篇原創文章 · 獲贊 18 · 訪問量 6萬+

私信關注

Pipeline詳解及Spark MLlib使用示例(Scala/Java/Python)

ARIMA--R實現

一次、二次、三次指數平滑計算思想及代碼

Pipeline詳解及Spark MLlib使用示例(Scala/Java/Python)

二十種特徵變換方法及Spark MLlib調用實例（Scala/Java/python）（一）

Oracle表分區分爲四種：範圍分區，散列分區，列表分區和複合分區

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結