ML Pipelines(譯文)

官方文檔鏈接：https://spark.apache.org/docs/latest/ml-pipeline.html

概述

在這一部分，我們將要介紹ML Pipelines，它提供了基於DataFrame上統一的高等級API，可以幫助使用者創建和調試機器學習工作流；

Pipelines中主要的概念：
- DataFrame
- Pipeline組件
  - Transformers：轉換器
  - Estimators：預測器
  - Pipelines組件屬性
- Pipeline
  - 如何工作
  - 細節
- 參數
- 機器學習持久化：保存和加載Pipelines
  - 機器學習持久化的向後兼容性
示例代碼：
- 例子：預測器、轉換器和參數
- 例子：Pipeline
- 模型選擇（超參數調試）

Pipelines中的主要概念

MLlib中機器學習算法相關的標準API使得其很容易組合多個算法到一個pipeline或者工作流中，這一部分包括通過Pipelines API介紹的主要概念，以及是從sklearn的哪部分獲取的靈感；

DataFrame：這個ML API使用Spark SQL中的DataFrame作爲ML數據集來持有某一種數據類型，比如一個DataFrame可以有不同類型的列：文本、向量特徵、標籤和預測結果等；
Transformer：轉換器是一個可以將某個DataFrame轉換成另一個DataFrame的算法，比如一個ML模型就是一個將DataFrame轉換爲原DataFrame+一個預測列的新的DataFrame的轉換器；
Estimator：預測器是一個可以fit一個DataFrame得到一個轉換器的算法，比如一個學習算法是一個使用DataFrame並訓練得到一個模型的預測器；
Pipeline：一個Pipeline鏈使用多個轉換器和預測器來指定一個機器學習工作流；
Parameter：所有的轉換器和預測器通過一個通用API來指定其參數；

DataFrame

機器學習可以作用於很多不同的數據類型，比如向量、文本、圖像和結構化數據等，DataFrame屬於Spark SQL，支持多種數據類型；

DataFrame支持多種基礎和結構化數據；

一個DataFrame可以通過RDD創建；

DataFrame中的列表示名稱，比如姓名、年齡、收入等；

Pipeline組件

Transformers - 轉換器

轉換器是包含特徵轉換器和學習模型的抽象概念，嚴格地說，轉換器需要實現transform方法，該方法將一個DataFrame轉換爲另一個DataFrame，通常這種轉換是通過在原基礎上增加一列或者多列，例如：

一個特徵轉換器接收一個DataFrame，讀取其中一列（比如text），將其映射到一個新的列上（比如feature vector），然後輸出一個新的DataFrame包含映射得到的新列；
一個學習模型接收一個DataFrame，讀取包含特徵向量的列，爲每個特徵向量預測其標籤值，然後輸出一個新的DataFrame包含標籤列；

Estimators - 預測器

一個預測器是一個學習算法或者任何在數據上使用fit和train的算法的抽象概念，嚴格地說，一個預測器需要實現fit方法，該方法接收一個DataFrame併產生一個模型，該模型實際上就是一個轉換器，例如，邏輯迴歸是一個預測器，調用其fit方法可以得到一個邏輯迴歸模型，同時該模型也是一個轉換器；

Pipeline組件屬性

轉換器的transform和預測器的fit都是無狀態的，未來可能通過其他方式支持有狀態的算法；

每個轉換器或者預測器的實例都有一個唯一ID，這在指定參數中很有用；

Pipeline

在機器學習中，運行一系列的算法來處理數據並從數據中學習是很常見的，比如一個簡單的文檔處理工作流可能包含以下幾個步驟：

將每個文檔文本切分爲單詞集合；
將每個文檔的單詞集合轉換爲數值特徵向量；
使用特徵向量和標籤學習一個預測模型；

MLlib提供了工作流作爲Pipeline，包含一系列的PipelineStageS（轉換器和預測器）在指定順序下運行，我們將使用這個簡單工作流作爲這一部分的例子；

如何工作

一個Pipeline作爲一個特定的階段序列，每一階段都是一個轉換器或者預測器，這些階段按順序執行，輸入的DataFrame在每一階段中都被轉換，對於轉換器階段，transform方法作用於DataFrame，對於預測器階段，fit方法被調用併產生一個轉換器（這個轉換器會成功Pipeline模型的一部分或者fit pipeline），該轉換器的transform方法同樣作用於DataFrame上；

下圖是一個使用Pipeline的簡單文檔處理工作流：

上圖中，上面一行表示一個包含三個階段的Pipeline，Tokenizer和HashingTF爲轉換器（藍色），LogisticRegression爲預測器（紅色），下面一行表示數據流經過整個Pipeline，圓柱體表示DataFrame，Pipeline的fit方法作用於包含原始文本數據和標籤的DataFrame，Tokenizer的transform方法將原始文本文檔分割爲單詞集合，作爲新列加入到DataFrame中，HashingTF的transform方法將單詞集合列轉換爲特徵向量，同樣作爲新列加入到DataFrame中，目前，LogisticRegression是一個預測器，Pipeline首先調用其fit方法得到一個LogisticRegressionModel，如果Pipeline中還有更多預測器，那麼就會在進入下一個階段前先調用LogisticRegressionModel的transform方法（此時該model就是一個轉換器）；

一個Pipeline就是一個預測器，因此，在Pipeline的fit方法運行後會產生一個PipelineModel，同樣是一個轉換器，這個PipelineModel在測試時間使用，下圖介紹了該階段：

上圖中，PipelineModel與原Pipeline有同樣數量的階段，但是原Pipeline中所有的預測器都變成了轉換器，當PipelineModel的tranform方法在測試集上調用時，數據將按順序經過被fit的Pipeline，每個階段的transform方法將更新DataFrame並傳遞給下一個階段；

Pipeline和PipelineModel幫助確定訓練和測試數據經過完全一致的特徵處理步驟；

細節

DAG Pipeline（有向無環圖Pipeline）：一個Pipeline的各個階段被指定作爲一個順序數組，之前的例子都是線性的Pipeline，即每個階段使用的數據都是前一個階段提供的，只要數據流圖來自於DAG，那麼是有可能創建非線性的Pipeline的，這個圖是當前指定的基於每個階段的輸入輸出列名（通常作爲參數指定），如果Pipeline來自DAG，那麼各個階段必須符合拓撲結構順序；

運行時檢查：由於Pipeline可以操作DataFrame可變數據類型，因此它不能使用編譯期類型檢查，Pipeline和PipelineModel在真正運行會進行運行時檢查，這種類型的檢查使用DataFrame的schema，schema是一種對DataFrmae中所有數據列數據類型的描述；

唯一Pipeline階段：一個Pipeline階段需要是唯一的實例，比如同一個實例myHashingTF不能兩次添加到Pipeline中，因爲每個階段必須具備唯一ID，然而，不同的類的實例可以添加到同一個Pipeline中，比如myHashingTF1和myHashingTF2，因爲這兩個對象有不同的ID，這裏的ID可以理解爲對象的內容地址，所以myHashingTF2=myHashingTF1也是不行的哈；

參數

MLlib預測器和轉換器使用統一API指定參數；

一個參數是各個轉換器和預測器自己文檔中命名的參數，一個參數Map就是參數的k,v對集合；

這裏有兩種主要的給算法傳參的方式：

爲一個實例設置參數，比如如果lr是邏輯迴歸的實例對象，可以通過調用lr.setMaxIter(10)指定lr.fit()最多迭代10次，這個API與spark.mllib包中的類似；
傳一個參數Map給fit和transform方法，參數Map中的任何一個參數都會覆蓋之前通過setter方法指定的參數；

參數屬於轉換器和預測器的具體實例，例如，如果我們有兩個邏輯迴歸實例lr1和lr2，然後我們創建一個參數Map，分別指定兩個實例的maxIter參數，將會在Pipeline中產生兩個參數不同的邏輯迴歸算法；

機器學習持久化：保存和加載Pipeline

大多數時候爲了之後使用將模型或者pipeline持久化到硬盤上是值得的，在Spark 1.6，一個模型的導入/導出功能被添加到了Pipeline的API中，截至Spark 2.3，基於DataFrame的API覆蓋了spark.ml和pyspark.ml；

機器學習持久化支持Scala、Java和Python，然而R目前使用一個修改後的格式，因此R存儲的模型只能被R加載，這個問題將在未來被修復；

機器學習持久化的向後兼容性

通常來說，MLlib爲持久化保持了向後兼容性，即如果你使用某個Spark版本存儲了一個模型或者Pipeline，那麼你就應該可以通過更新的版本加載它，然而依然有小概率出現異常；

模型持久話：模型或者Pipeline是否通過Spark的X版本存儲模型，通過Spark的Y版本加載模型？

主版本：不保證兼容，但是會盡最大努力保持兼容；
次版本和patch版本：保證向後兼容性；
格式提示：不保證有一個穩定的持久化格式，但是模型加載是通過向後兼容性決定的；

模型行爲：模型或Pipeline是否在Spark的X版本和Y版本具有一致的行爲？

主版本：不保證，但是會盡最大努力保證一致；
次版本和patch版本：行爲一致，除非是爲了修復bug；

爲了模型持久化和模型行爲，任何破壞兼容性和一致性的次版本或者patch都會在版本更新筆記中報告出來，如果一個改變沒有被報告，那麼它應該是爲了修復bug出現的；

示例代碼

這部分針對上述討論的內容給出代碼示例，更多相關信息，可以查看API文檔（Scala、Java、Python）；

例子：預測器、轉換器和參數

這個例子包含預測器、轉換器和參數的主要概念；

Scala:

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.Row

// Prepare training data from a list of (label, features) tuples.
val training = spark.createDataFrame(Seq(
  (1.0, Vectors.dense(0.0, 1.1, 0.1)),
  (0.0, Vectors.dense(2.0, 1.0, -1.0)),
  (0.0, Vectors.dense(2.0, 1.3, 1.0)),
  (1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")

// Create a LogisticRegression instance. This instance is an Estimator.
val lr = new LogisticRegression()
// Print out the parameters, documentation, and any default values.
println(s"LogisticRegression parameters:\n ${lr.explainParams()}\n")

// We may set parameters using setter methods.
lr.setMaxIter(10)
  .setRegParam(0.01)

// Learn a LogisticRegression model. This uses the parameters stored in lr.
val model1 = lr.fit(training)
// Since model1 is a Model (i.e., a Transformer produced by an Estimator),
// we can view the parameters it used during fit().
// This prints the parameter (name: value) pairs, where names are unique IDs for this
// LogisticRegression instance.
println(s"Model 1 was fit using parameters: ${model1.parent.extractParamMap}")

// We may alternatively specify parameters using a ParamMap,
// which supports several methods for specifying parameters.
val paramMap = ParamMap(lr.maxIter -> 20)
  .put(lr.maxIter, 30)  // Specify 1 Param. This overwrites the original maxIter.
  .put(lr.regParam -> 0.1, lr.threshold -> 0.55)  // Specify multiple Params.

// One can also combine ParamMaps.
val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability")  // Change output column name.
val paramMapCombined = paramMap ++ paramMap2

// Now learn a new model using the paramMapCombined parameters.
// paramMapCombined overrides all parameters set earlier via lr.set* methods.
val model2 = lr.fit(training, paramMapCombined)
println(s"Model 2 was fit using parameters: ${model2.parent.extractParamMap}")

// Prepare test data.
val test = spark.createDataFrame(Seq(
  (1.0, Vectors.dense(-1.0, 1.5, 1.3)),
  (0.0, Vectors.dense(3.0, 2.0, -0.1)),
  (1.0, Vectors.dense(0.0, 2.2, -1.5))
)).toDF("label", "features")

// Make predictions on test data using the Transformer.transform() method.
// LogisticRegression.transform will only use the 'features' column.
// Note that model2.transform() outputs a 'myProbability' column instead of the usual
// 'probability' column since we renamed the lr.probabilityCol parameter previously.
model2.transform(test)
  .select("features", "label", "myProbability", "prediction")
  .collect()
  .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>
    println(s"($features, $label) -> prob=$prob, prediction=$prediction")
  }

Java:

import java.util.Arrays;
import java.util.List;

import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.ml.classification.LogisticRegressionModel;
import org.apache.spark.ml.linalg.VectorUDT;
import org.apache.spark.ml.linalg.Vectors;
import org.apache.spark.ml.param.ParamMap;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

// Prepare training data.
List<Row> dataTraining = Arrays.asList(
    RowFactory.create(1.0, Vectors.dense(0.0, 1.1, 0.1)),
    RowFactory.create(0.0, Vectors.dense(2.0, 1.0, -1.0)),
    RowFactory.create(0.0, Vectors.dense(2.0, 1.3, 1.0)),
    RowFactory.create(1.0, Vectors.dense(0.0, 1.2, -0.5))
);
StructType schema = new StructType(new StructField[]{
    new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
    new StructField("features", new VectorUDT(), false, Metadata.empty())
});
Dataset<Row> training = spark.createDataFrame(dataTraining, schema);

// Create a LogisticRegression instance. This instance is an Estimator.
LogisticRegression lr = new LogisticRegression();
// Print out the parameters, documentation, and any default values.
System.out.println("LogisticRegression parameters:\n" + lr.explainParams() + "\n");

// We may set parameters using setter methods.
lr.setMaxIter(10).setRegParam(0.01);

// Learn a LogisticRegression model. This uses the parameters stored in lr.
LogisticRegressionModel model1 = lr.fit(training);
// Since model1 is a Model (i.e., a Transformer produced by an Estimator),
// we can view the parameters it used during fit().
// This prints the parameter (name: value) pairs, where names are unique IDs for this
// LogisticRegression instance.
System.out.println("Model 1 was fit using parameters: " + model1.parent().extractParamMap());

// We may alternatively specify parameters using a ParamMap.
ParamMap paramMap = new ParamMap()
  .put(lr.maxIter().w(20))  // Specify 1 Param.
  .put(lr.maxIter(), 30)  // This overwrites the original maxIter.
  .put(lr.regParam().w(0.1), lr.threshold().w(0.55));  // Specify multiple Params.

// One can also combine ParamMaps.
ParamMap paramMap2 = new ParamMap()
  .put(lr.probabilityCol().w("myProbability"));  // Change output column name
ParamMap paramMapCombined = paramMap.$plus$plus(paramMap2);

// Now learn a new model using the paramMapCombined parameters.
// paramMapCombined overrides all parameters set earlier via lr.set* methods.
LogisticRegressionModel model2 = lr.fit(training, paramMapCombined);
System.out.println("Model 2 was fit using parameters: " + model2.parent().extractParamMap());

// Prepare test documents.
List<Row> dataTest = Arrays.asList(
    RowFactory.create(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
    RowFactory.create(0.0, Vectors.dense(3.0, 2.0, -0.1)),
    RowFactory.create(1.0, Vectors.dense(0.0, 2.2, -1.5))
);
Dataset<Row> test = spark.createDataFrame(dataTest, schema);

// Make predictions on test documents using the Transformer.transform() method.
// LogisticRegression.transform will only use the 'features' column.
// Note that model2.transform() outputs a 'myProbability' column instead of the usual
// 'probability' column since we renamed the lr.probabilityCol parameter previously.
Dataset<Row> results = model2.transform(test);
Dataset<Row> rows = results.select("features", "label", "myProbability", "prediction");
for (Row r: rows.collectAsList()) {
  System.out.println("(" + r.get(0) + ", " + r.get(1) + ") -> prob=" + r.get(2)
    + ", prediction=" + r.get(3));
}

Python:

from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression

# Prepare training data from a list of (label, features) tuples.
training = spark.createDataFrame([
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, -1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print("LogisticRegression parameters:\n" + lr.explainParams() + "\n")

# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique IDs for this
# LogisticRegression instance.
print("Model 1 was fit using parameters: ")
print(model1.extractParamMap())

# We may alternatively specify parameters using a Python dictionary as a paramMap
paramMap = {lr.maxIter: 20}
paramMap[lr.maxIter] = 30  # Specify 1 Param, overwriting the original maxIter.
paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55})  # Specify multiple Params.

# You can combine paramMaps, which are python dictionaries.
paramMap2 = {lr.probabilityCol: "myProbability"}  # Change output column name
paramMapCombined = paramMap.copy()
paramMapCombined.update(paramMap2)

# Now learn a new model using the paramMapCombined parameters.
# paramMapCombined overrides all parameters set earlier via lr.set* methods.
model2 = lr.fit(training, paramMapCombined)
print("Model 2 was fit using parameters: ")
print(model2.extractParamMap())

# Prepare test data
test = spark.createDataFrame([
    (1.0, Vectors.dense([-1.0, 1.5, 1.3])),
    (0.0, Vectors.dense([3.0, 2.0, -0.1])),
    (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])

# Make predictions on test data using the Transformer.transform() method.
# LogisticRegression.transform will only use the 'features' column.
# Note that model2.transform() outputs a "myProbability" column instead of the usual
# 'probability' column since we renamed the lr.probabilityCol parameter previously.
prediction = model2.transform(test)
result = prediction.select("features", "label", "myProbability", "prediction") \
    .collect()

for row in result:
    print("features=%s, label=%s -> prob=%s, prediction=%s"
          % (row.features, row.label, row.myProbability, row.prediction))

例子：Pipeline

這個例子是基於上述的簡單文本文檔處理的例子；

Scala:

import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

// Prepare training documents from a list of (id, text, label) tuples.
val training = spark.createDataFrame(Seq(
  (0L, "a b c d e spark", 1.0),
  (1L, "b d", 0.0),
  (2L, "spark f g h", 1.0),
  (3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.001)
val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, lr))

// Fit the pipeline to training documents.
val model = pipeline.fit(training)

// Now we can optionally save the fitted pipeline to disk
model.write.overwrite().save("/tmp/spark-logistic-regression-model")

// We can also save this unfit pipeline to disk
pipeline.write.overwrite().save("/tmp/unfit-lr-model")

// And load it back in during production
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")

// Prepare test documents, which are unlabeled (id, text) tuples.
val test = spark.createDataFrame(Seq(
  (4L, "spark i j k"),
  (5L, "l m n"),
  (6L, "spark hadoop spark"),
  (7L, "apache hadoop")
)).toDF("id", "text")

// Make predictions on test documents.
model.transform(test)
  .select("id", "text", "probability", "prediction")
  .collect()
  .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
    println(s"($id, $text) --> prob=$prob, prediction=$prediction")
  }

Java:

import java.util.Arrays;

import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.ml.feature.HashingTF;
import org.apache.spark.ml.feature.Tokenizer;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

// Prepare training documents, which are labeled.
Dataset<Row> training = spark.createDataFrame(Arrays.asList(
  new JavaLabeledDocument(0L, "a b c d e spark", 1.0),
  new JavaLabeledDocument(1L, "b d", 0.0),
  new JavaLabeledDocument(2L, "spark f g h", 1.0),
  new JavaLabeledDocument(3L, "hadoop mapreduce", 0.0)
), JavaLabeledDocument.class);

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
Tokenizer tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words");
HashingTF hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol())
  .setOutputCol("features");
LogisticRegression lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.001);
Pipeline pipeline = new Pipeline()
  .setStages(new PipelineStage[] {tokenizer, hashingTF, lr});

// Fit the pipeline to training documents.
PipelineModel model = pipeline.fit(training);

// Prepare test documents, which are unlabeled.
Dataset<Row> test = spark.createDataFrame(Arrays.asList(
  new JavaDocument(4L, "spark i j k"),
  new JavaDocument(5L, "l m n"),
  new JavaDocument(6L, "spark hadoop spark"),
  new JavaDocument(7L, "apache hadoop")
), JavaDocument.class);

// Make predictions on test documents.
Dataset<Row> predictions = model.transform(test);
for (Row r : predictions.select("id", "text", "probability", "prediction").collectAsList()) {
  System.out.println("(" + r.get(0) + ", " + r.get(1) + ") --> prob=" + r.get(2)
    + ", prediction=" + r.get(3));
}

Python:

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    rid, text, prob, prediction = row
    print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

模型選擇（超參數調試）

機器學習Pipeline的一個巨大用處是調參，點擊這裏獲取更多自動模型選擇的相關信息；

Spark Pipeline官方文檔

ML Pipelines(譯文)

概述

Pipelines中的主要概念

DataFrame

Pipeline組件

Transformers - 轉換器

Estimators - 預測器

Pipeline組件屬性

Pipeline

如何工作

細節

參數

機器學習持久化：保存和加載Pipeline

機器學習持久化的向後兼容性

示例代碼

例子：預測器、轉換器和參數

例子：Pipeline

模型選擇（超參數調試）

Python實現大麥網搶票的四大關鍵技術點解析

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

手擼機器學習算法 - 邏輯迴歸

手擼機器學習算法 - 嶺迴歸

手擼機器學習算法 - 多項式迴歸

手擼機器學習算法 - 非線性問題

手擼機器學習算法 - 線性迴歸

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結