Spark - ML Tuning

官方文檔：https://spark.apache.org/docs/2.2.0/ml-tuning.html

這一章節主要講述如何通過使用MLlib的工具來調試模型算法和pipeline，內置的交叉驗證和其他工具允許用戶優化模型和pipeline中的超參數；

模型選擇，也就是調參；
交叉驗證；
訓練集、驗證集劃分；

模型選擇（調參）

機器學習的一個重要工作就是模型選擇，或者說根據給定任務使用數據來發現最優的模型和參數，也叫做調試，既可以針對單個模型進行調試，也可以針對整個pipeline的各個環節進行調試，使用者可以一次對整個pipeline進行調試而不是每次一個pipeline中的部分；

MLlib支持CrossValidator和TrainValidationSplit等模型選擇工具，這些工具需要下列參數：

Estimator：待調試的算法或者Pipeline；
參數Map列表：用於搜索的參數空間；
Evaluator：衡量模型在集外測試集上表現的方法；

這些工具工作方式如下：

分割數據到訓練集和測試集；
對每一組訓練&測試數據，應用所有參數空間中的可選參數組合：
- 對每一組參數組合，使用其設置到算法上，得到對應的model，並驗證該model的性能；
選擇得到最好性能的模型使用的參數組合；

Evaluator針對迴歸問題可以是RegressionEvaluator，針對二分數據可以是BinaryClassificationEvaluator，針對多分類問題的MulticlassClassificationEvaluator，默認的驗證方法可以通過setMetricName來修改；

交叉驗證

CrossValidator首先將數據分到一個個的fold中，使用這些fold集合作爲訓練集和測試集，如果k=3，那麼CrossValidator將生成3個（訓練，測試）組合，也就是通過3個fold排列組合得到的，每一組使用2個fold作爲訓練集，另一個fold作爲測試集，爲了驗證一個指定的參數組合，CrossValidator需要計算3個模型的平均性能，每個模型都是通過之前的一組訓練&測試集訓練得到；

確認了最佳參數後，CrossValidator最終會使用全部數據和最佳參數組合來重新訓練預測；

例子：通過交叉驗證進行模型選擇；

注意：交叉驗證在整個參數網格上是十分耗時的，下面的例子中，參數網格中numFeatures有3個可取值，regParam有2個可取值，CrossValidator使用2個fold，這將會訓練3*2*2個不同的模型，在實際工作中，通常會設置更多的參數、更多的參數取值以及更多的fold，換句話說，CrossValidator本身就是十分奢侈的，無論如何，與手工調試相比，它依然是一種更加合理和自動化的調參手段；

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Prepare training documents, which are labeled.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
    (8, "e spark program", 1.0),
    (9, "a e c l", 0.0),
    (10, "spark compile", 1.0),
    (11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

# Prepare test documents, which are unlabeled.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "mapreduce spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents. cvModel uses the best model found (lrModel).
prediction = cvModel.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    print(row)

劃分訓練、驗證集

對於超參數調試，Spark還支持TrainValidationSplit，它一次只能驗證一組參數，這與CrossValidator一次進行k次截然不同，因此它更加快速，但是如果訓練集不夠大的化就無法得到一個真實的結果；

不像是CrossValidator，TrainValidationSplit創建一個訓練、測試組合，它根據trainRatio將數據分爲兩部分，假設trainRatio=0.75，那麼數據集的75%作爲訓練集，25%用於驗證；

與CrossValidator類似的是，TrainValidationSplit最終也會使用最佳參數和全部數據來訓練一個預測器；

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

# Prepare training and test data.
data = spark.read.format("libsvm")\
    .load("data/mllib/sample_linear_regression_data.txt")
train, test = data.randomSplit([0.9, 0.1], seed=12345)

lr = LinearRegression(maxIter=10)

# We use a ParamGridBuilder to construct a grid of parameters to search over.
# TrainValidationSplit will try all combinations of values and determine best model using
# the evaluator.
paramGrid = ParamGridBuilder()\
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .addGrid(lr.fitIntercept, [False, True])\
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
    .build()

# In this case the estimator is simply the linear regression.
# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
tvs = TrainValidationSplit(estimator=lr,
                           estimatorParamMaps=paramGrid,
                           evaluator=RegressionEvaluator(),
                           # 80% of the data will be used for training, 20% for validation.
                           trainRatio=0.8)

# Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(train)

# Make predictions on test data. model is the model with combination of parameters
# that performed best.
model.transform(test)\
    .select("features", "label", "prediction")\
    .show()

Spark 模型選擇和調參

Spark - ML Tuning

模型選擇（調參）

交叉驗證

劃分訓練、驗證集

HTML頁面關於高分屏的設置

北歐瑞典挪威芬蘭瑞士TikTok海外網紅與YouTube博主的合作模式

歐洲英國德國法國TikTok與YouTube海外網紅達人的完美合作策略

druid數據源 xml配置

手擼機器學習算法 - 邏輯迴歸

手擼機器學習算法 - 嶺迴歸

手擼機器學習算法 - 多項式迴歸

手擼機器學習算法 - 非線性問題

手擼機器學習算法 - 線性迴歸

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結