spark ML 機器學習包的使用

  val spark = SparkSession.builder().config(new SparkConf().setMaster("local[*]")).getOrCreate()
    val training = spark.createDataFrame(Seq(
      (0L, "a b c d e spark", 1.0),
      (1L, "b d", 0.0),
      (2L, "spark f g h", 1.0),
      (3L, "hadoop mapreduce", 0.0),
      (4L, "b spark who", 1.0),
      (5L, "g d a y", 0.0),
      (6L, "spark fly", 1.0),
      (7L, "was mapreduce", 0.0),
      (8L, "e spark program", 1.0),
      (9L, "a e c l", 0.0),
      (10L, "spark compile", 1.0),
      (11L, "hadoop software", 0.0)
    )).toDF("id", "text", "label")

    // Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
    val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
    val hashingTF = new HashingTF()
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
    val lr = new LogisticRegression()
      .setMaxIter(10)


    // val lrStartTime = new Date().getTime

    val pipeline = new Pipeline()
      .setStages(Array(tokenizer, hashingTF, lr))

    // 我們使用ParamGridBuilder來構建要搜索的參數網格。
    // 使用hashingTF.numFeatures的3個值和lr.regParam的2個值, regParam正則化參數
    // 此網格將有3 x 2 = 6個參數設置供CrossValidator選擇。

    //    交叉驗證參數設定和模型
    val paramGrid = new ParamGridBuilder()
      .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
      .addGrid(lr.regParam, Array(0.1, 0.01))
      .addGrid(lr.elasticNetParam,Array(0.1,0.0))
      .build()

    // 模型選擇與調參的三個基本組件分別是 Estimator、ParamGrid 和 Evaluator,
    // 其中Estimator包括算法或者Pipeline;
    // ParamGrid即ParamMap集合,提供參數搜索空間;
    // Evaluator 即評價指標。


    // 我們現在將Pipeline視爲Estimator,將其包裝在CrossValidator實例中。
    // 這將允許我們共同選擇所有Pipeline階段的參數。
    // CrossValidator需要Estimator,一組Estimator ParamMaps和一個Evaluator。
    // 請注意,此處的求值程序是BinaryClassificationEvaluator,其默認度量
    // 是areaUnderROC。
    // 交叉驗證
    // BinaryClassificationEvaluator 二值數據的評估
    // RegressionEvaluator   迴歸
    // MulticlassClassificationEvaluator 多分類
    val cv = new CrossValidator()
      .setEstimator(pipeline)  //要優化的pipeline
      .setEvaluator(new BinaryClassificationEvaluator)  // 評價指標
      .setEstimatorParamMaps(paramGrid)   // 參數搜索
      .setNumFolds(2)  //   使用幾折交叉驗證

    //運行交叉驗證,並選擇最佳參數集。
    val cvModel = cv.fit(training)
    val bestLrModel = cvModel.bestModel.asInstanceOf[PipelineModel]
    val bestHash = bestLrModel.stages(1).asInstanceOf[HashingTF]
    val bestHashFeature = bestHash.getNumFeatures
    val bestLr = bestLrModel.stages(2).asInstanceOf[LogisticRegressionModel]
    val blr = bestLr.getRegParam
    val ble = bestLr.getElasticNetParam
    println(s"HashTF最優參數:\ngetNumFeatures= $bestHashFeature \n邏輯迴歸模型最優參數:\nregParam = $blr,elasticNetParam = $ble")

    // Prepare test documents, which are unlabeled (id, text) tuples.
    val test = spark.createDataFrame(Seq(
      (4L, "spark i j k","1.0"),
      (5L, "l m n","0.0"),
      (6L, "mapreduce spark","1.0"),
      (7L, "apache hadoop","0.0")
    )).toDF("id", "text","label")

    cvModel.transform(test)
      .select("id", "text", "probability", "prediction")
      .collect()
      .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
        println(s"($id, $text) --> prob=$prob, prediction=$prediction")
      }

輸出
{
logreg_ced5bcd864cb-elasticNetParam: 0.1,
hashingTF_d06823caef37-numFeatures: 100,
logreg_ced5bcd864cb-regParam: 0.1
}
是由配置日誌文件
log4j.logger.org.apache.spark.ml.tuning.TrainValidationSplit=INFO
log4j.logger.org.apache.spark.ml.tuning.CrossValidator=INFO
所生成。

HashTF最優參數:
getNumFeatures= 100
邏輯迴歸模型最優參數:
regParam = 0.1,elasticNetParam = 0.1

(4, spark i j k) --> prob=[0.18790027301642065,0.8120997269835794], prediction=1.0
(5, l m n) --> prob=[0.8895957990095681,0.11040420099043186], prediction=0.0
(6, mapreduce spark) --> prob=[0.33625307291668444,0.6637469270833155], prediction=1.0
(7, apache hadoop) --> prob=[0.706788417403727,0.293211582596273], prediction=0.0

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章