Spark MLlib學習(二)——分類和迴歸

MLlib支持多種分類方法,如二分類、多分類和迴歸分析等。

問題類型 支持的方法
二分類 線性SVM, 邏輯迴歸,決策樹,隨機森林,GBDT,樸素貝葉斯
多分類 決策樹,隨機森林,樸素貝葉斯
迴歸 線性最小二乘法,Lasso, 嶺迴歸,決策樹,隨機森林,GBDT,保序迴歸

1、線性模型

  • 分類(SVMs,邏輯迴歸)
  • 線性迴歸(最小二乘法、Lasso,嶺迴歸)
    (1)分類
    Mlib提供兩種分類方法:邏輯迴歸和線性支持向量機SVM。SVM只支持二分類,而邏輯迴歸既支持二分類又支持多分類。訓練集數據用RDD[LabeledPoint]表示,label是分類的索引,從0開始。

  • 線性支持向量機SNMs

import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.util.MLUtils

val data = MLUtils.loadLibSVMFile(sc, "file:///home/hdfs/data_mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.6,0.4),seed = 11L)
val training = splits(0).cache()
val test = splits(1)

val numIterations = 100
val model = SVMWithSGD.train(training,numIterations)
model.clearThreshold()

val scoreAndLabels = test.map{point =>
    val score = model.predict(point.features)
    (score,point.label)
}
scoreAndLabels.take(5)

val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)

model.save(sc, "myModelPath")    //save and load model
val sameModel = SVMModel.load(sc, "myModelPath")
  • 邏輯迴歸 Logistic regression

L-BFGS支持二分邏輯迴歸和多項式邏輯迴歸,SGD只支持二分邏輯迴歸。L-BFGS不支持L1正則化,SGD版本支持L1正則化。當L1不是必須時,推薦使用L-BFGS版本,它通過擬牛頓近似Heaaian矩陣收斂的更快更準。

import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils

val data = MLUtils.loadLibSVMFile(sc, "file:///home/hdfs/data_mllib/sample_libsvm_data.txt")

val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)

val model = new LogisticRegressionWithLBFGS().setNumClasses(10).run(training)

val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
  val prediction = model.predict(features)
  (prediction, label)
}  

val metrics = new MulticlassMetrics(predictionAndLabels)
val precision = metrics.precision 
println("precision = "+ precision)

model.save(sc, "myModelPath")
val sameModel = LogisticRegressionModel.load(sc, "myModelPath")

(2)迴歸

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors

val data = sc.textFile("file:///home/hdfs/data_mllib/lpsa.data")
val parsedData = data.map{ line =>
    val parts = line.split(',')
    LabeledPoint(parts(0).toDouble,Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()

val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)

val valuesAndPreds = parsedData.map{point =>
    val prediction = model.predict(point.features)
    (point.label,prediction)
}

val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)

model.save(sc, "myModelPath")
val sameModel = LinearRegressionModel.load(sc, "myModelPath")

2、決策樹
貪婪算法,遞歸地對特徵空間做分裂處理。MLlib支持使用決策樹做二分類、多分類和迴歸,技能處理連續特徵又能使用類別特徵。MLlib爲分類提供了兩種不純度衡量方法(gini不純度和熵),爲迴歸提供了一種不純度衡量方法(方差)。
ps:使用信息增益的算法有ID3、C4.5。ID3使用的是信息增益,在分裂時可傾向於屬性較多的節點;C4.5是ID3的改進版,使用的是信息增益率,另外還基於信息增益對連續特徵進行離散化處理。CART使用gini不純度進行度量。
隨機森林和迭代決策樹的集成樹算法在分類迴歸中使用率也較高。

(1)分類

import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils

val data = MLUtils.loadLibSVMFile(sc,"file:///home/hdfs/data_mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7,0.3))
val (trainData,testData) = (splits(0),splits(1))

val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainClassifier(trainData,numClasses,categoricalFeaturesInfo,impurity,maxDepth,maxBins)

val labelAndPreds = testData.map { point =>
    val prediction = model.predict(point.features)
    (point.label,prediction)
}

val testErr = labelAndPreds.filter(r => r._1 != r._2).count().toDouble /testData.count()
println("Test Error =" + testErr)
println("Learned classification tree model:\n" + model.toDebugString)

model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
val sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")

(2)迴歸
不純度度量不同

val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "variance"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainRegressor(trainData, categoricalFeaturesInfo, impurity,
  maxDepth, maxBins)

val labelsAndPredictions = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case (v, p) => math.pow(v - p, 2) }.mean()
println("Test Mean Squared Error = " + testMSE)

3、樹的集成

隨機森林與GBTs對比:
- GBTS一次只訓練一顆樹,訓練時間大於隨機森林,隨機森林可並行地訓練多顆樹。 GBTs訓練較淺的樹,花費的時間也少。
- 隨機森林不易過擬合,訓練更多的樹減少了過擬合的可能性,GBTs中訓練過多的樹會增加過擬合的可能性。(隨機森林通過多棵樹減少variance方差,GBTs通過多棵樹減少bias偏置)。
- 隨機森林易調優,效果會隨着數數量的增加單調提升。

(1)隨機森林

兩個主要可調節參數:

- numTrees:森林中樹的數量。增加數量可減少預測的方差,提升模型測試的準確率。
- maxDepth:森林中每棵樹的最大深度。增加深度可提升模型的表達能力,但是過大會導致過擬合。通常,隨機森林可設置比單棵樹更大的深度。
-
兩個一般不需要調整的參數,但是調整可以加速訓練過程:
- subsamplingRate:
這個參數指定了森林中每棵樹訓練的數據的大小,它是當前數據集大小佔總數據大小的比例。推薦默認值1.0,減小這個值可以提升訓練速度。
- featureSubsetStrategy:
每棵樹中使用的特徵數量。這個參數可用小數比例的形式指定,也可以是總特徵數量的函數。減少這個值會加速訓練,但是如果太小會影響效果。
(2)GBTs
MLlib支持GBTs作二分類和迴歸,能夠處理連續和類別變量。目前不支持多分類,對於多分類可以採用決策樹和隨機森林。

  • 分類
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils

val data = MLUtils.loadLibSVMFile(sc,"file:///home/hdfs/data_mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7,0.3))
val (trainData,testData) = (splits(0),splits(1))

val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 3
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 5
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int,Int]()

val model = GradientBoostedTrees.train(trainData,boostingStrategy)

val labelAndPreds = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}

val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification GBT model:\n" + model.toDebugString)
  • 迴歸(變化不大)
val boostingStrategy = BoostingStrategy.defaultParams("Regression")
boostingStrategy.numIterations = 3
boostingStrategy.treeStrategy.maxDepth = 5
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int,Int]()

val model = GradientBoostedTrees.train(trainData,boostingStrategy)

val labelsAndPredictions = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression GBT model:\n" + model.toDebugString)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章