一、Spark Pipeline
1.1 機器學習管道(Pipeline)
由一系列階段構成,每個階段是Transformer或Estimator,它們串聯到一起按照順序執行。
1.2 數據管道組件構成
Transformer:算法可以把一個DataFrame轉換成另一個DataFrame。
- 特徵轉換器(feature transformer),讀取輸入數據集中的一列(比如text),將產生新的特徵列。
- 學習模型(learning model),將一個有特徵列的DataFrame轉換成一個有預測信息的DataFrame。
EStimator:Estimator就是一種機器學習算法,會從輸入數據中進行學習,併產生一個訓練模型(Transformer)。
1.3 構建Pipeline
val training = spark.createDataFrame(Seq(
(0L,"a b c d e spark",1.0),
(1L,"b d",0.0),
(2L,"spark f g h",1.0),
(3L,"hadoop mapreduce",0.0)
)).toDF("id","text","label")
//配置ML Pipeline,包含三部分:tokenizer,hashingTF,lr。
val tokenizer = nwe Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegPaam(0.001)
val pipeline = new Pipeline()
.setStages(Array(tokenizer,hashingTF,lr))
val model = pipeline.fit(training)
可以將已經fit後的操作,存入磁盤。
model.write.overwrite().save("/tmp/spark-logistic-regression-model")
1.4 預測Pipeline
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
val rawdata = spark.createDataFrame(Seq(
(4L,"spark i j k"),
(5L,"l m n"),
(6L,"spark hadoop spark"),
(7L,"apache hadoop")
)).toDF("id","text")
model.transform(rawdata)
.select("id","text","probability","prediction")
.collect()
.foreach { case Row(id:Long,text:String,prob:Vector,prediction:Double)=>
println(s"($id,$text)--> prob=$prob,prediction=$prediction)
}