軟件版本:
CDH:5.8.0 , CDH-hadoop :2.6.0 ; CDH-spark :1.6.0
目標:
使用Spark 加載PMML文件到模型,並使用Spark平臺進行預測(這裏測試使用的是Spark on YARN的方式)。
具體小目標:
1. 參考https://github.com/jpmml/jpmml-spark 實現,能運行簡單例子;
2. 直接讀取HDFS上面的輸入數據文件,使用PMML生成的模型進行預測;
(第1點和第2點的不一樣的地方體現在輸入數據的構造上,可以參看下面的代碼)
具體步驟:
1. 準備原始數據,原始數據包括PMML文件,以及測試數據;分別如下:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML version="4.2" xmlns="http://www.dmg.org/PMML-4_2">
<Header description="linear SVM">
<Application name="Apache Spark MLlib"/>
<Timestamp>2016-11-16T22:17:47</Timestamp>
</Header>
<DataDictionary numberOfFields="4">
<DataField name="field_0" optype="continuous" dataType="double"/>
<DataField name="field_1" optype="continuous" dataType="double"/>
<DataField name="field_2" optype="continuous" dataType="double"/>
<DataField name="target" optype="categorical" dataType="string"/>
</DataDictionary>
<RegressionModel modelName="linear SVM" functionName="classification" normalizationMethod="none">
<MiningSchema>
<MiningField name="field_0" usageType="active"/>
<MiningField name="field_1" usageType="active"/>
<MiningField name="field_2" usageType="active"/>
<MiningField name="target" usageType="target"/>
</MiningSchema>
<RegressionTable intercept="0.0" targetCategory="1">
<NumericPredictor name="field_0" coefficient="-0.36682158807862086"/>
<NumericPredictor name="field_1" coefficient="3.8787681305811765"/>
<NumericPredictor name="field_2" coefficient="-1.6134308474471166"/>
</RegressionTable>
<RegressionTable intercept="0.0" targetCategory="0"/>
</RegressionModel>
</PMML>
以上pmml文件是由一個svm模型構建的,其輸入有三個字段,有一個目標輸出,代表類別;輸入測試數據,如下:
field_0,field_1,field_2
98,97,96
1,2,7
這個數據由列名和數據組成,這裏需要注意,列名需要和pmml裏面的列名對應;2. 把https://github.com/jpmml/jpmml-spark工程下載到本地,並添加如下代碼:
package org.jpmml.spark;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.ml.Transformer;
import org.apache.spark.sql.*;
import org.jpmml.evaluator.Evaluator;
public class SVMEvaluationSparkExample {
static
public void main(String... args) throws Exception {
if(args.length != 3){
System.err.println("Usage: java " + SVMEvaluationSparkExample.class.getName() + " <PMML file> <Input file> <Output directory>");
System.exit(-1);
}
/**
* 根據pmml文件,構建模型
*/
FileSystem fs = FileSystem.get(new Configuration());
Evaluator evaluator = EvaluatorUtil.createEvaluator(fs.open(new Path(args[0])));
TransformerBuilder modelBuilder = new TransformerBuilder(evaluator)
.withTargetCols()
.withOutputCols()
.exploded(true);
Transformer transformer = modelBuilder.build();
/**
* 利用DataFrameReader從原始數據中構造 DataFrame對象
* 需要原始數據包含列名
*/
SparkConf conf = new SparkConf();
try(JavaSparkContext sparkContext = new JavaSparkContext(conf)){
SQLContext sqlContext = new SQLContext(sparkContext);
DataFrameReader reader = sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true");
DataFrame dataFrame = reader.load(args[1]);// 輸入數據需要包含列名
/**
* 使用模型進行預測
*/
dataFrame = transformer.transform(dataFrame);
/**
* 寫入數據
*/
DataFrameWriter writer = dataFrame.write()
.format("com.databricks.spark.csv")
.option("header", "true");
writer.save(args[2]);
}
}
}
這個代碼主要實現的是小目標1,即參考jpmml-spark工程給的示例,編寫代碼;代碼有四個部分,第一部分讀取HDFS上面的PMML文件,然後構建模型;第二部分使用DataFrameReader根據輸入數據構建DataFrame數據結構;第三部分,使用模型對構造的DataFrame數據進行預測;第四部分,把預測的結果寫入HDFS。注意裏面在構造數據的時候.option("header","true")是一定要加的,原因如下:1)原始數據中確實有列名;2)如果這裏不加,那麼將讀取不到列名的相關信息,將不能和模型中的列名對應;(當然,下面有其他方法處理這種情況)。
3. 上傳測試數據以及pmml文件到HDFS,進行測試,代碼如下:
spark-submit --master yarn --class org.jpmml.spark.SVMEvaluationSparkExample /opt/tmp/example-1.0-SNAPSHOT.jar hdfs://quickstart.cloudera:8020/tmp/svm/part-00000 sample_test_data.txt sample_out00
其中,example-1.0-SNAPSHOT.jar 是編譯後的jar包;/tmp/svm/part-00000時svm模型的pmml文件;sample_test_data.txt 是測試數據;sample_out00是輸出目錄;查看結果:
根據輸出的結果,也可以看出預測結果是對的。
4. 如何實現小目標2呢?
編寫代碼:
/*
* Copyright (c) 2015 Villu Ruusmann
*
* This file is part of JPMML-Spark
*
* JPMML-Spark is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* JPMML-Spark is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Affero General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with JPMML-Spark. If not, see <http://www.gnu.org/licenses/>.
*/
package org.jpmml.spark;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.ml.Transformer;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.dmg.pmml.FieldName;
import org.jpmml.evaluator.Evaluator;
import java.util.ArrayList;
import java.util.List;
//import org.jpmml.evaluator.FieldValue;
public class EvaluationSparkExample {
static
public void main(String... args) throws Exception {
if(args.length != 3){
System.err.println("Usage: java " + EvaluationSparkExample.class.getName() + " <PMML file> <Input file> <Output directory>");
System.exit(-1);
}
/**
* 構造模型
*/
FileSystem fs = FileSystem.get(new Configuration());
Evaluator evaluator = EvaluatorUtil.createEvaluator(fs.open(new Path(args[0])));
TransformerBuilder modelBuilder = new TransformerBuilder(evaluator)
.withTargetCols()
.withOutputCols()
.exploded(true);
Transformer transformer = modelBuilder.build();
/**
* 構造列名,schema
*/
List<StructField> fields = new ArrayList<>();
for (FieldName fieldName: evaluator.getActiveFields()) {
fields.add(DataTypes.createStructField(fieldName.getValue(), DataTypes.StringType, true));
}
StructType schema = DataTypes.createStructType(fields);
/**
* 原始數據構造成DataFrame
*/
SparkConf conf = new SparkConf();
final String splitter = ",";
try(JavaSparkContext sparkContext = new JavaSparkContext(conf)){
JavaRDD<Row> data = sparkContext.textFile(args[1]).map(new Function<String, Row>() {
@Override
public Row call(String line) throws Exception {
String[] lineArr = line.split(splitter,-1);
return RowFactory.create(lineArr);
}
});
SQLContext sqlContext = new SQLContext(sparkContext);
DataFrame dataFrame = sqlContext.createDataFrame(data, schema);
/**
* 預測,並生成新的DataFrame
*/
dataFrame = transformer.transform(dataFrame);
/**
* 把評估後的數據寫入HDFS,不要寫入列名
*/
DataFrameWriter writer = dataFrame.write()
.format("com.databricks.spark.csv");
writer.save(args[2]);
}
}
}
這個代碼和上一個代碼的不同之處只是從原始測試數據中構造DataFrame不同,這裏使用的PMML模型中的列名信息,代碼參考:http://spark.apache.org/docs/1.6.0/sql-programming-guide.html#interoperating-with-rdds;同時,這時,原始測試數據就不需要再添加列名信息了。由於在代碼中,在輸出的時候也把列名信息給去掉了,所以只輸出數據。運行後,其結果如下所示:其調用代碼如下所示:
spark-submit --master yarn --class org.jpmml.spark.EvaluationSparkExample /opt/tmp/example-1.0-SNAPSHOT.jar hdfs://quickstart.cloudera:8020/tmp/svm/part-00000 sample_test_data1.txt sample_out02
其中,sample_test_data1.txt是沒有列名的數據。分享,成長,快樂
轉載請註明blog地址:http://blog.csdn.net/fansy1990