Spark加載PMML進行預測

軟件版本：

CDH:5.8.0 , CDH-hadoop :2.6.0 ; CDH-spark :1.6.0

目標：

使用Spark 加載PMML文件到模型，並使用Spark平臺進行預測（這裏測試使用的是Spark on YARN的方式）。

具體小目標：

1. 參考https://github.com/jpmml/jpmml-spark 實現，能運行簡單例子；

2. 直接讀取HDFS上面的輸入數據文件，使用PMML生成的模型進行預測；

（第1點和第2點的不一樣的地方體現在輸入數據的構造上，可以參看下面的代碼）

具體步驟：

1. 準備原始數據，原始數據包括PMML文件，以及測試數據；分別如下：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML version="4.2" xmlns="http://www.dmg.org/PMML-4_2">
    <Header description="linear SVM">
        <Application name="Apache Spark MLlib"/>
        <Timestamp>2016-11-16T22:17:47</Timestamp>
    </Header>
    <DataDictionary numberOfFields="4">
        <DataField name="field_0" optype="continuous" dataType="double"/>
        <DataField name="field_1" optype="continuous" dataType="double"/>
        <DataField name="field_2" optype="continuous" dataType="double"/>
        <DataField name="target" optype="categorical" dataType="string"/>
    </DataDictionary>
    <RegressionModel modelName="linear SVM" functionName="classification" normalizationMethod="none">
        <MiningSchema>
            <MiningField name="field_0" usageType="active"/>
            <MiningField name="field_1" usageType="active"/>
            <MiningField name="field_2" usageType="active"/>
            <MiningField name="target" usageType="target"/>
        </MiningSchema>
        <RegressionTable intercept="0.0" targetCategory="1">
            <NumericPredictor name="field_0" coefficient="-0.36682158807862086"/>
            <NumericPredictor name="field_1" coefficient="3.8787681305811765"/>
            <NumericPredictor name="field_2" coefficient="-1.6134308474471166"/>
        </RegressionTable>
        <RegressionTable intercept="0.0" targetCategory="0"/>
    </RegressionModel>
</PMML>

以上pmml文件是由一個svm模型構建的，其輸入有三個字段，有一個目標輸出，代表類別；

輸入測試數據，如下：

field_0,field_1,field_2
98,97,96
1,2,7

這個數據由列名和數據組成，這裏需要注意，列名需要和pmml裏面的列名對應；

2. 把https://github.com/jpmml/jpmml-spark工程下載到本地，並添加如下代碼：

package org.jpmml.spark;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.ml.Transformer;
import org.apache.spark.sql.*;
import org.jpmml.evaluator.Evaluator;

public class SVMEvaluationSparkExample {

	static
	public void main(String... args) throws Exception {

		if(args.length != 3){
			System.err.println("Usage: java " + SVMEvaluationSparkExample.class.getName() + " <PMML file> <Input file> <Output directory>");

			System.exit(-1);
		}
        /**
         * 根據pmml文件，構建模型
         */
        FileSystem fs = FileSystem.get(new Configuration());
        Evaluator evaluator = EvaluatorUtil.createEvaluator(fs.open(new Path(args[0])));

        TransformerBuilder modelBuilder = new TransformerBuilder(evaluator)
                .withTargetCols()
                .withOutputCols()
                .exploded(true);

        Transformer transformer = modelBuilder.build();

        /**
         * 利用DataFrameReader從原始數據中構造 DataFrame對象
         * 需要原始數據包含列名
         */
        SparkConf conf = new SparkConf();
        try(JavaSparkContext sparkContext = new JavaSparkContext(conf)){

            SQLContext sqlContext = new SQLContext(sparkContext);

            DataFrameReader reader = sqlContext.read()
                    .format("com.databricks.spark.csv")
                    .option("header", "true")
                    .option("inferSchema", "true");
            DataFrame dataFrame = reader.load(args[1]);// 輸入數據需要包含列名

            /**
             * 使用模型進行預測
             */
            dataFrame = transformer.transform(dataFrame);

            /**
             * 寫入數據
             */
            DataFrameWriter writer = dataFrame.write()
                    .format("com.databricks.spark.csv")
                    .option("header", "true");

            writer.save(args[2]);
        }
	}
}

這個代碼主要實現的是小目標1，即參考jpmml-spark工程給的示例，編寫代碼；代碼有四個部分，第一部分讀取HDFS上面的PMML文件，然後構建模型；第二部分使用DataFrameReader根據輸入數據構建DataFrame數據結構；第三部分，使用模型對構造的DataFrame數據進行預測；第四部分，把預測的結果寫入HDFS。

注意裏面在構造數據的時候.option("header","true")是一定要加的，原因如下：1）原始數據中確實有列名；2）如果這裏不加，那麼將讀取不到列名的相關信息，將不能和模型中的列名對應；（當然，下面有其他方法處理這種情況）。

3. 上傳測試數據以及pmml文件到HDFS，進行測試，代碼如下：

spark-submit --master yarn --class org.jpmml.spark.SVMEvaluationSparkExample /opt/tmp/example-1.0-SNAPSHOT.jar hdfs://quickstart.cloudera:8020/tmp/svm/part-00000 sample_test_data.txt sample_out00

其中，example-1.0-SNAPSHOT.jar 是編譯後的jar包；/tmp/svm/part-00000時svm模型的pmml文件；sample_test_data.txt 是測試數據；sample_out00是輸出目錄；

查看結果：

根據輸出的結果，也可以看出預測結果是對的。

4. 如何實現小目標2呢？

編寫代碼：

/*
 * Copyright (c) 2015 Villu Ruusmann
 *
 * This file is part of JPMML-Spark
 *
 * JPMML-Spark is free software: you can redistribute it and/or modify
 * it under the terms of the GNU Affero General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * JPMML-Spark is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU Affero General Public License for more details.
 *
 * You should have received a copy of the GNU Affero General Public License
 * along with JPMML-Spark.  If not, see <http://www.gnu.org/licenses/>.
 */
package org.jpmml.spark;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.ml.Transformer;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.dmg.pmml.FieldName;
import org.jpmml.evaluator.Evaluator;

import java.util.ArrayList;
import java.util.List;

//import org.jpmml.evaluator.FieldValue;

public class EvaluationSparkExample {

	static
	public void main(String... args) throws Exception {

		if(args.length != 3){
			System.err.println("Usage: java " + EvaluationSparkExample.class.getName() + " <PMML file> <Input file> <Output directory>");

			System.exit(-1);
		}

        /**
         * 構造模型
        */
        FileSystem fs = FileSystem.get(new Configuration());
        Evaluator evaluator = EvaluatorUtil.createEvaluator(fs.open(new Path(args[0])));

        TransformerBuilder modelBuilder = new TransformerBuilder(evaluator)
                .withTargetCols()
                .withOutputCols()
                .exploded(true);
        Transformer transformer = modelBuilder.build();

        /**
         * 構造列名,schema
         */
        List<StructField> fields = new ArrayList<>();
        for (FieldName fieldName: evaluator.getActiveFields()) {
            fields.add(DataTypes.createStructField(fieldName.getValue(), DataTypes.StringType, true));
        }
        StructType schema = DataTypes.createStructType(fields);

        /**
         * 原始數據構造成DataFrame
         */
        SparkConf conf = new SparkConf();
        final String splitter = ",";
        try(JavaSparkContext sparkContext = new JavaSparkContext(conf)){
            JavaRDD<Row> data = sparkContext.textFile(args[1]).map(new Function<String, Row>() {
                @Override
                public Row call(String line) throws Exception {
                    String[] lineArr = line.split(splitter,-1);
                    return  RowFactory.create(lineArr);
                }
            });

            SQLContext sqlContext = new SQLContext(sparkContext);
            DataFrame dataFrame = sqlContext.createDataFrame(data, schema);

            /**
             * 預測，並生成新的DataFrame
             */
            dataFrame = transformer.transform(dataFrame);

            /**
             * 把評估後的數據寫入HDFS，不要寫入列名
             */
            DataFrameWriter writer = dataFrame.write()
                    .format("com.databricks.spark.csv");
            writer.save(args[2]);

        }
	}
}

這個代碼和上一個代碼的不同之處只是從原始測試數據中構造DataFrame不同，這裏使用的PMML模型中的列名信息，代碼參考：http://spark.apache.org/docs/1.6.0/sql-programming-guide.html#interoperating-with-rdds；同時，這時，原始測試數據就不需要再添加列名信息了。由於在代碼中，在輸出的時候也把列名信息給去掉了，所以只輸出數據。運行後，其結果如下所示：

其調用代碼如下所示：

spark-submit --master yarn --class org.jpmml.spark.EvaluationSparkExample /opt/tmp/example-1.0-SNAPSHOT.jar hdfs://quickstart.cloudera:8020/tmp/svm/part-00000 sample_test_data1.txt sample_out02

其中，sample_test_data1.txt是沒有列名的數據。

分享，成長，快樂

轉載請註明blog地址：http://blog.csdn.net/fansy1990

Spark加載PMML進行預測

軟件版本：

目標：

具體步驟：

容器中nginx無法使用同一個網絡下的容器域名

Python: SunMoonTimeCalculator

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

nodejs學習07——API

避免DbContext同時在多個線程調用

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

MapReduce實現線性迴歸

Spark TopK問題解法

Spark讀寫Hive添加PMML支持

Spark讀寫Hive

Coursera TensorFlow 基礎課程-week2

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結