A PMML Based Framework for Developing R-Java Analytic Applications

寫在前面 

這是一個關於“用於開發R-Java分析應用程序的基於PMML的框架”的文章。

我寫的時候就用的英文,英文也算很易懂,但有時間我也許會再寫一箇中文版本。

摘要

構建R-Java分析應用程序的過程帶來了數據清洗、集成和模型表示等挑戰。本文總結了四種(Rserve,DB,Plumber,PMLL)R和Java集成方法及相關場景。提出了一個基於Predictive Modeling Markup Language(PMML)標準的框架,該框架允許獨立的R模型生產者和Java使用者。並給出了基於此框架開發和部署R-Java分析應用程序的生命週期演示。

 A PMML Based Framework for Developing R-Java Analytic Applications

JianTang

Abstract

The process of building R-Java analytic applications presents challenges such as data wrangling, integrating and model presentation. In this paper, four R and Java integration methods and related scenarios are summarized. A Predictive Modeling Markup Language (PMML) standard based framework that allows independent R model producers and Java consumers is presented. A demo to show the lifecycle of developing and deploying a R-Java analytic application based on this framework is also presented.

Introduction

R is very powerful in data analysis and visualization, but in software engineering languages such as C++ and java are more popular. By R-Java analytic application we mean an application coded by R and Java that can analyze historical data, extract the potential meaning and give prediction to the new coming data. Examples include a website or a smartphone APP.

R data analysis has outputs come in a variety of formats, such as plots and analytic models. By integration we mean plugging in these analytic results into other language coded software project, this integration can be implemented in various ways. The goal here is to find ways that can implement the integration between R and Java. More greedy speaking, we want they fast, stable, easy to understand and implement in related scenarios.

In this paper several integration methods and their applicable scenarios are listed. The explanation, usage, merits and demerits for each method are summarized.

The Streaming Machine Learning Architecture proposed by Oracle (Arango, Mauricio and Alex, 2016) allows model building based on Oracle R Enterprise (ORE) and model presentation based on Oracle Stream Explorer (OSX). The Predictive Modeling Markup Language (PMML) standard (Guazzelli, Alex, et al, 2009) is used to transfer information from ORE to OSX. However, too much reliance on the Oracle products makes this architecture too heavy to implement, a PMML based framework for building analytic application is presented here, it has much lighter weight and allows independent R model producers and consumers, many machine learning algorithms are well supported.

A demo to show the complete lifecycle of developing and deploying a R-Java analytic application based on this PMML framework is presented. The source code is available on GitHub and an accessible website provided by this demo is also shown.

Optional Integration methods

By integration methods we mean tools that enable Java coded applications to use R outputs. Since R is particularly good at data analysis and visualization, R outputs such as such ggplot2 graph and analytic models should be considered. Cases include plugging graph into Java applications or using R analytic models to do prediction. The stability and running performance can vary between methods.

1. Coding R in java (and vice versa). This means coding R code in Java project, or coding java code in R project. We can implement this by using RServe (Urbanek, Simon, 2003) or JRI (Urbanek, Simon, 2011). RServe and JRI are well used in many R-java integrated projects. Programmers are required to master both R and Java to start work.

  • RServe. Rserve is a TCP/IP server. It works in Client/Server mode, the client side supports various programming languages such as Java, C++ and PHP. The client side can get R responses back by sending raw R code or R files to the server without the need to install any R environment setting. Authentication and session control are also supported.

Rserve is easy to start since there is no need to any R setting in client side. But using RServe means we need to build a client/server system, and it’s based on TCP/IP, running slower than JRI because more time will be spent in network communication.

  • JRI. JRI is a Java/R interface. It can be used to support the single-threaded running of raw R codes inside Java applications, JRI is the inverse of rJava (Urbanek, Simon, 2013) where we can run java code in R project. The basic idea behind JRI is to load required R libraries into Java environments and roll, calls to R function and running REPL(Read–eval–print loop) are supported.

JRI requires more complex environment setting and this can vary from different operating systems. But it’s an inner Java operation which means better running performance.

2. Using DB or static files. By DB here we mean database, such as MySQL and Oracle. This way is easy to start: we can just save the R results to DB or local files, and then read them into other language coded applications. R package RMySQ (James, David, and Saikat, 2012) can be used to provide R interface to the MySQL database. And a variety of file formats, for example, .txt, .csv, .png , are well supported by R.

Using DB or files as bridge is an intuitive way and easy to start, R outputs such as plots and numbers can be well transmitted without misunderstanding. However, R outputs such as analytic models have the format that other language can never understand and use directly without specific interpreter. Besides, too much I/O operation presents the challenge to running performance and stability when facing with large size data or high-concurrency scenario.

3. HTTP API in R. This means exposing R codes as services in the HTTP (or REST) API format available on the Internet. The Hypertext Transfer Protocol (HTTP) API has become the predominant standard for helping different language coded systems to talk to each other. We can create HTTP APIs in R with package Plumber (Cran.R-Project.Org plumber, 2020), so as to our R codes dressed as services can communicate with other systems as long as they are in the same subnet, or public Internet.

HTTP API provides rich inputs and outputs formats such as Jason strings, HTML, PNG, or JPEG. R package Plumber helps user to build web API by merely adding special annotations above the normal R function, no more special editing for the R function body or definition is needed. These R functions will be exposed as Web APIs automatically. The APIs will use default port 8000 and will continue running in the R session until an Esc command is triggered.

No function definition or body editing is need, but the Plumber API does require specific annotation formats, we can use either #* or # ':

  • #' @param a. This specifies the input parameter name for the function
  • #' @get /echo. This specifies the function as a @get request, and echoes the message passed in. Other request types @post, @put, @delete, @head are also supported.
  • #' @png. This specifies the output of the function is an image. Other formats such as @html, @jpeg are also supported.

Plumber API is easy to start, what we need is to decorate our normal R function with some special annotations, and Plumber will do the rest, such as port setting and routing, and of course we can change these default setting manually. Then our R function will be available online before we know it. Just like Rserve, more time will be spent in network communication if we call too much web APIs in our system. The difference between Plumber and Rserve is that in Rserve we send raw R code to a serve and get response, but in Plumber we send API parameters, which means users don’t need master R.

4. PMML. In the case where we need build predictive and descriptive models, such as GLM, Hclust, SVM, randomForest, nnet, in the generator side by using R or python, and consume these models in other language coded system such as Java or C++, we can use PMML as bridge.

The goal of PMML is to rise a standard for wrapping models so that two different sides, one called the PMML producer, one called consumer, can work together (Ferrucci, et al, 2006). Package PMML in R allows user to export a variety of analytic models and necessary datasets from R to .pmml files in the producer side. These files use XML schema defined by PMML to describe the model structure. Then consumer side, commonly coded by other languages and running in environments without any R setting, can import these .pmml files and do presentation with the help with specific evaluator. JPMML (Java PMML API, 2020) is an open source Java library that provides interfaces to run this PMML interpretation.

The PMML generator and consumer can be linked by static .pmml files, so as to in the consumer side no network communication or R/python code running is required, the inner system operation, pure Java, makes it possible for a fast and stable running performance when facing with large size data or high-concurrency scenario.

Outputs such as plots are not supported by PMML, but if the integration context is about producing and consuming predictive and descriptive models, and they are supported by PMLL, PMLL is a safer and more efficient way.

A PMML based framework

The Streaming Machine Learning Architecture proposed by Oracle supports the separable, fashion independent components in model development and building based on Oracle R Enterprise (ORE) and model presentation based on Oracle Stream Explorer (OSX). PMML in this framework is used to transfer models from ORE to OSX, and special PMML evaluator in OSX is needed for preprocessing, model import and scoring. Since too much reliance on the Oracle products makes this architecture too heavy to implement, based on this ideal a PMML based framework for building analytic application is proposed here, it’s much lighter and allows independent R model producers and consumers.

The PMML framework involves two main stages:

Model Training.

  • Use structured or semi-structured data as input, necessary data cleaning and preparing is required.
  • Large dataset processing may be required.
  • A variety of machine learning algorithms should be supported.
  • Processing with low frequency.

Model Scoring.

  • Contains all scoring information and provides interfaces to handle input streaming data directly.
  • Supports automatic and manual model refresh.
  • Processing with high frequency.

The PMML framework allows the following step-to-step workflow:

Model generation. Using structured or semi-structured data as input to build analytic models. The training can be triggered automatically or manually. R environment setting and related R libraries are need, data wrangling and preparing is required. Machine learning algorithms supported by PMML can be accessed by checking the latest R PMML package doc, an incomplete list (Cran.R-Project.Org, pmml, 2020) includes:

  • ada, rules, coxph, cv.glmnet, glm, hclust, kmeans, ksvm, lm, multinom, naiveBayes, neighbr, nnet, randomForest.

Model export by PMML standard. PMML package in R allows to export models to xml format .pmml files. All necessary scoring information and datasets are included.

Model storage. Model files can be stored as static files or file stream in self-defined memories. Database is not necessary since the model building process is in low frequency. Triggers for model refresh is supported.

Model import and scoring function creating. The PMML evaluator JPMML is need to help importing PMML files or file stream and scoring interfaces design. A desirable maven dependency version can be org.jpmml pmml-evaluator 1.4.6.

JPMML provides common model evaluator Java class, but the consumer interfaces can still vary due to the training models can be built by using a variety of algorithms or datasets. This means semi-automatic scoring interfaces can be generated, all we need is to specific parameter names or other necessary formats.

Exposing and consuming. In this part exposed scoring functions and visual interfaces such as a web portal is generated. Java functions can be exposed as web APIs by using Spring-boot (Walls, Craig, 2016) to support the integration with web programming languages such as Javascript, or Dubbo servers (Apache Dubbo, 2020) as plugins in distributed systems. And visualization such as a website can be built by calling these servers.

Building Applications Using the PMML framework

The PMML framework provides an end-to-end workflow and implementation for each component. In this section an example of developing and deploying analytic application based on this framework is presented.

This demo application has the following components:

Model producer in R. 6 machine learning models rpart, glm, hclust, SVM, nnet, randomForest are built by using dataset “iris” or “audit”, Models are exported as PMML format files. No database is used.

fit <- rpart(Species ~ ., data = iris)

saveXML(pmml(fit), ".//R//irisRpart.pmml")

mod <- glm(TARGET_Adjusted ~ Age + Employment + Education + Income, data = audit, family = binomial(logit))

saveXML(pmml(mod), ".//R//auditGLM.pmml")

Scoring in Java. An optional maven dependency used for JPMML is:

<dependency>

    <groupId>org.jpmml</groupId>

    <artifactId>pmml-evaluator</artifactId>

    <version>1.4.6</version>

</dependency>

Semi-automatic scoring interfaces are generated by JPMML. The model evaluator class will load model files defined by model producer into JVM and no file reading processing is need during the high frequency scoring unless a model refresh is triggered.

ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();

Evaluator evaluator = modelEvaluatorFactory.newModelEvaluator(irisRpart.pmml);

Scoring interfaces accept input data stream and give prediction output stream by using class evaluator defined above, input parameter formats can be determined by checking the model predictor variables information.

Exposing. The scoring functions are exposed as Web APIs and Dubbo servers available externally. Web APIs is created by Spring-boot and consumed by the web portal. Dubbo servers is created by Dubbo to work as plugins in the distributed system to provide high-performance function call. For the full code detail please check the GitHub link https://github.com/JianTang2000/RPortal.

Web. A website is built for model introduction and operation such as model refresh and prediction. The web is built by using VUE (Macrae, Callum, 2018) and ElementUI (Element, 2020). The following web screenshots show how to input predictor variables and get prediction in the “audit” dataset, glm scenario and the “iris” dataset, rpart scenario.

Deployment. This application is deployed on the Google Cloud Platform (Krishnan, Jose, 2015). A quick setup simply includes public network, static IP address, SSH tunnel, centos OS 7 operating system, Java environment setting, Dubbo environment setting and node.js environment setting. Most of them are well supported by the Google Cloud Platform so the deployment is easy to be done. An accessible address is http://35.189.127.74:8110. This website will keep running and be available on the Internet as long as the associated Google resources are not released.

Conclusion

In this paper, we have summarized 4 methods to implement the R-Java integration and their related scenarios. We have presented a light weight, Predictive Modeling Markup Language (PMML) standard based framework, it allows independent R model producers and consumers. We have also shown the development and deployment of a demo analytic application based on the PMML framework.

References

Arango, Mauricio, and Alex Ardel. "Machine Learning on Streaming Data via Integration of Oracle Advanced Analytics and Oracle Stream Explorer." (2016).

Guazzelli, Alex, et al. "PMML: An open standard for sharing models." The R Journal 1.1 (2009): 60-65.

Urbanek, Simon. "Rserve--a fast way to provide R functionality to applications." PROC. OF THE 3RD INTERNATIONAL WORKSHOP ON DISTRIBUTED STATISTICAL COMPUTING (DSC 2003), ISSN 1609-395X, EDS.: KURT HORNIK, FRIEDRICH LEISCH & ACHIM ZEILEIS, 2003 (HTTP://ROSUDA. ORG/RSERVE. 2003.

Urbanek, Simon. "JRI: Java-R Interface." URL http://www. rforge. net/JRI/index. html (2011).

Urbanek, Simon. "rJava: Low-level R to Java interface." (2013).

James, David A., and Saikat DebRoy. "RMySQL: R interface to the MySQL database." R package version 0.9-3 (2012).

Cran.R-Project.Org, 2020, https://cran.r-project.org/web/packages/plumber/plumber.pdf.

Ferrucci, David, Robert L. Grossman, and Anthony Levas. "PMML and UIMA based frameworks for deploying analytic applications and services." Proceedings of the 4th international workshop on Data mining standards, services and platforms. 2006.

"Java PMML API". Github, 2020, https://github.com/jpmml.

Cran.R-Project.Org, 2020, https://cran.r-project.org/web/packages/pmml/pmml.pdf.

Walls, Craig. Spring Boot in action. Manning Publications, 2016.

"Apache/Dubbo". Github, 2020, https://github.com/apache/dubbo.

Macrae, Callum. Vue. js: Up and Running: Building Accessible and Performant Web Apps. " O'Reilly Media, Inc.", 2018.

"Element - The World's Most Popular Vue UI Framework". Element.Eleme.Io, 2020, https://element.eleme.io/#/en-US.

Krishnan, S. P. T., and Jose L. Ugia Gonzalez. Building your next big thing with google cloud platform: A guide for developers and enterprise architects. Apress, 2015.

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章