SparkR

SparkR

原創

Alex_DHU

2020-02-26 01:02

http://amplab-extras.github.io/SparkR-pkg/

先把文章標記起來，說不定哪天用的上，期待可以用的上……

R和Spark、Hadoop的前景很光明，學習道路漫漫，技術變革太快了，剛買的Hadoop還沒有看幾頁，貌似已經被超越了，要加緊學習纔是啊

R on Spark

SparkR is an R package that provides alight-weight frontend to use Apache Spark from R. SparkR exposes the Spark APIthrough the RDD class and allows users to interactively run jobs from the Rshell on a cluster.

Features

RDDs as Distributed Lists

SparkR exposes the RDD API of Spark asdistributed lists in R. For example we can read an input file from HDFS andprocess every line using lapply on a RDD.

sc<- sparkR.init("local")

lines <- textFile(sc, "hdfs://data.txt")

wordsPerLine <- lapply(lines, function(line) {length(unlist(strsplit(line, " "))) })

In addition to lapply, SparkR also allowsclosures to be applied on every partition using lapplyWithPartition. Othersupported RDD functions include operations like reduce, reduceByKey, groupByKeyand collect.

Serializing closures

SparkR automatically serializes thenecessary variables to execute a function on the cluster. For example if youuse some global variables in a function passed to lapply, SparkR will automaticallycapture these variables and copy them to the cluster. An example of using arandom weight vector to initialize a matrix is shown below

lines <- textFile(sc, "hdfs://data.txt")

initialWeights <- runif(n=D, min = -1, max = 1)

createMatrix <- function(line) {

as.numeric(unlist(strsplit(line, " "))) %*% t(initialWeights)

}

#initialWeights is automatically serialized

matrixRDD <- lapply(lines, createMatrix)

Using existing R packages

SparkR also allows easy use of existing Rpackages inside closures. The includePackage command can be used to indicatepackages that should be loaded before every closure is executed on the cluster.For example to use the Matrix in a closure applied on each partition of an RDD,you could run

generateSparse <- function(x) {

#Use sparseMatrix function from the Matrix package

sparseMatrix(i=c(1, 2, 3), j=c(1, 2, 3), x=c(1, 2, 3))

}

includePackage(sc, Matrix)

sparseMat <- lapplyPartition(rdd, generateSparse)

Installing SparkR

SparkR requires Scala 2.10 and Sparkversion >= 0.9.0 and depends on R packages rJava and testthat (only requiredfor running unit tests).

If you wish to try out SparkR, you can useinstall_github from the devtools package to directly install the package.

library(devtools)

install_github("amplab-extras/SparkR-pkg",subdir="pkg")

If you wish to clone the repository andbuild from source, you can using the following script to build the packagelocally.

./install-dev.sh

Running sparkR

If you have installed it directly fromgithub, you can include the SparkR package and then initialize a SparkContext.For example to run with a local Spark master you can launch R and then run

library(SparkR)

sc <-sparkR.init(master="local")

If you have cloned and built SparkR, youcan start using it by launching the SparkR shell with

./sparkR

SparkR also comes with several sampleprograms in the examples directory. To run one of them, use ./sparkR<filename> <args>. For example:

./sparkR examples/pi.R local[2]

You can also run the unit-tests for SparkRby running

./run-tests.sh

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

C語言--右移左移

12款高效開源Wiki系統推薦，打造團隊知識管理利器

一個開源且全面的C#算法實戰教程

dotnet 基於 DirectML 控制檯運行 Phi-3 模型

自定義MyBatis插件

一款.NET開源、功能強大、跨平臺的繪圖庫 - OxyPlot

常用的 Git 指令

鼠標控制軟件有可能和虛擬機軟件產生衝突

sm4加密工具類

Pentaho ETL安裝使用（KETTLE）

SAS學習——數據導入導出

Pentaho BI Server 安裝過程簡述

你妹

SparkR

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結