最近接触了github上两个基于spark的项目,现在做一个小结,这两个项目的链接会在下面具体介绍中给出。
SpatialSpark
SpatialSpark是一个基于Spark用来处理空间数据的一个分布式数据处理系统。在系统中,它提供了丰富的空间查询操作,同时为了实现高性能运算,采用了多种不同的空间索引结构。下面简单的介绍下它提供的某些功能。
一. 创建spark上下文
val conf = new SparkConf().setAppName("Test for Spark SpatialRDD").setMaster("local[*]")
val spark = new SparkContext(conf)
二. 从本地读取测试的数据文件,并生成SpatialRDD(项目自定义的一种RDD对象)
val locationRDD=datardd.map{
line=>
val arry=line.split(",")
try {
(Point(arry(0).toFloat, arry(1).toFloat), arry(2))
}catch
{
case e:Exception=>
//println("input format error")
}
}.map
{
case (point:Point,v)=>(point,v)
case ()=>null
case _=>null
}.filter(_!=null)
val indexed = SpatialRDD(locationRDD).cache()
数据文件中部分数据如下,格式:(横座标,纵座标,属性值)
24.8597823387,-92.1412089223,1
29.2580133673,-81.478279848,2
28.9985588549,-82.3540992077,3
40.626048454,-86.0555605318,4
36.1591546487,-88.6869851979,5
32.5855413933,-82.3103059884,6
……
三. 空间数据的查询操作
1 区域查询,判断数据点是否在指定的区域内
def testForSJOIN[V] (srdd:SpatialRDD[Point,V], spark:SparkContext) =
{
val numpartition=srdd.partitions.length
val boxes=Array(Box(20.10094f,-86.8612f, 30.41f, -81.222f), Box(29.10094f,-83.8612f, 32.41f, -80.222f))
val queryBoxes=spark.parallelize(boxes,numpartition)
val joinresultRdd=srdd.sjoin(queryBoxes)((k,id)=>id)
println("sjoin:"+joinresultRdd.count()) s
joinresultRdd.foreach(println)
}
结果如下:
2 归并查询区域,可能在某些情况下,矩形查询区域比较多,在查询前可以对这些区域进行归并(即某些区域之间可能存在包含关系,然后将其合并)
def testForRJOIN[V](srdd:SpatialRDD[Point,V], spark:SparkContext)=
{
val numpartition=srdd.partitions.length
val boxes=Array(
Box(20.10094f,-81.8612f, 30.41f, -84.222f), Box(29.10094f,-83.8612f, 32.41f, -80.222f),
Box(20.10094f,-86.8612f, 30.41f, -81.222f), Box(19.10094f,-83.8612f, 32.41f, -83.222f),
Box(20.10094f,-96.8612f, 30.41f, -81.222f), Box(19.10094f,-83.8612f, 34.41f, -82.222f),
Box(20.10094f,-86.8612f, 40.41f, -81.222f), Box(10.10094f,-83.8612f, 43.41f, -84.222f))
val queryBoxes=spark.parallelize(boxes,numpartition)
println("length:" +numpartition)
def aggfunction1[K](itr:Iterator[(K,V)]):Int=
{
itr.size
}
def aggfunction2(v1:Int, v2:Int):Int=
{
v1+v2
}
val joinresultRdd=srdd.rjoin(queryBoxes)(aggfunction1,aggfunction2)
println("rjoin:"+joinresultRdd.count())
joinresultRdd.foreach{println}
}
结果如下:
3 knn查询,给定某个座标点,查询该点附近最近的k个数据点
def testForKNNQuery[V](srdd:SpatialRDD[Point,V])=
{
val k=2
val querypoint=Point(30.40094f,-86.8612f)
val knnresults=srdd.knnFilter(querypoint,k,(id)=>true)
knnresults.foreach(println)
}
4 文本查询,数据点中的属性也可以是文本,因此可以结合数据的位置查询特定文本属性
def testForTextQuery[V](indexed:SpatialRDD[Point,V])=
{
val searchbox=Box(20.10094f,-86.8612f, 30.41f, -81.222f)
def textcondition[V](z:Entry[V]):Boolean=
{
z.value match
{
case v:String =>
val vl=v.toLowerCase()
(vl.contains("酒店")||vl.contains("饭馆")||v.contains("电影院"))
}
}
val textsearchresult=indexed.rangeFilter(searchbox,(id)=>true)
}
SparkDistributedMatrix
SparkDistributedMatrix项目基于Spark提供一个更高效的分布式矩阵线性运算系统。它扩展了Spark MLlib中数据类型,支持了更加丰富的线性运算,比如定义的LocalVector支持向量的内积,加法运算,标量相乘等;LocalMatrix支持CSC格式的稀疏矩阵相乘等。下面介绍一个应用实例——PageRank。
一. 获取命令行参数(arg(0):数据文件,arg(1):迭代次数)
val graphName = args(0)
var niter = 0
if (args.length > 1) niter = args(1).toInt else niter = 10
二. 配置spark参数,创建spark上下文
val conf = new SparkConf()
.setAppName("PageRank algorithm on block matrices")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.shuffle.consolidateFiles", "true")
.set("spark.shuffle.compress", "false")
.set("spark.cores.max", "64")
.set("spark.executor.memory", "6g")
.set("spark.default.parallelism", "256")
.set("spark.akka.frameSize", "64").setMaster("local[*]")
conf.setJars(SparkContext.jarOfClass(this.getClass).toArray)
val sc = new SparkContext(conf)
三. 应用BlockMatrix进行分块计算,首先根据数据和spark配置参数估计合适的分块大小,然后分组并行运算。
val coordinateRdd = genCoordinateRdd(sc, graphName)
val blkSize = BlockPartitionMatrix.estimateBlockSize(coordinateRdd)
var matrix = BlockPartitionMatrix.PageRankMatrixFromCoordinateEntries(coordinateRdd, blkSize, blkSize)
//matrix.partitionByBlockCyclic()
matrix.partitionBy(new ColumnPartitioner(8))
val vecRdd = sc.parallelize(BlockPartitionMatrix.onesMatrixList(matrix.nCols(), 1, blkSize, blkSize), 8)
var x = new BlockPartitionMatrix(vecRdd, blkSize, blkSize, matrix.nCols(), 1)
var v = x
v.partitionBy(new RowPartitioner(8))
四. 运行主程序
val alpha = 0.85
matrix = (alpha *:matrix).cache()
matrix.stat()
v = (1.0 - alpha) *:v
val t1 = System.currentTimeMillis()
for (i <- 0 until niter) {
x = matrix %*% x + (v, (blkSize, blkSize), v.partitioner)
}
结果如下: