最近接觸了github上兩個基於spark的項目,現在做一個小結,這兩個項目的鏈接會在下面具體介紹中給出。
SpatialSpark
SpatialSpark是一個基於Spark用來處理空間數據的一個分佈式數據處理系統。在系統中,它提供了豐富的空間查詢操作,同時爲了實現高性能運算,採用了多種不同的空間索引結構。下面簡單的介紹下它提供的某些功能。
一. 創建spark上下文
val conf = new SparkConf().setAppName("Test for Spark SpatialRDD").setMaster("local[*]")
val spark = new SparkContext(conf)
二. 從本地讀取測試的數據文件,並生成SpatialRDD(項目自定義的一種RDD對象)
val locationRDD=datardd.map{
line=>
val arry=line.split(",")
try {
(Point(arry(0).toFloat, arry(1).toFloat), arry(2))
}catch
{
case e:Exception=>
//println("input format error")
}
}.map
{
case (point:Point,v)=>(point,v)
case ()=>null
case _=>null
}.filter(_!=null)
val indexed = SpatialRDD(locationRDD).cache()
數據文件中部分數據如下,格式:(橫座標,縱座標,屬性值)
24.8597823387,-92.1412089223,1
29.2580133673,-81.478279848,2
28.9985588549,-82.3540992077,3
40.626048454,-86.0555605318,4
36.1591546487,-88.6869851979,5
32.5855413933,-82.3103059884,6
……
三. 空間數據的查詢操作
1 區域查詢,判斷數據點是否在指定的區域內
def testForSJOIN[V] (srdd:SpatialRDD[Point,V], spark:SparkContext) =
{
val numpartition=srdd.partitions.length
val boxes=Array(Box(20.10094f,-86.8612f, 30.41f, -81.222f), Box(29.10094f,-83.8612f, 32.41f, -80.222f))
val queryBoxes=spark.parallelize(boxes,numpartition)
val joinresultRdd=srdd.sjoin(queryBoxes)((k,id)=>id)
println("sjoin:"+joinresultRdd.count()) s
joinresultRdd.foreach(println)
}
結果如下:
2 歸併查詢區域,可能在某些情況下,矩形查詢區域比較多,在查詢前可以對這些區域進行歸併(即某些區域之間可能存在包含關係,然後將其合併)
def testForRJOIN[V](srdd:SpatialRDD[Point,V], spark:SparkContext)=
{
val numpartition=srdd.partitions.length
val boxes=Array(
Box(20.10094f,-81.8612f, 30.41f, -84.222f), Box(29.10094f,-83.8612f, 32.41f, -80.222f),
Box(20.10094f,-86.8612f, 30.41f, -81.222f), Box(19.10094f,-83.8612f, 32.41f, -83.222f),
Box(20.10094f,-96.8612f, 30.41f, -81.222f), Box(19.10094f,-83.8612f, 34.41f, -82.222f),
Box(20.10094f,-86.8612f, 40.41f, -81.222f), Box(10.10094f,-83.8612f, 43.41f, -84.222f))
val queryBoxes=spark.parallelize(boxes,numpartition)
println("length:" +numpartition)
def aggfunction1[K](itr:Iterator[(K,V)]):Int=
{
itr.size
}
def aggfunction2(v1:Int, v2:Int):Int=
{
v1+v2
}
val joinresultRdd=srdd.rjoin(queryBoxes)(aggfunction1,aggfunction2)
println("rjoin:"+joinresultRdd.count())
joinresultRdd.foreach{println}
}
結果如下:
3 knn查詢,給定某個座標點,查詢該點附近最近的k個數據點
def testForKNNQuery[V](srdd:SpatialRDD[Point,V])=
{
val k=2
val querypoint=Point(30.40094f,-86.8612f)
val knnresults=srdd.knnFilter(querypoint,k,(id)=>true)
knnresults.foreach(println)
}
4 文本查詢,數據點中的屬性也可以是文本,因此可以結合數據的位置查詢特定文本屬性
def testForTextQuery[V](indexed:SpatialRDD[Point,V])=
{
val searchbox=Box(20.10094f,-86.8612f, 30.41f, -81.222f)
def textcondition[V](z:Entry[V]):Boolean=
{
z.value match
{
case v:String =>
val vl=v.toLowerCase()
(vl.contains("酒店")||vl.contains("飯館")||v.contains("電影院"))
}
}
val textsearchresult=indexed.rangeFilter(searchbox,(id)=>true)
}
SparkDistributedMatrix
SparkDistributedMatrix項目基於Spark提供一個更高效的分佈式矩陣線性運算系統。它擴展了Spark MLlib中數據類型,支持了更加豐富的線性運算,比如定義的LocalVector支持向量的內積,加法運算,標量相乘等;LocalMatrix支持CSC格式的稀疏矩陣相乘等。下面介紹一個應用實例——PageRank。
一. 獲取命令行參數(arg(0):數據文件,arg(1):迭代次數)
val graphName = args(0)
var niter = 0
if (args.length > 1) niter = args(1).toInt else niter = 10
二. 配置spark參數,創建spark上下文
val conf = new SparkConf()
.setAppName("PageRank algorithm on block matrices")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.shuffle.consolidateFiles", "true")
.set("spark.shuffle.compress", "false")
.set("spark.cores.max", "64")
.set("spark.executor.memory", "6g")
.set("spark.default.parallelism", "256")
.set("spark.akka.frameSize", "64").setMaster("local[*]")
conf.setJars(SparkContext.jarOfClass(this.getClass).toArray)
val sc = new SparkContext(conf)
三. 應用BlockMatrix進行分塊計算,首先根據數據和spark配置參數估計合適的分塊大小,然後分組並行運算。
val coordinateRdd = genCoordinateRdd(sc, graphName)
val blkSize = BlockPartitionMatrix.estimateBlockSize(coordinateRdd)
var matrix = BlockPartitionMatrix.PageRankMatrixFromCoordinateEntries(coordinateRdd, blkSize, blkSize)
//matrix.partitionByBlockCyclic()
matrix.partitionBy(new ColumnPartitioner(8))
val vecRdd = sc.parallelize(BlockPartitionMatrix.onesMatrixList(matrix.nCols(), 1, blkSize, blkSize), 8)
var x = new BlockPartitionMatrix(vecRdd, blkSize, blkSize, matrix.nCols(), 1)
var v = x
v.partitionBy(new RowPartitioner(8))
四. 運行主程序
val alpha = 0.85
matrix = (alpha *:matrix).cache()
matrix.stat()
v = (1.0 - alpha) *:v
val t1 = System.currentTimeMillis()
for (i <- 0 until niter) {
x = matrix %*% x + (v, (blkSize, blkSize), v.partitioner)
}
結果如下: