http://blog.csdn.net/buring_/article/details/42424477 mark
一:Spark中常常面臨這RDD的存儲問題,記錄一下常常面臨的幾種情況。saveAsObjectFile, SequenceFile, mapFile, textFile我就不說了。
首先:在寫文件的時候,經常輸出的目錄以及存在,需要一個刪掉目錄以及存在的情況。大致功能如下
- def checkDirExist(sc:SparkContext,outpath:String) = {
- logInfo("check output dir is exists")
- val hdfs = FileSystem.get(new URI("hdfs://hdfs_host:port"),sc.hadoopConfiguration)
- try{
- hdfs.delete(new Path(outpath),true) //這裏已經new 目錄了,刪除再說,總之是刪了
- logInfo("輸出目錄存在,刪除掉:%s".format(outpath))
- } catch {
- case _:Throwable => logWarning("輸出目錄不存在,不用刪除") //永遠不會來吧
- }
- }
1)存取對象,saveAsObjectFile ,十分方便,類似於python的cPickle.主要用法如下:
比如有一個
- val rdd:RDD[(Long,Array[Double])]
- rdd.saveAsObjectFile(outpath)
- 讀取對象文件,需要指定類型
- val input:RDD[(Long,Array[Double])] = sc.objectFile[(Long,Array[Double])](outpath)
2)有時候需要節約空間,這樣就需要存儲序列化文件。如
- val rdd: RDD[(Long, Array[Byte])]
- new SequenceFileRDDFunctions(rdd).saveAsSequenceFile(outpath)
- 這裏需要注意序列化時候,Long,Array都需要一個隱式轉換函數,如下:
- implicit def arrayBytetoBytesArray(bytes:Array[Byte]):BytesWritable = new BytesWritable(bytes)
- implicit def long2LongWritable(ll:Long):LongWritable = new LongWritable(ll)
- 讀取序列化文件:
- val rdd = sc.sequenceFile(dir+"/svd",classOf[LongWritable],classOf[BytesWritable]).map{case(uid,sessions)=>
- sessions.setCapacity(sessions.getLength)
- (uid.get(),sessions.getBytes.clone())
- }
3)有時候需要存儲mapFile,用來根據key 快速索引。實踐發現,索引的確很快,而且節約存儲空間。
存儲mapFile文件,需要注意是現要排序以後才能輸出,爲了快速索引,排序也是可以理解的嘛。
- val temp: RDD[(Long, Array[Byte])]
- val mapfile = temp.toArray().sortBy(e => e._1)
- var writer: MapFile.Writer = null
- val conf = sc.hadoopConfiguration
- val fs = FileSystem.get(URI.create(dir+"/merge"),conf)
- val key = new LongWritable()
- val value = new BytesWritable()
- try{
- writer = new Writer(conf,fs,dir+"/merge",classOf[LongWritable],classOf[BytesWritable])
- //這裏不知道設置多少比較合適,感覺都ok
- writer.setIndexInterval(1024)
- for(ele <-mapfile){
- key.set(ele._1)
- value.set(ele._2)
- writer.append(key,value)
- }
- }finally {
- IOUtils.closeStream(writer)
- }
- 快捷根據key 來檢索value
- val reader = new Reader(FileSystem.get(sc.hadoopConfiguration),
- dir,sc.hadoopConfiguration)
- val key = new LongWritable(args(1).toLong)
- val value = new BytesWritable()
- reader.get(key,value)
- value.setCapacity(value.getLength)
- val bytes = value.getBytes