spark - Loading and Saving Data

- File Formats

  • Text File

sc.textFile, load a text file

sc.wholeTextFiles, load multiple files (filename, entire content) under specified dir

  • JSON
sc.textFile.map to JSON object (people.add(mapper.readValue(line, Person.class))) by third-party tool, like jackson
  • CSV
sc.textFile.map to separated strings as array (val reader = new CSVReader(new StringReader(line));
reader.readNext();) by third-party tool, like opencsv
  • Sequence File
sc.sequenceFile(inFile, classOf[Text], classOf[IntWritable]).
map{case (x, y) => (x.toString, y.get())}
  • Object File 
using Java Serialization. used for communicating between Spark Jobs


- Hadoop InputFormat and OutputFormat

val input = sc.hadoopFile[Text, Text, KeyValueTextInputFormat](inputFile).map{ //for hadoop old api
    case (x, y) => (x.toString, y.toString)
}
val input = sc.newAPIHadoopFile(inputFile, classOf[LzoJsonInputFormat], // for hadoop new api
classOf[LongWritable], classOf[MapWritable], conf)


- Others

hadoopDataset/saveAsHadoopDataset



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章