Spark中textFile產生了幾個RDD

原創

2020-06-22 14:15

我們可以用toDebugString方法看看產生了幾個RDD

val rdd = sc.textFile("file:///home/hadoop/data/wc.dat")
rdd.toDebugString

從下圖中可以看出，產生了2個RDD，HadoopRDD和MapPartitionsRDD

爲什麼是兩個RDD?

我們進入textFile源碼中進行查看

def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    //執行第一步
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],minPartitions)
    //執行第二步
    .map(pair => pair._2.toString).setName(path)
  }

源碼中先執行了hadoopFile，再執行了map，我們看下hadoopFile源碼（下面的代碼）最後返回了一個HadoopRDD，傳入的參數有TextInpuFormat, LongWritable（每行數據的偏移量）, Text（每行數據的內容），這就是MapReduce時mapper的參數；然後返回的內容是(1,xxxx) (7,yyyy)
而偏移量對我們來說，是沒有用的，所以上面再對數據進行了map(pair=>pair._2.toString)就是獲取每行的數據內容

所以會產生兩個RDD：HadoopRDD和MapPartitionsRDD

def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()
    FileSystem.getLocal(hadoopConfiguration)
    val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
    val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
    //重點關注這裏
    new HadoopRDD(
      this,
      confBroadcast,
      Some(setInputPathsFunc),
      inputFormatClass,
      keyClass,
      valueClass,
      minPartitions).setName(path)
  }

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Spark中textFile產生了幾個RDD

爲什麼是兩個RDD?

dbeaver連接phoenix異常: org.apache.hadoop.hbase.util.ClassSize和Unexpected version format:11.0.3

Spark中textFile產生了幾個RDD

IDEA中MR提交作業到yarn，踩坑彙總

scala中常用的函數式編程

spark on yarn cluster模式，異常：no suitable driver

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結