我們可以用toDebugString方法看看產生了幾個RDD
val rdd = sc.textFile("file:///home/hadoop/data/wc.dat")
rdd.toDebugString
從下圖中可以看出,產生了2個RDD,HadoopRDD和MapPartitionsRDD
爲什麼是兩個RDD?
我們進入textFile
源碼中進行查看
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
//執行第一步
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],minPartitions)
//執行第二步
.map(pair => pair._2.toString).setName(path)
}
源碼中先執行了hadoopFile,再執行了map,我們看下hadoopFile源碼(下面的代碼)最後返回了一個HadoopRDD,傳入的參數有TextInpuFormat, LongWritable(每行數據的偏移量), Text(每行數據的內容),這就是MapReduce時mapper的參數;然後返回的內容是(1,xxxx) (7,yyyy)
而偏移量對我們來說,是沒有用的,所以上面再對數據進行了map(pair=>pair._2.toString)就是獲取每行的數據內容
所以會產生兩個RDD:HadoopRDD和MapPartitionsRDD
def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
assertNotStopped()
FileSystem.getLocal(hadoopConfiguration)
val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
//重點關注這裏
new HadoopRDD(
this,
confBroadcast,
Some(setInputPathsFunc),
inputFormatClass,
keyClass,
valueClass,
minPartitions).setName(path)
}