關於Spark RDD 的分區的一些問題
虛擬機配置 2處理器16核
案例一 makeRDD
運行代碼
import org.apache.spark.{SparkConf, SparkContext}
object Spark21 extends App {
// 設定Spark計算環境
val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
// 創建上下文
val sc = new SparkContext(config)
val listRDD = sc.makeRDD(List(1, 2, 3, 4))
// 將RDD的數據保存到文件中
listRDD.saveAsTextFile("output")
}
運行結果
列表數字隨機分配在了16個分區中
markRDD源代碼
override def defaultParallelism(): Int = {
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
}
/** Get a parameter as an integer, falling back to a default if not set */
def getInt(key: String, defaultValue: Int): Int = {
getOption(key).map(_.toInt).getOrElse(defaultValue)
}
// Use an atomic variable to track total number of cores in the cluster for simplicity and speed
protected val totalCoreCount = new AtomicInteger(0)
由源代碼可以看出
totalCoreCount 獲取 核心總數 與 2 相比,取大的那個數,這裏是16,因此分區數爲16
案例二 textFile
一、運行代碼
object Spark21 extends App {
// 設定Spark計算環境
val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
// 創建上下文
val sc = new SparkContext(config)
// val listRDD = sc.makeRDD(List(1, 2, 3, 4))
private val listRDD: RDD[String] = sc.textFile("in")
// 將RDD的數據保存到文件中
listRDD.saveAsTextFile("output")
}
結果有兩個分區,下面分析源代碼
textFile源代碼
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
* @param path path to the text file on a supported file system
* @param minPartitions suggested minimum number of partitions for the resulting RDD
* @return RDD of lines of the text file
*/
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
/**
* Default min number of partitions for Hadoop RDDs when not given by user
* Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
* The reasons for this are discussed in https://github.com/mesos/spark/pull/718
*/
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
// Get the default level of parallelism to use in the cluster, as a hint for sizing jobs.
def defaultParallelism(): Int
由源代碼可以看出
textFile方法有兩個傳入參數,路徑&分區數
如果沒有設置,默認 defaultParallelism 與 2 比較,取較小值得分區
defaultParallelism獲取並行度,這裏是16,與2相比,後者小,所以最後分區爲2個分區
補充 textFile讀取文件,如果設置了分區爲2,但結果分區 有 3個
這裏textFile取決於hadoopFile切片操作,具體我也還不知道,先記錄一下
案例三
setMaster源代碼查看
/**
* The master URL to connect to, such as "local" to run locally with one thread, "local[4]" to
* run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
*/
def setMaster(master: String): SparkConf = {
set("spark.master", master)
}
要連接到的主URL,如“local”要用一個線程在本地運行,“local[4]”要連接到
使用4個內核在本地運行,或者“spark://master:7077”運行在spark獨立集羣上。
根據源代碼說明更改master,將local[*]改爲local[1]
val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
val config: SparkConf = new SparkConf().setMaster("local[1]").setAppName("WordCount")